CODING CHALLENGE · N°07

Token-Level F1

Medium AI EngineeringEvalsNLP

The metric behind QA evaluation (SQuAD and friends): how well does a predicted answer overlap a reference, as a bag of words? Compute token precision and recall, then combine them into the harmonic-mean F1 — the number that actually moves when your model gets better.

The problem

Given a prediction string and a reference string, return the token-level F1. Lowercase and split both on whitespace into tokens. Count the multiset overlap (shared tokens, respecting duplicates). Then precision = overlap / len(pred), recall = overlap / len(ref), and F1 = 2·P·R / (P + R). Return 0.0 if either side is empty or there is no overlap.

EXAMPLE 1
Input pred = "the cat sat", ref = "the cat sat"
Output 1.0
exact match
EXAMPLE 2
Input pred = "the cat", ref = "the cat sat"
Output 0.8
P=1.0, R=0.667 → F1=0.8
EXAMPLE 3
Input pred = "a b", ref = "c d"
Output 0.0
no shared tokens
CONSTRAINTS
  • Case-insensitive; split on whitespace.
  • Overlap is a multiset intersection — "the the" vs "the" shares only one "the".
  • Return 0.0 when either string is empty or overlap is 0 (avoid dividing by zero).
SOLVE IT YOURSELF

Your turn — write it

Edit the stub, hit Run (or ⌘/Ctrl + Enter), and watch the hidden tests. Stuck? the hints are right above and Reveal solution is one click away.

YOUR TASK

Implement token_f1(prediction, reference): lowercase + split both, count multiset token overlap, then return the harmonic mean of precision and recall (0.0 on empty / no overlap).

HINTS — 4 IDEAS
  1. Lowercase both strings and split() on whitespace into token lists.
  2. For multiset overlap, count reference tokens, then for each prediction token consume one from that count if available.
  3. precision = overlap / len(pred); recall = overlap / len(ref).
  4. F1 = 2·P·R / (P + R). Guard the empty / zero-overlap cases by returning 0.0 first.
CPython · WebAssembly