Token-Level F1
The metric behind QA evaluation (SQuAD and friends): how well does a predicted answer overlap a reference, as a bag of words? Compute token precision and recall, then combine them into the harmonic-mean F1 — the number that actually moves when your model gets better.
The problem
Given a prediction string and a reference string, return the token-level F1. Lowercase and split both on whitespace into tokens. Count the multiset overlap (shared tokens, respecting duplicates). Then precision = overlap / len(pred), recall = overlap / len(ref), and F1 = 2·P·R / (P + R). Return 0.0 if either side is empty or there is no overlap.
pred = "the cat sat", ref = "the cat sat"1.0pred = "the cat", ref = "the cat sat"0.8pred = "a b", ref = "c d"0.0- Case-insensitive; split on whitespace.
- Overlap is a multiset intersection — "the the" vs "the" shares only one "the".
- Return 0.0 when either string is empty or overlap is 0 (avoid dividing by zero).
Your turn — write it
Edit the stub, hit Run (or ⌘/Ctrl + Enter), and watch the hidden tests. Stuck? the hints are right above and Reveal solution is one click away.
Implement token_f1(prediction, reference): lowercase + split both, count multiset token overlap, then return the harmonic mean of precision and recall (0.0 on empty / no overlap).
- Lowercase both strings and
split()on whitespace into token lists. - For multiset overlap, count reference tokens, then for each prediction token consume one from that count if available.
- precision = overlap / len(pred); recall = overlap / len(ref).
- F1 = 2·P·R / (P + R). Guard the empty / zero-overlap cases by returning 0.0 first.