CODING CHALLENGE · N°07

Token-Level F1

Medium AI EngineeringEvalsNLP

The metric behind QA evaluation (SQuAD and friends): how well does a predicted answer overlap a reference, as a bag of words? Compute token precision and recall, then combine them into the harmonic-mean F1 — the number that actually moves when your model gets better.

The problem

Given a prediction string and a reference string, return the token-level F1. Lowercase and split both on whitespace into tokens. Count the multiset overlap (shared tokens, respecting duplicates). Then precision = overlap / len(pred), recall = overlap / len(ref), and F1 = 2·P·R / (P + R). Return 0.0 if either side is empty or there is no overlap.

EXAMPLE 1

Input pred = "the cat sat", ref = "the cat sat"

Output 1.0

exact match

EXAMPLE 2

Input pred = "the cat", ref = "the cat sat"

Output 0.8

P=1.0, R=0.667 → F1=0.8

EXAMPLE 3

Input pred = "a b", ref = "c d"

Output 0.0

no shared tokens

CONSTRAINTS

Case-insensitive; split on whitespace.
Overlap is a multiset intersection — "the the" vs "the" shares only one "the".
Return 0.0 when either string is empty or overlap is 0 (avoid dividing by zero).

SOLVE IT YOURSELF

Your turn — write it

Edit the stub, hit Run (or ⌘/Ctrl + Enter), and watch the hidden tests. Stuck? the hints are right above and Reveal solution is one click away.

YOUR TASK

Implement token_f1(prediction, reference): lowercase + split both, count multiset token overlap, then return the harmonic mean of precision and recall (0.0 on empty / no overlap).

HINTS — 4 IDEAS

Lowercase both strings and split() on whitespace into token lists.
For multiset overlap, count reference tokens, then for each prediction token consume one from that count if available.
precision = overlap / len(pred); recall = overlap / len(ref).
F1 = 2·P·R / (P + R). Guard the empty / zero-overlap cases by returning 0.0 first.

CPython · WebAssembly