3.1 The kitchen analogy
Imagine you've hired an AI agent as a chef. How do you tell if it's any good? Four levels of scrutiny:
🥄
Manual testing
The tasting spoon
You eat the food. Works for ten dishes. Doesn't scale to a thousand orders a night. But it's the only level that catches the thing nobody thought to test.
📋
Code-based grader
The recipe checklist
Right amount of salt? Cooked the right time? Yes/no checks. Cheap, fast, scale-free. Misses whether the dish actually tastes good.
👨🍳
LLM judge
The head chef review
You hire a more experienced chef to taste the food and rate it on a rubric. More flexible than a checklist. Has weird preferences and costs money.
⭐
Human evaluation
The Michelin inspector
Sometimes you bring in a top expert for the gold-standard verdict. Slow and expensive — used to calibrate the head chef, not for every dish.
A healthy program uses all four, in different proportions: recipe checklists for fast feedback during dev; head chef for nuanced things; the occasional Michelin inspector to keep the head chef honest; and tasting spoons forever, because nothing beats a human catching a thing the system missed.
3.2 Family 1 — Code-based graders
These are the cheapest, fastest graders you can write. Reach for them first. They check whether the output has a verifiable shape.
| Grader | Checks | When |
ExactMatch | Output equals an expected string | Single canonical answer |
RegexMatch | Output matches a pattern | Structured fields (numbers, IDs, currency) |
FuzzyMatch | Output similar to expected (Levenshtein ratio ≥ threshold) | Paraphrase tolerance without LLM judge |
ContainsAll / ContainsAny | All / any of N substrings appear | Required facts; required apologies; CTA mentions |
JsonMatch | Output parses as JSON and contains expected keys/values | Structured-output APIs |
ToolCallCheck | Specific tools were (or weren't) called | Verify search happened, dangerous tool didn't |
OutcomeCheck | The world ended up in the right state | Verify the file got written, the refund was issued |
The trap: writing a code-based grader so strict it punishes correct behavior. "96.12" rejected because the spec wanted "96.124991024571248". Always specify tolerance up front — numeric ε, fuzzy threshold, subset match for JSON.
3.3 Family 2 — LLM judges (model-based graders)
When the dimension you care about is open-ended — tone, helpfulness, polish, empathy — code can't grade it. You hand the trial to another LLM and ask it to score against a structured rubric.
from agent_evals.graders import LLMJudge
rubric = """Score 0–5 on:
1. helpfulness — does it answer the actual question?
2. tone — warm and professional?
3. brevity — no padding?
Return JSON: {"scores": {...}, "passed": bool, "rationale": "..."}.
Use "unknown": true if the rubric or transcript is unclear.
"""
grader = LLMJudge(rubric=rubric, pass_threshold=4)
Three rules for LLM judges:
- Calibrate against humans. Have a human grade ~20 trials by hand. Run the judge on the same 20. If the agreement (Cohen's kappa) is below 0.6, the judge is unreliable — fix the rubric and re-run.
- Provide an "Unknown" escape hatch. Forced binary verdicts hide judgment uncertainty. The judge should be able to say
"unknown": true when the rubric or transcript doesn't apply cleanly.
- Cache the rubric. The rubric is identical on every call; the trial is the only thing that changes. Anthropic's prompt caching makes the rubric ~10× cheaper after the first call within a 5-minute window.
3.4 Family 3 — Human evaluation
The gold standard. Slow. Expensive. Irreplaceable. You use humans for three things:
- Ground truth. The grades you'll measure your LLM judges against.
- Sanity sampling. 10 random trials per week, read by a human. Catches the failure modes from Chapter 7 before they show up in customer complaints.
- The "this feels wrong" test. Humans notice things no rubric was written for — exactly the gap the cold-open story sat in.
3.5 Composite graders
Most real evals don't use one grader — they use 2 to 4 stacked together. "This trial passes if the right tool was called, AND the answer contains the right facts, AND the tone scored ≥4."
from agent_evals.graders import Composite, ContainsAll, ToolCallCheck, LLMJudge
grader = Composite([
ContainsAll(),
ToolCallCheck(required=["search"]),
LLMJudge(rubric=tone_rubric, pass_threshold=4),
], mode="all", name="full_check")
Build composites incrementally. Don't try to write one mega-grader.
A 10-line code-based grader beats any LLM judge — cheaper, faster, more debuggable. Reach for the judge only when you must.