A self-contained handbook · ~45 min read

The Agent Evaluations Handbook

Everything you need to learn agent evaluations — theory first, then interactive widgets to feel the concepts. Written so a product manager, a founder, an engineer, and someone who has never written a line of code can all leave understanding the same things.

10chapters
5interactive widgets
~45 minfront to back
0signups, ads, scripts

Last March, a friend's team upgraded their AI customer-support agent to a newer model. Internal dashboards barely moved. Two weeks later, a board member forwarded a long-time customer's email: "Your support used to be incredible. The last few weeks it's gotten… weirdly cold? Like talking to someone reading from a script."

The new model was technically better — fewer factual errors, faster, better at edge cases. It had simply lost a tiny amount of warmth that nobody had thought to grade. The fix was three days of work. The lesson took a quarter to absorb:

If a behavior isn't graded, it isn't a feature — it's an accident. This handbook is how you stop having that accident.
Chapter 1

Foundations: what an eval is and why you need one

Before any code, three ideas. What an AI agent actually is. Why testing it is fundamentally harder than testing software. What an "evaluation" therefore has to be.

1.1 What is an "AI agent," really?

When you ask ChatGPT a question, that's a chatbot: one prompt, one reply, done. An AI agent is the same brain plugged into the world. It can search the web, call APIs, refund an order, write code and run it, look at a screenshot and click a button. It takes multiple steps, decides what to do next based on what just happened, and aims to finish a task — not just answer a question.

Concretely, an agent is built from three pieces:

  • A model — the LLM that generates text and decides what to do.
  • Tools — functions the model can call: search_docs, issue_refund, send_email.
  • A loop ("scaffold" or "harness") — code that runs the model, executes any tools it requests, feeds the results back, and repeats until the task is done.
Definition

Agent

A system built around an LLM that can take multiple steps, use tools, and adapt mid-task to achieve a goal — distinct from a chatbot, which produces one reply per prompt.

1.2 Why testing AI is fundamentally different from testing software

Testing a calculator is straightforward: add(2, 3) returns 5 today, tomorrow, and forever. AI breaks every assumption that makes that easy:

  1. Same input, different output. Ask a model "Write me a poem about the sea" twice. You'll get two different poems. Both can be right.
  2. There is no single correct answer. What makes a good poem? Reasonable people disagree. Software tests assume a right answer; AI usually has a range of right answers — and an even bigger range of wrong ones.
  3. The output can be a sequence of actions, not just text. When a refund agent succeeds, the "answer" is "the right tool was called with the right arguments, the email went out, the customer felt heard." That's three things, all with their own definitions of right.
  4. Quality is multidimensional. A response can be accurate but rude. Helpful but unsafe. Concise but missing key info. A binary pass/fail loses too much information.
  5. The same code might work for one user and fail for the next. An agent that handles English perfectly may butcher Spanish. One that's polite to calm users may flounder under abuse.

This is why we don't call them "tests" — we call them evaluations. The word change matters: tests have a binary verdict, evaluations have a distribution of verdicts across multiple trials, multiple dimensions, and multiple inputs.

1.3 So what is an "evaluation," precisely?

Definition

Evaluation

A piece of code that gives an AI agent a task with pre-defined success criteria, runs the agent multiple times, and uses one or more graders to decide which trials passed. The output is not "did it work" but "what fraction of trials worked, by which criteria, and where did the rest fail."

If you've ever written a unit test, you're 80% of the way there. The differences:

  • The "function under test" is probabilistic, so you usually run multiple trials per task and look at aggregate behavior.
  • The output is often free-form text, so the assertion sometimes calls another model to act as the judge.
  • The "test suite" is called an eval suite, and it lives next to your code with the same gravity as your unit tests — it runs on every PR.

1.4 Why this matters — three concrete payoffs

Building an eval suite is unglamorous work. Here is what it buys you:

Faster shipping

The next great model lands in three months. Teams with evals upgrade in a week and know exactly what got better and what regressed. Teams without evals spend a quarter agonizing.

🛡

Fewer surprises in production

Every Slack #complaints message is a free task waiting to be written down. A growing regression suite eats the long-tail of issues your QA could never reproduce.

🔁

Real institutional memory

Engineers leave. Prompts get rewritten. Without evals, the folklore "we tried that and it didn't work" walks out the door. Each fix lands as a test that survives the people who wrote it.

Teams without evals get bogged down in reactive loops — fixing one failure, creating another, unable to distinguish real regressions from noise. Teams that invest early find the opposite: development accelerates because failures become test cases.
Chapter 2

The three building blocks: Task, Agent, Grader

Every eval is made of exactly three things. If you can name them, you can read any eval codebase.

2.1 Task

A task is one test case: an input the agent receives plus the criteria you'll use to judge the result. Examples:

  • "Refund the customer's last order." Success = the refund tool was called with the right order ID.
  • "What was Q3 revenue per the attached 10-K?" Success = the answer matches the value on page 47.
  • "A customer asks for a refund 60 days after purchase. Our policy is 30 days." Success = the agent declines politely and references the policy.

A collection of tasks is a task set or eval suite. A small but well-curated suite (20 tasks) is far more useful than a large but vague one (500 synthetic tasks).

2.2 Agent (the system under test)

The agent is whatever you're evaluating. The eval doesn't care about its internals — it cares only about behavior. This is important: it means you can swap the underlying model, swap the prompt, swap the framework, and the eval keeps measuring the same thing. That's the whole point.

2.3 Grader

A grader is the logic that decides if a trial passed. It takes the task and the agent's full output (the transcript) and returns a verdict: passed/failed, a 0–1 score, and a one-line rationale.

Good graders share three properties:

  • Idempotent. Same trial in, same verdict out (or, if the grader uses an LLM, calibrated against humans so the variance is small).
  • Honest about uncertainty. When the grader can't tell, it returns "unknown" — not a guess.
  • Cheap to debug. The rationale answers "why did this trial fail?" in under 100 characters.

2.4 Trials, transcripts, and outcomes

Three more vocabulary words you'll see everywhere:

Trial

One attempt at one task. Because models are non-deterministic, you run multiple trials per task and look at aggregate pass rate.

Transcript (a.k.a. trace, trajectory)

The complete record of a trial: every prompt, every tool call, every model output, every intermediate result. Reading transcripts is the single highest-leverage habit in eval work.

Outcome

The actual final state of the world after the trial — separate from what the agent claims in the transcript. An agent might say "I filed the report"; the outcome grader checks whether the file actually exists.

2.5 The whole picture

📋 Task prompt + criteria 🤖 Agent model + tools + scaffolding ⚖️ Grader pass / fail / why ✓✗ verdict prompt transcript ×N trials per task

Run a task through the agent N times → collect transcripts → grade each → aggregate the verdicts. Below, you can do this yourself with a tiny browser-based grader. Type a response, type what you expected, and watch four common code-based graders fire on every keystroke.

Edit this. Watch the verdicts below change.
For ContainsAll, separate terms with commas.
Chapter 3

Graders, in depth

Three families of graders exist. Most teams over-use the expensive ones and under-use the cheap ones. Here's when to reach for what.

3.1 The kitchen analogy

Imagine you've hired an AI agent as a chef. How do you tell if it's any good? Four levels of scrutiny:

🥄
Manual testing

The tasting spoon

You eat the food. Works for ten dishes. Doesn't scale to a thousand orders a night. But it's the only level that catches the thing nobody thought to test.

📋
Code-based grader

The recipe checklist

Right amount of salt? Cooked the right time? Yes/no checks. Cheap, fast, scale-free. Misses whether the dish actually tastes good.

👨‍🍳
LLM judge

The head chef review

You hire a more experienced chef to taste the food and rate it on a rubric. More flexible than a checklist. Has weird preferences and costs money.

Human evaluation

The Michelin inspector

Sometimes you bring in a top expert for the gold-standard verdict. Slow and expensive — used to calibrate the head chef, not for every dish.

A healthy program uses all four, in different proportions: recipe checklists for fast feedback during dev; head chef for nuanced things; the occasional Michelin inspector to keep the head chef honest; and tasting spoons forever, because nothing beats a human catching a thing the system missed.

3.2 Family 1 — Code-based graders

These are the cheapest, fastest graders you can write. Reach for them first. They check whether the output has a verifiable shape.

GraderChecksWhen
ExactMatchOutput equals an expected stringSingle canonical answer
RegexMatchOutput matches a patternStructured fields (numbers, IDs, currency)
FuzzyMatchOutput similar to expected (Levenshtein ratio ≥ threshold)Paraphrase tolerance without LLM judge
ContainsAll / ContainsAnyAll / any of N substrings appearRequired facts; required apologies; CTA mentions
JsonMatchOutput parses as JSON and contains expected keys/valuesStructured-output APIs
ToolCallCheckSpecific tools were (or weren't) calledVerify search happened, dangerous tool didn't
OutcomeCheckThe world ended up in the right stateVerify the file got written, the refund was issued

The trap: writing a code-based grader so strict it punishes correct behavior. "96.12" rejected because the spec wanted "96.124991024571248". Always specify tolerance up front — numeric ε, fuzzy threshold, subset match for JSON.

3.3 Family 2 — LLM judges (model-based graders)

When the dimension you care about is open-ended — tone, helpfulness, polish, empathy — code can't grade it. You hand the trial to another LLM and ask it to score against a structured rubric.

from agent_evals.graders import LLMJudge

rubric = """Score 0–5 on:
1. helpfulness — does it answer the actual question?
2. tone — warm and professional?
3. brevity — no padding?

Return JSON: {"scores": {...}, "passed": bool, "rationale": "..."}.
Use "unknown": true if the rubric or transcript is unclear.
"""

grader = LLMJudge(rubric=rubric, pass_threshold=4)

Three rules for LLM judges:

  • Calibrate against humans. Have a human grade ~20 trials by hand. Run the judge on the same 20. If the agreement (Cohen's kappa) is below 0.6, the judge is unreliable — fix the rubric and re-run.
  • Provide an "Unknown" escape hatch. Forced binary verdicts hide judgment uncertainty. The judge should be able to say "unknown": true when the rubric or transcript doesn't apply cleanly.
  • Cache the rubric. The rubric is identical on every call; the trial is the only thing that changes. Anthropic's prompt caching makes the rubric ~10× cheaper after the first call within a 5-minute window.

3.4 Family 3 — Human evaluation

The gold standard. Slow. Expensive. Irreplaceable. You use humans for three things:

  1. Ground truth. The grades you'll measure your LLM judges against.
  2. Sanity sampling. 10 random trials per week, read by a human. Catches the failure modes from Chapter 7 before they show up in customer complaints.
  3. The "this feels wrong" test. Humans notice things no rubric was written for — exactly the gap the cold-open story sat in.

3.5 Composite graders

Most real evals don't use one grader — they use 2 to 4 stacked together. "This trial passes if the right tool was called, AND the answer contains the right facts, AND the tone scored ≥4."

from agent_evals.graders import Composite, ContainsAll, ToolCallCheck, LLMJudge

grader = Composite([
    ContainsAll(),
    ToolCallCheck(required=["search"]),
    LLMJudge(rubric=tone_rubric, pass_threshold=4),
], mode="all", name="full_check")

Build composites incrementally. Don't try to write one mega-grader.

A 10-line code-based grader beats any LLM judge — cheaper, faster, more debuggable. Reach for the judge only when you must.
Chapter 4

How to write a rubric an LLM judge will follow

Most LLM-judge failures are rubric failures. Here's how to write one that doesn't lie to you.

4.1 The shape of a good rubric

A working rubric has four parts:

  1. The dimensions — 3 to 7 named axes you score on (more than 7 and the judge starts averaging things together).
  2. What each score means — concrete language for what a 5 looks like vs. a 3 vs. a 0.
  3. The pass threshold — usually "every dimension ≥ 4" or "weighted average ≥ 0.7."
  4. The escape hatch — explicit permission to return "unknown" with a reason rather than guessing.

An example of each part working:

Score the agent's response on three dimensions, each 0–5:

1. helpfulness — does it answer the actual question?
   5 = directly answers, complete, no padding
   3 = partially answers; missing one sub-point
   0 = doesn't answer or answers a different question

2. tone — warm, professional, no jargon
   5 = sounds like a thoughtful human
   3 = correct but flat or robotic
   0 = rude, dismissive, or aggressive

3. brevity — no filler or padding
   5 = every sentence earns its place
   3 = one or two sentences could be cut
   0 = a wall of text for a simple question

Pass threshold: every dimension must score ≥ 4.
If the response is unparseable or off-topic, return "unknown": true.
Return ONLY a JSON object: {"scores": {...}, "passed": bool, "rationale": "...", "unknown": bool}

4.2 The calibration loop

You can't ship a rubric you haven't tested against humans. The loop:

  1. Pick 20 trials from your eval suite (mix of obvious passes and obvious fails).
  2. Grade them by hand — write down a pass/fail verdict for each.
  3. Run your LLM judge on the same 20.
  4. Compute Cohen's kappa — a chance-corrected agreement metric. 1.0 = perfect agreement, 0 = chance, <0 = worse than chance.
  5. If kappa ≥ 0.6 — the judge is reliable enough to ship. If < 0.6 — the rubric is broken. Find the trials where you and the judge disagreed, figure out why, tighten the rubric, re-run.
Watch out for the "always says PASS" trap. If 80% of your calibration trials are passes, an "always says PASS" judge will agree with you 80% of the time — but Cohen's kappa will be near zero. Raw agreement lies; kappa doesn't.

4.3 Rubric mistakes that bite

  • Too many dimensions. Beyond ~7, the judge stops scoring them independently. Split into multiple judges instead.
  • Vague score levels. "5 = excellent, 0 = bad" tells the judge nothing. Describe what each level looks like in concrete behavior.
  • No escape hatch. Without "unknown," the judge guesses on weird inputs and you can't tell which scores to trust.
  • Mixing dimensions and pass-criteria. "Score helpfulness 0–5 AND penalize for typos" forces the judge to weight two unrelated things into one number. Split them.
  • No format requirement. "Return JSON" beats "score the response." Without a format, you're parsing free-form text into your eval pipeline. That's how silent grader failures happen.
Chapter 5

Metrics & the math of non-determinism

Run the same task ten times — some pass, some fail. So what's the "real" pass rate? It depends on a question most teams never explicitly ask.

5.1 Why a single trial isn't enough

If a model has a 70% chance of answering correctly on any given attempt, a single trial tells you almost nothing. You'll see "passed" 70% of the time and "failed" 30% — but with one observation you can't distinguish that from a 50% model that got lucky, or a 90% model that got unlucky.

The solution is statistical: run N trials, observe how many passed (call it C), and compute a rate. With N=20 trials, your confidence interval shrinks to ±10–15 percentage points; with N=100, to ±5 points. For most teams, 3–10 trials per task is the sweet spot — enough to catch obvious flakiness, cheap enough to run in CI.

5.2 The two metrics that matter

Once you have N trials with C successes, you can compute two different things, and they answer two opposite questions:

📈
pass@k

"At least once in k tries"

The probability the agent gets it right at least once if the user gets k attempts. Goes up as k grows. Use this when one working answer is enough — coding agents the user retries, research agents that just need a good answer somewhere.

📉
pass^k

"Every single time across k tries"

The probability the agent gets it right every single time across k attempts. Goes down as k grows. Use this when consistency matters — customer-facing chat that has to be polite every time, unattended workflows where any failure breaks the chain.

5.3 The coin-flip intuition

Imagine a coin that lands heads 70% of the time. Heads = success. Then:

kpass@k (at least once)pass^k (every time)
10.700.70
21 − 0.3² = 0.910.7² = 0.49
51 − 0.3⁵ = 0.9980.7⁵ = 0.17
101 − 0.3¹⁰ ≈ 1.0000.7¹⁰ ≈ 0.03

Same coin. Five attempts. One says "you'll virtually never miss." The other says "you'll only succeed all five times in 17% of runs." Both are correct — they're answering different questions.

5.4 The unbiased estimators

In real evals you only run N trials, not infinite. So you estimate. The unbiased formulas (from the OpenAI Codex paper) for "what would pass@k be if I'd only run k attempts":

pass@k(N, C, k) = 1 − C(N − C, k) / C(N, k)
pass^k(N, C, k) = C(C, k) / C(N, k)

where C(n, r) is "n choose r" (the binomial coefficient).

Worked example: N=10 trials, C=7 passed, k=3.

  • pass@3 = 1 − C(3,3)/C(10,3) = 1 − 1/120 ≈ 0.992 (very likely to get one right in 3 tries)
  • pass^3 = C(7,3)/C(10,3) = 35/120 ≈ 0.292 (about 29% chance all 3 are right)

5.5 Drag the sliders

The chart below computes both metrics live as you change p (single-attempt success rate), n (trials run), and k (attempts in production). Watch the gap open up between the two curves as k grows.

pass@k — at least once in k tries pass^k — every time across k tries
at k = 3
attempts
pass@k = 0.97at-least-once
p = 0.70
single try
pass^k = 0.34every-time
The real-world trap: a "70% pass rate" sounds great on a slide. But if your agent runs unattended for 5 turns, the user only sees a working flow 17% of the time. Pick the metric that matches how the user actually experiences your product.
Chapter 6

Capability vs. regression — same code, opposite goals

Every eval suite is one of two flavors. A healthy team has both, and they don't get confused.

Capability and regression evals look identical from the outside — same code, same harness, same graders. The difference is what they're for.

Capability evalRegression eval
Question"Can the agent do this hard new thing yet?""Can it still do the things it used to?"
Target pass rateLow at first (10–60%), climbingNear 100%, defended
Reaction to failure"There's a hill to climb — invest here next sprint""Something broke — block release"
LifecycleGraduates to a regression eval once mostly solvedStays forever — institutional memory
How often it runsWeekly, when iterating on the featureOn every PR, in CI

6.1 Why you need both

Without capability evals, your team has no signal on whether the new features they're building are improving. Every iteration feels equally good or bad; you can't tell.

Without regression evals, your old features rot silently every release. The cold-open story at the top of this handbook is exactly this failure mode — the team had no warmth grader on existing behavior, so the upgrade silently degraded it.

6.2 The graduation rhythm

When a capability suite hits ~95% pass rate, it's stopped giving you signal. Two things happen:

  1. Graduate it to regression. The tasks move into the suite that runs on every PR. The pass rate is now defended, not climbed.
  2. Write harder tasks for the new capability suite. The next-hardest things you want the agent to do.

This cadence — graduate, write harder, repeat — is what keeps a long-running eval program useful. Without it, suites saturate and the metrics stop telling you anything.

6.3 See it in action

The toggle below shows a real-shaped scenario. v1 is the previous release. v2 is after a model upgrade. Click between them and watch the capability score jump while the regression score quietly drops. Without a regression suite, this ships and customers feel it before you do.

Capability suite (5 hard tasks)

42%
+0%

Hard new tasks. Target: climb.

Regression suite (50 stable tasks)

99%
+0%

Things it already did right. Target: defend.

Click "v2" to see what the upgrade did. Bet you can guess which dashboard would have caught it.
Chapter 7

Where evals go wrong (and how to spot it)

The most insidious eval bugs aren't in the agent — they're in the test itself. Seven canonical failure modes, each with the symptom and the fix.

If you only remember one thing from this chapter: read transcripts every week. There is no dashboard substitute. The single highest-leverage habit in eval work is sampling 10 random trial transcripts (5 random failures, 5 random passes) and reading them by hand. This catches every failure mode below — and it catches the ones not in this list.

Failure 1 of 7 · severity: high

Rigid grading

The grader demands an exact match when the answer admits valid variations. "96.12" rejected because the spec wanted "96.124991024571248".

SymptomPass rate drops on a model upgrade, but every "failed" trial looks correct on inspection. CauseTolerance was never specified up front. The grader uses exact equality where it shouldn't. FixSpecify tolerance explicitly: numeric ε (abs(diff) < 0.01), text fuzzy threshold, JSON subset match instead of strict equality.
Failure 2 of 7 · severity: high

Ambiguous specs

The task says "answer concisely" with no length cap, or "be polite" with no rubric. Two senior reviewers read the same trial and disagree on whether it passed.

SymptomReviewers disagree on what counts as passing. Pass rate flips around as different people grade. CauseThe success criteria were never tightened past intuition. FixThe "two experts agree" rule: if two domain experts can't reach the same verdict independently, the task is broken. Tighten the spec or convert to a structured rubric with concrete score levels.
Failure 3 of 7 · severity: high

Stochastic impossibility

"Generate three random passwords" graded by exact match. The agent is doing the right thing; the grader can't measure it.

SymptomAlways 0% pass rate, no matter what model you swap in. CauseThe grader demands exact reproducibility on output that's intentionally random. FixGrade properties, not values. For passwords: length ≥12, contains at least one upper / lower / digit / symbol, all three are unique. The properties are deterministic; the values aren't.
Failure 4 of 7 · severity: medium

Goal confusion

Tasks that the spec implies should fail are passing — or vice versa — and you can't articulate why.

SymptomWrong things pass / right things fail, and reading the trial doesn't make it obvious why. CauseThe grading criteria contradict the stated objective. Spec says "the agent should refuse," grader checks for the word "absolutely." The agent says "absolutely not, I cannot help" — passes the grader, fails the spec. FixRead your graders alongside your spec, line by line. Have someone who didn't write either review them.
Failure 5 of 7 · severity: medium

Harness flakiness (correlated failures)

Pass rate is bouncy across runs even with the same agent and same tasks. Sometimes 92%, sometimes 78%, no real change between runs.

SymptomVariance between runs that's larger than your trials_per_task can explain. CauseShared state across trials — a tmp file from trial 1 is still on disk when trial 5 runs and uses it; rate limits cause mid-run timeouts; environmental drift between dev and CI. FixEach trial starts from a clean state. Treat eval infrastructure with the same gravity as production: if your harness flakes, evals lie.
Failure 6 of 7 · severity: medium

Gamer optimization (Goodhart's law)

The grader rewards a measurable proxy (length, presence of keywords, a polite phrase) and the model has learned to game it.

SymptomPass rate climbs steadily but real users say it's gotten worse. CauseThe proxy your grader measures has drifted away from the underlying quality you actually care about. FixDiversify graders so no single one can be gamed. Read transcripts weekly. Sample real production traces and score them with the same graders — divergence between eval pass rate and prod pass rate is your tripwire.
Failure 7 of 7 · severity: lower

Over-specified paths

Your grader enforces a specific tool-call sequence: "must call search then summarize." The agent answered correctly from context without searching. Eval marks it failed.

SymptomThe agent finds a perfectly valid solution but you record it as a failure. CauseYou wrote a grader that constrains how the agent solves the problem instead of whether it solved it. FixGrade outcomes, not paths. Frontier models find solutions designers didn't imagine — anticipate creativity rather than punishing it.

7.1 Train your eye — five real broken evals

Below are five real broken eval cases. For each, pick the failure mode you think is happening. We'll tell you why.

Chapter 8

The lifecycle of an eval program

What "mature" looks like, six months in. Use this as a north star.

Eval programs grow in stages. The shape is more or less universal across teams; only the speed varies. Below is the arc a real refund-bot team walked, day 0 to day 180.

Day 0The wake-up call

Production incident, no evals

The bot refunds $500 to a customer who returned a $50 item. Cue weekend. Root cause: nobody had written a test for "refund the correct order." There was no automated way to have caught it.

Day 7The first 5 tasks

The smallest possible regression suite

Engineering and product write 5 tasks: one happy path, two ambiguous-input cases, one scope-creep escalation, one policy refusal. Two graders: OutcomeCheck for the refund ID, ContainsAll for required policy mentions.

5 tasks2 graderspass rate: 80%
Day 30The flywheel starts

47 tasks, all from real production failures

Three new tasks per week from Slack #support-bot-issues. The mix is roughly 60% positive cases ("should do X"), 30% negative cases ("should NOT do X" — refund frauds, scope-creep), 10% edge cases (empty input, very long input, multiple intents).

47 tasksruns in CI on every PR
Day 45The first save

The regression suite catches a model upgrade

Engineering wants to upgrade to a newer model for cost reasons. CI runs the suite. Pass rate drops from 100% to 89% — the new model is more "agreeable" and issues refunds when policy says no. Upgrade blocked. Two days of prompt fixes. Suite back to 100%. Upgrade ships safely.

incident prevented~$X saved in incorrect refunds
Day 90Adding the LLM judge

A board member complains the bot "sounds robotic"

Tone isn't gradeable with regex. The team writes a 4-dimension rubric (empathy / clarity / warmth / closure). Calibrates against 30 human-graded trials — kappa = 0.71. Iterates the rubric, re-calibrates to 0.83. Trustworthy. Costs ~$0.30 per CI run. The "robotic" complaint never recurs.

tone judge liveκ = 0.83
Day 180Mature program

What "mature" looks like

187 regression tasks. 22 capability tasks for a new feature. Two graders per task on average. Smoke suite (15 tasks) on every PR in 30s for $0.02. Full suite (187) nightly in 8 minutes for $1.20. Quarterly grader review. Slack alert if nightly drops below 95%. Time to evaluate a new model release: 2 days.

CSAT 4.1 → 4.6model upgrades in days, not quarters
Chapter 9

The 8-step roadmap from zero to trusted evals

A field-tested progression. The first three steps you can do this week without writing a single line of code.

Step 01

Start early. Twenty real failures beat two hundred synthetic ones.

Open your support tracker, your bug list, your Slack #complaints channel. Pick the 20 most common things that go wrong. Each one is a free task waiting to be written down. Synthetic tasks generated by an LLM are a poor substitute — they tend to test things the agent already does well.

Step 02

Write tasks two domain experts will agree on.

The "two experts agree" rule. If two domain experts read the trial and the success criteria and reach different verdicts, the task is broken. Fix the spec before grading. This rule is more useful than any abstract principle because it's testable: have two people grade 20 trials by hand and count agreements. Below 85% means the task spec is the problem, not the agent.

Step 03

Build balanced sets — positive AND negative cases.

Half your tasks should test "the agent should do X." The other half should test "the agent should NOT do X" — refusing dangerous requests, declining out-of-scope asks, asking for clarification on ambiguous inputs. Without negative cases, your eval is biased toward "be helpful," and the model that wins is the one that always helps — including when it shouldn't.

Step 04

Robust harness, clean state per trial.

Each trial starts from a clean environment. Shared state across trials = mysterious flakiness that you'll spend a day debugging. Common culprits: leftover tmp files, cached LLM responses, leaky database state, accumulated rate-limit backoff. Treat eval infra with production-grade gravity.

Step 05

Prefer deterministic graders. Reach for an LLM judge only when you must.

A 10-line code-based grader is cheaper, faster, and more debuggable than any LLM judge. Use the judge for the dimensions that genuinely need it (tone, helpfulness, polish); use code for everything verifiable. The bias should be code-first.

Step 06

Read transcripts. Always. Every week.

The single highest-leverage habit. Most "the agent failed" reports turn out, on inspection, to be "the grader was wrong." Build a weekly habit: 5 random failures, 5 random passes. Skim them. You'll catch every failure mode in Chapter 7 this way, and you'll spot the next ones to add.

Step 07

Watch for saturation.

When your capability suite hits ~95% pass rate, it's stopped giving you signal. Graduate it to regression and write harder tasks for the new capability suite. This rhythm is what keeps a long-running eval program useful — without it, suites saturate and the metrics stop telling you anything.

Step 08

Maintain like unit tests.

Designate an owner. Run on every PR. Have a "grader review" meeting once a quarter where you pull 10 random failures and ask: "Was this failure real, or did our grader misjudge?" Most teams accumulate grader debt the way they accumulate tech debt. The grader review is how you pay it down.

Chapter 10

For non-engineers: how to commission your first eval

You don't have to write code to start an eval program. You do have to be specific about what you want measured.

10.1 The work only you can do

Engineers can grade whatever you tell them to; they can't tell you what to grade. Three things only product/leadership can decide:

  1. What behavior counts as success. Each grader is a knob you set.
  2. How to spend the quality budget. Should the agent be polite or fast? Verbose or concise? Aggressive or cautious?
  3. What the user actually does with the output. A perfect answer that nobody reads is worse than a 90% answer in the right format.

10.2 The 4-step starter

Tomorrow morning, before any meeting:

  1. Open a doc. Write down the 20 most important things your AI product should be able to do. (You don't need engineering for this.)
  2. For each, write down: "How would I know if it did this right?" That's a grader spec. Be specific. "Polite tone" isn't enough; "starts with an acknowledgment of the user's situation" is.
  3. For each, find one real example from production where it didn't go right. Open Slack, search for "this is wrong" or "the bot did". That's a task.
  4. Email engineering. "Can we land these 20 as a regression suite this sprint?" That's how a quality program starts.

10.3 Reading an eval report without becoming an engineer

When engineering hands you a report, ignore the dashboards for 30 seconds and ask three questions:

  1. What's the overall pass rate, and what was it last week? Direction matters more than absolute level.
  2. Which 5 tasks failed? Read those transcripts. (Yes, you. Not the dashboard. The actual conversations.)
  3. Are there tasks that pass too easily? A 100% pass rate on the easy stuff often masks regressions on the hard stuff.

You'll know a team is doing evals well when reviewing their report feels like reading a balanced design critique — wins, losses, surprises — and not like reading a status dashboard.

10.4 The grader review meeting (your new ritual)

Once a quarter, sit down with engineering and review the graders themselves. Not the results. Pull up 10 random failing transcripts and ask:

  • Was the failure real or did the grader misjudge?
  • Has the world changed since we wrote this rubric?
  • Is there a thing users now expect that we never wrote a grader for?

Most teams accumulate grader debt the way they accumulate tech debt. The grader review is how you pay it down.

Reference

Glossary

Every word you'll see in eval discussions, in plain language. Bookmark this section; you'll come back.

Agent
A system built around an LLM that can take multiple steps, use tools, and adapt mid-task to achieve a goal — distinct from a chatbot, which produces one reply per prompt.
Capability eval
An eval that targets things the agent struggles with. Pass rates start low; the point is to find a hill to climb.
Code-based grader
A grader implemented in plain code (string match, regex, JSON equality). Fast, cheap, deterministic; brittle to valid variations.
Cohen's kappa
A chance-corrected agreement metric between two raters. 1.0 = perfect, 0 = chance, <0 = worse than chance. Used to validate that an LLM judge agrees with human grades.
Composite grader
A grader built by combining other graders with AND/OR/weighted logic.
Dataset / Task set
A collection of tasks, often stored in JSONL. Equivalent to a unit-test file in regular software.
Eval / Evaluation
Code that gives an agent a task, runs it, and grades the result against criteria you specified ahead of time.
Eval harness
The code that orchestrates an eval run end-to-end: load tasks, dispatch to agent, capture transcripts, run graders, persist results.
Golden dataset
A curated, hand-vetted set of tasks with known-good answers. The "ground truth" used for regression evals.
Grader
A piece of logic that decides if a trial passed. Three families: code-based, model-based (LLM judge), human.
Groundedness
How well the agent's claims are supported by source material. Common grader for research and RAG agents.
Holdout set
A portion of your tasks reserved and never shown during development. Used to verify you haven't overfit to your visible eval set.
LLM judge
A grader that uses another model call to score the trial. Flexible but non-deterministic and expensive.
Multi-turn eval
An eval where the agent and user (often a simulated one) exchange multiple messages. Used for conversational agents.
Outcome
The actual final state of the world after a trial — separate from what the transcript claims happened.
Pairwise comparison
A judge pattern where instead of scoring on a rubric, you ask "is response A or response B better?" Often more reliable than absolute scoring.
pass@k
Probability the agent gets the task right at least once in k trials. Goes up with k. Right metric when one working answer suffices.
pass^k
Probability the agent gets the task right every time in k trials. Goes down with k. Right metric when consistency is required.
Prompt caching
An API feature that lets you mark stable parts of a prompt as cacheable, dramatically reducing cost on repeat calls. Anthropic uses explicit cache_control; OpenAI auto-caches stable prefixes.
Regression eval
An eval that targets things the agent already does correctly. Pass rates should stay near 100%; a drop means something broke.
Rubric
A structured scoring guide for an LLM judge. Lists the dimensions (clarity, accuracy, tone) and the criteria for each score level.
Saturation
When all (or nearly all) tasks in a suite pass. The eval has stopped giving signal — time to graduate it to regression and write harder tasks.
Task
A single test case: inputs + success criteria. The unit of an eval suite.
Tool call verification
A grader that inspects the transcript to check whether specific tools were called (or not called).
Trajectory / Trace
Synonyms for transcript.
Transcript
The complete recording of a trial: prompts, model outputs, tool calls, tool results, intermediate reasoning. The single most important debugging artifact.
Trial
One attempt at a single task. Multiple trials per task account for non-determinism.
User simulator
A model configured to play the role of a user, used in multi-turn evals to drive realistic conversations against the agent under test.
You've finished the handbook

If you want to actually run the code,

everything in this handbook is implemented in a small Python toolkit — Claude, OpenAI, or local Ollama, your choice. pip install -e toolkit/ and you have your first eval running in 30 seconds.

Finished this one? 0 / 12 Handbooks done