Golden sets, LLM-as-judge, regression testing,
and the offline-vs-online divide — demystified.
Before metrics and benchmarks, you need a mental model for why LLM evaluation is fundamentally different — and why most teams get it wrong.
Traditional ML ships a model that outputs a single label or number. You measure accuracy, F1, AUC against a held-out set and you're mostly done. LLM systems output free-form text — there are infinitely many correct answers, and the same prompt can produce different responses each run.
That breaks three assumptions at once. First, you can't enumerate "correct" outputs, so exact-match collapses. Second, the system is a pipeline (prompt + retrieval + model + tools + post-processing) — a regression can come from any layer. Third, quality is multi-dimensional — a response can be factually correct but unsafe, helpful but verbose, on-topic but off-brand.
Add the production reality — silent regressions from prompt edits, model version bumps, vendor deprecations, retrieval index drift — and evals stop being a one-time gate. They become the production system you build around the model.
They sound interchangeable but they serve different loops. Testing answers a binary — did this specific behavior happen or not? It's pass/fail, written like unit tests, runs in CI, and protects you from regressions on known cases. Evaluation answers a distributional question — across a representative sample of inputs, what is the overall quality of the system?
| Dimension | Testing | Evaluation |
|---|---|---|
| Question | Did X happen? | How good is the system overall? |
| Granularity | Single case | Distribution over many cases |
| Output | Pass / fail | Score, percentile, win-rate |
| When run | Every commit / PR | Pre-release, weekly, ad hoc |
| Cardinality | 10s–1000s of assertions | 100s–10000s of scored cases |
| Analogy | Unit test | Statistical experiment |
A healthy program has both. Testing catches known failures you've already seen in production. Evaluation detects unknown shifts in quality. Conflating them is why teams either over-invest in brittle pass/fail suites that block every release, or under-invest in CI and ship silent regressions.
There's no single taxonomy, but a useful interview-friendly one is three axes — what you measure, how you measure, and where the data lives.
A concrete eval plan is a point on each axis. "I measure factuality (what) using a GPT-4-class judge with citations (how) on a versioned golden set of 500 support tickets (where)." That one sentence is the shape of every good eval.
A good framework is not a single dashboard — it's a layered system where each layer catches a different class of failure at a different speed and cost.
The principle — push work down the pyramid. Every failure that a cheap deterministic check can catch should not reach an LLM judge, and every failure an LLM judge can catch should not reach a human. Teams that flip this pyramid (human review for everything) burn money; teams that skip the top (no human calibration ever) drift without knowing.
Public benchmarks measure general model capability — they tell you whether a model is plausibly competent at broad domains. They say almost nothing about whether it will work for your task, your users, your data distribution.
Three structural problems make public benchmarks unreliable as production signals:
Use benchmarks for model selection shortlisting — a 20-point MMLU gap is a real signal. Use your own evals for everything downstream of that.
Capability evaluations probe an underlying skill the model has or doesn't — arithmetic, code generation, long-context recall, multilingual reasoning. Examples: GSM8K for math, HumanEval for code, needle-in-a-haystack for context retention. They're model-facing — they help you decide "is this model capable enough?".
Task-specific evaluations measure whether the full system does your job — resolve a support ticket, extract fields from an invoice, answer a policy question with the right citation. They're product-facing — they tell you "does this shipped pipeline work for our users?".
They're complementary, not competing. Capability evals tell you why a task eval regressed (the new model lost long-context recall). Task evals tell you whether the regression matters (our prompts never exceed 8k tokens, so we don't care).
The quality of your evals is capped by the quality of your data. This part covers how to build, size, version, and protect the ground truth that everything else sits on.
A golden dataset (or "golden set", "eval set", "canonical set") is a curated, versioned, human-reviewed collection of input–expected-output pairs that represents what correct looks like for your task. It's the ground truth every metric, judge, and regression test is ultimately calibrated against.
It's foundational because every other piece of your eval stack inherits its biases. A skewed golden set means a skewed LLM-judge, a misleading regression signal, a misguided A/B readout. If your golden set over-represents English short questions, your whole system will pass evals and still fail on long multilingual queries in production.
Note the tags, source, version, and reviewer. A golden set isn't just inputs and outputs — it's provenance. Without that, you can't audit biases, can't slice by failure mode, and can't prove the set wasn't scraped from your own prod logs containing PII.
The trap is going straight to labeling. The sequence that actually works is segment → sample → label → stratify → freeze.
Segment means naming the axes your users vary on — intent, language, input length, politeness, edge cases. Sample draws from real traffic (ideal) or synthesized prompts (bootstrap). Label is where a domain expert writes what should have come out — or, for subjective tasks, what a good response would contain. Stratify ensures every segment is present in enough volume to make the metric stable. Freeze the set at v1 and only change it via explicit versioning.
Two common shortcuts to resist — labeling whatever shows up first (you'll over-index on common cases) and using the model's own output as the "expected" (you've just measured self-consistency, not correctness).
Wrong question. The right question is how many examples per slice. A 10,000-item set that's all one category is weaker than a 500-item set with 25 examples across 20 slices.
A working rule of thumb for binary pass/fail metrics — you need about 50–100 examples per slice to detect a 5-point quality shift with reasonable confidence. For graded scores (1–5), fewer are needed; for rare failure modes (jailbreaks, PII leaks), more.
Two quick sanity checks — (1) bootstrap your current set and confirm the confidence interval on your headline metric is narrower than the regression you care about; (2) plot per-slice scores and look for slices with wild variance — those are usually under-sampled.
Contamination is when eval data has leaked into training data, so the model "remembers" the answer rather than earning it. Scores go up, real-world capability doesn't. It happens three ways — (1) you published your golden set; (2) the model was trained on scrapes that include your source; (3) you evaluate on data the model has already seen via retrieval or system prompts during labeling.
Defenses — keep a private held-out slice that's never shared with vendors; canary strings unique to your dataset so you can detect memorization; label with humans, not the model under test; refresh a rolling fraction of the set every quarter with net-new prompts.
Golden sets decay. User behavior shifts, products add features, edge cases are discovered, labels go stale as policies change. A never-updated set silently drifts from reality — your evals pass, your users suffer.
The operating pattern — treat the golden set like code. Semantic versions (v1.2.0), pull requests for additions, changelogs, deprecations. Minor version for "added 50 new examples in segment X", major version when changes break comparability with prior runs.
When you bump major, always re-run the prior release model on the new set and record the score — you need that anchor point so historical comparisons stay interpretable.
Most teams ship a bad golden set and don't know it because their metrics look stable. Stable metrics on a bad set are the worst case — confident wrongness.
| Smell | What it looks like | What it costs you |
|---|---|---|
| Easy mode | Every frontier model scores 95%+ | Metric is saturated — can't distinguish models |
| Skewed | 80% of examples are one intent | Head-case wins mask tail-case failures |
| Stale | No updates in 6+ months | Passes don't predict prod behavior anymore |
| Self-labeled | "Expected" was generated by the model | You're measuring self-agreement, not quality |
| Leaked | Available on the public internet | Memorization inflates scores ~5–15 points |
| Under-slotted | No per-segment tags | Can't diagnose which slice regressed |
| Solo-authored | One person labeled everything | One person's biases = ground truth |
Diagnosis heuristic — if your golden-set score barely moves between a tiny 7B open model and a frontier model, your set is either too easy or contaminated. If it moves by 30 points but production telemetry doesn't change at all, your set is unrepresentative.
Production is the best possible eval source — it's literally your distribution — but it carries PII, compliance, and consent risk. The pattern that works is sample → redact → review → promote.
Key choices — consent-friendly sampling (users who opted into training data sharing, or internal dogfooding traffic); synthetic twins when you can't use raw data (rewrite the prompt preserving the structure but replacing identifying details); and difficulty-stratified sampling — don't only collect what the system already handles well. Biased toward failure reveals real edges.
Also — keep production sampling continuous. A one-time snapshot becomes stale in weeks for an actively-developed product.
From BLEU to BERTScore to pass@k — which metric actually moves with quality, and which ones just look rigorous without meaning much.
Reference-based metrics compare a model's output to one or more human-authored "correct" answers. Reference-free metrics score an output on its own merits — grammaticality, factuality, relevance — without needing a gold answer to compare against.
Most production systems use both — reference-based for structured outputs (JSON fields, extracted entities) and reference-free for long-form generation (summaries, chat responses). A common pattern — reference-based for the "must haves" (required fields present), reference-free for the "feel" (tone, helpfulness).
They're all lexical — they compare surface-level n-gram or character overlap with a reference. That works for constrained tasks (translation against a parallel corpus) but breaks down the moment the model is allowed to paraphrase, reorder, or add useful context.
Lexical metrics reward surface similarity and punish paraphrase. Worse, they can rank a wrong answer above a correct one when the wrong answer copies more reference words. Exact match is even more brittle — a trailing period or capitalization difference fails a correct answer.
They still have a place — they're cheap, fast, deterministic, useful as a fast tripwire in CI. But they should never be your headline quality metric for free-form generation.
Factuality = does every factual claim in the output match a trusted source? Hallucination = a factual claim with no such support (or contradicted by one). For RAG systems the source is retrieved context; for open-domain, it's an external knowledge base or a reference answer.
Key metrics in use —
| Metric | What it measures | Best for |
|---|---|---|
| Faithfulness | % claims supported by retrieved context | RAG |
| Answer relevance | % output sentences relevant to the question | QA |
| Citation precision | % citations that actually support claim | Grounded generation |
| Citation recall | % claims that have a citation | Grounded generation |
| Hallucination rate | % outputs with ≥1 unsupported claim | Headline dashboard |
The practical move — use an LLM judge with a structured rubric that forces claim-by-claim verification against provided evidence, rather than a single holistic "is this hallucinated?" vote. Decomposition beats gestalt.
Instead of matching words, embed both output and reference in a vector space and measure how close they are. BERTScore tokenizes both, embeds with a pretrained transformer, and computes token-level cosine similarity. Sentence-level semantic similarity is the same idea at paragraph granularity using sentence-transformer embeddings.
Embedding metrics are a real upgrade over BLEU — they handle paraphrase, word order, synonymy. But they have blind spots — they can't tell you if a fact is wrong, can't penalize a fluent hallucination, and are sensitive to the embedding model's training biases.
Treat them as a mid-layer metric — better than BLEU, cheaper than an LLM judge, and directionally useful as a CI gate. Not a replacement for a judge or human review.
Single-turn scoring doesn't survive contact with a conversation. A response can be locally correct but break context from turn 3, or locally off but recover context from turn 2. Three scoring levels operate together —
Common failure modes specific to multi-turn — context forgetting (forgetting user preferences stated three turns back), repetition (asking for info already given), goal drift (sliding from the original intent), and sycophancy accumulation (progressively agreeing with incorrect user assertions). Each needs its own tag in the golden set and its own rubric line in the judge prompt.
Practical tip — for scripted multi-turn evals, use simulated users — another LLM role-playing the user with a fixed goal and persona. That lets you replay the exact same dialogue deterministically across model versions.
pass@k is the probability that at least one of k independently sampled outputs passes a correctness check. If you sample 10 candidate solutions and any one of them compiles and passes unit tests, pass@10 counts the problem as solved.
It matters whenever you can cheaply verify and reject bad outputs — code generation (run tests), math (check against a checker), tool use (re-try on error). In those regimes, a model with pass@1=40% but pass@10=85% is genuinely more useful than a model with pass@1=55% but pass@10=60%, because production can sample and filter.
Watch for the pass@1 vs pass@k gap — a big gap says the model has the knowledge but sampling is noisy (good candidate for best-of-N, self-consistency, or re-ranking). A small gap says the model is bottlenecked on capability (won't improve with more samples).
The technique that made scalable evaluation possible — and the landmine pattern if you deploy it without calibration. Biases, validation, and the prompts that actually work.
LLM-as-judge is using a language model (usually a strong one — Claude, GPT-4-class) to score another model's outputs according to a rubric. Instead of writing heuristics or asking humans, you give the judge the input, the output, optionally a reference, and a prompt describing what "good" means.
It became dominant because it hits a sweet spot that nothing else does — scalable like heuristics, nuanced like humans. It handles paraphrase, reasoning, tone, structure. It's available on demand. And — when validated properly — it correlates with human judgment well enough for many product decisions.
Three rules before you trust it — (1) validate against human labels on a representative sample, (2) use a different model as judge than the one under test where possible, and (3) give the judge rubrics, not vibes.
This is the single most likely follow-up in an evals interview. Know the catalog cold.
| Bias | What happens | Mitigation |
|---|---|---|
| Position bias | In pairwise, judge prefers A or B based on order | Randomize order, run both orderings, average |
| Verbosity bias | Longer answers score higher regardless of quality | Explicit rubric against length; normalize |
| Self-preference | Judge prefers outputs from its own model family | Use different family; validate with humans |
| Sycophancy | Judge agrees with leading language in the prompt | Neutral rubric, no "this looks good, rate it" |
| Authority bias | Confident tone scored higher than hedged | Separate confidence from correctness in rubric |
| Anchoring | First score sets expectation for later ones | Independent judgments; no streaming context |
| Format bias | Bulleted/Markdown answers beat plain prose | Instruct judge to ignore formatting |
| Refusal tolerance | Judge lets a refusal pass when it shouldn't | Add "did the model actually answer?" to rubric |
None of these go away completely. The goal is to know them, measure their impact on your task, and apply targeted mitigations. A 2-percentage-point position bias is fine if you're deciding between a 15-point quality gap; it's a disaster if you're deciding between a 3-point one.
A judge that hasn't been validated is a random number generator with vocabulary. Validation means showing that the judge's scores correlate with human judgment on your task.
Target judge-human agreement ≥ human-human agreement. If two humans agree 85% of the time on your rubric, a judge with 85%+ agreement is performing as well as another human. If judge-human agreement is substantially lower than human-human, the rubric is ambiguous or the judge is wrong for this task.
Quick sanity tests beyond top-line agreement — slice-level agreement (does the judge disagree with humans disproportionately on one failure mode?); edge-case probes (give the judge deliberately bad outputs — does it catch them?); ordering robustness (same items in different orders — same scores?).
Three modes. They solve different problems and have different noise profiles.
| Mode | Question it answers | Best for | Main bias |
|---|---|---|---|
| Single (pointwise) | How good is this output on scale of 1–5? | Tracking quality over time | Calibration drift, verbosity |
| Pairwise | Is A better than B? | Model comparison / A-B preference | Position bias |
| Reference-based | Does this match the expected answer? | When gold answer exists | Over-penalizing valid paraphrase |
A pragmatic shortcut — pairwise is your most reliable signal when you have two systems. Humans are better at "A or B?" than at "rate 1–5", and so are LLMs. If you only need to ship one of two variants, don't over-engineer — run pairwise, randomize order, done.
A vague prompt produces a vague judge. A good judge prompt has five components, each doing specific work —
Two high-leverage tweaks — few-shot with borderline examples (1 clearly good, 1 clearly bad, 2 borderline, each with rationale) anchors the rubric far better than definitions alone; ask for confidence — a judge's low-confidence items are exactly where you should route human review.
Avoid — "rate this on 1–10 overall." That's vibes, not eval. Avoid single-number scales without anchors. Avoid asking the judge to rank more than 2 items in one call (performance collapses).
A frontier-model judge on 10,000 examples is real money — and slow. The cost envelope dictates how often you can run evals, which dictates how fast your dev loop is. Six practical levers —
Rule of thumb — if your eval run costs more than one engineer-hour, people stop running it. Optimize aggressively to keep it cheaper than that. Cheap evals that run 10× a day beat rigorous evals that run once a week.
The judge is itself an LLM served by a vendor. When the vendor updates or deprecates the judge model, the same outputs can receive different scores — even with identical rubric and code. Your quality trend chart moves and it's not the system under test; it's the judge.
Defenses — pin the judge model version explicitly (never use "latest"); maintain an anchor set of 50–100 outputs with known human labels, re-run them whenever the judge model changes, and measure if judge-human agreement still holds; monitor per-slice score distributions over time — a sudden shift without a code change is the classic judge-drift fingerprint.
When a new judge model version does launch — don't blindly migrate. Re-validate on the anchor set, publish a "calibration delta", and only then switch. Historical scores before and after should be marked with a judge-version label so comparisons stay honest.
Turning evals into tripwires that catch regressions before they ship — CI suites, thresholds, non-determinism, and shadow evaluation.
Regression testing means detecting when a change to any part of the stack makes quality worse on cases that previously worked. For LLMs it's broader than unit testing because the "change" can come from many places.
The implication — regression tests must exercise the full pipeline, not just the model. A regression on raw-model output is diagnostic; a regression on end-to-end output is what users experience.
CI wants three properties — fast, cheap, deterministic-enough. Most LLM evals are none of those. The design pattern is tiering.
Rules that keep CI fast — cache model calls keyed on (prompt-hash, model-version, temperature); parallelize across examples; fail fast on deterministic tiers before spending judge budget. And — critical for keeping devs shipping — block merges only on Tier 2 failures. Tier 3 regressions page you; Tier 4 is for release decisions.
Assertion-style tests have a single, well-defined pass/fail condition on a single example. "When asked to cancel, the response must contain a confirmation token." They're unit-test shaped — one failure, one owner, one line of output.
Scoring-style tests run an evaluator over many examples and compare an aggregate score to a baseline. "On the 500-item support set, factuality score ≥ 0.87." They detect distribution shifts no individual assertion would catch.
A mature program has both. Every production incident should spawn one new assertion test — this is how "technical debt that caused the incident" becomes "regression that can't recur." Meanwhile, scoring tests ride the distribution.
Naive thresholds are "score must be ≥ 0.9." They fail two ways — too tight (every noisy run is red), or too loose (a 3-point regression slides through). The good pattern is relative, not absolute.
Practical rule — set the tolerance to the 95% confidence interval of your eval run. Bootstrap your golden set, measure run-to-run variance, and any drop larger than that CI is real. Tight thresholds without calibration produce red builds on noise; calibrated thresholds produce red builds on actual regressions.
Separate safety from quality. Safety thresholds are absolute, non-negotiable (zero PII leaks, zero jailbreak passes). Quality thresholds are delta-based and can negotiate.
LLMs are non-deterministic even at temperature 0 (vendors reserve the right to vary inference internals). That makes traditional "expected == actual" testing unreliable. Three stabilization strategies cover most cases —
Pattern for CI — run each PR-tier example once at temperature 0, and each nightly-tier example 3–5 times with temperature matching production. PR tier gets fast signal; nightly tier gets statistical confidence.
And — don't fight non-determinism by trying to assert exact strings with temperature 0. It's a losing battle against vendor-side changes. Design tests that would pass for a human writing the same answer differently.
A shadow eval runs the new candidate model or configuration alongside the currently-shipped one, on real production traffic, without affecting user experience. The new system's outputs are captured and scored offline; only the shipped system's outputs reach users.
When to use shadow — before any A/B test touches users. A shadow run tells you if the candidate produces obviously worse outputs on your real distribution. It catches prompt regressions, latency blowups, schema breaks, cost explosions, and policy violations without user exposure.
Key constraints — shadow adds real cost (you're paying for 2× inference), and for multi-turn or state-changing flows it's hard (shadow can't actually execute the user's booking). Mitigation — sample a fraction of traffic, not 100%. A 5% shadow is usually enough to surface showstoppers.
The most important split in LLM evaluation. Offline lets you iterate fast and safely; online tells you if any of it actually matters to users. You need both, and you need to close the loop between them.
Offline evaluation runs on a fixed, curated dataset with scored outputs — no users involved. Results are reproducible, cheap, and fast. You control the distribution and the judge. Online evaluation measures the system as it serves real users, typically via A/B tests, implicit signals (clicks, thumbs, retention), or explicit feedback.
The cognitive frame — offline measures capability, online measures value. A change can improve offline metrics and not move the needle online (users didn't notice). It can also degrade offline metrics and improve online ones (the eval was measuring the wrong thing). Both outcomes happen routinely and both are informative.
| Dimension | Offline | Online |
|---|---|---|
| Speed | Minutes to hours | Days to weeks (stat sig) |
| Cost | Judge tokens + data curation | User exposure + infra + analysis |
| Reproducibility | High (fixed set) | Low (traffic changes daily) |
| Signal type | Proxy metrics | Ground-truth business metrics |
| Coverage | Only what you curated | Real distribution incl. tail |
| Risk | None — no users | User harm, revenue, brand |
| Good for | Iteration, regression, ship/no-ship | Final validation, value discovery |
The classic failure modes on each side —
Neither wins. The point is to triangulate — use offline to gate and iterate, online to validate and discover, each informing the other's design.
An A/B test randomly splits users between control (current system) and treatment (new system) and measures downstream outcome metrics. For LLM features, three twists matter —
Sample size is non-trivial. LLM outputs are high-variance per user — you often need 10×–100× the users you'd need for a simple UI change. Pre-compute the needed sample with a power calculation based on observed per-user variance in your task.
Common pitfalls — peeking at results before stat sig and calling winners early; single-metric tunnel vision (north-star up, safety quietly down); and running A/B before any offline validation so a dangerous regression gets user exposure.
Online signals split into explicit (user volunteered) and implicit (inferred from behavior). Implicit signals dominate because they require no user effort and scale.
Sobering fact — explicit signals are biased. Users who rate are disproportionately those who are delighted or furious. A 4.8 average in-product rating is compatible with silent 30% regression if middling users don't rate at all. Implicit signals close that gap — they capture what silent users do.
The loop that separates serious eval programs from theatre — production failures flow back into the golden set so they can be caught offline next time.
The practical mechanism — a weekly triage where PM + engineer go through prod failures, cluster them, pick 10–20 to promote, and add them with tags. Over a year, this grows a golden set that actually mirrors your production distribution rather than what you imagined it would be on day one.
Without this loop, offline evals stay frozen at whatever they were when you shipped v1. With this loop, they compound and your system gets monotonically harder to regress.
Interleaving is an alternative to classic A/B where, for a single user request, you mix outputs from two variants and measure which one the user prefers through direct behavior (click-through, copy, selection). Common in search ranking, increasingly applied to LLM response selection and citation ranking.
When it works — ranked lists (search results, citations, suggestions) where "user picked from variant X" is a clean signal. When it doesn't — single-response chat, where you can't show two answers side-by-side without breaking UX.
For LLM products the practical application is usually candidate re-ranking — your product shows top-5 retrieved docs, half from ranker A, half from ranker B, observe which ones the user clicks / cites. Huge power efficiency gains because each request yields a paired preference signal.
Evaluating what happens at the edges — jailbreaks, PII leaks, prompt perturbations — and how RAG systems and agents need their own distinct eval frameworks.
Safety eval is structurally different from quality eval in one way — the metric you care about is rare-event. A 99.5% safe system still leaks PII on 1 in 200 requests; at 10M requests/day that's 50,000 incidents. So safety evals oversample adversarial inputs on purpose.
Gate safety with hard floors, not deltas. Quality can regress 1% and we negotiate; PII leakage must be zero or near-zero. Safety failures block releases independent of quality wins.
And remember — false positives have real cost. An over-aggressive safety layer that refuses legitimate queries tanks product utility. Measure refusal rate on benign prompts alongside harmful-content rate on adversarial ones — optimize the joint.
Red-teaming is humans or agents actively trying to break your system through creative, open-ended attack — same spirit as security red teams. Systematic adversarial eval runs a fixed, versioned suite of known attacks against every release.
| Red-teaming | Systematic adversarial eval |
|---|---|
| Discovers new failure modes | Prevents known ones from recurring |
| Creative, open-ended, unstructured | Fixed suite, CI-friendly |
| Humans / LLM attackers | Deterministic replays of known attacks |
| Output: a list of new jailbreaks found | Output: pass/fail on each known attack |
| Before major launches | Every release |
They feed each other — red-team finds a jailbreak once, that jailbreak becomes a test case in the systematic suite forever. Over time the systematic suite grows to cover the union of all discovered attacks. Red-teaming focuses on the frontier of what's not yet in the suite.
A robust system gives consistent-quality answers when the user rephrases, misspells, capitalizes oddly, or inserts noise. Fragile systems score 90% on a golden set and 55% on the same questions reworded.
The eval technique — perturbation batteries. For each golden-set item, generate variants and score each.
Report robustness as score variance across perturbations — a system that scores 0.88 on originals and 0.85 on perturbations with low variance is robust; one that averages 0.86 but swings from 0.60 to 0.98 is brittle.
A RAG system has two stages — retrieval (find relevant docs) and generation (answer from them). You must eval each separately, and then the end-to-end system. Blaming the wrong stage is the #1 RAG failure mode in teams.
Diagnosis pattern — if retrieval Recall@5 is 0.95 but end-to-end answer correctness is 0.60, your generator is wasting good context. If Recall@5 is 0.50 and answer correctness is 0.45, your retriever is the bottleneck. Ablation — replace retrieved context with the gold doc and see how much answer quality improves.
An agent makes decisions, calls tools, and uses their outputs to decide the next action. Quality isn't a single response — it's a trajectory. Three scoring levels parallel the multi-turn frame —
Practical techniques —
| Technique | What it measures |
|---|---|
| Mocked tool sandboxes | Deterministic tool responses → reproducible trajectories for CI |
| Trajectory rubric | LLM judge scores whole trace against expected plan |
| Step budget checks | Agent must complete within N steps / $X cost |
| End-state assertions | Post-condition checks on world state (e.g., DB row created) |
| Loop / thrash detectors | Same tool called with same args twice → flag as stuck |
The single highest-leverage agent eval is a mocked sandbox — tools return fixed responses so the entire trajectory is replayable. Without this, agent evals are non-deterministic and you can't tell if a change helped.
Capability evals ask "can the model do X?" — solve problems, reason, code, use tools. Alignment evals ask "does the model do what we want, when we want, and refrain from what we don't?" — honesty, helpfulness, harmlessness, following instructions, respecting constraints.
A more capable model is not automatically more aligned — often the opposite, because a more capable model is also more capable of persuasively producing misleading or harmful content. Frontier-model releases consistently show capability gains outpacing alignment gains; alignment evals are what pick that up.
For product teams — capability evals bound your shortlist; alignment evals determine deployability. A model can crush MMLU and still fail your "don't give financial advice" rubric in ways that block launch.
Evals in production, drift detection, culture, and how to synthesize everything into answers that land in the interview itself.
Production monitoring is continuous, sampled evaluation on live traffic. You can't run a full LLM judge on every request (too expensive), so the pattern is tiered sampling.
Surface the outputs in a rolling dashboard with per-slice quality, refusal rate, safety incidents, and the operational metrics (latency, cost, tool errors). Alerts fire on week-over-week drops, sudden spikes in refusals, or new clusters of 👎 feedback.
They overlap and interviewers often blur them. Hold the line —
| Dimension | Evals | Observability |
|---|---|---|
| Primary question | "Is output quality good?" | "What is happening in the system?" |
| Signal type | Scored judgment of quality | Traces, logs, metrics |
| Granularity | Aggregate + per-slice | Per-request, per-span |
| Time horizon | Trend over days/weeks | Real-time + per-incident |
| Who consumes | PM, ML eng, Research | On-call, SRE, ML eng |
| Fires on | Quality regression | Latency spike, error, token blow-up |
They need each other. Observability without evals tells you the system is fast and cheap but says nothing about whether it's right. Evals without observability tells you quality scores trended down but can't tell you which prompt/model/tool change caused it. A unified trace ID that joins "this request produced this output which scored this way" is the bridge.
Drift can come from three sources — data drift (user behavior changes), model drift (vendor updates the underlying model), and concept drift (the definition of "good" changed, e.g., new policy).
Response playbook — (1) confirm it's not your own deployment (check the change log); (2) run the anchor set to isolate judge vs system; (3) replay recent prod requests on the prior model version to localize; (4) if the vendor changed, trigger re-validation on the sealed set and decide whether to pin a prior version or accept the new behavior.
Most LLM teams treat evals as a pre-launch activity. Eval-driven teams treat them as a dev loop — every prompt change, every retrieval tweak is measured against the eval set before it merges.
The signal of a healthy culture — engineers reach for the golden set before they reach for the prompt editor. "Let me write a failing test first" becomes reflexive for LLM work the same way it did for backend work in mature test-driven teams.
| Mistake | Why it's wrong | What to say instead |
|---|---|---|
| "We use BLEU." | Signals unawareness of lexical-metric limitations | "BLEU for sanity, judge for semantics, humans for calibration" |
| Conflating eval and test | Suggests no production experience | "Tests catch known failures; evals detect distributional drift" |
| "MMLU looks good" | Confuses benchmark with product fit | "Benchmark filters shortlist; internal eval decides ship" |
| No slicing | Average metric masks targeted regressions | "I slice by intent, language, length, user cohort" |
| Unvalidated judge | Treats LLM judge as ground truth | "Judge validated against ≥100 human-labeled items; agreement ≥ human-human" |
| No loop back | Static eval sets go stale | "Every prod incident → golden set addition" |
| Safety as add-on | Bolts safety on at the end | "Hard floors on safety independent of quality delta gates" |
| Offline or online, not both | False dichotomy | "Offline gates ship, online validates value, closed-loop between them" |
The fastest way to signal seniority is to name the tradeoff explicitly. "I'd pick X here because the alternative Y has this specific problem in my context." Juniors name techniques; seniors name tradeoffs.
This is the most common open-ended question in LLM eval interviews. Use a repeatable 6-step template.
Worked example — "design an eval for a customer-support reply bot."
Deliver that in two minutes and you've outscored most candidates regardless of the specific X.
Everything in this handbook fits on one diagram — the closed-loop LLM eval stack. Keep it in your head and you can reconstruct any specific answer from first principles.
Five ideas, one loop. If an interview question doesn't map onto a layer here — golden set, scoring, CI, production stages, monitoring/feedback — you're either being asked something very narrow, or the question itself is confused and you can reframe it onto the stack.
The whole point of this handbook is that evals are not a metric, they're a system. Learn the system; the metrics follow.
From golden-set construction to LLM-as-judge validation to closing the loop from production back to CI — the full vocabulary and framework for LLM evaluation interviews and the systems behind them.