Design LLM Eval & Observability — A Guided System Design

Q: You can’t ship what you can’t measure

Build an eval + observability pipeline: trace every call (observability), score the system against versioned golden datasets with validated scorers (evaluation), gate changes in CI, and monitor production for drift — closing a loop where real failures become new test cases. Measurement is the system that lets you improve on purpose instead of by vibes.

Q: Offline evals and online observability

Two complementary loops: offline evaluation runs a change against fixed, versioned datasets before shipping (catch regressions early), and online observability traces and scores production traffic (catch drift and novel failures). The rest of this design builds both on a shared foundation — traces in, scores out.

Q: Tracing is the foundation

A Trace Collector captures the full interaction — prompt, output, tokens, latency, cost, tool calls, and any user feedback — into a durable Trace Store. This is the observability bedrock: you can’t evaluate, debug, attribute cost, or build datasets from calls you never recorded. Everything downstream reads from these traces.

Q: Versioned golden datasets

Curate versioned golden datasets: representative cases mined from production traces, plus known failures and edge cases, each labeled with expected behavior. Version them so scores are comparable across runs and dataset changes are explicit. Your own eval set — not a public benchmark — is the yardstick that matters, and it grows as production reveals new failures.

Q: Programmatic, LLM-judge, human

Layer the scorers: programmatic checks for verifiable/structured outputs (JSON valid, contains fact, matches regex), an LLM-as-judge for open-ended quality at scale, and human labels as ground truth. Critically, the judge is only trustworthy if validated against humans — measured for agreement and bias — because an uncalibrated judge (the chaos button) poisons every number.

Q: Offline evals as a CI regression gate

Wire the Eval Runner into CI as a regression gate: every prompt, model, or config change runs the golden-set evals, and a regression past threshold blocks the deploy. Prompts and models become artifacts under test, just like code — so a "small tweak" can’t silently trade one fixed case for five broken ones.

Q: Online evaluation and drift

A Prod Monitor continuously samples and scores live traffic (programmatic checks + judge), alerting on quality drift, new failure modes, and cost/latency regressions. It sees the real distribution — new inputs, silent provider model updates, changing user behavior — that a frozen offline set never will. Offline gates the change; online watches reality.

Q: Metrics, user signals, and feedback

Metrics + Feedback correlates eval scores with real user signals (thumbs, edits, escalations) to validate that the proxy tracks reality, and feeds new production failures back into the golden set. The loop closes: observability finds failures → they become eval cases → the gate prevents their return → monitoring finds the next ones. Your eval suite compounds into a moat.

System Design · step by stepDesign LLM Eval & Observability

Step 1 / 9

Design LLM Eval & Observability — the walkthrough in full

A written version of the interactive walkthrough above — the same steps, decisions and trade-offs, laid out for reading, reference and search.

The big idea

You can’t ship what you can’t measure

An LLM feature has no compiler and no unit test that says "correct." Outputs are open-ended, quality is subjective, and the same prompt can regress when you tweak a word, swap a model, or the provider updates theirs — silently, with no error. Demos are easy; knowing whether you’re actually getting better (or quietly getting worse) is the hard part that separates a toy from a product. How do you measure AI quality, continuously?

Build an eval + observability pipeline: trace every call (observability), score the system against versioned golden datasets with validated scorers (evaluation), gate changes in CI, and monitor production for drift — closing a loop where real failures become new test cases. Measurement is the system that lets you improve on purpose instead of by vibes.

How to read this: Each step opens with a real design decision — you make the call before I show you what ships. Watch the diagram grow, hover any box, replay the flow. At the end, trust the judge to feel evaluation’s meta-failure. Hit Begin.

Step 1 · Two loops

Offline evals and online observability

There are two questions: "is this change good before I ship it?" and "is it still good in production?" They need different machinery. What’s the overall shape?

Design decision: What are the two complementary halves of measuring an LLM system?

The call: Offline evaluation against fixed datasets (pre-ship) + online observability/monitoring of production traffic. — Offline evals score changes against versioned golden sets to catch regressions before deploy; online observability traces and scores live traffic to catch drift and new failures after deploy. Both are needed.

Two complementary loops: offline evaluation runs a change against fixed, versioned datasets before shipping (catch regressions early), and online observability traces and scores production traffic (catch drift and novel failures). The rest of this design builds both on a shared foundation — traces in, scores out.

Pre-ship gate + prod watch: Offline evals are your regression tests; online monitoring is your production alerting. Neither alone is enough — offline can’t see the real distribution, online can’t stop a bad change before users feel it.

Step 2 · Capture everything

Tracing is the foundation

You can’t evaluate, debug, or mine examples from calls you didn’t record. Before any scoring, what has to be captured on every LLM interaction?

Design decision: What must you log on every LLM call to make evaluation possible?

The call: Full traces: prompt, output, tokens, latency, cost, tool calls, and user feedback. — A Trace Collector records the complete interaction into a queryable store — the raw material for debugging, mining eval examples, monitoring, and cost attribution.

A Trace Collector captures the full interaction — prompt, output, tokens, latency, cost, tool calls, and any user feedback — into a durable Trace Store. This is the observability bedrock: you can’t evaluate, debug, attribute cost, or build datasets from calls you never recorded. Everything downstream reads from these traces.

Observability before evaluation: Comprehensive tracing is the precondition for everything else. Golden sets are mined from traces, monitoring scores traces, debugging replays traces. Instrument first; measure second.

Step 3 · The yardstick

Versioned golden datasets

To say "better" or "worse" you need something fixed to measure against. Where does that yardstick come from, and why must it be versioned?

Design decision: What do you evaluate a change against?

The call: Curated, versioned golden datasets — real cases + edge cases, with expected behavior. — Build eval sets from production traces (representative), known failures, and edge cases, each with expected behavior; version them so a score is comparable across runs and a change to the data is explicit.

Curate versioned golden datasets: representative cases mined from production traces, plus known failures and edge cases, each labeled with expected behavior. Version them so scores are comparable across runs and dataset changes are explicit. Your own eval set — not a public benchmark — is the yardstick that matters, and it grows as production reveals new failures.

Your data is your benchmark: Evaluation is only as good as the dataset. Mining real production traces (via observability) plus deliberate edge cases makes the yardstick reflect your reality — and versioning makes "we improved" a claim you can trust.

Step 4 · How to score

Programmatic, LLM-judge, human

For "did it return valid JSON?" a simple check works. For "was this answer helpful and correct?" there’s no regex. What scoring methods do you use, and which for what?

Design decision: How do you score open-ended LLM outputs?

The call: A mix: programmatic checks where possible, LLM-as-judge for open-ended quality, humans to calibrate and adjudicate. — Use cheap deterministic checks for structured/verifiable outputs, an LLM-as-judge for subjective quality at scale, and human labels as ground truth to validate the judge and handle the ambiguous.

Layer the scorers: programmatic checks for verifiable/structured outputs (JSON valid, contains fact, matches regex), an LLM-as-judge for open-ended quality at scale, and human labels as ground truth. Critically, the judge is only trustworthy if validated against humans — measured for agreement and bias — because an uncalibrated judge (the chaos button) poisons every number.

Right scorer for the output: Deterministic where you can (cheap, exact), judge where you must (scalable, subjective), human to anchor (ground truth). The judge is powerful but must be treated as a model that needs its own evaluation — not an oracle.

Step 5 · Gate every change

Offline evals as a CI regression gate

A developer tweaks a prompt to fix one case. It quietly breaks five others. Without a gate, that ships. How do you stop regressions from reaching users?

Design decision: How do you prevent a prompt/model change from silently regressing quality?

The call: Run the eval suite in CI on every prompt/model/config change and block on a regression. — The Eval Runner scores the change against the golden sets automatically; if key metrics regress past a threshold, the deploy is blocked — treating prompts/models like code under test.

Wire the Eval Runner into CI as a regression gate: every prompt, model, or config change runs the golden-set evals, and a regression past threshold blocks the deploy. Prompts and models become artifacts under test, just like code — so a "small tweak" can’t silently trade one fixed case for five broken ones.

Evals are your unit tests: Treating the eval suite as a required CI check is what makes iterating on prompts/models safe. It turns "I think this is better" into "the suite confirms it’s better, with no regressions" — the core of reliable AI iteration.

Step 6 · Watch production

Online evaluation and drift

Offline evals pass, you ship — and a week later quality drifts: real inputs differ from your golden set, the provider silently updated the model, a new failure mode appears. Offline can’t see any of it. How do you catch production decay?

Design decision: What catches quality problems that offline evals never saw?

The call: Continuously sample and score live traffic, watching for drift and new failure modes. — A production monitor samples real calls, scores them (programmatic + judge), and alerts on quality drift, novel failures, and cost/latency regressions — the real distribution offline evals can’t reproduce.

A Prod Monitor continuously samples and scores live traffic (programmatic checks + judge), alerting on quality drift, new failure modes, and cost/latency regressions. It sees the real distribution — new inputs, silent provider model updates, changing user behavior — that a frozen offline set never will. Offline gates the change; online watches reality.

Reality drifts from your dataset: Even a perfect golden set is a snapshot. Online evaluation is how you notice the world (and the provider’s model) moving out from under you — and the new failures it surfaces become tomorrow’s eval cases.

Step 7 · Close the loop

Metrics, user signals, and feedback

Eval scores are proxies. The ground truth is whether users are actually served well. And every new production failure is a test case you don’t yet have. How do you tie it together and keep improving?

Design decision: How do you keep the eval system honest and growing?

The call: Correlate eval scores with user signals, and feed new production failures back into the golden set. — Dashboards tie eval metrics to real user feedback (validating the proxy), and newly-discovered failures from monitoring flow back as fresh eval cases — a compounding measurement flywheel.

Metrics + Feedback correlates eval scores with real user signals (thumbs, edits, escalations) to validate that the proxy tracks reality, and feeds new production failures back into the golden set. The loop closes: observability finds failures → they become eval cases → the gate prevents their return → monitoring finds the next ones. Your eval suite compounds into a moat.

Failures become fixtures: The highest-value eval cases are the ones you got wrong in production. Piping them back into versioned datasets means the system provably never regresses on a known failure again — measurement that gets stronger with every incident.

The payoff

You built LLM eval & observability

From "is this actually good?" to a measurement system: full tracing, versioned golden datasets mined from production, layered scorers (programmatic + validated judge + human), a CI regression gate, online production monitoring for drift, and a feedback loop where failures become fixtures.

Now trust the judge — wire in an unvalidated LLM-as-judge — and watch every number quietly go wrong: the judge rewards length and its own style, so you optimize the metric, ship a regression, and the gate stays green while real quality falls. That’s why the judge must be calibrated against humans, version-pinned, and spot-checked — an uncalibrated judge corrupts the whole pipeline it feeds.

Two loops — offline evals gate changes; online observability watches production
Tracing — capture every call — the foundation for everything downstream
Golden datasets — versioned, mined from production + edge cases — your real benchmark
Scorers — programmatic + LLM-judge + human, each for the right output
CI gate — evals run on every change and block regressions — prompts as code
Prod monitor — sample and score live traffic for drift offline never sees
Feedback loop — production failures become new eval cases — a compounding moat
The meta-failure — an unvalidated judge silently corrupts every metric — calibrate it