LLM evals are the systematic methods for measuring whether a language-model system is correct, reliable, and improving over time. They range from golden datasets and automated metrics to LLM-as-judge scoring, RAG and agent evals, red-teaming, and online monitoring of real traffic. Good evals are version-controlled and treated as part of the product, not an afterthought.

What is LLM-as-a-judge and when should you use it?

LLM-as-a-judge uses a model to score or compare outputs against a rubric, which scales evaluation of open-ended responses that exact-match metrics can't handle. Use it when outputs are free-form, but control for biases (position, verbosity, self-preference) with a clear rubric, randomized order, and a required rationale before the score.

What is the difference between offline and online evals?

Offline evals run against a fixed dataset before shipping — fast, repeatable, ideal for regression testing. Online evals measure real production traffic with live signals (user feedback, task success, guardrail triggers). You need both: offline to catch regressions pre-release, online to catch what your dataset didn't anticipate.

Is the LLM Evals Interview Handbook free?

Yes. All 51 LLM evals interview questions are free to read, with no sign-up required.

Visual Handbook · 2026

51 Questions · 8 Domains

Interview Preparation & Reference

The LLM Evals
Interview
Handbook

Golden sets, LLM-as-judge, regression testing,
and the offline-vs-online divide — demystified.

Golden Sets LLM-as-Judge Regression Tests Offline Evals Online A/B RAG Evals Agent Evals Red-Teaming Observability

Evals Foundations

Q1–6

Golden Sets & Data

Q7–13

Metrics & Scoring

Q14–19

LLM-as-Judge

Q20–26

Regression & CI

Q27–32

Offline vs Online

Q33–38

Safety & Adversarial

Q39–44

Production & Strategy

Q45–51

Interview-Ready

Tips & Mental Models

Saurabh Singh

AI Engineer & Builder.

LinkedIn Medium GitHub

Contents

What's
Inside

I · Evals Foundations

Q1–6

Q1Why do LLM evals matter more than traditional ML metrics?

Q2Evaluation vs testing — what is the distinction?

Q3What are the main categories of LLM evaluation?

Q4What does a good evaluation framework look like?

Q5Why can't you just rely on MMLU and public benchmarks?

Q6Task-specific vs capability evaluations?

II · Golden Sets & Data

Q7–13

Q7What is a golden dataset and why is it foundational?

Q8How do you build a golden set from scratch?

Q9How many examples do you actually need?

Q10How do you handle dataset contamination?

Q11When and how should you version golden sets?

Q12What makes a bad golden set?

Q13Sourcing real production traffic for evals?

III · Metrics & Scoring

Q14–19

Q14Reference-based vs reference-free metrics?

Q15Why do BLEU / ROUGE / exact-match fail for LLMs?

Q16How do you measure factuality and hallucinations?

Q17Embedding-based similarity: BERTScore & beyond?

Q18How do you score multi-turn conversations?

Q19What is pass@k and when does it matter?

IV · LLM-as-Judge

Q20–26

Q20What is LLM-as-judge and why has it become dominant?

Q21What are the known biases of LLM judges?

Q22How do you validate an LLM judge?

Q23Single vs pairwise vs reference-based judging?

Q24How do you write a good judge prompt?

Q25Handling cost and latency of LLM judges?

Q26What is judge-model drift?

V · Regression & CI

Q27–32

Q27What is regression testing for LLMs?

Q28How do you design CI-friendly eval suites?

Q29Assertion-style vs scoring-style regression tests?

Q30How do you set thresholds and gates?

Q31Handling non-determinism (temperature, sampling)?

Q32What is a shadow eval?

VI · Offline vs Online

Q33–38

Q33The core distinction between offline and online evals?

Q34What are the tradeoffs of each approach?

Q35How do online A/B tests work for LLM features?

Q36What online signals complement offline metrics?

Q37Closing the loop — online back to offline?

Q38What is interleaving and when do you use it?

VII · Safety & Adversarial

Q39–44

Q39How do you evaluate toxicity, jailbreaks, PII?

Q40Red-teaming vs systematic adversarial evaluation?

Q41How do you measure robustness to prompt variations?

Q42How do you evaluate RAG systems specifically?

Q43How do you evaluate agents and tool-use?

Q44Capability vs alignment evaluation?

VIII · Production & Strategy

Q45–51

Q45How do you monitor LLM quality in production?

Q46Evals vs observability — what's the difference?

Q47How do you detect and respond to model drift?

Q48How do you build an eval-driven culture?

Q49Common interview mistakes to avoid?

Q50How do you answer "design an eval for X"?

Q51What one framework ties all of this together?

Part I

Evals Foundations

Before metrics and benchmarks, you need a mental model for why LLM evaluation is fundamentally different — and why most teams get it wrong.

Why Evals Matter

Eval vs Test

Taxonomy

Frameworks

Benchmarks

Capability vs Task

Questions 1–6

Why do LLM evals matter more than traditional ML metrics?

Traditional ML ships a model that outputs a single label or number. You measure accuracy, F1, AUC against a held-out set and you're mostly done. LLM systems output free-form text — there are infinitely many correct answers, and the same prompt can produce different responses each run.

That breaks three assumptions at once. First, you can't enumerate "correct" outputs, so exact-match collapses. Second, the system is a pipeline (prompt + retrieval + model + tools + post-processing) — a regression can come from any layer. Third, quality is multi-dimensional — a response can be factually correct but unsafe, helpful but verbose, on-topic but off-brand.

WHY LLM EVALS ARE HARDER

Traditional ML

Single label output

Deterministic given input

One metric (F1 / AUC)

Labeled test set = truth

Closed-world, quantitative

LLM Systems

Free-form text output

Stochastic by default

Many dimensions at once

No single "ground truth"

Open-world, subjective, pipeline

Add the production reality — silent regressions from prompt edits, model version bumps, vendor deprecations, retrieval index drift — and evals stop being a one-time gate. They become the production system you build around the model.

→ Interview Framing

"Traditional ML evaluation is a measurement problem; LLM evaluation is an engineering problem." Say that sentence and follow with the three assumptions that break (single label, determinism, single metric). Interviewers remember the frame.

What is the difference between evaluation and testing for LLMs?

They sound interchangeable but they serve different loops. Testing answers a binary — did this specific behavior happen or not? It's pass/fail, written like unit tests, runs in CI, and protects you from regressions on known cases. Evaluation answers a distributional question — across a representative sample of inputs, what is the overall quality of the system?

Dimension	Testing	Evaluation
Question	Did X happen?	How good is the system overall?
Granularity	Single case	Distribution over many cases
Output	Pass / fail	Score, percentile, win-rate
When run	Every commit / PR	Pre-release, weekly, ad hoc
Cardinality	10s–1000s of assertions	100s–10000s of scored cases
Analogy	Unit test	Statistical experiment

A healthy program has both. Testing catches known failures you've already seen in production. Evaluation detects unknown shifts in quality. Conflating them is why teams either over-invest in brittle pass/fail suites that block every release, or under-invest in CI and ship silent regressions.

→ Mental Model

Testing is a tripwire — it fires when something specific breaks. Evaluation is a thermometer — it tells you whether the overall system is getting hotter or colder. You need both, and they belong in different parts of your dev loop.

What are the main categories of LLM evaluation?

There's no single taxonomy, but a useful interview-friendly one is three axes — what you measure, how you measure, and where the data lives.

THE THREE AXES OF LLM EVALUATION

1 · What you measure — Quality dimensions

Correctness, factuality, helpfulness, safety, format adherence, latency, cost.

2 · How you measure — Scoring method

Deterministic (exact match, regex, schema), heuristic (BLEU, ROUGE, BERTScore), model-graded (LLM judge), human review.

3 · Where the data lives — Offline vs online

Offline (curated golden set), online (live user traffic, A/B, implicit signals).

A concrete eval plan is a point on each axis. "I measure factuality (what) using a GPT-4-class judge with citations (how) on a versioned golden set of 500 support tickets (where)." That one sentence is the shape of every good eval.

→ Interview Tip

When asked "how would you evaluate X?" — don't rush to a metric. Name the three axes out loud, pick a point on each, and justify it. This structure alone separates senior from junior answers.

What does a good evaluation framework look like in practice?

A good framework is not a single dashboard — it's a layered system where each layer catches a different class of failure at a different speed and cost.

THE EVAL PYRAMID (FAST → SLOW, CHEAP → EXPENSIVE)

Layer 5 · Human review

Slowest, gold standard. Used to calibrate everything below.

↑

Layer 4 · LLM-as-judge

Scales subjective quality evaluation across thousands of examples.

↑

Layer 3 · Heuristics & embeddings

BERTScore, semantic similarity, toxicity classifiers, citation overlap.

↑

Layer 2 · Deterministic checks

JSON schema, regex, "must mention X", length bounds, tool-call correctness.

↑

Layer 1 · Operational invariants

Didn't 500, didn't exceed token budget, didn't leak secrets. Always-on, free.

The principle — push work down the pyramid. Every failure that a cheap deterministic check can catch should not reach an LLM judge, and every failure an LLM judge can catch should not reach a human. Teams that flip this pyramid (human review for everything) burn money; teams that skip the top (no human calibration ever) drift without knowing.

→ Key Insight

A good eval system is less about choosing the best metric and more about choosing the right metric at the right layer. Cheap metrics gate PRs; expensive metrics gate releases; human review calibrates the machine metrics.

Why can't you just rely on public benchmarks like MMLU or HELM?

Public benchmarks measure general model capability — they tell you whether a model is plausibly competent at broad domains. They say almost nothing about whether it will work for your task, your users, your data distribution.

Three structural problems make public benchmarks unreliable as production signals:

WHY PUBLIC BENCHMARKS FAIL IN PRODUCTION

01 · Contamination

Popular benchmarks leak into training data. The model has seen MMLU; the score is inflated.

02 · Distribution mismatch

Your users don't write like benchmark authors. Their tone, jargon, and failure modes differ.

03 · Wrong dimension

MMLU rewards factual recall. Your app may care about tone, safety, JSON adherence — none of which MMLU measures.

04 · No pipeline signal

Benchmarks test a bare model. Your system has retrieval, tools, post-processing — all invisible to MMLU.

Use benchmarks for model selection shortlisting — a 20-point MMLU gap is a real signal. Use your own evals for everything downstream of that.

→ Real-World Use

Say it plainly: "Public benchmarks are useful for deciding which three models to put in my bake-off. My internal eval decides which one ships." Interviewers want to hear that you treat benchmarks as a filter, not an answer.

Task-specific vs capability evaluations — what's the distinction?

Capability evaluations probe an underlying skill the model has or doesn't — arithmetic, code generation, long-context recall, multilingual reasoning. Examples: GSM8K for math, HumanEval for code, needle-in-a-haystack for context retention. They're model-facing — they help you decide "is this model capable enough?".

Task-specific evaluations measure whether the full system does your job — resolve a support ticket, extract fields from an invoice, answer a policy question with the right citation. They're product-facing — they tell you "does this shipped pipeline work for our users?".

CAPABILITY VS TASK EVALS

Capability

Tests the raw model

Generic across products

E.g. GSM8K, HumanEval

Useful for vendor selection

"Can the model do arithmetic?"

Task-specific

Tests the whole pipeline

Unique to your product

E.g. 500 real support tickets

Useful for ship decisions

"Does support-bot resolve tickets?"

They're complementary, not competing. Capability evals tell you why a task eval regressed (the new model lost long-context recall). Task evals tell you whether the regression matters (our prompts never exceed 8k tokens, so we don't care).

→ Mental Model

Capability evals are diagnostic. Task evals are prescriptive. In an interview, explicitly say you'd run both — capability to bound your model shortlist, task-specific to make the final call.

Saurabh Singh — AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

Part II

Golden Sets & Datasets

The quality of your evals is capped by the quality of your data. This part covers how to build, size, version, and protect the ground truth that everything else sits on.

Building Golden Sets

Sample Sizing

Contamination

Versioning

Production Traffic

Anti-patterns

Questions 7–13

What is a golden dataset and why is it foundational?

A golden dataset (or "golden set", "eval set", "canonical set") is a curated, versioned, human-reviewed collection of input–expected-output pairs that represents what correct looks like for your task. It's the ground truth every metric, judge, and regression test is ultimately calibrated against.

It's foundational because every other piece of your eval stack inherits its biases. A skewed golden set means a skewed LLM-judge, a misleading regression signal, a misguided A/B readout. If your golden set over-represents English short questions, your whole system will pass evals and still fail on long multilingual queries in production.

WHAT A GOLDEN SET RECORD LOOKS LIKE

          {

            "id": "gs-2026-0142",

            "input": "Cancel my subscription effective next month.",

            "context": { "plan": "pro", "billing_cycle": "monthly" },

            "expected": {

              "intent": "cancel_subscription",

              "effective_date": "next_billing_cycle",

              "tone": "empathetic"

            },

            "tags": ["billing", "cancel", "edge-case"],

            "source": "prod-ticket-redacted",

            "version": "v1.3",

            "reviewer": "alex@"

          }

Note the tags, source, version, and reviewer. A golden set isn't just inputs and outputs — it's provenance. Without that, you can't audit biases, can't slice by failure mode, and can't prove the set wasn't scraped from your own prod logs containing PII.

→ Key Insight

"Garbage in, garbage out" applies to evals as strongly as to training. If you have 40 hours to invest in your eval stack, spend 30 on the golden set and 10 on everything else.

How do you build a golden set from scratch?

The trap is going straight to labeling. The sequence that actually works is segment → sample → label → stratify → freeze.

GOLDEN SET CONSTRUCTION PIPELINE

🧭

Step 01

Segment

define input types & failure modes

→

🎯

Step 02

Sample

from prod logs or synthesize

→

✍️

Step 03

Label

human-author expected output

→

⚖️

Step 04

Stratify

balance tags + difficulty

→

🔒

Step 05

Freeze

version and lock

Segment means naming the axes your users vary on — intent, language, input length, politeness, edge cases. Sample draws from real traffic (ideal) or synthesized prompts (bootstrap). Label is where a domain expert writes what should have come out — or, for subjective tasks, what a good response would contain. Stratify ensures every segment is present in enough volume to make the metric stable. Freeze the set at v1 and only change it via explicit versioning.

Two common shortcuts to resist — labeling whatever shows up first (you'll over-index on common cases) and using the model's own output as the "expected" (you've just measured self-consistency, not correctness).

→ Interview Tip

If asked to design a golden set on the spot, say "segment first, sample second" and list 5 segmentation axes for the domain. That single reordering — segment before sample — signals you've done this before.

How many examples do you actually need in a golden set?

Wrong question. The right question is how many examples per slice. A 10,000-item set that's all one category is weaker than a 500-item set with 25 examples across 20 slices.

A working rule of thumb for binary pass/fail metrics — you need about 50–100 examples per slice to detect a 5-point quality shift with reasonable confidence. For graded scores (1–5), fewer are needed; for rare failure modes (jailbreaks, PII leaks), more.

GOLDEN SET SIZING BY USE CASE

Smoke test

20–50

PR regression

100–300

Release gate

500–1,500

Model selection

2,000–5,000

Safety / red-team

5,000+

Two quick sanity checks — (1) bootstrap your current set and confirm the confidence interval on your headline metric is narrower than the regression you care about; (2) plot per-slice scores and look for slices with wild variance — those are usually under-sampled.

→ Mental Model

Think about statistical power, not total count. If your metric moves by ±3 points just from re-running, your golden set is too small for a 2-point "improvement" to mean anything.

Q10

How do you handle dataset contamination?

Contamination is when eval data has leaked into training data, so the model "remembers" the answer rather than earning it. Scores go up, real-world capability doesn't. It happens three ways — (1) you published your golden set; (2) the model was trained on scrapes that include your source; (3) you evaluate on data the model has already seen via retrieval or system prompts during labeling.

THREE FORMS OF CONTAMINATION

Public leak

Your eval set is on HuggingFace → frontier models saw it in pre-training.

Source leak

You labeled from Wikipedia / Stack Overflow — already in the model.

Loop leak

Labelers used the model to generate "expected" answers — circular.

Defenses — keep a private held-out slice that's never shared with vendors; canary strings unique to your dataset so you can detect memorization; label with humans, not the model under test; refresh a rolling fraction of the set every quarter with net-new prompts.

→ Real-World Use

For frontier-model evals, keep a sealed holdout: never send it to an API provider for labeling, never put it in a doc that could be scraped, never commit it to a public repo. That sealed 200-example set is worth more than 5,000 public examples.

Q11

When and how should you version and update golden sets?

Golden sets decay. User behavior shifts, products add features, edge cases are discovered, labels go stale as policies change. A never-updated set silently drifts from reality — your evals pass, your users suffer.

The operating pattern — treat the golden set like code. Semantic versions (v1.2.0), pull requests for additions, changelogs, deprecations. Minor version for "added 50 new examples in segment X", major version when changes break comparability with prior runs.

GOLDEN SET VERSIONING RHYTHM

Weekly · Review prod failures

Pipe incidents, thumbs-down, escalations into a candidate bucket.

Monthly · Add + label

Human review; promote curated items to v.next minor bump.

Quarterly · Re-stratify

Check distribution against current prod traffic; rebalance slices.

Annually · Major bump

Retire stale examples; re-review every label for drift.

When you bump major, always re-run the prior release model on the new set and record the score — you need that anchor point so historical comparisons stay interpretable.

→ Interview Tip

Strong answer: "Golden sets are code. I version them semantically, PR additions, and keep an anchor rerun on major bumps." This signals you've been on the other side of a stale-set incident.

Q12

What makes a bad golden set — and how do you recognize one?

Most teams ship a bad golden set and don't know it because their metrics look stable. Stable metrics on a bad set are the worst case — confident wrongness.

Smell	What it looks like	What it costs you
Easy mode	Every frontier model scores 95%+	Metric is saturated — can't distinguish models
Skewed	80% of examples are one intent	Head-case wins mask tail-case failures
Stale	No updates in 6+ months	Passes don't predict prod behavior anymore
Self-labeled	"Expected" was generated by the model	You're measuring self-agreement, not quality
Leaked	Available on the public internet	Memorization inflates scores ~5–15 points
Under-slotted	No per-segment tags	Can't diagnose which slice regressed
Solo-authored	One person labeled everything	One person's biases = ground truth

Diagnosis heuristic — if your golden-set score barely moves between a tiny 7B open model and a frontier model, your set is either too easy or contaminated. If it moves by 30 points but production telemetry doesn't change at all, your set is unrepresentative.

→ Mental Model

Run a "weak model sanity check": your cheapest / smallest model should clearly lose. If it ties the frontier model, you don't have an eval — you have a vibes check.

Q13

How do you source real production traffic for evals without breaking privacy?

Production is the best possible eval source — it's literally your distribution — but it carries PII, compliance, and consent risk. The pattern that works is sample → redact → review → promote.

FROM PROD TRAFFIC TO GOLDEN SET

📡

Step 01

Sample

stratified across users & intents

→

🔐

Step 02

Redact

PII scrubber + human QA

→

🧪

Step 03

Review

legal / privacy sign-off

→

🏷️

Step 04

Label

SME authors expected output

→

📦

Step 05

Promote

into versioned golden set

Key choices — consent-friendly sampling (users who opted into training data sharing, or internal dogfooding traffic); synthetic twins when you can't use raw data (rewrite the prompt preserving the structure but replacing identifying details); and difficulty-stratified sampling — don't only collect what the system already handles well. Biased toward failure reveals real edges.

Also — keep production sampling continuous. A one-time snapshot becomes stale in weeks for an actively-developed product.

→ Real-World Use

A useful split: 60% historical "happy path" prod examples + 30% thumbs-down and escalation examples + 10% synthesized adversarial cases. That mix catches regressions and tail behavior simultaneously.

Saurabh Singh — AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

Part III

Metrics & Scoring

From BLEU to BERTScore to pass@k — which metric actually moves with quality, and which ones just look rigorous without meaning much.

Reference-based

Reference-free

BLEU / ROUGE Limits

Factuality

Embedding Metrics

Multi-turn

pass@k

Questions 14–19

Q14

Reference-based vs reference-free metrics — when to use each?

Reference-based metrics compare a model's output to one or more human-authored "correct" answers. Reference-free metrics score an output on its own merits — grammaticality, factuality, relevance — without needing a gold answer to compare against.

WHEN EACH WINS

Reference-based

Exact match / F1 (QA)

BLEU / ROUGE (translation, summary)

BERTScore vs reference

Use when "correct" is well-defined
and you can afford to label.

Reference-free

Perplexity, fluency scores

Factuality (vs retrieved doc)

LLM-as-judge on rubrics

Use for open-ended generation
where many outputs are valid.

Most production systems use both — reference-based for structured outputs (JSON fields, extracted entities) and reference-free for long-form generation (summaries, chat responses). A common pattern — reference-based for the "must haves" (required fields present), reference-free for the "feel" (tone, helpfulness).

→ Mental Model

Reference-based = "did you match the expected answer?" Reference-free = "is this answer good in isolation?" Both answers can be yes, no, or different — and you usually need to ask both.

Q15

Why do BLEU, ROUGE, and exact-match fail for generative LLMs?

They're all lexical — they compare surface-level n-gram or character overlap with a reference. That works for constrained tasks (translation against a parallel corpus) but breaks down the moment the model is allowed to paraphrase, reorder, or add useful context.

SAME MEANING, DIFFERENT SCORES

REFERENCE:

"The meeting was rescheduled to Thursday afternoon."

"The meeting got moved to Thursday afternoon."

BLEU 0.21

✓ correct

"Thursday afternoon, rescheduled."

BLEU 0.11

✓ correct

"The meeting was rescheduled to Tuesday afternoon."

BLEU 0.88

✗ WRONG DAY

Lexical metrics reward surface similarity and punish paraphrase. Worse, they can rank a wrong answer above a correct one when the wrong answer copies more reference words. Exact match is even more brittle — a trailing period or capitalization difference fails a correct answer.

They still have a place — they're cheap, fast, deterministic, useful as a fast tripwire in CI. But they should never be your headline quality metric for free-form generation.

→ Interview Tip

Don't just say "BLEU is bad." Say: "BLEU correlates with quality in narrow tasks with parallel references. For open-ended generation, I'd pair it with a semantic metric and an LLM judge." That calibrated answer reads much better than a dismissal.

Q16

How do you measure factuality and hallucinations?

Factuality = does every factual claim in the output match a trusted source? Hallucination = a factual claim with no such support (or contradicted by one). For RAG systems the source is retrieved context; for open-domain, it's an external knowledge base or a reference answer.

FACTUALITY EVAL PIPELINE

📄

Step 01

Decompose

split output into atomic claims

→

🔍

Step 02

Retrieve

gather supporting evidence

→

⚖️

Step 03

Verify

entail / contradict / unsupported

→

📊

Step 04

Aggregate

% claims grounded

Key metrics in use —

Metric	What it measures	Best for
Faithfulness	% claims supported by retrieved context	RAG
Answer relevance	% output sentences relevant to the question	QA
Citation precision	% citations that actually support claim	Grounded generation
Citation recall	% claims that have a citation	Grounded generation
Hallucination rate	% outputs with ≥1 unsupported claim	Headline dashboard

The practical move — use an LLM judge with a structured rubric that forces claim-by-claim verification against provided evidence, rather than a single holistic "is this hallucinated?" vote. Decomposition beats gestalt.

→ Key Insight

"Hallucination" is not a single metric — it's a family. Separate faithfulness (to retrieved context) from factuality (to world knowledge). A model can be 100% faithful to a bad doc and still be factually wrong.

Q17

What are embedding-based similarity metrics (BERTScore, semantic similarity)?

Instead of matching words, embed both output and reference in a vector space and measure how close they are. BERTScore tokenizes both, embeds with a pretrained transformer, and computes token-level cosine similarity. Sentence-level semantic similarity is the same idea at paragraph granularity using sentence-transformer embeddings.

LEXICAL VS SEMANTIC SIMILARITY

BLEU

weak, word overlap only

ROUGE-L

longest common subsequence

BERTScore

contextual embeddings

Sent-embedding

semantic, paragraph-level

LLM-judge

can reason about intent

Axis: correlation with human quality judgment (schematic)

Embedding metrics are a real upgrade over BLEU — they handle paraphrase, word order, synonymy. But they have blind spots — they can't tell you if a fact is wrong, can't penalize a fluent hallucination, and are sensitive to the embedding model's training biases.

Treat them as a mid-layer metric — better than BLEU, cheaper than an LLM judge, and directionally useful as a CI gate. Not a replacement for a judge or human review.

→ Real-World Use

A cost-aware stack: embedding similarity on every PR (fast, free), LLM judge on nightly builds (slow, expensive), human review monthly (golden calibration). Embedding similarity earns its keep in CI specifically.

Q18

How do you score multi-turn conversations?

Single-turn scoring doesn't survive contact with a conversation. A response can be locally correct but break context from turn 3, or locally off but recover context from turn 2. Three scoring levels operate together —

THREE LEVELS OF CONVERSATION SCORING

Turn-level

Each response scored in its local context. Cheap, easy to localize regressions.

Trajectory-level

Did the conversation keep context, avoid contradictions, stay on task across N turns?

Outcome-level

Did the user's underlying goal get achieved? (task success, resolution, correct booking)

Common failure modes specific to multi-turn — context forgetting (forgetting user preferences stated three turns back), repetition (asking for info already given), goal drift (sliding from the original intent), and sycophancy accumulation (progressively agreeing with incorrect user assertions). Each needs its own tag in the golden set and its own rubric line in the judge prompt.

Practical tip — for scripted multi-turn evals, use simulated users — another LLM role-playing the user with a fixed goal and persona. That lets you replay the exact same dialogue deterministically across model versions.

→ Interview Tip

If a question touches chat/agents, volunteer the three-level frame (turn / trajectory / outcome). Nine candidates in ten only mention turn-level, and then struggle when asked "but did the user's task actually complete?"

Q19

What is pass@k and when does it matter?

pass@k is the probability that at least one of k independently sampled outputs passes a correctness check. If you sample 10 candidate solutions and any one of them compiles and passes unit tests, pass@10 counts the problem as solved.

pass@k ≈ 1 − (1 − p)^k, where p is the per-sample pass probability

It matters whenever you can cheaply verify and reject bad outputs — code generation (run tests), math (check against a checker), tool use (re-try on error). In those regimes, a model with pass@1=40% but pass@10=85% is genuinely more useful than a model with pass@1=55% but pass@10=60%, because production can sample and filter.

WHEN pass@k IS AND ISN'T THE RIGHT METRIC

✓ Use pass@k

Code gen with test harness · math with a verifier · structured extraction with schema validation · any pipeline where you can cheaply sample-and-select.

✗ Don't use pass@k

Open-ended chat · creative writing · anything without a cheap automatic verifier. If a human has to pick the best of k, you've just multiplied cost.

Watch for the pass@1 vs pass@k gap — a big gap says the model has the knowledge but sampling is noisy (good candidate for best-of-N, self-consistency, or re-ranking). A small gap says the model is bottlenecked on capability (won't improve with more samples).

→ Mental Model

pass@k is a production-configurability signal, not a pure quality signal. It answers: "if I'm willing to run N samples and pick the best, how well does this model do?" That's a different business question than "how good is a single response?"

Saurabh Singh — AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

Part IV

LLM-as-Judge

The technique that made scalable evaluation possible — and the landmine pattern if you deploy it without calibration. Biases, validation, and the prompts that actually work.

Why It Works

Bias Catalog

Judge Validation

Pairwise vs Single

Prompt Design

Cost Control

Model Drift

Questions 20–26

Q20

What is LLM-as-judge and why has it become dominant?

LLM-as-judge is using a language model (usually a strong one — Claude, GPT-4-class) to score another model's outputs according to a rubric. Instead of writing heuristics or asking humans, you give the judge the input, the output, optionally a reference, and a prompt describing what "good" means.

It became dominant because it hits a sweet spot that nothing else does — scalable like heuristics, nuanced like humans. It handles paraphrase, reasoning, tone, structure. It's available on demand. And — when validated properly — it correlates with human judgment well enough for many product decisions.

THE EVALUATION COST / QUALITY FRONTIER

Exact match

free · brittle

BLEU/ROUGE

cheap · weak

BERTScore

cheap · semantic

LLM-judge

$$ · flexible · scalable

Human review

$$$ · gold standard

Axis: quality of signal (schematic)

Three rules before you trust it — (1) validate against human labels on a representative sample, (2) use a different model as judge than the one under test where possible, and (3) give the judge rubrics, not vibes.

→ Key Insight

LLM-as-judge isn't magic. It's a measurement instrument — it needs calibration, needs to be checked for drift, needs a known error profile. Teams that treat it as ground truth skip all of that and produce metrics that look rigorous but aren't.

Q21

What are the known biases of LLM judges?

This is the single most likely follow-up in an evals interview. Know the catalog cold.

Bias	What happens	Mitigation
Position bias	In pairwise, judge prefers A or B based on order	Randomize order, run both orderings, average
Verbosity bias	Longer answers score higher regardless of quality	Explicit rubric against length; normalize
Self-preference	Judge prefers outputs from its own model family	Use different family; validate with humans
Sycophancy	Judge agrees with leading language in the prompt	Neutral rubric, no "this looks good, rate it"
Authority bias	Confident tone scored higher than hedged	Separate confidence from correctness in rubric
Anchoring	First score sets expectation for later ones	Independent judgments; no streaming context
Format bias	Bulleted/Markdown answers beat plain prose	Instruct judge to ignore formatting
Refusal tolerance	Judge lets a refusal pass when it shouldn't	Add "did the model actually answer?" to rubric

None of these go away completely. The goal is to know them, measure their impact on your task, and apply targeted mitigations. A 2-percentage-point position bias is fine if you're deciding between a 15-point quality gap; it's a disaster if you're deciding between a 3-point one.

→ Interview Tip

Memorize three: position, verbosity, self-preference. If asked "what's a failure mode of LLM-as-judge?", hit all three in two sentences and you've instantly cleared the senior bar.

Q22

How do you validate an LLM judge?

A judge that hasn't been validated is a random number generator with vocabulary. Validation means showing that the judge's scores correlate with human judgment on your task.

LLM-JUDGE VALIDATION LOOP

📦

Step 01

Sample

100–300 outputs from golden set

→

👥

Step 02

Human label

≥2 raters, measure IAA

→

🤖

Step 03

Judge label

same outputs, same rubric

→

📐

Step 04

Agreement

κ, Pearson, % match

→

🔁

Step 05

Iterate

refine rubric, re-validate

Target judge-human agreement ≥ human-human agreement. If two humans agree 85% of the time on your rubric, a judge with 85%+ agreement is performing as well as another human. If judge-human agreement is substantially lower than human-human, the rubric is ambiguous or the judge is wrong for this task.

Quick sanity tests beyond top-line agreement — slice-level agreement (does the judge disagree with humans disproportionately on one failure mode?); edge-case probes (give the judge deliberately bad outputs — does it catch them?); ordering robustness (same items in different orders — same scores?).

→ Mental Model

Judge validation is a one-time investment that pays back every eval run afterwards. Spend 1–2 days calibrating the judge, then let it do months of work. Skipping this step is the #1 reason eval programs lose credibility internally.

Q23

Single vs pairwise vs reference-based judging — when to use each?

Three modes. They solve different problems and have different noise profiles.

Mode	Question it answers	Best for	Main bias
Single (pointwise)	How good is this output on scale of 1–5?	Tracking quality over time	Calibration drift, verbosity
Pairwise	Is A better than B?	Model comparison / A-B preference	Position bias
Reference-based	Does this match the expected answer?	When gold answer exists	Over-penalizing valid paraphrase

DECISION TREE

Do you have a gold answer?

Yes → reference-based. No → keep asking.

Are you comparing two models?

Yes → pairwise (more reliable than single + subtract).

Tracking one system over time?

Yes → single with a stable rubric (same judge + rubric month-over-month).

A pragmatic shortcut — pairwise is your most reliable signal when you have two systems. Humans are better at "A or B?" than at "rate 1–5", and so are LLMs. If you only need to ship one of two variants, don't over-engineer — run pairwise, randomize order, done.

→ Interview Tip

If the interviewer describes a "did the new model beat the old model?" scenario, immediately say "pairwise, with position randomization." That one phrase answers two questions at once — mode and bias mitigation.

Q24

How do you write a good judge prompt?

A vague prompt produces a vague judge. A good judge prompt has five components, each doing specific work —

ANATOMY OF A JUDGE PROMPT

1 · Role & task framing

"You are an expert evaluator of customer support replies…"

2 · Rubric with explicit criteria

Named axes (correctness, tone, actionability) with definitions + examples at each level.

3 · Anti-bias instructions

"Do not reward length. Do not reward formatting. Judge content only."

4 · Chain-of-thought before score

Reasoning first in a "reasoning" field; score last. This materially improves accuracy.

5 · Structured output

JSON schema: {reasoning, per-axis scores, overall, confidence}. Parseable, auditable.

Two high-leverage tweaks — few-shot with borderline examples (1 clearly good, 1 clearly bad, 2 borderline, each with rationale) anchors the rubric far better than definitions alone; ask for confidence — a judge's low-confidence items are exactly where you should route human review.

Avoid — "rate this on 1–10 overall." That's vibes, not eval. Avoid single-number scales without anchors. Avoid asking the judge to rank more than 2 items in one call (performance collapses).

→ Real-World Use

Version-control judge prompts exactly like code. A 2-line change to the rubric can shift scores by 10 points. You need a diff log so when a metric moves, you can tell whether the model changed or the judge did.

Q25

How do you handle cost and latency of LLM judging at scale?

A frontier-model judge on 10,000 examples is real money — and slow. The cost envelope dictates how often you can run evals, which dictates how fast your dev loop is. Six practical levers —

COST / LATENCY LEVERS FOR JUDGES

01 · Tiered judges

Cheap fast judge on 100% · frontier judge only on disagreement or critical slices.

02 · Smaller specialized judges

Distill or fine-tune a small model on expensive-judge labels for recurring rubrics.

03 · Batching + caching

Prompt caching on rubric. Async batch API for nightlies. Only re-judge changed outputs.

04 · Sampling & stratification

Don't judge all 10k every PR — judge a stratified 500. Full run weekly.

05 · Cheap pre-filter

Deterministic / embedding checks first · send only survivors to the judge.

06 · Short rubric, short output

Compact rubric; structured JSON with minimal reasoning. Every token matters at scale.

Rule of thumb — if your eval run costs more than one engineer-hour, people stop running it. Optimize aggressively to keep it cheaper than that. Cheap evals that run 10× a day beat rigorous evals that run once a week.

→ Real-World Use

Prompt caching on the judge rubric alone can cut cost 60–80% because the rubric is the same across every call. It's the first optimization that pays for itself the day you ship it.

Q26

What is judge-model drift — and how do you detect it?

The judge is itself an LLM served by a vendor. When the vendor updates or deprecates the judge model, the same outputs can receive different scores — even with identical rubric and code. Your quality trend chart moves and it's not the system under test; it's the judge.

WHY JUDGE DRIFT IS PARTICULARLY DANGEROUS

Silent

Scores change without any change in your codebase or golden set. Attribution is ambiguous.

Compounding

Every downstream decision (ship/no-ship, model rollback) uses a drifted reference point.

Defenses — pin the judge model version explicitly (never use "latest"); maintain an anchor set of 50–100 outputs with known human labels, re-run them whenever the judge model changes, and measure if judge-human agreement still holds; monitor per-slice score distributions over time — a sudden shift without a code change is the classic judge-drift fingerprint.

When a new judge model version does launch — don't blindly migrate. Re-validate on the anchor set, publish a "calibration delta", and only then switch. Historical scores before and after should be marked with a judge-version label so comparisons stay honest.

→ Key Insight

The judge is infrastructure, not a free oracle. Version it, monitor it, re-validate it. Teams that forget this ship regressions caused by the judge itself and blame their product code for months.

Saurabh Singh — AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

Part V

Regression & CI

Turning evals into tripwires that catch regressions before they ship — CI suites, thresholds, non-determinism, and shadow evaluation.

What Regresses

CI Design

Assertion vs Score

Thresholds

Non-Determinism

Shadow Evals

Questions 27–32

Q27

What is regression testing for LLMs — and what actually regresses?

Regression testing means detecting when a change to any part of the stack makes quality worse on cases that previously worked. For LLMs it's broader than unit testing because the "change" can come from many places.

WHAT CAN CAUSE A REGRESSION

01 · Prompt change

Someone tweaks a system prompt to fix one bug, breaks three others.

02 · Model version bump

Vendor updates the model — same version string, different behavior.

03 · Retrieval change

New embedding model, new chunking, new index — all invisible to the LLM.

04 · Tool schema change

An API a tool calls changes response shape → agent fails silently.

05 · Post-processing

A parser is tightened and now rejects outputs that used to pass.

06 · Dependency drift

A library upgrade changes tokenization, truncation, retry behavior.

The implication — regression tests must exercise the full pipeline, not just the model. A regression on raw-model output is diagnostic; a regression on end-to-end output is what users experience.

→ Mental Model

Every regression story ends "…and we didn't catch it because our eval only tested X in isolation." Your eval must test the same surface the user hits.

Q28

How do you design CI-friendly eval suites?

CI wants three properties — fast, cheap, deterministic-enough. Most LLM evals are none of those. The design pattern is tiering.

TIERED CI EVAL SUITE

Tier 1 · Pre-commit (seconds)

Lint, schema checks, 10–30 deterministic assertions. No model calls.

Tier 2 · PR CI (minutes)

~100 golden-set items, embedding + heuristic metrics, cheap judge on a subset.

Tier 3 · Nightly (~1 hour)

Full golden set, strong LLM judge, all slices, regression vs last green run.

Tier 4 · Release gate

Extended set, safety + red-team, pairwise vs currently-shipped model, human spot check.

Rules that keep CI fast — cache model calls keyed on (prompt-hash, model-version, temperature); parallelize across examples; fail fast on deterministic tiers before spending judge budget. And — critical for keeping devs shipping — block merges only on Tier 2 failures. Tier 3 regressions page you; Tier 4 is for release decisions.

→ Interview Tip

Strongest answer: "Tier 1 and 2 must complete in under 5 minutes or engineers will game them. Tier 3 runs nightly. That pacing is non-negotiable." Shows you've lived through devs disabling slow CI.

Q29

Assertion-style vs scoring-style regression tests — what's the difference?

Assertion-style tests have a single, well-defined pass/fail condition on a single example. "When asked to cancel, the response must contain a confirmation token." They're unit-test shaped — one failure, one owner, one line of output.

Scoring-style tests run an evaluator over many examples and compare an aggregate score to a baseline. "On the 500-item support set, factuality score ≥ 0.87." They detect distribution shifts no individual assertion would catch.

TWO SHAPES OF REGRESSION TESTS

Assertion-style

Pass/fail, single case

Named after the bug it prevents

Lives forever (regression prevention)

Blocks PRs instantly

Good for known failure modes

Scoring-style

Aggregate score vs threshold

Across stratified samples

Tracked over time (trend)

Flags drift, not defects

Good for unknown distribution shifts

A mature program has both. Every production incident should spawn one new assertion test — this is how "technical debt that caused the incident" becomes "regression that can't recur." Meanwhile, scoring tests ride the distribution.

→ Real-World Use

"Every incident produces an assertion test." Name this norm explicitly in interviews. It's what separates teams that learn from incidents from teams that keep rediscovering the same bugs.

Q30

How do you set thresholds and gates for passing builds?

Naive thresholds are "score must be ≥ 0.9." They fail two ways — too tight (every noisy run is red), or too loose (a 3-point regression slides through). The good pattern is relative, not absolute.

FOUR THRESHOLD PATTERNS

Absolute floor

Score ≥ X. Simple, brittle. Use only for hard safety constraints.

Delta vs baseline

Score ≥ baseline − tolerance. Main pattern. Tolerance from bootstrap CI.

Per-slice delta

No slice may drop by >Y. Prevents average hiding targeted regressions.

Statistical significance

Score drop significant at p < 0.05 via paired test over examples.

Practical rule — set the tolerance to the 95% confidence interval of your eval run. Bootstrap your golden set, measure run-to-run variance, and any drop larger than that CI is real. Tight thresholds without calibration produce red builds on noise; calibrated thresholds produce red builds on actual regressions.

Separate safety from quality. Safety thresholds are absolute, non-negotiable (zero PII leaks, zero jailbreak passes). Quality thresholds are delta-based and can negotiate.

→ Key Insight

Quality thresholds should never be "score ≥ 0.87." They should be "score ≥ baseline - 1.5 × noise". Otherwise you're picking an arbitrary number and pretending it means something.

Q31

How do you handle non-determinism (temperature, sampling) in tests?

LLMs are non-deterministic even at temperature 0 (vendors reserve the right to vary inference internals). That makes traditional "expected == actual" testing unreliable. Three stabilization strategies cover most cases —

STABILIZATION STRATEGIES FOR NON-DETERMINISM

Lower variance

Temperature 0, fixed seed (if supported), stable rubric. Reduces but doesn't eliminate noise.

Semantic tests

Assert properties (contains X, valid JSON, within bounds) — not exact strings.

N-sample aggregation

Run each example N times, report mean + CI. Thresholds on distribution, not single run.

Pattern for CI — run each PR-tier example once at temperature 0, and each nightly-tier example 3–5 times with temperature matching production. PR tier gets fast signal; nightly tier gets statistical confidence.

And — don't fight non-determinism by trying to assert exact strings with temperature 0. It's a losing battle against vendor-side changes. Design tests that would pass for a human writing the same answer differently.

→ Interview Tip

"Even at temperature 0, outputs aren't reproducible across inference runs at scale — vendors don't guarantee it. So tests assert properties, not exact text." That sentence alone is senior-level evidence.

Q32

What is a shadow eval — and when do you use one?

A shadow eval runs the new candidate model or configuration alongside the currently-shipped one, on real production traffic, without affecting user experience. The new system's outputs are captured and scored offline; only the shipped system's outputs reach users.

SHADOW EVALUATION ARCHITECTURE

User
request

→

Router

→

Shipped
model

→ returned to user

Candidate
model

→ logged only

Offline judge

→

Comparison
dashboard

When to use shadow — before any A/B test touches users. A shadow run tells you if the candidate produces obviously worse outputs on your real distribution. It catches prompt regressions, latency blowups, schema breaks, cost explosions, and policy violations without user exposure.

Key constraints — shadow adds real cost (you're paying for 2× inference), and for multi-turn or state-changing flows it's hard (shadow can't actually execute the user's booking). Mitigation — sample a fraction of traffic, not 100%. A 5% shadow is usually enough to surface showstoppers.

→ Real-World Use

Every model or prompt change should run through shadow eval on real traffic before A/B. Shadow catches "this breaks production" cheaply; A/B answers "does this move business metrics" expensively. In that order.

Saurabh Singh — AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

Part VI

Offline vs Online

The most important split in LLM evaluation. Offline lets you iterate fast and safely; online tells you if any of it actually matters to users. You need both, and you need to close the loop between them.

Core Distinction

Tradeoffs

A/B Testing

Implicit Signals

Closing the Loop

Interleaving

Questions 33–38

Q33

What's the core distinction between offline and online evaluation?

Offline evaluation runs on a fixed, curated dataset with scored outputs — no users involved. Results are reproducible, cheap, and fast. You control the distribution and the judge. Online evaluation measures the system as it serves real users, typically via A/B tests, implicit signals (clicks, thumbs, retention), or explicit feedback.

OFFLINE VS ONLINE

Offline

Fixed golden set

Reproducible, cheap

Proxy metrics (judge, F1)

Minutes to hours to run

No user risk

"Would this change be good?"

Online

Live user traffic

Ground-truth business metrics

Behavior signals (click, retention)

Days to weeks for stat sig

User exposure

"Did this change actually help users?"

The cognitive frame — offline measures capability, online measures value. A change can improve offline metrics and not move the needle online (users didn't notice). It can also degrade offline metrics and improve online ones (the eval was measuring the wrong thing). Both outcomes happen routinely and both are informative.

→ Mental Model

Offline is the lab, online is the field. A new drug passes lab tests for months before entering clinical trials; you wouldn't ship a model to users without offline proof, and you wouldn't trust offline proof alone to know if it worked.

Q34

What are the tradeoffs of each approach?

Dimension	Offline	Online
Speed	Minutes to hours	Days to weeks (stat sig)
Cost	Judge tokens + data curation	User exposure + infra + analysis
Reproducibility	High (fixed set)	Low (traffic changes daily)
Signal type	Proxy metrics	Ground-truth business metrics
Coverage	Only what you curated	Real distribution incl. tail
Risk	None — no users	User harm, revenue, brand
Good for	Iteration, regression, ship/no-ship	Final validation, value discovery

The classic failure modes on each side —

WHAT GOES WRONG WITH EACH

Offline-only pitfalls

Goodhart — gaming the judge

Distribution mismatch

Ship, users don't notice

You're optimizing the wrong thing.

Online-only pitfalls

Too slow — can't iterate

Attribution is fuzzy

Regressions reach users first

You learn only after harm.

Neither wins. The point is to triangulate — use offline to gate and iterate, online to validate and discover, each informing the other's design.

→ Interview Framing

"Offline is fast and cheap but a proxy. Online is slow and expensive but real. A good program uses offline to gate shipping and online to decide if the shipping was worth it." Memorize.

Q35

How do online A/B tests work for LLM features?

An A/B test randomly splits users between control (current system) and treatment (new system) and measures downstream outcome metrics. For LLM features, three twists matter —

LLM A/B TEST DESIGN

1 · Randomize at the user level

Not per-request. Same user should see the same variant to avoid confusion & contamination.

2 · Choose guardrail + north-star metrics

North star: task success, resolution, retention. Guardrails: safety incidents, latency, cost.

3 · Watch for novelty effects

New UX or tone gets clicked more in week 1 just because it's new. Run ≥ 2 weeks.

Sample size is non-trivial. LLM outputs are high-variance per user — you often need 10×–100× the users you'd need for a simple UI change. Pre-compute the needed sample with a power calculation based on observed per-user variance in your task.

Common pitfalls — peeking at results before stat sig and calling winners early; single-metric tunnel vision (north-star up, safety quietly down); and running A/B before any offline validation so a dangerous regression gets user exposure.

→ Real-World Use

Always run shadow eval → offline gate → 5% canary → 50% A/B → 100%. Jumping from "passed offline" to "50% A/B" is how regressions reach users at scale.

Q36

What online signals complement offline metrics?

Online signals split into explicit (user volunteered) and implicit (inferred from behavior). Implicit signals dominate because they require no user effort and scale.

ONLINE SIGNAL CATALOG

Explicit

👍 / 👎, star ratings, free-text feedback, "regenerate" clicks, escalation-to-human, conversation-rating surveys.

Implicit

Task completion, session length, return rate, copy events, retry rate, abandonment, time-to-resolution.

Operational

Latency p95, error rate, tool-call failure rate, token spend, retries, fallback hits.

Safety

Policy violations, PII surfaces, jailbreak attempts, abuse reports, legal escalations.

Sobering fact — explicit signals are biased. Users who rate are disproportionately those who are delighted or furious. A 4.8 average in-product rating is compatible with silent 30% regression if middling users don't rate at all. Implicit signals close that gap — they capture what silent users do.

→ Key Insight

Your most valuable online signal is usually retry rate or regenerate clicks. It's unbiased, captures dissatisfaction without requiring effort, and moves fast enough to matter. Instrument it before you instrument a thumbs-up widget.

Q37

How do you close the loop — from online signal back to offline dataset?

The loop that separates serious eval programs from theatre — production failures flow back into the golden set so they can be caught offline next time.

THE CLOSED-LOOP EVAL FLYWHEEL

📡

Step 01

Detect online

👎, retries, escalations

→

🔍

Step 02

Triage

cluster, tag failure mode

→

🏷️

Step 03

Label

SME authors expected output

→

📦

Step 04

Promote

new version of golden set

→

🛡️

Step 05

Protect

CI catches it next time

The practical mechanism — a weekly triage where PM + engineer go through prod failures, cluster them, pick 10–20 to promote, and add them with tags. Over a year, this grows a golden set that actually mirrors your production distribution rather than what you imagined it would be on day one.

Without this loop, offline evals stay frozen at whatever they were when you shipped v1. With this loop, they compound and your system gets monotonically harder to regress.

→ Interview Tip

If asked "how would you improve an existing eval program?", lead with closing the loop. "Every online failure becomes an offline test case within a week." That's the single highest-leverage process intervention you can name.

Q38

What is interleaving and when should you use it?

Interleaving is an alternative to classic A/B where, for a single user request, you mix outputs from two variants and measure which one the user prefers through direct behavior (click-through, copy, selection). Common in search ranking, increasingly applied to LLM response selection and citation ranking.

INTERLEAVING VS A/B

A/B

User sees one variant

Compare aggregate metrics

Need lots of users

Slow, rigorous

Interleaving

User sees both (mixed)

Within-user preference

~10× more sample-efficient

Fast, narrow applicability

When it works — ranked lists (search results, citations, suggestions) where "user picked from variant X" is a clean signal. When it doesn't — single-response chat, where you can't show two answers side-by-side without breaking UX.

For LLM products the practical application is usually candidate re-ranking — your product shows top-5 retrieved docs, half from ranker A, half from ranker B, observe which ones the user clicks / cites. Huge power efficiency gains because each request yields a paired preference signal.

→ Mental Model

Interleaving is the statistical win when you can afford to show both variants in one view. It's not a replacement for A/B — it's a sharp tool for ranking-shaped problems where A/B would be wastefully underpowered.

Saurabh Singh — AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

Part VII

Safety & Adversarial

Evaluating what happens at the edges — jailbreaks, PII leaks, prompt perturbations — and how RAG systems and agents need their own distinct eval frameworks.

Toxicity & Safety

Red-Teaming

Robustness

RAG Evals

Agent Evals

Alignment

Questions 39–44

Q39

How do you evaluate safety (toxicity, jailbreaks, PII leakage)?

Safety eval is structurally different from quality eval in one way — the metric you care about is rare-event. A 99.5% safe system still leaks PII on 1 in 200 requests; at 10M requests/day that's 50,000 incidents. So safety evals oversample adversarial inputs on purpose.

SAFETY EVAL FACETS

Toxicity

Hate speech, harassment, explicit content. Classifier score on outputs across adversarial prompts.

Jailbreaks

Known jailbreak corpora (AdvBench, HarmBench) → success rate. Must stay flat as model evolves.

PII leakage

Canary strings in context, NER + regex on outputs, prompt-injection leak attempts.

Policy adherence

Your product-specific rules — "don't give medical diagnosis", "don't quote prices". Rubric judge.

Gate safety with hard floors, not deltas. Quality can regress 1% and we negotiate; PII leakage must be zero or near-zero. Safety failures block releases independent of quality wins.

And remember — false positives have real cost. An over-aggressive safety layer that refuses legitimate queries tanks product utility. Measure refusal rate on benign prompts alongside harmful-content rate on adversarial ones — optimize the joint.

→ Key Insight

Safety has two failure modes: harmful outputs (too permissive) and useless refusals (too restrictive). Report both in the same dashboard. A system that refuses to help with anything is not safe — it's broken.

Q40

Red-teaming vs systematic adversarial evaluation — what's the difference?

Red-teaming is humans or agents actively trying to break your system through creative, open-ended attack — same spirit as security red teams. Systematic adversarial eval runs a fixed, versioned suite of known attacks against every release.

Red-teaming	Systematic adversarial eval
Discovers new failure modes	Prevents known ones from recurring
Creative, open-ended, unstructured	Fixed suite, CI-friendly
Humans / LLM attackers	Deterministic replays of known attacks
Output: a list of new jailbreaks found	Output: pass/fail on each known attack
Before major launches	Every release

They feed each other — red-team finds a jailbreak once, that jailbreak becomes a test case in the systematic suite forever. Over time the systematic suite grows to cover the union of all discovered attacks. Red-teaming focuses on the frontier of what's not yet in the suite.

THE ATTACK FLYWHEEL

🎯

Step 01

Red-team

find new attack

→

🔬

Step 02

Reproduce

minimize + script

→

🛠️

Step 03

Fix

prompt or policy change

→

📥

Step 04

Add to suite

regression test forever

→ Interview Tip

Good answer includes both: "Red-teaming discovers; systematic eval prevents recurrence. Mature programs run red-team sprints pre-launch and keep a growing regression suite of known attacks that every release must pass."

Q41

How do you measure robustness to prompt variations?

A robust system gives consistent-quality answers when the user rephrases, misspells, capitalizes oddly, or inserts noise. Fragile systems score 90% on a golden set and 55% on the same questions reworded.

The eval technique — perturbation batteries. For each golden-set item, generate variants and score each.

PERTURBATION CATEGORIES

Surface-level

Typos, whitespace, casing, punctuation noise. Easy wins — a fragile system fails these.

Paraphrase

Same intent, different wording. LLM-generated paraphrases covering 3–5 variants per item.

Distractor

Irrelevant sentences added before/after. Tests whether the model stays on-task.

Adversarial

Prompt injection ("ignore previous"), role-confusion attacks, instruction smuggling in user text.

Report robustness as score variance across perturbations — a system that scores 0.88 on originals and 0.85 on perturbations with low variance is robust; one that averages 0.86 but swings from 0.60 to 0.98 is brittle.

→ Mental Model

Robustness is a distribution property, not a mean. Two systems can have the same average score and radically different worst-case behavior — and users remember worst cases.

Q42

How do you evaluate RAG systems specifically?

A RAG system has two stages — retrieval (find relevant docs) and generation (answer from them). You must eval each separately, and then the end-to-end system. Blaming the wrong stage is the #1 RAG failure mode in teams.

RAG EVAL MATRIX

Retrieval quality

Recall@k, MRR, nDCG vs labeled relevance. Did we retrieve the right docs at all?

Grounding / faithfulness

% claims in answer supported by retrieved context. Judge or NLI model.

Answer relevance

Does the answer actually address the user's question, not a tangent?

Context precision / noise resistance

Does answer quality hold when irrelevant context is included?

End-to-end task success

Did the user get a correct, useful answer? The one metric that ships.

Diagnosis pattern — if retrieval Recall@5 is 0.95 but end-to-end answer correctness is 0.60, your generator is wasting good context. If Recall@5 is 0.50 and answer correctness is 0.45, your retriever is the bottleneck. Ablation — replace retrieved context with the gold doc and see how much answer quality improves.

→ Interview Tip

When asked about RAG evals, immediately decompose: retrieval metrics, grounding metrics, end-to-end metrics. Name them in that order. Candidates who jump straight to "we use RAGAS" without understanding what RAGAS is actually measuring come across as shallow.

Q43

How do you evaluate agents and tool-use?

An agent makes decisions, calls tools, and uses their outputs to decide the next action. Quality isn't a single response — it's a trajectory. Three scoring levels parallel the multi-turn frame —

AGENT EVAL LEVELS

Step-level

Correct tool chosen? Correct args? Handled the tool's response properly?

Trajectory-level

Is the sequence efficient (no loops, no unnecessary calls)? Does state stay coherent?

Outcome-level

Did the agent achieve the goal? Is the world in the expected post-state?

Practical techniques —

Technique	What it measures
Mocked tool sandboxes	Deterministic tool responses → reproducible trajectories for CI
Trajectory rubric	LLM judge scores whole trace against expected plan
Step budget checks	Agent must complete within N steps / $X cost
End-state assertions	Post-condition checks on world state (e.g., DB row created)
Loop / thrash detectors	Same tool called with same args twice → flag as stuck

The single highest-leverage agent eval is a mocked sandbox — tools return fixed responses so the entire trajectory is replayable. Without this, agent evals are non-deterministic and you can't tell if a change helped.

→ Real-World Use

The interview test: "Our agent uses web search — how do you eval it?" Best answer: "Mock the search tool with fixed responses, score trajectories deterministically in CI. Run with live search less often — nightly — because non-determinism makes signal noisy."

Q44

Capability vs alignment evaluation — what's the distinction?

Capability evals ask "can the model do X?" — solve problems, reason, code, use tools. Alignment evals ask "does the model do what we want, when we want, and refrain from what we don't?" — honesty, helpfulness, harmlessness, following instructions, respecting constraints.

CAPABILITY VS ALIGNMENT

Capability

Can it solve this problem?

Scales with model size + data

Benchmarks: MMLU, HumanEval, math

"Is it smart enough?"

Alignment

Does it do what we want?

Shaped by RLHF, prompts, guardrails

Benchmarks: TruthfulQA, HarmBench

"Is it behaving correctly?"

A more capable model is not automatically more aligned — often the opposite, because a more capable model is also more capable of persuasively producing misleading or harmful content. Frontier-model releases consistently show capability gains outpacing alignment gains; alignment evals are what pick that up.

For product teams — capability evals bound your shortlist; alignment evals determine deployability. A model can crush MMLU and still fail your "don't give financial advice" rubric in ways that block launch.

→ Key Insight

Alignment is product-specific. A generally "aligned" model may still violate your product's policies — "don't diagnose", "don't quote prices", "don't speculate". Those rules don't come from the model vendor; they come from your eval suite.

Saurabh Singh — AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

Part VIII

Production & Strategy

Evals in production, drift detection, culture, and how to synthesize everything into answers that land in the interview itself.

Production Monitoring

Observability

Drift

Culture

Interview Mistakes

Design Template

Unifying Frame

Questions 45–51

Q45

How do you monitor LLM quality in production?

Production monitoring is continuous, sampled evaluation on live traffic. You can't run a full LLM judge on every request (too expensive), so the pattern is tiered sampling.

PRODUCTION MONITORING LAYERS

100% · Deterministic checks

Schema validation, safety classifier, refusal detector, length bounds, PII scanners. Every request.

10% · Cheap LLM judge

Smaller/cheaper model scoring core quality axes. Sampled, rolling window.

1% · Strong LLM judge

Frontier judge on a stratified sample. Higher signal, used for week-over-week trends.

0.1% · Human review

Expert review of flagged outputs. Calibrates judges, spots new failure modes.

Surface the outputs in a rolling dashboard with per-slice quality, refusal rate, safety incidents, and the operational metrics (latency, cost, tool errors). Alerts fire on week-over-week drops, sudden spikes in refusals, or new clusters of 👎 feedback.

→ Real-World Use

Stratified sampling beats uniform. Ensure every slice (intent, user cohort, language) has a minimum sample — otherwise rare segments get zero judge coverage and regressions hide there.

Q46

Evals vs observability — what's the difference?

They overlap and interviewers often blur them. Hold the line —

Dimension	Evals	Observability
Primary question	"Is output quality good?"	"What is happening in the system?"
Signal type	Scored judgment of quality	Traces, logs, metrics
Granularity	Aggregate + per-slice	Per-request, per-span
Time horizon	Trend over days/weeks	Real-time + per-incident
Who consumes	PM, ML eng, Research	On-call, SRE, ML eng
Fires on	Quality regression	Latency spike, error, token blow-up

They need each other. Observability without evals tells you the system is fast and cheap but says nothing about whether it's right. Evals without observability tells you quality scores trended down but can't tell you which prompt/model/tool change caused it. A unified trace ID that joins "this request produced this output which scored this way" is the bridge.

→ Mental Model

Observability is descriptive (what happened). Evals are evaluative (was it good). Same infrastructure, different questions. Modern LLM platforms treat them as one product surface — but keep the conceptual distinction clear.

Q47

How do you detect and respond to model drift?

Drift can come from three sources — data drift (user behavior changes), model drift (vendor updates the underlying model), and concept drift (the definition of "good" changed, e.g., new policy).

DRIFT SIGNALS AND WHAT THEY MEAN

Input distribution shift

New intents, new languages, new phrasings. Cluster recent queries; compare embedding centroids to baseline.

Output distribution shift

Response length, refusal rate, tool-call frequency changes without a deployment. Sign of model-side change.

Quality regression

Judge scores, 👎 rate, retry rate trending wrong way. Combine into a composite health score.

Operational anomalies

Latency / token-count / cost drift. Often the first visible signal of a silent vendor update.

Response playbook — (1) confirm it's not your own deployment (check the change log); (2) run the anchor set to isolate judge vs system; (3) replay recent prod requests on the prior model version to localize; (4) if the vendor changed, trigger re-validation on the sealed set and decide whether to pin a prior version or accept the new behavior.

→ Real-World Use

Maintain a "drift dashboard" with four panels — input shift, output shift, quality trend, operational trend. When an alert fires, being able to glance at all four in one view tells you in seconds whether it's user-side, vendor-side, code-side, or judge-side.

Q48

How do you build an eval-driven development culture?

Most LLM teams treat evals as a pre-launch activity. Eval-driven teams treat them as a dev loop — every prompt change, every retrieval tweak is measured against the eval set before it merges.

ORG PRACTICES THAT MAKE EVALS STICK

01 · Evals in PRs

Every PR shows delta vs baseline. No mystery regressions.

02 · Eval before metric

Proposals include "how we'll measure this", not just "what we'll build".

03 · Every incident → test

Post-incident: the failing case enters the golden set with a label.

04 · Weekly eval review

PM + Eng triage prod failures, update golden set, review trend.

05 · Eval is code

Golden set, judges, rubrics — all versioned, reviewed, deployed.

06 · Leaders model it

If the TL ships a prompt change without showing an eval delta, the norm dies within a month.

The signal of a healthy culture — engineers reach for the golden set before they reach for the prompt editor. "Let me write a failing test first" becomes reflexive for LLM work the same way it did for backend work in mature test-driven teams.

→ Interview Tip

When asked about culture, talk concretely. Don't say "I encourage a data-driven mindset." Say "every PR shows eval delta vs baseline; regressions can't merge without a paired test case." Specifics signal you've actually built this.

Q49

What common mistakes do candidates make when discussing evals in interviews?

Mistake	Why it's wrong	What to say instead
"We use BLEU."	Signals unawareness of lexical-metric limitations	"BLEU for sanity, judge for semantics, humans for calibration"
Conflating eval and test	Suggests no production experience	"Tests catch known failures; evals detect distributional drift"
"MMLU looks good"	Confuses benchmark with product fit	"Benchmark filters shortlist; internal eval decides ship"
No slicing	Average metric masks targeted regressions	"I slice by intent, language, length, user cohort"
Unvalidated judge	Treats LLM judge as ground truth	"Judge validated against ≥100 human-labeled items; agreement ≥ human-human"
No loop back	Static eval sets go stale	"Every prod incident → golden set addition"
Safety as add-on	Bolts safety on at the end	"Hard floors on safety independent of quality delta gates"
Offline or online, not both	False dichotomy	"Offline gates ship, online validates value, closed-loop between them"

The fastest way to signal seniority is to name the tradeoff explicitly. "I'd pick X here because the alternative Y has this specific problem in my context." Juniors name techniques; seniors name tradeoffs.

→ Interview Tip

If you only memorize one phrase from this handbook, make it: "offline evals gate shipping, online evals validate value, and every production failure feeds back into the offline set." That sentence alone answers a huge fraction of all eval interview questions.

Q50

How do you structure your answer to "design an eval for X"?

This is the most common open-ended question in LLM eval interviews. Use a repeatable 6-step template.

THE "DESIGN AN EVAL" TEMPLATE

🎯

Step 01

Task

what does success look like

→

📏

Step 02

Dimensions

axes of quality

→

📦

Step 03

Data

golden set design

→

⚖️

Step 04

Scoring

metric per dimension

→

🚦

Step 05

Gating

thresholds + CI tier

→

🔁

Step 06

Loop

prod feedback

Worked example — "design an eval for a customer-support reply bot."

Task: Resolve the user's ticket correctly and empathetically within policy.
Dimensions: Correctness, policy adherence, tone, conciseness, safety (no PII, no legal claims).
Data: 500 real redacted tickets stratified across billing / technical / cancellation / complaint, plus 50 adversarial cases (abusive users, prompt injections, PII probes).
Scoring: Deterministic for policy/PII, LLM-judge with rubric for correctness and tone, embedding similarity vs expected for conciseness.
Gating: Safety absolute (zero PII leaks), quality delta ≤ 1.5 × noise vs baseline, run in PR CI for subset, nightly for full.
Loop: Weekly triage of 👎 and escalation tickets into the golden set.

Deliver that in two minutes and you've outscored most candidates regardless of the specific X.

→ Interview Tip

Narrate the template out loud — "I'll walk through task, dimensions, data, scoring, gating, loop" — before diving in. It tells the interviewer you have a repeatable framework and invites them to push on any step they care about.

Q51

What one framework ties all of this together?

Everything in this handbook fits on one diagram — the closed-loop LLM eval stack. Keep it in your head and you can reconstruct any specific answer from first principles.

THE CLOSED-LOOP EVAL STACK

Foundation · Golden set (versioned, stratified, sealed)

Ground truth. Everything else calibrates against this.

↑↓

Scoring · Deterministic → embedding → judge → human

Push each case as far down the cost ladder as possible. Judge validated against humans.

↑↓

CI & regression · PR tier, nightly tier, release tier

Fast assertions + stratified scoring + shadow + safety gates.

↑↓

Production · Shadow → canary → A/B → 100%

Each stage catches what the previous couldn't. User exposure is earned, not assumed.

↑↓

Monitoring & loop · Online signals → triage → golden set

The loop closes here. Every failure becomes a future test. The system compounds.

Five ideas, one loop. If an interview question doesn't map onto a layer here — golden set, scoring, CI, production stages, monitoring/feedback — you're either being asked something very narrow, or the question itself is confused and you can reframe it onto the stack.

The whole point of this handbook is that evals are not a metric, they're a system. Learn the system; the metrics follow.

→ Closing Thought

"LLM evaluation is the production engineering of language models." Say that, draw this diagram, and you've shown the interviewer that you see the whole picture — not just BLEU, not just A/B, but the system that makes both meaningful.

Saurabh Singh — AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

Complete

All 51 Questions.
Covered.

From golden-set construction to LLM-as-judge validation to closing the loop from production back to CI — the full vocabulary and framework for LLM evaluation interviews and the systems behind them.

Questions

Topic Areas

40+

Visual Diagrams

Saurabh Singh

AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

RUN IT YOURSELF

Token-level F1, the QA eval metric

A staple LLM eval metric: token-level F1 between a prediction and a reference answer (as used in QA benchmarks like SQuAD). Real Python, running live. Edit the strings and hit Run.

HOW TO READ THE CODE — 4 IDEAS

Lowercase and tokenise both the prediction and the reference (step 1).
Count the shared tokens (step 2).
Precision = shared / predicted; recall = shared / reference (steps 3–4).
F1 is their harmonic mean — high only when both are high.

CPython · WebAssembly

Finished this one? 0 / 115 Handbooks done

Explore the topic

See this alongside everything else on the same subject — handbooks, system designs, challenges and tools, in one place.

AI Evaluation Interview Prep

The LLM EvalsInterviewHandbook

What'sInside

Evals Foundations

Why do LLM evals matter more than traditional ML metrics?

What is the difference between evaluation and testing for LLMs?

What are the main categories of LLM evaluation?

What does a good evaluation framework look like in practice?

Why can't you just rely on public benchmarks like MMLU or HELM?

Task-specific vs capability evaluations — what's the distinction?

Golden Sets & Datasets

What is a golden dataset and why is it foundational?

How do you build a golden set from scratch?

How many examples do you actually need in a golden set?

How do you handle dataset contamination?

When and how should you version and update golden sets?

What makes a bad golden set — and how do you recognize one?

How do you source real production traffic for evals without breaking privacy?

Metrics & Scoring

Reference-based vs reference-free metrics — when to use each?

Why do BLEU, ROUGE, and exact-match fail for generative LLMs?

How do you measure factuality and hallucinations?

What are embedding-based similarity metrics (BERTScore, semantic similarity)?

How do you score multi-turn conversations?

What is pass@k and when does it matter?

LLM-as-Judge

What is LLM-as-judge and why has it become dominant?

What are the known biases of LLM judges?

How do you validate an LLM judge?

Single vs pairwise vs reference-based judging — when to use each?

How do you write a good judge prompt?

How do you handle cost and latency of LLM judging at scale?

What is judge-model drift — and how do you detect it?

Regression & CI

What is regression testing for LLMs — and what actually regresses?

How do you design CI-friendly eval suites?

Assertion-style vs scoring-style regression tests — what's the difference?

How do you set thresholds and gates for passing builds?

How do you handle non-determinism (temperature, sampling) in tests?

What is a shadow eval — and when do you use one?

Offline vs Online

What's the core distinction between offline and online evaluation?

What are the tradeoffs of each approach?

How do online A/B tests work for LLM features?

What online signals complement offline metrics?

How do you close the loop — from online signal back to offline dataset?

What is interleaving and when should you use it?

Safety & Adversarial

How do you evaluate safety (toxicity, jailbreaks, PII leakage)?

Red-teaming vs systematic adversarial evaluation — what's the difference?

How do you measure robustness to prompt variations?

How do you evaluate RAG systems specifically?

How do you evaluate agents and tool-use?

Capability vs alignment evaluation — what's the distinction?

Production & Strategy

How do you monitor LLM quality in production?

Evals vs observability — what's the difference?

How do you detect and respond to model drift?

How do you build an eval-driven development culture?

What common mistakes do candidates make when discussing evals in interviews?

How do you structure your answer to "design an eval for X"?

What one framework ties all of this together?

All 51 Questions.Covered.

Token-level F1, the QA eval metric

Explore the topic

More Handbooks

Explore more from Vibe Engines

Get the next one in your inbox.

The LLM Evals
Interview
Handbook

What's
Inside

All 51 Questions.
Covered.