Vibe Engines
Visual Handbook · 2026
51 Questions · 8 Domains
Interview Preparation & Reference

The LLM Evals
Interview
Handbook

Golden sets, LLM-as-judge, regression testing,
and the offline-vs-online divide — demystified.

Golden Sets LLM-as-Judge Regression Tests Offline Evals Online A/B RAG Evals Agent Evals Red-Teaming Observability
Evals Foundations
Q1–6
Golden Sets & Data
Q7–13
Metrics & Scoring
Q14–19
LLM-as-Judge
Q20–26
Regression & CI
Q27–32
Offline vs Online
Q33–38
Safety & Adversarial
Q39–44
Production & Strategy
Q45–51
Interview-Ready
Tips & Mental Models
Saurabh Singh
AI Engineer & Builder.
Contents

What's
Inside

I · Evals Foundations
Q1–6
Q1Why do LLM evals matter more than traditional ML metrics?
Q2Evaluation vs testing — what is the distinction?
Q3What are the main categories of LLM evaluation?
Q4What does a good evaluation framework look like?
Q5Why can't you just rely on MMLU and public benchmarks?
Q6Task-specific vs capability evaluations?
II · Golden Sets & Data
Q7–13
Q7What is a golden dataset and why is it foundational?
Q8How do you build a golden set from scratch?
Q9How many examples do you actually need?
Q10How do you handle dataset contamination?
Q11When and how should you version golden sets?
Q12What makes a bad golden set?
Q13Sourcing real production traffic for evals?
III · Metrics & Scoring
Q14–19
Q14Reference-based vs reference-free metrics?
Q15Why do BLEU / ROUGE / exact-match fail for LLMs?
Q16How do you measure factuality and hallucinations?
Q17Embedding-based similarity: BERTScore & beyond?
Q18How do you score multi-turn conversations?
Q19What is pass@k and when does it matter?
IV · LLM-as-Judge
Q20–26
Q20What is LLM-as-judge and why has it become dominant?
Q21What are the known biases of LLM judges?
Q22How do you validate an LLM judge?
Q23Single vs pairwise vs reference-based judging?
Q24How do you write a good judge prompt?
Q25Handling cost and latency of LLM judges?
Q26What is judge-model drift?
V · Regression & CI
Q27–32
Q27What is regression testing for LLMs?
Q28How do you design CI-friendly eval suites?
Q29Assertion-style vs scoring-style regression tests?
Q30How do you set thresholds and gates?
Q31Handling non-determinism (temperature, sampling)?
Q32What is a shadow eval?
VI · Offline vs Online
Q33–38
Q33The core distinction between offline and online evals?
Q34What are the tradeoffs of each approach?
Q35How do online A/B tests work for LLM features?
Q36What online signals complement offline metrics?
Q37Closing the loop — online back to offline?
Q38What is interleaving and when do you use it?
VII · Safety & Adversarial
Q39–44
Q39How do you evaluate toxicity, jailbreaks, PII?
Q40Red-teaming vs systematic adversarial evaluation?
Q41How do you measure robustness to prompt variations?
Q42How do you evaluate RAG systems specifically?
Q43How do you evaluate agents and tool-use?
Q44Capability vs alignment evaluation?
VIII · Production & Strategy
Q45–51
Q45How do you monitor LLM quality in production?
Q46Evals vs observability — what's the difference?
Q47How do you detect and respond to model drift?
Q48How do you build an eval-driven culture?
Q49Common interview mistakes to avoid?
Q50How do you answer "design an eval for X"?
Q51What one framework ties all of this together?
Part I

Evals Foundations

Before metrics and benchmarks, you need a mental model for why LLM evaluation is fundamentally different — and why most teams get it wrong.

Why Evals Matter
Eval vs Test
Taxonomy
Frameworks
Benchmarks
Capability vs Task
Questions 1–6
Q1

Why do LLM evals matter more than traditional ML metrics?

Traditional ML ships a model that outputs a single label or number. You measure accuracy, F1, AUC against a held-out set and you're mostly done. LLM systems output free-form text — there are infinitely many correct answers, and the same prompt can produce different responses each run.

That breaks three assumptions at once. First, you can't enumerate "correct" outputs, so exact-match collapses. Second, the system is a pipeline (prompt + retrieval + model + tools + post-processing) — a regression can come from any layer. Third, quality is multi-dimensional — a response can be factually correct but unsafe, helpful but verbose, on-topic but off-brand.

WHY LLM EVALS ARE HARDER
Traditional ML
Single label output
Deterministic given input
One metric (F1 / AUC)
Labeled test set = truth
Closed-world, quantitative
LLM Systems
Free-form text output
Stochastic by default
Many dimensions at once
No single "ground truth"
Open-world, subjective, pipeline

Add the production reality — silent regressions from prompt edits, model version bumps, vendor deprecations, retrieval index drift — and evals stop being a one-time gate. They become the production system you build around the model.

→ Interview Framing
"Traditional ML evaluation is a measurement problem; LLM evaluation is an engineering problem." Say that sentence and follow with the three assumptions that break (single label, determinism, single metric). Interviewers remember the frame.
Q2

What is the difference between evaluation and testing for LLMs?

They sound interchangeable but they serve different loops. Testing answers a binary — did this specific behavior happen or not? It's pass/fail, written like unit tests, runs in CI, and protects you from regressions on known cases. Evaluation answers a distributional question — across a representative sample of inputs, what is the overall quality of the system?

DimensionTestingEvaluation
QuestionDid X happen?How good is the system overall?
GranularitySingle caseDistribution over many cases
OutputPass / failScore, percentile, win-rate
When runEvery commit / PRPre-release, weekly, ad hoc
Cardinality10s–1000s of assertions100s–10000s of scored cases
AnalogyUnit testStatistical experiment

A healthy program has both. Testing catches known failures you've already seen in production. Evaluation detects unknown shifts in quality. Conflating them is why teams either over-invest in brittle pass/fail suites that block every release, or under-invest in CI and ship silent regressions.

→ Mental Model
Testing is a tripwire — it fires when something specific breaks. Evaluation is a thermometer — it tells you whether the overall system is getting hotter or colder. You need both, and they belong in different parts of your dev loop.
Q3

What are the main categories of LLM evaluation?

There's no single taxonomy, but a useful interview-friendly one is three axes — what you measure, how you measure, and where the data lives.

THE THREE AXES OF LLM EVALUATION
1 · What you measure — Quality dimensions
Correctness, factuality, helpfulness, safety, format adherence, latency, cost.
2 · How you measure — Scoring method
Deterministic (exact match, regex, schema), heuristic (BLEU, ROUGE, BERTScore), model-graded (LLM judge), human review.
3 · Where the data lives — Offline vs online
Offline (curated golden set), online (live user traffic, A/B, implicit signals).

A concrete eval plan is a point on each axis. "I measure factuality (what) using a GPT-4-class judge with citations (how) on a versioned golden set of 500 support tickets (where)." That one sentence is the shape of every good eval.

→ Interview Tip
When asked "how would you evaluate X?" — don't rush to a metric. Name the three axes out loud, pick a point on each, and justify it. This structure alone separates senior from junior answers.
Q4

What does a good evaluation framework look like in practice?

A good framework is not a single dashboard — it's a layered system where each layer catches a different class of failure at a different speed and cost.

THE EVAL PYRAMID (FAST → SLOW, CHEAP → EXPENSIVE)
Layer 5 · Human review
Slowest, gold standard. Used to calibrate everything below.
Layer 4 · LLM-as-judge
Scales subjective quality evaluation across thousands of examples.
Layer 3 · Heuristics & embeddings
BERTScore, semantic similarity, toxicity classifiers, citation overlap.
Layer 2 · Deterministic checks
JSON schema, regex, "must mention X", length bounds, tool-call correctness.
Layer 1 · Operational invariants
Didn't 500, didn't exceed token budget, didn't leak secrets. Always-on, free.

The principle — push work down the pyramid. Every failure that a cheap deterministic check can catch should not reach an LLM judge, and every failure an LLM judge can catch should not reach a human. Teams that flip this pyramid (human review for everything) burn money; teams that skip the top (no human calibration ever) drift without knowing.

→ Key Insight
A good eval system is less about choosing the best metric and more about choosing the right metric at the right layer. Cheap metrics gate PRs; expensive metrics gate releases; human review calibrates the machine metrics.
Q5

Why can't you just rely on public benchmarks like MMLU or HELM?

Public benchmarks measure general model capability — they tell you whether a model is plausibly competent at broad domains. They say almost nothing about whether it will work for your task, your users, your data distribution.

Three structural problems make public benchmarks unreliable as production signals:

WHY PUBLIC BENCHMARKS FAIL IN PRODUCTION
01 · Contamination
Popular benchmarks leak into training data. The model has seen MMLU; the score is inflated.
02 · Distribution mismatch
Your users don't write like benchmark authors. Their tone, jargon, and failure modes differ.
03 · Wrong dimension
MMLU rewards factual recall. Your app may care about tone, safety, JSON adherence — none of which MMLU measures.
04 · No pipeline signal
Benchmarks test a bare model. Your system has retrieval, tools, post-processing — all invisible to MMLU.

Use benchmarks for model selection shortlisting — a 20-point MMLU gap is a real signal. Use your own evals for everything downstream of that.

→ Real-World Use
Say it plainly: "Public benchmarks are useful for deciding which three models to put in my bake-off. My internal eval decides which one ships." Interviewers want to hear that you treat benchmarks as a filter, not an answer.
Q6

Task-specific vs capability evaluations — what's the distinction?

Capability evaluations probe an underlying skill the model has or doesn't — arithmetic, code generation, long-context recall, multilingual reasoning. Examples: GSM8K for math, HumanEval for code, needle-in-a-haystack for context retention. They're model-facing — they help you decide "is this model capable enough?".

Task-specific evaluations measure whether the full system does your job — resolve a support ticket, extract fields from an invoice, answer a policy question with the right citation. They're product-facing — they tell you "does this shipped pipeline work for our users?".

CAPABILITY VS TASK EVALS
Capability
Tests the raw model
Generic across products
E.g. GSM8K, HumanEval
Useful for vendor selection
"Can the model do arithmetic?"
Task-specific
Tests the whole pipeline
Unique to your product
E.g. 500 real support tickets
Useful for ship decisions
"Does support-bot resolve tickets?"

They're complementary, not competing. Capability evals tell you why a task eval regressed (the new model lost long-context recall). Task evals tell you whether the regression matters (our prompts never exceed 8k tokens, so we don't care).

→ Mental Model
Capability evals are diagnostic. Task evals are prescriptive. In an interview, explicitly say you'd run both — capability to bound your model shortlist, task-specific to make the final call.
Part II

Golden Sets & Datasets

The quality of your evals is capped by the quality of your data. This part covers how to build, size, version, and protect the ground truth that everything else sits on.

Building Golden Sets
Sample Sizing
Contamination
Versioning
Production Traffic
Anti-patterns
Questions 7–13
Q7

What is a golden dataset and why is it foundational?

A golden dataset (or "golden set", "eval set", "canonical set") is a curated, versioned, human-reviewed collection of input–expected-output pairs that represents what correct looks like for your task. It's the ground truth every metric, judge, and regression test is ultimately calibrated against.

It's foundational because every other piece of your eval stack inherits its biases. A skewed golden set means a skewed LLM-judge, a misleading regression signal, a misguided A/B readout. If your golden set over-represents English short questions, your whole system will pass evals and still fail on long multilingual queries in production.

WHAT A GOLDEN SET RECORD LOOKS LIKE
{
  "id": "gs-2026-0142",
  "input": "Cancel my subscription effective next month.",
  "context": { "plan": "pro", "billing_cycle": "monthly" },
  "expected": {
    "intent": "cancel_subscription",
    "effective_date": "next_billing_cycle",
    "tone": "empathetic"
  },
  "tags": ["billing", "cancel", "edge-case"],
  "source": "prod-ticket-redacted",
  "version": "v1.3",
  "reviewer": "alex@"
}

Note the tags, source, version, and reviewer. A golden set isn't just inputs and outputs — it's provenance. Without that, you can't audit biases, can't slice by failure mode, and can't prove the set wasn't scraped from your own prod logs containing PII.

→ Key Insight
"Garbage in, garbage out" applies to evals as strongly as to training. If you have 40 hours to invest in your eval stack, spend 30 on the golden set and 10 on everything else.
Q8

How do you build a golden set from scratch?

The trap is going straight to labeling. The sequence that actually works is segment → sample → label → stratify → freeze.

GOLDEN SET CONSTRUCTION PIPELINE
🧭
Step 01
Segment
define input types & failure modes
🎯
Step 02
Sample
from prod logs or synthesize
✍️
Step 03
Label
human-author expected output
⚖️
Step 04
Stratify
balance tags + difficulty
🔒
Step 05
Freeze
version and lock

Segment means naming the axes your users vary on — intent, language, input length, politeness, edge cases. Sample draws from real traffic (ideal) or synthesized prompts (bootstrap). Label is where a domain expert writes what should have come out — or, for subjective tasks, what a good response would contain. Stratify ensures every segment is present in enough volume to make the metric stable. Freeze the set at v1 and only change it via explicit versioning.

Two common shortcuts to resist — labeling whatever shows up first (you'll over-index on common cases) and using the model's own output as the "expected" (you've just measured self-consistency, not correctness).

→ Interview Tip
If asked to design a golden set on the spot, say "segment first, sample second" and list 5 segmentation axes for the domain. That single reordering — segment before sample — signals you've done this before.
Q9

How many examples do you actually need in a golden set?

Wrong question. The right question is how many examples per slice. A 10,000-item set that's all one category is weaker than a 500-item set with 25 examples across 20 slices.

A working rule of thumb for binary pass/fail metrics — you need about 50–100 examples per slice to detect a 5-point quality shift with reasonable confidence. For graded scores (1–5), fewer are needed; for rare failure modes (jailbreaks, PII leaks), more.

GOLDEN SET SIZING BY USE CASE
Smoke test
20–50
PR regression
100–300
Release gate
500–1,500
Model selection
2,000–5,000
Safety / red-team
5,000+

Two quick sanity checks — (1) bootstrap your current set and confirm the confidence interval on your headline metric is narrower than the regression you care about; (2) plot per-slice scores and look for slices with wild variance — those are usually under-sampled.

→ Mental Model
Think about statistical power, not total count. If your metric moves by ±3 points just from re-running, your golden set is too small for a 2-point "improvement" to mean anything.
Q10

How do you handle dataset contamination?

Contamination is when eval data has leaked into training data, so the model "remembers" the answer rather than earning it. Scores go up, real-world capability doesn't. It happens three ways — (1) you published your golden set; (2) the model was trained on scrapes that include your source; (3) you evaluate on data the model has already seen via retrieval or system prompts during labeling.

THREE FORMS OF CONTAMINATION
Public leak
Your eval set is on HuggingFace → frontier models saw it in pre-training.
Source leak
You labeled from Wikipedia / Stack Overflow — already in the model.
Loop leak
Labelers used the model to generate "expected" answers — circular.

Defenses — keep a private held-out slice that's never shared with vendors; canary strings unique to your dataset so you can detect memorization; label with humans, not the model under test; refresh a rolling fraction of the set every quarter with net-new prompts.

→ Real-World Use
For frontier-model evals, keep a sealed holdout: never send it to an API provider for labeling, never put it in a doc that could be scraped, never commit it to a public repo. That sealed 200-example set is worth more than 5,000 public examples.
Q11

When and how should you version and update golden sets?

Golden sets decay. User behavior shifts, products add features, edge cases are discovered, labels go stale as policies change. A never-updated set silently drifts from reality — your evals pass, your users suffer.

The operating pattern — treat the golden set like code. Semantic versions (v1.2.0), pull requests for additions, changelogs, deprecations. Minor version for "added 50 new examples in segment X", major version when changes break comparability with prior runs.

GOLDEN SET VERSIONING RHYTHM
Weekly · Review prod failures
Pipe incidents, thumbs-down, escalations into a candidate bucket.
Monthly · Add + label
Human review; promote curated items to v.next minor bump.
Quarterly · Re-stratify
Check distribution against current prod traffic; rebalance slices.
Annually · Major bump
Retire stale examples; re-review every label for drift.

When you bump major, always re-run the prior release model on the new set and record the score — you need that anchor point so historical comparisons stay interpretable.

→ Interview Tip
Strong answer: "Golden sets are code. I version them semantically, PR additions, and keep an anchor rerun on major bumps." This signals you've been on the other side of a stale-set incident.
Q12

What makes a bad golden set — and how do you recognize one?

Most teams ship a bad golden set and don't know it because their metrics look stable. Stable metrics on a bad set are the worst case — confident wrongness.

SmellWhat it looks likeWhat it costs you
Easy modeEvery frontier model scores 95%+Metric is saturated — can't distinguish models
Skewed80% of examples are one intentHead-case wins mask tail-case failures
StaleNo updates in 6+ monthsPasses don't predict prod behavior anymore
Self-labeled"Expected" was generated by the modelYou're measuring self-agreement, not quality
LeakedAvailable on the public internetMemorization inflates scores ~5–15 points
Under-slottedNo per-segment tagsCan't diagnose which slice regressed
Solo-authoredOne person labeled everythingOne person's biases = ground truth

Diagnosis heuristic — if your golden-set score barely moves between a tiny 7B open model and a frontier model, your set is either too easy or contaminated. If it moves by 30 points but production telemetry doesn't change at all, your set is unrepresentative.

→ Mental Model
Run a "weak model sanity check": your cheapest / smallest model should clearly lose. If it ties the frontier model, you don't have an eval — you have a vibes check.
Q13

How do you source real production traffic for evals without breaking privacy?

Production is the best possible eval source — it's literally your distribution — but it carries PII, compliance, and consent risk. The pattern that works is sample → redact → review → promote.

FROM PROD TRAFFIC TO GOLDEN SET
📡
Step 01
Sample
stratified across users & intents
🔐
Step 02
Redact
PII scrubber + human QA
🧪
Step 03
Review
legal / privacy sign-off
🏷️
Step 04
Label
SME authors expected output
📦
Step 05
Promote
into versioned golden set

Key choices — consent-friendly sampling (users who opted into training data sharing, or internal dogfooding traffic); synthetic twins when you can't use raw data (rewrite the prompt preserving the structure but replacing identifying details); and difficulty-stratified sampling — don't only collect what the system already handles well. Biased toward failure reveals real edges.

Also — keep production sampling continuous. A one-time snapshot becomes stale in weeks for an actively-developed product.

→ Real-World Use
A useful split: 60% historical "happy path" prod examples + 30% thumbs-down and escalation examples + 10% synthesized adversarial cases. That mix catches regressions and tail behavior simultaneously.
Part III

Metrics & Scoring

From BLEU to BERTScore to pass@k — which metric actually moves with quality, and which ones just look rigorous without meaning much.

Reference-based
Reference-free
BLEU / ROUGE Limits
Factuality
Embedding Metrics
Multi-turn
pass@k
Questions 14–19
Q14

Reference-based vs reference-free metrics — when to use each?

Reference-based metrics compare a model's output to one or more human-authored "correct" answers. Reference-free metrics score an output on its own merits — grammaticality, factuality, relevance — without needing a gold answer to compare against.

WHEN EACH WINS
Reference-based
Exact match / F1 (QA)
BLEU / ROUGE (translation, summary)
BERTScore vs reference
Use when "correct" is well-defined
and you can afford to label.
Reference-free
Perplexity, fluency scores
Factuality (vs retrieved doc)
LLM-as-judge on rubrics
Use for open-ended generation
where many outputs are valid.

Most production systems use both — reference-based for structured outputs (JSON fields, extracted entities) and reference-free for long-form generation (summaries, chat responses). A common pattern — reference-based for the "must haves" (required fields present), reference-free for the "feel" (tone, helpfulness).

→ Mental Model
Reference-based = "did you match the expected answer?" Reference-free = "is this answer good in isolation?" Both answers can be yes, no, or different — and you usually need to ask both.
Q15

Why do BLEU, ROUGE, and exact-match fail for generative LLMs?

They're all lexical — they compare surface-level n-gram or character overlap with a reference. That works for constrained tasks (translation against a parallel corpus) but breaks down the moment the model is allowed to paraphrase, reorder, or add useful context.

SAME MEANING, DIFFERENT SCORES
REFERENCE:
"The meeting was rescheduled to Thursday afternoon."
"The meeting got moved to Thursday afternoon."
BLEU 0.21
✓ correct
"Thursday afternoon, rescheduled."
BLEU 0.11
✓ correct
"The meeting was rescheduled to Tuesday afternoon."
BLEU 0.88
✗ WRONG DAY

Lexical metrics reward surface similarity and punish paraphrase. Worse, they can rank a wrong answer above a correct one when the wrong answer copies more reference words. Exact match is even more brittle — a trailing period or capitalization difference fails a correct answer.

They still have a place — they're cheap, fast, deterministic, useful as a fast tripwire in CI. But they should never be your headline quality metric for free-form generation.

→ Interview Tip
Don't just say "BLEU is bad." Say: "BLEU correlates with quality in narrow tasks with parallel references. For open-ended generation, I'd pair it with a semantic metric and an LLM judge." That calibrated answer reads much better than a dismissal.
Q16

How do you measure factuality and hallucinations?

Factuality = does every factual claim in the output match a trusted source? Hallucination = a factual claim with no such support (or contradicted by one). For RAG systems the source is retrieved context; for open-domain, it's an external knowledge base or a reference answer.

FACTUALITY EVAL PIPELINE
📄
Step 01
Decompose
split output into atomic claims
🔍
Step 02
Retrieve
gather supporting evidence
⚖️
Step 03
Verify
entail / contradict / unsupported
📊
Step 04
Aggregate
% claims grounded

Key metrics in use —

MetricWhat it measuresBest for
Faithfulness% claims supported by retrieved contextRAG
Answer relevance% output sentences relevant to the questionQA
Citation precision% citations that actually support claimGrounded generation
Citation recall% claims that have a citationGrounded generation
Hallucination rate% outputs with ≥1 unsupported claimHeadline dashboard

The practical move — use an LLM judge with a structured rubric that forces claim-by-claim verification against provided evidence, rather than a single holistic "is this hallucinated?" vote. Decomposition beats gestalt.

→ Key Insight
"Hallucination" is not a single metric — it's a family. Separate faithfulness (to retrieved context) from factuality (to world knowledge). A model can be 100% faithful to a bad doc and still be factually wrong.
Q17

What are embedding-based similarity metrics (BERTScore, semantic similarity)?

Instead of matching words, embed both output and reference in a vector space and measure how close they are. BERTScore tokenizes both, embeds with a pretrained transformer, and computes token-level cosine similarity. Sentence-level semantic similarity is the same idea at paragraph granularity using sentence-transformer embeddings.

LEXICAL VS SEMANTIC SIMILARITY
BLEU
weak, word overlap only
ROUGE-L
longest common subsequence
BERTScore
contextual embeddings
Sent-embedding
semantic, paragraph-level
LLM-judge
can reason about intent
Axis: correlation with human quality judgment (schematic)

Embedding metrics are a real upgrade over BLEU — they handle paraphrase, word order, synonymy. But they have blind spots — they can't tell you if a fact is wrong, can't penalize a fluent hallucination, and are sensitive to the embedding model's training biases.

Treat them as a mid-layer metric — better than BLEU, cheaper than an LLM judge, and directionally useful as a CI gate. Not a replacement for a judge or human review.

→ Real-World Use
A cost-aware stack: embedding similarity on every PR (fast, free), LLM judge on nightly builds (slow, expensive), human review monthly (golden calibration). Embedding similarity earns its keep in CI specifically.
Q18

How do you score multi-turn conversations?

Single-turn scoring doesn't survive contact with a conversation. A response can be locally correct but break context from turn 3, or locally off but recover context from turn 2. Three scoring levels operate together —

THREE LEVELS OF CONVERSATION SCORING
Turn-level
Each response scored in its local context. Cheap, easy to localize regressions.
Trajectory-level
Did the conversation keep context, avoid contradictions, stay on task across N turns?
Outcome-level
Did the user's underlying goal get achieved? (task success, resolution, correct booking)

Common failure modes specific to multi-turn — context forgetting (forgetting user preferences stated three turns back), repetition (asking for info already given), goal drift (sliding from the original intent), and sycophancy accumulation (progressively agreeing with incorrect user assertions). Each needs its own tag in the golden set and its own rubric line in the judge prompt.

Practical tip — for scripted multi-turn evals, use simulated users — another LLM role-playing the user with a fixed goal and persona. That lets you replay the exact same dialogue deterministically across model versions.

→ Interview Tip
If a question touches chat/agents, volunteer the three-level frame (turn / trajectory / outcome). Nine candidates in ten only mention turn-level, and then struggle when asked "but did the user's task actually complete?"
Q19

What is pass@k and when does it matter?

pass@k is the probability that at least one of k independently sampled outputs passes a correctness check. If you sample 10 candidate solutions and any one of them compiles and passes unit tests, pass@10 counts the problem as solved.

pass@k ≈ 1 − (1 − p)k, where p is the per-sample pass probability

It matters whenever you can cheaply verify and reject bad outputs — code generation (run tests), math (check against a checker), tool use (re-try on error). In those regimes, a model with pass@1=40% but pass@10=85% is genuinely more useful than a model with pass@1=55% but pass@10=60%, because production can sample and filter.

WHEN pass@k IS AND ISN'T THE RIGHT METRIC
✓ Use pass@k
Code gen with test harness · math with a verifier · structured extraction with schema validation · any pipeline where you can cheaply sample-and-select.
✗ Don't use pass@k
Open-ended chat · creative writing · anything without a cheap automatic verifier. If a human has to pick the best of k, you've just multiplied cost.

Watch for the pass@1 vs pass@k gap — a big gap says the model has the knowledge but sampling is noisy (good candidate for best-of-N, self-consistency, or re-ranking). A small gap says the model is bottlenecked on capability (won't improve with more samples).

→ Mental Model
pass@k is a production-configurability signal, not a pure quality signal. It answers: "if I'm willing to run N samples and pick the best, how well does this model do?" That's a different business question than "how good is a single response?"
Part IV

LLM-as-Judge

The technique that made scalable evaluation possible — and the landmine pattern if you deploy it without calibration. Biases, validation, and the prompts that actually work.

Why It Works
Bias Catalog
Judge Validation
Pairwise vs Single
Prompt Design
Cost Control
Model Drift
Questions 20–26
Q20

What is LLM-as-judge and why has it become dominant?

LLM-as-judge is using a language model (usually a strong one — Claude, GPT-4-class) to score another model's outputs according to a rubric. Instead of writing heuristics or asking humans, you give the judge the input, the output, optionally a reference, and a prompt describing what "good" means.

It became dominant because it hits a sweet spot that nothing else does — scalable like heuristics, nuanced like humans. It handles paraphrase, reasoning, tone, structure. It's available on demand. And — when validated properly — it correlates with human judgment well enough for many product decisions.

THE EVALUATION COST / QUALITY FRONTIER
Exact match
free · brittle
BLEU/ROUGE
cheap · weak
BERTScore
cheap · semantic
LLM-judge
$$ · flexible · scalable
Human review
$$$ · gold standard
Axis: quality of signal (schematic)

Three rules before you trust it — (1) validate against human labels on a representative sample, (2) use a different model as judge than the one under test where possible, and (3) give the judge rubrics, not vibes.

→ Key Insight
LLM-as-judge isn't magic. It's a measurement instrument — it needs calibration, needs to be checked for drift, needs a known error profile. Teams that treat it as ground truth skip all of that and produce metrics that look rigorous but aren't.
Q21

What are the known biases of LLM judges?

This is the single most likely follow-up in an evals interview. Know the catalog cold.

BiasWhat happensMitigation
Position biasIn pairwise, judge prefers A or B based on orderRandomize order, run both orderings, average
Verbosity biasLonger answers score higher regardless of qualityExplicit rubric against length; normalize
Self-preferenceJudge prefers outputs from its own model familyUse different family; validate with humans
SycophancyJudge agrees with leading language in the promptNeutral rubric, no "this looks good, rate it"
Authority biasConfident tone scored higher than hedgedSeparate confidence from correctness in rubric
AnchoringFirst score sets expectation for later onesIndependent judgments; no streaming context
Format biasBulleted/Markdown answers beat plain proseInstruct judge to ignore formatting
Refusal toleranceJudge lets a refusal pass when it shouldn'tAdd "did the model actually answer?" to rubric

None of these go away completely. The goal is to know them, measure their impact on your task, and apply targeted mitigations. A 2-percentage-point position bias is fine if you're deciding between a 15-point quality gap; it's a disaster if you're deciding between a 3-point one.

→ Interview Tip
Memorize three: position, verbosity, self-preference. If asked "what's a failure mode of LLM-as-judge?", hit all three in two sentences and you've instantly cleared the senior bar.
Q22

How do you validate an LLM judge?

A judge that hasn't been validated is a random number generator with vocabulary. Validation means showing that the judge's scores correlate with human judgment on your task.

LLM-JUDGE VALIDATION LOOP
📦
Step 01
Sample
100–300 outputs from golden set
👥
Step 02
Human label
≥2 raters, measure IAA
🤖
Step 03
Judge label
same outputs, same rubric
📐
Step 04
Agreement
κ, Pearson, % match
🔁
Step 05
Iterate
refine rubric, re-validate

Target judge-human agreement ≥ human-human agreement. If two humans agree 85% of the time on your rubric, a judge with 85%+ agreement is performing as well as another human. If judge-human agreement is substantially lower than human-human, the rubric is ambiguous or the judge is wrong for this task.

Quick sanity tests beyond top-line agreement — slice-level agreement (does the judge disagree with humans disproportionately on one failure mode?); edge-case probes (give the judge deliberately bad outputs — does it catch them?); ordering robustness (same items in different orders — same scores?).

→ Mental Model
Judge validation is a one-time investment that pays back every eval run afterwards. Spend 1–2 days calibrating the judge, then let it do months of work. Skipping this step is the #1 reason eval programs lose credibility internally.
Q23

Single vs pairwise vs reference-based judging — when to use each?

Three modes. They solve different problems and have different noise profiles.

ModeQuestion it answersBest forMain bias
Single (pointwise)How good is this output on scale of 1–5?Tracking quality over timeCalibration drift, verbosity
PairwiseIs A better than B?Model comparison / A-B preferencePosition bias
Reference-basedDoes this match the expected answer?When gold answer existsOver-penalizing valid paraphrase
DECISION TREE
Do you have a gold answer?
Yes → reference-based. No → keep asking.
Are you comparing two models?
Yes → pairwise (more reliable than single + subtract).
Tracking one system over time?
Yes → single with a stable rubric (same judge + rubric month-over-month).

A pragmatic shortcut — pairwise is your most reliable signal when you have two systems. Humans are better at "A or B?" than at "rate 1–5", and so are LLMs. If you only need to ship one of two variants, don't over-engineer — run pairwise, randomize order, done.

→ Interview Tip
If the interviewer describes a "did the new model beat the old model?" scenario, immediately say "pairwise, with position randomization." That one phrase answers two questions at once — mode and bias mitigation.
Q24

How do you write a good judge prompt?

A vague prompt produces a vague judge. A good judge prompt has five components, each doing specific work —

ANATOMY OF A JUDGE PROMPT
1 · Role & task framing
"You are an expert evaluator of customer support replies…"
2 · Rubric with explicit criteria
Named axes (correctness, tone, actionability) with definitions + examples at each level.
3 · Anti-bias instructions
"Do not reward length. Do not reward formatting. Judge content only."
4 · Chain-of-thought before score
Reasoning first in a "reasoning" field; score last. This materially improves accuracy.
5 · Structured output
JSON schema: {reasoning, per-axis scores, overall, confidence}. Parseable, auditable.

Two high-leverage tweaks — few-shot with borderline examples (1 clearly good, 1 clearly bad, 2 borderline, each with rationale) anchors the rubric far better than definitions alone; ask for confidence — a judge's low-confidence items are exactly where you should route human review.

Avoid — "rate this on 1–10 overall." That's vibes, not eval. Avoid single-number scales without anchors. Avoid asking the judge to rank more than 2 items in one call (performance collapses).

→ Real-World Use
Version-control judge prompts exactly like code. A 2-line change to the rubric can shift scores by 10 points. You need a diff log so when a metric moves, you can tell whether the model changed or the judge did.
Q25

How do you handle cost and latency of LLM judging at scale?

A frontier-model judge on 10,000 examples is real money — and slow. The cost envelope dictates how often you can run evals, which dictates how fast your dev loop is. Six practical levers —

COST / LATENCY LEVERS FOR JUDGES
01 · Tiered judges
Cheap fast judge on 100% · frontier judge only on disagreement or critical slices.
02 · Smaller specialized judges
Distill or fine-tune a small model on expensive-judge labels for recurring rubrics.
03 · Batching + caching
Prompt caching on rubric. Async batch API for nightlies. Only re-judge changed outputs.
04 · Sampling & stratification
Don't judge all 10k every PR — judge a stratified 500. Full run weekly.
05 · Cheap pre-filter
Deterministic / embedding checks first · send only survivors to the judge.
06 · Short rubric, short output
Compact rubric; structured JSON with minimal reasoning. Every token matters at scale.

Rule of thumb — if your eval run costs more than one engineer-hour, people stop running it. Optimize aggressively to keep it cheaper than that. Cheap evals that run 10× a day beat rigorous evals that run once a week.

→ Real-World Use
Prompt caching on the judge rubric alone can cut cost 60–80% because the rubric is the same across every call. It's the first optimization that pays for itself the day you ship it.
Q26

What is judge-model drift — and how do you detect it?

The judge is itself an LLM served by a vendor. When the vendor updates or deprecates the judge model, the same outputs can receive different scores — even with identical rubric and code. Your quality trend chart moves and it's not the system under test; it's the judge.

WHY JUDGE DRIFT IS PARTICULARLY DANGEROUS
Silent
Scores change without any change in your codebase or golden set. Attribution is ambiguous.
Compounding
Every downstream decision (ship/no-ship, model rollback) uses a drifted reference point.

Defenses — pin the judge model version explicitly (never use "latest"); maintain an anchor set of 50–100 outputs with known human labels, re-run them whenever the judge model changes, and measure if judge-human agreement still holds; monitor per-slice score distributions over time — a sudden shift without a code change is the classic judge-drift fingerprint.

When a new judge model version does launch — don't blindly migrate. Re-validate on the anchor set, publish a "calibration delta", and only then switch. Historical scores before and after should be marked with a judge-version label so comparisons stay honest.

→ Key Insight
The judge is infrastructure, not a free oracle. Version it, monitor it, re-validate it. Teams that forget this ship regressions caused by the judge itself and blame their product code for months.
Part V

Regression & CI

Turning evals into tripwires that catch regressions before they ship — CI suites, thresholds, non-determinism, and shadow evaluation.

What Regresses
CI Design
Assertion vs Score
Thresholds
Non-Determinism
Shadow Evals
Questions 27–32
Q27

What is regression testing for LLMs — and what actually regresses?

Regression testing means detecting when a change to any part of the stack makes quality worse on cases that previously worked. For LLMs it's broader than unit testing because the "change" can come from many places.

WHAT CAN CAUSE A REGRESSION
01 · Prompt change
Someone tweaks a system prompt to fix one bug, breaks three others.
02 · Model version bump
Vendor updates the model — same version string, different behavior.
03 · Retrieval change
New embedding model, new chunking, new index — all invisible to the LLM.
04 · Tool schema change
An API a tool calls changes response shape → agent fails silently.
05 · Post-processing
A parser is tightened and now rejects outputs that used to pass.
06 · Dependency drift
A library upgrade changes tokenization, truncation, retry behavior.

The implication — regression tests must exercise the full pipeline, not just the model. A regression on raw-model output is diagnostic; a regression on end-to-end output is what users experience.

→ Mental Model
Every regression story ends "…and we didn't catch it because our eval only tested X in isolation." Your eval must test the same surface the user hits.
Q28

How do you design CI-friendly eval suites?

CI wants three properties — fast, cheap, deterministic-enough. Most LLM evals are none of those. The design pattern is tiering.

TIERED CI EVAL SUITE
Tier 1 · Pre-commit (seconds)
Lint, schema checks, 10–30 deterministic assertions. No model calls.
Tier 2 · PR CI (minutes)
~100 golden-set items, embedding + heuristic metrics, cheap judge on a subset.
Tier 3 · Nightly (~1 hour)
Full golden set, strong LLM judge, all slices, regression vs last green run.
Tier 4 · Release gate
Extended set, safety + red-team, pairwise vs currently-shipped model, human spot check.

Rules that keep CI fast — cache model calls keyed on (prompt-hash, model-version, temperature); parallelize across examples; fail fast on deterministic tiers before spending judge budget. And — critical for keeping devs shipping — block merges only on Tier 2 failures. Tier 3 regressions page you; Tier 4 is for release decisions.

→ Interview Tip
Strongest answer: "Tier 1 and 2 must complete in under 5 minutes or engineers will game them. Tier 3 runs nightly. That pacing is non-negotiable." Shows you've lived through devs disabling slow CI.
Q29

Assertion-style vs scoring-style regression tests — what's the difference?

Assertion-style tests have a single, well-defined pass/fail condition on a single example. "When asked to cancel, the response must contain a confirmation token." They're unit-test shaped — one failure, one owner, one line of output.

Scoring-style tests run an evaluator over many examples and compare an aggregate score to a baseline. "On the 500-item support set, factuality score ≥ 0.87." They detect distribution shifts no individual assertion would catch.

TWO SHAPES OF REGRESSION TESTS
Assertion-style
Pass/fail, single case
Named after the bug it prevents
Lives forever (regression prevention)
Blocks PRs instantly
Good for known failure modes
Scoring-style
Aggregate score vs threshold
Across stratified samples
Tracked over time (trend)
Flags drift, not defects
Good for unknown distribution shifts

A mature program has both. Every production incident should spawn one new assertion test — this is how "technical debt that caused the incident" becomes "regression that can't recur." Meanwhile, scoring tests ride the distribution.

→ Real-World Use
"Every incident produces an assertion test." Name this norm explicitly in interviews. It's what separates teams that learn from incidents from teams that keep rediscovering the same bugs.
Q30

How do you set thresholds and gates for passing builds?

Naive thresholds are "score must be ≥ 0.9." They fail two ways — too tight (every noisy run is red), or too loose (a 3-point regression slides through). The good pattern is relative, not absolute.

FOUR THRESHOLD PATTERNS
Absolute floor
Score ≥ X. Simple, brittle. Use only for hard safety constraints.
Delta vs baseline
Score ≥ baseline − tolerance. Main pattern. Tolerance from bootstrap CI.
Per-slice delta
No slice may drop by >Y. Prevents average hiding targeted regressions.
Statistical significance
Score drop significant at p < 0.05 via paired test over examples.

Practical rule — set the tolerance to the 95% confidence interval of your eval run. Bootstrap your golden set, measure run-to-run variance, and any drop larger than that CI is real. Tight thresholds without calibration produce red builds on noise; calibrated thresholds produce red builds on actual regressions.

Separate safety from quality. Safety thresholds are absolute, non-negotiable (zero PII leaks, zero jailbreak passes). Quality thresholds are delta-based and can negotiate.

→ Key Insight
Quality thresholds should never be "score ≥ 0.87." They should be "score ≥ baseline - 1.5 × noise". Otherwise you're picking an arbitrary number and pretending it means something.
Q31

How do you handle non-determinism (temperature, sampling) in tests?

LLMs are non-deterministic even at temperature 0 (vendors reserve the right to vary inference internals). That makes traditional "expected == actual" testing unreliable. Three stabilization strategies cover most cases —

STABILIZATION STRATEGIES FOR NON-DETERMINISM
Lower variance
Temperature 0, fixed seed (if supported), stable rubric. Reduces but doesn't eliminate noise.
Semantic tests
Assert properties (contains X, valid JSON, within bounds) — not exact strings.
N-sample aggregation
Run each example N times, report mean + CI. Thresholds on distribution, not single run.

Pattern for CI — run each PR-tier example once at temperature 0, and each nightly-tier example 3–5 times with temperature matching production. PR tier gets fast signal; nightly tier gets statistical confidence.

And — don't fight non-determinism by trying to assert exact strings with temperature 0. It's a losing battle against vendor-side changes. Design tests that would pass for a human writing the same answer differently.

→ Interview Tip
"Even at temperature 0, outputs aren't reproducible across inference runs at scale — vendors don't guarantee it. So tests assert properties, not exact text." That sentence alone is senior-level evidence.
Q32

What is a shadow eval — and when do you use one?

A shadow eval runs the new candidate model or configuration alongside the currently-shipped one, on real production traffic, without affecting user experience. The new system's outputs are captured and scored offline; only the shipped system's outputs reach users.

SHADOW EVALUATION ARCHITECTURE
User
request
Router
Shipped
model
→ returned to user
Candidate
model
→ logged only
Offline judge
Comparison
dashboard

When to use shadow — before any A/B test touches users. A shadow run tells you if the candidate produces obviously worse outputs on your real distribution. It catches prompt regressions, latency blowups, schema breaks, cost explosions, and policy violations without user exposure.

Key constraints — shadow adds real cost (you're paying for 2× inference), and for multi-turn or state-changing flows it's hard (shadow can't actually execute the user's booking). Mitigation — sample a fraction of traffic, not 100%. A 5% shadow is usually enough to surface showstoppers.

→ Real-World Use
Every model or prompt change should run through shadow eval on real traffic before A/B. Shadow catches "this breaks production" cheaply; A/B answers "does this move business metrics" expensively. In that order.
Part VI

Offline vs Online

The most important split in LLM evaluation. Offline lets you iterate fast and safely; online tells you if any of it actually matters to users. You need both, and you need to close the loop between them.

Core Distinction
Tradeoffs
A/B Testing
Implicit Signals
Closing the Loop
Interleaving
Questions 33–38
Q33

What's the core distinction between offline and online evaluation?

Offline evaluation runs on a fixed, curated dataset with scored outputs — no users involved. Results are reproducible, cheap, and fast. You control the distribution and the judge. Online evaluation measures the system as it serves real users, typically via A/B tests, implicit signals (clicks, thumbs, retention), or explicit feedback.

OFFLINE VS ONLINE
Offline
Fixed golden set
Reproducible, cheap
Proxy metrics (judge, F1)
Minutes to hours to run
No user risk
"Would this change be good?"
Online
Live user traffic
Ground-truth business metrics
Behavior signals (click, retention)
Days to weeks for stat sig
User exposure
"Did this change actually help users?"

The cognitive frame — offline measures capability, online measures value. A change can improve offline metrics and not move the needle online (users didn't notice). It can also degrade offline metrics and improve online ones (the eval was measuring the wrong thing). Both outcomes happen routinely and both are informative.

→ Mental Model
Offline is the lab, online is the field. A new drug passes lab tests for months before entering clinical trials; you wouldn't ship a model to users without offline proof, and you wouldn't trust offline proof alone to know if it worked.
Q34

What are the tradeoffs of each approach?

DimensionOfflineOnline
SpeedMinutes to hoursDays to weeks (stat sig)
CostJudge tokens + data curationUser exposure + infra + analysis
ReproducibilityHigh (fixed set)Low (traffic changes daily)
Signal typeProxy metricsGround-truth business metrics
CoverageOnly what you curatedReal distribution incl. tail
RiskNone — no usersUser harm, revenue, brand
Good forIteration, regression, ship/no-shipFinal validation, value discovery

The classic failure modes on each side —

WHAT GOES WRONG WITH EACH
Offline-only pitfalls
Goodhart — gaming the judge
Distribution mismatch
Ship, users don't notice
You're optimizing the wrong thing.
Online-only pitfalls
Too slow — can't iterate
Attribution is fuzzy
Regressions reach users first
You learn only after harm.

Neither wins. The point is to triangulate — use offline to gate and iterate, online to validate and discover, each informing the other's design.

→ Interview Framing
"Offline is fast and cheap but a proxy. Online is slow and expensive but real. A good program uses offline to gate shipping and online to decide if the shipping was worth it." Memorize.
Q35

How do online A/B tests work for LLM features?

An A/B test randomly splits users between control (current system) and treatment (new system) and measures downstream outcome metrics. For LLM features, three twists matter —

LLM A/B TEST DESIGN
1 · Randomize at the user level
Not per-request. Same user should see the same variant to avoid confusion & contamination.
2 · Choose guardrail + north-star metrics
North star: task success, resolution, retention. Guardrails: safety incidents, latency, cost.
3 · Watch for novelty effects
New UX or tone gets clicked more in week 1 just because it's new. Run ≥ 2 weeks.

Sample size is non-trivial. LLM outputs are high-variance per user — you often need 10×–100× the users you'd need for a simple UI change. Pre-compute the needed sample with a power calculation based on observed per-user variance in your task.

Common pitfalls — peeking at results before stat sig and calling winners early; single-metric tunnel vision (north-star up, safety quietly down); and running A/B before any offline validation so a dangerous regression gets user exposure.

→ Real-World Use
Always run shadow eval → offline gate → 5% canary → 50% A/B → 100%. Jumping from "passed offline" to "50% A/B" is how regressions reach users at scale.
Q36

What online signals complement offline metrics?

Online signals split into explicit (user volunteered) and implicit (inferred from behavior). Implicit signals dominate because they require no user effort and scale.

ONLINE SIGNAL CATALOG
Explicit
👍 / 👎, star ratings, free-text feedback, "regenerate" clicks, escalation-to-human, conversation-rating surveys.
Implicit
Task completion, session length, return rate, copy events, retry rate, abandonment, time-to-resolution.
Operational
Latency p95, error rate, tool-call failure rate, token spend, retries, fallback hits.
Safety
Policy violations, PII surfaces, jailbreak attempts, abuse reports, legal escalations.

Sobering fact — explicit signals are biased. Users who rate are disproportionately those who are delighted or furious. A 4.8 average in-product rating is compatible with silent 30% regression if middling users don't rate at all. Implicit signals close that gap — they capture what silent users do.

→ Key Insight
Your most valuable online signal is usually retry rate or regenerate clicks. It's unbiased, captures dissatisfaction without requiring effort, and moves fast enough to matter. Instrument it before you instrument a thumbs-up widget.
Q37

How do you close the loop — from online signal back to offline dataset?

The loop that separates serious eval programs from theatre — production failures flow back into the golden set so they can be caught offline next time.

THE CLOSED-LOOP EVAL FLYWHEEL
📡
Step 01
Detect online
👎, retries, escalations
🔍
Step 02
Triage
cluster, tag failure mode
🏷️
Step 03
Label
SME authors expected output
📦
Step 04
Promote
new version of golden set
🛡️
Step 05
Protect
CI catches it next time

The practical mechanism — a weekly triage where PM + engineer go through prod failures, cluster them, pick 10–20 to promote, and add them with tags. Over a year, this grows a golden set that actually mirrors your production distribution rather than what you imagined it would be on day one.

Without this loop, offline evals stay frozen at whatever they were when you shipped v1. With this loop, they compound and your system gets monotonically harder to regress.

→ Interview Tip
If asked "how would you improve an existing eval program?", lead with closing the loop. "Every online failure becomes an offline test case within a week." That's the single highest-leverage process intervention you can name.
Q38

What is interleaving and when should you use it?

Interleaving is an alternative to classic A/B where, for a single user request, you mix outputs from two variants and measure which one the user prefers through direct behavior (click-through, copy, selection). Common in search ranking, increasingly applied to LLM response selection and citation ranking.

INTERLEAVING VS A/B
A/B
User sees one variant
Compare aggregate metrics
Need lots of users
Slow, rigorous
Interleaving
User sees both (mixed)
Within-user preference
~10× more sample-efficient
Fast, narrow applicability

When it works — ranked lists (search results, citations, suggestions) where "user picked from variant X" is a clean signal. When it doesn't — single-response chat, where you can't show two answers side-by-side without breaking UX.

For LLM products the practical application is usually candidate re-ranking — your product shows top-5 retrieved docs, half from ranker A, half from ranker B, observe which ones the user clicks / cites. Huge power efficiency gains because each request yields a paired preference signal.

→ Mental Model
Interleaving is the statistical win when you can afford to show both variants in one view. It's not a replacement for A/B — it's a sharp tool for ranking-shaped problems where A/B would be wastefully underpowered.
Part VII

Safety & Adversarial

Evaluating what happens at the edges — jailbreaks, PII leaks, prompt perturbations — and how RAG systems and agents need their own distinct eval frameworks.

Toxicity & Safety
Red-Teaming
Robustness
RAG Evals
Agent Evals
Alignment
Questions 39–44
Q39

How do you evaluate safety (toxicity, jailbreaks, PII leakage)?

Safety eval is structurally different from quality eval in one way — the metric you care about is rare-event. A 99.5% safe system still leaks PII on 1 in 200 requests; at 10M requests/day that's 50,000 incidents. So safety evals oversample adversarial inputs on purpose.

SAFETY EVAL FACETS
Toxicity
Hate speech, harassment, explicit content. Classifier score on outputs across adversarial prompts.
Jailbreaks
Known jailbreak corpora (AdvBench, HarmBench) → success rate. Must stay flat as model evolves.
PII leakage
Canary strings in context, NER + regex on outputs, prompt-injection leak attempts.
Policy adherence
Your product-specific rules — "don't give medical diagnosis", "don't quote prices". Rubric judge.

Gate safety with hard floors, not deltas. Quality can regress 1% and we negotiate; PII leakage must be zero or near-zero. Safety failures block releases independent of quality wins.

And remember — false positives have real cost. An over-aggressive safety layer that refuses legitimate queries tanks product utility. Measure refusal rate on benign prompts alongside harmful-content rate on adversarial ones — optimize the joint.

→ Key Insight
Safety has two failure modes: harmful outputs (too permissive) and useless refusals (too restrictive). Report both in the same dashboard. A system that refuses to help with anything is not safe — it's broken.
Q40

Red-teaming vs systematic adversarial evaluation — what's the difference?

Red-teaming is humans or agents actively trying to break your system through creative, open-ended attack — same spirit as security red teams. Systematic adversarial eval runs a fixed, versioned suite of known attacks against every release.

Red-teamingSystematic adversarial eval
Discovers new failure modesPrevents known ones from recurring
Creative, open-ended, unstructuredFixed suite, CI-friendly
Humans / LLM attackersDeterministic replays of known attacks
Output: a list of new jailbreaks foundOutput: pass/fail on each known attack
Before major launchesEvery release

They feed each other — red-team finds a jailbreak once, that jailbreak becomes a test case in the systematic suite forever. Over time the systematic suite grows to cover the union of all discovered attacks. Red-teaming focuses on the frontier of what's not yet in the suite.

THE ATTACK FLYWHEEL
🎯
Step 01
Red-team
find new attack
🔬
Step 02
Reproduce
minimize + script
🛠️
Step 03
Fix
prompt or policy change
📥
Step 04
Add to suite
regression test forever
→ Interview Tip
Good answer includes both: "Red-teaming discovers; systematic eval prevents recurrence. Mature programs run red-team sprints pre-launch and keep a growing regression suite of known attacks that every release must pass."
Q41

How do you measure robustness to prompt variations?

A robust system gives consistent-quality answers when the user rephrases, misspells, capitalizes oddly, or inserts noise. Fragile systems score 90% on a golden set and 55% on the same questions reworded.

The eval technique — perturbation batteries. For each golden-set item, generate variants and score each.

PERTURBATION CATEGORIES
Surface-level
Typos, whitespace, casing, punctuation noise. Easy wins — a fragile system fails these.
Paraphrase
Same intent, different wording. LLM-generated paraphrases covering 3–5 variants per item.
Distractor
Irrelevant sentences added before/after. Tests whether the model stays on-task.
Adversarial
Prompt injection ("ignore previous"), role-confusion attacks, instruction smuggling in user text.

Report robustness as score variance across perturbations — a system that scores 0.88 on originals and 0.85 on perturbations with low variance is robust; one that averages 0.86 but swings from 0.60 to 0.98 is brittle.

→ Mental Model
Robustness is a distribution property, not a mean. Two systems can have the same average score and radically different worst-case behavior — and users remember worst cases.
Q42

How do you evaluate RAG systems specifically?

A RAG system has two stages — retrieval (find relevant docs) and generation (answer from them). You must eval each separately, and then the end-to-end system. Blaming the wrong stage is the #1 RAG failure mode in teams.

RAG EVAL MATRIX
Retrieval quality
Recall@k, MRR, nDCG vs labeled relevance. Did we retrieve the right docs at all?
Grounding / faithfulness
% claims in answer supported by retrieved context. Judge or NLI model.
Answer relevance
Does the answer actually address the user's question, not a tangent?
Context precision / noise resistance
Does answer quality hold when irrelevant context is included?
End-to-end task success
Did the user get a correct, useful answer? The one metric that ships.

Diagnosis pattern — if retrieval Recall@5 is 0.95 but end-to-end answer correctness is 0.60, your generator is wasting good context. If Recall@5 is 0.50 and answer correctness is 0.45, your retriever is the bottleneck. Ablation — replace retrieved context with the gold doc and see how much answer quality improves.

→ Interview Tip
When asked about RAG evals, immediately decompose: retrieval metrics, grounding metrics, end-to-end metrics. Name them in that order. Candidates who jump straight to "we use RAGAS" without understanding what RAGAS is actually measuring come across as shallow.
Q43

How do you evaluate agents and tool-use?

An agent makes decisions, calls tools, and uses their outputs to decide the next action. Quality isn't a single response — it's a trajectory. Three scoring levels parallel the multi-turn frame —

AGENT EVAL LEVELS
Step-level
Correct tool chosen? Correct args? Handled the tool's response properly?
Trajectory-level
Is the sequence efficient (no loops, no unnecessary calls)? Does state stay coherent?
Outcome-level
Did the agent achieve the goal? Is the world in the expected post-state?

Practical techniques —

TechniqueWhat it measures
Mocked tool sandboxesDeterministic tool responses → reproducible trajectories for CI
Trajectory rubricLLM judge scores whole trace against expected plan
Step budget checksAgent must complete within N steps / $X cost
End-state assertionsPost-condition checks on world state (e.g., DB row created)
Loop / thrash detectorsSame tool called with same args twice → flag as stuck

The single highest-leverage agent eval is a mocked sandbox — tools return fixed responses so the entire trajectory is replayable. Without this, agent evals are non-deterministic and you can't tell if a change helped.

→ Real-World Use
The interview test: "Our agent uses web search — how do you eval it?" Best answer: "Mock the search tool with fixed responses, score trajectories deterministically in CI. Run with live search less often — nightly — because non-determinism makes signal noisy."
Q44

Capability vs alignment evaluation — what's the distinction?

Capability evals ask "can the model do X?" — solve problems, reason, code, use tools. Alignment evals ask "does the model do what we want, when we want, and refrain from what we don't?" — honesty, helpfulness, harmlessness, following instructions, respecting constraints.

CAPABILITY VS ALIGNMENT
Capability
Can it solve this problem?
Scales with model size + data
Benchmarks: MMLU, HumanEval, math
"Is it smart enough?"
Alignment
Does it do what we want?
Shaped by RLHF, prompts, guardrails
Benchmarks: TruthfulQA, HarmBench
"Is it behaving correctly?"

A more capable model is not automatically more aligned — often the opposite, because a more capable model is also more capable of persuasively producing misleading or harmful content. Frontier-model releases consistently show capability gains outpacing alignment gains; alignment evals are what pick that up.

For product teams — capability evals bound your shortlist; alignment evals determine deployability. A model can crush MMLU and still fail your "don't give financial advice" rubric in ways that block launch.

→ Key Insight
Alignment is product-specific. A generally "aligned" model may still violate your product's policies — "don't diagnose", "don't quote prices", "don't speculate". Those rules don't come from the model vendor; they come from your eval suite.
Part VIII

Production & Strategy

Evals in production, drift detection, culture, and how to synthesize everything into answers that land in the interview itself.

Production Monitoring
Observability
Drift
Culture
Interview Mistakes
Design Template
Unifying Frame
Questions 45–51
Q45

How do you monitor LLM quality in production?

Production monitoring is continuous, sampled evaluation on live traffic. You can't run a full LLM judge on every request (too expensive), so the pattern is tiered sampling.

PRODUCTION MONITORING LAYERS
100% · Deterministic checks
Schema validation, safety classifier, refusal detector, length bounds, PII scanners. Every request.
10% · Cheap LLM judge
Smaller/cheaper model scoring core quality axes. Sampled, rolling window.
1% · Strong LLM judge
Frontier judge on a stratified sample. Higher signal, used for week-over-week trends.
0.1% · Human review
Expert review of flagged outputs. Calibrates judges, spots new failure modes.

Surface the outputs in a rolling dashboard with per-slice quality, refusal rate, safety incidents, and the operational metrics (latency, cost, tool errors). Alerts fire on week-over-week drops, sudden spikes in refusals, or new clusters of 👎 feedback.

→ Real-World Use
Stratified sampling beats uniform. Ensure every slice (intent, user cohort, language) has a minimum sample — otherwise rare segments get zero judge coverage and regressions hide there.
Q46

Evals vs observability — what's the difference?

They overlap and interviewers often blur them. Hold the line —

DimensionEvalsObservability
Primary question"Is output quality good?""What is happening in the system?"
Signal typeScored judgment of qualityTraces, logs, metrics
GranularityAggregate + per-slicePer-request, per-span
Time horizonTrend over days/weeksReal-time + per-incident
Who consumesPM, ML eng, ResearchOn-call, SRE, ML eng
Fires onQuality regressionLatency spike, error, token blow-up

They need each other. Observability without evals tells you the system is fast and cheap but says nothing about whether it's right. Evals without observability tells you quality scores trended down but can't tell you which prompt/model/tool change caused it. A unified trace ID that joins "this request produced this output which scored this way" is the bridge.

→ Mental Model
Observability is descriptive (what happened). Evals are evaluative (was it good). Same infrastructure, different questions. Modern LLM platforms treat them as one product surface — but keep the conceptual distinction clear.
Q47

How do you detect and respond to model drift?

Drift can come from three sources — data drift (user behavior changes), model drift (vendor updates the underlying model), and concept drift (the definition of "good" changed, e.g., new policy).

DRIFT SIGNALS AND WHAT THEY MEAN
Input distribution shift
New intents, new languages, new phrasings. Cluster recent queries; compare embedding centroids to baseline.
Output distribution shift
Response length, refusal rate, tool-call frequency changes without a deployment. Sign of model-side change.
Quality regression
Judge scores, 👎 rate, retry rate trending wrong way. Combine into a composite health score.
Operational anomalies
Latency / token-count / cost drift. Often the first visible signal of a silent vendor update.

Response playbook — (1) confirm it's not your own deployment (check the change log); (2) run the anchor set to isolate judge vs system; (3) replay recent prod requests on the prior model version to localize; (4) if the vendor changed, trigger re-validation on the sealed set and decide whether to pin a prior version or accept the new behavior.

→ Real-World Use
Maintain a "drift dashboard" with four panels — input shift, output shift, quality trend, operational trend. When an alert fires, being able to glance at all four in one view tells you in seconds whether it's user-side, vendor-side, code-side, or judge-side.
Q48

How do you build an eval-driven development culture?

Most LLM teams treat evals as a pre-launch activity. Eval-driven teams treat them as a dev loop — every prompt change, every retrieval tweak is measured against the eval set before it merges.

ORG PRACTICES THAT MAKE EVALS STICK
01 · Evals in PRs
Every PR shows delta vs baseline. No mystery regressions.
02 · Eval before metric
Proposals include "how we'll measure this", not just "what we'll build".
03 · Every incident → test
Post-incident: the failing case enters the golden set with a label.
04 · Weekly eval review
PM + Eng triage prod failures, update golden set, review trend.
05 · Eval is code
Golden set, judges, rubrics — all versioned, reviewed, deployed.
06 · Leaders model it
If the TL ships a prompt change without showing an eval delta, the norm dies within a month.

The signal of a healthy culture — engineers reach for the golden set before they reach for the prompt editor. "Let me write a failing test first" becomes reflexive for LLM work the same way it did for backend work in mature test-driven teams.

→ Interview Tip
When asked about culture, talk concretely. Don't say "I encourage a data-driven mindset." Say "every PR shows eval delta vs baseline; regressions can't merge without a paired test case." Specifics signal you've actually built this.
Q49

What common mistakes do candidates make when discussing evals in interviews?

MistakeWhy it's wrongWhat to say instead
"We use BLEU."Signals unawareness of lexical-metric limitations"BLEU for sanity, judge for semantics, humans for calibration"
Conflating eval and testSuggests no production experience"Tests catch known failures; evals detect distributional drift"
"MMLU looks good"Confuses benchmark with product fit"Benchmark filters shortlist; internal eval decides ship"
No slicingAverage metric masks targeted regressions"I slice by intent, language, length, user cohort"
Unvalidated judgeTreats LLM judge as ground truth"Judge validated against ≥100 human-labeled items; agreement ≥ human-human"
No loop backStatic eval sets go stale"Every prod incident → golden set addition"
Safety as add-onBolts safety on at the end"Hard floors on safety independent of quality delta gates"
Offline or online, not bothFalse dichotomy"Offline gates ship, online validates value, closed-loop between them"

The fastest way to signal seniority is to name the tradeoff explicitly. "I'd pick X here because the alternative Y has this specific problem in my context." Juniors name techniques; seniors name tradeoffs.

→ Interview Tip
If you only memorize one phrase from this handbook, make it: "offline evals gate shipping, online evals validate value, and every production failure feeds back into the offline set." That sentence alone answers a huge fraction of all eval interview questions.
Q50

How do you structure your answer to "design an eval for X"?

This is the most common open-ended question in LLM eval interviews. Use a repeatable 6-step template.

THE "DESIGN AN EVAL" TEMPLATE
🎯
Step 01
Task
what does success look like
📏
Step 02
Dimensions
axes of quality
📦
Step 03
Data
golden set design
⚖️
Step 04
Scoring
metric per dimension
🚦
Step 05
Gating
thresholds + CI tier
🔁
Step 06
Loop
prod feedback

Worked example — "design an eval for a customer-support reply bot."

  1. Task: Resolve the user's ticket correctly and empathetically within policy.
  2. Dimensions: Correctness, policy adherence, tone, conciseness, safety (no PII, no legal claims).
  3. Data: 500 real redacted tickets stratified across billing / technical / cancellation / complaint, plus 50 adversarial cases (abusive users, prompt injections, PII probes).
  4. Scoring: Deterministic for policy/PII, LLM-judge with rubric for correctness and tone, embedding similarity vs expected for conciseness.
  5. Gating: Safety absolute (zero PII leaks), quality delta ≤ 1.5 × noise vs baseline, run in PR CI for subset, nightly for full.
  6. Loop: Weekly triage of 👎 and escalation tickets into the golden set.

Deliver that in two minutes and you've outscored most candidates regardless of the specific X.

→ Interview Tip
Narrate the template out loud — "I'll walk through task, dimensions, data, scoring, gating, loop" — before diving in. It tells the interviewer you have a repeatable framework and invites them to push on any step they care about.
Q51

What one framework ties all of this together?

Everything in this handbook fits on one diagram — the closed-loop LLM eval stack. Keep it in your head and you can reconstruct any specific answer from first principles.

THE CLOSED-LOOP EVAL STACK
Foundation · Golden set (versioned, stratified, sealed)
Ground truth. Everything else calibrates against this.
↑↓
Scoring · Deterministic → embedding → judge → human
Push each case as far down the cost ladder as possible. Judge validated against humans.
↑↓
CI & regression · PR tier, nightly tier, release tier
Fast assertions + stratified scoring + shadow + safety gates.
↑↓
Production · Shadow → canary → A/B → 100%
Each stage catches what the previous couldn't. User exposure is earned, not assumed.
↑↓
Monitoring & loop · Online signals → triage → golden set
The loop closes here. Every failure becomes a future test. The system compounds.

Five ideas, one loop. If an interview question doesn't map onto a layer here — golden set, scoring, CI, production stages, monitoring/feedback — you're either being asked something very narrow, or the question itself is confused and you can reframe it onto the stack.

The whole point of this handbook is that evals are not a metric, they're a system. Learn the system; the metrics follow.

→ Closing Thought
"LLM evaluation is the production engineering of language models." Say that, draw this diagram, and you've shown the interviewer that you see the whole picture — not just BLEU, not just A/B, but the system that makes both meaningful.
Complete

All 51 Questions.
Covered.

From golden-set construction to LLM-as-judge validation to closing the loop from production back to CI — the full vocabulary and framework for LLM evaluation interviews and the systems behind them.

51
Questions
8
Topic Areas
40+
Visual Diagrams
Saurabh Singh
AI Engineer & Builder
linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7