Vibe Engines
Visual Handbook · 2026
60 Questions · 8 Domains
Senior Interview Preparation & Reference

The Senior
AI Engineer
Interview Handbook

Sixty questions on architecture, production incidents, and the leadership signals that separate senior from staff-level AI engineers.

Architecture Incidents Agents RAG Evals Cost & Scale Safety Leadership
Architecture & Design
Q01–Q08
Production Incidents
Q09–Q16
Agentic Systems
Q17–Q24
RAG & Retrieval
Q25–Q32
Evals & Observability
Q33–Q40
Cost & Scaling
Q41–Q46
Safety & Trust
Q47–Q52
Leadership Signals
Q53–Q60
Saurabh Singh
AI Engineer & Builder.
Contents

What's
Inside

I · Architecture
Q01–Q08
Q01Designing a production LLM serving platform
Q02RAG vs fine-tuning vs in-context decisioning
Q03Multi-tenancy in shared AI platforms
Q04Model versioning, rollout, and rollback
Q05Multi-model routing layers
Q06Context window management at scale
Q07Deterministic code vs LLM boundaries
Q08Feedback loops that actually improve quality
II · Incidents
Q09–Q16
Q09Walk me through an LLM production incident
Q10Detecting and mitigating model drift
Q11Runbook for a provider outage
Q12Handling hallucinations in user-facing apps
Q13A fine-tune regression you caught late
Q14Rolling back a bad prompt deploy
Q15Cascading failures in agent workflows
Q16Post-mortems for non-deterministic systems
III · Agentic Systems
Q17–Q24
Q17Structuring a multi-agent system
Q18When agents should escalate to humans
Q19Handling tool-use failures
Q20Agent state, checkpointing, and resume
Q21Bounding agent cost and runaway loops
Q22Testing an agent before shipping
Q23Designing agent memory architectures
Q24Handoffs between specialized workers
IV · RAG & Retrieval
Q25–Q32
Q25Designing a production RAG system
Q26Chunking strategies and tradeoffs
Q27Evaluating retrieval quality
Q28Hybrid search vs pure vector search
Q29Handling index staleness
Q30Re-ranking strategies in production
Q31Debugging a poor RAG answer
Q32Scaling RAG to billions of docs
V · Evals & Observability
Q33–Q40
Q33Building an eval framework from scratch
Q34Using LLM-as-judge correctly
Q35Preventing eval set leakage
Q36Regression tests for prompts
Q37What AI observability actually looks like
Q38A/B testing a model change safely
Q39Measuring quality without ground truth
Q40Prioritising quality work from signals
VI · Cost & Scaling
Q41–Q46
Q41Cutting LLM inference cost by 50%
Q42Latency budgets and how to defend them
Q43When to cache vs recompute
Q44Capacity planning for AI workloads
Q45Hosted vs self-hosted models
Q46Prompt optimisation without quality loss
VII · Safety & Trust
Q47–Q52
Q47Defending against prompt injection
Q48PII across an LLM pipeline
Q49Jailbreaks in customer-facing agents
Q50Guardrails that don't break UX
Q51Auditing an AI system for compliance
Q52Red-teaming before launch
VIII · Leadership
Q53–Q60
Q53Leading an AI engineering team
Q54Research exploration vs shipping
Q55Onboarding engineers to an AI codebase
Q56Selling infrastructure investment upward
Q57Model vs architecture disagreements
Q58Measuring ROI of an AI initiative
Q59Mentoring engineers into AI
Q60Senior vs staff AI engineer signals
Part I

Architecture &
System Design

The senior interview rarely asks you to invent a transformer. It asks you to draw a production system on a whiteboard in forty minutes and defend every line you drew.

Serving
Multi-tenancy
Model Routing
Versioning
Context Windows
Feedback Loops
Questions 01–08
Q01

Design a production LLM serving platform for a product with 1M daily active users.

Start by naming the three workloads that share nothing: interactive chat (single-digit seconds, streaming), async jobs (bulk summarisation, embedding indexing), and batch evals. Trying to serve them from one pool is the most common rookie error — they contend for the same GPUs and interactive p99 falls apart the moment a batch job kicks in.

PLATFORM REFERENCE ARCHITECTURE
🔐
01
Gateway
auth, quotas, PII scan
🧭
02
Router
model & region pick
💾
03
Cache
prefix & semantic
🧠
04
Inference
vLLM / provider
📊
05
Telemetry
traces, evals, cost

The gateway terminates auth, enforces per-tenant quotas, and scrubs PII before anything touches a model. The router decides model, region, and fallback chain — this is where you bake in your cost strategy (cheap models first, escalate on low confidence). The inference tier is where you separate interactive vs batch GPU pools. Everything is fronted by a prefix + semantic cache because in chat, 30–60% of prompts share a system prefix that should hit the KV cache every time.

Senior interviewers are listening for three things you didn't say as much as what you did. First, queueing — you need a token-aware admission queue because ten 32k-token requests can starve a hundred 2k-token ones. Second, multi-region failover that doesn't break conversation state. Third, a model-agnostic request schema so swapping Claude for GPT or Gemini is a router config change, not a code change.

→ Interview Tip
When the interviewer asks "what's the bottleneck at 1M DAU?" — the answer is GPU minutes, not QPS. Reframe capacity in tokens-per-second, not requests-per-second, and you immediately sound like you've done this for real.
Q02

When would you pick RAG, fine-tuning, or plain in-context learning — and when would you use more than one?

This is a decision framework question, not a recipe question. The answer turns on three variables: how often the knowledge changes, how much you need the model to change its behaviour vs its facts, and how much latency and cost budget you have.

ApproachStrong whenWeak whenCost shape
In-contextFacts fit in prompt, change dailyLong tail of knowledge, repeated costsPer-request tokens
RAGKnowledge is large, updates often, auditableBehaviour change, reasoning styleIndex + per-request retrieval
Fine-tuningStyle, format, domain jargon, routingFacts, anything that changes weeklyTraining run + hosting
HybridRegulated domains needing bothPrototype / unclear requirementsAll of the above

The thing that gets a senior signal is naming the hybrid case out loud. Most real systems end up as RAG for facts + a small fine-tune for tone and structure. Medical copilots do this. Legal copilots do this. Support bots for a specific product line do this. The fine-tune teaches the model "how we sound" and RAG teaches it "what we currently know" — those are two orthogonal needs and trying to solve both with one lever always overfits one of them.

Staff-level candidates go one layer deeper: fine-tune the retriever, not the generator. If your off-the-shelf embedding model doesn't know your jargon, you get bad retrieval no matter how big the LLM is. A small contrastive fine-tune on your query/doc pairs often moves the quality needle more than fine-tuning a 70B model.

→ Mental Model
Ask yourself: would a human expert need to read a document, or have years of apprenticeship? Documents → RAG. Apprenticeship → fine-tune. Both → both. Say this sentence in the interview and watch the interviewer nod.
Q03

How do you design multi-tenancy into a shared AI platform without one noisy tenant hurting everyone else?

Multi-tenancy for AI has three concerns that traditional SaaS doesn't: token quotas, data isolation in caches and embeddings, and model-level noisy neighbours where one tenant's batch workload starves another's interactive traffic. You want per-tenant isolation on all three axes.

ISOLATION LAYERS PER TENANT
1 · Quota layer
Tokens-per-minute, RPM, concurrent-requests per tenant
2 · Data layer
Namespaced indices, tenant-tagged cache keys, separate KMS keys
3 · Compute layer
Priority classes, fair queueing, dedicated pools for premium tiers
4 · Observability layer
Per-tenant cost, quality metrics, audit logs scoped to tenant

The trap to call out: prompt caches are a data leak surface. If you dedupe by prompt hash across tenants, a hash collision becomes an information leak. Always key caches by tenant-id || prompt-hash, not just prompt-hash. Same rule for any LLM response cache or embedding cache.

For the noisy-neighbour problem, token-weighted fair queueing beats round-robin. Charge each request to the tenant's bucket in tokens, not requests — a 32k-token request costs 16x a 2k one, and round-robin treats them the same. Large tenants should hit a separate high-priority lane that can't starve the standard lane below a floor.

→ Real-World Use
If you're running on a hosted API like Claude or OpenAI, the provider has rate limits per-key. Rotating keys per tenant gives you free per-tenant isolation and lets you use their quotas as your first line of noisy-neighbour defence.
Q04

What's your approach to model versioning, rollout, and rollback in production?

The senior move is to treat model + prompt + retriever as one deployable unit. Versioning only the model is the fastest way to get into production bugs you can't reproduce: the model is fine but the prompt was regenerated from a different template and nobody noticed. Call this the inference stack and version the whole thing.

INFERENCE STACK · ONE DEPLOY UNIT
MODEL
claude-sonnet-4.6
weights hash
PROMPT
system v17
templates v9
RETRIEVER
embed-3-large
index v22
TOOLS
schemas v4
MCP v2

Rollout strategy: always shadow first, then canary, then ramp. Shadow mode sends the new stack the same traffic as prod but discards its output — you get real-distribution eval data without risking users. Canary then flips a small slice (1% → 5% → 25%) with automatic rollback tied to your eval gates. The key insight: your rollback trigger should be an eval metric, not an error rate. Hallucination regressions don't throw 500s.

Rollback has to be atomic across the whole stack. A common outage pattern: you roll back the model but forget the prompt was updated to match the new model's behaviour, so you now have a prompt that only works with the new model rolled back to the old one. Snapshot the whole stack, roll back as one unit.

→ Key Insight
Version the cache too. When you roll out a new prompt, the prefix cache from the old one is stale — you'll get mysteriously good latency with the wrong outputs until someone notices. Bump the cache namespace on every deploy.
Q05

Design a multi-model routing layer that picks the right model per request.

A good router is cheap by default, expensive by necessity. The structure that works: a small classifier decides a tier, the tier maps to a model, and every request has a fallback chain if the first pick fails or returns low confidence.

ROUTING DECISION TREE
📥
01
Classify
intent + difficulty
🎯
02
Pick tier
haiku / sonnet / opus
▶️
03
Execute
with timeout
✔️
04
Verify
confidence check
↗️
05
Escalate
fallback on fail

The classifier should be a tiny, cheap model or a fine-tuned distilbert — don't use a frontier model to decide which frontier model to use. Features: intent, estimated token length, whether tool use is required, whether the user tier allows premium models. The classifier must be deterministic enough that A/B tests are interpretable.

The three pitfalls to mention: (1) Double billing — if you escalate, you pay for both models, so only escalate on measurable low-confidence signals. (2) Latency cliffs — users notice when their query randomly takes 10x longer because it hit the big model. Stabilise routing per session so a user's experience is consistent. (3) Observability debt — every request needs to log which tier it hit and why, or you can't tune the router.

→ Interview Tip
The most impressive senior answer: "I'd start with a hand-coded router based on 3–4 features, then log the outcomes, then train a classifier from those logs once I have ground truth." Bottom-up, data-driven — classic staff-engineer pattern.
Q06

How do you manage context windows when user sessions exceed the model's limit?

There's no single answer — it depends on whether the domain needs recency (support chat) or completeness (code review, legal). The strategy is a stack of techniques, not one choice.

CONTEXT PACKING STRATEGIES
01 · Sliding window
Keep last N turns verbatim. Cheap, fast, loses history. Good for support.
02 · Rolling summary
Summarise older turns into a running digest. Lossy but continuous.
03 · Retrieved memory
Embed past turns, pull only what's relevant per new query.
04 · Structured state
Extract facts to JSON, pass as small structured payload.

For most production systems the right answer is a hybrid: sliding window of the last 6–10 turns + rolling summary of everything before + retrieved memory for specific entities. Structured state is underused — if your domain has stable slots (user preferences, order IDs, active project) extract them and pass them as a small JSON block. That beats summarisation because it's lossless for the things that matter.

The senior-signal detail: measure your context utilisation. If your users' p95 conversation is 4k tokens and your budget is 200k, you're paying for nothing. Cap the window at what you actually use, monitor how often you hit it, and only increase when the data demands it.

→ Real-World Use
Anthropic's prompt caching makes long static system prompts essentially free. Put your instructions, tools, and big static context in the cacheable prefix — then use a tiny variable suffix for the actual turn. Context budget problem partly solved.
Q07

Where do you draw the line between deterministic code and an LLM call?

The default should be "use code wherever code works". LLM calls are non-deterministic, slow, expensive, and hard to test — if a regex, a SQL query, or a finite-state machine can do the job, that's what you use. The LLM is reserved for tasks where the input is ambiguous in a way that rules can't resolve.

THE LINE · WHAT GOES WHERE
Code owns
Routing & control flow
Data fetches & writes
Validation & schemas
Retries & timeouts
Auth & permissions
Determinism, testability, cost
LLM owns
Open-ended generation
Ambiguous parsing
Natural-language UX
Summarisation
Intent classification
Ambiguity, breadth, generalisation

The practical rule I give junior engineers: "the LLM is an expensive interpreter, not a runtime". Let it parse the user's intent, pick a tool, and talk back to the user — but route the actual execution through deterministic code. Whenever I see an LLM being asked to "decide what to do and do it in one call", there's a bug waiting. Split the reasoning step from the execution step.

The counter-pattern to watch for is LLM creep: every new edge case gets solved by adding another line to the system prompt. Three months in you have a 4000-token prompt that nobody can reason about. When that happens, audit — most of that prompt belongs in code.

→ Mental Model
Think of every LLM call as costing 100ms + $0.01 + a non-zero chance of being wrong. Would you pay that for this task if a human had to review the output? If not, use code.
Q08

How do you design a feedback loop that actually improves the system over time?

Most "feedback loops" are thumbs-up/thumbs-down buttons that nobody clicks. A real feedback loop has four stages: capture, label, improve, verify — and the hardest one is label, not capture.

THE FEEDBACK LOOP THAT CLOSES
📡
01
Capture
implicit + explicit
🏷️
02
Label
LLM + human review
🔧
03
Improve
prompt / retrieve / FT
🧪
04
Verify
eval + shadow
🚀
05
Ship
canary + ramp

Capture should bias to implicit signals: did the user follow up with a rephrase (bad), copy the output (good), close the tab within 3 seconds (bad), ask a new question (neutral)? Thumbs are a 2% sample and heavily biased toward negative. Implicit signals are 100% coverage and much more useful for ranking which examples are worth a human look.

The critical design choice is where the loop re-enters the system. Cheap loops update prompts and retrieval. Medium-cost loops add examples to the golden eval set so regressions can't ship. Expensive loops do fine-tuning. Most teams jump straight to fine-tuning because it feels serious, but updating prompts and evals from production data beats fine-tuning on almost all quality dimensions for a fraction of the cost.

→ Key Insight
The single most valuable feedback artifact is not a fine-tune dataset — it's a growing golden eval set built from real production failures. Every incident postmortem should end with "we added N examples to the eval set." That's how you get compounding quality.
Part II

Production
Incidents

Senior interviews live and die in the war-story section. The questions here are where you prove you've taken the pager and come out with scar tissue, not just slide decks.

Debugging
Drift
Provider Outages
Hallucinations
Rollbacks
Post-mortems
Questions 09–16
Q09

Walk me through a production LLM incident you debugged end-to-end.

The interviewer is listening for four things, in order: (1) how you detected it, (2) how you isolated the cause, (3) how you mitigated without making it worse, and (4) what you changed so it wouldn't happen again. The story doesn't need to be glamorous — clarity beats drama.

THE INCIDENT LOOP
🚨
01
Detect
alert, signal, user report
🔍
02
Triage
scope, severity, owner
🧪
03
Isolate
bisect traces, repro
🛠️
04
Mitigate
rollback, gate, fallback
📝
05
Learn
eval, runbook, action

A good story to have ready: "A support bot started confidently answering billing questions with the wrong currency. No errors, no latency blip — just wrong. Our customer-success team pinged us." Detection came from a human, not a metric — call that out. Triage: pulled 50 traces, saw the model had started including an EU pricing doc in retrieval for US users. Isolation: the retriever was pulling by cosine similarity only, and a recent re-indexing had changed the embedding space enough that geography tags no longer clustered. Mitigation: hot-patched the retriever to filter by user region as a hard filter before vector search. Learning: added region-aware tests to the eval set and made metadata filters mandatory in the retriever contract.

The staff-level detail most people miss: explain the blast radius assessment. How did you decide this was a P1 vs a P2? Who was affected, how do you know, and how did you estimate cost of being wrong? That's the judgment layer senior interviewers are probing.

→ Interview Tip
Pre-write two incident stories: one where you were the first responder and one where you coordinated multiple teams. Practice both out loud. The number-one mistake is rambling — aim for 3 minutes with a clear before-during-after arc.
Q10

How do you detect and mitigate model drift in production?

"Drift" means three different things and senior candidates distinguish them: data drift (user inputs change), concept drift (the right answer for the same input changes), and model drift (the provider silently updates the model behind your API). Each one is detected and mitigated differently.

TypeSymptomDetectorMitigation
Data driftNew topics, longer queriesEmbedding distribution shiftRetrieval refresh, prompt update
Concept driftRight answer is wrong nowFeedback rate deltaRe-label golden set, update prompt
Model driftSame input, different outputCanary replay on golden setPin version, re-validate, possibly switch

The sharp detector for data drift is embedding PCA over a rolling window: project last-24-hours of user prompts into a 2-D space and compare to last-week's. A cluster of queries in an unfamiliar region is an early signal that your retrieval is about to go sideways. Cheaper version: track the rate of "I don't know" responses — it correlates surprisingly well with topic drift.

Model drift is the one most teams forget about and the one that bites hardest. When you call a hosted API you're implicitly pinned to whatever the provider ships. Run a daily replay of your golden set through the hosted endpoint and diff the results. If a meaningful fraction changed without you shipping anything, the provider updated something. This is how you catch it before a user does.

→ Real-World Use
Pin model versions with date suffixes (claude-sonnet-4-6-20260115) not aliases. Aliases move underneath you. Pinned versions force you to re-validate before upgrading — the validation is the entire point.
Q11

Your LLM provider goes down mid-day. What's your runbook?

This question is testing whether you've planned for the outage, not whether you can improvise during it. The right answer starts with "we designed for this" and ends with "here's the minute-by-minute runbook".

FAILURE MODES & RESPONSES
01 · Regional brownout
Fail over to secondary region. Keep same provider, same model.
02 · Provider-wide outage
Swap to secondary provider. Expect quality delta, accept it.
03 · Rate-limit squeeze
Degrade: trim context, drop non-critical calls, queue batch.
04 · Degraded accuracy
Surface a "limited mode" banner. Don't hide quality loss from users.

Multi-provider is cheap insurance if you designed for it from day one. That means: model-agnostic request schema, prompt templates that work across providers (or provider-specific variants versioned together), and a pre-approved fallback chain so incident commanders don't have to make architectural decisions at 2am. The worst time to discover your prompts don't port between providers is during the outage.

Also underrated: graceful degradation is a product decision, not just a technical one. Sometimes the right response is to serve a cached answer with a disclaimer. Sometimes it's to put the feature in read-only. Sometimes it's to fail loudly because a wrong answer would be worse than no answer. Have that conversation with your PM before the outage and put the answer in the runbook.

→ Key Insight
Run a provider outage game day every quarter. Block the primary API in staging and watch your system fall over in slow motion. Ten minutes of planned pain saves you four hours of real pain when it happens for real.
Q12

How do you handle hallucinations in user-facing applications?

The senior framing is: hallucinations are a system problem, not a model problem. You don't make them go away — you contain them, detect them, and design UX that tolerates them. Treat it like we treat SQL injection: defence in depth.

DEFENCE IN DEPTH · HALLUCINATION CONTAINMENT
1 · Ground every claim
RAG with source attribution; model can only cite what it retrieved
2 · Constrain outputs
Structured output; validate against schema before returning
3 · Verify claims
Secondary pass: check each claim is supported by the source
4 · UX honesty
Show citations inline; make uncertainty visible to users

Layer 1 is grounding: give the model the facts it needs and instruct it not to answer beyond them. Layer 2 is structure: if the output has to be JSON with specific fields, the surface area for freeform fabrication drops. Layer 3 is verification: run a second, cheaper model pass that asks "is every claim in this answer supported by the retrieved context?" — a surprisingly effective filter. Layer 4 is UX: never let a claim appear without a citation the user can click. Users are remarkably forgiving of "I'm not sure" and remarkably unforgiving of confident wrongness.

What you do not do is tell the model "don't hallucinate" in the system prompt. That's a meme answer. It doesn't work. Neither does temperature=0 — that makes the same wrong answer consistent, not less wrong.

→ Mental Model
Hallucination is what a language model does when you ask it something it doesn't know. The fix is never "make it know more" — it's "make it refuse, or ground it, or verify it." Senior engineers pick one of those three for every output surface.
Q13

Tell me about a time a fine-tuned model regressed in production and how you caught it.

Fine-tune regressions have a specific flavour: the model got better on the training distribution and worse on the tail. If your golden eval is a clone of your training distribution, you won't catch it — both move together. The senior story to tell is about a team that caught it because the eval set had a held-out tail slice.

A real pattern: a support team fine-tunes a model on last-90-days of resolved tickets. Accuracy on a random sample jumps. They ship. A week later, customer satisfaction on new-product questions tanks — the fine-tune had shifted the model toward the existing product mix, and it now underperforms on questions about anything that wasn't in the training window. The fix isn't another fine-tune. The fix is mixing a retrieval layer for long-tail queries and classifying inputs to decide which path to use.

EVAL SLICES THAT CATCH FINE-TUNE REGRESSIONS
In-distribution
+4%
Held-out tail
−11%
New topics
−14%
Refusal quality
−6%
Tone & style
+9%

The lesson to deliver: always keep a "rare but important" slice in your eval set. Rare queries about important things (legal, billing edge cases, new features) should each get 10–50 examples, even if they're 0.1% of traffic. Those are the slices fine-tunes regress on.

→ Interview Tip
If you haven't shipped a fine-tune, tell a story about a prompt regression instead — same arc, same lesson. "I changed the system prompt, aggregate scores went up, one slice tanked, we caught it because X." That story still signals seniority.
Q14

How do you roll back a bad prompt deployment safely?

The mechanics are simple if you planned for it. You need: (1) prompts stored as versioned artifacts separate from code, (2) a feature flag per deploy, and (3) automatic rollback triggers tied to eval gates. If any of those three are missing, the rollback is a code deploy — which is fine for the first incident and unacceptable after the second.

PROMPT ROLLBACK STACK
1 · Prompt registry
Each prompt version is immutable, has a hash, and ships independently
2 · Flag-gated rollout
Percent traffic flip — 1% → 10% → 50% → 100%
3 · Eval gates
Automated rollback if critical slice drops N% vs baseline
4 · Audit trail
Every request logs prompt hash — join with any complaint later

The subtle part is "rollback" is not just flipping a flag. You also have to: invalidate any prompt-prefix caches keyed on the new prompt, drain in-flight requests that were mid-stream on the new prompt, and communicate to downstream consumers that output format may have shifted back. A senior answer mentions these. A staff answer builds them into the registry from day one so the rollback is a single operation.

The story I'd tell: team shipped a prompt that added a new output field. Downstream code assumed the field was present. They rolled back the prompt but the downstream service was still on the new code path and started throwing nulls. The incident was longer than it should have been because the rollback wasn't atomic across the dependency graph.

→ Real-World Use
Treat prompt changes that modify output schema as breaking changes. Version them like API contracts. Ship the schema change, wait for all consumers to be ready, then ship the prompt. Boring discipline, zero incidents.
Q15

How do you handle cascading failures in multi-step agent workflows?

Agents cascade because a small upstream error compounds at every downstream step — a mis-parsed tool argument becomes a failed tool call becomes a confused planner becomes a 20-step loop. You contain this by putting circuit breakers at every step boundary, the same way you would in a microservice mesh.

CIRCUIT BREAKERS FOR AGENTS
🎯
01
Plan
budget: steps, $, time
🛠️
02
Act
validate args first
👀
03
Observe
tool result + schema
🧮
04
Reflect
progress check
🛑
05
Abort
if budget blown

Three concrete circuit breakers: a step-count limit (hard cap on iterations), a cost limit in tokens or dollars (when the agent has spent its budget, stop), and a no-progress detector (if the last three steps didn't change the state, the agent is stuck — stop). The no-progress detector is the one most teams forget; it catches the "agent is looping on the same failing tool call" pattern.

For recovery, the key design choice is whether to retry a failed step or abort and escalate to a human. The answer depends on reversibility: retries are fine for read-only tools; for anything that writes state, default to "escalate" unless you have idempotency guarantees on the tool itself. This is where senior candidates mention idempotency keys per agent run — a rare but correct detail.

→ Key Insight
The "budget" for an agent isn't one number, it's three: steps, tokens, wall-clock. Hit any one and stop. Give the agent a way to ask for more budget if it's close to the answer — otherwise it will hoard on step one and fail at step nine.
Q16

What does a post-mortem for a non-deterministic AI system look like?

Traditional post-mortems assume reproducibility. AI incidents often aren't reproducible — you can't replay the same user session and get the same wrong answer. The post-mortem template has to bend around that.

SectionClassicalAI-specific
ReproDeterministic stepsCaptured traces + inputs; probabilistic replay
Root causeSingle commit / configPrompt + model + retrieval + input shape
FixCode changePrompt, eval set, retrieval filter, UX change
PreventionTest caseGolden example + monitoring alert
MetricTTR, error rateTTR + quality-signal delta + slice recovery

Two rules I enforce: (1) Every AI post-mortem ends with at least one new eval example. That's how you make the incident compound into long-term protection instead of just a memory. (2) Root cause is plural. "The model hallucinated" is never the root cause — the root cause is why the system let a hallucination reach a user. Usually: missing grounding, missing verification, UX that hid uncertainty, or a retrieval filter that was off by one.

The senior framing I'd lead with: "post-mortems are how you turn probabilistic bugs into deterministic tests." Every user-reported failure, if captured well, becomes a line in the golden set. After two years of this, your eval set is the most valuable artifact on the team — it encodes every scar you have.

→ Interview Tip
If asked "how do you measure reliability for an LLM feature?" the answer is never pure uptime. Pair uptime with a quality SLO on a held-out slice — e.g. "95% of queries on the billing slice must still get an eval-judge score of 4+." Quality is part of the SLO or it doesn't exist.
Part III

Agentic
Systems

The topic every AI team is hiring for and the one where interviewers can tell in thirty seconds whether you've shipped one. The distinction is in the failure modes.

Orchestration
Tool use
State
Memory
Handoffs
Cost bounds
Questions 17–24
Q17

How do you structure a multi-agent system? Why not a single agent with more tools?

The honest senior answer: start with a single agent and split only when you have evidence. Multi-agent is the most over-engineered pattern in the space. A single agent with 8 well-designed tools beats a three-agent mesh in most domains — less state, less orchestration, fewer handoff bugs.

The cases where you genuinely need multi-agent are specific: (1) role specialisation where the prompts are too different to fit in one system prompt (planner vs executor vs critic), (2) security boundaries where one agent has write access and another doesn't, and (3) scale boundaries where the orchestrator coordinates N parallel workers that don't talk to each other.

WHEN TO SPLIT · SINGLE VS MULTI
Keep single
Under ~10 tools
One role
Shared context helps
Trust boundary uniform
Simpler to test, cheaper to run
Go multi
Distinct roles & prompts
Parallel workers
Security isolation
Specialised models
Pay for orchestration + state

When you do split, the pattern that works is planner → executor → critic, each running on a model sized to its job (cheap planner, cheap executor, more capable critic for the final check). The orchestrator is code, not an LLM, because code is the thing that reliably enforces the step count, budget, and handoff protocol.

→ Interview Tip
When a candidate jumps straight to "multi-agent system" without justifying it, senior interviewers mentally mark it as inexperience. The seasoned move is "I'd prototype with one agent first, then split when I can point at a specific failure mode." That's the answer to lead with.
Q18

When should an agent escalate to a human versus retry on its own?

The rule of thumb: retry on transient failures, escalate on ambiguity. A 429 rate-limit is a retry. A tool that returned zero results is ambiguity — the agent should not silently pivot to a different plan and hope for the best.

THE ESCALATION DECISION TABLE
01 · Retry silently
Network errors, rate limits, schema parse fail, stale tokens.
02 · Retry with repair
Invalid tool args — show error back to LLM, retry once.
03 · Ask the user
Ambiguous intent, missing required info, confidence too low.
04 · Escalate to human
Destructive action needed, policy violation, repeated failures.

The design detail that separates senior from staff: the escalation path is part of the product surface, not a fallback hack. If the agent can escalate to a human, the human's queue, the SLA, the "who gets paged" policy, and the handback flow all have to be designed. Otherwise escalation becomes a black hole that the user abandons.

Also mention the "ask the user once" rule: if the agent is unsure, it should ask a single clarifying question with a bounded set of answers. A free-form clarification loop devolves into conversation and burns tokens. Bounded clarifications feel like a good UI and are cheap.

→ Real-World Use
For any write action, require the agent to summarise the proposed change and show it to the user before executing. This is the single highest-leverage pattern for destructive tools. Cheap, obvious, under-used.
Q19

How do you handle tool-use failures inside an agent loop?

Tool failures come in three flavours and each needs different handling: schema errors (the model called the tool wrong), tool errors (the tool ran but returned an error), and semantic errors (the tool ran, returned "success", but the result is wrong for the task).

FailureDetectionFix
Schema errorJSON parse / type checkReturn error to LLM, let it retry with corrected args
Tool errorHTTP status / exceptionMap to a human-readable error the LLM can reason about
Empty resultZero hits, empty listSurface as "no results, consider alternatives"
Semantic errorVerifier / sanity checkTag as suspect, retry with different params
Repeated failureN retries exceededEscalate, don't loop

The underrated one is semantic errors. Tools often return "success" for wrong outcomes — a search tool returns hits that don't match the intent, a code-execution tool returns output that ran but didn't do the thing asked. You catch these with a verifier pass: a small prompt that asks "given the goal and this tool output, did we make progress?" It's cheap and it's the difference between agents that feel reliable and agents that feel slippery.

The schema design rule: tools should return errors that the LLM can reason about. "HTTP 400: bad field 'start_date', expected YYYY-MM-DD" is actionable. "Internal server error" is not. Wrap your tool errors in natural language on the way back to the model.

→ Key Insight
Give every tool an idempotency key parameter and accept it silently. Agents retry. Retries cause duplicate writes. Idempotency keys cost you one extra column and they save you one on-call shift per quarter.
Q20

How do you manage agent state — checkpointing, pausing, resuming long runs?

Agents that run more than a few seconds need to survive restarts, deploys, and human pauses. The mental model: treat an agent run like a workflow engine job. State lives outside the agent, the agent is a reducer over an append-only event log, and any step can be replayed from the log.

AGENT RUN AS WORKFLOW
1 · Event log
Every plan / call / observation appended with seq number
2 · Current state derived
Replay log → get conversation, tool cache, budget used
3 · Pause / resume
Paused = no worker polling. Resume = re-derive state, continue
4 · Human-in-the-loop
A human action is just another event in the log

The event-sourced design has a huge hidden benefit: you can replay an agent run on a newer model and diff the outcome. When you upgrade Sonnet, you replay yesterday's agent runs through the new model and see which steps went better or worse. That's your eval on real data for free.

For short-lived agents (<30 seconds, single request) this is overkill — just hold state in memory and be done. For anything that might span hours, survive deploys, or need human approval in the middle, the workflow-engine model is the right default. Temporal, Restate, and Inngest all ship patterns for this; rolling your own is fine if the domain is small.

→ Mental Model
Agent state = event log + derived view. The log is durable, the view is disposable. If the view goes wrong, rebuild it from the log. This mental model imports 30 years of database wisdom into agents.
Q21

How do you stop an agent from burning money in a runaway loop?

Three layers, enforced in code outside the LLM: step budget, token/dollar budget, wall-clock budget. Hit any one and the loop stops. Do not trust the LLM to track its own budget — it will not, because it doesn't know how to.

THREE HARD LIMITS
STEPS
≤ 25
Typical cap for interactive agents
COST
$ 0.50
Hard per-run spend ceiling
TIME
120s
Wall-clock wall, no exceptions

Beyond hard limits, use a no-progress detector: hash the last three (tool, args) pairs — if identical, abort. This catches the most common loop: the model keeps calling the same failing search with slightly different phrasing. Also log a loop-detected event so you can count how often agents hit it — that's your quality signal.

For runtime cost alerting, aggregate per tenant per hour and alarm on 5x-over-baseline. Runaway agents usually cluster around a broken deploy or a single tenant's weird input. Spotting the cluster fast matters more than per-request limits — one user spamming will always eat some budget, ten users hitting a broken prompt can take down a quarter's margin.

→ Key Insight
Budgets should be configurable per tool call, not global. A research agent searching arxiv can spend 50 steps. A confirm-this-transaction agent should cap at 3. One-size-fits-all budgets either starve real work or bankrupt you on edge cases.
Q22

How do you test an agent before you ship it?

Agents are stochastic and multi-step, so test pyramids from traditional software don't directly transfer. The version that works: tool tests at the bottom, trajectory tests in the middle, end-to-end evals on top.

AGENT TEST PYRAMID
1 · Tool unit tests
Every tool tested deterministically — no LLM involved
2 · Trajectory tests
Given input X, did the agent call the expected tools in sensible order?
3 · Outcome evals
LLM-as-judge scores final answer against rubric on N cases
4 · Adversarial / red-team
Prompt injection, jailbreaks, unsafe tool use — all must fail safely

The key trick is at layer 2: trajectory tests don't check the exact tool sequence — they check that required tools were called and forbidden ones weren't. "The agent must call verify_identity before update_email, and must not call delete_account anywhere" is a reliable invariant. Exact-match tool sequences break every time the model re-plans.

The layer most teams skip is adversarial and it's the one that matters most for agents with real-world side effects. Have a set of prompt injections, tool-abuse attempts, and policy violations in your eval set, and gate deploys on them passing. This is the eng equivalent of a sandbox test for a release.

→ Real-World Use
Record every production trajectory. Sample 5% weekly, diff them against a baseline, and add any surprising ones to your test set. Your test set grows organically from real traffic — the best possible source.
Q23

How do you design an agent's memory so it's useful and bounded?

"Agent memory" is vague. Senior candidates split it into four concrete kinds and wire each to its own store.

TypeContentsStoreRetrieval
WorkingCurrent task scratchpadPrompt contextAlways included
EpisodicPast sessions by userDB + vectorRecency + similarity
SemanticUser prefs & factsStructured row / JSONAlways included, small
Procedural"How we do X" patternsPrompt / toolsBaked into system prompt

The most valuable memory for production agents is semantic memory stored as structured facts, not prose. "User prefers metric units, based in Berlin, has project IDs PRJ-14, PRJ-27" is 60 tokens and a lossless feed into every new session. Beats any vector-based "memory system" you can build because it's deterministic and auditable.

Episodic memory is where people over-engineer. The brutal truth: you rarely need to retrieve specific past conversations. You need to retrieve the facts from them. Build a pipeline that extracts facts from episodes into structured memory and throws away the episode text after a while. Storage is cheap but context is expensive.

→ Interview Tip
If you're asked to design a chat assistant with memory, the right move is "structured facts in a user row + retrieved snippets for specifics + never the raw episode text." Say this sentence and you cut through a lot of hand-wavy designs.
Q24

How do you handle handoffs between specialised agents or roles?

Handoffs are where multi-agent systems earn their keep — or fall apart. The three things that must cross the handoff cleanly: (1) the goal, (2) the facts gathered so far, (3) the constraints and budget remaining. Miss any of them and the receiving agent starts from zero.

HANDOFF PROTOCOL · WHAT MUST CROSS
01 · Goal statement
Single sentence describing the outcome the receiving agent must produce.
02 · Structured facts
JSON of what's known — never "here's everything I thought about".
03 · Remaining budget
Steps/tokens/time left so the receiver plans within bounds.
04 · Return contract
Schema for what the sender expects back. Not optional.

The common anti-pattern is passing the entire conversation history across the handoff. That dumps the noise (exploration, false starts, errors) on the receiver, blows its context, and confuses its planning. Compress to structured facts first. The senior pattern: handoff is a pure function call with typed inputs and outputs, not a stream of consciousness.

The return path matters just as much. If the delegated agent fails or times out, the orchestrator needs a structured failure back, not silence. Design handoffs like RPC calls with typed success and failure envelopes — that one discipline makes multi-agent systems debuggable instead of mystical.

→ Mental Model
A handoff between agents should pass less context than a function call in a REST API. If you're passing the raw message history, you've built an agent mesh by accident — and it will fail the same way a microservice mesh fails when every call passes every cookie.
Part IV

RAG &
Retrieval

Retrieval is the discipline that decides whether your model looks smart or clueless. Senior questions skip the definitions and go straight to the production tradeoffs that make or break an answer.

Chunking
Hybrid search
Re-ranking
Staleness
Scaling
Debugging
Questions 25–32
Q25

Walk me through how you'd design a production RAG system from scratch.

The senior framing: RAG has two pipelines, not one — indexing and querying — and they have different SLAs. Indexing is batch, idempotent, and fault-tolerant. Querying is interactive, latency-sensitive, and read-only. Conflate them and you either overpay for batch or underperform at query time.

RAG · TWO PIPELINES
📥
INDEX
Ingest
parse, clean, extract
✂️
INDEX
Chunk
semantic + metadata
🧮
INDEX
Embed
dense + sparse
🗂️
INDEX
Store
vector + doc db
QUERY
Rewrite
expand, decontext
🔎
QUERY
Retrieve
hybrid search
📊
QUERY
Rerank
cross-encoder
✏️
QUERY
Answer
cite + verify

Walk the interviewer through the full pipeline and call out the non-obvious choices: (1) Query rewriting — a dedicated step where you expand the raw user question into a search-friendly form, decontextualize pronouns, and sometimes generate multiple queries. This single step is often the largest quality lever. (2) Hybrid search — BM25 + dense is the production default; pure vector loses precision on exact-match queries. (3) Re-ranking — a cross-encoder on the top-50 before the top-5 are sent to the LLM. (4) Answer verification — a second pass that checks every claim is grounded in a retrieved source.

The staff-level detail: metadata filters, not vector similarity, are the most important production lever. User region, document type, permission scope, date range — these should be hard filters applied before vector search, not post-hoc. Skip this and you'll spend quarters tuning embeddings to fix a problem a SQL WHERE clause would solve instantly.

→ Interview Tip
Ask the interviewer "what's the corpus size and update rate?" before answering. The answer for a 1k-doc corpus that updates monthly is different from a 50M-doc corpus that updates every minute. Showing you know that distinction is itself a signal.
Q26

What's your chunking strategy and why?

Chunking is the most underrated quality lever in RAG. The wrong answer is "512 tokens with 50 overlap" — that's a default, not a strategy. The right answer starts with "what's the shape of my documents?" and builds from there.

StrategyBest forTradeoff
Fixed-sizeUniform text, PDFsCuts across sentences; cheap baseline
Sentence / paragraphProse, blog postsVariable length, more semantic
Semantic (embedding gap)Mixed contentExpensive at index, cleaner chunks
Structural (markdown / headings)Technical docs, wikisNeeds clean source, best retrieval
Late chunkingLong coherent docsEmbeds full doc first, chunks the outputs
Contextual (Anthropic)Dense reference materialPrepends doc context to each chunk — quality bump

My production default for heterogeneous corpora: structural chunking on headings + contextual retrieval. The structural pass gives you chunks that respect document boundaries (no cutting mid-sentence), and contextual retrieval fixes the "chunk orphaned from its parent" problem by prepending a one-sentence doc summary to every chunk before embedding. Anthropic's paper showed this cuts retrieval failures by ~50% for not much index-time cost.

The thing to call out explicitly: chunking quality is bounded by parsing quality. If you feed the chunker garbage HTML from scraped PDFs, no chunking strategy will save you. Fifty percent of RAG projects' quality problems are upstream of chunking — they're in ingestion and normalization. Fix parsing first.

→ Real-World Use
Store both the chunk text and its parent doc ID. When a chunk wins retrieval, optionally expand to neighbouring chunks or the full section. "Small chunks for retrieval precision, bigger windows for generation context" is the pattern.
Q27

How do you evaluate retrieval quality — separate from answer quality?

Decoupling retrieval eval from generation eval is the single most important discipline in RAG. If you only measure end-to-end answer quality, you can't tell whether a regression is because the retriever missed the doc or the generator botched the synthesis.

RETRIEVAL METRICS · THE FOUR THAT MATTER
01 · Recall @ K
Did we retrieve any relevant doc in top K? The ceiling of the system.
02 · MRR
Mean reciprocal rank — how high is the first relevant hit?
03 · nDCG
Quality-weighted ranking — punishes bad ordering, not just absence.
04 · Context Precision
Of the chunks sent to the LLM, how many were actually useful?

To measure any of this you need a labelled retrieval set — queries paired with ground-truth document IDs. Build one from 200–500 real queries, have humans mark the correct docs, and run every retrieval change against it. Frameworks like RAGAS can substitute an LLM judge for the human labels in a pinch, but I'd still want a human-labelled gold slice for anything safety-critical.

The metric most people skip is Context Precision: of the top-K chunks you sent to the LLM, what fraction were actually used in the final answer? High precision means you can shrink K (cheaper prompts). Low precision means the re-ranker is broken or the prompt is wasteful. Measure this and every decision you make about K becomes quantifiable.

→ Key Insight
When retrieval fails, upstream fixes (parsing, chunking, metadata) beat downstream fixes (re-ranker, prompt). Always profile where in the pipeline your recall is dropping before reaching for the fancy tools.
Q28

When do you use hybrid search versus pure vector search?

The answer in 2026 is simple: almost always hybrid. Pure vector is elegant in papers and brittle in production — it loses to BM25 on anything involving exact product names, IDs, error codes, quoted phrases, or rare jargon. Hybrid is the boring, correct default.

WHERE EACH ONE WINS
BM25 wins
Exact product names
Error codes (E_404_BAR)
Rare domain jargon
Short queries
Lexical precision
Vector wins
Paraphrased questions
Cross-lingual
Semantic generalisation
Long natural queries
Semantic recall

Fusion: the standard approach is reciprocal rank fusion (RRF) — merge ranked lists from each retriever by summing 1/(k+rank). Cheap, needs no tuning, works. More sophisticated: learn to weight the two signals per query type, but the returns diminish fast.

Pure-vector-only is defensible in two cases: (1) your corpus is tiny and well-curated (say, a 200-FAQ knowledge base — BM25 will always have exact matches there), or (2) you're doing cross-lingual retrieval (English queries hitting French docs — BM25 can't help you). For anything else, ship hybrid from day one. You'll spend the same engineering effort on purely-vector with worse outcomes.

→ Real-World Use
Postgres with pgvector + tsvector gives you hybrid search in one database with one query. For many teams this beats a dedicated vector DB for the first year — fewer moving parts, transactional consistency, and you already have Postgres on-call.
Q29

How do you handle index staleness when documents change constantly?

Staleness has two sides: content drift (docs changed, embeddings didn't) and model drift (embedding model changed, old embeddings don't align with new queries). Most teams plan for the first and ignore the second.

STALENESS · TWO KINDS, TWO FIXES
Content drift → CDC pipeline
Doc update triggers re-embed + upsert. Event-driven, idempotent by doc-id.
Delete propagation
Soft-delete with retention window, hard-delete after. Never leave orphans.
Model drift → versioned indexes
Never mix embedding model versions. Blue/green whole index on upgrade.
Freshness SLO
"99% of doc updates visible in retrieval within N minutes." Measure it.

Practically: hook your ingestion to a change-data-capture stream from your source of truth (database or document store). Every change emits an event, a worker re-embeds just the affected chunks, and upserts them. Never reindex-the-world unless you absolutely must — reindexing is expensive, creates temporary inconsistency, and hides bugs.

For embedding model upgrades, the rule is brutal: never mix versions in one index. Old embeddings and new embeddings live in different spaces. Blue/green the whole index: reindex to v2 behind a flag, shadow-query to verify, flip traffic. This is the single most common "why is retrieval mysteriously worse" root cause I've seen.

→ Key Insight
Add doc_version and embed_model_version as required metadata on every chunk. Being able to query "show me chunks using model v1" is how you audit consistency during a migration — and how you find orphans afterwards.
Q30

What's your re-ranking strategy and when is it worth the cost?

Re-ranking is the second-pass quality amplifier on top of first-stage retrieval. First stage (BM25 + vector) is fast and recall-oriented — get the top 50–100 candidates. Second stage (cross-encoder) is slow and precision-oriented — rerank those 50 down to the top 5 that go to the LLM.

TWO-STAGE RETRIEVAL
🔎
STAGE 1
Retrieve
Hybrid → top 50
~10ms
⚖️
STAGE 2
Rerank
Cross-enc → top 5
~100ms
🧠
STAGE 3
Generate
LLM answer
~1-3s

Model choice depends on budget. Hosted options: Cohere Rerank (best quality, API call, predictable latency), Voyage Rerank (competitive quality, cheaper). Self-hosted: BGE-reranker or Jina Reranker (free, 50–200ms on a GPU, good enough for most cases). For tiny budgets, a well-tuned BM25 + vector fusion often matches a naive cross-encoder — don't reach for rerankers before you've tuned the first stage.

When is re-ranking not worth it? When your first-stage precision is already high (rare), when your latency budget is <300ms (interactive autocomplete), or when your top-K is already 3–5 and re-ranking just reorders the same small set. In practice: if your first stage returns 50 candidates and your LLM sees only 5, re-ranking is almost always worth the cost.

→ Mental Model
Retrieval is about recall (did we find it?). Re-ranking is about precision (did we put it first?). Don't try to make one stage do both — they have opposing optimal settings and one pipeline can't serve them both.
Q31

A user reports a RAG answer is wrong. Walk me through debugging it.

Debug RAG by walking the pipeline in order and answering three questions: (1) was the right doc in the corpus?, (2) did retrieval find it?, (3) did the LLM use it correctly?. Most teams jump to the LLM first; the correct order is the other way around.

DEBUG ORDER · OUTSIDE IN
1 · Is the right answer in the corpus at all?
Grep the raw docs for the keyword. If missing → ingestion bug.
2 · Did chunking keep it together?
Find the chunk(s). If split across chunks → chunking bug.
3 · Did retrieval rank it highly?
Replay query. Check top-K. If not present → retriever bug.
4 · Did the LLM use it?
Inspect the actual prompt sent. If present but ignored → prompt bug.

Most real RAG failures turn out to be at step 1 or 2, not at the model. The doc wasn't in the corpus, or it was but parsed poorly, or the chunk that contained the answer got separated from the chunk that contained the context that makes the answer recognisable. The senior habit: always grep the raw corpus before blaming the model.

Make this debug loop fast. Every RAG system should have a "replay query" tool that takes a question and shows: rewritten query, BM25 results, vector results, fused top-K, reranker output, chunks sent to the LLM, and final answer. Thirty seconds to diagnose — that's the tool that pays for itself in the first week.

→ Interview Tip
If asked "our RAG answers are sometimes wrong", the senior first question isn't "which model?" — it's "how do you tell whether retrieval found the source or not?" Framing the problem that way signals you've done this before.
Q32

How would you scale RAG to a billion documents?

At a billion documents, none of the defaults apply. You're no longer doing vector search on a laptop — you're designing a distributed system where the retrieval layer has its own on-call rotation. The key design moves: sharding, approximate search, and aggressive pre-filtering.

BILLION-DOC RAG · KEY LEVERS
01 · Shard by hard filter
Partition by tenant / geo / date so most queries hit one shard.
02 · ANN (HNSW / IVF-PQ)
Approximate — trade tiny recall loss for 100x speedup.
03 · Quantization
Int8 or binary embeddings cut memory 4-32x for small recall hit.
04 · Tiered storage
Hot data in RAM; warm on SSD; cold in object storage.

The biggest lever isn't the vector DB — it's sharding by the filter most queries use. If every query is scoped to a tenant and a date range, partition the index by (tenant, month). A query then hits 1% of the corpus, not 100%. You get a thousand-fold speedup for free before you touch any ANN parameters.

Quantization is underrated. Binary embeddings (1 bit per dim) are surprisingly competitive when paired with a reranker, and they cut memory by 32x. The pattern: binary for first-stage retrieval on the billion-doc shard, full-precision vectors for the few thousand you actually rerank. This is how the frontier search systems scale.

Finally — and this is the staff-level framing — most teams that think they need billion-doc RAG don't. They need filtered retrieval over the 100k docs their user actually cares about. Before designing a distributed index, ask whether the working set per query is actually that large.

→ Key Insight
Sharding strategy should match the filter shape, not alphabetical doc-id. If 99% of queries are scoped to one user's docs, shard by user. If 99% are scoped to the last 30 days, shard by date. Match the access pattern.
Part V

Evals &
Observability

If an interviewer asks only one thing about this topic, they'll ask something deceptively simple that exposes whether you've shipped without evals. These are that topic.

Golden sets
LLM-as-judge
Regression
Observability
A/B testing
Prioritisation
Questions 33–40
Q33

How would you build an LLM eval framework from scratch?

Start with the smallest thing that works and grow. Day one: a CSV, a script, and a golden set of 30 examples. Day ninety: sliced metrics, automated regression, CI integration. Day three-sixty: online evals, drift alerts, per-tenant quality tracking. The mistake is trying to jump from zero to the framework you'd see at OpenAI — you'll build infrastructure nobody uses.

EVAL FRAMEWORK · GROWTH STAGES
📝
D1
CSV + script
30 examples, run by hand
⚖️
D30
Judge model
scored rubric, sliced
🧪
D90
CI gate
block regressions on PR
📡
D180
Online evals
live traffic, drift alerts

The framework needs four things and nothing else at the start: a dataset (examples + expected criteria), a scorer (rule-based, LLM-judge, or human), a slicer (break results by segment — intent, model, user tier), and a runner (script that produces a comparable report between two runs). Everything else is infrastructure on top.

Senior candidates separate offline evals (run on a fixed dataset in CI) from online evals (run on live traffic in prod). Offline catches regressions pre-ship. Online catches problems that only appear with real inputs. Both are necessary; most teams have only offline and wonder why prod feels different from CI.

→ Interview Tip
When asked "how do you evaluate your LLM system?", the wrong answer is listing metric names. The right answer is "I have a golden set of N examples sliced by intent, an LLM judge with a rubric, and CI blocks regressions over 3%." Concrete numbers signal real experience.
Q34

How do you use LLM-as-judge without fooling yourself?

LLM-as-judge is powerful and it's also how most teams fool themselves. The failure mode is the judge agreeing with the model it's judging because they share a lineage. Your eval metric silently becomes "does output A look like output B" instead of "is output A good."

LLM-AS-JUDGE · AVOIDING THE TRAPS
01 · Pairwise, not scalar
"Is A better than B?" is more reliable than "rate A 1-5". Less drift.
02 · Different lineage
Judge with a different family than the one you're judging.
03 · Position-swap
Run each pair twice, A/B and B/A. Average to remove order bias.
04 · Calibrate against humans
Once a month, re-check agreement on a held-out set with human labels.

Most durable framing: LLM judge is a signal, human labels are ground truth. Use the judge for throughput (run it on thousands of examples per change) and humans for calibration (spot-check 50 cases a week to ensure the judge still agrees with reality). If agreement drops below 80%, the judge prompt is stale and needs updating.

The rubric matters more than the model. A specific, criterion-based rubric ("scored 5 if answer addresses all 3 sub-questions, cites at least 1 source, and contains no unsupported claims") outperforms generic "rate helpfulness 1-5" by a wide margin. Invest in the rubric.

→ Key Insight
Run the judge at temperature 0 and include the rubric inline in every call. You want reproducibility over creativity. A reproducible judge that's slightly wrong is more useful than a creative judge that's slightly right but non-deterministic.
Q35

How do you prevent your eval set from leaking into training or prompt optimization?

Leakage is subtle and it's the reason most teams over-trust their own eval numbers. Three leakage paths to defend: (1) training data leak (eval examples end up in fine-tune set), (2) prompt-tuning leak (you keep tweaking the prompt until eval scores go up — you've now overfit to eval), (3) provider leak (your eval examples were in the base model's pretraining data).

THE THREE LEAKAGE PATHS
1 · Training leak
Hash every eval example. Exclude hashes from training set by contract.
2 · Prompt overfit
Dev set for iteration, held-out test set touched only before release.
3 · Pretrain leak
Write fresh examples for important topics. Don't only use benchmarks.
4 · Real-world audit
Shadow a % of production traffic to a human-labelled sample monthly.

The pattern that fixes prompt-tuning leak: split evals into dev (used while iterating) and test (run once before shipping, never iterated on). Every time you look at the test set and change behaviour, you've polluted it. In practice teams don't have the discipline for this — so add a locked "gold" slice that even the engineers can't see the individual examples of, only the aggregate score.

For pretrain leak, the honest truth is you can't perfectly control what was in a hosted model's training data. The mitigation is to add novel, domain-specific examples you wrote yourself — these definitely weren't in the pretrain data. Don't rely on public benchmarks (MMLU, GSM8K) as your primary eval; they're memorised to some degree by every frontier model.

→ Real-World Use
Build your eval set from real user failures, not synthetic questions. Real failures are guaranteed to not be in pretrain data, and they're directly aligned with what you care about. Mining incidents for eval examples is the cheapest quality win in the industry.
Q36

How do you do regression testing for prompts?

Treat prompts like code. A prompt change is a diff, a regression test is an eval run, and CI blocks a merge that regresses a protected slice. The pipeline is Promptfoo-style: a YAML config defines providers, tests, and assertions, and eval run fails the build if any assertion drops.

PROMPT CI PIPELINE
✏️
01
Edit prompt
PR opened
🧪
02
Run evals
dev + regression slice
📊
03
Diff report
before / after slice scores
04
Gate merge
block on regression

Three things the test harness must do: (1) Deterministic replay — pin model version and temperature so results are comparable. (2) Slice-level gates — an aggregate 2% lift is fine, but a 10% drop on the "billing" slice is a block regardless. (3) Visible diffs — the PR reviewer sees "score on refund questions dropped from 0.82 to 0.64" with specific examples. Narrative beats numbers.

The discipline that matters: regressions block by default, you have to explicitly override with a reason. Teams that let regressions merge "to unblock" ship a worse product every sprint. Teams that block by default either fix the regression or have a conversation about why this one is acceptable. Both are better than shipping blind.

→ Interview Tip
The concrete tool names senior candidates drop: Promptfoo, Braintrust, Langfuse, Inspect AI. You don't need to have used all of them — naming one and explaining why it fits your workflow is enough to show you know this space is real.
Q37

What does AI observability actually look like in production?

Observability for an AI system has the same three pillars as any distributed system — logs, metrics, traces — but each pillar has a specific flavour. Traces especially: a single user request can produce a five-step agent trace with tool calls, retrieval hops, and reranker passes, and you need all of it in one view.

THE FOUR LAYERS OF AI OBSERVABILITY
01 · System metrics
Latency, error rate, QPS, GPU util — the boring ones you already know.
02 · Traces
Full prompt, tool calls, retrieved chunks, output — replayable.
03 · Quality signals
Online judge scores, user feedback, refusal rate, follow-up rate.
04 · Cost telemetry
Tokens per request per tenant; alert on per-tenant overspend.

The breakthrough is trace → eval → fix as a loop: a trace is a structured record of a single interaction; you can attach a score (from a judge or user) to it; bad scores bubble up into a review queue; the review produces an eval example and a fix. Tools like Braintrust, Langfuse, Phoenix, and LangSmith are built for this shape. Homegrown works too but you'll rebuild half of those tools.

Two details senior candidates always mention: (1) Correlate tokens with user IDs, not just request IDs — that's how you find which users are driving cost or breaking things. (2) Sample-and-log your prompts in full for X% of traffic and store them for 30+ days. When someone complains tomorrow about yesterday's answer, you need to be able to show them exactly what the model saw.

→ Real-World Use
Use OpenTelemetry semantic conventions for LLM spans (gen_ai.* attributes). They're standardising in 2025-2026 and they future-proof your traces so you can swap observability backends later.
Q38

How do you A/B test a model or prompt change in production?

The framework is standard experimentation — randomise users, hold the exposure stable, measure a primary metric, wait for significance — but the metric choice is the hard part. Unlike a classical web A/B test, there isn't a single conversion rate; quality is multi-dimensional.

Metric familyExamplesWatch out for
QualityJudge score, refusal rateJudge drift over time
EngagementFollow-up rate, session lengthLonger ≠ better
OutcomeTask completed, escalation rateSlow to accumulate
CostTokens/req, latencyEasy to forget, easy to blow
SafetyPolicy violations, PII leaksMust be a guardrail, not a trade

Use a primary metric + guardrails: pick one metric you're trying to move (say, judge score), and guardrails you won't trade (cost, latency, safety). A winning experiment must lift the primary without breaking a guardrail. Without this structure, teams ship changes that improve one axis and silently regress another.

The trap in stochastic systems: temperature > 0 adds noise, and noise adds the sample size you need. Run experiments at temperature 0 when possible, or size your sample 2-3x what a classical test would demand. And never compare model A at T=0 to model B at T=0.7 — you'll call randomness quality.

→ Key Insight
Randomise per user, not per request. Same user flipping between arms mid-conversation poisons the metric and confuses the UX. Stickiness at the user level is non-negotiable for LLM experiments.
Q39

How do you measure quality when there's no single ground truth?

Most real LLM tasks have no ground truth — summarisation, creative writing, open-ended Q&A all have many good answers. The senior move is to measure what you can and use proxies with honesty about what they're measuring.

PROXY METRICS · FROM HARD TO SOFT
Rule-based
tight
Reference ans
tight
Rubric judge
noisy
Pairwise judge
noisy
User feedback
biased
Downstream outcome
slow

The most underused technique is measurable sub-criteria: break "was this a good answer?" into three or four objective checks — did it include a citation? did it cover all sub-questions? did it refuse unsafe requests? did it return in the right format? — and score each one independently. Four 0/1 checks per example beats one hand-wavy 1-5 rating every time.

For truly subjective tasks, lean on pairwise comparisons against a baseline. "Is this output better than what the old prompt produced?" is a question a judge can answer reliably even when "is this output good?" is hopeless. You lose absolute quality tracking and gain comparability — usually a good trade.

The hardest and best signal is downstream outcome: did the user's task actually get done? Did the ticket get resolved without a follow-up? Did the code the agent wrote pass the tests? When you can tie quality to a real-world outcome, you stop arguing about judge prompts.

→ Mental Model
No single metric tells you the truth. Stack three or four weak proxies that correlate with quality in different ways, and only trust a change that lifts most of them. Any one metric is gameable; a portfolio is not.
Q40

You have 200 signals of quality problems and time for 10 fixes. How do you prioritise?

This is a judgement question disguised as a process question. The answer frames it as impact × reach × tractability, applied to clusters of signals, not individual signals.

PRIORITISATION MATRIX
1 · Cluster the signals
Group by failure mode, not by complaint. 200 signals often reduce to 15 clusters.
2 · Score each cluster
Reach (% users) × severity (S1–S3) × tractability (hours to fix)
3 · Pick from top of each quadrant
Mix of quick wins and big bets. Don't only ship 10 quick wins.
4 · Add each cluster to eval set
Even the ones you didn't fix. Guaranteed regression protection.

The clustering step is where staff-level engineers differ. Junior engineers sort a spreadsheet by frequency. Staff engineers find the shared root cause: half the 200 signals might all be one failure mode ("retrieval missing recent docs"), and one fix lifts all fifty.

The pattern that fails: shipping 10 fixes that each touch a different part of the prompt. Each fix might be net-positive, but together they conflict, and your eval scores oscillate. Batch related fixes, ship them together, and eval as one change. Fewer, bigger, more verified.

→ Interview Tip
The best answer here doesn't just prioritise — it names what you'd not do and why. "I wouldn't touch the system prompt this sprint even though 5 signals point there, because we're mid-migration and the prompt is about to change anyway." That's judgement.
Part VI

Cost &
Scaling

The questions your finance partner cares about and the questions that separate engineers who shipped a demo from engineers who shipped a P&L-accountable service.

Cost cuts
Latency budgets
Caching
Capacity
Self-host vs hosted
Prompt optimisation
Questions 41–46
Q41

Your LLM bill just jumped 50%. How do you cut inference cost in half without hurting quality?

The right first move is instrument, then act. Most teams try to optimise before they know where the tokens go. Always profile first: what percent of spend is on which model, which product surface, which tenant, prompt vs completion tokens? The answer usually surprises people — one feature or one tenant will be 60% of cost.

COST LEVERS · ORDERED BY ROI
Prompt cache
−30%
Model routing
−40%
Prompt shrink
−20%
Semantic cache
−15%
Batch APIs
−50% on batch
Self-host
varies

Three levers that stack cleanly: (1) Prompt caching — if your system prompt is stable, Anthropic and OpenAI both offer prefix caching that cuts input-token cost ~90% on the cached portion. Moving a 3k-token system prompt from uncached to cached is a single-afternoon change and can be a 30% cost win. (2) Model routing — downshift easy queries to a smaller model, keep the frontier model for what needs it. (3) Prompt shrinking — audit your prompts for copy-paste accretion and cut 20% of tokens; almost always invisible in quality.

What you do not do first is rewrite to self-hosted. Self-host pays off at high volume and stable traffic, but the engineering-months are non-trivial and you lose the provider's reliability and model-upgrade treadmill. Reach for it after the easy wins.

→ Key Insight
The biggest free lunch in LLM cost is the prompt cache, and it's the one most teams forget to verify. Log your cache hit rate as a first-class metric. If it's below 70% for your chat endpoint, your system prompt isn't stable enough — fix that before anything else.
Q42

How do you set and defend a latency budget for an LLM feature?

A latency budget is a contract: "p95 end-to-end latency for this feature is 3 seconds". You split that budget across stages and give each stage a sub-budget. When any stage blows its sub-budget, you know exactly what to fix.

LATENCY BUDGET · 3S INTERACTIVE CHAT
🔐
50ms
Auth + gate
fast path
🔎
200ms
Retrieve
hybrid search
⚖️
100ms
Rerank
cross-encoder
🧠
2400ms
LLM
TTFB 400 + stream
✔️
250ms
Verify
safety + format

The user-perceived metric to defend is time-to-first-token (TTFB) for streaming responses, not total latency. Users forgive a 6-second answer that starts flowing in 400ms; they hate a 3-second answer that blanks the screen for 3 seconds and dumps. Stream by default. Show progress. Pre-send an acknowledgement if there's a retrieval step.

Three levers for latency: (1) Parallelise anything independent — run BM25 and vector search in parallel, not sequentially. (2) Cache the stable parts — prompt prefix, embedding of the last query, reranker scores. (3) Cut tokens — generation cost scales with output length, so a shorter output is a faster output.

→ Real-World Use
Alert on TTFB p95 per surface, not end-to-end latency. TTFB is the number users feel. End-to-end is the number your CFO sees. You need both but optimise for TTFB in interactive flows.
Q43

When do you cache LLM outputs versus recompute every time?

There are four caching surfaces and each answers a different question. Confusing them is the source of most LLM-cache bugs.

CacheKeyed onWhat it savesWatch out
Prompt prefixPrompt hashInput token cost on cached portionAny prefix change invalidates
Exact responseFull prompt hashFull call costNon-determinism across temps
SemanticEmbedding similarityFull call cost on paraphrasesFalse hits are silent errors
KV cacheSession continuityInference compute on same sessionServer-side, framework-specific

The safe default: always use prompt-prefix caching (free, enabled at the API level, correctness preserved). Exact-response cache is fine for deterministic calls (temperature 0, no tools, no randomness in retrieval). Semantic caching is dangerous — the whole point is that similar-but-not-identical prompts share an answer, which is fine for FAQ but catastrophic for "what's my account balance?" where two similar questions have different right answers.

The rule of thumb: cache answers to questions whose answer doesn't depend on mutable state. "What are your business hours?" is cacheable forever. "What's the status of my order?" is never cacheable. "What's the weather in NYC today?" is cacheable for 15 minutes. Classify your queries, tag them, and cache-by-tag.

→ Mental Model
Every cache is a tradeoff between cost and freshness. For a semantic cache, the tradeoff is between cost and correctness — and correctness failures are usually more expensive than the call you saved. Default to caching off for anything user-specific.
Q44

How do you do capacity planning for AI workloads?

Capacity planning for LLMs is different because the unit isn't requests — it's tokens per second per tier. A request for 32k tokens in and 1k tokens out costs 33x what a 1k-in 100-out request costs on the same model. Plan capacity in tokens, and you'll stop being surprised by traffic that "looked flat" but spent 5x more.

CAPACITY PLANNING · WHAT TO PROJECT
1 · Forecast DAU growth
Baseline from existing signup curve; layer seasonality
2 · Tokens per active user
Median + p95 per surface — usually skewed, plan for p95
3 · Translate to TPM per model
Split by model tier; add 2x safety buffer for peaks
4 · Negotiate provider quota
Enterprise TPM contracts, reserved capacity, burst credits

Two uncomfortable facts worth naming: (1) Provider rate limits are the real ceiling, not your wallet. You can have unlimited budget and still hit a 500k TPM wall that takes weeks to raise. Plan at least 2 quarters ahead on quota negotiations. (2) Input tokens grow faster than output tokens as your product matures, because you add context, tools, memory, and retrieval. Forecast the growth direction, not just the magnitude.

For self-hosted inference: the unit isn't TPM, it's tokens-per-second per GPU. A single H100 running vLLM with Llama-70B at batch size 32 does roughly 2000 output tokens/sec. That's your planning unit. Utilisation below 60% is wasted spend; above 85% means tail latency is collapsing. Tune the batch size to your mix.

→ Key Insight
Negotiate rate limits before you need them, not after. Every provider's enterprise contract has a 4–8 week lead time for real capacity increases. If your growth is exponential, you're always one month from a ceiling.
Q45

Hosted APIs versus self-hosted models — how do you make the call?

The analysis is financial + strategic + operational. Financial alone almost always says "self-host at scale" — but the strategic cost of losing the model upgrade cycle and the operational cost of running GPUs in production is usually larger than the cash savings.

THE DECISION MATRIX
Hosted wins
Rapid iteration
Small team
Want latest frontier
Variable traffic
No infra team
Time-to-market wins
Self-host wins
Steady huge volume
Data residency / air gap
Custom fine-tune
Specific latency SLO
Infra team exists
Unit economics wins

Key framing: the breakeven isn't when self-hosted is cheaper per token — it's when the cash saved exceeds the engineering + ops + opportunity cost of not shipping features. For most startups under 50 engineers, that moment never comes. For BigCo with a dedicated ML platform team, it comes sooner.

The hybrid pattern: self-host the classifier/embedder/reranker, hosted for the generator. The embedding model is small, high-QPS, and not on the model-upgrade treadmill — cheap to self-host and you get to use the latest frontier LLM for the generation pass. This is what most mature teams land on.

→ Interview Tip
If you haven't actually run GPUs in production, don't pretend you have. Instead say: "I'd start hosted, instrument cost per surface, and reach for self-host when the cash savings beat ~$500k/yr of engineering time." That's a credible, grounded framing.
Q46

How do you optimise prompts for cost without hurting quality?

Prompt-level cost optimisation has three moves, in order of safety: (1) delete copy-paste accretion (safest), (2) move static content to cached prefix (safe), (3) compress with a smaller model (risky, evaluate).

PROMPT COMPRESSION · WHAT TO CUT FIRST
1 · Redundant instructions
"Be helpful. Be accurate. Be concise." — pick one, the others are noise.
2 · Stale few-shot examples
Drop examples that no longer match current data; keep diverse ones.
3 · Over-long rubrics
Frontier models follow short rubrics; shrink 2x, eval, shrink again.
4 · Retrieved context
Shrink K from 10 → 5 → 3 via reranker; measure at each step.

Output-side compression is just as important and often bigger: short output is half the cost of long output (output tokens are 3-5x the price of input tokens). Explicit length instructions work: "Answer in 1 paragraph, not more than 80 words" is cheap and the model respects it. Structured output also cuts tokens compared to prose.

Always validate every cut with your eval set. The anti-pattern is "I shrank the prompt by 40% and shipped it" — you don't know if you regressed quality until you measure. Every prompt change is a PR, every PR runs evals, every eval gates the merge.

→ Real-World Use
The cheapest output-length win: add "Be direct. Don't hedge. Don't restate the question." to your system prompt. Ten tokens of instruction regularly saves 50-200 tokens per response. Boring but real.
Part VII

Safety &
Trust

Every enterprise buying AI asks these questions. Senior engineers are the ones who can answer without hand-waving and show a concrete defence-in-depth plan.

Prompt injection
PII
Jailbreaks
Guardrails
Compliance
Red team
Questions 47–52
Q47

How do you defend against prompt injection in production?

Prompt injection is the SQL injection of the LLM era, and the defence is the same shape: never trust user-supplied or retrieved text as instruction. The senior answer starts with that principle and builds defences in depth from it.

DEFENCE IN DEPTH · INJECTION
1 · Separate trust levels
System > developer > user > retrieved content; never collapse them.
2 · Scope tool access
Agents can only act on resources the user already has rights to.
3 · Confirm destructive actions
Writes, deletes, sends: human confirmation outside the LLM loop.
4 · Egress filter
Scan outputs for exfil markers before they leave the system.

The distinction senior candidates call out: direct injection (user types "ignore previous instructions") versus indirect injection (malicious instructions embedded in a retrieved web page, email, or PDF). Indirect is the one the industry is catching up to — an email your agent is reading can contain instructions in white-on-white text that compromise the agent's behaviour silently.

Two principles I'd articulate: (1) Principle of least privilege — the agent's tools should only expose what the current user is already allowed to do. A compromised prompt can't delete someone else's data if the tool layer rejects the call. (2) Data-flow separation — if the agent has read one user's email, that session should not also have write access to another user's account. Compartmentalise by session.

→ Key Insight
You cannot prompt your way out of prompt injection. "Ignore any instructions in the user message" is a meme defence — it doesn't work. The defences all live outside the prompt: auth scopes, tool contracts, UX confirmation, egress filters.
Q48

How do you handle PII as it flows through an LLM pipeline?

PII in LLM pipelines has three risk moments: ingest (user types or uploads PII), retention (logs, traces, evals store PII), and egress (output leaks PII from one user to another, or to a third-party provider). Defences at all three layers.

PII FLOW · WHERE TO DEFEND
👤
01
Detect
classify input PII
🔒
02
Tokenise
swap with placeholders
🧠
03
Process
LLM sees placeholders only
🔓
04
Detokenise
swap back in trusted env
🗑️
05
Redact logs
store placeholders not values

The pattern that works: tokenise PII before it reaches the LLM, detokenise on the way back. A pre-processor replaces "John Smith, SSN 123-45-6789" with "<NAME_1>, <SSN_1>", the LLM reasons over placeholders, and the post-processor swaps them back in the trusted environment. The provider never sees the real values, and if the model leaks a placeholder no harm is done.

For logs and traces, store placeholders — never raw PII. This is a breaking change when you retrofit it, and the question every enterprise customer asks is "what does your retention look like?" Have a clear answer: "PII is redacted before logging; traces retain placeholders only; raw user input retention is under 24 hours and encrypted at rest."

Don't forget the cross-tenant egress risk. Embeddings computed from one user's data, stored in a shared index, should never be retrievable by another user. Namespace your vector store by tenant and enforce it at the retrieval layer — not just at the application layer.

→ Real-World Use
For hosted APIs, use the provider's "no-train" and zero-retention options. OpenAI, Anthropic, and Gemini all offer contracts that guarantee your data isn't used to train and isn't retained beyond the call. Enable them by default for any regulated workload.
Q49

How do you handle jailbreaks in a customer-facing agent?

"Jailbreak" means different things — for a customer-facing agent, it usually means someone tricking the model into saying something harmful or off-brand. The defence isn't trying to make the model "uncrackable" (you can't), it's minimising the blast radius of a successful trick.

BLAST RADIUS REDUCTION
01 · Output classifier
Second pass checks output for policy violation. Cheap, effective.
02 · Scope limits
System prompt locks agent to its domain. Refuses off-topic.
03 · Tool-level auth
Jailbreaks can't escalate what the user's own session can do.
04 · Monitor & patch
Log refusals and bypass attempts. Update policies weekly.

The cheapest, highest-leverage defence is an output classifier — a second small-model pass that reads the proposed answer and asks "does this violate policy? is it off-topic? does it say something the brand would never say?" before it reaches the user. Llama Guard, NeMo Guardrails, or a fine-tuned small model all work. Latency cost: ~100ms. Effectiveness: catches the vast majority of policy escapes.

The philosophical point: scope your agent so narrowly that jailbreaks are uninteresting. A customer support bot should refuse any question outside its domain by default. "How do I reset my password?" — yes. "What are your thoughts on geopolitics?" — "I'm a support assistant, I can only help with your account." This isn't censorship, it's product scoping. Jailbreaks of a narrowly-scoped agent produce at-most a mildly embarrassing screenshot — never a data breach.

→ Mental Model
Think about jailbreaks like XSS — you don't make every input "safe", you assume some will get through and make sure the consequences are small. Scope, auth, output filters, monitoring — that's the XSS playbook applied to LLMs.
Q50

How do you build guardrails that don't ruin the user experience?

Over-aggressive guardrails are the most-complained-about feature in AI products. The senior move is to target the bad outcomes, not the bad topics. A medical app can discuss symptoms without becoming a diagnostic tool; a financial app can discuss budgeting without giving personalised advice. Narrow, outcome-based guardrails beat keyword blocklists every time.

GUARDRAIL DESIGN · GOOD VS BAD
Bad guardrails
Keyword blocklists
Refuse-on-suspicion
Generic safety prompt
No escalation path
High false-positive rate
Good guardrails
Outcome-specific classifier
Targeted refusals
Clear safe alternative
Route to human
Low friction, high signal

Rule I enforce: every refusal must offer a next step. "I can't help with that" is a terrible UX. "I can't recommend specific dosages — but I can show you the manufacturer guidance, or connect you to a pharmacist" respects the boundary and keeps the user moving. That's not a nice-to-have — it's the difference between a 20% refusal satisfaction and a 70% one.

Measure guardrail false-positive rate alongside false-negative rate. Most teams only track "did we miss a bad output?" A good guardrail also tracks "did we block a good output?" Both are defects. A guardrail at 2% FN and 20% FP is worse for the product than one at 5% FN and 3% FP.

→ Real-World Use
Use the strict/loose toggle pattern. Expose a single classifier with two thresholds: strict mode for first-time users and unauthenticated sessions, loose mode for known trusted tenants. One code path, two risk profiles. Easy to tune over time.
Q51

How do you audit an AI system for compliance — SOC2, HIPAA, GDPR?

Auditors ask about data flow, access control, retention, and auditability. Your AI system has to answer those four questions at every layer — just like any other regulated system, but with a few LLM-specific wrinkles.

ConcernClassic answerAI-specific addition
Data flowDiagram + DPA with vendorsPrompt & response logging, embedding storage
AccessRBAC, audit logsPer-tenant isolation in vector store, namespaced caches
RetentionRetention scheduleLog redaction, eval set opt-out, right-to-delete on embeddings
AuditabilityImmutable logsTrace every output to prompt hash + model version
Sub-processorsVendor listModel provider, embedding provider, observability tools

The specific wrinkle: right-to-delete applied to embeddings. When a GDPR deletion request comes in, you must be able to remove that user's data not just from the primary DB but from every embedding derived from it. Build deletion hooks into your embedding pipeline on day one — retrofitting is painful. The pattern: every chunk row carries a user_id column; delete cascades from user → chunks → vectors.

Two other things auditors love: (1) a data flow diagram showing every hop, every vendor, every store. (2) DPAs (Data Processing Agreements) with every model and tool provider, with zero-retention and no-train flags enabled where available. Get these before the audit, not during.

→ Key Insight
Model providers count as sub-processors. Your enterprise customers will ask for the sub-processor list; OpenAI, Anthropic, and the major clouds are all on standard pre-approved lists. Smaller specialist APIs often aren't — budget procurement time for each new vendor.
Q52

What does a red-team process look like before launching an AI feature?

A real red-team is adversarial, diverse, and documented. Not "we had the QA team try some weird inputs for an afternoon". The senior answer has structure: threat model, attacker personas, a test script, a severity rubric, and a gate for "this doesn't ship until the P1 issues are fixed."

RED-TEAM PROCESS
🎯
01
Threat model
who, what, why
🎭
02
Personas
curious / malicious / naive
📋
03
Playbook
200+ scripted attacks
⚠️
04
Triage
P0–P3 with fix SLA
🔁
05
Re-test
gate launch on fix

Three persona archetypes to red-team against: the curious user (stumbles on bad outputs by accident — this is most of your traffic), the malicious user (actively tries to break things, post screenshots), and the naive but high-stakes user (asks a dangerous question without realising it). The playbook should have 30–50 attacks per persona, mixed to cover your feature's domain.

The gate matters more than the red team itself. Without a pre-committed launch gate — "no P0 issues, fewer than N P1s, all issues have a fix plan" — the red team becomes performative. With a gate, it becomes a decision-maker. Build the gate and get the leadership sign-off on it before you run the red team, so the results have teeth.

And finally: every red-team finding becomes a golden eval case. Future model/prompt changes can't regress any previously-found vulnerability without triggering a CI failure. That's how red teams compound into durable quality instead of being a one-time launch ritual.

→ Interview Tip
Senior candidates mention the external red team. At high stakes (healthcare, finance, security), hire an outside group to attack the system before launch. They're not biased by the product roadmap and they find things internal teams wouldn't.
Part VIII

Leadership
Signals

The half of the interview where the conversation stops being about models and starts being about judgement, people, and how you make decisions when nobody's going to tell you what to do.

Team leadership
Research vs shipping
Onboarding
Stakeholder alignment
ROI
Staff signals
Questions 53–60
Q53

How do you lead an AI engineering team that has to ship fast and stay safe?

Lead with clear stage-gates and explicit tradeoffs. The failure mode in AI teams is permanent prototype energy — everything is a demo and nothing is production. The opposite failure mode is a team so careful they never ship. The leadership job is naming which stage you're in and which rules apply.

THREE STAGES · DIFFERENT RULES
PROTO
Days. Hack, break things, no evals required. Goal: does it work?
BETA
Weeks. Evals exist, shadow traffic, internal users. Goal: is it safe?
GA
Months. SLOs, red team, runbook. Goal: is it reliable?

The cultural move that makes this work: celebrate the transitions. When a feature moves from proto to beta, it's a milestone. When it moves from beta to GA, it's a bigger one. Teams that don't ritualise the transition end up with five half-GA features and one on-call incident per day — all the same tier of fragility.

On ship fast: my rule is to keep the iteration loop under a day. From "idea" to "evaluated prompt change" should be hours, not weeks. That requires investment in the eval harness, the deploy pipeline, and the rollback tooling. Leaders who don't fund infrastructure in year one pay for it ten-fold in year two.

→ Interview Tip
When asked "what's your leadership style?", the specific-example answer always beats the generic one. "In our last quarter I made the call to pause feature work for two weeks to build an eval harness — here's the before/after metric" is the story that lands.
Q54

How do you balance research exploration with shipping features?

The senior framing: "research" and "shipping" are not opposites — they're different-cost experiments. A well-run team is constantly running experiments at multiple cost tiers, with clear criteria for promoting a cheap experiment into an expensive one.

EXPERIMENT TIERS
Tier 0 · Notebook
Hours. One engineer, dirty code, small eval. Promote if signal is strong.
Tier 1 · Shadow
Days. Runs on prod traffic silently. Measures real distribution.
Tier 2 · Canary
Week. Small % of users. Measures real outcomes.
Tier 3 · GA
Quarter. Full launch, SLOs, on-call, maintenance.

Kill criteria at each tier matter more than entry criteria. Most teams have no explicit rule for stopping an experiment that isn't working. Write it down: "Tier 1 is killed if shadow judge score is below baseline after 1000 examples." Without kill criteria, research becomes an indefinite line item.

On the team level, I aim for roughly 70% ship, 20% ship-adjacent research (paying off in this quarter), 10% horizon. The horizon bucket is how you stay ahead on model upgrades and new techniques — skip it and you wake up obsolete. Overfund it and you ship nothing. Most teams land at the wrong ratio in both directions.

→ Mental Model
Every research effort needs a "what would I ship" sentence from day one. If you can't name a concrete ship artifact the research would unlock, it's not research — it's learning, which is fine but should be on a smaller budget and a shorter timeline.
Q55

How do you onboard engineers to an AI codebase where half the system is a prompt?

AI codebases are famously hard to onboard to because the "source of truth" lives in prompts, evals, and traces — not in the code. A new engineer reading the repo gets maybe 40% of the picture. The onboarding plan has to compensate.

FIRST-WEEK ONBOARDING PLAN
Day 1–2 · Run one real query end-to-end
Open a trace, read the prompt, see the retrieval, read the output
Day 3 · Ship a trivial prompt change
Full loop: PR, eval, review, canary — builds confidence in the system
Day 4 · Read last 3 incident post-mortems
The best "how this actually fails" doc the team has
Day 5 · Shadow a real feature review
See how decisions get made in the team's actual language

What you don't do: sit them down with the full system prompt and the architecture diagram and expect it to click. The sequence "run → ship small → read scars → observe decisions" compresses six weeks of learning into five days. It works because every step is concrete and produces feedback.

The asset I invest in: a "read this first" doc that isn't a README. It's a 10-page narrative: "here's the feature, here's why we built it this way, here's the thing that nearly broke it, here's the eval that keeps it honest, here's where new engineers have gotten stuck before." One document, refreshed quarterly, beats a dozen auto-generated doc sites.

→ Key Insight
The fastest way to know if an engineer is "getting it" on an AI team: ask them to debug an old production incident trace. If they can navigate the trace, find the bad step, and propose a fix — they're ready to ship. If not, more shadowing.
Q56

How do you convince leadership to fund infrastructure instead of shipping the next feature?

The losing argument is "our infra is bad and it's embarrassing." The winning argument is "here are three features we couldn't ship last quarter because infra blocked them, and here's the cost of keeping that going." Leaders fund work that prevents measurable pain, not work that satisfies engineering aesthetics.

THE CONVERSATION STRUCTURE
1 · The cost of doing nothing
Engineer-weeks lost, incidents, features blocked, churn
2 · The proposal
Scope, team, timeline, clear end-state definition
3 · The payoff
Measurable "after" state — engineer time recovered, cost saved, risk cut
4 · The tradeoff
What feature slips, who's affected, who you've told

The framing that actually works with non-technical leaders: talk in dollars and weeks, never in abstractions. "Our eval harness takes 45 minutes to run, which means engineers wait or skip it, which means regressions ship, which means a customer-success rep spends 10 hours/week handling the fallout" — that's a business case. "We need a better eval harness" is a request.

When you get the funding, overcommunicate progress. Weekly update on what's done, what's left, what changed. Infra work is invisible to leadership by default; if you don't write about it, they'll forget you got approval and wonder why features are slow. A 5-minute Friday email beats a 60-minute meeting every time.

→ Interview Tip
Have one concrete story ready about infrastructure advocacy. Even if you didn't get approval the first time — the story of "I pitched it, got pushback, came back with better numbers, got approval" is actually stronger than a first-time yes.
Q57

Two senior engineers disagree on model vs architecture tradeoffs. How do you resolve it?

Rule 1: resolve technical disagreements with data, not seniority. If two strong engineers disagree, usually they're both looking at different parts of the elephant. Your job is to make the disagreement concrete — specific claim, specific metric, specific test — and let the data decide.

TURNING DEBATE INTO DECISION
💬
01
Clarify
what exactly is the disagreement?
🎯
02
Formalise
what would settle it?
🧪
03
Bake-off
time-boxed experiment
📉
04
Review
both eng in the room
📣
05
Commit
both disagree & commit

The pattern I use: "disagree and commit" after a time-boxed bake-off. Define the metric and the timeline in advance ("we'll evaluate both approaches on eval set X over 3 days, winner is whichever beats the other by more than 5%"). Lock both engineers into committing to the outcome before the test runs, so the losing side doesn't re-litigate afterwards.

Sometimes the debate is not settleable by data — it's about maintainability, clarity, or long-term direction. In that case the leader's job is to make the call explicitly, own it, and explain the reasoning. "Both approaches work; I'm picking B because it's closer to the direction the platform team is going — I might be wrong, we'll revisit in 6 months." Transparency about your own uncertainty keeps the other engineer's trust.

→ Real-World Use
The one thing not to do: appoint a third senior engineer to arbitrate. That creates a political dynamic and the losing side feels ganged up on. Either use data or take the decision yourself — never delegate the call to a tiebreaker.
Q58

How do you measure the ROI of an AI initiative?

ROI for AI initiatives falls into three buckets: revenue (new customers, upsell), cost (deflected work, reduced tool spend), and risk (incidents avoided, compliance posture). Every AI project must explicitly name which bucket it's in — and the metrics are different for each.

BucketMetric examplesAttribution tricks
RevenueTrial → paid conversion, upsell rate, feature-gated ARRRandomise feature access for 60 days
CostTickets deflected, hours saved per agentHold-out cohort + before/after
RiskIncidents avoided, SLA hitsHard — use leading indicators (coverage, drill results)
SatisfactionCSAT, NPS, retentionSegment by exposure, long window

The hardest one is ticket deflection — everyone wants to claim "our bot deflected 30% of tickets" and nobody can prove it without a control group. The honest measurement: hold 5% of users out of the AI feature, run for 30 days, compare their ticket volume to the exposed group. If you won't do that, don't claim the deflection number.

The trap to avoid: vanity metrics like "number of messages sent to the bot". Usage isn't value. A chatbot with 10k daily messages and a CSAT of 2.1 is actively destroying value. Tie every AI project to a business metric downstream of usage — conversion, resolution, retention — or the initiative doesn't justify its budget.

→ Mental Model
Before you build an AI feature, write the "if this works, this metric moves from X to Y" sentence. If you can't write it, you're not ready to build it — and you won't be able to prove ROI later.
Q59

How do you mentor an engineer from "knows Python" to "can ship AI features"?

Mentorship in AI engineering is less about teaching facts and more about building instincts for non-deterministic systems. The transition that matters: moving from "does my code run?" to "does my feature produce good outputs, reliably, over a distribution of inputs?"

FOUR HABITS I DRILL INTO NEW AI ENGINEERS
01 · Always look at data
Before you debug, read 20 real examples. The answer is usually in the data.
02 · Build the eval first
No ship, no feature, no PR until there's a way to measure it.
03 · Start with the simplest thing
Prompt before fine-tune; retrieval before agent; regex before LLM.
04 · Own the trace
When something is wrong, open the trace. Never ship on hope.

The single practice I require: read 20 real examples from the system every week. No dashboards, no summaries — actual traces from actual users. Pattern-matching on real data is how senior AI engineers develop intuition, and it's the one practice juniors skip. Make it part of the weekly ritual.

The blocker that's hardest to coach through: the fear of "wasting" an LLM call. Junior engineers over-optimise their prompt before running it because it "feels expensive." Mid-level engineers run it, see it fail, and iterate. You want to coach toward the latter — the learning loop is the whole job.

→ Interview Tip
If asked about mentoring, tell a before/after story with a specific engineer: "Engineer X couldn't ship a prompt change in a week. Here's what I changed about how they worked. Now they ship daily." Concrete beats generic every time.
Q60

What does "senior" look like versus "staff" in AI engineering — and which are you?

The clearest framing I've found: senior engineers own outcomes on a feature; staff engineers own outcomes across features. Senior ships the feature reliably; staff designs the platform that makes five teams ship reliably. The scope of what you're accountable for is the main axis.

SENIOR vs STAFF · WHERE THE LINE SITS
Senior owns
A feature end-to-end
Reliable shipping
Clean eval discipline
Incident response
Mentorship up to one level
Mastery in the box
Staff owns
A platform or domain
Architecture across teams
Eval culture at org level
Strategy & roadmap
Develops senior engineers
Leverage outside the box

In AI specifically, the staff signals I look for are: (1) can they design an eval culture, not just write evals for their own feature? (2) can they sequence the team's investments across model upgrades, infra improvements, and features over a 6-month horizon? (3) can they pick the right abstraction — knowing when to add a platform component vs when to let teams keep copy-pasting? That last one is where most senior-to-staff transitions live or die.

The answer to "which are you?" should be honest and forward-looking: "I'm operating at senior today. Here are the staff-level scopes I've taken on and the ones I know I haven't yet." That combination — self-awareness plus a growth direction — is what interviewers want to hear. Overclaiming burns trust. Underclaiming costs you the level.

→ Key Insight
Staff engineers make other engineers more effective. If you can point at three engineers whose work is better because of something you built or taught them, you're operating above a pure senior scope — say that, give examples, let the interviewer do the math.
Complete

All 60 Questions.
Covered.

Sixty production-grounded questions for senior AI engineer interviews — architecture, incidents, agents, RAG, evals, cost, safety, and the leadership signals hiring panels actually listen for.

60
Questions
8
Topic Areas
60+
Visual Diagrams
Saurabh Singh
AI Engineer & Builder
linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7