What do staff and senior AI engineer interviews focus on?

At senior and staff level the focus shifts from coding to judgement: system architecture for generative systems, debugging real production incidents, agentic design, RAG and evals, cost and latency trade-offs, safety and reliability, and technical leadership. Interviewers probe how you make decisions under ambiguity and own outcomes, not just whether you know an API.

How is a senior AI engineer interview different from a junior one?

Junior interviews test whether you can implement; senior ones test whether you can design, operate, and lead. Expect open-ended system design, production incident retrospectives, questions on cost and reliability at scale, and how you mentor and set technical direction — areas where there is no single right answer and your reasoning is the signal.

How should I prepare for an AI engineering system design round?

Practice reasoning out loud about trade-offs: when to use RAG vs fine-tuning, how to design an eval pipeline, how to control cost and latency, how to handle non-determinism and failure, and how to monitor a generative system in production. This handbook walks through 60 such questions across architecture, incidents, agents, RAG, evals, cost, safety, and leadership.

Is the Senior AI Engineer Interview Handbook free?

Yes. All 60 questions and their worked reasoning are free to read, with no sign-up required.

Visual Handbook · 2026

60 Questions · 8 Domains

Senior Interview Preparation & Reference

The Senior
AI Engineer
Interview Handbook

Sixty questions on architecture, production incidents, and the leadership signals that separate senior from staff-level AI engineers.

Architecture Incidents Agents RAG Evals Cost & Scale Safety Leadership

Architecture & Design

Q01–Q08

Production Incidents

Q09–Q16

Agentic Systems

Q17–Q24

RAG & Retrieval

Q25–Q32

Evals & Observability

Q33–Q40

Cost & Scaling

Q41–Q46

Safety & Trust

Q47–Q52

Leadership Signals

Q53–Q60

Saurabh Singh

AI Engineer & Builder.

LinkedIn Medium GitHub

Contents

What's
Inside

I · Architecture

Q01–Q08

Q01Designing a production LLM serving platform

Q02RAG vs fine-tuning vs in-context decisioning

Q03Multi-tenancy in shared AI platforms

Q04Model versioning, rollout, and rollback

Q05Multi-model routing layers

Q06Context window management at scale

Q07Deterministic code vs LLM boundaries

Q08Feedback loops that actually improve quality

II · Incidents

Q09–Q16

Q09Walk me through an LLM production incident

Q10Detecting and mitigating model drift

Q11Runbook for a provider outage

Q12Handling hallucinations in user-facing apps

Q13A fine-tune regression you caught late

Q14Rolling back a bad prompt deploy

Q15Cascading failures in agent workflows

Q16Post-mortems for non-deterministic systems

III · Agentic Systems

Q17–Q24

Q17Structuring a multi-agent system

Q18When agents should escalate to humans

Q19Handling tool-use failures

Q20Agent state, checkpointing, and resume

Q21Bounding agent cost and runaway loops

Q22Testing an agent before shipping

Q23Designing agent memory architectures

Q24Handoffs between specialized workers

IV · RAG & Retrieval

Q25–Q32

Q25Designing a production RAG system

Q26Chunking strategies and tradeoffs

Q27Evaluating retrieval quality

Q28Hybrid search vs pure vector search

Q29Handling index staleness

Q30Re-ranking strategies in production

Q31Debugging a poor RAG answer

Q32Scaling RAG to billions of docs

V · Evals & Observability

Q33–Q40

Q33Building an eval framework from scratch

Q34Using LLM-as-judge correctly

Q35Preventing eval set leakage

Q36Regression tests for prompts

Q37What AI observability actually looks like

Q38A/B testing a model change safely

Q39Measuring quality without ground truth

Q40Prioritising quality work from signals

VI · Cost & Scaling

Q41–Q46

Q41Cutting LLM inference cost by 50%

Q42Latency budgets and how to defend them

Q43When to cache vs recompute

Q44Capacity planning for AI workloads

Q45Hosted vs self-hosted models

Q46Prompt optimisation without quality loss

VII · Safety & Trust

Q47–Q52

Q47Defending against prompt injection

Q48PII across an LLM pipeline

Q49Jailbreaks in customer-facing agents

Q50Guardrails that don't break UX

Q51Auditing an AI system for compliance

Q52Red-teaming before launch

VIII · Leadership

Q53–Q60

Q53Leading an AI engineering team

Q54Research exploration vs shipping

Q55Onboarding engineers to an AI codebase

Q56Selling infrastructure investment upward

Q57Model vs architecture disagreements

Q58Measuring ROI of an AI initiative

Q59Mentoring engineers into AI

Q60Senior vs staff AI engineer signals

Part I

Architecture &
System Design

The senior interview rarely asks you to invent a transformer. It asks you to draw a production system on a whiteboard in forty minutes and defend every line you drew.

Serving

Multi-tenancy

Model Routing

Versioning

Context Windows

Feedback Loops

Questions 01–08

Q01

Design a production LLM serving platform for a product with 1M daily active users.

Start by naming the three workloads that share nothing: interactive chat (single-digit seconds, streaming), async jobs (bulk summarisation, embedding indexing), and batch evals. Trying to serve them from one pool is the most common rookie error — they contend for the same GPUs and interactive p99 falls apart the moment a batch job kicks in.

PLATFORM REFERENCE ARCHITECTURE

🔐

Gateway

auth, quotas, PII scan

→

🧭

Router

model & region pick

→

💾

Cache

prefix & semantic

→

🧠

Inference

vLLM / provider

→

📊

Telemetry

traces, evals, cost

The gateway terminates auth, enforces per-tenant quotas, and scrubs PII before anything touches a model. The router decides model, region, and fallback chain — this is where you bake in your cost strategy (cheap models first, escalate on low confidence). The inference tier is where you separate interactive vs batch GPU pools. Everything is fronted by a prefix + semantic cache because in chat, 30–60% of prompts share a system prefix that should hit the KV cache every time.

Senior interviewers are listening for three things you didn't say as much as what you did. First, queueing — you need a token-aware admission queue because ten 32k-token requests can starve a hundred 2k-token ones. Second, multi-region failover that doesn't break conversation state. Third, a model-agnostic request schema so swapping Claude for GPT or Gemini is a router config change, not a code change.

→ Interview Tip

When the interviewer asks "what's the bottleneck at 1M DAU?" — the answer is GPU minutes, not QPS. Reframe capacity in tokens-per-second, not requests-per-second, and you immediately sound like you've done this for real.

Q02

When would you pick RAG, fine-tuning, or plain in-context learning — and when would you use more than one?

This is a decision framework question, not a recipe question. The answer turns on three variables: how often the knowledge changes, how much you need the model to change its behaviour vs its facts, and how much latency and cost budget you have.

Approach	Strong when	Weak when	Cost shape
In-context	Facts fit in prompt, change daily	Long tail of knowledge, repeated costs	Per-request tokens
RAG	Knowledge is large, updates often, auditable	Behaviour change, reasoning style	Index + per-request retrieval
Fine-tuning	Style, format, domain jargon, routing	Facts, anything that changes weekly	Training run + hosting
Hybrid	Regulated domains needing both	Prototype / unclear requirements	All of the above

The thing that gets a senior signal is naming the hybrid case out loud. Most real systems end up as RAG for facts + a small fine-tune for tone and structure. Medical copilots do this. Legal copilots do this. Support bots for a specific product line do this. The fine-tune teaches the model "how we sound" and RAG teaches it "what we currently know" — those are two orthogonal needs and trying to solve both with one lever always overfits one of them.

Staff-level candidates go one layer deeper: fine-tune the retriever, not the generator. If your off-the-shelf embedding model doesn't know your jargon, you get bad retrieval no matter how big the LLM is. A small contrastive fine-tune on your query/doc pairs often moves the quality needle more than fine-tuning a 70B model.

→ Mental Model

Ask yourself: would a human expert need to read a document, or have years of apprenticeship? Documents → RAG. Apprenticeship → fine-tune. Both → both. Say this sentence in the interview and watch the interviewer nod.

Q03

How do you design multi-tenancy into a shared AI platform without one noisy tenant hurting everyone else?

Multi-tenancy for AI has three concerns that traditional SaaS doesn't: token quotas, data isolation in caches and embeddings, and model-level noisy neighbours where one tenant's batch workload starves another's interactive traffic. You want per-tenant isolation on all three axes.

ISOLATION LAYERS PER TENANT

1 · Quota layer

Tokens-per-minute, RPM, concurrent-requests per tenant

↓

2 · Data layer

Namespaced indices, tenant-tagged cache keys, separate KMS keys

↓

3 · Compute layer

Priority classes, fair queueing, dedicated pools for premium tiers

↓

4 · Observability layer

Per-tenant cost, quality metrics, audit logs scoped to tenant

The trap to call out: prompt caches are a data leak surface. If you dedupe by prompt hash across tenants, a hash collision becomes an information leak. Always key caches by tenant-id || prompt-hash, not just prompt-hash. Same rule for any LLM response cache or embedding cache.

For the noisy-neighbour problem, token-weighted fair queueing beats round-robin. Charge each request to the tenant's bucket in tokens, not requests — a 32k-token request costs 16x a 2k one, and round-robin treats them the same. Large tenants should hit a separate high-priority lane that can't starve the standard lane below a floor.

→ Real-World Use

If you're running on a hosted API like Claude or OpenAI, the provider has rate limits per-key. Rotating keys per tenant gives you free per-tenant isolation and lets you use their quotas as your first line of noisy-neighbour defence.

Q04

What's your approach to model versioning, rollout, and rollback in production?

The senior move is to treat model + prompt + retriever as one deployable unit. Versioning only the model is the fastest way to get into production bugs you can't reproduce: the model is fine but the prompt was regenerated from a different template and nobody noticed. Call this the inference stack and version the whole thing.

INFERENCE STACK · ONE DEPLOY UNIT

MODEL

claude-sonnet-4.6
weights hash

PROMPT

system v17
templates v9

RETRIEVER

embed-3-large
index v22

TOOLS

schemas v4
MCP v2

Rollout strategy: always shadow first, then canary, then ramp. Shadow mode sends the new stack the same traffic as prod but discards its output — you get real-distribution eval data without risking users. Canary then flips a small slice (1% → 5% → 25%) with automatic rollback tied to your eval gates. The key insight: your rollback trigger should be an eval metric, not an error rate. Hallucination regressions don't throw 500s.

Rollback has to be atomic across the whole stack. A common outage pattern: you roll back the model but forget the prompt was updated to match the new model's behaviour, so you now have a prompt that only works with the new model rolled back to the old one. Snapshot the whole stack, roll back as one unit.

→ Key Insight

Version the cache too. When you roll out a new prompt, the prefix cache from the old one is stale — you'll get mysteriously good latency with the wrong outputs until someone notices. Bump the cache namespace on every deploy.

Q05

Design a multi-model routing layer that picks the right model per request.

A good router is cheap by default, expensive by necessity. The structure that works: a small classifier decides a tier, the tier maps to a model, and every request has a fallback chain if the first pick fails or returns low confidence.

ROUTING DECISION TREE

📥

Classify

intent + difficulty

→

🎯

Pick tier

haiku / sonnet / opus

→

▶️

Execute

with timeout

→

✔️

Verify

confidence check

→

↗️

Escalate

fallback on fail

The classifier should be a tiny, cheap model or a fine-tuned distilbert — don't use a frontier model to decide which frontier model to use. Features: intent, estimated token length, whether tool use is required, whether the user tier allows premium models. The classifier must be deterministic enough that A/B tests are interpretable.

The three pitfalls to mention: (1) Double billing — if you escalate, you pay for both models, so only escalate on measurable low-confidence signals. (2) Latency cliffs — users notice when their query randomly takes 10x longer because it hit the big model. Stabilise routing per session so a user's experience is consistent. (3) Observability debt — every request needs to log which tier it hit and why, or you can't tune the router.

→ Interview Tip

The most impressive senior answer: "I'd start with a hand-coded router based on 3–4 features, then log the outcomes, then train a classifier from those logs once I have ground truth." Bottom-up, data-driven — classic staff-engineer pattern.

Q06

How do you manage context windows when user sessions exceed the model's limit?

There's no single answer — it depends on whether the domain needs recency (support chat) or completeness (code review, legal). The strategy is a stack of techniques, not one choice.

CONTEXT PACKING STRATEGIES

01 · Sliding window

Keep last N turns verbatim. Cheap, fast, loses history. Good for support.

02 · Rolling summary

Summarise older turns into a running digest. Lossy but continuous.

03 · Retrieved memory

Embed past turns, pull only what's relevant per new query.

04 · Structured state

Extract facts to JSON, pass as small structured payload.

For most production systems the right answer is a hybrid: sliding window of the last 6–10 turns + rolling summary of everything before + retrieved memory for specific entities. Structured state is underused — if your domain has stable slots (user preferences, order IDs, active project) extract them and pass them as a small JSON block. That beats summarisation because it's lossless for the things that matter.

The senior-signal detail: measure your context utilisation. If your users' p95 conversation is 4k tokens and your budget is 200k, you're paying for nothing. Cap the window at what you actually use, monitor how often you hit it, and only increase when the data demands it.

→ Real-World Use

Anthropic's prompt caching makes long static system prompts essentially free. Put your instructions, tools, and big static context in the cacheable prefix — then use a tiny variable suffix for the actual turn. Context budget problem partly solved.

Q07

Where do you draw the line between deterministic code and an LLM call?

The default should be "use code wherever code works". LLM calls are non-deterministic, slow, expensive, and hard to test — if a regex, a SQL query, or a finite-state machine can do the job, that's what you use. The LLM is reserved for tasks where the input is ambiguous in a way that rules can't resolve.

THE LINE · WHAT GOES WHERE

Code owns

Routing & control flow

Data fetches & writes

Validation & schemas

Retries & timeouts

Auth & permissions

Determinism, testability, cost

LLM owns

Open-ended generation

Ambiguous parsing

Natural-language UX

Summarisation

Intent classification

Ambiguity, breadth, generalisation

The practical rule I give junior engineers: "the LLM is an expensive interpreter, not a runtime". Let it parse the user's intent, pick a tool, and talk back to the user — but route the actual execution through deterministic code. Whenever I see an LLM being asked to "decide what to do and do it in one call", there's a bug waiting. Split the reasoning step from the execution step.

The counter-pattern to watch for is LLM creep: every new edge case gets solved by adding another line to the system prompt. Three months in you have a 4000-token prompt that nobody can reason about. When that happens, audit — most of that prompt belongs in code.

→ Mental Model

Think of every LLM call as costing 100ms + $0.01 + a non-zero chance of being wrong. Would you pay that for this task if a human had to review the output? If not, use code.

Q08

How do you design a feedback loop that actually improves the system over time?

Most "feedback loops" are thumbs-up/thumbs-down buttons that nobody clicks. A real feedback loop has four stages: capture, label, improve, verify — and the hardest one is label, not capture.

THE FEEDBACK LOOP THAT CLOSES

📡

Capture

implicit + explicit

→

🏷️

Label

LLM + human review

→

🔧

Improve

prompt / retrieve / FT

→

🧪

Verify

eval + shadow

→

🚀

Ship

canary + ramp

Capture should bias to implicit signals: did the user follow up with a rephrase (bad), copy the output (good), close the tab within 3 seconds (bad), ask a new question (neutral)? Thumbs are a 2% sample and heavily biased toward negative. Implicit signals are 100% coverage and much more useful for ranking which examples are worth a human look.

The critical design choice is where the loop re-enters the system. Cheap loops update prompts and retrieval. Medium-cost loops add examples to the golden eval set so regressions can't ship. Expensive loops do fine-tuning. Most teams jump straight to fine-tuning because it feels serious, but updating prompts and evals from production data beats fine-tuning on almost all quality dimensions for a fraction of the cost.

→ Key Insight

The single most valuable feedback artifact is not a fine-tune dataset — it's a growing golden eval set built from real production failures. Every incident postmortem should end with "we added N examples to the eval set." That's how you get compounding quality.

Saurabh Singh — AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

Part II

Production
Incidents

Senior interviews live and die in the war-story section. The questions here are where you prove you've taken the pager and come out with scar tissue, not just slide decks.

Debugging

Drift

Provider Outages

Hallucinations

Rollbacks

Post-mortems

Questions 09–16

Q09

Walk me through a production LLM incident you debugged end-to-end.

The interviewer is listening for four things, in order: (1) how you detected it, (2) how you isolated the cause, (3) how you mitigated without making it worse, and (4) what you changed so it wouldn't happen again. The story doesn't need to be glamorous — clarity beats drama.

THE INCIDENT LOOP

🚨

Detect

alert, signal, user report

→

🔍

Triage

scope, severity, owner

→

🧪

Isolate

bisect traces, repro

→

🛠️

Mitigate

rollback, gate, fallback

→

📝

Learn

eval, runbook, action

A good story to have ready: "A support bot started confidently answering billing questions with the wrong currency. No errors, no latency blip — just wrong. Our customer-success team pinged us." Detection came from a human, not a metric — call that out. Triage: pulled 50 traces, saw the model had started including an EU pricing doc in retrieval for US users. Isolation: the retriever was pulling by cosine similarity only, and a recent re-indexing had changed the embedding space enough that geography tags no longer clustered. Mitigation: hot-patched the retriever to filter by user region as a hard filter before vector search. Learning: added region-aware tests to the eval set and made metadata filters mandatory in the retriever contract.

The staff-level detail most people miss: explain the blast radius assessment. How did you decide this was a P1 vs a P2? Who was affected, how do you know, and how did you estimate cost of being wrong? That's the judgment layer senior interviewers are probing.

→ Interview Tip

Pre-write two incident stories: one where you were the first responder and one where you coordinated multiple teams. Practice both out loud. The number-one mistake is rambling — aim for 3 minutes with a clear before-during-after arc.

Q10

How do you detect and mitigate model drift in production?

"Drift" means three different things and senior candidates distinguish them: data drift (user inputs change), concept drift (the right answer for the same input changes), and model drift (the provider silently updates the model behind your API). Each one is detected and mitigated differently.

Type	Symptom	Detector	Mitigation
Data drift	New topics, longer queries	Embedding distribution shift	Retrieval refresh, prompt update
Concept drift	Right answer is wrong now	Feedback rate delta	Re-label golden set, update prompt
Model drift	Same input, different output	Canary replay on golden set	Pin version, re-validate, possibly switch

The sharp detector for data drift is embedding PCA over a rolling window: project last-24-hours of user prompts into a 2-D space and compare to last-week's. A cluster of queries in an unfamiliar region is an early signal that your retrieval is about to go sideways. Cheaper version: track the rate of "I don't know" responses — it correlates surprisingly well with topic drift.

Model drift is the one most teams forget about and the one that bites hardest. When you call a hosted API you're implicitly pinned to whatever the provider ships. Run a daily replay of your golden set through the hosted endpoint and diff the results. If a meaningful fraction changed without you shipping anything, the provider updated something. This is how you catch it before a user does.

→ Real-World Use

Pin model versions with date suffixes (claude-sonnet-4-6-20260115) not aliases. Aliases move underneath you. Pinned versions force you to re-validate before upgrading — the validation is the entire point.

Q11

Your LLM provider goes down mid-day. What's your runbook?

This question is testing whether you've planned for the outage, not whether you can improvise during it. The right answer starts with "we designed for this" and ends with "here's the minute-by-minute runbook".

FAILURE MODES & RESPONSES

01 · Regional brownout

Fail over to secondary region. Keep same provider, same model.

02 · Provider-wide outage

Swap to secondary provider. Expect quality delta, accept it.

03 · Rate-limit squeeze

Degrade: trim context, drop non-critical calls, queue batch.

04 · Degraded accuracy

Surface a "limited mode" banner. Don't hide quality loss from users.

Multi-provider is cheap insurance if you designed for it from day one. That means: model-agnostic request schema, prompt templates that work across providers (or provider-specific variants versioned together), and a pre-approved fallback chain so incident commanders don't have to make architectural decisions at 2am. The worst time to discover your prompts don't port between providers is during the outage.

Also underrated: graceful degradation is a product decision, not just a technical one. Sometimes the right response is to serve a cached answer with a disclaimer. Sometimes it's to put the feature in read-only. Sometimes it's to fail loudly because a wrong answer would be worse than no answer. Have that conversation with your PM before the outage and put the answer in the runbook.

→ Key Insight

Run a provider outage game day every quarter. Block the primary API in staging and watch your system fall over in slow motion. Ten minutes of planned pain saves you four hours of real pain when it happens for real.

Q12

How do you handle hallucinations in user-facing applications?

The senior framing is: hallucinations are a system problem, not a model problem. You don't make them go away — you contain them, detect them, and design UX that tolerates them. Treat it like we treat SQL injection: defence in depth.

DEFENCE IN DEPTH · HALLUCINATION CONTAINMENT

1 · Ground every claim

RAG with source attribution; model can only cite what it retrieved

↓

2 · Constrain outputs

Structured output; validate against schema before returning

↓

3 · Verify claims

Secondary pass: check each claim is supported by the source

↓

4 · UX honesty

Show citations inline; make uncertainty visible to users

Layer 1 is grounding: give the model the facts it needs and instruct it not to answer beyond them. Layer 2 is structure: if the output has to be JSON with specific fields, the surface area for freeform fabrication drops. Layer 3 is verification: run a second, cheaper model pass that asks "is every claim in this answer supported by the retrieved context?" — a surprisingly effective filter. Layer 4 is UX: never let a claim appear without a citation the user can click. Users are remarkably forgiving of "I'm not sure" and remarkably unforgiving of confident wrongness.

What you do not do is tell the model "don't hallucinate" in the system prompt. That's a meme answer. It doesn't work. Neither does temperature=0 — that makes the same wrong answer consistent, not less wrong.

→ Mental Model

Hallucination is what a language model does when you ask it something it doesn't know. The fix is never "make it know more" — it's "make it refuse, or ground it, or verify it." Senior engineers pick one of those three for every output surface.

Q13

Tell me about a time a fine-tuned model regressed in production and how you caught it.

Fine-tune regressions have a specific flavour: the model got better on the training distribution and worse on the tail. If your golden eval is a clone of your training distribution, you won't catch it — both move together. The senior story to tell is about a team that caught it because the eval set had a held-out tail slice.

A real pattern: a support team fine-tunes a model on last-90-days of resolved tickets. Accuracy on a random sample jumps. They ship. A week later, customer satisfaction on new-product questions tanks — the fine-tune had shifted the model toward the existing product mix, and it now underperforms on questions about anything that wasn't in the training window. The fix isn't another fine-tune. The fix is mixing a retrieval layer for long-tail queries and classifying inputs to decide which path to use.

EVAL SLICES THAT CATCH FINE-TUNE REGRESSIONS

In-distribution

+4%

Held-out tail

−11%

New topics

−14%

Refusal quality

−6%

Tone & style

+9%

The lesson to deliver: always keep a "rare but important" slice in your eval set. Rare queries about important things (legal, billing edge cases, new features) should each get 10–50 examples, even if they're 0.1% of traffic. Those are the slices fine-tunes regress on.

→ Interview Tip

If you haven't shipped a fine-tune, tell a story about a prompt regression instead — same arc, same lesson. "I changed the system prompt, aggregate scores went up, one slice tanked, we caught it because X." That story still signals seniority.

Q14

How do you roll back a bad prompt deployment safely?

The mechanics are simple if you planned for it. You need: (1) prompts stored as versioned artifacts separate from code, (2) a feature flag per deploy, and (3) automatic rollback triggers tied to eval gates. If any of those three are missing, the rollback is a code deploy — which is fine for the first incident and unacceptable after the second.

PROMPT ROLLBACK STACK

1 · Prompt registry

Each prompt version is immutable, has a hash, and ships independently

↓

2 · Flag-gated rollout

Percent traffic flip — 1% → 10% → 50% → 100%

↓

3 · Eval gates

Automated rollback if critical slice drops N% vs baseline

↓

4 · Audit trail

Every request logs prompt hash — join with any complaint later

The subtle part is "rollback" is not just flipping a flag. You also have to: invalidate any prompt-prefix caches keyed on the new prompt, drain in-flight requests that were mid-stream on the new prompt, and communicate to downstream consumers that output format may have shifted back. A senior answer mentions these. A staff answer builds them into the registry from day one so the rollback is a single operation.

The story I'd tell: team shipped a prompt that added a new output field. Downstream code assumed the field was present. They rolled back the prompt but the downstream service was still on the new code path and started throwing nulls. The incident was longer than it should have been because the rollback wasn't atomic across the dependency graph.

→ Real-World Use

Treat prompt changes that modify output schema as breaking changes. Version them like API contracts. Ship the schema change, wait for all consumers to be ready, then ship the prompt. Boring discipline, zero incidents.

Q15

How do you handle cascading failures in multi-step agent workflows?

Agents cascade because a small upstream error compounds at every downstream step — a mis-parsed tool argument becomes a failed tool call becomes a confused planner becomes a 20-step loop. You contain this by putting circuit breakers at every step boundary, the same way you would in a microservice mesh.

CIRCUIT BREAKERS FOR AGENTS

🎯

Plan

budget: steps, $, time

→

🛠️

Act

validate args first

→

👀

Observe

tool result + schema

→

🧮

Reflect

progress check

→

🛑

Abort

if budget blown

Three concrete circuit breakers: a step-count limit (hard cap on iterations), a cost limit in tokens or dollars (when the agent has spent its budget, stop), and a no-progress detector (if the last three steps didn't change the state, the agent is stuck — stop). The no-progress detector is the one most teams forget; it catches the "agent is looping on the same failing tool call" pattern.

For recovery, the key design choice is whether to retry a failed step or abort and escalate to a human. The answer depends on reversibility: retries are fine for read-only tools; for anything that writes state, default to "escalate" unless you have idempotency guarantees on the tool itself. This is where senior candidates mention idempotency keys per agent run — a rare but correct detail.

→ Key Insight

The "budget" for an agent isn't one number, it's three: steps, tokens, wall-clock. Hit any one and stop. Give the agent a way to ask for more budget if it's close to the answer — otherwise it will hoard on step one and fail at step nine.

Q16

What does a post-mortem for a non-deterministic AI system look like?

Traditional post-mortems assume reproducibility. AI incidents often aren't reproducible — you can't replay the same user session and get the same wrong answer. The post-mortem template has to bend around that.

Section	Classical	AI-specific
Repro	Deterministic steps	Captured traces + inputs; probabilistic replay
Root cause	Single commit / config	Prompt + model + retrieval + input shape
Fix	Code change	Prompt, eval set, retrieval filter, UX change
Prevention	Test case	Golden example + monitoring alert
Metric	TTR, error rate	TTR + quality-signal delta + slice recovery

Two rules I enforce: (1) Every AI post-mortem ends with at least one new eval example. That's how you make the incident compound into long-term protection instead of just a memory. (2) Root cause is plural. "The model hallucinated" is never the root cause — the root cause is why the system let a hallucination reach a user. Usually: missing grounding, missing verification, UX that hid uncertainty, or a retrieval filter that was off by one.

The senior framing I'd lead with: "post-mortems are how you turn probabilistic bugs into deterministic tests." Every user-reported failure, if captured well, becomes a line in the golden set. After two years of this, your eval set is the most valuable artifact on the team — it encodes every scar you have.

→ Interview Tip

If asked "how do you measure reliability for an LLM feature?" the answer is never pure uptime. Pair uptime with a quality SLO on a held-out slice — e.g. "95% of queries on the billing slice must still get an eval-judge score of 4+." Quality is part of the SLO or it doesn't exist.

Saurabh Singh — AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

Part III

Agentic
Systems

The topic every AI team is hiring for and the one where interviewers can tell in thirty seconds whether you've shipped one. The distinction is in the failure modes.

Orchestration

Tool use

State

Memory

Handoffs

Cost bounds

Questions 17–24

Q17

How do you structure a multi-agent system? Why not a single agent with more tools?

The honest senior answer: start with a single agent and split only when you have evidence. Multi-agent is the most over-engineered pattern in the space. A single agent with 8 well-designed tools beats a three-agent mesh in most domains — less state, less orchestration, fewer handoff bugs.

The cases where you genuinely need multi-agent are specific: (1) role specialisation where the prompts are too different to fit in one system prompt (planner vs executor vs critic), (2) security boundaries where one agent has write access and another doesn't, and (3) scale boundaries where the orchestrator coordinates N parallel workers that don't talk to each other.

WHEN TO SPLIT · SINGLE VS MULTI

Keep single

Under ~10 tools

One role

Shared context helps

Trust boundary uniform

Simpler to test, cheaper to run

Go multi

Distinct roles & prompts

Parallel workers

Security isolation

Specialised models

Pay for orchestration + state

When you do split, the pattern that works is planner → executor → critic, each running on a model sized to its job (cheap planner, cheap executor, more capable critic for the final check). The orchestrator is code, not an LLM, because code is the thing that reliably enforces the step count, budget, and handoff protocol.

→ Interview Tip

When a candidate jumps straight to "multi-agent system" without justifying it, senior interviewers mentally mark it as inexperience. The seasoned move is "I'd prototype with one agent first, then split when I can point at a specific failure mode." That's the answer to lead with.

Q18

When should an agent escalate to a human versus retry on its own?

The rule of thumb: retry on transient failures, escalate on ambiguity. A 429 rate-limit is a retry. A tool that returned zero results is ambiguity — the agent should not silently pivot to a different plan and hope for the best.

THE ESCALATION DECISION TABLE

01 · Retry silently

Network errors, rate limits, schema parse fail, stale tokens.

02 · Retry with repair

Invalid tool args — show error back to LLM, retry once.

03 · Ask the user

Ambiguous intent, missing required info, confidence too low.

04 · Escalate to human

Destructive action needed, policy violation, repeated failures.

The design detail that separates senior from staff: the escalation path is part of the product surface, not a fallback hack. If the agent can escalate to a human, the human's queue, the SLA, the "who gets paged" policy, and the handback flow all have to be designed. Otherwise escalation becomes a black hole that the user abandons.

Also mention the "ask the user once" rule: if the agent is unsure, it should ask a single clarifying question with a bounded set of answers. A free-form clarification loop devolves into conversation and burns tokens. Bounded clarifications feel like a good UI and are cheap.

→ Real-World Use

For any write action, require the agent to summarise the proposed change and show it to the user before executing. This is the single highest-leverage pattern for destructive tools. Cheap, obvious, under-used.

Q19

How do you handle tool-use failures inside an agent loop?

Tool failures come in three flavours and each needs different handling: schema errors (the model called the tool wrong), tool errors (the tool ran but returned an error), and semantic errors (the tool ran, returned "success", but the result is wrong for the task).

Failure	Detection	Fix
Schema error	JSON parse / type check	Return error to LLM, let it retry with corrected args
Tool error	HTTP status / exception	Map to a human-readable error the LLM can reason about
Empty result	Zero hits, empty list	Surface as "no results, consider alternatives"
Semantic error	Verifier / sanity check	Tag as suspect, retry with different params
Repeated failure	N retries exceeded	Escalate, don't loop

The underrated one is semantic errors. Tools often return "success" for wrong outcomes — a search tool returns hits that don't match the intent, a code-execution tool returns output that ran but didn't do the thing asked. You catch these with a verifier pass: a small prompt that asks "given the goal and this tool output, did we make progress?" It's cheap and it's the difference between agents that feel reliable and agents that feel slippery.

The schema design rule: tools should return errors that the LLM can reason about. "HTTP 400: bad field 'start_date', expected YYYY-MM-DD" is actionable. "Internal server error" is not. Wrap your tool errors in natural language on the way back to the model.

→ Key Insight

Give every tool an idempotency key parameter and accept it silently. Agents retry. Retries cause duplicate writes. Idempotency keys cost you one extra column and they save you one on-call shift per quarter.

Q20

How do you manage agent state — checkpointing, pausing, resuming long runs?

Agents that run more than a few seconds need to survive restarts, deploys, and human pauses. The mental model: treat an agent run like a workflow engine job. State lives outside the agent, the agent is a reducer over an append-only event log, and any step can be replayed from the log.

AGENT RUN AS WORKFLOW

1 · Event log

Every plan / call / observation appended with seq number

↓

2 · Current state derived

Replay log → get conversation, tool cache, budget used

↓

3 · Pause / resume

Paused = no worker polling. Resume = re-derive state, continue

↓

4 · Human-in-the-loop

A human action is just another event in the log

The event-sourced design has a huge hidden benefit: you can replay an agent run on a newer model and diff the outcome. When you upgrade Sonnet, you replay yesterday's agent runs through the new model and see which steps went better or worse. That's your eval on real data for free.

For short-lived agents (<30 seconds, single request) this is overkill — just hold state in memory and be done. For anything that might span hours, survive deploys, or need human approval in the middle, the workflow-engine model is the right default. Temporal, Restate, and Inngest all ship patterns for this; rolling your own is fine if the domain is small.

→ Mental Model

Agent state = event log + derived view. The log is durable, the view is disposable. If the view goes wrong, rebuild it from the log. This mental model imports 30 years of database wisdom into agents.

Q21

How do you stop an agent from burning money in a runaway loop?

Three layers, enforced in code outside the LLM: step budget, token/dollar budget, wall-clock budget. Hit any one and the loop stops. Do not trust the LLM to track its own budget — it will not, because it doesn't know how to.

THREE HARD LIMITS

STEPS

≤ 25

Typical cap for interactive agents

COST

$ 0.50

Hard per-run spend ceiling

TIME

120s

Wall-clock wall, no exceptions

Beyond hard limits, use a no-progress detector: hash the last three (tool, args) pairs — if identical, abort. This catches the most common loop: the model keeps calling the same failing search with slightly different phrasing. Also log a loop-detected event so you can count how often agents hit it — that's your quality signal.

For runtime cost alerting, aggregate per tenant per hour and alarm on 5x-over-baseline. Runaway agents usually cluster around a broken deploy or a single tenant's weird input. Spotting the cluster fast matters more than per-request limits — one user spamming will always eat some budget, ten users hitting a broken prompt can take down a quarter's margin.

→ Key Insight

Budgets should be configurable per tool call, not global. A research agent searching arxiv can spend 50 steps. A confirm-this-transaction agent should cap at 3. One-size-fits-all budgets either starve real work or bankrupt you on edge cases.

Q22

How do you test an agent before you ship it?

Agents are stochastic and multi-step, so test pyramids from traditional software don't directly transfer. The version that works: tool tests at the bottom, trajectory tests in the middle, end-to-end evals on top.

AGENT TEST PYRAMID

1 · Tool unit tests

Every tool tested deterministically — no LLM involved

↑

2 · Trajectory tests

Given input X, did the agent call the expected tools in sensible order?

↑

3 · Outcome evals

LLM-as-judge scores final answer against rubric on N cases

↑

4 · Adversarial / red-team

Prompt injection, jailbreaks, unsafe tool use — all must fail safely

The key trick is at layer 2: trajectory tests don't check the exact tool sequence — they check that required tools were called and forbidden ones weren't. "The agent must call verify_identity before update_email, and must not call delete_account anywhere" is a reliable invariant. Exact-match tool sequences break every time the model re-plans.

The layer most teams skip is adversarial and it's the one that matters most for agents with real-world side effects. Have a set of prompt injections, tool-abuse attempts, and policy violations in your eval set, and gate deploys on them passing. This is the eng equivalent of a sandbox test for a release.

→ Real-World Use

Record every production trajectory. Sample 5% weekly, diff them against a baseline, and add any surprising ones to your test set. Your test set grows organically from real traffic — the best possible source.

Q23

How do you design an agent's memory so it's useful and bounded?

"Agent memory" is vague. Senior candidates split it into four concrete kinds and wire each to its own store.

Type	Contents	Store	Retrieval
Working	Current task scratchpad	Prompt context	Always included
Episodic	Past sessions by user	DB + vector	Recency + similarity
Semantic	User prefs & facts	Structured row / JSON	Always included, small
Procedural	"How we do X" patterns	Prompt / tools	Baked into system prompt

The most valuable memory for production agents is semantic memory stored as structured facts, not prose. "User prefers metric units, based in Berlin, has project IDs PRJ-14, PRJ-27" is 60 tokens and a lossless feed into every new session. Beats any vector-based "memory system" you can build because it's deterministic and auditable.

Episodic memory is where people over-engineer. The brutal truth: you rarely need to retrieve specific past conversations. You need to retrieve the facts from them. Build a pipeline that extracts facts from episodes into structured memory and throws away the episode text after a while. Storage is cheap but context is expensive.

→ Interview Tip

If you're asked to design a chat assistant with memory, the right move is "structured facts in a user row + retrieved snippets for specifics + never the raw episode text." Say this sentence and you cut through a lot of hand-wavy designs.

Q24

How do you handle handoffs between specialised agents or roles?

Handoffs are where multi-agent systems earn their keep — or fall apart. The three things that must cross the handoff cleanly: (1) the goal, (2) the facts gathered so far, (3) the constraints and budget remaining. Miss any of them and the receiving agent starts from zero.

HANDOFF PROTOCOL · WHAT MUST CROSS

01 · Goal statement

Single sentence describing the outcome the receiving agent must produce.

02 · Structured facts

JSON of what's known — never "here's everything I thought about".

03 · Remaining budget

Steps/tokens/time left so the receiver plans within bounds.

04 · Return contract

Schema for what the sender expects back. Not optional.

The common anti-pattern is passing the entire conversation history across the handoff. That dumps the noise (exploration, false starts, errors) on the receiver, blows its context, and confuses its planning. Compress to structured facts first. The senior pattern: handoff is a pure function call with typed inputs and outputs, not a stream of consciousness.

The return path matters just as much. If the delegated agent fails or times out, the orchestrator needs a structured failure back, not silence. Design handoffs like RPC calls with typed success and failure envelopes — that one discipline makes multi-agent systems debuggable instead of mystical.

→ Mental Model

A handoff between agents should pass less context than a function call in a REST API. If you're passing the raw message history, you've built an agent mesh by accident — and it will fail the same way a microservice mesh fails when every call passes every cookie.

Saurabh Singh — AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

Part IV

RAG &
Retrieval

Retrieval is the discipline that decides whether your model looks smart or clueless. Senior questions skip the definitions and go straight to the production tradeoffs that make or break an answer.

Chunking

Hybrid search

Re-ranking

Staleness

Scaling

Debugging

Questions 25–32

Q25

Walk me through how you'd design a production RAG system from scratch.

The senior framing: RAG has two pipelines, not one — indexing and querying — and they have different SLAs. Indexing is batch, idempotent, and fault-tolerant. Querying is interactive, latency-sensitive, and read-only. Conflate them and you either overpay for batch or underperform at query time.

RAG · TWO PIPELINES

📥

INDEX

Ingest

parse, clean, extract

→

✂️

INDEX

Chunk

semantic + metadata

→

🧮

INDEX

Embed

dense + sparse

→

🗂️

INDEX

Store

vector + doc db

❓

QUERY

Rewrite

expand, decontext

→

🔎

QUERY

Retrieve

hybrid search

→

📊

QUERY

Rerank

cross-encoder

→

✏️

QUERY

Answer

cite + verify

Walk the interviewer through the full pipeline and call out the non-obvious choices: (1) Query rewriting — a dedicated step where you expand the raw user question into a search-friendly form, decontextualize pronouns, and sometimes generate multiple queries. This single step is often the largest quality lever. (2) Hybrid search — BM25 + dense is the production default; pure vector loses precision on exact-match queries. (3) Re-ranking — a cross-encoder on the top-50 before the top-5 are sent to the LLM. (4) Answer verification — a second pass that checks every claim is grounded in a retrieved source.

The staff-level detail: metadata filters, not vector similarity, are the most important production lever. User region, document type, permission scope, date range — these should be hard filters applied before vector search, not post-hoc. Skip this and you'll spend quarters tuning embeddings to fix a problem a SQL WHERE clause would solve instantly.

→ Interview Tip

Ask the interviewer "what's the corpus size and update rate?" before answering. The answer for a 1k-doc corpus that updates monthly is different from a 50M-doc corpus that updates every minute. Showing you know that distinction is itself a signal.

Q26

What's your chunking strategy and why?

Chunking is the most underrated quality lever in RAG. The wrong answer is "512 tokens with 50 overlap" — that's a default, not a strategy. The right answer starts with "what's the shape of my documents?" and builds from there.

Strategy	Best for	Tradeoff
Fixed-size	Uniform text, PDFs	Cuts across sentences; cheap baseline
Sentence / paragraph	Prose, blog posts	Variable length, more semantic
Semantic (embedding gap)	Mixed content	Expensive at index, cleaner chunks
Structural (markdown / headings)	Technical docs, wikis	Needs clean source, best retrieval
Late chunking	Long coherent docs	Embeds full doc first, chunks the outputs
Contextual (Anthropic)	Dense reference material	Prepends doc context to each chunk — quality bump

My production default for heterogeneous corpora: structural chunking on headings + contextual retrieval. The structural pass gives you chunks that respect document boundaries (no cutting mid-sentence), and contextual retrieval fixes the "chunk orphaned from its parent" problem by prepending a one-sentence doc summary to every chunk before embedding. Anthropic's paper showed this cuts retrieval failures by ~50% for not much index-time cost.

The thing to call out explicitly: chunking quality is bounded by parsing quality. If you feed the chunker garbage HTML from scraped PDFs, no chunking strategy will save you. Fifty percent of RAG projects' quality problems are upstream of chunking — they're in ingestion and normalization. Fix parsing first.

→ Real-World Use

Store both the chunk text and its parent doc ID. When a chunk wins retrieval, optionally expand to neighbouring chunks or the full section. "Small chunks for retrieval precision, bigger windows for generation context" is the pattern.

Q27

How do you evaluate retrieval quality — separate from answer quality?

Decoupling retrieval eval from generation eval is the single most important discipline in RAG. If you only measure end-to-end answer quality, you can't tell whether a regression is because the retriever missed the doc or the generator botched the synthesis.

RETRIEVAL METRICS · THE FOUR THAT MATTER

01 · Recall @ K

Did we retrieve any relevant doc in top K? The ceiling of the system.

02 · MRR

Mean reciprocal rank — how high is the first relevant hit?

03 · nDCG

Quality-weighted ranking — punishes bad ordering, not just absence.

04 · Context Precision

Of the chunks sent to the LLM, how many were actually useful?

To measure any of this you need a labelled retrieval set — queries paired with ground-truth document IDs. Build one from 200–500 real queries, have humans mark the correct docs, and run every retrieval change against it. Frameworks like RAGAS can substitute an LLM judge for the human labels in a pinch, but I'd still want a human-labelled gold slice for anything safety-critical.

The metric most people skip is Context Precision: of the top-K chunks you sent to the LLM, what fraction were actually used in the final answer? High precision means you can shrink K (cheaper prompts). Low precision means the re-ranker is broken or the prompt is wasteful. Measure this and every decision you make about K becomes quantifiable.

→ Key Insight

When retrieval fails, upstream fixes (parsing, chunking, metadata) beat downstream fixes (re-ranker, prompt). Always profile where in the pipeline your recall is dropping before reaching for the fancy tools.

Q28

When do you use hybrid search versus pure vector search?

The answer in 2026 is simple: almost always hybrid. Pure vector is elegant in papers and brittle in production — it loses to BM25 on anything involving exact product names, IDs, error codes, quoted phrases, or rare jargon. Hybrid is the boring, correct default.

WHERE EACH ONE WINS

BM25 wins

Exact product names

Error codes (E_404_BAR)

Rare domain jargon

Short queries

Lexical precision

Vector wins

Paraphrased questions

Cross-lingual

Semantic generalisation

Long natural queries

Semantic recall

Fusion: the standard approach is reciprocal rank fusion (RRF) — merge ranked lists from each retriever by summing 1/(k+rank). Cheap, needs no tuning, works. More sophisticated: learn to weight the two signals per query type, but the returns diminish fast.

Pure-vector-only is defensible in two cases: (1) your corpus is tiny and well-curated (say, a 200-FAQ knowledge base — BM25 will always have exact matches there), or (2) you're doing cross-lingual retrieval (English queries hitting French docs — BM25 can't help you). For anything else, ship hybrid from day one. You'll spend the same engineering effort on purely-vector with worse outcomes.

→ Real-World Use

Postgres with pgvector + tsvector gives you hybrid search in one database with one query. For many teams this beats a dedicated vector DB for the first year — fewer moving parts, transactional consistency, and you already have Postgres on-call.

Q29

How do you handle index staleness when documents change constantly?

Staleness has two sides: content drift (docs changed, embeddings didn't) and model drift (embedding model changed, old embeddings don't align with new queries). Most teams plan for the first and ignore the second.

STALENESS · TWO KINDS, TWO FIXES

Content drift → CDC pipeline

Doc update triggers re-embed + upsert. Event-driven, idempotent by doc-id.

↓

Delete propagation

Soft-delete with retention window, hard-delete after. Never leave orphans.

↓

Model drift → versioned indexes

Never mix embedding model versions. Blue/green whole index on upgrade.

↓

Freshness SLO

"99% of doc updates visible in retrieval within N minutes." Measure it.

Practically: hook your ingestion to a change-data-capture stream from your source of truth (database or document store). Every change emits an event, a worker re-embeds just the affected chunks, and upserts them. Never reindex-the-world unless you absolutely must — reindexing is expensive, creates temporary inconsistency, and hides bugs.

For embedding model upgrades, the rule is brutal: never mix versions in one index. Old embeddings and new embeddings live in different spaces. Blue/green the whole index: reindex to v2 behind a flag, shadow-query to verify, flip traffic. This is the single most common "why is retrieval mysteriously worse" root cause I've seen.

→ Key Insight

Add doc_version and embed_model_version as required metadata on every chunk. Being able to query "show me chunks using model v1" is how you audit consistency during a migration — and how you find orphans afterwards.

Q30

What's your re-ranking strategy and when is it worth the cost?

Re-ranking is the second-pass quality amplifier on top of first-stage retrieval. First stage (BM25 + vector) is fast and recall-oriented — get the top 50–100 candidates. Second stage (cross-encoder) is slow and precision-oriented — rerank those 50 down to the top 5 that go to the LLM.

TWO-STAGE RETRIEVAL

🔎

STAGE 1

Retrieve

Hybrid → top 50
~10ms

→

⚖️

STAGE 2

Rerank

Cross-enc → top 5
~100ms

→

🧠

STAGE 3

Generate

LLM answer
~1-3s

Model choice depends on budget. Hosted options: Cohere Rerank (best quality, API call, predictable latency), Voyage Rerank (competitive quality, cheaper). Self-hosted: BGE-reranker or Jina Reranker (free, 50–200ms on a GPU, good enough for most cases). For tiny budgets, a well-tuned BM25 + vector fusion often matches a naive cross-encoder — don't reach for rerankers before you've tuned the first stage.

When is re-ranking not worth it? When your first-stage precision is already high (rare), when your latency budget is <300ms (interactive autocomplete), or when your top-K is already 3–5 and re-ranking just reorders the same small set. In practice: if your first stage returns 50 candidates and your LLM sees only 5, re-ranking is almost always worth the cost.

→ Mental Model

Retrieval is about recall (did we find it?). Re-ranking is about precision (did we put it first?). Don't try to make one stage do both — they have opposing optimal settings and one pipeline can't serve them both.

Q31

A user reports a RAG answer is wrong. Walk me through debugging it.

Debug RAG by walking the pipeline in order and answering three questions: (1) was the right doc in the corpus?, (2) did retrieval find it?, (3) did the LLM use it correctly?. Most teams jump to the LLM first; the correct order is the other way around.

DEBUG ORDER · OUTSIDE IN

1 · Is the right answer in the corpus at all?

Grep the raw docs for the keyword. If missing → ingestion bug.

↓

2 · Did chunking keep it together?

Find the chunk(s). If split across chunks → chunking bug.

↓

3 · Did retrieval rank it highly?

Replay query. Check top-K. If not present → retriever bug.

↓

4 · Did the LLM use it?

Inspect the actual prompt sent. If present but ignored → prompt bug.

Most real RAG failures turn out to be at step 1 or 2, not at the model. The doc wasn't in the corpus, or it was but parsed poorly, or the chunk that contained the answer got separated from the chunk that contained the context that makes the answer recognisable. The senior habit: always grep the raw corpus before blaming the model.

Make this debug loop fast. Every RAG system should have a "replay query" tool that takes a question and shows: rewritten query, BM25 results, vector results, fused top-K, reranker output, chunks sent to the LLM, and final answer. Thirty seconds to diagnose — that's the tool that pays for itself in the first week.

→ Interview Tip

If asked "our RAG answers are sometimes wrong", the senior first question isn't "which model?" — it's "how do you tell whether retrieval found the source or not?" Framing the problem that way signals you've done this before.

Q32

How would you scale RAG to a billion documents?

At a billion documents, none of the defaults apply. You're no longer doing vector search on a laptop — you're designing a distributed system where the retrieval layer has its own on-call rotation. The key design moves: sharding, approximate search, and aggressive pre-filtering.

BILLION-DOC RAG · KEY LEVERS

01 · Shard by hard filter

Partition by tenant / geo / date so most queries hit one shard.

02 · ANN (HNSW / IVF-PQ)

Approximate — trade tiny recall loss for 100x speedup.

03 · Quantization

Int8 or binary embeddings cut memory 4-32x for small recall hit.

04 · Tiered storage

Hot data in RAM; warm on SSD; cold in object storage.

The biggest lever isn't the vector DB — it's sharding by the filter most queries use. If every query is scoped to a tenant and a date range, partition the index by (tenant, month). A query then hits 1% of the corpus, not 100%. You get a thousand-fold speedup for free before you touch any ANN parameters.

Quantization is underrated. Binary embeddings (1 bit per dim) are surprisingly competitive when paired with a reranker, and they cut memory by 32x. The pattern: binary for first-stage retrieval on the billion-doc shard, full-precision vectors for the few thousand you actually rerank. This is how the frontier search systems scale.

Finally — and this is the staff-level framing — most teams that think they need billion-doc RAG don't. They need filtered retrieval over the 100k docs their user actually cares about. Before designing a distributed index, ask whether the working set per query is actually that large.

→ Key Insight

Sharding strategy should match the filter shape, not alphabetical doc-id. If 99% of queries are scoped to one user's docs, shard by user. If 99% are scoped to the last 30 days, shard by date. Match the access pattern.

Saurabh Singh — AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

Part V

Evals &
Observability

If an interviewer asks only one thing about this topic, they'll ask something deceptively simple that exposes whether you've shipped without evals. These are that topic.

Golden sets

LLM-as-judge

Regression

Observability

A/B testing

Prioritisation

Questions 33–40

Q33

How would you build an LLM eval framework from scratch?

Start with the smallest thing that works and grow. Day one: a CSV, a script, and a golden set of 30 examples. Day ninety: sliced metrics, automated regression, CI integration. Day three-sixty: online evals, drift alerts, per-tenant quality tracking. The mistake is trying to jump from zero to the framework you'd see at OpenAI — you'll build infrastructure nobody uses.

EVAL FRAMEWORK · GROWTH STAGES

📝

CSV + script

30 examples, run by hand

→

⚖️

D30

Judge model

scored rubric, sliced

→

🧪

D90

CI gate

block regressions on PR

→

📡

D180

Online evals

live traffic, drift alerts

The framework needs four things and nothing else at the start: a dataset (examples + expected criteria), a scorer (rule-based, LLM-judge, or human), a slicer (break results by segment — intent, model, user tier), and a runner (script that produces a comparable report between two runs). Everything else is infrastructure on top.

Senior candidates separate offline evals (run on a fixed dataset in CI) from online evals (run on live traffic in prod). Offline catches regressions pre-ship. Online catches problems that only appear with real inputs. Both are necessary; most teams have only offline and wonder why prod feels different from CI.

→ Interview Tip

When asked "how do you evaluate your LLM system?", the wrong answer is listing metric names. The right answer is "I have a golden set of N examples sliced by intent, an LLM judge with a rubric, and CI blocks regressions over 3%." Concrete numbers signal real experience.

Q34

How do you use LLM-as-judge without fooling yourself?

LLM-as-judge is powerful and it's also how most teams fool themselves. The failure mode is the judge agreeing with the model it's judging because they share a lineage. Your eval metric silently becomes "does output A look like output B" instead of "is output A good."

LLM-AS-JUDGE · AVOIDING THE TRAPS

01 · Pairwise, not scalar

"Is A better than B?" is more reliable than "rate A 1-5". Less drift.

02 · Different lineage

Judge with a different family than the one you're judging.

03 · Position-swap

Run each pair twice, A/B and B/A. Average to remove order bias.

04 · Calibrate against humans

Once a month, re-check agreement on a held-out set with human labels.

Most durable framing: LLM judge is a signal, human labels are ground truth. Use the judge for throughput (run it on thousands of examples per change) and humans for calibration (spot-check 50 cases a week to ensure the judge still agrees with reality). If agreement drops below 80%, the judge prompt is stale and needs updating.

The rubric matters more than the model. A specific, criterion-based rubric ("scored 5 if answer addresses all 3 sub-questions, cites at least 1 source, and contains no unsupported claims") outperforms generic "rate helpfulness 1-5" by a wide margin. Invest in the rubric.

→ Key Insight

Run the judge at temperature 0 and include the rubric inline in every call. You want reproducibility over creativity. A reproducible judge that's slightly wrong is more useful than a creative judge that's slightly right but non-deterministic.

Q35

How do you prevent your eval set from leaking into training or prompt optimization?

Leakage is subtle and it's the reason most teams over-trust their own eval numbers. Three leakage paths to defend: (1) training data leak (eval examples end up in fine-tune set), (2) prompt-tuning leak (you keep tweaking the prompt until eval scores go up — you've now overfit to eval), (3) provider leak (your eval examples were in the base model's pretraining data).

THE THREE LEAKAGE PATHS

1 · Training leak

Hash every eval example. Exclude hashes from training set by contract.

↓

2 · Prompt overfit

Dev set for iteration, held-out test set touched only before release.

↓

3 · Pretrain leak

Write fresh examples for important topics. Don't only use benchmarks.

↓

4 · Real-world audit

Shadow a % of production traffic to a human-labelled sample monthly.

The pattern that fixes prompt-tuning leak: split evals into dev (used while iterating) and test (run once before shipping, never iterated on). Every time you look at the test set and change behaviour, you've polluted it. In practice teams don't have the discipline for this — so add a locked "gold" slice that even the engineers can't see the individual examples of, only the aggregate score.

For pretrain leak, the honest truth is you can't perfectly control what was in a hosted model's training data. The mitigation is to add novel, domain-specific examples you wrote yourself — these definitely weren't in the pretrain data. Don't rely on public benchmarks (MMLU, GSM8K) as your primary eval; they're memorised to some degree by every frontier model.

→ Real-World Use

Build your eval set from real user failures, not synthetic questions. Real failures are guaranteed to not be in pretrain data, and they're directly aligned with what you care about. Mining incidents for eval examples is the cheapest quality win in the industry.

Q36

How do you do regression testing for prompts?

Treat prompts like code. A prompt change is a diff, a regression test is an eval run, and CI blocks a merge that regresses a protected slice. The pipeline is Promptfoo-style: a YAML config defines providers, tests, and assertions, and eval run fails the build if any assertion drops.

PROMPT CI PIPELINE

✏️

Edit prompt

PR opened

→

🧪

Run evals

dev + regression slice

→

📊

Diff report

before / after slice scores

→

✅

Gate merge

block on regression

Three things the test harness must do: (1) Deterministic replay — pin model version and temperature so results are comparable. (2) Slice-level gates — an aggregate 2% lift is fine, but a 10% drop on the "billing" slice is a block regardless. (3) Visible diffs — the PR reviewer sees "score on refund questions dropped from 0.82 to 0.64" with specific examples. Narrative beats numbers.

The discipline that matters: regressions block by default, you have to explicitly override with a reason. Teams that let regressions merge "to unblock" ship a worse product every sprint. Teams that block by default either fix the regression or have a conversation about why this one is acceptable. Both are better than shipping blind.

→ Interview Tip

The concrete tool names senior candidates drop: Promptfoo, Braintrust, Langfuse, Inspect AI. You don't need to have used all of them — naming one and explaining why it fits your workflow is enough to show you know this space is real.

Q37

What does AI observability actually look like in production?

Observability for an AI system has the same three pillars as any distributed system — logs, metrics, traces — but each pillar has a specific flavour. Traces especially: a single user request can produce a five-step agent trace with tool calls, retrieval hops, and reranker passes, and you need all of it in one view.

THE FOUR LAYERS OF AI OBSERVABILITY

01 · System metrics

Latency, error rate, QPS, GPU util — the boring ones you already know.

02 · Traces

Full prompt, tool calls, retrieved chunks, output — replayable.

03 · Quality signals

Online judge scores, user feedback, refusal rate, follow-up rate.

04 · Cost telemetry

Tokens per request per tenant; alert on per-tenant overspend.

The breakthrough is trace → eval → fix as a loop: a trace is a structured record of a single interaction; you can attach a score (from a judge or user) to it; bad scores bubble up into a review queue; the review produces an eval example and a fix. Tools like Braintrust, Langfuse, Phoenix, and LangSmith are built for this shape. Homegrown works too but you'll rebuild half of those tools.

Two details senior candidates always mention: (1) Correlate tokens with user IDs, not just request IDs — that's how you find which users are driving cost or breaking things. (2) Sample-and-log your prompts in full for X% of traffic and store them for 30+ days. When someone complains tomorrow about yesterday's answer, you need to be able to show them exactly what the model saw.

→ Real-World Use

Use OpenTelemetry semantic conventions for LLM spans (gen_ai.* attributes). They're standardising in 2025-2026 and they future-proof your traces so you can swap observability backends later.

Q38

How do you A/B test a model or prompt change in production?

The framework is standard experimentation — randomise users, hold the exposure stable, measure a primary metric, wait for significance — but the metric choice is the hard part. Unlike a classical web A/B test, there isn't a single conversion rate; quality is multi-dimensional.

Metric family	Examples	Watch out for
Quality	Judge score, refusal rate	Judge drift over time
Engagement	Follow-up rate, session length	Longer ≠ better
Outcome	Task completed, escalation rate	Slow to accumulate
Cost	Tokens/req, latency	Easy to forget, easy to blow
Safety	Policy violations, PII leaks	Must be a guardrail, not a trade

Use a primary metric + guardrails: pick one metric you're trying to move (say, judge score), and guardrails you won't trade (cost, latency, safety). A winning experiment must lift the primary without breaking a guardrail. Without this structure, teams ship changes that improve one axis and silently regress another.

The trap in stochastic systems: temperature > 0 adds noise, and noise adds the sample size you need. Run experiments at temperature 0 when possible, or size your sample 2-3x what a classical test would demand. And never compare model A at T=0 to model B at T=0.7 — you'll call randomness quality.

→ Key Insight

Randomise per user, not per request. Same user flipping between arms mid-conversation poisons the metric and confuses the UX. Stickiness at the user level is non-negotiable for LLM experiments.

Q39

How do you measure quality when there's no single ground truth?

Most real LLM tasks have no ground truth — summarisation, creative writing, open-ended Q&A all have many good answers. The senior move is to measure what you can and use proxies with honesty about what they're measuring.

PROXY METRICS · FROM HARD TO SOFT

Rule-based

tight

Reference ans

tight

Rubric judge

noisy

Pairwise judge

noisy

User feedback

biased

Downstream outcome

slow

The most underused technique is measurable sub-criteria: break "was this a good answer?" into three or four objective checks — did it include a citation? did it cover all sub-questions? did it refuse unsafe requests? did it return in the right format? — and score each one independently. Four 0/1 checks per example beats one hand-wavy 1-5 rating every time.

For truly subjective tasks, lean on pairwise comparisons against a baseline. "Is this output better than what the old prompt produced?" is a question a judge can answer reliably even when "is this output good?" is hopeless. You lose absolute quality tracking and gain comparability — usually a good trade.

The hardest and best signal is downstream outcome: did the user's task actually get done? Did the ticket get resolved without a follow-up? Did the code the agent wrote pass the tests? When you can tie quality to a real-world outcome, you stop arguing about judge prompts.

→ Mental Model

No single metric tells you the truth. Stack three or four weak proxies that correlate with quality in different ways, and only trust a change that lifts most of them. Any one metric is gameable; a portfolio is not.

Q40

You have 200 signals of quality problems and time for 10 fixes. How do you prioritise?

This is a judgement question disguised as a process question. The answer frames it as impact × reach × tractability, applied to clusters of signals, not individual signals.

PRIORITISATION MATRIX

1 · Cluster the signals

Group by failure mode, not by complaint. 200 signals often reduce to 15 clusters.

↓

2 · Score each cluster

Reach (% users) × severity (S1–S3) × tractability (hours to fix)

↓

3 · Pick from top of each quadrant

Mix of quick wins and big bets. Don't only ship 10 quick wins.

↓

4 · Add each cluster to eval set

Even the ones you didn't fix. Guaranteed regression protection.

The clustering step is where staff-level engineers differ. Junior engineers sort a spreadsheet by frequency. Staff engineers find the shared root cause: half the 200 signals might all be one failure mode ("retrieval missing recent docs"), and one fix lifts all fifty.

The pattern that fails: shipping 10 fixes that each touch a different part of the prompt. Each fix might be net-positive, but together they conflict, and your eval scores oscillate. Batch related fixes, ship them together, and eval as one change. Fewer, bigger, more verified.

→ Interview Tip

The best answer here doesn't just prioritise — it names what you'd not do and why. "I wouldn't touch the system prompt this sprint even though 5 signals point there, because we're mid-migration and the prompt is about to change anyway." That's judgement.

Saurabh Singh — AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

Part VI

Cost &
Scaling

The questions your finance partner cares about and the questions that separate engineers who shipped a demo from engineers who shipped a P&L-accountable service.

Cost cuts

Latency budgets

Caching

Capacity

Self-host vs hosted

Prompt optimisation

Questions 41–46

Q41

Your LLM bill just jumped 50%. How do you cut inference cost in half without hurting quality?

The right first move is instrument, then act. Most teams try to optimise before they know where the tokens go. Always profile first: what percent of spend is on which model, which product surface, which tenant, prompt vs completion tokens? The answer usually surprises people — one feature or one tenant will be 60% of cost.

COST LEVERS · ORDERED BY ROI

Prompt cache

−30%

Model routing

−40%

Prompt shrink

−20%

Semantic cache

−15%

Batch APIs

−50% on batch

Self-host

varies

Three levers that stack cleanly: (1) Prompt caching — if your system prompt is stable, Anthropic and OpenAI both offer prefix caching that cuts input-token cost ~90% on the cached portion. Moving a 3k-token system prompt from uncached to cached is a single-afternoon change and can be a 30% cost win. (2) Model routing — downshift easy queries to a smaller model, keep the frontier model for what needs it. (3) Prompt shrinking — audit your prompts for copy-paste accretion and cut 20% of tokens; almost always invisible in quality.

What you do not do first is rewrite to self-hosted. Self-host pays off at high volume and stable traffic, but the engineering-months are non-trivial and you lose the provider's reliability and model-upgrade treadmill. Reach for it after the easy wins.

→ Key Insight

The biggest free lunch in LLM cost is the prompt cache, and it's the one most teams forget to verify. Log your cache hit rate as a first-class metric. If it's below 70% for your chat endpoint, your system prompt isn't stable enough — fix that before anything else.

Q42

How do you set and defend a latency budget for an LLM feature?

A latency budget is a contract: "p95 end-to-end latency for this feature is 3 seconds". You split that budget across stages and give each stage a sub-budget. When any stage blows its sub-budget, you know exactly what to fix.

LATENCY BUDGET · 3S INTERACTIVE CHAT

🔐

50ms

Auth + gate

fast path

→

🔎

200ms

Retrieve

hybrid search

→

⚖️

100ms

Rerank

cross-encoder

→

🧠

2400ms

LLM

TTFB 400 + stream

→

✔️

250ms

Verify

safety + format

The user-perceived metric to defend is time-to-first-token (TTFB) for streaming responses, not total latency. Users forgive a 6-second answer that starts flowing in 400ms; they hate a 3-second answer that blanks the screen for 3 seconds and dumps. Stream by default. Show progress. Pre-send an acknowledgement if there's a retrieval step.

Three levers for latency: (1) Parallelise anything independent — run BM25 and vector search in parallel, not sequentially. (2) Cache the stable parts — prompt prefix, embedding of the last query, reranker scores. (3) Cut tokens — generation cost scales with output length, so a shorter output is a faster output.

→ Real-World Use

Alert on TTFB p95 per surface, not end-to-end latency. TTFB is the number users feel. End-to-end is the number your CFO sees. You need both but optimise for TTFB in interactive flows.

Q43

When do you cache LLM outputs versus recompute every time?

There are four caching surfaces and each answers a different question. Confusing them is the source of most LLM-cache bugs.

Cache	Keyed on	What it saves	Watch out
Prompt prefix	Prompt hash	Input token cost on cached portion	Any prefix change invalidates
Exact response	Full prompt hash	Full call cost	Non-determinism across temps
Semantic	Embedding similarity	Full call cost on paraphrases	False hits are silent errors
KV cache	Session continuity	Inference compute on same session	Server-side, framework-specific

The safe default: always use prompt-prefix caching (free, enabled at the API level, correctness preserved). Exact-response cache is fine for deterministic calls (temperature 0, no tools, no randomness in retrieval). Semantic caching is dangerous — the whole point is that similar-but-not-identical prompts share an answer, which is fine for FAQ but catastrophic for "what's my account balance?" where two similar questions have different right answers.

The rule of thumb: cache answers to questions whose answer doesn't depend on mutable state. "What are your business hours?" is cacheable forever. "What's the status of my order?" is never cacheable. "What's the weather in NYC today?" is cacheable for 15 minutes. Classify your queries, tag them, and cache-by-tag.

→ Mental Model

Every cache is a tradeoff between cost and freshness. For a semantic cache, the tradeoff is between cost and correctness — and correctness failures are usually more expensive than the call you saved. Default to caching off for anything user-specific.

Q44

How do you do capacity planning for AI workloads?

Capacity planning for LLMs is different because the unit isn't requests — it's tokens per second per tier. A request for 32k tokens in and 1k tokens out costs 33x what a 1k-in 100-out request costs on the same model. Plan capacity in tokens, and you'll stop being surprised by traffic that "looked flat" but spent 5x more.

CAPACITY PLANNING · WHAT TO PROJECT

1 · Forecast DAU growth

Baseline from existing signup curve; layer seasonality

↓

2 · Tokens per active user

Median + p95 per surface — usually skewed, plan for p95

↓

3 · Translate to TPM per model

Split by model tier; add 2x safety buffer for peaks

↓

4 · Negotiate provider quota

Enterprise TPM contracts, reserved capacity, burst credits

Two uncomfortable facts worth naming: (1) Provider rate limits are the real ceiling, not your wallet. You can have unlimited budget and still hit a 500k TPM wall that takes weeks to raise. Plan at least 2 quarters ahead on quota negotiations. (2) Input tokens grow faster than output tokens as your product matures, because you add context, tools, memory, and retrieval. Forecast the growth direction, not just the magnitude.

For self-hosted inference: the unit isn't TPM, it's tokens-per-second per GPU. A single H100 running vLLM with Llama-70B at batch size 32 does roughly 2000 output tokens/sec. That's your planning unit. Utilisation below 60% is wasted spend; above 85% means tail latency is collapsing. Tune the batch size to your mix.

→ Key Insight

Negotiate rate limits before you need them, not after. Every provider's enterprise contract has a 4–8 week lead time for real capacity increases. If your growth is exponential, you're always one month from a ceiling.

Q45

Hosted APIs versus self-hosted models — how do you make the call?

The analysis is financial + strategic + operational. Financial alone almost always says "self-host at scale" — but the strategic cost of losing the model upgrade cycle and the operational cost of running GPUs in production is usually larger than the cash savings.

THE DECISION MATRIX

Hosted wins

Rapid iteration

Small team

Want latest frontier

Variable traffic

No infra team

Time-to-market wins

Self-host wins

Steady huge volume

Data residency / air gap

Custom fine-tune

Specific latency SLO

Infra team exists

Unit economics wins

Key framing: the breakeven isn't when self-hosted is cheaper per token — it's when the cash saved exceeds the engineering + ops + opportunity cost of not shipping features. For most startups under 50 engineers, that moment never comes. For BigCo with a dedicated ML platform team, it comes sooner.

The hybrid pattern: self-host the classifier/embedder/reranker, hosted for the generator. The embedding model is small, high-QPS, and not on the model-upgrade treadmill — cheap to self-host and you get to use the latest frontier LLM for the generation pass. This is what most mature teams land on.

→ Interview Tip

If you haven't actually run GPUs in production, don't pretend you have. Instead say: "I'd start hosted, instrument cost per surface, and reach for self-host when the cash savings beat ~$500k/yr of engineering time." That's a credible, grounded framing.

Q46

How do you optimise prompts for cost without hurting quality?

Prompt-level cost optimisation has three moves, in order of safety: (1) delete copy-paste accretion (safest), (2) move static content to cached prefix (safe), (3) compress with a smaller model (risky, evaluate).

PROMPT COMPRESSION · WHAT TO CUT FIRST

1 · Redundant instructions

"Be helpful. Be accurate. Be concise." — pick one, the others are noise.

↓

2 · Stale few-shot examples

Drop examples that no longer match current data; keep diverse ones.

↓

3 · Over-long rubrics

Frontier models follow short rubrics; shrink 2x, eval, shrink again.

↓

4 · Retrieved context

Shrink K from 10 → 5 → 3 via reranker; measure at each step.

Output-side compression is just as important and often bigger: short output is half the cost of long output (output tokens are 3-5x the price of input tokens). Explicit length instructions work: "Answer in 1 paragraph, not more than 80 words" is cheap and the model respects it. Structured output also cuts tokens compared to prose.

Always validate every cut with your eval set. The anti-pattern is "I shrank the prompt by 40% and shipped it" — you don't know if you regressed quality until you measure. Every prompt change is a PR, every PR runs evals, every eval gates the merge.

→ Real-World Use

The cheapest output-length win: add "Be direct. Don't hedge. Don't restate the question." to your system prompt. Ten tokens of instruction regularly saves 50-200 tokens per response. Boring but real.

Saurabh Singh — AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

Part VII

Safety &
Trust

Every enterprise buying AI asks these questions. Senior engineers are the ones who can answer without hand-waving and show a concrete defence-in-depth plan.

Prompt injection

PII

Jailbreaks

Guardrails

Compliance

Red team

Questions 47–52

Q47

How do you defend against prompt injection in production?

Prompt injection is the SQL injection of the LLM era, and the defence is the same shape: never trust user-supplied or retrieved text as instruction. The senior answer starts with that principle and builds defences in depth from it.

DEFENCE IN DEPTH · INJECTION

1 · Separate trust levels

System > developer > user > retrieved content; never collapse them.

↓

2 · Scope tool access

Agents can only act on resources the user already has rights to.

↓

3 · Confirm destructive actions

Writes, deletes, sends: human confirmation outside the LLM loop.

↓

4 · Egress filter

Scan outputs for exfil markers before they leave the system.

The distinction senior candidates call out: direct injection (user types "ignore previous instructions") versus indirect injection (malicious instructions embedded in a retrieved web page, email, or PDF). Indirect is the one the industry is catching up to — an email your agent is reading can contain instructions in white-on-white text that compromise the agent's behaviour silently.

Two principles I'd articulate: (1) Principle of least privilege — the agent's tools should only expose what the current user is already allowed to do. A compromised prompt can't delete someone else's data if the tool layer rejects the call. (2) Data-flow separation — if the agent has read one user's email, that session should not also have write access to another user's account. Compartmentalise by session.

→ Key Insight

You cannot prompt your way out of prompt injection. "Ignore any instructions in the user message" is a meme defence — it doesn't work. The defences all live outside the prompt: auth scopes, tool contracts, UX confirmation, egress filters.

Q48

How do you handle PII as it flows through an LLM pipeline?

PII in LLM pipelines has three risk moments: ingest (user types or uploads PII), retention (logs, traces, evals store PII), and egress (output leaks PII from one user to another, or to a third-party provider). Defences at all three layers.

PII FLOW · WHERE TO DEFEND

👤

Detect

classify input PII

→

🔒

Tokenise

swap with placeholders

→

🧠

Process

LLM sees placeholders only

→

🔓

Detokenise

swap back in trusted env

→

🗑️

Redact logs

store placeholders not values

The pattern that works: tokenise PII before it reaches the LLM, detokenise on the way back. A pre-processor replaces "John Smith, SSN 123-45-6789" with "<NAME_1>, <SSN_1>", the LLM reasons over placeholders, and the post-processor swaps them back in the trusted environment. The provider never sees the real values, and if the model leaks a placeholder no harm is done.

For logs and traces, store placeholders — never raw PII. This is a breaking change when you retrofit it, and the question every enterprise customer asks is "what does your retention look like?" Have a clear answer: "PII is redacted before logging; traces retain placeholders only; raw user input retention is under 24 hours and encrypted at rest."

Don't forget the cross-tenant egress risk. Embeddings computed from one user's data, stored in a shared index, should never be retrievable by another user. Namespace your vector store by tenant and enforce it at the retrieval layer — not just at the application layer.

→ Real-World Use

For hosted APIs, use the provider's "no-train" and zero-retention options. OpenAI, Anthropic, and Gemini all offer contracts that guarantee your data isn't used to train and isn't retained beyond the call. Enable them by default for any regulated workload.

Q49

How do you handle jailbreaks in a customer-facing agent?

"Jailbreak" means different things — for a customer-facing agent, it usually means someone tricking the model into saying something harmful or off-brand. The defence isn't trying to make the model "uncrackable" (you can't), it's minimising the blast radius of a successful trick.

BLAST RADIUS REDUCTION

01 · Output classifier

Second pass checks output for policy violation. Cheap, effective.

02 · Scope limits

System prompt locks agent to its domain. Refuses off-topic.

03 · Tool-level auth

Jailbreaks can't escalate what the user's own session can do.

04 · Monitor & patch

Log refusals and bypass attempts. Update policies weekly.

The cheapest, highest-leverage defence is an output classifier — a second small-model pass that reads the proposed answer and asks "does this violate policy? is it off-topic? does it say something the brand would never say?" before it reaches the user. Llama Guard, NeMo Guardrails, or a fine-tuned small model all work. Latency cost: ~100ms. Effectiveness: catches the vast majority of policy escapes.

The philosophical point: scope your agent so narrowly that jailbreaks are uninteresting. A customer support bot should refuse any question outside its domain by default. "How do I reset my password?" — yes. "What are your thoughts on geopolitics?" — "I'm a support assistant, I can only help with your account." This isn't censorship, it's product scoping. Jailbreaks of a narrowly-scoped agent produce at-most a mildly embarrassing screenshot — never a data breach.

→ Mental Model

Think about jailbreaks like XSS — you don't make every input "safe", you assume some will get through and make sure the consequences are small. Scope, auth, output filters, monitoring — that's the XSS playbook applied to LLMs.

Q50

How do you build guardrails that don't ruin the user experience?

Over-aggressive guardrails are the most-complained-about feature in AI products. The senior move is to target the bad outcomes, not the bad topics. A medical app can discuss symptoms without becoming a diagnostic tool; a financial app can discuss budgeting without giving personalised advice. Narrow, outcome-based guardrails beat keyword blocklists every time.

GUARDRAIL DESIGN · GOOD VS BAD

Bad guardrails

Keyword blocklists

Refuse-on-suspicion

Generic safety prompt

No escalation path

High false-positive rate

Good guardrails

Outcome-specific classifier

Targeted refusals

Clear safe alternative

Route to human

Low friction, high signal

Rule I enforce: every refusal must offer a next step. "I can't help with that" is a terrible UX. "I can't recommend specific dosages — but I can show you the manufacturer guidance, or connect you to a pharmacist" respects the boundary and keeps the user moving. That's not a nice-to-have — it's the difference between a 20% refusal satisfaction and a 70% one.

Measure guardrail false-positive rate alongside false-negative rate. Most teams only track "did we miss a bad output?" A good guardrail also tracks "did we block a good output?" Both are defects. A guardrail at 2% FN and 20% FP is worse for the product than one at 5% FN and 3% FP.

→ Real-World Use

Use the strict/loose toggle pattern. Expose a single classifier with two thresholds: strict mode for first-time users and unauthenticated sessions, loose mode for known trusted tenants. One code path, two risk profiles. Easy to tune over time.

Q51

How do you audit an AI system for compliance — SOC2, HIPAA, GDPR?

Auditors ask about data flow, access control, retention, and auditability. Your AI system has to answer those four questions at every layer — just like any other regulated system, but with a few LLM-specific wrinkles.

Concern	Classic answer	AI-specific addition
Data flow	Diagram + DPA with vendors	Prompt & response logging, embedding storage
Access	RBAC, audit logs	Per-tenant isolation in vector store, namespaced caches
Retention	Retention schedule	Log redaction, eval set opt-out, right-to-delete on embeddings
Auditability	Immutable logs	Trace every output to prompt hash + model version
Sub-processors	Vendor list	Model provider, embedding provider, observability tools

The specific wrinkle: right-to-delete applied to embeddings. When a GDPR deletion request comes in, you must be able to remove that user's data not just from the primary DB but from every embedding derived from it. Build deletion hooks into your embedding pipeline on day one — retrofitting is painful. The pattern: every chunk row carries a user_id column; delete cascades from user → chunks → vectors.

Two other things auditors love: (1) a data flow diagram showing every hop, every vendor, every store. (2) DPAs (Data Processing Agreements) with every model and tool provider, with zero-retention and no-train flags enabled where available. Get these before the audit, not during.

→ Key Insight

Model providers count as sub-processors. Your enterprise customers will ask for the sub-processor list; OpenAI, Anthropic, and the major clouds are all on standard pre-approved lists. Smaller specialist APIs often aren't — budget procurement time for each new vendor.

Q52

What does a red-team process look like before launching an AI feature?

A real red-team is adversarial, diverse, and documented. Not "we had the QA team try some weird inputs for an afternoon". The senior answer has structure: threat model, attacker personas, a test script, a severity rubric, and a gate for "this doesn't ship until the P1 issues are fixed."

RED-TEAM PROCESS

🎯

Threat model

who, what, why

→

🎭

Personas

curious / malicious / naive

→

📋

Playbook

200+ scripted attacks

→

⚠️

Triage

P0–P3 with fix SLA

→

🔁

Re-test

gate launch on fix

Three persona archetypes to red-team against: the curious user (stumbles on bad outputs by accident — this is most of your traffic), the malicious user (actively tries to break things, post screenshots), and the naive but high-stakes user (asks a dangerous question without realising it). The playbook should have 30–50 attacks per persona, mixed to cover your feature's domain.

The gate matters more than the red team itself. Without a pre-committed launch gate — "no P0 issues, fewer than N P1s, all issues have a fix plan" — the red team becomes performative. With a gate, it becomes a decision-maker. Build the gate and get the leadership sign-off on it before you run the red team, so the results have teeth.

And finally: every red-team finding becomes a golden eval case. Future model/prompt changes can't regress any previously-found vulnerability without triggering a CI failure. That's how red teams compound into durable quality instead of being a one-time launch ritual.

→ Interview Tip

Senior candidates mention the external red team. At high stakes (healthcare, finance, security), hire an outside group to attack the system before launch. They're not biased by the product roadmap and they find things internal teams wouldn't.

Saurabh Singh — AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

Part VIII

Leadership
Signals

The half of the interview where the conversation stops being about models and starts being about judgement, people, and how you make decisions when nobody's going to tell you what to do.

Team leadership

Research vs shipping

Onboarding

Stakeholder alignment

ROI

Staff signals

Questions 53–60

Q53

How do you lead an AI engineering team that has to ship fast and stay safe?

Lead with clear stage-gates and explicit tradeoffs. The failure mode in AI teams is permanent prototype energy — everything is a demo and nothing is production. The opposite failure mode is a team so careful they never ship. The leadership job is naming which stage you're in and which rules apply.

THREE STAGES · DIFFERENT RULES

PROTO

Days. Hack, break things, no evals required. Goal: does it work?

BETA

Weeks. Evals exist, shadow traffic, internal users. Goal: is it safe?

Months. SLOs, red team, runbook. Goal: is it reliable?

The cultural move that makes this work: celebrate the transitions. When a feature moves from proto to beta, it's a milestone. When it moves from beta to GA, it's a bigger one. Teams that don't ritualise the transition end up with five half-GA features and one on-call incident per day — all the same tier of fragility.

On ship fast: my rule is to keep the iteration loop under a day. From "idea" to "evaluated prompt change" should be hours, not weeks. That requires investment in the eval harness, the deploy pipeline, and the rollback tooling. Leaders who don't fund infrastructure in year one pay for it ten-fold in year two.

→ Interview Tip

When asked "what's your leadership style?", the specific-example answer always beats the generic one. "In our last quarter I made the call to pause feature work for two weeks to build an eval harness — here's the before/after metric" is the story that lands.

Q54

How do you balance research exploration with shipping features?

The senior framing: "research" and "shipping" are not opposites — they're different-cost experiments. A well-run team is constantly running experiments at multiple cost tiers, with clear criteria for promoting a cheap experiment into an expensive one.

EXPERIMENT TIERS

Tier 0 · Notebook

Hours. One engineer, dirty code, small eval. Promote if signal is strong.

↓

Tier 1 · Shadow

Days. Runs on prod traffic silently. Measures real distribution.

↓

Tier 2 · Canary

Week. Small % of users. Measures real outcomes.

↓

Tier 3 · GA

Quarter. Full launch, SLOs, on-call, maintenance.

Kill criteria at each tier matter more than entry criteria. Most teams have no explicit rule for stopping an experiment that isn't working. Write it down: "Tier 1 is killed if shadow judge score is below baseline after 1000 examples." Without kill criteria, research becomes an indefinite line item.

On the team level, I aim for roughly 70% ship, 20% ship-adjacent research (paying off in this quarter), 10% horizon. The horizon bucket is how you stay ahead on model upgrades and new techniques — skip it and you wake up obsolete. Overfund it and you ship nothing. Most teams land at the wrong ratio in both directions.

→ Mental Model

Every research effort needs a "what would I ship" sentence from day one. If you can't name a concrete ship artifact the research would unlock, it's not research — it's learning, which is fine but should be on a smaller budget and a shorter timeline.

Q55

How do you onboard engineers to an AI codebase where half the system is a prompt?

AI codebases are famously hard to onboard to because the "source of truth" lives in prompts, evals, and traces — not in the code. A new engineer reading the repo gets maybe 40% of the picture. The onboarding plan has to compensate.

FIRST-WEEK ONBOARDING PLAN

Day 1–2 · Run one real query end-to-end

Open a trace, read the prompt, see the retrieval, read the output

↓

Day 3 · Ship a trivial prompt change

Full loop: PR, eval, review, canary — builds confidence in the system

↓

Day 4 · Read last 3 incident post-mortems

The best "how this actually fails" doc the team has

↓

Day 5 · Shadow a real feature review

See how decisions get made in the team's actual language

What you don't do: sit them down with the full system prompt and the architecture diagram and expect it to click. The sequence "run → ship small → read scars → observe decisions" compresses six weeks of learning into five days. It works because every step is concrete and produces feedback.

The asset I invest in: a "read this first" doc that isn't a README. It's a 10-page narrative: "here's the feature, here's why we built it this way, here's the thing that nearly broke it, here's the eval that keeps it honest, here's where new engineers have gotten stuck before." One document, refreshed quarterly, beats a dozen auto-generated doc sites.

→ Key Insight

The fastest way to know if an engineer is "getting it" on an AI team: ask them to debug an old production incident trace. If they can navigate the trace, find the bad step, and propose a fix — they're ready to ship. If not, more shadowing.

Q56

How do you convince leadership to fund infrastructure instead of shipping the next feature?

The losing argument is "our infra is bad and it's embarrassing." The winning argument is "here are three features we couldn't ship last quarter because infra blocked them, and here's the cost of keeping that going." Leaders fund work that prevents measurable pain, not work that satisfies engineering aesthetics.

THE CONVERSATION STRUCTURE

1 · The cost of doing nothing

Engineer-weeks lost, incidents, features blocked, churn

↓

2 · The proposal

Scope, team, timeline, clear end-state definition

↓

3 · The payoff

Measurable "after" state — engineer time recovered, cost saved, risk cut

↓

4 · The tradeoff

What feature slips, who's affected, who you've told

The framing that actually works with non-technical leaders: talk in dollars and weeks, never in abstractions. "Our eval harness takes 45 minutes to run, which means engineers wait or skip it, which means regressions ship, which means a customer-success rep spends 10 hours/week handling the fallout" — that's a business case. "We need a better eval harness" is a request.

When you get the funding, overcommunicate progress. Weekly update on what's done, what's left, what changed. Infra work is invisible to leadership by default; if you don't write about it, they'll forget you got approval and wonder why features are slow. A 5-minute Friday email beats a 60-minute meeting every time.

→ Interview Tip

Have one concrete story ready about infrastructure advocacy. Even if you didn't get approval the first time — the story of "I pitched it, got pushback, came back with better numbers, got approval" is actually stronger than a first-time yes.

Q57

Two senior engineers disagree on model vs architecture tradeoffs. How do you resolve it?

Rule 1: resolve technical disagreements with data, not seniority. If two strong engineers disagree, usually they're both looking at different parts of the elephant. Your job is to make the disagreement concrete — specific claim, specific metric, specific test — and let the data decide.

TURNING DEBATE INTO DECISION

💬

Clarify

what exactly is the disagreement?

→

🎯

Formalise

what would settle it?

→

🧪

Bake-off

time-boxed experiment

→

📉

Review

both eng in the room

→

📣

Commit

both disagree & commit

The pattern I use: "disagree and commit" after a time-boxed bake-off. Define the metric and the timeline in advance ("we'll evaluate both approaches on eval set X over 3 days, winner is whichever beats the other by more than 5%"). Lock both engineers into committing to the outcome before the test runs, so the losing side doesn't re-litigate afterwards.

Sometimes the debate is not settleable by data — it's about maintainability, clarity, or long-term direction. In that case the leader's job is to make the call explicitly, own it, and explain the reasoning. "Both approaches work; I'm picking B because it's closer to the direction the platform team is going — I might be wrong, we'll revisit in 6 months." Transparency about your own uncertainty keeps the other engineer's trust.

→ Real-World Use

The one thing not to do: appoint a third senior engineer to arbitrate. That creates a political dynamic and the losing side feels ganged up on. Either use data or take the decision yourself — never delegate the call to a tiebreaker.

Q58

How do you measure the ROI of an AI initiative?

ROI for AI initiatives falls into three buckets: revenue (new customers, upsell), cost (deflected work, reduced tool spend), and risk (incidents avoided, compliance posture). Every AI project must explicitly name which bucket it's in — and the metrics are different for each.

Bucket	Metric examples	Attribution tricks
Revenue	Trial → paid conversion, upsell rate, feature-gated ARR	Randomise feature access for 60 days
Cost	Tickets deflected, hours saved per agent	Hold-out cohort + before/after
Risk	Incidents avoided, SLA hits	Hard — use leading indicators (coverage, drill results)
Satisfaction	CSAT, NPS, retention	Segment by exposure, long window

The hardest one is ticket deflection — everyone wants to claim "our bot deflected 30% of tickets" and nobody can prove it without a control group. The honest measurement: hold 5% of users out of the AI feature, run for 30 days, compare their ticket volume to the exposed group. If you won't do that, don't claim the deflection number.

The trap to avoid: vanity metrics like "number of messages sent to the bot". Usage isn't value. A chatbot with 10k daily messages and a CSAT of 2.1 is actively destroying value. Tie every AI project to a business metric downstream of usage — conversion, resolution, retention — or the initiative doesn't justify its budget.

→ Mental Model

Before you build an AI feature, write the "if this works, this metric moves from X to Y" sentence. If you can't write it, you're not ready to build it — and you won't be able to prove ROI later.

Q59

How do you mentor an engineer from "knows Python" to "can ship AI features"?

Mentorship in AI engineering is less about teaching facts and more about building instincts for non-deterministic systems. The transition that matters: moving from "does my code run?" to "does my feature produce good outputs, reliably, over a distribution of inputs?"

FOUR HABITS I DRILL INTO NEW AI ENGINEERS

01 · Always look at data

Before you debug, read 20 real examples. The answer is usually in the data.

02 · Build the eval first

No ship, no feature, no PR until there's a way to measure it.

03 · Start with the simplest thing

Prompt before fine-tune; retrieval before agent; regex before LLM.

04 · Own the trace

When something is wrong, open the trace. Never ship on hope.

The single practice I require: read 20 real examples from the system every week. No dashboards, no summaries — actual traces from actual users. Pattern-matching on real data is how senior AI engineers develop intuition, and it's the one practice juniors skip. Make it part of the weekly ritual.

The blocker that's hardest to coach through: the fear of "wasting" an LLM call. Junior engineers over-optimise their prompt before running it because it "feels expensive." Mid-level engineers run it, see it fail, and iterate. You want to coach toward the latter — the learning loop is the whole job.

→ Interview Tip

If asked about mentoring, tell a before/after story with a specific engineer: "Engineer X couldn't ship a prompt change in a week. Here's what I changed about how they worked. Now they ship daily." Concrete beats generic every time.

Q60

What does "senior" look like versus "staff" in AI engineering — and which are you?

The clearest framing I've found: senior engineers own outcomes on a feature; staff engineers own outcomes across features. Senior ships the feature reliably; staff designs the platform that makes five teams ship reliably. The scope of what you're accountable for is the main axis.

SENIOR vs STAFF · WHERE THE LINE SITS

Senior owns

A feature end-to-end

Reliable shipping

Clean eval discipline

Incident response

Mentorship up to one level

Mastery in the box

Staff owns

A platform or domain

Architecture across teams

Eval culture at org level

Strategy & roadmap

Develops senior engineers

Leverage outside the box

In AI specifically, the staff signals I look for are: (1) can they design an eval culture, not just write evals for their own feature? (2) can they sequence the team's investments across model upgrades, infra improvements, and features over a 6-month horizon? (3) can they pick the right abstraction — knowing when to add a platform component vs when to let teams keep copy-pasting? That last one is where most senior-to-staff transitions live or die.

The answer to "which are you?" should be honest and forward-looking: "I'm operating at senior today. Here are the staff-level scopes I've taken on and the ones I know I haven't yet." That combination — self-awareness plus a growth direction — is what interviewers want to hear. Overclaiming burns trust. Underclaiming costs you the level.

→ Key Insight

Staff engineers make other engineers more effective. If you can point at three engineers whose work is better because of something you built or taught them, you're operating above a pure senior scope — say that, give examples, let the interviewer do the math.

Saurabh Singh — AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

Complete

All 60 Questions.
Covered.

Sixty production-grounded questions for senior AI engineer interviews — architecture, incidents, agents, RAG, evals, cost, safety, and the leadership signals hiring panels actually listen for.

Questions

Topic Areas

60+

Visual Diagrams

Saurabh Singh

AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

RUN IT YOURSELF

Exponential backoff with a cap

Production AI calls fail, so you retry with exponential backoff instead of hammering a struggling service. This computes the delay schedule, in real Python, running live. Edit the factor/cap and hit Run.

HOW TO READ THE CODE — 4 IDEAS

Each retry should wait longer than the last, to give the service room to recover.
Delay grows geometrically: base · factor^attempt (step 1).
A cap stops the delay growing without bound (step 2).
Real clients also add jitter (randomness) so retries don't all fire at once.

CPython · WebAssembly

Finished this one? 0 / 30 Handbooks done

Explore the topic

See this alongside everything else on the same subject — handbooks, system designs, challenges and tools, in one place.

LLM Engineering Interview Prep

The SeniorAI EngineerInterview Handbook

What'sInside

Architecture &System Design

Design a production LLM serving platform for a product with 1M daily active users.

When would you pick RAG, fine-tuning, or plain in-context learning — and when would you use more than one?

How do you design multi-tenancy into a shared AI platform without one noisy tenant hurting everyone else?

What's your approach to model versioning, rollout, and rollback in production?

Design a multi-model routing layer that picks the right model per request.

How do you manage context windows when user sessions exceed the model's limit?

Where do you draw the line between deterministic code and an LLM call?

How do you design a feedback loop that actually improves the system over time?

ProductionIncidents

Walk me through a production LLM incident you debugged end-to-end.

How do you detect and mitigate model drift in production?

Your LLM provider goes down mid-day. What's your runbook?

How do you handle hallucinations in user-facing applications?

Tell me about a time a fine-tuned model regressed in production and how you caught it.

How do you roll back a bad prompt deployment safely?

How do you handle cascading failures in multi-step agent workflows?

What does a post-mortem for a non-deterministic AI system look like?

AgenticSystems

How do you structure a multi-agent system? Why not a single agent with more tools?

When should an agent escalate to a human versus retry on its own?

How do you handle tool-use failures inside an agent loop?

How do you manage agent state — checkpointing, pausing, resuming long runs?

How do you stop an agent from burning money in a runaway loop?

How do you test an agent before you ship it?

How do you design an agent's memory so it's useful and bounded?

How do you handle handoffs between specialised agents or roles?

RAG &Retrieval

Walk me through how you'd design a production RAG system from scratch.

What's your chunking strategy and why?

How do you evaluate retrieval quality — separate from answer quality?

When do you use hybrid search versus pure vector search?

How do you handle index staleness when documents change constantly?

What's your re-ranking strategy and when is it worth the cost?

A user reports a RAG answer is wrong. Walk me through debugging it.

How would you scale RAG to a billion documents?

Evals &Observability

How would you build an LLM eval framework from scratch?

How do you use LLM-as-judge without fooling yourself?

How do you prevent your eval set from leaking into training or prompt optimization?

How do you do regression testing for prompts?

What does AI observability actually look like in production?

How do you A/B test a model or prompt change in production?

How do you measure quality when there's no single ground truth?

You have 200 signals of quality problems and time for 10 fixes. How do you prioritise?

Cost &Scaling

Your LLM bill just jumped 50%. How do you cut inference cost in half without hurting quality?

How do you set and defend a latency budget for an LLM feature?

When do you cache LLM outputs versus recompute every time?

How do you do capacity planning for AI workloads?

Hosted APIs versus self-hosted models — how do you make the call?

How do you optimise prompts for cost without hurting quality?

Safety &Trust

How do you defend against prompt injection in production?

How do you handle PII as it flows through an LLM pipeline?

How do you handle jailbreaks in a customer-facing agent?

How do you build guardrails that don't ruin the user experience?

How do you audit an AI system for compliance — SOC2, HIPAA, GDPR?

What does a red-team process look like before launching an AI feature?

LeadershipSignals

How do you lead an AI engineering team that has to ship fast and stay safe?

How do you balance research exploration with shipping features?

How do you onboard engineers to an AI codebase where half the system is a prompt?

How do you convince leadership to fund infrastructure instead of shipping the next feature?

Two senior engineers disagree on model vs architecture tradeoffs. How do you resolve it?

How do you measure the ROI of an AI initiative?

How do you mentor an engineer from "knows Python" to "can ship AI features"?

What does "senior" look like versus "staff" in AI engineering — and which are you?

All 60 Questions.Covered.

Exponential backoff with a cap

Explore the topic

More Handbooks

Explore more from Vibe Engines

Get the next one in your inbox.

The Senior
AI Engineer
Interview Handbook

What's
Inside

Architecture &
System Design

Production
Incidents

Agentic
Systems

RAG &
Retrieval

Evals &
Observability

Cost &
Scaling

Safety &
Trust

Leadership
Signals

All 60 Questions.
Covered.