Sixty questions on architecture, production incidents, and the leadership signals that separate senior from staff-level AI engineers.
The senior interview rarely asks you to invent a transformer. It asks you to draw a production system on a whiteboard in forty minutes and defend every line you drew.
Start by naming the three workloads that share nothing: interactive chat (single-digit seconds, streaming), async jobs (bulk summarisation, embedding indexing), and batch evals. Trying to serve them from one pool is the most common rookie error — they contend for the same GPUs and interactive p99 falls apart the moment a batch job kicks in.
The gateway terminates auth, enforces per-tenant quotas, and scrubs PII before anything touches a model. The router decides model, region, and fallback chain — this is where you bake in your cost strategy (cheap models first, escalate on low confidence). The inference tier is where you separate interactive vs batch GPU pools. Everything is fronted by a prefix + semantic cache because in chat, 30–60% of prompts share a system prefix that should hit the KV cache every time.
Senior interviewers are listening for three things you didn't say as much as what you did. First, queueing — you need a token-aware admission queue because ten 32k-token requests can starve a hundred 2k-token ones. Second, multi-region failover that doesn't break conversation state. Third, a model-agnostic request schema so swapping Claude for GPT or Gemini is a router config change, not a code change.
This is a decision framework question, not a recipe question. The answer turns on three variables: how often the knowledge changes, how much you need the model to change its behaviour vs its facts, and how much latency and cost budget you have.
| Approach | Strong when | Weak when | Cost shape |
|---|---|---|---|
| In-context | Facts fit in prompt, change daily | Long tail of knowledge, repeated costs | Per-request tokens |
| RAG | Knowledge is large, updates often, auditable | Behaviour change, reasoning style | Index + per-request retrieval |
| Fine-tuning | Style, format, domain jargon, routing | Facts, anything that changes weekly | Training run + hosting |
| Hybrid | Regulated domains needing both | Prototype / unclear requirements | All of the above |
The thing that gets a senior signal is naming the hybrid case out loud. Most real systems end up as RAG for facts + a small fine-tune for tone and structure. Medical copilots do this. Legal copilots do this. Support bots for a specific product line do this. The fine-tune teaches the model "how we sound" and RAG teaches it "what we currently know" — those are two orthogonal needs and trying to solve both with one lever always overfits one of them.
Staff-level candidates go one layer deeper: fine-tune the retriever, not the generator. If your off-the-shelf embedding model doesn't know your jargon, you get bad retrieval no matter how big the LLM is. A small contrastive fine-tune on your query/doc pairs often moves the quality needle more than fine-tuning a 70B model.
Multi-tenancy for AI has three concerns that traditional SaaS doesn't: token quotas, data isolation in caches and embeddings, and model-level noisy neighbours where one tenant's batch workload starves another's interactive traffic. You want per-tenant isolation on all three axes.
The trap to call out: prompt caches are a data leak surface. If you dedupe by prompt hash across tenants, a hash collision becomes an information leak. Always key caches by tenant-id || prompt-hash, not just prompt-hash. Same rule for any LLM response cache or embedding cache.
For the noisy-neighbour problem, token-weighted fair queueing beats round-robin. Charge each request to the tenant's bucket in tokens, not requests — a 32k-token request costs 16x a 2k one, and round-robin treats them the same. Large tenants should hit a separate high-priority lane that can't starve the standard lane below a floor.
The senior move is to treat model + prompt + retriever as one deployable unit. Versioning only the model is the fastest way to get into production bugs you can't reproduce: the model is fine but the prompt was regenerated from a different template and nobody noticed. Call this the inference stack and version the whole thing.
Rollout strategy: always shadow first, then canary, then ramp. Shadow mode sends the new stack the same traffic as prod but discards its output — you get real-distribution eval data without risking users. Canary then flips a small slice (1% → 5% → 25%) with automatic rollback tied to your eval gates. The key insight: your rollback trigger should be an eval metric, not an error rate. Hallucination regressions don't throw 500s.
Rollback has to be atomic across the whole stack. A common outage pattern: you roll back the model but forget the prompt was updated to match the new model's behaviour, so you now have a prompt that only works with the new model rolled back to the old one. Snapshot the whole stack, roll back as one unit.
A good router is cheap by default, expensive by necessity. The structure that works: a small classifier decides a tier, the tier maps to a model, and every request has a fallback chain if the first pick fails or returns low confidence.
The classifier should be a tiny, cheap model or a fine-tuned distilbert — don't use a frontier model to decide which frontier model to use. Features: intent, estimated token length, whether tool use is required, whether the user tier allows premium models. The classifier must be deterministic enough that A/B tests are interpretable.
The three pitfalls to mention: (1) Double billing — if you escalate, you pay for both models, so only escalate on measurable low-confidence signals. (2) Latency cliffs — users notice when their query randomly takes 10x longer because it hit the big model. Stabilise routing per session so a user's experience is consistent. (3) Observability debt — every request needs to log which tier it hit and why, or you can't tune the router.
There's no single answer — it depends on whether the domain needs recency (support chat) or completeness (code review, legal). The strategy is a stack of techniques, not one choice.
For most production systems the right answer is a hybrid: sliding window of the last 6–10 turns + rolling summary of everything before + retrieved memory for specific entities. Structured state is underused — if your domain has stable slots (user preferences, order IDs, active project) extract them and pass them as a small JSON block. That beats summarisation because it's lossless for the things that matter.
The senior-signal detail: measure your context utilisation. If your users' p95 conversation is 4k tokens and your budget is 200k, you're paying for nothing. Cap the window at what you actually use, monitor how often you hit it, and only increase when the data demands it.
The default should be "use code wherever code works". LLM calls are non-deterministic, slow, expensive, and hard to test — if a regex, a SQL query, or a finite-state machine can do the job, that's what you use. The LLM is reserved for tasks where the input is ambiguous in a way that rules can't resolve.
The practical rule I give junior engineers: "the LLM is an expensive interpreter, not a runtime". Let it parse the user's intent, pick a tool, and talk back to the user — but route the actual execution through deterministic code. Whenever I see an LLM being asked to "decide what to do and do it in one call", there's a bug waiting. Split the reasoning step from the execution step.
The counter-pattern to watch for is LLM creep: every new edge case gets solved by adding another line to the system prompt. Three months in you have a 4000-token prompt that nobody can reason about. When that happens, audit — most of that prompt belongs in code.
Most "feedback loops" are thumbs-up/thumbs-down buttons that nobody clicks. A real feedback loop has four stages: capture, label, improve, verify — and the hardest one is label, not capture.
Capture should bias to implicit signals: did the user follow up with a rephrase (bad), copy the output (good), close the tab within 3 seconds (bad), ask a new question (neutral)? Thumbs are a 2% sample and heavily biased toward negative. Implicit signals are 100% coverage and much more useful for ranking which examples are worth a human look.
The critical design choice is where the loop re-enters the system. Cheap loops update prompts and retrieval. Medium-cost loops add examples to the golden eval set so regressions can't ship. Expensive loops do fine-tuning. Most teams jump straight to fine-tuning because it feels serious, but updating prompts and evals from production data beats fine-tuning on almost all quality dimensions for a fraction of the cost.
Senior interviews live and die in the war-story section. The questions here are where you prove you've taken the pager and come out with scar tissue, not just slide decks.
The interviewer is listening for four things, in order: (1) how you detected it, (2) how you isolated the cause, (3) how you mitigated without making it worse, and (4) what you changed so it wouldn't happen again. The story doesn't need to be glamorous — clarity beats drama.
A good story to have ready: "A support bot started confidently answering billing questions with the wrong currency. No errors, no latency blip — just wrong. Our customer-success team pinged us." Detection came from a human, not a metric — call that out. Triage: pulled 50 traces, saw the model had started including an EU pricing doc in retrieval for US users. Isolation: the retriever was pulling by cosine similarity only, and a recent re-indexing had changed the embedding space enough that geography tags no longer clustered. Mitigation: hot-patched the retriever to filter by user region as a hard filter before vector search. Learning: added region-aware tests to the eval set and made metadata filters mandatory in the retriever contract.
The staff-level detail most people miss: explain the blast radius assessment. How did you decide this was a P1 vs a P2? Who was affected, how do you know, and how did you estimate cost of being wrong? That's the judgment layer senior interviewers are probing.
"Drift" means three different things and senior candidates distinguish them: data drift (user inputs change), concept drift (the right answer for the same input changes), and model drift (the provider silently updates the model behind your API). Each one is detected and mitigated differently.
| Type | Symptom | Detector | Mitigation |
|---|---|---|---|
| Data drift | New topics, longer queries | Embedding distribution shift | Retrieval refresh, prompt update |
| Concept drift | Right answer is wrong now | Feedback rate delta | Re-label golden set, update prompt |
| Model drift | Same input, different output | Canary replay on golden set | Pin version, re-validate, possibly switch |
The sharp detector for data drift is embedding PCA over a rolling window: project last-24-hours of user prompts into a 2-D space and compare to last-week's. A cluster of queries in an unfamiliar region is an early signal that your retrieval is about to go sideways. Cheaper version: track the rate of "I don't know" responses — it correlates surprisingly well with topic drift.
Model drift is the one most teams forget about and the one that bites hardest. When you call a hosted API you're implicitly pinned to whatever the provider ships. Run a daily replay of your golden set through the hosted endpoint and diff the results. If a meaningful fraction changed without you shipping anything, the provider updated something. This is how you catch it before a user does.
This question is testing whether you've planned for the outage, not whether you can improvise during it. The right answer starts with "we designed for this" and ends with "here's the minute-by-minute runbook".
Multi-provider is cheap insurance if you designed for it from day one. That means: model-agnostic request schema, prompt templates that work across providers (or provider-specific variants versioned together), and a pre-approved fallback chain so incident commanders don't have to make architectural decisions at 2am. The worst time to discover your prompts don't port between providers is during the outage.
Also underrated: graceful degradation is a product decision, not just a technical one. Sometimes the right response is to serve a cached answer with a disclaimer. Sometimes it's to put the feature in read-only. Sometimes it's to fail loudly because a wrong answer would be worse than no answer. Have that conversation with your PM before the outage and put the answer in the runbook.
The senior framing is: hallucinations are a system problem, not a model problem. You don't make them go away — you contain them, detect them, and design UX that tolerates them. Treat it like we treat SQL injection: defence in depth.
Layer 1 is grounding: give the model the facts it needs and instruct it not to answer beyond them. Layer 2 is structure: if the output has to be JSON with specific fields, the surface area for freeform fabrication drops. Layer 3 is verification: run a second, cheaper model pass that asks "is every claim in this answer supported by the retrieved context?" — a surprisingly effective filter. Layer 4 is UX: never let a claim appear without a citation the user can click. Users are remarkably forgiving of "I'm not sure" and remarkably unforgiving of confident wrongness.
What you do not do is tell the model "don't hallucinate" in the system prompt. That's a meme answer. It doesn't work. Neither does temperature=0 — that makes the same wrong answer consistent, not less wrong.
Fine-tune regressions have a specific flavour: the model got better on the training distribution and worse on the tail. If your golden eval is a clone of your training distribution, you won't catch it — both move together. The senior story to tell is about a team that caught it because the eval set had a held-out tail slice.
A real pattern: a support team fine-tunes a model on last-90-days of resolved tickets. Accuracy on a random sample jumps. They ship. A week later, customer satisfaction on new-product questions tanks — the fine-tune had shifted the model toward the existing product mix, and it now underperforms on questions about anything that wasn't in the training window. The fix isn't another fine-tune. The fix is mixing a retrieval layer for long-tail queries and classifying inputs to decide which path to use.
The lesson to deliver: always keep a "rare but important" slice in your eval set. Rare queries about important things (legal, billing edge cases, new features) should each get 10–50 examples, even if they're 0.1% of traffic. Those are the slices fine-tunes regress on.
The mechanics are simple if you planned for it. You need: (1) prompts stored as versioned artifacts separate from code, (2) a feature flag per deploy, and (3) automatic rollback triggers tied to eval gates. If any of those three are missing, the rollback is a code deploy — which is fine for the first incident and unacceptable after the second.
The subtle part is "rollback" is not just flipping a flag. You also have to: invalidate any prompt-prefix caches keyed on the new prompt, drain in-flight requests that were mid-stream on the new prompt, and communicate to downstream consumers that output format may have shifted back. A senior answer mentions these. A staff answer builds them into the registry from day one so the rollback is a single operation.
The story I'd tell: team shipped a prompt that added a new output field. Downstream code assumed the field was present. They rolled back the prompt but the downstream service was still on the new code path and started throwing nulls. The incident was longer than it should have been because the rollback wasn't atomic across the dependency graph.
Agents cascade because a small upstream error compounds at every downstream step — a mis-parsed tool argument becomes a failed tool call becomes a confused planner becomes a 20-step loop. You contain this by putting circuit breakers at every step boundary, the same way you would in a microservice mesh.
Three concrete circuit breakers: a step-count limit (hard cap on iterations), a cost limit in tokens or dollars (when the agent has spent its budget, stop), and a no-progress detector (if the last three steps didn't change the state, the agent is stuck — stop). The no-progress detector is the one most teams forget; it catches the "agent is looping on the same failing tool call" pattern.
For recovery, the key design choice is whether to retry a failed step or abort and escalate to a human. The answer depends on reversibility: retries are fine for read-only tools; for anything that writes state, default to "escalate" unless you have idempotency guarantees on the tool itself. This is where senior candidates mention idempotency keys per agent run — a rare but correct detail.
Traditional post-mortems assume reproducibility. AI incidents often aren't reproducible — you can't replay the same user session and get the same wrong answer. The post-mortem template has to bend around that.
| Section | Classical | AI-specific |
|---|---|---|
| Repro | Deterministic steps | Captured traces + inputs; probabilistic replay |
| Root cause | Single commit / config | Prompt + model + retrieval + input shape |
| Fix | Code change | Prompt, eval set, retrieval filter, UX change |
| Prevention | Test case | Golden example + monitoring alert |
| Metric | TTR, error rate | TTR + quality-signal delta + slice recovery |
Two rules I enforce: (1) Every AI post-mortem ends with at least one new eval example. That's how you make the incident compound into long-term protection instead of just a memory. (2) Root cause is plural. "The model hallucinated" is never the root cause — the root cause is why the system let a hallucination reach a user. Usually: missing grounding, missing verification, UX that hid uncertainty, or a retrieval filter that was off by one.
The senior framing I'd lead with: "post-mortems are how you turn probabilistic bugs into deterministic tests." Every user-reported failure, if captured well, becomes a line in the golden set. After two years of this, your eval set is the most valuable artifact on the team — it encodes every scar you have.
The topic every AI team is hiring for and the one where interviewers can tell in thirty seconds whether you've shipped one. The distinction is in the failure modes.
The honest senior answer: start with a single agent and split only when you have evidence. Multi-agent is the most over-engineered pattern in the space. A single agent with 8 well-designed tools beats a three-agent mesh in most domains — less state, less orchestration, fewer handoff bugs.
The cases where you genuinely need multi-agent are specific: (1) role specialisation where the prompts are too different to fit in one system prompt (planner vs executor vs critic), (2) security boundaries where one agent has write access and another doesn't, and (3) scale boundaries where the orchestrator coordinates N parallel workers that don't talk to each other.
When you do split, the pattern that works is planner → executor → critic, each running on a model sized to its job (cheap planner, cheap executor, more capable critic for the final check). The orchestrator is code, not an LLM, because code is the thing that reliably enforces the step count, budget, and handoff protocol.
The rule of thumb: retry on transient failures, escalate on ambiguity. A 429 rate-limit is a retry. A tool that returned zero results is ambiguity — the agent should not silently pivot to a different plan and hope for the best.
The design detail that separates senior from staff: the escalation path is part of the product surface, not a fallback hack. If the agent can escalate to a human, the human's queue, the SLA, the "who gets paged" policy, and the handback flow all have to be designed. Otherwise escalation becomes a black hole that the user abandons.
Also mention the "ask the user once" rule: if the agent is unsure, it should ask a single clarifying question with a bounded set of answers. A free-form clarification loop devolves into conversation and burns tokens. Bounded clarifications feel like a good UI and are cheap.
Tool failures come in three flavours and each needs different handling: schema errors (the model called the tool wrong), tool errors (the tool ran but returned an error), and semantic errors (the tool ran, returned "success", but the result is wrong for the task).
| Failure | Detection | Fix |
|---|---|---|
| Schema error | JSON parse / type check | Return error to LLM, let it retry with corrected args |
| Tool error | HTTP status / exception | Map to a human-readable error the LLM can reason about |
| Empty result | Zero hits, empty list | Surface as "no results, consider alternatives" |
| Semantic error | Verifier / sanity check | Tag as suspect, retry with different params |
| Repeated failure | N retries exceeded | Escalate, don't loop |
The underrated one is semantic errors. Tools often return "success" for wrong outcomes — a search tool returns hits that don't match the intent, a code-execution tool returns output that ran but didn't do the thing asked. You catch these with a verifier pass: a small prompt that asks "given the goal and this tool output, did we make progress?" It's cheap and it's the difference between agents that feel reliable and agents that feel slippery.
The schema design rule: tools should return errors that the LLM can reason about. "HTTP 400: bad field 'start_date', expected YYYY-MM-DD" is actionable. "Internal server error" is not. Wrap your tool errors in natural language on the way back to the model.
Agents that run more than a few seconds need to survive restarts, deploys, and human pauses. The mental model: treat an agent run like a workflow engine job. State lives outside the agent, the agent is a reducer over an append-only event log, and any step can be replayed from the log.
The event-sourced design has a huge hidden benefit: you can replay an agent run on a newer model and diff the outcome. When you upgrade Sonnet, you replay yesterday's agent runs through the new model and see which steps went better or worse. That's your eval on real data for free.
For short-lived agents (<30 seconds, single request) this is overkill — just hold state in memory and be done. For anything that might span hours, survive deploys, or need human approval in the middle, the workflow-engine model is the right default. Temporal, Restate, and Inngest all ship patterns for this; rolling your own is fine if the domain is small.
Three layers, enforced in code outside the LLM: step budget, token/dollar budget, wall-clock budget. Hit any one and the loop stops. Do not trust the LLM to track its own budget — it will not, because it doesn't know how to.
Beyond hard limits, use a no-progress detector: hash the last three (tool, args) pairs — if identical, abort. This catches the most common loop: the model keeps calling the same failing search with slightly different phrasing. Also log a loop-detected event so you can count how often agents hit it — that's your quality signal.
For runtime cost alerting, aggregate per tenant per hour and alarm on 5x-over-baseline. Runaway agents usually cluster around a broken deploy or a single tenant's weird input. Spotting the cluster fast matters more than per-request limits — one user spamming will always eat some budget, ten users hitting a broken prompt can take down a quarter's margin.
Agents are stochastic and multi-step, so test pyramids from traditional software don't directly transfer. The version that works: tool tests at the bottom, trajectory tests in the middle, end-to-end evals on top.
The key trick is at layer 2: trajectory tests don't check the exact tool sequence — they check that required tools were called and forbidden ones weren't. "The agent must call verify_identity before update_email, and must not call delete_account anywhere" is a reliable invariant. Exact-match tool sequences break every time the model re-plans.
The layer most teams skip is adversarial and it's the one that matters most for agents with real-world side effects. Have a set of prompt injections, tool-abuse attempts, and policy violations in your eval set, and gate deploys on them passing. This is the eng equivalent of a sandbox test for a release.
"Agent memory" is vague. Senior candidates split it into four concrete kinds and wire each to its own store.
| Type | Contents | Store | Retrieval |
|---|---|---|---|
| Working | Current task scratchpad | Prompt context | Always included |
| Episodic | Past sessions by user | DB + vector | Recency + similarity |
| Semantic | User prefs & facts | Structured row / JSON | Always included, small |
| Procedural | "How we do X" patterns | Prompt / tools | Baked into system prompt |
The most valuable memory for production agents is semantic memory stored as structured facts, not prose. "User prefers metric units, based in Berlin, has project IDs PRJ-14, PRJ-27" is 60 tokens and a lossless feed into every new session. Beats any vector-based "memory system" you can build because it's deterministic and auditable.
Episodic memory is where people over-engineer. The brutal truth: you rarely need to retrieve specific past conversations. You need to retrieve the facts from them. Build a pipeline that extracts facts from episodes into structured memory and throws away the episode text after a while. Storage is cheap but context is expensive.
Handoffs are where multi-agent systems earn their keep — or fall apart. The three things that must cross the handoff cleanly: (1) the goal, (2) the facts gathered so far, (3) the constraints and budget remaining. Miss any of them and the receiving agent starts from zero.
The common anti-pattern is passing the entire conversation history across the handoff. That dumps the noise (exploration, false starts, errors) on the receiver, blows its context, and confuses its planning. Compress to structured facts first. The senior pattern: handoff is a pure function call with typed inputs and outputs, not a stream of consciousness.
The return path matters just as much. If the delegated agent fails or times out, the orchestrator needs a structured failure back, not silence. Design handoffs like RPC calls with typed success and failure envelopes — that one discipline makes multi-agent systems debuggable instead of mystical.
Retrieval is the discipline that decides whether your model looks smart or clueless. Senior questions skip the definitions and go straight to the production tradeoffs that make or break an answer.
The senior framing: RAG has two pipelines, not one — indexing and querying — and they have different SLAs. Indexing is batch, idempotent, and fault-tolerant. Querying is interactive, latency-sensitive, and read-only. Conflate them and you either overpay for batch or underperform at query time.
Walk the interviewer through the full pipeline and call out the non-obvious choices: (1) Query rewriting — a dedicated step where you expand the raw user question into a search-friendly form, decontextualize pronouns, and sometimes generate multiple queries. This single step is often the largest quality lever. (2) Hybrid search — BM25 + dense is the production default; pure vector loses precision on exact-match queries. (3) Re-ranking — a cross-encoder on the top-50 before the top-5 are sent to the LLM. (4) Answer verification — a second pass that checks every claim is grounded in a retrieved source.
The staff-level detail: metadata filters, not vector similarity, are the most important production lever. User region, document type, permission scope, date range — these should be hard filters applied before vector search, not post-hoc. Skip this and you'll spend quarters tuning embeddings to fix a problem a SQL WHERE clause would solve instantly.
Chunking is the most underrated quality lever in RAG. The wrong answer is "512 tokens with 50 overlap" — that's a default, not a strategy. The right answer starts with "what's the shape of my documents?" and builds from there.
| Strategy | Best for | Tradeoff |
|---|---|---|
| Fixed-size | Uniform text, PDFs | Cuts across sentences; cheap baseline |
| Sentence / paragraph | Prose, blog posts | Variable length, more semantic |
| Semantic (embedding gap) | Mixed content | Expensive at index, cleaner chunks |
| Structural (markdown / headings) | Technical docs, wikis | Needs clean source, best retrieval |
| Late chunking | Long coherent docs | Embeds full doc first, chunks the outputs |
| Contextual (Anthropic) | Dense reference material | Prepends doc context to each chunk — quality bump |
My production default for heterogeneous corpora: structural chunking on headings + contextual retrieval. The structural pass gives you chunks that respect document boundaries (no cutting mid-sentence), and contextual retrieval fixes the "chunk orphaned from its parent" problem by prepending a one-sentence doc summary to every chunk before embedding. Anthropic's paper showed this cuts retrieval failures by ~50% for not much index-time cost.
The thing to call out explicitly: chunking quality is bounded by parsing quality. If you feed the chunker garbage HTML from scraped PDFs, no chunking strategy will save you. Fifty percent of RAG projects' quality problems are upstream of chunking — they're in ingestion and normalization. Fix parsing first.
Decoupling retrieval eval from generation eval is the single most important discipline in RAG. If you only measure end-to-end answer quality, you can't tell whether a regression is because the retriever missed the doc or the generator botched the synthesis.
To measure any of this you need a labelled retrieval set — queries paired with ground-truth document IDs. Build one from 200–500 real queries, have humans mark the correct docs, and run every retrieval change against it. Frameworks like RAGAS can substitute an LLM judge for the human labels in a pinch, but I'd still want a human-labelled gold slice for anything safety-critical.
The metric most people skip is Context Precision: of the top-K chunks you sent to the LLM, what fraction were actually used in the final answer? High precision means you can shrink K (cheaper prompts). Low precision means the re-ranker is broken or the prompt is wasteful. Measure this and every decision you make about K becomes quantifiable.
The answer in 2026 is simple: almost always hybrid. Pure vector is elegant in papers and brittle in production — it loses to BM25 on anything involving exact product names, IDs, error codes, quoted phrases, or rare jargon. Hybrid is the boring, correct default.
Fusion: the standard approach is reciprocal rank fusion (RRF) — merge ranked lists from each retriever by summing 1/(k+rank). Cheap, needs no tuning, works. More sophisticated: learn to weight the two signals per query type, but the returns diminish fast.
Pure-vector-only is defensible in two cases: (1) your corpus is tiny and well-curated (say, a 200-FAQ knowledge base — BM25 will always have exact matches there), or (2) you're doing cross-lingual retrieval (English queries hitting French docs — BM25 can't help you). For anything else, ship hybrid from day one. You'll spend the same engineering effort on purely-vector with worse outcomes.
Staleness has two sides: content drift (docs changed, embeddings didn't) and model drift (embedding model changed, old embeddings don't align with new queries). Most teams plan for the first and ignore the second.
Practically: hook your ingestion to a change-data-capture stream from your source of truth (database or document store). Every change emits an event, a worker re-embeds just the affected chunks, and upserts them. Never reindex-the-world unless you absolutely must — reindexing is expensive, creates temporary inconsistency, and hides bugs.
For embedding model upgrades, the rule is brutal: never mix versions in one index. Old embeddings and new embeddings live in different spaces. Blue/green the whole index: reindex to v2 behind a flag, shadow-query to verify, flip traffic. This is the single most common "why is retrieval mysteriously worse" root cause I've seen.
Re-ranking is the second-pass quality amplifier on top of first-stage retrieval. First stage (BM25 + vector) is fast and recall-oriented — get the top 50–100 candidates. Second stage (cross-encoder) is slow and precision-oriented — rerank those 50 down to the top 5 that go to the LLM.
Model choice depends on budget. Hosted options: Cohere Rerank (best quality, API call, predictable latency), Voyage Rerank (competitive quality, cheaper). Self-hosted: BGE-reranker or Jina Reranker (free, 50–200ms on a GPU, good enough for most cases). For tiny budgets, a well-tuned BM25 + vector fusion often matches a naive cross-encoder — don't reach for rerankers before you've tuned the first stage.
When is re-ranking not worth it? When your first-stage precision is already high (rare), when your latency budget is <300ms (interactive autocomplete), or when your top-K is already 3–5 and re-ranking just reorders the same small set. In practice: if your first stage returns 50 candidates and your LLM sees only 5, re-ranking is almost always worth the cost.
Debug RAG by walking the pipeline in order and answering three questions: (1) was the right doc in the corpus?, (2) did retrieval find it?, (3) did the LLM use it correctly?. Most teams jump to the LLM first; the correct order is the other way around.
Most real RAG failures turn out to be at step 1 or 2, not at the model. The doc wasn't in the corpus, or it was but parsed poorly, or the chunk that contained the answer got separated from the chunk that contained the context that makes the answer recognisable. The senior habit: always grep the raw corpus before blaming the model.
Make this debug loop fast. Every RAG system should have a "replay query" tool that takes a question and shows: rewritten query, BM25 results, vector results, fused top-K, reranker output, chunks sent to the LLM, and final answer. Thirty seconds to diagnose — that's the tool that pays for itself in the first week.
At a billion documents, none of the defaults apply. You're no longer doing vector search on a laptop — you're designing a distributed system where the retrieval layer has its own on-call rotation. The key design moves: sharding, approximate search, and aggressive pre-filtering.
The biggest lever isn't the vector DB — it's sharding by the filter most queries use. If every query is scoped to a tenant and a date range, partition the index by (tenant, month). A query then hits 1% of the corpus, not 100%. You get a thousand-fold speedup for free before you touch any ANN parameters.
Quantization is underrated. Binary embeddings (1 bit per dim) are surprisingly competitive when paired with a reranker, and they cut memory by 32x. The pattern: binary for first-stage retrieval on the billion-doc shard, full-precision vectors for the few thousand you actually rerank. This is how the frontier search systems scale.
Finally — and this is the staff-level framing — most teams that think they need billion-doc RAG don't. They need filtered retrieval over the 100k docs their user actually cares about. Before designing a distributed index, ask whether the working set per query is actually that large.
If an interviewer asks only one thing about this topic, they'll ask something deceptively simple that exposes whether you've shipped without evals. These are that topic.
Start with the smallest thing that works and grow. Day one: a CSV, a script, and a golden set of 30 examples. Day ninety: sliced metrics, automated regression, CI integration. Day three-sixty: online evals, drift alerts, per-tenant quality tracking. The mistake is trying to jump from zero to the framework you'd see at OpenAI — you'll build infrastructure nobody uses.
The framework needs four things and nothing else at the start: a dataset (examples + expected criteria), a scorer (rule-based, LLM-judge, or human), a slicer (break results by segment — intent, model, user tier), and a runner (script that produces a comparable report between two runs). Everything else is infrastructure on top.
Senior candidates separate offline evals (run on a fixed dataset in CI) from online evals (run on live traffic in prod). Offline catches regressions pre-ship. Online catches problems that only appear with real inputs. Both are necessary; most teams have only offline and wonder why prod feels different from CI.
LLM-as-judge is powerful and it's also how most teams fool themselves. The failure mode is the judge agreeing with the model it's judging because they share a lineage. Your eval metric silently becomes "does output A look like output B" instead of "is output A good."
Most durable framing: LLM judge is a signal, human labels are ground truth. Use the judge for throughput (run it on thousands of examples per change) and humans for calibration (spot-check 50 cases a week to ensure the judge still agrees with reality). If agreement drops below 80%, the judge prompt is stale and needs updating.
The rubric matters more than the model. A specific, criterion-based rubric ("scored 5 if answer addresses all 3 sub-questions, cites at least 1 source, and contains no unsupported claims") outperforms generic "rate helpfulness 1-5" by a wide margin. Invest in the rubric.
Leakage is subtle and it's the reason most teams over-trust their own eval numbers. Three leakage paths to defend: (1) training data leak (eval examples end up in fine-tune set), (2) prompt-tuning leak (you keep tweaking the prompt until eval scores go up — you've now overfit to eval), (3) provider leak (your eval examples were in the base model's pretraining data).
The pattern that fixes prompt-tuning leak: split evals into dev (used while iterating) and test (run once before shipping, never iterated on). Every time you look at the test set and change behaviour, you've polluted it. In practice teams don't have the discipline for this — so add a locked "gold" slice that even the engineers can't see the individual examples of, only the aggregate score.
For pretrain leak, the honest truth is you can't perfectly control what was in a hosted model's training data. The mitigation is to add novel, domain-specific examples you wrote yourself — these definitely weren't in the pretrain data. Don't rely on public benchmarks (MMLU, GSM8K) as your primary eval; they're memorised to some degree by every frontier model.
Treat prompts like code. A prompt change is a diff, a regression test is an eval run, and CI blocks a merge that regresses a protected slice. The pipeline is Promptfoo-style: a YAML config defines providers, tests, and assertions, and eval run fails the build if any assertion drops.
Three things the test harness must do: (1) Deterministic replay — pin model version and temperature so results are comparable. (2) Slice-level gates — an aggregate 2% lift is fine, but a 10% drop on the "billing" slice is a block regardless. (3) Visible diffs — the PR reviewer sees "score on refund questions dropped from 0.82 to 0.64" with specific examples. Narrative beats numbers.
The discipline that matters: regressions block by default, you have to explicitly override with a reason. Teams that let regressions merge "to unblock" ship a worse product every sprint. Teams that block by default either fix the regression or have a conversation about why this one is acceptable. Both are better than shipping blind.
Observability for an AI system has the same three pillars as any distributed system — logs, metrics, traces — but each pillar has a specific flavour. Traces especially: a single user request can produce a five-step agent trace with tool calls, retrieval hops, and reranker passes, and you need all of it in one view.
The breakthrough is trace → eval → fix as a loop: a trace is a structured record of a single interaction; you can attach a score (from a judge or user) to it; bad scores bubble up into a review queue; the review produces an eval example and a fix. Tools like Braintrust, Langfuse, Phoenix, and LangSmith are built for this shape. Homegrown works too but you'll rebuild half of those tools.
Two details senior candidates always mention: (1) Correlate tokens with user IDs, not just request IDs — that's how you find which users are driving cost or breaking things. (2) Sample-and-log your prompts in full for X% of traffic and store them for 30+ days. When someone complains tomorrow about yesterday's answer, you need to be able to show them exactly what the model saw.
The framework is standard experimentation — randomise users, hold the exposure stable, measure a primary metric, wait for significance — but the metric choice is the hard part. Unlike a classical web A/B test, there isn't a single conversion rate; quality is multi-dimensional.
| Metric family | Examples | Watch out for |
|---|---|---|
| Quality | Judge score, refusal rate | Judge drift over time |
| Engagement | Follow-up rate, session length | Longer ≠ better |
| Outcome | Task completed, escalation rate | Slow to accumulate |
| Cost | Tokens/req, latency | Easy to forget, easy to blow |
| Safety | Policy violations, PII leaks | Must be a guardrail, not a trade |
Use a primary metric + guardrails: pick one metric you're trying to move (say, judge score), and guardrails you won't trade (cost, latency, safety). A winning experiment must lift the primary without breaking a guardrail. Without this structure, teams ship changes that improve one axis and silently regress another.
The trap in stochastic systems: temperature > 0 adds noise, and noise adds the sample size you need. Run experiments at temperature 0 when possible, or size your sample 2-3x what a classical test would demand. And never compare model A at T=0 to model B at T=0.7 — you'll call randomness quality.
Most real LLM tasks have no ground truth — summarisation, creative writing, open-ended Q&A all have many good answers. The senior move is to measure what you can and use proxies with honesty about what they're measuring.
The most underused technique is measurable sub-criteria: break "was this a good answer?" into three or four objective checks — did it include a citation? did it cover all sub-questions? did it refuse unsafe requests? did it return in the right format? — and score each one independently. Four 0/1 checks per example beats one hand-wavy 1-5 rating every time.
For truly subjective tasks, lean on pairwise comparisons against a baseline. "Is this output better than what the old prompt produced?" is a question a judge can answer reliably even when "is this output good?" is hopeless. You lose absolute quality tracking and gain comparability — usually a good trade.
The hardest and best signal is downstream outcome: did the user's task actually get done? Did the ticket get resolved without a follow-up? Did the code the agent wrote pass the tests? When you can tie quality to a real-world outcome, you stop arguing about judge prompts.
This is a judgement question disguised as a process question. The answer frames it as impact × reach × tractability, applied to clusters of signals, not individual signals.
The clustering step is where staff-level engineers differ. Junior engineers sort a spreadsheet by frequency. Staff engineers find the shared root cause: half the 200 signals might all be one failure mode ("retrieval missing recent docs"), and one fix lifts all fifty.
The pattern that fails: shipping 10 fixes that each touch a different part of the prompt. Each fix might be net-positive, but together they conflict, and your eval scores oscillate. Batch related fixes, ship them together, and eval as one change. Fewer, bigger, more verified.
The questions your finance partner cares about and the questions that separate engineers who shipped a demo from engineers who shipped a P&L-accountable service.
The right first move is instrument, then act. Most teams try to optimise before they know where the tokens go. Always profile first: what percent of spend is on which model, which product surface, which tenant, prompt vs completion tokens? The answer usually surprises people — one feature or one tenant will be 60% of cost.
Three levers that stack cleanly: (1) Prompt caching — if your system prompt is stable, Anthropic and OpenAI both offer prefix caching that cuts input-token cost ~90% on the cached portion. Moving a 3k-token system prompt from uncached to cached is a single-afternoon change and can be a 30% cost win. (2) Model routing — downshift easy queries to a smaller model, keep the frontier model for what needs it. (3) Prompt shrinking — audit your prompts for copy-paste accretion and cut 20% of tokens; almost always invisible in quality.
What you do not do first is rewrite to self-hosted. Self-host pays off at high volume and stable traffic, but the engineering-months are non-trivial and you lose the provider's reliability and model-upgrade treadmill. Reach for it after the easy wins.
A latency budget is a contract: "p95 end-to-end latency for this feature is 3 seconds". You split that budget across stages and give each stage a sub-budget. When any stage blows its sub-budget, you know exactly what to fix.
The user-perceived metric to defend is time-to-first-token (TTFB) for streaming responses, not total latency. Users forgive a 6-second answer that starts flowing in 400ms; they hate a 3-second answer that blanks the screen for 3 seconds and dumps. Stream by default. Show progress. Pre-send an acknowledgement if there's a retrieval step.
Three levers for latency: (1) Parallelise anything independent — run BM25 and vector search in parallel, not sequentially. (2) Cache the stable parts — prompt prefix, embedding of the last query, reranker scores. (3) Cut tokens — generation cost scales with output length, so a shorter output is a faster output.
There are four caching surfaces and each answers a different question. Confusing them is the source of most LLM-cache bugs.
| Cache | Keyed on | What it saves | Watch out |
|---|---|---|---|
| Prompt prefix | Prompt hash | Input token cost on cached portion | Any prefix change invalidates |
| Exact response | Full prompt hash | Full call cost | Non-determinism across temps |
| Semantic | Embedding similarity | Full call cost on paraphrases | False hits are silent errors |
| KV cache | Session continuity | Inference compute on same session | Server-side, framework-specific |
The safe default: always use prompt-prefix caching (free, enabled at the API level, correctness preserved). Exact-response cache is fine for deterministic calls (temperature 0, no tools, no randomness in retrieval). Semantic caching is dangerous — the whole point is that similar-but-not-identical prompts share an answer, which is fine for FAQ but catastrophic for "what's my account balance?" where two similar questions have different right answers.
The rule of thumb: cache answers to questions whose answer doesn't depend on mutable state. "What are your business hours?" is cacheable forever. "What's the status of my order?" is never cacheable. "What's the weather in NYC today?" is cacheable for 15 minutes. Classify your queries, tag them, and cache-by-tag.
Capacity planning for LLMs is different because the unit isn't requests — it's tokens per second per tier. A request for 32k tokens in and 1k tokens out costs 33x what a 1k-in 100-out request costs on the same model. Plan capacity in tokens, and you'll stop being surprised by traffic that "looked flat" but spent 5x more.
Two uncomfortable facts worth naming: (1) Provider rate limits are the real ceiling, not your wallet. You can have unlimited budget and still hit a 500k TPM wall that takes weeks to raise. Plan at least 2 quarters ahead on quota negotiations. (2) Input tokens grow faster than output tokens as your product matures, because you add context, tools, memory, and retrieval. Forecast the growth direction, not just the magnitude.
For self-hosted inference: the unit isn't TPM, it's tokens-per-second per GPU. A single H100 running vLLM with Llama-70B at batch size 32 does roughly 2000 output tokens/sec. That's your planning unit. Utilisation below 60% is wasted spend; above 85% means tail latency is collapsing. Tune the batch size to your mix.
The analysis is financial + strategic + operational. Financial alone almost always says "self-host at scale" — but the strategic cost of losing the model upgrade cycle and the operational cost of running GPUs in production is usually larger than the cash savings.
Key framing: the breakeven isn't when self-hosted is cheaper per token — it's when the cash saved exceeds the engineering + ops + opportunity cost of not shipping features. For most startups under 50 engineers, that moment never comes. For BigCo with a dedicated ML platform team, it comes sooner.
The hybrid pattern: self-host the classifier/embedder/reranker, hosted for the generator. The embedding model is small, high-QPS, and not on the model-upgrade treadmill — cheap to self-host and you get to use the latest frontier LLM for the generation pass. This is what most mature teams land on.
Prompt-level cost optimisation has three moves, in order of safety: (1) delete copy-paste accretion (safest), (2) move static content to cached prefix (safe), (3) compress with a smaller model (risky, evaluate).
Output-side compression is just as important and often bigger: short output is half the cost of long output (output tokens are 3-5x the price of input tokens). Explicit length instructions work: "Answer in 1 paragraph, not more than 80 words" is cheap and the model respects it. Structured output also cuts tokens compared to prose.
Always validate every cut with your eval set. The anti-pattern is "I shrank the prompt by 40% and shipped it" — you don't know if you regressed quality until you measure. Every prompt change is a PR, every PR runs evals, every eval gates the merge.
Every enterprise buying AI asks these questions. Senior engineers are the ones who can answer without hand-waving and show a concrete defence-in-depth plan.
Prompt injection is the SQL injection of the LLM era, and the defence is the same shape: never trust user-supplied or retrieved text as instruction. The senior answer starts with that principle and builds defences in depth from it.
The distinction senior candidates call out: direct injection (user types "ignore previous instructions") versus indirect injection (malicious instructions embedded in a retrieved web page, email, or PDF). Indirect is the one the industry is catching up to — an email your agent is reading can contain instructions in white-on-white text that compromise the agent's behaviour silently.
Two principles I'd articulate: (1) Principle of least privilege — the agent's tools should only expose what the current user is already allowed to do. A compromised prompt can't delete someone else's data if the tool layer rejects the call. (2) Data-flow separation — if the agent has read one user's email, that session should not also have write access to another user's account. Compartmentalise by session.
PII in LLM pipelines has three risk moments: ingest (user types or uploads PII), retention (logs, traces, evals store PII), and egress (output leaks PII from one user to another, or to a third-party provider). Defences at all three layers.
The pattern that works: tokenise PII before it reaches the LLM, detokenise on the way back. A pre-processor replaces "John Smith, SSN 123-45-6789" with "<NAME_1>, <SSN_1>", the LLM reasons over placeholders, and the post-processor swaps them back in the trusted environment. The provider never sees the real values, and if the model leaks a placeholder no harm is done.
For logs and traces, store placeholders — never raw PII. This is a breaking change when you retrofit it, and the question every enterprise customer asks is "what does your retention look like?" Have a clear answer: "PII is redacted before logging; traces retain placeholders only; raw user input retention is under 24 hours and encrypted at rest."
Don't forget the cross-tenant egress risk. Embeddings computed from one user's data, stored in a shared index, should never be retrievable by another user. Namespace your vector store by tenant and enforce it at the retrieval layer — not just at the application layer.
"Jailbreak" means different things — for a customer-facing agent, it usually means someone tricking the model into saying something harmful or off-brand. The defence isn't trying to make the model "uncrackable" (you can't), it's minimising the blast radius of a successful trick.
The cheapest, highest-leverage defence is an output classifier — a second small-model pass that reads the proposed answer and asks "does this violate policy? is it off-topic? does it say something the brand would never say?" before it reaches the user. Llama Guard, NeMo Guardrails, or a fine-tuned small model all work. Latency cost: ~100ms. Effectiveness: catches the vast majority of policy escapes.
The philosophical point: scope your agent so narrowly that jailbreaks are uninteresting. A customer support bot should refuse any question outside its domain by default. "How do I reset my password?" — yes. "What are your thoughts on geopolitics?" — "I'm a support assistant, I can only help with your account." This isn't censorship, it's product scoping. Jailbreaks of a narrowly-scoped agent produce at-most a mildly embarrassing screenshot — never a data breach.
Over-aggressive guardrails are the most-complained-about feature in AI products. The senior move is to target the bad outcomes, not the bad topics. A medical app can discuss symptoms without becoming a diagnostic tool; a financial app can discuss budgeting without giving personalised advice. Narrow, outcome-based guardrails beat keyword blocklists every time.
Rule I enforce: every refusal must offer a next step. "I can't help with that" is a terrible UX. "I can't recommend specific dosages — but I can show you the manufacturer guidance, or connect you to a pharmacist" respects the boundary and keeps the user moving. That's not a nice-to-have — it's the difference between a 20% refusal satisfaction and a 70% one.
Measure guardrail false-positive rate alongside false-negative rate. Most teams only track "did we miss a bad output?" A good guardrail also tracks "did we block a good output?" Both are defects. A guardrail at 2% FN and 20% FP is worse for the product than one at 5% FN and 3% FP.
Auditors ask about data flow, access control, retention, and auditability. Your AI system has to answer those four questions at every layer — just like any other regulated system, but with a few LLM-specific wrinkles.
| Concern | Classic answer | AI-specific addition |
|---|---|---|
| Data flow | Diagram + DPA with vendors | Prompt & response logging, embedding storage |
| Access | RBAC, audit logs | Per-tenant isolation in vector store, namespaced caches |
| Retention | Retention schedule | Log redaction, eval set opt-out, right-to-delete on embeddings |
| Auditability | Immutable logs | Trace every output to prompt hash + model version |
| Sub-processors | Vendor list | Model provider, embedding provider, observability tools |
The specific wrinkle: right-to-delete applied to embeddings. When a GDPR deletion request comes in, you must be able to remove that user's data not just from the primary DB but from every embedding derived from it. Build deletion hooks into your embedding pipeline on day one — retrofitting is painful. The pattern: every chunk row carries a user_id column; delete cascades from user → chunks → vectors.
Two other things auditors love: (1) a data flow diagram showing every hop, every vendor, every store. (2) DPAs (Data Processing Agreements) with every model and tool provider, with zero-retention and no-train flags enabled where available. Get these before the audit, not during.
A real red-team is adversarial, diverse, and documented. Not "we had the QA team try some weird inputs for an afternoon". The senior answer has structure: threat model, attacker personas, a test script, a severity rubric, and a gate for "this doesn't ship until the P1 issues are fixed."
Three persona archetypes to red-team against: the curious user (stumbles on bad outputs by accident — this is most of your traffic), the malicious user (actively tries to break things, post screenshots), and the naive but high-stakes user (asks a dangerous question without realising it). The playbook should have 30–50 attacks per persona, mixed to cover your feature's domain.
The gate matters more than the red team itself. Without a pre-committed launch gate — "no P0 issues, fewer than N P1s, all issues have a fix plan" — the red team becomes performative. With a gate, it becomes a decision-maker. Build the gate and get the leadership sign-off on it before you run the red team, so the results have teeth.
And finally: every red-team finding becomes a golden eval case. Future model/prompt changes can't regress any previously-found vulnerability without triggering a CI failure. That's how red teams compound into durable quality instead of being a one-time launch ritual.
The half of the interview where the conversation stops being about models and starts being about judgement, people, and how you make decisions when nobody's going to tell you what to do.
Lead with clear stage-gates and explicit tradeoffs. The failure mode in AI teams is permanent prototype energy — everything is a demo and nothing is production. The opposite failure mode is a team so careful they never ship. The leadership job is naming which stage you're in and which rules apply.
The cultural move that makes this work: celebrate the transitions. When a feature moves from proto to beta, it's a milestone. When it moves from beta to GA, it's a bigger one. Teams that don't ritualise the transition end up with five half-GA features and one on-call incident per day — all the same tier of fragility.
On ship fast: my rule is to keep the iteration loop under a day. From "idea" to "evaluated prompt change" should be hours, not weeks. That requires investment in the eval harness, the deploy pipeline, and the rollback tooling. Leaders who don't fund infrastructure in year one pay for it ten-fold in year two.
The senior framing: "research" and "shipping" are not opposites — they're different-cost experiments. A well-run team is constantly running experiments at multiple cost tiers, with clear criteria for promoting a cheap experiment into an expensive one.
Kill criteria at each tier matter more than entry criteria. Most teams have no explicit rule for stopping an experiment that isn't working. Write it down: "Tier 1 is killed if shadow judge score is below baseline after 1000 examples." Without kill criteria, research becomes an indefinite line item.
On the team level, I aim for roughly 70% ship, 20% ship-adjacent research (paying off in this quarter), 10% horizon. The horizon bucket is how you stay ahead on model upgrades and new techniques — skip it and you wake up obsolete. Overfund it and you ship nothing. Most teams land at the wrong ratio in both directions.
AI codebases are famously hard to onboard to because the "source of truth" lives in prompts, evals, and traces — not in the code. A new engineer reading the repo gets maybe 40% of the picture. The onboarding plan has to compensate.
What you don't do: sit them down with the full system prompt and the architecture diagram and expect it to click. The sequence "run → ship small → read scars → observe decisions" compresses six weeks of learning into five days. It works because every step is concrete and produces feedback.
The asset I invest in: a "read this first" doc that isn't a README. It's a 10-page narrative: "here's the feature, here's why we built it this way, here's the thing that nearly broke it, here's the eval that keeps it honest, here's where new engineers have gotten stuck before." One document, refreshed quarterly, beats a dozen auto-generated doc sites.
The losing argument is "our infra is bad and it's embarrassing." The winning argument is "here are three features we couldn't ship last quarter because infra blocked them, and here's the cost of keeping that going." Leaders fund work that prevents measurable pain, not work that satisfies engineering aesthetics.
The framing that actually works with non-technical leaders: talk in dollars and weeks, never in abstractions. "Our eval harness takes 45 minutes to run, which means engineers wait or skip it, which means regressions ship, which means a customer-success rep spends 10 hours/week handling the fallout" — that's a business case. "We need a better eval harness" is a request.
When you get the funding, overcommunicate progress. Weekly update on what's done, what's left, what changed. Infra work is invisible to leadership by default; if you don't write about it, they'll forget you got approval and wonder why features are slow. A 5-minute Friday email beats a 60-minute meeting every time.
Rule 1: resolve technical disagreements with data, not seniority. If two strong engineers disagree, usually they're both looking at different parts of the elephant. Your job is to make the disagreement concrete — specific claim, specific metric, specific test — and let the data decide.
The pattern I use: "disagree and commit" after a time-boxed bake-off. Define the metric and the timeline in advance ("we'll evaluate both approaches on eval set X over 3 days, winner is whichever beats the other by more than 5%"). Lock both engineers into committing to the outcome before the test runs, so the losing side doesn't re-litigate afterwards.
Sometimes the debate is not settleable by data — it's about maintainability, clarity, or long-term direction. In that case the leader's job is to make the call explicitly, own it, and explain the reasoning. "Both approaches work; I'm picking B because it's closer to the direction the platform team is going — I might be wrong, we'll revisit in 6 months." Transparency about your own uncertainty keeps the other engineer's trust.
ROI for AI initiatives falls into three buckets: revenue (new customers, upsell), cost (deflected work, reduced tool spend), and risk (incidents avoided, compliance posture). Every AI project must explicitly name which bucket it's in — and the metrics are different for each.
| Bucket | Metric examples | Attribution tricks |
|---|---|---|
| Revenue | Trial → paid conversion, upsell rate, feature-gated ARR | Randomise feature access for 60 days |
| Cost | Tickets deflected, hours saved per agent | Hold-out cohort + before/after |
| Risk | Incidents avoided, SLA hits | Hard — use leading indicators (coverage, drill results) |
| Satisfaction | CSAT, NPS, retention | Segment by exposure, long window |
The hardest one is ticket deflection — everyone wants to claim "our bot deflected 30% of tickets" and nobody can prove it without a control group. The honest measurement: hold 5% of users out of the AI feature, run for 30 days, compare their ticket volume to the exposed group. If you won't do that, don't claim the deflection number.
The trap to avoid: vanity metrics like "number of messages sent to the bot". Usage isn't value. A chatbot with 10k daily messages and a CSAT of 2.1 is actively destroying value. Tie every AI project to a business metric downstream of usage — conversion, resolution, retention — or the initiative doesn't justify its budget.
Mentorship in AI engineering is less about teaching facts and more about building instincts for non-deterministic systems. The transition that matters: moving from "does my code run?" to "does my feature produce good outputs, reliably, over a distribution of inputs?"
The single practice I require: read 20 real examples from the system every week. No dashboards, no summaries — actual traces from actual users. Pattern-matching on real data is how senior AI engineers develop intuition, and it's the one practice juniors skip. Make it part of the weekly ritual.
The blocker that's hardest to coach through: the fear of "wasting" an LLM call. Junior engineers over-optimise their prompt before running it because it "feels expensive." Mid-level engineers run it, see it fail, and iterate. You want to coach toward the latter — the learning loop is the whole job.
The clearest framing I've found: senior engineers own outcomes on a feature; staff engineers own outcomes across features. Senior ships the feature reliably; staff designs the platform that makes five teams ship reliably. The scope of what you're accountable for is the main axis.
In AI specifically, the staff signals I look for are: (1) can they design an eval culture, not just write evals for their own feature? (2) can they sequence the team's investments across model upgrades, infra improvements, and features over a 6-month horizon? (3) can they pick the right abstraction — knowing when to add a platform component vs when to let teams keep copy-pasting? That last one is where most senior-to-staff transitions live or die.
The answer to "which are you?" should be honest and forward-looking: "I'm operating at senior today. Here are the staff-level scopes I've taken on and the ones I know I haven't yet." That combination — self-awareness plus a growth direction — is what interviewers want to hear. Overclaiming burns trust. Underclaiming costs you the level.
Sixty production-grounded questions for senior AI engineer interviews — architecture, incidents, agents, RAG, evals, cost, safety, and the leadership signals hiring panels actually listen for.