Design an AI Coding Assistant — A Guided System Design

Q: Autocomplete that understands your code

An AI coding assistant is a latency-obsessed retrieval + inference system: assemble the tightest useful context (cursor prefix/suffix, open tabs, retrieved repo snippets), feed a code-specialized model doing fill-in-the-middle, and stream a suggestion back — all under a strict time budget, measured by whether developers accept it.

Q: Keystroke to suggestion, on a deadline

The Editor debounces keystrokes and sends the cursor context to the Completion Service, which streams a suggestion back under a tight latency budget and cancels in-flight requests as soon as the developer types more. Everything downstream is engineered around that deadline — a completion that’s late is worthless.

Q: Context assembly and fill-in-the-middle

The Context Builder assembles a fill-in-the-middle prompt — the cursor’s prefix and suffix — plus open tabs, imports, and recent edits, packed into a small token budget by relevance. FIM lets the model complete a span that fits both what came before and after the cursor, which plain left-to-right prompting can’t do.

Q: Repo-aware retrieval (RAG for code)

Repo Retrieval pulls the snippets most relevant to the cursor from a Code Index — embedded chunks for semantic matches plus a symbol index for exact definitions/usages — and folds them into the context. It’s RAG for code: the model completes against the functions and types that actually exist in this repo, not plausible inventions.

Q: A code-specialized, streaming model

A code-specialized model trained for fill-in-the-middle generates the completion, tuned for low latency and token streaming. Inline completion often uses a smaller/faster model than chat, because meeting the deadline matters more than a marginally better suggestion that arrives too late. (Heavier reasoning lives in the chat/agent mode, step 7.)

Q: Debounce, cancel, cache

Cut latency and cost with three levers: debounce keystrokes to avoid firing mid-type, cancel superseded requests so only fresh work finishes, and a Completion Cache keyed on context so repeated cursor positions return instantly. In a fast edit loop these do more for perceived speed than any hardware upgrade.

Q: Acceptance rate, not benchmarks

Log acceptance data: suggestions shown vs accepted, and whether accepted code is retained (not immediately deleted). Acceptance rate is the north-star — it captures real usefulness that offline benchmarks miss, and it’s what you A/B test context strategies, models, and latency budgets against.

Q: Agentic chat and multi-file edits

Add a separate Agent / Chat mode: a stronger model with a larger context that reads multiple files, runs tools (tests, search, build), and proposes reviewable diffs in a deliberate loop — decoupled from the latency-bound inline path. Same retrieval and privacy plumbing, different model and cadence. (This is the agent loop from the AI-agent design, specialized to code.)

Q: Privacy, secrets, and licensing

A Secret Filter and privacy layer: redact secrets/PII from outgoing context, honor retention and training-consent settings (don’t train on code without opt-in), and scan suggestions for leaked secrets or license-risky verbatim copies. Handling proprietary code demands guarantees in both directions — nothing sensitive leaves, nothing unsafe comes back.

System Design · step by stepDesign an AI Coding Assistant

Step 1 / 9

Design an AI Coding Assistant — the walkthrough in full

A written version of the interactive walkthrough above — the same steps, decisions and trade-offs, laid out for reading, reference and search.

The big idea

Autocomplete that understands your code

A coding assistant suggests the next lines as you type. To feel magical it must be three things at once: fast (a suggestion in a few hundred milliseconds, or the developer has moved on), context-aware (it knows this file, this repo, these imports), and right enough that accepting saves time instead of introducing bugs. Those three fight each other — more context and a bigger model mean more latency. How do you build inline completion that’s fast and grounded?

An AI coding assistant is a latency-obsessed retrieval + inference system: assemble the tightest useful context (cursor prefix/suffix, open tabs, retrieved repo snippets), feed a code-specialized model doing fill-in-the-middle, and stream a suggestion back — all under a strict time budget, measured by whether developers accept it.

How to read this: Each step opens with a real design decision — you make the call before I show you what ships. Watch the diagram grow, hover any box, replay the flow. At the end, starve the context to feel the assistant’s quiet failure. Hit Begin.

Step 1 · The skeleton

Keystroke to suggestion, on a deadline

The developer types; you have a few hundred milliseconds to show a useful completion before it’s stale. What’s the request path, and what makes it different from a normal API call?

Design decision: What defines the shape of an inline-completion request?

The call: A debounced, cancellable request to a completion service that streams a suggestion under a latency budget. — The editor debounces keystrokes, sends context to a completion service, and streams back a suggestion — cancelling in-flight requests the instant the developer keeps typing so only fresh work completes.

The Editor debounces keystrokes and sends the cursor context to the Completion Service, which streams a suggestion back under a tight latency budget and cancels in-flight requests as soon as the developer types more. Everything downstream is engineered around that deadline — a completion that’s late is worthless.

Latency is the product: Unlike batch AI, inline completion is judged in milliseconds. Debounce, cancellation, and streaming aren’t optimizations — they’re the core UX. The whole system is shaped by "fast enough to accept before you’d have typed it yourself."

Step 2 · What to send

Context assembly and fill-in-the-middle

The model needs to know what you’re writing — but you can’t send the whole repo, and you have a small token budget. And unlike chat, code has text after the cursor too. What context do you assemble?

Design decision: What’s the right context for completing at the cursor?

The call: Cursor prefix + suffix (fill-in-the-middle), plus open tabs, imports, and recent edits — within a token budget. — The Context Builder packs the highest-signal context: the prefix and suffix around the cursor (FIM), nearby open files, imports, and recent edits — fit into a small budget so the model completes into the actual surrounding code.

The Context Builder assembles a fill-in-the-middle prompt — the cursor’s prefix and suffix — plus open tabs, imports, and recent edits, packed into a small token budget by relevance. FIM lets the model complete a span that fits both what came before and after the cursor, which plain left-to-right prompting can’t do.

Signal per token: With a tight budget, context assembly is a ranking problem: get the most useful bytes in. Prefix/suffix and nearby files are the base; the next step widens the lens to the whole repo without blowing the budget.

Step 3 · Know the whole repo

Repo-aware retrieval (RAG for code)

The function you’re calling is defined in another file the model has never seen. Local context alone makes the model guess its signature — and guess wrong. How does completion learn about the rest of the codebase?

Design decision: How does the assistant know about code defined elsewhere in the repo?

The call: Retrieve relevant snippets from a repo index (embeddings + symbols) and add them to the context. — A background-built index (semantic embeddings plus a symbol/definition map) lets the assistant pull the definitions and usages relevant to the cursor into the prompt — RAG applied to your codebase.

Repo Retrieval pulls the snippets most relevant to the cursor from a Code Index — embedded chunks for semantic matches plus a symbol index for exact definitions/usages — and folds them into the context. It’s RAG for code: the model completes against the functions and types that actually exist in this repo, not plausible inventions.

Grounding beats guessing: Retrieval is what turns "statistically likely code" into "code that fits your project." The index is built and refreshed in the background; the more relevant it makes the context, the fewer hallucinated symbols — the core quality lever.

Step 4 · The model that fills the gap

A code-specialized, streaming model

Now the context is assembled. What model turns it into a suggestion, and why not just use a general chat model?

Design decision: What kind of model serves inline code completion best?

The call: A code-specialized model trained for fill-in-the-middle, tuned for low latency and streaming. — A model trained on code with a FIM objective completes the span between prefix and suffix accurately, and a smaller/optimized variant meets the latency budget while streaming tokens as they generate.

A code-specialized model trained for fill-in-the-middle generates the completion, tuned for low latency and token streaming. Inline completion often uses a smaller/faster model than chat, because meeting the deadline matters more than a marginally better suggestion that arrives too late. (Heavier reasoning lives in the chat/agent mode, step 7.)

Right-size the model to the deadline: There isn’t one model — inline wants fast + FIM; agentic chat wants powerful + multi-step. Matching model to mode is how you hit both the latency bar and the hard-task bar.

Step 5 · Win the milliseconds

Debounce, cancel, cache

Even a fast model plus retrieval can blow the budget if you call it carelessly. The developer is typing quickly, invalidating requests constantly. How do you keep latency (and cost) down in a tight edit loop?

Design decision: What keeps completion fast and cheap under rapid typing?

The call: Debounce keystrokes, cancel superseded requests, and cache completions by context. — Wait for a typing pause (debounce), cancel any in-flight request the moment new input arrives, and serve a Completion Cache hit when the context repeats — cutting both latency and model spend.

Cut latency and cost with three levers: debounce keystrokes to avoid firing mid-type, cancel superseded requests so only fresh work finishes, and a Completion Cache keyed on context so repeated cursor positions return instantly. In a fast edit loop these do more for perceived speed than any hardware upgrade.

Don’t compute what you’ll throw away: Most in-flight completions get invalidated by the next keystroke. Debounce + cancel stop paying for them; caching reuses the ones that recur. Latency is won by not doing work as much as by doing it fast.

Step 6 · Measure what matters

Acceptance rate, not benchmarks

Is the assistant actually good? Offline code benchmarks say one thing, but the only judgment that counts is whether developers keep your suggestions. How do you measure real quality?

Design decision: What’s the north-star quality metric for a coding assistant?

The call: Acceptance rate (and retention of accepted code) from live telemetry. — Track shown-vs-accepted suggestions and whether accepted code survives — the direct measure of usefulness, and the signal you optimize context, model, and latency against.

Log acceptance data: suggestions shown vs accepted, and whether accepted code is retained (not immediately deleted). Acceptance rate is the north-star — it captures real usefulness that offline benchmarks miss, and it’s what you A/B test context strategies, models, and latency budgets against.

The developer is the eval: Every accept/reject is a live label on suggestion quality in real context. Acceptance-rate telemetry turns the editor into a continuous evaluation harness — far more honest than any static benchmark.

Step 7 · Beyond the next line

Agentic chat and multi-file edits

Inline completion suggests the next span, but developers also want "add tests for this," "refactor across these files," "why is this failing?" — tasks that need reasoning, multiple files, and running tools. How do you support that without breaking inline latency?

Design decision: How do you add multi-file, tool-using help alongside inline completion?

The call: A separate agent/chat mode: a stronger model that reads files, runs tools, and proposes diffs. — A distinct deliberate loop uses a more powerful model with a larger context, reads multiple files, runs tools (tests, search, build), and returns reviewable diffs — decoupled from the sub-second inline path.

Add a separate Agent / Chat mode: a stronger model with a larger context that reads multiple files, runs tools (tests, search, build), and proposes reviewable diffs in a deliberate loop — decoupled from the latency-bound inline path. Same retrieval and privacy plumbing, different model and cadence. (This is the agent loop from the AI-agent design, specialized to code.)

Two modes, two budgets: Inline = fast, local, ghost-text. Agentic = powerful, multi-file, diff-based. Sharing context/retrieval but splitting the models and latency budgets is how one assistant serves both "finish this line" and "do this task."

Step 8 · Handle code responsibly

Privacy, secrets, and licensing

You’re shipping a company’s private source code to a model, and shipping model output back into their repo. That cuts both ways: secrets could leak out, and secrets or license-risky code could get suggested in. How do you keep it safe?

Design decision: What must a coding assistant do to handle code responsibly?

The call: Redact secrets from context, honor retention/training consent, and filter suggestions for leaked secrets and license-risky verbatim code. — A Secret Filter strips credentials/PII from outgoing context, respects code-retention and training-consent settings, and scans suggestions for leaked secrets or large verbatim copies that pose licensing risk.

A Secret Filter and privacy layer: redact secrets/PII from outgoing context, honor retention and training-consent settings (don’t train on code without opt-in), and scan suggestions for leaked secrets or license-risky verbatim copies. Handling proprietary code demands guarantees in both directions — nothing sensitive leaves, nothing unsafe comes back.

Trust is the moat: Developers only adopt an assistant they trust with their source. Explicit consent, secret redaction, and output filtering aren’t compliance checkboxes — they’re the precondition for anyone letting the tool near a private repo.

The payoff

You built an AI coding assistant

From "suggest the next lines, instantly, correctly" to a real system: a debounced/cancellable completion service, fill-in-the-middle context assembly, repo-aware retrieval over a code index, a fast code-specialized model, latency wins from debounce/cancel/cache, acceptance-rate telemetry, a separate agentic chat mode, and a privacy/secret layer.

Now starve the context — send only the current line — and watch the model confidently complete against APIs that don’t exist: invented functions, wrong imports, guessed signatures, all syntactically perfect and all wrong. That’s why context assembly and repo retrieval are the quality engine, and why grounding beats a bigger model when the goal is code that actually runs.

Latency-bound — debounce, cancel, stream — a late completion is worthless
Context builder — fill-in-the-middle prefix/suffix + open tabs + imports, on a budget
Repo retrieval — RAG for code — embeddings + symbol index ground completions in the repo
Code model — code-specialized, FIM-tuned, fast and streaming
Latency tricks — debounce + cancel + cache beat brute-force hardware
Acceptance rate — the north-star metric — the developer is the eval
Agent mode — a separate powerful, multi-file, tool-using, diff-based loop
Privacy — redact secrets, opt-in training, filter risky output — trust is the moat
The failure — too little context → confident, hallucinated code, silently