System Design · step by stepDesign Multi-Agent Orchestration
Step 1 / 9

Design Multi-Agent Orchestration — the walkthrough in full

A written version of the interactive walkthrough above — the same steps, decisions and trade-offs, laid out for reading, reference and search.

The big idea

When one agent isn’t enough

A single agent works until the task gets broad: it needs many skills, a huge context, and dozens of steps — and its one prompt turns into an unfocused mess that loses the thread. The instinct is to split the work across specialist agents (a researcher, an analyst, a writer) coordinated toward one goal. But the moment you have many agents, you inherit distributed-systems problems: coordination, shared state, partial failure, and — worst — runaway loops. How do you orchestrate multiple agents reliably?

A multi-agent orchestration engine decomposes a goal, routes sub-tasks to specialist agents, coordinates them through shared state, and runs the whole thing as a durable workflow with hard budgets and termination. The discipline: gain the power of many focused agents without letting autonomy become an unbounded loop.

How to read this: Each step opens with a real design decision — you make the call before I show you what ships. Watch the diagram grow, hover any box, replay the flow. At the end, uncap the agents to feel the multi-agent failure mode. Hit Begin.

Step 1 · The skeleton

An orchestrator over worker agents

A broad goal comes in. Something has to break it down, decide who does what, and assemble the answer. What’s the basic coordination shape?

Design decision: What’s the minimal structure for coordinating multiple agents?

The call: An orchestrator that decomposes the goal, routes sub-tasks to workers, and assembles results. — A coordinator plans the work, dispatches sub-tasks to specialist worker agents, tracks their results in shared state, and composes the final output — a clear owner of the workflow.

Start with an Orchestrator over worker agents: it decomposes the goal, routes sub-tasks to specialists, and assembles their outputs into the answer. A clear coordinator owns the plan and the termination — the foundation every reliable multi-agent system is built on.

Orchestrator-worker first: The orchestrator-worker topology gives you one owner of the plan, the state, and the stop condition. Fancier topologies exist, but "a coordinator delegating to specialists" is the default that stays debuggable.

Step 2 · Don’t over-engineer

Specialists — and when NOT to go multi-agent

Multi-agent is powerful and fashionable — and often the wrong choice. Each agent adds latency, cost, and a coordination failure surface. When does splitting into agents actually help, and how do you scope them?

Design decision: When is multiple agents the right call over one good agent?

The call: When the task splits into distinct skills or parallel sub-tasks that overload one context. — Split when sub-tasks need different tools/expertise or can run independently, and one agent’s context would be a muddle. Each specialist gets a focused prompt, its own tools, and an isolated context.

Use specialists when a task splits into distinct skills or parallel sub-tasks that would overload one agent’s context — and not otherwise, because every agent adds cost, latency, and failure surface. Each worker gets a focused prompt, its own tools, and an isolated context, so it does one thing well instead of everything poorly.

Context isolation is the real win: The reason multi-agent helps isn’t "more brains" — it’s that each specialist keeps a small, relevant context instead of one agent drowning in everything. If a task doesn’t need that, one agent is simpler and better.

Step 3 · How agents coordinate

Shared state vs. message passing

The researcher’s findings have to reach the analyst; the analyst’s conclusions have to reach the writer. If you just paste everything into every prompt, contexts explode and quality rots. How do agents share information?

Design decision: How should agents pass work between each other?

The call: A shared state / blackboard agents read and write, passing structured artifacts — not raw context. — A shared store holds task results and intermediate artifacts; agents read what they need and write structured outputs, so no single context has to carry everything and handoffs stay clean.

Agents coordinate through Shared State (a blackboard): each reads what it needs and writes structured artifacts — results, not raw transcripts. The orchestrator can also message-pass explicit handoffs. Either way, keep contexts small by sharing distilled outputs, so adding an agent doesn’t multiply everyone’s prompt size.

Share artifacts, not transcripts: The blackboard pattern lets many agents collaborate without every context containing every other context. Structured, distilled state is what keeps a multi-agent system’s token cost and quality from degrading as it grows.

Step 4 · Survive failure

The durable workflow engine

A run is twenty steps and five minutes long. Step 14 calls a flaky API and the process crashes. Do you really restart from step 1 — re-paying for all that work? Agentic runs are long and partial failure is constant. How do you make them reliable?

Design decision: How do you keep a long multi-agent run reliable across failures?

The call: Run it as a durable workflow: checkpoint each step so you can retry or resume from the last good state. — A workflow engine persists state after each step; a crash or a failed tool call retries that step or resumes from the last checkpoint, instead of throwing away completed work.

Run the orchestration on a Workflow Engine with durable execution: checkpoint state after each step so a crash or a flaky tool retries that step or resumes from the last good state — never restarting a long run from scratch. Agentic workflows are long-lived and failure-prone; durability (à la Temporal/Step Functions) is what makes them production-grade.

Agents are long-running workflows: Treat a multi-agent run like a durable business workflow, not a single request: checkpointed steps, idempotent retries, resumability. That reframing is what turns "a script that sometimes finishes" into a reliable system.

Step 5 · Clean handoffs

Typed contracts between agents

The researcher hands to the analyst hands to the writer. If each handoff is freeform prose, meaning blurs at every hop — a game of telephone that degrades over the chain. How do you keep quality across handoffs?

Design decision: How do you stop quality degrading as work passes between agents?

The call: Define typed handoff contracts — structured outputs each agent must produce and the next consumes. — Each agent emits a defined structured output (schema/fields) that the next agent consumes, so information passes precisely and predictably instead of blurring through freeform text.

Define typed handoff contracts: each agent produces a structured output (a schema — fields, not freeform prose) that the next agent consumes. Structured handoffs stop the telephone-game degradation, make each step testable in isolation, and let the orchestrator validate an agent’s work before passing it on.

Interfaces between agents: The same reason services use typed APIs: a defined contract makes each agent a replaceable, testable component and keeps meaning intact across the chain. Freeform handoffs are where multi-agent quality quietly leaks.

Step 6 · Do things at once

Parallel fan-out / fan-in

Three of the sub-tasks don’t depend on each other — researching three vendors, say. Running them one after another triples the latency for no reason. How do you exploit independence?

Design decision: How do you speed up independent sub-tasks?

The call: Fan out independent sub-tasks to run in parallel, then fan in and aggregate the results. — The orchestrator dispatches independent sub-tasks concurrently to multiple agents (fan-out), waits for them, and merges their structured outputs (fan-in) — cutting latency for the parallelizable part of the plan.

Exploit independence with fan-out / fan-in: the orchestrator runs independent sub-tasks in parallel across agents, then aggregates their structured results. Dependencies form a DAG — parallelize the independent branches, sequence the dependent ones. It’s the map-reduce shape, applied to agents, and often the biggest latency lever in a multi-agent plan.

The plan is a DAG: A good orchestrator sees the task as a dependency graph, not a straight line: overlap what can overlap, order what must be ordered. Fan-out/fan-in turns a serial chain into a fast, parallel workflow where the structure allows.

Step 7 · Put a ceiling on autonomy

Budgets, loop detection, termination

Agents that can spawn agents and retry tools have no natural stopping point. Two agents can correct each other forever; a planner can delegate infinitely; a retry can loop. Nothing errors — it just spins and spends. What stops a runaway?

Design decision: What keeps a multi-agent system from looping and burning unbounded cost?

The call: Per-run budgets (max steps, depth, tokens/$), loop detection, and human-approval gates. — Hard caps on steps, recursion depth, and token/dollar spend, plus loop detection and approval gates for consequential actions, bound the run — so autonomy always has a ceiling and can’t spin forever.

Enforce termination at the system level: per-run budgets (max steps, max recursion depth, token/$ caps), loop detection (spot repeating states/ping-pong), and human-approval gates for consequential or expensive actions. Autonomy needs both a floor and a ceiling — this is the single most important safeguard, and removing it (the chaos button) is how multi-agent systems fail.

Bounded autonomy or bust: The defining risk of multi-agent isn’t a wrong answer — it’s an expensive non-answer that loops forever. Budgets, loop detection, and approval gates are non-negotiable; they turn open-ended autonomy into a bounded, affordable workflow.

Step 8 · See the whole run

End-to-end tracing and evaluation

A multi-agent run failed to produce a good answer. Which agent? Which handoff? Which tool call? With work spread across many agents and steps, a single log line tells you nothing. How do you debug and evaluate the whole thing?

Design decision: How do you make a multi-agent run debuggable and evaluable?

The call: Trace the full run tree — every agent, prompt, tool call, handoff, and cost — and evaluate the workflow end to end. — A structured trace of the whole run (each agent, prompt, tool call, handoff, token/cost) makes failures debuggable and lets you evaluate the end-to-end workflow, not just individual model calls.

Capture a Run Trace: the full tree of the workflow — every agent, prompt, tool call, handoff, and cost — so a failure is traceable to the exact step, and the whole workflow can be evaluated end to end (not just per model call). Multi-agent quality is a property of the system, so observability and eval have to span the entire run.

Evaluate the workflow, not the call: With many agents, the interesting failures live in the seams — a bad handoff, a mis-routed task, a loop. Tracing the run tree and evaluating end-to-end is the only way to see and fix them.

The payoff

You built multi-agent orchestration

From "one agent can’t hold it all" to a coordinated system: an orchestrator over focused specialists (used only when the task warrants it), shared-state coordination via structured artifacts, a durable workflow engine with checkpoint/retry, typed handoffs, parallel fan-out/fan-in, hard budgets + termination, and end-to-end tracing.

Now uncap the agents — remove budgets and termination — and watch the system loop: agents correcting each other, planners delegating endlessly, retries spinning, tokens burning, with no error and no answer. That’s why termination conditions are the most important safeguard in multi-agent design, and why autonomy must always sit between a floor and a ceiling.

  • Orchestrator-worker — one owner of the plan, state, and stop condition
  • Use specialists wisely — split on distinct skills/parallelism — not by default
  • Shared state — coordinate via structured artifacts, not forwarded transcripts
  • Durable workflow — checkpoint each step; retry/resume, never restart
  • Typed handoffs — structured contracts stop telephone-game degradation
  • Fan-out / fan-in — the plan is a DAG — parallelize the independent branches
  • Budgets + termination — steps, depth, cost caps, loop detection, approval gates
  • Trace + eval — the run tree makes the workflow debuggable and evaluable
  • The failure — uncapped autonomy loops and burns unbounded cost, silently
built to be reasoned about, not memorized — make the calls, uncap the agents, run the quiz.
Finished this one? 0 / 16 AI System Designs done

Explore the topic

See this alongside everything else on the same subject — handbooks, system designs, challenges and tools, in one place.

More AI System Designs