Error Handling in Multi-Agent Systems: Retries, Fallbacks, Circuit Breakers
Error handling in multi-agent systems comprises all mechanisms that prevent a single agent's failure from toppling the entire system: timeouts, retries with backoff, fallback agents, circuit breakers and end-to-end observability. The goal is fault tolerance – a sub-agent may fail without errors cascading or propagating.
Key Takeaways
- ✓Multi-agent systems have eight error classes that single agents do not have – among them cascading failures, resource deadlock, context fragmentation and debuggability collapse.
- ✓The most important protection against error propagation is the separation of write and read paths: many agents read in parallel, but only one agent or one pipeline stage writes (Cognition's single-threaded writes principle).
- ✓Timeouts on every A2A task, the explicit input-required state and sub-agent token caps prevent deadlocks and cost explosions.
- ✓A verifier-judge agent (often a stronger model) intercepts hallucinated facts before the lead agent synthesises them as truth.
- ✓In the DACH region, observability is not a pure engineering question but a compliance one: end-to-end trace IDs across every A2A task and MCP call are increasingly mandatory for BaFin/FMA auditability.
- ✓Allianz Project Nemo demonstrates the DACH pattern: a dedicated audit agent and human-in-the-loop at the critical payout stage make errors manageable.
Error handling in multi-agent systems comprises all mechanisms that prevent a single agent's failure from toppling the entire system: timeouts, retries with backoff, fallback agents, circuit breakers and end-to-end observability. The goal is fault tolerance – a sub-agent may fail without errors cascading or propagating. In a single-agent system, error handling is largely a solved problem; as soon as multiple agents with their own context windows, tools and roles work together, error classes arise that simply do not exist in a single tool-use loop.
- Separate the read and write paths: Many agents may read and make proposals in parallel, but only one (or one pipeline stage) commits – this is the most effective measure against error propagation.
- Bound every task: A timeout, token cap and an explicit
failedstate per sub-agent prevent deadlocks and cost explosions. - Verify before synthesis: A verifier-judge agent intercepts hallucinated facts before the lead agent writes them into the answer as truth.
Why error handling is different in multi-agent systems
A multi-agent system consists of several LLM-based agents – each with its own prompt, role, toolset and, in the strict case, its own context window – that solve a task together. It is precisely this distribution that creates new error surfaces. There is no longer a single trace, but N sub-agent trajectories plus a synthesis. Runs are non-deterministic: the same input spawns different sub-agents in a different order. And the system can exhibit emergent behaviour that no single agent planned that way.
The central danger is error propagation. If a sub-agent hallucinates a fact, the lead agent synthesises it into the final answer, and downstream agents act on the basis of this false fact. A local error turns into a system-wide wrong decision. Error handling in multi-agent systems therefore means two things: making individual agents robust against transient failures and preventing errors from spreading along the agent chain.
The eight error classes that single-agent systems do not have
The following list is the error catalogue documented in production by Anthropic, Cognition, Sierra, Salesforce and Microsoft. Anyone bringing multi-agent into production must know it.
# | Error class | What happens | Countermeasure |
|---|---|---|---|
1 | Cascading failures | Sub-agent hallucinates, lead synthesises it as fact, downstream acts on it | Verifier-judge agent, grounded retrieval, mandatory source citations |
2 | Echo chamber | Sub-agents reinforce a false premise from the lead | Diversify models/prompts (MoA style), introduce a critic role |
3 | Authority confusion | Sub-agent overrides lead instructions, lead loses control | Clear role hierarchies, A2A task contracts with strong AgentCards |
4 | Resource deadlock | Agent A waits for B, B waits for clarification from A | Timeout on every task, explicit |
5 | Prompt-injection amplification | Each sub-agent context is a new attack surface | Prompt partitioning, provenance-based access control, no autonomous MCP installation |
6 | Context fragmentation | Sub-agents make incompatible implicit decisions | Share full traces under high coupling, single-threaded writes, decision contracts |
7 | Cost explosion | Token consumption escalates (orchestrator-worker approx. 15x vs. single-agent) | Sub-agent token caps, QoS tiers, route to single-agent below complexity threshold |
8 | Debuggability collapse | No single trace covers the run | Distributed tracing across the A2A mesh, correlation IDs in every task and MCP call |
Three of these classes – cascades, deadlock and context fragmentation – are the actual drivers of error propagation. The other five amplify them or render them invisible.
Preventing error propagation: separating the read and write paths
The most important architectural decision against error propagation is not a tool but a principle. Cognition.ai put it in a nutshell in "Don't Build Multi-Agents" (June 2025) and in the update "Multi-Agents: What's Actually Working" (April 2026): multi-agent fan-out for reading is robust, multi-agent fan-out for writing is fragile. The consequence is the rule of single-threaded writes:
- Single-threaded writes: Many agents read, research, make proposals – but only one agent or one pipeline stage commits. A faulty read agent only delivers a fragment contribution that the lead can discard or request anew.
- Independent writes with reconciliation: Each sub-agent delivers a fragment, the lead agent reconciles them into the final output (the Anthropic research-agent pattern).
- Concurrent writes to shared state: The death spiral – avoid it.
Cognition's observation behind this: "apparent disagreements" between agents are mostly symptoms of context fragmentation, not genuine disagreement. Where coupling is high, you have to share full agent traces instead of just individual messages. For shared external storage (vector store, Postgres, SAP HANA), the same rule applies in most production systems: one writer, many readers.
Retries with backoff, timeouts and circuit breakers
At the level of the individual agent or tool call, the classic resilience patterns apply – carried over from the microservice world:
- Retries with exponential backoff: Transient errors (rate limit, brief tool outage, timeout) are retried with increasing wait times, ideally with jitter, to avoid thundering-herd effects. Important: retries only help against transient, not against semantic errors. A repeated call to a hallucinating agent delivers the same hallucination.
- Timeouts on every task: The A2A protocol defines a task lifecycle
submitted → working → input-required → completed | failed | canceled. A timeout on every task plus the explicitinput-requiredstate are the documented countermeasure against resource deadlocks. Without a timeout, a waiting agent can block the entire system. - Circuit breakers: If errors from a sub-agent or tool accumulate above a threshold, the breaker opens and blocks further calls for a time window. This stops the cost explosion and creates room for a fallback.
- Fallback agents and degraded responses: If a specialised agent fails, a simpler fallback agent, a different model tier or a deliberately reduced response takes over. Better an honestly limited output than a cascaded wrong decision.
The practical token discipline from the Anthropic research-agent reference complements this: explicitly cap sub-agent token spend (supported by Bedrock AgentCore, LangGraph and the Claude Agent SDK), force sub-agent outputs into a typed schema (Pydantic, JSON schema, A2A artifact) and compress before returning – never pass a full sub-agent transcript through to the lead. A typed schema is at the same time an error boundary: what does not fit the schema is rejected rather than passed on.
Monitoring and observability: a prerequisite, not an add-on
Debuggability collapse is the most insidious error class: if something goes wrong, there is no single trace that explains the run. Observability is therefore not optional for any multi-agent system in production. The building blocks:
- Distributed tracing across the entire A2A mesh, with an end-to-end correlation/trace ID in every A2A task and every MCP call. OpenTelemetry is the cross-cutting standard in 2026.
- Tools (as of 2026): LangSmith for internal LangGraph traces; Galileo or Arize Phoenix for multi-agent traces across vendor boundaries; Pydantic Logfire in Python-heavy teams; Datadog/Splunk/Grafana for SIEM correlation.
- Verifier-judge pattern: A separate judge agent – often a stronger model than the workers – assesses each trajectory against a small rubric: task completed? Answer grounded? Agents stayed on task? Budget kept? This is the lightest production-ready answer against cascading failures.
For the DACH region, observability is moreover a compliance question. BaFin, FMA and FINMA will increasingly push in 2026 for end-to-end traces of every agent call, for reproducibility (pinning model versions in production, recording all sub-agent prompts, tool calls and AgentCards) and for append-only audit storage. Retention periods depend on the sector: BFSI 10 years, pharma GxP often 25–30 years.
Example: Allianz Project Nemo
Project Nemo by Allianz – a German insurer – is one of the cleanest documented DACH-relevant multi-agent deployments and a textbook case for error handling. Seven specialised agents (planner, cyber, coverage, weather, fraud, payout, audit) process food-spoilage claims following natural disasters. The complete seven-agent workflow runs in under five minutes.
Two error-handling principles are structurally anchored here:
- Audit agent as a topology element: A dedicated agent generates a complete summary of all agent decisions and justifications – a complete audit trail for compliance, quality control and human review. Auditability is built into the agent topology, not only into the logging pipeline.
- Human-in-the-loop at the critical stage: A human caseworker reviews the audit summary and makes the final payout decision. A cascaded wrong decision is caught at the most expensive, irreversible point – as explicit policy.
The result: an 80% reduction in processing and settlement time for valid claims under AUD 500, live in under 100 days (Australia, July 2025). The lesson for error handling: modular agents with clear roles, a dedicated audit path and a human checkpoint make a system that is fast and manageable.
Practical error-handling checklist
For agencies and B2B decision-makers
For marketing agencies and AI-native providers, error handling is a product-quality feature, not a background detail: a newsletter or SEO-audit pipeline with research, outline, draft and review agents stands or falls on the verifier-judge and the single-threaded write stage. For DACH B2B decision-makers the rule is: demand from every multi-agent pitch – internal or from a vendor – concrete answers on the timeout, fallback, verifier and trace strategy. A system without Galileo/LangSmith-class observability is not investigable under a regulator inquiry. Those who build error handling into the agent topology from the start – following the Allianz Nemo pattern with an audit agent and human-in-the-loop – deliver systems that are fast and at the same time remain auditable.
FAQ
What is the difference between a cascading failure and context fragmentation?
Why are simple retries not enough in multi-agent systems?
How do you prevent a single agent failure from toppling the whole system?
What does circuit breaker mean in the context of agents?
What role does human-in-the-loop play in error handling?
Want to go deeper?
Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.