Skip to content
5.8Intermediate8 min

Error Handling in Multi-Agent Systems: Retries, Fallbacks, Circuit Breakers

Blck Alpaca·
Definition

Error handling in multi-agent systems comprises all mechanisms that prevent a single agent's failure from toppling the entire system: timeouts, retries with backoff, fallback agents, circuit breakers and end-to-end observability. The goal is fault tolerance – a sub-agent may fail without errors cascading or propagating.

Key Takeaways

  • Multi-agent systems have eight error classes that single agents do not have – among them cascading failures, resource deadlock, context fragmentation and debuggability collapse.
  • The most important protection against error propagation is the separation of write and read paths: many agents read in parallel, but only one agent or one pipeline stage writes (Cognition's single-threaded writes principle).
  • Timeouts on every A2A task, the explicit input-required state and sub-agent token caps prevent deadlocks and cost explosions.
  • A verifier-judge agent (often a stronger model) intercepts hallucinated facts before the lead agent synthesises them as truth.
  • In the DACH region, observability is not a pure engineering question but a compliance one: end-to-end trace IDs across every A2A task and MCP call are increasingly mandatory for BaFin/FMA auditability.
  • Allianz Project Nemo demonstrates the DACH pattern: a dedicated audit agent and human-in-the-loop at the critical payout stage make errors manageable.

Error handling in multi-agent systems comprises all mechanisms that prevent a single agent's failure from toppling the entire system: timeouts, retries with backoff, fallback agents, circuit breakers and end-to-end observability. The goal is fault tolerance – a sub-agent may fail without errors cascading or propagating. In a single-agent system, error handling is largely a solved problem; as soon as multiple agents with their own context windows, tools and roles work together, error classes arise that simply do not exist in a single tool-use loop.

  • Separate the read and write paths: Many agents may read and make proposals in parallel, but only one (or one pipeline stage) commits – this is the most effective measure against error propagation.
  • Bound every task: A timeout, token cap and an explicit failed state per sub-agent prevent deadlocks and cost explosions.
  • Verify before synthesis: A verifier-judge agent intercepts hallucinated facts before the lead agent writes them into the answer as truth.

Why error handling is different in multi-agent systems

A multi-agent system consists of several LLM-based agents – each with its own prompt, role, toolset and, in the strict case, its own context window – that solve a task together. It is precisely this distribution that creates new error surfaces. There is no longer a single trace, but N sub-agent trajectories plus a synthesis. Runs are non-deterministic: the same input spawns different sub-agents in a different order. And the system can exhibit emergent behaviour that no single agent planned that way.

The central danger is error propagation. If a sub-agent hallucinates a fact, the lead agent synthesises it into the final answer, and downstream agents act on the basis of this false fact. A local error turns into a system-wide wrong decision. Error handling in multi-agent systems therefore means two things: making individual agents robust against transient failures and preventing errors from spreading along the agent chain.

The eight error classes that single-agent systems do not have

The following list is the error catalogue documented in production by Anthropic, Cognition, Sierra, Salesforce and Microsoft. Anyone bringing multi-agent into production must know it.

#

Error class

What happens

Countermeasure

1

Cascading failures

Sub-agent hallucinates, lead synthesises it as fact, downstream acts on it

Verifier-judge agent, grounded retrieval, mandatory source citations

2

Echo chamber

Sub-agents reinforce a false premise from the lead

Diversify models/prompts (MoA style), introduce a critic role

3

Authority confusion

Sub-agent overrides lead instructions, lead loses control

Clear role hierarchies, A2A task contracts with strong AgentCards

4

Resource deadlock

Agent A waits for B, B waits for clarification from A

Timeout on every task, explicit input-required state in A2A

5

Prompt-injection amplification

Each sub-agent context is a new attack surface

Prompt partitioning, provenance-based access control, no autonomous MCP installation

6

Context fragmentation

Sub-agents make incompatible implicit decisions

Share full traces under high coupling, single-threaded writes, decision contracts

7

Cost explosion

Token consumption escalates (orchestrator-worker approx. 15x vs. single-agent)

Sub-agent token caps, QoS tiers, route to single-agent below complexity threshold

8

Debuggability collapse

No single trace covers the run

Distributed tracing across the A2A mesh, correlation IDs in every task and MCP call

Three of these classes – cascades, deadlock and context fragmentation – are the actual drivers of error propagation. The other five amplify them or render them invisible.

Preventing error propagation: separating the read and write paths

The most important architectural decision against error propagation is not a tool but a principle. Cognition.ai put it in a nutshell in "Don't Build Multi-Agents" (June 2025) and in the update "Multi-Agents: What's Actually Working" (April 2026): multi-agent fan-out for reading is robust, multi-agent fan-out for writing is fragile. The consequence is the rule of single-threaded writes:

  • Single-threaded writes: Many agents read, research, make proposals – but only one agent or one pipeline stage commits. A faulty read agent only delivers a fragment contribution that the lead can discard or request anew.
  • Independent writes with reconciliation: Each sub-agent delivers a fragment, the lead agent reconciles them into the final output (the Anthropic research-agent pattern).
  • Concurrent writes to shared state: The death spiral – avoid it.

Cognition's observation behind this: "apparent disagreements" between agents are mostly symptoms of context fragmentation, not genuine disagreement. Where coupling is high, you have to share full agent traces instead of just individual messages. For shared external storage (vector store, Postgres, SAP HANA), the same rule applies in most production systems: one writer, many readers.

Retries with backoff, timeouts and circuit breakers

At the level of the individual agent or tool call, the classic resilience patterns apply – carried over from the microservice world:

  • Retries with exponential backoff: Transient errors (rate limit, brief tool outage, timeout) are retried with increasing wait times, ideally with jitter, to avoid thundering-herd effects. Important: retries only help against transient, not against semantic errors. A repeated call to a hallucinating agent delivers the same hallucination.
  • Timeouts on every task: The A2A protocol defines a task lifecycle submitted → working → input-required → completed | failed | canceled. A timeout on every task plus the explicit input-required state are the documented countermeasure against resource deadlocks. Without a timeout, a waiting agent can block the entire system.
  • Circuit breakers: If errors from a sub-agent or tool accumulate above a threshold, the breaker opens and blocks further calls for a time window. This stops the cost explosion and creates room for a fallback.
  • Fallback agents and degraded responses: If a specialised agent fails, a simpler fallback agent, a different model tier or a deliberately reduced response takes over. Better an honestly limited output than a cascaded wrong decision.

The practical token discipline from the Anthropic research-agent reference complements this: explicitly cap sub-agent token spend (supported by Bedrock AgentCore, LangGraph and the Claude Agent SDK), force sub-agent outputs into a typed schema (Pydantic, JSON schema, A2A artifact) and compress before returning – never pass a full sub-agent transcript through to the lead. A typed schema is at the same time an error boundary: what does not fit the schema is rejected rather than passed on.

Monitoring and observability: a prerequisite, not an add-on

Debuggability collapse is the most insidious error class: if something goes wrong, there is no single trace that explains the run. Observability is therefore not optional for any multi-agent system in production. The building blocks:

  • Distributed tracing across the entire A2A mesh, with an end-to-end correlation/trace ID in every A2A task and every MCP call. OpenTelemetry is the cross-cutting standard in 2026.
  • Tools (as of 2026): LangSmith for internal LangGraph traces; Galileo or Arize Phoenix for multi-agent traces across vendor boundaries; Pydantic Logfire in Python-heavy teams; Datadog/Splunk/Grafana for SIEM correlation.
  • Verifier-judge pattern: A separate judge agent – often a stronger model than the workers – assesses each trajectory against a small rubric: task completed? Answer grounded? Agents stayed on task? Budget kept? This is the lightest production-ready answer against cascading failures.

For the DACH region, observability is moreover a compliance question. BaFin, FMA and FINMA will increasingly push in 2026 for end-to-end traces of every agent call, for reproducibility (pinning model versions in production, recording all sub-agent prompts, tool calls and AgentCards) and for append-only audit storage. Retention periods depend on the sector: BFSI 10 years, pharma GxP often 25–30 years.

Example: Allianz Project Nemo

Project Nemo by Allianz – a German insurer – is one of the cleanest documented DACH-relevant multi-agent deployments and a textbook case for error handling. Seven specialised agents (planner, cyber, coverage, weather, fraud, payout, audit) process food-spoilage claims following natural disasters. The complete seven-agent workflow runs in under five minutes.

Two error-handling principles are structurally anchored here:

  • Audit agent as a topology element: A dedicated agent generates a complete summary of all agent decisions and justifications – a complete audit trail for compliance, quality control and human review. Auditability is built into the agent topology, not only into the logging pipeline.
  • Human-in-the-loop at the critical stage: A human caseworker reviews the audit summary and makes the final payout decision. A cascaded wrong decision is caught at the most expensive, irreversible point – as explicit policy.

The result: an 80% reduction in processing and settlement time for valid claims under AUD 500, live in under 100 days (Australia, July 2025). The lesson for error handling: modular agents with clear roles, a dedicated audit path and a human checkpoint make a system that is fast and manageable.

Practical error-handling checklist

For agencies and B2B decision-makers

For marketing agencies and AI-native providers, error handling is a product-quality feature, not a background detail: a newsletter or SEO-audit pipeline with research, outline, draft and review agents stands or falls on the verifier-judge and the single-threaded write stage. For DACH B2B decision-makers the rule is: demand from every multi-agent pitch – internal or from a vendor – concrete answers on the timeout, fallback, verifier and trace strategy. A system without Galileo/LangSmith-class observability is not investigable under a regulator inquiry. Those who build error handling into the agent topology from the start – following the Allianz Nemo pattern with an audit agent and human-in-the-loop – deliver systems that are fast and at the same time remain auditable.

FAQ

What is the difference between a cascading failure and context fragmentation?
A cascading failure arises when a sub-agent hallucinates a fact, the lead agent synthesises it into the answer as truth, and downstream agents build on it. Context fragmentation – Cognition's central criticism – describes, by contrast, parallel agents making incompatible implicit decisions because they do not share the same complete context. Both propagate, but require different remedies: verifier-judge and grounding against cascades, shared traces and single-threaded writes against fragmentation.
Why are simple retries not enough in multi-agent systems?
Retries with backoff catch transient errors – rate limits, timeouts, short-term tool failures. However, they do not help against semantic errors such as hallucinations, against context fragmentation or against deadlocks in which agents block one another. A repeated call to an agent that produces a wrong fact only delivers the same wrong fact. That is why you additionally need verifier agents, fallback paths, circuit breakers and timeouts on every task.
How do you prevent a single agent failure from toppling the whole system?
Through isolation and defined boundaries: each sub-agent gets its own token budget and a timeout; the A2A task lifecycle model makes failed states explicit; a circuit breaker stops repeatedly failing calls; fallback agents or a degraded response take over. The decisive factor is that the write path remains single-threaded, so that a faulty read agent only delivers a fragment contribution that the lead can discard or request anew.
What does circuit breaker mean in the context of agents?
A circuit breaker transfers the pattern familiar from the microservice world to agent calls: if errors from a particular sub-agent or tool accumulate above a threshold, the breaker opens and blocks further calls for a time window, instead of burning tokens and latency over and over again. This prevents the cost explosion documented in the research and gives the system time to switch to a fallback agent or a reduced response.
What role does human-in-the-loop play in error handling?
Human-in-the-loop is the last line of defence at critical, irreversible steps. Allianz Project Nemo demonstrates the pattern: seven agents work largely autonomously, but a human caseworker reviews the audit summary and makes the final payout decision. This way a cascaded wrong decision remains catchable at the most expensive point in the chain – a deliberate policy decision, not a technical stopgap.

Want to go deeper?

Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.