Pillar 3

Agent Architectures Overview

Overview of common agent architectures such as ReAct, planner-executor and reflection, and their use cases.

For: Software architects, developers, technical decision makers

Definition

Agent architectures are recurring design patterns that define how an LLM-based agent thinks, plans, calls Tools and corrects itself. The five canonical patterns ReAct, Reflexion, Plan-and-Execute, ReWOO and Tree-of-Thoughts emerged in 2022/2023 and to this day form the foundation of nearly every production Agent stack. They differ above all in how much is planned in advance, how often re-evaluation occurs and how elaborately the search space is explored, with correspondingly widely varying costs, latencies and error tolerances.

Key Takeaways

✓ReAct (Yao et al., arXiv:2210.03629, Oct 2022) interleaves reasoning and acting in the same context (Thought → Action → Observation) and is the recommended entry-level pattern: low latency, low complexity, available in LangGraph as a one-liner (create_react_agent).
✓Reflexion (Shinn et al., arXiv:2303.11366, NeurIPS 2023) adds a self-critique loop and achieved 91 % pass@1 on HumanEval (vs. ~80 % GPT-4 baseline in the study at the time), but fails without a reliable evaluator (MBPP example: underperformance due to false positives).
✓Plan-and-Execute separates planning (large model) from execution (small model) and, according to the LangChain blog, saves roughly 30–60 % tokens compared to pure ReAct on multi-step tasks; corresponds to Anthropic's Orchestrator-Workers pattern.
✓ReWOO (Xu et al., arXiv:2305.18323, May 2023) replaces N LLM calls with exactly two (Planner + Solver) and achieved 5x token efficiency on HotpotQA at +4 % accuracy compared to ReAct, ideal for deterministic n8n workflows in agencies.
✓Tree-of-Thoughts (Yao et al., arXiv:2305.10601, NeurIPS 2023) searches a tree of solution branches and solved Game of 24 at 74 % (vs. 4 % CoT), but costs 10x to 100x the tokens and is largely obsolete for general reasoning with modern Reasoning models.
✓Routing and hierarchical patterns (Anthropic "Building Effective Agents", Dec 2024) complement the five base patterns: Routing classifies inputs and directs them to specialized paths, hierarchical setups nest orchestrator and sub-agents.
✓The central practical lesson from production (Anthropic, Cognition, LangChain): start with the simplest pattern that works (usually ReAct) and only escalate to planning, reflexion or search once failure modes have been measured.
✓Every agent needs hard upper limits (recursion_limit / max_iterations), persisted Thought/Action/Observation traces for auditability (relevant for GDPR and the EU AI Act) and observability tooling (LangSmith, Langfuse, Arize Phoenix).

Why agent architectures?

An LLM agent is more than a model with tools: it needs a structure that defines when it reasons, when it calls a tool, when it makes a plan and when it corrects itself. This structure is precisely what agent architectures describe: recurring design patterns that have become the industry's shared vocabulary over the past few years.

What is remarkable is how closely their origins cluster together: the five canonical patterns ReAct, Reflexion, Plan-and-Execute, ReWOO and Tree-of-Thoughts emerged within a narrow window between October 2022 and May 2023, predominantly from the orbit of Princeton, Google and Northeastern (Shunyu Yao is a co-author on three of the five papers). Since then, the industry has either generalized these patterns into framework primitives (such as LangGraph's create_react_agent or CrewAI's planning=True) or hybridized them into successor patterns (LATS, LLMCompiler, Plan-and-Act).

For decision-makers and tech leads in the DACH region, the core question is rarely "Which pattern is the most advanced?\" but rather "Which pattern solves this concrete task at acceptable cost, latency and auditability?\". This overview provides a vendor-neutral map for exactly that.

ReAct: reasoning and acting interleaved

ReAct (Yao et al., arXiv:2210.03629, October 2022, ICLR 2023) is the foundation of almost all of today's tool-using agents. The idea: the LLM alternately generates free-form "Thought\" tokens (reasoning) and "Action\" tokens (tool calls) and reads the "Observation\" (the tool result) back in. The loop Thought → Action → Observation → Thought → … runs until the model emits a final Finish[answer].

The problem it solves: pure Chain-of-Thought (CoT) hallucinates facts because it has no grounding in reality. Pure action-only agents, in turn, cannot reason abstractly about long-term goals or recover from errors. ReAct unites both: reasoning steers tool use, and tool observations correct the reasoning. In the original study, this yielded +34 percentage points on ALFWorld (text-based household tasks) and +10 percentage points on WebShop (e-commerce navigation) compared to the imitation/RL baselines of the time, but against models of the GPT-3/PaLM generation; the absolute figures are outdated today and should only be read as relative effects.

Strengths: low latency, low implementation complexity, high interpretability thanks to the traceable reasoning trace. In LangGraph, ReAct is a one-liner (create_react_agent); CrewAI uses it internally in every agent; n8n offers two native nodes with the "ReAct AI Agent\" and the more modern "Tools Agent\", whose execution view logs every Thought/Action step, a real advantage for non-developing marketing teams that need an auditable log.

Weaknesses: with ambiguous tool descriptions, the model hallucinates tool arguments. There is "reasoning drift\": once committed to a wrong thought, the agent interprets subsequent observations to fit it. And the context bloats because each step carries the system prompt and the entire trajectory so far along again (cost O(N·T)). In practice, the upper limit is typically 10–25 steps before context loss or drift dominate.

An important practical lesson: modern frontier models master the reasoning-action loop natively via function calling. Explicit ReAct prompting has, according to the n8n blog, become largely unnecessary: what matters today is memory, iteration limits and traceability. The most common production failure with weaker models is malformed JSON in tool arguments; use structured output or function-calling mode wherever the provider supports it.

Reflexion: agents that critique themselves

Reflexion (Shinn et al., arXiv:2303.11366, NeurIPS 2023) wraps a self-correction loop around an existing agent. Three components interlock: an Actor (usually a ReAct or CoT agent) generates a trajectory; an Evaluator scores it (binary, scalar or via an external test suite); a Self-Reflection model turns the score plus trajectory into verbal feedback: a paragraph of natural-language critique that is stored in an episodic memory. On the next attempt, this reflexion is prepended to the Actor. The "policy update\" is purely linguistic; no weights are changed.

The paper's central insight: an LLM's self-verbalized error analysis is a stronger learning signal than a mere numerical reward, and entirely in-context across multiple attempts. On HumanEval (Python code generation), Reflexion achieved 91 % pass@1 versus roughly 80 % of the GPT-4 baseline at the time of the study.

But, and this is important as a credibility and warning signal for DACH B2B readers: on MBPP, Reflexion underperformed the baseline because the self-generated unit tests had a high false-positive rate and the agent reported "success\" prematurely. Reflexion is therefore no universal improver. It needs a high-quality evaluator signal. For tasks without a clear oracle (creative writing, open-ended research), reflexions degenerate into vague platitudes, and "confabulated reflexions\" (i.e. misdiagnosed error causes) pass the wrong correction on to the next attempt.

Costs: per attempt, roughly 2x to 5x a single ReAct run, multiplied by K attempts; at a typical K=3, therefore 5x to 15x. Attempts are necessarily sequential, which generally rules out real-time use cases.

Three practical rules from the research: first, always cap the iterations (max_reasoning_attempts in CrewAI, revision_number ≤ N in LangGraph). Second, where possible provide an external ground-truth signal (unit tests, RAG evaluator, regex match): self-assessment alone is unreliable. Third, cache reflexions: many teams persist them in a vector store by task type, building an emergent skill library this way. A note on naming: in some sources (LangChain blog), "Reflection\" means any self-critique loop, whereas "Reflexion\" refers specifically to the paper by Shinn et al., in German the distinction Reflexion (Shinn et al.) versus Reflection pattern in the broader sense is worth making.

Plan-and-Execute: the plan first, then the execution

The Plan-and-Execute pattern decouples planning from execution. A Planner creates a numbered multi-step plan once, an Executor (often a ReAct sub-agent) works through it step by step, and a Replanner decides after each execution whether to terminate or output an adjusted remaining plan. Conceptually it is based on "Plan-and-Solve\" prompting (Wang et al., arXiv:2305.04091, ACL 2023) and BabyAGI; the agent name stems from the LangChain port. Correctly phrased: the Plan-and-Execute architecture popularized by LangChain on the basis of Plan-and-Solve prompting.

The decisive lever: planning is a heavy reasoning task (large model), execution is per-step tool use (smaller, cheaper model). This model staggering empirically saves 30–60 % tokens compared to pure ReAct on multi-tool tasks, according to the LangChain blog. The explicitly articulated plans are moreover auditable, a strong argument for enterprise and compliance contexts.

Weaknesses: plan brittleness: if the upfront plan is wrong, the Executor wastes calls on steps doomed to fail until the Replanner notices. Execution remains sequential (no true parallelism). And every replan invokes the large model again; in highly stochastic environments, replanning happens at almost every step, which eats up the cost advantage. Rule of thumb: use it when tasks decompose into more than three independent steps, a clear oracle for plan validity exists and latency is not directly perceptible to the user. Do not use it when the environment is highly stochastic, then ReAct's reactive loop is strictly better. Anthropic's "Orchestrator-Workers\" pattern is essentially the same thing, reframed.

ReWOO: Reasoning Without Observation

ReWOO (Xu et al., arXiv:2305.18323, May 2023) is the cost-optimized answer to ReAct's token hunger. Three modules: a Planner generates, in a single LLM call, the complete chain of plan steps and tool calls, using a variable syntax (#E1, #E2, …) that allows references to results not yet available. A Worker executes the tools in the prescribed order and replaces the placeholders with real results. A Solver finally reads the task plus all evidence in a final LLM call and formulates the answer.

The effect: N LLM calls are replaced by exactly two (Planner + Solver), independent of the number of tool steps. On HotpotQA, the paper reports 5x token efficiency at +4 % accuracy compared to ReAct, with consistent token reduction across six NLP benchmarks. ReWOO is moreover robust against tool failures: the plan is already fixed, and the Solver cleanly detects missing evidence.

Limits: no adjustment mid-execution: if evidence 3 contradicts the plan, ReWOO cannot replan within the same run. Without environmental context, the Planner struggles with unfamiliar tool ecosystems (few-shot prompting or fine-tuning needed). And tools run sequentially; true parallelism is only delivered by the successor LLMCompiler (Kim et al., arXiv:2312.04511) via a DAG.

For DACH marketing agencies, ReWOO is often the best pattern in n8n: n8n's strength is deterministic, declarative workflows, variable substitution is native, and tasks like "research X → enrich → format → send\" are thereby debuggable, cheap and fault-tolerant. Field reports cite roughly 65 % token cost reduction at 4–5 % accuracy gain. Anti-pattern: do not use ReWOO for tasks where tool results frequently invalidate the plan (e.g. interactive web navigation): there, Plan-and-Execute or ReAct are better.

Tree-of-Thoughts: search instead of a line

Tree-of-Thoughts (Yao et al., arXiv:2305.10601, NeurIPS 2023) breaks with the strictly left-to-right logic of CoT. Instead of a single chain of thought, a tree is spanned: each node is a "Thought\" (a partial solution), at each point the model generates k candidates, a State Evaluator scores them (e.g. via a sure/maybe/impossible vote), and a search algorithm (BFS or DFS with backtracking) explores the tree. The paper explicitly frames this as "System 2\" search in the tradition of Newell & Simon's problem-solving formalisms.

The results are dramatic on search-heavy tasks: Game of 24 solved at 74 % (versus 4 % CoT, 7.3 % IO), mini-crosswords at the word level 60 % (vs. 16 % CoT), creative writing with better coherence. The price, however, is enormous: 10x to 100x the tokens compared to a single CoT call, and effectiveness depends heavily on generator quality (GPT-3.5+ToT reached only 19 % instead of 74 % on Game of 24).

The most important field lesson on ToT: for general reasoning it is largely obsolete in 2026, because modern Reasoning models (o-series, Claude with Extended Thinking, Gemini 2.x Thinking) internalize the search within the model. ToT remains relevant as a conceptual foundation for tree-structured agent search (it is the basis of LATS) as well as for three niches: audit-bound regulated industries, puzzles/optimization with verifiable rewards, and small-model deployments with cheap proposers. In tools like n8n, ToT is practically unscalable due to the combinatorial explosion; the clean approximation is "best-of-N sampling\": several parallel runs from which a critic picks the best (ToT with depth 1).

Routing and hierarchical patterns

Beyond the five reasoning patterns, Anthropic names two further structure-giving patterns in "Building Effective Agents\" (December 2024) that often form the framing in practice. Routing classifies an input and directs it to a specialized path, e.g. a classifier step that distributes support requests by type to different prompts or models. Hierarchical architectures nest an orchestrator with sub-agents: a coordinating LLM decomposes the task dynamically and delegates to subordinate agents, Plan-and-Execute with ReAct sub-agents is exactly this variant (Anthropic's Orchestrator-Workers).

On caution with hierarchy, Cognition (Devin) provides the most-cited lesson: "Don't Build Multi-Agents\" (June 2025) warned that parallel agents implicitly make conflicting decisions and deliver fragile results. The updated position "Multi-Agents: What's Actually Working\" (April 2026) holds multi-agent to be viable for read-parallel, write-single-threaded setups. Translated for agency clients: bet on a single strong agent with tools; parallel sub-agents only for information gathering, never for write or state changes.

Comparison: when to use which pattern

Use case (DACH B2B / marketing)	Recommended pattern	Rationale
Chatbot with CRM and KB access	ReAct	Reactive, low latency, native in all frameworks
Daily marketing report (scrape → analyze → write → send)	ReWOO or Plan-and-Execute	Plan once, execute cheaply
Code/bugfix agent	Reflexion + ReAct	HumanEval evidence, needs unit tests as oracle
Open-ended multi-step research (market/competitive analysis)	Plan-and-Execute + ReAct sub-agents	Long horizon, replanning needed
Creative copywriting with constraints	ToT (best-of-N) or LATS	Search over drafts pays off
Optimization/math/scheduling puzzles	ToT / LATS	Searchable, verifiable rewards
High-volume ticket triage	ReAct (Tools Agent)	Latency and cost dominate
Compliance-critical workflow (GDPR / EU AI Act)	Plan-and-Execute or ReWOO with human-in-the-loop	Auditable plan, deterministic execution

The following order-of-magnitude table should be read as a rough guideline (synthesized estimates from paper figures and field reports, not direct measurements), measure on your own workload:

Pattern	Tokens (relative to 1x CoT)	Latency (N tool steps)	Complexity
ReAct	3–10x	N × sequential	Low
Reflexion (K=3)	10–30x	K × ReAct, sequential	Medium
Plan-and-Execute	2–6x	1 plan + N sequential	Medium
ReWOO	1.5–3x	1 plan + N tools + 1 Solver	Medium
Tree-of-Thoughts (b=5, d=3)	50–150x	b^d evaluator calls	High
LATS (ToT + Reflexion)	100–300x	tree × reflexion	Very high

The common thread for practice

Across all patterns, the most important field lesson from production blog posts of 2024–2026 (Anthropic, Cognition, LangChain) is strikingly sober: start with the simplest pattern that works (usually ReAct) and only escalate to planning, reflexion or search when measured failure modes demand it. Anthropic puts it this way: "The most successful implementations used simple, composable patterns rather than complex frameworks.\"

The 2025/2026 generation of patterns is at its core a recombination of the original five: LATS = ToT + Reflexion + MCTS (native in LangGraph), LLMCompiler = ReWOO + parallel DAG (~3.6x speedup), Plan-and-Act = Plan-and-Execute for long horizons. The frameworks too are converging on a shared primitive of state graph plus tool calling: Microsoft has consolidated AutoGen and Semantic Kernel into the Microsoft Agent Framework (AutoGen is officially in maintenance mode), LangChain is moving create_react_agent into langchain.agents.create_agent with middleware. For DACH decision-makers evaluating the Microsoft stack, this is the relevant junction.

Three non-negotiable practical rules in closing: first, every agent needs hard upper limits (recursion_limit, max_iterations), otherwise loops escalate into costs. Second, observability tools (LangSmith, Langfuse, Arize Phoenix) are de facto mandatory. Third, for compliance, complete Thought/Action/Observation traces with PII scrubbing must be persisted: this is central precisely for GDPR- and EU-AI-Act-relevant systems.

Note: Compliance statements in this text are informational and do not constitute legal advice. The benchmark figures cited stem predominantly from the original papers of the years 2022–2023 (GPT-3.5/GPT-4 era) and should be understood as relative effect sizes, not as absolute values for today's frontier models.

All Articles in this Topic

7 Articles

2.2

The ReAct Pattern: Thought, Action, Observation

The ReAct pattern (Reasoning and Acting) is an agent design pattern in which an LLM alternates between reasoning (Thought), calling a tool (Action) and reading the result (Observation). This loop repeats until the agent produces a final answer. Introduced by Yao et al. (2022).

Intermediate·7 min

2.3

Chain-of-Thought for Agents: When Does It Help, and When Not?

Chain-of-Thought (CoT) is a prompting technique in which a large language model spells out its intermediate steps explicitly in words before answering. Instead of producing a result directly, the model writes down the solution path step by step. This improves accuracy on multi-step logic, mathematics and planning, but costs additional tokens and latency.

Intermediate·7 min

2.4

Tree of Thoughts: When One Path Is Not Enough

Tree of Thoughts (ToT) is a reasoning method for language models that, instead of a single linear chain of thought, generates, evaluates and explores multiple reasoning paths in parallel via search (BFS or DFS) with backtracking. This lets the model spot dead ends, backtrack and consider alternatives, rather than getting stuck on a wrong assumption.

Advanced·7 min

2.5

The Reflexion Pattern: Agents That Learn From Their Mistakes

The Reflexion pattern is an agent architecture in which an LLM agent reflects on its past attempts: an Actor produces a solution, an Evaluator assesses it, and a Self-Reflection model writes a verbal critique from this into a memory buffer. On the next attempt, the Actor reads this reflection and corrects itself, entirely without model training.

Advanced·7 min

2.6

Plan-and-Execute: Separating Planning from Execution

Plan-and-Execute is an agent architecture in which a Planner first creates a complete multi-step plan and an Executor works through it step by step. A Replanner adjusts the plan when needed. Separating planning from execution reduces LLM calls and improves control over long-horizon tasks compared with pure ReAct.

Intermediate·7 min

2.7

Hierarchical Agents: Supervisor and Sub-Agents

Hierarchical agents are a multi-agent architecture in which a supervisor agent decomposes a complex task, delegates subtasks to specialised sub-agents and merges their results. Instead of a single agent, a higher-level control instance coordinates several subordinate workers and aggregates their output into an overall solution.

Advanced·7 min

2.8

Event-driven agents: AutoGen v0.4 / AG2 architecture explained

Event-driven agents are autonomous software actors that communicate asynchronously via messages and events rather than in a fixed sequential loop. Each agent reacts to incoming events, processes them independently and publishes results - as in AutoGen v0.4 and AG2. This enables loose coupling, parallelism and long runtimes.

Advanced·7 min