Skip to content
2.5Advanced7 min

The Reflexion Pattern: Agents That Learn From Their Mistakes

Blck Alpaca·
Definition

The Reflexion pattern is an agent architecture in which an LLM agent reflects on its past attempts: an Actor produces a solution, an Evaluator assesses it, and a Self-Reflection model writes a verbal critique from this into a memory buffer. On the next attempt, the Actor reads this reflection and corrects itself, entirely without model training.

Key Takeaways

  • Reflexion (Shinn et al., arXiv:2303.11366, NeurIPS 2023) lets agents learn from mistakes in-context: the Actor generates, the Evaluator assesses, and Self-Reflection writes a verbal critique into an episodic memory.
  • Self-correction happens purely through language. No model weights are altered. This is not genuine training but verbal reinforcement across multiple attempts.
  • The strongest area of application is iterative tasks with a clear verification signal: on HumanEval (Python code generation), Reflexion achieved 91% pass@1 versus a GPT-4 baseline of around 80%.
  • Without a reliable evaluator, the pattern breaks down: on MBPP, Reflexion fell below the GPT-4 baseline because the self-generated unit tests produced too many false positives.
  • Reflexion costs roughly 10 to 30 times a single CoT run and operates strictly sequentially. It is suited to batch and back-office tasks, not real-time chat.
  • Always cap iterations hard and, where possible, use an external ground-truth signal (unit tests, RAG evaluator, regex). Self-assessment alone is unreliable.

The Reflexion pattern is an agent architecture in which an LLM agent systematically reflects on its own past attempts and improves as a result. Instead of a single answer, the agent works through several attempts: an Actor produces a solution, an Evaluator assesses it, and a Self-Reflection model writes a verbal critique from the assessment and the trajectory. This critique lands in a memory buffer and steers the next attempt. Remarkably, no model weights are altered. The improvement is purely linguistic.

The method goes back to the paper by Shinn, Cassano, Berman, Gopinath, Narasimhan and Yao: Reflexion: Language Agents with Verbal Reinforcement Learning (arXiv:2303.11366, March 2023, NeurIPS 2023). It was the first widely cited demonstration that an LLM's self-verbalised error analysis is a stronger learning signal for in-context improvement than a simple numerical reward.

The essentials in three sentences

  • What it is: A loop of Actor, Evaluator and Self-Reflection with episodic memory. The agent learns from mistakes in-context across multiple attempts, without fine-tuning.
  • When it pays off: On iterative tasks with a clear, automatable verification signal, such as code generation with unit tests. Here it measurably increases the success rate.
  • Where the limit lies: High cost (on the order of 10 to 30 times a CoT call), strictly sequential, and useless without a reliable evaluator. It is no substitute for genuine training.

How the Reflexion loop works

The pattern consists of three components that interact in a loop:

  1. Actor – produces the solution, or the solution path. This is typically itself a ReAct or chain-of-thought agent.
  2. Evaluator – assesses the attempt. The assessment can be binary (pass/fail), scalar, or carried out by an external test suite.
  3. Self-Reflection model – converts the assessment plus the trajectory into verbal feedback: a paragraph of natural-language critique. This text is stored in an episodic memory buffer.

On the next attempt, the reflection text is prepended to the Actor's context. The "policy update" is therefore purely linguistic; no weights change. The memory is deliberately kept small, with typically one to three reflections retained.

The process in brief:

```
Attempt_k: Actor → Evaluator → Self-Reflect → [append to memory]
Attempt_k+1: Actor sees memory → renewed attempt
... until success or max. attempts reached
```

The conceptual advantage over classical reinforcement learning: standard RL needs many samples and expensive fine-tuning to learn from feedback. Reflexion teaches in-context across just a few attempts, which is orders of magnitude cheaper and faster to set up.

When the Reflexion pattern increases the success rate

Its usefulness stands or falls with the quality of the evaluator signal. The strongest documented example is code generation, because an automatic oracle exists there: unit tests.

Benchmark

Reflexion

Comparison/baseline

Interpretation

HumanEval pass@1 (Python)

91%

~80% GPT-4 baseline (at time of paper)

Clear gain thanks to test oracle

ALFWorld (success rate)

near-perfect after a few attempts

Iterative task with feedback

HotpotQA (Distractor)

substantial absolute gains over CoT/ReAct

CoT, ReAct alone

Reasoning task benefits

MBPP pass@1 (Python)

fell below the baseline

~80% GPT-4 baseline

Warning sign: weak evaluator

The ideal configuration for Reflexion is thus clearly delineated: the task is iterative, a failed attempt is cheap to repeat, and there is an automatable verification signal. This applies to code generation and bug fixing, to tasks with verifiable intermediate results, and to text-based search/action environments such as ALFWorld. In the decision matrix of the underlying research, the combination "Reflexion + ReAct" is expressly listed for code-generation and bug-fix agents, with the caveat: needs unit tests as an oracle.

The limits: cost, fallacies and no genuine learning

Reflexion is no silver bullet. The most important weaknesses, all documented in the paper and in practice:

  • The MBPP case (paper section 4.4): On MBPP, Reflexion fell below the GPT-4 baseline because the self-generated unit tests had a high false-positive rate. The agent declared the attempt successful prematurely. This is the canonical warning against the assumption "Reflexion always improves performance".
  • Confabulated reflections: If the model misdiagnoses why it failed, the next attempt inherits the wrong correction. The self-critique can therefore also lead the agent astray.
  • Reward hacking: The agent may learn to trick the evaluator rather than solve the actual task.
  • Vague reflections without an oracle: In creative writing or open-ended research, a clear verification signal is missing. The reflections then flatten into general platitudes.
  • Cost and latency: Per attempt, roughly two to five times a ReAct run, multiplied by K attempts. The attempts are necessarily sequential, each requiring the previous reflection. Real-time use cases are usually ruled out.

The fundamental conceptual point for decision-makers: Reflexion is not genuine learning. No weights change, and what is "learned" lives in the context window of the current episode. Lasting knowledge only emerges when reflections are persisted externally.

Cost in comparison

The following figures are indicative (orders of magnitude), synthesised from the paper's data and field reports. They are no substitute for measurement against your own workload.

Pattern

Tokens (relative to one CoT call = 1)

Latency (with N tool steps)

Complexity

ReAct

3–10×

N × sequential

low

Reflexion (K=3 attempts)

10–30×

K × ReAct, sequential

medium

Plan-and-Execute

2–6×

1 plan + N sequential

medium

ReWOO

1.5–3×

1 plan + N tools + 1 solver

medium

Reflexion therefore sits deliberately at the upper end of the cost range. Anyone deploying it without a hard iteration limit risks the loop driving costs up.

Practical example: a self-correcting code agent

A concrete scenario from the documented area of application. An agency is to build an agent that delivers runnable Python code from a function description. The pseudocode of the Reflexion loop:

```text
attempt = 0
reflections = []
while attempt < MAX_ATTEMPTS (e.g. 3):
code = Actor.generate(task, context=reflections) # Actor
result = Evaluator.run_unit_tests(code) # Evaluator = test oracle
if result.all_passed:
return code # stop on success
critique = SelfReflect.analyse(code, result.errors) # verbal critique
reflections.append(critique) # cap memory at 1–3 entries
attempt += 1
```

In the logic of the HumanEval result: a first draft fails two of ten tests. The Self-Reflection notes, in effect, "edge case of empty list not handled, off-by-one in the loop". The second attempt reads this note, corrects both, and passes all tests. It is precisely this pattern that drives the HumanEval improvement from around 80% to 91% pass@1.

Two non-negotiable production lessons from the research:

  • Always cap iterations. In CrewAI via max_reasoning_attempts, in LangGraph via a condition revision_number ≤ N. Otherwise the loop can escalate.
  • Use an external ground-truth signal wherever possible – unit tests, RAG evaluator, regex match. Pure self-assessment is unreliable; the MBPP case is the cautionary tale.

If you have recurring task types, you can also store reflections in a vector database, keyed by task type. This gives rise to an emergent skill library that operates beyond individual episodes, the bridge to skill-library approaches such as Agent Workflow Memory (arXiv:2409.07429).

Placement within the framework ecosystem (as of 2026)

  • LangGraph: an official tutorial with three variants – Basic Reflection (generator-reflector pair), Reflexion (draft responder → tool execution → revisor loop with explicit reflection memory) and LATS (Reflexion plus tree search). A strong production pattern here is structured output that bundles answer, reflection (what is missing, what is superfluous) and follow-up search_queries.
  • CrewAI: two-tiered. At the agent level via Agent(reasoning=True, max_reasoning_attempts=N), at the crew level via a dedicated Critic Agent in the Hierarchical Process.
  • AutoGen: documented as a "Reflection" design pattern with a Coder and Reviewer agent that iterate until convergence or until max_iterations. Note: AutoGen is officially in maintenance mode, with Microsoft directing new projects to the Microsoft Agent Framework with its SPAR cycle (Sense, Plan, Act, Reflect).
  • n8n: no native Reflexion node. Implementable as a workflow loop – generator agent → critic agent → IF node on the verdict → back to the generator or end. Suited to batch and back-office tasks, not low-latency chat.

For agencies and B2B

For DACH marketing agencies, Reflexion is the right tool for tightly scoped, verifiable batch tasks with a quality requirement: automated code or configuration generation, data enrichment with clear validation rules, structured content with a hard schema. The rule of thumb: only use it when you can define an automatable verification signal and latency does not become visible during a client conversation. For real-time chatbots, a reactive ReAct pattern is the better choice. At Blck Alpaca, we build precisely this architectural decision into the cost framework: first measure which failure modes occur, then escalate deliberately from simple ReAct to Reflexion, with a hard iteration limit and an external oracle. This keeps the agent reliable and the token budget predictable.

FAQ

What is the difference between Reflexion and reflection?
Reflexion with an x denotes the specific method from the paper by Shinn et al. (2023), comprising the triad of Actor, Evaluator and Self-Reflection together with episodic memory. Reflection is used in practice, for example on the LangChain blog, as an umbrella term for any self-critique loop, including simple generator-critic pairs. In German it is worth distinguishing Reflexion (Shinn et al.) from the reflection pattern in the broader sense, to avoid misunderstandings.
Does a Reflexion agent really learn permanently?
No, not in the sense of classical training. No model weights are altered. The improvement arises in-context: the verbal critique is loaded into the Actor's context on the next attempt. What is learned applies, initially, only within that episode. Only when reflections are stored in a vector database and reused by task type does an emergent skill library arise across episodes.
When should you not use Reflexion?
For tasks without a clear verification signal, such as creative writing or open-ended research, the reflections degenerate into vague platitudes. It is equally unsuitable for latency-critical real-time applications, since the attempts necessarily run sequentially. It also struggles with long task horizons, where the cause of an error lies many steps before the failure signal, making diagnosis difficult.
How expensive is the Reflexion pattern compared with a simple agent?
As an order of magnitude: a Reflexion run costs, per attempt, roughly two to five times a single ReAct run, multiplied by the number of attempts K. With a typical K of three, that is around five to fifteen times a single ReAct run, which in the table relative to a single CoT call is about 10 to 30 times. These figures are indicative and should be measured against your own workload.
Which frameworks support Reflexion?
LangGraph offers an official tutorial with three variants (Basic Reflection, Reflexion, LATS). CrewAI covers it in two ways: an agent with reasoning=True and max_reasoning_attempts, as well as a dedicated Critic Agent in the Hierarchical Process. AutoGen documents it as a Reflection design pattern with a Coder and Reviewer agent. In n8n there is no native node; you build a loop from a generator agent, a critic agent and an IF node (as of 2026).

Want to go deeper?

Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.