The Reflexion Pattern: Agents That Learn From Their Mistakes
The Reflexion pattern is an agent architecture in which an LLM agent reflects on its past attempts: an Actor produces a solution, an Evaluator assesses it, and a Self-Reflection model writes a verbal critique from this into a memory buffer. On the next attempt, the Actor reads this reflection and corrects itself, entirely without model training.
Key Takeaways
- ✓Reflexion (Shinn et al., arXiv:2303.11366, NeurIPS 2023) lets agents learn from mistakes in-context: the Actor generates, the Evaluator assesses, and Self-Reflection writes a verbal critique into an episodic memory.
- ✓Self-correction happens purely through language. No model weights are altered. This is not genuine training but verbal reinforcement across multiple attempts.
- ✓The strongest area of application is iterative tasks with a clear verification signal: on HumanEval (Python code generation), Reflexion achieved 91% pass@1 versus a GPT-4 baseline of around 80%.
- ✓Without a reliable evaluator, the pattern breaks down: on MBPP, Reflexion fell below the GPT-4 baseline because the self-generated unit tests produced too many false positives.
- ✓Reflexion costs roughly 10 to 30 times a single CoT run and operates strictly sequentially. It is suited to batch and back-office tasks, not real-time chat.
- ✓Always cap iterations hard and, where possible, use an external ground-truth signal (unit tests, RAG evaluator, regex). Self-assessment alone is unreliable.
The Reflexion pattern is an agent architecture in which an LLM agent systematically reflects on its own past attempts and improves as a result. Instead of a single answer, the agent works through several attempts: an Actor produces a solution, an Evaluator assesses it, and a Self-Reflection model writes a verbal critique from the assessment and the trajectory. This critique lands in a memory buffer and steers the next attempt. Remarkably, no model weights are altered. The improvement is purely linguistic.
The method goes back to the paper by Shinn, Cassano, Berman, Gopinath, Narasimhan and Yao: Reflexion: Language Agents with Verbal Reinforcement Learning (arXiv:2303.11366, March 2023, NeurIPS 2023). It was the first widely cited demonstration that an LLM's self-verbalised error analysis is a stronger learning signal for in-context improvement than a simple numerical reward.
The essentials in three sentences
- What it is: A loop of Actor, Evaluator and Self-Reflection with episodic memory. The agent learns from mistakes in-context across multiple attempts, without fine-tuning.
- When it pays off: On iterative tasks with a clear, automatable verification signal, such as code generation with unit tests. Here it measurably increases the success rate.
- Where the limit lies: High cost (on the order of 10 to 30 times a CoT call), strictly sequential, and useless without a reliable evaluator. It is no substitute for genuine training.
How the Reflexion loop works
The pattern consists of three components that interact in a loop:
- Actor – produces the solution, or the solution path. This is typically itself a ReAct or chain-of-thought agent.
- Evaluator – assesses the attempt. The assessment can be binary (pass/fail), scalar, or carried out by an external test suite.
- Self-Reflection model – converts the assessment plus the trajectory into verbal feedback: a paragraph of natural-language critique. This text is stored in an episodic memory buffer.
On the next attempt, the reflection text is prepended to the Actor's context. The "policy update" is therefore purely linguistic; no weights change. The memory is deliberately kept small, with typically one to three reflections retained.
The process in brief:
```
Attempt_k: Actor → Evaluator → Self-Reflect → [append to memory]
Attempt_k+1: Actor sees memory → renewed attempt
... until success or max. attempts reached
```
The conceptual advantage over classical reinforcement learning: standard RL needs many samples and expensive fine-tuning to learn from feedback. Reflexion teaches in-context across just a few attempts, which is orders of magnitude cheaper and faster to set up.
When the Reflexion pattern increases the success rate
Its usefulness stands or falls with the quality of the evaluator signal. The strongest documented example is code generation, because an automatic oracle exists there: unit tests.
Benchmark | Reflexion | Comparison/baseline | Interpretation |
|---|---|---|---|
HumanEval pass@1 (Python) | 91% | ~80% GPT-4 baseline (at time of paper) | Clear gain thanks to test oracle |
ALFWorld (success rate) | near-perfect after a few attempts | – | Iterative task with feedback |
HotpotQA (Distractor) | substantial absolute gains over CoT/ReAct | CoT, ReAct alone | Reasoning task benefits |
MBPP pass@1 (Python) | fell below the baseline | ~80% GPT-4 baseline | Warning sign: weak evaluator |
The ideal configuration for Reflexion is thus clearly delineated: the task is iterative, a failed attempt is cheap to repeat, and there is an automatable verification signal. This applies to code generation and bug fixing, to tasks with verifiable intermediate results, and to text-based search/action environments such as ALFWorld. In the decision matrix of the underlying research, the combination "Reflexion + ReAct" is expressly listed for code-generation and bug-fix agents, with the caveat: needs unit tests as an oracle.
The limits: cost, fallacies and no genuine learning
Reflexion is no silver bullet. The most important weaknesses, all documented in the paper and in practice:
- The MBPP case (paper section 4.4): On MBPP, Reflexion fell below the GPT-4 baseline because the self-generated unit tests had a high false-positive rate. The agent declared the attempt successful prematurely. This is the canonical warning against the assumption "Reflexion always improves performance".
- Confabulated reflections: If the model misdiagnoses why it failed, the next attempt inherits the wrong correction. The self-critique can therefore also lead the agent astray.
- Reward hacking: The agent may learn to trick the evaluator rather than solve the actual task.
- Vague reflections without an oracle: In creative writing or open-ended research, a clear verification signal is missing. The reflections then flatten into general platitudes.
- Cost and latency: Per attempt, roughly two to five times a ReAct run, multiplied by K attempts. The attempts are necessarily sequential, each requiring the previous reflection. Real-time use cases are usually ruled out.
The fundamental conceptual point for decision-makers: Reflexion is not genuine learning. No weights change, and what is "learned" lives in the context window of the current episode. Lasting knowledge only emerges when reflections are persisted externally.
Cost in comparison
The following figures are indicative (orders of magnitude), synthesised from the paper's data and field reports. They are no substitute for measurement against your own workload.
Pattern | Tokens (relative to one CoT call = 1) | Latency (with N tool steps) | Complexity |
|---|---|---|---|
ReAct | 3–10× | N × sequential | low |
Reflexion (K=3 attempts) | 10–30× | K × ReAct, sequential | medium |
Plan-and-Execute | 2–6× | 1 plan + N sequential | medium |
ReWOO | 1.5–3× | 1 plan + N tools + 1 solver | medium |
Reflexion therefore sits deliberately at the upper end of the cost range. Anyone deploying it without a hard iteration limit risks the loop driving costs up.
Practical example: a self-correcting code agent
A concrete scenario from the documented area of application. An agency is to build an agent that delivers runnable Python code from a function description. The pseudocode of the Reflexion loop:
```text
attempt = 0
reflections = []
while attempt < MAX_ATTEMPTS (e.g. 3):
code = Actor.generate(task, context=reflections) # Actor
result = Evaluator.run_unit_tests(code) # Evaluator = test oracle
if result.all_passed:
return code # stop on success
critique = SelfReflect.analyse(code, result.errors) # verbal critique
reflections.append(critique) # cap memory at 1–3 entries
attempt += 1
```
In the logic of the HumanEval result: a first draft fails two of ten tests. The Self-Reflection notes, in effect, "edge case of empty list not handled, off-by-one in the loop". The second attempt reads this note, corrects both, and passes all tests. It is precisely this pattern that drives the HumanEval improvement from around 80% to 91% pass@1.
Two non-negotiable production lessons from the research:
- Always cap iterations. In CrewAI via
max_reasoning_attempts, in LangGraph via a conditionrevision_number ≤ N. Otherwise the loop can escalate. - Use an external ground-truth signal wherever possible – unit tests, RAG evaluator, regex match. Pure self-assessment is unreliable; the MBPP case is the cautionary tale.
If you have recurring task types, you can also store reflections in a vector database, keyed by task type. This gives rise to an emergent skill library that operates beyond individual episodes, the bridge to skill-library approaches such as Agent Workflow Memory (arXiv:2409.07429).
Placement within the framework ecosystem (as of 2026)
- LangGraph: an official tutorial with three variants – Basic Reflection (generator-reflector pair), Reflexion (draft responder → tool execution → revisor loop with explicit reflection memory) and LATS (Reflexion plus tree search). A strong production pattern here is structured output that bundles
answer,reflection(what is missing, what is superfluous) and follow-upsearch_queries. - CrewAI: two-tiered. At the agent level via
Agent(reasoning=True, max_reasoning_attempts=N), at the crew level via a dedicated Critic Agent in the Hierarchical Process. - AutoGen: documented as a "Reflection" design pattern with a Coder and Reviewer agent that iterate until convergence or until
max_iterations. Note: AutoGen is officially in maintenance mode, with Microsoft directing new projects to the Microsoft Agent Framework with its SPAR cycle (Sense, Plan, Act, Reflect). - n8n: no native Reflexion node. Implementable as a workflow loop – generator agent → critic agent → IF node on the verdict → back to the generator or end. Suited to batch and back-office tasks, not low-latency chat.
For agencies and B2B
For DACH marketing agencies, Reflexion is the right tool for tightly scoped, verifiable batch tasks with a quality requirement: automated code or configuration generation, data enrichment with clear validation rules, structured content with a hard schema. The rule of thumb: only use it when you can define an automatable verification signal and latency does not become visible during a client conversation. For real-time chatbots, a reactive ReAct pattern is the better choice. At Blck Alpaca, we build precisely this architectural decision into the cost framework: first measure which failure modes occur, then escalate deliberately from simple ReAct to Reflexion, with a hard iteration limit and an external oracle. This keeps the agent reliable and the token budget predictable.
FAQ
What is the difference between Reflexion and reflection?
Does a Reflexion agent really learn permanently?
When should you not use Reflexion?
How expensive is the Reflexion pattern compared with a simple agent?
Which frameworks support Reflexion?
Want to go deeper?
Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.