Chain-of-Thought for Agents: When Does It Help, and When Not?
Chain-of-Thought (CoT) is a prompting technique in which a large language model spells out its intermediate steps explicitly in words before answering. Instead of producing a result directly, the model writes down the solution path step by step. This improves accuracy on multi-step logic, mathematics and planning, but costs additional tokens and latency.
Key Takeaways
- ✓Chain-of-Thought makes an LLM's step-by-step reasoning explicit and visible, which helps especially with multi-step logic, mathematics and planning.
- ✓CoT is strictly left-to-right and cannot backtrack: on search and backtracking problems (e.g. Game of 24), pure CoT collapses, where Tree-of-Thoughts performs significantly better at 74% versus 4% for CoT (under GPT-4 conditions).
- ✓Pure CoT hallucinates facts because it lacks any grounding; ReAct combines CoT with tool calls (Thought-Action-Observation) and thereby delivers verifiable intermediate results.
- ✓For simple lookups and latency- and cost-critical applications, CoT is often unnecessary and even disadvantageous.
- ✓Modern reasoning models (o-series, Claude with Extended Thinking, Gemini Thinking; as of 2026) internalise the thinking process, which makes explicit CoT prompting partly redundant.
- ✓Zero-shot CoT ("Let's think step by step") needs no examples; few-shot CoT supplies model solution paths and is more precise on domain tasks.
Chain-of-Thought (CoT) is a prompting technique in which a large language model spells out its intermediate steps explicitly in words before answering. Instead of producing a result directly, the model writes down the solution path step by step. This improves accuracy on multi-step logic, mathematics and planning, but costs additional tokens and latency. For agents, CoT is at the same time the conceptual core from which nearly all modern agent architectures are derived.
- Helps with: multi-step logic, mathematics, planning and anywhere a traceable thinking path improves the result.
- Does not help with: simple lookups, single-step classification, and latency- and cost-critical applications.
- Partly redundant with: reasoning-optimised models that already carry out the step-by-step thinking internally.
What Chain-of-Thought does technically
CoT forces the model to break the implicit jump from question to answer down into a visible chain of intermediate considerations. The practical lever is small, the effect often large: a prefixed trigger or an example solution path shifts the model's probability distribution towards structured, step-by-step generation.
Chain-of-Thought (Wei et al., 2022) is the historical root of practically all of today's agent patterns. ReAct, Tree-of-Thoughts, Plan-and-Solve/Plan-and-Execute and ReWOO all emerged from CoT, each either building on the explicit reasoning, restructuring it or deliberately rejecting it. Whoever understands CoT thereby understands the common foundation of these architectures.
The decisive property, and at the same time the central weakness, of pure CoT: the thinking path is strictly left-to-right. The model generates one step after another and cannot backtrack or revise a dead end it has already entered. On tasks that require search, lookahead or backtracking, CoT therefore collapses. This is precisely where Tree-of-Thoughts comes in, reinterpreting reasoning as a search over a tree of intermediate states.
Zero-shot vs. few-shot CoT
In practice there are two variants with different cost-benefit profiles.
- Zero-shot CoT: The task is merely prefixed with a trigger such as "Let's think step by step". No examples, minimal context overhead. Plan-and-Solve prompting (Wang et al., 2023) emerged from this idea, a two-stage zero-shot approach ("Let's first devise a plan / Let's carry out the plan") that outperforms zero-shot CoT on mathematical reasoning and became the template for the Plan-and-Execute agent architecture.
- Few-shot CoT: The prompt contains one or more complete example solution paths. This anchors not only the step-by-step thinking but also a desired format and domain-specific heuristics. Few-shot CoT is usually more precise on specialised or format-critical tasks, but costs noticeably more context tokens per request.
Rule of thumb for agencies: zero-shot CoT as the cost-effective default, few-shot CoT only where the additional accuracy or format fidelity justifies the higher token consumption.
Relationship to ReAct: reasoning needs grounding
Pure CoT has a fundamental problem for agents: it hallucinates facts because it lacks any external grounding. The model does "think" plausibly, but has no mechanism to check its assumptions against reality.
ReAct (Yao et al., 2022) solves exactly this by extending CoT with acting. Instead of only thinking, ReAct interleaves reasoning steps with tool calls:
```
Thought: I need to look up the customer's current revenue.
Action: crm_lookup(customer="Muster GmbH", field="revenue_q1")
Observation: Revenue Q1 = 1,240,000 EUR
Thought: That is 8% above the previous year. Now I compare with the pipeline.
Action: ...
```
The reasoning steps drive the tool usage; the tool observations correct the reasoning. CoT thus remains the core, with ReAct only adding the grounding that pure CoT lacks. For most production agents (chatbots with CRM and knowledge-base access, ticket triage), this grounded variant is the right starting point, not pure CoT.
When CoT helps, and when not
Task type | CoT sensible? | Rationale |
|---|---|---|
Multi-step logic / mathematics | Yes | Explicit intermediate steps significantly reduce calculation errors |
Planning / task decomposition | Yes | Step-by-step thinking enforces global structure |
Simple lookup / single-step | No | No gain in accuracy, only more tokens and latency |
Classification / routing | Mostly no | Direct answer is faster and cheaper |
Search/backtracking problems | Only as Tree-of-Thoughts | Left-to-right CoT cannot backtrack |
Reasoning models (o-series, Extended Thinking; as of 2026) | Partly redundant | Model already thinks step by step internally |
Audit/compliance context (DACH, EU AI Act) | Yes | Visible thinking path as a traceable trace |
The most important insight from field reports 2024-2026 is: start with the simplest pattern that works, and only escalate once measured error rates force it. Applied to CoT, this means: do not reflexively bloat every prompt with "step by step" instructions, but only where the task type demonstrably requires it.
Example with and without CoT: the Game of 24
The "Game of 24" (forming 24 from four numbers using basic arithmetic) shows the limit of pure CoT particularly clearly. Under GPT-4 conditions, pure CoT achieves only a 4% success rate here, because the model commits to a path early and cannot backtrack. Tree-of-Thoughts, which generates, evaluates and branches across several candidates per step, reaches 74% (b=5, BFS over three thinking steps).
Similarly with mini crosswords (5x5): at the game level, pure CoT solves only 1% of the tasks, Tree-of-Thoughts 20%. In creative writing with constraints, the coherence rating of CoT is around 6.2 out of 10, Tree-of-Thoughts around 7.6.
The lesson is not "CoT is bad", but: CoT fits linear thinking tasks, not search problems. On a multi-step calculation or planning task, CoT noticeably improves the result; on a search problem with many dead ends, it needs a tree structure; on a pure lookup, no reasoning at all.
Note: these figures come from the original papers (predominantly the GPT-3.5/GPT-4 era, 2022-2023). Modern frontier models reset these values, so the magnitudes should be read as relative effect indicators, not as today's absolute values.
CoT and cost: the latency/token calculation
Every reasoning step generates additional tokens and, because generation is sequential, additional latency. On a simple lookup that the model could answer directly, this is pure overhead. In high-volume applications (e.g. support ticket triage) this overhead quickly adds up to noticeable costs.
A practical cost lever is model tiering: a strong, expensive model for the demanding reasoning/planning phase, a smaller, cheaper model for the simple execution steps. This approach, known from the Plan-and-Execute pattern, saves on the order of 40-70% of the tokens on multi-step workflows. For CoT, this means: buy reasoning deliberately where it counts, and do not distribute it equally across all steps.
Important for model choice in 2026: reasoning-optimised models (o-series, Claude with Extended Thinking, Gemini Thinking variants) already carry out the step-by-step thinking process internally. Explicit CoT prompting often brings little additional benefit there and can even interfere. Explicit CoT remains relevant above all for smaller/cheaper models, for traceable audit trails and in regulated DACH contexts where the thinking path has to be documented.
For agencies and B2B: the pragmatic decision
For DACH marketing agencies and B2B teams, CoT can be condensed into a simple heuristic: CoT is not a default surcharge, but a targeted tool. Use it when the task demands several thinking steps, calculation or planning, for example in preparing reports, sanity-checking key figures or decomposing complex customer enquiries. Do without it for lookups, classification and anything where speed and unit cost dominate. Choose deliberately between zero-shot (cheap) and few-shot (more precise, more expensive) according to the task. And with reasoning models, check whether explicit CoT still contributes anything at all before you spend tokens on it. Whoever makes these four decisions cleanly gets more accurate agents at controlled costs, instead of scattering expensive, slow reasoning overhead across the entire pipeline.
FAQ
What is Chain-of-Thought (CoT)?
When should you NOT use CoT with agents?
What is the difference between zero-shot and few-shot CoT?
How are CoT and ReAct related?
Is CoT still worthwhile with modern reasoning models in 2026?
Want to go deeper?
Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.