Tree of Thoughts: When One Path Is Not Enough
Tree of Thoughts (ToT) is a reasoning method for language models that, instead of a single linear chain of thought, generates, evaluates and explores multiple reasoning paths in parallel via search (BFS or DFS) with backtracking. This lets the model spot dead ends, backtrack and consider alternatives, rather than getting stuck on a wrong assumption.
Key Takeaways
- ✓Tree of Thoughts replaces the linear chain-of-thought process with a search tree of multiple reasoning paths that are generated, evaluated and explored with backtracking.
- ✓The benefit shows up on search- and lookahead-intensive tasks: in the Game of 24, ToT with GPT-4 achieved 74 percent success according to the paper, compared with 4 percent for classic CoT.
- ✓The price is high: depending on branching, ToT costs roughly 50 to 150 times the token volume of a single CoT call and is significantly slower.
- ✓Quality depends heavily on the generator model: with GPT-3.5 the Game-of-24 rate dropped to 19 percent, and small models such as Mixtral-8x7B failed at even the simplest puzzles.
- ✓Modern reasoning models already internalise the search, which is why ToT (as of 2026) remains relevant mainly for audit-required domains, verifiable puzzles and low-cost small proposer models.
- ✓The production-ready evolution is LATS (ToT plus Reflexion plus MCTS), available as a tutorial in LangGraph.
Tree of Thoughts (ToT) is a reasoning method for language models that, instead of a single linear chain of thought, generates, evaluates and explores multiple reasoning paths in parallel via search (BFS or DFS) with backtracking. This lets the model spot dead ends, backtrack and consider alternatives, rather than getting stuck on a wrong assumption once made. The method was introduced by Yao et al. in the paper "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" (arXiv:2305.10601, May 2023, NeurIPS 2023).
- What is new? ToT models reasoning as deliberate, search-driven "System 2" exploration over a tree of intermediate states, rather than as a single chain of thought.
- When is it needed? Whenever a single path is not enough: for puzzles, planning with dead ends, and tasks that require lookahead and backtracking.
- What does it cost? Roughly 50 to 150 times the token volume of a single chain-of-thought call, plus significantly higher latency.
Why a single path is often not enough
Chain of Thought (CoT) has substantially improved the reasoning quality of language models by having the model write out its intermediate steps explicitly. The catch: CoT runs strictly left to right and cannot backtrack. As soon as the model makes a wrong assumption early on, it drags that error through the entire remaining chain of thought.
For many tasks this barely matters. For problems that demand genuine search and lookahead, however, CoT collapses. The classic example from the paper is the number game "Game of 24" (form 24 from four numbers using the basic arithmetic operations). Here you have to try combinations, recognise dead ends and choose alternative calculation paths. This is precisely what a linear chain cannot do: in the paper, CoT with GPT-4 achieved only 4 percent success here.
Tree of Thoughts addresses this structural deficit. Instead of a chain, it builds a tree in which each node is a partial solution, each branch an alternative continuation, and each leaf represents either a solution or a dead end.
The four design decisions of ToT
The ToT paper describes four modular building blocks that every concrete ToT implementation has to decide on:
- Thought decomposition — How is the problem broken down into individual "thoughts"? A thought can be a calculation step, a paragraph plan or a single crossword word.
- Thought generator — At a given state, k candidates for the next thought are proposed or sampled.
- State evaluator — Each candidate is assessed, either independently (in the paper via votes such as
sure/maybe/impossible, each sampled three times) or via relative ranking. - Search algorithm — BFS (the best b candidates per level, with the paper's default of b=5) or DFS with backtracking traverses the tree.
The paper explicitly draws its conceptual inspiration from the problem-solving research of Newell and Simon in the 1950s: deliberate, search-driven reasoning instead of an associative single answer.
Schematic example: the ToT loop as pseudocode
The following pseudocode shows a BFS variant with breadth b and depth d. It is intended as an illustration, not as a runnable implementation.
```
frontier = [root(problem)] # list of active states
for depth in range(d): # e.g. d = 3 steps
candidates = []
for state in frontier:
for _ in range(k): # k proposals per state
t = generator(state) # next thought
candidates.append(state + t)
evaluated = [(evaluator(c), c) for c in candidates]
# discard dead ends (impossible) -> implicit backtracking
evaluated = [x for x in evaluated if x[0] != "impossible"]
frontier = top_b(evaluated, b=5) # continue only with the best b
solution = best(frontier)
```
Two points are decisive. First, the branching: k candidates are created per state, so the tree grows wide. Second, the top_b pruning combined with discarding impossible nodes: poorly rated paths are cut off, and the search concentrates the compute on the promising branches. With DFS, the backtracking appears explicitly: if a path turns out to be a dead end, the search returns to the last promising node.
Cost and benefit: ToT versus CoT
The benefit of ToT is considerable on the right tasks, but so is the price. The following figures come from the ToT paper (GPT-4, conditions as in the paper) and should be read as relative effect sizes, not as today's absolute values.
Task (paper conditions, GPT-4) | Chain of Thought | Tree of Thoughts |
|---|---|---|
Game of 24 (success rate) | 4 % | 74 % |
Mini crossword 5x5 (game level) | 1 % | 20 % |
Mini crossword 5x5 (word level) | 16 % | 60 % |
Creative writing, coherence (scale 1–10) | ~6.2 | ~7.6 |
These quality gains come at a massive additional cost. The following table places ToT within the spectrum of common agent reasoning patterns; the token values are rough orders of magnitude relative to a single CoT call (= 1), synthesised from the paper's figures and field reports, and should be measured against your own workload.
Pattern | Tokens (relative to CoT = 1) | Latency | Implementation complexity |
|---|---|---|---|
Chain of Thought | 1 | low | very low |
ReAct | 3–10x | N sequential steps | low |
Tree of Thoughts (b=5, d=3) | 50–150x | b^d evaluator calls | high |
LATS (ToT + Reflexion) | 100–300x | tree x reflexion | very high |
In the Game of 24, ToT consumed around 5,500 tokens per case according to the paper, compared with a few hundred for CoT. Latency is strictly worse than CoT, because it grows in proportion to branching, depth and sample rate.
Quality stands or falls with the generator
A central finding from the research: ToT does not lift every model. The effect is dominated by the generator model. In the Game of 24, GPT-4 with ToT reached 74 percent, whereas GPT-3.5 with ToT reached only 19 percent. The paper's analysis cleanly separates the roles: GPT-4 as generator plus GPT-3.5 as evaluator yielded 64 percent, while the reverse configuration (GPT-3.5 generates, GPT-4 evaluates) yielded only 31 percent.
A Stanford CS224N project shows it even more clearly: Mixtral-8x7B solved none of the ten easiest Game-of-24 puzzles, because the model hallucinated numbers and failed at simple arithmetic, which rendered the evaluator worthless. In practice this means: ToT on a weak model burns budget without solving the problem.
When (not) to use ToT
Clear guardrails for 2026 emerge from the research.
ToT is suitable when:
- the task has a search character, that is, it features lookahead, branching and potential dead ends (puzzles, optimisation, scheduling, planning).
- a reliable evaluation signal exists, ideally with verifiable rewards.
- an audit trail of the alternatives examined is required, for example in regulated DACH industries under GDPR or the EU AI Act.
- a small, low-cost proposer model is deliberately used, whose weaker single result is to be compensated by search.
ToT is not suitable when:
- the model in use can already deliver the answer in one step.
- the application is latency-sensitive (such as live chat).
- there is no clear evaluator; without a good evaluation signal, the search comes to nothing.
The most important framing from the research: for general reasoning, explicit ToT is largely superseded, because modern reasoning models represent the tree search internally. ToT therefore remains relevant primarily as a conceptual foundation for tree-structured agent search, and in the special cases mentioned.
From method to agent: LATS
If you need ToT in an agent stack, today you usually do not reach for pure ToT but for its successor LATS (Language Agent Tree Search, Zhou et al., arXiv:2310.04406). LATS combines the tree expansion of ToT with self-reflection (in the sense of Reflexion) and Monte Carlo Tree Search. LangGraph has an official tutorial for it; there is no dedicated ToT-only tutorial there. The LangChain blog argues that LATS outperforms comparable algorithms such as Tree of Thoughts, ReAct and Reflexion. For the framework classification: in CrewAI, ToT is not idiomatic (sequential/hierarchical); in AutoGen it can be implemented via a Selector Group Chat but laboriously; and in n8n it is not realistic because of the combinatorial explosion, apart from best-of-N sampling (ToT with depth 1).
For agencies and B2B decision-makers
For marketing agencies and DACH B2B teams, the message is pragmatic: Tree of Thoughts is not a standard everyday tool but a specialist instrument. In the vast majority of client projects (chatbots, content pipelines, research workflows), cheaper patterns such as ReAct or ReWOO are the better choice. ToT or its successor LATS only pay off where a single solution path demonstrably is not enough, for example in constraint-heavy copywriting, optimisation tasks or regulated processes that require an auditable search trail over discarded alternatives. Anyone considering ToT should ensure two things beforehand: a capable generator model and a robust evaluation signal. If either is missing, the token costs multiply without any return. Blck Alpaca supports exactly this decision, namely the level-headed selection of the right reasoning pattern and a sound cost-benefit assessment, before budget is burned in production.
FAQ
What is the difference between Tree of Thoughts and Chain of Thought?
When is Tree of Thoughts worthwhile compared with the cheaper Chain of Thought?
How does backtracking work in Tree of Thoughts?
Is Tree of Thoughts still relevant in 2026?
Which frameworks support Tree of Thoughts?
Want to go deeper?
Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.