Skip to content
2.4Advanced7 min

Tree of Thoughts: When One Path Is Not Enough

Blck Alpaca·
Definition

Tree of Thoughts (ToT) is a reasoning method for language models that, instead of a single linear chain of thought, generates, evaluates and explores multiple reasoning paths in parallel via search (BFS or DFS) with backtracking. This lets the model spot dead ends, backtrack and consider alternatives, rather than getting stuck on a wrong assumption.

Key Takeaways

  • Tree of Thoughts replaces the linear chain-of-thought process with a search tree of multiple reasoning paths that are generated, evaluated and explored with backtracking.
  • The benefit shows up on search- and lookahead-intensive tasks: in the Game of 24, ToT with GPT-4 achieved 74 percent success according to the paper, compared with 4 percent for classic CoT.
  • The price is high: depending on branching, ToT costs roughly 50 to 150 times the token volume of a single CoT call and is significantly slower.
  • Quality depends heavily on the generator model: with GPT-3.5 the Game-of-24 rate dropped to 19 percent, and small models such as Mixtral-8x7B failed at even the simplest puzzles.
  • Modern reasoning models already internalise the search, which is why ToT (as of 2026) remains relevant mainly for audit-required domains, verifiable puzzles and low-cost small proposer models.
  • The production-ready evolution is LATS (ToT plus Reflexion plus MCTS), available as a tutorial in LangGraph.

Tree of Thoughts (ToT) is a reasoning method for language models that, instead of a single linear chain of thought, generates, evaluates and explores multiple reasoning paths in parallel via search (BFS or DFS) with backtracking. This lets the model spot dead ends, backtrack and consider alternatives, rather than getting stuck on a wrong assumption once made. The method was introduced by Yao et al. in the paper "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" (arXiv:2305.10601, May 2023, NeurIPS 2023).

  • What is new? ToT models reasoning as deliberate, search-driven "System 2" exploration over a tree of intermediate states, rather than as a single chain of thought.
  • When is it needed? Whenever a single path is not enough: for puzzles, planning with dead ends, and tasks that require lookahead and backtracking.
  • What does it cost? Roughly 50 to 150 times the token volume of a single chain-of-thought call, plus significantly higher latency.

Why a single path is often not enough

Chain of Thought (CoT) has substantially improved the reasoning quality of language models by having the model write out its intermediate steps explicitly. The catch: CoT runs strictly left to right and cannot backtrack. As soon as the model makes a wrong assumption early on, it drags that error through the entire remaining chain of thought.

For many tasks this barely matters. For problems that demand genuine search and lookahead, however, CoT collapses. The classic example from the paper is the number game "Game of 24" (form 24 from four numbers using the basic arithmetic operations). Here you have to try combinations, recognise dead ends and choose alternative calculation paths. This is precisely what a linear chain cannot do: in the paper, CoT with GPT-4 achieved only 4 percent success here.

Tree of Thoughts addresses this structural deficit. Instead of a chain, it builds a tree in which each node is a partial solution, each branch an alternative continuation, and each leaf represents either a solution or a dead end.

The four design decisions of ToT

The ToT paper describes four modular building blocks that every concrete ToT implementation has to decide on:

  1. Thought decomposition — How is the problem broken down into individual "thoughts"? A thought can be a calculation step, a paragraph plan or a single crossword word.
  2. Thought generator — At a given state, k candidates for the next thought are proposed or sampled.
  3. State evaluator — Each candidate is assessed, either independently (in the paper via votes such as sure/maybe/impossible, each sampled three times) or via relative ranking.
  4. Search algorithm — BFS (the best b candidates per level, with the paper's default of b=5) or DFS with backtracking traverses the tree.

The paper explicitly draws its conceptual inspiration from the problem-solving research of Newell and Simon in the 1950s: deliberate, search-driven reasoning instead of an associative single answer.

Schematic example: the ToT loop as pseudocode

The following pseudocode shows a BFS variant with breadth b and depth d. It is intended as an illustration, not as a runnable implementation.

```
frontier = [root(problem)] # list of active states
for depth in range(d): # e.g. d = 3 steps
candidates = []
for state in frontier:
for _ in range(k): # k proposals per state
t = generator(state) # next thought
candidates.append(state + t)
evaluated = [(evaluator(c), c) for c in candidates]
# discard dead ends (impossible) -> implicit backtracking
evaluated = [x for x in evaluated if x[0] != "impossible"]
frontier = top_b(evaluated, b=5) # continue only with the best b
solution = best(frontier)
```

Two points are decisive. First, the branching: k candidates are created per state, so the tree grows wide. Second, the top_b pruning combined with discarding impossible nodes: poorly rated paths are cut off, and the search concentrates the compute on the promising branches. With DFS, the backtracking appears explicitly: if a path turns out to be a dead end, the search returns to the last promising node.

Cost and benefit: ToT versus CoT

The benefit of ToT is considerable on the right tasks, but so is the price. The following figures come from the ToT paper (GPT-4, conditions as in the paper) and should be read as relative effect sizes, not as today's absolute values.

Task (paper conditions, GPT-4)

Chain of Thought

Tree of Thoughts

Game of 24 (success rate)

4 %

74 %

Mini crossword 5x5 (game level)

1 %

20 %

Mini crossword 5x5 (word level)

16 %

60 %

Creative writing, coherence (scale 1–10)

~6.2

~7.6

These quality gains come at a massive additional cost. The following table places ToT within the spectrum of common agent reasoning patterns; the token values are rough orders of magnitude relative to a single CoT call (= 1), synthesised from the paper's figures and field reports, and should be measured against your own workload.

Pattern

Tokens (relative to CoT = 1)

Latency

Implementation complexity

Chain of Thought

1

low

very low

ReAct

3–10x

N sequential steps

low

Tree of Thoughts (b=5, d=3)

50–150x

b^d evaluator calls

high

LATS (ToT + Reflexion)

100–300x

tree x reflexion

very high

In the Game of 24, ToT consumed around 5,500 tokens per case according to the paper, compared with a few hundred for CoT. Latency is strictly worse than CoT, because it grows in proportion to branching, depth and sample rate.

Quality stands or falls with the generator

A central finding from the research: ToT does not lift every model. The effect is dominated by the generator model. In the Game of 24, GPT-4 with ToT reached 74 percent, whereas GPT-3.5 with ToT reached only 19 percent. The paper's analysis cleanly separates the roles: GPT-4 as generator plus GPT-3.5 as evaluator yielded 64 percent, while the reverse configuration (GPT-3.5 generates, GPT-4 evaluates) yielded only 31 percent.

A Stanford CS224N project shows it even more clearly: Mixtral-8x7B solved none of the ten easiest Game-of-24 puzzles, because the model hallucinated numbers and failed at simple arithmetic, which rendered the evaluator worthless. In practice this means: ToT on a weak model burns budget without solving the problem.

When (not) to use ToT

Clear guardrails for 2026 emerge from the research.

ToT is suitable when:

  • the task has a search character, that is, it features lookahead, branching and potential dead ends (puzzles, optimisation, scheduling, planning).
  • a reliable evaluation signal exists, ideally with verifiable rewards.
  • an audit trail of the alternatives examined is required, for example in regulated DACH industries under GDPR or the EU AI Act.
  • a small, low-cost proposer model is deliberately used, whose weaker single result is to be compensated by search.

ToT is not suitable when:

  • the model in use can already deliver the answer in one step.
  • the application is latency-sensitive (such as live chat).
  • there is no clear evaluator; without a good evaluation signal, the search comes to nothing.

The most important framing from the research: for general reasoning, explicit ToT is largely superseded, because modern reasoning models represent the tree search internally. ToT therefore remains relevant primarily as a conceptual foundation for tree-structured agent search, and in the special cases mentioned.

From method to agent: LATS

If you need ToT in an agent stack, today you usually do not reach for pure ToT but for its successor LATS (Language Agent Tree Search, Zhou et al., arXiv:2310.04406). LATS combines the tree expansion of ToT with self-reflection (in the sense of Reflexion) and Monte Carlo Tree Search. LangGraph has an official tutorial for it; there is no dedicated ToT-only tutorial there. The LangChain blog argues that LATS outperforms comparable algorithms such as Tree of Thoughts, ReAct and Reflexion. For the framework classification: in CrewAI, ToT is not idiomatic (sequential/hierarchical); in AutoGen it can be implemented via a Selector Group Chat but laboriously; and in n8n it is not realistic because of the combinatorial explosion, apart from best-of-N sampling (ToT with depth 1).

For agencies and B2B decision-makers

For marketing agencies and DACH B2B teams, the message is pragmatic: Tree of Thoughts is not a standard everyday tool but a specialist instrument. In the vast majority of client projects (chatbots, content pipelines, research workflows), cheaper patterns such as ReAct or ReWOO are the better choice. ToT or its successor LATS only pay off where a single solution path demonstrably is not enough, for example in constraint-heavy copywriting, optimisation tasks or regulated processes that require an auditable search trail over discarded alternatives. Anyone considering ToT should ensure two things beforehand: a capable generator model and a robust evaluation signal. If either is missing, the token costs multiply without any return. Blck Alpaca supports exactly this decision, namely the level-headed selection of the right reasoning pattern and a sound cost-benefit assessment, before budget is burned in production.

FAQ

What is the difference between Tree of Thoughts and Chain of Thought?
Chain of Thought (CoT) produces a single, strictly left-to-right chain of reasoning and cannot backtrack. Tree of Thoughts (ToT) generates multiple candidate thoughts at each step, evaluates them and searches the resulting tree via BFS or DFS with backtracking. This allows ToT to leave dead ends, instead of remaining committed to a wrong assumption once made.
When is Tree of Thoughts worthwhile compared with the cheaper Chain of Thought?
ToT is worthwhile for tasks with a search character, lookahead and potential dead ends, such as puzzles, combinatorial optimisation, planning or texts with hard constraints. A prerequisite is a reliable evaluation signal (evaluator). For tasks the model solves in one step anyway, or for latency-critical applications, ToT is not justified because of the 50 to 150 times higher token costs.
How does backtracking work in Tree of Thoughts?
Each node in the tree is a partial solution (a thought). A state evaluator assesses candidates, in the ToT paper for example via votes such as sure/maybe/impossible. If the evaluator rates a path as impossible or as a dead end, the search discards that branch and returns to the last promising node (DFS), or continues to explore the best b candidates per level (BFS, with the paper's default of b=5).
Is Tree of Thoughts still relevant in 2026?
For general reasoning, explicit ToT is largely superseded, because modern reasoning models already represent the search internally. According to research, ToT remains relevant primarily in three cases: in audit-required, regulated industries; for puzzles and optimisation with verifiable rewards; and in deployments with small, low-cost proposer models. Conceptually, ToT is also the foundation for modern tree-search agents such as LATS.
Which frameworks support Tree of Thoughts?
There is no dedicated ToT tutorial in LangGraph; the production-ready successor is LATS (Language Agent Tree Search), which combines ToT with MCTS and self-reflection and is available as a LangGraph tutorial. CrewAI is not an idiomatic fit, as it works sequentially/hierarchically. In AutoGen, ToT can be implemented via a Selector Group Chat, but it is laborious. In n8n, true ToT is not realistic because of the combinatorial explosion, at most as best-of-N sampling (ToT with depth 1).

Want to go deeper?

Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.