2.10Beginner8 min

Tokenisation and Context Window: What Drives Agent Latency and Cost

Blck Alpaca·9 June 2026

Definition

Tokenisation breaks text into tokens, the smallest processing units of an LLM; the context window is the maximum number of tokens a model can process together per request. With AI agents, both directly determine cost and latency, because every step carries the entire prior context along again.

Key Takeaways

✓Tokens are the billing and processing unit of LLMs: cost and latency depend almost entirely on the token count, not the character count.
✓The context window is finite. Multi-step agents fill it quickly, because every step resends history, tool outputs and the system prompt, so the billed token volume grows disproportionately over the run.
✓With a full or very long context, answer quality drops measurably (lost-in-the-middle): information in the middle of long inputs is used less effectively than at the beginning or end.
✓Output tokens usually cost a multiple of input tokens (often three to six times as much); long contexts add further expense, for example Gemini 3.1 Pro with a tariff surcharge above 200K tokens (as of 2026).
✓Counter-strategies are context compression, summarisation of the history and retrieval (RAG) instead of loading everything into the prompt. They reduce cost, latency and degradation risk at the same time.
✓Model choice is a cost lever: workhorse and open-weight models are often a factor of 8 to 100 below frontier closed models per token (as of 2026).

Tokenisation breaks text into tokens, the smallest processing units of an LLM; the context window is the maximum number of tokens a model can process together per request. With AI agents, both directly determine cost and latency, because every agent step resends the entire prior context to the model. Anyone running agents in production controls the largest part of ongoing cost and response time through these two variables.

Tokens are the billing and processing unit. Cost and latency depend on the token count, not the character count. German text tends to produce more tokens per word than English.
The context window is finite. Multi-step agents fill it quickly, because every step carries along the system prompt, the entire history and all tool results. The token volume billed per run therefore grows disproportionately.
A full context does not mean a better context. With very long inputs, quality drops measurably (lost-in-the-middle). Context management is not a nice-to-have but a cost and quality lever at the same time.

What tokens are and why they are the cost driver

An LLM does not process plain text but tokens. The tokeniser breaks input text into units that usually correspond to a word part, a short whole word or a punctuation mark. The model computes exclusively with these tokens, and all providers bill per token, separated into input (what goes in) and output (what is generated).

Two properties are decisive in practice. First, token density is language-dependent: in German, one token corresponds roughly to 0.6 to 0.8 words. Long compounds, umlauts and inflectional endings mean that the same content often requires more tokens in German than in English. For DACH workloads this means: same task, higher token consumption, higher cost and faster filling of the context window.

Second, output is usually significantly more expensive than input. With the models available on the market, the output price is typically three to six times the input price - with individual cheap workhorse models the surcharge is smaller, with frontier models it tends towards the upper end. An agent that produces long, detailed answers is therefore disproportionately expensive relative to the volume of output tokens.

How the context window works

The context window is the maximum number of tokens a model can see together in a request - input and generated output combined. Everything the model is to take into account for an answer must fit into this window: system instruction, conversation history, supplied documents, tool definitions and tool results.

Window sizes have grown strongly in recent years. Current frontier models offer very large contexts: Claude Opus 4.7 and Gemini 3.1 Pro work with around 1 million tokens, Gemini depending on deployment with up to 2 million, Mistral Large 3 with 256K, Llama 4 Scout even with up to 10 million tokens (all figures as of 2026). A larger window shifts the hard upper limit - but it does not make carrying along context free. Two effects remain: cost and latency rise with input length, and answer quality degrades with very long inputs.

Context growth with multi-step agents

A single chat call is uncritical in cost terms. The problem arises with the agentic pattern: an agent solves a task not in one call but in many steps - plan, call a tool, evaluate the result, call the next tool, and so on. With each of these steps the entire prior history is resent as input, because the model is stateless and remembers nothing.

From this follows the central economic pattern of agents: a disproportionately growing token consumption. If step 1 still has 2,000 input tokens, step 2 already 5,000, step 3 then 9,000 and so on, the billed token consumption over the run adds up not linearly but far disproportionately - after all, every step carries along the grown context of all previous steps. Every additional tool output - an API response, a searched document, a search result - enlarges the context for all subsequent steps. Long tool outputs are the most frequent silent cost driver here.

Latency follows the same logic. The time to first token and the overall response time rise with input length, because the model has to read in the complete context before it answers. An agent that carries along 80,000 tokens of context towards the end of a long run is noticeably slower per step than at the beginning - precisely when the user is already waiting anyway.

Degradation: lost-in-the-middle with a full context

A widespread fallacy goes: if the window is large enough, you can simply pack everything in. That is technically true, but not qualitatively. LLMs use information at the beginning and end of a long input more reliably than information in the middle - an effect known as lost-in-the-middle. The longer the context, the higher the risk that the decisive information is weighted less or gets lost.

For agents this is doubly relevant. First, an agent often fills its context with history that is irrelevant for the current step. Second, an overfilled context can cause the model to ignore older instructions or early tool results from the middle of the history - with the result that answer quality drops despite technically sufficient space. More context is therefore not automatically more performance; beyond a certain point, less but curated context is better.

Terms and their agent implication

Term	Meaning	Agent implication
Token	Smallest processing and billing unit; word part, short word or punctuation mark	Drives cost and latency; German text produces more tokens per word
Tokenisation	Breaking text into tokens by the tokeniser	Determines how expensive a given input becomes; language- and model-dependent
Context window	Maximum token count (input plus output) per request	Hard upper limit; quickly exhausted with multi-step agents
Input tokens	Context sent to the model	Grow with every agent step; main cause of disproportionate cost
Output tokens	Response generated by the model	Typically 3 to 6 times more expensive than input; limit long answers deliberately
Lost-in-the-middle	Weaker use of information in the middle of long inputs	Overfilled context lowers quality; curation beats completeness
Context compression	Condensing/summarising the history	Reduces token count, latency and degradation risk in long runs
Retrieval (RAG)	Feeding in only relevant snippets via search instead of full text	Keeps context small and focused; reduces cost and improves hit quality

Example calculation: tokens to cost

A support agent solves a request in 6 steps. With each step it resends the entire context. Simplified assumption for the input progression: 2,000, 5,000, 9,000, 14,000, 20,000, 27,000 tokens - around 77,000 input tokens in total. In addition, it generates about 500 output tokens per step, so around 3,000 output tokens in total.

Let us work this through with three price levels (prices per 1 million tokens, input/output, as of 2026):

Model (as of 2026)	Input price	Output price	Input cost (77K)	Output cost (3K)	Total per run
Claude Opus 4.7 (frontier)	$5.00	$25.00	$0.385	$0.075	~$0.46
Mistral Large 3 (EU-sovereign)	$0.50	$1.50	$0.039	$0.0045	~$0.043
DeepSeek V4 Flash (workhorse)	$0.14	$0.28	$0.0108	$0.00084	~$0.012

A single run looks cheap. But at 50,000 such runs per month, the result is roughly 23,000 US dollars with the frontier model versus around 2,150 US dollars with the EU-sovereign model and around 600 US dollars with the cheap workhorse model. The factor between the frontier and workhorse tier lies between 8 and 100 depending on the model (as of 2026). Two lessons: first, with agents the carried-along input dominates the budget, not the output. Second, the model choice per sub-step is a massive cost lever.

On top of this comes a long-context surcharge. Gemini 3.1 Pro, for example, becomes more expensive above 200,000 tokens: the input price doubles, the output price also rises significantly ($4 / $18 instead of $2 / $12 per 1 million tokens, as of 2026). Anyone who lets an agent's context grow uncontrolled beyond this threshold pays not only for more tokens but also a higher tariff per token.

Counter-strategies: keep the context small and focused

The art of agent engineering lies in giving the model only what it really needs per step. Three strategies that can be combined:

Context compression and summarisation. Older steps are condensed into a short summary instead of carrying along the full-text history. Long tool outputs are reduced to the result. This lowers input tokens, latency and the lost-in-the-middle risk at the same time.
Retrieval instead of full text (RAG). Instead of loading entire knowledge bases or documents into the prompt, only the relevant snippets are fed in via search. This keeps the context small and increases hit quality, because the model is not distracted by the irrelevant.
Model routing and prompt caching. Simple sub-steps (classification, extraction, formatting) run on cheap workhorse or open-weight models; expensive frontier models are reserved only for the complex steps. Prompt caching additionally reduces the cost for recurring, stable context parts such as system prompts.

In addition: limit output deliberately (concise, structured answers instead of rambling prose) and tailor tool results to what is necessary. Taken together, these measures decide whether an agent looks elegant in the pilot phase and remains economical in production.

For agencies and B2B decision-makers

Tokenisation and the context window are not peripheral technical topics but the two most important levers for the economic viability of an AI agent. Anyone budgeting an agent solution should calculate the cost not per request but per complete multi-step run, extrapolated to the expected monthly volume - including the disproportionate context growth. For agencies, the advisory value lies in protecting clients from the nasty surprise that an agent that is cheap in the pilot becomes unpredictably expensive or slow under load. Blck Alpaca supports DACH companies in designing agent architectures so that context management, model routing and retrieval are built in from the outset - for predictable cost, acceptable latency and stable answer quality in production.

FAQ

What is the difference between a token and a word?

A token is a sub-unit that the tokeniser forms from text - usually a word part, a whole short word or a punctuation mark. In German, one token corresponds roughly to 0.6 to 0.8 words; long compounds and umlauts often produce more tokens than in English. Rule of thumb: German text consumes more tokens per word, which increases cost and context consumption.

Why do agents become slower and more expensive with every step?

With every reasoning step, an agent resends the entire prior history to the model: system prompt, user request, all previous responses and all tool results. This input grows with every step, and since input tokens drive both latency and cost, both add up disproportionately over a multi-step run.

What does lost-in-the-middle mean and why is it relevant for agents?

Lost-in-the-middle describes the fact that LLMs use information at the beginning and end of a long input more reliably than information in the middle. A large context window therefore does not automatically mean better results: if an agent overfills its context with irrelevant history, the decisive information can be lost and answer quality drops despite technically sufficient space.

Does a larger context window solve the problem?

Only partly. A larger window (around 1 million tokens with current frontier models such as Claude Opus 4.7 or Gemini 3.1 Pro, as of 2026) shifts the hard limit, but eliminates neither the cost and latency that rise with length nor the quality degradation with very long inputs. More space tends to tempt you into carrying along unnecessarily much context. Context management remains necessary.

How do I concretely reduce token costs in a production agent?

Three levers: first, shorten the context by summarising older steps and removing redundant tool outputs; second, retrieval instead of full text - feed in only the relevant knowledge snippets via RAG; third, model routing - assign simple sub-steps to cheaper workhorse or open-weight models and use frontier models only for complex steps. In addition, prompt caching reduces recurring input costs.

Want to go deeper?

Get new analyses straight to your inbox, or see how we put this knowledge to work for companies.

Subscribe to newsletter →Our services

NextTemperature, Top-p and Sampling: Settings for Deterministic Agents →