Skip to content
10.5Intermediate7 min

Token Economics: How AI Agent Costs Really Arise

Blck Alpaca·
Definition

Token economics for AI agents describes the cost mechanics whereby every agent run is billed by the tokens consumed: input, output, cached and reasoning tokens. Unlike a chatbot, agents multiply consumption through multi-step loops, tool calls and sub-agents - the list price deviates from real production costs by a factor of 2 to 10.

Key Takeaways

  • A single agent run in 2026 typically generates 5 to 20 LLM calls (planner, tool call, evaluation, verification) - sub-agent cascades multiply token consumption by an additional factor of 3 to 10.
  • Four token types drive the bill: input, output (more expensive), cached input (90 percent cheaper at Anthropic) and reasoning tokens (billed at the output rate).
  • API tokens are rarely more than 30 to 50 percent of total cost (TCO) - vector store, observability, compute, retries and compliance make up the rest.
  • Prompt caching is the single biggest FinOps lever in 2026: a 60 to 90 percent cache hit rate cuts input costs by 70 to 80 percent.
  • Eval-driven model routing (small model as default, large only on a measurable gap) saves 30 to 60 percent with no loss of quality.
  • All 2026 prices are volatile - budget guardrails such as per-workflow token caps are mandatory, not optional.

Token economics for AI agents describes the cost mechanics whereby every agent run is billed by the tokens consumed: input, output, cached and reasoning tokens. Unlike a chatbot, agents multiply consumption through multi-step loops, tool calls and sub-agents. As a result, the list price typically deviates from real production costs by a factor of 2 to 10. Anyone planning an agent budget has to understand these mechanics, otherwise they will calculate away half the bill.

  • Tokens are the billing unit, not requests. As of 2026, a user request to an agent typically generates 5 to 20 LLM calls - each of which costs again.
  • Output and reasoning are expensive, cached input is cheap. Output usually costs 3 to 5 times as much as input; reused context (cache) costs only 10 percent at Anthropic.
  • API tokens are rarely more than 30 to 50 percent of total cost. Vector store, observability, compute, retries and compliance drive the rest.

The four token types and how they are billed

Every LLM call breaks down into differently priced token buckets. Anyone who only sees the output systematically underestimates the input - because the agent drags along the system prompt, tool definitions and growing context again at every step.

  • Input tokens comprise everything that goes into the model: system prompt, tool definitions, retrieved RAG context, conversation history and the actual request. A workflow with five defined tools at 150 tokens each adds 750 input tokens to every request on its own.
  • Output tokens are the generated response. They regularly cost three to five times as much as input. With Claude Sonnet, the rate as of 2026 stands at USD 3 input against USD 15 output per million tokens - a factor of five.
  • Cached input tokens are reused, stable context. Anthropic reads cache at 0.1x the base price (90 percent discount), OpenAI on the GPT-5.x family at around 10 percent of the base price. This is the most impactful cost lever introduced since the start of usage-based billing.
  • Reasoning tokens arise from the internal thinking of reasoning models. As of 2026, they are simply billed at the output rate by both OpenAI and Anthropic. The practical consequence: a call with 20,000 reasoning tokens costs around USD 0.50 for the thinking alone at USD 25/million output - before a single visible word is generated.

The multiplier: why agents make consumption explode

The decisive difference from a chatbot is not the token price, but the number of calls. A single user request, which in 2023 was still one model call, routinely translates in 2026 into a chain: planner, tool selection, interpretation of the tool result, next-step decision, output formatting, often with explicit verification loops. That is 5 to 20 calls. Sub-agent cascades occasionally push this to 50 and more.

Three multipliers stack up:

  • Multi-step execution adds +50 to +200 percent on top of the direct API line. Each tool call is its own completion call with its own context.
  • Sub-agent fan-out multiplies token consumption by a factor of 3 to 10 compared with a single agent. Each sub-agent is a separate completion with its own context window and its own tool definitions.
  • Failure and retry add a factor of 1.3 to 3 with weak verification. Agentic workflows that fail and restart burn tokens on the way there.

On top of this comes context growth: with each step the conversation history grows, which is paid for again as input. Long contexts are doubly expensive - above 200,000 tokens, several providers (Gemini Pro models, OpenAI GPT-5.5 from 272,000 tokens) charge a surcharge of 2x input and 1.5x output. Naively stuffing the context window full is therefore rarely the cheap solution.

Hidden costs: the bill below the API line

For an agentic load at enterprise scale, the API token line is rarely more than 30 to 50 percent of total cost (TCO). Budgets that only apply the model provider's price list systematically miss half the bill. The following items are the most common blind spots in the DACH region (shares as a guideline for a representative enterprise load, as of 2026):

Cost driver

Cause

Lever

Direct model tokens

Input, output, reasoning per call

Caching, routing, eval-based model choice

Tool-use cascade

5 to 20 LLM calls per request

max_iterations and max_tool_calls as a hard cap

Sub-agent fan-out

each sub-agent completion separate

advisor pattern instead of a full sub-agent cascade

Retry loops

weak verification, failed runs

better verification, token budget per trace

Indirect model costs

tool definitions as input per call (5 tools = 750 tokens)

cache tool definitions, slim down output schemas

Vector store and embeddings

RAG storage, embedding generation, queries

self-hosted Qdrant instead of managed; targeted retrieval

Compute and sandbox

containers, VM minutes for coding/tool agents

spin up only on actual demand

Observability

monitoring of token consumption

self-hosted Langfuse instead of Datadog at enterprise scale

EU-region surcharge

around 10 percent (OpenAI on EU endpoints, Anthropic on inference_geo: "us")

steady state on EU, burst to US where GDPR permits

Sovereignty premium

1.5x to 3x the price at SAP, Telekom, OVHcloud

only for regulated loads, otherwise a negotiation lever

Compliance ops

DPA chain, sub-processor disclosure per provider

keep the number of providers low, use contract templates

The hidden items in numbers: vector store and embeddings amount to 5 to 15 percent of total cost, observability to 2 to 8 percent, compute and sandbox to 10 to 25 percent. In the DACH region, factors are added that appear on no Californian price list: the EU-region surcharge of around 10 percent, the sovereignty premium of a factor of 1.5 to 3, as well as ongoing compliance costs of a realistic EUR 5,000 to 20,000 per year and active contracting partner. These DACH-specific factors increase total cost by 15 to 35 percent compared with a comparable US load.

The model-choice lever: small, large, or routed

Not every step needs the most expensive model. The second-biggest FinOps lever after caching is routing - the cheap model as the default, the expensive one only on a measurable gap. The price spread is considerable (as of 2026): Claude Haiku stands at USD 1 input / USD 5 output per million, Sonnet at 3 / 15, Opus at 5 / 25; GPT-5.5 at 5 / 30. On the open side, DeepSeek V4 Flash undercuts the frontier level on the input side by a factor of 36 with USD 0.14 input.

As of 2026, Anthropic has formalised the routing pattern with the Advisor Tool (in beta since 9 April 2026): Sonnet or Haiku as the executor, Opus as an advisor brought in on demand within a single API call. The published benchmarks show how powerful the lever is: Sonnet plus Opus advisor reached 74.8 percent on SWE-bench Multilingual versus 72.1 percent for Sonnet alone - at 11.9 percent lower cost than Opus solo. Haiku plus Opus advisor doubled the BrowseComp score (19.7 to 41.2 percent) at 85 percent lower cost than Sonnet solo.

The sober rule of thumb behind it: the cheapest model that passes the eval is the right model. Anthropic's own comparison of Sonnet against Opus shows roughly a factor of five cost difference at 1 to 2 percentage points of benchmark gap on most workflows. Teams that route by eval result rather than by gut feeling typically cut model costs by 30 to 60 percent with no loss of quality.

Worked example: 1,000 agent runs

Concretely, with Sonnet rates (USD 3 input / USD 15 output per million, as of 2026, volatile). Assume a research agent with an average of 8 LLM calls per run, at 4,000 input and 800 output tokens each.

Unoptimised, without caching:

  • Input: 1,000 runs x 8 calls x 4,000 tokens = 32 million tokens x USD 3 = USD 96
  • Output: 1,000 x 8 x 800 = 6.4 million tokens x USD 15 = USD 96
  • Direct token costs: around USD 192 per 1,000 runs

With an 80 percent cache hit rate on the stable context share (system prompt and tool definitions, cached at USD 0.30/million instead of USD 3): the cached input share falls to around one fifth of its price. The weighted input costs fall from 96 to about USD 25 to 30, the output remains. Total: around USD 120 to 125 - a saving of around 35 percent through caching alone.

If 60 percent of the calls are then routed to Haiku (1 / 5 USD) instead of Sonnet, because the eval allows it, the direct line falls further towards USD 70 to 80 per 1,000 runs. And that is only the API line - if you add the vector store, observability and compliance ops, the real total cost is again significantly higher. This is precisely why per-workflow token caps (max_iterations, max_tool_calls, max_sub_agent_depth) are the governance standard as of 2026: they prevent a single run that has spun out of control from blowing up the calculation.

FinOps and budget guardrails

The effective measures are not secret knowledge, but engineering practice. Stacked together, a well-instrumented FinOps programme delivers a 60 to 80 percent cost reduction compared with the unoptimised baseline:

  • Aggressive prompt caching as the single biggest lever: a 60 to 90 percent cache hit rate cuts input costs by 70 to 80 percent. The 5-minute cache pays off after the first read access, the 1-hour cache after the second.
  • Eval-driven routing via LiteLLM, OpenRouter or Portkey - the cheap model as the default, advisor/escalation patterns for the hard cases.
  • Batch API for non-real-time loads with a flat 50 percent discount, combinable with caching - a cached batch request can fall to 5 percent of the standard price.
  • Token budget per workflow with hard caps and cost attribution per tenant, team or workflow - without this attribution, FinOps cannot answer the CFO's only question: which business unit is causing this bill?
  • Open-weight fallback for the long-tail loads (summarisation, classification, simple extraction) - in the DACH region via EU-hosted providers such as Together AI EU region or DeepInfra Frankfurt, since the China-hosted DeepSeek direct API is ruled out for GDPR-bound loads.

For agencies and B2B decision-makers

Anyone running AI agents in production in the DACH region or building them for clients should make token economics a core competency - because the biggest cost levers lie not in the contract, but in the technical implementation. Procurement teams that wrestle over a five percent volume discount leave fifty percent on the table elsewhere. For agencies this means: cost attribution per client via Helicone or Portkey, transparent cost pass-through with a clear margin on the operational complexity (10 contracting partners means 10 DPA chains), and sovereign hosting as a premium tier for clients with GDPR obligations. For B2B the rule is: eval-driven model choice, caching from day one, per-workflow token caps and an exit path to open-weight providers for any load that exceeds a sensible monthly threshold. Blck Alpaca from Vienna supports DACH companies with precisely this calculation - from workflow architecture through FinOps guardrails to sovereign deployment. Note: all price figures in this article are as of 2026 and volatile; the price bands shift quarterly and should be checked against the current provider documentation before any budget decision.

FAQ

What is the difference between input, output and reasoning tokens?
Input tokens are everything that goes to the model (system prompt, tool definitions, context, user request). Output tokens are the generated response and usually cost three to five times as much as input. Reasoning tokens arise from the internal thinking of reasoning models and, as of 2026, are billed at the output rate by both OpenAI and Anthropic - a call with 20,000 reasoning tokens costs around USD 0.50 for the thinking alone at USD 25/million output. Cached input tokens are reused context and cost only 10 percent of the base input at Anthropic.
Why does an AI agent cost so much more than a chatbot?
A chatbot is one call: question in, answer out. An agent runs through typically 5 to 20 LLM calls per user request - planning, tool selection, evaluation of the tool result, next-step decision, verification. With each call the context and tool definitions grow along with it, which are paid for again as input. Sub-agent patterns multiply this by a factor of 3 to 10, and failed runs with retries by a factor of 1.3 to 3. This produces the 5- to 50-fold higher consumption compared with the classic prompt-in/answer-out pattern.
How do I calculate an agent's LLM costs realistically?
Do not just look at the token rate. Calculate per run: average number of calls times average input and output tokens per call, weighted by the cache hit ratio. Then add the multi-step and sub-agent factor as well as the retry surcharge. Next, add the hidden items: embeddings and vector store (5 to 15 percent), observability (2 to 8 percent), compute/sandbox (10 to 25 percent), compliance ops. In the end, the API tokens are usually less than half of total cost.
Which measures reduce AI agent costs the most?
Three levers dominate as of 2026. First, aggressive prompt caching: caching stable system prompts and tool definitions cuts input costs by 70 to 80 percent. Second, eval-driven model routing: the cheapest model that passes the test is the right one - this saves 30 to 60 percent. Third, batch processing for non-real-time loads with a flat 50 percent discount, combinable with caching. Stacked together, a well-instrumented FinOps programme achieves a 60 to 80 percent cost reduction compared with the unoptimised baseline.
Which hidden costs are most often overlooked in AI agent budgets?
The most expensive blind spots are: tool definitions, which are paid for as input on every call (5 tools at 150 tokens each are 750 tokens per request); retry loops with weak verification; the vector store for RAG; observability tooling; and, in the DACH region, the EU-region surcharge of around 10 percent, the sovereignty premium of a factor of 1.5 to 3 and the ongoing compliance costs per contracting partner. These DACH factors increase total cost by 15 to 35 percent compared with a comparable US load.

Want to go deeper?

Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.