Pillar 4

LLM Fundamentals for Agents

How LLMs work as the reasoning engine of agents: tokens, context windows, function calling and model selection.

For: Developers, product managers, AI practitioners

Definition

LLM fundamentals for agents refer to the technical baseline knowledge required to deploy Large Language Models (LLMs) as the reasoning engine in AI Agents: the interplay of tokens, context window, function calling, and model selection. In the agent context, the LLM is not merely a text generator but the central decision-making and planning authority that interacts with data sources and systems via structured tool calls. Anyone bringing agents into production must master these fundamentals just as much as the strategic choice between open-source and proprietary models.

Key Takeaways

✓In the agent, the LLM is the reasoning engine: it plans, selects tools, and steers the loop. In practice, it is not model size but the quality of the context (Context Engineering) that determines 60-80% of production success (Anthropic Applied AI, 2025).
✓Tokens are the unit of billing and processing. German produces 30-50% more tokens than English in common BPE tokenizers (compound nouns, inflection) - this shrinks effective context windows and increases costs accordingly.
✓The context window is a finite resource with diminishing marginal returns. Despite a nominal 1M-2M tokens (Claude Opus 4.7, Gemini 3.1 Pro), the effective working capacity is around 30-50% for reasoning and 60-80% for retrieval tasks - 'Context Rot' (Chroma, July 2025).
✓Function calling / tool-use is the bridge between the LLM and systems. The most common source of error is not the models but unclear or overlapping tool definitions; 3-7 actively loaded tools plus dynamic discovery are best practice in 2026.
✓The Model Context Protocol (MCP, Anthropic Nov. 2024) has become the de facto standard for tool integration - according to industry reports, growing from ~100,000 to ~97M SDK downloads/month (March 2026), adopted by OpenAI, Google, and Microsoft.
✓The capability gap between the best open-weight (DeepSeek V4 Pro, Kimi K2.6, Mistral Large 3) and frontier-closed (Claude Opus 4.7, GPT-5.5 Pro, Gemini 3.1 Pro) has narrowed from 12-18 months (2024) to 3-6 months (2026); for coding/reasoning, sometimes zero.
✓Cost ranges are substantial: closed-frontier with Claude Opus 4.7 ($5/$25) and GPT-5.5 Pro ($30/$180) per million tokens sits well above open-weight options like Mistral Large 3 ($0.50/$1.50) or DeepSeek V4 Flash ($0.14/$0.28).
✓In 2026, model selection is no longer a binary open-vs-closed question but a portfolio allocation per workload along weights control, hosting sovereignty, customization path, and cost profile - for the DACH region additionally shaped by GDPR, the EU AI Act, and sovereignty sentiment (Bitkom 2025/2026).

What does "LLM as a reasoning engine" mean?

An AI Agent is more than a chatbot: it pursues a goal over multiple steps, calls tools, observes results, and decides what to do next. At the center of this loop sits a Large Language Model (LLM) as a reasoning engine - the authority that plans, weighs options, selects tools, and interprets results.

A metaphor that has shaped practice comes from Andrej Karpathy (June 2025): the LLM is the CPU, the context window is the working memory (RAM), and the engineer takes on the role of the operating system that fills the memory with the right information per step. From this follows the central insight for 2026: agents almost never fail in production because the model is "too small." They fail because the context is constructed incorrectly. Anthropic's Applied AI team positions Context Engineering as a discipline in its own right, one that today decides 60-80% of whether an agent runs reliably.

Anyone building agents therefore needs a solid foundation in four building blocks: tokens, context window, function calling, and model selection (open-source vs. proprietary). This overview page bundles the fundamentals; the deeper topics (Context Engineering, RAG, FinOps, compliance) are each their own building block.

Tokens: the unit of processing and cost

LLMs do not process text as words but as tokens - subword units that a tokenizer produces from the text. Tokens are also the billing unit: providers charge input and output tokens separately, usually per million tokens.

For the DACH region, tokenization is not a fringe topic but a measurable cost factor. German produces 30-50% more tokens per equivalent content than English in the common BPE tokenizers. The reasons:

Compound nouns: "Lebensversicherungsgesellschaftsangestellter" breaks down into a long subword chain, whereas "life insurance company employee" is split into familiar word tokens.
Inflection: German case and verb endings produce morphological variants that are captured as separate subwords.
Tokenizer bias: The training data of common tokenizers is English-heavy; rare German subwords are split more finely.

Three practical consequences follow from this: First, effective context windows hold less content (a 200K window on Sonnet 4.6 corresponds to approximately 130K-150K tokens of German content). Second, the costs per call are correspondingly 30-50% higher. Third, prompt caching pays off even more for DACH workloads than for English ones, because the discount applies to a larger token count.

An important note on model migration: Claude Opus 4.7 shipped with a new tokenizer that generates up to 35% more tokens for many inputs than Opus 4.6 - with identical per-token prices, the effective cost-per-request can therefore rise solely due to the tokenizer change. Before every migration the rule is: benchmark against your own workload profile.

The context window: nominal versus effective

The context window is the maximum number of tokens a model can "see" per request - that is, the system prompt, tool definitions, retrieval content, conversation history, and the space for the response combined. In 2026, nominally large windows are standard: Claude Opus 4.7 and Sonnet 4.6 support 1M tokens, Gemini 3.1 Pro up to 2M.

The decisive point, however, is: nominal does not equal effective capacity. Chroma's "Context Rot" study (July 2025, 18 frontier models) empirically demonstrated that all models degrade with increasing input length. Three mechanisms reinforce each other:

Lost-in-the-middle: Models pay more attention to the beginning and end of the context, and less to the middle.
Attention dilution: The quadratic complexity of attention means that at 100K tokens there are already around 10 billion pairwise relationships.
Distractor interference: Semantically similar but irrelevant content actively lures the model toward wrong answers.

The finding is task-dependent: for simple factoid retrieval, newer models have caught up considerably; for multi-hop reasoning, the effect remains structural. As a heuristic for production: effective working capacity is 30-50% of nominal for reasoning-heavy and 60-80% for retrieval-heavy tasks. Anyone who fully fills a 1M window is engaging in waste with a quality penalty.

Model	Nominal Context	Effective Working Capacity (Heuristic)
Claude Opus 4.7	1M	300-500K (Reasoning), 600-800K (Retrieval)
Claude Sonnet 4.6	1M (Standard 200K)	200-400K (Reasoning), 400-600K (Retrieval)
Gemini 3.1 Pro	2M	300-500K (Reasoning), 600K-1M (Retrieval)
DeepSeek V4 Pro	1M	Open-Weight; Long-Context performance below Closed-Weight

The practical answer to Context Rot is not "pack in more" but to curate: cache stable building blocks, load dynamic content selectively, prune old content, and compress at 70-85% utilization (compaction). This is the subject of the Context Engineering building block.

Function calling and tool-use: the LLM as actor

For an LLM to become the engine of an agent, it must be able to act - that is, generate structured calls to external functions. Function calling (also tool-use) is the mechanism for this: the model receives tool definitions in JSON schema format and, when needed, returns a structured call with parameters that the application executes.

The most important practical finding in 2026: when an agent acts incorrectly, the cause usually lies not with the model but with the tool definition. Anthropic phrases the guideline sharply: if a human engineer cannot clearly say which tool to use in a given situation, an agent cannot either. Concrete consequences:

Tool count: 3-7 actively loaded tools are optimal; from around 10 tools, measurable degradation begins. In Anthropic's internal MCP evals, tool-selection accuracy rose with dynamic tool search from 49% to 74% (Opus 4) and from 79.5% to 88.1% (Opus 4.5).
Tool overlap is fatal: Two tools that could plausibly answer the same request are the one problem that no prompt, however good, can solve. A clear "when not to use" clause in the description is the most effective, often forgotten component.
Structured outputs: For downstream systems, outputs must be reliably machine-readable. OpenAI Structured Outputs (GA since August 2024) achieves 100% schema conformance via constrained decoding; Anthropic achieves the equivalent via tool-use with JSON schema; open-weight models via grammar-constrained decoding (Outlines, jsonformer, vLLM).

The standard for tool integration in 2026 is the Model Context Protocol (MCP), published by Anthropic in November 2024 as a JSON-RPC standard. According to industry reports, SDK downloads grew from ~100,000/month (Nov. 2024) to ~97M/month (March 2026), with adoption by OpenAI, Google, and Microsoft. (These download figures come from vendor/industry reports and are not independently validated; the adoption pattern itself is considered undisputed.) For assessing tool-calling quality, the Berkeley Function-Calling Leaderboard (BFCL) remains the most relevant benchmark - with the caveat that no model leads across all categories and closed-weight is still ahead on complex multi-turn tool-use.

Open-source vs. proprietary: the model choice

The choice of model base is the most strategically consequential decision. In 2026 it is no longer a binary question but a portfolio allocation. First a clarification of terms, since "open" is multidimensional:

Closed/proprietary: API-only, weights not available (Claude, GPT-5.5, Gemini).
Open-weight: Weights downloadable, possibly under restrictive licenses. Llama is open-weight but is explicitly not classified as open-source by OSI/FSF - among other reasons because of a 700M-MAU threshold and an EU multimodal restriction in Llama 4.
Open-source AI per the OSI definition (OSAID 1.0, Oct. 2024): additionally requires open training/inference code and sufficient training-data transparency. Most models marketed as "open source" do not meet this threshold.

The central trend: the capability gap between the best open-weight (DeepSeek V4 Pro, Kimi K2.6, Mistral Large 3, Qwen 3.6, Llama 4 Maverick) and frontier-closed (Claude Opus 4.7, GPT-5.5 Pro, Gemini 3.1 Pro) has narrowed from 12-18 months (2024) to 3-6 months (2026) - on individual workloads (coding, long-context, math) it is zero or negative. For instance, Kimi K2.6 (1T parameters, open-weight) ranks 4th on the Artificial Analysis Intelligence Index (score 54), directly behind Anthropic, Google, and OpenAI (57 each).

At the same time, premium workloads (top-tier agentic coding, frontier math, premium multimodal) remain genuinely superior with closed-frontier. Empirically, in DACH enterprises typically 15-35% of token volume but 60-80% of perceived strategic value falls into this premium category. Hybrid thereby becomes economically compelling.

Model	Tier	License / Access	Price $ / 1M Tok. (in/out)	Profile
Claude Opus 4.7	Frontier-Closed	Proprietary, API	$5 / $25	Frontier coding, tool orchestration
GPT-5.5 Pro	Frontier-Closed	Proprietary, API	$30 / $180	Top reasoning, frontier math; expensive
Gemini 3.1 Pro	Frontier-Closed	Proprietary, API	$2 / $12	1M-2M context, omnimodal
Mistral Large 3	Frontier-near	Apache 2.0 (EU)	$0.50 / $1.50	EU sovereign anchor, strongly multilingual
DeepSeek V4 Pro	Frontier-near	MIT-derived (CN)	$1.74 / $3.48	Cost disruption, coding/math parity
Kimi K2.6	Frontier-near	Modified MIT (CN)	$0.60 / $2.50	#1 among open-weight, strong agentic performance
DeepSeek V4 Flash	Workhorse	MIT-derived (CN)	$0.14 / $0.28	Deepest cost disruption

Prices per vendor public listing as of April-May 2026; verify before any multi-year commitment.

The sober decision logic runs across four dimensions: weights control (who can deliver the model, when, and under what conditions?), hosting sovereignty (which jurisdiction governs the stack?), customization path (off-the-shelf, prompt, RAG, LoRA, full fine-tuning), and cost profile (per-token vs. self-hosting amortization). For most standard workloads - classification, extraction, RAG-backed knowledge assistants, German-language workflows - the closed-frontier premium is no longer economically compelling.

DACH relevance: sovereignty and compliance

In the DACH region, three factors additionally shape the model choice. First, the sovereignty sentiment: according to Bitkom (study Digitale Souveränität 2025, 603 companies), 89% of digital importers see themselves as dependent; in the population survey (CW 8-11/2026), 72% consider Germany too dependent on the USA for AI, and 67% would like to use a German AI. A commercially credible sovereign EU tier exists with Mistral and - following the Cohere/Aleph Alpha merger (April 2026, $20 billion valuation, STACKIT as cloud backbone) - now twofold.

Second, it is relevant that an EU region on a US hyperscaler (e.g., Claude on AWS Bedrock Frankfurt) reduces latency and GDPR friction but does not eliminate US jurisdiction (CLOUD Act). True sovereignty requires a sovereign EU stack (STACKIT, OVHcloud, T-Systems, IONOS) or on-prem.

Third, regulatory notes apply (informational, not legal advice): closed-API models are GPAI models of their providers; the DACH deployer bears deployer obligations. Anyone who substantially fine-tunes an open-weight model can themselves become a GPAI provider - the EU AI Office guideline (July 2025) names an indicative, non-binding threshold at >1/3 of the base pretraining compute (default 3.33 × 10²² FLOPs, if unknown). LoRA/QLoRA typically lie well below this, continued pretraining above it. The GPAI provider obligations have applied since 2 August 2025; full enforcement powers from 2 August 2026 (provisional deadlines from the EU AI Act timeline, to be treated clearly as provisional). Added to this are GDPR obligations on the training pipeline when fine-tuning on personal data. These points are explored in depth in the respective compliance building blocks.

Outlook and practical note

The LLM fundamentals are not a one-time learning task but a moving target: tokenizers change, context windows grow, models appear on a monthly cadence, and benchmark scores fluctuate by 5-10 percentage points depending on the test harness. Three open questions for 2026/2027 are especially worth watching: whether the open-vs-closed capability gap continues to shrink, whether long context windows deliver on the "pack everything in" promise (currently: no, Context Engineering remains necessary), and whether sovereign EU infrastructure achieves scale parity with US hyperscalers.

For practice, therefore, a simple discipline applies: do not choose from the marketing blog post, but measure against your own eval set. Anyone who validates model selection, context budget, and tool design against real traces rather than intuition builds agents that hold up in production - and retains the strategic flexibility to switch providers when prices, licenses, or capabilities shift, without starting from scratch.

All Articles in this Topic

5 Articles

2.10

Tokenisation and Context Window: What Drives Agent Latency and Cost

Tokenisation breaks text into tokens, the smallest processing units of an LLM; the context window is the maximum number of tokens a model can process together per request. With AI agents, both directly determine cost and latency, because every step carries the entire prior context along again.

Beginner·8 min

2.11

Temperature, Top-p and Sampling: Settings for Deterministic Agents

Temperature, Top-p and Top-k are sampling parameters that control how randomly an LLM selects the next token. Low values (temperature 0 to 0.2) make outputs reproducible and are mandatory for tool calls and structured outputs; higher values increase variance and are suited to creative content.

Intermediate·7 min

2.12

Function Calling vs. Tool Use: Terminology and Implementations

Function Calling and Tool Use describe the same core capability: an LLM outputs not prose, but a structured, schema-compliant call to an externally defined function. OpenAI coined "Function Calling", Anthropic uses "Tool Use" - technically both are JSON-Schema-based and nearly identical, with differences in field names and API mechanics.

Intermediate·8 min

2.13

Structured Outputs with JSON Schema: Enforcing Reliable Agent Responses

Structured outputs with JSON Schema are a technique that forces an LLM to produce its response exactly according to a predefined JSON schema. Instead of free text, the model returns a machine-readable, validatable object. This makes agent pipelines reliable, because downstream program steps can depend on a guaranteed data structure.

Intermediate·7 min

2.14

LLM router: when to use a large frontier model, when small, when open source?

An LLM router is a routing logic that automatically assigns each agent step to the appropriate model: large frontier models for complex reasoning, small low-cost models for simple steps, open-source or EU-hosted models for sovereignty and cost control. The choice follows four criteria: quality, cost, latency and compliance.

Intermediate·8 min