Code Execution with MCP: Token Efficiency for Complex Agents
Code execution with MCP refers to an agent pattern in which an AI agent writes and runs code that calls MCP tools programmatically in a sandbox, instead of issuing many individual tool calls. This significantly reduces token consumption and latency, because intermediate results are processed in the code context rather than in the language model's context window.
Key Takeaways
- ✓Code execution inverts the logic: the agent generates code that calls MCP tools, rather than routing each tool individually through the LLM context - tool definitions and intermediate results no longer land in full in the context window.
- ✓The lever is token efficiency: with many available tools and large intermediate results, token consumption drops considerably, because filtering, loops and aggregation happen in code instead of in the model.
- ✓The pattern pays off for complex, multi-step workflows with many tools and large volumes of data - not for simple single-step tasks, where direct tool calls remain faster and more transparent.
- ✓Security is not optional: executable, model-generated code requires an isolated sandbox, scope-limited OAuth 2.1 tokens and least privilege - MCP follows a deliberately optimistic trust model that leaves hardening to the operator.
- ✓As of 2026, MCP is the de facto standard for agent-to-tool communication (introduced by Anthropic in 2024, under the Linux Foundation's Agentic AI Foundation since 9 December 2025); its technical basis is JSON-RPC 2.0 over stdio or Streamable HTTP.
- ✓For DACH agencies, code execution is the lever for running agentic products economically: fewer tokens per run means better unit economics for multi-tenant services.
Code execution with MCP refers to an agent pattern in which an AI agent writes and runs code that calls MCP tools programmatically in a sandbox, instead of issuing many individual tool calls. This noticeably reduces token consumption and latency, because tool definitions and intermediate results are processed in the code context rather than passing in full through the language model's context window. It is an optimisation for complex, data-intensive workflows - not a replacement for classic tool calls.
Quick answers
- What it is: the agent generates a program that calls MCP tools as functions; only the condensed final result returns to the model.
- What it is good for: token and latency efficiency with many tools, many steps and large intermediate results.
- What it requires: an isolated sandbox, scope-limited tokens and least privilege - executable model code is an attack surface in its own right.
The problem: tool calls scale poorly
The Model Context Protocol (MCP) was introduced by Anthropic as an open standard on 25 November 2024, to connect AI applications with external systems - file systems, databases, business applications, developer tools. Technically, MCP is based on JSON-RPC 2.0 over several transports: stdio for local servers, and since the spec revision in April 2025 additionally Streamable HTTP, plus OAuth 2.1, JSON-RPC batching and tool annotations. As of 2026, MCP is the de facto standard for agent-to-tool communication; on 9 December 2025 Anthropic donated the protocol to the newly founded Agentic AI Foundation (AAIF) under the umbrella of the Linux Foundation. The SDKs record more than 97 million monthly downloads across Python and TypeScript.
In the classic pattern, the agent calls each tool individually. Every call passes through the language model: the model sees the tool definition, formulates the call, receives the full result back into its context window, and decides on the next step. For simple tasks, this is ideal - transparent and easy to audit. For complex workflows, however, two token guzzlers arise:
- Tool definitions up front. If an agent has dozens or hundreds of MCP tools at its disposal, their descriptions must be loaded into the context in advance - often thousands of tokens before the agent even starts working.
- Intermediate results in the context. A tool that returns a long list, a large document or an extensive API response writes this in full into the context window - even if the agent needs only three values from it. Across multi-step chains, this adds up.
The solution: the agent writes code
The code execution pattern - published by Anthropic as an engineering approach in late 2025 - inverts the logic. The agent generates a program, typically Python or TypeScript, that treats the MCP tools as importable functions. This code runs in a sandbox. Filtering, loops, branching and aggregation happen there, in the code runtime context. Only the condensed final result returns to the model's context window.
The difference can be shown with pseudocode.
Classic - many tool calls through the model:
```
LLM: call get_contacts()
-> 5,000 contacts (in full into the context)
LLM: for each contact, call get_last_order(id)
-> 5,000 further tool calls, each one individually through the model
LLM: filter those with revenue > 10,000
-> large volumes of data in the context again
```
Code execution - one generated program:
```
code generated by the agent, executed in the sandbox
contacts = mcp.crm.get_contacts()
top = [c for c in contacts
if mcp.crm.get_last_order(c.id).revenue > 10000]
return [{"name": c.name, "revenue": c.revenue} for c in top]
```
In the second case, the model sees neither the 5,000 contacts nor the 5,000 order responses. It sees only the code it has written, and at the end the filtered list. The tool calls themselves run programmatically, not as individual LLM rounds. This saves both tokens and latency, because loops are executed without model round-trips.
A second effect is on-demand tool discovery: instead of loading all tool definitions in advance, the agent can browse the available MCP servers and their functions like a module directory and import only those actually needed. This keeps the up-front overhead small, even when hundreds of tools are theoretically available.
Example with numbers
A simplified calculation illustrates the lever. Assume an agent is to identify the top customers from a CRM and enrich them with order data. The figures are illustrative and serve to convey orders of magnitude, not as a benchmark.
Item | Classic tool calls | Code execution |
|---|---|---|
Tool definitions up front | approx. 8,000 tokens (all tools loaded) | approx. 1,500 tokens (only those needed) |
Intermediate results in the context | approx. 40,000 tokens (5,000 raw records) | approx. 0 tokens (filtered in code) |
Generated code / final answer | approx. 2,500 tokens | |
LLM round-trips | 1 per tool call (many) | few (code generation + result) |
Token total (rough order of magnitude) | approx. 48,000 | approx. 4,000 |
The decisive item is not the up-front load, but the intermediate results: they are processed in code and occupy no model context. This is precisely where the lever lies. The larger the volumes of data and the more iterations, the greater the difference. For a simple single query, by contrast, the calculation reverses - the overhead of generating and executing a program exceeds the benefit.
When it makes sense - and when it does not
Criterion | Code execution makes sense | Direct tool calls are better |
|---|---|---|
Number of tools | Many (dozens+) | Few |
Workflow steps | Multi-step, with loops/branching | One to two steps |
Intermediate results | Large (lists, documents, API dumps) | Small |
Determinism | Recurring, programmable logic | Exploratory, ad hoc |
Traceability | Code trace + sandbox logs needed | Tool-call trace is sufficient |
Code execution is an optimisation, not a standard pattern for everything. It fits data-intensive, recurring multi-tool workflows. For simple, one-off actions, direct tool calls remain faster, cheaper and more transparent - the trace shows every step directly, without the detour via generated code.
Security: the sandbox is mandatory
The biggest difference compared with classic tool calls is security-relevant: in the code execution pattern, the system runs executable code generated by a language model. This is an attack surface in its own right. For MCP, as of 2026 a deliberately optimistic trust model is documented, which equates syntactic correctness with semantic safety - along with concrete attack classes: indirect prompt injection via server descriptions, tool poisoning (Invariant Labs PoC, March 2025), look-alike server squatting and CyberArk's full-schema poisoning, in which every part of a tool schema can be an injection vector. Hardening is explicitly the operator's task here.
For code execution, this means concretely:
- Isolated sandbox. The generated code runs in a sealed-off runtime environment without broad file and network access, with resource and time limits against infinite loops and resource exhaustion.
- Scope-limited tokens. MCP servers behind OAuth 2.1 (since the April 2025 spec) with the minimum necessary permissions - the code gets access only to what the task requires.
- Least privilege and no autonomous installation. Agents must not load or install MCP servers from untrusted registries on their own.
- Complete audit logging. Every tool call from the sandbox code is logged. In DACH contexts, this is at the same time a compliance requirement, not just engineering practice - end-to-end trace, model versions pinned, correlation via a trace ID.
Anyone running executable model code without these protective layers trades token efficiency for a considerable security and data-protection risk - a no-go in regulated DACH industries.
For agencies and B2B decision-makers
For DACH agencies and AI vendors building products, code execution with MCP is the direct lever on the unit economics of agentic products. Those running multi-tenant services - such as SEO, newsletter or research agents - pay per run in tokens; the code execution pattern measurably lowers these costs for data-intensive workflows and at the same time reduces latency, which improves perceived product quality. The practical starting point: review existing MCP-based agents to identify which workflows run through many tools and large intermediate results, switch precisely those over to code execution - and build in sandbox, OAuth 2.1 scopes and audit logging from day one. For B2B decision-makers, the rule is: the pattern is an architectural decision per workflow, not a platform switch. It builds on the already established MCP standard and can be introduced step by step, without discarding the existing tool integration.
FAQ
What exactly is code execution with MCP?
When is code worthwhile instead of individual tool calls?
Why does code execution save tokens?
What security risks does executing agent code introduce?
Does code execution replace normal MCP tool calls?
Is code execution production-ready for DACH B2B projects?
Want to go deeper?
Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.