Skip to content
5.14Advanced7 min

Code Execution with MCP: Token Efficiency for Complex Agents

Blck Alpaca·
Definition

Code execution with MCP refers to an agent pattern in which an AI agent writes and runs code that calls MCP tools programmatically in a sandbox, instead of issuing many individual tool calls. This significantly reduces token consumption and latency, because intermediate results are processed in the code context rather than in the language model's context window.

Key Takeaways

  • Code execution inverts the logic: the agent generates code that calls MCP tools, rather than routing each tool individually through the LLM context - tool definitions and intermediate results no longer land in full in the context window.
  • The lever is token efficiency: with many available tools and large intermediate results, token consumption drops considerably, because filtering, loops and aggregation happen in code instead of in the model.
  • The pattern pays off for complex, multi-step workflows with many tools and large volumes of data - not for simple single-step tasks, where direct tool calls remain faster and more transparent.
  • Security is not optional: executable, model-generated code requires an isolated sandbox, scope-limited OAuth 2.1 tokens and least privilege - MCP follows a deliberately optimistic trust model that leaves hardening to the operator.
  • As of 2026, MCP is the de facto standard for agent-to-tool communication (introduced by Anthropic in 2024, under the Linux Foundation's Agentic AI Foundation since 9 December 2025); its technical basis is JSON-RPC 2.0 over stdio or Streamable HTTP.
  • For DACH agencies, code execution is the lever for running agentic products economically: fewer tokens per run means better unit economics for multi-tenant services.

Code execution with MCP refers to an agent pattern in which an AI agent writes and runs code that calls MCP tools programmatically in a sandbox, instead of issuing many individual tool calls. This noticeably reduces token consumption and latency, because tool definitions and intermediate results are processed in the code context rather than passing in full through the language model's context window. It is an optimisation for complex, data-intensive workflows - not a replacement for classic tool calls.

Quick answers

  • What it is: the agent generates a program that calls MCP tools as functions; only the condensed final result returns to the model.
  • What it is good for: token and latency efficiency with many tools, many steps and large intermediate results.
  • What it requires: an isolated sandbox, scope-limited tokens and least privilege - executable model code is an attack surface in its own right.

The problem: tool calls scale poorly

The Model Context Protocol (MCP) was introduced by Anthropic as an open standard on 25 November 2024, to connect AI applications with external systems - file systems, databases, business applications, developer tools. Technically, MCP is based on JSON-RPC 2.0 over several transports: stdio for local servers, and since the spec revision in April 2025 additionally Streamable HTTP, plus OAuth 2.1, JSON-RPC batching and tool annotations. As of 2026, MCP is the de facto standard for agent-to-tool communication; on 9 December 2025 Anthropic donated the protocol to the newly founded Agentic AI Foundation (AAIF) under the umbrella of the Linux Foundation. The SDKs record more than 97 million monthly downloads across Python and TypeScript.

In the classic pattern, the agent calls each tool individually. Every call passes through the language model: the model sees the tool definition, formulates the call, receives the full result back into its context window, and decides on the next step. For simple tasks, this is ideal - transparent and easy to audit. For complex workflows, however, two token guzzlers arise:

  1. Tool definitions up front. If an agent has dozens or hundreds of MCP tools at its disposal, their descriptions must be loaded into the context in advance - often thousands of tokens before the agent even starts working.
  2. Intermediate results in the context. A tool that returns a long list, a large document or an extensive API response writes this in full into the context window - even if the agent needs only three values from it. Across multi-step chains, this adds up.

The solution: the agent writes code

The code execution pattern - published by Anthropic as an engineering approach in late 2025 - inverts the logic. The agent generates a program, typically Python or TypeScript, that treats the MCP tools as importable functions. This code runs in a sandbox. Filtering, loops, branching and aggregation happen there, in the code runtime context. Only the condensed final result returns to the model's context window.

The difference can be shown with pseudocode.

Classic - many tool calls through the model:

```
LLM: call get_contacts()
-> 5,000 contacts (in full into the context)
LLM: for each contact, call get_last_order(id)
-> 5,000 further tool calls, each one individually through the model
LLM: filter those with revenue > 10,000
-> large volumes of data in the context again
```

Code execution - one generated program:

```

code generated by the agent, executed in the sandbox

contacts = mcp.crm.get_contacts()
top = [c for c in contacts
if mcp.crm.get_last_order(c.id).revenue > 10000]
return [{"name": c.name, "revenue": c.revenue} for c in top]
```

In the second case, the model sees neither the 5,000 contacts nor the 5,000 order responses. It sees only the code it has written, and at the end the filtered list. The tool calls themselves run programmatically, not as individual LLM rounds. This saves both tokens and latency, because loops are executed without model round-trips.

A second effect is on-demand tool discovery: instead of loading all tool definitions in advance, the agent can browse the available MCP servers and their functions like a module directory and import only those actually needed. This keeps the up-front overhead small, even when hundreds of tools are theoretically available.

Example with numbers

A simplified calculation illustrates the lever. Assume an agent is to identify the top customers from a CRM and enrich them with order data. The figures are illustrative and serve to convey orders of magnitude, not as a benchmark.

Item

Classic tool calls

Code execution

Tool definitions up front

approx. 8,000 tokens (all tools loaded)

approx. 1,500 tokens (only those needed)

Intermediate results in the context

approx. 40,000 tokens (5,000 raw records)

approx. 0 tokens (filtered in code)

Generated code / final answer

approx. 2,500 tokens

LLM round-trips

1 per tool call (many)

few (code generation + result)

Token total (rough order of magnitude)

approx. 48,000

approx. 4,000

The decisive item is not the up-front load, but the intermediate results: they are processed in code and occupy no model context. This is precisely where the lever lies. The larger the volumes of data and the more iterations, the greater the difference. For a simple single query, by contrast, the calculation reverses - the overhead of generating and executing a program exceeds the benefit.

When it makes sense - and when it does not

Criterion

Code execution makes sense

Direct tool calls are better

Number of tools

Many (dozens+)

Few

Workflow steps

Multi-step, with loops/branching

One to two steps

Intermediate results

Large (lists, documents, API dumps)

Small

Determinism

Recurring, programmable logic

Exploratory, ad hoc

Traceability

Code trace + sandbox logs needed

Tool-call trace is sufficient

Code execution is an optimisation, not a standard pattern for everything. It fits data-intensive, recurring multi-tool workflows. For simple, one-off actions, direct tool calls remain faster, cheaper and more transparent - the trace shows every step directly, without the detour via generated code.

Security: the sandbox is mandatory

The biggest difference compared with classic tool calls is security-relevant: in the code execution pattern, the system runs executable code generated by a language model. This is an attack surface in its own right. For MCP, as of 2026 a deliberately optimistic trust model is documented, which equates syntactic correctness with semantic safety - along with concrete attack classes: indirect prompt injection via server descriptions, tool poisoning (Invariant Labs PoC, March 2025), look-alike server squatting and CyberArk's full-schema poisoning, in which every part of a tool schema can be an injection vector. Hardening is explicitly the operator's task here.

For code execution, this means concretely:

  • Isolated sandbox. The generated code runs in a sealed-off runtime environment without broad file and network access, with resource and time limits against infinite loops and resource exhaustion.
  • Scope-limited tokens. MCP servers behind OAuth 2.1 (since the April 2025 spec) with the minimum necessary permissions - the code gets access only to what the task requires.
  • Least privilege and no autonomous installation. Agents must not load or install MCP servers from untrusted registries on their own.
  • Complete audit logging. Every tool call from the sandbox code is logged. In DACH contexts, this is at the same time a compliance requirement, not just engineering practice - end-to-end trace, model versions pinned, correlation via a trace ID.

Anyone running executable model code without these protective layers trades token efficiency for a considerable security and data-protection risk - a no-go in regulated DACH industries.

For agencies and B2B decision-makers

For DACH agencies and AI vendors building products, code execution with MCP is the direct lever on the unit economics of agentic products. Those running multi-tenant services - such as SEO, newsletter or research agents - pay per run in tokens; the code execution pattern measurably lowers these costs for data-intensive workflows and at the same time reduces latency, which improves perceived product quality. The practical starting point: review existing MCP-based agents to identify which workflows run through many tools and large intermediate results, switch precisely those over to code execution - and build in sandbox, OAuth 2.1 scopes and audit logging from day one. For B2B decision-makers, the rule is: the pattern is an architectural decision per workflow, not a platform switch. It builds on the already established MCP standard and can be introduced step by step, without discarding the existing tool integration.

FAQ

What exactly is code execution with MCP?
A pattern in which the AI agent does not send each MCP tool call individually through its context window, but instead writes a program that calls the MCP tools as functions. This code runs in an isolated sandbox; only the condensed final result returns to the context window. This significantly reduces token consumption and latency for complex workflows.
When is code worthwhile instead of individual tool calls?
When the workflow involves many tools, many steps or large intermediate results - for example, data filtering across multiple sources, loops over datasets or aggregations. For simple single-step tasks with small output, a direct tool call is faster, cheaper and easier to follow. Rule of thumb: the larger the volume of data and the more intermediate steps, the more code execution pays off.
Why does code execution save tokens?
Two effects. First, not all tool definitions have to be loaded into the context in advance - the agent can discover and import tools on demand. Second, intermediate results are processed, filtered and aggregated in code, rather than passing through the model in full. Only the condensed result consumes tokens. For long result lists, this is the biggest lever.
What security risks does executing agent code introduce?
The agent produces executable code - that is an attack surface in its own right. What is needed is an isolated sandbox without broad network and file access, scope-limited OAuth 2.1 tokens, least privilege and resource limits. For MCP, as of 2026 an optimistic trust model is documented, along with risks such as tool poisoning and full-schema poisoning; hardening is the operator's responsibility.
Does code execution replace normal MCP tool calls?
No. Both coexist. Direct tool calls remain the standard pattern for simple, individual actions and for maximum transparency in the trace. Code execution is the optimisation for complex, data-intensive multi-tool workflows. A good architecture decides per task which pattern applies.
Is code execution production-ready for DACH B2B projects?
The pattern builds on MCP, which as of 2026 is the established agent-to-tool standard under the Linux Foundation. It becomes production-ready through the surrounding sandbox, identity and observability discipline. For regulated DACH workflows, additional audit-trail, GDPR and data-sovereignty requirements apply, which must be met regardless of the execution pattern.

Want to go deeper?

Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.