2.12Intermediate8 min

Function Calling vs. Tool Use: Terminology and Implementations

Blck Alpaca·9 June 2026

Definition

Function Calling and Tool Use describe the same core capability: an LLM outputs not prose, but a structured, schema-compliant call to an externally defined function. OpenAI coined "Function Calling", Anthropic uses "Tool Use" - technically both are JSON-Schema-based and nearly identical, with differences in field names and API mechanics.

Key Takeaways

✓Function Calling (OpenAI) and Tool Use (Anthropic) are largely synonymous - both let the LLM produce structured JSON instead of prose, which deterministic code then executes.
✓The central implementation difference is field naming: OpenAI uses 'parameters' in the tool schema, Anthropic 'input_schema'. Both build on JSON Schema underneath.
✓The LLM executes nothing itself: it delivers a tool call, your code runs it and returns the result. With Anthropic, the stop_reason 'tool_use' signals that a call is pending.
✓Parallel tool calls are the default and can be switched off (Anthropic: disable_parallel_tool_use). Errors should be returned to the model as a structured tool_result with is_error, not as an exception.
✓Tool selection does not scale arbitrarily: a few clearly delineated tools are more robust than many overlapping ones. Tool definitions are both prompt budget and cache target.
✓Practical rule: always read tool inputs with a JSON parser, never via string matching - escaping can vary between model versions.

Function Calling and Tool Use describe, at their core, the same capability: at a particular point, a Large Language Model outputs not prose, but a structured, schema-compliant call to an externally defined function - including the appropriate arguments. OpenAI introduced the term "Function Calling", Anthropic speaks of "Tool Use". Technically, both are JSON-Schema-based and nearly identical. The differences lie in the naming of schema fields and in some API mechanics - not in the underlying principle.

This article is part of the cluster around the pillar "LLM Fundamentals for Agents" and deepens what is touched upon in the sister piece on tool-calling fundamentals. All technical details refer to the state of 2026.

Synonymous with a nuance: "Tool Use" is the slightly broader term, because with Anthropic a tool can also be a server-side service (web search, code execution) or an MCP server. "Function Calling" means, in the narrower sense, the call to a pure function.
The model executes nothing: it only delivers the call. Your code validates the arguments, executes them and returns the result. Only then does the LLM formulate the answer.
Most important implementation difference: OpenAI uses the field parameters in the tool schema, Anthropic input_schema. Both build on JSON Schema underneath.

How an LLM produces structured tool calls

The mechanism is built up identically in both ecosystems. You hand the model a list of tool definitions in addition to the actual user request. Each definition describes a name, a description (when the tool is to be used and when not) and an input schema with typed, partly required parameters.

Anthropic injects the tool definitions into the context via an API-internal wrapper prompt - roughly along the following pattern:

```
In this environment you have access to a set of tools you can use to answer the user's question.
{{ FORMATTING INSTRUCTIONS }}
String and scalar parameters should be specified as is, while lists and objects should use JSON format.
Here are the functions available in JSONSchema format:
{{ TOOL DEFINITIONS IN JSON SCHEMA }}
{{ USER SYSTEM PROMPT }}
{{ TOOL CONFIGURATION }}
```

From this follows an often overlooked economic consequence: each tool, as a schema, costs tokens permanently (typically on the order of around 100-300 per tool, depending on schema complexity), because the model reads the definitions along with every inference turn. A catalogue with ten tools therefore sits roughly at 1,000-3,000 tokens per call. At 100,000 calls per month, that amounts to 100-300 million tokens of pure schema load - economically viable only through prompt caching.

In Dex Horthy's 12-Factor Agents framing, the guiding principle therefore applies: tools are just structured outputs. What the LLM produces is schema-compliant JSON; what is executed is deterministic code that you own and control yourself.

Schema definition: JSON Schema as the common language

Both providers rely under the hood on JSON Schema as a lingua franca. Code-first tools such as Pydantic (Python) or Zod (TypeScript) generate this schema automatically from type definitions.

A robust tool definition lives off its description - at tool as well as field level. A parameter recipient_email with the description "RFC-5321-compliant email address of the recipient; must correspond to an existing customer record" is considerably more reliable than the same field without a description. As a robust practical rule: the description should not only say what a tool does, but above all when it is to be called and how the arguments are to be constructed. It is the permanent interface contract between model and API - and measurably more effective on the more recent models, which tend to be reticent in reaching for tools, than a mere function description.

DACH-specific note: keep tool names and parameter identifiers in English (interoperability with libraries, logs, OpenAPI), but write the descriptions in the agent's runtime language. Mixed-language catalogues work with frontier models, but are an avoidable source of inconsistencies.

Implementation differences: OpenAI vs. Anthropic

The following table summarises the practically relevant differences (state of 2026). Important: the commonalities predominate. In production, many teams use an abstraction layer (such as LiteLLM or the Vercel AI SDK) to encapsulate the field-name differences.

Aspect	OpenAI (Function Calling)	Anthropic (Tool Use)
Term	Function Calling	Tool Use
Schema field for parameters	`parameters`	`input_schema`
Schema standard	JSON Schema	JSON Schema
Signal for tool call	Tool call in the response (`finish_reason`)	`stop_reason: "tool_use"`
Selection control	`tool_choice`: auto / required / none / named	`tool_choice`: auto / any / tool / none
Parallel calls	Active by default	Active by default
Disabling parallelism	via `tool_choice` options	`disable_parallel_tool_use: true` in `tool_choice`
Result return	Tool-result message with call reference	`tool_result` block with matching `tool_use_id`
Signalling errors	structured error in the result	`tool_result` with `is_error: true`
Server-side tools	including code interpreter, web search	code execution, web search/fetch, computer use
Strict schema enforcement	Structured Outputs / `strict: true`	`strict: true` or `output_config.format`

With Anthropic, four modes are available for tool_choice: {"type": "auto"} (model decides, default), {"type": "any"} (at least one tool must be used), {"type": "tool", "name": "..."} (a specific tool is enforced) and {"type": "none"} (no tools). Each of these values can additionally carry disable_parallel_tool_use: true to enforce at most one call per response.

Parallel tool calls and the agent loop

Frontier models can request several tool calls simultaneously in a single response - for example "fetch profile", "load recent orders" and "check stock level" in parallel. This noticeably reduces round-trips and latency. Your harness must work through all requested calls and return the results collectively in a single follow-up message.

The manual agent loop with Anthropic runs according to this pattern: call the API, check the response - as long as stop_reason equals tool_use, the tool-use blocks are executed, the complete model response as well as the tool_result blocks (each with a matching tool_use_id) are appended to the message history, and the next API call is started. The loop ends when stop_reason equals end_turn. The official SDKs alternatively offer an automatic "tool runner" that encapsulates this loop; one uses the manual loop for fine granularity, for example approval gates or custom logging.

Error handling: robust instead of fragile

The second most frequent cause of unstable production agents - after ambiguous tool definitions - is poor error handling. Three principles apply across industries:

Return errors as a structured tool result, do not throw an exception that aborts the loop. A condensed { "error": "code", "message": "...", "retry_after_seconds": 30 } lets the model decide whether to correct the arguments, choose a different strategy or escalate. With Anthropic, you additionally set is_error: true in the tool_result block.
Keep error context compact. Do not dump raw stack traces into the context window - this pollutes the context and worsens the subsequent answers. Condense them.
Cap retries. Three attempts are typical. Differentiate by error type: retry tool errors (500/timeout) with backoff, validation errors (400) with adjusted arguments, escalate permission errors (403) to the human without a retry, rate limits (429) with backoff and, where appropriate, a model switch.

A persistent anti-pattern is silent error suppression, where tool calls fail but the agent continues running as if everything were fine. Always return errors explicitly to the model.

A second, practical rule: always read tool inputs with a JSON parser, never via string matching on the serialised input. The models of the current generation (such as Opus 4.6, 4.7 and 4.8 as well as Sonnet 4.6, state of 2026) can handle JSON escaping differently between versions - Unicode or slash escaping, for instance. Whoever string-matches raw is building themselves a hard-to-find bug.

Concrete pseudocode example

The following pseudocode example shows a tool definition and a manual loop in the Anthropic notation. The difference from OpenAI consists essentially in input_schema being called parameters there.

```
tool = {
"name": "fetch_recent_orders",
"description": "Loads a customer's recent orders. "
"Use when the user asks about order history, status "
"or deliveries. Do NOT use for pure "
"product search - use search_products for that. "
"Returns: list of {id, total, status, items}.",
"input_schema": {
"type": "object",
"properties": {
"user_id": {"type": "string", "description": "Internal customer ID"},
"limit": {"type": "integer", "description": "Max. count (default 10)"}
},
"required": ["user_id"]
}
}

messages = [{"role": "user", "content": "Where is my last order?"}]

while True:
response = client.messages.create(
model="claude-opus-4-8", # state of 2026
max_tokens=1024,
tools=[tool],
messages=messages,
)
if response.stop_reason != "tool_use":
break # end_turn -> done

messages.append({"role": "assistant", "content": response.content})
results = []
for block in response.content:
if block.type == "tool_use": # possibly several -> parallel
try:
data = run_tool(block.name, block.input) # your own code
results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": to_json(data),
})
except ToolError as e:
results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": str(e),
"is_error": True,
})
messages.append({"role": "user", "content": results})
```

A calculation example for the tool load: assume this agent runs with five tools at ~200 tokens of schema each (1,000 tokens) plus 800 tokens of system prompt. At 100,000 calls per month, around 180 million input tokens accrue for the stable prefix alone. Via Anthropic's prompt caching, a cache read costs only about 10 percent of the standard input rate (state of 2026: e.g. around 0.30 instead of 3.00 US dollars per million tokens with Sonnet 4.6) - provided the tool catalogue remains stable. Any change to tool definitions, even just the ordering, invalidates the cache completely. From this follows the production pattern: roll out tool catalogues in planned releases, no hotfixes to tool defs.

How many tools - and the overlap trap

Tool selection accuracy does not scale arbitrarily. It has proven worthwhile to permanently load only a small core set and to load further tools dynamically via a tool-search tool - this keeps the fixed context small and spares the cache, because schemas are appended instead of swapped. The actual bottleneck, however, is not the sheer number, but the overlap: even a handful of clearly delineated tools can be managed cleanly, whereas a few overlapping ones reliably drive a model into guessing.

Two tools that could plausibly answer the same request - such as search_documents and search_knowledge_base - are the problem that no prompt, however good, solves. With "Find information on X", the model guesses. The most effective, usually forgotten countermeasure is the when-not-to-use clause in every description. As a public benchmark reference for selection accuracy, the Berkeley Function-Calling Leaderboard (BFCL) serves; however, its values fluctuate and should be re-benchmarked against your own eval set, not against a marketing blog post.

For agencies and B2B decision-makers

Function Calling, or Tool Use, is the bridge between the language model and your real systems - CRM, accounting, shipping, knowledge base. The terminological confusion is harmless; the implementation discipline is not. Whether an agent runs reliably in production is decided by cleanly delineated tool definitions, robust error handling as structured results, controlled termination conditions and a cache-aware catalogue strategy - not by the choice between OpenAI and Anthropic.

As a Vienna-based agency with a focus on AI agents, we support DACH companies from tool-schema design through provider abstraction to a production-ready, auditable agent loop. If you want to integrate structured tool calls reliably into your business processes, get in touch with us.

FAQ

Is Function Calling the same as Tool Use?

At the core, yes. Both terms describe the same mechanism: an LLM produces a structured, schema-compliant call to a function you have defined instead of prose. 'Function Calling' is OpenAI's term, 'Tool Use' is Anthropic's. The nuance: 'Tool Use' is the slightly broader term, because with Anthropic a tool can also be a server-side service (such as web search or code execution) or an MCP server, not just a client-side function. In practice, the terms are used synonymously.

What is the most important technical difference between OpenAI and Anthropic?

The field naming in the tool schema. OpenAI defines the parameters under the key 'parameters', Anthropic under 'input_schema'. Both use JSON Schema underneath, so the actual parameter object is almost identical. Further differences concern the signalling (Anthropic: stop_reason 'tool_use') and the return format of the results (Anthropic: a tool_result block with a matching tool_use_id).

Does the LLM execute the function itself?

No - not with classic, client-side tools. The model only delivers the structured call including arguments. Your code (the 'harness') validates the arguments, executes the function and sends the result back as a tool_result. Only then does the LLM formulate the final answer. An exception is server-side tools (such as Anthropic's code execution or web search), which run on the provider's infrastructure.

How do parallel tool calls work?

Frontier models can request several tool calls simultaneously in a single response, for example to query three independent data sources. This saves round-trips and latency. You must execute all requested calls and return the results collectively. If you want at most one call per response, the behaviour can be disabled - with Anthropic via disable_parallel_tool_use in the tool_choice parameter.

How do you handle errors in tool calls correctly?

Errors belong back in the model as a structured result, not as a thrown exception that aborts the agent loop. Return a tool_result with is_error true and a short, comprehensible error message. The model can then decide whether to correct the arguments, choose a different strategy or escalate. Stack traces should be condensed, not passed raw, so as not to pollute the context window.

How many tools should an agent have?

Less is usually more. A small, permanently loaded core set plus dynamic loading via a tool-search tool is more robust than a large static catalogue. The actual problem is not the sheer number, but overlap: two tools that could plausibly answer the same request cannot be cleanly separated by any prompt. Clean, unambiguous descriptions with a clear when-not-to-use clause are decisive.

Want to go deeper?

Get new analyses straight to your inbox, or see how we put this knowledge to work for companies.

Subscribe to newsletter →Our services

Previous← Temperature, Top-p and Sampling: Settings for Deterministic Agents NextStructured Outputs with JSON Schema: Enforcing Reliable Agent Responses →