Skip to content
3.13Intermediate8 min

Few-Shot Prompting for Robust Agent Outputs

Blck Alpaca·
Definition

Few-shot prompting refers to the technique of giving an AI agent a few examples (typically 2 to 5) of correct inputs and outputs within the prompt, so that it adopts the format, style and logic of a task via in-context learning, without the model being retrained. This makes output formats and tool calls considerably more reliable.

Key Takeaways

  • Few-shot prompting steers agents via 2 to 5 canonical examples in the context, not via model training. It is a building block of context engineering, not a separate method alongside prompt engineering.
  • Representativeness and diversity beat quantity: contradictory or duplicated examples tempt the model to copy the nearest example. Diverse, canonical cases are the goal.
  • For tool calling there are dedicated mechanics: Anthropic allows an input_examples array (1 to 3 calls) per tool definition. This is particularly effective at stabilising nested or optional parameters.
  • Where 100 percent format fidelity is mandatory, schema-constrained decoding (OpenAI Structured Outputs, GA since August 2024) replaces the few-shot heuristic for pure structure. Few-shot remains relevant for style and logic.
  • Zero-shot for simple tasks, few-shot for format and tool reliability, fine-tuning only at very high volume or under latency pressure. In the DACH context, German increases the token costs of examples by 30 to 50 percent.
  • Examples are not free: they cost tokens per call and are prone to overfitting. Stable example blocks belong in the cacheable prompt prefix (cache reads around 10 percent of the standard price, as of 2026).

Few-shot prompting refers to the technique of giving an AI agent a few examples of correct inputs and outputs within the prompt, so that it adopts the format, style and logic of a task via in-context learning, without the model being retrained. Instead of describing the behaviour in prose, you show it: two to five representative cases against which the model aligns itself for the next inference turn. For production-ready agents, few-shot is not a gimmick but one of the most effective levers for reliable output formats and correct tool calling.

  • How many: Two to five examples for the general output, one to three canonical calls per tool. More rarely helps linearly.
  • Which ones: Diverse, representative cases without duplicates and without contradictions. Quality and coverage beat quantity.
  • When not to: For simple zero-shot tasks, or when 100 percent format fidelity is required; here schema-constrained decoding is superior.

Few-Shot in Context: A Building Block of Context Engineering

Few-shot prompting is not a standalone method alongside prompt engineering, but a sub-discipline within context engineering. Andrej Karpathy explicitly names "few-shot examples" as one of the scientific building blocks used to populate the context window for the next step, alongside task descriptions, RAG, tools and state. The corresponding mental model for agents is: examples are part of the token substrate that the model sees per turn, not a one-off instruction.

This framing has practical consequences. Examples compete with everything else for attention budget and token space. Anyone using few-shot must consider it within the context budget, not in isolation.

Zero-Shot, Few-Shot, One-Shot

  • Zero-shot: Only the task description, no examples. Fast, cheap, ideal for simple or obvious tasks.
  • One-shot: Exactly one example. Useful for anchoring an unambiguous format without a large token load.
  • Few-shot: Several examples that cover variants and edge cases. The standard for format- and logic-critical agent outputs.

Selection and Representativeness: The Real Engineering

The most common mistake is not the wrong number, but the wrong selection. Among the system-prompt anti-patterns, the research source explicitly lists: multiple contradictory examples lead the model to choose the nearest one. The correction is to use diverse, canonical examples without duplicates.

In concrete terms, representative selection means:

  • Cover variants, not the same case three times with slight modifications. If an agent processes invoices, credit notes and cancellations, one example of each belongs in the set, not three invoices.
  • Deliberately show edge cases, for instance a case with missing mandatory fields and the correct response to it. Examples also teach behaviour under ambiguity.
  • Consistent format across all examples. Format drift between the examples is poison; the model replicates the inconsistency.
  • No contradictory signals. If example A omits a field and example B fills it in, without the difference being explainable, the model will guess.

Impact on Tool Calling and Output Reliability

This is where few-shot pays off most. For the tool definition, the source recommends an input_examples array with one to three canonical calls. Without examples, the model guesses at nested or optional parameters; with examples, this source of error drops significantly. The connection to tool-selection reliability is close: Anthropic reports that disciplined tool catalogues plus tool search raise tool-selection accuracy on Opus 4 from 49 to 74 percent and on Opus 4.5 from 79.5 to 88.1 percent (internal MCP evals, as of 2026). Good examples in the tool description are part of the same discipline.

For the final output, an important distinction applies. Where mandatory machine-parsable structures are required, few-shot alone is not sufficient. OpenAI Structured Outputs (GA since August 2024, for GPT-4o-2024-08-06 and successors) enforces a documented 100 percent schema adherence via constrained decoding at the token level. Anthropic achieves functionally equivalent results through forced tool use with a pseudo-tool such as return_structured_result. The clean division of labour in 2026: schema enforcement guarantees the structure, while few-shot shapes style, word choice, logic and the handling of edge cases that no schema captures.

Method

Purpose

Format fidelity

Effort / cost

Zero-shot

Simple tasks, obvious format

Variable

Minimal

Few-shot (2-5 examples)

Stabilise style, logic, tool calls

High, not guaranteed

Tokens per call, iterable

Structured Outputs / forced tool use

Mandatory JSON structure

100 percent (schema)

Schema maintenance, low latency

Fine-tuning

Very high volume, latency pressure

High, model-dependent

Training cycle, data effort

When Few-Shot, When Zero-Shot, When Fine-Tuning

The decision follows three axes: task complexity, volume and stability of requirements.

  • Zero-shot, when the task is simple and the format uncritical. Every additional example token would be wasted.
  • Few-shot, as soon as a specific format, a consistent style or a non-trivial logic is required and the requirements are still changing. Few-shot is iterable without a training cycle, which is its greatest advantage.
  • Fine-tuning, only at very high, stable volume, when the example tokens per call become economically significant or when latency becomes critical. Cognition Labs trained its own smaller summarisation model on its own trace data for Devin, because generic prompts lost too much detail, a classic case in which few-shot reached its limit.

One intermediate stage deserves mention: in LLM-as-judge verification, few-shot examples in the judge prompt (positive and negative) are standard. Hamel Husain's recommendation is to calibrate such judge evals with over 100 labelled examples and maintain them weekly. This illustrates the boundary between few-shot in the prompt (few examples) and the eval data basis (many examples) behind it.

Pitfalls: Overfitting and Token Costs

Overfitting to examples is the most subtle trap. The agent copies surface features of the examples, for instance a particular order or phrasing, instead of generalising the underlying rule. Symptom: for inputs that resemble the examples the output is perfect, while for deviating cases it breaks down. The remedy is targeted diversity of the examples and an eval set that specifically tests the cases not covered.

Token costs and context rot are the second trap. Every example runs along with every call. The source documents that all frontier models degrade measurably as input length increases (context rot, Chroma study July 2025); the effective capacity for reasoning-heavy tasks is often only 30 to 50 percent of the nominal one. In the DACH context, the added complication is that German text requires 30 to 50 percent more tokens than equivalent English. German few-shot examples are therefore noticeably more expensive.

The most important economic lever: stable example blocks belong in the cacheable prompt prefix. Anthropic cache reads cost around 10 percent of the standard input rate (as of 2026), for Sonnet 4.6 about 0.30 instead of 3.00 US dollars per million tokens. Anyone who keeps their few-shot examples stable together with the system prompt and tool definitions, and places only the dynamic part at the end, pays for the example tokens predominantly at the cache rate. Any change to the examples invalidates the cache, so: version the examples and change them in planned releases, not via hotfix.

A Concrete Example: With and Without Few-Shot

An agent is to generate a structured order-record request for a create_order tool from an email.

Without few-shot (zero-shot):

```
System: From the email, generate a create_order call.
Input: "Please 3x item A-100 and 1 pc B-205 to customer no. 4711."
Output (typical): {"kunde": "4711", "artikel": "A-100, B-205", "menge": "3 und 1"}
```

The output is plausible but unusable: quantities and items are summarised as free text, and the field names deviate from the tool schema. Downstream, the parsing breaks.

With few-shot (two canonical examples in the tool prompt):

```
Example 1
Input: "2x C-300 for customer 9001"
Call: {"customer_id":"9001","items":[{"sku":"C-300","qty":2}]}

Example 2 (edge case: no quantity stated -> default 1)
Input: "Item D-401 to customer 9002"
Call: {"customer_id":"9002","items":[{"sku":"D-401","qty":1}]}

Real input: "Please 3x item A-100 and 1 pc B-205 to customer no. 4711."
Call: {"customer_id":"4711","items":[{"sku":"A-100","qty":3},{"sku":"B-205","qty":1}]}
```

Two examples are enough to anchor the field names, the array structure, the default behaviour when a quantity is missing, and the separation of multiple line items. On the cost side: the two examples add roughly 150 to 250 tokens per call, correspondingly more in German. If they sit in the cached prefix, this costs only about a tenth on repeated calls. For a 100 percent structure guarantee, you combine this few-shot setup with forced tool use on the create_order schema; in this way few-shot carries the logic, and the schema the form.

For Agencies and B2B Decision-Makers

Few-shot prompting is the fastest way to move an agent from "works in the demo" to "reliable in production" without a training budget. For agencies, this means: invest early in a curated, versioned collection of examples per use case; it is a reusable asset and a differentiator. For B2B decision-makers: demand from your implementation partner an eval set against which examples are tested, as well as a clear separation between few-shot for logic and schema enforcement for format. If you want to build agent workflows with demonstrable output reliability, Blck Alpaca in Vienna supports you with the design, evaluation and productive operation of DACH-compliant agents.

FAQ

How many examples do I need for few-shot prompting with agents?
As a rule of thumb, two to five examples for the general output, and one to three canonical calls per tool definition. More examples rarely yield linearly better results, but they cost tokens and increase the risk that the agent copies the nearest example instead of generalising. What matters is diversity, not quantity: cover the most important variants and edge cases, and avoid duplicates.
When is zero-shot better than few-shot?
Zero-shot is sufficient when the task is simple and the output format uncritical, for instance free-form summaries or simple classification with clear classes. As soon as a specific JSON/tool format, a consistent style or a non-obvious logic is required, few-shot improves reliability measurably. Where format fidelity is mandatory, schema-constrained decoding is the more robust choice than examples alone.
Few-shot prompting or fine-tuning?
Few-shot can be iterated without a training cycle and is ideal as long as requirements keep changing. Fine-tuning only pays off at very high, stable volume, when the example tokens per call become economically significant or when latency is critical. Cognition Labs, for example, trained its own smaller summarisation model on its own trace data for Devin, because generic prompts lost too much detail. For most B2B agents, few-shot is the pragmatic starting point.
What is in-context learning?
In-context learning is the ability of a language model to take on a task directly from examples in the prompt, without model weights being changed. Few-shot prompting is the practical application of this: the examples serve as a temporary instruction for precisely this inference turn. The effect is transient; it applies only to the respective context and must be sent along with every call.
Why do too many examples worsen agent outputs?
Three reasons: first, overfitting to the examples, where the agent copies surface features instead of generalising. Second, context rot, where long contexts degrade the effective model performance, even below the nominal limit. Third, token costs, since every example runs along with every call. In German this is exacerbated, because equivalent text requires 30 to 50 percent more tokens than in English.

Want to go deeper?

Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.