Skip to content
Pillar 10

Prompt Engineering for AI Agents

Prompt engineering for agents: techniques for system prompts, tool use and reliable behavior of autonomous AI Agents.

Definition

Prompt engineering for agents is the engineering discipline of shaping an AI Agent's entire context window across multiple inference turns so that the system prompt, tool descriptions, retrieved data, and history reliably produce the desired behavior. It goes beyond writing a single prompt and encompasses system-prompt architecture, tool design, planning loops such as ReAct, context-window management, and eval-driven iteration. Since 2025, this expanded understanding has been commonly referred to in the industry as context engineering, which subsumes prompt engineering but does not replace it.

Key Takeaways

  • Prompt engineering for agents has evolved from a single instruction string into architecting the entire context window across multiple turns; according to practitioner reports (Anthropic, Cognition 2026), this context-engineering discipline accounts for roughly 60 to 80 percent of whether an agent runs reliably in production.
  • Tool descriptions are part of the prompt budget and the most common source of errors: with Anthropic's tool_search and defer_loading, tool-selection accuracy rose on Opus 4 from 49 to 74 percent, and on Opus 4.5 from 79.5 to 88.1 percent (Anthropic, November 2025); the production pattern is 3 to 7 always-loaded tools plus dynamic discovery.
  • Reasoning models invert classic prompting practice: OpenAI explicitly advises against chain-of-thought prompts for the o-series; you specify the goal, constraints, and output contract without prescribing every intermediate step (OpenAI Reasoning Best Practices, GPT-5 Prompt Guidance 2026).
  • ReAct (Yao et al. 2022, Reason-Act-Observe) remains the default loop for tool use; on reasoning models it is increasingly being supplanted by interleaved thinking, which collapses several classic ReAct iterations into a single API call.
  • Effective context capacity does not scale linearly with the nominal capacity: Chroma's context-rot study (July 2025, 18 frontier models) shows degradation with increasing length; as a heuristic, the usable capacity is 30 to 50 percent for reasoning-heavy tasks and 60 to 80 percent for retrieval-heavy tasks.
  • Prompt caching is the dominant cost lever: Anthropic cache reads cost roughly 10 percent of the standard input rate (about a 90 percent discount), OpenAI offers around 50 percent; an arXiv study (February 2026) measures 41 to 80 percent lower API costs and 13 to 31 percent shorter time-to-first-token through strategic cache control.
  • Structured-output enforcement achieves 100 percent schema adherence in production in 2026 (OpenAI Structured Outputs since August 2024, Anthropic tool use with JSON schema) and replaces the earlier 'parse JSON and hope'.
  • Eval-driven iteration is non-negotiable: changes to the system prompt, tools, or retrieval are validated against an eval set, not intuitively; many popular prompt tips show no measurable effect in rigorous evals (Husain/Shankar, 'Look at your data').
  • For the DACH region, three hard constraints apply: German tokenization causes 30 to 50 percent higher token costs (and a correspondingly higher caching ROI), GDPR requires PII discipline within the context window, and EU AI Act logging under Art. 12 becomes fully applicable to high-risk systems as of 2 August 2026 (informational, not legal advice).

What is prompt engineering for agents?

Prompt engineering for agents refers to the engineering discipline by which an AI Agent consistently exhibits the desired behavior across multiple inference turns. Whereas classic prompt engineering (2022–2023) meant writing a single, clever instruction string, for an agent the context window is no longer static text but a dynamically composed system state: system prompt, tool definitions, tool results, conversation history, retrieved RAG chunks, scratchpad notes, and structured state.

Andrej Karpathy described this shift on 25 June 2025 as "the delicate art and science of filling the context window with just the right information for the next step," one day after Shopify CEO Tobi Lütke had coined the same term. Anthropic formalized it on 29 September 2025 in "Effective context engineering for AI agents" as "the natural progression of prompt engineering." Important for context: context engineering subsumes prompt engineering, it does not replace it. A good system prompt remains a necessary condition; it is simply no longer sufficient.

This hub page provides an overview of the five central building blocks: system prompts, tool descriptions, planning loops (ReAct), context-window management, and evaluation.

System-prompt architecture

The system prompt is the only piece of context an agent sees in every turn — it is code: versioned, reviewable, diffable. In 2026, production system prompts consistently structure themselves into four layers:

Layer

Content

Typical length

Identity

Role, domain, boundaries ("You are X, responsible for Y")

50–200 tokens

Capability

Available tools, what they do, when to use them

800–2,000 tokens (incl. tool schemas)

Behavioral

Output format, style, "Never X," examples

200–600 tokens

Context

Dynamic: today's date, current user, workflow

100–400 tokens

Anthropic recommends separating these sections via XML tags or Markdown headers — the model parses structured prompts more reliably. Anthropic calls the right length "the right altitude": practitioner reports converge on 500–3,000 tokens for the core (excluding tool schemas). Both extremes are harmful: overly long prompts with 47 numbered rules cause the model to apply later rules less often (lost-in-the-middle within its own system prompt); overly vague prompts ("You are a helpful assistant") leave too much to model inference.

A central shift concerns absolute rules: OpenAI's GPT-5 guidance (2026) explicitly warns to use ALWAYS/NEVER only for genuine invariants — "for judgment calls, such as when to search, ask for clarification, use a tool, or keep iterating, prefer decision rules instead." For DACH teams the following additionally applies: always set the response language explicitly (otherwise frontier models default to English for technical content), and explicitly pin the level of formality (Sie/Du) to avoid style drift in the loop.

Tool descriptions: where agents actually fail

When a production agent acts incorrectly, the cause, according to Anthropic's own engineering experience, lies "in most cases" not with the model but with the tool definition. The guiding question is: "If a human engineer can't definitively say which tool should be used in a given situation, an AI agent can't be expected to do better."

What is decisive is that tool descriptions are part of the prompt budget. Each tool adds 100–300 tokens "always-on"; a 10-tool catalog costs 1,000–3,000 tokens per call. The most impactful, most often forgotten component is the when-not-to-use clause: if search_web and query_internal_db exist in parallel, this clause determines tool selection. Tool overlap — two tools that could plausibly answer the same query — is the one problem that no prompt, however good, can solve.

The empirical evidence is clear. With Anthropic's tool_search tool and defer_loading: true for rarely used tools, tool-selection accuracy in internal MCP evals rose on Opus 4 from 49% to 74%, and on Opus 4.5 from 79.5% to 88.1% (Anthropic, November 2025), with around 85% token savings. Beyond roughly 10 active tools, measurable degradation begins. The production pattern is therefore 3–7 always-loaded tools plus tool search for the rest.

Further robust conventions: verb-noun names (get_user, send_email), field-level descriptions with semantics, documented return formats and failure modes, search-focused rather than list-all tools, as well as hard response-token limits (Anthropic guideline ~25,000 tokens per tool return). DACH practice: tool names and parameters in English (interoperability), descriptions in the agent's runtime language.

Planning loops: ReAct and its successors

Planning is the structure of an agent's decision loop. Four patterns dominate in 2026, with clear trade-offs:

Pattern

Idea

Production use 2026

ReAct (Yao et al. 2022)

Reason → Act → Observe → …

Standard default for tool-use agents

Plan-and-Execute

Generate a plan first, then execute

Multi-step workflows, low latency

Reflexion (Shinn et al. 2023)

Generate → critique → revise

Quality-sensitive tasks (2–3× token cost)

Tree of Thoughts (Yao et al. 2023)

Several branches in parallel, merge

Hard reasoning, very expensive, rarely standard

ReAct interleaves reasoning and action (Thought → Action → Observation) and is the robust default for non-reasoning models. With reasoning models the picture shifts: interleaved thinking (Anthropic Claude with Extended Thinking, OpenAI o-series/GPT-5) lets the model re-plan between tool calls and thereby collapses many classic "ReAct-in-a-loop" implementations into a single API call. In practice, production agents use hybrids: a ReAct loop with an explicit planning step at the start and verification before irreversible actions.

Disciplined termination is the most important safeguard — infinite loops were the most common production bug class in 2025–2026. Robust loops combine max-iterations (hard cap, typically 10–30 general, 50–100 for coding), a success criterion (e.g., a submit_final_answer tool), cost caps, repeated-state detection against tool thrashing, and a human-escalation path.

Context-window management

Long context windows are available in 2026 (Claude Opus 1M, Gemini 2M tokens), but not uniformly usable. Chroma's "Context Rot" study (July 2025, 18 frontier models) demonstrates: all models degrade with increasing input length. Three mechanisms compound: lost-in-the-middle (Liu et al., Stanford/TACL 2024 — models attend to the beginning and end, poorly in the middle), attention dilution, and distractor interference. As a heuristic, the effective capacity is 30–50% of the nominal for reasoning-heavy and 60–80% for retrieval-heavy tasks — filling a 1M window completely is wasteful with a quality penalty.

Lance Martin (LangChain) coined the canonical four-pillar taxonomy for this, which Anthropic and Manus have adopted:

  • Write — persist information outside the window (scratchpads, todo.md, memory store)
  • Select — bring the right tokens in per step (RAG, tool filtering, sub-agent dispatch)
  • Compress — keep only task-relevant tokens (summarization, Anthropic context editing)
  • Isolate — split context across sub-agents and schema fields

Three levers dominate the economics. Prompt caching is the most important: Anthropic cache reads cost roughly 10% of the standard input rate (≈90% discount), OpenAI offers around 50%. An arXiv study (February 2026, "Don't Break the Cache") measures across agentic workloads 41–80% lower API costs and 13–31% shorter time-to-first-token through strategic cache-block control. Pruning removes old turns and stale tool results. Compaction compresses at 70–85% capacity — in Claude Code via /compact, which according to Anthropic preserves "architectural decisions, unresolved bugs, and implementation details" and discards redundant tool outputs. Sub-agent dispatch acts as a compaction primitive: a sub-agent explores in its own window and returns only a 1,000–2,000-token summary.

Evaluation: if you don't measure, you've done nothing

The most brutal insight for tech leads is: context-engineering changes are validated by evals, not by intuition. Hamel Husain's most-cited advice — "Look at your data" — means concretely: read 50–100 real production traces, label failures freely, cluster them into a taxonomy, write a code eval or an LLM-as-judge eval per common mode, and integrate these into CI/monitoring.

Husain warns here against pure eval-first development: "Write evaluators for errors you discover, not errors you imagine." The practicable middle path starts with a small end-to-end eval (10–50 representative tasks), iterates, and builds specific sub-evals for real failure modes. A sobering, empirically substantiated finding: many popular prompt tips ("You are an expert," "Think step by step," "I'll tip you $200") show minimal or no improvement in rigorous evals — on reasoning models, "think step by step" is already default behavior and manually often counterproductive.

Production maturity means: evals run automatically on every change to context, tools, or retrieval — a PR eval (20–50 tasks) blocks merge, a pre-deploy eval (200–2,000 tasks) blocks deploy, a post-deploy eval on production traces performs drift detection. Structured-output enforcement closes the loop: OpenAI Structured Outputs (GA since August 2024) and Anthropic tool use with JSON schema deliver 100% schema adherence and replace the earlier "parse JSON and hope."

DACH relevance and compliance

For DACH teams (Germany, Austria, Switzerland), three hard engineering constraints are added. First, German tokenization: compound nouns and inflection produce 30–50% more tokens per equivalent content than English. A 200K window therefore holds only ~130–150K tokens of German content — higher costs, but also a higher caching ROI, because the 90% discount applies to a larger token count.

Second, GDPR discipline within the context window: personal data does not belong in it unfiltered. Patterns are pseudonymization before context injection (resolving real names only at the tool layer), a PII-redaction layer before the RAG inject, and an auditable session-state reset. Third, EU AI Act logging under Art. 12, which becomes fully applicable to high-risk systems as of 2 August 2026 (provisional or phased applicability — this classification is informational and not legal advice). On the engineering side this means: system-prompt version, tool-catalog version, retrieved documents (or IDs + hashes), user input, tool calls, tool results, and final output must be persisted in an audit-capable and at the same time GDPR-deletable manner — recommended per tool call with a run correlation ID.

Outlook and practical note

In 2026, prompt engineering for agents is neither folklore nor a mere rebranding, but the response to the shift from one-shot calls to multi-stage agentic loops. Anyone building a production agent operates this discipline — the only question is whether deliberately and reproducibly, or unconsciously and fragilely. Several fields remain in motion and should be read as a "current snapshot": the reasoning-model conventions, the multi-agent-vs.-single-agent heuristic (read-heavy parallel works, write-heavy does not), as well as prompt-optimization frameworks such as DSPy, which are suitable for narrowly scoped sub-tasks but do not yet form a production standard for complete agent loops.

The practical entry point is undramatic and well documented: pull the system prompt from the vendor playground and version it in Git, schema validation on every output, enable prompt caching on the stable parts, set hard cost caps, and read 20 real traces weekly. Anthropic's guiding principle captures the entire goal more precisely than any tooling discussion: "Find the smallest set of high-signal tokens that maximize the likelihood of your desired outcome." That is engineering — and then: look at your data.

All Articles in this Topic

6 Articles
3.12

System Prompts for Agents: 12 Design Patterns for Production-Ready System Prompt Design

System Prompt Design refers to the structured construction of an AI agent's system prompt from reusable building blocks: role, goal, constraints, tool instructions, output format, examples, error handling, reflection, memory, escalation, stop criteria, and safety. A good agent system prompt is modular, auditable, and eval-driven rather than a wall of prose.

Intermediate·7 min
3.13

Few-Shot Prompting for Robust Agent Outputs

Few-shot prompting refers to the technique of giving an AI agent a few examples (typically 2 to 5) of correct inputs and outputs within the prompt, so that it adopts the format, style and logic of a task via in-context learning, without the model being retrained. This makes output formats and tool calls considerably more reliable.

Intermediate·8 min
3.14

Versioning Prompt Templates: A Git Workflow for Prompts

Prompt versioning means treating prompt templates like code: parameterised, separated from application logic, versioned in Git, checked via review, tested against regression through evals and rolled back when needed. This makes prompt changes traceable, reproducible and auditable instead of randomly scattered throughout the code.

Intermediate·7 min
3.15

Meta-Prompting: When Agents Write Their Own Prompts

Meta-prompting refers to techniques in which an LLM generates, evaluates or improves its own prompts instead of formulating them manually. Rather than trial-and-error, an eval-driven process optimises instructions, examples and output formats programmatically against a test set. Frameworks such as DSPy automate this by treating prompts like compilable code.

Advanced·7 min
3.16

Prompt Evaluation: Promptfoo, LangSmith, Langfuse Compared (As of 2026)

Prompt evaluation is the systematic, measurable testing of prompts and LLM outputs against a fixed eval set. Methods include rule-based assertions, LLM-as-judge, regression tests and human eval. Tools such as Promptfoo, LangSmith, Langfuse and DeepEval automate the assessment and embed it in CI/CD pipelines, so prompt changes are validated by data rather than intuition.

Advanced·7 min
3.17

Prompt Injection Defence: 9 Techniques for Production Agents

Prompt injection defence is the multi-layered protection of AI agents against manipulated inputs that smuggle in instructions. Because language models cannot reliably separate instruction from data, effective defence combines instruction/data separation, least-privilege tools, output filters, human-in-the-loop and monitoring rather than relying on a single guardrail.

Advanced·7 min
Prompt Engineering for AI Agents | Blck Alpaca