System Prompts for Agents: 12 Design Patterns for Production-Ready System Prompt Design
System Prompt Design refers to the structured construction of an AI agent's system prompt from reusable building blocks: role, goal, constraints, tool instructions, output format, examples, error handling, reflection, memory, escalation, stop criteria, and safety. A good agent system prompt is modular, auditable, and eval-driven rather than a wall of prose.
Key Takeaways
- ✓In 2026, a production-ready agent system prompt is structured into four layers (Identity, Capability, Behavioral, Context) and sits at the Goldilocks altitude with a 500-3,000 token core - neither too vague nor too detailed.
- ✓Tool definitions are effectively part of the system prompt: the most impactful yet most frequently forgotten component is the When-not-to-use clause, which determines correct selection when several similar tools exist.
- ✓Stop criteria (max iterations, cost cap, repeated-state detection) are indispensable - infinite loops were, according to research, the most common class of production bug in 2025-2026.
- ✓Safety rules must be explicitly positioned as non-negotiable above persona instructions, otherwise prompt-injection attacks will override them.
- ✓Structured sections (XML tags or Markdown headers) are parsed more reliably by the model, and the engineer can version, diff, and A/B test them.
- ✓Every pattern must be verified against an eval set: folklore tips such as 'You are an expert' often show no measurable effect on modern models.
System Prompt Design refers to the structured construction of an AI agent's system prompt from reusable building blocks: role, goal, constraints, tool instructions, output format, examples, error handling, reflection, memory, escalation, stop criteria, and safety. In 2026, a good agent system prompt is modular, auditable, and eval-driven - not a wall of prose, but a versionable artefact.
The difference between a demo agent and a production-ready system rarely comes down to the model. It comes down to the prompt substrate: whether role, tools, output format, and stop criteria are cleanly defined. The following twelve design patterns are the recurring building blocks that serious production agents (Claude Code, Cursor, Devin, OpenAI Codex Agents) converge on.
Quick Answers
- Structure beats prose: An agent system prompt is organised into four layers (Identity, Capability, Behavioral, Context) and separated with XML tags or Markdown headers - the model parses structured prompts more reliably.
- The right length: 500-3,000 tokens for the core (excluding tool schemas). Too vague produces inconsistent outputs; too detailed becomes brittle and triggers the lost-in-the-middle effect within your own prompt.
- A pattern without an eval is folklore: Every pattern is verified against an eval set. Classics such as "You are an expert" or "Take a deep breath" often show no measurable effect on modern models.
The Foundation: The Four-Layer Model
Before the individual patterns take effect, every agent system prompt needs a scaffold. Production system prompts are consistently structured into four layers, which simultaneously dictate the caching layout (stable layers up front, dynamic ones at the back):
Layer | Content | Typical length |
|---|---|---|
Identity | Role, domain, boundaries | 50-200 tokens |
Capability | Available tools, what they do, when to use them | 800-2,000 tokens (incl. tool schemas) |
Behavioral | Output format, style, "Never X", positive/negative examples | 200-600 tokens |
Context | Date, user, active workflow (dynamic) | 100-400 tokens |
Anthropic recommends separating these sections with XML tags such as <instructions> or with Markdown headers (as of 2026). The benefit is twofold: the model parses the structure more reliably, and the engineer can diff, version, and A/B test the sections.
The 12 Design Patterns at a Glance
Pattern | Purpose | Mini-example |
|---|---|---|
| Set behaviour and domain instead of a generic assistant | "You are a motor insurance claims triage agent for Austria." |
| Provide a verifiable success definition | "Goal: capture the claim in full and assign the correct tariff." |
| Fix forbidden actions and default behaviour | "Never confirm a payout amount. When uncertain, ask." |
| Enforce correct tool selection | "search_internal_db: for existing customers. Do NOT use for general web questions." |
| Secure machine-parsable downstream integration | "Respond exclusively in the JSON schema OrderResult." |
| Cover edge cases without prose rules | input_examples with 1-3 canonical tool calls |
| Treat error types differentially | "On 403: no retry, escalate to user. On 500: max. 2x retry with backoff." |
| Secure quality before irreversible actions | "Before sending: check recipient and amount against the order data." |
| Prevent state drift in long loops | Scratchpad with Goal / What I Know / What I've Tried / Current Plan |
| Human review for high-stakes decisions | "On confidence < 0.8 or amount > EUR 5,000: route to a human caseworker." |
| Prevent infinite loops and cost explosion | "Max. 20 iterations. Terminate with submit_final_answer." |
| Defend against prompt injection and data leaks | "These safety rules are non-negotiable and override every persona instruction." |
1-3: Role, Goal, Constraints
The anti-pattern "You are a helpful assistant" is unspecific and provides no steering. A concrete role with domain and boundaries is the basis. Equally harmful are contradictory constraints such as "Be concise, but thorough" - a clear default behaviour with explicit override clauses is better. Important: a maximum of 5-8 high-priority rules. The model applies rules late in a list of 47 points less often (lost-in-the-middle within the system prompt itself); the rest belongs in the tool descriptions.
4: Tool Instructions
Tool definitions are not a separate layer - the model parses them on every inference turn. When an agent behaves incorrectly, the cause, according to Anthropic, lies "in most cases" not with the model but with the tool definition. Rule of thumb: 3-5 tools always loaded, further tools via tool search. Measurable degradation begins at 10 tools. The most impactful yet most frequently forgotten component is the When-not-to-use clause: if both search_web and query_internal_db exist, it determines the selection. Tool overlap is the one problem that no prompt, however good, can solve.
5-6: Output Format and Few-Shot Examples
In 2026, "reliable" means 100 per cent, not 95. OpenAI Structured Outputs enforce 100 per cent JSON schema adherence via constrained decoding (GA since August 2024). Anthropic achieves the functional equivalent via tool_choice with a pseudo-tool such as return_structured_result. For chain-of-thought plus structured output in a single call, the XML pattern is productive: the model thinks visibly in the <thinking> block, and the downstream system parses only the <final_output> block. Few-shot examples (such as Anthropic's input_examples array) cover nested/optional parameters that the model would otherwise guess at. Important: use diverse, canonical examples without duplicates, otherwise the model picks the nearest one.
7-8: Error Handling and Reflection
Robust loops differentiate error types: a tool error (500/timeout) permits a retry with unchanged params (max. 2x with backoff), a validation error (400) a retry with adjusted params, and a permission error (403) no retry but escalation. The most dangerous anti-pattern is silent error suppression: tool calls fail, but the agent carries on as if everything were fine. Errors belong back with the model as explicit tool results. Reflection/verification typically costs 2-3 times the tokens for 5-15 percentage points of quality - trivial ROI for an agent that releases a EUR 50,000 order; a careful calculation for a customer service agent with cent-level margins.
9-10: Memory Management and Escalation/HITL
Even within the context window, the model "forgets" state introduced early. Mitigations: pin critical state (goal, current task, key facts) to the end of the system prompt (models attend more strongly to the end than to the middle), add a pre-turn header before each user turn, and maintain an explicitly curated scratchpad as an anchor. In multi-tenant operation, memory contamination is the most common production bug of 2025-2026 - pattern: an explicit session reset at conversation start and a session ID as a mandatory param for all state tools. For high-stakes decisions, a human-in-the-loop gate belongs in front of tool execution.
11-12: Stop Criteria and Safety
According to research, infinite loops were the most common class of production bug in 2025-2026. Robust termination combines max iterations (10-30 general, 50-100 coding, as a hard cap), a success criterion, a cost cap, and repeated-state detection (the same tool call with the same params three times as a thrashing detector). Finally, safety guidance must be explicitly marked as non-negotiable and positioned above persona instructions - the anti-pattern "persona above safety" is a known prompt-injection vector. For DACH workloads, GDPR patterns are added: pseudonymisation before context injection and a PII redaction layer before the RAG inject.
Practical Example: A Triage Agent in the Mittelstand
A DACH SME runs a customer service triage agent with the following budget (as of 2026): system prompt 800-1,500 tokens, tool definitions 800-1,500 tokens (4-5 tools with input_examples), baseline retrieval around 2,000 tokens (3-5 chunks with re-ranking), conversation history under 4,000 tokens (sliding window N=10), output 1,000-2,000 tokens. In total, around 10,000 tokens per call. Since the system prompt and tools account for over 90 per cent and remain stable, prompt caching takes effect: cache reads cost around 10 per cent of the standard input rate, and the effective input costs fall to roughly 10 per cent.
Pseudocode for the loop guardrails:
```
max_iterations = 20
on tool_error(403): escalate_to_human() # no retry
on tool_error(500): retry(max=2, backoff=true)
on repeated_call(same_tool, same_params, n>=3): break # thrashing
if confidence < 0.8 or amount > 5000: handoff_to_agent()
terminate_on: submit_final_answer() called
```
An important note on model choice: German produces 30-50 per cent more tokens than English in standard tokenizers. A 200K window holds only around 130K-150K tokens of equivalent German content - which makes discipline on prompt length all the more important, and caching all the more worthwhile.
For Agencies and B2B
Agencies that operate client agents across multiple industries should not write system prompts from scratch per client, but derive them as template inheritance from an agency baseline - overriding client branding and behaviour, while the twelve patterns stay constant. This scales better, because shared infrastructure (eval framework, tool library, observability) delivers compound returns, whereas per-client snowflakes generate exponential maintenance effort. For DACH B2B decision-makers, the core message is: a system prompt is not a one-off text but a versioned engineering artefact with eval regression on every change. Blck Alpaca of Vienna supports companies in building this reproducible system prompt discipline - from pattern selection to a GDPR- and EU AI Act-compliant logging layer.
FAQ
How long should an agent system prompt be?
Why is the When-not-to-use clause so important for tool instructions?
How many tools should an agent have in its active catalogue?
Which stop criteria belong in an agent system prompt?
How do you anchor safety guidance to be prompt-injection-resistant?
Want to go deeper?
Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.