3.12Intermediate7 min

System Prompts for Agents: 12 Design Patterns for Production-Ready System Prompt Design

Blck Alpaca·9 June 2026

Definition

System Prompt Design refers to the structured construction of an AI agent's system prompt from reusable building blocks: role, goal, constraints, tool instructions, output format, examples, error handling, reflection, memory, escalation, stop criteria, and safety. A good agent system prompt is modular, auditable, and eval-driven rather than a wall of prose.

Key Takeaways

✓In 2026, a production-ready agent system prompt is structured into four layers (Identity, Capability, Behavioral, Context) and sits at the Goldilocks altitude with a 500-3,000 token core - neither too vague nor too detailed.
✓Tool definitions are effectively part of the system prompt: the most impactful yet most frequently forgotten component is the When-not-to-use clause, which determines correct selection when several similar tools exist.
✓Stop criteria (max iterations, cost cap, repeated-state detection) are indispensable - infinite loops were, according to research, the most common class of production bug in 2025-2026.
✓Safety rules must be explicitly positioned as non-negotiable above persona instructions, otherwise prompt-injection attacks will override them.
✓Structured sections (XML tags or Markdown headers) are parsed more reliably by the model, and the engineer can version, diff, and A/B test them.
✓Every pattern must be verified against an eval set: folklore tips such as 'You are an expert' often show no measurable effect on modern models.

System Prompt Design refers to the structured construction of an AI agent's system prompt from reusable building blocks: role, goal, constraints, tool instructions, output format, examples, error handling, reflection, memory, escalation, stop criteria, and safety. In 2026, a good agent system prompt is modular, auditable, and eval-driven - not a wall of prose, but a versionable artefact.

The difference between a demo agent and a production-ready system rarely comes down to the model. It comes down to the prompt substrate: whether role, tools, output format, and stop criteria are cleanly defined. The following twelve design patterns are the recurring building blocks that serious production agents (Claude Code, Cursor, Devin, OpenAI Codex Agents) converge on.

Quick Answers

Structure beats prose: An agent system prompt is organised into four layers (Identity, Capability, Behavioral, Context) and separated with XML tags or Markdown headers - the model parses structured prompts more reliably.
The right length: 500-3,000 tokens for the core (excluding tool schemas). Too vague produces inconsistent outputs; too detailed becomes brittle and triggers the lost-in-the-middle effect within your own prompt.
A pattern without an eval is folklore: Every pattern is verified against an eval set. Classics such as "You are an expert" or "Take a deep breath" often show no measurable effect on modern models.

The Foundation: The Four-Layer Model

Before the individual patterns take effect, every agent system prompt needs a scaffold. Production system prompts are consistently structured into four layers, which simultaneously dictate the caching layout (stable layers up front, dynamic ones at the back):

Layer	Content	Typical length
Identity	Role, domain, boundaries	50-200 tokens
Capability	Available tools, what they do, when to use them	800-2,000 tokens (incl. tool schemas)
Behavioral	Output format, style, "Never X", positive/negative examples	200-600 tokens
Context	Date, user, active workflow (dynamic)	100-400 tokens

Anthropic recommends separating these sections with XML tags such as <instructions> or with Markdown headers (as of 2026). The benefit is twofold: the model parses the structure more reliably, and the engineer can diff, version, and A/B test the sections.

The 12 Design Patterns at a Glance

Pattern	Purpose	Mini-example
Role/Persona	Set behaviour and domain instead of a generic assistant	"You are a motor insurance claims triage agent for Austria."
Clear goal	Provide a verifiable success definition	"Goal: capture the claim in full and assign the correct tariff."
Constraints/guardrails	Fix forbidden actions and default behaviour	"Never confirm a payout amount. When uncertain, ask."
Tool instructions	Enforce correct tool selection	"search_internal_db: for existing customers. Do NOT use for general web questions."
Output format	Secure machine-parsable downstream integration	"Respond exclusively in the JSON schema OrderResult."
Few-shot examples	Cover edge cases without prose rules	input_examples with 1-3 canonical tool calls
Error handling	Treat error types differentially	"On 403: no retry, escalate to user. On 500: max. 2x retry with backoff."
Reflection	Secure quality before irreversible actions	"Before sending: check recipient and amount against the order data."
Context/memory management	Prevent state drift in long loops	Scratchpad with Goal / What I Know / What I've Tried / Current Plan
Escalation/HITL	Human review for high-stakes decisions	"On confidence < 0.8 or amount > EUR 5,000: route to a human caseworker."
Stop criteria	Prevent infinite loops and cost explosion	"Max. 20 iterations. Terminate with submit_final_answer."
Safety guidance	Defend against prompt injection and data leaks	"These safety rules are non-negotiable and override every persona instruction."

1-3: Role, Goal, Constraints

The anti-pattern "You are a helpful assistant" is unspecific and provides no steering. A concrete role with domain and boundaries is the basis. Equally harmful are contradictory constraints such as "Be concise, but thorough" - a clear default behaviour with explicit override clauses is better. Important: a maximum of 5-8 high-priority rules. The model applies rules late in a list of 47 points less often (lost-in-the-middle within the system prompt itself); the rest belongs in the tool descriptions.

4: Tool Instructions

Tool definitions are not a separate layer - the model parses them on every inference turn. When an agent behaves incorrectly, the cause, according to Anthropic, lies "in most cases" not with the model but with the tool definition. Rule of thumb: 3-5 tools always loaded, further tools via tool search. Measurable degradation begins at 10 tools. The most impactful yet most frequently forgotten component is the When-not-to-use clause: if both search_web and query_internal_db exist, it determines the selection. Tool overlap is the one problem that no prompt, however good, can solve.

5-6: Output Format and Few-Shot Examples

In 2026, "reliable" means 100 per cent, not 95. OpenAI Structured Outputs enforce 100 per cent JSON schema adherence via constrained decoding (GA since August 2024). Anthropic achieves the functional equivalent via tool_choice with a pseudo-tool such as return_structured_result. For chain-of-thought plus structured output in a single call, the XML pattern is productive: the model thinks visibly in the <thinking> block, and the downstream system parses only the <final_output> block. Few-shot examples (such as Anthropic's input_examples array) cover nested/optional parameters that the model would otherwise guess at. Important: use diverse, canonical examples without duplicates, otherwise the model picks the nearest one.

7-8: Error Handling and Reflection

Robust loops differentiate error types: a tool error (500/timeout) permits a retry with unchanged params (max. 2x with backoff), a validation error (400) a retry with adjusted params, and a permission error (403) no retry but escalation. The most dangerous anti-pattern is silent error suppression: tool calls fail, but the agent carries on as if everything were fine. Errors belong back with the model as explicit tool results. Reflection/verification typically costs 2-3 times the tokens for 5-15 percentage points of quality - trivial ROI for an agent that releases a EUR 50,000 order; a careful calculation for a customer service agent with cent-level margins.

9-10: Memory Management and Escalation/HITL

Even within the context window, the model "forgets" state introduced early. Mitigations: pin critical state (goal, current task, key facts) to the end of the system prompt (models attend more strongly to the end than to the middle), add a pre-turn header before each user turn, and maintain an explicitly curated scratchpad as an anchor. In multi-tenant operation, memory contamination is the most common production bug of 2025-2026 - pattern: an explicit session reset at conversation start and a session ID as a mandatory param for all state tools. For high-stakes decisions, a human-in-the-loop gate belongs in front of tool execution.

11-12: Stop Criteria and Safety

According to research, infinite loops were the most common class of production bug in 2025-2026. Robust termination combines max iterations (10-30 general, 50-100 coding, as a hard cap), a success criterion, a cost cap, and repeated-state detection (the same tool call with the same params three times as a thrashing detector). Finally, safety guidance must be explicitly marked as non-negotiable and positioned above persona instructions - the anti-pattern "persona above safety" is a known prompt-injection vector. For DACH workloads, GDPR patterns are added: pseudonymisation before context injection and a PII redaction layer before the RAG inject.

Practical Example: A Triage Agent in the Mittelstand

A DACH SME runs a customer service triage agent with the following budget (as of 2026): system prompt 800-1,500 tokens, tool definitions 800-1,500 tokens (4-5 tools with input_examples), baseline retrieval around 2,000 tokens (3-5 chunks with re-ranking), conversation history under 4,000 tokens (sliding window N=10), output 1,000-2,000 tokens. In total, around 10,000 tokens per call. Since the system prompt and tools account for over 90 per cent and remain stable, prompt caching takes effect: cache reads cost around 10 per cent of the standard input rate, and the effective input costs fall to roughly 10 per cent.

Pseudocode for the loop guardrails:

```
max_iterations = 20
on tool_error(403): escalate_to_human() # no retry
on tool_error(500): retry(max=2, backoff=true)
on repeated_call(same_tool, same_params, n>=3): break # thrashing
if confidence < 0.8 or amount > 5000: handoff_to_agent()
terminate_on: submit_final_answer() called
```

An important note on model choice: German produces 30-50 per cent more tokens than English in standard tokenizers. A 200K window holds only around 130K-150K tokens of equivalent German content - which makes discipline on prompt length all the more important, and caching all the more worthwhile.

For Agencies and B2B

Agencies that operate client agents across multiple industries should not write system prompts from scratch per client, but derive them as template inheritance from an agency baseline - overriding client branding and behaviour, while the twelve patterns stay constant. This scales better, because shared infrastructure (eval framework, tool library, observability) delivers compound returns, whereas per-client snowflakes generate exponential maintenance effort. For DACH B2B decision-makers, the core message is: a system prompt is not a one-off text but a versioned engineering artefact with eval regression on every change. Blck Alpaca of Vienna supports companies in building this reproducible system prompt discipline - from pattern selection to a GDPR- and EU AI Act-compliant logging layer.

FAQ

How long should an agent system prompt be?

Practitioner reports converge on 500-3,000 tokens for the core of the system prompt (excluding tool schemas). Anthropic calls this the right Goldilocks altitude between too vague (inconsistent outputs) and too detailed (brittle, lost-in-the-middle within your own prompt). GPT-5.5 responds better to shorter, behavioral prompts (400-1,500 tokens), while Gemini 3.1 Pro tolerates longer prompts thanks to its 2M context. As of 2026.

Why is the When-not-to-use clause so important for tool instructions?

When two tools could plausibly answer the same query (e.g. search_documents and search_knowledge_base), the model guesses without a clear distinction. According to research, this is the one problem that no prompt, however good, can solve. The When-not-to-use clause in every tool description determines correct tool selection for ambiguous requests and prevents tool thrashing.

How many tools should an agent have in its active catalogue?

Anthropic recommends 3-5 always-loaded tools, with further tools available via tool search. Measurable degradation of selection accuracy begins at 10 tools and becomes severe at 15. With the tool_search mechanism, tool selection accuracy in Anthropic's internal MCP evals on Opus 4.5 rose from 79.5 to 88.1 per cent (as of 2026).

Which stop criteria belong in an agent system prompt?

At minimum a hard cap on max iterations (10-30 general, 50-100 coding), a success criterion (such as a defined submit_final_answer call), a cost or token cap, and repeated-state detection (the same tool call with the same params three times as a thrashing detector). According to research, infinite loops were the most common class of production bug in 2025-2026.

How do you anchor safety guidance to be prompt-injection-resistant?

Safety rules must be explicitly marked as non-negotiable and positioned above persona instructions. The anti-pattern of persona instructions above safety is a known prompt-injection attack vector. In addition: require explicit verification before irreversible actions (DB writes, external API calls, file deletes) and a session-state reset at conversation start to guard against memory contamination in multi-tenant operation.

Want to go deeper?

Get new analyses straight to your inbox, or see how we put this knowledge to work for companies.

Subscribe to newsletter →Our services

NextFew-Shot Prompting for Robust Agent Outputs →