16.8Advanced6 min

Red-Teaming for AI Agents: Uncovering Vulnerabilities Systematically

Blck Alpaca·9 June 2026

Definition

Red-teaming for AI agents refers to the systematic, simulated attacking of AI agents in order to uncover vulnerabilities such as prompt injection, jailbreaks, tool misuse and data exfiltration before real attackers exploit them. It combines automated attack tools with manual, multi-stage attack creativity and delivers measurable findings such as attack success rates rather than binary vulnerability lists.

Key Takeaways

✓AI red teaming is not the same as a classic penetration test: different skills (ML literacy, prompt engineering, multi-stage attack fluency), different tools (Garak, PyRIT, DeepTeam instead of Burp/Metasploit) and different finding formats (probabilistic attack success rates instead of binary CVE findings).
✓Automated scanning (Garak, PyRIT, DeepTeam, Promptfoo) covers broad, known attack classes cost-effectively; manual red-teaming finds the multi-stage, context-specific attacks (boiling-frog drift, A2A session smuggling, delayed tool invocation) that scanners miss.
✓Rule of thumb for frequency: a quarterly baseline plus trigger-driven runs - before every new tool integration with destructive permissions, after every substantial prompt/system message change, after every model upgrade, and ad hoc whenever a CVE or PoC affects your own stack.
✓Reporting should be measurable and framework-mapped: attack success rate, detection rate, time-to-detection and blast radius, mapped to the OWASP Agentic Top 10 (ASI01-ASI10), MITRE ATLAS and - emerging, as of 2026 - AIVSS as well as AVID-compatible records.
✓Every published guardrail has been bypassed by competent researchers within months (EchoLeak against Microsoft's XPIA classifier, CamoLeak against GitHub-side filters); red-teaming is therefore mandatory evidence, not a marketing argument - 'OWASP-compliant' is not a meaningful claim, because OWASP does not certify.
✓For DACH financial service providers, red-teaming is effectively regulated: DORA requires threat-led penetration testing (Art. 24-27), and the BaFin guidance (18 Dec 2025) explicitly recommends adversarial penetration tests.

This article is part of the hub "AI agent security according to OWASP" and makes concrete how you can offensively test the risks catalogued in the OWASP Agentic Top 10 (ASI01-ASI10).

Three quick answers

What is being attacked? The agent-specific attack surface - not just the model, but tool calls, persistent memory, inter-agent communication and the human-in-the-loop. The goal is to uncover goal hijack, tool misuse, memory poisoning and data exfiltration.
With what? Open-source scanners (Garak, PyRIT, DeepTeam, Promptfoo) for breadth, manual red-teaming for the multi-stage attacks, plus bug bounty for long-term coverage.
How often? A quarterly baseline plus trigger-driven (new tools, prompt changes, model upgrades, acute CVEs).

Why AI red teaming is not a classic pentest

The most important distinction first: AI red teaming is not synonymous with a traditional penetration test. DACH procurement teams regularly confuse the two - with costly consequences when a Burp Suite pentest is accepted as "AI security evidence".

The differences are fundamental:

Dimension	Classic pentest	AI red teaming
Required skills	network/AppSec knowledge	ML literacy, prompt-engineering creativity, multi-stage adversarial fluency
Typical tools	Burp, Metasploit, Cobalt Strike	Garak, PyRIT, DeepTeam, Promptfoo
Finding format	binary CVE-style findings	probabilistic attack success rates
Attack model	mostly single-shot exploit	multi-stage, multi-turn campaigns
Goal	code/configuration errors	behavioural and context manipulation

An AI agent often does not "break" in the classic sense at all - it follows instructions it has been deceived into believing are legitimate. Because agents and the underlying model cannot reliably distinguish instructions from data, every piece of text the agent reads is part of the attack surface. This demands a different testing discipline.

Which vulnerabilities a red team specifically looks for

Red-teaming works through the agent-specific threat classes of the OWASP Agentic Top 10. In practice, this means deliberately constructing the following attacks:

Prompt injection / goal hijack (ASI01). Direct and indirect injection - hidden instructions in documents, RAG corpus, emails, calendar invitations, PR descriptions or tool outputs. Particularly insidious: the "boiling-frog" multi-turn drift, in which each individual step appears plausible but the cumulative trajectory is malicious.
Jailbreaks. Bypassing the safety guardrails in order to trigger prohibited actions or content.
Tool misuse (ASI02). A legitimate function (e.g. send_email) is repurposed; auto-approve or "YOLO" modes that disable confirmation prompts are exploited.
Memory and context poisoning (ASI06). Content injected once permanently poisons the persistent memory; "delayed tool invocation" only fires weeks later on a trigger word.
Data exfiltration. Exfiltration via manipulated Markdown images, redirect chains or abused proxies.
Inter-agent attacks (ASI07) and human-agent trust exploitation (ASI09). Forged messages between agents, as well as deliberately undermining the human approval layer through confidently formulated but manipulated recommendations.

That these attacks are real is demonstrated by documented incidents: EchoLeak (CVE-2025-32711, CVSS 9.3) was the first real-world zero-click prompt injection in a production LLM system and bypassed Microsoft's XPIA classifier (Cross-Prompt Injection Attempt). CamoLeak (CVSS 9.6) exfiltrated private repository secrets and source code character by character via GitHub's own Camo image proxy. Both show: every published guardrail has been bypassed by competent researchers within months.

Approach: automated vs manual

Mature red-teaming combines two modes that complement each other.

Automated means scale and repeatability. Scanners run broad probe libraries against the agent and measure what proportion gets through. They are well suited to continuous integration into CI/CD and to regression testing after every change. Weakness: they predominantly find known attack patterns.

Manual means creativity and context. Experienced analysts construct multi-stage campaigns tailored to the specific agent - precisely those attacks that scanners miss. Examples from research: the "agent session smuggling" against Google's A2A protocol (Palo Alto Unit 42, November 2025) is not a single-shot injection but a sustained agent-against-agent social-engineering campaign. The Google Gemini memory attack (Johann Rehberger, February 2025) used "delayed tool invocation" to poison the memory on a time delay.

Tools and frameworks (as of 2026)

The following tools are the building blocks established in practice. Version and market details as of 2026.

Tool	Origin	Classification
Garak	NVIDIA (originally Leon Derczynski)	LLM vulnerability scanner with broad probe library
PyRIT	Microsoft AI Red Team	Python Risk Identification Tool, extensible
DeepEval / DeepTeam	Confident AI	supports the OWASP_ASI_2026 framework as a plug-in
Promptfoo Red Team	Promptfoo	listed by OWASP as a GenAI security solution
Spikee	Community	spike testing for LLM apps
MAESTRO Threat Analyzer	Cloud Security Alliance	AI-assisted threat modelling (not a pure red-team tool)

Commercial: Lakera Red (Swiss vendor, DACH-relevant), HiddenLayer AIDR, Robust Intelligence (Cisco), Trustwise and Cranium.

Bug bounty with an AI scope: HackerOne (GitHub used HackerOne for the CamoLeak disclosure), Bugcrowd and the EU-based, DACH-friendly Intigriti.

Reporting: measurable and framework-mapped

The value of a red-teaming exercise stands or falls with the report. Because findings are probabilistic, they must be quantified. Useful metrics:

Attack success rate - the proportion of successful attacks per class.
Detection rate - how many attacks the monitoring detected.
Time-to-detection - how long until detection.
Blast radius - how many downstream agents/systems would be affected.

Every finding should be mapped to established frameworks: the OWASP Agentic Top 10 (ASI01-ASI10) as a risk register, MITRE ATLAS as an adversary playbook (with the honest caveat that ATLAS lags the agentic frontier by 6-12 months, especially for ASI07, ASI08 and ASI10) and - emerging, as of 2026 - AIVSS (version 0.8, March 2026) for quantitative scoring. Findings can also be structured in an AVID-compatible way, which makes them usable as reproducible audit evidence for ISO 42001 A.5 and EU AI Act Art. 9. Important for procurement: "OWASP-compliant" is not a meaningful claim - OWASP does not certify.

How often and who

Frequency (rule of thumb): a quarterly baseline plus trigger-driven - before every new tool integration with destructive permissions, after every substantial prompt/system message change, after every model version upgrade and ad hoc as soon as a CVE or PoC affects your own stack.

Who: corporations with their own agent stack over sensitive data maintain a dedicated AI red team (in-house or retained externally) that works with Garak, PyRIT and DeepTeam against the OWASP_ASI_2026 framework, plus a bug-bounty programme. Mid-market deployers of managed-API agents outsource red-teaming to specialised providers, because the ML-specific skills are lacking internally.

For DACH financial service providers, this is effectively regulated: DORA (Art. 24-27) requires threat-led penetration testing, and the BaFin guidance of 18 Dec 2025 explicitly recommends adversarial penetration tests as well as the simulation of attacks (data poisoning, evasion). Both are formally non-binding, but in audits they effectively reverse the burden of proof.

Practical example with numbers

An insurer operates a multi-agent workflow for claims processing. An internal red team constructs an indirect injection: in a scanned copy of a doctor's letter, an instruction is hidden as OCR-reconstructable text to automatically approve cases of certain categories. The "risk-scoring" agent adopts the manipulated assessment and passes it on to the "pricing" and "compliance" agents.

The measured metrics: attack success rate of the injection 1 out of 1 (successful), time-to-detection > 4 hours, blast radius 3 downstream agents. For comparison: Galileo AI research (December 2025) showed in simulated multi-agent systems that a single compromised agent poisoned 87% of downstream decision-making within 4 hours. A documented manufacturing procurement incident (2025): an agent was gradually convinced over three weeks that its approval limit was USD 500,000 - the attacker subsequently placed USD 5 million in fraudulent orders across 10 transactions. Such findings translate abstract risks into board-ready numbers.

Practical checklist

Define scope: which tools, memory, inter-agent paths and HITL gates are in scope?
Derive threat-model-informed scenarios from OWASP ASI01-ASI10.
Anchor an automated baseline scan (Garak/PyRIT/DeepTeam) in CI/CD.
Add manual, multi-stage campaigns (boiling-frog, A2A, delayed invocation).
Collect metrics: attack success rate, detection rate, time-to-detection, blast radius.
Map findings to OWASP/MITRE ATLAS/AIVSS, document in an AVID-compatible way.
"Test injections" to check whether the human-in-the-loop actually takes effect.
Set a cadence: quarterly plus trigger-driven.

For agencies and B2B decision-makers

Anyone who builds agents for clients or deploys them in their own marketing and sales stack should understand red-teaming as a fixed component of the delivery and operations process - not as a one-off acceptance test. For agencies it is also a differentiator: demonstrable attack success rates and OWASP-mapped reporting build trust with DACH clients that "we use guardrails" cannot achieve. Blck Alpaca supports you in building a red-teaming setup that matches your organisation's maturity and regulatory situation - from tool selection through scenario development to an audit-ready report. Talk to us before your agent goes into production.

FAQ

What is the difference between AI red teaming and a classic penetration test?

A classic pentest looks for binary vulnerabilities (an open port, SQL injection) using tools such as Burp, Metasploit or Cobalt Strike. AI red teaming requires ML literacy, prompt-engineering creativity and multi-stage adversarial fluency, uses tools such as Garak, PyRIT and DeepTeam, and delivers probabilistic findings (attack success rates) rather than binary CVE findings. DACH procurement teams regularly confuse the two - they complement each other but do not replace one another.

How often should an AI agent be subjected to red-teaming?

As a rule of thumb: a quarterly baseline plus trigger-driven runs. Triggers are a new tool integration with destructive permissions, a substantial change to a prompt or system message, a model version upgrade, and ad hoc as soon as a CVE or proof-of-concept affects your own stack. For financial firms subject to DORA, formal threat-led penetration testing cycles are added on top.

Which tools are suitable for red-teaming AI agents?

Open source: Garak (NVIDIA, broad probe library), PyRIT (Microsoft AI Red Team), DeepEval/DeepTeam (Confident AI, supports the OWASP_ASI_2026 framework as a plug-in), Promptfoo Red Team and Spikee. Commercial: Lakera Red, HiddenLayer AIDR, Robust Intelligence (Cisco), Trustwise and Cranium. Bug-bounty platforms with an AI scope are HackerOne, Bugcrowd and the EU-based Intigriti. All version and market details as of 2026.

Is automated red-teaming sufficient?

No. Automated scanners cover broad, known attack classes quickly and cheaply and are well suited to continuous CI/CD integration. The most serious agent attacks, however, are multi-stage and context-specific - such as boiling-frog goal drift, A2A session smuggling or delayed tool invocation that only fires weeks later. Only manual red-teaming by experienced analysts finds these. Best practice is to combine both approaches.

Who should carry out the red-teaming - internal or external?

Both are legitimate and depend on maturity and budget. Corporations with their own agent stack over sensitive data typically maintain a dedicated AI red team (in-house or retained externally), quarterly plus change-triggered, supplemented by a bug-bounty programme with an explicit AI scope. Mid-market deployers of managed-API agents usually outsource red-teaming to specialised providers, because the ML-specific skills are lacking internally.

Want to go deeper?

Get new analyses straight to your inbox, or see how we put this knowledge to work for companies.

Subscribe to newsletter →Our services

Previous← Designing Human-in-the-Loop (HITL) Correctly: Approval Patterns for AI Agents NextAI Agent Monitoring with LangSmith and Langfuse: Observability for Secure AI Agents →