Red-Teaming for AI Agents: Uncovering Vulnerabilities Systematically
Red-teaming for AI agents refers to the systematic, simulated attacking of AI agents in order to uncover vulnerabilities such as prompt injection, jailbreaks, tool misuse and data exfiltration before real attackers exploit them. It combines automated attack tools with manual, multi-stage attack creativity and delivers measurable findings such as attack success rates rather than binary vulnerability lists.
Key Takeaways
- ✓AI red teaming is not the same as a classic penetration test: different skills (ML literacy, prompt engineering, multi-stage attack fluency), different tools (Garak, PyRIT, DeepTeam instead of Burp/Metasploit) and different finding formats (probabilistic attack success rates instead of binary CVE findings).
- ✓Automated scanning (Garak, PyRIT, DeepTeam, Promptfoo) covers broad, known attack classes cost-effectively; manual red-teaming finds the multi-stage, context-specific attacks (boiling-frog drift, A2A session smuggling, delayed tool invocation) that scanners miss.
- ✓Rule of thumb for frequency: a quarterly baseline plus trigger-driven runs - before every new tool integration with destructive permissions, after every substantial prompt/system message change, after every model upgrade, and ad hoc whenever a CVE or PoC affects your own stack.
- ✓Reporting should be measurable and framework-mapped: attack success rate, detection rate, time-to-detection and blast radius, mapped to the OWASP Agentic Top 10 (ASI01-ASI10), MITRE ATLAS and - emerging, as of 2026 - AIVSS as well as AVID-compatible records.
- ✓Every published guardrail has been bypassed by competent researchers within months (EchoLeak against Microsoft's XPIA classifier, CamoLeak against GitHub-side filters); red-teaming is therefore mandatory evidence, not a marketing argument - 'OWASP-compliant' is not a meaningful claim, because OWASP does not certify.
- ✓For DACH financial service providers, red-teaming is effectively regulated: DORA requires threat-led penetration testing (Art. 24-27), and the BaFin guidance (18 Dec 2025) explicitly recommends adversarial penetration tests.
Red-teaming for AI agents refers to the systematic, simulated attacking of AI agents in order to uncover vulnerabilities such as prompt injection, jailbreaks, tool misuse and data exfiltration before real attackers exploit them. It combines automated attack tools with manual, multi-stage attack creativity and delivers measurable findings such as attack success rates rather than binary vulnerability lists. Unlike a functional test, it does not ask "does the agent work?", but "how can it be abused against its actual purpose?".
This article is part of the hub "AI agent security according to OWASP" and makes concrete how you can offensively test the risks catalogued in the OWASP Agentic Top 10 (ASI01-ASI10).
Three quick answers
- What is being attacked? The agent-specific attack surface - not just the model, but tool calls, persistent memory, inter-agent communication and the human-in-the-loop. The goal is to uncover goal hijack, tool misuse, memory poisoning and data exfiltration.
- With what? Open-source scanners (Garak, PyRIT, DeepTeam, Promptfoo) for breadth, manual red-teaming for the multi-stage attacks, plus bug bounty for long-term coverage.
- How often? A quarterly baseline plus trigger-driven (new tools, prompt changes, model upgrades, acute CVEs).
Why AI red teaming is not a classic pentest
The most important distinction first: AI red teaming is not synonymous with a traditional penetration test. DACH procurement teams regularly confuse the two - with costly consequences when a Burp Suite pentest is accepted as "AI security evidence".
The differences are fundamental:
Dimension | Classic pentest | AI red teaming |
|---|---|---|
Required skills | network/AppSec knowledge | ML literacy, prompt-engineering creativity, multi-stage adversarial fluency |
Typical tools | Burp, Metasploit, Cobalt Strike | Garak, PyRIT, DeepTeam, Promptfoo |
Finding format | binary CVE-style findings | probabilistic attack success rates |
Attack model | mostly single-shot exploit | multi-stage, multi-turn campaigns |
Goal | code/configuration errors | behavioural and context manipulation |
An AI agent often does not "break" in the classic sense at all - it follows instructions it has been deceived into believing are legitimate. Because agents and the underlying model cannot reliably distinguish instructions from data, every piece of text the agent reads is part of the attack surface. This demands a different testing discipline.
Which vulnerabilities a red team specifically looks for
Red-teaming works through the agent-specific threat classes of the OWASP Agentic Top 10. In practice, this means deliberately constructing the following attacks:
- Prompt injection / goal hijack (ASI01). Direct and indirect injection - hidden instructions in documents, RAG corpus, emails, calendar invitations, PR descriptions or tool outputs. Particularly insidious: the "boiling-frog" multi-turn drift, in which each individual step appears plausible but the cumulative trajectory is malicious.
- Jailbreaks. Bypassing the safety guardrails in order to trigger prohibited actions or content.
- Tool misuse (ASI02). A legitimate function (e.g.
send_email) is repurposed; auto-approve or "YOLO" modes that disable confirmation prompts are exploited. - Memory and context poisoning (ASI06). Content injected once permanently poisons the persistent memory; "delayed tool invocation" only fires weeks later on a trigger word.
- Data exfiltration. Exfiltration via manipulated Markdown images, redirect chains or abused proxies.
- Inter-agent attacks (ASI07) and human-agent trust exploitation (ASI09). Forged messages between agents, as well as deliberately undermining the human approval layer through confidently formulated but manipulated recommendations.
That these attacks are real is demonstrated by documented incidents: EchoLeak (CVE-2025-32711, CVSS 9.3) was the first real-world zero-click prompt injection in a production LLM system and bypassed Microsoft's XPIA classifier (Cross-Prompt Injection Attempt). CamoLeak (CVSS 9.6) exfiltrated private repository secrets and source code character by character via GitHub's own Camo image proxy. Both show: every published guardrail has been bypassed by competent researchers within months.
Approach: automated vs manual
Mature red-teaming combines two modes that complement each other.
Automated means scale and repeatability. Scanners run broad probe libraries against the agent and measure what proportion gets through. They are well suited to continuous integration into CI/CD and to regression testing after every change. Weakness: they predominantly find known attack patterns.
Manual means creativity and context. Experienced analysts construct multi-stage campaigns tailored to the specific agent - precisely those attacks that scanners miss. Examples from research: the "agent session smuggling" against Google's A2A protocol (Palo Alto Unit 42, November 2025) is not a single-shot injection but a sustained agent-against-agent social-engineering campaign. The Google Gemini memory attack (Johann Rehberger, February 2025) used "delayed tool invocation" to poison the memory on a time delay.
Tools and frameworks (as of 2026)
The following tools are the building blocks established in practice. Version and market details as of 2026.
Tool | Origin | Classification |
|---|---|---|
Garak | NVIDIA (originally Leon Derczynski) | LLM vulnerability scanner with broad probe library |
PyRIT | Microsoft AI Red Team | Python Risk Identification Tool, extensible |
DeepEval / DeepTeam | Confident AI | supports the OWASP_ASI_2026 framework as a plug-in |
Promptfoo Red Team | Promptfoo | listed by OWASP as a GenAI security solution |
Spikee | Community | spike testing for LLM apps |
MAESTRO Threat Analyzer | Cloud Security Alliance | AI-assisted threat modelling (not a pure red-team tool) |
Commercial: Lakera Red (Swiss vendor, DACH-relevant), HiddenLayer AIDR, Robust Intelligence (Cisco), Trustwise and Cranium.
Bug bounty with an AI scope: HackerOne (GitHub used HackerOne for the CamoLeak disclosure), Bugcrowd and the EU-based, DACH-friendly Intigriti.
Reporting: measurable and framework-mapped
The value of a red-teaming exercise stands or falls with the report. Because findings are probabilistic, they must be quantified. Useful metrics:
- Attack success rate - the proportion of successful attacks per class.
- Detection rate - how many attacks the monitoring detected.
- Time-to-detection - how long until detection.
- Blast radius - how many downstream agents/systems would be affected.
Every finding should be mapped to established frameworks: the OWASP Agentic Top 10 (ASI01-ASI10) as a risk register, MITRE ATLAS as an adversary playbook (with the honest caveat that ATLAS lags the agentic frontier by 6-12 months, especially for ASI07, ASI08 and ASI10) and - emerging, as of 2026 - AIVSS (version 0.8, March 2026) for quantitative scoring. Findings can also be structured in an AVID-compatible way, which makes them usable as reproducible audit evidence for ISO 42001 A.5 and EU AI Act Art. 9. Important for procurement: "OWASP-compliant" is not a meaningful claim - OWASP does not certify.
How often and who
Frequency (rule of thumb): a quarterly baseline plus trigger-driven - before every new tool integration with destructive permissions, after every substantial prompt/system message change, after every model version upgrade and ad hoc as soon as a CVE or PoC affects your own stack.
Who: corporations with their own agent stack over sensitive data maintain a dedicated AI red team (in-house or retained externally) that works with Garak, PyRIT and DeepTeam against the OWASP_ASI_2026 framework, plus a bug-bounty programme. Mid-market deployers of managed-API agents outsource red-teaming to specialised providers, because the ML-specific skills are lacking internally.
For DACH financial service providers, this is effectively regulated: DORA (Art. 24-27) requires threat-led penetration testing, and the BaFin guidance of 18 Dec 2025 explicitly recommends adversarial penetration tests as well as the simulation of attacks (data poisoning, evasion). Both are formally non-binding, but in audits they effectively reverse the burden of proof.
Practical example with numbers
An insurer operates a multi-agent workflow for claims processing. An internal red team constructs an indirect injection: in a scanned copy of a doctor's letter, an instruction is hidden as OCR-reconstructable text to automatically approve cases of certain categories. The "risk-scoring" agent adopts the manipulated assessment and passes it on to the "pricing" and "compliance" agents.
The measured metrics: attack success rate of the injection 1 out of 1 (successful), time-to-detection > 4 hours, blast radius 3 downstream agents. For comparison: Galileo AI research (December 2025) showed in simulated multi-agent systems that a single compromised agent poisoned 87% of downstream decision-making within 4 hours. A documented manufacturing procurement incident (2025): an agent was gradually convinced over three weeks that its approval limit was USD 500,000 - the attacker subsequently placed USD 5 million in fraudulent orders across 10 transactions. Such findings translate abstract risks into board-ready numbers.
Practical checklist
- Define scope: which tools, memory, inter-agent paths and HITL gates are in scope?
- Derive threat-model-informed scenarios from OWASP ASI01-ASI10.
- Anchor an automated baseline scan (Garak/PyRIT/DeepTeam) in CI/CD.
- Add manual, multi-stage campaigns (boiling-frog, A2A, delayed invocation).
- Collect metrics: attack success rate, detection rate, time-to-detection, blast radius.
- Map findings to OWASP/MITRE ATLAS/AIVSS, document in an AVID-compatible way.
- "Test injections" to check whether the human-in-the-loop actually takes effect.
- Set a cadence: quarterly plus trigger-driven.
For agencies and B2B decision-makers
Anyone who builds agents for clients or deploys them in their own marketing and sales stack should understand red-teaming as a fixed component of the delivery and operations process - not as a one-off acceptance test. For agencies it is also a differentiator: demonstrable attack success rates and OWASP-mapped reporting build trust with DACH clients that "we use guardrails" cannot achieve. Blck Alpaca supports you in building a red-teaming setup that matches your organisation's maturity and regulatory situation - from tool selection through scenario development to an audit-ready report. Talk to us before your agent goes into production.
FAQ
What is the difference between AI red teaming and a classic penetration test?
How often should an AI agent be subjected to red-teaming?
Which tools are suitable for red-teaming AI agents?
Is automated red-teaming sufficient?
Who should carry out the red-teaming - internal or external?
Want to go deeper?
Get new analyses straight to your inbox – or see how we put this knowledge to work for companies.