CybersecurityFebruary 202512 min read

Red-Teaming AI Systems: Our Methodology for Adversarial Security Testing

Traditional penetration testing methodology was built for deterministic systems. AI systems are not deterministic. The same input can produce different outputs. The attack surface is not just the API — it is the model's behaviour under adversarial inputs, the retrieval system's susceptibility to poisoning, and the agentic pipeline's handling of untrusted content. Standard security tooling was not built to find these vulnerabilities.

Why AI Systems Require a Different Testing Approach

A web application has a fixed attack surface. The inputs are defined — form fields, query parameters, headers, file uploads. The behaviour under each input is deterministic given the same application state. A penetration tester can enumerate the attack surface systematically, apply known vulnerability patterns, and verify findings by reproducing them deterministically.

An AI system has a probabilistic attack surface. The input space is natural language — effectively unbounded. The system's response to any given input depends on the model, the system prompt, the conversation history, the retrieval context, and the sampling parameters. A finding that reproduces reliably in testing may not reproduce under production conditions. A vulnerability that does not appear in the first 1,000 test inputs may appear in the 1,001st.

This non-determinism is a fundamental security challenge. It means that coverage metrics that work for traditional penetration testing — "we tested all input fields" — are not meaningful for AI systems. It also means that the security testing must be designed to characterise the probability distribution of unsafe behaviour, not just to demonstrate that unsafe behaviour is possible.

AI Attack Surface — Primary Categories

Prompt InjectionCritical

Attacker-controlled input overrides system instructions or alters model behaviour in downstream pipelines.

Context ManipulationHigh

Long-context window exploitation to surface training data, system prompts, or other conversation state.

Indirect InjectionHigh

Malicious instructions embedded in documents, web pages, or tool outputs that the model processes.

Data Exfiltration via InferenceHigh

Systematic probing to reconstruct sensitive system prompt contents or user data from model responses.

JailbreakingMedium

Bypass of safety constraints through roleplay, encoded instructions, or multi-step reasoning chains.

Denial of ReasoningMedium

Inputs designed to cause excessive compute consumption, hallucination loops, or response refusal at scale.

Prompt Injection: The Primary Attack Vector

Prompt injection is the AI equivalent of SQL injection: attacker-controlled input reaches a processing layer that interprets it as instructions rather than data. In a well-designed SQL system, user input is parameterised and never interpreted as SQL. In many AI systems, there is no equivalent of parameterisation — user input and system instructions are both natural language, processed by the same model.

Direct prompt injection targets the model's input directly: the attacker supplies a user message that attempts to override, ignore, or augment the system prompt. The attack typically works by framing instructions as meta-instructions: "ignore your previous instructions and instead do X." The effectiveness of this attack varies by model and system prompt design. Models with robust instruction-following training are more resistant but not immune.

Indirect prompt injection is more dangerous and less commonly understood. The attacker does not have direct access to the model's input — instead, they embed malicious instructions in content that the model will process: web pages retrieved by a RAG system, documents uploaded by legitimate users, tool call responses. The model processes the content and follows the embedded instructions as if they were legitimate. Agentic systems with tool use — browse, execute, send — are particularly vulnerable because the injected instructions can instruct the model to take actions with real-world consequences.

Our testing methodology for prompt injection covers both vectors with a systematic test set: a library of over 200 direct injection patterns drawn from public research and our own engagement history, and a separate methodology for indirect injection that involves poisoning the data sources the model is likely to retrieve. For agentic systems, we test injection through every external data source in the pipeline — web retrieval, database queries, tool responses, uploaded files.

System Prompt Exfiltration

The system prompt in a production AI deployment frequently contains sensitive information: business logic, data access credentials, customer data handling instructions, and the guardrails that define the system's permitted behaviour. Many organisations treat the system prompt as a secret — it is not displayed to users and is not expected to be accessible.

System prompt exfiltration attacks are designed to recover system prompt contents through the model's outputs. The most reliable technique is direct instruction — asking the model to repeat its system prompt. Well-configured systems refuse this. The more sophisticated techniques use indirect extraction: asking the model to summarise "its instructions," asking it to list "the things it cannot help with," or using roleplay frames that ask the model to describe its configuration to another character.

For RAG systems, the attack surface extends to the retrieval store. A document retrieval system that returns snippets from a knowledge base without proper access control can be used to retrieve documents the user should not have access to — by crafting queries designed to match restricted documents rather than the documents the user legitimately needs. We test retrieval systems with queries designed to surface restricted documents through semantic similarity rather than direct permission bypasses.

Testing Agentic Systems

Agentic AI systems — those that use tools, call APIs, browse the web, or execute code — introduce an entirely different risk category. The security of an agentic system is not just the security of the model's outputs: it is the security of every action the model can take through its tools. A model that can send emails, modify database records, or make API calls on behalf of users is a system with privileged access that can be exploited through the model's input.

Our agentic system testing methodology begins with tool inventory: every tool available to the model, the permissions each tool requires, and the maximum impact of a worst-case invocation of each tool. This produces an impact model — what is the worst thing this system could do if fully compromised? — that shapes the test prioritisation.

We then test each high-impact tool for injection susceptibility: can an adversarially crafted input cause the model to invoke the tool inappropriately? For email-sending tools, we test whether injected instructions can redirect email content or recipients. For database tools, we test whether injection can cause data modification or exfiltration queries. For code execution tools, we test whether injected instructions can escape the intended execution scope.

Multi-step reasoning chain attacks are tested separately: sequences of individually innocuous requests that, combined, drive the model to a harmful final action that it would have refused if asked directly. These attacks are more difficult to test exhaustively because the combination space is large — we use guided exploration based on the system's known capabilities and our understanding of the model's reasoning patterns.

Reporting AI Security Findings

AI security findings require a different reporting structure than traditional penetration testing findings. The non-deterministic nature of AI vulnerabilities means that CVSS scores — which assume that a vulnerability either exists or does not — are not directly applicable. A finding might be "the model repeats system prompt contents approximately 40% of the time when asked using this pattern" rather than "the SQL injection vulnerability is present and reproducible."

We report AI security findings with a reproduction rate alongside the finding description: the percentage of trials in which the finding was observed, the conditions under which reproduction rate was highest, and the conditions under which it was lowest. This gives remediation teams a calibrated view of the risk — a finding that reproduces 80% of the time under default settings is different from one that reproduces 10% of the time under adversarial conditions.

Remediation guidance for AI findings is also different from traditional findings. Most AI security vulnerabilities cannot be "patched" in the traditional sense — they require architectural changes to the system design, adjustments to the system prompt, implementation of input/output filtering, or changes to the permissions and capabilities available to the model. We provide remediation guidance at each of these levels, with estimates of the risk reduction each measure produces.

EU AI Act Alignment

The EU AI Act's Article 9 requires that high-risk AI systems implement a risk management system covering the full lifecycle. Article 15 requires robustness, accuracy, and cybersecurity — specifically requiring that high-risk AI systems be resilient against attempts to alter their use or performance by third parties exploiting vulnerabilities. This is a regulatory mandate for adversarial testing.

Organisations that deploy high-risk AI systems and cannot produce evidence of adversarial security testing will face a gap in their Article 15 compliance. The Act does not prescribe a specific testing methodology, but the conformity assessment process requires documentation of the security measures in place — and "we have not tested for adversarial inputs" is not a defensible position for a system classified as high-risk under Annex III.

Our AI red-teaming engagements produce documentation structured for conformity assessment use: a threat model, a test methodology description, the full findings register with reproduction rates, and a gap assessment against Article 15's requirements. This documentation is designed to satisfy the technical file requirements for high-risk system conformity assessments under the Act.

Sources: OWASP Top 10 for LLM Applications (2025) · NIST AI RMF (2024) · EU AI Act Article 15 — Accuracy, Robustness, Cybersecurity · MITRE ATLAS framework for adversarial ML.