Now in early access, book a 30-minute demo →
← Back to blog Guide

Indirect Prompt Injection Explained: The Defining Attack on AI Agents

TL;DR
  • Indirect prompt injection (IPI) is the prompt-injection subtype where malicious instructions arrive inside external content an agent ingests - web pages, files, RAG documents, emails, MCP tool descriptions, tool outputs, and memory - not from the user.
  • The root cause is architectural: an LLM concatenates the system prompt, user input, retrieved data, and tool metadata into one token stream with no structural line between instructions and data. Any instruction-shaped text anywhere can be obeyed.
  • Passive IPI plants poisoned content and waits for retrieval (poisoned RAG doc, public GitHub issue). Active IPI pushes the payload at the agent (a crafted email, a malicious MCP tool description loaded at connect-time).
  • Simon Willison's lethal trifecta names the danger condition: private data + untrusted content + an exfiltration vector. Remove any one leg and the data-theft path breaks.
  • Input filtering is provably incomplete - any filter that catches most attacks still lets the rest through, and an attacker only needs one. The durable control is runtime visibility into what each agent and MCP reads and does, plus least privilege.
  • Map your fleet against the trifecta: find every agent and MCP server that simultaneously has private-data access, untrusted-content exposure, and an egress path, and govern those first.

Type a normal request to an AI agent - "summarize this support ticket" - and the agent does exactly what you asked. The problem is what else it reads on the way. The ticket body, the linked document, the web page it browses, the tool response it gets back: all of that lands in the same context window as your instruction, and the model cannot tell which words are *its task* and which are *just data it was handed*. Indirect prompt injection is the attack that lives in that gap, and it is the defining security problem of the agentic era.

Direct prompt injection - a user trying to jailbreak the model they are talking to - is mostly a content-safety concern. Indirect prompt injection is different in kind: the attacker is not the user, and the payload is smuggled inside content the agent ingests on its own. That is what makes it the threat that matters once an LLM has tools, data access, and the autonomy to act. This guide explains what it is, why it cannot be patched away, how it manifests across agents and MCP servers, and the controls a security team can actually rely on.

What indirect prompt injection is

Indirect prompt injection (IPI) is the subtype of prompt injection where malicious instructions arrive embedded in external, untrusted content the model ingests rather than typed by the user. OWASP, which ranks prompt injection as LLM01:2025 - the number-one risk for LLM applications - puts it plainly: indirect prompt injections occur when an LLM accepts input from external sources, such as websites or files, whose content alters the model's behavior.

The carriers are everything an agent reads: web pages it browses, PDFs and office files, documents retrieved from a RAG or vector store, incoming emails, MCP tool descriptions and tool responses, the output of any API it calls, and its own persistent memory. None of those are channels a user controls, and all of them flow into the same context the model reasons over.

It is worth being precise about a distinction teams routinely blur. Simon Willison, who coined the term *prompt injection* in September 2022, draws a hard line: prompt injection is mixing trusted and untrusted content together in the same context, whereas jailbreaking is directly tricking a model into harmful output. Indirect prompt injection is not about coaxing a model into saying something it shouldn't - it is about getting an agent with real tools and real data to act on an attacker's instruction. Treating it as a content-safety problem is why so many teams underestimate the data-theft risk.

The root cause: a collapsed data/instruction boundary

Indirect prompt injection is not a bug in any single product. It is an architectural property of how transformer LLMs work. A model assembles the system prompt, the user's input, retrieved documents, tool descriptions, and tool results into one continuous token stream - and that stream has no structural distinction between control (instructions) and data (content to be processed). If a sentence reads like an instruction, the model can follow it, no matter where in the stream it came from or who put it there.

This is the same insight that named the attack after SQL injection: the danger is mixing trusted and untrusted content with no enforced separator. In SQL you can parameterize queries to keep data out of the command channel. In an LLM there is no equivalent - no escape() that reliably marks "this text is only data, never an instruction." That is why the boundary cannot be restored at the model layer, and why no amount of patching makes the underlying problem go away.

Prompt injection is mixing together trusted and untrusted content in the same context. That is different from jailbreaking - and it is the part that matters for agents.Paraphrasing Simon Willison's distinction between prompt injection and jailbreaking

Passive vs active: how the payload reaches the agent

Indirect prompt injection splits cleanly into two delivery models, and the distinction shapes how you defend each one.

Passive IPI is plant-and-wait. The attacker places poisoned content somewhere the agent is likely to read it later and lets the agent come to it. A poisoned document in a RAG or vector store, a malicious public GitHub issue an agent triages, an SEO-indexed web page an agent browses, or hidden text inside a file shared into a workspace - all are passive. The attacker never touches the agent directly; they seed the environment.

Active IPI delivers the payload to the agent's context. A crafted email that lands in an assistant's inbox and gets summarized, or a malicious MCP server whose poisoned tool description is loaded the moment a client connects, are active deliveries. The attacker pushes the content toward the agent rather than waiting for retrieval.

The carriers are varied and many are invisible to a human reviewer. A few worth knowing by name:

  • Hidden web text - instructions in white-on-white text, zero-size fonts, or off-screen elements on a page an agent browses.
  • Document injection - the RAG case, where a poisoned retrieved document is the payload. Every RAG pipeline is an injection surface.
  • Email payloads - instructions hidden as HTML comments or white-on-white text in inbound mail an assistant processes.
  • MCP tool descriptions and responses - covered in detail below.
  • Tool and API outputs - any external response returned into context can carry instructions.
  • Memory - poisoned records written into an agent's persistent memory (the MINJA pattern).
  • Invisible Unicode - the deprecated tag block U+E0000-U+E007F renders as nothing on screen yet can encode a full set of instructions a model still reads.

The lethal trifecta: when IPI becomes data theft

Not every agent that reads untrusted content is a serious risk. The condition that makes one *structurally exploitable* is what Simon Willison named the lethal trifecta in June 2025: an agent that simultaneously has (1) access to private data, (2) exposure to untrusted content, and (3) the ability to communicate externally - an outbound HTTP request, a rendered image URL, a clickable link, or any API that can carry data out.

When all three are present, indirect prompt injection has a complete path: untrusted content carries the instruction, private data is the target, and the egress vector is the way out. The framing is useful precisely because it is actionable for a CISO: remove any one leg and the data-theft path breaks. You do not have to win the unwinnable fight of sanitizing all untrusted text - you can cut data access, constrain tools, or block egress on the agents where all three overlap. We cover production exfiltration patterns in depth in our guide to the lethal trifecta and agent data exfiltration.

How it applies to AI agents and MCP

The abstract problem becomes concrete the moment an agent is wired to tools and data. Here is how IPI lands across the configurations you actually run:

Agent configurationIPI channelWhat an attacker plants
Single agent + RAGRetrieved documents (passive)A poisoned doc in the vector store; every RAG pipeline is an injection surface
Agent + web browsingPages the agent reads (active read)Hidden text that hijacks summarization or triggers an action
Agent + MCP serversTool descriptions (connect-time) and tool responses (runtime)Poisoned metadata or a malicious response; runtime responses are the unguarded gap
Agent + email/docs connectorsInbound content (active)An EchoLeak-class crafted message that reaches a private-data-capable agent
Multi-agent + shared memoryPersistent memory (passive)A MINJA-style record that activates on a later victim query

MCP makes it a supply-chain problem

The Model Context Protocol amplifies indirect prompt injection from a single-agent issue into a fleet-wide one, because every connected server contributes text to the model's context. OWASP's MCP Top 10 splits this into two entries. MCP03:2025 Tool Poisoning covers malicious instructions hidden in tool descriptions read at planning time, with sub-techniques including rug pulls, schema poisoning, and tool shadowing. MCP06:2025 Prompt Injection via Contextual Payloads covers instructions arriving in tool responses and other retrieved context.

The decisive insight is a trust gap between connect-time and runtime. A tool description may get reviewed once when the server is added, but the tool's *responses* flow into the model on every call with no equivalent check. That unguarded runtime channel is the real attack surface. Invariant Labs, which coined the term Tool Poisoning Attack, demonstrated it concretely: a poisoned add-tool description steered Cursor into reading ~/.ssh/id_rsa. We trace that pattern further in our advisory on the MCP tool poisoning campaign, and the broader surface in the MCP server security guide.

The same data/instruction collapse is what we keep finding in real coding-agent incidents - from the Amazon Q Developer wiper prompt injection to the Cursor CurXecute MCP RCE. In each, attacker-controlled content reached an agent with the privilege to act, and the agent acted.

Concrete controls a security team takes

There is no single fix, so the work is layered. OWASP's LLM01:2025 lists mitigations worth implementing together; the practical sequence for a security team looks like this.

  1. Discover first. You cannot govern agents and MCP servers you have not inventoried. Build a complete picture of which agents exist, which MCP servers they connect to, what data each can reach, and what egress each has. This is the prerequisite for every control below.
  2. Apply the lethal-trifecta lens. For each agent and MCP server, record whether it has private-data access, untrusted-content exposure, and an egress path. The ones with all three are your highest-priority targets - govern those before anything else.
  3. Enforce least privilege. Shrink each agent's data scope and tool permissions to the minimum it needs. A narrower blast radius means injected instructions have less to steal and fewer tools to abuse. See least privilege for AI agents.
  4. Break an exfiltration leg. Restrict outbound destinations, disable automatic image/link rendering where data can ride out, and allowlist the egress an agent genuinely needs.
  5. Filter and label - but don't trust it. Apply input/output filtering and segregate and clearly mark external content. These raise attacker cost; they do not close the boundary.
  6. Require human approval for high-risk actions. Gate irreversible or privileged tool calls behind a human checkpoint so a steered plan cannot execute silently.
  7. Monitor at runtime. Observe what each agent ingests, which tools it calls, what data it touches, and what egress it attempts - and keep an audit trail. This is where an injection that slips past every other layer becomes visible.

Two facts should set expectations. First, input filtering is provably incomplete: any filter that catches most attempts still lets a residual fraction through, and an attacker only needs the one payload that gets past it. Second, the stronger architectural defenses come with a capability cost. Google DeepMind's CaMeL ("Defeating Prompt Injections by Design") splits work between a privileged LLM that sees only the trusted request and a quarantined LLM that parses untrusted content but cannot call tools, enforcing data-flow policy with capabilities; in the AgentDojo benchmark it solved 77% of tasks with provable security versus 84% for an undefended system. Plan-then-execute patterns give similar control-flow integrity. They are real progress - and they are trade-offs, not free wins.

Terms worth knowing

A few related research terms come up constantly and are worth defining so they don't get conflated:

  • MINJA (Memory INJection Attack) - IPI that poisons an agent's memory bank through normal queries alone; the malicious record activates on a later victim query and can persist across sessions. See AI agent memory poisoning.
  • Document injection - the name for IPI delivered through a RAG-retrieved document.
  • XPIA - Microsoft's Cross-Prompt Injection Attack classifier, an example of a detection layer (which real-world exploits such as EchoLeak have shown can be bypassed).
  • Adversarial / multimodal / obfuscated variants - OWASP catalogs instructions hidden in images, encoded in Base64, split across languages, or carried by adversarial suffixes.

Where continuous agent and MCP visibility fits

Every defense above degrades to the same prerequisite: you have to know what your agents are reading and doing. Input sanitization is incomplete by construction, and the data/instruction boundary cannot be fully restored in the model. That leaves runtime observability as the control that holds when the others leak - which is exactly the layer Anomity is built for.

Anomity discovers every AI agent and MCP server across the fleet - the *anonymity* half of the problem, the agents and servers operating invisibly - then observes what each one ingests, which tools it invokes, what data it touches, and what egress it attempts, flagging the anomalous reads and outbound calls that an injection produces. That maps directly onto the trifecta: you can see which agents hold all three legs at once and watch the egress attempt the moment it happens. It complements least privilege and a human-approval gate rather than replacing them, and it keeps the audit trail that lets you answer, after the fact, exactly what an agent read and did. This is the practical meaning of the principle behind why we built Anomity: you can't govern what you can't see - and with indirect prompt injection, what you can't see is precisely what the attacker is counting on.

If you are starting from zero, the first move is an inventory - our guide on how to build an AI agent inventory walks through it - followed by runtime monitoring and anomaly detection on the agents that matter most. Indirect prompt injection is not going to be patched. It is going to be governed.

Frequently asked questions

What is indirect prompt injection?

Indirect prompt injection is a subtype of prompt injection where malicious instructions reach the model through external content the agent ingests - a web page it browses, a file or PDF it reads, a document retrieved from a RAG store, an incoming email, an MCP tool description, a tool's API response, or its own memory - rather than being typed by the user. OWASP describes it in LLM01:2025 as occurring when an LLM accepts input from external sources, such as websites or files, whose content then alters the model's behavior. Because the agent treats that external text as part of its working context, any instruction-shaped content it contains can be acted on.

How is indirect prompt injection different from a jailbreak?

They are different problems and conflating them is dangerous. Simon Willison, who coined the term prompt injection in September 2022, frames it as mixing trusted and untrusted content in the same context window. Jailbreaking is directly tricking a model into producing harmful output it was trained to refuse. Indirect prompt injection is not about getting the model to say something bad - it is about getting an agent with real tools and real data access to take an action on an attacker's behalf, such as exfiltrating data or calling a privileged tool. A model can be perfectly aligned and still be hijacked by injected instructions.

Why can't indirect prompt injection just be patched?

Because it is an architectural property of how transformer LLMs work, not a bug in one product. The model concatenates the system prompt, user input, retrieved documents, tool descriptions, and tool results into a single undifferentiated token stream, and it has no structural distinction between control instructions and processed data. Whatever looks like an instruction can be treated as one, regardless of where in the stream it came from. Input filters help but are incomplete: any filter that catches most attempts still lets a residual fraction through, and an attacker only needs one payload to land.

What is the lethal trifecta?

The lethal trifecta is Simon Willison's framing (June 2025) of the conditions that make an agent structurally exploitable for data theft: it has (1) access to private data, (2) exposure to untrusted content, and (3) a way to communicate externally - an HTTP request, a rendered image, a link, or an outbound API call. When all three are present, indirect prompt injection can turn the agent into an exfiltration channel. The defensive insight is that removing any one leg breaks the data-theft path, so the practical move is to enumerate which of your agents and MCP servers have all three at once and constrain them first.

How does indirect prompt injection affect MCP servers?

MCP turns indirect injection into a fleet-wide supply-chain problem. There are two channels. At connect-time, a malicious server's tool descriptions are read into the model's planning context - OWASP MCP03:2025 calls this tool poisoning. At runtime, tool responses flow back into the model with no equivalent review - OWASP MCP06:2025 covers prompt injection via contextual payloads. The gap is that descriptions may be reviewed once when a server is added, but responses are never re-checked, so the runtime channel is the unguarded one. Invariant Labs demonstrated a poisoned tool description steering Cursor to read a private SSH key.

Can input filtering or guardrails stop indirect prompt injection?

Not completely, and treating them as a complete control is the common mistake. Filtering and content labeling raise the cost of an attack and should be deployed, but they cannot restore the data/instruction boundary at the model layer, and any incomplete filter is bypassable by an attacker who iterates. Research defenses such as Google DeepMind's CaMeL and plan-then-execute patterns provide stronger guarantees but trade away capability. The durable, layered answer is least privilege, breaking a leg of the lethal trifecta, and runtime visibility into what each agent actually reads and does.

What is memory injection (MINJA)?

Memory injection, formalized as MINJA (Memory INJection Attack) in 2025 research, is indirect prompt injection that targets an agent's persistent memory rather than its immediate context. The attacker interacts with the agent using only normal queries - no direct access to the memory store - and gets a malicious record written into the agent's memory bank. Later, when a victim's query retrieves that record, the injected instruction activates. It matters because it persists across sessions and can propagate between agents that share memory, so a single poisoning can have a long, quiet tail.

How should a security team prioritize indirect prompt injection risk?

Start with discovery, because you cannot govern agents and MCP servers you have not inventoried. Then apply the lethal-trifecta lens: for each agent and MCP server, record whether it has private-data access, untrusted-content exposure, and an egress path, and treat the ones with all three as your highest priority. Apply least privilege to shrink data access and tool scope, break an exfiltration leg where you can, require human approval for high-risk actions, and put runtime monitoring on what each agent ingests and which tools it calls so an anomalous read or egress attempt is visible and recorded.

Ask AI about Anomity
ChatGPT Claude Perplexity Google AI Grok