Now in early access, book a 30-minute demo →
← Back to blog Guide

AI Agent Memory Poisoning: Poison Once, Exploit Forever

TL;DR
  • AI agent memory poisoning plants malicious content in an agent's *persistent* memory or knowledge store so it re-fires across future sessions - unlike a prompt injection, which dies when the session resets.
  • The threat is recognized by OWASP as ASI06: Memory & Context Poisoning in the Top 10 for Agentic Applications (released December 9, 2025).
  • Research shows the attack works even without direct memory access: MINJA (arXiv 2503.03704) reports an average 98.2% injection success rate and 76.8% attack success rate using only ordinary user queries.
  • Newer work - MemoryGraft (arXiv 2512.16962) - poisons stored RAG/experience records and abuses agents' tendency to imitate past successful runs.
  • The blast radius compounds with the lethal trifecta: an agent with private data access, exposure to untrusted content, and an exfiltration path turns one poisoned memory into ongoing data loss.
  • Defense is provenance tagging of memory writes, monitoring of persisted memory, behavioral drift detection, and a tamper-evident audit trail - not just input filtering.

A prompt injection is a one-night stand. Memory poisoning is a marriage you never agreed to. The injection lives in the current context window, does its damage, and dies when the session resets. Poison the agent's *memory* and the malicious instruction gets written to disk, reloaded on the next run, and re-fired against the next user - silently, indefinitely, until someone notices the agent has been quietly working for someone else.

This is the difference that makes AI agent memory poisoning one of the harder agentic threats to reason about. It converts a transient exploit into a durable one. And the unsettling part, confirmed by current research, is that an attacker often does not need to touch your database, your model, or your infrastructure to do it. They just need to talk to the agent.

This guide explains what memory poisoning is, the mechanics behind it, how it lands specifically on AI agents and MCP-connected tools, and the concrete controls a security team puts in place. It is grounded in current research and the OWASP framework that now names this risk explicitly.

What memory poisoning is

Memory poisoning is the act of writing malicious or false content into the parts of an agent's state that persist across sessions, so the agent later reads that content back as trusted context and acts on it. The defining property is persistence. A normal prompt injection lives and dies inside a single context window. A poisoned memory survives the reset.

Modern agents accumulate several kinds of persistent state, and each is a poisoning target:

  • Long-term memory - summaries, user preferences, and "facts" the agent stores to recall in future conversations.
  • Conversation history - prior turns replayed into context, including any session summaries the agent writes about itself.
  • Retrieval-augmented (RAG) knowledge - vector stores and document indexes the agent queries at run time.
  • Experience / episodic stores - records of past task executions an agent consults to decide how to act next.
  • Shared scratch state in multi-agent systems - files, tickets, or message buses that one agent writes and another reads.

Poison any of these and you are not changing what the agent thinks *right now* - you are changing what it will believe is true the next time it loads that record. The attacker writes once; the agent re-reads forever.

Where OWASP places it

This is not a speculative category. OWASP names it directly as ASI06: Memory & Context Poisoning in the OWASP Top 10 for Agentic Applications, released December 9, 2025. ASI06 sits alongside goal hijack, tool misuse, and identity abuse as one of the ten agent-specific risks curated from real deployments. If you are building a threat model, this is the line item it belongs under.

How it works: the mechanics

There are two broad ways to get malicious content into persistent memory, and they differ sharply in how much access the attacker needs.

Direct writes (the obvious path)

If an attacker can write to the backing store - a compromised RAG document, an editable wiki the agent indexes, a shared ticket queue, a poisoned tool output - they can plant content directly. This overlaps with AI supply-chain attacks: the agent trusts a source, the source is tainted, and the taint gets retrieved as authoritative context. The fix here is conventional integrity control on the source plus skepticism about anything the agent ingests.

Indirect writes through ordinary use (the dangerous path)

The harder problem is that agents write to their own memory based on conversation. An attacker who can only *talk* to the agent can still steer what gets persisted. This is where the research gets pointed.

MINJA (Memory INJection Attack, arXiv 2503.03704) demonstrates exactly this. The attacker has no direct access to the memory store and no access to the model - only the ability to send normal-looking user queries. By crafting queries that cause the agent to record poisoned reasoning into its own memory bank, later legitimate queries from *other* users retrieve that poisoned record and are steered toward attacker-chosen outputs. The reported results are not marginal:

MINJA metricReported result
Injection success rate98.2% (average across configurations)
Attack success rate76.8% (average across configurations)
Attacker access requiredUser queries only - no memory or model access
Agents evaluatedEHRAgent, RAP, and a QA agent

The takeaway is that the trust boundary you actually have to defend is not "who can write to the database." It is "who can influence what the agent decides to remember" - which, for a public-facing or shared agent, is everyone.

Poisoning learned experience

More recent work pushes the idea up a level. MemoryGraft (arXiv 2512.16962, December 2025) targets not raw memory but an agent's *stored experiences* - the records some agents keep of past successful task runs and consult to decide how to handle similar tasks. MemoryGraft exploits the imitation heuristic: an agent's tendency to copy a prior run that looked successful. Plant a poisoned "successful" experience and the agent will faithfully reproduce the attacker's pattern on future tasks. It was demonstrated against MetaGPT's DataInterpreter agent running on GPT-4o. The lesson: any store an agent treats as a source of "how I did this before" is a poisoning surface, not just the literal memory table.

Forged self-summaries

Palo Alto Networks' Unit 42, working with Amazon Bedrock Agents, described a concrete variant that coined much of the popular framing for this attack: a manipulated session summary. The agent writes a summary of the session to carry forward; an attacker uses indirect prompt injection - including forged conversation-delimiter tags in the summarization input - so the persisted summary contains instructions for a *delayed, silent* action, such as exfiltrating conversation history to a remote server on a later session rather than the current one. The damage is decoupled in time from the injection, which is precisely what defeats reactive, single-turn detection.

Why agents and MCP make it worse

Memory poisoning is bad in a chatbot. In an agent with tools, it is a different class of problem, for three structural reasons.

Persistence meets capability. A poisoned chatbot says something wrong. A poisoned agent *does* something wrong - and it does it every session the record is loaded. The same property that makes agents useful (memory across runs) is the property the attacker is weaponizing. This is closely related to how a single planted instruction propagates in multi-agent prompt-injection and credential-theft chains, where one agent's poisoned output becomes another agent's trusted input.

The lethal trifecta turns persistence into exfiltration. Memory poisoning supplies durability; the lethal trifecta supplies the damage. When an agent has (1) access to private data, (2) exposure to untrusted content, and (3) a path to communicate externally, a single poisoned memory can drive repeated, silent data exfiltration over many sessions - because the instruction is reloaded every time the agent wakes up. The forged-summary delayed-exfiltration pattern above is exactly this trifecta closing.

MCP widens the ingestion surface. Every MCP server an agent connects to is a source of content that can land in memory: tool descriptions, tool outputs, retrieved documents. A poisoned tool output is an indirect memory write. This is the same trust problem covered in MCP tool poisoning via hidden instructions and in the broader MCP server security guide - except the consequence here is not a one-shot exploit but a planted record that keeps paying out. If you do not know which MCP servers your agents trust, you cannot reason about what can write to their memory at all; that visibility gap is the same one behind AI agents being the new shadow IT.

Memory poisoning vs. prompt injection at a glance

PropertyPrompt injectionMemory poisoning
LifetimeSession-scoped; dies on resetPersistent; survives resets
Re-firingOne executionRe-fires on every retrieval
Blast radiusCurrent session / userFuture sessions, often other users
Detection windowReactive, same turnDelayed; decoupled from injection
Primary controlInput handling, context isolationMemory provenance, write monitoring, drift detection
OWASP mappingGoal hijack familyASI06: Memory & Context Poisoning

The two are related - a prompt injection is frequently the *delivery mechanism* for a memory write - but they demand different defenses. Filtering the input that triggered the write does nothing for the record already sitting in the store. For the injection side, see indirect prompt injection explained; this guide is about what happens after the injection succeeds in writing something durable.

Controls a security team puts in place

Because the attack is about persistence, the controls cluster around the memory lifecycle - what gets written, what gets read, and whether you can tell the difference between a legitimate and a poisoned record after the fact. No single control is sufficient; layer them.

1. Tag provenance on every memory write

Treat memory like any other data store with a trust model. Every record the agent persists should carry metadata: who or what produced it, from which source, in which session, and whether that source was trusted or untrusted. A summary the agent wrote after ingesting an external web page is *untrusted-derived* and should never be treated with the same authority as a verified configuration value. Provenance is the precondition for every other control here - you cannot quarantine what you cannot attribute.

2. Separate trusted from untrusted context

Do not let content derived from untrusted sources be written into the same memory namespace the agent treats as ground truth. Keep retrieved external content, tool outputs, and user-supplied data in a clearly lower-trust tier, and require the agent to treat that tier as data, not instructions. This is the persistent-store analogue of context isolation in securing AI coding agents and CLIs.

3. Monitor reads and writes to persistent memory

Input filtering alone misses this attack because the malicious content is already inside the trust boundary. You need visibility into the memory operations themselves: what the agent writes, what it retrieves, and whether a retrieved record is steering behavior. Watch for memory writes that contain instruction-like content, references to external recipients, or encoded payloads. This is a core part of runtime monitoring and anomaly detection for AI agents.

4. Run behavioral drift detection

A poisoned memory shows up as a *change in behavior over time*: an agent that suddenly calls a tool it never used, sends data to a new recipient, or follows a task pattern inconsistent with its established baseline. Establishing that baseline per agent - its normal tools, data flows, and counterparties - lets you flag the deviation even when the triggering record looks innocuous in isolation. The delayed, decoupled nature of forged-summary attacks means drift detection is often the *only* control that catches them, because there is no malicious input in the session where the damage fires.

5. Constrain capability with least privilege

The damage a poisoned memory can do is bounded by what the agent is allowed to do. An agent that cannot reach external networks cannot exfiltrate, no matter what its memory tells it to do. Apply least privilege for AI agents so that breaking the lethal trifecta - removing private-data access, untrusted exposure, or the exfiltration path - is a deliberate, enforced boundary rather than a hope.

6. Make memory tamper-evident with an audit trail

When you find anomalous behavior, you need to trace it back to the record that caused it and the event that wrote that record. A tamper-evident AI agent audit trail that logs memory writes with their provenance turns an open-ended investigation into a query. It is also what lets you remediate confidently: you can identify and purge the specific poisoned records rather than wiping all memory and losing legitimate state.

7. Remediate the store, not the session

Clearing the context window does nothing for a poisoned long-term store or RAG index - that content persists independently of the session and will be retrieved again on the next run. Incident response for memory poisoning means identifying the poisoned records in the *persistent* store, removing them, and re-validating anything derived from them. Fold this into your AI agent incident response playbook as a distinct path from session-scoped injection cleanup.

A practical checklist

  1. Inventory every agent's persistent stores - long-term memory, conversation history, RAG indexes, experience stores, shared scratch state.
  2. Tag every memory write with provenance and a trust tier.
  3. Forbid untrusted-derived content from being treated as instructions or ground truth.
  4. Instrument memory reads and writes for monitoring, not just inputs.
  5. Baseline each agent's normal behavior and alert on drift.
  6. Enforce least privilege so a poisoned memory cannot complete the lethal trifecta.
  7. Keep a tamper-evident audit trail of memory operations with provenance.
  8. Write a memory-store remediation path into incident response - purge records, not just sessions.

Where continuous visibility fits

Most of these controls share one prerequisite: you have to be able to *see* the agents, their MCP connections, and the memory they read and write. You cannot tag provenance on a store you did not know existed, baseline an agent you never inventoried, or detect drift without a record of normal. This is the visibility layer Anomity is built for - discovering every agent and MCP server across the fleet, monitoring what they connect to and how they behave, flagging anomalies against a learned baseline, and keeping an audit trail you can query after an incident.

That is deliberately not a claim that visibility alone prevents memory poisoning. Provenance tagging, trust separation, and least privilege are engineering work you do in the agent itself. But none of it operates without an accurate, continuous picture of what is running - and for memory poisoning specifically, where the damage is delayed and decoupled from the injection, the ability to detect behavioral drift and trace it back through a memory audit trail is frequently the difference between catching a poisoned agent and discovering it after months of quiet exfiltration. You can't govern what you can't see, and you certainly can't un-poison a memory you never knew the agent had.

Frequently asked questions

What is AI agent memory poisoning?

It is an attack that plants malicious or false content into an AI agent's persistent memory - its long-term store, conversation history, or retrieval-augmented (RAG) knowledge base - so that the agent treats the planted content as trusted context in future sessions. Because the content persists, the attacker influences the agent's behavior long after the original interaction ends.

How is memory poisoning different from prompt injection?

Prompt injection is session-scoped: the malicious instruction lives in the current context window and disappears when the session resets. Memory poisoning is persistent: the malicious content is written into storage the agent reads back on later runs, so a single successful poisoning re-fires across many future sessions and often many users.

Can an attacker poison memory without direct access to the database?

Yes. The MINJA research (arXiv 2503.03704) demonstrates injection through ordinary user queries alone, with no direct access to the memory store or the model. Across its evaluated configurations it reported an average 98.2% injection success rate and a 76.8% attack success rate against agents including EHRAgent, RAP, and a QA agent, showing the attack does not require a breach of the backing store.

Is memory poisoning in any official security framework?

Yes. OWASP lists it as ASI06: Memory & Context Poisoning in the Top 10 for Agentic Applications, released December 9, 2025. It is one of the ten agent-specific risk categories curated from real agentic deployments.

What is MemoryGraft?

MemoryGraft (arXiv 2512.16962, December 2025) is research that poisons an agent's stored experiences or RAG records and exploits the imitation heuristic - an agent's tendency to copy past runs that appeared successful. It was demonstrated against MetaGPT's DataInterpreter agent running on GPT-4o, extending the threat from raw memory to learned-experience stores.

Why does memory poisoning make data exfiltration worse?

Memory poisoning supplies the persistence; the lethal trifecta supplies the damage. An agent that has access to private data, ingests untrusted content, and can communicate externally can be steered by a single poisoned memory into repeated, silent exfiltration over many sessions, because the malicious instruction is reloaded every time.

How do you detect memory poisoning?

You monitor what gets written to and read from persistent memory, tag every memory record with its provenance, and run behavioral drift detection on the agent's actions over time. A poisoned memory shows up as anomalous tool calls, recipients, or data flows that deviate from the agent's established baseline, and a tamper-evident audit trail lets you trace the bad behavior back to the poisoning event.

Does clearing the context window remove a poisoned memory?

No. Clearing the context only resets session-scoped state. A poisoned long-term memory store or RAG index persists independently of the session, so it will be retrieved again on the next run unless you remediate the underlying record.

Ask AI about Anomity
ChatGPT Claude Perplexity Google AI Grok