The Lethal Trifecta in Production: Why Notion AI, Superhuman, and Claude Cowork Fell the Same Way
- Across about nine days in January 2026 (roughly January 7-15), security firm PromptArmor publicly disclosed indirect prompt injection exfiltration flaws in four AI productivity tools: IBM Bob, Notion AI, Superhuman AI, and Anthropic's Claude Cowork.
- Every one matched Simon Willison's lethal trifecta: access to private data, exposure to untrusted content, and the ability to communicate externally, all in one agent context.
- None required a code exploit. The attack was text. Patches closed individual egress paths (image renders, allowed CSP domains, allowlisted APIs) one at a time, like whack-a-mole.
- Prompt injection is an unsolved frontier problem: no production LLM reliably separates instructions from data, and research defenses such as the Dual-LLM pattern and Google's CaMeL are not deployed at scale.
- Because the trifecta is a configuration property of a deployed agent, the durable control is detection: inventory each agent's data access, untrusted inputs, and egress, then alert on any agent holding all three.
In the span of roughly a week in January 2026, the same attack worked against four different AI products. Not four variations on a theme. The same attack, executed by the same security firm, against tools built by different companies on different stacks. PromptArmor publicly disclosed indirect prompt injection exfiltration flaws in IBM Bob, Notion AI, Superhuman AI, and Anthropic's Claude Cowork, with public disclosures landing between roughly January 7 and 15. One team found all four.
That is not a streak of bad luck. It is the signature of a structural problem. Each product was vulnerable for the same reason, and the reason has a name.
The pattern has a name: the lethal trifecta
On June 16, 2025, Simon Willison published a definition that has since become the standard vocabulary for this class of failure. He called it the lethal trifecta: an AI system that combines three capabilities in a single context.
- Access to private data the user wouldn't want leaked (emails, documents, files, credentials).
- Exposure to untrusted content the user didn't write (an inbound email, an uploaded resume, a shared page, a web page).
- The ability to communicate externally (render an image from a URL, call an API, submit a form).
When all three sit together, an attacker who controls the untrusted content can instruct the model to read the private data and send it somewhere. No memory-corruption bug. No privilege escalation. No malware. The exploit is text, and the model does exactly what it was built to do: follow instructions it found in its context.
You can't govern what you can't see. The lethal trifecta is invisible precisely because each capability, on its own, is a feature someone shipped on purpose.
We've written before about how this plays out in agentic systems in The Lethal Trifecta: How AI Agents Leak Your Data. The January disclosures are that theory landing in production, in tools millions of people use every day.
Four products, one failure mode
Here is the timeline as disclosed. These were responsible disclosures to vendors, not a coordinated attack campaign in the wild, and that distinction matters: it means defenders got a clean, controlled demonstration of how exposed this layer is.
| Product | Public disclosure | Untrusted input | Private data | Exfil channel | Status |
|---|---|---|---|---|---|
| IBM Bob (closed beta) | Jan 7 | Malicious README | Local machine / files | Auto-approved shell command + markdown-image render | Disclosed |
| Notion AI | Jan 7 | Resume with hidden text | Page contents | Image URL render to attacker domain | Remediation later confirmed |
| Superhuman AI | Jan 12 | One inbound email | Dozens of sensitive emails | Google Form (GET persists data) | Remediated |
| Claude Cowork | Jan 14-15 | Docx disguised as a Skill | Local files | Anthropic file-upload API (attacker key) | Disclosed |
Read down the columns. Every row has all three trifecta elements. The products differ; the structure does not.
Notion AI: the render fires before you approve
PromptArmor's Notion AI write-up is the cleanest illustration. An attacker plants instructions in tiny white text inside a document, for example a resume a recruiter pastes into a workspace. When Notion AI processes the page, it appends the page's contents to an attacker-controlled domain formatted as an image URL. The browser then tries to render that image, which fires the network request, and the data is gone before the user's approval step runs. PromptArmor reported the issue via HackerOne in late December 2025; after the report was initially closed, PromptArmor disclosed publicly on January 7, 2026, and Notion subsequently confirmed a remediation was deployed. Notion AI reportedly serves on the order of 100 million users.
Superhuman AI: one email, dozens leaked
In Superhuman, per Willison's and PromptArmor's accounts, a single untrusted inbound email carries the injection, and the victim does not even need to open it. When the user asks the assistant to summarize recent mail, it gathers dozens of sensitive emails and exfiltrates them to an attacker-controlled Google Form. Why a Google Form? Because Superhuman's Content Security Policy allowed images from docs.google.com, and Google Forms persist submitted data through GET requests. The egress channel was a domain the product already trusted. Superhuman remediated rapidly.
Claude Cowork: exfiltration through the trusted API
Claude Cowork ran agent workloads in a VM that allowlisted the Anthropic API as trusted egress, a sensible-sounding default. PromptArmor showed that a .docx disguised as a Skill could instruct Claude to run a curl command uploading local files via the Anthropic file-upload (Files) API using an attacker-supplied API key. The data left through the one channel the sandbox was built to trust, with no human approval at any point. PromptArmor demonstrated the attack against Claude Haiku and Opus 4.5, and noted the underlying isolation flaw had been disclosed earlier (by researcher Johann Rehberger) before Cowork launched. We unpack the broader risk of malicious Skills and tool definitions in MCP Tool Poisoning: Hidden Instructions at Scale.
Why the patches are whack-a-mole
Look at how each vendor fixed (or could fix) its instance. Notion can sanitize image rendering. Superhuman can tighten its CSP. Anthropic can constrain what a Skill may do with the Files API. Every one of those is a fix to a single egress path.
But the trifecta doesn't care which door you lock. Superhuman's attackers used a Google Form precisely because the obvious image-exfil path ran into a CSP that still allowed a Google domain. Close that, and the next researcher finds a webhook, a markdown link, a calendar invite, a DNS lookup. The model still has private data, still ingests untrusted content, and still has *some* way to talk to the outside world. As long as all three remain, you are playing defense against an unbounded set of egress techniques.
This is the same dynamic we documented in Comment-and-Control: Multi-Agent Prompt Injection and Credential Theft, where the exfil channel hid inside ordinary collaboration features. Egress is plural. Patching it one path at a time is a treadmill.
The inconvenient truth: prompt injection is unsolved
There is a deeper reason these fixes are mitigations rather than cures. No production LLM reliably separates instructions from data. Everything in the context window, your system prompt, the user's request, and the contents of that resume or email, arrives as one undifferentiated stream of tokens. The model has no trusted channel that says "this part is a command and that part is just text to summarize."
That is why prompt injection has stayed an open frontier problem since it was named. The most promising research defenses change the architecture rather than patching the prompt:
- The Dual-LLM pattern, where a privileged model never sees untrusted content and a quarantined model never touches sensitive tools.
- Google's CaMeL, which derives an explicit control-flow and data-flow policy so untrusted text cannot redirect privileged actions.
Both are real progress. Neither has meaningful production adoption at scale today. Until something like them ships broadly, treating prompt injection as a bug you can patch out is a category error. For the foundational explainer, see Indirect Prompt Injection, Explained.
CVE-2025-59536: the same lesson, in code
If you want a reminder that the trust boundary itself is the soft spot, look at CVE-2025-59536 (verified on the GitHub Advisory Database and NVD). Claude Code could execute project code before the user accepted its startup trust dialog, classified as CWE-94 (code injection), enabling command execution and API token theft when launched in an untrusted directory. It carried a CVSS v3.1 base score of 8.7 (High) and was fixed in v1.0.111.
Different surface, identical theme: the gap between "the agent has acted" and "the human approved" is where these failures live. The Notion render firing before approval and CVE-2025-59536 running code before the dialog are the same mistake wearing different clothes. We go deeper on agent-CLI trust boundaries in Securing AI Coding Agents and CLIs.
Why this is a governance problem, not just an engineering one
Here is the shift in thinking the January disclosures force. A normal vulnerability is a defect in code: it exists, you find it, you patch it, it's gone. The lethal trifecta is a configuration property of a deployed agent. It is not a flaw in any one line of code; it is an emergent property of how the agent is wired, what it can read, what it ingests, and where it can send data.
That has a sharp consequence. An agent that was perfectly safe last month can become exploitable this month with no code change at all, simply because someone connected a new data source, enabled a new integration, or added an MCP server that opens an egress path. The trifecta assembles itself across teams and over time, which is exactly why it qualifies as the new shadow IT we describe in AI Agents Are the New Shadow IT.
If the risk is a configuration that drifts, then the durable control is not a one-time patch. It is continuous detection of the configuration itself.
Detection as the durable control: the Anomity view
Our position is direct. You cannot reliably stop prompt injection at the model layer today, and you cannot win the egress whack-a-mole. What you *can* do is see the trifecta forming and refuse to let agents hold all three legs unsupervised. That requires three things, in order:
- Inventory every AI agent and MCP server across the fleet, including the shadow ones nobody registered. You can't assess what you haven't found. See How to Build an AI Agent Inventory.
- Map each one's three legs: what private data it can access, what untrusted inputs it ingests, and what external egress it has (rendered URLs, allowlisted APIs, webhooks, forms).
- Alert on any agent that holds all three. That single rule would have flagged Notion AI, Superhuman, Claude Cowork, and IBM Bob as high-risk configurations before a researcher, or an attacker, ever sent the payload.
This is the inverse of waiting for a CVE. Instead of patching after disclosure, you surface the exploitable shape in advance and force a decision: cut a leg (scope the data, sandbox the input, constrain egress) or accept and monitor the risk. Pair that static map with runtime behavior baselines so a sudden bulk read-and-send looks anomalous in real time, as covered in Runtime Monitoring and Anomaly Detection for AI Agents.
Egress hygiene still matters, and least privilege still matters, see Least Privilege for AI Agents. But those are how you *respond* to a flagged trifecta. Detection is what tells you where to look.
Framework and compliance hooks
These incidents are not abstract. They map cleanly onto the standards your auditors and regulators already use:
- OWASP: the attacks are textbook LLM01 (prompt injection) leading to LLM02 (sensitive information disclosure). See the OWASP Top 10 for LLM Applications guide.
- MCP authorization: the spec requires OAuth 2.1 (with mandatory PKCE) and RFC 9728 protected-resource metadata for internet-facing MCP servers, hardening the auth boundary attackers probe. See OAuth for MCP Servers, Explained.
- GDPR: exfiltration of personal data implicates Article 5(1)(f) (integrity and confidentiality), Article 32 (security of processing), and breach notification under Articles 33 and 34.
Traditional DLP, worth noting, does not catch these. The data leaves through a rendered image or a trusted API call inside an AI workflow, not over an obvious channel a legacy DLP engine inspects. We explain the gap in DLP for AI Agents: Why Traditional DLP Fails.
What to do this quarter
If you run security for an organization that uses any AI productivity tool, and you do, whether you've inventoried it or not, treat the January cluster as your forcing function:
- Build or refresh your agent and MCP inventory so you actually know what's deployed.
- For each agent, answer three questions on one line: private data? untrusted input? external egress?
- Flag every agent that answers yes to all three and put it through review.
- Wire runtime anomaly detection for bulk read-then-send behavior, and keep an audit trail so you can reconstruct any exfil event after the fact, see AI Agent Audit Trail and Logging Guide.
The four products that fell in January were built by capable teams who care about security. They were exposed anyway, because the lethal trifecta is not a mistake you can engineer away one patch at a time. It is a property you have to *see*. Make it visible, and you can govern it. Leave it invisible, and you are just waiting for your turn in the timeline.
Sources cited inline: Simon Willison (lethal trifecta definition, Superhuman and Cowork analysis), PromptArmor (IBM Bob, Notion AI, Superhuman AI, Claude Cowork disclosures), the GitHub Advisory Database and NVD (CVE-2025-59536), the OWASP GenAI Top 10, and the Model Context Protocol authorization specification.
Frequently asked questions
What is the lethal trifecta in AI security?
Coined by Simon Willison on June 16, 2025, the lethal trifecta describes an AI agent that simultaneously has access to private data, exposure to untrusted content, and the ability to communicate externally. When all three are present in one context, an attacker can use indirect prompt injection to read private data and exfiltrate it, with no code vulnerability required.
What were the four AI productivity tool vulnerabilities PromptArmor disclosed in January 2026?
PromptArmor publicly disclosed indirect prompt injection exfiltration vulnerabilities in IBM Bob, Notion AI, Superhuman AI, and Anthropic's Claude Cowork over roughly nine days (public disclosures spanning January 7-15, 2026). They were responsible disclosures, not a coordinated attack campaign, and each was a textbook lethal-trifecta case.
How did the Notion AI exfiltration work?
Per PromptArmor, an attacker hid instructions in tiny white text inside a document such as a resume. When Notion AI processed the page, it appended the page contents to an attacker-controlled domain formatted as an image URL. The browser rendered the image, firing the network request before the user could approve the edit. PromptArmor disclosed it publicly on January 7, 2026, after Notion had initially closed the report; Notion later confirmed a remediation was deployed.
Is prompt injection a solvable problem?
Not reliably, and not yet. No production LLM cleanly separates trusted instructions from untrusted data, which is why prompt injection remains an open frontier problem. Research architectures such as the Dual-LLM pattern and Google's CaMeL show promise but lack production adoption at scale. Until that changes, per-egress patches are mitigation, not a cure.
How is the lethal trifecta different from a normal software vulnerability?
A normal vulnerability is a flaw in code you can patch. The lethal trifecta is a configuration property of how an agent is wired: which data it can read, what untrusted content it ingests, and where it can send data. The same architecture can be 'safe' one day and exploitable the next as features and integrations change.
Why don't CSP and egress allowlists fully fix these attacks?
They fix one path at a time. Superhuman's CSP allowed images from docs.google.com, so attackers used a Google Form (which persists data via GET). Claude Cowork's VM allowlisted the Anthropic API as trusted egress, so attackers exfiltrated through it. Each fix closes a door while others stay open, the definition of whack-a-mole.
How can security teams detect lethal-trifecta exposure before it is exploited?
Treat the trifecta as a detectable configuration. Inventory every AI agent and MCP server, then map each one's data access, untrusted input sources, and external egress. Any agent holding all three is high risk and should be flagged for review, ideally before an attacker finds it.
Which frameworks and regulations apply to these incidents?
OWASP LLM01 (prompt injection) and LLM02 (sensitive information disclosure) map directly. For internet-facing MCP servers, the MCP authorization spec requires OAuth 2.1 (with mandatory PKCE) and RFC 9728 protected-resource metadata. Under GDPR, unauthorized exfiltration of personal data implicates Article 5(1)(f), Article 32, and breach notification under Articles 33 and 34.
Did Claude Cowork ship with a known vulnerability?
According to PromptArmor and subsequent reporting, Cowork shipped a known class of weakness: a document disguised as a Skill could upload files via the Anthropic file-upload API using an attacker-supplied API key, because the VM allowlisted the Anthropic API as trusted egress. The underlying isolation flaw had been disclosed earlier by researcher Johann Rehberger. PromptArmor demonstrated it against Claude Haiku and Opus 4.5.




