← Back to blog AdvisoryCritical

vLLM Unauthenticated RCE via Malicious Video URL - CVE-2026-22778

Anomity Research Anomity Threat Research · Mar 26, 2026 · 5 min read

LLM Gateways & Proxies·Critical·CVE-2026-22778 (GHSA-4r2x-xpjr-7cvv)·Mar 26, 2026

Affected vLLM 0.8.3 through 0.14.0 (multimodal/video models); fixed in 0.14.1

On February 2, 2026, the vLLM project disclosed CVE-2026-22778 and shipped version 0.14.1 to fix it (also tracked as GHSA-4r2x-xpjr-7cvv), a pre-authentication remote code execution chain in the vLLM inference server. The flaw is rated CVSS 9.8. It affects vLLM releases 0.8.3 through 0.14.0, and only deployments serving a video-capable model are exploitable. This advisory covers the chain, why a vulnerable inference server is an agentic-endpoint problem, and how Anomity surfaces and governs the agents that route through it.

What happened

vLLM is a high-throughput inference server that fronts large language and multimodal models behind an OpenAI-compatible API. Callers send requests to routes such as the Completions or Invocations API, and for multimodal models those requests can include media references like a video_url. CVE-2026-22778 is a two-step chain reached through that path with no authentication.

The first step is an information leak. When an attacker submits an invalid image to a multimodal endpoint, the Python imaging library PIL raises an exception whose message includes the memory address of a BytesIO object. vLLM returns that error directly to the client, handing the attacker a live heap address and defeating address-space layout randomization.

The second step is a heap buffer overflow in the video-processing pipeline. The JPEG2000 decoder in the bundled OpenCV honors a cdef box that remaps color channels. By crafting a video referenced through video_url so the larger Y luma plane is mapped into the smaller U chroma buffer, the decoder writes out of bounds on the heap. With the leaked address giving the attacker a known target, that controlled overflow becomes a reliable path to remote code execution on the host.

The fix in 0.14.1 patches both halves: it stops the information leak that exposes the heap address and corrects the underlying heap overflow in the video decode path. Because the overflow lives in the video pipeline, only deployments serving a video-capable model are exploitable; a vLLM instance serving a text-only model does not expose the vulnerable decode path.

Detail	Value
Identifier	CVE-2026-22778 (GHSA-4r2x-xpjr-7cvv)
Type	Pre-auth RCE chain (info leak + heap overflow)
CVSS	9.8 (Critical)
Trigger	Completions/Invocations request with a malicious video_url
Affected	vLLM 0.8.3 – 0.14.0 (video-capable models only)
Fixed in	0.14.1 (disclosed February 2, 2026)

Why this is an agentic-endpoint risk

An inference server rarely sits alone. It exists because AI agents, CLIs, and developer tooling need a place to send model traffic. On a managed endpoint, the vLLM process is an AI artifact in its own right, and so are the Claude Code sessions, MCP servers, and command-line agents that point at it. When that server runs untrusted media through a native decode path, it becomes a pre-auth code-execution surface in the middle of the model fleet.

That is the risk. Unauthenticated code execution on an inference host does not stop at one process; the attacker inherits whatever that host can reach, including model-provider keys, model weights, and the network paths the serving process trusts. Network and EDR controls see the connection and may flag the spawned process after the fact, but they cannot tell you which agents on which endpoints were configured to route through the affected version, or what those agents were allowed to do.

This is the same artifact-layer blind spot we track across the gateway cluster, including the sibling case in LiteLLM pre-auth SQL injection - CVE-2026-42208. The inference server is one node in a graph of AI artifacts, and you can't govern what you can't see. Fleet-wide inventory of every AI artifact is the precondition for scoping an incident like this one.

How Anomity surfaces and governs it

Anomity inventories eight AI artifact types on every managed endpoint: AI agents, MCP servers, extensions, skills, plugins, secrets, hooks, and CLIs. For CVE-2026-22778 that means the vLLM process and its version are catalogued alongside the agents and CLIs that route through it, so you can answer "which endpoints run an affected vLLM build, which of those serve video-capable models, and what talks to them" from the fleet inventory instead of guessing.

On agents that expose a hook, such as Claude Code PreToolUse, Anomity returns allow, deny, or log on each tool call before it runs. That is the enforcement point in runtime governance: a tool call that routes to a known-vulnerable vLLM version, or that submits media to a model endpoint outside policy, can be denied or logged in line rather than discovered after the host is already compromised. Anomity collects metadata only and redacts secrets on the endpoint, so it never has to read the very credentials an RCE on the host would expose.

Every decision is written to a queryable 90-day audit trail. After a disclosure like this, that trail is what lets responders scope the event: which agents called through the affected server, when, and what each call was allowed to do. Anomity routes those decisions to SIEM, Slack, email, or Jira so the right team sees them in the tool they already use, the timeline and enforcement record described under outcomes.

Anomity complements your existing Network, EDR, DLP, and GRC controls rather than replacing them. It adds the agentic-endpoint layer those tools cannot see. See how it works and how Anomity compares for where it fits.

What to check across your fleet

Identify every endpoint and service running vLLM and record the exact version; treat anything from 0.8.3 through 0.14.0 as affected.
Determine which of those vLLM hosts serve a video-capable multimodal model, since only video-capable deployments expose the vulnerable decode path.
Upgrade affected hosts to vLLM 0.14.1 or later, which patches both the information leak and the underlying heap buffer overflow.
Restrict network reachability of the inference API so untrusted callers cannot reach the Completions or Invocations endpoints, and require authentication at the network edge.
Review process and network logs on affected hosts for unexpected child processes or outbound connections spawned by the vLLM serving process around and after disclosure.
Rotate any model-provider keys, tokens, or credentials present on a host that ran an affected build while reachable by untrusted callers.
Enumerate which AI agents, CLIs, and MCP servers were configured to route inference through the affected server, using a fleet-wide AI artifact inventory.
Confirm hook-based allow/deny/log enforcement is active on agents that route model traffic, so calls to a vulnerable vLLM version can be blocked.

CVE-2026-22778 turns one reachable inference server into unauthenticated code execution on the host, which is exactly why the AI artifact layer needs its own inventory and enforcement. For the full cluster context, see the pillar on securing LLM gateways and proxies. To see Anomity inventory your agents, govern tool calls at the hook, and keep a 90-day audit trail, request early access.

Frequently asked questions

Does upgrading vLLM to 0.14.1 fully resolve CVE-2026-22778?

Upgrading to vLLM 0.14.1 closes both halves of the chain: it stops the information leak that returns a heap address to the client and patches the underlying heap buffer overflow in the video-processing pipeline. That removes the path to remote code execution. It does not, however, undo any exposure that already occurred on an affected build. If a deployment ran a version from 0.8.3 through 0.14.0 while serving a video-capable model and was reachable by untrusted callers, treat the host as a potential compromise: review process and network logs, rotate any credentials or model-provider keys present on the server, and confirm no unexpected child processes were spawned by the serving process.

Which vLLM deployments are actually exploitable?

Only deployments serving a video-capable multimodal model are exploitable, because the overflow lives in the video-processing pipeline reached through a video_url in a Completions or Invocations API request. A vLLM instance serving a text-only model does not expose the vulnerable decode path. That said, model configuration changes over time, and a single endpoint can be repurposed to a multimodal model without the security team noticing. The practical guidance is to inventory which vLLM hosts run a version in the affected range and which of those serve video-capable models, then prioritize the intersection for the 0.14.1 upgrade and for restricting who can reach the inference API.

How does the two-step chain reach remote code execution?

The chain pairs an information leak with a memory-corruption bug. First, an invalid image submitted to a multimodal endpoint makes PIL raise an exception that includes the memory address of a BytesIO object, and vLLM returns that error to the client, leaking a live heap address and defeating address-space layout randomization. Second, the JPEG2000 decoder in bundled OpenCV honors a cdef box that remaps color channels; mapping the larger Y luma plane into the smaller U chroma buffer writes out of bounds on the heap. With the leaked address giving the attacker a known target, the controlled overflow becomes a reliable path to code execution, all without authentication.

How does Anomity reduce exposure when an inference server like vLLM is vulnerable?

Anomity treats the vLLM server as an AI artifact on the endpoint, so it inventories the process, its version, and the local agents, CLIs, and MCP servers that route inference through it. On agents that expose a hook, such as Claude Code PreToolUse, Anomity returns allow, deny, or log on each tool call before it runs, so calls that route to a known-vulnerable vLLM version can be denied or logged in line. Every decision lands in a queryable 90-day audit trail, giving responders the timeline they need to scope an unauthenticated-RCE event. Anomity collects metadata only and redacts secrets on the endpoint, complementing your existing Network, EDR, DLP, and GRC controls.