Now in early access, book a 30-minute demo →
← Back to blog AdvisoryHigh

vLLM Prompt-Embeds Tensor Deserialization Memory Corruption - CVE-2025-62164

LLM Gateways & Proxies·High·CVE-2025-62164 (GHSA-mrw7-hf4f-83pf)·
Affected vLLM 0.10.2 through before 0.11.1; fixed in 0.11.1

On November 20, 2025, CVE-2025-62164 (also tracked as GHSA-mrw7-hf4f-83pf) was disclosed against the vLLM inference and serving engine: a memory-corruption flaw in its Completions API that can cause a denial of service and, in the worst case, remote code execution. It affects vLLM releases from 0.10.2 up to but not including 0.11.1, with the fix in 0.11.1. This advisory covers what the bug exposes, why a serving-engine compromise is an agentic-endpoint problem, and how Anomity surfaces and governs the agents that route through it.

What happened

vLLM is a high-throughput LLM inference and serving engine that exposes an OpenAI-compatible API. Its Completions API supports a prompt-embeds feature, where a caller can supply precomputed embedding tensors instead of plain text. To accept those tensors, vLLM has to deserialize binary data that the caller controls.

In the affected versions, vLLM deserialized user-supplied prompt embeddings, sent as serialized PyTorch tensors, using torch.load() inside _load_and_validate_embed without sufficient validation. Since PyTorch 2.8.0, torch.load with weights_only=True does not validate sparse-tensor invariants unless that is explicitly enabled. A maliciously crafted sparse tensor therefore bypasses the internal bounds checks that would normally reject a malformed structure.

The corruption happens later. When vLLM calls .to_dense() on the deserialized tensor, PyTorch dereferences attacker-controlled index arrays and writes outside the allocated buffer. That out-of-bounds write can crash the server, producing a denial of service, and can potentially be steered toward remote code execution on the serving host. Any user with access to the Completions API can exploit it; no separate authentication bypass is required.

The fix in vLLM 0.11.1 closes the deserialization path. Beyond upgrading, the project's guidance is to avoid loading untrusted serialized tensors and to validate user-supplied embeddings before they reach the deserialization sink.

DetailValue
IdentifierCVE-2025-62164 (GHSA-mrw7-hf4f-83pf)
TypeUntrusted tensor deserialization memory corruption (Completions API)
ImpactDenial of service; potential remote code execution
AffectedvLLM 0.10.2 through before 0.11.1
Fixed in0.11.1
Root causetorch.load() of user-supplied prompt embeds; sparse-tensor invariants unvalidated since PyTorch 2.8.0; out-of-bounds write on .to_dense()

Why this is an agentic-endpoint risk

A serving engine rarely sits alone. vLLM exists because AI agents, CLIs, and developer tooling need a place to send inference traffic. On a managed endpoint, the vLLM process is an AI artifact in its own right, and so are the Claude Code sessions, MCP servers, and command-line agents that point at it. When that process can be crashed, or potentially turned into code execution, by anyone who can reach its Completions API, it becomes one of the most dangerous AI artifacts on the host.

The blast radius runs past the engine itself. A denial of service against a shared inference endpoint takes down every agent and pipeline that depends on it, and code execution on the serving host would expose the model weights, any provider credentials reachable from that host, and a foothold to pivot toward the tools that route through it. Network and EDR controls can see the request, but they cannot tell you which agents on which endpoints were configured to send prompt-embeds to the affected vLLM build, or what those agents were allowed to do once they reached it.

This is the same artifact-layer blind spot we track across the gateway cluster, including the sibling case in LiteLLM pre-auth SQL injection - CVE-2026-42208 and the LightLLM unauthenticated pickle deserialization RCE - CVE-2026-26220. The serving engine is one node in a graph of AI artifacts, and you can't govern what you can't see. Fleet-wide inventory of every AI artifact is the precondition for scoping an incident like this one.

How Anomity surfaces and governs it

Anomity inventories eight AI artifact types on every managed endpoint: AI agents, MCP servers, extensions, skills, plugins, secrets, hooks, and CLIs. For CVE-2025-62164 that means the vLLM process and its version are catalogued alongside the agents and CLIs that route inference through it, so you can answer "which endpoints run an affected vLLM build, and what talks to it" from the fleet inventory instead of guessing.

On agents that expose a hook, such as Claude Code PreToolUse, Anomity returns allow, deny, or log on each tool call before it runs. That is the enforcement point in runtime governance: a tool call that routes to a known-vulnerable vLLM version, or that submits prompt-embeds to a serving endpoint outside policy, can be denied or logged in line rather than discovered after the server crashes. Anomity collects metadata only and redacts secrets on the endpoint, so it never has to read the credentials a compromised host might expose.

Every decision is written to a queryable 90-day audit trail. After a disclosure like this, that trail is what lets responders scope the event: which agents called through the engine, when, and what each call was allowed to do. Anomity routes those decisions to SIEM, Slack, email, or Jira so the right team sees them in the tool they already use. The result is the timeline and the enforcement record described under outcomes.

Anomity complements your existing Network, EDR, DLP, and GRC controls rather than replacing them. It adds the agentic-endpoint layer those tools cannot see. See how it works and how Anomity compares for where it fits.

What to check across your fleet

  • Identify every endpoint and service running vLLM and record the exact version; treat anything from 0.10.2 up to but not including 0.11.1 as affected.
  • Upgrade to vLLM 0.11.1 or later, which closes the prompt-embeds deserialization path.
  • Determine whether the prompt-embeds feature is enabled and whether the Completions API is reachable from untrusted callers; disable prompt-embeds where it is not required.
  • Avoid loading untrusted serialized tensors and validate any user-supplied embeddings before they reach the deserialization path.
  • Restrict network reachability of the Completions API and require authentication at the network edge so arbitrary callers cannot submit tensors.
  • Review host and serving logs for vLLM worker crashes, restarts, and unexpected child processes around prompt-embeds requests.
  • Enumerate which AI agents, CLIs, and MCP servers were configured to route inference through the affected vLLM build, using a fleet-wide AI artifact inventory.
  • Confirm hook-based allow/deny/log enforcement is active on agents that route inference, so calls submitting prompt-embeds to a vulnerable vLLM version can be blocked.

CVE-2025-62164 turns one reachable inference endpoint into a denial of service and a possible path to code execution, which is exactly why the AI artifact layer needs its own inventory and enforcement. For the full cluster context, see the pillar on securing LLM gateways and proxies. To see Anomity inventory your agents, govern tool calls at the hook, and keep a 90-day audit trail, request early access.

Frequently asked questions

What exactly triggers CVE-2025-62164 in vLLM?

The vLLM Completions API accepts user-supplied prompt embeddings as serialized PyTorch tensors and deserializes them with torch.load() inside _load_and_validate_embed without sufficient validation. Since PyTorch 2.8.0, torch.load with weights_only=True does not validate sparse-tensor invariants unless that is explicitly enabled, so a maliciously crafted sparse tensor bypasses internal bounds checks. When vLLM later calls .to_dense(), PyTorch dereferences attacker-controlled index arrays and writes outside the allocated buffer. That out-of-bounds write can crash the server and, in the worst case, lead to remote code execution. Any user who can reach the Completions API can trigger it.

Which vLLM versions are affected and where is the fix?

The flaw affects vLLM releases from 0.10.2 up to but not including 0.11.1. The fix ships in 0.11.1, so upgrading to 0.11.1 or later is the primary remediation. Beyond upgrading, operators should avoid loading untrusted serialized tensors and should validate any user-supplied embeddings before they reach the deserialization path. If you cannot upgrade immediately, restrict who can reach the Completions API and disable the prompt-embeds feature where it is not required, since the vulnerable path is reachable by any caller with API access.

Why is the prompt-embeds feature the risky part?

Prompt embeds let a caller supply precomputed embedding tensors instead of plain text, which is useful for advanced inference workflows. To accept them, vLLM has to deserialize a binary tensor the caller controls, and it used torch.load() to do so. PyTorch's own documentation has long warned that loading untrusted serialized data is unsafe, and the sparse-tensor validation gap in PyTorch 2.8.0 removed the bounds check that would have caught a malformed tensor. The combination turns a convenience feature into an untrusted-input deserialization sink reachable over the network.

How does Anomity help when a serving engine like vLLM is exposed?

Anomity treats the vLLM serving process as an AI artifact on the endpoint, so it inventories the process, its version, and the local agents and CLIs that route inference through it. On agents that expose a hook, such as Claude Code PreToolUse, Anomity returns allow, deny, or log on each tool call before it runs, so a call that routes to a known-vulnerable vLLM build, or that submits prompt-embeds to one, can be denied or logged in line. Every decision lands in a queryable 90-day audit trail, giving responders the timeline they need to scope the event across the fleet.

Ask AI about Anomity
ChatGPT Claude Perplexity Google AI Grok