Now in early access, book a 30-minute demo →
← Back to blog AdvisoryCritical

Ollama Bleeding Llama Unauthenticated Memory Leak - CVE-2026-7482

LLM Gateways & Proxies·Critical·CVE-2026-7482 (Bleeding Llama)·
Affected Ollama before 0.17.1; fixed in 0.17.1

On May 5, 2026, Cyera disclosed CVE-2026-7482, named Bleeding Llama, a critical unauthenticated memory-disclosure flaw in Ollama rated CVSS 9.1, with research estimating roughly 300,000 exposed servers. The bug is an out-of-bounds heap read in how Ollama parses GGUF model files, fixed in Ollama 0.17.1. This advisory covers what the flaw exposes, why a local inference endpoint is an agentic-endpoint problem, and how Anomity surfaces and governs the agents that route through it.

What happened

Ollama is a tool for running large language models locally, exposing an HTTP API for model management and inference. One of those routes, /api/create, creates a model and parses the supplied GGUF model file. GGUF is the on-disk format that carries model weights along with metadata such as each tensor's shape.

In every version before 0.17.1, an attacker sends a crafted GGUF to the /api/create endpoint with a tensor shape set to a very large number. During model creation the parser uses that attacker-controlled value to drive a read without bounding it against the real buffer, triggering an out-of-bounds heap read. The leaked process memory is returned to the caller. That memory can contain user prompts, system prompts, environment variables, API keys, and the conversation data of other concurrent users on the same instance.

Two properties make Bleeding Llama hard to catch. The attack needs only three API calls, and it leaves no error in the logs. There is no crash and no stack trace to alert on, so an exposed instance can be read repeatedly without the kind of trace server-side logging would surface, which is why it is hard to detect without dedicated endpoint monitoring.

The fix ships in Ollama 0.17.1, which corrects how the GGUF parser validates tensor shapes before using them to size a read. Because Ollama is frequently run as a local or self-hosted inference endpoint, an exposed instance becomes a direct leak of sensitive conversational and credential data.

DetailValue
IdentifierCVE-2026-7482 (Bleeding Llama)
TypeUnauthenticated out-of-bounds heap read (GGUF parser)
CVSS9.1 (Critical)
AffectedOllama before 0.17.1
Fixed inOllama 0.17.1
Exposure~300,000 exposed servers; three API calls, no log error

Why this is an agentic-endpoint risk

A local inference endpoint rarely sits alone. Ollama exists because AI agents, CLIs, and developer tooling need somewhere to send model traffic, and on many endpoints that somewhere is a process listening on localhost or a shared port. On a managed endpoint the Ollama process is an AI artifact in its own right, and so are the Claude Code sessions, MCP servers, and command-line agents that point at it.

That concentration is the risk. The process memory this bug exposes is where the most sensitive material lives: system prompts, in-flight user prompts, the API keys the process holds in its environment, and the conversation data of every concurrent user. A single unauthenticated read can pull all of it at once, and because the attack leaves no error in the logs, the host gives no signal that it happened. Network and EDR controls may see the connection, but they cannot tell you which agents on which endpoints were configured to route inference through the affected version.

This is the same artifact-layer blind spot we track across the gateway cluster, including the sibling case in LiteLLM pre-auth SQL injection - CVE-2026-42208. The inference endpoint is one node in a graph of AI artifacts, and you can't govern what you can't see. Fleet-wide inventory of every AI artifact is the precondition for scoping an incident like this one.

How Anomity surfaces and governs it

Anomity inventories eight AI artifact types on every managed endpoint: AI agents, MCP servers, extensions, skills, plugins, secrets, hooks, and CLIs. For CVE-2026-7482 that means the Ollama process and its version are catalogued alongside the agents and CLIs that route inference through it, so you can answer "which endpoints run an affected Ollama build, and what talks to it" from the fleet inventory instead of guessing across hundreds of hosts.

On agents that expose a hook, such as Claude Code PreToolUse, Anomity returns allow, deny, or log on each tool call before it runs. That is the enforcement point in runtime governance: a tool call that routes to a known-vulnerable Ollama version, or that reaches an inference endpoint outside policy, can be denied or logged in line rather than discovered after the fact. Anomity collects metadata only and redacts secrets on the endpoint, so it never has to read the prompts and keys this bug targets.

Every decision is written to a queryable 90-day audit trail. Because Bleeding Llama leaves no error in the server logs, that trail is what gives responders an on-endpoint timeline: which agents called through the local model endpoint, when, and what each call was allowed to do. Anomity routes those decisions to SIEM, Slack, email, or Jira so the right team sees them in the tool they already use. The result is the timeline and enforcement record described under outcomes.

Anomity complements your existing Network, EDR, DLP, and GRC controls rather than replacing them; it adds the agentic-endpoint layer those tools cannot see. See how it works and how Anomity compares for where it fits.

What to check across your fleet

  • Identify every endpoint, laptop, and server running Ollama and record the exact version; treat anything before 0.17.1 as affected.
  • Upgrade to Ollama 0.17.1 or later, which corrects how the GGUF parser validates tensor shapes before sizing a read.
  • Confirm the Ollama API is not reachable from untrusted networks; bind it to localhost or place it behind authentication at the network edge.
  • Rotate any API keys that were present in the Ollama process environment, since a successful read can expose environment variables.
  • Treat recent system prompts, user prompts, and conversation data on exposed instances as potentially leaked, including data from other concurrent users.
  • Do not rely on server logs to confirm exposure; the attack needs only three API calls and leaves no error, so use endpoint monitoring instead.
  • Enumerate which AI agents, CLIs, and MCP servers were configured to route inference through the affected Ollama endpoint, using a fleet-wide AI artifact inventory.
  • Confirm hook-based allow/deny/log enforcement is active on agents that route model traffic, so calls to a vulnerable version can be blocked.

CVE-2026-7482 turns one reachable inference endpoint into a direct leak of prompts and credentials, which is exactly why the AI artifact layer needs its own inventory and enforcement. For the full cluster context, see the pillar on securing LLM gateways and proxies. To see Anomity inventory your agents, govern tool calls at the hook, and keep a 90-day audit trail, request early access.

Frequently asked questions

Does upgrading to Ollama 0.17.1 fully resolve CVE-2026-7482?

Upgrading to Ollama 0.17.1 closes the out-of-bounds read by fixing how the GGUF parser validates tensor shapes during model creation, so the patch stops the bug itself. It does not undo any disclosure that already happened. Because the attack needs only three API calls and leaves no error in the logs, any instance that ran an affected build while reachable should be treated as a possible leak of process memory. Rotate any API keys that were present in the environment, treat recent prompts and conversation data as potentially exposed, and confirm the endpoint is no longer reachable from untrusted networks.

How does the Bleeding Llama out-of-bounds read actually work?

Ollama parses GGUF model files when it creates a model. An attacker sends a crafted GGUF to the /api/create endpoint with a tensor shape set to a very large number. During model creation the parser uses that attacker-controlled value to size a read without bounding it against the real buffer, so it reads past the allocated heap region. The leaked process memory is returned to the caller and can contain user prompts, system prompts, environment variables, API keys, and the conversation data of other concurrent users. No authentication is required, and the whole sequence is three API calls.

Why is an exposed local Ollama instance such a high-value target?

Ollama is frequently run as a local or self-hosted inference endpoint on developer laptops, build agents, and shared servers. Research tied to this disclosure estimates roughly 300,000 exposed servers. When that endpoint is reachable, a process-memory read does not leak one isolated value; it can expose system prompts, in-flight user prompts, the API keys the process holds in its environment, and the conversation data of everyone using the same instance at once. That turns a single unauthenticated request into a direct leak of sensitive conversational and credential data from the host.

How does Anomity reduce exposure from a flaw like Bleeding Llama?

Anomity treats Ollama as an AI artifact on the endpoint, so it inventories the Ollama process, its version, and the local agents and CLIs that route inference through it. On agents that expose a hook, such as Claude Code PreToolUse, Anomity returns allow, deny, or log on each tool call before it runs, so calls that route to a known-vulnerable Ollama version can be denied or logged in line. Because the attack leaves no error in the logs, the queryable 90-day audit trail of those decisions gives responders the on-endpoint timeline that server logs alone do not provide.

Ask AI about Anomity
ChatGPT Claude Perplexity Google AI Grok