KV cache

Documents layer-major key/value allocation, dimensions, write and read operations, committed length, reset semantics, memory formula, and overflow handling.

Experimental
Last verified
2026-06-25 00:00 UTC
Updated
Reading time
2 minutes

Documents layer-major key/value allocation, dimensions, write and read operations, committed length, reset semantics, memory formula, and overflow handling.

Implementation evidence: this topic is grounded in the reviewed GGUF.MiRust.com source snapshot. It documents observed code and artifacts without claiming broad deployment, model quality, or production readiness.

Layout

Keys and values are separate contiguous f32 vectors sized:

KV-cache byte formula
2 × layers × max_context × kv_heads × head_dim × 4 bytes

For the supplied 4-layer, 512-context, 8-head, 64-dimension model, the full cache allocation is 8 MiB.

Write protocol

store_layer_at writes each layer’s K/V at an explicit position. After all layers for the token succeed, commit_len advances the visible cache length. This avoids exposing a partially completed token.

Reads

Head-level access computes checked offsets and returns slices for the requested layer, position, and KV head.

Reset

Reset sets logical length to zero; backing vectors remain allocated for reuse. Freeing the model drops the cache.

Limits

The cache is always f32, even for q8/q4 weights. There is no paged cache, sliding window, quantized KV, prefix sharing, session snapshot, or eviction.

Scope

This starter page defines the questions, boundaries, evidence, and failure modes that should be recorded before a capability is presented as supported.

Engineering considerations

  • Identify the source, version, target environment, and owner.
  • Separate observed values from estimates and externally reported values.
  • Record trade-offs, unsupported cases, and fallback behavior.
  • Link performance statements to a compatible benchmark methodology.

Verification questions

  • What exact artifact, revision, backend, and environment were reviewed?
  • Which assumptions could change the result?
  • Which data should be retained so another engineer can reproduce the conclusion?