KV cache sizing, indexing, and commit discipline

Defines the layer-major cache layout, exact byte formula, write-before-commit behavior, reset semantics, and current equal-head restriction.

Experimental
Last verified
2026-06-25 00:00 UTC
Updated
Reading time
2 minutes

Defines the layer-major cache layout, exact byte formula, write-before-commit behavior, reset semantics, and current equal-head restriction.

Implementation evidence: this topic is grounded in the reviewed GGUF.MiRust.com source snapshot. It documents observed code and artifacts without claiming broad deployment, model quality, or production readiness.

Layout

Keys and values are separate contiguous Vec<f32> stores indexed as [layer][position][kv_head][head_dim]. The offset is computed with checked multiplication and bounds validation.

Formula

kv_bytes = 2 × layers × max_context × kv_heads × head_dim × 4.

TinyLM-16M-shaped example

Four layers, 512 positions, eight KV heads, and head dimension 64 require 8,388,608 bytes.

Commit discipline

Every layer writes K and V at the current position; only after the layer loop completes does commit_len(position + 1) expose the new cache length. This avoids publishing a partially written token position.

Current limit

Forward scratch rejects head_count != kv_head_count; the header can describe GQA/MQA, but this implementation cannot execute it.

Scope

This starter page defines the questions, boundaries, evidence, and failure modes that should be recorded before a capability is presented as supported.

Engineering considerations

  • Identify the source, version, target environment, and owner.
  • Separate observed values from estimates and externally reported values.
  • Record trade-offs, unsupported cases, and fallback behavior.
  • Link performance statements to a compatible benchmark methodology.

Verification questions

  • What exact artifact, revision, backend, and environment were reviewed?
  • Which assumptions could change the result?
  • Which data should be retained so another engineer can reproduce the conclusion?