Describes head-local score calculation, stable softmax, KV reads, scaling, causal range, complexity, and current equal-KV-head restriction.
Implementation evidence: this topic is grounded in the reviewed GGUF.MiRust.com source snapshot. It documents observed code and artifacts without claiming broad deployment, model quality, or production readiness.
Score path
For each query head and each cached position from zero through the current position, the runtime computes a dot product with the stored key and multiplies by 1 / sqrt(head_dim).
Normalization and value mix
Scores are passed through numerically stable softmax, then used to form a weighted sum of cached value vectors for that head.
Causality
No future cache position is present when decoding, so iterating only through the current position implements the causal mask.
Complexity
Decode attention grows linearly with cache length per head and token; full response cost is quadratic in generated sequence length. The scalar loops and repeated cache reads are a primary performance bottleneck.
Restriction
Header metadata allows separate KV-head count, but forward scratch rejects head_count != kv_head_count. GQA and MQA artifacts are not supported.
Scope
This starter page defines the questions, boundaries, evidence, and failure modes that should be recorded before a capability is presented as supported.
Engineering considerations
- Identify the source, version, target environment, and owner.
- Separate observed values from estimates and externally reported values.
- Record trade-offs, unsupported cases, and fallback behavior.
- Link performance statements to a compatible benchmark methodology.
Verification questions
- What exact artifact, revision, backend, and environment were reviewed?
- Which assumptions could change the result?
- Which data should be retained so another engineer can reproduce the conclusion?