Causal attention

Describes head-local score calculation, stable softmax, KV reads, scaling, causal range, complexity, and current equal-KV-head restriction.

Experimental
Last verified
2026-06-25 00:00 UTC
Updated
Reading time
2 minutes

Describes head-local score calculation, stable softmax, KV reads, scaling, causal range, complexity, and current equal-KV-head restriction.

Implementation evidence: this topic is grounded in the reviewed GGUF.MiRust.com source snapshot. It documents observed code and artifacts without claiming broad deployment, model quality, or production readiness.

Score path

For each query head and each cached position from zero through the current position, the runtime computes a dot product with the stored key and multiplies by 1 / sqrt(head_dim).

Normalization and value mix

Scores are passed through numerically stable softmax, then used to form a weighted sum of cached value vectors for that head.

Causality

No future cache position is present when decoding, so iterating only through the current position implements the causal mask.

Complexity

Decode attention grows linearly with cache length per head and token; full response cost is quadratic in generated sequence length. The scalar loops and repeated cache reads are a primary performance bottleneck.

Restriction

Header metadata allows separate KV-head count, but forward scratch rejects head_count != kv_head_count. GQA and MQA artifacts are not supported.

Scope

This starter page defines the questions, boundaries, evidence, and failure modes that should be recorded before a capability is presented as supported.

Engineering considerations

  • Identify the source, version, target environment, and owner.
  • Separate observed values from estimates and externally reported values.
  • Record trade-offs, unsupported cases, and fallback behavior.
  • Link performance statements to a compatible benchmark methodology.

Verification questions

  • What exact artifact, revision, backend, and environment were reviewed?
  • Which assumptions could change the result?
  • Which data should be retained so another engineer can reproduce the conclusion?