Reusable scratch memory and logits

Details model-load allocation of forward buffers, peak-scratch accounting, reuse across prefill and decode, vocabulary logits reuse, and remaining allocations.

Experimental
Last verified
2026-06-25 00:00 UTC
Updated
Reading time
2 minutes

Details model-load allocation of forward buffers, peak-scratch accounting, reuse across prefill and decode, vocabulary logits reuse, and remaining allocations.

Implementation evidence: this topic is grounded in the reviewed GGUF.MiRust.com source snapshot. It documents observed code and artifacts without claiming broad deployment, model quality, or production readiness.

Forward scratch

Model load allocates buffers for residual, normalized states, Q/K/V, attention, projected output, FFN normalized state, W1/W3 intermediates, gated hidden, FFN output, and attention scores.

Reuse

The same buffers are reused for every layer and token. A separate logits vector sized to vocabulary is reused by generate and generate_next_token.

Reported metric

Diagnostics exposes peak_scratch_bytes, calculated from these known f32 buffer lengths. This is not process peak memory.

Remaining dynamic work

Tokenizer output and generated-token vectors can grow per request. Result text and diagnostics JSON are rebuilt. Model loading allocates tensor vectors. Helper attention code outside the central path contains a dynamic score vector, while the central generation path uses scratch.

Next memory instrumentation

Record model storage by dtype, KV capacity and active bytes, scratch, logits, token vectors, result/diagnostic capacity, source transfer, and allocator high-water mark separately.

Scope

This starter page defines the questions, boundaries, evidence, and failure modes that should be recorded before a capability is presented as supported.

Engineering considerations

  • Identify the source, version, target environment, and owner.
  • Separate observed values from estimates and externally reported values.
  • Record trade-offs, unsupported cases, and fallback behavior.
  • Link performance statements to a compatible benchmark methodology.

Verification questions

  • What exact artifact, revision, backend, and environment were reviewed?
  • Which assumptions could change the result?
  • Which data should be retained so another engineer can reproduce the conclusion?