Details model-load allocation of forward buffers, peak-scratch accounting, reuse across prefill and decode, vocabulary logits reuse, and remaining allocations.
Implementation evidence: this topic is grounded in the reviewed GGUF.MiRust.com source snapshot. It documents observed code and artifacts without claiming broad deployment, model quality, or production readiness.
Forward scratch
Model load allocates buffers for residual, normalized states, Q/K/V, attention, projected output, FFN normalized state, W1/W3 intermediates, gated hidden, FFN output, and attention scores.
Reuse
The same buffers are reused for every layer and token. A separate logits vector sized to vocabulary is reused by generate and generate_next_token.
Reported metric
Diagnostics exposes peak_scratch_bytes, calculated from these known f32 buffer lengths. This is not process peak memory.
Remaining dynamic work
Tokenizer output and generated-token vectors can grow per request. Result text and diagnostics JSON are rebuilt. Model loading allocates tensor vectors. Helper attention code outside the central path contains a dynamic score vector, while the central generation path uses scratch.
Next memory instrumentation
Record model storage by dtype, KV capacity and active bytes, scratch, logits, token vectors, result/diagnostic capacity, source transfer, and allocator high-water mark separately.
Scope
This starter page defines the questions, boundaries, evidence, and failure modes that should be recorded before a capability is presented as supported.
Engineering considerations
- Identify the source, version, target environment, and owner.
- Separate observed values from estimates and externally reported values.
- Record trade-offs, unsupported cases, and fallback behavior.
- Link performance statements to a compatible benchmark methodology.
Verification questions
- What exact artifact, revision, backend, and environment were reviewed?
- Which assumptions could change the result?
- Which data should be retained so another engineer can reproduce the conclusion?