Forward scratch sizing and reuse

Derives the exact reusable f32 scratch formula and explains why the runtime allocates once per accepted model rather than once per token.

Experimental
Last verified
2026-06-25 00:00 UTC
Updated
Reading time
1 minutes

Derives the exact reusable f32 scratch formula and explains why the runtime allocates once per accepted model rather than once per token.

Implementation evidence: this topic is grounded in the reviewed GGUF.MiRust.com source snapshot. It documents observed code and artifacts without claiming broad deployment, model quality, or production readiness.

Buffer inventory

The scratch object contains ten hidden-width vectors, three feed-forward-width vectors in one shared allocation, and one context-width attention-score vector.

Formula

scratch_bytes = 4 × (10 × hidden_size + 3 × ffn_size + max_context).

TinyLM-16M-shaped example

For hidden 512, FFN 2048, and context 512, the formula yields 47,104 bytes. This excludes the KV cache, logits, model storage, token vectors, and allocator overhead.

Reuse guarantee

The source tests snapshot all twelve backing addresses across generate and generate_next_token. The intended invariant is no scratch-vector reallocation during steady-state decoding.

Scope

This starter page defines the questions, boundaries, evidence, and failure modes that should be recorded before a capability is presented as supported.

Engineering considerations

  • Identify the source, version, target environment, and owner.
  • Separate observed values from estimates and externally reported values.
  • Record trade-offs, unsupported cases, and fallback behavior.
  • Link performance statements to a compatible benchmark methodology.

Verification questions

  • What exact artifact, revision, backend, and environment were reviewed?
  • Which assumptions could change the result?
  • Which data should be retained so another engineer can reproduce the conclusion?