Explains how one model API selects f32, q8_0, or q4_0 behavior without decoding every quantized tensor to a permanent float copy.
Implementation evidence: this topic is grounded in the reviewed GGUF.MiRust.com source snapshot. It documents observed code and artifacts without claiming broad deployment, model quality, or production readiness.
Storage enum
Each loaded tensor is one of three runtime-owned variants: f32 values, q8 signed bytes plus row scales, or q4 packed nibbles plus block scales and a block size.
Operation dispatch
Embedding and norm reads call copy helpers. Matrix operations call a storage-aware matvec method. Quantized matrix rows are consumed directly by q8 or q4 kernels; only requested rows or vectors are expanded into caller-provided scratch when required.
Evidence boundary
This is compact runtime storage, not zero-copy file mapping. The parser copies accepted bytes into Rust-owned vectors. The browser transfer buffer is released only after loading returns.
Scope
This starter page defines the questions, boundaries, evidence, and failure modes that should be recorded before a capability is presented as supported.
Engineering considerations
- Identify the source, version, target environment, and owner.
- Separate observed values from estimates and externally reported values.
- Record trade-offs, unsupported cases, and fallback behavior.
- Link performance statements to a compatible benchmark methodology.
Verification questions
- What exact artifact, revision, backend, and environment were reviewed?
- Which assumptions could change the result?
- Which data should be retained so another engineer can reproduce the conclusion?