Explains the complete model-admission sequence, runtime-owned storage variants, pre-resolved tensor indices, tokenizer selection, and load-time memory implications.
Implementation evidence: this topic is grounded in the reviewed GGUF.MiRust.com source snapshot. It documents observed code and artifacts without claiming broad deployment, model quality, or production readiness.
Admission sequence
- Parse fixed header and checksum.
- Parse tensor directory and ranges.
- Parse BTOK or BPE1 section.
- Verify required tensor hashes.
- Verify exact global and per-layer shapes.
- Decode or copy each tensor into a typed storage variant.
- Resolve global and per-layer indices.
- Allocate KV cache, forward scratch, and logits.
- Install the model into Runtime.
Storage enum
- F32:
Vec<f32>. - Q8:
Vec<i8>plus row scales. - Q4: packed
Vec<u8>, block scales, and block size.
Selective dequantization
Matrix operations dispatch directly by storage type. Small vectors and embedding rows can be copied into reusable f32 scratch. Borrowing a quantized tensor as an f32 slice is rejected, preventing accidental hidden full-model expansion.
Memory accounting gap
The runtime exposes scratch bytes and KV length but does not yet report total model heap, transfer peak, allocator overhead, browser ArrayBuffer duplication, or process/GPU memory.
Scope
This starter page defines the questions, boundaries, evidence, and failure modes that should be recorded before a capability is presented as supported.
Engineering considerations
- Identify the source, version, target environment, and owner.
- Separate observed values from estimates and externally reported values.
- Record trade-offs, unsupported cases, and fallback behavior.
- Link performance statements to a compatible benchmark methodology.
Verification questions
- What exact artifact, revision, backend, and environment were reviewed?
- Which assumptions could change the result?
- Which data should be retained so another engineer can reproduce the conclusion?