Traces the exact per-token execution order through embedding, four transformer layers, final normalization, output projection, and logits.
Implementation evidence: this topic is grounded in the reviewed GGUF.MiRust.com source snapshot. It documents observed code and artifacts without claiming broad deployment, model quality, or production readiness.
Token entry
The selected token ID indexes one embedding row into the residual buffer.
Per-layer attention block
- RMS-normalize the residual with
attn_norm. - Project Q, K, and V.
- Apply RoPE to each Q and K head at the current position.
- Store K and V in the layer-major cache.
- Compute causal attention over positions 0 through current.
- Project attention output with WO.
- Add the projection to the residual.
Per-layer FFN block
- RMS-normalize the updated residual with
ffn_norm. - Project W1 and W3.
- Compute elementwise SwiGLU.
- Project through W2.
- Add the result to the residual.
Output
After the last layer, final RMSNorm is applied and the output projection produces vocabulary logits. Sampling selects a token.
Execution model
All operations are synchronous scalar CPU loops inside WASM. There is no graph compiler, operator fusion, batching, or device abstraction.
Scope
This starter page defines the questions, boundaries, evidence, and failure modes that should be recorded before a capability is presented as supported.
Engineering considerations
- Identify the source, version, target environment, and owner.
- Separate observed values from estimates and externally reported values.
- Record trade-offs, unsupported cases, and fallback behavior.
- Link performance statements to a compatible benchmark methodology.
Verification questions
- What exact artifact, revision, backend, and environment were reviewed?
- Which assumptions could change the result?
- Which data should be retained so another engineer can reproduce the conclusion?