Transformer forward pass

Traces the exact per-token execution order through embedding, four transformer layers, final normalization, output projection, and logits.

Experimental
Last verified
2026-06-25 00:00 UTC
Updated
Reading time
2 minutes

Traces the exact per-token execution order through embedding, four transformer layers, final normalization, output projection, and logits.

Implementation evidence: this topic is grounded in the reviewed GGUF.MiRust.com source snapshot. It documents observed code and artifacts without claiming broad deployment, model quality, or production readiness.

Token entry

The selected token ID indexes one embedding row into the residual buffer.

Per-layer attention block

  1. RMS-normalize the residual with attn_norm.
  2. Project Q, K, and V.
  3. Apply RoPE to each Q and K head at the current position.
  4. Store K and V in the layer-major cache.
  5. Compute causal attention over positions 0 through current.
  6. Project attention output with WO.
  7. Add the projection to the residual.

Per-layer FFN block

  1. RMS-normalize the updated residual with ffn_norm.
  2. Project W1 and W3.
  3. Compute elementwise SwiGLU.
  4. Project through W2.
  5. Add the result to the residual.

Output

After the last layer, final RMSNorm is applied and the output projection produces vocabulary logits. Sampling selects a token.

Execution model

All operations are synchronous scalar CPU loops inside WASM. There is no graph compiler, operator fusion, batching, or device abstraction.

Scope

This starter page defines the questions, boundaries, evidence, and failure modes that should be recorded before a capability is presented as supported.

Engineering considerations

  • Identify the source, version, target environment, and owner.
  • Separate observed values from estimates and externally reported values.
  • Record trade-offs, unsupported cases, and fallback behavior.
  • Link performance statements to a compatible benchmark methodology.

Verification questions

  • What exact artifact, revision, backend, and environment were reviewed?
  • Which assumptions could change the result?
  • Which data should be retained so another engineer can reproduce the conclusion?