Transformer forward pass

Traces the exact per-token execution order through embedding, four transformer layers, final normalization, output projection, and logits.

Experimental

Last verified: 2026-06-25 00:00 UTC
Updated: 2026-06-25
Reading time: 2 minutes

Traces the exact per-token execution order through embedding, four transformer layers, final normalization, output projection, and logits.

Implementation evidence: this topic is grounded in the reviewed GGUF.MiRust.com source snapshot. It documents observed code and artifacts without claiming broad deployment, model quality, or production readiness.

Token entry

The selected token ID indexes one embedding row into the residual buffer.

Per-layer attention block

RMS-normalize the residual with attn_norm.
Project Q, K, and V.
Apply RoPE to each Q and K head at the current position.
Store K and V in the layer-major cache.
Compute causal attention over positions 0 through current.
Project attention output with W_O.
Add the projection to the residual.

Per-layer FFN block

RMS-normalize the updated residual with ffn_norm.
Project W1 and W3.
Compute elementwise SwiGLU.
Project through W2.
Add the result to the residual.

Output

After the last layer, final RMSNorm is applied and the output projection produces vocabulary logits. Sampling selects a token.

Execution model

All operations are synchronous scalar CPU loops inside WASM. There is no graph compiler, operator fusion, batching, or device abstraction.

Scope

This starter page defines the questions, boundaries, evidence, and failure modes that should be recorded before a capability is presented as supported.

Engineering considerations

Identify the source, version, target environment, and owner.
Separate observed values from estimates and externally reported values.
Record trade-offs, unsupported cases, and fallback behavior.
Link performance statements to a compatible benchmark methodology.

Verification questions

What exact artifact, revision, backend, and environment were reviewed?
Which assumptions could change the result?
Which data should be retained so another engineer can reproduce the conclusion?

Transformer forward pass

Token entry #

Per-layer attention block #

Per-layer FFN block #

Output #

Execution model #

Scope #

Engineering considerations #

Verification questions #

Token entry

Per-layer attention block

Per-layer FFN block

Output

Execution model

Scope

Engineering considerations

Verification questions