How TinyRustLM generates one token

A source-grounded walk through embedding lookup, pre-norm attention, RoPE, causal KV-cache attention, residual flow, SwiGLU, logits, and sampling.

The current runtime executes an ordinary decoder-only transformer one token at a time through handwritten scalar Rust loops compiled to WebAssembly.

Layer order

Embedding → attention RMSNorm → Q/K/V → RoPE → KV store → causal attention → WO → residual → FFN RMSNorm → W1/W3 → SwiGLU → W2 → residual. Final RMSNorm and output projection produce logits.

Current bottleneck

There is no batched prefill, SIMD-specific kernel, WebGPU dispatch, or worker isolation. Quantization primarily reduces memory bandwidth; it does not change the scalar execution topology.

Read the source-level execution path.