Current performance model and bottlenecks

Derives where time and memory are spent from the actual scalar execution path without fabricating device measurements.

Experimental
Last verified
2026-06-25 00:00 UTC
Updated
Reading time
2 minutes

Derives where time and memory are spent from the actual scalar execution path without fabricating device measurements.

Implementation evidence: this topic is grounded in the reviewed GGUF.MiRust.com source snapshot. It documents observed code and artifacts without claiming broad deployment, model quality, or production readiness.

Decode work

For every generated token and layer, the runtime performs four attention projections, causal score/value work, one output projection, three FFN projections, normalization, RoPE, and output projection after the final layer. Matrix-vector operations dominate weight reads.

Memory bandwidth

F32 reads four bytes per weight; q8 roughly one byte plus row scales; q4 roughly half a byte plus block scales. Direct quantized kernels reduce memory traffic but add decode arithmetic.

Attention growth

KV storage is fixed at model-load capacity, while attention work grows with active context. Prefill is token-serial rather than batched. Long prompts therefore multiply both full-layer passes and attention history reads.

Browser overhead

Full-file fetch and copy, WASM allocation, tensor materialization, main-thread synchronous execution, result decoding, and diagnostics rendering contribute outside kernel time.

Known optimization order

  1. Measure per-operator and total wall time.
  2. Move execution to a worker and add cancellation.
  3. Vectorize/tiling for CPU or add WebGPU backend.
  4. Add batched prefill.
  5. Optimize BPE and result allocations.
  6. Add persistent verified artifact caching.

No fabricated throughput

The source UI can display host-side tokens per second, but no comparable hardware/browser measurement dataset was supplied. MiRust publishes no tokens/s claim for this snapshot.

Scope

This starter page defines the questions, boundaries, evidence, and failure modes that should be recorded before a capability is presented as supported.

Engineering considerations

  • Identify the source, version, target environment, and owner.
  • Separate observed values from estimates and externally reported values.
  • Record trade-offs, unsupported cases, and fallback behavior.
  • Link performance statements to a compatible benchmark methodology.

Verification questions

  • What exact artifact, revision, backend, and environment were reviewed?
  • Which assumptions could change the result?
  • Which data should be retained so another engineer can reproduce the conclusion?