RMSNorm and residual flow

Defines normalization math, epsilon handling, learned scales, residual ordering, scratch reuse, and numerical test expectations.

Experimental
Last verified
2026-06-25 00:00 UTC
Updated
Reading time
2 minutes

Defines normalization math, epsilon handling, learned scales, residual ordering, scratch reuse, and numerical test expectations.

Implementation evidence: this topic is grounded in the reviewed GGUF.MiRust.com source snapshot. It documents observed code and artifacts without claiming broad deployment, model quality, or production readiness.

RMSNorm implemented by the runtime
mean_square = sum(x[i] * x[i]) / N
inv_rms = 1 / sqrt(mean_square + epsilon)
y[i] = x[i] * inv_rms * scale[i]

Placement

The model uses pre-normalization: attention and FFN each consume a normalized copy while the original residual is updated by adding the projected block output.

Scratch ownership

A reusable normed buffer serves attention normalization and a separate FFN-normalized buffer serves the feed-forward block. No per-layer normalization vector is allocated during steady-state generation.

Failure checks

Input and scale lengths must match and cannot be empty. Product validation should include NaN/Inf behavior, epsilon extremes, denormal handling, and reference comparisons across quantized modes.

Scope

This starter page defines the questions, boundaries, evidence, and failure modes that should be recorded before a capability is presented as supported.

Engineering considerations

  • Identify the source, version, target environment, and owner.
  • Separate observed values from estimates and externally reported values.
  • Record trade-offs, unsupported cases, and fallback behavior.
  • Link performance statements to a compatible benchmark methodology.

Verification questions

  • What exact artifact, revision, backend, and environment were reviewed?
  • Which assumptions could change the result?
  • Which data should be retained so another engineer can reproduce the conclusion?