Generation prefill and decode

Explains prompt encoding, EOS removal, context checks, token-by-token prefill, autoregressive sampling, result decoding, and continuation state.

Experimental

Last verified: 2026-06-25 00:00 UTC
Updated: 2026-06-25
Reading time: 2 minutes

Explains prompt encoding, EOS removal, context checks, token-by-token prefill, autoregressive sampling, result decoding, and continuation state.

Implementation evidence: this topic is grounded in the reviewed GGUF.MiRust.com source snapshot. It documents observed code and artifacts without claiming broad deployment, model quality, or production readiness.

Prompt preparation

The tokenizer returns BOS, body tokens, and EOS. Generation removes the terminal EOS so the prompt can be continued. A prompt that already fills the context is rejected.

Prefill

Every prompt token is forwarded sequentially. This constructs the KV cache and produces logits for the next token. There is no batched prefill matrix path.

Decode loop

Sample a token from current logits.
Stop if EOS.
Append the token to generated state.
Forward it at the next position.
Repeat until max tokens or context boundary.

Output text

Only generated token IDs are decoded into last_result. Runtime token state retains prompt and generated IDs for continuation.

Step API

generate_next_token runs one additional continuation step after an initial generate request. The browser UI does not currently use it for live streaming.

Scope

This starter page defines the questions, boundaries, evidence, and failure modes that should be recorded before a capability is presented as supported.

Engineering considerations

Identify the source, version, target environment, and owner.
Separate observed values from estimates and externally reported values.
Record trade-offs, unsupported cases, and fallback behavior.
Link performance statements to a compatible benchmark methodology.

Verification questions

What exact artifact, revision, backend, and environment were reviewed?
Which assumptions could change the result?
Which data should be retained so another engineer can reproduce the conclusion?

Generation prefill and decode

Prompt preparation #

Prefill #

Decode loop #

Output text #

Step API #

Scope #

Engineering considerations #

Verification questions #

Prompt preparation

Prefill

Decode loop

Output text

Step API

Scope

Engineering considerations

Verification questions