Explains prompt validation, context reset, tokenization, prefill, decode, error cleanup, result publication, and which failures preserve the accepted model.
Implementation evidence: this topic is grounded in the reviewed GGUF.MiRust.com source snapshot. It documents observed code and artifacts without claiming broad deployment, model quality, or production readiness.
Transaction start
generate first requires a loaded model and a nonzero max_new_tokens. It clears prompt, generated-token, KV-cache, and generation diagnostics state before encoding the new prompt.
Execution
The tokenizer emits BOS and EOS according to its contract. Generation removes only a terminal EOS before prefill, rejects an empty prompt or a prompt that already fills the context, forwards each prompt token, then samples and forwards new tokens until EOS, token limit, or context capacity.
Success commit
Generated token IDs are decoded to UTF-8, copied into the result buffer, appended to the runtime token context, and reflected in diagnostics.
Error recovery
Tokenization, shape, context, sampling, or decoding failures clear generation state but retain the accepted model and its reusable allocations. A later valid request can recover without reloading the artifact. Explicit free_model is the operation that discards model-owned state.
Scope
This starter page defines the questions, boundaries, evidence, and failure modes that should be recorded before a capability is presented as supported.
Engineering considerations
- Identify the source, version, target environment, and owner.
- Separate observed values from estimates and externally reported values.
- Record trade-offs, unsupported cases, and fallback behavior.
- Link performance statements to a compatible benchmark methodology.
Verification questions
- What exact artifact, revision, backend, and environment were reviewed?
- Which assumptions could change the result?
- Which data should be retained so another engineer can reproduce the conclusion?