Documents the fixed byte-tokenizer vocabulary, special IDs, encoding and decoding behavior, binary section, and limitations for trained language use.
Implementation evidence: this topic is grounded in the reviewed GGUF.MiRust.com source snapshot. It documents observed code and artifacts without claiming broad deployment, model quality, or production readiness.
Vocabulary
Token IDs 0–255 map directly to bytes. BOS = 256, EOS = 257, PAD = 258, and UNK = 259, producing a declared vocabulary of 260.
Encoding
The tokenizer emits BOS, one token per UTF-8 byte, then EOS. This is deterministic and makes prompt transport easy to verify without a tokenizer dependency.
Decoding
BOS, EOS, and PAD are skipped. UNK emits ?. Remaining byte IDs are concatenated and validated as UTF-8.
Binary section
The fixed BTOK section is 28 bytes and declares magic, version, vocabulary size, and special token IDs.
Limits
Byte tokenization increases sequence length, makes the 512-token context roughly a byte budget rather than a word budget, and does not provide normalization or training-tokenizer compatibility. It is appropriate for deterministic runtime smoke unless a model was explicitly trained with the same mapping.
Scope
This starter page defines the questions, boundaries, evidence, and failure modes that should be recorded before a capability is presented as supported.
Engineering considerations
- Identify the source, version, target environment, and owner.
- Separate observed values from estimates and externally reported values.
- Record trade-offs, unsupported cases, and fallback behavior.
- Link performance statements to a compatible benchmark methodology.
Verification questions
- What exact artifact, revision, backend, and environment were reviewed?
- Which assumptions could change the result?
- Which data should be retained so another engineer can reproduce the conclusion?