BTOK byte tokenizer

Documents the fixed byte-tokenizer vocabulary, special IDs, encoding and decoding behavior, binary section, and limitations for trained language use.

Experimental
Last verified
2026-06-25 00:00 UTC
Updated
Reading time
2 minutes

Documents the fixed byte-tokenizer vocabulary, special IDs, encoding and decoding behavior, binary section, and limitations for trained language use.

Implementation evidence: this topic is grounded in the reviewed GGUF.MiRust.com source snapshot. It documents observed code and artifacts without claiming broad deployment, model quality, or production readiness.

Vocabulary

Token IDs 0–255 map directly to bytes. BOS = 256, EOS = 257, PAD = 258, and UNK = 259, producing a declared vocabulary of 260.

Encoding

The tokenizer emits BOS, one token per UTF-8 byte, then EOS. This is deterministic and makes prompt transport easy to verify without a tokenizer dependency.

Decoding

BOS, EOS, and PAD are skipped. UNK emits ?. Remaining byte IDs are concatenated and validated as UTF-8.

Binary section

The fixed BTOK section is 28 bytes and declares magic, version, vocabulary size, and special token IDs.

Limits

Byte tokenization increases sequence length, makes the 512-token context roughly a byte budget rather than a word budget, and does not provide normalization or training-tokenizer compatibility. It is appropriate for deterministic runtime smoke unless a model was explicitly trained with the same mapping.

Scope

This starter page defines the questions, boundaries, evidence, and failure modes that should be recorded before a capability is presented as supported.

Engineering considerations

  • Identify the source, version, target environment, and owner.
  • Separate observed values from estimates and externally reported values.
  • Record trade-offs, unsupported cases, and fallback behavior.
  • Link performance statements to a compatible benchmark methodology.

Verification questions

  • What exact artifact, revision, backend, and environment were reviewed?
  • Which assumptions could change the result?
  • Which data should be retained so another engineer can reproduce the conclusion?