Describes the custom embedded BPE token table, merge records, ranked deterministic merge algorithm, parser validation, fixture proof, and scaling concerns.
Implementation evidence: this topic is grounded in the reviewed GGUF.MiRust.com source snapshot. It documents observed code and artifacts without claiming broad deployment, model quality, or production readiness.
Section layout
BPE1 begins with a 36-byte header containing version, vocabulary size, four special IDs, token count, and merge count. Variable token records store token ID, byte length, and raw bytes. Fixed 16-byte merge records store left, right, output, and rank.
Encoding algorithm
- Convert UTF-8 input to byte-fallback token IDs.
- Find available adjacent merges.
- Select the lowest-rank merge deterministically.
- Replace the pair and repeat until no merge applies.
- Add BOS and EOS.
Validation
The parser rejects unsupported versions, header/vocabulary drift, duplicate IDs, empty token bytes, out-of-range IDs, missing merge-output tokens, malformed lengths, and trailing bytes.
Checked fixture
The BPE smoke artifact defines t + h → 260 then 260 + e → 261. Diagnostics for prompt the record token output 256,261 before generation.
Scaling issue
The current representation uses simple vectors and repeated scans. A production vocabulary requires indexed pair lookup, normalization policy, tokenizer checksum/identity, corpus compatibility tests, and bounded parser allocation.
Scope
This starter page defines the questions, boundaries, evidence, and failure modes that should be recorded before a capability is presented as supported.
Engineering considerations
- Identify the source, version, target environment, and owner.
- Separate observed values from estimates and externally reported values.
- Record trade-offs, unsupported cases, and fallback behavior.
- Link performance statements to a compatible benchmark methodology.
Verification questions
- What exact artifact, revision, backend, and environment were reviewed?
- Which assumptions could change the result?
- Which data should be retained so another engineer can reproduce the conclusion?