BPE1 tokenizer

Describes the custom embedded BPE token table, merge records, ranked deterministic merge algorithm, parser validation, fixture proof, and scaling concerns.

Experimental
Last verified
2026-06-25 00:00 UTC
Updated
Reading time
2 minutes

Describes the custom embedded BPE token table, merge records, ranked deterministic merge algorithm, parser validation, fixture proof, and scaling concerns.

Implementation evidence: this topic is grounded in the reviewed GGUF.MiRust.com source snapshot. It documents observed code and artifacts without claiming broad deployment, model quality, or production readiness.

Section layout

BPE1 begins with a 36-byte header containing version, vocabulary size, four special IDs, token count, and merge count. Variable token records store token ID, byte length, and raw bytes. Fixed 16-byte merge records store left, right, output, and rank.

Encoding algorithm

  1. Convert UTF-8 input to byte-fallback token IDs.
  2. Find available adjacent merges.
  3. Select the lowest-rank merge deterministically.
  4. Replace the pair and repeat until no merge applies.
  5. Add BOS and EOS.

Validation

The parser rejects unsupported versions, header/vocabulary drift, duplicate IDs, empty token bytes, out-of-range IDs, missing merge-output tokens, malformed lengths, and trailing bytes.

Checked fixture

The BPE smoke artifact defines t + h → 260 then 260 + e → 261. Diagnostics for prompt the record token output 256,261 before generation.

Scaling issue

The current representation uses simple vectors and repeated scans. A production vocabulary requires indexed pair lookup, normalization policy, tokenizer checksum/identity, corpus compatibility tests, and bounded parser allocation.

Scope

This starter page defines the questions, boundaries, evidence, and failure modes that should be recorded before a capability is presented as supported.

Engineering considerations

  • Identify the source, version, target environment, and owner.
  • Separate observed values from estimates and externally reported values.
  • Record trade-offs, unsupported cases, and fallback behavior.
  • Link performance statements to a compatible benchmark methodology.

Verification questions

  • What exact artifact, revision, backend, and environment were reviewed?
  • Which assumptions could change the result?
  • Which data should be retained so another engineer can reproduce the conclusion?