BPE1 tokenizer

Describes the custom embedded BPE token table, merge records, ranked deterministic merge algorithm, parser validation, fixture proof, and scaling concerns.

Experimental

Last verified: 2026-06-25 00:00 UTC
Updated: 2026-06-25
Reading time: 2 minutes

Describes the custom embedded BPE token table, merge records, ranked deterministic merge algorithm, parser validation, fixture proof, and scaling concerns.

Implementation evidence: this topic is grounded in the reviewed GGUF.MiRust.com source snapshot. It documents observed code and artifacts without claiming broad deployment, model quality, or production readiness.

Section layout

BPE1 begins with a 36-byte header containing version, vocabulary size, four special IDs, token count, and merge count. Variable token records store token ID, byte length, and raw bytes. Fixed 16-byte merge records store left, right, output, and rank.

Encoding algorithm

Convert UTF-8 input to byte-fallback token IDs.
Find available adjacent merges.
Select the lowest-rank merge deterministically.
Replace the pair and repeat until no merge applies.
Add BOS and EOS.

Validation

The parser rejects unsupported versions, header/vocabulary drift, duplicate IDs, empty token bytes, out-of-range IDs, missing merge-output tokens, malformed lengths, and trailing bytes.

Checked fixture

The BPE smoke artifact defines t + h → 260 then 260 + e → 261. Diagnostics for prompt the record token output 256,261 before generation.

Scaling issue

The current representation uses simple vectors and repeated scans. A production vocabulary requires indexed pair lookup, normalization policy, tokenizer checksum/identity, corpus compatibility tests, and bounded parser allocation.

Scope

This starter page defines the questions, boundaries, evidence, and failure modes that should be recorded before a capability is presented as supported.

Engineering considerations

Identify the source, version, target environment, and owner.
Separate observed values from estimates and externally reported values.
Record trade-offs, unsupported cases, and fallback behavior.
Link performance statements to a compatible benchmark methodology.

Verification questions

What exact artifact, revision, backend, and environment were reviewed?
Which assumptions could change the result?
Which data should be retained so another engineer can reproduce the conclusion?

BPE1 tokenizer

Section layout #

Encoding algorithm #

Validation #

Checked fixture #

Scaling issue #

Scope #

Engineering considerations #

Verification questions #

Section layout

Encoding algorithm

Validation

Checked fixture

Scaling issue

Scope

Engineering considerations

Verification questions