Zubnet AILearnWiki › BPE
Fundamentals

BPE

Byte Pair Encoding, Subword Tokenization
The most common algorithm for building tokenizer vocabularies. BPE starts with individual bytes or characters and iteratively merges the most frequent adjacent pair into a new token. After thousands of merges, common words become single tokens ("the," "function") while rare words are split into subword pieces ("un" + "common"). Used by GPT, Claude, Llama, and most modern LLMs.

Why it matters

BPE is the reason your tokenizer works the way it does. It explains why common words are cheap (one token), why rare words are expensive (many tokens), and why non-English text costs more (fewer merges allocated to non-English character pairs). Understanding BPE helps you predict token counts, optimize prompts, and understand why different tokenizers produce different results for the same text.

Deep Dive

The algorithm: (1) start with a base vocabulary of individual bytes (256 entries) or characters, (2) scan the training corpus and count every adjacent pair of tokens, (3) merge the most frequent pair into a new token and add it to the vocabulary, (4) repeat steps 2–3 until the vocabulary reaches the target size (typically 32K–128K). The merge order defines a priority: "th" might be merge #50 while "ing" is merge #200, meaning "th" is a more fundamental unit in this tokenizer.

SentencePiece

SentencePiece (Google) is a popular BPE implementation that treats the input as raw bytes rather than pre-tokenized words. This means it can handle any language without language-specific preprocessing — no need for word segmentation in Chinese or morphological analysis in Turkish. Most modern LLMs use SentencePiece or a similar byte-level BPE variant. The alternative, WordPiece (used by BERT), is similar but uses a slightly different merge criterion.

The Training Corpus Matters

BPE merges reflect the training corpus's statistics. A tokenizer trained on English code gets efficient merges for "function," "return," and "const" but fragments Hindi or Arabic text. This is why multilingual tokenizers need balanced training corpora — the merge table must allocate enough merges to every language's common patterns. Llama 3's tokenizer explicitly trained on more balanced multilingual data, improving non-English token efficiency by 2–3x compared to Llama 2.

Related Concepts

← All Terms
← BLEU & ROUGE Bria →