The algorithm: (1) start with a base vocabulary of individual bytes (256 entries) or characters, (2) scan the training corpus and count every adjacent pair of tokens, (3) merge the most frequent pair into a new token and add it to the vocabulary, (4) repeat steps 2–3 until the vocabulary reaches the target size (typically 32K–128K). The merge order defines a priority: "th" might be merge #50 while "ing" is merge #200, meaning "th" is a more fundamental unit in this tokenizer.
SentencePiece (Google) is a popular BPE implementation that treats the input as raw bytes rather than pre-tokenized words. This means it can handle any language without language-specific preprocessing — no need for word segmentation in Chinese or morphological analysis in Turkish. Most modern LLMs use SentencePiece or a similar byte-level BPE variant. The alternative, WordPiece (used by BERT), is similar but uses a slightly different merge criterion.
BPE merges reflect the training corpus's statistics. A tokenizer trained on English code gets efficient merges for "function," "return," and "const" but fragments Hindi or Arabic text. This is why multilingual tokenizers need balanced training corpora — the merge table must allocate enough merges to every language's common patterns. Llama 3's tokenizer explicitly trained on more balanced multilingual data, improving non-English token efficiency by 2–3x compared to Llama 2.