Zubnet AI学习Wiki › BPE
基础

BPE

Byte Pair Encoding, Subword Tokenization
构建 tokenizer 词表最常见的算法。BPE 从单个字节或字符开始,迭代地把最频繁的相邻对合并成一个新 token。经过数千次合并,常见词变成单 token(“the”、“function”),而罕见词被切成子词片段(“un” + “common”)。GPT、Claude、Llama、以及大多数现代 LLM 都用它。

为什么重要

BPE 就是你的 tokenizer 为什么这样工作的原因。它解释了为什么常见词便宜(一个 token)、为什么罕见词贵(很多 token)、以及为什么非英文文本更贵(分配给非英文字符对的合并更少)。理解 BPE 能帮你预测 token 数、优化 prompt、理解为什么不同的 tokenizer 对同一段文本产生不同结果。

Deep Dive

The algorithm: (1) start with a base vocabulary of individual bytes (256 entries) or characters, (2) scan the training corpus and count every adjacent pair of tokens, (3) merge the most frequent pair into a new token and add it to the vocabulary, (4) repeat steps 2–3 until the vocabulary reaches the target size (typically 32K–128K). The merge order defines a priority: "th" might be merge #50 while "ing" is merge #200, meaning "th" is a more fundamental unit in this tokenizer.

SentencePiece

SentencePiece (Google) is a popular BPE implementation that treats the input as raw bytes rather than pre-tokenized words. This means it can handle any language without language-specific preprocessing — no need for word segmentation in Chinese or morphological analysis in Turkish. Most modern LLMs use SentencePiece or a similar byte-level BPE variant. The alternative, WordPiece (used by BERT), is similar but uses a slightly different merge criterion.

The Training Corpus Matters

BPE merges reflect the training corpus's statistics. A tokenizer trained on English code gets efficient merges for "function," "return," and "const" but fragments Hindi or Arabic text. This is why multilingual tokenizers need balanced training corpora — the merge table must allocate enough merges to every language's common patterns. Llama 3's tokenizer explicitly trained on more balanced multilingual data, improving non-English token efficiency by 2–3x compared to Llama 2.

相关概念

← 所有术语
← BLEU & ROUGE Bria →