Zubnet AI學習Wiki › BPE
基礎

BPE

Byte Pair Encoding, Subword Tokenization
建構 tokenizer 詞彙表最常見的演算法。BPE 從單個位元組或字元開始,迭代地把最頻繁的相鄰對合併成一個新 token。經過數千次合併,常見詞變成單 token(「the」、「function」),而罕見詞被切成子詞片段(「un」 + 「common」)。GPT、Claude、Llama、以及大多數現代 LLM 都用它。

為什麼重要

BPE 就是你的 tokenizer 為什麼這樣運作的原因。它解釋了為什麼常見詞便宜(一個 token)、為什麼罕見詞貴(很多 token)、以及為什麼非英文文字更貴(分配給非英文字元對的合併更少)。理解 BPE 能幫你預測 token 數、優化 prompt、理解為什麼不同的 tokenizer 對同一段文字產生不同結果。

Deep Dive

The algorithm: (1) start with a base vocabulary of individual bytes (256 entries) or characters, (2) scan the training corpus and count every adjacent pair of tokens, (3) merge the most frequent pair into a new token and add it to the vocabulary, (4) repeat steps 2–3 until the vocabulary reaches the target size (typically 32K–128K). The merge order defines a priority: "th" might be merge #50 while "ing" is merge #200, meaning "th" is a more fundamental unit in this tokenizer.

SentencePiece

SentencePiece (Google) is a popular BPE implementation that treats the input as raw bytes rather than pre-tokenized words. This means it can handle any language without language-specific preprocessing — no need for word segmentation in Chinese or morphological analysis in Turkish. Most modern LLMs use SentencePiece or a similar byte-level BPE variant. The alternative, WordPiece (used by BERT), is similar but uses a slightly different merge criterion.

The Training Corpus Matters

BPE merges reflect the training corpus's statistics. A tokenizer trained on English code gets efficient merges for "function," "return," and "const" but fragments Hindi or Arabic text. This is why multilingual tokenizers need balanced training corpora — the merge table must allocate enough merges to every language's common patterns. Llama 3's tokenizer explicitly trained on more balanced multilingual data, improving non-English token efficiency by 2–3x compared to Llama 2.

相關概念

← 所有術語
← BLEU & ROUGE Bria →