BPE: Definition & Meaning — AI Wiki

L'algorithme le plus commun pour construire des vocabulaires de tokenizer. BPE part de bytes ou de caractères individuels et fusionne itérativement la paire adjacente la plus fréquente en un nouveau token. Après des milliers de fusions, les mots communs deviennent des tokens uniques (« the », « function ») tandis que les mots rares sont découpés en morceaux de sous-mots (« un » + « common »). Utilisé par GPT, Claude, Llama et la plupart des LLM modernes.

Pourquoi c'est important

Le BPE est la raison pour laquelle ton tokenizer fonctionne comme il le fait. Il explique pourquoi les mots communs sont bon marché (un token), pourquoi les mots rares sont chers (beaucoup de tokens), et pourquoi le texte non anglais coûte plus cher (moins de fusions allouées aux paires de caractères non anglais). Comprendre le BPE t'aide à prédire le comptage de tokens, optimiser les prompts, et comprendre pourquoi différents tokenizers produisent des résultats différents pour le même texte.

Deep Dive

The algorithm: (1) start with a base vocabulary of individual bytes (256 entries) or characters, (2) scan the training corpus and count every adjacent pair of tokens, (3) merge the most frequent pair into a new token and add it to the vocabulary, (4) repeat steps 2–3 until the vocabulary reaches the target size (typically 32K–128K). The merge order defines a priority: "th" might be merge #50 while "ing" is merge #200, meaning "th" is a more fundamental unit in this tokenizer.

SentencePiece

SentencePiece (Google) is a popular BPE implementation that treats the input as raw bytes rather than pre-tokenized words. This means it can handle any language without language-specific preprocessing — no need for word segmentation in Chinese or morphological analysis in Turkish. Most modern LLMs use SentencePiece or a similar byte-level BPE variant. The alternative, WordPiece (used by BERT), is similar but uses a slightly different merge criterion.

The Training Corpus Matters

BPE merges reflect the training corpus's statistics. A tokenizer trained on English code gets efficient merges for "function," "return," and "const" but fragments Hindi or Arabic text. This is why multilingual tokenizers need balanced training corpora — the merge table must allocate enough merges to every language's common patterns. Llama 3's tokenizer explicitly trained on more balanced multilingual data, improving non-English token efficiency by 2–3x compared to Llama 2.

BPE

Pourquoi c'est important

Deep Dive

SentencePiece

The Training Corpus Matters

Concepts liés