Zubnet AIApprendreWiki › Tokenizer
Fondamentaux

Tokenizer

Tokenization
L'algorithme qui convertit du texte brut en tokens avant qu'un modèle puisse le traiter. Un tokenizer maintient un vocabulaire fixe de types de tokens et découpe n'importe quel texte d'entrée en une séquence de ces tokens. Différents modèles utilisent différents tokenizers — la même phrase se tokenize différemment pour Claude, GPT et Llama, ce qui affecte l'usage du contexte et le coût.

Pourquoi c'est important

Le tokenizer est la couche invisible entre ton texte et le modèle. Il détermine combien de tokens coûte ton prompt, pourquoi certaines langues coûtent plus cher que d'autres, et pourquoi le code utilise parfois le contexte plus vite que la prose. Quand tu atteins une limite de contexte ou que tu vois des coûts d'API inattendus, le tokenizer est habituellement l'explication.

Deep Dive

Most modern tokenizers use Byte Pair Encoding (BPE) or a variant called SentencePiece. BPE works by starting with individual bytes or characters and repeatedly merging the most frequent adjacent pair into a new token. After thousands of merges, common words like "the" become single tokens, while rare words get split into subword pieces. The word "tokenization" might become ["token", "ization"] or ["token", "iz", "ation"] depending on the specific merge table.

Vocabulary Size Matters

A tokenizer's vocabulary size is a real engineering trade-off. Larger vocabularies (100K+ tokens) compress text more efficiently — common words and phrases get dedicated tokens, so less context is consumed. But larger vocabularies also mean a bigger embedding table at the model's input and output layers. For a model with dimension 4096, each vocabulary entry adds 4096 parameters to both the embedding and the unembedding layers. At 128K vocabulary, that's over a billion parameters just for the token tables. Smaller models feel this overhead proportionally more.

The Multilingual Tax

Tokenizers are trained on a corpus, and the language distribution of that corpus determines efficiency. English text typically tokenizes at roughly 1 token per word. But languages like Chinese, Japanese, Korean, Arabic, and Hindi can require 2–4x more tokens for equivalent meaning, because their characters appear less frequently in English-dominated training data and earn fewer dedicated merges. This isn't just an academic concern — it means non-English users pay more per API call and fit less content in the context window. Some newer tokenizers (like Llama 3's) explicitly train on more balanced multilingual data to reduce this gap.

Tokenizer Artifacts

Quirks in tokenization explain several LLM behaviors people find puzzling. Models struggle with character-level tasks (counting letters in "strawberry") because they see tokens, not characters. They handle some variable names better than others because common names like "result" are single tokens while unusual ones fragment. They sometimes produce slightly different outputs for semantically identical inputs because the token boundaries differ. Understanding the tokenizer helps you understand the model.

Concepts liés

← Tous les termes
← Token Tool Use →