Vocabulary: Definition & Meaning — AI Wiki

El conjunto fijo de tokens que un modelo puede reconocer y producir. Un vocabulario se construye por el tokenizer durante el entrenamiento y típicamente contiene 32K a 128K entradas — palabras comunes, fragmentos de subpalabras, caracteres individuales y tokens especiales. Cualquier texto que el modelo procese debe ser expresable como una secuencia de tokens de este vocabulario. Los tokens no en el vocabulario se rompen en piezas más pequeñas que sí están.

Por qué importa

El vocabulario determina lo que el modelo puede «ver». Un vocabulario entrenado mayoritariamente en inglés manejará inglés eficientemente (un token por palabra) pero puede fragmentar chino, árabe o código en muchos tokens pequeños (caro, más lento, menos contexto). El diseño del vocabulario es una de las decisiones más consecuentes y menos discutidas en el desarrollo de modelos.

Deep Dive

Building a vocabulary: the tokenizer algorithm (usually BPE) starts with individual bytes or characters and iteratively merges the most frequent pairs. After 32K–128K merges, you have a vocabulary where common words are single tokens ("the," "and," "function") and rare words are split into subword pieces ("un" + "common," "pre" + "process" + "ing"). Special tokens like <BOS> (beginning of sequence), <EOS> (end), and <PAD> (padding) are added explicitly.

The Size Trade-off

Larger vocabularies compress text better (fewer tokens per sentence = cheaper, fits more in context) but increase the model's embedding table size. A 128K vocabulary with 4096-dimensional embeddings adds ~500M parameters just for the token tables. For a 7B model, that's 7% of total parameters doing nothing but mapping tokens to vectors. For a 1B model, it would be 50%. This is why smaller models tend to use smaller vocabularies.

Multilingual Vocabulary

A vocabulary's language coverage depends on its training corpus. Llama's early tokenizer was trained predominantly on English and represented Chinese characters as 3–4 tokens each, making Chinese inference 3–4x more expensive than English. Llama 3's tokenizer was trained on more balanced multilingual data, dramatically improving non-English efficiency. This is a solvable problem, but it requires deliberate effort — the default is English-dominant.

Vocabulary

Por qué importa

Deep Dive

The Size Trade-off

Multilingual Vocabulary

Conceptos relacionados