Zubnet AILearnWiki › Vocabulary
Fundamentals

Vocabulary

Vocab, Token Vocabulary
The fixed set of tokens that a model can recognize and produce. A vocabulary is built by the tokenizer during training and typically contains 32K to 128K entries — common words, subword fragments, individual characters, and special tokens. Any text the model processes must be expressible as a sequence of tokens from this vocabulary. Tokens not in the vocabulary are broken into smaller pieces that are.

Why it matters

The vocabulary determines what the model can "see." A vocabulary trained mostly on English will handle English efficiently (one token per word) but may fragment Chinese, Arabic, or code into many small tokens (expensive, slower, less context). Vocabulary design is one of the most consequential and least discussed decisions in model development.

Deep Dive

Building a vocabulary: the tokenizer algorithm (usually BPE) starts with individual bytes or characters and iteratively merges the most frequent pairs. After 32K–128K merges, you have a vocabulary where common words are single tokens ("the," "and," "function") and rare words are split into subword pieces ("un" + "common," "pre" + "process" + "ing"). Special tokens like <BOS> (beginning of sequence), <EOS> (end), and <PAD> (padding) are added explicitly.

The Size Trade-off

Larger vocabularies compress text better (fewer tokens per sentence = cheaper, fits more in context) but increase the model's embedding table size. A 128K vocabulary with 4096-dimensional embeddings adds ~500M parameters just for the token tables. For a 7B model, that's 7% of total parameters doing nothing but mapping tokens to vectors. For a 1B model, it would be 50%. This is why smaller models tend to use smaller vocabularies.

Multilingual Vocabulary

A vocabulary's language coverage depends on its training corpus. Llama's early tokenizer was trained predominantly on English and represented Chinese characters as 3–4 tokens each, making Chinese inference 3–4x more expensive than English. Llama 3's tokenizer was trained on more balanced multilingual data, dramatically improving non-English efficiency. This is a solvable problem, but it requires deliberate effort — the default is English-dominant.

Related Concepts

← All Terms
← vLLM Voice AI →