Zubnet AI學習Wiki › Vocabulary
基礎

Vocabulary

Vocab, Token Vocabulary
模型能辨識和產生的固定 token 集合。詞彙表由 tokenizer 在訓練時建構,通常包含 32K 到 128K 個條目 — 常見詞、子詞片段、單個字元、特殊 token。模型處理的任何文字必須能表示為這個詞彙表中 token 的序列。不在詞彙表裡的 token 被分解成詞彙表裡有的小片段。

為什麼重要

詞彙表決定了模型能「看」到什麼。主要在英語上訓練的詞彙表能高效處理英語(每個詞一個 token),但可能把中文、阿拉伯文或程式拆成很多小 token(貴、慢、上下文少)。詞彙表設計是模型開發中最有後果、最少討論的決定之一。

Deep Dive

Building a vocabulary: the tokenizer algorithm (usually BPE) starts with individual bytes or characters and iteratively merges the most frequent pairs. After 32K–128K merges, you have a vocabulary where common words are single tokens ("the," "and," "function") and rare words are split into subword pieces ("un" + "common," "pre" + "process" + "ing"). Special tokens like <BOS> (beginning of sequence), <EOS> (end), and <PAD> (padding) are added explicitly.

The Size Trade-off

Larger vocabularies compress text better (fewer tokens per sentence = cheaper, fits more in context) but increase the model's embedding table size. A 128K vocabulary with 4096-dimensional embeddings adds ~500M parameters just for the token tables. For a 7B model, that's 7% of total parameters doing nothing but mapping tokens to vectors. For a 1B model, it would be 50%. This is why smaller models tend to use smaller vocabularies.

Multilingual Vocabulary

A vocabulary's language coverage depends on its training corpus. Llama's early tokenizer was trained predominantly on English and represented Chinese characters as 3–4 tokens each, making Chinese inference 3–4x more expensive than English. Llama 3's tokenizer was trained on more balanced multilingual data, dramatically improving non-English efficiency. This is a solvable problem, but it requires deliberate effort — the default is English-dominant.

相關概念

← 所有術語
← vLLM Voice AI →