Zubnet AI学习Wiki › Vocabulary
基础

Vocabulary

Vocab, Token Vocabulary
模型能识别和产生的固定 token 集合。词表由 tokenizer 在训练时构建,通常包含 32K 到 128K 个条目 — 常见词、子词片段、单个字符、特殊 token。模型处理的任何文本必须能表示为这个词表中 token 的序列。不在词表里的 token 被分解成词表里有的小片段。

为什么重要

词表决定了模型能“看”到什么。主要在英语上训练的词表能高效处理英语(每个词一个 token),但可能把中文、阿拉伯文或代码拆成很多小 token(贵、慢、上下文少)。词表设计是模型开发中最有后果、最少讨论的决定之一。

Deep Dive

Building a vocabulary: the tokenizer algorithm (usually BPE) starts with individual bytes or characters and iteratively merges the most frequent pairs. After 32K–128K merges, you have a vocabulary where common words are single tokens ("the," "and," "function") and rare words are split into subword pieces ("un" + "common," "pre" + "process" + "ing"). Special tokens like <BOS> (beginning of sequence), <EOS> (end), and <PAD> (padding) are added explicitly.

The Size Trade-off

Larger vocabularies compress text better (fewer tokens per sentence = cheaper, fits more in context) but increase the model's embedding table size. A 128K vocabulary with 4096-dimensional embeddings adds ~500M parameters just for the token tables. For a 7B model, that's 7% of total parameters doing nothing but mapping tokens to vectors. For a 1B model, it would be 50%. This is why smaller models tend to use smaller vocabularies.

Multilingual Vocabulary

A vocabulary's language coverage depends on its training corpus. Llama's early tokenizer was trained predominantly on English and represented Chinese characters as 3–4 tokens each, making Chinese inference 3–4x more expensive than English. Llama 3's tokenizer was trained on more balanced multilingual data, dramatically improving non-English efficiency. This is a solvable problem, but it requires deliberate effort — the default is English-dominant.

相关概念

← 所有术语
← vLLM Voice AI →