Zubnet AI学习Wiki › Corpus
Training

Corpus

又名: Dataset, Training Data
用来训练模型的文本(或其他数据)的主体。语料库可以从策展的书籍和论文集合到整个互联网的大规模抓取。语料库的质量和组成从根本上塑造模型知道什么、怎么表现。

为什么重要

垃圾进,垃圾出。在 Reddit 上训练的模型和在科学论文上训练的说话不同。这就是为什么我们为 Sarah 策展了自己的语料库 — 通用网络爬取产出困惑、不连贯的结果。

Deep Dive

Building a corpus is deceptively simple in concept and brutally complex in practice. At the most basic level, you gather text, clean it, and feed it to a model. But "cleaning" is where the real work lives. Raw web scrapes contain duplicate pages, boilerplate navigation text, SEO spam, encoding errors, truncated documents, and vast quantities of low-quality machine-generated content. Projects like Common Crawl provide petabytes of raw web data, but turning that into a usable training corpus requires aggressive deduplication (exact and near-duplicate removal), language identification, quality filtering, and content classification. The Pile, RedPajama, FineWeb, and DCLM each represent different philosophies about how to do this filtering, and the quality differences in downstream models are measurable.

The Data Mix

Corpus composition has a direct, often surprising impact on what a model can do. If 80% of your training data is English, the model will be mediocre at French even if French text is technically present. If your corpus is heavy on code, the model gets better at structured reasoning even for non-code tasks — this was one of the unexpected findings from early Codex training at OpenAI. The ratio of different domains matters too: too much social media text and the model learns to be glib; too much academic text and it becomes stilted. Most frontier labs treat their data mix as a closely guarded secret, because it is one of the few remaining competitive advantages that is not just about having more GPUs.

From Text to Tokens

Tokenization is the bridge between a raw corpus and what the model actually sees. Before training, every document gets broken into tokens — subword units learned by algorithms like BPE (byte pair encoding) or SentencePiece. The tokenizer is trained on the corpus itself, so a corpus heavy on code will produce a tokenizer that efficiently represents programming constructs, while a multilingual corpus yields a tokenizer with better coverage of non-Latin scripts. This step is usually done once and then frozen: you tokenize the entire corpus into binary shards that can be loaded efficiently during training. For a large corpus, this is itself a multi-day, multi-terabyte operation. A 185-billion-token corpus, for instance, might produce several hundred gigabytes of tokenized shards.

Quality vs. Quantity

The curation-versus-scale debate is one of the most important ongoing discussions in the field. For years, the dominant view was that more data is always better — just throw everything in and let the model sort it out. But empirical results have repeatedly shown that a smaller, carefully curated corpus can outperform a much larger noisy one. The Phi series of models from Microsoft demonstrated that high-quality "textbook-like" data could produce surprisingly capable small models. On the other end, the Chinchilla scaling laws showed that most models were trained on too little data relative to their parameter count. The practical lesson: data quality and data quantity are not interchangeable, and the best results come from getting both right.

相关概念

← 所有术语
← Copyright in AI Cosine Similarity →
ESC