Building a corpus is deceptively simple in concept and brutally complex in practice. At the most basic level, you gather text, clean it, and feed it to a model. But "cleaning" is where the real work lives. Raw web scrapes contain duplicate pages, boilerplate navigation text, SEO spam, encoding errors, truncated documents, and vast quantities of low-quality machine-generated content. Projects like Common Crawl provide petabytes of raw web data, but turning that into a usable training corpus requires aggressive deduplication (exact and near-duplicate removal), language identification, quality filtering, and content classification. The Pile, RedPajama, FineWeb, and DCLM each represent different philosophies about how to do this filtering, and the quality differences in downstream models are measurable.
Corpus composition has a direct, often surprising impact on what a model can do. If 80% of your training data is English, the model will be mediocre at French even if French text is technically present. If your corpus is heavy on code, the model gets better at structured reasoning even for non-code tasks — this was one of the unexpected findings from early Codex training at OpenAI. The ratio of different domains matters too: too much social media text and the model learns to be glib; too much academic text and it becomes stilted. Most frontier labs treat their data mix as a closely guarded secret, because it is one of the few remaining competitive advantages that is not just about having more GPUs.
Tokenization is the bridge between a raw corpus and what the model actually sees. Before training, every document gets broken into tokens — subword units learned by algorithms like BPE (byte pair encoding) or SentencePiece. The tokenizer is trained on the corpus itself, so a corpus heavy on code will produce a tokenizer that efficiently represents programming constructs, while a multilingual corpus yields a tokenizer with better coverage of non-Latin scripts. This step is usually done once and then frozen: you tokenize the entire corpus into binary shards that can be loaded efficiently during training. For a large corpus, this is itself a multi-day, multi-terabyte operation. A 185-billion-token corpus, for instance, might produce several hundred gigabytes of tokenized shards.
The curation-versus-scale debate is one of the most important ongoing discussions in the field. For years, the dominant view was that more data is always better — just throw everything in and let the model sort it out. But empirical results have repeatedly shown that a smaller, carefully curated corpus can outperform a much larger noisy one. The Phi series of models from Microsoft demonstrated that high-quality "textbook-like" data could produce surprisingly capable small models. On the other end, the Chinchilla scaling laws showed that most models were trained on too little data relative to their parameter count. The practical lesson: data quality and data quantity are not interchangeable, and the best results come from getting both right.