Pre-training: Definition & Meaning — AI Wiki

The initial, massive training phase where a model learns language (or other modalities) from a huge corpus. This is the expensive part — thousands of GPUs running for weeks or months, costing millions of dollars. The result is a foundation model that understands language but hasn't been specialized for any task yet.

Why it matters

Pre-training is what makes foundation models possible. It's also why only a handful of companies can create frontier models — the compute costs are astronomical. Everything else (fine-tuning, RLHF, prompting) builds on this base.

Deep Dive

The dominant pre-training objective for language models is next-token prediction: given a sequence of tokens, predict what comes next. The model processes trillions of tokens from the training corpus, and for each token, it computes a probability distribution over the entire vocabulary and is penalized (via cross-entropy loss) for assigning low probability to the actual next token. This deceptively simple objective turns out to be extraordinarily powerful — to predict the next word well in diverse contexts, the model must implicitly learn grammar, facts, reasoning patterns, coding conventions, and much more. The loss starts high (essentially random guessing across a vocabulary of 32,000-128,000 tokens) and gradually decreases as the model internalizes the statistical structure of language. For transformer-based models, this is the standard recipe. Alternative architectures like state-space models (Mamba, RWKV) use the same objective but replace the attention mechanism with recurrent state updates, achieving comparable quality with better computational scaling on long sequences.

At Scale

The scale of modern pre-training is staggering and has been doubling roughly every 6-9 months. GPT-3 (2020) trained on 300 billion tokens. LLaMA 2 (2023) used 2 trillion. LLaMA 3 (2024) used over 15 trillion. The compute is measured in floating-point operations, and a frontier pre-training run might require 10^25 FLOPs — a number that translates to thousands of GPUs running for months and costs tens of millions of dollars in electricity and hardware alone. The training is distributed across GPUs using techniques like data parallelism (each GPU processes different data batches), tensor parallelism (each layer's computation is split across GPUs), and pipeline parallelism (different layers live on different GPUs). Frameworks like Megatron-LM, DeepSpeed, and FSDP (PyTorch's Fully Sharded Data Parallel) handle the complexity of keeping thousands of GPUs synchronized, but failures are common — hardware errors, network issues, and numerical instabilities mean that large training runs require robust checkpointing and automatic recovery.

The Training Recipe

Not all pre-training is created equal, and the details of the training recipe matter as much as the data and compute. The learning rate schedule is typically a warmup phase (linearly increasing the learning rate over the first few thousand steps) followed by a cosine decay to near zero. Batch size often increases during training — starting small for more frequent, noisier gradient updates and growing larger for more stable later-stage training. The sequence length (how many tokens the model sees at once) has a major impact on what the model learns: longer sequences let it capture longer-range dependencies but cost quadratically more memory for attention-based models. Many teams now use progressive sequence length training, starting with shorter contexts and increasing to the full context window later. The optimizer is almost universally AdamW, though newer approaches like SOAP and Muon are gaining traction for their potentially better convergence properties.

The Multi-Stage Pipeline

Pre-training is not a monolithic single phase anymore. Modern training pipelines often include multiple stages with different data mixes. The main pre-training phase uses a broad corpus, then a "mid-training" or "continued pre-training" phase uses a higher-quality or more domain-specific data mix, sometimes with longer context lengths. This is how models learn to handle long documents effectively — training on 128K-token sequences from the start would be prohibitively expensive, but a short fine-tuning phase on long-context data at the end works surprisingly well. After pre-training comes supervised fine-tuning (SFT) on instruction data, then alignment via RLHF or DPO. Each stage builds on the last, and the boundaries between them are increasingly blurred. What used to be a clean three-step pipeline (pre-train, SFT, RLHF) is now a multi-stage curriculum with distinct data mixes, learning rates, and objectives at each phase.

Pre-training

Why it matters

Deep Dive

At Scale

The Training Recipe

The Multi-Stage Pipeline

Related Concepts