Zubnet AILearnWiki › Distributed Training
Infrastructure

Distributed Training

Data Parallelism, Model Parallelism, FSDP
Training a model across multiple GPUs or machines simultaneously. Data parallelism gives each GPU a copy of the model and splits the training data. Model parallelism splits the model itself across GPUs when it's too large for one. Modern approaches like FSDP (Fully Sharded Data Parallel) and DeepSpeed combine both, enabling training of models with hundreds of billions of parameters.

Why it matters

No frontier model fits on a single GPU. Training GPT-4 or Claude requires thousands of GPUs working together for months. Distributed training is the engineering that makes this possible — it's as critical as the architecture or the data. The efficiency of your distributed training directly determines how much model you can train for a given budget.

Deep Dive

Data parallelism (DP): each GPU has a full model copy, processes a different mini-batch, and gradients are averaged across GPUs. Simple and efficient for models that fit on one GPU. Tensor parallelism (TP): individual layers are split across GPUs, with each GPU computing part of each matrix multiplication. Needed when a single layer's weights don't fit on one GPU. Pipeline parallelism (PP): different layers run on different GPUs, with micro-batches flowing through the pipeline.

FSDP and DeepSpeed

Fully Sharded Data Parallel (FSDP, from PyTorch) and DeepSpeed ZeRO (from Microsoft) shard model parameters, gradients, and optimizer states across GPUs. Each GPU only stores a fraction of the model, and parameters are gathered on-demand for computation, then released. This enables training models much larger than a single GPU's memory. DeepSpeed ZeRO has three stages: Stage 1 shards optimizer states, Stage 2 adds gradients, Stage 3 adds parameters.

The Communication Bottleneck

The fundamental challenge of distributed training is communication: GPUs must synchronize gradients (in data parallelism) or exchange activations (in model/pipeline parallelism). This communication happens over NVLink (within a node, 900 GB/s) or InfiniBand (between nodes, 400 Gb/s). Training efficiency drops when GPUs spend more time waiting for communication than computing. Optimal configurations minimize cross-node communication by keeping tightly-coupled operations (like tensor parallelism) within a node and loosely-coupled operations (like data parallelism) across nodes.

Related Concepts

← All Terms
← Distillation DPO →