Distributed Training: Definition & Meaning — AI Wiki

एक model को multiple GPUs या machines के across simultaneously train करना। Data parallelism हर GPU को model की एक copy देता है और training data को split करता है। Model parallelism model को खुद GPUs के across split करता है जब वो एक के लिए बहुत बड़ा हो। FSDP (Fully Sharded Data Parallel) और DeepSpeed जैसे modern approaches दोनों को combine करते हैं, hundreds of billions of parameters वाले models की training enable करते हुए।

यह क्यों matter करता है

कोई frontier model एक single GPU पर fit नहीं होता। GPT-4 या Claude को train करने के लिए thousands of GPUs को months तक साथ काम करना पड़ता है। Distributed training वो engineering है जो इसे possible बनाती है — ये architecture या data जितनी ही critical है। आपकी distributed training की efficiency directly determine करती है कि एक given budget के लिए आप कितना model train कर सकते हैं।

Deep Dive

Data parallelism (DP): each GPU has a full model copy, processes a different mini-batch, and gradients are averaged across GPUs. Simple and efficient for models that fit on one GPU. Tensor parallelism (TP): individual layers are split across GPUs, with each GPU computing part of each matrix multiplication. Needed when a single layer's weights don't fit on one GPU. Pipeline parallelism (PP): different layers run on different GPUs, with micro-batches flowing through the pipeline.

FSDP and DeepSpeed

Fully Sharded Data Parallel (FSDP, from PyTorch) and DeepSpeed ZeRO (from Microsoft) shard model parameters, gradients, and optimizer states across GPUs. Each GPU only stores a fraction of the model, and parameters are gathered on-demand for computation, then released. This enables training models much larger than a single GPU's memory. DeepSpeed ZeRO has three stages: Stage 1 shards optimizer states, Stage 2 adds gradients, Stage 3 adds parameters.

The Communication Bottleneck

The fundamental challenge of distributed training is communication: GPUs must synchronize gradients (in data parallelism) or exchange activations (in model/pipeline parallelism). This communication happens over NVLink (within a node, 900 GB/s) or InfiniBand (between nodes, 400 Gb/s). Training efficiency drops when GPUs spend more time waiting for communication than computing. Optimal configurations minimize cross-node communication by keeping tightly-coupled operations (like tensor parallelism) within a node and loosely-coupled operations (like data parallelism) across nodes.

Distributed Training

यह क्यों matter करता है

Deep Dive

FSDP and DeepSpeed

The Communication Bottleneck

संबंधित अवधारणाएँ