Distributed Training: Definition & Meaning — AI Wiki

同時在多個 GPU 或機器上訓練一個模型。資料平行給每個 GPU 一份模型副本並切分訓練資料。模型平行在模型對單個 GPU 太大時把模型本身切分到多個 GPU 上。像 FSDP(Fully Sharded Data Parallel)和 DeepSpeed 這樣的現代方法把兩者結合,使數千億參數模型的訓練成為可能。

為什麼重要

沒有前沿模型裝得進單個 GPU。訓練 GPT-4 或 Claude 需要數千個 GPU 一起工作幾個月。分散式訓練就是讓這成為可能的工程 — 它和架構或資料一樣關鍵。你分散式訓練的效率直接決定了在給定預算下你能訓多大的模型。

Deep Dive

Data parallelism (DP): each GPU has a full model copy, processes a different mini-batch, and gradients are averaged across GPUs. Simple and efficient for models that fit on one GPU. Tensor parallelism (TP): individual layers are split across GPUs, with each GPU computing part of each matrix multiplication. Needed when a single layer's weights don't fit on one GPU. Pipeline parallelism (PP): different layers run on different GPUs, with micro-batches flowing through the pipeline.

FSDP and DeepSpeed

Fully Sharded Data Parallel (FSDP, from PyTorch) and DeepSpeed ZeRO (from Microsoft) shard model parameters, gradients, and optimizer states across GPUs. Each GPU only stores a fraction of the model, and parameters are gathered on-demand for computation, then released. This enables training models much larger than a single GPU's memory. DeepSpeed ZeRO has three stages: Stage 1 shards optimizer states, Stage 2 adds gradients, Stage 3 adds parameters.

The Communication Bottleneck

The fundamental challenge of distributed training is communication: GPUs must synchronize gradients (in data parallelism) or exchange activations (in model/pipeline parallelism). This communication happens over NVLink (within a node, 900 GB/s) or InfiniBand (between nodes, 400 Gb/s). Training efficiency drops when GPUs spend more time waiting for communication than computing. Optimal configurations minimize cross-node communication by keeping tightly-coupled operations (like tensor parallelism) within a node and loosely-coupled operations (like data parallelism) across nodes.

Distributed Training

為什麼重要

Deep Dive

FSDP and DeepSpeed

The Communication Bottleneck

相關概念