Zubnet AI学习Wiki › Distributed Training
基础设施

Distributed Training

Data Parallelism, Model Parallelism, FSDP
同时在多个 GPU 或机器上训练一个模型。数据并行给每个 GPU 一份模型副本并切分训练数据。模型并行在模型对单个 GPU 太大时把模型本身切分到多个 GPU 上。像 FSDP(Fully Sharded Data Parallel)和 DeepSpeed 这样的现代方法把两者结合,使数千亿参数模型的训练成为可能。

为什么重要

没有前沿模型装得进单个 GPU。训练 GPT-4 或 Claude 需要数千个 GPU 一起工作几个月。分布式训练就是让这成为可能的工程 — 它和架构或数据一样关键。你分布式训练的效率直接决定了在给定预算下你能训多大的模型。

Deep Dive

Data parallelism (DP): each GPU has a full model copy, processes a different mini-batch, and gradients are averaged across GPUs. Simple and efficient for models that fit on one GPU. Tensor parallelism (TP): individual layers are split across GPUs, with each GPU computing part of each matrix multiplication. Needed when a single layer's weights don't fit on one GPU. Pipeline parallelism (PP): different layers run on different GPUs, with micro-batches flowing through the pipeline.

FSDP and DeepSpeed

Fully Sharded Data Parallel (FSDP, from PyTorch) and DeepSpeed ZeRO (from Microsoft) shard model parameters, gradients, and optimizer states across GPUs. Each GPU only stores a fraction of the model, and parameters are gathered on-demand for computation, then released. This enables training models much larger than a single GPU's memory. DeepSpeed ZeRO has three stages: Stage 1 shards optimizer states, Stage 2 adds gradients, Stage 3 adds parameters.

The Communication Bottleneck

The fundamental challenge of distributed training is communication: GPUs must synchronize gradients (in data parallelism) or exchange activations (in model/pipeline parallelism). This communication happens over NVLink (within a node, 900 GB/s) or InfiniBand (between nodes, 400 Gb/s). Training efficiency drops when GPUs spend more time waiting for communication than computing. Optimal configurations minimize cross-node communication by keeping tightly-coupled operations (like tensor parallelism) within a node and loosely-coupled operations (like data parallelism) across nodes.

相关概念

← 所有术语
← Distillation DPO →