Zubnet AIAprenderWiki › Distributed Training
Infraestrutura

Distributed Training

Data Parallelism, Model Parallelism, FSDP
Treinar um modelo através de múltiplas GPUs ou máquinas simultaneamente. Paralelismo de dados dá a cada GPU uma cópia do modelo e divide os dados de treinamento. Paralelismo de modelo divide o próprio modelo através de GPUs quando é grande demais para uma. Abordagens modernas como FSDP (Fully Sharded Data Parallel) e DeepSpeed combinam ambos, habilitando o treinamento de modelos com centenas de bilhões de parâmetros.

Por que importa

Nenhum modelo frontier cabe em uma única GPU. Treinar GPT-4 ou Claude exige milhares de GPUs trabalhando juntas por meses. Treinamento distribuído é a engenharia que torna isso possível — é tão crítico quanto a arquitetura ou os dados. A eficiência do seu treinamento distribuído determina diretamente quanto modelo você pode treinar para um orçamento dado.

Deep Dive

Data parallelism (DP): each GPU has a full model copy, processes a different mini-batch, and gradients are averaged across GPUs. Simple and efficient for models that fit on one GPU. Tensor parallelism (TP): individual layers are split across GPUs, with each GPU computing part of each matrix multiplication. Needed when a single layer's weights don't fit on one GPU. Pipeline parallelism (PP): different layers run on different GPUs, with micro-batches flowing through the pipeline.

FSDP and DeepSpeed

Fully Sharded Data Parallel (FSDP, from PyTorch) and DeepSpeed ZeRO (from Microsoft) shard model parameters, gradients, and optimizer states across GPUs. Each GPU only stores a fraction of the model, and parameters are gathered on-demand for computation, then released. This enables training models much larger than a single GPU's memory. DeepSpeed ZeRO has three stages: Stage 1 shards optimizer states, Stage 2 adds gradients, Stage 3 adds parameters.

The Communication Bottleneck

The fundamental challenge of distributed training is communication: GPUs must synchronize gradients (in data parallelism) or exchange activations (in model/pipeline parallelism). This communication happens over NVLink (within a node, 900 GB/s) or InfiniBand (between nodes, 400 Gb/s). Training efficiency drops when GPUs spend more time waiting for communication than computing. Optimal configurations minimize cross-node communication by keeping tightly-coupled operations (like tensor parallelism) within a node and loosely-coupled operations (like data parallelism) across nodes.

Conceitos relacionados

← Todos os termos
← Distillation DPO →