Distributed Training: Definition & Meaning — AI Wiki

Entrenar un modelo a través de múltiples GPUs o máquinas simultáneamente. Paralelismo de datos le da a cada GPU una copia del modelo y divide los datos de entrenamiento. Paralelismo de modelo divide el modelo mismo a través de GPUs cuando es demasiado grande para uno. Enfoques modernos como FSDP (Fully Sharded Data Parallel) y DeepSpeed combinan ambos, habilitando el entrenamiento de modelos con cientos de miles de millones de parámetros.

Por qué importa

Ningún modelo de frontera cabe en un solo GPU. Entrenar GPT-4 o Claude requiere miles de GPUs trabajando juntos por meses. El entrenamiento distribuido es la ingeniería que hace esto posible — es tan crítico como la arquitectura o los datos. La eficiencia de tu entrenamiento distribuido determina directamente cuánto modelo puedes entrenar para un presupuesto dado.

Deep Dive

Data parallelism (DP): each GPU has a full model copy, processes a different mini-batch, and gradients are averaged across GPUs. Simple and efficient for models that fit on one GPU. Tensor parallelism (TP): individual layers are split across GPUs, with each GPU computing part of each matrix multiplication. Needed when a single layer's weights don't fit on one GPU. Pipeline parallelism (PP): different layers run on different GPUs, with micro-batches flowing through the pipeline.

FSDP and DeepSpeed

Fully Sharded Data Parallel (FSDP, from PyTorch) and DeepSpeed ZeRO (from Microsoft) shard model parameters, gradients, and optimizer states across GPUs. Each GPU only stores a fraction of the model, and parameters are gathered on-demand for computation, then released. This enables training models much larger than a single GPU's memory. DeepSpeed ZeRO has three stages: Stage 1 shards optimizer states, Stage 2 adds gradients, Stage 3 adds parameters.

The Communication Bottleneck

The fundamental challenge of distributed training is communication: GPUs must synchronize gradients (in data parallelism) or exchange activations (in model/pipeline parallelism). This communication happens over NVLink (within a node, 900 GB/s) or InfiniBand (between nodes, 400 Gb/s). Training efficiency drops when GPUs spend more time waiting for communication than computing. Optimal configurations minimize cross-node communication by keeping tightly-coupled operations (like tensor parallelism) within a node and loosely-coupled operations (like data parallelism) across nodes.

Distributed Training

Por qué importa

Deep Dive

FSDP and DeepSpeed

The Communication Bottleneck

Conceptos relacionados