Chapter 35: Key Takeaways

1. Data Parallelism Is the Simplest and Most Widely Used Strategy

DDP replicates the full model on every GPU and shards the data. After each backward pass, gradients are synchronized via all-reduce. With NCCL and proper bucketing, DDP achieves 85-95% scaling efficiency and should be the first approach tried for any distributed training workload.

2. Communication Primitives Underpin All Distributed Training

All-reduce, all-gather, reduce-scatter, and broadcast are the fundamental building blocks. Ring all-reduce achieves bandwidth-optimal communication volume of 2(N-1)/N times the data size. Understanding these operations is essential for diagnosing performance bottlenecks and designing custom parallelism strategies.

3. Learning Rate Scaling and Warmup Are Critical for Large Batch Training

When data parallelism increases the effective batch size by a factor k, the learning rate should be multiplied by k (linear scaling rule) with a warmup period to avoid early training instability. Without proper scaling, large-batch training significantly degrades model quality.

4. Training Memory Has Four Components

Total training memory equals parameters + gradients + optimizer states + activations. For a model with P parameters trained with AdamW in float32, the non-activation memory is 16P bytes. Understanding this decomposition is essential for choosing the right parallelism strategy.

5. FSDP and ZeRO Shard Everything Across GPUs

Fully Sharded Data Parallelism partitions parameters, gradients, and optimizer states across GPUs, enabling training of models much larger than a single GPU's memory. Parameters are gathered (all-gather) before computation and resharded afterward, trading communication for memory.

6. Mixed Precision Training Provides Speed and Memory Benefits

Training in float16 or bfloat16 (with critical operations in float32) approximately halves memory usage and doubles compute throughput on modern GPUs. BFloat16 is preferred for its larger dynamic range, eliminating the need for loss scaling required by float16.

7. Gradient Accumulation Simulates Larger Batch Sizes

When the desired batch size exceeds GPU memory capacity, gradient accumulation performs N forward-backward passes before each optimizer step, simulating a batch size N times larger. This is essential when combining data parallelism with memory-constrained settings.

8. Pipeline Parallelism Assigns Layers to Different GPUs

Pipeline parallelism splits the model depth-wise, assigning groups of layers to different GPUs. Micro-batching reduces the pipeline bubble (idle time), but some inefficiency remains. It is primarily used in combination with data and tensor parallelism for the largest models.

9. Tensor Parallelism Splits Individual Layers

Tensor parallelism partitions weight matrices within a single layer across GPUs, enabling individual layers that are too large for one GPU. It requires tight, low-latency communication (typically intra-node NVLink) and is used primarily within nodes.

10. Practical Distributed Training Requires Systematic Engineering

Beyond the algorithms, production distributed training demands: deterministic data loading with proper sharding, robust checkpoint saving and loading, fault tolerance and elastic scaling, profiling and bottleneck identification, and cost-aware resource allocation. HuggingFace Accelerate and DeepSpeed simplify these concerns.