Chapter 26: Key Takeaways

  1. Data parallelism (DDP) is the right default for distributed training; reach for model parallelism only when the model does not fit on a single GPU. DDP replicates the full model on every GPU, partitions the data, and synchronizes gradients via ring all-reduce — a communication primitive whose per-GPU bandwidth cost approaches a constant as the number of GPUs grows. This simplicity makes DDP the most debuggable, most portable, and most efficient strategy for models that fit in single-GPU memory. StreamRec's two-tower model (45M active parameters) trains on the full 1.2-billion-event dataset in under one hour on 4 GPUs with DDP — no model parallelism, pipeline parallelism, or FSDP required. Reserve model parallelism (pipeline, tensor, or FSDP/ZeRO-3) for models whose parameters, optimizer state, and activations genuinely exceed single-GPU memory, as Climate DL's 1.2-billion-parameter transformer does.

  2. GPU memory optimization — mixed precision, gradient checkpointing, and FlashAttention — is a prerequisite for large-scale training, not an optional enhancement. BF16 mixed precision provides 2-8x throughput by leveraging Tensor Cores and halves activation memory. Gradient checkpointing reduces activation memory by approximately 85% at the cost of 30-40% recomputation. FlashAttention eliminates the $O(s^2)$ attention matrix from HBM entirely, enabling longer sequences and larger batch sizes. For Climate DL, these three optimizations together reduced per-GPU memory from 71 GB (would not fit) to 34 GB (fits comfortably at batch size 4). Without them, distributed training is impossible — not just slow, but impossible — because the model does not fit on a single device at any useful batch size.

  3. Communication overhead is the primary scaling bottleneck, and reducing it requires understanding the hardware topology. Intra-node communication via NVLink (600 GB/s on DGX A100) is fast enough that gradient all-reduce adds less than 1% overhead. Inter-node communication via InfiniBand (25 GB/s effective) introduces meaningful overhead — 16.5% for Climate DL at 64 GPUs. The mitigation toolkit includes gradient accumulation (fewer all-reduce operations), DDP's overlap of communication with backward-pass computation, and larger local batch sizes (more compute per all-reduce). Profiling the communication-to-computation ratio is the single most informative diagnostic for distributed training performance.

  4. Large-batch training requires learning rate scaling, warmup, and layer-wise adaptive optimizers to maintain model quality. Naively increasing batch size degrades generalization because the gradient noise that helps escape sharp minima diminishes. The linear scaling rule ($\eta_{\text{new}} = \eta_{\text{base}} \times B_{\text{new}}/B_{\text{base}}$) compensates, but requires warmup (5-10% of training steps) to avoid early divergence. Beyond the critical batch size, LARS (for SGD) and LAMB (for Adam) provide per-layer learning rate adaptation that maintains convergence. The practical recipe: start with a well-tuned single-GPU baseline, scale batch size gradually, apply linear scaling with warmup, and switch to LAMB if the batch size exceeds approximately 4,096.

  5. Cost management is a first-class engineering concern, not an afterthought. The Climate DL model costs $24,640 per training run on demand — $900,000 per year at daily cadence. Spot instances with checkpoint-based fault tolerance reduce the per-run cost to $10,775 (56% savings). BF16 training doubles throughput (halving GPU-hours). Gradient checkpointing enables larger batches that reduce total steps. Combined, these optimizations reduced Climate DL's per-run cost by 7x (from $37,000 FP32 on-demand to $5,400). The TrainingCostEstimator and CheckpointManager classes from this chapter provide the infrastructure to make cost-aware decisions systematically.

  6. Profile before you parallelize. Both case studies in this chapter — Climate DL and StreamRec — arrived at a simpler parallelism strategy than the team initially planned. Climate DL avoided tensor parallelism because memory optimizations made data parallelism sufficient. StreamRec avoided pipeline parallelism because the model was too small to benefit from it. In each case, profiling the single-GPU memory footprint, computation time, and communication requirements before choosing a strategy saved weeks of implementation time and produced a faster, simpler training pipeline. The 2-hour profiling investment is the highest-ROI engineering activity in distributed training.