Chapter 35: Exercises -- Distributed Training and Scaling
Section 1: Communication Primitives
Exercise 35.1: Collective Operations
Implement all-reduce, all-gather, and reduce-scatter operations using only point-to-point send/receive primitives. Verify correctness by comparing with torch.distributed implementations.
Exercise 35.2: Ring All-Reduce
Implement the ring all-reduce algorithm for N processes. Measure the communication volume and compare with the theoretical optimal of 2(N-1)/N times the data size.
Exercise 35.3: Communication Bandwidth
Measure the effective bandwidth of all-reduce for varying tensor sizes (1KB to 1GB) on your system. Plot bandwidth vs message size and identify the latency-dominated and bandwidth-dominated regimes.
Section 2: Data Parallelism
Exercise 35.4: DDP from Scratch
Implement simplified data-parallel training using only torch.distributed.all_reduce (no DistributedDataParallel wrapper). Verify that gradients are correctly synchronized across processes.
Exercise 35.5: DDP Scaling Efficiency
Train a ResNet-18 model using DDP with 1, 2, and 4 GPUs (or simulated processes). Measure throughput (samples/second) for each configuration and compute the scaling efficiency.
Exercise 35.6: Gradient Accumulation
Implement gradient accumulation to simulate a batch size of 256 using actual batch size of 32. Compare the training curves and final accuracy with and without gradient accumulation.
Exercise 35.7: Learning Rate Scaling
Implement the linear scaling rule (lr = base_lr * world_size) with learning rate warmup. Compare training with and without the scaling rule when using 4x the original batch size.
Exercise 35.8: DDP Communication Overlap
Demonstrate how DDP overlaps gradient communication with backward computation. Measure the wall-clock time with and without communication overlap for a multi-layer model.
Section 3: Model Parallelism
Exercise 35.9: Pipeline Parallelism
Implement simple pipeline parallelism by splitting a 4-layer model across 2 devices. Demonstrate the pipeline bubble problem and compute the pipeline efficiency for micro-batch sizes of 1, 2, 4, and 8.
Exercise 35.10: Tensor Parallelism
Implement column-parallel and row-parallel linear layers for tensor parallelism. Verify that the combined output matches a single large linear layer.
Exercise 35.11: Activation Checkpointing
Implement gradient checkpointing for a 12-layer transformer. Measure memory savings and compute overhead compared to standard backpropagation.
Section 4: FSDP and ZeRO
Exercise 35.12: ZeRO Stage 1
Implement ZeRO Stage 1 (optimizer state partitioning) for AdamW. Verify that each process stores only 1/N of the optimizer states and that training is equivalent to standard training.
Exercise 35.13: FSDP Wrapping Strategies
Compare different FSDP wrapping strategies: per-layer wrapping, per-block wrapping, and full model wrapping. Measure memory usage and communication volume for each.
Exercise 35.14: FSDP with Mixed Precision
Configure FSDP with mixed precision (compute in float16, reduce in float32, parameters in float16). Measure the memory savings and throughput improvement compared to full float32 training.
Exercise 35.15: Memory Profiling
Profile the GPU memory usage during training of a transformer model. Break down memory into: parameters, gradients, optimizer states, and activations. Verify the memory formula from the chapter.
Section 5: Mixed Precision Training
Exercise 35.16: AMP Training
Implement mixed precision training using torch.cuda.amp.autocast and GradScaler. Compare training speed and memory usage with full precision training.
Exercise 35.17: Loss Scaling Analysis
Experiment with different loss scaling strategies: fixed scale, dynamic scaling with various growth intervals. Measure how often overflow occurs and its impact on convergence.
Exercise 35.18: BFloat16 vs Float16
Compare BF16 and FP16 training for a transformer model. Measure numerical stability (gradient overflow frequency) and model quality for both formats.
Section 6: Practical Distributed Training
Exercise 35.19: Multi-Node Training
Write a training script that works across multiple nodes using torchrun. Handle environment variable configuration, checkpoint saving/loading, and graceful error handling.
Exercise 35.20: HuggingFace Accelerate
Rewrite a single-GPU training script to use HuggingFace Accelerate. Demonstrate launching with different configurations: single GPU, multi-GPU DDP, and FSDP.
Exercise 35.21: Checkpoint Management
Implement distributed checkpoint saving and loading for FSDP. Handle the case where the number of GPUs changes between saving and loading.
Exercise 35.22: Cost Optimization
Given a training job that takes 24 hours on 8 A100 GPUs, analyze the cost trade-offs of: (a) 16 A100s for 12 hours, (b) 32 A10Gs for 48 hours, (c) spot instances with checkpointing. Consider both dollar cost and total time.
Exercise 35.23: Debugging Distributed Training
Implement diagnostic tools for distributed training: (a) gradient norm monitoring per rank, (b) loss synchronization verification, (c) communication timing breakdown.
Exercise 35.24: End-to-End Distributed Fine-Tuning
Fine-tune a pre-trained language model using FSDP with mixed precision, gradient accumulation, and cosine learning rate scheduling. Train on at least 2 GPUs and verify convergence.