Chapter 35: Exercises -- Distributed Training and Scaling

Section 1: Communication Primitives

Exercise 35.1: Collective Operations

Implement all-reduce, all-gather, and reduce-scatter operations using only point-to-point send/receive primitives. Verify correctness by comparing with torch.distributed implementations.

Exercise 35.2: Ring All-Reduce

Implement the ring all-reduce algorithm for N processes. Measure the communication volume and compare with the theoretical optimal of 2(N-1)/N times the data size.

Exercise 35.3: Communication Bandwidth

Measure the effective bandwidth of all-reduce for varying tensor sizes (1KB to 1GB) on your system. Plot bandwidth vs message size and identify the latency-dominated and bandwidth-dominated regimes.

Section 2: Data Parallelism

Exercise 35.4: DDP from Scratch

Implement simplified data-parallel training using only torch.distributed.all_reduce (no DistributedDataParallel wrapper). Verify that gradients are correctly synchronized across processes.

Exercise 35.5: DDP Scaling Efficiency

Train a ResNet-18 model using DDP with 1, 2, and 4 GPUs (or simulated processes). Measure throughput (samples/second) for each configuration and compute the scaling efficiency.

Exercise 35.6: Gradient Accumulation

Implement gradient accumulation to simulate a batch size of 256 using actual batch size of 32. Compare the training curves and final accuracy with and without gradient accumulation.

Exercise 35.7: Learning Rate Scaling

Implement the linear scaling rule (lr = base_lr * world_size) with learning rate warmup. Compare training with and without the scaling rule when using 4x the original batch size.

Exercise 35.8: DDP Communication Overlap

Demonstrate how DDP overlaps gradient communication with backward computation. Measure the wall-clock time with and without communication overlap for a multi-layer model.

Section 3: Model Parallelism

Exercise 35.9: Pipeline Parallelism

Implement simple pipeline parallelism by splitting a 4-layer model across 2 devices. Demonstrate the pipeline bubble problem and compute the pipeline efficiency for micro-batch sizes of 1, 2, 4, and 8.

Exercise 35.10: Tensor Parallelism

Implement column-parallel and row-parallel linear layers for tensor parallelism. Verify that the combined output matches a single large linear layer.

Exercise 35.11: Activation Checkpointing

Implement gradient checkpointing for a 12-layer transformer. Measure memory savings and compute overhead compared to standard backpropagation.

Section 4: FSDP and ZeRO

Exercise 35.12: ZeRO Stage 1

Implement ZeRO Stage 1 (optimizer state partitioning) for AdamW. Verify that each process stores only 1/N of the optimizer states and that training is equivalent to standard training.

Exercise 35.13: FSDP Wrapping Strategies

Compare different FSDP wrapping strategies: per-layer wrapping, per-block wrapping, and full model wrapping. Measure memory usage and communication volume for each.

Exercise 35.14: FSDP with Mixed Precision

Configure FSDP with mixed precision (compute in float16, reduce in float32, parameters in float16). Measure the memory savings and throughput improvement compared to full float32 training.

Exercise 35.15: Memory Profiling

Profile the GPU memory usage during training of a transformer model. Break down memory into: parameters, gradients, optimizer states, and activations. Verify the memory formula from the chapter.

Section 5: Mixed Precision Training

Exercise 35.16: AMP Training

Implement mixed precision training using torch.cuda.amp.autocast and GradScaler. Compare training speed and memory usage with full precision training.

Exercise 35.17: Loss Scaling Analysis

Experiment with different loss scaling strategies: fixed scale, dynamic scaling with various growth intervals. Measure how often overflow occurs and its impact on convergence.

Exercise 35.18: BFloat16 vs Float16

Compare BF16 and FP16 training for a transformer model. Measure numerical stability (gradient overflow frequency) and model quality for both formats.

Section 6: Practical Distributed Training

Exercise 35.19: Multi-Node Training

Write a training script that works across multiple nodes using torchrun. Handle environment variable configuration, checkpoint saving/loading, and graceful error handling.

Exercise 35.20: HuggingFace Accelerate

Rewrite a single-GPU training script to use HuggingFace Accelerate. Demonstrate launching with different configurations: single GPU, multi-GPU DDP, and FSDP.

Exercise 35.21: Checkpoint Management

Implement distributed checkpoint saving and loading for FSDP. Handle the case where the number of GPUs changes between saving and loading.

Exercise 35.22: Cost Optimization

Given a training job that takes 24 hours on 8 A100 GPUs, analyze the cost trade-offs of: (a) 16 A100s for 12 hours, (b) 32 A10Gs for 48 hours, (c) spot instances with checkpointing. Consider both dollar cost and total time.

Exercise 35.23: Debugging Distributed Training

Implement diagnostic tools for distributed training: (a) gradient norm monitoring per rank, (b) loss synchronization verification, (c) communication timing breakdown.

Exercise 35.24: End-to-End Distributed Fine-Tuning

Fine-tune a pre-trained language model using FSDP with mixed precision, gradient accumulation, and cosine learning rate scheduling. Train on at least 2 GPUs and verify convergence.