Chapter 26: Exercises
Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field
Data Parallelism and Gradient Synchronization
Exercise 26.1 (*)
You are training a ResNet-50 (25.6M parameters) on ImageNet (1.28M training images) with 8 A100 GPUs using DDP.
(a) If the local batch size per GPU is 256, what is the global batch size? How many gradient synchronization steps are required per epoch?
(b) The gradient tensor for ResNet-50 is approximately 100 MB in FP32. Using the ring all-reduce formula $T_{\text{ring}} = 2(N-1) \cdot \alpha + 2 \cdot \frac{N-1}{N} \cdot \frac{M}{\beta}$, estimate the all-reduce time with $\alpha = 10\,\mu\text{s}$ and $\beta = 50\,\text{GB/s}$ (NVLink effective bandwidth). What fraction of a 250ms training step does communication consume?
(c) Repeat the calculation for 64 GPUs across 8 nodes with $\beta = 12.5\,\text{GB/s}$ (InfiniBand effective bandwidth). How does the communication fraction change?
Exercise 26.2 (*)
Explain why calling dataloader.sampler.set_epoch(epoch) is necessary in DDP training. What happens if you omit it? Design a simple experiment (pseudocode is sufficient) that would detect this bug by monitoring training loss convergence.
Exercise 26.3 (**)
In the verify_parameter_sync function (Section 26.2.5), a scalar fingerprint is computed by summing all parameter values. This is a weak hash — two different parameter tensors could produce the same sum.
(a) Propose a stronger verification method that detects any per-element divergence, while keeping communication cost at $O(1)$ tensors per GPU (not sending the full parameter tensor).
(b) When should you run parameter synchronization checks in production? After every step? Every $k$ steps? Justify your choice with a cost-benefit analysis.
Exercise 26.4 (**)
A team observes that DDP training on 4 GPUs achieves only 2.8x speedup over single-GPU training (instead of the ideal 4x).
(a) List four possible causes for the sub-linear scaling. For each cause, describe a diagnostic measurement that would confirm or rule it out.
(b) The team profiles the training step and finds: compute = 180ms, all-reduce = 45ms, data loading = 25ms, other = 10ms. What is the theoretical maximum speedup (Amdahl's Law) if communication and data loading scale perfectly? What is the actual efficiency?
(c) Propose two optimizations that would improve scaling. Estimate the expected speedup from each.
Parallelism Strategies
Exercise 26.5 (*)
For each of the following models, select the most appropriate parallelism strategy and justify your choice:
- A 300M parameter BERT model fine-tuned on a single 8-GPU node with A100 80GB GPUs.
- A 13B parameter LLM pre-trained on 512 GPUs across 64 nodes.
- A 100M parameter recommender model trained on 50 billion interactions, where the embedding table is 200 GB.
- A 500M parameter image segmentation model with input resolution $4096 \times 4096$.
Exercise 26.6 (**)
Consider pipeline parallelism with 4 stages and $M$ micro-batches using the GPipe schedule.
(a) Derive the bubble fraction formula $\frac{N-1}{N+M-1}$ from first principles. Count the total time slots in the schedule and the number of idle slots.
(b) Plot the bubble fraction as a function of $M$ for $N = 4, 8, 16$. At what value of $M$ does the bubble fraction drop below 10% for each value of $N$?
(c) Increasing $M$ requires storing activations for more micro-batches simultaneously. If each micro-batch's activation memory is $A$ GB, what is the peak activation memory as a function of $M$ and $N$? At what point does the memory cost of reducing the bubble become prohibitive?
Exercise 26.7 (***)
Tensor parallelism for a transformer's self-attention layer requires splitting the query, key, and value projections.
(a) Describe how to split a multi-head attention layer with $h = 32$ heads across 4 GPUs. How many attention heads does each GPU compute? What are the communication operations required (and when)?
(b) For the feed-forward network with column-parallel $W_1$ and row-parallel $W_2$ (as in Section 26.3.4), prove that only one all-reduce per FFN is needed (rather than two) by showing that the GeLU nonlinearity between $W_1$ and $W_2$ can be applied locally.
(c) Compare the communication volume (bytes) of tensor parallelism vs. data parallelism for a single transformer layer with $d = 4096$, $B = 32$, $s = 1024$, on 4 GPUs. Which strategy has lower communication cost? Under what conditions does this change?
GPU Memory and FlashAttention
Exercise 26.8 (*)
Using the memory breakdown from Section 26.4.1, compute the total memory consumption for:
- A 350M parameter transformer trained with AdamW in FP32, batch size 16, sequence length 512, hidden dimension 1024, 24 layers.
- The same model trained with AMP (BF16 activations, FP32 master weights).
- The same model with AMP and gradient checkpointing (every 4 layers).
For each configuration, state whether it fits on (a) an A100 40GB, (b) an A100 80GB, (c) a V100 32GB.
Exercise 26.9 (**)
The roofline model (Section 26.4.3) classifies operations as compute-bound or memory-bound.
(a) Compute the arithmetic intensity of a matrix multiplication $C = AB$ where $A \in \mathbb{R}^{M \times K}$ and $B \in \mathbb{R}^{K \times N}$, assuming FP16 (2 bytes per element). Express the result in terms of $M$, $N$, and $K$.
(b) For a square matrix multiplication ($M = N = K$), at what dimension $d$ does the operation become compute-bound on an A100 (312 TFLOP/s, 2.0 TB/s bandwidth)? The ridge point is at $I = 156$ FLOP/byte.
(c) The batch size $B$ in a transformer forward pass affects the arithmetic intensity of the attention computation. Explain why very small batch sizes (e.g., $B = 1$) make attention memory-bound even for large sequence lengths, and suggest a practical solution for inference.
Exercise 26.10 (**)
You are implementing FlashAttention-style tiling for a custom attention variant.
(a) Explain why the softmax operation prevents naive tiling: why can't you simply compute softmax on each tile independently and concatenate?
(b) The online softmax algorithm maintains a running maximum $m_{\text{prev}}$ and a running sum of exponentials $\ell_{\text{prev}}$. Write the update equations for processing a new tile of attention scores $s_{\text{new}}$: - Update $m_{\text{new}} = \ldots$ - Correction factor for the old sum: $\ell_{\text{corrected}} = \ldots$ - Updated running sum: $\ell_{\text{new}} = \ldots$
(c) How does FlashAttention handle the backward pass without storing the full attention matrix? What auxiliary information does it store, and what is its memory cost?
Exercise 26.11 (**)
The AttentionBenchmark dataclass in Section 26.7.3 computes estimated memory for standard vs. FlashAttention.
(a) Extend it to include the backward pass memory. For standard attention, the backward pass stores the attention matrix ($B \times h \times s \times s$) and the output of the value projection. For FlashAttention, the backward pass stores only the logsumexp values ($B \times h \times s$).
(b) Compute the crossover sequence length $s^*$ at which FlashAttention's memory savings become greater than 2x compared to standard attention. Express $s^*$ in terms of $d$ (model dimension) and $h$ (number of heads).
(c) For the Climate DL model ($d = 1536$, $h = 24$), what is $s^*$? Is this consistent with the 11.1x savings reported in Section 26.7.3 for $s = 2048$?
Mixed Precision and Gradient Checkpointing
Exercise 26.12 (*)
A team switches their training from FP32 to AMP with FP16. Training is unstable: the loss diverges after 2,000 steps. The GradScaler's loss scale drops from $2^{16}$ to $2^3$ and continues decreasing.
(a) Explain what is happening: why is the loss scale decreasing, and what does this indicate about the gradient distribution?
(b) List three possible fixes in order of decreasing invasiveness. For each fix, explain why it addresses the root cause.
(c) Would switching from FP16 to BF16 help? Explain, referencing the exponent and mantissa differences between the two formats.
Exercise 26.13 (*)
Gradient checkpointing stores activations at checkpoint boundaries and recomputes intermediate activations during the backward pass.
(a) For a 48-layer transformer with checkpoint interval $k$, write the formulas for activation memory and forward-pass compute overhead as functions of $k$.
(b) Plot both curves and find the value of $k$ that minimizes total training time, assuming compute overhead translates linearly to wall-clock time and the freed memory is used to double the batch size (which halves the number of steps per epoch).
(c) Is the $\sqrt{L}$ checkpoint interval always optimal? Describe a scenario where a different interval is better (hint: consider layers with very different activation sizes).
Exercise 26.14 (**)
Implement a function estimate_memory_consumption that takes a PyTorch model, a sample input batch, and a boolean flag for gradient checkpointing, and returns an estimate of:
- Parameter memory
- Optimizer state memory (for AdamW)
- Activation memory (by running a forward pass and checking
torch.cuda.memory_allocated) - Total memory
Test it on the TwoTowerModel and CheckpointedTransformer classes from this chapter. Verify that gradient checkpointing reduces activation memory as predicted.
Large-Batch Training
Exercise 26.15 (*)
Using the linear scaling rule, compute the scaled learning rate for the following configurations:
| Base batch size | Base LR | Target batch size | Scaled LR |
|---|---|---|---|
| 256 | $10^{-3}$ | 4096 | ? |
| 64 | $5 \times 10^{-4}$ | 2048 | ? |
| 512 | $3 \times 10^{-4}$ | 32768 | ? |
For the third row, the scaled LR exceeds 0.01. Is this likely to cause problems? What technique from Section 26.8.4 should you apply?
Exercise 26.16 (**)
The critical batch size $B_{\text{crit}}$ is the batch size beyond which increasing batch size yields diminishing returns in terms of compute efficiency.
(a) McCandlish et al. (2018) define the critical batch size as $B_{\text{crit}} = \frac{B_{\text{noise}}}{1 - B_{\text{noise}} / B}$, where $B_{\text{noise}} = \text{tr}(\Sigma) / \|\nabla \mathcal{L}\|^2$ is the gradient noise scale. Explain intuitively why larger gradient noise corresponds to a larger critical batch size.
(b) Design an experiment to estimate $B_{\text{crit}}$ for the StreamRec two-tower model. Describe the measurements you would take and the analysis you would perform.
(c) If the estimated $B_{\text{crit}} = 4096$ and you have 16 GPUs, what local batch size should each GPU use? If you have 128 GPUs, what strategy should you use to avoid exceeding $B_{\text{crit}}$?
Exercise 26.17 (***)
Implement the LAMB optimizer from scratch in PyTorch. Your implementation should:
- Compute the Adam update (bias-corrected first and second moments).
- Add weight decay to the update (decoupled, as in AdamW).
- Compute the layer-wise trust ratio $r_l = \|w_l\| / \|u_l\|$ where $u_l$ is the Adam update plus weight decay for layer $l$.
- Apply the scaled update: $w_l \leftarrow w_l - \eta \cdot r_l \cdot u_l$.
Test your implementation by training a 3-layer MLP on a synthetic dataset and comparing convergence with AdamW at batch sizes of 256, 2048, and 16384.
DeepSpeed, FSDP, and ZeRO
Exercise 26.18 (*)
For a 7B parameter model trained with Adam in FP32 on 8 GPUs:
(a) Compute the per-GPU memory for standard DDP, ZeRO-1, ZeRO-2, and ZeRO-3. Present the results in a table.
(b) Which ZeRO stage is required to fit the model on A100 40GB GPUs (considering that activations will consume ~15 GB additional)?
(c) ZeRO-3 introduces additional communication (all-gather before each forward layer). Qualitatively, how does this affect the communication-to-computation ratio compared to DDP?
Exercise 26.19 (**)
Compare PyTorch FSDP and DeepSpeed ZeRO-3 along the following dimensions:
- Integration with PyTorch ecosystem (autograd, distributed, torchrun)
- Configuration complexity
- Support for mixed precision
- CPU offloading
- Activation checkpointing integration
- Community adoption and maintenance
For each dimension, state which framework has an advantage (or state parity) and provide a one-sentence justification. Conclude with a recommendation for (a) a team already using PyTorch Lightning, and (b) a team training on Azure with managed infrastructure.
Cost Estimation and Management
Exercise 26.20 (*)
Using the TrainingCostEstimator class, compute the cost of training the following models on AWS (A100 80GB at $3.50/GPU-hour on-demand, 65% spot discount):
- BERT-base (110M parameters, 3.3B tokens, 8 GPUs, MFU 0.45)
- GPT-2 medium (345M parameters, 40B tokens, 16 GPUs, MFU 0.42)
- A 3B parameter vision model (500B tokens, 64 GPUs, MFU 0.38)
For each, compute on-demand and spot costs, and the break-even number of spot preemptions (assuming 30 minutes of lost work per preemption) at which spot instances become more expensive than on-demand.
Exercise 26.21 (**)
A team has a $50,000 compute budget and needs to train a 1B parameter model. They have two options:
Option A: 32 A100 80GB GPUs on-demand at $3.50/GPU-hour. MFU = 0.45.
Option B: 128 A100 40GB spot GPUs at $1.10/GPU-hour. MFU = 0.35 (lower due to cross-node communication and smaller GPU memory). Expected 2 preemptions per day, each costing 1 hour of recomputation.
(a) For a target of $10^{21}$ FLOPs total compute, compute the wall-clock time and total cost for each option.
(b) Which option provides more total compute within the $50,000 budget?
(c) Beyond cost, list three non-financial factors that might favor Option A despite higher unit cost.
Exercise 26.22 (**)
The CheckpointManager class from Section 26.10.3 saves checkpoints at fixed intervals.
(a) Propose an adaptive checkpointing strategy that increases checkpoint frequency when spot instance termination probability is high (e.g., based on spot price history).
(b) Implement an extension to CheckpointManager that supports asynchronous checkpointing: saving the checkpoint to disk in a background thread while the next training step proceeds. What synchronization issues must you handle?
(c) For a training run on 64 GPUs where each checkpoint is 15 GB, estimate the I/O time to save a checkpoint to (1) local NVMe SSD, (2) network-attached storage (1 GB/s), and (3) cloud object storage (200 MB/s). Which is the bottleneck for each?
Integration and Production
Exercise 26.23 (**)
Design the complete distributed training configuration for a production recommendation system with the following requirements:
- Model: 500M parameter deep ranking model (DCN-V2 architecture from Chapter 24)
- Training data: 10B user-item interactions per day, 30-day training window (300B interactions total)
- Retraining cadence: daily, must complete within 6 hours
- Hardware budget: 32 A100 80GB GPUs on a single DGX cluster
- Quality constraint: NDCG@10 must be within 0.5% of the best single-GPU baseline
Specify: (1) parallelism strategy, (2) AMP configuration, (3) gradient checkpointing decision, (4) batch size and learning rate, (5) checkpoint frequency, and (6) estimated training cost per day.
Exercise 26.24 (***)
Implement a complete DDP training script for a transformer language model using the building blocks from this chapter. Your script should:
- Accept a command-line config (world size, local batch size, AMP dtype, checkpoint interval).
- Initialize DDP with NCCL backend.
- Use
DistributedSamplerwith correctset_epoch()calls. - Implement AMP with
GradScaler(for FP16) orautocast(for BF16). - Implement warmup + cosine annealing LR schedule with linear scaling.
- Save and resume from checkpoints.
- Log metrics only from rank 0.
- Clean up the process group on exit.
Test by training a 6-layer transformer on WikiText-2 with 1, 2, and 4 GPUs. Verify that the final perplexity is within 2% across configurations and that wall-clock time scales near-linearly.
Exercise 26.25 (***)
The chapter focused on synchronous data parallelism (all GPUs synchronize gradients every step). Asynchronous data parallelism allows GPUs to update a shared parameter server without waiting for other GPUs.
(a) Explain why asynchronous SGD can converge despite using stale gradients. What is the staleness bound, and how does it affect convergence rate?
(b) When does asynchronous training outperform synchronous training? When does it underperform? Consider both hardware heterogeneity (different GPU speeds) and statistical efficiency (convergence per gradient step).
(c) Local SGD (Lin et al., 2020) is a hybrid: GPUs run $H$ local steps before synchronizing. Analyze the communication-computation trade-off as a function of $H$. What is the effect on the critical batch size?
Exercise 26.26 (****)
The linear scaling rule (Goyal et al., 2017) states that scaling the learning rate linearly with the batch size preserves the expected gradient update. This assumes the gradient noise is Gaussian with constant variance.
(a) Derive the linear scaling rule formally. Start with the SGD update $\theta_{t+1} = \theta_t - \eta \cdot g_t$, where $g_t = \frac{1}{B} \sum_{i=1}^B \nabla \ell_i$, and show that $\mathbb{E}[g_t] = \nabla \mathcal{L}$ independent of $B$ but $\text{Var}(g_t) = \frac{\sigma^2}{B}$.
(b) The linear scaling rule breaks down for very large batch sizes. Hoffer et al. (2017) and Smith et al. (2018) argue this is because the gradient noise becomes negligible, and the dynamics shift from a stochastic to a deterministic regime. Explain this transition and its consequences for generalization.
(c) Recent work (Malladi et al., 2023) on memory-efficient fine-tuning (MeZO) uses zeroth-order optimization that estimates gradients without backpropagation. Analyze the memory trade-off: MeZO uses $O(P)$ memory (no activations, no gradients), but requires $O(d)$ forward passes per step. For what model sizes and hardware configurations is this preferable to standard backpropagation with FSDP?
Exercise 26.27 (****)
The Chinchilla scaling laws (Hoffmann et al., 2022) suggest that for compute-optimal training, the number of training tokens $D$ should scale linearly with model parameters $P$: $D \approx 20P$.
(a) Using the $6PD$ FLOPs approximation, express the compute-optimal training cost in terms of $P$ alone. How does cost scale with model size?
(b) A team has a fixed compute budget of $C$ FLOPs. Should they train a larger model for fewer tokens or a smaller model for more tokens? Derive the optimal allocation using the Chinchilla loss function $L(P, D) = \frac{A}{P^\alpha} + \frac{B}{D^\beta} + L_\infty$ with $\alpha = 0.34$, $\beta = 0.28$.
(c) The Chinchilla scaling laws were derived for language models. Discuss whether they are likely to hold for the Climate DL model (dense prediction on gridded atmospheric data) and the StreamRec recommender (sparse interaction data). What domain-specific factors might shift the optimal compute allocation?