Chapter 26: Quiz

DataField.Dev

Chapter 26: Quiz

Test your understanding of distributed training, GPU optimization, and compute cost management. Answers follow each question.

Question 1

In data-parallel training with $N$ GPUs, what operation synchronizes gradients across all GPUs, and why is ring all-reduce preferred over naive all-reduce?

Answer

The **all-reduce** operation computes the element-wise sum (or mean) of gradient tensors across all GPUs, producing the result on every GPU. **Ring all-reduce** is preferred because its per-GPU communication cost is $2 \cdot \frac{N-1}{N} \cdot M$ bytes, which approaches $2M$ bytes as $N$ grows — the bandwidth cost per GPU is nearly independent of the number of GPUs. In contrast, naive all-reduce funnels all communication through a single GPU (the reducer), creating a bottleneck with $2(N-1) \cdot M$ bytes flowing through one network interface. Ring all-reduce distributes the communication load evenly across all GPUs, achieving near-optimal bandwidth utilization.

Question 2

What is the mathematical equivalence guarantee of DDP? Under what conditions can this guarantee break?

Answer

DDP guarantees that distributed training with $N$ GPUs and local batch size $b$ produces **exactly the same gradient** as single-GPU training with batch size $B = Nb$. This is because the averaged gradient $\bar{g} = \frac{1}{N}\sum_k g_k$ equals the gradient computed on the full batch. This guarantee relies on **all GPUs starting with identical parameters**. It can break if: (1) models are initialized with different random seeds on different ranks, (2) non-deterministic CUDA operations (e.g., `atomicAdd`) cause divergent activations, or (3) floating-point non-associativity in the all-reduce causes cumulative drift (negligible in practice).

Question 3

Why must you call dataloader.sampler.set_epoch(epoch) in each training epoch when using DDP?

Answer

The `DistributedSampler` uses the epoch number as a random seed to determine which data partition each GPU receives. Without `set_epoch()`, the sampler uses the same seed every epoch, meaning each GPU sees the same partition of data in every epoch. This reduces effective data diversity — the model sees the same GPU-to-sample mapping repeatedly — and can degrade convergence. Calling `set_epoch(epoch)` ensures a different shuffling per epoch, so that each sample is seen by different GPUs across epochs.

Question 4

Name and briefly describe the four parallelism strategies for distributed training. When is each one appropriate?

Answer

**Data parallelism:** Replicate the model on every GPU, partition the data, synchronize gradients via all-reduce. Use when the model fits on a single GPU and the dataset is large. **Model parallelism (naive):** Assign different layers to different GPUs. Use when the model does not fit on a single GPU, but suffers from pipeline bubbles (low utilization). **Pipeline parallelism:** Split the model into stages across GPUs and interleave micro-batches to fill the bubble. Use for very deep models; reduces idle time to a fraction $\frac{N-1}{N+M-1}$ of total time. **Tensor parallelism:** Split individual layers (e.g., weight matrices) across GPUs. Use for very wide layers (e.g., large transformer FFN) within a single high-bandwidth node (NVLink). In practice, these are combined into 3D parallelism for very large models.

Question 5

What is the "pipeline bubble" in model parallelism, and how does GPipe reduce it?

Answer

The pipeline bubble is the idle time each GPU spends waiting for activations (forward pass) or gradients (backward pass) from other GPUs. In naive model parallelism with $N$ GPUs and 1 micro-batch, utilization is only $1/N$. GPipe reduces the bubble by splitting the global batch into $M$ micro-batches and feeding them through the pipeline sequentially, so that while GPU $k$ processes micro-batch $m$, GPU $k-1$ can process micro-batch $m+1$. The bubble fraction becomes $\frac{N-1}{N+M-1}$, which decreases as $M$ increases. With $N=4$ and $M=16$, the bubble is only about 16%.

Question 6

In tensor parallelism for a transformer FFN, why can the GeLU activation be applied locally on each GPU without communication?

Answer

The first weight matrix $W_1$ is split **column-wise** across GPUs, so each GPU $k$ computes $y_k = x W_1^{(k)}$, producing a partition of the full output in the hidden-dimension direction. GeLU is an **element-wise** operation — it is applied independently to each element — so $\text{GeLU}(y_k)$ produces the correct result for GPU $k$'s partition without needing any information from other GPUs. The communication (an all-reduce) is only needed after the second matrix multiplication ($W_2$, split row-wise), because the output of $W_2$ must be summed across GPUs to produce the correct result.

Question 7

What are the four main consumers of GPU memory during training, and which one dominates for large-batch training of deep models?

Answer

The four main consumers are: (1) **Model parameters** ($4P$ bytes in FP32), (2) **Gradients** ($4P$ bytes), (3) **Optimizer state** ($8P$ bytes for Adam — first and second moments), and (4) **Activations** stored for the backward pass, which scale with batch size, sequence length, and model depth ($O(B \cdot L \cdot d)$ or $O(B \cdot L \cdot s^2)$ for attention). For large-batch training of deep models, **activations dominate**. For a 48-layer transformer with batch size 32 and sequence length 1024, activations can exceed 16 GB — more than the combined parameter and optimizer memory.

Question 8

What is arithmetic intensity, and how does the roofline model use it to classify operations as compute-bound or memory-bound?

Answer

**Arithmetic intensity** $I$ is the ratio of floating-point operations (FLOPs) to bytes accessed from memory (FLOP/byte). The **roofline model** defines the maximum achievable performance as $\min(\text{Peak FLOP/s},\; I \times \text{Memory Bandwidth})$. Operations with $I$ above the ridge point ($\text{Peak FLOP/s} / \text{Bandwidth}$) are **compute-bound** — they are limited by the GPU's arithmetic throughput. Operations below the ridge point are **memory-bound** — they are limited by the speed of reading/writing data from HBM. For an A100 (312 TFLOP/s, 2.0 TB/s), the ridge point is 156 FLOP/byte. Large matrix multiplications (arithmetic intensity ~4096) are compute-bound; element-wise operations like ReLU (arithmetic intensity ~0.25) are memory-bound.

Question 9

What is Model FLOP Utilization (MFU), and what does a low MFU (e.g., 15%) indicate?

Answer

**MFU** is the fraction of a GPU's theoretical peak FLOP/s that is actually used for model computation: $\text{MFU} = \frac{\text{Observed model FLOP/s}}{\text{GPU peak FLOP/s}}$. It excludes communication, memory-bound overhead, and pipeline bubbles. An MFU of 15% means that 85% of the GPU's compute capacity is wasted — likely due to memory-bound operations dominating the workload, excessive communication overhead, small batch sizes that under-utilize Tensor Cores, or data loading bottlenecks. Typical well-optimized training achieves 40-55% MFU; values below 30% indicate significant optimization opportunities.

Question 10

Explain how Automatic Mixed Precision (AMP) works. What are the roles of autocast and GradScaler?

Answer

AMP uses reduced precision (FP16 or BF16) for compute-heavy operations to leverage Tensor Core acceleration, while maintaining FP32 for numerically sensitive operations. **`autocast`** is a context manager that automatically selects the appropriate precision for each operation: matrix multiplications and convolutions run in reduced precision (high arithmetic intensity, tolerant of lower precision), while reductions, layer normalization, softmax, and loss computation run in FP32 (sensitive to precision). **`GradScaler`** prevents gradient underflow in FP16 by scaling the loss by a large constant before backpropagation, then unscaling the gradients before the optimizer step. If gradients overflow (become `inf`), the step is skipped and the scale is halved; if no overflow occurs for a window, the scale is doubled. BF16 has the same exponent range as FP32, so it rarely needs loss scaling.

Question 11

Why does naive FP16 training fail, and how does BF16 address the primary failure mode?

Answer

Naive FP16 training fails primarily because of **gradient underflow**: small gradients (common in early layers of deep networks) fall below FP16's minimum positive value ($\approx 5.96 \times 10^{-8}$) and become zero, causing parameters to stop updating. FP16 has a 5-bit exponent (range $\pm 65504$), so values outside this range overflow. **BF16** addresses this by using an 8-bit exponent — the same as FP32 — giving it the same dynamic range ($\pm 3.4 \times 10^{38}$). Gradients that would underflow in FP16 are representable in BF16. The trade-off is that BF16 has lower precision (7-bit mantissa vs. FP16's 10-bit mantissa), but this precision loss is tolerable for most deep learning operations.

Question 12

What is gradient checkpointing, and what trade-off does it make?

Answer

Gradient checkpointing **discards intermediate activations** during the forward pass and **recomputes them** during the backward pass. This trades compute for memory: activation memory drops from $O(L)$ (storing all $L$ layers' outputs) to $O(\sqrt{L})$ with the optimal checkpoint interval of $\sqrt{L}$ layers, at the cost of approximately 33% additional forward-pass computation (each discarded activation is recomputed once). For a 48-layer transformer, this reduces activation memory by approximately 85% while increasing training time by 30-40%. It is most valuable when activations dominate memory and the freed memory can be used for larger batch sizes, which in turn improve GPU utilization and reduce communication overhead.

Question 13

How does FlashAttention achieve both lower memory usage and higher throughput than standard attention?

Answer

FlashAttention avoids materializing the full $s \times s$ attention matrix in HBM by **tiling the computation on SRAM**. It divides $Q$, $K$, $V$ into blocks that fit in the GPU's on-chip shared memory (20 MB on A100), computes partial attention scores and outputs within SRAM, and uses the **online softmax** trick to compute numerically correct softmax in a single pass without storing intermediate results. This provides: (1) **Lower memory** — $O(Bs)$ instead of $O(Bs^2)$, because the attention matrix is never written to HBM; and (2) **Higher throughput** — by fusing multiple HBM read/write operations into a single kernel, FlashAttention eliminates the round trips that make standard attention memory-bound. The speedup comes not from reducing FLOPs (it computes the same operations) but from reducing HBM traffic.

Question 14

What is the linear scaling rule for large-batch training, and why does it require learning rate warmup?

Answer

The **linear scaling rule** (Goyal et al., 2017) states: when the batch size is multiplied by $k$, multiply the learning rate by $k$. The intuition is that a $k$-times-larger batch reduces gradient variance by $1/k$, so a $k$-times-larger step size is needed to maintain the same expected progress per epoch. **Warmup** is needed because the linear scaling rule assumes the loss landscape is smooth enough for large steps, which is false at the beginning of training when parameters are random. Large learning rates applied to randomly initialized weights cause divergence. Linear warmup gradually increases the learning rate from near-zero to the scaled target over the first 5-10% of training steps, allowing the network to reach a region where the loss landscape supports larger updates.

Question 15

How do LARS and LAMB differ from standard SGD and Adam, and when are they needed?

Answer

**LARS** (Layer-wise Adaptive Rate Scaling) and **LAMB** (Layer-wise Adaptive Moments for Batch training) adjust the learning rate **per layer** based on the ratio of the weight norm to the update norm: $r_l = \|w_l\| / \|u_l\|$. Standard SGD and Adam use a single global learning rate for all layers, which is suboptimal because different layers have different gradient-to-weight ratios. LARS applies this to SGD; LAMB applies it to Adam. They are needed when training with **very large batch sizes** (typically $> 4096$) where the linear scaling rule alone is insufficient — different layers need different scaling to maintain stable convergence. LAMB enabled training BERT with batch size 65,536 in 76 minutes.

Question 16

What are the three stages of ZeRO optimization, and what does each stage shard?

Answer

**ZeRO-1** shards the **optimizer state** across GPUs. Each GPU stores only $1/N$ of the Adam first and second moments, reducing optimizer memory from $8P$ to $8P/N$ per GPU. **ZeRO-2** shards both **optimizer state and gradients**. Gradients are reduced and immediately discarded after use, so each GPU stores only its shard. **ZeRO-3** shards **everything**: parameters, gradients, and optimizer state. Each GPU stores only $16P/N$ bytes total (vs. $16P$ in standard DDP). The trade-off is communication: ZeRO-3 requires an all-gather operation before each forward/backward layer to reconstruct the full parameters, adding communication that does not exist in DDP or ZeRO-1/2.

Question 17

Why are spot/preemptible instances cost-effective for training but risky without checkpointing?

Answer

Spot instances offer **60-90% discounts** compared to on-demand pricing because the cloud provider can reclaim them with minimal notice (2 minutes on AWS, 30 seconds on GCP). This is cost-effective for training because training is a long-running computation with a clear restart strategy. However, without checkpointing, a preemption **loses all progress** since the last save. For a training run that takes 167 hours (like Climate DL), losing even a few hours of work is costly. With checkpoint-based fault tolerance, the cost of preemption is bounded: at most the work since the last checkpoint (e.g., 30 minutes) is lost. The savings from spot pricing (e.g., $24,000 for Climate DL) far outweigh the expected cost of re-doing a few checkpoints' worth of work.

Question 18

Using the $6PD$ approximation, estimate the training FLOPs for a 7B parameter model trained on 1.4 trillion tokens. How many A100-hours would this require at 45% MFU?

Answer

Total FLOPs: $C = 6 \times 7 \times 10^9 \times 1.4 \times 10^{12} = 5.88 \times 10^{22}$ FLOPs. An A100 at BF16 delivers 312 TFLOP/s peak. At 45% MFU, effective throughput is $312 \times 0.45 = 140.4$ TFLOP/s $= 1.404 \times 10^{14}$ FLOP/s. Time: $5.88 \times 10^{22} / 1.404 \times 10^{14} = 4.19 \times 10^8$ seconds $= 116,300$ GPU-hours. On 256 GPUs, this takes approximately 454 hours (19 days). At $3.50/GPU-hour on-demand, the cost is approximately $407,000.

Question 19

A team profiles their DDP training and finds a communication fraction of 35%. What are three strategies to reduce it, and what trade-off does each involve?

Answer

Three strategies to reduce communication fraction: (1) **Increase local batch size** — processes more data per GPU before each all-reduce, amortizing the fixed communication cost. Trade-off: more GPU memory consumed by activations; may require gradient checkpointing. (2) **Gradient accumulation** — accumulate gradients over $K$ micro-batches before the all-reduce, reducing communication frequency by $K\times$. Trade-off: $K\times$ larger effective batch size may require learning rate adjustment (linear scaling rule); model updates are $K\times$ less frequent, which can slow convergence per wall-clock second. (3) **Gradient compression** — quantize or sparsify gradients before all-reduce (e.g., 1-bit Adam, Top-K sparsification). Trade-off: compression introduces approximation error; some methods require warmup steps with full precision; convergence may be slightly slower in terms of steps to reach target quality.

Question 20

For the StreamRec progressive project (M10a), why is DDP the right parallelism choice rather than FSDP or pipeline parallelism? What would need to change about the model for FSDP to become necessary?

Answer

DDP is the right choice because the StreamRec two-tower model (45M parameters, ~0.5 GB with optimizer state) **easily fits on a single GPU** — it consumes less than 1% of an A100's 80 GB memory. DDP is simpler than FSDP (no sharding complexity, no all-gather overhead per layer), and the model is small enough that replicating it on every GPU wastes negligible memory. FSDP would become necessary if the model grew to the point where **parameters plus optimizer state plus activations exceed single-GPU memory**. For example, if StreamRec added a large transformer encoder with 1B+ parameters (e.g., for processing content descriptions or user history sequences), the optimizer state alone (~19 GB) plus activations would approach the A100's memory limit, making FSDP's state sharding essential.