Chapter 35: Quiz -- Distributed Training and Scaling

Question 1

In data parallelism, what operation synchronizes gradients across GPUs?

A) Broadcast B) All-reduce C) All-gather D) Reduce-scatter

Answer: B Explanation: All-reduce sums the gradients computed on each GPU and distributes the result to all GPUs, ensuring every replica has identical averaged gradients before the optimizer step.

Question 2

What is the communication volume of ring all-reduce for N GPUs with data size D?

A) N * D B) 2 * D C) 2(N-1)/N * D D) D / N

Answer: C Explanation: Ring all-reduce performs two phases (reduce-scatter and all-gather), each requiring (N-1)/N * D communication. The total is 2(N-1)/N * D, which approaches 2D as N grows. This is bandwidth-optimal.

Question 3

What is the linear scaling rule for learning rate in distributed training?

A) Divide the learning rate by the number of GPUs B) Multiply the learning rate by the number of GPUs (or effective batch size ratio) C) Keep the learning rate constant regardless of the number of GPUs D) Use the square root of the number of GPUs as a multiplier

Answer: B Explanation: When the effective batch size increases by a factor k (due to data parallelism), the linear scaling rule multiplies the learning rate by k. This compensates for the larger gradient averaging and is typically combined with a warmup period.

Question 4

What problem does gradient accumulation solve?

A) It reduces communication overhead in distributed training B) It simulates a larger effective batch size when GPU memory is insufficient for the desired batch size C) It speeds up the forward pass D) It eliminates the need for gradient synchronization

Answer: B Explanation: Gradient accumulation runs multiple forward-backward passes without an optimizer step, accumulating gradients. The optimizer step is taken after N accumulation steps, simulating a batch size N times larger than what fits in memory.

Question 5

What does FSDP (Fully Sharded Data Parallelism) shard across GPUs?

A) Only the training data B) Only the model parameters C) Parameters, gradients, and optimizer states D) Only the optimizer states

Answer: C Explanation: FSDP shards all three: parameters, gradients, and optimizer states across GPUs. This is equivalent to ZeRO Stage 3. Parameters are gathered (all-gather) before each forward/backward pass and resharded after.

Question 6

What is the memory formula for training a model with P parameters using AdamW in float32?

A) 4P bytes B) 8P bytes C) 12P bytes D) 16P + activations bytes

Answer: D Explanation: Training requires 4P (parameters) + 4P (gradients) + 4P (Adam first moment) + 4P (Adam second moment) = 16P bytes, plus activation memory which varies with batch size and model architecture.

Question 7

In pipeline parallelism, what is the "bubble" problem?

A) Memory overflow from large activations B) Idle time when some pipeline stages have no work because they are waiting for inputs or gradients C) Communication bottleneck between nodes D) Gradient explosion in deep networks

Answer: B Explanation: In pipeline parallelism, stages wait for inputs from the previous stage (forward) or gradients from the next stage (backward). This idle time is called the pipeline bubble. Micro-batching reduces but does not eliminate it.

Question 8

What is the key difference between tensor parallelism and pipeline parallelism?

A) Tensor parallelism splits data, pipeline parallelism splits the model B) Tensor parallelism splits individual layers across GPUs, pipeline parallelism assigns different layers to different GPUs C) They are identical approaches with different names D) Tensor parallelism is only for CNNs, pipeline parallelism is for transformers

Answer: B Explanation: Tensor parallelism partitions the computation within a single layer (e.g., splitting weight matrices column-wise or row-wise). Pipeline parallelism partitions the model layer-by-layer, assigning groups of layers to different GPUs.

Question 9

What does torch.cuda.amp.autocast do?

A) Automatically selects the optimal GPU for training B) Runs operations in lower precision (float16 or bfloat16) where safe, while keeping critical operations in float32 C) Casts all tensors to float64 for maximum precision D) Enables asynchronous computation across GPUs

Answer: B Explanation: autocast automatically runs matrix multiplications and convolutions in float16/bfloat16 for speed, while keeping reductions, normalization, and loss computation in float32 for numerical stability.

Question 10

Why does mixed precision training use a gradient scaler?

A) To reduce memory usage B) To prevent float16 gradients from underflowing to zero by scaling the loss before backward, then unscaling gradients before the optimizer step C) To speed up the optimizer step D) To normalize gradients across GPUs

Answer: B Explanation: Float16 has a limited dynamic range. Small gradient values can underflow to zero. The gradient scaler multiplies the loss by a large factor before backward (preventing underflow), then divides gradients by the same factor before the optimizer step.

Question 11

What communication backend is recommended for GPU-to-GPU distributed training?

A) Gloo B) MPI C) NCCL D) TCP

Answer: C Explanation: NCCL (NVIDIA Collective Communications Library) is optimized for NVIDIA GPU-to-GPU communication, utilizing NVLink and InfiniBand when available. It provides the highest bandwidth for GPU collective operations.

Question 12

What is activation checkpointing (gradient checkpointing)?

A) Saving model checkpoints during training B) Discarding intermediate activations during forward pass and recomputing them during backward, trading compute for memory C) Checking that activations are within valid ranges D) Saving activations to disk for later analysis

Answer: B Explanation: Activation checkpointing saves memory by not storing all intermediate activations. Instead, it recomputes them during the backward pass from saved checkpoints. This trades approximately 33% more compute for significant memory savings.

Question 13

In ZeRO Stage 1, what is partitioned across GPUs?

A) Parameters only B) Optimizer states only C) Parameters and gradients D) Parameters, gradients, and optimizer states

Answer: B Explanation: ZeRO has three stages. Stage 1 partitions only optimizer states (reducing memory by ~4x for Adam). Stage 2 adds gradient partitioning. Stage 3 adds parameter partitioning (equivalent to FSDP).

Question 14

What is DeepSpeed's offloading capability?

A) Offloading data preprocessing to CPUs B) Moving optimizer states and optionally parameters to CPU RAM or NVMe storage when not actively needed on GPU C) Offloading inference to edge devices D) Moving logging data to cloud storage

Answer: B Explanation: DeepSpeed ZeRO-Offload and ZeRO-Infinity can offload optimizer states, gradients, and even parameters to CPU RAM or NVMe storage. This enables training models much larger than GPU memory, at the cost of additional data transfer time.

Question 15

What does torchrun provide for distributed training?

A) A faster optimizer for distributed training B) Automatic process launching, fault tolerance, and elastic scaling for distributed PyTorch training C) GPU memory optimization D) Model parallelism implementation

Answer: B Explanation: torchrun (torch.distributed.run) handles process launching, environment variable setup, fault detection, and worker restart for distributed PyTorch training. It replaces the older torch.distributed.launch utility.

Scoring Guide

Score	Level	Recommendation
14-15	Expert	Ready to scale training to multi-node clusters
11-13	Advanced	Strong understanding, practice with real multi-GPU setups
8-10	Intermediate	Good grasp of concepts, implement DDP and FSDP
5-7	Developing	Review parallelism strategies and memory analysis
0-4	Beginning	Re-read the chapter focusing on data parallelism and communication