Chapter 12: Quiz

Test your understanding of training deep networks. Each question has a single best answer unless otherwise stated.

Question 1

Why should you not apply torch.nn.Softmax before passing logits to torch.nn.CrossEntropyLoss?

Show Answer

`CrossEntropyLoss` internally applies `log_softmax`, which uses the log-sum-exp trick for numerical stability. If you apply softmax first, then take the log, you compute `log(softmax(z))` in two separate steps. For very large or very small logits, softmax can produce values that are exactly 0.0 or 1.0 due to floating-point precision limits, and `log(0)` produces `-inf`. The fused `log_softmax` operation avoids this by subtracting the maximum logit before exponentiation, keeping all intermediate values in a numerically safe range. Additionally, applying softmax before CrossEntropyLoss means the log-softmax is applied *twice*, producing incorrect gradients.

Question 2

Which optimizer applies weight decay directly to the parameters rather than adding an L2 penalty to the loss gradient?

A) SGD B) Adam C) AdamW D) RMSprop

Show Answer

**C) AdamW**. AdamW implements *decoupled weight decay*, which subtracts a fraction of the parameter value directly during the update step: $\theta_{t+1} = \theta_t - \eta(\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t)$. In contrast, standard Adam with L2 regularization adds the penalty to the gradient, which then gets scaled by Adam's adaptive learning rate, effectively reducing the regularization strength for parameters with large gradients.

Question 3

You are training a transformer model. The loss decreases for the first 100 steps, then suddenly becomes NaN. What is the most likely first diagnostic step?

A) Increase the model size B) Switch from Adam to SGD C) Check gradient norms and add gradient clipping D) Double the batch size

Show Answer

**C) Check gradient norms and add gradient clipping.** A sudden NaN loss typically indicates a gradient explosion---gradients become so large that the parameter update produces inf or NaN values. The first step is to monitor gradient norms to confirm this, then add gradient clipping (e.g., `clip_grad_norm_` with `max_norm=1.0`). Other causes include division by zero in the loss computation or NaN values in the input data, but gradient explosion is the most common cause when loss was initially decreasing normally.

Question 4

What is the purpose of bias correction in the Adam optimizer?

Show Answer

Adam's first moment $m_t$ and second moment $v_t$ are initialized to zero vectors. Because they are updated via exponential moving averages ($m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$), they are biased toward zero in the early steps of training. Bias correction divides by $(1 - \beta_1^t)$ and $(1 - \beta_2^t)$ respectively, which corrects for this initialization bias. Without bias correction, the effective learning rate in early steps would be much smaller than intended, causing slow initial progress. The bias correction terms approach 1 as $t$ grows, so the correction is only significant in early training.

Question 5

Which initialization scheme should you use for a network with ReLU activations?

A) Xavier (Glorot) uniform B) Xavier (Glorot) normal C) He (Kaiming) normal D) All zeros

Show Answer

**C) He (Kaiming) normal.** ReLU zeroes out approximately half of its inputs, effectively halving the variance of activations at each layer. He initialization accounts for this by using variance $\frac{2}{n_{\text{in}}}$ instead of Xavier's $\frac{1}{n_{\text{in}}}$. The factor of 2 compensates for the variance reduction caused by ReLU. Xavier initialization was derived assuming linear or symmetric activations (like tanh) and leads to vanishing activations in deep ReLU networks.

Question 6

In mixed precision training, why are model weights kept in FP32 ("master weights") even though forward/backward passes use FP16?

Show Answer

Weight updates involve adding a small gradient value (often $10^{-5}$ or smaller after scaling by learning rate) to a large weight value (often around $10^{-1}$ to $10^{0}$). FP16 has only about 3-4 decimal digits of precision, so adding a value smaller than roughly $10^{-4}$ to a value of 1.0 has no effect---the small value is simply rounded away. This means many gradient updates would be silently discarded in FP16, preventing the model from learning. FP32 master weights have about 7 decimal digits of precision, which is sufficient to accumulate these small updates over many steps.

Question 7

A model has training accuracy of 99% but validation accuracy of 65%. Which combination of techniques is most likely to help?

A) Increase learning rate and remove normalization B) Add dropout, increase weight decay, and use data augmentation C) Use a deeper model with more parameters D) Switch from Adam to SGD

Show Answer

**B) Add dropout, increase weight decay, and use data augmentation.** The large gap between training and validation accuracy (99% vs. 65%) is a clear sign of overfitting---the model has memorized the training data instead of learning generalizable features. The correct response is to add regularization: dropout prevents co-adaptation of neurons, weight decay penalizes large weights, and data augmentation increases the effective size of the training set. Increasing model size (C) would likely worsen overfitting. Changing the optimizer (D) might help marginally but does not address the core issue.

Question 8

When should you call model.eval() during a training pipeline? Select all that apply.

A) Before computing training loss B) Before computing validation loss C) Before saving a model checkpoint D) Before running inference on test data

Show Answer

**B and D.** `model.eval()` switches the model to evaluation mode, which changes the behavior of batch normalization (use running statistics instead of batch statistics) and dropout (disabled). This should be called before any evaluation---both validation during training (B) and final test inference (D). You should NOT call it before training loss computation (A), as training requires batch statistics and active dropout. Calling it before saving a checkpoint (C) is not strictly necessary since the checkpoint saves the model state dict, not the mode. After evaluation, always call `model.train()` to resume training.

Question 9

What is the difference between a per-epoch learning rate scheduler and a per-step scheduler?

Show Answer

A **per-epoch scheduler** (e.g., `StepLR`, `CosineAnnealingLR`) updates the learning rate once after each complete pass through the training data. You call `scheduler.step()` in the outer epoch loop. A **per-step scheduler** (e.g., `OneCycleLR`, custom warmup schedules) updates the learning rate after every mini-batch. You call `scheduler.step()` inside the inner batch loop. Confusing the two is a common bug: calling a per-epoch scheduler every step causes the learning rate to decay far too quickly (e.g., finishing the entire cosine decay in the first epoch), while calling a per-step scheduler only once per epoch causes the learning rate to barely change.

Question 10

Which normalization layer is standard for transformer architectures?

A) Batch normalization B) Layer normalization C) Group normalization D) Instance normalization

Show Answer

**B) Layer normalization.** Transformers use layer normalization because: (1) it normalizes across the feature dimension independently for each sample, so it does not depend on batch size; (2) its behavior is identical during training and inference, avoiding the train/eval mode discrepancy of batch norm; (3) it works naturally with variable-length sequences, whereas batch norm would mix statistics across different sequence positions; and (4) the original Transformer paper (Vaswani et al., 2017) established this as the standard, and subsequent work confirmed its effectiveness.

Question 11

What does optimizer.zero_grad(set_to_none=True) do differently from optimizer.zero_grad()?

Show Answer

`optimizer.zero_grad()` (default `set_to_none=False`) fills all parameter gradients with zero tensors. `optimizer.zero_grad(set_to_none=True)` sets the `.grad` attribute to `None` instead. Setting to `None` is slightly more memory-efficient (no zero tensor is allocated) and can be marginally faster. However, it requires that downstream code checks for `None` gradients rather than assuming gradients are always tensors. In modern PyTorch, `set_to_none=True` is the recommended practice and is compatible with all standard optimizers and gradient clipping functions.

Question 12

In the context of gradient clipping by norm with max_norm=1.0, if the global gradient norm is 5.0, by what factor are all gradients scaled?

A) 0.2 B) 0.5 C) 1.0 D) 5.0

Show Answer

**A) 0.2.** Gradient clipping by norm scales all gradients by $\frac{\text{max\_norm}}{\|g\|} = \frac{1.0}{5.0} = 0.2$ when the global norm exceeds the threshold. This preserves the direction of the gradient vector while reducing its magnitude to exactly `max_norm`. Each individual gradient component is multiplied by 0.2, so the resulting global norm equals 1.0.

Question 13

You have a GPU with 8 GB of memory. Your model requires 6 GB in FP32. Which technique would allow you to train with a larger batch size?

A) Gradient clipping B) Mixed precision training C) Weight initialization D) Label smoothing

Show Answer

**B) Mixed precision training.** FP16 activations use half the memory of FP32, so mixed precision reduces activation memory by approximately 50%. This frees up GPU memory that can be used for larger batch sizes. Note that the model parameters themselves still need FP32 master weights, so the parameter memory savings are smaller, but activation memory (which scales with batch size) is the dominant factor during training. Gradient clipping (A) does not affect memory. Weight initialization (C) and label smoothing (D) do not affect memory usage.

Question 14

What does the GradScaler do when it detects inf or NaN in the gradients?

Show Answer

When `GradScaler` detects inf or NaN values in the unscaled gradients, it **skips the optimizer step entirely** (does not update model parameters) and **reduces the loss scale factor** (typically by half). This prevents corrupted gradients from ruining the model weights. In subsequent steps, if no inf/NaN is detected for a configurable number of consecutive steps (default: 2000), the scaler **increases the loss scale** (typically by a factor of 2) to maximize the dynamic range used. This dynamic scaling ensures that the loss scale is as large as possible without causing overflow.

Question 15

Which of the following is the correct order of operations in a training step?

A) forward pass -> zero_grad -> backward -> step B) zero_grad -> forward pass -> backward -> step C) zero_grad -> backward -> forward pass -> step D) forward pass -> backward -> zero_grad -> step

Show Answer

**B) zero_grad -> forward pass -> backward -> step.** The correct order is: (1) zero out gradients from the previous step, (2) compute the forward pass to get predictions and loss, (3) compute gradients via backpropagation, (4) update parameters with the optimizer. Option A accumulates gradients from the previous step into the current backward pass. Option C computes backward before forward, which is impossible. Option D zeros gradients after they have been computed but before using them, which would discard the current step's gradients (though this ordering is technically valid if `zero_grad` is at the start of the *next* iteration).

Question 16

What is the "overfit-one-batch" test and why is it useful?

Show Answer

The overfit-one-batch test takes a single mini-batch from the training set and trains the model on only that batch for many iterations, without any regularization. The model should be able to achieve near-zero training loss and near-100% accuracy on this single batch. If it cannot, there is a bug in the code---the model architecture, loss function, optimizer, or data pipeline is broken. This test is useful because it is fast (only one batch), eliminates data complexity as a variable, and catches fundamental bugs like mismatched dimensions, incorrect loss functions, softmax-before-cross-entropy errors, and wrong learning rates. It should be the first thing you try when debugging a new training pipeline.

Question 17

Cosine annealing with $T_{\max} = 100$ and $\eta_{\min} = 0$ starts with $\eta_{\max} = 0.1$. What is the learning rate at step $t = 50$?

A) 0.1 B) 0.075 C) 0.05 D) 0.025

Show Answer

**C) 0.05.** Using the cosine annealing formula: $\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})(1 + \cos(\frac{t}{T}\pi))$. At $t = 50$: $\eta_{50} = 0 + \frac{1}{2}(0.1 - 0)(1 + \cos(\frac{50}{100}\pi)) = 0.05 \times (1 + \cos(\frac{\pi}{2})) = 0.05 \times (1 + 0) = 0.05$. Intuitively, at the midpoint of cosine annealing, the learning rate is exactly halfway between the maximum and minimum.

Question 18

Why should biases and normalization layer parameters typically be excluded from weight decay?

Show Answer

Weight decay penalizes large parameter values, pushing them toward zero. This is appropriate for weight matrices, where large values can indicate overfitting. However, **biases** serve as offsets that allow the model to fit data that is not centered at zero---penalizing them restricts the model's representational capacity without providing regularization benefit. **Normalization layer parameters** ($\gamma$ for scale, $\beta$ for shift) are designed to undo normalization when beneficial. Applying weight decay to $\gamma$ pushes it toward zero, which would effectively disable the layer's output, and applying decay to $\beta$ prevents the model from learning appropriate feature shifts. Empirically, excluding these parameters from weight decay consistently improves performance.

Question 19

What problem does the symmetry-breaking property of weight initialization address?

Show Answer

If all weights in a layer are initialized to the same value (e.g., all zeros), then every neuron in that layer computes exactly the same output for any input. During backpropagation, every neuron receives exactly the same gradient. After the update, all weights are still identical. This symmetry is never broken, so the layer effectively has only one neuron regardless of its width---all neurons are redundant copies. Random initialization breaks this symmetry by giving each neuron different initial weights, so they compute different functions, receive different gradients, and specialize to different features during training.

Question 20

In a linear warmup schedule over 1000 steps with a target learning rate of $3 \times 10^{-4}$, what is the learning rate at step 200?

A) $3 \times 10^{-5}$ B) $6 \times 10^{-5}$ C) $1.5 \times 10^{-4}$ D) $3 \times 10^{-4}$

Show Answer

**B) $6 \times 10^{-5}$.** Linear warmup computes: $\eta_t = \eta_{\text{target}} \times \frac{t}{T_{\text{warmup}}} = 3 \times 10^{-4} \times \frac{200}{1000} = 3 \times 10^{-4} \times 0.2 = 6 \times 10^{-5}$. The learning rate increases linearly from 0 to the target rate over the warmup period.

Question 21

Which of the following is NOT a benefit of batch normalization?

A) Allows higher learning rates B) Eliminates the need for weight initialization C) Acts as a mild regularizer D) Smooths the loss landscape

Show Answer

**B) Eliminates the need for weight initialization.** While batch normalization reduces sensitivity to weight initialization (because it re-normalizes activations at each layer), it does not eliminate the need for initialization entirely. Extremely poor initialization can still cause problems, and the first forward pass before normalization statistics are established still depends on initialization. The other three options are genuine benefits: BN allows higher learning rates by stabilizing activation distributions (A), the noise from using batch statistics acts as regularization (C), and BN has been shown to smooth the loss landscape, making optimization easier (D).

Question 22

What happens if you forget to call model.eval() before validation?

Show Answer

Two things go wrong: (1) **Batch normalization** uses the current batch's mean and variance instead of the running statistics accumulated during training. This means validation results depend on the specific batch of data being evaluated, and if the validation batch distribution differs from training, the normalization will be incorrect. With a batch size of 1, batch norm becomes meaningless. (2) **Dropout** remains active, randomly zeroing neurons during evaluation. This introduces noise into predictions and typically reduces accuracy. The combined effect is that validation metrics will be inconsistent (varying between evaluations of the same data), lower than actual model performance, and unreliable for model selection or early stopping decisions.

Question 23

Focal loss with $\gamma = 2$ and a correctly classified example with predicted probability $\hat{p}_t = 0.9$. By what factor is the loss reduced compared to standard cross-entropy?

A) $(0.1)^2 = 0.01$ B) $(0.9)^2 = 0.81$ C) $0.1$ D) $0.9$

Show Answer

**A) $(0.1)^2 = 0.01$.** Focal loss multiplies the standard cross-entropy by $(1 - \hat{p}_t)^\gamma$. For $\hat{p}_t = 0.9$ and $\gamma = 2$: $(1 - 0.9)^2 = (0.1)^2 = 0.01$. So the loss contribution of this easy, correctly classified example is reduced to 1% of what it would be under standard cross-entropy. This is the key mechanism of focal loss: it dramatically down-weights easy examples, allowing the model to focus training on hard, misclassified examples. For a hard example with $\hat{p}_t = 0.1$, the factor would be $(0.9)^2 = 0.81$, barely reducing the loss.

Question 24

What is the primary difference between DataParallel and DistributedDataParallel in PyTorch?

Show Answer

`DataParallel` (DP) uses a single process with multiple threads and replicates the model on each GPU every forward pass. It gathers outputs and scatters gradients through a single "master" GPU, creating a communication bottleneck. `DistributedDataParallel` (DDP) uses one process per GPU, each with its own model replica that persists across iterations. Gradients are synchronized using efficient all-reduce operations (via NCCL backend) that overlap with backward computation. DDP is significantly faster (near-linear scaling with GPU count), avoids Python's GIL bottleneck, and is the recommended approach for multi-GPU training. The only trade-off is slightly more complex setup code (process group initialization, distributed sampler).

Question 25

You are training with gradient accumulation over 4 steps. The loss for each micro-batch is 2.0, 2.5, 1.8, and 2.2. What value should be used for the optimizer step?

A) The sum: 8.5 B) The average: 2.125 C) The last value: 2.2 D) The maximum: 2.5

Show Answer

**B) The average: 2.125.** When using gradient accumulation, you must divide each micro-batch loss by the number of accumulation steps before calling backward. This ensures that the accumulated gradients approximate the gradient of a single large batch of size `batch_size * accumulation_steps`. If you accumulate without dividing, the effective learning rate is multiplied by the number of accumulation steps, leading to instability. In code: `loss = criterion(outputs, targets) / accumulation_steps` before `loss.backward()`.