Chapter 7: Quiz

DataField.Dev

Chapter 7: Quiz

Test your understanding of deep network training techniques. Answers follow each question.

Question 1

What problem does random weight initialization solve, and why can't all weights be initialized to zero?

Answer

Random initialization solves the **symmetry breaking** problem. If all weights are initialized to the same value (including zero), every neuron in a layer computes the same function, receives the same gradient, and remains identical throughout training. The network effectively has only one neuron per layer regardless of its width. Random initialization gives each neuron a different starting point, allowing them to specialize during training.

Question 2

Xavier initialization sets weight variance to $\frac{2}{n_{\text{in}} + n_{\text{out}}}$. Why is it the harmonic mean of $1/n_{\text{in}}$ and $1/n_{\text{out}}$ rather than simply $1/n_{\text{in}}$?

Answer

Variance preservation is needed in both directions: the forward pass (activations) and the backward pass (gradients). The forward pass requires $\text{Var}(w) = 1/n_{\text{in}}$ to prevent activation explosion/collapse. The backward pass requires $\text{Var}(w) = 1/n_{\text{out}}$ to prevent gradient explosion/collapse. Since both conditions cannot be satisfied simultaneously when $n_{\text{in}} \neq n_{\text{out}}$, Glorot and Bengio proposed the compromise $\text{Var}(w) = 2/(n_{\text{in}} + n_{\text{out}})$, which balances forward and backward stability.

Question 3

Why does He initialization use $\text{Var}(w) = 2/n_{\text{in}}$ instead of $1/n_{\text{in}}$?

Answer

ReLU sets all negative inputs to zero, which halves the variance of the output: $\text{Var}(\text{ReLU}(z)) = \frac{1}{2}\text{Var}(z)$ for zero-mean Gaussian inputs. He initialization compensates by doubling the weight variance (factor of 2 in the numerator), so that the combined effect of the linear transformation and ReLU preserves the overall activation variance through layers.

Question 4

In batch normalization, the output is $y_i = \gamma \hat{x}_i + \beta$ where $\hat{x}_i$ is the normalized input. If $\gamma$ and $\beta$ can undo the normalization, why normalize at all?

Answer

The learned parameters $\gamma$ and $\beta$ *can* undo the normalization, but they would only do so if that is optimal for the loss. In practice, the network learns $\gamma$ and $\beta$ values that are different from the pre-normalization statistics, meaning normalization changes the effective parameterization. The benefit is that the normalization **reparameterizes the optimization problem** into one with a smoother loss landscape and more predictable gradients. The network retains full representational power (it can represent any function it could before) but the optimization dynamics are improved.

Question 5

A colleague argues that batch normalization works by reducing "internal covariate shift." What is the more recent understanding of why BN actually helps?

Answer

Santurkar et al. (2018) showed that BN does not meaningfully reduce internal covariate shift (measured by changes in activation distributions). Instead, BN helps by **smoothing the loss landscape**: it makes the loss function more Lipschitz continuous, meaning gradients are more predictive of the actual loss change. This makes optimization steps more reliable and allows the use of higher learning rates. The "internal covariate shift" explanation is at best a partial story — networks with artificially injected covariate shift still train faster with BN.

Question 6

Why does batch normalization behave differently during training and inference, and what is the consequence of forgetting to switch modes?

Answer

During training, BN uses the current mini-batch's mean and variance. During inference, BN uses running averages (exponential moving averages accumulated during training), because the inference batch may be size 1 or contain non-representative samples. If you forget to call `model.eval()` before inference, BN uses batch statistics from whatever inputs happen to be in the current batch, producing **inconsistent predictions** — the same input can produce different outputs depending on what other inputs are in the batch. This is one of the most common production bugs in deep learning.

Question 7

When would you choose layer normalization over batch normalization?

Answer

Choose layer normalization when: (1) training with small batch sizes (where batch statistics are noisy), (2) using transformer or recurrent architectures (where different positions in a sequence have different distributions, making per-position batch statistics meaningless), or (3) when you need identical behavior during training and inference (layer norm has no running statistics and no train/eval mode distinction). Layer norm is the standard for transformers and modern LLMs.

Question 8

Explain the difference between standard dropout (scale at test time) and inverted dropout (scale at train time). Why is inverted dropout preferred in practice?

Answer

**Standard dropout** zeros neurons with probability $p$ during training, then multiplies all activations by $(1-p)$ at inference time to match the expected value. **Inverted dropout** zeros neurons during training and divides surviving activations by $(1-p)$, so inference requires no modification. Inverted dropout is preferred because: (1) the inference code does not need to know the dropout rate, simplifying deployment; (2) no computation is wasted at inference time; and (3) the model can be evaluated with the same code path used for production serving.

Question 9

A model with highly correlated financial features (debt-to-income, credit utilization, total balance) produces unstable predictions when deployed to a new customer segment. How does weight decay address this problem?

Answer

Without weight decay, the model can assign large positive weight to one correlated feature and large negative weight to another — the effects cancel on the training data, but the individual weights are fragile. When the correlations shift slightly in a new segment, the cancellation breaks and predictions swing wildly. Weight decay penalizes total weight magnitude, forcing the model to distribute its weight budget across correlated features rather than concentrating in large-magnitude pairs that cancel. This produces smaller, more stable weights that generalize better across segments.

Question 10

What is the difference between L2 regularization and decoupled weight decay when using the Adam optimizer?

Answer

L2 regularization adds $\lambda w$ to the gradient *before* Adam's adaptive scaling (division by the second moment estimate). This means the effective regularization strength varies across parameters — parameters with large gradients get less regularization. **Decoupled weight decay** (AdamW) applies weight decay *after* the Adam update, so every parameter receives the same regularization strength regardless of its gradient history. In practice, AdamW generalizes better than Adam with L2 regularization, and the hyperparameter $\lambda$ has different optimal ranges ($\sim 10^{-4}$ for L2, $\sim 10^{-2}$ for decoupled).

Question 11

Why does early stopping work as a regularizer?

Answer

Early stopping limits the number of gradient descent steps, which limits how far the model parameters can move from their initial values. For quadratic loss surfaces, Bishop (1995) showed this is mathematically equivalent to L2 regularization: fewer steps = stronger regularization. The model effectively stays in a neighborhood of the initialization, which constrains its complexity. In general, training longer allows the model to fit increasingly fine-grained (and potentially spurious) patterns in the training data, so stopping early prevents this overfitting.

Question 12

Why is learning rate warmup important for Adam and AdamW, particularly at the start of training?

Answer

Adam maintains a running estimate of the second moment of gradients ($v_t$), initialized to zero. In early training, $v_t$ is very small, so the updates are divided by a very small number, producing disproportionately large parameter changes. Warmup starts with a small learning rate and gradually increases it, giving the second-moment estimate time to converge to a reasonable value before the learning rate reaches its full magnitude. Without warmup, the early large updates can push the model into a bad region of the loss landscape from which it struggles to recover.

Question 13

Describe the three phases of the one-cycle learning rate policy and the role of each phase.

Answer

**Phase 1 (warmup, ~30% of training):** Learning rate increases linearly from $\eta_{\max}/\text{div\_factor}$ to $\eta_{\max}$. Purpose: explore the loss landscape broadly, acting as a regularizer by preventing premature convergence to sharp minima. **Phase 2 (annealing, ~70% of training):** Learning rate decreases from $\eta_{\max}$ back to $\eta_{\max}/\text{div\_factor}$ via cosine decay. Purpose: refine the solution by making smaller, more precise updates. **Phase 3 (annihilation, last few percent):** Learning rate drops to a very small value ($\eta_{\max}/10^4$). Purpose: fine convergence near the minimum. Momentum follows the inverse schedule (low when LR is high, high when LR is low) to maintain a consistent effective step size.

Question 14

Gradient clipping by global norm preserves the gradient direction, while clipping by value does not. Why does direction preservation matter?

Answer

The gradient direction indicates the steepest descent direction in the loss landscape. Clipping by global norm scales all gradient components by the same factor, shrinking the step size while maintaining the descent direction. Clipping by value independently clamps each component, which changes the *direction* of the gradient — the resulting update may no longer point toward a local minimum. Direction preservation means gradient clipping acts purely as a step-size limiter, which is a much safer modification to the optimization process.

Question 15

Explain how gradient accumulation simulates a larger batch size. Why must the loss be divided by the number of accumulation steps?

Answer

Gradient accumulation runs multiple forward-backward passes, accumulating (summing) gradients before executing an optimizer step. This produces the same total gradient as a single forward-backward pass on a batch $k$ times larger. The loss must be divided by $k$ (the number of accumulation steps) because each `loss.backward()` call *adds* to the existing gradients. Without the division, the accumulated gradient would be $k$ times larger than the true large-batch gradient. Dividing the loss by $k$ at each step ensures the final accumulated gradient has the correct scale: $\frac{1}{k}\sum_{i=1}^{k} \nabla L_i = \nabla L_{\text{big batch}}$.

Question 16

What is the key difference between fp16 and bf16, and why does bf16 often not need loss scaling?

Answer

**fp16** has 5 exponent bits and 10 mantissa bits (range $\pm 6.5 \times 10^4$, precision ~3.3 decimal digits). **bf16** has 8 exponent bits and 7 mantissa bits (range $\pm 3.4 \times 10^{38}$, precision ~2.1 decimal digits). The critical difference is **dynamic range**. bf16 has the same range as fp32 (same 8 exponent bits), so gradients rarely underflow or overflow. fp16's limited range means small gradients (common in deep networks) underflow to zero, which is why loss scaling is needed. bf16 sacrifices precision for range — a trade-off that is usually acceptable because gradient precision matters less than gradient existence.

Question 17

Which of the following operations should remain in fp32 even during mixed precision training, and why?

(a) Matrix multiplications in linear layers (b) Loss computation (reduction over predictions) (c) ReLU activations (d) Softmax (e) Batch normalization variance computation

Answer

**(b) Loss computation** — reductions (sums/means) over large tensors accumulate rounding errors in low precision. **(d) Softmax** — involves exponentials that can overflow in fp16 (max fp16 value is 65,504; $e^{11} \approx 60,000$ already approaches the limit). **(e) Batch normalization variance computation** — the variance formula involves squared differences and averaging, which is numerically sensitive in low precision. **(a) Matrix multiplications** can safely run in fp16/bf16 because modern GPUs have hardware support (Tensor Cores) that performs the multiply-accumulate in higher precision internally. **(c) ReLU** is a simple comparison and is safe in any precision.

Question 18

Your training loss is decreasing smoothly, but the validation loss plateaus after 5 epochs and then slowly increases. The training loss reaches 0.001 while the validation loss stabilizes around 0.350. Diagnose the problem and propose a specific sequence of fixes.

Answer

This is classic **overfitting** — the model memorizes the training data without generalizing. Sequence of fixes (in order of invasiveness): 1. **Early stopping** with patience 5-10 — restore the model from the epoch where validation loss was lowest. 2. **Increase dropout** — start with 0.3 and increase to 0.5 if needed. 3. **Increase weight decay** — try $10^{-3}$ to $10^{-1}$ with AdamW. 4. **Data augmentation** — if applicable to the domain. 5. **Reduce model size** — fewer layers or smaller hidden dimensions. 6. **Get more training data** — the most reliable fix but often the most expensive. Apply these one at a time and monitor the train-val gap at each step. If the gap closes but both losses are high, you have shifted to underfitting and need to relax the regularization.

Question 19

You observe that the gradient norms for layers 1-3 (closest to the input) are $10^{-7}$, while the gradient norms for layers 8-10 (closest to the output) are $10^{-1}$. What is this phenomenon called, and what are three techniques to address it?

Answer

This is the **vanishing gradient problem** — gradients shrink exponentially as they propagate backward through the network, causing early layers to learn extremely slowly. Three techniques to address it: 1. **Residual (skip) connections** — provide a direct gradient highway from the loss to earlier layers, bypassing the vanishing gradient path. 2. **Normalization layers** (batch norm or layer norm) — keep activations and gradients in a healthy range at every layer. 3. **Proper initialization** — He initialization for ReLU networks ensures gradients start at a reasonable scale. Alternatively, use activation functions with non-vanishing gradients in the negative domain (Leaky ReLU, GELU) instead of standard ReLU.

Question 20

True or False: If you train the same model architecture with the same data and hyperparameters but change only the initialization seed, you will get the same final test accuracy.

Answer

**False.** Different initialization seeds place the model at different starting points in the loss landscape. Since deep learning optimization is non-convex, different starting points can lead to different local minima (or saddle point neighborhoods) with different generalization properties. In practice, the variance across seeds is typically small for well-tuned models but can be significant — sometimes 1-2% in test accuracy. This is why reporting results averaged over multiple seeds (with standard deviation) is important for rigorous evaluation. The sensitivity to initialization is also why proper initialization schemes matter: they ensure the *starting region* is reasonable, even if the exact starting point varies.