Chapter 13: Quiz

Test your understanding of regularization and generalization. Each question has a single best answer unless otherwise noted.

Question 1

What is the generalization gap?

(a) The difference between the model's capacity and the dataset size
(b) The difference between test error and training error
(c) The difference between validation accuracy and test accuracy
(d) The difference between the learning rate and weight decay

Show Answer

**(b)** The generalization gap is defined as the difference between test error and training error: $\mathcal{L}_{\text{test}} - \mathcal{L}_{\text{train}}$. A large generalization gap indicates overfitting, where the model performs much better on training data than on unseen data.

Question 2

Which of the following is a sign of underfitting?

(a) Low training error, high test error
(b) High training error, high test error
(c) Low training error, low test error
(d) High training error, low test error

Show Answer

**(b)** Underfitting occurs when the model is too simple to capture the data's patterns, resulting in high error on both training and test data. Option (a) describes overfitting, option (c) describes a well-fit model, and option (d) is generally impossible in practice.

Question 3

What is the key difference between L1 and L2 regularization?

(a) L1 penalizes the square of weights; L2 penalizes their absolute value
(b) L1 produces sparse solutions; L2 produces small but non-zero weights
(c) L1 is always better for deep learning; L2 is better for linear models
(d) L1 requires a larger learning rate than L2

Show Answer

**(b)** L1 regularization ($\sum |w_i|$) pushes weights to exactly zero because the gradient is constant regardless of weight magnitude, producing sparse solutions. L2 regularization ($\sum w_i^2$) pushes weights toward zero proportionally to their current magnitude, producing small but generally non-zero weights.

Question 4

Why should you use AdamW instead of Adam with L2 regularization?

(a) AdamW is faster to compute
(b) Adam's adaptive learning rates interact poorly with the L2 penalty, decoupling them works better
(c) L2 regularization is incompatible with Adam
(d) AdamW uses L1 regularization instead of L2

Show Answer

**(b)** In standard Adam with L2 penalty, the weight decay is scaled by the adaptive learning rate (the second moment estimate), which means parameters with large gradients receive less regularization. AdamW applies weight decay directly to the weights, decoupled from the gradient-based update, which provides more uniform regularization. This was demonstrated by Loshchilov and Hutter (2019).

Question 5

During training with dropout rate $p = 0.3$, what happens to surviving neurons in PyTorch's inverted dropout?

(a) They are scaled by 0.3
(b) They are scaled by 0.7
(c) They are scaled by $1 / 0.7 \approx 1.43$
(d) They are not scaled during training

Show Answer

**(c)** In inverted dropout, surviving neurons are scaled by $\frac{1}{1-p} = \frac{1}{0.7} \approx 1.43$ during training. This ensures the expected value of the output remains the same with or without dropout, eliminating the need for rescaling at inference time.

Question 6

What is the most critical step to remember when using dropout in PyTorch?

(a) Set the dropout rate to exactly 0.5
(b) Call model.train() before training and model.eval() before evaluation
(c) Apply dropout only to the output layer
(d) Use the same seed for dropout masks in every batch

Show Answer

**(b)** Calling `model.train()` activates dropout during training, and `model.eval()` disables it during evaluation. Forgetting this switch is one of the most common bugs in deep learning code, leading to degraded inference performance because dropout is still randomly zeroing activations.

Question 7

Why is Dropout2d (spatial dropout) preferred over standard dropout for convolutional layers?

(a) It is faster to compute
(b) It drops entire feature maps, which is more effective because adjacent pixels are correlated
(c) It applies dropout only to the bias terms
(d) It uses a different dropout rate for each layer

Show Answer

**(b)** In convolutional layers, adjacent spatial positions are highly correlated. Standard dropout on individual pixels is ineffective because neighboring pixels carry redundant information. Dropout2d drops entire feature maps (channels), forcing the network to learn multiple independent representations of the same spatial features.

Question 8

What is the primary advantage of data augmentation over other regularization techniques?

(a) It is computationally free
(b) It addresses the root cause of overfitting by increasing data diversity
(c) It always improves accuracy by exactly 10%
(d) It eliminates the need for a validation set

Show Answer

**(b)** Data augmentation is powerful because it directly addresses the root cause of overfitting: insufficient diversity in the training data. By applying label-preserving transformations, it effectively expands the training set, helping the model learn invariant features rather than memorizing specific examples.

Question 9

What is RandAugment's key simplification compared to AutoAugment?

(a) It uses only horizontal flips
(b) It reduces the search space to just two hyperparameters: number of operations and magnitude
(c) It eliminates all augmentation
(d) It uses a fixed augmentation policy for all datasets

Show Answer

**(b)** RandAugment simplifies the augmentation policy search from AutoAugment's large discrete search space to just two hyperparameters: $N$ (number of augmentation operations per image) and $M$ (magnitude of the operations). Despite this simplification, it achieves competitive performance at a fraction of the computational cost.

Question 10

In early stopping, what does "patience" refer to?

(a) The total number of training epochs
(b) The number of epochs to wait after the last improvement before stopping
(c) The minimum validation accuracy required
(d) The learning rate reduction factor

Show Answer

**(b)** Patience is the number of consecutive epochs without improvement in the monitored metric (typically validation loss) that the training loop will tolerate before stopping. For example, patience=10 means training will stop if validation loss does not improve for 10 consecutive epochs after the best recorded value.

Question 11

How does early stopping relate to L2 regularization theoretically?

(a) They are completely unrelated techniques
(b) Early stopping is equivalent to L2 regularization with strength inversely proportional to training steps
(c) Early stopping is equivalent to L1 regularization
(d) L2 regularization makes early stopping unnecessary

Show Answer

**(b)** For linear models trained with gradient descent, Bishop (1995) showed that early stopping is equivalent to L2 regularization where the regularization strength is inversely proportional to the number of training steps. Stopping earlier corresponds to stronger regularization because the weights remain closer to their initial (small) values.

Question 12

With label smoothing parameter $\alpha = 0.1$ and $K = 10$ classes, what is the target probability for the correct class?

(a) 0.9
(b) 0.91
(c) 0.99
(d) 0.1

Show Answer

**(b)** The smoothed target for the correct class is $1 - \alpha + \frac{\alpha}{K} = 1 - 0.1 + \frac{0.1}{10} = 0.91$. The remaining probability $\frac{\alpha}{K} = 0.01$ is assigned to each incorrect class.

Question 13

In mixup, what distribution is used to sample the mixing coefficient $\lambda$?

(a) Uniform distribution
(b) Normal distribution
(c) Beta distribution
(d) Bernoulli distribution

Show Answer

**(c)** Mixup samples $\lambda$ from a $\text{Beta}(\alpha, \alpha)$ distribution. The hyperparameter $\alpha$ controls the strength of mixing: small $\alpha$ (e.g., 0.2) produces $\lambda$ values close to 0 or 1 (mild mixing), while $\alpha = 1.0$ produces uniform mixing.

Question 14

What is the key difference between mixup and CutMix?

(a) Mixup is for images only; CutMix is for text only
(b) Mixup blends entire images pixel-wise; CutMix replaces a rectangular region
(c) CutMix does not modify labels; mixup does
(d) They are mathematically identical

Show Answer

**(b)** Mixup creates training examples by blending two images pixel-wise across the entire image ($\tilde{x} = \lambda x_i + (1-\lambda) x_j$), which can create unrealistic blurry images. CutMix replaces a rectangular region of one image with a patch from another, maintaining local image coherence and being more effective for tasks that require spatial reasoning.

Question 15

Why do models trained with large batch sizes tend to generalize worse?

(a) Large batches cause numerical overflow
(b) Large batches provide more accurate gradients that converge to sharp minima which generalize poorly
(c) Large batches require more memory, leaving less for the model
(d) Large batches always lead to underfitting

Show Answer

**(b)** Large batch sizes provide more accurate gradient estimates, reducing the stochastic noise in optimization. This noise-free optimization tends to converge to sharp minima---narrow valleys in the loss landscape. Sharp minima generalize poorly because the test loss landscape is slightly different from the training loss landscape, and small parameter perturbations at sharp minima cause large increases in loss.

Question 16

What is the linear scaling rule for batch size?

(a) Double the model size when doubling the batch size
(b) Multiply the learning rate by the same factor as the batch size increase
(c) Divide the learning rate by the batch size
(d) Keep the learning rate constant regardless of batch size

Show Answer

**(b)** The linear scaling rule states that when you multiply the batch size by a factor $k$, you should also multiply the learning rate by $k$: $\eta_{\text{new}} = \eta_{\text{base}} \times \frac{B_{\text{new}}}{B_{\text{base}}}$. This keeps the expected magnitude of weight updates approximately constant across different batch sizes.

Question 17

In the double descent phenomenon, what happens at the interpolation threshold?

(a) Test error reaches its minimum
(b) Test error peaks because the model uses all its capacity for memorization
(c) Training error starts to increase
(d) The learning rate is automatically reduced

Show Answer

**(b)** At the interpolation threshold, the model has just enough parameters to perfectly fit (interpolate) the training data. It must use all its capacity for memorization, leaving no room for generalization. This causes test error to peak. Beyond this point, in the overparameterized regime, the model has excess capacity and can find smoother, better-generalizing solutions.

Question 18

What does the lottery ticket hypothesis claim?

(a) Only randomly initialized networks can achieve good performance
(b) A randomly initialized dense network contains a sparse subnetwork that can match the full network's performance when trained from the same initialization
(c) Smaller networks always outperform larger networks
(d) Pruning always improves accuracy

Show Answer

**(b)** The lottery ticket hypothesis (Frankle and Carlin, 2019) claims that within a randomly initialized dense network, there exists a sparse subnetwork (the "winning ticket") that, when trained in isolation from its original initialization, can match the test accuracy of the full network in a comparable number of training steps. The large network serves as a search space for finding these winning tickets.

Question 19

Which regularization technique is generally most effective for very small datasets (fewer than 1,000 samples)?

(a) Reducing weight decay to zero
(b) Using the largest possible batch size
(c) Heavy data augmentation combined with transfer learning
(d) Removing all dropout

Show Answer

**(c)** For very small datasets, the most effective strategy is heavy data augmentation (to synthetically expand the dataset) combined with transfer learning (to leverage knowledge from larger datasets). These approaches directly address the core problem of limited data. Weight decay and dropout also help, but data augmentation and transfer learning have the largest impact.

Question 20

Why is dropout often set to zero in very large language models?

(a) Dropout is incompatible with transformer architectures
(b) The massive amount of training data already provides sufficient regularization
(c) Large models cannot overfit by definition
(d) Dropout increases training time too much

Show Answer

**(b)** Very large language models are trained on billions of tokens, and the sheer volume and diversity of training data acts as a natural regularizer. The models rarely see the same exact sequence twice, which prevents memorization. In this regime, explicit regularization like dropout provides diminishing returns and can even slow down training unnecessarily.

Question 21

What form of implicit regularization does mini-batch SGD provide?

(a) L1 regularization
(b) Gradient noise that helps escape sharp minima
(c) Automatic weight pruning
(d) Feature normalization

Show Answer

**(b)** Mini-batch SGD computes gradients from a random subset of the data, introducing noise in the gradient estimate. This stochasticity helps the optimizer escape sharp local minima and settle into flatter minima that generalize better. Smaller batch sizes introduce more noise, providing stronger implicit regularization.

Question 22

When fine-tuning a pretrained model on a small dataset, which regularization strategy is most appropriate?

(a) Remove all regularization since the model is already pretrained
(b) Use high weight decay, low learning rate for pretrained layers, and strong data augmentation
(c) Use very high dropout (0.9) and no weight decay
(d) Freeze all layers and only train the final classification head with no regularization

Show Answer

**(b)** When fine-tuning on small datasets, high weight decay prevents the model from straying too far from the well-learned pretrained weights. A low learning rate for pretrained layers serves a similar purpose. Strong data augmentation is critical because the dataset is small. While freezing layers (option d) can work, it is overly restrictive and option (b) provides a better balance.

Question 23

What does the weight_decay parameter in torch.optim.AdamW control?

(a) The rate at which the learning rate decreases
(b) The coefficient for decoupled weight decay applied directly to the weights
(c) The L1 regularization strength
(d) The dropout rate

Show Answer

**(b)** In `AdamW`, the `weight_decay` parameter specifies the coefficient for decoupled weight decay. At each update step, each weight is multiplied by $(1 - \text{lr} \times \text{weight\_decay})$ before the Adam gradient update is applied. This is decoupled from the gradient computation, unlike L2 regularization which adds the penalty to the loss.

Question 24

Which of the following is NOT an example of implicit regularization?

(a) The noise introduced by mini-batch gradient estimation
(b) The architectural bias of convolutional networks toward spatial locality
(c) Adding an explicit L2 penalty to the loss function
(d) The noise in batch normalization statistics

Show Answer

**(c)** Adding an L2 penalty to the loss function is explicit regularization---it is a deliberate modification to the training objective. Implicit regularization refers to regularization effects that arise naturally from design choices without being explicitly imposed, such as mini-batch noise (a), architectural biases (b), and batch normalization noise (d).

Question 25

You observe that adding dropout to your convolutional network with batch normalization makes performance worse. What is the most likely explanation?

(a) Your dropout rate is too low
(b) Dropout and batch normalization interact poorly because dropout changes the variance of inputs to batch norm
(c) Your model is underfitting
(d) You forgot to call model.eval()

Show Answer

**(b)** Dropout and batch normalization can interact poorly. Dropout changes the distribution of activations (introducing variance), while batch normalization computes running statistics assuming a stable distribution. During training, dropout alters the variance of batch norm inputs; at test time, dropout is disabled, creating a mismatch with the stored running statistics. This is why many modern architectures avoid combining them in the same block.