Chapter 6: Quiz
Test your understanding of neural network fundamentals. Answers follow each question.
Question 1
A single perceptron with step activation can learn the AND function but not the XOR function. Why?
Answer
AND is linearly separable: there exists a hyperplane that separates the positive examples (both inputs 1) from the negative examples (at least one input 0). For example, $w_1 = 1, w_2 = 1, b = -1.5$ gives the correct output for all four input combinations. XOR is not linearly separable: no single hyperplane can separate $\{(0,0), (1,1)\}$ (output 0) from $\{(0,1), (1,0)\}$ (output 1). This was proven by Minsky and Papert (1969) and requires at least one hidden layer to solve.Question 2
A network has architecture [10, 64, 32, 1] with ReLU hidden activations and sigmoid output. How many trainable parameters does it have?
Answer
Layer 1: $10 \times 64 + 64 = 704$ (weights + biases). Layer 2: $64 \times 32 + 32 = 2080$. Layer 3: $32 \times 1 + 1 = 33$. Total: $704 + 2080 + 33 = 2{,}817$ parameters.Question 3
Why does stacking multiple linear layers without activation functions not increase the expressiveness of the network?
Answer
The composition of linear transformations is itself linear: $\mathbf{W}_2(\mathbf{W}_1\mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2 = \mathbf{W}_2\mathbf{W}_1\mathbf{x} + \mathbf{W}_2\mathbf{b}_1 + \mathbf{b}_2 = \mathbf{W}'\mathbf{x} + \mathbf{b}'$. The product $\mathbf{W}' = \mathbf{W}_2\mathbf{W}_1$ is a single matrix, and $\mathbf{b}' = \mathbf{W}_2\mathbf{b}_1 + \mathbf{b}_2$ is a single bias vector. The multi-layer network is equivalent to a single-layer network. Nonlinear activation functions break this collapse and are essential for depth to provide additional expressiveness.Question 4
The sigmoid activation has a maximum gradient of 0.25. Why is this problematic for deep networks?
Answer
During backpropagation, the gradient is multiplied by the activation gradient at each layer. With sigmoid, the maximum multiplicative factor per layer is 0.25. For an $L$-layer network, the gradient at the first layer is attenuated by at most $0.25^{L-1}$. For a 10-layer network, this is $0.25^9 \approx 3.8 \times 10^{-6}$ — the gradient effectively vanishes, making the first layer untrainable. This is the vanishing gradient problem.Question 5
What is the "dead neuron" problem, and which activation function causes it?
Answer
A dead neuron is a ReLU unit whose pre-activation $z$ is negative for every training example. Since $\text{ReLU}'(z) = 0$ for $z < 0$, no gradient flows through the neuron, so its weights never update. The neuron is permanently inactive. This typically happens when a large learning rate causes weights to overshoot into a regime where all inputs produce negative pre-activations. GELU and Swish mitigate this by having non-zero gradients for negative inputs.Question 6
True or False: Backpropagation is a learning algorithm.
Answer
**False.** Backpropagation is a gradient computation algorithm — it computes $\frac{\partial \mathcal{L}}{\partial \theta}$ for all parameters $\theta$ using the chain rule applied to the computational graph. Learning happens when an optimizer (SGD, Adam, etc.) uses these gradients to update the parameters. The same backpropagation algorithm can be paired with any gradient-based optimizer.Question 7
For a binary classification problem with sigmoid output and BCE loss, the combined gradient $\frac{\partial \mathcal{L}}{\partial z}$ simplifies to $\hat{y} - y$. Derive this result.
Answer
The chain rule gives $\frac{\partial \mathcal{L}}{\partial z} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z}$. For BCE: $\frac{\partial \mathcal{L}}{\partial \hat{y}} = \frac{\hat{y} - y}{\hat{y}(1 - \hat{y})}$. For sigmoid: $\frac{\partial \hat{y}}{\partial z} = \hat{y}(1 - \hat{y})$. Multiplying: $\frac{\hat{y} - y}{\hat{y}(1 - \hat{y})} \cdot \hat{y}(1 - \hat{y}) = \hat{y} - y$. The $\hat{y}(1-\hat{y})$ terms cancel exactly. This is not a coincidence — it follows from the sigmoid being the canonical link function for Bernoulli data in the exponential family.Question 8
He initialization sets the variance of weights to $\frac{2}{n_{\text{in}}}$. Where does the factor of 2 come from?
Answer
For ReLU activation, if $z \sim \mathcal{N}(0, \sigma^2)$, then $\text{ReLU}(z) = \max(0, z)$ preserves only the positive half of the distribution. Since the distribution is symmetric around zero, exactly half the values are zeroed out, and the variance is halved: $\text{Var}(\text{ReLU}(z)) = \sigma^2 / 2$. To compensate and keep the activation variance equal to the input variance across layers, we need $n_\text{in} \cdot \text{Var}(w) / 2 = 1$, giving $\text{Var}(w) = 2/n_\text{in}$. The factor of 2 directly compensates for ReLU halving the variance.Question 9
What is the difference between Xavier/Glorot initialization and He initialization? When should you use each?
Answer
Xavier initialization: $\text{Var}(w) = \frac{2}{n_\text{in} + n_\text{out}}$. It compromises between maintaining forward and backward variance and assumes a linear or tanh activation (where the derivative at zero is 1). He initialization: $\text{Var}(w) = \frac{2}{n_\text{in}}$. It accounts for ReLU's halving of variance and focuses on the forward pass. Use Xavier for sigmoid, tanh, and other activations where the derivative at zero is approximately 1. Use He for ReLU and its variants (Leaky ReLU, PReLU). In modern practice, He initialization is the default because ReLU is the default activation.Question 10
In gradient checking, why do we use the centered difference $\frac{f(\theta + \epsilon) - f(\theta - \epsilon)}{2\epsilon}$ rather than the forward difference $\frac{f(\theta + \epsilon) - f(\theta)}{\epsilon}$?
Answer
The centered difference has error $O(\epsilon^2)$ while the forward difference has error $O(\epsilon)$. This comes from Taylor expansion. For centered: $f(\theta + \epsilon) = f(\theta) + f'(\theta)\epsilon + \frac{f''(\theta)}{2}\epsilon^2 + O(\epsilon^3)$ and $f(\theta - \epsilon) = f(\theta) - f'(\theta)\epsilon + \frac{f''(\theta)}{2}\epsilon^2 + O(\epsilon^3)$. Subtracting cancels the $\epsilon^2$ term: $\frac{f(\theta+\epsilon) - f(\theta-\epsilon)}{2\epsilon} = f'(\theta) + O(\epsilon^2)$. With $\epsilon = 10^{-5}$, the centered difference is accurate to about $10^{-10}$, while the forward difference is accurate to only $10^{-5}$.Question 11
The universal approximation theorem guarantees that a single-hidden-layer network can approximate any continuous function. Name three things the theorem does NOT guarantee.
Answer
The theorem does not guarantee: (1) **How many neurons are needed** — the required width can be exponentially large in the input dimension. (2) **That gradient descent can find the right weights** — the theorem is an existence result and says nothing about optimization. (3) **Sample efficiency** — it says nothing about how much training data is needed to learn the approximation. Additionally, it does not address the depth-width tradeoff (deeper networks can be exponentially more efficient), computational cost, or generalization to unseen data.Question 12
In the forward pass, why do we cache the pre-activation values $\mathbf{Z}^{(\ell)}$ and post-activation values $\mathbf{A}^{(\ell)}$?
Answer
The backward pass requires these intermediate values to compute gradients. Specifically: (1) The activation gradient $\sigma'(\mathbf{z}^{(\ell)})$ requires the pre-activation $\mathbf{Z}^{(\ell)}$. (2) The weight gradient $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(\ell)}} = \delta^{(\ell)T} \mathbf{A}^{(\ell-1)}$ requires the previous layer's post-activation $\mathbf{A}^{(\ell-1)}$. Without caching, we would need to recompute the forward pass for each gradient computation. This is the fundamental time-memory tradeoff in deep learning. Gradient checkpointing (recomputing instead of caching) saves memory at the cost of additional computation.Question 13
A colleague claims their MLP achieves 99% training accuracy but only 55% validation accuracy on a binary classification task. What is the most likely problem, and what should they try?
Answer
The model is severely overfitting — it has memorized the training data but has not learned generalizable patterns. The large gap between training and validation accuracy is the hallmark of overfitting. Remedies include: (1) **Regularization** — add dropout, weight decay (L2 regularization), or both. (2) **More data** — if available, more training data reduces overfitting. (3) **Simpler model** — reduce the number of layers or neurons. (4) **Early stopping** — stop training when validation loss starts increasing. (5) **Data augmentation** — if applicable, generate synthetic training examples. These techniques are covered in depth in Chapter 7.Question 14
What does optimizer.zero_grad() do in PyTorch, and why is it necessary?
Answer
`optimizer.zero_grad()` sets the `.grad` attribute of all parameters to zero. It is necessary because PyTorch *accumulates* gradients by default — each call to `loss.backward()` adds to the existing `.grad` values rather than replacing them. Without `zero_grad()`, the gradients from the current batch would be added to the gradients from the previous batch, producing incorrect updates. Gradient accumulation is intentional: it enables computing gradients over effective batch sizes larger than what fits in memory by accumulating gradients across multiple forward-backward passes before calling `optimizer.step()`.Question 15
Explain the difference between automatic differentiation, numerical differentiation, and symbolic differentiation.
Answer
**Numerical differentiation** computes gradients using finite differences: $f'(x) \approx \frac{f(x+\epsilon) - f(x-\epsilon)}{2\epsilon}$. It is simple but slow ($O(p)$ function evaluations for $p$ parameters) and suffers from truncation and round-off errors. **Symbolic differentiation** manipulates mathematical expressions to derive closed-form gradient formulas (like Mathematica or Wolfram Alpha). It produces exact derivatives but can lead to expression swell — the resulting formula can be exponentially larger than the original. **Automatic differentiation** decomposes computation into elementary operations and applies the chain rule at each step. It is exact (no approximation error), efficient (one forward + one backward pass regardless of parameter count), and works on arbitrary programs. PyTorch uses reverse-mode automatic differentiation (equivalent to backpropagation).Question 16
For a network with architecture [20, 256, 256, 256, 1], what is the total number of multiply-add operations in the forward pass for a single example?
Answer
Layer 1: $20 \times 256 = 5{,}120$ multiply-adds (plus 256 bias additions). Layer 2: $256 \times 256 = 65{,}536$ multiply-adds (plus 256 bias additions). Layer 3: $256 \times 1 = 256$ multiply-adds (plus 1 bias addition). Total multiply-adds: $5{,}120 + 65{,}536 + 256 = 70{,}912$. Activation functions add $256 + 256 + 1 = 513$ elementwise operations. The dominant cost is the matrix multiplications, particularly the $256 \times 256$ layer. For a batch of $B$ examples, the cost scales linearly with $B$.Question 17
You are training a neural network and the loss is NaN after a few epochs. List three possible causes and how to diagnose each.
Answer
(1) **Exploding gradients:** Large gradient norms cause parameter updates that are too large, leading to overflow. Diagnose by monitoring gradient norms per layer; fix with gradient clipping or a smaller learning rate. (2) **Numerical overflow in activation/loss:** Computing $\log(0)$ in cross-entropy or $\exp(\text{large})$ in softmax/sigmoid produces NaN/Inf. Diagnose by adding numerical stability checks (clipping predicted probabilities away from 0 and 1); fix by using numerically stable implementations like `BCEWithLogitsLoss`. (3) **Learning rate too high:** Large learning rate causes the loss to diverge rather than converge. Diagnose by plotting the loss curve; fix by reducing the learning rate by a factor of 10.Question 18
Why is the GELU activation function preferred over ReLU in transformer architectures?
Answer
GELU ($z \cdot \Phi(z)$, where $\Phi$ is the standard Gaussian CDF) has several advantages over ReLU in transformers: (1) **Smooth everywhere** — GELU is differentiable at $z = 0$, avoiding the sharp corner of ReLU. This smoothness can benefit optimization. (2) **No dead neurons** — GELU has a non-zero gradient for slightly negative inputs, so neurons are not permanently killed. (3) **Probabilistic interpretation** — GELU can be viewed as a stochastic regularizer: it multiplies the input by the probability that the input exceeds a standard normal sample, providing a soft form of dropout. Empirically, GELU tends to produce slightly better results than ReLU in transformers, though the difference is often small.Question 19
In the verify_numpy_pytorch_equivalence function, why might the outputs differ by more than machine epsilon even when the weights are identical?
Answer
Even with identical weights in float32, the outputs can differ due to: (1) **Operation ordering** — numpy and PyTorch may compute matrix multiplications using different BLAS implementations with different accumulation orders. Floating-point addition is not associative, so $(a + b) + c \neq a + (b + c)$ in general. (2) **Fused operations** — PyTorch may use fused multiply-add (FMA) instructions that differ from numpy's separate multiply-then-add. (3) **Parallel reduction** — GPU implementations sum partial results in different orders depending on thread scheduling. The differences are typically $O(10^{-7})$ for float32, which is acceptable. This is why the verification function uses an absolute tolerance (e.g., $10^{-6}$) rather than exact equality.Question 20
A network with sigmoid activations has 10 hidden layers. You observe that the gradient norm at layer 1 is $10^{-8}$ while the gradient norm at layer 10 is $10^{-1}$. You switch to ReLU activations with He initialization and observe that gradient norms are approximately $10^{-1}$ at all layers. Explain the change.