Quiz: Neural Networks from Scratch
Test your understanding of the concepts from Chapter 11. Try to answer each question before revealing the solution.
Question 1
What is the fundamental limitation of a single-layer perceptron?
Show Answer
A single-layer perceptron can only learn **linearly separable** patterns. It computes a linear decision boundary (a hyperplane) in the input space, so it cannot solve problems where the classes are not separable by a single line (or hyperplane). The classic example is the XOR problem, where no single line can separate the positive from the negative examples. Minsky and Papert proved this limitation in 1969, which led to the first "AI winter" for neural networks.Question 2
Why is a nonlinear activation function essential in a multi-layer network?
Show Answer
Without nonlinear activation functions, the composition of multiple linear layers collapses into a single linear transformation. Mathematically, if each layer computes **a** = **W**^[l] **a**^[l-1] + **b**^[l] (with no activation), then the entire network computes **y** = **W_eff** **x** + **b_eff** where **W_eff** and **b_eff** are the product and sum of all layer parameters. This is equivalent to a single linear layer, making the additional layers useless. The nonlinear activation function breaks this collapse and allows the network to represent complex, nonlinear functions.Question 3
What is the vanishing gradient problem, and which activation functions are most susceptible to it?
Show Answer
The vanishing gradient problem occurs when gradients become extremely small as they are propagated backward through many layers, making it nearly impossible for early layers to learn. The **sigmoid** and **tanh** activation functions are most susceptible because their derivatives are bounded: sigmoid'(z) is at most 0.25, and tanh'(z) is at most 1.0. When |z| is large, both derivatives approach zero. In a deep network, the product of many such small derivatives shrinks exponentially. **ReLU** mitigates this because its derivative is exactly 1 for positive inputs, allowing gradients to flow unchanged.Question 4
Given a sigmoid output sigma(z) = 0.88, what is sigma'(z)?
Show Answer
Using the property that sigma'(z) = sigma(z) * (1 - sigma(z)): sigma'(z) = 0.88 * (1 - 0.88) = 0.88 * 0.12 = **0.1056**Question 5
What is the "dying ReLU" problem, and how does Leaky ReLU address it?
Show Answer
The dying ReLU problem occurs when a neuron's pre-activation value z is consistently negative. Since ReLU(z) = 0 for z < 0, the neuron outputs zero. More critically, ReLU'(z) = 0 for z < 0, so the gradient is also zero, meaning the neuron's weights never get updated. The neuron is effectively "dead" and cannot recover. **Leaky ReLU** addresses this by allowing a small, non-zero output for negative inputs: LeakyReLU(z) = alpha * z for z < 0 (typically alpha = 0.01). This ensures the gradient is alpha (not zero) for negative inputs, so the neuron can still learn and potentially recover.Question 6
For a network with architecture [784, 256, 128, 10], how many trainable parameters are there?
Show Answer
Count weights and biases for each layer: - Layer 1: 784 * 256 weights + 256 biases = 200,704 + 256 = **200,960** - Layer 2: 256 * 128 weights + 128 biases = 32,768 + 128 = **32,896** - Layer 3: 128 * 10 weights + 10 biases = 1,280 + 10 = **1,290** Total: 200,960 + 32,896 + 1,290 = **235,146 parameters**Question 7
In the forward pass equation z^[l] = W^[l] a^[l-1] + b^[l], what are the shapes of W^[l], a^[l-1], and b^[l] for a layer with 64 inputs and 32 outputs, processing a batch of 16 examples?
Show Answer
- **W**^[l] has shape **(32, 64)** --- (n_out, n_in) - **a**^[l-1] has shape **(64, 16)** --- (n_in, batch_size) - **b**^[l] has shape **(32, 1)** --- (n_out, 1), broadcast across the 16 columns - **z**^[l] has shape **(32, 16)** --- (n_out, batch_size) Note: This is the convention used in the chapter (column vectors, features-first). PyTorch uses the transposed convention (batch_size, features), so the shapes would be transposed in PyTorch.Question 8
What is the purpose of He initialization, and what distribution does it use?
Show Answer
He initialization (He et al., 2015) is designed specifically for layers with **ReLU** activation functions. It initializes weights by sampling from: w ~ N(0, sqrt(2 / n_in)) where n_in is the number of inputs to the layer. The factor of 2 accounts for the fact that ReLU zeros out approximately half the activations. This initialization keeps the variance of activations approximately constant across layers, preventing both vanishing and exploding activations/gradients at the start of training. Biases are typically initialized to zero.Question 9
Why do we clip the input to log() when computing binary cross-entropy loss? What would happen without clipping?
Show Answer
We add a small epsilon (e.g., 1e-8) to prevent computing log(0), which is negative infinity. Without this safeguard, if the model predicts exactly 0 for a positive example (y=1, y_hat=0), the loss term y * log(y_hat) = 1 * log(0) = -infinity. Similarly, if the model predicts exactly 1 for a negative example, (1-y) * log(1-y_hat) = 1 * log(0) = -infinity. This would produce NaN values that corrupt all subsequent computations. The epsilon makes the loss very large but finite, preserving numerical stability.Question 10
Derive the gradient delta^[L] for the output layer when using sigmoid activation with binary cross-entropy loss.
Show Answer
Starting from the loss L = -[y log(y_hat) + (1-y) log(1 - y_hat)]: dL/dy_hat = -y/y_hat + (1-y)/(1-y_hat) And the sigmoid derivative: dy_hat/dz = y_hat(1 - y_hat) Applying the chain rule: delta^[L] = dL/dz = (dL/dy_hat) * (dy_hat/dz) = [-y/y_hat + (1-y)/(1-y_hat)] * y_hat(1-y_hat) = -y(1-y_hat) + (1-y)(y_hat) = -y + y*y_hat + y_hat - y*y_hat = **y_hat - y** This elegant result means the output layer error signal is simply the difference between prediction and truth.Question 11
In backpropagation, why do we propagate the error signal using (W^[l+1])^T rather than W^[l+1] itself?
Show Answer
During the forward pass, layer l+1 computes **z**^[l+1] = **W**^[l+1] **a**^[l] + **b**^[l+1]. The gradient of the loss with respect to **a**^[l] is: dL/d**a**^[l] = (dL/d**z**^[l+1]) * (d**z**^[l+1]/d**a**^[l]) = (**W**^[l+1])^T * delta^[l+1] The transpose arises from the chain rule applied to matrix operations. Intuitively, the forward pass "fans out" the signal from n_{l} neurons to n_{l+1} neurons via **W**^[l+1]; the backward pass must "fan in" the error signal from n_{l+1} neurons back to n_{l} neurons, which requires the transpose. The shapes confirm this: if **W**^[l+1] is (n_{l+1}, n_l) and delta^[l+1] is (n_{l+1}, m), then (**W**^[l+1])^T delta^[l+1] has shape (n_l, m), matching the shape of **a**^[l].Question 12
What is the element-wise multiplication in delta^[l] = [(W^[l+1])^T delta^[l+1]] * f'(z^[l]) computing, and why is it necessary?
Show Answer
The element-wise multiplication (Hadamard product) applies the **derivative of the activation function** at each neuron. The term (**W**^[l+1])^T delta^[l+1] gives the error signal flowing back to the activations **a**^[l]. But to get the error signal at the pre-activations **z**^[l], we need to account for how the activation function transformed **z**^[l] into **a**^[l]. The chain rule requires multiplying by f'(**z**^[l]) element-wise (since the activation function is applied element-wise). For ReLU, this masks out neurons where z was negative (gradient = 0) and passes the gradient unchanged where z was positive (gradient = 1).Question 13
What does optimizer.zero_grad() do in PyTorch, and what happens if you forget it?
Show Answer
`optimizer.zero_grad()` sets all parameter gradients to zero. PyTorch **accumulates** gradients by default---each call to `loss.backward()` adds to the existing `.grad` attributes rather than replacing them. If you forget `zero_grad()`, the gradients from the current iteration will be added to the gradients from previous iterations, resulting in incorrect (and increasingly large) gradient values. This leads to erratic parameter updates and usually causes the loss to diverge. Gradient accumulation is a deliberate design choice that enables techniques like accumulating gradients over multiple mini-batches, but it requires explicit zeroing when you do not want accumulation.Question 14
What is the difference between torch.no_grad() and tensor.detach()?
Show Answer
- **`torch.no_grad()`** is a context manager that disables gradient computation for all operations within its scope. Tensors created inside `with torch.no_grad():` will not have their operations tracked in the computational graph. This is used during inference and evaluation to save memory and computation. - **`tensor.detach()`** creates a new tensor that shares the same data but is detached from the current computational graph. The detached tensor has `requires_grad=False`. This is used when you want to use a tensor's value without allowing gradients to flow through it (e.g., when computing a target value that should not be differentiated). Both prevent gradient computation, but `no_grad()` applies to a block of code while `detach()` applies to a specific tensor.Question 15
Explain the three-step training pattern in PyTorch: zero_grad(), backward(), step().
Show Answer
1. **`optimizer.zero_grad()`**: Clears the gradient buffers of all parameters managed by the optimizer. This is necessary because PyTorch accumulates gradients by default. 2. **`loss.backward()`**: Traverses the computational graph in reverse (from loss back to all parameters with `requires_grad=True`) and computes gradients using the chain rule (automatic differentiation). After this call, each parameter's `.grad` attribute contains dL/d(parameter). 3. **`optimizer.step()`**: Updates all parameters using the computed gradients and the optimizer's update rule. For basic SGD, this performs: parameter = parameter - lr * parameter.grad. For Adam or other optimizers, the update rule is more complex and may use running averages of gradients and squared gradients. This three-step pattern is universal across all PyTorch training loops, regardless of model complexity.Question 16
What is the universal approximation theorem, and what are its practical limitations?
Show Answer
The universal approximation theorem (Cybenko, 1989; Hornik, 1991) states that a feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on a compact subset of R^n to arbitrary accuracy, given a suitable non-linear activation function. **Practical limitations:** 1. It says nothing about **how many** neurons are needed. The required number may be exponentially large. 2. It is an **existence** theorem, not a constructive one. It does not guarantee that gradient descent will find the right weights. 3. It does not address **generalization**. The network may perfectly fit the training data but fail on new data. 4. In practice, **deeper** networks (multiple hidden layers) tend to be more parameter-efficient than very wide single-layer networks for representing hierarchical functions.Question 17
How does the softmax function differ from applying sigmoid independently to each output neuron in a multi-class classification problem?
Show Answer
**Sigmoid** applied independently to K output neurons produces K values, each in (0, 1), but these values do NOT necessarily sum to 1. Each output is independent: sigmoid treats each class as a separate binary decision. **Softmax** produces K values that are all in (0, 1) and are guaranteed to sum to exactly 1, forming a valid probability distribution over the K classes. Softmax introduces competition between classes: if one class's score increases, the others must decrease. Use **sigmoid** for multi-label classification (an example can belong to multiple classes simultaneously). Use **softmax** for multi-class classification (an example belongs to exactly one class).Question 18
What is numerical gradient checking, and when should you use it?
Show Answer
Numerical gradient checking approximates the gradient using the finite difference formula: dL/dtheta_j approximately equals [L(theta_j + epsilon) - L(theta_j - epsilon)] / (2 * epsilon) where epsilon is typically 10^{-7}. This two-sided difference is more accurate than the one-sided version. **When to use it:** - When implementing backpropagation from scratch, to verify your analytical gradients are correct - When debugging a custom layer or loss function - As a one-time verification step during development **When NOT to use it:** - During actual training (far too slow---requires 2 forward passes per parameter) - In production code - On very large networks (impractical) A relative difference below 10^{-5} between analytical and numerical gradients indicates a correct implementation. A difference above 10^{-3} strongly suggests a bug.Question 19
In PyTorch, what is the difference between nn.BCELoss and nn.BCEWithLogitsLoss?
Show Answer
- **`nn.BCELoss`** expects the input to already have a sigmoid applied (values in (0, 1)). It computes: L = -[y * log(y_hat) + (1-y) * log(1-y_hat)]. - **`nn.BCEWithLogitsLoss`** expects raw logits (pre-sigmoid values) and applies the sigmoid internally before computing the loss. It uses the **log-sum-exp trick** for numerical stability. **Always prefer `nn.BCEWithLogitsLoss`** because: 1. It is more numerically stable (avoids log(0) issues). 2. It is slightly more efficient (combines sigmoid and loss in one operation). 3. It avoids potential bugs from applying sigmoid in the model and again in the loss. When using `nn.BCEWithLogitsLoss`, your model's output layer should NOT include a sigmoid activation.Question 20
Why does PyTorch use model.train() and model.eval() modes?
Show Answer
Some layers behave differently during training and inference: - **Dropout** randomly zeroes elements during training (for regularization) but passes all values through during evaluation (no randomness). - **Batch normalization** uses per-batch statistics (mean, variance) during training but uses running averages accumulated during training when evaluating. `model.train()` enables training behavior for these layers. `model.eval()` switches to inference behavior. Forgetting to call `model.eval()` before inference can lead to: - Inconsistent predictions (dropout introduces randomness) - Reduced accuracy (batch norm uses batch statistics from a single test batch instead of the learned running statistics) Always pair `model.eval()` with `torch.no_grad()` during inference for both correctness and efficiency.Question 21
What is the relationship between the learning rate and the loss landscape?
Show Answer
The learning rate controls the step size in parameter space during gradient descent: - **Too small**: The optimizer takes tiny steps, converging very slowly. It may get trapped in poor local minima or saddle points. - **Too large**: The optimizer overshoots minima, causing the loss to oscillate or diverge (loss goes to infinity or NaN). - **Just right**: The optimizer converges efficiently to a good minimum. The optimal learning rate depends on the **curvature** of the loss landscape. In flat regions, a larger learning rate is beneficial. In regions with high curvature (sharp minima), a smaller learning rate is needed. This is why adaptive optimizers like Adam (Chapter 12) adjust the effective learning rate per-parameter based on historical gradient information. As discussed in Chapter 9, learning rate scheduling (gradually decreasing the learning rate during training) can help navigate from large initial steps (for exploration) to small final steps (for fine-tuning).Question 22
Explain why the Xavier/Glorot initialization uses sqrt(2 / (n_in + n_out)) while He initialization uses sqrt(2 / n_in).
Show Answer
Both initializations aim to keep the variance of activations and gradients approximately constant across layers, preventing vanishing or exploding values. **Xavier/Glorot initialization** (Glorot & Bengio, 2010) was designed for layers with **symmetric** activation functions (tanh, sigmoid). It considers both the forward pass (variance propagation from input to output) and backward pass (gradient propagation from output to input), taking the harmonic mean: sqrt(2 / (n_in + n_out)). **He initialization** (He et al., 2015) was designed for **ReLU** activation functions. Since ReLU zeros out approximately half the neurons (those with negative pre-activations), the variance of the output is halved compared to the linear case. To compensate, the initial variance needs to be doubled. He initialization only considers the forward pass, using sqrt(2 / n_in), where the factor of 2 accounts for ReLU's halving effect. Using He initialization with ReLU and Xavier with tanh/sigmoid leads to stable training in deep networks.Question 23
A network has 5 hidden layers, each with 100 neurons using sigmoid activation. If the maximum sigmoid derivative is 0.25, what is the worst-case gradient attenuation from the output layer to the first hidden layer?
Show Answer
In the worst case, the gradient at each layer is multiplied by the sigmoid derivative. If all sigmoid derivatives are at their maximum of 0.25, the gradient is multiplied by 0.25 at each of the 5 hidden layers: Attenuation = 0.25^5 = **9.77 x 10^{-4}** This means the gradient reaching the first hidden layer is less than 0.1% of the gradient at the output. In practice, the situation is often worse because the sigmoid derivative is only 0.25 at z = 0; for most other z values, it is smaller. Additionally, the weight magnitudes can further attenuate or amplify the gradient at each layer. This is why deep sigmoid networks are nearly impossible to train, and why ReLU (with a derivative of 1 for positive inputs) was a breakthrough for deep learning.Question 24
What does the requires_grad attribute control in PyTorch?
Show Answer
The `requires_grad` attribute (default: `False` for tensors created directly, `True` for `nn.Parameter` objects) controls whether PyTorch tracks operations on a tensor for automatic differentiation. When `requires_grad=True`: - All operations on the tensor are recorded in a computational graph - Calling `.backward()` on a downstream result computes gradients with respect to this tensor - The gradient is stored in the tensor's `.grad` attribute - More memory is used to store the computational graph When `requires_grad=False`: - Operations are not tracked (faster, less memory) - No gradient is computed for this tensor - Used for input data, fixed parameters, and during inference Input data typically has `requires_grad=False` because we want to optimize the model weights, not the input features. Model parameters (`nn.Parameter`) have `requires_grad=True` by default.Question 25
You train a neural network and observe that the training loss decreases steadily, but the training accuracy stays at 50% (random chance for binary classification). What is likely wrong?