Exercises: Neural Networks from Scratch
These exercises progress from foundational concepts to full implementations. Exercises marked with a single asterisk (*) are intermediate; those with a double asterisk (**) are advanced. Solutions are provided in code/exercise-solutions.py.
Section A: The Perceptron and Biological Inspiration (Exercises 1--5)
Exercise 1: Perceptron Logic Gates
Implement a single perceptron (no hidden layers) that computes the logical AND function. Find weights w_1, w_2 and bias b such that the perceptron correctly classifies all four input combinations: (0,0), (0,1), (1,0), (1,1).
(a) Write out the decision boundary equation w_1 x_1 + w_2 x_2 + b = 0 and verify it separates the positive and negative examples.
(b) Repeat for the logical OR function.
(c) Explain why no single perceptron can compute XOR. Your explanation should reference the geometry of the decision boundary.
Exercise 2: Perceptron Learning Algorithm
Implement the perceptron learning algorithm from scratch in NumPy. Use it to learn the AND function starting from random weights. Track the number of epochs required for convergence.
(a) Run the algorithm 10 times with different random seeds. What is the average number of epochs to convergence?
(b) What happens if you set the learning rate to 0? To 100? Explain the behavior in each case.
Exercise 3: Perceptron Decision Boundaries *
Generate a linearly separable 2D dataset with 200 points using sklearn.datasets.make_blobs with centers=2 and random_state=42. Train a perceptron on this data and plot the decision boundary.
(a) How does the final decision boundary change with different random initializations?
(b) Does the perceptron converge to the same boundary as logistic regression (Chapter 10)? Why or why not?
Exercise 4: McCulloch-Pitts Neuron
Implement a McCulloch-Pitts neuron with binary inputs and binary weights. Show that it can compute AND, OR, and NOT, but not XOR. How does this model differ from Rosenblatt's perceptron?
Exercise 5: Biological Plausibility *
Research and write a 200-word paragraph comparing artificial neural networks to biological neural networks. Address at least three specific differences (e.g., learning rules, signal types, connectivity patterns). Why do these differences matter for interpreting AI capabilities?
Section B: Activation Functions (Exercises 6--10)
Exercise 6: Activation Function Implementation
Implement the following activation functions and their derivatives in NumPy: sigmoid, tanh, ReLU, Leaky ReLU (alpha = 0.01), and GELU (using the tanh approximation).
(a) Plot each function and its derivative on the interval [-5, 5].
(b) For each function, compute the output and derivative at z = -2.0, 0.0, and 2.0. Present your results in a table.
Exercise 7: Vanishing Gradients
Consider a 10-layer network where every layer uses the sigmoid activation function and each layer has a single neuron. Assume all weights are 0.5.
(a) Compute the product of sigmoid derivatives across all 10 layers for an input that produces z = 0 at every layer. What is the magnitude of the gradient at the first layer?
(b) Repeat for z = 3 at every layer. How much smaller is the gradient now?
(c) Explain why ReLU mitigates this problem.
Exercise 8: Dying ReLU *
Create a simple experiment that demonstrates the dying ReLU problem:
(a) Initialize a single-hidden-layer network (input: 2, hidden: 10, output: 1) with large negative biases in the hidden layer. Show that all hidden neurons output zero after the forward pass.
(b) Show that the gradients for the hidden layer weights are all zero during backpropagation.
(c) Implement Leaky ReLU and show that it resolves the issue.
Exercise 9: Softmax Properties
(a) Prove mathematically that the outputs of the softmax function sum to 1 for any input vector.
(b) Show that softmax is invariant to adding a constant to all inputs: softmax(z) = softmax(z + c) for any scalar c. Why is this property useful for numerical stability?
(c) Implement a numerically stable softmax in NumPy and verify it produces the same results as the naive implementation for z = [1000, 1001, 1002].
Exercise 10: GELU vs. ReLU **
(a) Plot GELU and ReLU on the same axes for z in [-4, 4]. Highlight the region where they differ most.
(b) Compute and plot the derivatives of both functions. Where does GELU have a negative derivative? What does this mean for gradient flow?
(c) Train a 3-layer MLP on the moons dataset (Section 11.8.1) using ReLU and then GELU. Compare training loss curves and final accuracy. Does GELU provide any advantage on this small dataset?
Section C: Forward Pass and Architecture (Exercises 11--15)
Exercise 11: Manual Forward Pass
Given a 2-layer network with:
$$ \mathbf{W}^{[1]} = \begin{bmatrix} 0.1 & 0.3 \\ -0.2 & 0.5 \\ 0.4 & -0.1 \end{bmatrix}, \quad \mathbf{b}^{[1]} = \begin{bmatrix} 0.0 \\ 0.1 \\ -0.2 \end{bmatrix} $$
$$ \mathbf{W}^{[2]} = \begin{bmatrix} 0.2 & -0.4 & 0.6 \end{bmatrix}, \quad b^{[2]} = 0.3 $$
Compute the forward pass for input x = [1.0, -0.5]^T using ReLU for the hidden layer and sigmoid for the output. Show all intermediate values (z^[1], a^[1], z^[2], y_hat).
Exercise 12: Batch Forward Pass
Extend Exercise 11 to process a mini-batch of three examples simultaneously:
$$ \mathbf{X} = \begin{bmatrix} 1.0 & 0.0 & -1.0 \\ -0.5 & 0.5 & 1.0 \end{bmatrix} $$
Verify the shapes of all intermediate matrices and confirm that each column of the output matches the result of processing each example individually.
Exercise 13: Network Capacity *
(a) How many trainable parameters does a network with architecture [784, 256, 128, 10] have? Show your calculation, counting both weights and biases.
(b) Write a function count_parameters(layer_dims: list[int]) -> int that computes the total number of parameters for any architecture.
(c) If each parameter is a 32-bit float, how much memory (in MB) does this network require?
Exercise 14: Universal Approximation *
(a) Create a 1D target function: f(x) = sin(2pix) for x in [0, 1]. Generate 1000 training points.
(b) Train a single-hidden-layer network with varying numbers of hidden neurons: 2, 5, 10, 50, 200. Plot the learned function and the true function for each.
(c) At what point does the network closely approximate the true function? How does this relate to the universal approximation theorem?
Exercise 15: Feature Visualization **
Train a network with architecture [2, 8, 4, 1] on the XOR problem.
(a) After training, extract the activations of the first hidden layer for all four inputs. Plot these 8-dimensional representations (use the first 2 or 3 dimensions) and observe how the network has transformed the input space.
(b) Do the same for the second hidden layer (4-dimensional). Is XOR linearly separable in this space?
(c) Explain how the hidden layers create a representation that makes the problem solvable.
Section D: Backpropagation (Exercises 16--22)
Exercise 16: Manual Backpropagation
Using the network and input from Exercise 11, assume the true label is y = 1.
(a) Compute the binary cross-entropy loss.
(b) Compute delta^[2] = y_hat - y.
(c) Compute dL/dW^[2] and dL/db^[2].
(d) Backpropagate to the hidden layer: compute delta^[1].
(e) Compute dL/dW^[1] and dL/db^[1].
(f) Verify your gradients using numerical gradient checking with epsilon = 10^{-7}.
Exercise 17: Gradient Checking Implementation
Implement a general-purpose gradient checking function:
def gradient_check(
parameters: dict,
x: np.ndarray,
y: np.ndarray,
num_layers: int,
epsilon: float = 1e-7,
) -> float:
"""Compare analytical and numerical gradients.
Returns the relative difference.
"""
Apply it to your network trained on XOR. What relative difference do you observe? Is your backpropagation correct?
Exercise 18: Backpropagation with tanh *
(a) Derive the backpropagation equations for a network that uses tanh instead of ReLU in the hidden layers. How does the derivative of tanh change the computation compared to ReLU?
(b) Modify the NumPy implementation from Section 11.6 to support tanh. Train on the XOR dataset and compare the convergence speed to ReLU.
Exercise 19: Multi-Class Backpropagation *
Derive the backpropagation equations for a network with softmax output and categorical cross-entropy loss.
(a) Show that delta^[L] = y_hat - y (where y is one-hot encoded), analogous to the sigmoid + BCE case.
(b) Implement the forward and backward passes for a 3-class classification problem.
(c) Generate a 3-class dataset using make_blobs and train your network. Report the classification accuracy.
Exercise 20: Computational Graph *
Draw the computational graph for a 2-layer network computing binary cross-entropy loss. Label each node with the operation it performs and each edge with the intermediate value it carries. Then annotate the graph with the gradients computed during the backward pass.
Exercise 21: Batch Normalization Gradient **
(a) In batch normalization, the pre-activations are normalized as:
$$ \hat{z}_i = \frac{z_i - \mu}{\sqrt{\sigma^2 + \epsilon}} $$
where mu and sigma^2 are the batch mean and variance. Derive dL/dz_i given dL/dz_hat_i.
(b) Why is this derivation more complex than standard backpropagation? (Hint: mu and sigma^2 both depend on all z_i in the batch.)
Exercise 22: Backpropagation Complexity **
(a) What is the computational complexity (in terms of FLOPs) of the forward pass through a fully connected layer with n_in inputs and n_out outputs, processing a batch of m examples?
(b) Show that the backward pass through the same layer has approximately the same computational cost as the forward pass.
(c) For a network with layers [784, 512, 256, 128, 10], compute the total FLOPs for one forward-backward pass on a batch of 64 examples.
Section E: NumPy to PyTorch (Exercises 23--30)
Exercise 23: Tensor Operations
Perform the following operations in both NumPy and PyTorch, and verify the results match:
(a) Create a 3x4 matrix of ones. Multiply it element-wise by a vector [1, 2, 3, 4].
(b) Compute the matrix product of a (5, 3) random matrix and a (3, 7) random matrix.
(c) Apply softmax to the vector [2.0, 1.0, 0.1].
Exercise 24: Autograd Exploration
(a) Create a scalar computation: y = 3x^2 + 2x + 1 for x = 2.0 with requires_grad=True. Call y.backward() and verify that x.grad equals 14.0 (the analytical derivative 6x + 2 evaluated at x = 2).
(b) Create a vector computation: y = Wx + b where W is (3, 2), x is (2,), and b is (3,). Compute the loss as the sum of y. What are the shapes of the gradients of W, x, and b?
(c) What happens if you call .backward() twice without calling optimizer.zero_grad()? Demonstrate with a simple example.
Exercise 25: Custom nn.Module *
Create a PyTorch nn.Module that implements a network with:
- Input dimension: 10
- Three hidden layers with dimensions 64, 32, 16
- ReLU activations between hidden layers
- Output dimension: 1 with sigmoid activation
- Dropout (p=0.2) after each hidden layer
(a) Print the model architecture using print(model).
(b) Count the total number of trainable parameters.
(c) Train the model on a randomly generated binary classification dataset with 1000 examples.
Exercise 26: PyTorch XOR *
Solve the XOR problem using PyTorch. Experiment with:
(a) Different optimizers: SGD, Adam, and RMSprop. Which converges fastest?
(b) Different hidden layer sizes: 2, 4, 8, 16. What is the minimum size that reliably solves XOR?
(c) Different learning rates: 0.001, 0.01, 0.1, 1.0. Plot the loss curves for each.
Exercise 27: PyTorch vs. NumPy Speed *
Compare the execution time of a forward pass in your NumPy and PyTorch implementations for networks of increasing size:
(a) Architecture [100, 50, 10] with batch size 32.
(b) Architecture [1000, 500, 200, 100, 10] with batch size 256.
(c) Architecture [10000, 5000, 1000, 100] with batch size 1024.
Use time.perf_counter() for timing. At what scale does PyTorch (CPU) become faster than NumPy? If you have a GPU, add GPU timings.
Exercise 28: Custom Loss Function *
Implement Huber loss as a custom PyTorch loss function:
$$ \mathcal{L}_\delta(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \leq \delta \\ \delta |y - \hat{y}| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases} $$
(a) Implement it as a Python function that operates on tensors.
(b) Verify that autograd correctly computes its gradient by comparing to a numerical approximation.
(c) Train a regression network using your Huber loss and compare to MSE on a dataset with outliers. Which loss is more robust?
Exercise 29: Learning Rate Finder **
Implement a simple learning rate finder that:
(a) Starts with a very small learning rate (e.g., 1e-7) and exponentially increases it over one epoch.
(b) Records the loss at each step.
(c) Plots loss vs. learning rate on a log scale.
(d) Suggests the optimal learning rate as the one where the loss is decreasing most rapidly.
Apply this to the moons dataset from Section 11.8.1.
Exercise 30: Regularization Preview **
(a) Train an MLP with architecture [2, 128, 64, 1] on the moons dataset with only 50 training examples. Observe overfitting by plotting training and test loss.
(b) Add L2 regularization by modifying the loss: L_total = L_BCE + (lambda/2m) * sum(||W^[l]||_F^2). Implement this in both NumPy and PyTorch (hint: use the weight_decay parameter in PyTorch optimizers).
(c) Compare the decision boundaries with and without regularization. We will explore regularization techniques in depth in Chapter 12.
Section F: Synthesis and Challenge Problems (Exercises 31--35)
Exercise 31: Full NumPy Implementation *
Extend the NumPy implementation from Section 11.6 to support:
(a) Multiple activation functions (sigmoid, tanh, ReLU) configurable per layer.
(b) Multiple loss functions (MSE, binary cross-entropy).
(c) Mini-batch gradient descent with a configurable batch size.
Train on the moons dataset and verify that your implementation achieves >90% accuracy.
Exercise 32: Neural Network as Function Composition *
(a) Write a mathematical expression for the function computed by a 3-layer network with ReLU activations and a linear output, treating it as a composition of functions.
(b) For a network with architecture [1, 3, 3, 1] and ReLU activations, what is the maximum number of linear regions the network can represent? (This is a research question; give your best analysis.)
(c) Verify your answer empirically by training such a network on a piecewise linear function and counting the linear segments in the output.
Exercise 33: Implementing Momentum **
(a) Implement SGD with momentum in NumPy:
$$ \mathbf{v}_t = \beta \mathbf{v}_{t-1} + (1 - \beta) \nabla_\theta \mathcal{L} $$
$$ \theta \leftarrow \theta - \eta \mathbf{v}_t $$
(b) Train on the moons dataset with momentum values beta = 0.0, 0.5, 0.9, and 0.99. Plot the loss curves and compare convergence speed.
(c) Verify that your results match PyTorch's SGD(momentum=...).
Exercise 34: Weight Initialization Experiment **
Compare the following weight initialization strategies on a 5-layer network [2, 64, 64, 64, 64, 1]:
(a) All zeros (b) All ones (c) Small random: N(0, 0.01) (d) Xavier/Glorot: N(0, sqrt(2 / (n_in + n_out))) (e) He: N(0, sqrt(2 / n_in))
For each, plot the distribution of activations and gradients at each layer after one forward-backward pass. Which initializations lead to healthy gradient flow?
Exercise 35: Build Your Own Deep Learning Framework **
Design and implement a minimal deep learning framework in Python that supports:
(a) A Tensor class that tracks operations for autograd.
(b) A backward() method that computes gradients via reverse-mode automatic differentiation.
(c) At minimum, support for: addition, multiplication, matrix multiplication, ReLU, and sigmoid.
(d) Verify your framework by training a 2-layer network on XOR and comparing gradients to your NumPy implementation.
This exercise is deliberately open-ended. Focus on correctness over performance.