Case Study 1: Handwritten Digit Classification from Scratch
Overview
In this case study, you will build a handwritten digit classifier from scratch using only NumPy, then rebuild it in PyTorch. The task is to classify 28x28 grayscale images of handwritten digits (0--9) from the MNIST dataset---a benchmark that has been called the "Hello World" of deep learning.
This case study brings together every concept from Chapter 11: forward passes, activation functions, loss computation, backpropagation, and gradient descent. By the end, you will have a working system that achieves over 95% accuracy on digit classification, built from first principles.
The Dataset
The MNIST dataset contains 60,000 training images and 10,000 test images. Each image is 28x28 pixels, which we flatten into a 784-dimensional vector. Each pixel value ranges from 0 (black) to 255 (white).
import numpy as np
def load_mnist_data() -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
"""Load and preprocess MNIST data.
Returns:
Tuple of (X_train, Y_train, X_test, Y_test).
X arrays have shape (784, m), Y arrays have shape (10, m).
"""
# In practice, use sklearn or torchvision to load MNIST
from sklearn.datasets import fetch_openml
mnist = fetch_openml("mnist_784", version=1, as_frame=False)
X = mnist.data.astype(np.float64)
y = mnist.target.astype(np.int64)
# Normalize pixel values to [0, 1]
X = X / 255.0
# Train/test split
X_train, X_test = X[:60000].T, X[60000:].T # (784, m)
y_train, y_test = y[:60000], y[60000:]
# One-hot encode labels
Y_train = np.eye(10)[y_train].T # (10, 60000)
Y_test = np.eye(10)[y_test].T # (10, 10000)
return X_train, Y_train, X_test, Y_test
Network Architecture
We use a three-layer network: - Input: 784 neurons (one per pixel) - Hidden layer 1: 128 neurons, ReLU activation - Hidden layer 2: 64 neurons, ReLU activation - Output layer: 10 neurons, softmax activation (one per digit class)
This architecture has 784128 + 128 + 12864 + 64 + 6410 + 10 = 109,386 parameters*.
NumPy Implementation
Activation Functions
def relu(z: np.ndarray) -> np.ndarray:
"""ReLU activation."""
return np.maximum(0, z)
def relu_derivative(z: np.ndarray) -> np.ndarray:
"""Derivative of ReLU."""
return (z > 0).astype(np.float64)
def softmax(z: np.ndarray) -> np.ndarray:
"""Numerically stable softmax.
Args:
z: Pre-activation of shape (K, m).
Returns:
Probabilities of shape (K, m), columns sum to 1.
"""
exp_z = np.exp(z - np.max(z, axis=0, keepdims=True))
return exp_z / np.sum(exp_z, axis=0, keepdims=True)
Weight Initialization
def initialize_parameters(
layer_dims: list[int],
) -> dict[str, np.ndarray]:
"""He initialization for weights, zeros for biases.
Args:
layer_dims: E.g., [784, 128, 64, 10].
Returns:
Dictionary with W1, b1, W2, b2, W3, b3.
"""
np.random.seed(42)
params = {}
for l in range(1, len(layer_dims)):
params[f"W{l}"] = np.random.randn(
layer_dims[l], layer_dims[l - 1]
) * np.sqrt(2.0 / layer_dims[l - 1])
params[f"b{l}"] = np.zeros((layer_dims[l], 1))
return params
Forward Pass
def forward(
x: np.ndarray,
params: dict[str, np.ndarray],
) -> tuple[np.ndarray, dict]:
"""Forward pass for the 3-layer MNIST network.
Args:
x: Input of shape (784, m).
params: Network parameters.
Returns:
Tuple of (predictions of shape (10, m), cache for backprop).
"""
# Layer 1
z1 = params["W1"] @ x + params["b1"]
a1 = relu(z1)
# Layer 2
z2 = params["W2"] @ a1 + params["b2"]
a2 = relu(z2)
# Layer 3 (output)
z3 = params["W3"] @ a2 + params["b3"]
a3 = softmax(z3)
cache = {
"Z1": z1, "A1": a1,
"Z2": z2, "A2": a2,
"Z3": z3, "A3": a3,
"X": x,
}
return a3, cache
Loss Function
For multi-class classification, we use categorical cross-entropy:
def compute_loss(
y_hat: np.ndarray,
y: np.ndarray,
) -> float:
"""Categorical cross-entropy loss.
Args:
y_hat: Predictions of shape (10, m).
y: One-hot labels of shape (10, m).
Returns:
Scalar loss value.
"""
m = y.shape[1]
epsilon = 1e-8
loss = -np.sum(y * np.log(y_hat + epsilon)) / m
return float(loss)
Backward Pass
The backward pass for softmax + cross-entropy yields a clean gradient delta^[3] = y_hat - y, just like sigmoid + BCE:
def backward(
y: np.ndarray,
params: dict[str, np.ndarray],
cache: dict,
) -> dict[str, np.ndarray]:
"""Backpropagation for the 3-layer MNIST network.
Args:
y: One-hot labels of shape (10, m).
params: Network parameters.
cache: Forward pass cache.
Returns:
Gradients dictionary.
"""
m = y.shape[1]
grads = {}
# Output layer (softmax + cross-entropy)
dz3 = cache["A3"] - y # (10, m)
grads["dW3"] = dz3 @ cache["A2"].T / m
grads["db3"] = np.sum(dz3, axis=1, keepdims=True) / m
# Hidden layer 2
da2 = params["W3"].T @ dz3
dz2 = da2 * relu_derivative(cache["Z2"])
grads["dW2"] = dz2 @ cache["A1"].T / m
grads["db2"] = np.sum(dz2, axis=1, keepdims=True) / m
# Hidden layer 1
da1 = params["W2"].T @ dz2
dz1 = da1 * relu_derivative(cache["Z1"])
grads["dW1"] = dz1 @ cache["X"].T / m
grads["db1"] = np.sum(dz1, axis=1, keepdims=True) / m
return grads
Training Loop with Mini-Batches
Processing all 60,000 examples at once is memory-intensive and provides noisy but slow gradient estimates. Mini-batch gradient descent (Chapter 9) offers a practical middle ground:
def create_mini_batches(
x: np.ndarray,
y: np.ndarray,
batch_size: int,
) -> list[tuple[np.ndarray, np.ndarray]]:
"""Split data into mini-batches.
Args:
x: Input data of shape (n_features, m).
y: Labels of shape (n_classes, m).
batch_size: Number of examples per batch.
Returns:
List of (X_batch, Y_batch) tuples.
"""
m = x.shape[1]
permutation = np.random.permutation(m)
x_shuffled = x[:, permutation]
y_shuffled = y[:, permutation]
batches = []
num_complete = m // batch_size
for i in range(num_complete):
start = i * batch_size
end = start + batch_size
batches.append((x_shuffled[:, start:end], y_shuffled[:, start:end]))
# Handle remaining examples
if m % batch_size != 0:
batches.append((x_shuffled[:, num_complete * batch_size:],
y_shuffled[:, num_complete * batch_size:]))
return batches
def train_mnist(
x_train: np.ndarray,
y_train: np.ndarray,
x_test: np.ndarray,
y_test: np.ndarray,
layer_dims: list[int],
learning_rate: float = 0.1,
num_epochs: int = 20,
batch_size: int = 128,
) -> tuple[dict, list[float], list[float]]:
"""Train the MNIST classifier.
Args:
x_train: Training inputs of shape (784, 60000).
y_train: Training labels of shape (10, 60000).
x_test: Test inputs of shape (784, 10000).
y_test: Test labels of shape (10, 10000).
layer_dims: Network architecture.
learning_rate: Step size for gradient descent.
num_epochs: Number of passes through the training set.
batch_size: Mini-batch size.
Returns:
Tuple of (trained parameters, train losses, test accuracies).
"""
params = initialize_parameters(layer_dims)
train_losses = []
test_accuracies = []
for epoch in range(num_epochs):
epoch_loss = 0.0
batches = create_mini_batches(x_train, y_train, batch_size)
for x_batch, y_batch in batches:
# Forward pass
y_hat, cache = forward(x_batch, params)
# Compute loss
loss = compute_loss(y_hat, y_batch)
epoch_loss += loss * x_batch.shape[1]
# Backward pass
grads = backward(y_batch, params, cache)
# Update parameters
for l in range(1, len(layer_dims)):
params[f"W{l}"] -= learning_rate * grads[f"dW{l}"]
params[f"b{l}"] -= learning_rate * grads[f"db{l}"]
# Epoch statistics
avg_loss = epoch_loss / x_train.shape[1]
train_losses.append(avg_loss)
# Test accuracy
y_pred, _ = forward(x_test, params)
accuracy = np.mean(
np.argmax(y_pred, axis=0) == np.argmax(y_test, axis=0)
)
test_accuracies.append(accuracy)
print(
f"Epoch {epoch + 1:2d}/{num_epochs} | "
f"Loss: {avg_loss:.4f} | "
f"Test Acc: {accuracy:.4f}"
)
return params, train_losses, test_accuracies
Running the NumPy Implementation
# Load data
X_train, Y_train, X_test, Y_test = load_mnist_data()
# Train
params, losses, accuracies = train_mnist(
X_train, Y_train, X_test, Y_test,
layer_dims=[784, 128, 64, 10],
learning_rate=0.1,
num_epochs=20,
batch_size=128,
)
# Expected output (approximate):
# Epoch 1/20 | Loss: 0.4523 | Test Acc: 0.9125
# Epoch 5/20 | Loss: 0.1432 | Test Acc: 0.9543
# Epoch 10/20 | Loss: 0.0782 | Test Acc: 0.9651
# Epoch 20/20 | Loss: 0.0298 | Test Acc: 0.9712
PyTorch Implementation
Now we rebuild the same network in PyTorch, demonstrating how the framework handles the mechanics:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
class MNISTNet(nn.Module):
"""Three-layer MLP for MNIST digit classification.
Architecture: 784 -> 128 (ReLU) -> 64 (ReLU) -> 10 (Softmax).
"""
def __init__(self) -> None:
super().__init__()
self.net = nn.Sequential(
nn.Linear(784, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 10),
)
self._init_weights()
def _init_weights(self) -> None:
"""Apply He initialization to all linear layers."""
for module in self.net:
if isinstance(module, nn.Linear):
nn.init.kaiming_normal_(module.weight, nonlinearity="relu")
nn.init.zeros_(module.bias)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Forward pass.
Args:
x: Input tensor of shape (batch_size, 784).
Returns:
Logits of shape (batch_size, 10).
"""
return self.net(x)
def train_pytorch(
num_epochs: int = 20,
batch_size: int = 128,
learning_rate: float = 0.1,
) -> tuple[list[float], list[float]]:
"""Train the PyTorch MNIST classifier.
Args:
num_epochs: Number of training epochs.
batch_size: Mini-batch size.
learning_rate: Learning rate for SGD.
Returns:
Tuple of (train losses, test accuracies).
"""
torch.manual_seed(42)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load data (reuse sklearn loader, convert to tensors)
from sklearn.datasets import fetch_openml
mnist = fetch_openml("mnist_784", version=1, as_frame=False)
X = torch.tensor(mnist.data / 255.0, dtype=torch.float32)
y = torch.tensor(mnist.target.astype(int), dtype=torch.long)
X_train, X_test = X[:60000], X[60000:]
y_train, y_test = y[:60000], y[60000:]
train_loader = DataLoader(
TensorDataset(X_train, y_train),
batch_size=batch_size,
shuffle=True,
)
# Model, loss, optimizer
model = MNISTNet().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
train_losses = []
test_accuracies = []
for epoch in range(num_epochs):
model.train()
epoch_loss = 0.0
for x_batch, y_batch in train_loader:
x_batch, y_batch = x_batch.to(device), y_batch.to(device)
# Forward pass
logits = model(x_batch)
loss = criterion(logits, y_batch)
# Backward pass and update
optimizer.zero_grad()
loss.backward()
optimizer.step()
epoch_loss += loss.item() * x_batch.size(0)
avg_loss = epoch_loss / len(X_train)
train_losses.append(avg_loss)
# Evaluate
model.eval()
with torch.no_grad():
logits = model(X_test.to(device))
preds = logits.argmax(dim=1)
accuracy = (preds == y_test.to(device)).float().mean().item()
test_accuracies.append(accuracy)
print(
f"Epoch {epoch + 1:2d}/{num_epochs} | "
f"Loss: {avg_loss:.4f} | "
f"Test Acc: {accuracy:.4f}"
)
return train_losses, test_accuracies
Results and Analysis
Both implementations achieve approximately 97% test accuracy after 20 epochs. Key observations:
-
Training dynamics: Both implementations show similar loss curves because they use the same architecture, initialization scheme, and learning rate.
-
Code complexity: The NumPy implementation requires about 150 lines of mathematical code. The PyTorch version requires about 60 lines, with the framework handling backpropagation, mini-batching, and GPU transfer.
-
Speed: The PyTorch implementation is significantly faster, especially on GPU. On a modern GPU, each epoch takes about 3 seconds; in NumPy on CPU, each epoch takes about 30 seconds.
-
Error analysis: The most common errors are between visually similar digits: 4 and 9, 3 and 8, 7 and 1. Convolutional neural networks (Chapter 13) handle these distinctions better by exploiting spatial structure.
Key Takeaways
- A three-layer MLP with only 109,386 parameters can achieve 97%+ accuracy on MNIST.
- The NumPy implementation reveals every mathematical operation; the PyTorch implementation demonstrates production patterns.
- Mini-batch gradient descent is essential for large datasets---processing all 60,000 examples at once is impractical.
- Softmax + cross-entropy yields the same clean gradient (y_hat - y) as sigmoid + BCE, making multi-class backpropagation elegant.
- This baseline sets the stage for convolutional networks in Chapter 13, which exploit the 2D spatial structure that our MLP ignores.