Case Study 1: Handwritten Digit Classification from Scratch

Overview

In this case study, you will build a handwritten digit classifier from scratch using only NumPy, then rebuild it in PyTorch. The task is to classify 28x28 grayscale images of handwritten digits (0--9) from the MNIST dataset---a benchmark that has been called the "Hello World" of deep learning.

This case study brings together every concept from Chapter 11: forward passes, activation functions, loss computation, backpropagation, and gradient descent. By the end, you will have a working system that achieves over 95% accuracy on digit classification, built from first principles.

The Dataset

The MNIST dataset contains 60,000 training images and 10,000 test images. Each image is 28x28 pixels, which we flatten into a 784-dimensional vector. Each pixel value ranges from 0 (black) to 255 (white).

import numpy as np

def load_mnist_data() -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
    """Load and preprocess MNIST data.

    Returns:
        Tuple of (X_train, Y_train, X_test, Y_test).
        X arrays have shape (784, m), Y arrays have shape (10, m).
    """
    # In practice, use sklearn or torchvision to load MNIST
    from sklearn.datasets import fetch_openml

    mnist = fetch_openml("mnist_784", version=1, as_frame=False)
    X = mnist.data.astype(np.float64)
    y = mnist.target.astype(np.int64)

    # Normalize pixel values to [0, 1]
    X = X / 255.0

    # Train/test split
    X_train, X_test = X[:60000].T, X[60000:].T  # (784, m)
    y_train, y_test = y[:60000], y[60000:]

    # One-hot encode labels
    Y_train = np.eye(10)[y_train].T  # (10, 60000)
    Y_test = np.eye(10)[y_test].T    # (10, 10000)

    return X_train, Y_train, X_test, Y_test

Network Architecture

We use a three-layer network: - Input: 784 neurons (one per pixel) - Hidden layer 1: 128 neurons, ReLU activation - Hidden layer 2: 64 neurons, ReLU activation - Output layer: 10 neurons, softmax activation (one per digit class)

This architecture has 784128 + 128 + 12864 + 64 + 6410 + 10 = 109,386 parameters*.

NumPy Implementation

Activation Functions

def relu(z: np.ndarray) -> np.ndarray:
    """ReLU activation."""
    return np.maximum(0, z)


def relu_derivative(z: np.ndarray) -> np.ndarray:
    """Derivative of ReLU."""
    return (z > 0).astype(np.float64)


def softmax(z: np.ndarray) -> np.ndarray:
    """Numerically stable softmax.

    Args:
        z: Pre-activation of shape (K, m).

    Returns:
        Probabilities of shape (K, m), columns sum to 1.
    """
    exp_z = np.exp(z - np.max(z, axis=0, keepdims=True))
    return exp_z / np.sum(exp_z, axis=0, keepdims=True)

Weight Initialization

def initialize_parameters(
    layer_dims: list[int],
) -> dict[str, np.ndarray]:
    """He initialization for weights, zeros for biases.

    Args:
        layer_dims: E.g., [784, 128, 64, 10].

    Returns:
        Dictionary with W1, b1, W2, b2, W3, b3.
    """
    np.random.seed(42)
    params = {}
    for l in range(1, len(layer_dims)):
        params[f"W{l}"] = np.random.randn(
            layer_dims[l], layer_dims[l - 1]
        ) * np.sqrt(2.0 / layer_dims[l - 1])
        params[f"b{l}"] = np.zeros((layer_dims[l], 1))
    return params

Forward Pass

def forward(
    x: np.ndarray,
    params: dict[str, np.ndarray],
) -> tuple[np.ndarray, dict]:
    """Forward pass for the 3-layer MNIST network.

    Args:
        x: Input of shape (784, m).
        params: Network parameters.

    Returns:
        Tuple of (predictions of shape (10, m), cache for backprop).
    """
    # Layer 1
    z1 = params["W1"] @ x + params["b1"]
    a1 = relu(z1)

    # Layer 2
    z2 = params["W2"] @ a1 + params["b2"]
    a2 = relu(z2)

    # Layer 3 (output)
    z3 = params["W3"] @ a2 + params["b3"]
    a3 = softmax(z3)

    cache = {
        "Z1": z1, "A1": a1,
        "Z2": z2, "A2": a2,
        "Z3": z3, "A3": a3,
        "X": x,
    }
    return a3, cache

Loss Function

For multi-class classification, we use categorical cross-entropy:

def compute_loss(
    y_hat: np.ndarray,
    y: np.ndarray,
) -> float:
    """Categorical cross-entropy loss.

    Args:
        y_hat: Predictions of shape (10, m).
        y: One-hot labels of shape (10, m).

    Returns:
        Scalar loss value.
    """
    m = y.shape[1]
    epsilon = 1e-8
    loss = -np.sum(y * np.log(y_hat + epsilon)) / m
    return float(loss)

Backward Pass

The backward pass for softmax + cross-entropy yields a clean gradient delta^[3] = y_hat - y, just like sigmoid + BCE:

def backward(
    y: np.ndarray,
    params: dict[str, np.ndarray],
    cache: dict,
) -> dict[str, np.ndarray]:
    """Backpropagation for the 3-layer MNIST network.

    Args:
        y: One-hot labels of shape (10, m).
        params: Network parameters.
        cache: Forward pass cache.

    Returns:
        Gradients dictionary.
    """
    m = y.shape[1]
    grads = {}

    # Output layer (softmax + cross-entropy)
    dz3 = cache["A3"] - y  # (10, m)
    grads["dW3"] = dz3 @ cache["A2"].T / m
    grads["db3"] = np.sum(dz3, axis=1, keepdims=True) / m

    # Hidden layer 2
    da2 = params["W3"].T @ dz3
    dz2 = da2 * relu_derivative(cache["Z2"])
    grads["dW2"] = dz2 @ cache["A1"].T / m
    grads["db2"] = np.sum(dz2, axis=1, keepdims=True) / m

    # Hidden layer 1
    da1 = params["W2"].T @ dz2
    dz1 = da1 * relu_derivative(cache["Z1"])
    grads["dW1"] = dz1 @ cache["X"].T / m
    grads["db1"] = np.sum(dz1, axis=1, keepdims=True) / m

    return grads

Training Loop with Mini-Batches

Processing all 60,000 examples at once is memory-intensive and provides noisy but slow gradient estimates. Mini-batch gradient descent (Chapter 9) offers a practical middle ground:

def create_mini_batches(
    x: np.ndarray,
    y: np.ndarray,
    batch_size: int,
) -> list[tuple[np.ndarray, np.ndarray]]:
    """Split data into mini-batches.

    Args:
        x: Input data of shape (n_features, m).
        y: Labels of shape (n_classes, m).
        batch_size: Number of examples per batch.

    Returns:
        List of (X_batch, Y_batch) tuples.
    """
    m = x.shape[1]
    permutation = np.random.permutation(m)
    x_shuffled = x[:, permutation]
    y_shuffled = y[:, permutation]

    batches = []
    num_complete = m // batch_size
    for i in range(num_complete):
        start = i * batch_size
        end = start + batch_size
        batches.append((x_shuffled[:, start:end], y_shuffled[:, start:end]))

    # Handle remaining examples
    if m % batch_size != 0:
        batches.append((x_shuffled[:, num_complete * batch_size:],
                        y_shuffled[:, num_complete * batch_size:]))
    return batches


def train_mnist(
    x_train: np.ndarray,
    y_train: np.ndarray,
    x_test: np.ndarray,
    y_test: np.ndarray,
    layer_dims: list[int],
    learning_rate: float = 0.1,
    num_epochs: int = 20,
    batch_size: int = 128,
) -> tuple[dict, list[float], list[float]]:
    """Train the MNIST classifier.

    Args:
        x_train: Training inputs of shape (784, 60000).
        y_train: Training labels of shape (10, 60000).
        x_test: Test inputs of shape (784, 10000).
        y_test: Test labels of shape (10, 10000).
        layer_dims: Network architecture.
        learning_rate: Step size for gradient descent.
        num_epochs: Number of passes through the training set.
        batch_size: Mini-batch size.

    Returns:
        Tuple of (trained parameters, train losses, test accuracies).
    """
    params = initialize_parameters(layer_dims)
    train_losses = []
    test_accuracies = []

    for epoch in range(num_epochs):
        epoch_loss = 0.0
        batches = create_mini_batches(x_train, y_train, batch_size)

        for x_batch, y_batch in batches:
            # Forward pass
            y_hat, cache = forward(x_batch, params)

            # Compute loss
            loss = compute_loss(y_hat, y_batch)
            epoch_loss += loss * x_batch.shape[1]

            # Backward pass
            grads = backward(y_batch, params, cache)

            # Update parameters
            for l in range(1, len(layer_dims)):
                params[f"W{l}"] -= learning_rate * grads[f"dW{l}"]
                params[f"b{l}"] -= learning_rate * grads[f"db{l}"]

        # Epoch statistics
        avg_loss = epoch_loss / x_train.shape[1]
        train_losses.append(avg_loss)

        # Test accuracy
        y_pred, _ = forward(x_test, params)
        accuracy = np.mean(
            np.argmax(y_pred, axis=0) == np.argmax(y_test, axis=0)
        )
        test_accuracies.append(accuracy)

        print(
            f"Epoch {epoch + 1:2d}/{num_epochs} | "
            f"Loss: {avg_loss:.4f} | "
            f"Test Acc: {accuracy:.4f}"
        )

    return params, train_losses, test_accuracies

Running the NumPy Implementation

# Load data
X_train, Y_train, X_test, Y_test = load_mnist_data()

# Train
params, losses, accuracies = train_mnist(
    X_train, Y_train, X_test, Y_test,
    layer_dims=[784, 128, 64, 10],
    learning_rate=0.1,
    num_epochs=20,
    batch_size=128,
)

# Expected output (approximate):
# Epoch  1/20 | Loss: 0.4523 | Test Acc: 0.9125
# Epoch  5/20 | Loss: 0.1432 | Test Acc: 0.9543
# Epoch 10/20 | Loss: 0.0782 | Test Acc: 0.9651
# Epoch 20/20 | Loss: 0.0298 | Test Acc: 0.9712

PyTorch Implementation

Now we rebuild the same network in PyTorch, demonstrating how the framework handles the mechanics:

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset


class MNISTNet(nn.Module):
    """Three-layer MLP for MNIST digit classification.

    Architecture: 784 -> 128 (ReLU) -> 64 (ReLU) -> 10 (Softmax).
    """

    def __init__(self) -> None:
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(784, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 10),
        )
        self._init_weights()

    def _init_weights(self) -> None:
        """Apply He initialization to all linear layers."""
        for module in self.net:
            if isinstance(module, nn.Linear):
                nn.init.kaiming_normal_(module.weight, nonlinearity="relu")
                nn.init.zeros_(module.bias)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass.

        Args:
            x: Input tensor of shape (batch_size, 784).

        Returns:
            Logits of shape (batch_size, 10).
        """
        return self.net(x)


def train_pytorch(
    num_epochs: int = 20,
    batch_size: int = 128,
    learning_rate: float = 0.1,
) -> tuple[list[float], list[float]]:
    """Train the PyTorch MNIST classifier.

    Args:
        num_epochs: Number of training epochs.
        batch_size: Mini-batch size.
        learning_rate: Learning rate for SGD.

    Returns:
        Tuple of (train losses, test accuracies).
    """
    torch.manual_seed(42)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Load data (reuse sklearn loader, convert to tensors)
    from sklearn.datasets import fetch_openml

    mnist = fetch_openml("mnist_784", version=1, as_frame=False)
    X = torch.tensor(mnist.data / 255.0, dtype=torch.float32)
    y = torch.tensor(mnist.target.astype(int), dtype=torch.long)

    X_train, X_test = X[:60000], X[60000:]
    y_train, y_test = y[:60000], y[60000:]

    train_loader = DataLoader(
        TensorDataset(X_train, y_train),
        batch_size=batch_size,
        shuffle=True,
    )

    # Model, loss, optimizer
    model = MNISTNet().to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

    train_losses = []
    test_accuracies = []

    for epoch in range(num_epochs):
        model.train()
        epoch_loss = 0.0

        for x_batch, y_batch in train_loader:
            x_batch, y_batch = x_batch.to(device), y_batch.to(device)

            # Forward pass
            logits = model(x_batch)
            loss = criterion(logits, y_batch)

            # Backward pass and update
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item() * x_batch.size(0)

        avg_loss = epoch_loss / len(X_train)
        train_losses.append(avg_loss)

        # Evaluate
        model.eval()
        with torch.no_grad():
            logits = model(X_test.to(device))
            preds = logits.argmax(dim=1)
            accuracy = (preds == y_test.to(device)).float().mean().item()
            test_accuracies.append(accuracy)

        print(
            f"Epoch {epoch + 1:2d}/{num_epochs} | "
            f"Loss: {avg_loss:.4f} | "
            f"Test Acc: {accuracy:.4f}"
        )

    return train_losses, test_accuracies

Results and Analysis

Both implementations achieve approximately 97% test accuracy after 20 epochs. Key observations:

  1. Training dynamics: Both implementations show similar loss curves because they use the same architecture, initialization scheme, and learning rate.

  2. Code complexity: The NumPy implementation requires about 150 lines of mathematical code. The PyTorch version requires about 60 lines, with the framework handling backpropagation, mini-batching, and GPU transfer.

  3. Speed: The PyTorch implementation is significantly faster, especially on GPU. On a modern GPU, each epoch takes about 3 seconds; in NumPy on CPU, each epoch takes about 30 seconds.

  4. Error analysis: The most common errors are between visually similar digits: 4 and 9, 3 and 8, 7 and 1. Convolutional neural networks (Chapter 13) handle these distinctions better by exploiting spatial structure.

Key Takeaways

  • A three-layer MLP with only 109,386 parameters can achieve 97%+ accuracy on MNIST.
  • The NumPy implementation reveals every mathematical operation; the PyTorch implementation demonstrates production patterns.
  • Mini-batch gradient descent is essential for large datasets---processing all 60,000 examples at once is impractical.
  • Softmax + cross-entropy yields the same clean gradient (y_hat - y) as sigmoid + BCE, making multi-class backpropagation elegant.
  • This baseline sets the stage for convolutional networks in Chapter 13, which exploit the 2D spatial structure that our MLP ignores.