Case Study 1: Optimizing a Neural Network Loss Landscape
Overview
In this case study, we visualize and navigate the loss landscape of a small neural network trained on a classification task. We will see firsthand how the concepts from this chapter — gradients, critical points, and optimizer dynamics — manifest in real training scenarios. By projecting the high-dimensional parameter space onto two dimensions, we gain visual intuition for why certain optimizers outperform others and how the landscape's geometry shapes the training process.
Motivation
The loss landscape of a neural network is a function over millions (or billions) of dimensions. We cannot visualize it directly, but we can slice through it along chosen directions. This technique, pioneered by Li et al. (2018) in their paper "Visualizing the Loss Landscape of Neural Nets," reveals that network architecture choices profoundly affect the landscape's smoothness and the ease of optimization.
In this case study, we will:
- Train a small neural network on a synthetic 2D classification dataset.
- Visualize the loss landscape along two directions in parameter space.
- Observe optimizer trajectories on this landscape.
- Investigate saddle points and their effect on training.
- Compare how different optimizers navigate challenging terrain.
Part 1: Setting Up the Problem
We begin with a synthetic two-class spiral dataset that is not linearly separable, requiring a nonlinear model to classify correctly.
import numpy as np
import matplotlib.pyplot as plt
def generate_spiral_data(
n_points: int = 200, noise: float = 0.3, seed: int = 42
) -> tuple[np.ndarray, np.ndarray]:
"""Generate a two-class spiral dataset.
Args:
n_points: Number of points per class.
noise: Standard deviation of Gaussian noise.
seed: Random seed for reproducibility.
Returns:
Tuple of (features array shape (2*n_points, 2),
labels array shape (2*n_points,)).
"""
np.random.seed(seed)
theta = np.linspace(0, 4 * np.pi, n_points)
# Class 0: spiral arm 1
r0 = theta / (4 * np.pi)
x0 = r0 * np.cos(theta) + noise * np.random.randn(n_points)
y0 = r0 * np.sin(theta) + noise * np.random.randn(n_points)
# Class 1: spiral arm 2 (rotated by pi)
r1 = theta / (4 * np.pi)
x1 = r1 * np.cos(theta + np.pi) + noise * np.random.randn(n_points)
y1 = r1 * np.sin(theta + np.pi) + noise * np.random.randn(n_points)
X = np.vstack([np.column_stack([x0, y0]),
np.column_stack([x1, y1])])
y = np.hstack([np.zeros(n_points), np.ones(n_points)])
return X, y
Part 2: Building a Small Neural Network
We implement a two-layer neural network with ReLU activation — small enough to visualize meaningfully, yet capable of learning the spiral pattern.
class SmallNetwork:
"""A two-layer neural network for binary classification.
Architecture: Input(2) -> Dense(16, ReLU) -> Dense(1, Sigmoid)
Attributes:
params: Dictionary of weight matrices and bias vectors.
"""
def __init__(self, seed: int = 42) -> None:
np.random.seed(seed)
self.params = {
"W1": np.random.randn(2, 16) * 0.5,
"b1": np.zeros(16),
"W2": np.random.randn(16, 1) * 0.5,
"b2": np.zeros(1),
}
def forward(self, X: np.ndarray) -> tuple[np.ndarray, dict]:
"""Forward pass through the network.
Args:
X: Input features, shape (n_samples, 2).
Returns:
Tuple of (predictions shape (n_samples,), cache dict).
"""
z1 = X @ self.params["W1"] + self.params["b1"]
h1 = np.maximum(0, z1) # ReLU
z2 = h1 @ self.params["W2"] + self.params["b2"]
y_pred = 1.0 / (1.0 + np.exp(-z2.ravel())) # Sigmoid
cache = {"X": X, "z1": z1, "h1": h1, "z2": z2, "y_pred": y_pred}
return y_pred, cache
def compute_loss(
self, y_pred: np.ndarray, y_true: np.ndarray
) -> float:
"""Binary cross-entropy loss.
Args:
y_pred: Predicted probabilities, shape (n_samples,).
y_true: True labels, shape (n_samples,).
Returns:
Scalar loss value.
"""
eps = 1e-12
y_pred = np.clip(y_pred, eps, 1 - eps)
loss = -np.mean(
y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)
)
return loss
def backward(
self, cache: dict, y_true: np.ndarray
) -> dict[str, np.ndarray]:
"""Backward pass: compute gradients.
Args:
cache: Intermediate values from forward pass.
y_true: True labels, shape (n_samples,).
Returns:
Dictionary of gradients for each parameter.
"""
n = len(y_true)
y_pred = cache["y_pred"]
# Output layer gradient
dz2 = (y_pred - y_true).reshape(-1, 1) / n
dW2 = cache["h1"].T @ dz2
db2 = np.sum(dz2, axis=0)
# Hidden layer gradient
dh1 = dz2 @ self.params["W2"].T
dz1 = dh1 * (cache["z1"] > 0) # ReLU backward
dW1 = cache["X"].T @ dz1
db1 = np.sum(dz1, axis=0)
return {"W1": dW1, "b1": db1, "W2": dW2, "b2": db2}
def get_flat_params(self) -> np.ndarray:
"""Flatten all parameters into a single vector."""
return np.concatenate([
self.params["W1"].ravel(),
self.params["b1"].ravel(),
self.params["W2"].ravel(),
self.params["b2"].ravel(),
])
def set_flat_params(self, flat: np.ndarray) -> None:
"""Set parameters from a flattened vector."""
idx = 0
for key, shape in [("W1", (2, 16)), ("b1", (16,)),
("W2", (16, 1)), ("b2", (1,))]:
size = np.prod(shape)
self.params[key] = flat[idx:idx + size].reshape(shape)
idx += size
Part 3: Visualizing the Loss Landscape
To visualize the loss landscape, we choose a trained parameter point and two random directions in parameter space. We then evaluate the loss on a grid around the trained parameters.
def visualize_loss_landscape(
net: SmallNetwork,
X: np.ndarray,
y: np.ndarray,
center_params: np.ndarray,
direction1: np.ndarray,
direction2: np.ndarray,
grid_range: float = 1.0,
grid_points: int = 50,
) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
"""Compute loss values on a 2D slice of parameter space.
Args:
net: The neural network.
X: Input features.
y: True labels.
center_params: Flattened parameter vector at the center.
direction1: First direction vector (normalized).
direction2: Second direction vector (normalized).
grid_range: Range of alpha, beta values.
grid_points: Number of grid points per dimension.
Returns:
Tuple of (alpha values, beta values, loss grid).
"""
alphas = np.linspace(-grid_range, grid_range, grid_points)
betas = np.linspace(-grid_range, grid_range, grid_points)
loss_grid = np.zeros((grid_points, grid_points))
for i, alpha in enumerate(alphas):
for j, beta in enumerate(betas):
params = center_params + alpha * direction1 + beta * direction2
net.set_flat_params(params)
y_pred, _ = net.forward(X)
loss_grid[j, i] = net.compute_loss(y_pred, y)
return alphas, betas, loss_grid
Part 4: Tracking Optimizer Trajectories
We train the network with different optimizers and record the parameter trajectory.
def train_with_trajectory(
net: SmallNetwork,
X: np.ndarray,
y: np.ndarray,
optimizer_name: str = "sgd",
learning_rate: float = 0.01,
n_epochs: int = 200,
momentum: float = 0.9,
beta1: float = 0.9,
beta2: float = 0.999,
) -> tuple[list[np.ndarray], list[float]]:
"""Train the network and record parameter trajectory.
Args:
net: The neural network.
X: Training features.
y: Training labels.
optimizer_name: One of 'sgd', 'momentum', 'adam'.
learning_rate: Learning rate.
n_epochs: Number of training epochs.
momentum: Momentum coefficient (for momentum optimizer).
beta1: First moment decay (for Adam).
beta2: Second moment decay (for Adam).
Returns:
Tuple of (parameter trajectory, loss history).
"""
trajectory = [net.get_flat_params().copy()]
loss_history = []
# Initialize optimizer state
velocity = {k: np.zeros_like(v) for k, v in net.params.items()}
m = {k: np.zeros_like(v) for k, v in net.params.items()}
v = {k: np.zeros_like(v) for k, v in net.params.items()}
t = 0
for epoch in range(n_epochs):
y_pred, cache = net.forward(X)
loss = net.compute_loss(y_pred, y)
loss_history.append(loss)
grads = net.backward(cache, y)
t += 1
for key in net.params:
if optimizer_name == "sgd":
net.params[key] -= learning_rate * grads[key]
elif optimizer_name == "momentum":
velocity[key] = momentum * velocity[key] + grads[key]
net.params[key] -= learning_rate * velocity[key]
elif optimizer_name == "adam":
m[key] = beta1 * m[key] + (1 - beta1) * grads[key]
v[key] = beta2 * v[key] + (1 - beta2) * grads[key] ** 2
m_hat = m[key] / (1 - beta1 ** t)
v_hat = v[key] / (1 - beta2 ** t)
net.params[key] -= (
learning_rate * m_hat / (np.sqrt(v_hat) + 1e-8)
)
trajectory.append(net.get_flat_params().copy())
return trajectory, loss_history
Part 5: Analysis and Observations
5.1 The Landscape Is Not a Bowl
When we visualize the loss landscape of even this small network, we observe that it is far from the smooth bowl of a convex function. The surface features:
- Flat regions where gradients are near zero, making progress slow.
- Narrow valleys where the loss drops steeply in one direction but is nearly flat in others.
- Saddle points where the surface curves up in some directions and down in others.
- Multiple basins corresponding to different local (and possibly global) minima.
5.2 Optimizer Behavior
Comparing the three optimizers on the same landscape:
- SGD follows the steepest descent direction at each point but oscillates in narrow valleys and can stall in flat regions.
- Momentum builds up speed along consistent directions, cutting through narrow valleys more efficiently and overshooting less.
- Adam adapts its step size per parameter, navigating both flat regions (where it takes larger steps) and steep regions (where it takes smaller steps) effectively.
5.3 The Effect of Initialization
Running the experiment with different random seeds for weight initialization often leads to different final solutions. This illustrates that the loss landscape has multiple basins of attraction, and the initial point determines which basin the optimizer settles into.
Key Takeaways
-
Loss landscapes are complex. Even small neural networks have non-trivial loss surfaces with saddle points, flat regions, and multiple minima.
-
Optimizer choice matters. Different optimizers navigate the landscape differently, and the "best" optimizer depends on the landscape geometry.
-
Visualization is limited but valuable. We can only see 2D slices of a high-dimensional space, but these slices provide genuine insight into optimization dynamics.
-
Initialization affects the outcome. The starting point of optimization determines the final solution, underscoring the importance of initialization strategies (which we will explore in Chapter 6).
-
Theory connects to practice. The concepts of gradients, critical points, and curvature that we studied analytically in this chapter directly explain the behavior we observe in these visualizations.
Further Exploration
- Modify the network architecture (add layers, change width) and observe how the landscape changes.
- Implement the filter-normalized visualization method from Li et al. (2018) for a fairer comparison across architectures.
- Add regularization (L2 penalty) and observe how it smooths the landscape.
The complete code for this case study, including all visualizations, is available in code/case-study-code.py.