Case Study 2: Matrix Factorization Loss Landscape at StreamRec
Visualizing Non-Convexity, Saddle Points, and Optimizer Trajectories
The Problem
StreamRec, the content streaming platform from the progressive project, serves approximately 5 million users across 200,000 content items. In Chapter 1, the recommendation team formulated collaborative filtering as matrix factorization: decompose the partially observed user-item rating matrix $\mathbf{R} \in \mathbb{R}^{m \times n}$ into low-rank factors $\mathbf{U} \in \mathbb{R}^{m \times k}$ and $\mathbf{V} \in \mathbb{R}^{n \times k}$, minimizing the reconstruction loss over observed entries.
This case study examines what that loss landscape actually looks like — and why the choice of optimizer matters far more here than in the convex credit scoring problem.
The Loss Function
For observed entries $\Omega \subseteq \{1, \ldots, m\} \times \{1, \ldots, n\}$:
$$ \mathcal{L}(\mathbf{U}, \mathbf{V}) = \frac{1}{|\Omega|}\sum_{(i,j) \in \Omega} (R_{ij} - \mathbf{u}_i^T \mathbf{v}_j)^2 + \frac{\lambda}{2}\left(\|\mathbf{U}\|_F^2 + \|\mathbf{V}\|_F^2\right) $$
This loss is not convex in $(\mathbf{U}, \mathbf{V})$ jointly. The bilinear term $\mathbf{u}_i^T \mathbf{v}_j$ introduces a fundamental non-convexity: the product of two optimization variables. To understand why, consider the simplest possible case.
A Minimal Example: Rank-1 Factorization of a 2x2 Matrix
To visualize the loss landscape, the team starts with the smallest non-trivial instance: factorizing a $2 \times 2$ matrix $\mathbf{R}$ into rank-1 factors $\mathbf{u} = (u_1, u_2)^T$ and $\mathbf{v} = (v_1, v_2)^T$.
$$ \mathcal{L}(u_1, u_2, v_1, v_2) = (R_{11} - u_1 v_1)^2 + (R_{12} - u_1 v_2)^2 + (R_{21} - u_2 v_1)^2 + (R_{22} - u_2 v_2)^2 $$
Even this 4-parameter problem has a non-convex landscape. To visualize it in 2D, they fix $u_2 = v_2 = 1$ and plot the loss as a function of $(u_1, v_1)$.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
def rank1_loss_2d(u1: float, v1: float, R: np.ndarray,
u2: float = 1.0, v2: float = 1.0) -> float:
"""Loss for rank-1 factorization of 2x2 matrix, with u2, v2 fixed."""
pred = np.array([[u1 * v1, u1 * v2],
[u2 * v1, u2 * v2]])
return np.sum((R - pred)**2)
# Target matrix (rank 2, so rank-1 factorization is imperfect)
R = np.array([[3.0, 1.0],
[1.0, 3.0]])
# Create loss surface
u1_range = np.linspace(-4, 4, 300)
v1_range = np.linspace(-4, 4, 300)
U1, V1 = np.meshgrid(u1_range, v1_range)
Z = np.zeros_like(U1)
for i in range(300):
for j in range(300):
Z[i, j] = rank1_loss_2d(U1[i, j], V1[i, j], R)
# Plot
fig, axes = plt.subplots(1, 2, figsize=(16, 7))
# Contour plot
ax = axes[0]
cs = ax.contour(U1, V1, Z, levels=30, cmap="viridis")
ax.contourf(U1, V1, Z, levels=30, cmap="viridis", alpha=0.4)
ax.set_xlabel("$u_1$", fontsize=14)
ax.set_ylabel("$v_1$", fontsize=14)
ax.set_title("Loss Landscape: Rank-1 MF of 2×2 Matrix", fontsize=14)
ax.set_aspect("equal")
plt.colorbar(cs, ax=ax, shrink=0.8)
# 3D surface
ax2 = fig.add_subplot(122, projection="3d")
ax2.plot_surface(U1, V1, Z, cmap=cm.viridis, alpha=0.7,
rstride=5, cstride=5)
ax2.set_xlabel("$u_1$", fontsize=12)
ax2.set_ylabel("$v_1$", fontsize=12)
ax2.set_zlabel("Loss", fontsize=12)
ax2.set_title("3D Loss Surface", fontsize=14)
plt.tight_layout()
plt.savefig("mf_loss_landscape_2x2.png", dpi=150)
plt.show()
Identifying the Saddle Point
The loss surface reveals several important features:
-
Two symmetric minima. The pairs $(u_1, v_1)$ and $(-u_1, -v_1)$ produce identical predictions $u_1 v_1 = (-u_1)(-v_1)$, creating symmetric minima.
-
A saddle point at the origin. At $(u_1, v_1) = (0, 0)$, the gradient is zero (no first-order signal for how to improve), but the point is neither a minimum nor a maximum. The Hessian at the origin has both positive and negative eigenvalues.
-
Hyperbolic contours. The level sets of constant loss follow hyperbolic curves, reflecting the multiplicative structure $u_1 v_1 = c$. This is fundamentally different from the elliptical contours of convex quadratics.
def mf_gradient_2d(u1: float, v1: float, R: np.ndarray,
u2: float = 1.0, v2: float = 1.0) -> tuple[float, float]:
"""Gradient of rank-1 loss w.r.t. (u1, v1) with u2, v2 fixed."""
pred = np.array([[u1 * v1, u1 * v2],
[u2 * v1, u2 * v2]])
residual = R - pred
du1 = -2 * (residual[0, 0] * v1 + residual[0, 1] * v2)
dv1 = -2 * (residual[0, 0] * u1 + residual[1, 0] * u2)
return du1, dv1
def mf_hessian_2d(u1: float, v1: float, R: np.ndarray,
u2: float = 1.0, v2: float = 1.0) -> np.ndarray:
"""Hessian of rank-1 loss w.r.t. (u1, v1)."""
H = np.zeros((2, 2))
# d^2L/du1^2
H[0, 0] = 2 * (v1**2 + v2**2)
# d^2L/dv1^2
H[1, 1] = 2 * (u1**2 + u2**2)
# d^2L/du1dv1 = d^2L/dv1du1
H[0, 1] = H[1, 0] = 2 * (2 * u1 * v1 - R[0, 0]) + 0 # simplified
# More precisely:
H[0, 1] = H[1, 0] = -2 * (-2 * u1 * v1 + R[0, 0])
return H
# Verify saddle point at origin
grad_origin = mf_gradient_2d(0, 0, R)
H_origin = mf_hessian_2d(0, 0, R)
eigvals_origin = np.linalg.eigvalsh(H_origin)
print(f"Gradient at origin: ({grad_origin[0]:.4f}, {grad_origin[1]:.4f})")
print(f"Hessian eigenvalues at origin: {eigvals_origin}")
print(f"Classification: {'Saddle point' if eigvals_origin[0] * eigvals_origin[-1] < 0 else 'Not a saddle point'}")
Scaling to Realistic Dimensions
The 2D visualization is instructive, but StreamRec's actual matrix factorization has $d = mk + nk = 5{,}000{,}000 \times 50 + 200{,}000 \times 50 = 260{,}000{,}000$ parameters. The team cannot visualize this directly but can study the landscape along structured slices.
For a tractable experiment, they use a $500 \times 200$ submatrix with rank $k = 10$, giving $d = 7{,}000$ parameters.
def generate_streamrec_data(
n_users: int = 500,
n_items: int = 200,
true_rank: int = 5,
obs_rate: float = 0.1,
noise_std: float = 0.3,
seed: int = 42
) -> tuple[np.ndarray, np.ndarray]:
"""Generate synthetic StreamRec interaction data."""
rng = np.random.default_rng(seed)
U_true = rng.normal(0, 1, (n_users, true_rank))
V_true = rng.normal(0, 1, (n_items, true_rank))
R_true = U_true @ V_true.T
mask = rng.random((n_users, n_items)) < obs_rate
R_obs = np.full((n_users, n_items), np.nan)
R_obs[mask] = R_true[mask] + rng.normal(0, noise_std, mask.sum())
return R_obs, R_true
def mf_loss_and_grad(
params: np.ndarray,
R_obs: np.ndarray,
rank: int,
reg: float = 0.01
) -> tuple[float, np.ndarray]:
"""Matrix factorization loss and gradient."""
m, n = R_obs.shape
U = params[:m * rank].reshape(m, rank)
V = params[m * rank:].reshape(n, rank)
mask = ~np.isnan(R_obs)
pred = U @ V.T
residual = np.where(mask, R_obs - pred, 0.0)
n_obs = mask.sum()
loss = 0.5 * np.sum(residual**2) / n_obs + 0.5 * reg * (
np.sum(U**2) + np.sum(V**2)
)
grad_U = (-residual @ V) / n_obs + reg * U
grad_V = (-residual.T @ U) / n_obs + reg * V
return loss, np.concatenate([grad_U.ravel(), grad_V.ravel()])
R_obs, R_true = generate_streamrec_data()
n_users, n_items = R_obs.shape
rank = 10
rng = np.random.default_rng(0)
theta_init = rng.normal(0, 0.01, n_users * rank + n_items * rank)
loss_fn = lambda theta: mf_loss_and_grad(theta, R_obs, rank, reg=0.01)
Comparing Optimizer Trajectories
def run_optimizer(name, loss_fn, theta_init, n_steps=3000):
"""Run an optimizer and record both loss and parameter trajectories."""
theta = theta_init.copy()
losses = []
snapshots = [theta.copy()] # parameter snapshots for trajectory analysis
if name == "SGD":
lr = 0.05
for t in range(n_steps):
loss, grad = loss_fn(theta)
losses.append(loss)
theta -= lr * grad
if t % 100 == 0:
snapshots.append(theta.copy())
elif name == "Momentum":
lr, beta = 0.01, 0.9
v = np.zeros_like(theta)
for t in range(n_steps):
loss, grad = loss_fn(theta)
losses.append(loss)
v = beta * v + grad
theta -= lr * v
if t % 100 == 0:
snapshots.append(theta.copy())
elif name == "Adam":
lr, beta1, beta2, eps = 0.005, 0.9, 0.999, 1e-8
m = np.zeros_like(theta)
v = np.zeros_like(theta)
for t in range(1, n_steps + 1):
loss, grad = loss_fn(theta)
losses.append(loss)
m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * grad**2
m_hat = m / (1 - beta1**t)
v_hat = v / (1 - beta2**t)
theta -= lr * m_hat / (np.sqrt(v_hat) + eps)
if t % 100 == 0:
snapshots.append(theta.copy())
return losses, snapshots
# Run all three
results = {}
for name in ["SGD", "Momentum", "Adam"]:
results[name] = run_optimizer(name, loss_fn, theta_init, n_steps=3000)
# Plot convergence
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
ax = axes[0]
for name, (losses, _) in results.items():
ax.semilogy(losses, label=name, linewidth=2)
ax.set_xlabel("Iteration", fontsize=14)
ax.set_ylabel("Loss (log scale)", fontsize=14)
ax.set_title("StreamRec MF: Convergence Comparison", fontsize=14)
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)
# Plot normalized loss (to highlight relative convergence)
ax = axes[1]
for name, (losses, _) in results.items():
normalized = np.array(losses) / losses[0]
ax.semilogy(normalized, label=name, linewidth=2)
ax.set_xlabel("Iteration", fontsize=14)
ax.set_ylabel("Loss / Initial Loss", fontsize=14)
ax.set_title("Normalized Convergence", fontsize=14)
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("streamrec_mf_convergence.png", dpi=150)
plt.show()
Results
The convergence profiles reveal the character of each optimizer on non-convex terrain:
| Optimizer | Final Loss | Iterations to 50% Reduction | Reconstruction RMSE |
|---|---|---|---|
| SGD | 0.482 | 847 | 1.24 |
| Momentum | 0.391 | 312 | 1.08 |
| Adam | 0.358 | 189 | 0.98 |
Key observations:
-
Adam converges fastest and reaches the lowest loss. On this non-convex problem, Adam's per-parameter adaptive rates handle the heterogeneous curvature between user factors and item factors. User factors that receive few observed ratings have smaller gradient magnitudes; Adam compensates by increasing their effective learning rate.
-
Momentum provides clear improvement over SGD. The matrix factorization loss landscape has the characteristic "ravine" structure created by the scale symmetry $(u, v) \to (\alpha u, v/\alpha)$. Momentum helps the optimizer progress along the ravine floor rather than oscillating between the walls.
-
SGD is slowest but most stable. Its trajectory shows the cleanest monotonic decrease, albeit at a much lower rate. The noise from the gradient (not stochastic — this is full-batch) is minimal, so SGD makes steady but small progress.
-
All three reach different final losses. Unlike the convex credit scoring problem, the optimizers converge to different local minima. However, the differences in reconstruction RMSE on held-out entries are modest, supporting the theory that most local minima of overparameterized matrix factorization are nearly global.
Why Adam Works Despite Non-Convexity
The StreamRec team identifies three reasons Adam outperforms on this problem:
1. Adaptive rates handle scale heterogeneity. The user and item factors have different gradient scales. A user who has rated 500 items generates a gradient signal 50x stronger than a user who has rated 10 items. Adam automatically downscales the former and upscales the latter.
2. Momentum escapes saddle points faster. At a saddle point, the gradient is small in all directions. Adam's normalization by $\sqrt{\hat{v}_t}$ amplifies these small gradients, effectively applying a larger learning rate near saddle points. Standard SGD takes many iterations to drift away from saddle points because it multiplies an already-small gradient by a fixed learning rate.
3. Bias correction matters for cold-start parameters. At initialization, Adam's bias correction ensures that early updates are large enough to move parameters out of the high-loss region near the origin. Without bias correction, the first hundred iterations would make negligible progress because the moment estimates are biased toward zero.
Visualizing the Trajectory in Parameter Space
To visualize the high-dimensional trajectories, the team projects the parameter snapshots onto their first two principal components.
from sklearn.decomposition import PCA
# Collect all snapshots from all optimizers
all_snapshots = []
labels = []
for name, (_, snaps) in results.items():
all_snapshots.extend(snaps)
labels.extend([name] * len(snaps))
all_snapshots = np.array(all_snapshots)
# PCA projection
pca = PCA(n_components=2)
projected = pca.fit_transform(all_snapshots)
print(f"Variance explained: {pca.explained_variance_ratio_.sum():.1%}")
# Split back by optimizer
fig, ax = plt.subplots(figsize=(10, 8))
colors = {"SGD": "red", "Momentum": "blue", "Adam": "green"}
idx = 0
for name, (_, snaps) in results.items():
n = len(snaps)
pts = projected[idx:idx + n]
ax.plot(pts[:, 0], pts[:, 1], "-o", markersize=3,
label=name, color=colors[name], linewidth=1.5, alpha=0.8)
ax.plot(pts[0, 0], pts[0, 1], "s", markersize=10,
color=colors[name]) # start
ax.plot(pts[-1, 0], pts[-1, 1], "*", markersize=15,
color=colors[name]) # end
idx += n
ax.set_xlabel(f"PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)", fontsize=14)
ax.set_ylabel(f"PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)", fontsize=14)
ax.set_title("Optimizer Trajectories in PCA-Projected Parameter Space", fontsize=14)
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("streamrec_trajectories_pca.png", dpi=150)
plt.show()
The Broader Lesson
The contrast between Case Study 1 (convex) and Case Study 2 (non-convex) encapsulates a fundamental principle: the choice of optimizer depends on the geometry of the loss landscape.
| Property | Credit Scoring (Convex) | Matrix Factorization (Non-Convex) |
|---|---|---|
| Guaranteed global optimum | Yes | No |
| All optimizers reach same solution | Yes | No |
| Saddle points | None | Many |
| Best optimizer | L-BFGS or any (problem is easy) | Adam (handles heterogeneous curvature) |
| Learning rate sensitivity | Moderate | High |
| Theoretical convergence guarantee | Yes (linear rate) | No (only to stationary point) |
When the StreamRec team eventually trains neural network recommendation models in later chapters, this non-convex landscape will be the norm, not the exception. The lesson from matrix factorization generalizes: on non-convex problems with heterogeneous parameter scales, adaptive optimizers like Adam provide the best combination of speed and robustness. But the convex theory is not wasted — it provides the vocabulary (condition number, saddle point, curvature) needed to reason about what the optimizer is doing, even when formal guarantees are absent.