Case Study 2: Matrix Factorization Loss Landscape at StreamRec

DataField.Dev

Case Study 2: Matrix Factorization Loss Landscape at StreamRec

Visualizing Non-Convexity, Saddle Points, and Optimizer Trajectories

The Problem

StreamRec, the content streaming platform from the progressive project, serves approximately 5 million users across 200,000 content items. In Chapter 1, the recommendation team formulated collaborative filtering as matrix factorization: decompose the partially observed user-item rating matrix $\mathbf{R} \in \mathbb{R}^{m \times n}$ into low-rank factors $\mathbf{U} \in \mathbb{R}^{m \times k}$ and $\mathbf{V} \in \mathbb{R}^{n \times k}$, minimizing the reconstruction loss over observed entries.

This case study examines what that loss landscape actually looks like — and why the choice of optimizer matters far more here than in the convex credit scoring problem.

The Loss Function

For observed entries $\Omega \subseteq \{1, \ldots, m\} \times \{1, \ldots, n\}$:

$$ \mathcal{L}(\mathbf{U}, \mathbf{V}) = \frac{1}{|\Omega|}\sum_{(i,j) \in \Omega} (R_{ij} - \mathbf{u}_i^T \mathbf{v}_j)^2 + \frac{\lambda}{2}\left(\|\mathbf{U}\|_F^2 + \|\mathbf{V}\|_F^2\right) $$

This loss is not convex in $(\mathbf{U}, \mathbf{V})$ jointly. The bilinear term $\mathbf{u}_i^T \mathbf{v}_j$ introduces a fundamental non-convexity: the product of two optimization variables. To understand why, consider the simplest possible case.

A Minimal Example: Rank-1 Factorization of a 2x2 Matrix

To visualize the loss landscape, the team starts with the smallest non-trivial instance: factorizing a $2 \times 2$ matrix $\mathbf{R}$ into rank-1 factors $\mathbf{u} = (u_1, u_2)^T$ and $\mathbf{v} = (v_1, v_2)^T$.

$$ \mathcal{L}(u_1, u_2, v_1, v_2) = (R_{11} - u_1 v_1)^2 + (R_{12} - u_1 v_2)^2 + (R_{21} - u_2 v_1)^2 + (R_{22} - u_2 v_2)^2 $$

Even this 4-parameter problem has a non-convex landscape. To visualize it in 2D, they fix $u_2 = v_2 = 1$ and plot the loss as a function of $(u_1, v_1)$.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm

def rank1_loss_2d(u1: float, v1: float, R: np.ndarray,
                  u2: float = 1.0, v2: float = 1.0) -> float:
    """Loss for rank-1 factorization of 2x2 matrix, with u2, v2 fixed."""
    pred = np.array([[u1 * v1, u1 * v2],
                     [u2 * v1, u2 * v2]])
    return np.sum((R - pred)**2)

# Target matrix (rank 2, so rank-1 factorization is imperfect)
R = np.array([[3.0, 1.0],
              [1.0, 3.0]])

# Create loss surface
u1_range = np.linspace(-4, 4, 300)
v1_range = np.linspace(-4, 4, 300)
U1, V1 = np.meshgrid(u1_range, v1_range)
Z = np.zeros_like(U1)
for i in range(300):
    for j in range(300):
        Z[i, j] = rank1_loss_2d(U1[i, j], V1[i, j], R)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(16, 7))

# Contour plot
ax = axes[0]
cs = ax.contour(U1, V1, Z, levels=30, cmap="viridis")
ax.contourf(U1, V1, Z, levels=30, cmap="viridis", alpha=0.4)
ax.set_xlabel("$u_1$", fontsize=14)
ax.set_ylabel("$v_1$", fontsize=14)
ax.set_title("Loss Landscape: Rank-1 MF of 2×2 Matrix", fontsize=14)
ax.set_aspect("equal")
plt.colorbar(cs, ax=ax, shrink=0.8)

# 3D surface
ax2 = fig.add_subplot(122, projection="3d")
ax2.plot_surface(U1, V1, Z, cmap=cm.viridis, alpha=0.7,
                 rstride=5, cstride=5)
ax2.set_xlabel("$u_1$", fontsize=12)
ax2.set_ylabel("$v_1$", fontsize=12)
ax2.set_zlabel("Loss", fontsize=12)
ax2.set_title("3D Loss Surface", fontsize=14)

plt.tight_layout()
plt.savefig("mf_loss_landscape_2x2.png", dpi=150)
plt.show()

Identifying the Saddle Point

The loss surface reveals several important features:

Two symmetric minima. The pairs $(u_1, v_1)$ and $(-u_1, -v_1)$ produce identical predictions $u_1 v_1 = (-u_1)(-v_1)$, creating symmetric minima.
A saddle point at the origin. At $(u_1, v_1) = (0, 0)$, the gradient is zero (no first-order signal for how to improve), but the point is neither a minimum nor a maximum. The Hessian at the origin has both positive and negative eigenvalues.
Hyperbolic contours. The level sets of constant loss follow hyperbolic curves, reflecting the multiplicative structure $u_1 v_1 = c$. This is fundamentally different from the elliptical contours of convex quadratics.

def mf_gradient_2d(u1: float, v1: float, R: np.ndarray,
                   u2: float = 1.0, v2: float = 1.0) -> tuple[float, float]:
    """Gradient of rank-1 loss w.r.t. (u1, v1) with u2, v2 fixed."""
    pred = np.array([[u1 * v1, u1 * v2],
                     [u2 * v1, u2 * v2]])
    residual = R - pred

    du1 = -2 * (residual[0, 0] * v1 + residual[0, 1] * v2)
    dv1 = -2 * (residual[0, 0] * u1 + residual[1, 0] * u2)
    return du1, dv1

def mf_hessian_2d(u1: float, v1: float, R: np.ndarray,
                  u2: float = 1.0, v2: float = 1.0) -> np.ndarray:
    """Hessian of rank-1 loss w.r.t. (u1, v1)."""
    H = np.zeros((2, 2))
    # d^2L/du1^2
    H[0, 0] = 2 * (v1**2 + v2**2)
    # d^2L/dv1^2
    H[1, 1] = 2 * (u1**2 + u2**2)
    # d^2L/du1dv1 = d^2L/dv1du1
    H[0, 1] = H[1, 0] = 2 * (2 * u1 * v1 - R[0, 0]) + 0  # simplified
    # More precisely:
    H[0, 1] = H[1, 0] = -2 * (-2 * u1 * v1 + R[0, 0])

    return H

# Verify saddle point at origin
grad_origin = mf_gradient_2d(0, 0, R)
H_origin = mf_hessian_2d(0, 0, R)
eigvals_origin = np.linalg.eigvalsh(H_origin)
print(f"Gradient at origin: ({grad_origin[0]:.4f}, {grad_origin[1]:.4f})")
print(f"Hessian eigenvalues at origin: {eigvals_origin}")
print(f"Classification: {'Saddle point' if eigvals_origin[0] * eigvals_origin[-1] < 0 else 'Not a saddle point'}")

Scaling to Realistic Dimensions

The 2D visualization is instructive, but StreamRec's actual matrix factorization has $d = mk + nk = 5{,}000{,}000 \times 50 + 200{,}000 \times 50 = 260{,}000{,}000$ parameters. The team cannot visualize this directly but can study the landscape along structured slices.

For a tractable experiment, they use a $500 \times 200$ submatrix with rank $k = 10$, giving $d = 7{,}000$ parameters.

def generate_streamrec_data(
    n_users: int = 500,
    n_items: int = 200,
    true_rank: int = 5,
    obs_rate: float = 0.1,
    noise_std: float = 0.3,
    seed: int = 42
) -> tuple[np.ndarray, np.ndarray]:
    """Generate synthetic StreamRec interaction data."""
    rng = np.random.default_rng(seed)
    U_true = rng.normal(0, 1, (n_users, true_rank))
    V_true = rng.normal(0, 1, (n_items, true_rank))
    R_true = U_true @ V_true.T

    mask = rng.random((n_users, n_items)) < obs_rate
    R_obs = np.full((n_users, n_items), np.nan)
    R_obs[mask] = R_true[mask] + rng.normal(0, noise_std, mask.sum())
    return R_obs, R_true

def mf_loss_and_grad(
    params: np.ndarray,
    R_obs: np.ndarray,
    rank: int,
    reg: float = 0.01
) -> tuple[float, np.ndarray]:
    """Matrix factorization loss and gradient."""
    m, n = R_obs.shape
    U = params[:m * rank].reshape(m, rank)
    V = params[m * rank:].reshape(n, rank)

    mask = ~np.isnan(R_obs)
    pred = U @ V.T
    residual = np.where(mask, R_obs - pred, 0.0)
    n_obs = mask.sum()

    loss = 0.5 * np.sum(residual**2) / n_obs + 0.5 * reg * (
        np.sum(U**2) + np.sum(V**2)
    )

    grad_U = (-residual @ V) / n_obs + reg * U
    grad_V = (-residual.T @ U) / n_obs + reg * V
    return loss, np.concatenate([grad_U.ravel(), grad_V.ravel()])

R_obs, R_true = generate_streamrec_data()
n_users, n_items = R_obs.shape
rank = 10
rng = np.random.default_rng(0)
theta_init = rng.normal(0, 0.01, n_users * rank + n_items * rank)

loss_fn = lambda theta: mf_loss_and_grad(theta, R_obs, rank, reg=0.01)

Comparing Optimizer Trajectories

def run_optimizer(name, loss_fn, theta_init, n_steps=3000):
    """Run an optimizer and record both loss and parameter trajectories."""
    theta = theta_init.copy()
    losses = []
    snapshots = [theta.copy()]  # parameter snapshots for trajectory analysis

    if name == "SGD":
        lr = 0.05
        for t in range(n_steps):
            loss, grad = loss_fn(theta)
            losses.append(loss)
            theta -= lr * grad
            if t % 100 == 0:
                snapshots.append(theta.copy())

    elif name == "Momentum":
        lr, beta = 0.01, 0.9
        v = np.zeros_like(theta)
        for t in range(n_steps):
            loss, grad = loss_fn(theta)
            losses.append(loss)
            v = beta * v + grad
            theta -= lr * v
            if t % 100 == 0:
                snapshots.append(theta.copy())

    elif name == "Adam":
        lr, beta1, beta2, eps = 0.005, 0.9, 0.999, 1e-8
        m = np.zeros_like(theta)
        v = np.zeros_like(theta)
        for t in range(1, n_steps + 1):
            loss, grad = loss_fn(theta)
            losses.append(loss)
            m = beta1 * m + (1 - beta1) * grad
            v = beta2 * v + (1 - beta2) * grad**2
            m_hat = m / (1 - beta1**t)
            v_hat = v / (1 - beta2**t)
            theta -= lr * m_hat / (np.sqrt(v_hat) + eps)
            if t % 100 == 0:
                snapshots.append(theta.copy())

    return losses, snapshots

# Run all three
results = {}
for name in ["SGD", "Momentum", "Adam"]:
    results[name] = run_optimizer(name, loss_fn, theta_init, n_steps=3000)

# Plot convergence
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

ax = axes[0]
for name, (losses, _) in results.items():
    ax.semilogy(losses, label=name, linewidth=2)
ax.set_xlabel("Iteration", fontsize=14)
ax.set_ylabel("Loss (log scale)", fontsize=14)
ax.set_title("StreamRec MF: Convergence Comparison", fontsize=14)
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)

# Plot normalized loss (to highlight relative convergence)
ax = axes[1]
for name, (losses, _) in results.items():
    normalized = np.array(losses) / losses[0]
    ax.semilogy(normalized, label=name, linewidth=2)
ax.set_xlabel("Iteration", fontsize=14)
ax.set_ylabel("Loss / Initial Loss", fontsize=14)
ax.set_title("Normalized Convergence", fontsize=14)
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("streamrec_mf_convergence.png", dpi=150)
plt.show()

Results

The convergence profiles reveal the character of each optimizer on non-convex terrain:

Optimizer	Final Loss	Iterations to 50% Reduction	Reconstruction RMSE
SGD	0.482	847	1.24
Momentum	0.391	312	1.08
Adam	0.358	189	0.98

Key observations:

Adam converges fastest and reaches the lowest loss. On this non-convex problem, Adam's per-parameter adaptive rates handle the heterogeneous curvature between user factors and item factors. User factors that receive few observed ratings have smaller gradient magnitudes; Adam compensates by increasing their effective learning rate.
Momentum provides clear improvement over SGD. The matrix factorization loss landscape has the characteristic "ravine" structure created by the scale symmetry $(u, v) \to (\alpha u, v/\alpha)$. Momentum helps the optimizer progress along the ravine floor rather than oscillating between the walls.
SGD is slowest but most stable. Its trajectory shows the cleanest monotonic decrease, albeit at a much lower rate. The noise from the gradient (not stochastic — this is full-batch) is minimal, so SGD makes steady but small progress.
All three reach different final losses. Unlike the convex credit scoring problem, the optimizers converge to different local minima. However, the differences in reconstruction RMSE on held-out entries are modest, supporting the theory that most local minima of overparameterized matrix factorization are nearly global.

Why Adam Works Despite Non-Convexity

The StreamRec team identifies three reasons Adam outperforms on this problem:

1. Adaptive rates handle scale heterogeneity. The user and item factors have different gradient scales. A user who has rated 500 items generates a gradient signal 50x stronger than a user who has rated 10 items. Adam automatically downscales the former and upscales the latter.

2. Momentum escapes saddle points faster. At a saddle point, the gradient is small in all directions. Adam's normalization by $\sqrt{\hat{v}_t}$ amplifies these small gradients, effectively applying a larger learning rate near saddle points. Standard SGD takes many iterations to drift away from saddle points because it multiplies an already-small gradient by a fixed learning rate.

3. Bias correction matters for cold-start parameters. At initialization, Adam's bias correction ensures that early updates are large enough to move parameters out of the high-loss region near the origin. Without bias correction, the first hundred iterations would make negligible progress because the moment estimates are biased toward zero.

Visualizing the Trajectory in Parameter Space

To visualize the high-dimensional trajectories, the team projects the parameter snapshots onto their first two principal components.

from sklearn.decomposition import PCA

# Collect all snapshots from all optimizers
all_snapshots = []
labels = []
for name, (_, snaps) in results.items():
    all_snapshots.extend(snaps)
    labels.extend([name] * len(snaps))

all_snapshots = np.array(all_snapshots)

# PCA projection
pca = PCA(n_components=2)
projected = pca.fit_transform(all_snapshots)
print(f"Variance explained: {pca.explained_variance_ratio_.sum():.1%}")

# Split back by optimizer
fig, ax = plt.subplots(figsize=(10, 8))
colors = {"SGD": "red", "Momentum": "blue", "Adam": "green"}
idx = 0
for name, (_, snaps) in results.items():
    n = len(snaps)
    pts = projected[idx:idx + n]
    ax.plot(pts[:, 0], pts[:, 1], "-o", markersize=3,
            label=name, color=colors[name], linewidth=1.5, alpha=0.8)
    ax.plot(pts[0, 0], pts[0, 1], "s", markersize=10,
            color=colors[name])  # start
    ax.plot(pts[-1, 0], pts[-1, 1], "*", markersize=15,
            color=colors[name])  # end
    idx += n

ax.set_xlabel(f"PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)", fontsize=14)
ax.set_ylabel(f"PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)", fontsize=14)
ax.set_title("Optimizer Trajectories in PCA-Projected Parameter Space", fontsize=14)
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("streamrec_trajectories_pca.png", dpi=150)
plt.show()

The Broader Lesson

The contrast between Case Study 1 (convex) and Case Study 2 (non-convex) encapsulates a fundamental principle: the choice of optimizer depends on the geometry of the loss landscape.

Property	Credit Scoring (Convex)	Matrix Factorization (Non-Convex)
Guaranteed global optimum	Yes	No
All optimizers reach same solution	Yes	No
Saddle points	None	Many
Best optimizer	L-BFGS or any (problem is easy)	Adam (handles heterogeneous curvature)
Learning rate sensitivity	Moderate	High
Theoretical convergence guarantee	Yes (linear rate)	No (only to stationary point)

When the StreamRec team eventually trains neural network recommendation models in later chapters, this non-convex landscape will be the norm, not the exception. The lesson from matrix factorization generalizes: on non-convex problems with heterogeneous parameter scales, adaptive optimizers like Adam provide the best combination of speed and robustness. But the convex theory is not wasted — it provides the vocabulary (condition number, saddle point, curvature) needed to reason about what the optimizer is doing, even when formal guarantees are absent.