Case Study 1 — Recommendation System via Matrix Factorization
StreamRec's User-Item Interaction Matrix
Context
StreamRec, a mid-size content streaming platform with approximately 5 million users and 200,000 items (articles, videos, podcasts), needs to recommend content that users will engage with. The data science team has a user-item interaction matrix: rows are users, columns are items, and entries record engagement signals (a composite score from 1 to 5 derived from view duration, completion rate, and explicit ratings).
The matrix is extremely sparse. The average user interacts with roughly 40 items out of 200,000 — a sparsity of 99.98%. The recommendation task is to predict the missing entries: which items would a user rate highly if they saw them?
This case study builds a baseline recommender using SVD, analyzes the latent factor structure, and quantifies the accuracy-compression tradeoff.
Constructing the Interaction Matrix
We work with a representative sample: 10,000 users and 5,000 items, retaining the interaction density of the full platform.
import numpy as np
from typing import NamedTuple
class StreamRecData(NamedTuple):
"""Container for StreamRec interaction data."""
R_observed: np.ndarray # Observed ratings (0 where missing)
mask: np.ndarray # Boolean mask of observed entries
R_true: np.ndarray # Ground truth (for evaluation)
user_ids: np.ndarray # User identifiers
item_ids: np.ndarray # Item identifiers
def load_streamrec_sample(
n_users: int = 10_000,
n_items: int = 5_000,
n_latent_true: int = 15,
density: float = 0.008,
noise_std: float = 0.3,
seed: int = 42,
) -> StreamRecData:
"""Generate a realistic StreamRec sample with known latent structure.
The data is generated from a low-rank model to simulate the latent
factor structure observed in real recommendation datasets.
Args:
n_users: Number of users in the sample.
n_items: Number of items in the sample.
n_latent_true: True number of latent factors.
density: Fraction of observed entries.
noise_std: Observation noise standard deviation.
seed: Random seed for reproducibility.
Returns:
StreamRecData with observed ratings, mask, and ground truth.
"""
rng = np.random.default_rng(seed)
# True latent factors with varying importance
# Factor importance decays: factor 1 explains more variance than factor 15
factor_weights = np.array([1.0 / (i + 1)**0.5 for i in range(n_latent_true)])
P_true = rng.standard_normal((n_users, n_latent_true)) * factor_weights
Q_true = rng.standard_normal((n_items, n_latent_true)) * factor_weights
# True rating matrix
R_true = P_true @ Q_true.T
# Normalize to 1-5 scale
R_true = 3.0 + (R_true - R_true.mean()) / R_true.std()
R_true = np.clip(R_true, 1.0, 5.0)
# Observation mask (not missing at random — popular items more likely observed)
item_popularity = rng.dirichlet(np.ones(n_items) * 0.3)
user_activity = rng.dirichlet(np.ones(n_users) * 0.5)
mask = np.zeros((n_users, n_items), dtype=bool)
n_observed = int(n_users * n_items * density)
for _ in range(n_observed):
u = rng.choice(n_users, p=user_activity)
i = rng.choice(n_items, p=item_popularity)
mask[u, i] = True
# Observed ratings with noise
R_observed = np.where(
mask,
np.clip(R_true + noise_std * rng.standard_normal(R_true.shape), 1.0, 5.0),
0.0,
)
return StreamRecData(
R_observed=R_observed,
mask=mask,
R_true=R_true,
user_ids=np.arange(n_users),
item_ids=np.arange(n_items),
)
data = load_streamrec_sample()
print(f"Matrix shape: {data.R_observed.shape}")
print(f"Observed entries: {data.mask.sum():,}")
print(f"Sparsity: {1 - data.mask.mean():.4%}")
print(f"Mean observed rating: {data.R_observed[data.mask].mean():.2f}")
SVD-Based Collaborative Filtering
The simplest matrix factorization approach: fill missing values, compute SVD, and truncate.
def svd_recommender(
R: np.ndarray,
mask: np.ndarray,
k: int,
fill_strategy: str = "user_mean",
) -> np.ndarray:
"""Build a rank-k SVD recommender.
Args:
R: Observed rating matrix (n_users, n_items).
mask: Boolean observation mask.
k: Number of latent factors (rank of approximation).
fill_strategy: How to impute missing values before SVD.
Options: "global_mean", "user_mean", "item_mean".
Returns:
Predicted rating matrix of shape (n_users, n_items).
"""
# Impute missing values
if fill_strategy == "global_mean":
fill_value = R[mask].mean()
R_filled = np.where(mask, R, fill_value)
elif fill_strategy == "user_mean":
user_means = np.where(
mask.sum(axis=1, keepdims=True) > 0,
(R * mask).sum(axis=1, keepdims=True) /
np.maximum(mask.sum(axis=1, keepdims=True), 1),
R[mask].mean(),
)
R_filled = np.where(mask, R, user_means)
elif fill_strategy == "item_mean":
item_means = np.where(
mask.sum(axis=0, keepdims=True) > 0,
(R * mask).sum(axis=0, keepdims=True) /
np.maximum(mask.sum(axis=0, keepdims=True), 1),
R[mask].mean(),
)
R_filled = np.where(mask, R, item_means)
else:
raise ValueError(f"Unknown fill strategy: {fill_strategy}")
# Center by user means
user_means_final = R_filled.mean(axis=1, keepdims=True)
R_centered = R_filled - user_means_final
# Truncated SVD
U, sigma, Vt = np.linalg.svd(R_centered, full_matrices=False)
R_hat = U[:, :k] @ np.diag(sigma[:k]) @ Vt[:k, :] + user_means_final
# Clip to valid rating range
return np.clip(R_hat, 1.0, 5.0)
Analyzing the Singular Value Spectrum
The singular value spectrum answers a critical question: what is the effective dimensionality of user preferences?
def analyze_spectrum(R: np.ndarray, mask: np.ndarray) -> dict:
"""Analyze the singular value spectrum of the rating matrix.
Args:
R: Observed rating matrix.
mask: Observation mask.
Returns:
Dictionary with spectrum analysis results.
"""
# Impute and center
user_means = np.where(
mask.sum(axis=1, keepdims=True) > 0,
(R * mask).sum(axis=1, keepdims=True) /
np.maximum(mask.sum(axis=1, keepdims=True), 1),
R[mask].mean(),
)
R_filled = np.where(mask, R, user_means)
R_centered = R_filled - R_filled.mean(axis=1, keepdims=True)
# Full SVD
_, sigma, _ = np.linalg.svd(R_centered, full_matrices=False)
# Energy analysis
energy = sigma**2
total_energy = energy.sum()
cumulative_ratio = np.cumsum(energy) / total_energy
# Spectral gap analysis
relative_gaps = np.diff(sigma) / sigma[:-1]
elbow_candidates = np.where(np.abs(relative_gaps[:100]) > 0.05)[0]
# Effective rank at different thresholds
thresholds = {0.5: None, 0.8: None, 0.9: None, 0.95: None, 0.99: None}
for t in thresholds:
thresholds[t] = int(np.searchsorted(cumulative_ratio, t) + 1)
return {
"singular_values": sigma,
"cumulative_energy": cumulative_ratio,
"effective_ranks": thresholds,
"top_20_sv": sigma[:20],
"energy_ratio_top_15": float(cumulative_ratio[14]),
}
spectrum = analyze_spectrum(data.R_observed, data.mask)
print("Effective ranks at energy thresholds:")
for threshold, rank in spectrum["effective_ranks"].items():
print(f" {threshold*100:.0f}%: {rank}")
print(f"\nEnergy in top 15 factors: {spectrum['energy_ratio_top_15']:.1%}")
print(f"\nTop 20 singular values:\n{spectrum['top_20_sv'].round(1)}")
In real recommendation datasets, the singular value spectrum typically shows a rapid initial decay (the first few factors capture genre-level preferences) followed by a long tail (nuanced, individual-specific preferences). The "elbow" in the spectrum suggests the number of latent factors that separate signal from noise.
The Accuracy-Compression Tradeoff
How many latent factors do we need? More factors improve training fit but risk overfitting the noise in observed ratings.
def evaluate_across_ranks(
data: StreamRecData,
ranks: list[int],
) -> dict:
"""Evaluate SVD recommender at multiple ranks.
Uses a held-out evaluation set: observed entries are split 80/20.
Args:
data: StreamRec dataset.
ranks: List of rank values to evaluate.
Returns:
Dictionary mapping rank to (train_rmse, test_rmse).
"""
rng = np.random.default_rng(123)
# Split observed entries into train/test
observed_indices = np.argwhere(data.mask)
n_obs = len(observed_indices)
perm = rng.permutation(n_obs)
n_train = int(0.8 * n_obs)
train_idx = observed_indices[perm[:n_train]]
test_idx = observed_indices[perm[n_train:]]
train_mask = np.zeros_like(data.mask)
train_mask[train_idx[:, 0], train_idx[:, 1]] = True
R_train = data.R_observed * train_mask
results = {}
for k in ranks:
R_hat = svd_recommender(R_train, train_mask, k)
train_rmse = np.sqrt(np.mean(
(R_hat[train_mask] - data.R_observed[train_mask])**2
))
test_rmse = np.sqrt(np.mean(
(R_hat[test_idx[:, 0], test_idx[:, 1]] -
data.R_observed[test_idx[:, 0], test_idx[:, 1]])**2
))
results[k] = {"train_rmse": train_rmse, "test_rmse": test_rmse}
return results
ranks = [1, 2, 5, 10, 15, 20, 30, 50, 100, 200]
results = evaluate_across_ranks(data, ranks)
print(f"{'Rank':>6} | {'Train RMSE':>12} | {'Test RMSE':>12} | {'Overfit?':>10}")
print("-" * 50)
best_test_rmse = float("inf")
best_k = 0
for k, metrics in results.items():
overfit = "←" if metrics["test_rmse"] > best_test_rmse and best_test_rmse < float("inf") else ""
if metrics["test_rmse"] < best_test_rmse:
best_test_rmse = metrics["test_rmse"]
best_k = k
print(f"{k:>6} | {metrics['train_rmse']:>12.4f} | {metrics['test_rmse']:>12.4f} | {overfit:>10}")
print(f"\nBest rank: {best_k} (test RMSE: {best_test_rmse:.4f})")
What Do the Latent Factors Represent?
The right singular vectors (columns of $\mathbf{V}$) define the item-side latent space. Each singular vector is a "direction" in item space — a pattern of items that tend to co-vary.
def interpret_factors(
R: np.ndarray,
mask: np.ndarray,
n_factors: int = 5,
n_top_items: int = 10,
) -> None:
"""Examine the top items associated with each latent factor.
Args:
R: Rating matrix.
mask: Observation mask.
n_factors: Number of factors to examine.
n_top_items: Number of top items to show per factor.
"""
user_means = np.where(
mask.sum(axis=1, keepdims=True) > 0,
(R * mask).sum(axis=1, keepdims=True) /
np.maximum(mask.sum(axis=1, keepdims=True), 1),
R[mask].mean(),
)
R_filled = np.where(mask, R, user_means)
R_centered = R_filled - R_filled.mean(axis=1, keepdims=True)
U, sigma, Vt = np.linalg.svd(R_centered, full_matrices=False)
for f in range(n_factors):
factor = Vt[f, :]
top_pos = np.argsort(factor)[-n_top_items:][::-1]
top_neg = np.argsort(factor)[:n_top_items]
print(f"\n--- Factor {f+1} (σ = {sigma[f]:.1f}) ---")
print(f" Positive end (items {top_pos[:5]}): "
f"loadings {factor[top_pos[:5]].round(3)}")
print(f" Negative end (items {top_neg[:5]}): "
f"loadings {factor[top_neg[:5]].round(3)}")
interpret_factors(data.R_observed, data.mask)
In a real system with item metadata, these latent factors often correspond to interpretable dimensions: genre preferences (comedy vs. drama), format preferences (long-form vs. short clips), or topical interests (technology vs. sports). The first factor typically captures overall popularity; subsequent factors capture increasingly specific preference dimensions.
Key Findings
-
Low-rank structure is real. The singular value spectrum of real interaction matrices decays rapidly. Typically 10-50 latent factors capture the meaningful preference structure among millions of users and items.
-
Overfitting is the central challenge. Training RMSE decreases monotonically with rank, but test RMSE reaches a minimum and then increases. The optimal rank balances model capacity against the noise level and sparsity of the data.
-
SVD provides the optimal factorization — but only for the imputed matrix. The fundamental limitation of the SVD approach is that it operates on a fully observed matrix (after imputation), not on the sparse observed entries directly. Production systems use optimization-based methods (ALS, SGD) that minimize reconstruction error only on observed entries, with explicit regularization to prevent overfitting.
-
Latent factors are interpretable in aggregate but not individually. Individual factors may not map to human-understandable concepts, but the latent space as a whole captures the structure of user preferences. Two users with similar latent representations have similar tastes; two items with similar latent representations attract similar audiences.
-
The singular value spectrum is a diagnostic tool. Before choosing a factorization method or model architecture, examine the spectrum. If the top 10 singular values dominate, a simple low-rank model suffices. If the spectrum decays slowly, the preference structure is complex and may benefit from nonlinear methods (neural collaborative filtering, explored in Chapter 13).
Case Study 1 for Chapter 1 of Advanced Data Science