Case Study 1 — Recommendation System via Matrix Factorization

StreamRec's User-Item Interaction Matrix


Context

StreamRec, a mid-size content streaming platform with approximately 5 million users and 200,000 items (articles, videos, podcasts), needs to recommend content that users will engage with. The data science team has a user-item interaction matrix: rows are users, columns are items, and entries record engagement signals (a composite score from 1 to 5 derived from view duration, completion rate, and explicit ratings).

The matrix is extremely sparse. The average user interacts with roughly 40 items out of 200,000 — a sparsity of 99.98%. The recommendation task is to predict the missing entries: which items would a user rate highly if they saw them?

This case study builds a baseline recommender using SVD, analyzes the latent factor structure, and quantifies the accuracy-compression tradeoff.


Constructing the Interaction Matrix

We work with a representative sample: 10,000 users and 5,000 items, retaining the interaction density of the full platform.

import numpy as np
from typing import NamedTuple

class StreamRecData(NamedTuple):
    """Container for StreamRec interaction data."""
    R_observed: np.ndarray   # Observed ratings (0 where missing)
    mask: np.ndarray         # Boolean mask of observed entries
    R_true: np.ndarray       # Ground truth (for evaluation)
    user_ids: np.ndarray     # User identifiers
    item_ids: np.ndarray     # Item identifiers

def load_streamrec_sample(
    n_users: int = 10_000,
    n_items: int = 5_000,
    n_latent_true: int = 15,
    density: float = 0.008,
    noise_std: float = 0.3,
    seed: int = 42,
) -> StreamRecData:
    """Generate a realistic StreamRec sample with known latent structure.

    The data is generated from a low-rank model to simulate the latent
    factor structure observed in real recommendation datasets.

    Args:
        n_users: Number of users in the sample.
        n_items: Number of items in the sample.
        n_latent_true: True number of latent factors.
        density: Fraction of observed entries.
        noise_std: Observation noise standard deviation.
        seed: Random seed for reproducibility.

    Returns:
        StreamRecData with observed ratings, mask, and ground truth.
    """
    rng = np.random.default_rng(seed)

    # True latent factors with varying importance
    # Factor importance decays: factor 1 explains more variance than factor 15
    factor_weights = np.array([1.0 / (i + 1)**0.5 for i in range(n_latent_true)])

    P_true = rng.standard_normal((n_users, n_latent_true)) * factor_weights
    Q_true = rng.standard_normal((n_items, n_latent_true)) * factor_weights

    # True rating matrix
    R_true = P_true @ Q_true.T

    # Normalize to 1-5 scale
    R_true = 3.0 + (R_true - R_true.mean()) / R_true.std()
    R_true = np.clip(R_true, 1.0, 5.0)

    # Observation mask (not missing at random — popular items more likely observed)
    item_popularity = rng.dirichlet(np.ones(n_items) * 0.3)
    user_activity = rng.dirichlet(np.ones(n_users) * 0.5)

    mask = np.zeros((n_users, n_items), dtype=bool)
    n_observed = int(n_users * n_items * density)

    for _ in range(n_observed):
        u = rng.choice(n_users, p=user_activity)
        i = rng.choice(n_items, p=item_popularity)
        mask[u, i] = True

    # Observed ratings with noise
    R_observed = np.where(
        mask,
        np.clip(R_true + noise_std * rng.standard_normal(R_true.shape), 1.0, 5.0),
        0.0,
    )

    return StreamRecData(
        R_observed=R_observed,
        mask=mask,
        R_true=R_true,
        user_ids=np.arange(n_users),
        item_ids=np.arange(n_items),
    )

data = load_streamrec_sample()
print(f"Matrix shape: {data.R_observed.shape}")
print(f"Observed entries: {data.mask.sum():,}")
print(f"Sparsity: {1 - data.mask.mean():.4%}")
print(f"Mean observed rating: {data.R_observed[data.mask].mean():.2f}")

SVD-Based Collaborative Filtering

The simplest matrix factorization approach: fill missing values, compute SVD, and truncate.

def svd_recommender(
    R: np.ndarray,
    mask: np.ndarray,
    k: int,
    fill_strategy: str = "user_mean",
) -> np.ndarray:
    """Build a rank-k SVD recommender.

    Args:
        R: Observed rating matrix (n_users, n_items).
        mask: Boolean observation mask.
        k: Number of latent factors (rank of approximation).
        fill_strategy: How to impute missing values before SVD.
            Options: "global_mean", "user_mean", "item_mean".

    Returns:
        Predicted rating matrix of shape (n_users, n_items).
    """
    # Impute missing values
    if fill_strategy == "global_mean":
        fill_value = R[mask].mean()
        R_filled = np.where(mask, R, fill_value)
    elif fill_strategy == "user_mean":
        user_means = np.where(
            mask.sum(axis=1, keepdims=True) > 0,
            (R * mask).sum(axis=1, keepdims=True) /
            np.maximum(mask.sum(axis=1, keepdims=True), 1),
            R[mask].mean(),
        )
        R_filled = np.where(mask, R, user_means)
    elif fill_strategy == "item_mean":
        item_means = np.where(
            mask.sum(axis=0, keepdims=True) > 0,
            (R * mask).sum(axis=0, keepdims=True) /
            np.maximum(mask.sum(axis=0, keepdims=True), 1),
            R[mask].mean(),
        )
        R_filled = np.where(mask, R, item_means)
    else:
        raise ValueError(f"Unknown fill strategy: {fill_strategy}")

    # Center by user means
    user_means_final = R_filled.mean(axis=1, keepdims=True)
    R_centered = R_filled - user_means_final

    # Truncated SVD
    U, sigma, Vt = np.linalg.svd(R_centered, full_matrices=False)
    R_hat = U[:, :k] @ np.diag(sigma[:k]) @ Vt[:k, :] + user_means_final

    # Clip to valid rating range
    return np.clip(R_hat, 1.0, 5.0)

Analyzing the Singular Value Spectrum

The singular value spectrum answers a critical question: what is the effective dimensionality of user preferences?

def analyze_spectrum(R: np.ndarray, mask: np.ndarray) -> dict:
    """Analyze the singular value spectrum of the rating matrix.

    Args:
        R: Observed rating matrix.
        mask: Observation mask.

    Returns:
        Dictionary with spectrum analysis results.
    """
    # Impute and center
    user_means = np.where(
        mask.sum(axis=1, keepdims=True) > 0,
        (R * mask).sum(axis=1, keepdims=True) /
        np.maximum(mask.sum(axis=1, keepdims=True), 1),
        R[mask].mean(),
    )
    R_filled = np.where(mask, R, user_means)
    R_centered = R_filled - R_filled.mean(axis=1, keepdims=True)

    # Full SVD
    _, sigma, _ = np.linalg.svd(R_centered, full_matrices=False)

    # Energy analysis
    energy = sigma**2
    total_energy = energy.sum()
    cumulative_ratio = np.cumsum(energy) / total_energy

    # Spectral gap analysis
    relative_gaps = np.diff(sigma) / sigma[:-1]
    elbow_candidates = np.where(np.abs(relative_gaps[:100]) > 0.05)[0]

    # Effective rank at different thresholds
    thresholds = {0.5: None, 0.8: None, 0.9: None, 0.95: None, 0.99: None}
    for t in thresholds:
        thresholds[t] = int(np.searchsorted(cumulative_ratio, t) + 1)

    return {
        "singular_values": sigma,
        "cumulative_energy": cumulative_ratio,
        "effective_ranks": thresholds,
        "top_20_sv": sigma[:20],
        "energy_ratio_top_15": float(cumulative_ratio[14]),
    }

spectrum = analyze_spectrum(data.R_observed, data.mask)
print("Effective ranks at energy thresholds:")
for threshold, rank in spectrum["effective_ranks"].items():
    print(f"  {threshold*100:.0f}%: {rank}")
print(f"\nEnergy in top 15 factors: {spectrum['energy_ratio_top_15']:.1%}")
print(f"\nTop 20 singular values:\n{spectrum['top_20_sv'].round(1)}")

In real recommendation datasets, the singular value spectrum typically shows a rapid initial decay (the first few factors capture genre-level preferences) followed by a long tail (nuanced, individual-specific preferences). The "elbow" in the spectrum suggests the number of latent factors that separate signal from noise.


The Accuracy-Compression Tradeoff

How many latent factors do we need? More factors improve training fit but risk overfitting the noise in observed ratings.

def evaluate_across_ranks(
    data: StreamRecData,
    ranks: list[int],
) -> dict:
    """Evaluate SVD recommender at multiple ranks.

    Uses a held-out evaluation set: observed entries are split 80/20.

    Args:
        data: StreamRec dataset.
        ranks: List of rank values to evaluate.

    Returns:
        Dictionary mapping rank to (train_rmse, test_rmse).
    """
    rng = np.random.default_rng(123)

    # Split observed entries into train/test
    observed_indices = np.argwhere(data.mask)
    n_obs = len(observed_indices)
    perm = rng.permutation(n_obs)
    n_train = int(0.8 * n_obs)

    train_idx = observed_indices[perm[:n_train]]
    test_idx = observed_indices[perm[n_train:]]

    train_mask = np.zeros_like(data.mask)
    train_mask[train_idx[:, 0], train_idx[:, 1]] = True

    R_train = data.R_observed * train_mask

    results = {}
    for k in ranks:
        R_hat = svd_recommender(R_train, train_mask, k)

        train_rmse = np.sqrt(np.mean(
            (R_hat[train_mask] - data.R_observed[train_mask])**2
        ))
        test_rmse = np.sqrt(np.mean(
            (R_hat[test_idx[:, 0], test_idx[:, 1]] -
             data.R_observed[test_idx[:, 0], test_idx[:, 1]])**2
        ))

        results[k] = {"train_rmse": train_rmse, "test_rmse": test_rmse}

    return results

ranks = [1, 2, 5, 10, 15, 20, 30, 50, 100, 200]
results = evaluate_across_ranks(data, ranks)

print(f"{'Rank':>6} | {'Train RMSE':>12} | {'Test RMSE':>12} | {'Overfit?':>10}")
print("-" * 50)
best_test_rmse = float("inf")
best_k = 0
for k, metrics in results.items():
    overfit = "←" if metrics["test_rmse"] > best_test_rmse and best_test_rmse < float("inf") else ""
    if metrics["test_rmse"] < best_test_rmse:
        best_test_rmse = metrics["test_rmse"]
        best_k = k
    print(f"{k:>6} | {metrics['train_rmse']:>12.4f} | {metrics['test_rmse']:>12.4f} | {overfit:>10}")

print(f"\nBest rank: {best_k} (test RMSE: {best_test_rmse:.4f})")

What Do the Latent Factors Represent?

The right singular vectors (columns of $\mathbf{V}$) define the item-side latent space. Each singular vector is a "direction" in item space — a pattern of items that tend to co-vary.

def interpret_factors(
    R: np.ndarray,
    mask: np.ndarray,
    n_factors: int = 5,
    n_top_items: int = 10,
) -> None:
    """Examine the top items associated with each latent factor.

    Args:
        R: Rating matrix.
        mask: Observation mask.
        n_factors: Number of factors to examine.
        n_top_items: Number of top items to show per factor.
    """
    user_means = np.where(
        mask.sum(axis=1, keepdims=True) > 0,
        (R * mask).sum(axis=1, keepdims=True) /
        np.maximum(mask.sum(axis=1, keepdims=True), 1),
        R[mask].mean(),
    )
    R_filled = np.where(mask, R, user_means)
    R_centered = R_filled - R_filled.mean(axis=1, keepdims=True)

    U, sigma, Vt = np.linalg.svd(R_centered, full_matrices=False)

    for f in range(n_factors):
        factor = Vt[f, :]
        top_pos = np.argsort(factor)[-n_top_items:][::-1]
        top_neg = np.argsort(factor)[:n_top_items]

        print(f"\n--- Factor {f+1} (σ = {sigma[f]:.1f}) ---")
        print(f"  Positive end (items {top_pos[:5]}): "
              f"loadings {factor[top_pos[:5]].round(3)}")
        print(f"  Negative end (items {top_neg[:5]}): "
              f"loadings {factor[top_neg[:5]].round(3)}")

interpret_factors(data.R_observed, data.mask)

In a real system with item metadata, these latent factors often correspond to interpretable dimensions: genre preferences (comedy vs. drama), format preferences (long-form vs. short clips), or topical interests (technology vs. sports). The first factor typically captures overall popularity; subsequent factors capture increasingly specific preference dimensions.


Key Findings

  1. Low-rank structure is real. The singular value spectrum of real interaction matrices decays rapidly. Typically 10-50 latent factors capture the meaningful preference structure among millions of users and items.

  2. Overfitting is the central challenge. Training RMSE decreases monotonically with rank, but test RMSE reaches a minimum and then increases. The optimal rank balances model capacity against the noise level and sparsity of the data.

  3. SVD provides the optimal factorization — but only for the imputed matrix. The fundamental limitation of the SVD approach is that it operates on a fully observed matrix (after imputation), not on the sparse observed entries directly. Production systems use optimization-based methods (ALS, SGD) that minimize reconstruction error only on observed entries, with explicit regularization to prevent overfitting.

  4. Latent factors are interpretable in aggregate but not individually. Individual factors may not map to human-understandable concepts, but the latent space as a whole captures the structure of user preferences. Two users with similar latent representations have similar tastes; two items with similar latent representations attract similar audiences.

  5. The singular value spectrum is a diagnostic tool. Before choosing a factorization method or model architecture, examine the spectrum. If the top 10 singular values dominate, a simple low-rank model suffices. If the spectrum decays slowly, the preference structure is complex and may benefit from nonlinear methods (neural collaborative filtering, explored in Chapter 13).


Case Study 1 for Chapter 1 of Advanced Data Science