Chapter 24: Recommender Systems

DataField.Dev

15 min read

> Core Principle --- Recommender systems generate more revenue per engineering hour than almost any other machine learning application. Netflix estimates that its recommender system saves $1 billion per year in reduced churn. Amazon attributes 35%...

In This Chapter

Collaborative Filtering, Content-Based, and Hybrid Approaches
The Highest-Impact ML You Will Ever Deploy
Part 1: The Recommendation Problem
Part 2: Collaborative Filtering
Part 3: Content-Based Filtering
Part 4: Matrix Factorization (SVD)
Part 5: Evaluation --- Ranking Metrics That Actually Matter
Part 6: The Cold Start Problem
Part 7: Building a Hybrid Recommender
Part 8: Implicit Feedback and Beyond
Part 9: Putting It All Together --- A Recommender System Design Checklist
Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 24: Recommender Systems

Collaborative Filtering, Content-Based, and Hybrid Approaches

Learning Objectives

By the end of this chapter, you will be able to:

Implement user-based and item-based collaborative filtering
Build a content-based recommender using TF-IDF feature similarity
Apply matrix factorization (SVD) for latent factor models
Evaluate recommender systems with ranking metrics (NDCG, MAP, Hit Rate)
Design a hybrid recommender that combines collaborative and content-based signals

The Highest-Impact ML You Will Ever Deploy

Core Principle --- Recommender systems generate more revenue per engineering hour than almost any other machine learning application. Netflix estimates that its recommender system saves $1 billion per year in reduced churn. Amazon attributes 35% of its revenue to recommendations. Spotify's Discover Weekly playlist converted millions of casual listeners into engaged subscribers. If you build one ML system in your career, there is a decent chance it will be a recommender.

Yet most introductory treatments get the evaluation wrong. They teach you to minimize RMSE on held-out ratings, which is what Kaggle competitions optimize. In production, nobody cares whether you predicted a 4.2 when the user rated 3.9. What matters is whether the items you surfaced at the top of the list were items the user actually engaged with. That is a ranking problem, not a regression problem. This chapter treats it as such from the start.

The three paradigms are straightforward:

Collaborative filtering says: "Users who behaved similarly to you in the past liked these items. You probably will too." It needs no knowledge of what items are --- only who interacted with what.

Content-based filtering says: "You liked items with these features. Here are other items with similar features." It needs no knowledge of other users --- only the features of items and the user's history.

Hybrid approaches combine both signals, and in practice, every production recommender is a hybrid. Pure collaborative filtering cannot handle new items (the cold start problem). Pure content-based filtering cannot capture taste patterns that transcend item features. Hybrids get the best of both.

This chapter builds all three, evaluates them with ranking metrics that actually matter, and closes with two case studies: ShopSmart's product recommender and StreamFlow's content recommender for churn reduction.

Part 1: The Recommendation Problem

Explicit vs. Implicit Feedback

The first design decision is what data you have. Recommender systems consume two fundamentally different types of feedback, and the choice shapes every algorithm decision downstream.

Explicit feedback is a direct expression of preference: star ratings, thumbs up/down, "not interested" clicks. It is clean, interpretable, and rare. Netflix's 5-star rating system generated perhaps 200 million ratings from 200 million subscribers. That sounds like a lot until you realize the catalog has tens of thousands of titles. The user-item matrix is over 99% empty.

Implicit feedback is behavioral data that implies preference: views, clicks, purchases, time spent, scroll depth, add-to-cart events. It is noisy, ambiguous, and abundant. A user who watched 90% of a movie probably liked it. A user who watched 5 minutes and switched to something else probably did not. But "probably" is doing a lot of work there --- maybe they were interrupted, maybe they fell asleep, maybe they were hate-watching.

import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity

np.random.seed(42)

# --- Simulate explicit feedback: user-item rating matrix ---
n_users = 500
n_items = 200

# Simulate sparse ratings (only 5% of entries are observed)
n_ratings = int(0.05 * n_users * n_items)
user_ids = np.random.randint(0, n_users, size=n_ratings)
item_ids = np.random.randint(0, n_items, size=n_ratings)

# Create latent preferences to make ratings non-random
n_factors = 5
user_factors = np.random.randn(n_users, n_factors)
item_factors = np.random.randn(n_items, n_factors)

# Rating = dot product of latent factors + noise, clipped to [1, 5]
raw_scores = user_factors[user_ids] * item_factors[item_ids]
ratings = np.sum(raw_scores, axis=1) + np.random.randn(n_ratings) * 0.5
ratings = np.clip(np.round(ratings * 0.5 + 3), 1, 5)

ratings_df = pd.DataFrame({
    'user_id': user_ids,
    'item_id': item_ids,
    'rating': ratings
}).drop_duplicates(subset=['user_id', 'item_id']).reset_index(drop=True)

print(f"Explicit feedback: {len(ratings_df)} ratings")
print(f"Users: {ratings_df['user_id'].nunique()}, Items: {ratings_df['item_id'].nunique()}")
print(f"Sparsity: {1 - len(ratings_df) / (n_users * n_items):.4f}")
print(f"\nRating distribution:")
print(ratings_df['rating'].value_counts().sort_index())

Practical Note --- In production, implicit feedback dominates. Most users never rate anything. Amazon does not wait for star ratings to recommend products; it uses purchase history, browsing behavior, and add-to-cart events. If your data is implicit, treat it as binary (interacted / did not interact) or weighted by engagement strength (e.g., minutes watched). Do not try to convert implicit signals into fake star ratings.

The User-Item Matrix

Regardless of feedback type, the core data structure is the user-item matrix. Rows are users, columns are items, and entries are ratings (explicit) or interaction counts (implicit). The matrix is almost always extremely sparse.

# Build the user-item matrix
from scipy.sparse import csr_matrix

# Pivot to dense matrix (for illustration; use sparse in production)
user_item_matrix = ratings_df.pivot_table(
    index='user_id', columns='item_id', values='rating'
)

print(f"Matrix shape: {user_item_matrix.shape}")
print(f"Non-null entries: {user_item_matrix.notna().sum().sum()}")
print(f"Sparsity: {user_item_matrix.isna().sum().sum() / user_item_matrix.size:.4f}")

# For computation, fill NaN with 0 and convert to sparse
user_item_dense = user_item_matrix.fillna(0).values
user_item_sparse = csr_matrix(user_item_dense)
print(f"\nSparse matrix: {user_item_sparse.nnz} stored elements")

The sparsity is the defining challenge. With 99%+ of entries missing, every algorithm is essentially an exercise in intelligent extrapolation: given the tiny fraction of the matrix you can observe, predict the missing entries well enough to rank items for each user.

Part 2: Collaborative Filtering

Collaborative filtering is the oldest and most intuitive approach: recommend items that similar users liked (user-based CF) or items that are similar to items the user already liked (item-based CF). "Similar" is defined by the observed ratings or interactions.

User-Based Collaborative Filtering

The idea is simple. To predict whether user A will like item X:

Find users who are similar to user A (based on their rating patterns)
Check how those similar users rated item X
Predict user A's rating as a weighted average of similar users' ratings on item X

Similarity is typically measured with cosine similarity or Pearson correlation, computed over the co-rated items.

from sklearn.metrics.pairwise import cosine_similarity

# Center ratings per user (mean-centering improves similarity)
user_means = np.where(
    user_item_dense.sum(axis=1, keepdims=True) != 0,
    user_item_dense.sum(axis=1, keepdims=True) /
    np.maximum((user_item_dense != 0).sum(axis=1, keepdims=True), 1),
    0
)
user_item_centered = np.where(user_item_dense != 0, user_item_dense - user_means, 0)

# Compute user-user cosine similarity
user_similarity = cosine_similarity(user_item_centered)
np.fill_diagonal(user_similarity, 0)  # don't recommend based on self

print(f"User similarity matrix shape: {user_similarity.shape}")
print(f"Mean similarity: {user_similarity[user_similarity != 0].mean():.4f}")
print(f"Max similarity: {user_similarity.max():.4f}")


def predict_user_cf(user_id, item_id, k=20):
    """
    Predict rating for a user-item pair using user-based CF.

    Parameters
    ----------
    user_id : int
        Target user index.
    item_id : int
        Target item index.
    k : int
        Number of nearest neighbors to use.

    Returns
    -------
    float
        Predicted rating, or NaN if no neighbors rated the item.
    """
    # Find users who rated this item
    item_raters = np.where(user_item_dense[:, item_id] != 0)[0]
    if len(item_raters) == 0:
        return np.nan

    # Get similarities to those users
    sims = user_similarity[user_id, item_raters]

    # Keep top-k most similar users with positive similarity
    positive_mask = sims > 0
    if positive_mask.sum() == 0:
        return np.nan

    item_raters = item_raters[positive_mask]
    sims = sims[positive_mask]

    if len(sims) > k:
        top_k_idx = np.argsort(sims)[-k:]
        item_raters = item_raters[top_k_idx]
        sims = sims[top_k_idx]

    # Weighted average of centered ratings + user's mean
    centered_ratings = user_item_centered[item_raters, item_id]
    prediction = user_means[user_id, 0] + np.dot(sims, centered_ratings) / np.sum(sims)

    return np.clip(prediction, 1, 5)


# Test prediction
test_user, test_item = 0, 15
pred = predict_user_cf(test_user, test_item, k=20)
print(f"\nPredicted rating for user {test_user}, item {test_item}: {pred:.2f}")

Practical Note --- User-based CF has a fatal scaling problem. Computing user-user similarity is O(n_users^2), and user profiles change every time someone rates a new item. For a platform with millions of users, recomputing user similarity daily is expensive. This is why production systems almost always prefer item-based CF or matrix factorization.

Item-Based Collaborative Filtering

Item-based CF flips the approach: instead of finding similar users, find similar items. To predict whether user A will like item X, find items similar to X that user A has already rated, and weight their ratings by item similarity.

The key insight is that item-item similarity is more stable than user-user similarity. A movie's "personality" (who watches it) changes slowly. A user's preferences can shift overnight. Amazon's original recommender system was item-based CF, and the stability of item similarity was a major reason.

# Compute item-item cosine similarity (on centered ratings, transposed)
item_similarity = cosine_similarity(user_item_centered.T)
np.fill_diagonal(item_similarity, 0)

print(f"Item similarity matrix shape: {item_similarity.shape}")
print(f"Mean similarity: {item_similarity[item_similarity != 0].mean():.4f}")


def predict_item_cf(user_id, item_id, k=20):
    """
    Predict rating for a user-item pair using item-based CF.

    Parameters
    ----------
    user_id : int
        Target user index.
    item_id : int
        Target item index.
    k : int
        Number of nearest neighbor items to use.

    Returns
    -------
    float
        Predicted rating, or NaN if the user has no rated similar items.
    """
    # Find items this user has rated
    rated_items = np.where(user_item_dense[user_id] != 0)[0]
    if len(rated_items) == 0:
        return np.nan

    # Get similarities between target item and rated items
    sims = item_similarity[item_id, rated_items]

    # Keep top-k with positive similarity
    positive_mask = sims > 0
    if positive_mask.sum() == 0:
        return np.nan

    rated_items = rated_items[positive_mask]
    sims = sims[positive_mask]

    if len(sims) > k:
        top_k_idx = np.argsort(sims)[-k:]
        rated_items = rated_items[top_k_idx]
        sims = sims[top_k_idx]

    # Weighted average of the user's ratings on similar items
    user_ratings = user_item_dense[user_id, rated_items]
    prediction = np.dot(sims, user_ratings) / np.sum(sims)

    return np.clip(prediction, 1, 5)


pred_item = predict_item_cf(test_user, test_item, k=20)
print(f"Predicted rating for user {test_user}, item {test_item}: {pred_item:.2f}")

User-Based vs. Item-Based: When to Use Each

Factor	User-Based CF	Item-Based CF
Similarity stability	Low (user tastes change)	High (item profiles stable)
Computational cost	O(n_users^2)	O(n_items^2)
Better when...	n_items >> n_users	n_users >> n_items (typical)
Cold start for new users	Cannot recommend (no history)	Cannot recommend (no history)
Cold start for new items	Can recommend (other users rate it)	Cannot recommend (no similarity data)
Interpretability	"Users like you also liked..."	"Because you liked X, you'll like Y..."

In most production settings, n_users >> n_items, which makes item-based CF computationally cheaper and more stable. Amazon's original 2003 paper on item-based CF made exactly this argument.

Part 3: Content-Based Filtering

Collaborative filtering ignores what items are. It only cares about who interacted with what. Content-based filtering takes the opposite approach: it recommends items whose features are similar to items the user has liked before.

The advantage is that content-based filtering handles the cold start problem for new items. A new movie with known genres, cast, and plot keywords can be recommended immediately based on its features --- no one needs to have watched it first.

The disadvantage is the filter bubble: content-based filtering only recommends more of the same. If you have watched ten action movies, it recommends action movie number eleven. It cannot discover that you might enjoy a documentary.

TF-IDF Feature Vectors

For items described by text (product descriptions, movie plots, article content), TF-IDF provides a natural feature representation. Each item becomes a vector in term space, and similarity between items is cosine similarity between their TF-IDF vectors.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity as cos_sim

# Simulate item catalog with text descriptions
item_descriptions = {
    0: "action thriller with car chases and explosions in a futuristic city",
    1: "romantic comedy set in New York with a love triangle",
    2: "science fiction epic about space exploration and alien contact",
    3: "dark thriller about a detective solving serial murders",
    4: "romantic drama about long distance relationships and sacrifice",
    5: "action science fiction with robots and artificial intelligence",
    6: "comedy about family reunions and holiday disasters",
    7: "documentary about climate change and ocean conservation",
    8: "horror thriller set in an abandoned hospital",
    9: "animated comedy for families with talking animals",
    10: "science fiction thriller about time travel paradoxes",
    11: "romantic comedy about dating apps and modern relationships",
    12: "action adventure with treasure hunting and ancient ruins",
    13: "documentary about the history of jazz music",
    14: "drama about a family dealing with addiction and recovery",
}

descriptions_list = [item_descriptions[i] for i in range(len(item_descriptions))]

# Build TF-IDF vectors
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(descriptions_list)

print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
print(f"Vocabulary size: {len(tfidf.vocabulary_)}")

# Compute item-item content similarity
content_similarity = cos_sim(tfidf_matrix)
np.fill_diagonal(content_similarity, 0)

# Show most similar items for item 0 (action thriller)
item_0_sims = content_similarity[0]
top_similar = np.argsort(item_0_sims)[::-1][:5]
print(f"\nMost similar items to item 0 ('{descriptions_list[0][:50]}...'):")
for idx in top_similar:
    print(f"  Item {idx}: sim={item_0_sims[idx]:.3f} - {descriptions_list[idx][:60]}")

Building a Content-Based Recommender

The content-based recommender builds a user profile from the features of items the user has liked, then recommends items whose features best match that profile.

def build_user_profile(user_id, ratings_df, tfidf_matrix, min_rating=3.5):
    """
    Build a user profile by averaging TF-IDF vectors of highly-rated items.

    Parameters
    ----------
    user_id : int
        Target user.
    ratings_df : DataFrame
        User-item ratings.
    tfidf_matrix : sparse matrix
        TF-IDF features per item.
    min_rating : float
        Minimum rating to consider as "liked."

    Returns
    -------
    ndarray
        User profile vector (averaged TF-IDF of liked items).
    """
    user_ratings = ratings_df[ratings_df['user_id'] == user_id]
    liked_items = user_ratings[user_ratings['rating'] >= min_rating]['item_id'].values

    # Filter to items that have descriptions
    liked_items = liked_items[liked_items < tfidf_matrix.shape[0]]

    if len(liked_items) == 0:
        return np.zeros(tfidf_matrix.shape[1])

    # Average the TF-IDF vectors of liked items
    profile = tfidf_matrix[liked_items].mean(axis=0)
    return np.asarray(profile).flatten()


def recommend_content_based(user_id, ratings_df, tfidf_matrix, n=5):
    """
    Recommend items using content-based filtering.

    Parameters
    ----------
    user_id : int
        Target user.
    ratings_df : DataFrame
        User-item ratings.
    tfidf_matrix : sparse matrix
        TF-IDF features per item.
    n : int
        Number of recommendations.

    Returns
    -------
    list of int
        Recommended item IDs.
    """
    profile = build_user_profile(user_id, ratings_df, tfidf_matrix)

    if np.sum(np.abs(profile)) == 0:
        return []

    # Score every item against user profile
    scores = cos_sim(profile.reshape(1, -1), tfidf_matrix).flatten()

    # Exclude items user has already rated
    rated_items = set(ratings_df[ratings_df['user_id'] == user_id]['item_id'].values)
    for item_id in rated_items:
        if item_id < len(scores):
            scores[item_id] = -1

    # Return top-n
    return list(np.argsort(scores)[::-1][:n])


# Test: recommend for a user
recs = recommend_content_based(0, ratings_df, tfidf_matrix, n=5)
print(f"Content-based recommendations for user 0:")
for item_id in recs:
    print(f"  Item {item_id}: {descriptions_list[item_id][:70]}")

Practical Note --- TF-IDF is the simplest content representation. In production, you would use pre-trained embeddings (sentence-transformers, CLIP for images, or domain-specific embeddings) for richer feature representations. The framework is identical: represent items as vectors, compute similarity, rank by profile match.

Part 4: Matrix Factorization (SVD)

Collaborative filtering with nearest neighbors has two problems at scale: the user-item matrix is too sparse for reliable similarity, and computing pairwise similarity is expensive. Matrix factorization addresses both by compressing the user-item matrix into a low-rank approximation.

The intuition: if a user-item matrix has 500 users and 200 items, it has 100,000 entries (mostly missing). But the underlying preferences might be driven by only 10-20 latent factors --- "likes action," "prefers long movies," "watches with family." Matrix factorization discovers these latent factors automatically.

SVD for Recommender Systems

In the recommender context, SVD (Singular Value Decomposition) decomposes the user-item matrix R into:

$$R \approx U \cdot \Sigma \cdot V^T$$

Where U is the user-factor matrix, V is the item-factor matrix, and the factors represent latent preferences. The prediction for user u on item i is the dot product of their latent vectors.

The surprise library provides optimized implementations specifically designed for recommender systems:

from surprise import Dataset, Reader, SVD, KNNBasic, KNNWithMeans
from surprise.model_selection import cross_validate, train_test_split
from surprise import accuracy

# Prepare data for surprise
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(
    ratings_df[['user_id', 'item_id', 'rating']], reader
)

# Train SVD model
svd_model = SVD(
    n_factors=50,       # number of latent factors
    n_epochs=20,        # training iterations
    lr_all=0.005,       # learning rate
    reg_all=0.02,       # regularization
    random_state=42
)

# Cross-validate
cv_results = cross_validate(svd_model, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

print(f"\nSVD Cross-Validation Results:")
print(f"  Mean RMSE: {cv_results['test_rmse'].mean():.4f} (+/- {cv_results['test_rmse'].std():.4f})")
print(f"  Mean MAE:  {cv_results['test_mae'].mean():.4f} (+/- {cv_results['test_mae'].std():.4f})")

Core Principle --- RMSE tells you how well you predict ratings. It does not tell you how well you rank items. A model with lower RMSE can produce worse top-N recommendations than a model with higher RMSE. Always evaluate with ranking metrics when the task is "recommend the best items," not "predict the exact rating."

Training and Extracting Latent Factors

# Train on full dataset to inspect latent factors
trainset = data.build_full_trainset()
svd_model.fit(trainset)

# User and item factor matrices
user_factors_svd = svd_model.pu  # shape: (n_users, n_factors)
item_factors_svd = svd_model.qi  # shape: (n_items, n_factors)

print(f"User factors shape: {user_factors_svd.shape}")
print(f"Item factors shape: {item_factors_svd.shape}")

# Predict a specific rating
prediction = svd_model.predict(uid=0, iid=15)
print(f"\nSVD prediction for user 0, item 15: {prediction.est:.2f}")

# Generate top-N recommendations for a user
def recommend_svd(model, user_id, ratings_df, n=10):
    """
    Generate top-N recommendations using a trained SVD model.

    Parameters
    ----------
    model : surprise.SVD
        Trained SVD model.
    user_id : int
        Target user.
    ratings_df : DataFrame
        Observed ratings (to exclude already-rated items).
    n : int
        Number of recommendations.

    Returns
    -------
    list of tuples
        (item_id, predicted_rating) sorted by predicted rating descending.
    """
    rated_items = set(ratings_df[ratings_df['user_id'] == user_id]['item_id'])
    all_items = set(ratings_df['item_id'].unique())
    unrated_items = all_items - rated_items

    predictions = []
    for item_id in unrated_items:
        pred = model.predict(uid=user_id, iid=item_id)
        predictions.append((item_id, pred.est))

    predictions.sort(key=lambda x: x[1], reverse=True)
    return predictions[:n]


top_recs = recommend_svd(svd_model, user_id=0, ratings_df=ratings_df, n=10)
print(f"\nTop-10 SVD recommendations for user 0:")
for item_id, score in top_recs:
    print(f"  Item {item_id}: predicted rating {score:.2f}")

Comparing CF Methods with Surprise

from surprise import KNNBasic, KNNWithMeans, NMF

# Define models to compare
models = {
    'User-KNN (cosine)': KNNBasic(
        k=20, sim_options={'name': 'cosine', 'user_based': True},
        verbose=False
    ),
    'Item-KNN (cosine)': KNNBasic(
        k=20, sim_options={'name': 'cosine', 'user_based': False},
        verbose=False
    ),
    'User-KNN (Pearson, mean-centered)': KNNWithMeans(
        k=20, sim_options={'name': 'pearson', 'user_based': True},
        verbose=False
    ),
    'SVD (50 factors)': SVD(n_factors=50, random_state=42),
    'NMF (50 factors)': NMF(n_factors=50, random_state=42),
}

print("Model Comparison (5-fold CV):")
print(f"{'Model':<40} {'RMSE':>8} {'MAE':>8}")
print("-" * 58)

for name, model in models.items():
    results = cross_validate(model, data, measures=['RMSE', 'MAE'], cv=5, verbose=False)
    print(f"{name:<40} {results['test_rmse'].mean():>8.4f} {results['test_mae'].mean():>8.4f}")

Practical Note --- SVD and NMF almost always beat neighborhood methods on RMSE. But RMSE is not the metric that matters. The next section covers the metrics that do.

Part 5: Evaluation --- Ranking Metrics That Actually Matter

This is where most textbook treatments fail. They evaluate recommenders with RMSE, which measures rating prediction accuracy. In production, the question is not "how accurately can you predict a 4.3-star rating?" but "did the user engage with the items you put at the top of the list?" That is a ranking quality problem, and it requires ranking metrics.

Train-Test Split for Ranking Evaluation

The standard protocol: for each user, hold out a fraction of their interactions as the test set. Train on the remaining interactions. Generate a ranked list of recommendations. Check whether the held-out items appear in the top-N of the ranked list.

from collections import defaultdict

def train_test_split_by_user(ratings_df, test_fraction=0.2, random_state=42):
    """
    Split ratings so each user has some ratings in train and some in test.

    Parameters
    ----------
    ratings_df : DataFrame
        Columns: user_id, item_id, rating.
    test_fraction : float
        Fraction of each user's ratings to hold out.
    random_state : int
        Random seed.

    Returns
    -------
    tuple of DataFrames
        (train_df, test_df)
    """
    rng = np.random.RandomState(random_state)
    train_rows = []
    test_rows = []

    for user_id, group in ratings_df.groupby('user_id'):
        n_test = max(1, int(len(group) * test_fraction))
        test_idx = rng.choice(group.index, size=n_test, replace=False)
        train_rows.append(group.drop(test_idx))
        test_rows.append(group.loc[test_idx])

    return pd.concat(train_rows), pd.concat(test_rows)


train_df, test_df = train_test_split_by_user(ratings_df, test_fraction=0.2)
print(f"Train: {len(train_df)} ratings, Test: {len(test_df)} ratings")
print(f"Users in train: {train_df['user_id'].nunique()}")
print(f"Users in test: {test_df['user_id'].nunique()}")

Hit Rate

The simplest ranking metric: did at least one of the held-out items appear in the top-N recommendations?

def hit_rate_at_k(recommendations, test_items, k=10):
    """
    Compute Hit Rate@K: fraction of users where at least one test item
    appears in the top-K recommendations.

    Parameters
    ----------
    recommendations : dict
        {user_id: [ranked list of item_ids]}
    test_items : dict
        {user_id: set of held-out item_ids}
    k : int
        Number of recommendations to consider.

    Returns
    -------
    float
        Hit rate between 0 and 1.
    """
    hits = 0
    total = 0

    for user_id in test_items:
        if user_id not in recommendations:
            continue
        top_k = set(recommendations[user_id][:k])
        if top_k & test_items[user_id]:  # set intersection
            hits += 1
        total += 1

    return hits / total if total > 0 else 0.0

NDCG (Normalized Discounted Cumulative Gain)

NDCG measures not just whether relevant items appear, but where they appear. An item at position 1 is more valuable than the same item at position 10.

def dcg_at_k(relevant_items, ranked_list, k=10):
    """
    Compute DCG@K for a single user.

    Parameters
    ----------
    relevant_items : set
        Items the user actually interacted with.
    ranked_list : list
        Ranked recommendations.
    k : int
        Cutoff.

    Returns
    -------
    float
        DCG score.
    """
    dcg = 0.0
    for i, item in enumerate(ranked_list[:k]):
        if item in relevant_items:
            dcg += 1.0 / np.log2(i + 2)  # i+2 because log2(1)=0
    return dcg


def ndcg_at_k(recommendations, test_items, k=10):
    """
    Compute mean NDCG@K across all users.

    Parameters
    ----------
    recommendations : dict
        {user_id: [ranked list of item_ids]}
    test_items : dict
        {user_id: set of held-out item_ids}
    k : int
        Cutoff.

    Returns
    -------
    float
        Mean NDCG.
    """
    ndcg_scores = []

    for user_id in test_items:
        if user_id not in recommendations:
            continue

        relevant = test_items[user_id]
        ranked = recommendations[user_id]

        # Actual DCG
        dcg = dcg_at_k(relevant, ranked, k)

        # Ideal DCG (all relevant items at the top)
        ideal_list = list(relevant)[:k]
        idcg = dcg_at_k(relevant, ideal_list, k)

        if idcg > 0:
            ndcg_scores.append(dcg / idcg)
        else:
            ndcg_scores.append(0.0)

    return np.mean(ndcg_scores) if ndcg_scores else 0.0

Mean Average Precision (MAP)

MAP measures precision at every position where a relevant item appears, then averages.

def average_precision_at_k(relevant_items, ranked_list, k=10):
    """
    Compute Average Precision@K for a single user.

    Parameters
    ----------
    relevant_items : set
        Items the user actually interacted with.
    ranked_list : list
        Ranked recommendations.
    k : int
        Cutoff.

    Returns
    -------
    float
        Average precision.
    """
    hits = 0
    sum_precision = 0.0

    for i, item in enumerate(ranked_list[:k]):
        if item in relevant_items:
            hits += 1
            sum_precision += hits / (i + 1)

    n_relevant = min(len(relevant_items), k)
    return sum_precision / n_relevant if n_relevant > 0 else 0.0


def mean_average_precision(recommendations, test_items, k=10):
    """
    Compute MAP@K across all users.

    Parameters
    ----------
    recommendations : dict
        {user_id: [ranked list of item_ids]}
    test_items : dict
        {user_id: set of held-out item_ids}
    k : int
        Cutoff.

    Returns
    -------
    float
        MAP score.
    """
    ap_scores = []

    for user_id in test_items:
        if user_id not in recommendations:
            continue
        ap = average_precision_at_k(
            test_items[user_id], recommendations[user_id], k
        )
        ap_scores.append(ap)

    return np.mean(ap_scores) if ap_scores else 0.0

Full Evaluation Pipeline

# Train SVD on train set
reader = Reader(rating_scale=(1, 5))
train_data = Dataset.load_from_df(
    train_df[['user_id', 'item_id', 'rating']], reader
)
trainset = train_data.build_full_trainset()

svd_eval = SVD(n_factors=50, n_epochs=20, random_state=42)
svd_eval.fit(trainset)

# Generate recommendations for all test users
all_items = set(ratings_df['item_id'].unique())
test_items_dict = test_df.groupby('user_id')['item_id'].apply(set).to_dict()
train_items_dict = train_df.groupby('user_id')['item_id'].apply(set).to_dict()

recommendations = {}
for user_id in test_items_dict:
    if user_id not in train_items_dict:
        continue

    # Score all unseen items
    seen = train_items_dict[user_id]
    unseen = all_items - seen

    preds = []
    for item_id in unseen:
        pred = svd_eval.predict(uid=user_id, iid=item_id)
        preds.append((item_id, pred.est))

    preds.sort(key=lambda x: x[1], reverse=True)
    recommendations[user_id] = [item_id for item_id, _ in preds]

# Evaluate
for k in [5, 10, 20]:
    hr = hit_rate_at_k(recommendations, test_items_dict, k=k)
    ndcg = ndcg_at_k(recommendations, test_items_dict, k=k)
    map_score = mean_average_precision(recommendations, test_items_dict, k=k)
    print(f"@{k:>2}: Hit Rate={hr:.4f}  NDCG={ndcg:.4f}  MAP={map_score:.4f}")

Core Principle --- Always report ranking metrics at multiple cutoffs (5, 10, 20). A model that looks great at @20 might be mediocre at @5, meaning it finds relevant items but buries them in the list. In production, users see 5-10 recommendations. Performance at @5 is what matters for the user experience.

Part 6: The Cold Start Problem

The cold start problem is the most important practical challenge in recommender systems, and it comes in two forms.

New user cold start: A user just signed up and has no interaction history. Collaborative filtering cannot work --- there are no ratings to compute similarity from. Content-based filtering cannot work either --- there is no profile to match against.

New item cold start: A new item was just added to the catalog. No one has interacted with it yet. Collaborative filtering cannot recommend it because there is no interaction data. Content-based filtering can recommend it if the item has content features.

# Demonstrate cold start quantitatively
user_rating_counts = ratings_df.groupby('user_id').size()
item_rating_counts = ratings_df.groupby('item_id').size()

print("User rating counts:")
print(user_rating_counts.describe())
print(f"\nUsers with < 5 ratings: {(user_rating_counts < 5).sum()} "
      f"({(user_rating_counts < 5).mean():.1%})")
print(f"Users with < 3 ratings: {(user_rating_counts < 3).sum()} "
      f"({(user_rating_counts < 3).mean():.1%})")

print("\nItem rating counts:")
print(item_rating_counts.describe())
print(f"\nItems with < 5 ratings: {(item_rating_counts < 5).sum()} "
      f"({(item_rating_counts < 5).mean():.1%})")

Cold Start Mitigation Strategies

Strategy	Cold Start Type	Approach
Popularity baseline	Both	Recommend most popular items to new users; let new items ride on category popularity
Content-based fallback	New items	Use item features to recommend even without interaction data
Onboarding questionnaire	New users	Ask the user to rate 5-10 items at signup
Demographic filtering	New users	Use age, location, or other profile data to bootstrap preferences
Hybrid model	Both	Blend CF (for users with history) and content (for cold start cases)

def popularity_baseline(ratings_df, n=10):
    """
    Return the n most popular items (by number of ratings) as a baseline.
    Items that many users rate highly are safe default recommendations.
    """
    popularity = (
        ratings_df.groupby('item_id')
        .agg(n_ratings=('rating', 'count'), mean_rating=('rating', 'mean'))
    )
    # Bayesian average to avoid small-sample items
    C = popularity['n_ratings'].mean()
    m = popularity['mean_rating'].mean()
    popularity['bayesian_avg'] = (
        (popularity['n_ratings'] * popularity['mean_rating'] + C * m) /
        (popularity['n_ratings'] + C)
    )
    return popularity.nlargest(n, 'bayesian_avg').index.tolist()


popular_items = popularity_baseline(ratings_df, n=10)
print(f"Popularity baseline (top 10): {popular_items}")

Practical Note --- Every production recommender has a cold start fallback. The most common pattern is a cascade: try the personalized model first; if the user has fewer than N interactions, fall back to content-based; if the item has no content features, fall back to popularity. The fallback is never as good as the personalized model, but it is infinitely better than showing nothing.

Part 7: Building a Hybrid Recommender

Hybrid recommenders combine multiple signals. The two most common architectures are weighted blending (average the scores from multiple models) and switching (use different models for different situations, e.g., content-based for new users, CF for established users).

Weighted Hybrid

def hybrid_recommend(user_id, ratings_df, svd_model, tfidf_matrix,
                     content_weight=0.3, cf_weight=0.7, n=10):
    """
    Hybrid recommender that blends SVD collaborative filtering scores
    with content-based similarity scores.

    Parameters
    ----------
    user_id : int
        Target user.
    ratings_df : DataFrame
        Observed ratings.
    svd_model : surprise.SVD
        Trained SVD model.
    tfidf_matrix : sparse matrix
        TF-IDF features per item.
    content_weight : float
        Weight for content-based score.
    cf_weight : float
        Weight for collaborative filtering score.
    n : int
        Number of recommendations.

    Returns
    -------
    list of tuples
        (item_id, blended_score) sorted descending.
    """
    rated_items = set(ratings_df[ratings_df['user_id'] == user_id]['item_id'])
    all_items = set(ratings_df['item_id'].unique())
    candidates = all_items - rated_items

    # CF scores from SVD (normalize to 0-1)
    cf_scores = {}
    for item_id in candidates:
        pred = svd_model.predict(uid=user_id, iid=item_id)
        cf_scores[item_id] = (pred.est - 1) / 4  # normalize 1-5 to 0-1

    # Content scores (cosine similarity to user profile)
    profile = build_user_profile(user_id, ratings_df, tfidf_matrix)
    content_scores = {}
    if np.sum(np.abs(profile)) > 0:
        all_content_scores = cos_sim(
            profile.reshape(1, -1), tfidf_matrix
        ).flatten()
        for item_id in candidates:
            if item_id < len(all_content_scores):
                content_scores[item_id] = float(all_content_scores[item_id])
            else:
                content_scores[item_id] = 0.0

    # Blend scores
    blended = []
    for item_id in candidates:
        cf = cf_scores.get(item_id, 0.5)
        cb = content_scores.get(item_id, 0.0)
        score = cf_weight * cf + content_weight * cb
        blended.append((item_id, score))

    blended.sort(key=lambda x: x[1], reverse=True)
    return blended[:n]


# Test hybrid recommender
hybrid_recs = hybrid_recommend(
    user_id=0, ratings_df=ratings_df,
    svd_model=svd_model, tfidf_matrix=tfidf_matrix,
    content_weight=0.3, cf_weight=0.7, n=10
)
print("Hybrid recommendations for user 0:")
for item_id, score in hybrid_recs:
    desc = descriptions_list[item_id] if item_id < len(descriptions_list) else "N/A"
    print(f"  Item {item_id}: score={score:.3f} - {desc[:60]}")

Switching Hybrid (Cold Start Aware)

def switching_hybrid(user_id, ratings_df, svd_model, tfidf_matrix,
                     min_ratings_for_cf=5, n=10):
    """
    Switching hybrid: use CF for established users,
    content-based for new users, popularity for the coldest cases.

    Parameters
    ----------
    user_id : int
        Target user.
    ratings_df : DataFrame
        Observed ratings.
    svd_model : surprise.SVD
        Trained SVD model.
    tfidf_matrix : sparse matrix
        TF-IDF features per item.
    min_ratings_for_cf : int
        Minimum ratings needed to trust CF predictions.
    n : int
        Number of recommendations.

    Returns
    -------
    tuple
        (recommendations, method_used)
    """
    user_ratings = ratings_df[ratings_df['user_id'] == user_id]
    n_ratings = len(user_ratings)

    if n_ratings >= min_ratings_for_cf:
        # Established user: use hybrid CF + content
        recs = hybrid_recommend(
            user_id, ratings_df, svd_model, tfidf_matrix,
            content_weight=0.2, cf_weight=0.8, n=n
        )
        return [r[0] for r in recs], "hybrid (CF-dominant)"

    elif n_ratings > 0:
        # Some history: rely on content-based
        recs = recommend_content_based(
            user_id, ratings_df, tfidf_matrix, n=n
        )
        return recs, "content-based"

    else:
        # New user: popularity baseline
        recs = popularity_baseline(ratings_df, n=n)
        return recs, "popularity baseline"


# Demonstrate switching behavior
for uid in [0, 450, 999]:
    recs, method = switching_hybrid(
        uid, ratings_df, svd_model, tfidf_matrix
    )
    n_user_ratings = len(ratings_df[ratings_df['user_id'] == uid])
    print(f"User {uid} ({n_user_ratings} ratings): {method} -> {recs[:5]}")

Core Principle --- The switching hybrid is the production pattern. Pure CF, pure content-based, and pure popularity each have a regime where they are the best option. The switching hybrid uses the right tool for each user's situation. This is not an algorithmic sophistication --- it is an engineering pattern for handling the real-world messiness of uneven user engagement.

Part 8: Implicit Feedback and Beyond

Most production recommender systems run on implicit feedback --- clicks, views, purchases, time spent --- not explicit ratings. Implicit feedback requires different algorithms because the signal is fundamentally different: a missing entry means "we don't know," not "the user would rate this low."

Alternating Least Squares (ALS) for Implicit Feedback

The standard approach for implicit feedback is ALS (Alternating Least Squares), which treats the user-item matrix as binary (interacted vs. not interacted) with confidence weights.

from scipy.sparse import csr_matrix
from sklearn.decomposition import TruncatedSVD

# Simulate implicit feedback: binary interactions with confidence
np.random.seed(42)
n_users_impl = 1000
n_items_impl = 300
n_interactions = 15000

impl_user_ids = np.random.randint(0, n_users_impl, size=n_interactions)
impl_item_ids = np.random.randint(0, n_items_impl, size=n_interactions)
impl_confidence = np.random.exponential(scale=2.0, size=n_interactions) + 1

impl_df = pd.DataFrame({
    'user_id': impl_user_ids,
    'item_id': impl_item_ids,
    'confidence': impl_confidence
}).groupby(['user_id', 'item_id']).agg(
    confidence=('confidence', 'sum'),
    interactions=('confidence', 'count')
).reset_index()

print(f"Implicit feedback: {len(impl_df)} user-item pairs")
print(f"Users: {impl_df['user_id'].nunique()}, Items: {impl_df['item_id'].nunique()}")
print(f"\nConfidence distribution:")
print(impl_df['confidence'].describe())

# Build sparse implicit matrix
implicit_matrix = csr_matrix(
    (impl_df['confidence'].values, (impl_df['user_id'].values, impl_df['item_id'].values)),
    shape=(n_users_impl, n_items_impl)
)

# Use TruncatedSVD as a simple matrix factorization baseline
svd_implicit = TruncatedSVD(n_components=30, random_state=42)
user_factors_impl = svd_implicit.fit_transform(implicit_matrix)
item_factors_impl = svd_implicit.components_.T

print(f"\nUser factors shape: {user_factors_impl.shape}")
print(f"Item factors shape: {item_factors_impl.shape}")
print(f"Explained variance ratio (first 5): {svd_implicit.explained_variance_ratio_[:5]}")

# Recommend for a user via dot product
def recommend_implicit(user_id, user_factors, item_factors, seen_items, n=10):
    """
    Recommend items for a user based on implicit factor model.
    """
    scores = user_factors[user_id] @ item_factors.T

    # Zero out seen items
    for item_id in seen_items:
        scores[item_id] = -np.inf

    top_n = np.argsort(scores)[::-1][:n]
    return [(int(item_id), float(scores[item_id])) for item_id in top_n]

# Test
seen = set(impl_df[impl_df['user_id'] == 0]['item_id'].values)
recs = recommend_implicit(0, user_factors_impl, item_factors_impl, seen, n=10)
print(f"\nImplicit-feedback recommendations for user 0:")
for item_id, score in recs:
    print(f"  Item {item_id}: score={score:.4f}")

Practical Note --- For production-grade implicit feedback recommenders, use the implicit library (https://github.com/benfred/implicit), which implements ALS, BPR (Bayesian Personalized Ranking), and logistic matrix factorization with GPU acceleration. The surprise library is designed for explicit feedback and does not natively handle implicit data.

Part 9: Putting It All Together --- A Recommender System Design Checklist

Building a recommender in a notebook is one thing. Putting it into production is another. Here is the practitioner's checklist:

1. Define the objective. "Recommend stuff" is not an objective. "Increase click-through rate on the homepage carousel from 3% to 5%" is. "Reduce subscriber churn by recommending content that increases weekly engagement" is. The objective determines the evaluation metric.

2. Characterize your data. - Explicit or implicit feedback? - How sparse is the user-item matrix? - How severe is the cold start problem (what fraction of users have < 5 interactions)? - Do items have content features (text, categories, images)?

3. Establish baselines. Before building anything sophisticated, measure the performance of: - Random recommendations - Popularity baseline (most popular items) - Simple item-based CF

If your fancy model does not beat popularity by a meaningful margin, you do not need a fancy model.

4. Choose your approach.

Situation	Recommended Approach
Dense explicit ratings	SVD or NMF
Sparse explicit ratings	SVD with regularization
Implicit feedback	ALS or BPR
Rich item content, sparse interactions	Content-based or hybrid
Severe cold start	Switching hybrid with popularity fallback
Need real-time updates	Online learning or approximate nearest neighbors

5. Evaluate with ranking metrics. Hit Rate, NDCG, and MAP at multiple cutoffs (5, 10, 20). Report all three. A model that improves NDCG@10 by 0.05 over the baseline is a material improvement.

6. A/B test before full deployment. Offline ranking metrics are necessary but not sufficient. A model with better NDCG offline can still perform worse in production due to position bias, presentation effects, or feedback loops. A/B test with a business metric (CTR, revenue per session, churn rate) as the primary outcome.

Core Principle --- A recommender system is not a model. It is a product feature. The model is one component; the others are the candidate generation pipeline, the ranking layer, the filtering rules (e.g., "do not recommend items the user already purchased"), the presentation logic, and the feedback loop. Getting the model right is necessary. It is not sufficient.

Summary

This chapter covered the three paradigms of recommender systems --- collaborative filtering, content-based filtering, and hybrid approaches --- along with the evaluation framework that separates textbook exercises from production systems.

Collaborative filtering (user-based and item-based) exploits the structure of the user-item interaction matrix. Matrix factorization (SVD) compresses that structure into latent factors that generalize better to sparse data. Content-based filtering uses item features to handle the cold start problem. Hybrid recommenders combine all three signals, switching strategies based on how much data is available for each user.

The critical evaluation insight: use ranking metrics (NDCG, MAP, Hit Rate), not RMSE. A recommender's job is to surface the right items at the top of the list, not to predict exact ratings. Evaluate at multiple cutoffs. A/B test before you ship.

The two case studies that follow apply these concepts to ShopSmart (product recommendations) and StreamFlow (content recommendations for churn reduction). Both demonstrate the full pipeline: data preparation, model training, offline evaluation with ranking metrics, cold start handling, and business impact estimation.

Key Terms

Term	Definition
Recommender system	A system that predicts which items a user will prefer, typically by ranking items
Collaborative filtering	Recommendation based on patterns in user-item interactions across all users
Content-based filtering	Recommendation based on similarity between item features and user preferences
Hybrid approach	Combining collaborative and content-based methods
User-based CF	Finding similar users and recommending what they liked
Item-based CF	Finding similar items to those the user already liked
Cosine similarity	Similarity measure based on the angle between two vectors
Pearson correlation	Similarity measure based on the linear correlation of co-rated items
Matrix factorization	Decomposing the user-item matrix into low-rank factor matrices
SVD (recommender context)	Singular value decomposition used to learn latent user and item factors
Latent factors	Hidden dimensions (e.g., genre preference, quality taste) discovered by matrix factorization
Cold start problem	Inability to recommend for new users or new items that lack interaction data
Implicit feedback	Behavioral signals (clicks, views, purchases) that imply preference
Explicit feedback	Direct preference expressions (ratings, reviews, thumbs up/down)
NDCG	Normalized Discounted Cumulative Gain --- ranking metric that rewards relevant items at higher positions
MAP	Mean Average Precision --- ranking metric averaging precision at each relevant item's position
Hit Rate	Fraction of users for whom at least one relevant item appears in the top-N
Surprise library	Python library for building and evaluating recommender systems on explicit feedback
Sparsity problem	The challenge that most entries in the user-item matrix are missing

Next: Case Study 1 --- ShopSmart Product Recommendations applies collaborative filtering and hybrid methods to e-commerce product recommendations. Case Study 2 --- StreamFlow Content Recommendations builds a content recommender designed to reduce subscriber churn.