> Core Principle --- Recommender systems generate more revenue per engineering hour than almost any other machine learning application. Netflix estimates that its recommender system saves $1 billion per year in reduced churn. Amazon attributes 35%...
In This Chapter
- Collaborative Filtering, Content-Based, and Hybrid Approaches
- The Highest-Impact ML You Will Ever Deploy
- Part 1: The Recommendation Problem
- Part 2: Collaborative Filtering
- Part 3: Content-Based Filtering
- Part 4: Matrix Factorization (SVD)
- Part 5: Evaluation --- Ranking Metrics That Actually Matter
- Part 6: The Cold Start Problem
- Part 7: Building a Hybrid Recommender
- Part 8: Implicit Feedback and Beyond
- Part 9: Putting It All Together --- A Recommender System Design Checklist
- Summary
Chapter 24: Recommender Systems
Collaborative Filtering, Content-Based, and Hybrid Approaches
Learning Objectives
By the end of this chapter, you will be able to:
- Implement user-based and item-based collaborative filtering
- Build a content-based recommender using TF-IDF feature similarity
- Apply matrix factorization (SVD) for latent factor models
- Evaluate recommender systems with ranking metrics (NDCG, MAP, Hit Rate)
- Design a hybrid recommender that combines collaborative and content-based signals
The Highest-Impact ML You Will Ever Deploy
Core Principle --- Recommender systems generate more revenue per engineering hour than almost any other machine learning application. Netflix estimates that its recommender system saves $1 billion per year in reduced churn. Amazon attributes 35% of its revenue to recommendations. Spotify's Discover Weekly playlist converted millions of casual listeners into engaged subscribers. If you build one ML system in your career, there is a decent chance it will be a recommender.
Yet most introductory treatments get the evaluation wrong. They teach you to minimize RMSE on held-out ratings, which is what Kaggle competitions optimize. In production, nobody cares whether you predicted a 4.2 when the user rated 3.9. What matters is whether the items you surfaced at the top of the list were items the user actually engaged with. That is a ranking problem, not a regression problem. This chapter treats it as such from the start.
The three paradigms are straightforward:
Collaborative filtering says: "Users who behaved similarly to you in the past liked these items. You probably will too." It needs no knowledge of what items are --- only who interacted with what.
Content-based filtering says: "You liked items with these features. Here are other items with similar features." It needs no knowledge of other users --- only the features of items and the user's history.
Hybrid approaches combine both signals, and in practice, every production recommender is a hybrid. Pure collaborative filtering cannot handle new items (the cold start problem). Pure content-based filtering cannot capture taste patterns that transcend item features. Hybrids get the best of both.
This chapter builds all three, evaluates them with ranking metrics that actually matter, and closes with two case studies: ShopSmart's product recommender and StreamFlow's content recommender for churn reduction.
Part 1: The Recommendation Problem
Explicit vs. Implicit Feedback
The first design decision is what data you have. Recommender systems consume two fundamentally different types of feedback, and the choice shapes every algorithm decision downstream.
Explicit feedback is a direct expression of preference: star ratings, thumbs up/down, "not interested" clicks. It is clean, interpretable, and rare. Netflix's 5-star rating system generated perhaps 200 million ratings from 200 million subscribers. That sounds like a lot until you realize the catalog has tens of thousands of titles. The user-item matrix is over 99% empty.
Implicit feedback is behavioral data that implies preference: views, clicks, purchases, time spent, scroll depth, add-to-cart events. It is noisy, ambiguous, and abundant. A user who watched 90% of a movie probably liked it. A user who watched 5 minutes and switched to something else probably did not. But "probably" is doing a lot of work there --- maybe they were interrupted, maybe they fell asleep, maybe they were hate-watching.
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity
np.random.seed(42)
# --- Simulate explicit feedback: user-item rating matrix ---
n_users = 500
n_items = 200
# Simulate sparse ratings (only 5% of entries are observed)
n_ratings = int(0.05 * n_users * n_items)
user_ids = np.random.randint(0, n_users, size=n_ratings)
item_ids = np.random.randint(0, n_items, size=n_ratings)
# Create latent preferences to make ratings non-random
n_factors = 5
user_factors = np.random.randn(n_users, n_factors)
item_factors = np.random.randn(n_items, n_factors)
# Rating = dot product of latent factors + noise, clipped to [1, 5]
raw_scores = user_factors[user_ids] * item_factors[item_ids]
ratings = np.sum(raw_scores, axis=1) + np.random.randn(n_ratings) * 0.5
ratings = np.clip(np.round(ratings * 0.5 + 3), 1, 5)
ratings_df = pd.DataFrame({
'user_id': user_ids,
'item_id': item_ids,
'rating': ratings
}).drop_duplicates(subset=['user_id', 'item_id']).reset_index(drop=True)
print(f"Explicit feedback: {len(ratings_df)} ratings")
print(f"Users: {ratings_df['user_id'].nunique()}, Items: {ratings_df['item_id'].nunique()}")
print(f"Sparsity: {1 - len(ratings_df) / (n_users * n_items):.4f}")
print(f"\nRating distribution:")
print(ratings_df['rating'].value_counts().sort_index())
Practical Note --- In production, implicit feedback dominates. Most users never rate anything. Amazon does not wait for star ratings to recommend products; it uses purchase history, browsing behavior, and add-to-cart events. If your data is implicit, treat it as binary (interacted / did not interact) or weighted by engagement strength (e.g., minutes watched). Do not try to convert implicit signals into fake star ratings.
The User-Item Matrix
Regardless of feedback type, the core data structure is the user-item matrix. Rows are users, columns are items, and entries are ratings (explicit) or interaction counts (implicit). The matrix is almost always extremely sparse.
# Build the user-item matrix
from scipy.sparse import csr_matrix
# Pivot to dense matrix (for illustration; use sparse in production)
user_item_matrix = ratings_df.pivot_table(
index='user_id', columns='item_id', values='rating'
)
print(f"Matrix shape: {user_item_matrix.shape}")
print(f"Non-null entries: {user_item_matrix.notna().sum().sum()}")
print(f"Sparsity: {user_item_matrix.isna().sum().sum() / user_item_matrix.size:.4f}")
# For computation, fill NaN with 0 and convert to sparse
user_item_dense = user_item_matrix.fillna(0).values
user_item_sparse = csr_matrix(user_item_dense)
print(f"\nSparse matrix: {user_item_sparse.nnz} stored elements")
The sparsity is the defining challenge. With 99%+ of entries missing, every algorithm is essentially an exercise in intelligent extrapolation: given the tiny fraction of the matrix you can observe, predict the missing entries well enough to rank items for each user.
Part 2: Collaborative Filtering
Collaborative filtering is the oldest and most intuitive approach: recommend items that similar users liked (user-based CF) or items that are similar to items the user already liked (item-based CF). "Similar" is defined by the observed ratings or interactions.
User-Based Collaborative Filtering
The idea is simple. To predict whether user A will like item X:
- Find users who are similar to user A (based on their rating patterns)
- Check how those similar users rated item X
- Predict user A's rating as a weighted average of similar users' ratings on item X
Similarity is typically measured with cosine similarity or Pearson correlation, computed over the co-rated items.
from sklearn.metrics.pairwise import cosine_similarity
# Center ratings per user (mean-centering improves similarity)
user_means = np.where(
user_item_dense.sum(axis=1, keepdims=True) != 0,
user_item_dense.sum(axis=1, keepdims=True) /
np.maximum((user_item_dense != 0).sum(axis=1, keepdims=True), 1),
0
)
user_item_centered = np.where(user_item_dense != 0, user_item_dense - user_means, 0)
# Compute user-user cosine similarity
user_similarity = cosine_similarity(user_item_centered)
np.fill_diagonal(user_similarity, 0) # don't recommend based on self
print(f"User similarity matrix shape: {user_similarity.shape}")
print(f"Mean similarity: {user_similarity[user_similarity != 0].mean():.4f}")
print(f"Max similarity: {user_similarity.max():.4f}")
def predict_user_cf(user_id, item_id, k=20):
"""
Predict rating for a user-item pair using user-based CF.
Parameters
----------
user_id : int
Target user index.
item_id : int
Target item index.
k : int
Number of nearest neighbors to use.
Returns
-------
float
Predicted rating, or NaN if no neighbors rated the item.
"""
# Find users who rated this item
item_raters = np.where(user_item_dense[:, item_id] != 0)[0]
if len(item_raters) == 0:
return np.nan
# Get similarities to those users
sims = user_similarity[user_id, item_raters]
# Keep top-k most similar users with positive similarity
positive_mask = sims > 0
if positive_mask.sum() == 0:
return np.nan
item_raters = item_raters[positive_mask]
sims = sims[positive_mask]
if len(sims) > k:
top_k_idx = np.argsort(sims)[-k:]
item_raters = item_raters[top_k_idx]
sims = sims[top_k_idx]
# Weighted average of centered ratings + user's mean
centered_ratings = user_item_centered[item_raters, item_id]
prediction = user_means[user_id, 0] + np.dot(sims, centered_ratings) / np.sum(sims)
return np.clip(prediction, 1, 5)
# Test prediction
test_user, test_item = 0, 15
pred = predict_user_cf(test_user, test_item, k=20)
print(f"\nPredicted rating for user {test_user}, item {test_item}: {pred:.2f}")
Practical Note --- User-based CF has a fatal scaling problem. Computing user-user similarity is O(n_users^2), and user profiles change every time someone rates a new item. For a platform with millions of users, recomputing user similarity daily is expensive. This is why production systems almost always prefer item-based CF or matrix factorization.
Item-Based Collaborative Filtering
Item-based CF flips the approach: instead of finding similar users, find similar items. To predict whether user A will like item X, find items similar to X that user A has already rated, and weight their ratings by item similarity.
The key insight is that item-item similarity is more stable than user-user similarity. A movie's "personality" (who watches it) changes slowly. A user's preferences can shift overnight. Amazon's original recommender system was item-based CF, and the stability of item similarity was a major reason.
# Compute item-item cosine similarity (on centered ratings, transposed)
item_similarity = cosine_similarity(user_item_centered.T)
np.fill_diagonal(item_similarity, 0)
print(f"Item similarity matrix shape: {item_similarity.shape}")
print(f"Mean similarity: {item_similarity[item_similarity != 0].mean():.4f}")
def predict_item_cf(user_id, item_id, k=20):
"""
Predict rating for a user-item pair using item-based CF.
Parameters
----------
user_id : int
Target user index.
item_id : int
Target item index.
k : int
Number of nearest neighbor items to use.
Returns
-------
float
Predicted rating, or NaN if the user has no rated similar items.
"""
# Find items this user has rated
rated_items = np.where(user_item_dense[user_id] != 0)[0]
if len(rated_items) == 0:
return np.nan
# Get similarities between target item and rated items
sims = item_similarity[item_id, rated_items]
# Keep top-k with positive similarity
positive_mask = sims > 0
if positive_mask.sum() == 0:
return np.nan
rated_items = rated_items[positive_mask]
sims = sims[positive_mask]
if len(sims) > k:
top_k_idx = np.argsort(sims)[-k:]
rated_items = rated_items[top_k_idx]
sims = sims[top_k_idx]
# Weighted average of the user's ratings on similar items
user_ratings = user_item_dense[user_id, rated_items]
prediction = np.dot(sims, user_ratings) / np.sum(sims)
return np.clip(prediction, 1, 5)
pred_item = predict_item_cf(test_user, test_item, k=20)
print(f"Predicted rating for user {test_user}, item {test_item}: {pred_item:.2f}")
User-Based vs. Item-Based: When to Use Each
| Factor | User-Based CF | Item-Based CF |
|---|---|---|
| Similarity stability | Low (user tastes change) | High (item profiles stable) |
| Computational cost | O(n_users^2) | O(n_items^2) |
| Better when... | n_items >> n_users | n_users >> n_items (typical) |
| Cold start for new users | Cannot recommend (no history) | Cannot recommend (no history) |
| Cold start for new items | Can recommend (other users rate it) | Cannot recommend (no similarity data) |
| Interpretability | "Users like you also liked..." | "Because you liked X, you'll like Y..." |
In most production settings, n_users >> n_items, which makes item-based CF computationally cheaper and more stable. Amazon's original 2003 paper on item-based CF made exactly this argument.
Part 3: Content-Based Filtering
Collaborative filtering ignores what items are. It only cares about who interacted with what. Content-based filtering takes the opposite approach: it recommends items whose features are similar to items the user has liked before.
The advantage is that content-based filtering handles the cold start problem for new items. A new movie with known genres, cast, and plot keywords can be recommended immediately based on its features --- no one needs to have watched it first.
The disadvantage is the filter bubble: content-based filtering only recommends more of the same. If you have watched ten action movies, it recommends action movie number eleven. It cannot discover that you might enjoy a documentary.
TF-IDF Feature Vectors
For items described by text (product descriptions, movie plots, article content), TF-IDF provides a natural feature representation. Each item becomes a vector in term space, and similarity between items is cosine similarity between their TF-IDF vectors.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity as cos_sim
# Simulate item catalog with text descriptions
item_descriptions = {
0: "action thriller with car chases and explosions in a futuristic city",
1: "romantic comedy set in New York with a love triangle",
2: "science fiction epic about space exploration and alien contact",
3: "dark thriller about a detective solving serial murders",
4: "romantic drama about long distance relationships and sacrifice",
5: "action science fiction with robots and artificial intelligence",
6: "comedy about family reunions and holiday disasters",
7: "documentary about climate change and ocean conservation",
8: "horror thriller set in an abandoned hospital",
9: "animated comedy for families with talking animals",
10: "science fiction thriller about time travel paradoxes",
11: "romantic comedy about dating apps and modern relationships",
12: "action adventure with treasure hunting and ancient ruins",
13: "documentary about the history of jazz music",
14: "drama about a family dealing with addiction and recovery",
}
descriptions_list = [item_descriptions[i] for i in range(len(item_descriptions))]
# Build TF-IDF vectors
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(descriptions_list)
print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
print(f"Vocabulary size: {len(tfidf.vocabulary_)}")
# Compute item-item content similarity
content_similarity = cos_sim(tfidf_matrix)
np.fill_diagonal(content_similarity, 0)
# Show most similar items for item 0 (action thriller)
item_0_sims = content_similarity[0]
top_similar = np.argsort(item_0_sims)[::-1][:5]
print(f"\nMost similar items to item 0 ('{descriptions_list[0][:50]}...'):")
for idx in top_similar:
print(f" Item {idx}: sim={item_0_sims[idx]:.3f} - {descriptions_list[idx][:60]}")
Building a Content-Based Recommender
The content-based recommender builds a user profile from the features of items the user has liked, then recommends items whose features best match that profile.
def build_user_profile(user_id, ratings_df, tfidf_matrix, min_rating=3.5):
"""
Build a user profile by averaging TF-IDF vectors of highly-rated items.
Parameters
----------
user_id : int
Target user.
ratings_df : DataFrame
User-item ratings.
tfidf_matrix : sparse matrix
TF-IDF features per item.
min_rating : float
Minimum rating to consider as "liked."
Returns
-------
ndarray
User profile vector (averaged TF-IDF of liked items).
"""
user_ratings = ratings_df[ratings_df['user_id'] == user_id]
liked_items = user_ratings[user_ratings['rating'] >= min_rating]['item_id'].values
# Filter to items that have descriptions
liked_items = liked_items[liked_items < tfidf_matrix.shape[0]]
if len(liked_items) == 0:
return np.zeros(tfidf_matrix.shape[1])
# Average the TF-IDF vectors of liked items
profile = tfidf_matrix[liked_items].mean(axis=0)
return np.asarray(profile).flatten()
def recommend_content_based(user_id, ratings_df, tfidf_matrix, n=5):
"""
Recommend items using content-based filtering.
Parameters
----------
user_id : int
Target user.
ratings_df : DataFrame
User-item ratings.
tfidf_matrix : sparse matrix
TF-IDF features per item.
n : int
Number of recommendations.
Returns
-------
list of int
Recommended item IDs.
"""
profile = build_user_profile(user_id, ratings_df, tfidf_matrix)
if np.sum(np.abs(profile)) == 0:
return []
# Score every item against user profile
scores = cos_sim(profile.reshape(1, -1), tfidf_matrix).flatten()
# Exclude items user has already rated
rated_items = set(ratings_df[ratings_df['user_id'] == user_id]['item_id'].values)
for item_id in rated_items:
if item_id < len(scores):
scores[item_id] = -1
# Return top-n
return list(np.argsort(scores)[::-1][:n])
# Test: recommend for a user
recs = recommend_content_based(0, ratings_df, tfidf_matrix, n=5)
print(f"Content-based recommendations for user 0:")
for item_id in recs:
print(f" Item {item_id}: {descriptions_list[item_id][:70]}")
Practical Note --- TF-IDF is the simplest content representation. In production, you would use pre-trained embeddings (sentence-transformers, CLIP for images, or domain-specific embeddings) for richer feature representations. The framework is identical: represent items as vectors, compute similarity, rank by profile match.
Part 4: Matrix Factorization (SVD)
Collaborative filtering with nearest neighbors has two problems at scale: the user-item matrix is too sparse for reliable similarity, and computing pairwise similarity is expensive. Matrix factorization addresses both by compressing the user-item matrix into a low-rank approximation.
The intuition: if a user-item matrix has 500 users and 200 items, it has 100,000 entries (mostly missing). But the underlying preferences might be driven by only 10-20 latent factors --- "likes action," "prefers long movies," "watches with family." Matrix factorization discovers these latent factors automatically.
SVD for Recommender Systems
In the recommender context, SVD (Singular Value Decomposition) decomposes the user-item matrix R into:
$$R \approx U \cdot \Sigma \cdot V^T$$
Where U is the user-factor matrix, V is the item-factor matrix, and the factors represent latent preferences. The prediction for user u on item i is the dot product of their latent vectors.
The surprise library provides optimized implementations specifically designed for recommender systems:
from surprise import Dataset, Reader, SVD, KNNBasic, KNNWithMeans
from surprise.model_selection import cross_validate, train_test_split
from surprise import accuracy
# Prepare data for surprise
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(
ratings_df[['user_id', 'item_id', 'rating']], reader
)
# Train SVD model
svd_model = SVD(
n_factors=50, # number of latent factors
n_epochs=20, # training iterations
lr_all=0.005, # learning rate
reg_all=0.02, # regularization
random_state=42
)
# Cross-validate
cv_results = cross_validate(svd_model, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
print(f"\nSVD Cross-Validation Results:")
print(f" Mean RMSE: {cv_results['test_rmse'].mean():.4f} (+/- {cv_results['test_rmse'].std():.4f})")
print(f" Mean MAE: {cv_results['test_mae'].mean():.4f} (+/- {cv_results['test_mae'].std():.4f})")
Core Principle --- RMSE tells you how well you predict ratings. It does not tell you how well you rank items. A model with lower RMSE can produce worse top-N recommendations than a model with higher RMSE. Always evaluate with ranking metrics when the task is "recommend the best items," not "predict the exact rating."
Training and Extracting Latent Factors
# Train on full dataset to inspect latent factors
trainset = data.build_full_trainset()
svd_model.fit(trainset)
# User and item factor matrices
user_factors_svd = svd_model.pu # shape: (n_users, n_factors)
item_factors_svd = svd_model.qi # shape: (n_items, n_factors)
print(f"User factors shape: {user_factors_svd.shape}")
print(f"Item factors shape: {item_factors_svd.shape}")
# Predict a specific rating
prediction = svd_model.predict(uid=0, iid=15)
print(f"\nSVD prediction for user 0, item 15: {prediction.est:.2f}")
# Generate top-N recommendations for a user
def recommend_svd(model, user_id, ratings_df, n=10):
"""
Generate top-N recommendations using a trained SVD model.
Parameters
----------
model : surprise.SVD
Trained SVD model.
user_id : int
Target user.
ratings_df : DataFrame
Observed ratings (to exclude already-rated items).
n : int
Number of recommendations.
Returns
-------
list of tuples
(item_id, predicted_rating) sorted by predicted rating descending.
"""
rated_items = set(ratings_df[ratings_df['user_id'] == user_id]['item_id'])
all_items = set(ratings_df['item_id'].unique())
unrated_items = all_items - rated_items
predictions = []
for item_id in unrated_items:
pred = model.predict(uid=user_id, iid=item_id)
predictions.append((item_id, pred.est))
predictions.sort(key=lambda x: x[1], reverse=True)
return predictions[:n]
top_recs = recommend_svd(svd_model, user_id=0, ratings_df=ratings_df, n=10)
print(f"\nTop-10 SVD recommendations for user 0:")
for item_id, score in top_recs:
print(f" Item {item_id}: predicted rating {score:.2f}")
Comparing CF Methods with Surprise
from surprise import KNNBasic, KNNWithMeans, NMF
# Define models to compare
models = {
'User-KNN (cosine)': KNNBasic(
k=20, sim_options={'name': 'cosine', 'user_based': True},
verbose=False
),
'Item-KNN (cosine)': KNNBasic(
k=20, sim_options={'name': 'cosine', 'user_based': False},
verbose=False
),
'User-KNN (Pearson, mean-centered)': KNNWithMeans(
k=20, sim_options={'name': 'pearson', 'user_based': True},
verbose=False
),
'SVD (50 factors)': SVD(n_factors=50, random_state=42),
'NMF (50 factors)': NMF(n_factors=50, random_state=42),
}
print("Model Comparison (5-fold CV):")
print(f"{'Model':<40} {'RMSE':>8} {'MAE':>8}")
print("-" * 58)
for name, model in models.items():
results = cross_validate(model, data, measures=['RMSE', 'MAE'], cv=5, verbose=False)
print(f"{name:<40} {results['test_rmse'].mean():>8.4f} {results['test_mae'].mean():>8.4f}")
Practical Note --- SVD and NMF almost always beat neighborhood methods on RMSE. But RMSE is not the metric that matters. The next section covers the metrics that do.
Part 5: Evaluation --- Ranking Metrics That Actually Matter
This is where most textbook treatments fail. They evaluate recommenders with RMSE, which measures rating prediction accuracy. In production, the question is not "how accurately can you predict a 4.3-star rating?" but "did the user engage with the items you put at the top of the list?" That is a ranking quality problem, and it requires ranking metrics.
Train-Test Split for Ranking Evaluation
The standard protocol: for each user, hold out a fraction of their interactions as the test set. Train on the remaining interactions. Generate a ranked list of recommendations. Check whether the held-out items appear in the top-N of the ranked list.
from collections import defaultdict
def train_test_split_by_user(ratings_df, test_fraction=0.2, random_state=42):
"""
Split ratings so each user has some ratings in train and some in test.
Parameters
----------
ratings_df : DataFrame
Columns: user_id, item_id, rating.
test_fraction : float
Fraction of each user's ratings to hold out.
random_state : int
Random seed.
Returns
-------
tuple of DataFrames
(train_df, test_df)
"""
rng = np.random.RandomState(random_state)
train_rows = []
test_rows = []
for user_id, group in ratings_df.groupby('user_id'):
n_test = max(1, int(len(group) * test_fraction))
test_idx = rng.choice(group.index, size=n_test, replace=False)
train_rows.append(group.drop(test_idx))
test_rows.append(group.loc[test_idx])
return pd.concat(train_rows), pd.concat(test_rows)
train_df, test_df = train_test_split_by_user(ratings_df, test_fraction=0.2)
print(f"Train: {len(train_df)} ratings, Test: {len(test_df)} ratings")
print(f"Users in train: {train_df['user_id'].nunique()}")
print(f"Users in test: {test_df['user_id'].nunique()}")
Hit Rate
The simplest ranking metric: did at least one of the held-out items appear in the top-N recommendations?
def hit_rate_at_k(recommendations, test_items, k=10):
"""
Compute Hit Rate@K: fraction of users where at least one test item
appears in the top-K recommendations.
Parameters
----------
recommendations : dict
{user_id: [ranked list of item_ids]}
test_items : dict
{user_id: set of held-out item_ids}
k : int
Number of recommendations to consider.
Returns
-------
float
Hit rate between 0 and 1.
"""
hits = 0
total = 0
for user_id in test_items:
if user_id not in recommendations:
continue
top_k = set(recommendations[user_id][:k])
if top_k & test_items[user_id]: # set intersection
hits += 1
total += 1
return hits / total if total > 0 else 0.0
NDCG (Normalized Discounted Cumulative Gain)
NDCG measures not just whether relevant items appear, but where they appear. An item at position 1 is more valuable than the same item at position 10.
def dcg_at_k(relevant_items, ranked_list, k=10):
"""
Compute DCG@K for a single user.
Parameters
----------
relevant_items : set
Items the user actually interacted with.
ranked_list : list
Ranked recommendations.
k : int
Cutoff.
Returns
-------
float
DCG score.
"""
dcg = 0.0
for i, item in enumerate(ranked_list[:k]):
if item in relevant_items:
dcg += 1.0 / np.log2(i + 2) # i+2 because log2(1)=0
return dcg
def ndcg_at_k(recommendations, test_items, k=10):
"""
Compute mean NDCG@K across all users.
Parameters
----------
recommendations : dict
{user_id: [ranked list of item_ids]}
test_items : dict
{user_id: set of held-out item_ids}
k : int
Cutoff.
Returns
-------
float
Mean NDCG.
"""
ndcg_scores = []
for user_id in test_items:
if user_id not in recommendations:
continue
relevant = test_items[user_id]
ranked = recommendations[user_id]
# Actual DCG
dcg = dcg_at_k(relevant, ranked, k)
# Ideal DCG (all relevant items at the top)
ideal_list = list(relevant)[:k]
idcg = dcg_at_k(relevant, ideal_list, k)
if idcg > 0:
ndcg_scores.append(dcg / idcg)
else:
ndcg_scores.append(0.0)
return np.mean(ndcg_scores) if ndcg_scores else 0.0
Mean Average Precision (MAP)
MAP measures precision at every position where a relevant item appears, then averages.
def average_precision_at_k(relevant_items, ranked_list, k=10):
"""
Compute Average Precision@K for a single user.
Parameters
----------
relevant_items : set
Items the user actually interacted with.
ranked_list : list
Ranked recommendations.
k : int
Cutoff.
Returns
-------
float
Average precision.
"""
hits = 0
sum_precision = 0.0
for i, item in enumerate(ranked_list[:k]):
if item in relevant_items:
hits += 1
sum_precision += hits / (i + 1)
n_relevant = min(len(relevant_items), k)
return sum_precision / n_relevant if n_relevant > 0 else 0.0
def mean_average_precision(recommendations, test_items, k=10):
"""
Compute MAP@K across all users.
Parameters
----------
recommendations : dict
{user_id: [ranked list of item_ids]}
test_items : dict
{user_id: set of held-out item_ids}
k : int
Cutoff.
Returns
-------
float
MAP score.
"""
ap_scores = []
for user_id in test_items:
if user_id not in recommendations:
continue
ap = average_precision_at_k(
test_items[user_id], recommendations[user_id], k
)
ap_scores.append(ap)
return np.mean(ap_scores) if ap_scores else 0.0
Full Evaluation Pipeline
# Train SVD on train set
reader = Reader(rating_scale=(1, 5))
train_data = Dataset.load_from_df(
train_df[['user_id', 'item_id', 'rating']], reader
)
trainset = train_data.build_full_trainset()
svd_eval = SVD(n_factors=50, n_epochs=20, random_state=42)
svd_eval.fit(trainset)
# Generate recommendations for all test users
all_items = set(ratings_df['item_id'].unique())
test_items_dict = test_df.groupby('user_id')['item_id'].apply(set).to_dict()
train_items_dict = train_df.groupby('user_id')['item_id'].apply(set).to_dict()
recommendations = {}
for user_id in test_items_dict:
if user_id not in train_items_dict:
continue
# Score all unseen items
seen = train_items_dict[user_id]
unseen = all_items - seen
preds = []
for item_id in unseen:
pred = svd_eval.predict(uid=user_id, iid=item_id)
preds.append((item_id, pred.est))
preds.sort(key=lambda x: x[1], reverse=True)
recommendations[user_id] = [item_id for item_id, _ in preds]
# Evaluate
for k in [5, 10, 20]:
hr = hit_rate_at_k(recommendations, test_items_dict, k=k)
ndcg = ndcg_at_k(recommendations, test_items_dict, k=k)
map_score = mean_average_precision(recommendations, test_items_dict, k=k)
print(f"@{k:>2}: Hit Rate={hr:.4f} NDCG={ndcg:.4f} MAP={map_score:.4f}")
Core Principle --- Always report ranking metrics at multiple cutoffs (5, 10, 20). A model that looks great at @20 might be mediocre at @5, meaning it finds relevant items but buries them in the list. In production, users see 5-10 recommendations. Performance at @5 is what matters for the user experience.
Part 6: The Cold Start Problem
The cold start problem is the most important practical challenge in recommender systems, and it comes in two forms.
New user cold start: A user just signed up and has no interaction history. Collaborative filtering cannot work --- there are no ratings to compute similarity from. Content-based filtering cannot work either --- there is no profile to match against.
New item cold start: A new item was just added to the catalog. No one has interacted with it yet. Collaborative filtering cannot recommend it because there is no interaction data. Content-based filtering can recommend it if the item has content features.
# Demonstrate cold start quantitatively
user_rating_counts = ratings_df.groupby('user_id').size()
item_rating_counts = ratings_df.groupby('item_id').size()
print("User rating counts:")
print(user_rating_counts.describe())
print(f"\nUsers with < 5 ratings: {(user_rating_counts < 5).sum()} "
f"({(user_rating_counts < 5).mean():.1%})")
print(f"Users with < 3 ratings: {(user_rating_counts < 3).sum()} "
f"({(user_rating_counts < 3).mean():.1%})")
print("\nItem rating counts:")
print(item_rating_counts.describe())
print(f"\nItems with < 5 ratings: {(item_rating_counts < 5).sum()} "
f"({(item_rating_counts < 5).mean():.1%})")
Cold Start Mitigation Strategies
| Strategy | Cold Start Type | Approach |
|---|---|---|
| Popularity baseline | Both | Recommend most popular items to new users; let new items ride on category popularity |
| Content-based fallback | New items | Use item features to recommend even without interaction data |
| Onboarding questionnaire | New users | Ask the user to rate 5-10 items at signup |
| Demographic filtering | New users | Use age, location, or other profile data to bootstrap preferences |
| Hybrid model | Both | Blend CF (for users with history) and content (for cold start cases) |
def popularity_baseline(ratings_df, n=10):
"""
Return the n most popular items (by number of ratings) as a baseline.
Items that many users rate highly are safe default recommendations.
"""
popularity = (
ratings_df.groupby('item_id')
.agg(n_ratings=('rating', 'count'), mean_rating=('rating', 'mean'))
)
# Bayesian average to avoid small-sample items
C = popularity['n_ratings'].mean()
m = popularity['mean_rating'].mean()
popularity['bayesian_avg'] = (
(popularity['n_ratings'] * popularity['mean_rating'] + C * m) /
(popularity['n_ratings'] + C)
)
return popularity.nlargest(n, 'bayesian_avg').index.tolist()
popular_items = popularity_baseline(ratings_df, n=10)
print(f"Popularity baseline (top 10): {popular_items}")
Practical Note --- Every production recommender has a cold start fallback. The most common pattern is a cascade: try the personalized model first; if the user has fewer than N interactions, fall back to content-based; if the item has no content features, fall back to popularity. The fallback is never as good as the personalized model, but it is infinitely better than showing nothing.
Part 7: Building a Hybrid Recommender
Hybrid recommenders combine multiple signals. The two most common architectures are weighted blending (average the scores from multiple models) and switching (use different models for different situations, e.g., content-based for new users, CF for established users).
Weighted Hybrid
def hybrid_recommend(user_id, ratings_df, svd_model, tfidf_matrix,
content_weight=0.3, cf_weight=0.7, n=10):
"""
Hybrid recommender that blends SVD collaborative filtering scores
with content-based similarity scores.
Parameters
----------
user_id : int
Target user.
ratings_df : DataFrame
Observed ratings.
svd_model : surprise.SVD
Trained SVD model.
tfidf_matrix : sparse matrix
TF-IDF features per item.
content_weight : float
Weight for content-based score.
cf_weight : float
Weight for collaborative filtering score.
n : int
Number of recommendations.
Returns
-------
list of tuples
(item_id, blended_score) sorted descending.
"""
rated_items = set(ratings_df[ratings_df['user_id'] == user_id]['item_id'])
all_items = set(ratings_df['item_id'].unique())
candidates = all_items - rated_items
# CF scores from SVD (normalize to 0-1)
cf_scores = {}
for item_id in candidates:
pred = svd_model.predict(uid=user_id, iid=item_id)
cf_scores[item_id] = (pred.est - 1) / 4 # normalize 1-5 to 0-1
# Content scores (cosine similarity to user profile)
profile = build_user_profile(user_id, ratings_df, tfidf_matrix)
content_scores = {}
if np.sum(np.abs(profile)) > 0:
all_content_scores = cos_sim(
profile.reshape(1, -1), tfidf_matrix
).flatten()
for item_id in candidates:
if item_id < len(all_content_scores):
content_scores[item_id] = float(all_content_scores[item_id])
else:
content_scores[item_id] = 0.0
# Blend scores
blended = []
for item_id in candidates:
cf = cf_scores.get(item_id, 0.5)
cb = content_scores.get(item_id, 0.0)
score = cf_weight * cf + content_weight * cb
blended.append((item_id, score))
blended.sort(key=lambda x: x[1], reverse=True)
return blended[:n]
# Test hybrid recommender
hybrid_recs = hybrid_recommend(
user_id=0, ratings_df=ratings_df,
svd_model=svd_model, tfidf_matrix=tfidf_matrix,
content_weight=0.3, cf_weight=0.7, n=10
)
print("Hybrid recommendations for user 0:")
for item_id, score in hybrid_recs:
desc = descriptions_list[item_id] if item_id < len(descriptions_list) else "N/A"
print(f" Item {item_id}: score={score:.3f} - {desc[:60]}")
Switching Hybrid (Cold Start Aware)
def switching_hybrid(user_id, ratings_df, svd_model, tfidf_matrix,
min_ratings_for_cf=5, n=10):
"""
Switching hybrid: use CF for established users,
content-based for new users, popularity for the coldest cases.
Parameters
----------
user_id : int
Target user.
ratings_df : DataFrame
Observed ratings.
svd_model : surprise.SVD
Trained SVD model.
tfidf_matrix : sparse matrix
TF-IDF features per item.
min_ratings_for_cf : int
Minimum ratings needed to trust CF predictions.
n : int
Number of recommendations.
Returns
-------
tuple
(recommendations, method_used)
"""
user_ratings = ratings_df[ratings_df['user_id'] == user_id]
n_ratings = len(user_ratings)
if n_ratings >= min_ratings_for_cf:
# Established user: use hybrid CF + content
recs = hybrid_recommend(
user_id, ratings_df, svd_model, tfidf_matrix,
content_weight=0.2, cf_weight=0.8, n=n
)
return [r[0] for r in recs], "hybrid (CF-dominant)"
elif n_ratings > 0:
# Some history: rely on content-based
recs = recommend_content_based(
user_id, ratings_df, tfidf_matrix, n=n
)
return recs, "content-based"
else:
# New user: popularity baseline
recs = popularity_baseline(ratings_df, n=n)
return recs, "popularity baseline"
# Demonstrate switching behavior
for uid in [0, 450, 999]:
recs, method = switching_hybrid(
uid, ratings_df, svd_model, tfidf_matrix
)
n_user_ratings = len(ratings_df[ratings_df['user_id'] == uid])
print(f"User {uid} ({n_user_ratings} ratings): {method} -> {recs[:5]}")
Core Principle --- The switching hybrid is the production pattern. Pure CF, pure content-based, and pure popularity each have a regime where they are the best option. The switching hybrid uses the right tool for each user's situation. This is not an algorithmic sophistication --- it is an engineering pattern for handling the real-world messiness of uneven user engagement.
Part 8: Implicit Feedback and Beyond
Most production recommender systems run on implicit feedback --- clicks, views, purchases, time spent --- not explicit ratings. Implicit feedback requires different algorithms because the signal is fundamentally different: a missing entry means "we don't know," not "the user would rate this low."
Alternating Least Squares (ALS) for Implicit Feedback
The standard approach for implicit feedback is ALS (Alternating Least Squares), which treats the user-item matrix as binary (interacted vs. not interacted) with confidence weights.
from scipy.sparse import csr_matrix
from sklearn.decomposition import TruncatedSVD
# Simulate implicit feedback: binary interactions with confidence
np.random.seed(42)
n_users_impl = 1000
n_items_impl = 300
n_interactions = 15000
impl_user_ids = np.random.randint(0, n_users_impl, size=n_interactions)
impl_item_ids = np.random.randint(0, n_items_impl, size=n_interactions)
impl_confidence = np.random.exponential(scale=2.0, size=n_interactions) + 1
impl_df = pd.DataFrame({
'user_id': impl_user_ids,
'item_id': impl_item_ids,
'confidence': impl_confidence
}).groupby(['user_id', 'item_id']).agg(
confidence=('confidence', 'sum'),
interactions=('confidence', 'count')
).reset_index()
print(f"Implicit feedback: {len(impl_df)} user-item pairs")
print(f"Users: {impl_df['user_id'].nunique()}, Items: {impl_df['item_id'].nunique()}")
print(f"\nConfidence distribution:")
print(impl_df['confidence'].describe())
# Build sparse implicit matrix
implicit_matrix = csr_matrix(
(impl_df['confidence'].values, (impl_df['user_id'].values, impl_df['item_id'].values)),
shape=(n_users_impl, n_items_impl)
)
# Use TruncatedSVD as a simple matrix factorization baseline
svd_implicit = TruncatedSVD(n_components=30, random_state=42)
user_factors_impl = svd_implicit.fit_transform(implicit_matrix)
item_factors_impl = svd_implicit.components_.T
print(f"\nUser factors shape: {user_factors_impl.shape}")
print(f"Item factors shape: {item_factors_impl.shape}")
print(f"Explained variance ratio (first 5): {svd_implicit.explained_variance_ratio_[:5]}")
# Recommend for a user via dot product
def recommend_implicit(user_id, user_factors, item_factors, seen_items, n=10):
"""
Recommend items for a user based on implicit factor model.
"""
scores = user_factors[user_id] @ item_factors.T
# Zero out seen items
for item_id in seen_items:
scores[item_id] = -np.inf
top_n = np.argsort(scores)[::-1][:n]
return [(int(item_id), float(scores[item_id])) for item_id in top_n]
# Test
seen = set(impl_df[impl_df['user_id'] == 0]['item_id'].values)
recs = recommend_implicit(0, user_factors_impl, item_factors_impl, seen, n=10)
print(f"\nImplicit-feedback recommendations for user 0:")
for item_id, score in recs:
print(f" Item {item_id}: score={score:.4f}")
Practical Note --- For production-grade implicit feedback recommenders, use the
implicitlibrary (https://github.com/benfred/implicit), which implements ALS, BPR (Bayesian Personalized Ranking), and logistic matrix factorization with GPU acceleration. Thesurpriselibrary is designed for explicit feedback and does not natively handle implicit data.
Part 9: Putting It All Together --- A Recommender System Design Checklist
Building a recommender in a notebook is one thing. Putting it into production is another. Here is the practitioner's checklist:
1. Define the objective. "Recommend stuff" is not an objective. "Increase click-through rate on the homepage carousel from 3% to 5%" is. "Reduce subscriber churn by recommending content that increases weekly engagement" is. The objective determines the evaluation metric.
2. Characterize your data. - Explicit or implicit feedback? - How sparse is the user-item matrix? - How severe is the cold start problem (what fraction of users have < 5 interactions)? - Do items have content features (text, categories, images)?
3. Establish baselines. Before building anything sophisticated, measure the performance of: - Random recommendations - Popularity baseline (most popular items) - Simple item-based CF
If your fancy model does not beat popularity by a meaningful margin, you do not need a fancy model.
4. Choose your approach.
| Situation | Recommended Approach |
|---|---|
| Dense explicit ratings | SVD or NMF |
| Sparse explicit ratings | SVD with regularization |
| Implicit feedback | ALS or BPR |
| Rich item content, sparse interactions | Content-based or hybrid |
| Severe cold start | Switching hybrid with popularity fallback |
| Need real-time updates | Online learning or approximate nearest neighbors |
5. Evaluate with ranking metrics. Hit Rate, NDCG, and MAP at multiple cutoffs (5, 10, 20). Report all three. A model that improves NDCG@10 by 0.05 over the baseline is a material improvement.
6. A/B test before full deployment. Offline ranking metrics are necessary but not sufficient. A model with better NDCG offline can still perform worse in production due to position bias, presentation effects, or feedback loops. A/B test with a business metric (CTR, revenue per session, churn rate) as the primary outcome.
Core Principle --- A recommender system is not a model. It is a product feature. The model is one component; the others are the candidate generation pipeline, the ranking layer, the filtering rules (e.g., "do not recommend items the user already purchased"), the presentation logic, and the feedback loop. Getting the model right is necessary. It is not sufficient.
Summary
This chapter covered the three paradigms of recommender systems --- collaborative filtering, content-based filtering, and hybrid approaches --- along with the evaluation framework that separates textbook exercises from production systems.
Collaborative filtering (user-based and item-based) exploits the structure of the user-item interaction matrix. Matrix factorization (SVD) compresses that structure into latent factors that generalize better to sparse data. Content-based filtering uses item features to handle the cold start problem. Hybrid recommenders combine all three signals, switching strategies based on how much data is available for each user.
The critical evaluation insight: use ranking metrics (NDCG, MAP, Hit Rate), not RMSE. A recommender's job is to surface the right items at the top of the list, not to predict exact ratings. Evaluate at multiple cutoffs. A/B test before you ship.
The two case studies that follow apply these concepts to ShopSmart (product recommendations) and StreamFlow (content recommendations for churn reduction). Both demonstrate the full pipeline: data preparation, model training, offline evaluation with ranking metrics, cold start handling, and business impact estimation.
Key Terms
| Term | Definition |
|---|---|
| Recommender system | A system that predicts which items a user will prefer, typically by ranking items |
| Collaborative filtering | Recommendation based on patterns in user-item interactions across all users |
| Content-based filtering | Recommendation based on similarity between item features and user preferences |
| Hybrid approach | Combining collaborative and content-based methods |
| User-based CF | Finding similar users and recommending what they liked |
| Item-based CF | Finding similar items to those the user already liked |
| Cosine similarity | Similarity measure based on the angle between two vectors |
| Pearson correlation | Similarity measure based on the linear correlation of co-rated items |
| Matrix factorization | Decomposing the user-item matrix into low-rank factor matrices |
| SVD (recommender context) | Singular value decomposition used to learn latent user and item factors |
| Latent factors | Hidden dimensions (e.g., genre preference, quality taste) discovered by matrix factorization |
| Cold start problem | Inability to recommend for new users or new items that lack interaction data |
| Implicit feedback | Behavioral signals (clicks, views, purchases) that imply preference |
| Explicit feedback | Direct preference expressions (ratings, reviews, thumbs up/down) |
| NDCG | Normalized Discounted Cumulative Gain --- ranking metric that rewards relevant items at higher positions |
| MAP | Mean Average Precision --- ranking metric averaging precision at each relevant item's position |
| Hit Rate | Fraction of users for whom at least one relevant item appears in the top-N |
| Surprise library | Python library for building and evaluating recommender systems on explicit feedback |
| Sparsity problem | The challenge that most entries in the user-item matrix are missing |
Next: Case Study 1 --- ShopSmart Product Recommendations applies collaborative filtering and hybrid methods to e-commerce product recommendations. Case Study 2 --- StreamFlow Content Recommendations builds a content recommender designed to reduce subscriber churn.