Exercises: Chapter 24

Recommender Systems: Collaborative Filtering, Content-Based, and Hybrid Approaches


Exercise 1: Collaborative Filtering by Hand (Conceptual)

Consider the following user-item rating matrix (blank entries = unobserved):

Item A Item B Item C Item D Item E
User 1 5 4 3
User 2 4 5 2
User 3 3 4 5
User 4 3 4 5
User 5 5 3 4

a) Mean-center the ratings for each user (subtract each user's mean rating from their observed ratings). Show the centered matrix.

b) Compute the cosine similarity between User 1 and every other user using the centered ratings. Only use items that both users have rated (co-rated items). Which user is most similar to User 1?

c) Using user-based CF with k=2 nearest neighbors, predict User 1's rating for Item C. Show the formula and calculation.

d) Now compute item-item cosine similarity between Item C and all other items (using the original, non-centered ratings). Which item is most similar to Item C?

e) Using item-based CF with k=2 nearest neighbor items, predict User 1's rating for Item C. Compare this prediction with your answer from (c). Why might they differ?

f) The sparsity of this matrix is 40% (8 out of 20 cells are empty). In a production system with 1 million users and 50,000 items, what sparsity would you expect? What problems does extreme sparsity cause for neighborhood-based CF?


Exercise 2: Content-Based Filtering (Conceptual + Code)

You have a movie catalog with the following features:

Movie Genre Director Year Keywords
M1 action director_a 2020 explosions, cars, heist
M2 action director_b 2021 space, robots, ai
M3 comedy director_c 2019 romance, wedding, family
M4 drama director_a 2020 war, history, sacrifice
M5 action director_a 2022 explosions, military, espionage

A user has rated M1 = 5, M4 = 4, M3 = 2.

a) (Conceptual) Based on the user's ratings, describe qualitatively what kind of movies this user prefers. What item features seem to matter most?

b) (Conceptual) Using only the Genre and Director features, which unrated movie (M2 or M5) would a content-based recommender rank higher? Explain.

c) (Code) Implement a content-based recommender using TF-IDF on the Keywords column. Build the user profile from positively-rated items (rating >= 4) and rank M2 and M5.

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

keywords = [
    "explosions cars heist",
    "space robots ai",
    "romance wedding family",
    "war history sacrifice",
    "explosions military espionage"
]

user_ratings = {0: 5, 3: 4, 2: 2}  # movie_index: rating

# Your code here:
# 1. Build TF-IDF matrix from keywords
# 2. Build user profile from items rated >= 4
# 3. Compute similarity between profile and M2, M5
# 4. Which movie ranks higher?

d) (Conceptual) The user has never rated a documentary. Can a content-based recommender ever recommend a documentary to this user? What about a collaborative filtering system? Explain the filter bubble problem.


Exercise 3: SVD Matrix Factorization (Code)

Using the surprise library, train SVD models with different numbers of latent factors and compare their performance.

import numpy as np
import pandas as pd
from surprise import Dataset, Reader, SVD
from surprise.model_selection import cross_validate

np.random.seed(42)

# Generate synthetic ratings
n_users, n_items = 300, 100
n_ratings = 5000

user_ids = np.random.randint(0, n_users, size=n_ratings)
item_ids = np.random.randint(0, n_items, size=n_ratings)

# Latent structure: 3 true factors
true_factors = 3
U = np.random.randn(n_users, true_factors)
V = np.random.randn(n_items, true_factors)
ratings = np.clip(
    np.sum(U[user_ids] * V[item_ids], axis=1) * 0.5 + 3 + np.random.randn(n_ratings) * 0.3,
    1, 5
).round(1)

df = pd.DataFrame({'user': user_ids, 'item': item_ids, 'rating': ratings})
df = df.drop_duplicates(subset=['user', 'item']).reset_index(drop=True)

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['user', 'item', 'rating']], reader)

# a) Train SVD with n_factors = [2, 3, 5, 10, 20, 50, 100]
#    Use 5-fold cross-validation. Record RMSE and MAE for each.
#    Plot RMSE vs. n_factors.

# b) The true number of latent factors is 3.
#    Does the RMSE curve show an elbow near 3?
#    What happens when n_factors >> true factors? Why?

# c) Train SVD with n_factors=3 and n_factors=50.
#    For each, extract the item factor matrices (model.qi).
#    Compute pairwise cosine similarity between the first 10 items.
#    Are the similarity patterns consistent between the two models?

# d) For the n_factors=50 model, how many of the 50 factors
#    capture meaningful variance vs. noise?
#    Hint: look at the magnitude of the factor vectors.

Exercise 4: Ranking Metrics (Code)

Implement and compare ranking evaluation metrics on a simulated recommender output.

import numpy as np

np.random.seed(42)

# Simulated scenario: 100 users, each with 5 relevant items
# Three recommender systems produce different ranked lists of 20 items

n_users = 100
n_total_items = 500
k = 20  # recommendation list length

# Ground truth: 5 relevant items per user
test_items = {}
for u in range(n_users):
    test_items[u] = set(np.random.choice(n_total_items, size=5, replace=False))

# Model A: relevant items clustered at positions 1-5 (excellent ranking)
# Model B: relevant items scattered uniformly across positions 1-20
# Model C: relevant items clustered at positions 15-20 (poor ranking)

recs_a, recs_b, recs_c = {}, {}, {}
for u in range(n_users):
    relevant = list(test_items[u])
    irrelevant = list(set(range(n_total_items)) - test_items[u])
    np.random.shuffle(irrelevant)

    # Model A: place relevant items at top
    recs_a[u] = relevant + irrelevant[:k - len(relevant)]

    # Model B: scatter relevant items randomly in list
    mixed = relevant + irrelevant[:k - len(relevant)]
    np.random.shuffle(mixed)
    recs_b[u] = mixed

    # Model C: place relevant items at bottom
    recs_c[u] = irrelevant[:k - len(relevant)] + relevant

# a) Implement Hit Rate@K, NDCG@K, and MAP@K (or use the implementations
#    from the chapter). Compute all three metrics at k=5, 10, 20
#    for each model.

# b) Which metric best distinguishes Model A from Model B?
#    Which metric is least sensitive to the position of relevant items?

# c) Create a Model D where relevant items are at positions 1, 5, 10, 15, 20.
#    Compute NDCG@10 and MAP@10. Explain why NDCG and MAP give
#    different scores for this model.

# d) Suppose Model A achieves NDCG@10 = 0.92 offline but only 0.74
#    in an A/B test. List three possible reasons for the discrepancy.

Exercise 5: Cold Start Strategies (Conceptual + Code)

A music streaming service launches with 10 million songs and 0 users. Over the first month, users sign up at a rate of 50,000 per day.

a) (Conceptual) On day 1, what recommendation strategy is the only option? Why?

b) (Conceptual) By day 7, you have 350,000 users but most have listened to fewer than 10 songs. Which users can benefit from collaborative filtering? What threshold would you set?

c) (Conceptual) You have rich metadata for every song: artist, genre, tempo, key, danceability, energy, valence (from audio analysis). Design a content-based system that can provide personalized recommendations from a user's very first listen. Describe the user profile update process.

d) (Code) Implement a switching hybrid that transitions from popularity to content-based to CF as user history grows.

import numpy as np
import pandas as pd

np.random.seed(42)

# Simulate 1000 users with varying amounts of history
n_users = 1000
# User history lengths: exponentially distributed (many new users, few veterans)
history_lengths = np.random.exponential(scale=10, size=n_users).astype(int)
history_lengths = np.clip(history_lengths, 0, 100)

print(f"Users with 0 listens: {(history_lengths == 0).sum()}")
print(f"Users with 1-4 listens: {((history_lengths >= 1) & (history_lengths < 5)).sum()}")
print(f"Users with 5-19 listens: {((history_lengths >= 5) & (history_lengths < 20)).sum()}")
print(f"Users with 20+ listens: {(history_lengths >= 20).sum()}")

# Implement:
# 1. A function that assigns each user to a strategy tier:
#    - 0 listens: popularity
#    - 1-4 listens: content-based
#    - 5+ listens: collaborative filtering (hybrid)
# 2. Compute what fraction of users fall into each tier
# 3. Simulate the distribution after 30 days (assume each user
#    listens to 3 songs per day on average). How does the tier
#    distribution change?

Exercise 6: Building a Complete Hybrid Recommender (Code)

Build a hybrid recommender from scratch on the MovieLens-style synthetic data below. This exercise integrates all concepts from the chapter.

import numpy as np
import pandas as pd
from surprise import Dataset, Reader, SVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

np.random.seed(42)

# --- Generate synthetic movie data ---
n_users = 500
n_movies = 150
n_ratings = 8000

# Movie metadata
genres = ['action', 'comedy', 'drama', 'horror', 'sci-fi',
          'romance', 'thriller', 'documentary', 'animation', 'mystery']
movie_genres = [np.random.choice(genres, size=np.random.randint(1, 4),
                replace=False).tolist() for _ in range(n_movies)]
movie_descriptions = [' '.join(g) + ' ' + ' '.join(
    np.random.choice(['adventure', 'dark', 'funny', 'intense', 'emotional',
                       'suspense', 'family', 'classic', 'modern', 'indie',
                       'blockbuster', 'cerebral', 'atmospheric', 'gritty'],
                      size=3, replace=False))
    for g in movie_genres]

# Ratings
user_ids = np.random.randint(0, n_users, size=n_ratings)
item_ids = np.random.randint(0, n_movies, size=n_ratings)
# Create some latent structure
U = np.random.randn(n_users, 5)
V = np.random.randn(n_movies, 5)
ratings = np.clip(
    np.sum(U[user_ids] * V[item_ids], axis=1) * 0.4 + 3 + np.random.randn(n_ratings) * 0.5,
    1, 5
).round(1)

ratings_df = pd.DataFrame({'user_id': user_ids, 'item_id': item_ids, 'rating': ratings})
ratings_df = ratings_df.drop_duplicates(subset=['user_id', 'item_id']).reset_index(drop=True)

# Tasks:
# a) Split data: 80% train, 20% test (per-user split)
# b) Train an SVD model on the training data
# c) Build TF-IDF content features from movie_descriptions
# d) Implement a hybrid recommender that:
#    - Users with 10+ train ratings: 70% CF + 30% content
#    - Users with 3-9 train ratings: 30% CF + 70% content
#    - Users with < 3 train ratings: pure content-based
#    - Users with 0 train ratings: popularity baseline
# e) Generate top-10 recommendations for all test users
# f) Evaluate with Hit Rate@10, NDCG@10, and MAP@10
# g) Compare the hybrid against:
#    - Pure SVD
#    - Pure content-based
#    - Popularity baseline
# h) For which user segment (by number of train ratings) does each
#    method perform best?

Exercise 7: Evaluation Pitfalls (Conceptual)

a) A recommender system achieves NDCG@10 = 0.85 on a held-out test set. The team deploys it, and average session time drops by 8%. What went wrong? List three specific failure modes where strong offline metrics do not translate to good online performance.

b) You are evaluating a recommender on implicit feedback data (clicks). Your test set is built by holding out the most recent click for each user. A colleague points out that this evaluation is biased. Explain the bias and propose a fix.

c) A product manager asks: "Our recommender has MAP@10 = 0.12. Is that good?" How do you answer this question? What context is needed to interpret an absolute MAP score?

d) Your team is debating whether to optimize for NDCG@5 or Hit Rate@10. The product is a music app where users typically listen to the first suggestion and skip the rest. Which metric better captures the user experience? What if the product were a news feed where users scroll through 10-20 articles?

e) You train a recommender on 12 months of data. An A/B test shows it outperforms the baseline in months 13-14 but underperforms in months 15-16. Diagnose this decay and propose a solution.


Exercise 8: Recommender System Design (Applied)

Design (do not implement) a recommender system for each of the following scenarios. For each, specify: (1) feedback type (explicit/implicit), (2) primary algorithm, (3) cold start strategy, (4) evaluation metric, and (5) one unique challenge.

a) Online grocery store. Users reorder many of the same items weekly but occasionally try new products.

b) Job posting board. Employers post jobs; candidates browse and apply. Recommendations serve both sides.

c) Academic paper search. Researchers look for relevant papers to cite. The catalog grows by 2 million papers per year.

d) Real estate listings. Home buyers search for properties. Each user buys at most one item (a house). The "item" disappears from the catalog once sold.

e) Playlist generation for a music app. The system must generate a sequence of 30 songs, not just a set. Order matters: tempo, energy, and mood should flow naturally.


These exercises cover Chapter 24: Recommender Systems. Return to the chapter for reference.