Case Study 1: ShopSmart Product Recommendations
Background
ShopSmart, the e-commerce platform from Chapters 20-23, has 200,000 monthly active customers and a catalog of 12,000 products across 45 categories. The existing recommendation system is the association rule engine from Chapter 23: if a customer has item X in their cart, recommend items that co-occur with X above a lift threshold.
The association rule approach works for cross-selling at checkout --- "customers who bought this also bought that" --- but it does not solve the broader recommendation problem. When a customer lands on the homepage, there is no cart to trigger rules from. When a customer browses a category page, the association rules have no context beyond the current product. The merchandising team wants a personalized homepage and category page experience: recommendations tailored to each customer's purchase history, not just the current session.
Marcus Chen, the Head of Analytics, frames the project in business terms: Build a recommender system that personalizes the "Recommended for You" section on the homepage and the "You Might Also Like" section on product pages. Measure success with click-through rate (CTR) and add-to-cart rate. The baseline is the current most-popular-items approach, which achieves 2.8% CTR on the homepage widget.
The team has 18 months of purchase history, product descriptions, and category metadata. The question is whether personalized recommendations can outperform the popularity baseline meaningfully enough to justify the engineering investment.
The Data
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from surprise import Dataset, Reader, SVD
import matplotlib.pyplot as plt
np.random.seed(42)
# --- Simulate ShopSmart purchase data ---
n_customers = 5000
n_products = 800
# Product metadata
categories = ['electronics', 'clothing', 'home', 'kitchen', 'sports',
'books', 'beauty', 'toys', 'office', 'garden']
product_category = np.random.choice(categories, size=n_products)
product_keywords = []
keyword_pools = {
'electronics': ['wireless', 'bluetooth', 'smart', 'portable', 'digital', 'usb', 'led'],
'clothing': ['cotton', 'casual', 'formal', 'slim', 'comfortable', 'lightweight', 'durable'],
'home': ['modern', 'minimalist', 'storage', 'organizer', 'decorative', 'compact', 'sturdy'],
'kitchen': ['stainless', 'nonstick', 'ceramic', 'electric', 'dishwasher', 'compact', 'premium'],
'sports': ['fitness', 'outdoor', 'training', 'lightweight', 'adjustable', 'performance', 'grip'],
'books': ['bestseller', 'guide', 'reference', 'illustrated', 'practical', 'comprehensive', 'beginner'],
'beauty': ['organic', 'moisturizing', 'gentle', 'fragrance', 'natural', 'hydrating', 'SPF'],
'toys': ['educational', 'creative', 'interactive', 'colorful', 'safe', 'buildable', 'STEM'],
'office': ['ergonomic', 'professional', 'adjustable', 'desk', 'productivity', 'quiet', 'compact'],
'garden': ['outdoor', 'weather-resistant', 'solar', 'planter', 'organic', 'durable', 'seasonal'],
}
for i in range(n_products):
cat = product_category[i]
kw = np.random.choice(keyword_pools[cat], size=3, replace=False)
product_keywords.append(f"{cat} {' '.join(kw)}")
# Customer purchase data (implicit: 1 = purchased, weighted by purchase count)
# Create latent preference structure
n_factors = 8
customer_factors = np.random.randn(n_customers, n_factors)
product_factors = np.random.randn(n_products, n_factors)
# Generate purchases based on latent affinity + category preferences
purchases = []
for cust in range(n_customers):
# Number of purchases: power-law distributed
n_purchases = max(1, int(np.random.exponential(scale=12)))
# Score all products
scores = customer_factors[cust] @ product_factors.T
scores += np.random.randn(n_products) * 0.5 # noise
# Sample from top-scoring products
probs = np.exp(scores - scores.max())
probs /= probs.sum()
bought = np.random.choice(n_products, size=min(n_purchases, 50),
replace=False, p=probs)
for prod in bought:
# Assign a rating (explicit feedback, 1-5)
true_score = customer_factors[cust] @ product_factors[prod]
rating = np.clip(round(true_score * 0.5 + 3.5 + np.random.randn() * 0.4), 1, 5)
purchases.append({'customer_id': cust, 'product_id': prod, 'rating': float(rating)})
purchases_df = pd.DataFrame(purchases)
purchases_df = purchases_df.drop_duplicates(
subset=['customer_id', 'product_id']
).reset_index(drop=True)
print(f"Purchase data: {len(purchases_df)} ratings")
print(f"Customers: {purchases_df['customer_id'].nunique()}")
print(f"Products: {purchases_df['product_id'].nunique()}")
print(f"Sparsity: {1 - len(purchases_df) / (n_customers * n_products):.4f}")
print(f"\nPurchases per customer:")
print(purchases_df.groupby('customer_id').size().describe())
print(f"\nRatings per product:")
print(purchases_df.groupby('product_id').size().describe())
The sparsity is extreme. Most customers have purchased a small fraction of the catalog, and most products have been purchased by a small fraction of customers. This is the norm in e-commerce.
Step 1: Establish Baselines
Before building anything sophisticated, measure the baselines.
# Train-test split (per customer)
from collections import defaultdict
def train_test_split_per_user(df, test_frac=0.2, random_state=42):
rng = np.random.RandomState(random_state)
train_parts, test_parts = [], []
for uid, group in df.groupby('customer_id'):
n_test = max(1, int(len(group) * test_frac))
test_idx = rng.choice(group.index, size=n_test, replace=False)
train_parts.append(group.drop(test_idx))
test_parts.append(group.loc[test_idx])
return pd.concat(train_parts), pd.concat(test_parts)
train_df, test_df = train_test_split_per_user(purchases_df)
print(f"Train: {len(train_df)}, Test: {len(test_df)}")
# Build test set lookup
test_items = test_df.groupby('customer_id')['product_id'].apply(set).to_dict()
train_items = train_df.groupby('customer_id')['product_id'].apply(set).to_dict()
# --- Baseline 1: Popularity ---
product_popularity = (
train_df.groupby('product_id')
.agg(n_purchases=('rating', 'count'), mean_rating=('rating', 'mean'))
)
C = product_popularity['n_purchases'].mean()
m = product_popularity['mean_rating'].mean()
product_popularity['score'] = (
(product_popularity['n_purchases'] * product_popularity['mean_rating'] + C * m) /
(product_popularity['n_purchases'] + C)
)
popular_products = product_popularity.nlargest(50, 'score').index.tolist()
popularity_recs = {}
for uid in test_items:
seen = train_items.get(uid, set())
popularity_recs[uid] = [p for p in popular_products if p not in seen][:20]
# --- Baseline 2: Random ---
all_products = list(purchases_df['product_id'].unique())
random_recs = {}
rng = np.random.RandomState(42)
for uid in test_items:
seen = train_items.get(uid, set())
candidates = [p for p in all_products if p not in seen]
rng.shuffle(candidates)
random_recs[uid] = candidates[:20]
Ranking Evaluation Functions
def hit_rate_at_k(recommendations, test_items, k=10):
hits, total = 0, 0
for uid in test_items:
if uid not in recommendations:
continue
if set(recommendations[uid][:k]) & test_items[uid]:
hits += 1
total += 1
return hits / total if total > 0 else 0.0
def ndcg_at_k(recommendations, test_items, k=10):
scores = []
for uid in test_items:
if uid not in recommendations:
continue
relevant = test_items[uid]
ranked = recommendations[uid][:k]
dcg = sum(1.0 / np.log2(i + 2) for i, item in enumerate(ranked)
if item in relevant)
ideal = sorted(ranked, key=lambda x: x in relevant, reverse=True)
idcg = sum(1.0 / np.log2(i + 2) for i, item in enumerate(ideal[:k])
if item in relevant)
# Use min(len(relevant), k) ideal items
n_rel = min(len(relevant), k)
idcg_perfect = sum(1.0 / np.log2(i + 2) for i in range(n_rel))
scores.append(dcg / idcg_perfect if idcg_perfect > 0 else 0.0)
return np.mean(scores) if scores else 0.0
def map_at_k(recommendations, test_items, k=10):
ap_scores = []
for uid in test_items:
if uid not in recommendations:
continue
relevant = test_items[uid]
ranked = recommendations[uid][:k]
hits = 0
sum_prec = 0.0
for i, item in enumerate(ranked):
if item in relevant:
hits += 1
sum_prec += hits / (i + 1)
n_rel = min(len(relevant), k)
ap_scores.append(sum_prec / n_rel if n_rel > 0 else 0.0)
return np.mean(ap_scores) if ap_scores else 0.0
# Evaluate baselines
print("Baseline Performance:")
print(f"{'Method':<20} {'HR@5':>8} {'HR@10':>8} {'NDCG@10':>8} {'MAP@10':>8}")
print("-" * 56)
for name, recs in [('Random', random_recs), ('Popularity', popularity_recs)]:
hr5 = hit_rate_at_k(recs, test_items, k=5)
hr10 = hit_rate_at_k(recs, test_items, k=10)
ndcg = ndcg_at_k(recs, test_items, k=10)
map_score = map_at_k(recs, test_items, k=10)
print(f"{name:<20} {hr5:>8.4f} {hr10:>8.4f} {ndcg:>8.4f} {map_score:>8.4f}")
Practical Note --- The popularity baseline is always stronger than you expect. In e-commerce, popular products are popular because they appeal broadly. Any personalized model must beat popularity by a meaningful margin to justify its complexity. If your SVD model beats popularity by 0.01 on NDCG@10, you probably do not need SVD.
Step 2: Collaborative Filtering with SVD
# Train SVD
reader = Reader(rating_scale=(1, 5))
train_surprise = Dataset.load_from_df(
train_df[['customer_id', 'product_id', 'rating']], reader
)
trainset = train_surprise.build_full_trainset()
svd = SVD(n_factors=50, n_epochs=30, lr_all=0.005, reg_all=0.02, random_state=42)
svd.fit(trainset)
# Generate recommendations
all_product_ids = set(purchases_df['product_id'].unique())
svd_recs = {}
for uid in test_items:
seen = train_items.get(uid, set())
unseen = all_product_ids - seen
preds = [(pid, svd.predict(uid=uid, iid=pid).est) for pid in unseen]
preds.sort(key=lambda x: x[1], reverse=True)
svd_recs[uid] = [pid for pid, _ in preds[:20]]
# Evaluate
print("\nSVD Performance:")
for k in [5, 10, 20]:
hr = hit_rate_at_k(svd_recs, test_items, k=k)
ndcg = ndcg_at_k(svd_recs, test_items, k=k)
map_score = map_at_k(svd_recs, test_items, k=k)
print(f" @{k:>2}: HR={hr:.4f} NDCG={ndcg:.4f} MAP={map_score:.4f}")
Step 3: Content-Based Filtering
# Build TF-IDF features from product keywords
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(product_keywords)
print(f"TF-IDF matrix: {tfidf_matrix.shape}")
# Content similarity between all products
content_sim = cosine_similarity(tfidf_matrix)
np.fill_diagonal(content_sim, 0)
def content_recommend(uid, train_df, content_sim, n=20):
"""
Content-based: recommend products similar to what the customer purchased,
weighted by their rating.
"""
user_history = train_df[train_df['customer_id'] == uid]
if len(user_history) == 0:
return []
seen = set(user_history['product_id'].values)
# Weight each purchased product's similarity by the rating
scores = np.zeros(content_sim.shape[0])
total_weight = 0
for _, row in user_history.iterrows():
pid = int(row['product_id'])
weight = row['rating'] / 5.0
if pid < content_sim.shape[0]:
scores += weight * content_sim[pid]
total_weight += weight
if total_weight > 0:
scores /= total_weight
# Zero out seen items
for pid in seen:
if pid < len(scores):
scores[pid] = -1
top_n = np.argsort(scores)[::-1][:n]
return list(top_n)
# Generate content-based recommendations
content_recs = {}
for uid in test_items:
content_recs[uid] = content_recommend(uid, train_df, content_sim, n=20)
# Evaluate
print("Content-Based Performance:")
for k in [5, 10, 20]:
hr = hit_rate_at_k(content_recs, test_items, k=k)
ndcg = ndcg_at_k(content_recs, test_items, k=k)
map_score = map_at_k(content_recs, test_items, k=k)
print(f" @{k:>2}: HR={hr:.4f} NDCG={ndcg:.4f} MAP={map_score:.4f}")
Step 4: Hybrid Model
The hybrid blends SVD and content-based scores, with the weights adapting to each customer's history depth.
def hybrid_recommend_shopsmart(uid, train_df, svd_model, content_sim,
all_products, n=20):
"""
Switching hybrid recommender for ShopSmart.
- 10+ purchases: 75% CF + 25% content
- 3-9 purchases: 40% CF + 60% content
- 1-2 purchases: pure content-based
- 0 purchases: popularity baseline
"""
user_history = train_df[train_df['customer_id'] == uid]
n_hist = len(user_history)
seen = set(user_history['product_id'].values)
candidates = [p for p in all_products if p not in seen]
if n_hist == 0:
return [p for p in popular_products if p not in seen][:n]
# Content scores
content_scores = np.zeros(max(all_products) + 1)
total_weight = 0
for _, row in user_history.iterrows():
pid = int(row['product_id'])
weight = row['rating'] / 5.0
if pid < content_sim.shape[0]:
content_scores[:content_sim.shape[0]] += weight * content_sim[pid]
total_weight += weight
if total_weight > 0:
content_scores /= total_weight
# Normalize content scores to 0-1
cs_min, cs_max = content_scores[candidates].min(), content_scores[candidates].max()
if cs_max > cs_min:
content_scores_norm = (content_scores - cs_min) / (cs_max - cs_min)
else:
content_scores_norm = np.zeros_like(content_scores)
if n_hist < 3:
# Pure content-based
scored = [(p, content_scores_norm[p]) for p in candidates]
else:
# Blend with CF
cf_weight = 0.75 if n_hist >= 10 else 0.40
cb_weight = 1.0 - cf_weight
# CF scores from SVD
cf_scores = {}
for pid in candidates:
pred = svd_model.predict(uid=uid, iid=pid)
cf_scores[pid] = (pred.est - 1) / 4 # normalize 1-5 to 0-1
scored = [
(p, cf_weight * cf_scores.get(p, 0.5) + cb_weight * content_scores_norm[p])
for p in candidates
]
scored.sort(key=lambda x: x[1], reverse=True)
return [p for p, _ in scored[:n]]
# Generate hybrid recommendations
hybrid_recs = {}
for uid in test_items:
hybrid_recs[uid] = hybrid_recommend_shopsmart(
uid, train_df, svd, content_sim, all_product_ids, n=20
)
# Evaluate
print("Hybrid Performance:")
for k in [5, 10, 20]:
hr = hit_rate_at_k(hybrid_recs, test_items, k=k)
ndcg = ndcg_at_k(hybrid_recs, test_items, k=k)
map_score = map_at_k(hybrid_recs, test_items, k=k)
print(f" @{k:>2}: HR={hr:.4f} NDCG={ndcg:.4f} MAP={map_score:.4f}")
Step 5: Full Comparison
# Compile all results
print("\n" + "=" * 70)
print("ShopSmart Recommender Comparison")
print("=" * 70)
print(f"{'Method':<25} {'HR@5':>8} {'HR@10':>8} {'NDCG@10':>8} {'MAP@10':>8}")
print("-" * 60)
for name, recs in [
('Random', random_recs),
('Popularity', popularity_recs),
('Content-Based', content_recs),
('SVD (CF)', svd_recs),
('Hybrid', hybrid_recs),
]:
hr5 = hit_rate_at_k(recs, test_items, k=5)
hr10 = hit_rate_at_k(recs, test_items, k=10)
ndcg = ndcg_at_k(recs, test_items, k=10)
map_score = map_at_k(recs, test_items, k=10)
print(f"{name:<25} {hr5:>8.4f} {hr10:>8.4f} {ndcg:>8.4f} {map_score:>8.4f}")
Step 6: Analyze by User Segment
The aggregate numbers hide important variation. How does each method perform for heavy buyers vs. light buyers?
# Segment users by purchase history
user_purchase_counts = train_df.groupby('customer_id').size()
segments = {
'Light (1-3 purchases)': user_purchase_counts[
(user_purchase_counts >= 1) & (user_purchase_counts <= 3)
].index,
'Medium (4-10 purchases)': user_purchase_counts[
(user_purchase_counts >= 4) & (user_purchase_counts <= 10)
].index,
'Heavy (11+ purchases)': user_purchase_counts[
user_purchase_counts >= 11
].index,
}
print("\nNDCG@10 by User Segment:")
print(f"{'Segment':<30} {'Popularity':>12} {'Content':>12} {'SVD':>12} {'Hybrid':>12}")
print("-" * 80)
for seg_name, seg_users in segments.items():
seg_test = {u: test_items[u] for u in seg_users if u in test_items}
if len(seg_test) == 0:
continue
pop_ndcg = ndcg_at_k(popularity_recs, seg_test, k=10)
cb_ndcg = ndcg_at_k(content_recs, seg_test, k=10)
svd_ndcg = ndcg_at_k(svd_recs, seg_test, k=10)
hyb_ndcg = ndcg_at_k(hybrid_recs, seg_test, k=10)
print(f"{seg_name:<30} {pop_ndcg:>12.4f} {cb_ndcg:>12.4f} "
f"{svd_ndcg:>12.4f} {hyb_ndcg:>12.4f}")
print(f"\n(Users per segment: "
f"{', '.join(f'{k}: {len(v)}' for k, v in segments.items())})")
Practical Note --- The segment analysis almost always reveals that CF methods dominate for heavy users (lots of interaction data to learn from) while content-based methods close the gap for light users (where CF has insufficient signal). The hybrid wins overall because it allocates the right method to each segment. This is the key insight: no single algorithm is best for all users.
Business Impact Estimation
# Translate NDCG improvement to CTR improvement (rough estimation)
# Industry heuristic: NDCG improvement correlates ~linearly with CTR
# improvement in the 0.05-0.20 NDCG range
popularity_ndcg = ndcg_at_k(popularity_recs, test_items, k=10)
hybrid_ndcg = ndcg_at_k(hybrid_recs, test_items, k=10)
relative_improvement = (hybrid_ndcg - popularity_ndcg) / max(popularity_ndcg, 0.001)
baseline_ctr = 0.028 # 2.8% current CTR
estimated_ctr = baseline_ctr * (1 + relative_improvement)
print("Business Impact Estimation:")
print(f" Baseline CTR (popularity): {baseline_ctr:.1%}")
print(f" NDCG improvement: {popularity_ndcg:.4f} -> {hybrid_ndcg:.4f} "
f"({relative_improvement:+.1%})")
print(f" Estimated CTR with hybrid: {estimated_ctr:.1%}")
print(f"\n Monthly sessions: 200,000 customers * 8 sessions = 1,600,000")
print(f" Additional clicks/month: {1_600_000 * (estimated_ctr - baseline_ctr):,.0f}")
print(f" At $0.50 avg revenue/click: ${1_600_000 * (estimated_ctr - baseline_ctr) * 0.50:,.0f}/month")
print("\n NOTE: This is an offline estimate. A/B testing is required to")
print(" validate the CTR lift before committing to full deployment.")
Marcus's Recommendation to the VP
Marcus presents the following to the VP of Product:
-
The hybrid model outperforms popularity across all segments, with the largest gains for medium and heavy buyers who have enough history for personalization.
-
The recommended deployment is a switching hybrid: popularity for new customers (first 3 sessions), content-based for customers with limited history (3-9 purchases), and the full CF+content hybrid for established customers.
-
Estimated revenue impact: the offline metrics suggest a 20-40% relative improvement in recommendation CTR over the popularity baseline. At ShopSmart's volume, this translates to meaningful incremental revenue.
-
Required A/B test: a 2-week A/B test splitting 10% of traffic to the hybrid model, measuring CTR, add-to-cart rate, and revenue per session as the primary metrics. If the hybrid outperforms popularity by at least 15% relative CTR lift, the team recommends full deployment.
-
Cold start is handled by design: the switching architecture ensures every customer gets reasonable recommendations regardless of history depth.
Core Principle --- The deliverable is not the model. The deliverable is the A/B test plan. No offline metric, no matter how encouraging, justifies shipping a recommender to production without controlled online evaluation. Marcus's presentation closes with the A/B test design, not the NDCG numbers.
This case study applies Chapter 24 concepts to ShopSmart's e-commerce recommendation problem. Return to the chapter or continue to Case Study 2 --- StreamFlow Content Recommendations.