Case Study 2: StreamRec Recommendation Explanations — Attention, SHAP, and Natural Language at Scale
Context
StreamRec serves 50 million monthly active users with 400 million recommendation impressions per day. The three-stage pipeline — retrieval (two-tower model, Chapter 13), ranking (transformer session model, Chapter 10), and re-ranking (gradient-boosted tree with fairness and diversity constraints, Chapters 24 and 31) — has been deployed, monitored (Chapter 30), audited for fairness (Chapter 31), and trained with differential privacy for EU users (Chapter 32).
The product team received a recurring theme in user research: users want to know why they see what they see. Qualitative interviews with 200 users across three markets (US, EU, Japan) revealed three distinct user needs:
| User Need | Representative Quote | Frequency |
|---|---|---|
| Calibrate trust | "I want to know if it's showing me this because I'd actually like it, or because it's promoting it." | 68% |
| Discover control | "If I knew why it showed me this, I could tell it to stop showing me stuff like that." | 54% |
| Understand serendipity | "Sometimes it recommends something totally unexpected and great. I want to know how it found it." | 31% |
The VP of Product approved a project to add user-facing explanations to every recommendation. The ML team was tasked with: (1) generating explanations that are technically grounded, (2) presenting them in natural language that non-technical users understand, and (3) serving them at the scale and latency of the recommendation pipeline (p99 < 100ms).
The Architecture
The explanation system consists of three components: a SHAP attribution module for the re-ranking model, an attention extraction module for the session transformer, and a natural language generation module that combines both signals into user-facing text.
from dataclasses import dataclass, field
from typing import List, Dict, Tuple, Optional
import numpy as np
import time
@dataclass
class RecommendationContext:
"""Full context for one recommendation to be explained."""
user_id_hash: str
recommended_item_id: str
session_history: List[str] # Recent item IDs
user_features: Dict[str, float] # 20 user features
item_features: Dict[str, float] # 20 item features
ranking_score: float
reranking_score: float
position: int # Position in the recommendation list
@dataclass
class ExplanationSignals:
"""Raw explanation signals from SHAP and attention.
Combines signals from the re-ranking model (SHAP) and the
session transformer (attention) into a unified representation
for downstream NL generation.
"""
# SHAP signals from the re-ranking model
shap_top_features: List[Tuple[str, float]] # (feature, SHAP value)
shap_dominant_category: Optional[str] # If a category feature dominates
shap_user_vs_item: float # Fraction of SHAP mass on user vs. item features
# Attention signals from the session transformer
attention_top_items: List[Tuple[str, float]] # (item_id, attention weight)
attention_recency_bias: float # Correlation between attention and recency
attention_category_match: float # Fraction of top-attended items in same category
# Contextual signals
item_popularity_percentile: float # Popularity of the recommended item
item_is_trending: bool # Whether the item is in the top-100 trending
user_region: str # User's geographic region
class StreamRecExplanationEngine:
"""Generates explanations for StreamRec recommendations.
Combines SHAP attribution from the re-ranking model with
attention signals from the session transformer to produce
natural language explanations suitable for end users.
"""
# Explanation type thresholds
ATTENTION_THRESHOLD = 0.5 # Use watch-history explanation
CATEGORY_THRESHOLD = 0.6 # Use category-based explanation
TRENDING_POPULARITY_PERCENTILE = 95 # Use trending explanation
# Feature name to user-friendly mapping
FEATURE_DISPLAY_NAMES = {
"user_category_affinity_score": "your interest in this category",
"user_session_length": "your current viewing session",
"item_engagement_rate": "how much viewers enjoy this content",
"item_category": "the content category",
"user_item_category_match": "match with your preferred categories",
"item_freshness_days": "how recently this was published",
"user_avg_watch_time": "your typical viewing patterns",
"item_creator_subscriber_count": "the creator's audience size",
"user_region_item_popularity": "popularity in your region",
"collaborative_similarity_score": "similarity to content you've enjoyed",
}
def __init__(
self,
shap_explainer,
session_model,
item_metadata: Dict[str, Dict],
):
self.shap_explainer = shap_explainer
self.session_model = session_model
self.item_metadata = item_metadata
def explain(
self, context: RecommendationContext
) -> Tuple[str, str, Dict]:
"""Generate a natural language explanation for one recommendation.
Args:
context: Full context for the recommendation.
Returns:
Tuple of (explanation_text, explanation_type, raw_signals).
"""
start = time.monotonic()
# Step 1: Extract SHAP signals from the re-ranking model
shap_signals = self._compute_shap_signals(context)
# Step 2: Extract attention signals from the session transformer
attention_signals = self._compute_attention_signals(context)
# Step 3: Combine into ExplanationSignals
signals = ExplanationSignals(
shap_top_features=shap_signals["top_features"],
shap_dominant_category=shap_signals.get("dominant_category"),
shap_user_vs_item=shap_signals["user_vs_item_ratio"],
attention_top_items=attention_signals["top_items"],
attention_recency_bias=attention_signals["recency_bias"],
attention_category_match=attention_signals["category_match"],
item_popularity_percentile=context.item_features.get(
"popularity_percentile", 50.0
),
item_is_trending=context.item_features.get(
"is_trending", False
),
user_region=context.user_features.get("region", "unknown"),
)
# Step 4: Select explanation type and generate NL
explanation_text, explanation_type = self._generate_nl(signals, context)
elapsed_ms = (time.monotonic() - start) * 1000
raw_signals = {
"shap_top_features": signals.shap_top_features,
"attention_top_items": signals.attention_top_items,
"explanation_type": explanation_type,
"computation_time_ms": elapsed_ms,
}
return explanation_text, explanation_type, raw_signals
def _compute_shap_signals(
self, context: RecommendationContext
) -> Dict:
"""Compute SHAP signals from the re-ranking model."""
combined_features = {**context.user_features, **context.item_features}
shap_values = self.shap_explainer.explain(combined_features)
# Separate user and item feature contributions
user_shap_mass = sum(
abs(v) for k, v in shap_values.items()
if k.startswith("user_")
)
item_shap_mass = sum(
abs(v) for k, v in shap_values.items()
if k.startswith("item_")
)
total = user_shap_mass + item_shap_mass
user_ratio = user_shap_mass / total if total > 0 else 0.5
# Identify top features
sorted_feats = sorted(
shap_values.items(), key=lambda x: abs(x[1]), reverse=True
)
# Check for category dominance
dominant_cat = None
for feat, val in sorted_feats[:3]:
if "category" in feat.lower():
dominant_cat = context.item_features.get("category", None)
break
return {
"top_features": sorted_feats[:5],
"dominant_category": dominant_cat,
"user_vs_item_ratio": user_ratio,
}
def _compute_attention_signals(
self, context: RecommendationContext
) -> Dict:
"""Compute attention signals from the session transformer."""
if not context.session_history:
return {
"top_items": [],
"recency_bias": 0.0,
"category_match": 0.0,
}
# Extract attention weights for the target item
# against all items in the session history
attention_weights = self.session_model.get_attention_weights(
context.session_history,
context.recommended_item_id,
)
# Top attended items
item_attention = list(zip(context.session_history, attention_weights))
item_attention.sort(key=lambda x: x[1], reverse=True)
# Recency bias: correlation between attention and position
positions = list(range(len(attention_weights)))
if len(positions) > 1:
recency_bias = float(np.corrcoef(positions, attention_weights)[0, 1])
else:
recency_bias = 0.0
# Category match: fraction of top-3 attended items sharing category
recommended_category = self.item_metadata.get(
context.recommended_item_id, {}
).get("category", "unknown")
top_3 = item_attention[:3]
category_matches = sum(
1 for item_id, _ in top_3
if self.item_metadata.get(item_id, {}).get("category") == recommended_category
)
category_match = category_matches / max(len(top_3), 1)
return {
"top_items": item_attention[:5],
"recency_bias": recency_bias,
"category_match": category_match,
}
def _generate_nl(
self,
signals: ExplanationSignals,
context: RecommendationContext,
) -> Tuple[str, str]:
"""Select explanation type and generate natural language.
Selection priority:
1. Watch history (if attention signals are strong)
2. Category match (if SHAP shows category dominance)
3. Trending (if item is in top percentile)
4. Similar viewers (fallback)
"""
# Priority 1: Watch history explanation
if (signals.attention_top_items and
sum(w for _, w in signals.attention_top_items[:2]) > self.ATTENTION_THRESHOLD):
top_items = signals.attention_top_items[:2]
item_titles = [
self.item_metadata.get(iid, {}).get("title", iid)
for iid, _ in top_items
]
if len(item_titles) == 2:
text = f"Because you watched \"{item_titles[0]}\" and \"{item_titles[1]}\""
else:
text = f"Because you watched \"{item_titles[0]}\""
return text, "watch_history"
# Priority 2: Category explanation
if (signals.shap_dominant_category and
signals.attention_category_match > self.CATEGORY_THRESHOLD):
category = signals.shap_dominant_category
text = f"Popular in {category} with viewers like you"
return text, "category_match"
# Priority 3: Trending explanation
if (signals.item_is_trending or
signals.item_popularity_percentile > self.TRENDING_POPULARITY_PERCENTILE):
region = signals.user_region
if region and region != "unknown":
text = f"Trending in {region}"
else:
text = "Trending now"
return text, "trending"
# Priority 4: Fallback — similar viewers
return "Recommended for you based on your viewing history", "personalized"
Evaluation: Online A/B Test
The team deployed the explanation system in an A/B test across 5 million users (2.5M treatment, 2.5M control) for 21 days.
Treatment group: Each recommendation displayed a one-line explanation beneath the item thumbnail (e.g., "Because you watched 'Chef's Table' and 'Salt Fat Acid Heat'").
Control group: No explanation text displayed. Standard recommendation layout.
Primary Metrics
| Metric | Control | Treatment | Relative Change | p-value |
|---|---|---|---|---|
| Click-through rate (CTR) | 4.21% | 4.38% | +4.0% | < 0.001 |
| Watch-through rate (>50% completion) | 62.3% | 64.1% | +2.9% | < 0.001 |
| Session length (minutes) | 34.2 | 35.8 | +4.7% | < 0.001 |
| "Not interested" clicks | 2.1% | 2.5% | +19.0% | < 0.001 |
| User satisfaction (weekly survey) | 3.8 / 5.0 | 4.1 / 5.0 | +7.9% | 0.003 |
The increase in CTR and watch-through rate confirmed the hypothesis that explanations help users identify content they will enjoy. The increase in "not interested" clicks was unexpected but, upon analysis, positive: explanations gave users enough information to reject irrelevant content before watching it, rather than abandoning mid-stream. This reduced wasted viewing time and improved the signal quality for the recommendation model's feedback loop.
Explanation Type Distribution
| Explanation Type | Frequency | CTR (treatment) | Watch-Through (treatment) |
|---|---|---|---|
| Watch history | 42.3% | 4.62% | 66.8% |
| Category match | 28.1% | 4.21% | 63.2% |
| Trending | 14.8% | 4.18% | 60.4% |
| Personalized (fallback) | 14.8% | 4.09% | 61.5% |
Watch-history explanations — "Because you watched X and Y" — were the most effective, with the highest CTR and watch-through rate. This makes intuitive sense: a concrete reference to content the user has seen and enjoyed provides the strongest calibration signal. Category explanations were second. Trending explanations performed similarly to the fallback, suggesting that popularity alone is a weak explanation signal.
Explanation Faithfulness Audit
The ML team conducted a post-hoc faithfulness audit on 10,000 explanations.
Audit methodology: For each watch-history explanation ("Because you watched X and Y"), the team computed the integrated gradients attribution of items X and Y to the transformer's ranking score. If the integrated gradients attribution of the referenced items was in the top-5 of all session items, the explanation was considered "faithful."
| Explanation Type | Faithfulness Rate | Notes |
|---|---|---|
| Watch history | 78.3% | 21.7% referenced items with high attention but low IG attribution |
| Category match | 91.2% | Category features consistently in top-3 SHAP |
| Trending | N/A | Factual claim, not model attribution |
| Personalized | N/A | Generic, no specific claim |
The 78.3% faithfulness rate for watch-history explanations confirmed the concern from Section 35.9: attention weights are not perfectly faithful to model behavior. In 21.7% of cases, the items the model attended to most were not the items whose removal would most change the prediction (as measured by integrated gradients). The team decided this was acceptable for user-facing explanations — the attention-based explanation was "directionally correct" (the model did use the session history) even when it was not precise about which items were most important.
For regulatory contexts, the team would use TreeSHAP or integrated gradients rather than attention. For user-facing explanations, the team accepted the tradeoff: attention-based explanations are more natural ("Because you watched X") than SHAP-based explanations ("Your category affinity score contributed +0.12") and have sufficient directional faithfulness for user trust calibration.
Audit Logging at Scale
The explanation system generates 400 million audit log entries per day — one per recommendation impression. The audit infrastructure uses the following design:
@dataclass
class StreamRecAuditEntry:
"""Compact audit entry for one recommendation explanation.
Designed for high-throughput append-only storage.
Size: ~500 bytes per entry (compressed).
"""
timestamp_ms: int # Unix milliseconds
user_id_hash: str # 16-char hex (truncated SHA-256)
item_id: str
model_version: str
position: int
ranking_score: float
explanation_type: str # "watch_history", "category_match", "trending", "personalized"
explanation_text_hash: str # 16-char hex (not full text, for storage)
shap_top_3: List[Tuple[str, float]] # Top 3 SHAP features
attention_top_2: List[Tuple[str, float]] # Top 2 attended items
computation_time_ms: float
# Storage: Apache Parquet on S3, partitioned by date and region
# Retention: 90 days hot (queryable), 2 years cold (archived)
# Daily volume: ~400M entries x 500 bytes = ~200 GB/day (compressed)
# Monthly cost: ~$3,000 (S3 Standard) + ~$500 (cold archive)
The storage cost — approximately $3,500 per month for the full audit trail — is modest relative to the recommendation system's infrastructure budget. The 90-day hot retention enables real-time querying for explanation monitoring (is the distribution of explanation types shifting? Are certain items always explained as "trending" when they are not?), while the 2-year cold archive satisfies potential regulatory requests.
Lessons Learned
-
Users value explanations for trust calibration, not model understanding. No user in the qualitative interviews wanted to understand the model's architecture. They wanted to know whether the recommendation was for them (based on their history) or for everyone (based on popularity). The explanation system's primary value is helping users distinguish personalized from generic recommendations.
-
Attention-based explanations are imperfect but sufficient for user-facing use. The 78.3% faithfulness rate means that roughly 1 in 5 watch-history explanations references items that are not the most important by integrated gradients. For user trust calibration, this is acceptable — the explanation is directionally correct. For regulatory or fairness audit contexts, use SHAP or integrated gradients.
-
Explanation type selection is itself a modeling problem. The heuristic rules (attention threshold > 0.5, category match > 0.6) were tuned by hand based on the A/B test results. A more sophisticated approach would train a classifier to predict which explanation type maximizes user engagement conditional on the recommendation context. The team deferred this to avoid the meta-problem of explaining the explanation selector.
-
Audit logging at recommendation scale is a storage engineering problem. The 400 million daily entries require compact serialization, efficient partitioning, and tiered storage. The decision to log SHAP top-3 and attention top-2 (rather than the full attribution vector) was a pragmatic compromise between audit completeness and storage cost. Full attribution vectors are computed on demand for specific investigations.