Case Study 2: StreamRec Recommendation Explanations — Attention, SHAP, and Natural Language at Scale

DataField.Dev

Case Study 2: StreamRec Recommendation Explanations — Attention, SHAP, and Natural Language at Scale

Context

StreamRec serves 50 million monthly active users with 400 million recommendation impressions per day. The three-stage pipeline — retrieval (two-tower model, Chapter 13), ranking (transformer session model, Chapter 10), and re-ranking (gradient-boosted tree with fairness and diversity constraints, Chapters 24 and 31) — has been deployed, monitored (Chapter 30), audited for fairness (Chapter 31), and trained with differential privacy for EU users (Chapter 32).

The product team received a recurring theme in user research: users want to know why they see what they see. Qualitative interviews with 200 users across three markets (US, EU, Japan) revealed three distinct user needs:

User Need	Representative Quote	Frequency
Calibrate trust	"I want to know if it's showing me this because I'd actually like it, or because it's promoting it."	68%
Discover control	"If I knew why it showed me this, I could tell it to stop showing me stuff like that."	54%
Understand serendipity	"Sometimes it recommends something totally unexpected and great. I want to know how it found it."	31%

The VP of Product approved a project to add user-facing explanations to every recommendation. The ML team was tasked with: (1) generating explanations that are technically grounded, (2) presenting them in natural language that non-technical users understand, and (3) serving them at the scale and latency of the recommendation pipeline (p99 < 100ms).

The Architecture

The explanation system consists of three components: a SHAP attribution module for the re-ranking model, an attention extraction module for the session transformer, and a natural language generation module that combines both signals into user-facing text.

from dataclasses import dataclass, field
from typing import List, Dict, Tuple, Optional
import numpy as np
import time


@dataclass
class RecommendationContext:
    """Full context for one recommendation to be explained."""
    user_id_hash: str
    recommended_item_id: str
    session_history: List[str]  # Recent item IDs
    user_features: Dict[str, float]  # 20 user features
    item_features: Dict[str, float]  # 20 item features
    ranking_score: float
    reranking_score: float
    position: int  # Position in the recommendation list


@dataclass
class ExplanationSignals:
    """Raw explanation signals from SHAP and attention.

    Combines signals from the re-ranking model (SHAP) and the
    session transformer (attention) into a unified representation
    for downstream NL generation.
    """
    # SHAP signals from the re-ranking model
    shap_top_features: List[Tuple[str, float]]  # (feature, SHAP value)
    shap_dominant_category: Optional[str]  # If a category feature dominates
    shap_user_vs_item: float  # Fraction of SHAP mass on user vs. item features

    # Attention signals from the session transformer
    attention_top_items: List[Tuple[str, float]]  # (item_id, attention weight)
    attention_recency_bias: float  # Correlation between attention and recency
    attention_category_match: float  # Fraction of top-attended items in same category

    # Contextual signals
    item_popularity_percentile: float  # Popularity of the recommended item
    item_is_trending: bool  # Whether the item is in the top-100 trending
    user_region: str  # User's geographic region


class StreamRecExplanationEngine:
    """Generates explanations for StreamRec recommendations.

    Combines SHAP attribution from the re-ranking model with
    attention signals from the session transformer to produce
    natural language explanations suitable for end users.
    """

    # Explanation type thresholds
    ATTENTION_THRESHOLD = 0.5  # Use watch-history explanation
    CATEGORY_THRESHOLD = 0.6  # Use category-based explanation
    TRENDING_POPULARITY_PERCENTILE = 95  # Use trending explanation

    # Feature name to user-friendly mapping
    FEATURE_DISPLAY_NAMES = {
        "user_category_affinity_score": "your interest in this category",
        "user_session_length": "your current viewing session",
        "item_engagement_rate": "how much viewers enjoy this content",
        "item_category": "the content category",
        "user_item_category_match": "match with your preferred categories",
        "item_freshness_days": "how recently this was published",
        "user_avg_watch_time": "your typical viewing patterns",
        "item_creator_subscriber_count": "the creator's audience size",
        "user_region_item_popularity": "popularity in your region",
        "collaborative_similarity_score": "similarity to content you've enjoyed",
    }

    def __init__(
        self,
        shap_explainer,
        session_model,
        item_metadata: Dict[str, Dict],
    ):
        self.shap_explainer = shap_explainer
        self.session_model = session_model
        self.item_metadata = item_metadata

    def explain(
        self, context: RecommendationContext
    ) -> Tuple[str, str, Dict]:
        """Generate a natural language explanation for one recommendation.

        Args:
            context: Full context for the recommendation.

        Returns:
            Tuple of (explanation_text, explanation_type, raw_signals).
        """
        start = time.monotonic()

        # Step 1: Extract SHAP signals from the re-ranking model
        shap_signals = self._compute_shap_signals(context)

        # Step 2: Extract attention signals from the session transformer
        attention_signals = self._compute_attention_signals(context)

        # Step 3: Combine into ExplanationSignals
        signals = ExplanationSignals(
            shap_top_features=shap_signals["top_features"],
            shap_dominant_category=shap_signals.get("dominant_category"),
            shap_user_vs_item=shap_signals["user_vs_item_ratio"],
            attention_top_items=attention_signals["top_items"],
            attention_recency_bias=attention_signals["recency_bias"],
            attention_category_match=attention_signals["category_match"],
            item_popularity_percentile=context.item_features.get(
                "popularity_percentile", 50.0
            ),
            item_is_trending=context.item_features.get(
                "is_trending", False
            ),
            user_region=context.user_features.get("region", "unknown"),
        )

        # Step 4: Select explanation type and generate NL
        explanation_text, explanation_type = self._generate_nl(signals, context)

        elapsed_ms = (time.monotonic() - start) * 1000

        raw_signals = {
            "shap_top_features": signals.shap_top_features,
            "attention_top_items": signals.attention_top_items,
            "explanation_type": explanation_type,
            "computation_time_ms": elapsed_ms,
        }

        return explanation_text, explanation_type, raw_signals

    def _compute_shap_signals(
        self, context: RecommendationContext
    ) -> Dict:
        """Compute SHAP signals from the re-ranking model."""
        combined_features = {**context.user_features, **context.item_features}
        shap_values = self.shap_explainer.explain(combined_features)

        # Separate user and item feature contributions
        user_shap_mass = sum(
            abs(v) for k, v in shap_values.items()
            if k.startswith("user_")
        )
        item_shap_mass = sum(
            abs(v) for k, v in shap_values.items()
            if k.startswith("item_")
        )
        total = user_shap_mass + item_shap_mass
        user_ratio = user_shap_mass / total if total > 0 else 0.5

        # Identify top features
        sorted_feats = sorted(
            shap_values.items(), key=lambda x: abs(x[1]), reverse=True
        )

        # Check for category dominance
        dominant_cat = None
        for feat, val in sorted_feats[:3]:
            if "category" in feat.lower():
                dominant_cat = context.item_features.get("category", None)
                break

        return {
            "top_features": sorted_feats[:5],
            "dominant_category": dominant_cat,
            "user_vs_item_ratio": user_ratio,
        }

    def _compute_attention_signals(
        self, context: RecommendationContext
    ) -> Dict:
        """Compute attention signals from the session transformer."""
        if not context.session_history:
            return {
                "top_items": [],
                "recency_bias": 0.0,
                "category_match": 0.0,
            }

        # Extract attention weights for the target item
        # against all items in the session history
        attention_weights = self.session_model.get_attention_weights(
            context.session_history,
            context.recommended_item_id,
        )

        # Top attended items
        item_attention = list(zip(context.session_history, attention_weights))
        item_attention.sort(key=lambda x: x[1], reverse=True)

        # Recency bias: correlation between attention and position
        positions = list(range(len(attention_weights)))
        if len(positions) > 1:
            recency_bias = float(np.corrcoef(positions, attention_weights)[0, 1])
        else:
            recency_bias = 0.0

        # Category match: fraction of top-3 attended items sharing category
        recommended_category = self.item_metadata.get(
            context.recommended_item_id, {}
        ).get("category", "unknown")
        top_3 = item_attention[:3]
        category_matches = sum(
            1 for item_id, _ in top_3
            if self.item_metadata.get(item_id, {}).get("category") == recommended_category
        )
        category_match = category_matches / max(len(top_3), 1)

        return {
            "top_items": item_attention[:5],
            "recency_bias": recency_bias,
            "category_match": category_match,
        }

    def _generate_nl(
        self,
        signals: ExplanationSignals,
        context: RecommendationContext,
    ) -> Tuple[str, str]:
        """Select explanation type and generate natural language.

        Selection priority:
        1. Watch history (if attention signals are strong)
        2. Category match (if SHAP shows category dominance)
        3. Trending (if item is in top percentile)
        4. Similar viewers (fallback)
        """
        # Priority 1: Watch history explanation
        if (signals.attention_top_items and
                sum(w for _, w in signals.attention_top_items[:2]) > self.ATTENTION_THRESHOLD):
            top_items = signals.attention_top_items[:2]
            item_titles = [
                self.item_metadata.get(iid, {}).get("title", iid)
                for iid, _ in top_items
            ]
            if len(item_titles) == 2:
                text = f"Because you watched \"{item_titles[0]}\" and \"{item_titles[1]}\""
            else:
                text = f"Because you watched \"{item_titles[0]}\""
            return text, "watch_history"

        # Priority 2: Category explanation
        if (signals.shap_dominant_category and
                signals.attention_category_match > self.CATEGORY_THRESHOLD):
            category = signals.shap_dominant_category
            text = f"Popular in {category} with viewers like you"
            return text, "category_match"

        # Priority 3: Trending explanation
        if (signals.item_is_trending or
                signals.item_popularity_percentile > self.TRENDING_POPULARITY_PERCENTILE):
            region = signals.user_region
            if region and region != "unknown":
                text = f"Trending in {region}"
            else:
                text = "Trending now"
            return text, "trending"

        # Priority 4: Fallback — similar viewers
        return "Recommended for you based on your viewing history", "personalized"

Evaluation: Online A/B Test

The team deployed the explanation system in an A/B test across 5 million users (2.5M treatment, 2.5M control) for 21 days.

Treatment group: Each recommendation displayed a one-line explanation beneath the item thumbnail (e.g., "Because you watched 'Chef's Table' and 'Salt Fat Acid Heat'").

Control group: No explanation text displayed. Standard recommendation layout.

Primary Metrics

Metric	Control	Treatment	Relative Change	p-value
Click-through rate (CTR)	4.21%	4.38%	+4.0%	< 0.001
Watch-through rate (>50% completion)	62.3%	64.1%	+2.9%	< 0.001
Session length (minutes)	34.2	35.8	+4.7%	< 0.001
"Not interested" clicks	2.1%	2.5%	+19.0%	< 0.001
User satisfaction (weekly survey)	3.8 / 5.0	4.1 / 5.0	+7.9%	0.003

The increase in CTR and watch-through rate confirmed the hypothesis that explanations help users identify content they will enjoy. The increase in "not interested" clicks was unexpected but, upon analysis, positive: explanations gave users enough information to reject irrelevant content before watching it, rather than abandoning mid-stream. This reduced wasted viewing time and improved the signal quality for the recommendation model's feedback loop.

Explanation Type Distribution

Explanation Type	Frequency	CTR (treatment)	Watch-Through (treatment)
Watch history	42.3%	4.62%	66.8%
Category match	28.1%	4.21%	63.2%
Trending	14.8%	4.18%	60.4%
Personalized (fallback)	14.8%	4.09%	61.5%

Watch-history explanations — "Because you watched X and Y" — were the most effective, with the highest CTR and watch-through rate. This makes intuitive sense: a concrete reference to content the user has seen and enjoyed provides the strongest calibration signal. Category explanations were second. Trending explanations performed similarly to the fallback, suggesting that popularity alone is a weak explanation signal.

Explanation Faithfulness Audit

The ML team conducted a post-hoc faithfulness audit on 10,000 explanations.

Audit methodology: For each watch-history explanation ("Because you watched X and Y"), the team computed the integrated gradients attribution of items X and Y to the transformer's ranking score. If the integrated gradients attribution of the referenced items was in the top-5 of all session items, the explanation was considered "faithful."

Explanation Type	Faithfulness Rate	Notes
Watch history	78.3%	21.7% referenced items with high attention but low IG attribution
Category match	91.2%	Category features consistently in top-3 SHAP
Trending	N/A	Factual claim, not model attribution
Personalized	N/A	Generic, no specific claim

The 78.3% faithfulness rate for watch-history explanations confirmed the concern from Section 35.9: attention weights are not perfectly faithful to model behavior. In 21.7% of cases, the items the model attended to most were not the items whose removal would most change the prediction (as measured by integrated gradients). The team decided this was acceptable for user-facing explanations — the attention-based explanation was "directionally correct" (the model did use the session history) even when it was not precise about which items were most important.

For regulatory contexts, the team would use TreeSHAP or integrated gradients rather than attention. For user-facing explanations, the team accepted the tradeoff: attention-based explanations are more natural ("Because you watched X") than SHAP-based explanations ("Your category affinity score contributed +0.12") and have sufficient directional faithfulness for user trust calibration.

Audit Logging at Scale

The explanation system generates 400 million audit log entries per day — one per recommendation impression. The audit infrastructure uses the following design:

@dataclass
class StreamRecAuditEntry:
    """Compact audit entry for one recommendation explanation.

    Designed for high-throughput append-only storage.
    Size: ~500 bytes per entry (compressed).
    """
    timestamp_ms: int  # Unix milliseconds
    user_id_hash: str  # 16-char hex (truncated SHA-256)
    item_id: str
    model_version: str
    position: int
    ranking_score: float
    explanation_type: str  # "watch_history", "category_match", "trending", "personalized"
    explanation_text_hash: str  # 16-char hex (not full text, for storage)
    shap_top_3: List[Tuple[str, float]]  # Top 3 SHAP features
    attention_top_2: List[Tuple[str, float]]  # Top 2 attended items
    computation_time_ms: float


# Storage: Apache Parquet on S3, partitioned by date and region
# Retention: 90 days hot (queryable), 2 years cold (archived)
# Daily volume: ~400M entries x 500 bytes = ~200 GB/day (compressed)
# Monthly cost: ~$3,000 (S3 Standard) + ~$500 (cold archive)

The storage cost — approximately $3,500 per month for the full audit trail — is modest relative to the recommendation system's infrastructure budget. The 90-day hot retention enables real-time querying for explanation monitoring (is the distribution of explanation types shifting? Are certain items always explained as "trending" when they are not?), while the 2-year cold archive satisfies potential regulatory requests.

Lessons Learned

Users value explanations for trust calibration, not model understanding. No user in the qualitative interviews wanted to understand the model's architecture. They wanted to know whether the recommendation was for them (based on their history) or for everyone (based on popularity). The explanation system's primary value is helping users distinguish personalized from generic recommendations.
Attention-based explanations are imperfect but sufficient for user-facing use. The 78.3% faithfulness rate means that roughly 1 in 5 watch-history explanations references items that are not the most important by integrated gradients. For user trust calibration, this is acceptable — the explanation is directionally correct. For regulatory or fairness audit contexts, use SHAP or integrated gradients.
Explanation type selection is itself a modeling problem. The heuristic rules (attention threshold > 0.5, category match > 0.6) were tuned by hand based on the A/B test results. A more sophisticated approach would train a classifier to predict which explanation type maximizes user engagement conditional on the recommendation context. The team deferred this to avoid the meta-problem of explaining the explanation selector.
Audit logging at recommendation scale is a storage engineering problem. The 400 million daily entries require compact serialization, efficient partitioning, and tiered storage. The decision to log SHAP top-3 and attention top-2 (rather than the full attribution vector) was a pragmatic compromise between audit completeness and storage cost. Full attribution vectors are computed on demand for specific investigations.