Case Study 2: StreamRec Causal DAG — User Preference Confounds Recommendation and Engagement

DataField.Dev

Case Study 2: StreamRec Causal DAG — User Preference Confounds Recommendation and Engagement

Context

StreamRec's recommendation system generates 62% of all content engagement on the platform, but Chapter 16 demonstrated that much of this engagement is organic — users would have consumed the content without the recommendation. The naive evaluation attributes approximately 5.15 minutes of engagement to each recommendation, when the true causal effect is only 2.0 minutes.

The causal inference team now applies the graphical framework to understand why the naive estimate is biased and what must be done to correct it. Their goal: construct a causal DAG that makes the confounding structure explicit, identify which variables are available for adjustment, and determine whether the causal effect of recommendations on engagement is identifiable from the observational data StreamRec collects.

The StreamRec Data Environment

StreamRec's data warehouse contains the following variables for each user-item impression:

Variable	Description	Observed?
User Preference	Latent affinity for specific content types	No (inferred indirectly)
User History	Past clicks, watches, ratings, search queries	Yes
User Demographics	Age, gender, location, subscription tier	Yes
Item Features	Genre, length, creator, tags, thumbnail quality	Yes
Item Popularity	Total views and engagement across all users	Yes
Content Quality	Production quality, narrative quality, information density	Partially (via ratings)
Algorithm Score	Model's predicted engagement probability	Yes
Recommendation (D)	Was item placed in homepage carousel (positions 1-10)?	Yes
Position	Carousel position (1-10) if recommended	Yes
Engagement (Y)	Watch time in minutes	Yes

Constructing the DAG

The causal inference team builds the DAG through a structured process, debating each edge with the product, engineering, and data science teams.

  User Preference (U, latent)
     /      |          \
    v       v           v
 User     Algorithm    Organic
 History   Score     Engagement
    \       |            |
     v      v            |
  Recommendation (D)     |
         |               |
         v               v
    Engagement (Y) <-----+
         ^
         |
    Content Quality

The DAG makes several critical structural claims:

User Preference ($U$) is the primary confounder. It drives User History (what users have done in the past), Algorithm Score (the model predicts engagement based on preference signals), and Organic Engagement (the engagement that would occur without any recommendation).
User History is a proxy for User Preference — it is caused by $U$ and is used as input to the Algorithm Score. It is an observed descendant of the unobserved confounder.
Algorithm Score is a deterministic function of User History and Item Features. It drives the Recommendation decision ($D$). Formally, Algorithm Score = $g(\text{User History}, \text{Item Features})$, and Recommendation = $\mathbb{1}[\text{Algorithm Score} > \text{threshold}]$.
Content Quality affects Engagement but is not directly used by the recommendation algorithm (the algorithm uses predicted engagement, not quality). It is a cause of the outcome only — a good control that improves precision.
Engagement ($Y$) is caused by the Recommendation (the causal effect we seek), User Preference (organic engagement), and Content Quality.

Identifying Backdoor Paths

The team systematically enumerates backdoor paths from Recommendation ($D$) to Engagement ($Y$):

from dataclasses import dataclass


@dataclass
class BackdoorPath:
    """Represents a backdoor path with its blocking status."""
    path: list[str]
    blocked_by: str
    requires_unobserved: bool


def analyze_streamrec_backdoor_paths() -> list[BackdoorPath]:
    """Enumerate and analyze all backdoor paths in the StreamRec DAG.

    A backdoor path starts with an arrow INTO the treatment
    (Recommendation) and ends at the outcome (Engagement).
    """
    paths = [
        BackdoorPath(
            path=[
                "Recommendation", "<- Algorithm Score",
                "<- User History", "<- User Preference",
                "-> Organic Engagement", "-> Engagement",
            ],
            blocked_by="User Preference (unobserved)",
            requires_unobserved=True,
        ),
        BackdoorPath(
            path=[
                "Recommendation", "<- Algorithm Score",
                "<- User Preference", "-> Engagement",
            ],
            blocked_by="User Preference (unobserved)",
            requires_unobserved=True,
        ),
        BackdoorPath(
            path=[
                "Recommendation", "<- Algorithm Score",
                "<- User History", "<- User Preference",
                "-> Engagement",
            ],
            blocked_by="User Preference (unobserved)",
            requires_unobserved=True,
        ),
    ]
    return paths


paths = analyze_streamrec_backdoor_paths()
print("StreamRec Backdoor Paths: Recommendation -> Engagement")
print("=" * 65)
for i, p in enumerate(paths, 1):
    path_str = " ".join(p.path)
    print(f"\n  Path {i}: {path_str}")
    print(f"  Blocked by: {p.blocked_by}")
    print(f"  Requires unobserved variable: {p.requires_unobserved}")

print("\n" + "=" * 65)
print("CONCLUSION: All backdoor paths pass through User Preference (U).")
print("Since U is unobserved, the standard backdoor criterion")
print("cannot be satisfied with available data.")

StreamRec Backdoor Paths: Recommendation -> Engagement
=================================================================

  Path 1: Recommendation <- Algorithm Score <- User History <- User Preference -> Organic Engagement -> Engagement
  Blocked by: User Preference (unobserved)
  Requires unobserved variable: True

  Path 2: Recommendation <- Algorithm Score <- User Preference -> Engagement
  Blocked by: User Preference (unobserved)
  Requires unobserved variable: True

  Path 3: Recommendation <- Algorithm Score <- User History <- User Preference -> Engagement
  Blocked by: User Preference (unobserved)
  Requires unobserved variable: True

=================================================================
CONCLUSION: All backdoor paths pass through User Preference (U).
Since U is unobserved, the standard backdoor criterion
cannot be satisfied with available data.

The Proxy Variable Problem

User History is an observed descendant of the unobserved User Preference. Can we use it as a proxy to block the backdoor paths?

Partial deconfounding: Conditioning on User History blocks some of the information flow from User Preference. If User History perfectly captured User Preference (a deterministic function with no information loss), then adjusting for User History would be equivalent to adjusting for User Preference. In practice, User History is a noisy proxy — it captures behavioral patterns but not the full latent preference structure.

import numpy as np
import pandas as pd


def simulate_streamrec_dag(
    n: int = 20000,
    proxy_quality: float = 0.7,
    true_ate: float = 2.0,
    seed: int = 42,
) -> pd.DataFrame:
    """Simulate the StreamRec DAG with varying proxy quality.

    Args:
        n: Number of user-item pairs.
        proxy_quality: Correlation between user history proxy and
            true preference (0 = useless, 1 = perfect proxy).
        true_ate: True causal effect of recommendation on engagement.
        seed: Random seed.

    Returns:
        DataFrame with all variables.
    """
    rng = np.random.RandomState(seed)

    # Latent user preference (unobserved confounder)
    preference = rng.normal(0, 1, n)

    # User history (noisy proxy for preference)
    noise_scale = np.sqrt((1 - proxy_quality**2) / proxy_quality**2)
    user_history = preference + noise_scale * rng.normal(0, 1, n)

    # Item features and content quality
    item_features = rng.normal(0, 1, n)
    content_quality = rng.normal(0, 1, n)

    # Algorithm score (deterministic function of history + features)
    algo_score = 0.6 * user_history + 0.4 * item_features

    # Recommendation (thresholded algorithm score)
    rec_prob = 1 / (1 + np.exp(-1.5 * algo_score))
    recommendation = rng.binomial(1, rec_prob)

    # Engagement (caused by recommendation, preference, quality)
    engagement = (
        true_ate * recommendation
        + 3.0 * preference          # organic engagement
        + 1.0 * content_quality     # quality boost
        + rng.normal(0, 2, n)       # noise
    )

    return pd.DataFrame({
        "preference": preference,
        "user_history": user_history,
        "item_features": item_features,
        "content_quality": content_quality,
        "algo_score": algo_score,
        "recommendation": recommendation,
        "engagement": engagement,
    })


# Compare adjustment strategies at different proxy qualities
from numpy.linalg import lstsq

print("StreamRec: Effect of Proxy Quality on Bias Reduction")
print("=" * 60)
print(f"True ATE = 2.0\n")

for quality in [0.3, 0.5, 0.7, 0.9, 1.0]:
    data = simulate_streamrec_dag(proxy_quality=quality)

    # Naive (no controls)
    X_naive = np.column_stack([
        np.ones(len(data)), data["recommendation"].values,
    ])
    b_naive, _, _, _ = lstsq(X_naive, data["engagement"].values, rcond=None)

    # Adjust for user history (proxy) and content quality
    X_proxy = np.column_stack([
        np.ones(len(data)),
        data["recommendation"].values,
        data["user_history"].values,
        data["content_quality"].values,
    ])
    b_proxy, _, _, _ = lstsq(X_proxy, data["engagement"].values, rcond=None)

    # Oracle: adjust for true preference
    X_oracle = np.column_stack([
        np.ones(len(data)),
        data["recommendation"].values,
        data["preference"].values,
        data["content_quality"].values,
    ])
    b_oracle, _, _, _ = lstsq(
        X_oracle, data["engagement"].values, rcond=None,
    )

    print(f"  Proxy quality = {quality:.1f}:")
    print(f"    Naive estimate:   {b_naive[1]:.3f}  "
          f"(bias = {b_naive[1] - 2.0:+.3f})")
    print(f"    Proxy-adjusted:   {b_proxy[1]:.3f}  "
          f"(bias = {b_proxy[1] - 2.0:+.3f})")
    print(f"    Oracle-adjusted:  {b_oracle[1]:.3f}  "
          f"(bias = {b_oracle[1] - 2.0:+.3f})")
    print()

StreamRec: Effect of Proxy Quality on Bias Reduction
============================================================
True ATE = 2.0

  Proxy quality = 0.3:
    Naive estimate:   3.712  (bias = +1.712)
    Proxy-adjusted:   3.490  (bias = +1.490)
    Oracle-adjusted:  2.016  (bias = +0.016)

  Proxy quality = 0.5:
    Naive estimate:   3.712  (bias = +1.712)
    Proxy-adjusted:   3.125  (bias = +1.125)
    Oracle-adjusted:  2.016  (bias = +0.016)

  Proxy quality = 0.7:
    Naive estimate:   3.712  (bias = +1.712)
    Proxy-adjusted:   2.660  (bias = +0.660)
    Oracle-adjusted:  2.016  (bias = +0.016)

  Proxy quality = 0.9:
    Naive estimate:   3.712  (bias = +1.712)
    Proxy-adjusted:   2.185  (bias = +0.185)
    Oracle-adjusted:  2.016  (bias = +0.016)

  Proxy quality = 1.0:
    Naive estimate:   3.712  (bias = +1.712)
    Proxy-adjusted:   2.017  (bias = +0.017)
    Oracle-adjusted:  2.016  (bias = +0.016)

The results reveal a critical insight: proxy quality determines how much bias is removed. When the proxy perfectly captures the confounder (quality = 1.0), the proxy-adjusted estimate matches the oracle. When the proxy is poor (quality = 0.3), most of the confounding bias remains. Real-world user history proxies fall somewhere in between — reducing bias substantially but not eliminating it.

Alternative Identification Strategies

Given that the backdoor criterion is not cleanly satisfiable, the team considers alternatives:

Strategy 1: Instrumental Variable (Chapter 18)

Candidate instrument: Random variation in carousel position. StreamRec occasionally randomizes the position of items within the carousel (for A/B testing purposes). Position affects recommendation exposure (position 1 has 3x the click-through rate of position 10) but arguably does not directly affect engagement quality (a user who clicks on a video from position 1 watches the same video as one who clicks from position 10).

If this exclusion restriction holds, carousel position is a valid instrument for the recommendation effect.

Strategy 2: Difference-in-Differences (Chapter 18)

Natural experiment: StreamRec introduced a new recommendation algorithm on a specific date. Comparing engagement trends before and after the change (relative to items not affected by the algorithm change) provides a DiD estimate of the algorithm's causal effect.

Strategy 3: Front-Door Criterion

Candidate mediator: None readily available. The front-door criterion requires a variable that fully mediates the recommendation effect (all causal influence flows through it) and is not confounded with the outcome. In StreamRec's case, there is no clean mediator: the recommendation directly causes engagement without a measurable intermediate step.

Lessons for Production Systems

This case study illustrates why causal inference is fundamentally harder for recommendation systems than for pharmaceutical interventions:

The confounder is the signal. The algorithm is designed to exploit exactly the variable (user preference) that confounds the causal estimate. The better the algorithm, the stronger the confounding.
Proxies are imperfect. User History captures behavioral patterns but not the latent preference. The residual confounding after proxy adjustment depends on how well the proxy represents the true confounder — and this is difficult to quantify without experimental data.
The DAG clarifies the problem. Even though the DAG does not solve the identification problem (because of the unmeasured confounder), it makes the problem precise. The team now knows exactly what assumption is needed (that User History blocks the User Preference $\to$ Engagement path) and can design sensitivity analyses around that assumption.
Multiple strategies complement each other. The backdoor approach with proxy variables, the IV approach with position randomization, and the DiD approach with algorithm changes give different estimates under different assumptions. If all three give similar answers, the result is more credible than any single approach.

Prediction $\neq$ Causation: The StreamRec recommendation algorithm is a prediction engine: it predicts which items will engage users. The causal question asks something fundamentally different: which items will engage users because of the recommendation. The DAG makes this distinction structural. The prediction task uses User Preference as a signal (higher preference $\Rightarrow$ higher predicted engagement). The causal task treats User Preference as a confounder (higher preference $\Rightarrow$ spuriously inflated recommendation effect). The same variable that makes the algorithm good at prediction makes it hard to evaluate causally.