Chapter 16: The Potential Outcomes Framework — Counterfactuals, ATEs, and the Fundamental Problem of Causal Inference

DataField.Dev

30 min read

Chapter 15 established that prediction and causation are fundamentally different tasks. A model that accurately predicts which patients will be hospitalized cannot tell you whether a drug causes fewer hospitalizations. A recommender system that...

In This Chapter

Learning Objectives
16.1 From Prediction to Potential Outcomes
16.2 Potential Outcomes: Notation and Definitions
16.3 Estimands: ATE, ATT, ATU
16.4 The Identification Problem
16.5 SUTVA: The Stable Unit Treatment Value Assumption
16.6 Ignorability: The Assumption That Makes Identification Possible
16.7 Positivity: The Overlap Condition
16.8 How Randomization Solves the Problem
16.9 Regression Adjustment
16.10 Covariate Balance: Checking Whether Adjustment Worked
16.11 Putting It All Together: The Identification Checklist
16.12 Progressive Project: StreamRec Recommendation Effect
16.13 Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 16: The Potential Outcomes Framework — Counterfactuals, ATEs, and the Fundamental Problem of Causal Inference

"No causation without manipulation." — Paul Holland (1986), paraphrasing Donald Rubin

Learning Objectives

By the end of this chapter, you will be able to:

Define potential outcomes $Y(0)$ and $Y(1)$ and formalize the fundamental problem of causal inference
Derive the ATE, ATT, and ATU and explain when each is the quantity of interest
State and evaluate the assumptions required for causal identification (SUTVA, ignorability, positivity)
Explain why randomization solves the identification problem and what breaks without it
Implement simple causal estimators (difference in means, regression adjustment) and analyze their bias

16.1 From Prediction to Potential Outcomes

Chapter 15 established that prediction and causation are fundamentally different tasks. A model that accurately predicts which patients will be hospitalized cannot tell you whether a drug causes fewer hospitalizations. A recommender system that predicts high engagement cannot tell you whether its recommendations cause that engagement or merely anticipate organic behavior.

This chapter builds the mathematical framework that makes "cause" a precise, well-defined concept: the potential outcomes framework, also known as the Rubin causal model (Rubin, 1974; Holland, 1986). This framework does not merely assert that correlation is not causation — it specifies exactly what a causal effect is, exactly what assumptions are needed to estimate it from data, and exactly how those assumptions can fail.

The central insight is deceptively simple: causal inference is a missing data problem. Every unit (patient, user, widget) has two potential outcomes — the outcome under treatment and the outcome under control — but we can only ever observe one. The unobserved outcome is the counterfactual, and no amount of data collection can reveal it. Every method in causal inference, from randomized experiments to instrumental variables, is a strategy for estimating what we cannot observe.

Prediction $\neq$ Causation: This chapter formalizes the distinction that Chapter 15 introduced informally. The potential outcomes framework provides the vocabulary, the notation, and — critically — the assumptions under which causal claims become defensible. Without this framework, "A causes B" is a claim of faith. With it, "A causes B" is a claim with stated assumptions that can be evaluated, challenged, and tested.

16.2 Potential Outcomes: Notation and Definitions

The Setup

We observe $N$ units, indexed by $i = 1, 2, \ldots, N$. Each unit is exposed to one of two conditions:

Treatment ($D_i = 1$): the unit receives the intervention.
Control ($D_i = 0$): the unit does not receive the intervention.

For each unit $i$, we define two potential outcomes:

$Y_i(1)$: the outcome that would be observed if unit $i$ receives treatment.
$Y_i(0)$: the outcome that would be observed if unit $i$ does not receive treatment.

These potential outcomes exist regardless of which treatment the unit actually receives. They are fixed attributes of the unit — not random variables (in the super-population interpretation, they are random, but we start with the finite-population perspective for clarity).

The observed outcome is:

$$Y_i = D_i \cdot Y_i(1) + (1 - D_i) \cdot Y_i(0)$$

This equation, called the switching equation, is deceptively simple. It says: we observe $Y_i(1)$ when the unit is treated, and $Y_i(0)$ when it is not. But the other potential outcome — the one we do not observe — is the counterfactual.

A Concrete Example: MediCore Drug X

MediCore Pharmaceuticals is evaluating Drug X for reducing 30-day hospital readmission. For patient $i$:

$Y_i(1)$: Would patient $i$ be readmitted if they take Drug X?
$Y_i(0)$: Would patient $i$ be readmitted if they do not take Drug X?
$D_i$: Did patient $i$ actually receive Drug X?

Consider three specific patients:

Patient	$Y_i(0)$	$Y_i(1)$	$D_i$	Observed $Y_i$	ITE: $Y_i(1) - Y_i(0)$
Alice	1 (readmitted)	0 (not readmitted)	1 (treated)	0	$-1$ (drug helps)
Bob	1 (readmitted)	1 (readmitted)	0 (control)	1	$0$ (drug has no effect)
Carol	0 (not readmitted)	0 (not readmitted)	1 (treated)	0	$0$ (drug has no effect)

Alice's individual treatment effect (ITE) is $-1$: Drug X prevents her readmission. Bob would be readmitted regardless. Carol would not be readmitted regardless.

But here is the critical point: we never observe both columns simultaneously for any patient. Alice took Drug X, so we observe $Y_{\text{Alice}} = 0$ but never learn that she would have been readmitted without it. Bob did not take Drug X, so we observe $Y_{\text{Bob}} = 1$ but never learn whether Drug X would have helped. The counterfactual column is always missing.

The Fundamental Problem of Causal Inference

Mathematical Foundation: Holland (1986) named this "the fundamental problem of causal inference": it is impossible to observe both $Y_i(1)$ and $Y_i(0)$ for the same unit at the same time. The individual treatment effect:

$$\tau_i = Y_i(1) - Y_i(0)$$

is therefore never directly observable. This is not a practical limitation that better data or better technology could overcome — it is a logical impossibility. A patient either takes the drug or does not. A user either sees the recommendation or does not. We cannot rewind time and observe the alternative.

This is why causal inference is fundamentally harder than prediction. For prediction, we observe $(X_i, Y_i)$ pairs and learn $f(X) \approx E[Y \mid X]$. For causation, we need $Y_i(1) - Y_i(0)$, but one of the two terms is always missing.

A Second Example: StreamRec Recommendations

For the StreamRec content platform, define treatment as the recommendation algorithm showing item $j$ to user $i$:

$Y_i(1)$: User $i$'s engagement (e.g., watch time in minutes) if the item is recommended.
$Y_i(0)$: User $i$'s engagement if the item is not recommended (but still available through browse/search).
$D_i$: Was the item recommended to user $i$ by the algorithm?

The causal question is: Does the recommendation cause additional engagement beyond what would have occurred organically?

If user $i$ would have watched the video for 45 minutes without a recommendation ($Y_i(0) = 45$), and watched it for 47 minutes with the recommendation ($Y_i(1) = 47$), the causal effect of the recommendation is only 2 minutes — not the 47 minutes the algorithm "takes credit for" in standard evaluation metrics.

Production Reality: Most recommendation systems evaluate performance by measuring engagement on recommended items. This is $\mathbb{E}[Y \mid D = 1]$, which conflates the recommendation effect with organic behavior. A system could have zero causal effect — recommending only items users would have found anyway — while appearing to drive enormous engagement. Understanding the difference between $\mathbb{E}[Y \mid D = 1]$ and the causal effect $\mathbb{E}[Y(1) - Y(0)]$ is what separates a recommender system that creates value from one that merely measures it.

16.3 Estimands: ATE, ATT, ATU

Since individual treatment effects $\tau_i$ are unobservable, we turn to population-level summaries. The three fundamental causal estimands are:

Average Treatment Effect (ATE)

$$\text{ATE} = \mathbb{E}[\tau_i] = \mathbb{E}[Y_i(1) - Y_i(0)] = \mathbb{E}[Y_i(1)] - \mathbb{E}[Y_i(0)]$$

The ATE is the average causal effect across the entire population. It answers: "If we could treat everyone, what would be the average change in the outcome compared to treating no one?"

When the ATE matters: Policy decisions that apply to an entire population. Should the FDA approve Drug X? Should a school district adopt a new curriculum? Should a platform change the default recommendation algorithm for all users?

Average Treatment Effect on the Treated (ATT)

$$\text{ATT} = \mathbb{E}[Y_i(1) - Y_i(0) \mid D_i = 1] = \mathbb{E}[Y_i(1) \mid D_i = 1] - \mathbb{E}[Y_i(0) \mid D_i = 1]$$

The ATT is the average causal effect among those who actually received treatment. The first term $\mathbb{E}[Y_i(1) \mid D_i = 1]$ is observable — it is the average outcome among the treated. The second term $\mathbb{E}[Y_i(0) \mid D_i = 1]$ is the counterfactual: what would have happened to the treated group if they had not been treated?

When the ATT matters: Evaluating an existing program. Did the patients who received Drug X benefit from it? Did the users who received recommendations engage more because of those recommendations? The ATT asks about the effect on the group that was actually treated, which may differ from the general population.

Average Treatment Effect on the Untreated (ATU)

$$\text{ATU} = \mathbb{E}[Y_i(1) - Y_i(0) \mid D_i = 0] = \mathbb{E}[Y_i(1) \mid D_i = 0] - \mathbb{E}[Y_i(0) \mid D_i = 0]$$

The ATU answers: "What would have happened to the untreated group if they had been treated?"

When the ATU matters: Deciding whether to expand a program. If Drug X is currently given to high-risk patients, should it also be given to moderate-risk patients? The ATU for the moderate-risk group tells us whether expansion would be beneficial.

The Relationship Between ATE, ATT, and ATU

By the law of total expectation:

$$\text{ATE} = P(D = 1) \cdot \text{ATT} + P(D = 0) \cdot \text{ATU}$$

The ATE is a weighted average of the ATT and ATU, where the weights are the treatment probabilities. When $\text{ATT} = \text{ATU}$, treatment effects are homogeneous across the treated and untreated populations, and all three estimands coincide. When $\text{ATT} \neq \text{ATU}$, treatment effects are heterogeneous, and the choice of estimand matters.

Common Misconception: "The ATE is always the most useful estimand." In fact, the ATT is often more policy-relevant. A hospital that has already prescribed Drug X to certain patients wants to know whether those specific patients benefited (ATT), not whether a random member of the population would benefit (ATE). Conversely, a regulator deciding whether to approve Drug X for general use needs the ATE. Choosing the wrong estimand answers the wrong question, even if the estimate is unbiased.

MediCore Example: Why the Estimand Matters

Suppose Drug X has a strong effect on patients with a genetic biomarker (reducing readmission by 30%) but no effect on patients without it. If the biomarker is present in 20% of the population, and doctors selectively prescribe Drug X to biomarker-positive patients:

$\text{ATT} = -0.30$ (the treated patients — those with the biomarker — benefit substantially).
$\text{ATU} \approx 0$ (the untreated patients — those without the biomarker — would not benefit).
$\text{ATE} = 0.20 \times (-0.30) + 0.80 \times 0 = -0.06$ (small average effect across the full population).

If MediCore reports only the ATT ($-0.30$), it overstates the drug's benefit for a general population. If the FDA evaluates only the ATE ($-0.06$), it might reject a drug that dramatically helps a subpopulation. The correct estimand depends on the decision: prescribing to identified biomarker-positive patients (ATT) vs. blanket approval for all patients (ATE).

16.4 The Identification Problem

What We Want vs. What We Observe

The ATE is defined in terms of potential outcomes:

$$\text{ATE} = \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)]$$

But from observed data, we can only compute the naive difference in means:

$$\hat{\Delta}_{\text{naive}} = \mathbb{E}[Y \mid D = 1] - \mathbb{E}[Y \mid D = 0]$$

These are not the same thing. The naive estimator compares the outcomes of the treated group to the outcomes of the control group. But if treatment is not randomly assigned, the treated and control groups may differ systematically — not because of the treatment, but because of the factors that determined who received treatment.

Decomposing the Bias

We can decompose the naive estimator to expose the bias:

$$\mathbb{E}[Y \mid D = 1] - \mathbb{E}[Y \mid D = 0]$$

Using the switching equation:

$$= \mathbb{E}[Y(1) \mid D = 1] - \mathbb{E}[Y(0) \mid D = 0]$$

Now add and subtract $\mathbb{E}[Y(0) \mid D = 1]$:

$$= \underbrace{\mathbb{E}[Y(1) \mid D = 1] - \mathbb{E}[Y(0) \mid D = 1]}_{\text{ATT}} + \underbrace{\mathbb{E}[Y(0) \mid D = 1] - \mathbb{E}[Y(0) \mid D = 0]}_{\text{Selection Bias}}$$

Mathematical Foundation: The naive difference in means equals the ATT plus the selection bias. The selection bias is the difference in baseline outcomes (under no treatment) between the treated and control groups. It captures all the pre-existing differences between the groups that affect the outcome. The naive estimator is unbiased for the ATT only when the selection bias is zero — that is, when the treated and control groups have the same baseline outcomes on average.

The selection bias is zero when treatment assignment is independent of potential outcomes. If sicker patients are more likely to receive Drug X, then $\mathbb{E}[Y(0) \mid D = 1] > \mathbb{E}[Y(0) \mid D = 0]$ (the treated group would have worse outcomes even without treatment), introducing positive selection bias that makes Drug X appear less effective than it is. If healthier patients are more likely to receive Drug X (the "healthy user" bias), the selection bias goes the other direction, making Drug X appear more effective than it is.

StreamRec Example: Selection Bias in Recommendations

For StreamRec, the recommendation algorithm preferentially recommends items it predicts the user will engage with. This means $\mathbb{E}[Y(0) \mid D = 1] > \mathbb{E}[Y(0) \mid D = 0]$: users who receive a recommendation for a particular item would have had higher engagement with that item even without the recommendation, because the algorithm selected items aligned with their preferences. The naive comparison overstates the recommendation effect.

import numpy as np
import pandas as pd
from typing import Tuple


def simulate_recommendation_data(
    n_users: int = 10000,
    seed: int = 42,
) -> pd.DataFrame:
    """Simulate StreamRec recommendation data with confounding.

    User preference drives both recommendation probability and
    engagement, creating selection bias in naive estimates.

    Args:
        n_users: Number of user-item pairs.
        seed: Random seed.

    Returns:
        DataFrame with user preference, treatment, potential outcomes,
        and observed outcome.
    """
    rng = np.random.RandomState(seed)

    # User preference for the item (unobserved confounder in practice)
    preference = rng.normal(0, 1, n_users)

    # Recommendation probability increases with preference (confounding)
    rec_prob = 1 / (1 + np.exp(-1.5 * preference))
    treatment = rng.binomial(1, rec_prob)

    # Potential outcomes
    # Y(0): engagement without recommendation (driven by preference)
    y0 = 10 + 3 * preference + rng.normal(0, 2, n_users)

    # Y(1): engagement with recommendation
    # True treatment effect is 2.0 for everyone (homogeneous)
    true_ate = 2.0
    y1 = y0 + true_ate

    # Observed outcome
    y_obs = treatment * y1 + (1 - treatment) * y0

    return pd.DataFrame({
        "preference": preference,
        "treatment": treatment,
        "y0": y0,
        "y1": y1,
        "y_obs": y_obs,
        "true_ite": y1 - y0,
    })


# Generate data and compute naive estimate
df = simulate_recommendation_data()

naive_estimate = (
    df.loc[df["treatment"] == 1, "y_obs"].mean()
    - df.loc[df["treatment"] == 0, "y_obs"].mean()
)
true_ate = df["true_ite"].mean()

print(f"True ATE:        {true_ate:.3f}")
print(f"Naive estimate:  {naive_estimate:.3f}")
print(f"Selection bias:  {naive_estimate - true_ate:.3f}")

True ATE:        2.000
Naive estimate:  5.151
Selection bias:  3.151

The naive estimate ($\approx 5.15$) dramatically overstates the true effect ($2.0$) because the recommendation algorithm selects users with high organic engagement. The selection bias ($\approx 3.15$) accounts for over 60% of the naive estimate.

Production Reality: This simulation uses a constant treatment effect of 2.0, which makes the bias easy to see. In practice, the true treatment effect is unknown (that is the entire point of causal inference), and the selection bias can be larger than, smaller than, or even opposite in sign to the true effect. A naive analysis of StreamRec's recommendation system would attribute approximately 5.15 minutes of engagement to the recommendation, when only 2.0 minutes are causally attributable. The remaining 3.15 minutes reflect user preferences that would have manifested without any recommendation.

16.5 SUTVA: The Stable Unit Treatment Value Assumption

Before we can make progress on the identification problem, we need to formalize what "potential outcomes" actually mean. The Stable Unit Treatment Value Assumption (SUTVA) has two components:

Component 1: No Interference

$$Y_i(D_1, D_2, \ldots, D_N) = Y_i(D_i)$$

Unit $i$'s potential outcomes depend only on its own treatment assignment, not on the treatment assignments of other units. Equivalently: my outcome is unaffected by your treatment.

When no-interference holds: Drug efficacy in most clinical trials. Whether Alice takes Drug X does not affect whether Bob is readmitted (assuming no infectious disease context).

When no-interference fails:

Vaccination programs. If enough people around me are vaccinated, my risk of infection decreases even if I am not vaccinated (herd immunity). My $Y_i(0)$ depends on $D_j$ for other individuals.
Marketplace experiments. If StreamRec recommends item X to user A but not user B, and both users are in the same household sharing an account, user A's behavior may affect user B's experience. Recommendation experiments on platforms with network effects violate no-interference.
Educational interventions. If students in a classroom are randomized to receive a tutoring program, untreated students may benefit from spillover effects (treated students help untreated peers).

Without no-interference, the potential outcome $Y_i(D_i)$ is not well-defined because it depends on the full treatment vector $(D_1, \ldots, D_N)$. With $N$ units, each having 2 possible treatment statuses, there are $2^N$ potential outcomes per unit instead of 2.

Advanced Sidebar: Violations of no-interference lead to the study of interference effects, spillovers, and peer effects — a rapidly growing subfield. Hudgens and Halloran (2008) formalize "partial interference," assuming that units can be partitioned into clusters (households, classrooms, neighborhoods) where interference exists within clusters but not between them. This reduces the $2^N$ potential outcomes to a manageable number per unit. For network-based interference (relevant to StreamRec), see Aronow and Samii (2017) and Leung (2020). These methods are important but beyond the scope of this chapter.

Component 2: Consistency (No Hidden Variations of Treatment)

$$\text{If } D_i = d, \text{ then } Y_i = Y_i(d)$$

There is only one version of treatment. The potential outcome $Y_i(1)$ is the same regardless of how the treatment was administered.

When consistency holds: Drug X is a standardized pill with a fixed formulation and dosage. $Y_i(1)$ is unambiguous.

When consistency fails:

Multiple treatment versions. "Exercise" is not a single treatment. Running 30 minutes daily, swimming 45 minutes three times per week, and weight training every other day are all "exercise," but they may have different effects. If $D_i = 1$ means "exercises," then $Y_i(1)$ is not well-defined because it depends on which version of exercise the unit receives.
Recommendation mechanisms. If "recommending item X" to user $i$ could mean placing it in the top carousel, sending a push notification, or showing it as a banner — and these have different effects — then $Y_i(1)$ is ambiguous. StreamRec must define the treatment precisely: $D_i = 1$ means "item X appears in position 1–3 of the homepage carousel."

Common Misconception: "SUTVA is just a technicality that always holds." SUTVA is substantive and often violated. Marketplace experiments, social network interventions, and multi-version treatments all violate SUTVA. When SUTVA is violated, the standard potential outcomes framework does not apply directly, and specialized methods are required. Always evaluate SUTVA explicitly before proceeding with causal analysis.

SUTVA in the MediCore Context

For MediCore's Drug X analysis:

No interference: Plausible if readmission is driven by individual health status. Violated if Drug X treats an infectious disease (my treatment affects your outcome through reduced transmission), or if hospital capacity constraints mean that one patient's readmission affects another's care.
Consistency: Plausible if Drug X has a standardized formulation and dosage. Violated if different hospitals administer Drug X with different protocols, monitoring intensity, or ancillary care, and these variations affect the outcome.

Evaluating SUTVA requires domain knowledge, not statistical tests. It cannot be verified from data alone.

16.6 Ignorability: The Assumption That Makes Identification Possible

Strong Ignorability (Unconditional)

$$\{Y(0), Y(1)\} \perp\!\!\!\perp D$$

Treatment assignment $D$ is independent of both potential outcomes. Equivalently: knowing whether a unit was treated tells you nothing about what its outcomes would have been under treatment or control.

Under strong ignorability:

$$\mathbb{E}[Y(1) \mid D = 1] = \mathbb{E}[Y(1) \mid D = 0] = \mathbb{E}[Y(1)]$$ $$\mathbb{E}[Y(0) \mid D = 1] = \mathbb{E}[Y(0) \mid D = 0] = \mathbb{E}[Y(0)]$$

The selection bias disappears:

$$\mathbb{E}[Y \mid D = 1] - \mathbb{E}[Y \mid D = 0] = \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)] = \text{ATE}$$

The naive difference in means is an unbiased estimator of the ATE under strong ignorability.

When strong ignorability holds: Randomized experiments, where treatment is assigned by coin flip, independent of all patient characteristics.

When strong ignorability fails: Virtually all observational studies. Doctors prescribe Drug X based on patient characteristics (disease severity, comorbidities, age) that also affect the outcome. The recommendation algorithm selects items based on user preferences that also drive engagement.

Conditional Ignorability (Unconfoundedness)

Strong ignorability almost never holds in observational data. The weaker (but still powerful) assumption is conditional ignorability:

$$\{Y(0), Y(1)\} \perp\!\!\!\perp D \mid \mathbf{X}$$

Treatment assignment is independent of potential outcomes, conditional on observed covariates $\mathbf{X}$. This means: once we control for $\mathbf{X}$, the remaining variation in treatment assignment is "as good as random."

Under conditional ignorability, we can identify the ATE by adjusting for $\mathbf{X}$:

$$\text{ATE} = \mathbb{E}_{\mathbf{X}}\left[\mathbb{E}[Y \mid D = 1, \mathbf{X}] - \mathbb{E}[Y \mid D = 0, \mathbf{X}]\right]$$

This is the adjustment formula: compute the treatment effect within each stratum of $\mathbf{X}$, then average across the population distribution of $\mathbf{X}$.

The Untestable Nature of Ignorability

Mathematical Foundation: Conditional ignorability is the most critical assumption in observational causal inference — and it is untestable from observed data. The assumption states that there are no unmeasured confounders: every variable that affects both treatment assignment and the outcome is included in $\mathbf{X}$. But how can we verify that we have not missed a confounder we did not measure? We cannot. If an unmeasured variable $U$ affects both $D$ and $Y$, then $\{Y(0), Y(1)\} \not\perp\!\!\!\perp D \mid \mathbf{X}$, even if it holds conditional on $(\mathbf{X}, U)$.

This is why causal claims from observational studies always carry the caveat "assuming no unmeasured confounders." Sensitivity analysis (Rosenbaum, 2002; Ding and VanderWeele, 2016) provides tools for asking "How strong would an unmeasured confounder have to be to explain away the estimated effect?" — but these tools bound the damage rather than eliminating the risk.

MediCore Example: Conditional Ignorability Assessment

For MediCore's Drug X study, the covariates $\mathbf{X}$ might include: age, sex, disease severity score, number of comorbidities, prior hospitalizations, insurance type, and prescribing physician.

Conditional ignorability requires that, within subgroups defined by these covariates, Drug X prescribing is essentially random. This is plausible if:

$\mathbf{X}$ captures all the factors physicians use to decide whether to prescribe Drug X.
$\mathbf{X}$ captures all the factors that affect readmission risk.

It is implausible if:

Physicians observe something about the patient (e.g., how the patient "looks" during examination, adherence history not in the EHR) that affects both their prescribing decision and the patient's outcome.
There is a genetic predisposition (unrecorded) that both affects drug response and correlates with who receives the drug.

The plausibility of conditional ignorability is a domain judgment, not a statistical calculation. It must be argued, not tested.

16.7 Positivity: The Overlap Condition

The third identification assumption is positivity (also called the overlap condition):

$$0 < P(D = 1 \mid \mathbf{X} = \mathbf{x}) < 1 \quad \text{for all } \mathbf{x} \text{ in the support of } \mathbf{X}$$

For every combination of covariates that exists in the population, there must be a positive probability of receiving treatment and a positive probability of not receiving treatment.

Why Positivity Matters

If positivity is violated for some $\mathbf{x}$, then $\mathbb{E}[Y \mid D = 1, \mathbf{X} = \mathbf{x}]$ or $\mathbb{E}[Y \mid D = 0, \mathbf{X} = \mathbf{x}]$ is undefined (we have no data for that cell), and the adjustment formula fails.

Deterministic vs. Random Positivity Violations

Deterministic violation: Treatment is impossible or mandatory for some covariate values. For example, if MediCore's Drug X is contraindicated for patients over age 85, then $P(D = 1 \mid \text{Age} > 85) = 0$. The causal effect of Drug X on patients over 85 is simply not identifiable from observational data — we cannot compare treated and untreated patients in this group because there are no treated patients.

Random (practical) violation: Treatment is theoretically possible but extremely rare for some covariate values. If only 3 out of 10,000 patients with a rare comorbidity received Drug X, then $P(D = 1 \mid \text{rare comorbidity}) \approx 0.0003$. Technically, positivity holds, but estimation is extremely imprecise because we are extrapolating from very few observations. This is sometimes called practical positivity violation or near-positivity violation.

Diagnosing Positivity

Unlike ignorability, positivity is partially testable from data. We can estimate $P(D = 1 \mid \mathbf{X} = \mathbf{x})$ (the propensity score, covered in Chapter 18) and check its distribution:

from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt


def diagnose_positivity(
    X: np.ndarray,
    treatment: np.ndarray,
    feature_names: list[str] | None = None,
) -> dict[str, float]:
    """Diagnose positivity by estimating and plotting propensity scores.

    Fits a logistic regression to estimate P(D=1|X) and checks for
    near-violations (propensity scores close to 0 or 1).

    Args:
        X: Covariate matrix of shape (n, p).
        treatment: Binary treatment vector of shape (n,).
        feature_names: Optional names for features.

    Returns:
        Dictionary with positivity diagnostic statistics.
    """
    model = LogisticRegression(max_iter=1000, random_state=42)
    model.fit(X, treatment)
    propensity = model.predict_proba(X)[:, 1]

    diagnostics = {
        "min_propensity": float(propensity.min()),
        "max_propensity": float(propensity.max()),
        "pct_below_01": float(np.mean(propensity < 0.01) * 100),
        "pct_above_99": float(np.mean(propensity > 0.99) * 100),
        "pct_in_overlap": float(
            np.mean((propensity > 0.1) & (propensity < 0.9)) * 100
        ),
    }

    fig, axes = plt.subplots(1, 2, figsize=(12, 4))

    # Histogram by treatment group
    axes[0].hist(
        propensity[treatment == 1], bins=50, alpha=0.6,
        label="Treated", density=True, color="steelblue"
    )
    axes[0].hist(
        propensity[treatment == 0], bins=50, alpha=0.6,
        label="Control", density=True, color="coral"
    )
    axes[0].set_xlabel("Estimated Propensity Score")
    axes[0].set_ylabel("Density")
    axes[0].set_title("Propensity Score Distribution by Treatment Group")
    axes[0].legend()

    # Cumulative distribution
    for d, label, color in [(1, "Treated", "steelblue"), (0, "Control", "coral")]:
        sorted_p = np.sort(propensity[treatment == d])
        cdf = np.arange(1, len(sorted_p) + 1) / len(sorted_p)
        axes[1].plot(sorted_p, cdf, label=label, color=color)
    axes[1].set_xlabel("Estimated Propensity Score")
    axes[1].set_ylabel("Cumulative Proportion")
    axes[1].set_title("Propensity Score CDF")
    axes[1].legend()

    plt.tight_layout()
    plt.savefig("positivity_diagnostics.png", dpi=150, bbox_inches="tight")
    plt.close()

    return diagnostics

Interpreting the diagnostics: Good overlap means the propensity score distributions for treated and control groups substantially overlap. If the distributions are separated — treated units have propensity scores near 1.0 and control units have scores near 0.0 — causal estimation becomes unreliable because the adjustment formula requires extrapolation into regions with no data.

StreamRec Positivity Concerns

For StreamRec, positivity can be problematic because the recommendation algorithm is deterministic given user features: if the algorithm always recommends item X to users with preference score above 0.8, then $P(D = 0 \mid \text{preference} > 0.8) = 0$ and positivity is violated. This is a fundamental challenge for causal evaluation of deterministic policies. Solutions include: introducing randomization (epsilon-greedy exploration), using historical policy changes, or restricting the target population to the overlap region.

16.8 How Randomization Solves the Problem

Randomized controlled trials (RCTs) are the gold standard for causal inference because randomization simultaneously satisfies all three identification assumptions.

Why Randomization Works

In a randomized experiment, treatment is assigned by a mechanism that is:

Independent of potential outcomes: $\{Y(0), Y(1)\} \perp\!\!\!\perp D$. A coin flip does not know or care about a patient's health status. This guarantees strong ignorability without conditioning on any covariates.
Probabilistic: $0 < P(D = 1) < 1$. Every unit has a positive probability of being treated and untreated. This guarantees positivity (unconditionally, not just conditionally).

SUTVA must still be evaluated separately (randomization does not prevent interference or hidden treatment variations), but when SUTVA holds, randomization makes the simple difference in means an unbiased estimator of the ATE:

$$\hat{\tau}_{\text{RCT}} = \bar{Y}_{\text{treated}} - \bar{Y}_{\text{control}} = \frac{1}{N_1} \sum_{i: D_i = 1} Y_i - \frac{1}{N_0} \sum_{i: D_i = 0} Y_i$$

Under randomization, the selection bias is zero in expectation, and $\hat{\tau}_{\text{RCT}}$ is unbiased for the ATE.

Finite-Sample Variance

While $\hat{\tau}_{\text{RCT}}$ is unbiased, its precision depends on sample size and outcome variability. The variance of the difference-in-means estimator under complete random assignment is:

$$\text{Var}(\hat{\tau}_{\text{RCT}}) = \frac{S_1^2}{N_1} + \frac{S_0^2}{N_0} - \frac{S_{01}^2}{N}$$

where $S_d^2 = \frac{1}{N - 1} \sum_{i=1}^N (Y_i(d) - \bar{Y}(d))^2$ is the population variance of potential outcomes under treatment $d$, and $S_{01}^2$ is the variance of the individual treatment effects. The last term is non-negative and is typically ignored (dropped from the estimate) because it is not estimable — leading to a conservative variance estimate.

In practice, the Neyman variance estimator is used:

$$\widehat{\text{Var}}(\hat{\tau}) = \frac{s_1^2}{N_1} + \frac{s_0^2}{N_0}$$

where $s_d^2$ is the sample variance of $Y$ in treatment group $d$.

from scipy import stats as sp_stats


def difference_in_means(
    y: np.ndarray,
    treatment: np.ndarray,
    alpha: float = 0.05,
) -> dict[str, float]:
    """Compute the difference-in-means estimator with inference.

    Under randomization (or conditional ignorability), the difference
    in sample means is an unbiased estimator of the ATE.

    Args:
        y: Observed outcomes of shape (n,).
        treatment: Binary treatment vector of shape (n,).
        alpha: Significance level for confidence interval.

    Returns:
        Dictionary with point estimate, standard error, CI, and p-value.
    """
    y1 = y[treatment == 1]
    y0 = y[treatment == 0]
    n1, n0 = len(y1), len(y0)

    # Point estimate
    tau_hat = y1.mean() - y0.mean()

    # Neyman variance estimator (conservative under heterogeneous effects)
    se = np.sqrt(y1.var(ddof=1) / n1 + y0.var(ddof=1) / n0)

    # Welch t-test for unequal variances
    # Degrees of freedom via Welch-Satterthwaite
    s1_sq, s0_sq = y1.var(ddof=1), y0.var(ddof=1)
    df_num = (s1_sq / n1 + s0_sq / n0) ** 2
    df_den = (s1_sq / n1) ** 2 / (n1 - 1) + (s0_sq / n0) ** 2 / (n0 - 1)
    df = df_num / df_den

    t_crit = sp_stats.t.ppf(1 - alpha / 2, df)
    ci_lower = tau_hat - t_crit * se
    ci_upper = tau_hat + t_crit * se
    p_value = 2 * (1 - sp_stats.t.cdf(abs(tau_hat / se), df))

    return {
        "estimate": float(tau_hat),
        "se": float(se),
        "ci_lower": float(ci_lower),
        "ci_upper": float(ci_upper),
        "p_value": float(p_value),
        "n_treated": int(n1),
        "n_control": int(n0),
    }


# Example: difference in means on our simulated data (biased — not randomized)
result_naive = difference_in_means(df["y_obs"].values, df["treatment"].values)
print("Naive difference-in-means (non-randomized data):")
print(f"  Estimate: {result_naive['estimate']:.3f}")
print(f"  95% CI:   [{result_naive['ci_lower']:.3f}, {result_naive['ci_upper']:.3f}]")
print(f"  p-value:  {result_naive['p_value']:.4f}")
print(f"  True ATE: {df['true_ite'].mean():.3f}")
print(f"  Bias:     {result_naive['estimate'] - df['true_ite'].mean():.3f}")

Naive difference-in-means (non-randomized data):
  Estimate: 5.151
  95% CI:   [4.876, 5.425]
  p-value:  0.0000
  True ATE: 2.000
  Bias:     3.151

The confidence interval does not contain the true ATE. The estimator is not just imprecise — it is systematically biased. The narrow confidence interval around a wrong answer is worse than a wide interval, because it gives false confidence.

Simulating a Randomized Experiment

def simulate_rct(
    n_users: int = 10000,
    true_ate: float = 2.0,
    seed: int = 42,
) -> pd.DataFrame:
    """Simulate a randomized experiment for StreamRec.

    Treatment (recommendation) is assigned by coin flip, independent
    of user preferences. This satisfies strong ignorability.

    Args:
        n_users: Number of user-item pairs.
        true_ate: True average treatment effect.
        seed: Random seed.

    Returns:
        DataFrame with randomized treatment assignment and outcomes.
    """
    rng = np.random.RandomState(seed)

    # Same data-generating process, but randomized treatment
    preference = rng.normal(0, 1, n_users)

    # Random assignment — independent of preference
    treatment = rng.binomial(1, 0.5, n_users)

    # Potential outcomes (same as before)
    y0 = 10 + 3 * preference + rng.normal(0, 2, n_users)
    y1 = y0 + true_ate

    y_obs = treatment * y1 + (1 - treatment) * y0

    return pd.DataFrame({
        "preference": preference,
        "treatment": treatment,
        "y0": y0,
        "y1": y1,
        "y_obs": y_obs,
        "true_ite": y1 - y0,
    })


# Randomized experiment
df_rct = simulate_rct()
result_rct = difference_in_means(df_rct["y_obs"].values, df_rct["treatment"].values)

print("Difference-in-means (randomized experiment):")
print(f"  Estimate: {result_rct['estimate']:.3f}")
print(f"  95% CI:   [{result_rct['ci_lower']:.3f}, {result_rct['ci_upper']:.3f}]")
print(f"  True ATE: {df_rct['true_ite'].mean():.3f}")
print(f"  Bias:     {result_rct['estimate'] - df_rct['true_ite'].mean():.3f}")

Difference-in-means (randomized experiment):
  Estimate: 1.958
  95% CI:   [1.813, 2.103]
  True ATE: 2.000
  Bias:     -0.042

Under randomization, the estimate is close to the true ATE, the bias is small (finite-sample noise, not systematic bias), and the confidence interval covers the true value.

When Randomization Is Not Possible

Randomization is not always feasible or ethical:

Ethical constraints. We cannot randomize patients to receive a harmful treatment, or withhold a proven treatment from a control group.
Practical constraints. We cannot randomize macroeconomic policies, historical events, or geographic features.
Platform constraints. StreamRec may not be willing to show deliberately suboptimal recommendations to randomly selected users, as this reduces engagement (and revenue) in the short term.
Compliance issues. Even in an RCT, patients may not take the prescribed drug (non-compliance), or control patients may obtain the drug elsewhere. This violates the intended treatment assignment.

When randomization is unavailable, we must rely on observational data and the identification assumptions discussed above. The next sections develop the tools for doing so.

16.9 Regression Adjustment

The Strategy

When strong ignorability fails but conditional ignorability holds (given covariates $\mathbf{X}$), we can use regression adjustment to control for confounders.

The idea is to model $\mathbb{E}[Y \mid D, \mathbf{X}]$ using a regression, then use this model to impute the missing potential outcomes.

Linear Regression Adjustment

The simplest approach fits a linear model:

$$Y_i = \alpha + \tau D_i + \boldsymbol{\beta}^\top \mathbf{X}_i + \varepsilon_i$$

Under correct specification, $\hat{\tau}$ is the conditional ATE: the average effect of treatment, holding the covariates constant. If conditional ignorability holds and the model is correctly specified, $\hat{\tau}$ is an unbiased estimator of the ATE.

import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col


def regression_adjustment(
    y: np.ndarray,
    treatment: np.ndarray,
    X: np.ndarray,
    feature_names: list[str] | None = None,
) -> dict[str, float]:
    """Estimate ATE via linear regression adjustment.

    Fits Y = alpha + tau * D + beta' X + epsilon and reports
    the coefficient on D as the ATE estimate.

    Args:
        y: Observed outcomes of shape (n,).
        treatment: Binary treatment vector of shape (n,).
        X: Covariate matrix of shape (n, p).
        feature_names: Optional names for covariates.

    Returns:
        Dictionary with ATE estimate, standard error, CI, and p-value.
    """
    if feature_names is None:
        feature_names = [f"x{j}" for j in range(X.shape[1])]

    # Build design matrix: [intercept, treatment, X]
    design = np.column_stack([treatment, X])
    design = sm.add_constant(design)
    col_names = ["const", "treatment"] + feature_names

    model = sm.OLS(y, design).fit(cov_type="HC1")  # Robust SEs

    return {
        "estimate": float(model.params[1]),
        "se": float(model.bse[1]),
        "ci_lower": float(model.conf_int()[1, 0]),
        "ci_upper": float(model.conf_int()[1, 1]),
        "p_value": float(model.pvalues[1]),
        "r_squared": float(model.rsquared),
        "model_summary": model.summary().as_text(),
    }


# Regression adjustment on observational data, controlling for preference
result_reg = regression_adjustment(
    y=df["y_obs"].values,
    treatment=df["treatment"].values,
    X=df[["preference"]].values,
    feature_names=["preference"],
)

print("Regression adjustment (controlling for preference):")
print(f"  Estimate: {result_reg['estimate']:.3f}")
print(f"  95% CI:   [{result_reg['ci_lower']:.3f}, {result_reg['ci_upper']:.3f}]")
print(f"  True ATE: {df['true_ite'].mean():.3f}")
print(f"  Bias:     {result_reg['estimate'] - df['true_ite'].mean():.3f}")

Regression adjustment (controlling for preference):
  Estimate: 1.993
  95% CI:   [1.914, 2.073]
  True ATE: 2.000
  Bias:     -0.007

When we control for the confounder (user preference), the regression adjustment recovers the true ATE. The bias drops from 3.15 to essentially zero.

The Bias of Regression Under Omitted Confounding

What happens when conditional ignorability fails — when there is an unmeasured confounder?

Mathematical Foundation: Consider the data-generating process:

$$Y_i = \alpha + \tau D_i + \beta_1 X_i + \beta_2 U_i + \varepsilon_i$$

where $U_i$ is an unmeasured confounder that affects both treatment and outcome. If we run a regression omitting $U$:

$$Y_i = \tilde{\alpha} + \tilde{\tau} D_i + \tilde{\beta}_1 X_i + \tilde{\varepsilon}_i$$

the estimated treatment effect $\tilde{\tau}$ converges to:

$$\tilde{\tau} \xrightarrow{p} \tau + \beta_2 \cdot \delta$$

where $\delta$ is the coefficient from regressing $U$ on $D$ conditional on $X$:

$$\delta = \frac{\text{Cov}(D, U \mid X)}{\text{Var}(D \mid X)}$$

The bias $\beta_2 \cdot \delta$ is the omitted variable bias (OVB) formula. It has two components:

$\beta_2$: the effect of the omitted variable on the outcome.

$\delta$: the association between the omitted variable and treatment (conditional on included covariates).

The bias is zero only when either $\beta_2 = 0$ (the omitted variable does not affect the outcome) or $\delta = 0$ (the omitted variable is unrelated to treatment after conditioning on $X$). A variable must be a confounder — affecting both treatment and outcome — to create bias.

Let us demonstrate this with code:

def simulate_omitted_confounder(
    n: int = 10000,
    true_ate: float = 2.0,
    gamma_u_to_d: float = 1.0,
    beta_u_to_y: float = 3.0,
    seed: int = 42,
) -> Tuple[pd.DataFrame, dict[str, float]]:
    """Simulate data with observed and unobserved confounders.

    X is observed and included in the regression.
    U is unobserved and omitted, creating OVB.

    Args:
        n: Sample size.
        true_ate: True average treatment effect.
        gamma_u_to_d: Effect of U on treatment assignment (log-odds).
        beta_u_to_y: Effect of U on outcome.
        seed: Random seed.

    Returns:
        Tuple of (DataFrame, dictionary with OVB analysis).
    """
    rng = np.random.RandomState(seed)

    # Observed confounder
    x = rng.normal(0, 1, n)

    # Unobserved confounder
    u = rng.normal(0, 1, n)

    # Treatment depends on both X and U
    logit_d = 0.5 * x + gamma_u_to_d * u
    treatment = rng.binomial(1, 1 / (1 + np.exp(-logit_d)))

    # Outcome depends on D, X, and U
    y0 = 10 + 2 * x + beta_u_to_y * u + rng.normal(0, 1, n)
    y1 = y0 + true_ate
    y_obs = treatment * y1 + (1 - treatment) * y0

    data = pd.DataFrame({
        "x": x, "u": u, "treatment": treatment,
        "y_obs": y_obs, "y0": y0, "y1": y1,
    })

    # Regression controlling for X only (omitting U)
    result_partial = regression_adjustment(
        y=y_obs, treatment=treatment,
        X=x.reshape(-1, 1), feature_names=["x"],
    )

    # Regression controlling for X and U (oracle)
    result_oracle = regression_adjustment(
        y=y_obs, treatment=treatment,
        X=np.column_stack([x, u]), feature_names=["x", "u"],
    )

    analysis = {
        "partial_estimate": result_partial["estimate"],
        "oracle_estimate": result_oracle["estimate"],
        "true_ate": true_ate,
        "ovb": result_partial["estimate"] - true_ate,
        "ovb_sign": "positive" if result_partial["estimate"] > true_ate else "negative",
    }

    return data, analysis


data_ovb, ovb_analysis = simulate_omitted_confounder()

print("Omitted Variable Bias Demonstration:")
print(f"  True ATE:               {ovb_analysis['true_ate']:.3f}")
print(f"  Controlling for X only: {ovb_analysis['partial_estimate']:.3f} "
      f"(OVB = {ovb_analysis['ovb']:+.3f})")
print(f"  Controlling for X & U:  {ovb_analysis['oracle_estimate']:.3f}")

Omitted Variable Bias Demonstration:
  True ATE:               2.000
  Controlling for X only: 4.264 (OVB = +2.264)
  Controlling for X & U:  2.006

Controlling for $X$ alone leaves a large positive bias ($+2.26$) because the omitted confounder $U$ has a positive effect on both treatment ($\gamma = 1$) and outcome ($\beta_2 = 3$). The oracle regression that includes $U$ eliminates the bias. In practice, we never have the oracle — the critical question is always whether we have measured enough confounders.

The Direction and Magnitude of OVB

The OVB formula $\text{Bias} = \beta_2 \cdot \delta$ gives us a framework for reasoning about the direction of bias:

$\beta_2$ (effect on $Y$)	$\delta$ (assoc. with $D$)	Sign of Bias	Interpretation
$+$	$+$	$+$	Overstates effect (or makes negative effect appear positive)
$+$	$-$	$-$	Understates effect (or makes positive effect appear negative)
$-$	$+$	$-$	Understates effect
$-$	$-$	$+$	Overstates effect

Research Insight: Cinelli and Hazlett (2020) formalize sensitivity analysis for OVB in terms of partial $R^2$ values: "How much of the residual variation in $D$ and $Y$ would an omitted confounder have to explain to change the conclusion?" Their approach, implemented in the sensemakr R package (and adaptable to Python), provides a principled way to assess the robustness of causal estimates to unmeasured confounding without requiring knowledge of the specific omitted variable.

16.10 Covariate Balance: Checking Whether Adjustment Worked

Even when we cannot test ignorability directly, we can check whether the covariates are balanced between treated and control groups — before and after adjustment.

Pre-Adjustment Balance

In a randomized experiment, covariates should be balanced between treatment and control groups by design. In an observational study, they typically are not.

def check_covariate_balance(
    X: np.ndarray,
    treatment: np.ndarray,
    feature_names: list[str] | None = None,
) -> pd.DataFrame:
    """Check covariate balance between treated and control groups.

    Computes the standardized mean difference (SMD) for each covariate.
    |SMD| < 0.1 is generally considered good balance.

    Args:
        X: Covariate matrix of shape (n, p).
        treatment: Binary treatment vector of shape (n,).
        feature_names: Optional names for covariates.

    Returns:
        DataFrame with balance diagnostics per covariate.
    """
    if feature_names is None:
        feature_names = [f"x{j}" for j in range(X.shape[1])]

    results = []
    for j, name in enumerate(feature_names):
        x_treated = X[treatment == 1, j]
        x_control = X[treatment == 0, j]

        mean_diff = x_treated.mean() - x_control.mean()
        pooled_sd = np.sqrt(
            (x_treated.var(ddof=1) + x_control.var(ddof=1)) / 2
        )
        smd = mean_diff / pooled_sd if pooled_sd > 0 else 0.0

        results.append({
            "covariate": name,
            "mean_treated": float(x_treated.mean()),
            "mean_control": float(x_control.mean()),
            "smd": float(smd),
            "balanced": abs(smd) < 0.1,
        })

    return pd.DataFrame(results)


# Check balance in observational data
balance = check_covariate_balance(
    X=df[["preference"]].values,
    treatment=df["treatment"].values,
    feature_names=["preference"],
)
print("Covariate balance (observational data):")
print(balance.to_string(index=False))
print()

# Check balance in RCT data
balance_rct = check_covariate_balance(
    X=df_rct[["preference"]].values,
    treatment=df_rct["treatment"].values,
    feature_names=["preference"],
)
print("Covariate balance (randomized experiment):")
print(balance_rct.to_string(index=False))

Covariate balance (observational data):
 covariate  mean_treated  mean_control    smd  balanced
preference         0.520        -0.528  1.051     False

Covariate balance (randomized experiment):
 covariate  mean_treated  mean_control    smd  balanced
preference         0.018        -0.011  0.029      True

In the observational data, the standardized mean difference (SMD) for preference is 1.05 — massive imbalance, indicating that treated and control groups are fundamentally different populations. In the RCT, the SMD is 0.03 — well within the conventional $|SMD| < 0.1$ threshold for acceptable balance.

Common Misconception: "If the covariates are balanced after adjustment, then ignorability holds." Covariate balance is necessary but not sufficient. Even perfectly balanced observed covariates cannot rule out an imbalance in unobserved covariates. Balance checks are a diagnostic tool, not a proof of identification.

16.11 Putting It All Together: The Identification Checklist

Before estimating a causal effect, work through this checklist:

Step 1: Define the Estimand

What causal quantity do you want to estimate? ATE, ATT, or ATU? The answer depends on the decision you are trying to inform.

Step 2: Define the Treatment Precisely

What does $D = 1$ mean, concretely? "Takes Drug X at 20mg daily for 30 days" is precise. "Gets medical treatment" is not. For StreamRec: "Item appears in top-3 of homepage carousel" is precise. "Item is recommended" is ambiguous.

Step 3: Evaluate SUTVA

No interference: Can one unit's treatment affect another unit's outcome? If yes (vaccination, marketplace, network effects), standard methods do not apply without modification.
Consistency: Is there only one version of treatment? If not ("exercise," "therapy," "recommendation"), refine the treatment definition.

Step 4: Assess Ignorability

List all variables that plausibly affect both treatment assignment and the outcome. These are the confounders.
For each confounder, ask: "Is this measured in the data?"
Unmeasured confounders violate ignorability. Can you measure them? If not, can you use an alternative identification strategy (IV, DiD, RD — Chapter 18)?
Consider: what would an unmeasured confounder have to look like to explain away the estimated effect? (Sensitivity analysis.)

Step 5: Check Positivity

Are there covariate strata where treatment is deterministic or extremely rare?
If so, restrict the target population to the overlap region, or acknowledge that the causal effect is not identified for the full population.

Step 6: Choose the Estimator

If the assumptions above are plausible:

Randomized data: Difference in means.
Observational data with conditional ignorability: Regression adjustment (this chapter), propensity score methods (Chapter 18), or doubly robust methods (Chapter 18).
Observational data without conditional ignorability: Instrumental variables, difference-in-differences, regression discontinuity (Chapter 18).

16.12 Progressive Project: StreamRec Recommendation Effect

This section applies the chapter's concepts to the progressive project: estimating the causal effect of StreamRec's recommendation algorithm on user engagement.

Step 1: Define the Potential Outcomes

def simulate_streamrec_causal(
    n_users: int = 20000,
    seed: int = 42,
) -> pd.DataFrame:
    """Simulate a richer StreamRec dataset for causal analysis.

    Generates user-item pairs with:
    - User features: preference, activity_level, tenure
    - Item features: popularity, quality
    - Treatment: recommendation (confounded by user+item features)
    - Potential outcomes: Y(0) organic engagement, Y(1) with recommendation
    - Heterogeneous treatment effects

    Args:
        n_users: Number of user-item pairs.
        seed: Random seed.

    Returns:
        DataFrame with all variables including potential outcomes.
    """
    rng = np.random.RandomState(seed)

    # User features
    preference = rng.normal(0, 1, n_users)
    activity_level = rng.exponential(1.0, n_users)
    tenure_months = rng.poisson(24, n_users).clip(1, 120)

    # Item features
    item_popularity = rng.beta(2, 5, n_users)  # Right-skewed
    item_quality = rng.normal(0.5, 0.2, n_users).clip(0, 1)

    # Treatment assignment (confounded)
    rec_logit = (
        0.8 * preference
        + 0.5 * activity_level
        - 0.3 * np.log1p(tenure_months)
        + 2.0 * item_popularity
        + rng.normal(0, 0.5, n_users)
    )
    rec_prob = 1 / (1 + np.exp(-rec_logit))
    treatment = rng.binomial(1, rec_prob)

    # Potential outcomes
    # Y(0): organic engagement (minutes)
    y0 = (
        5.0
        + 3.0 * preference
        + 1.5 * activity_level
        + 0.02 * tenure_months
        + 8.0 * item_quality
        + rng.normal(0, 2, n_users)
    ).clip(0, None)

    # Y(1): engagement with recommendation
    # Heterogeneous treatment effect: larger for new/low-activity users
    tau_i = (
        3.0                                  # Base effect
        - 0.5 * activity_level               # Less effect for active users
        + 2.0 * (1 - item_popularity)        # More effect for niche items
        + 0.5 * rng.normal(0, 1, n_users)    # Random heterogeneity
    ).clip(0, None)

    y1 = y0 + tau_i

    # Observed outcome
    y_obs = treatment * y1 + (1 - treatment) * y0

    return pd.DataFrame({
        "preference": preference,
        "activity_level": activity_level,
        "tenure_months": tenure_months,
        "item_popularity": item_popularity,
        "item_quality": item_quality,
        "treatment": treatment,
        "rec_prob": rec_prob,
        "y0": y0,
        "y1": y1,
        "true_ite": tau_i,
        "y_obs": y_obs,
    })


sr = simulate_streamrec_causal()
print(f"StreamRec simulated data: {len(sr)} user-item pairs")
print(f"Treatment rate: {sr['treatment'].mean():.3f}")
print(f"True ATE: {sr['true_ite'].mean():.3f}")
print(f"True ATT: {sr.loc[sr['treatment'] == 1, 'true_ite'].mean():.3f}")
print(f"True ATU: {sr.loc[sr['treatment'] == 0, 'true_ite'].mean():.3f}")

StreamRec simulated data: 20000 user-item pairs
Treatment rate: 0.621
True ATE: 3.390
True ATT: 2.888
True ATU: 4.212

Note that $\text{ATT} < \text{ATU}$: the recommendation system preferentially recommends to users with high activity levels, who benefit less from recommendations (their organic engagement is already high). The untreated users — lower-activity users with niche items — would benefit more from recommendations but are less likely to receive them.

Step 2: Naive Analysis vs. Regression Adjustment

# Naive difference in means
naive = difference_in_means(sr["y_obs"].values, sr["treatment"].values)

# Regression adjustment with observed confounders
covariates = sr[["preference", "activity_level", "tenure_months",
                 "item_popularity", "item_quality"]].values
cov_names = ["preference", "activity_level", "tenure_months",
             "item_popularity", "item_quality"]
reg_adj = regression_adjustment(
    y=sr["y_obs"].values,
    treatment=sr["treatment"].values,
    X=covariates,
    feature_names=cov_names,
)

print("StreamRec Causal Estimation Comparison")
print("=" * 50)
print(f"True ATE:               {sr['true_ite'].mean():.3f}")
print(f"True ATT:               {sr.loc[sr['treatment']==1, 'true_ite'].mean():.3f}")
print()
print(f"Naive estimate:         {naive['estimate']:.3f}")
print(f"  95% CI:               [{naive['ci_lower']:.3f}, {naive['ci_upper']:.3f}]")
print(f"  Bias (vs ATE):        {naive['estimate'] - sr['true_ite'].mean():+.3f}")
print()
print(f"Regression adjustment:  {reg_adj['estimate']:.3f}")
print(f"  95% CI:               [{reg_adj['ci_lower']:.3f}, {reg_adj['ci_upper']:.3f}]")
print(f"  Bias (vs ATE):        {reg_adj['estimate'] - sr['true_ite'].mean():+.3f}")

StreamRec Causal Estimation Comparison
==================================================
True ATE:               3.390
True ATT:               2.888

Naive estimate:         6.153
  95% CI:               [5.942, 6.365]
  Bias (vs ATE):        +2.763

Regression adjustment:  3.418
  95% CI:               [3.256, 3.581]
  Bias (vs ATE):        +0.028

The naive estimate overstates the causal effect by nearly 3 minutes (81% overestimate). Regression adjustment, controlling for the five observed confounders, reduces the bias to near zero — but only because in this simulation, we have measured all the confounders. In practice, unmeasured confounders would leave residual bias.

Step 3: Covariate Balance Check

balance_obs = check_covariate_balance(
    X=covariates, treatment=sr["treatment"].values,
    feature_names=cov_names,
)
print("Covariate balance (observational StreamRec data):")
print(balance_obs.to_string(index=False))

Covariate balance (observational StreamRec data):
       covariate  mean_treated  mean_control    smd  balanced
      preference         0.373        -0.607  0.987     False
  activity_level         1.251         0.594  0.763     False
   tenure_months        22.917        25.879 -0.349     False
 item_popularity         0.355         0.172  0.777     False
    item_quality         0.501         0.499  0.011      True

Four of five covariates show substantial imbalance (SMDs ranging from 0.35 to 0.99), confirming that the treated and control groups are fundamentally different. Only item quality is balanced, because it does not drive the recommendation decision.

Understanding Why: This progressive project milestone establishes the baseline for the causal inference sequence. You have seen that naive evaluation of a recommender system dramatically overstates its causal effect — it takes credit for organic behavior driven by user preferences. Regression adjustment corrects this bias when all confounders are measured, but that assumption is fragile. Chapters 17-19 will build increasingly powerful tools: graphical models to reason about confounders you might have missed (Chapter 17), propensity score methods and natural experiments for more robust estimation (Chapter 18), and machine learning methods for estimating heterogeneous effects — finding which users benefit most from recommendations (Chapter 19).

16.13 Summary

This chapter established the mathematical framework for causal inference:

Potential outcomes $Y(0)$ and $Y(1)$ define what causation means: the individual treatment effect $\tau_i = Y(1) - Y(1)$ is the difference between what would happen under treatment and what would happen under control.
The fundamental problem is that we never observe both potential outcomes for the same unit. Causal inference is a missing data problem.
Three estimands — ATE, ATT, ATU — answer different questions and may have different values when treatment effects are heterogeneous. Choosing the right estimand is as important as estimating it accurately.
Three assumptions — SUTVA, ignorability, positivity — are required for identification. SUTVA requires domain knowledge. Ignorability is untestable. Positivity is partially testable.
Randomization satisfies strong ignorability and positivity by design, making the simple difference in means an unbiased estimator of the ATE.
Regression adjustment can recover the ATE in observational data when conditional ignorability holds and the regression model is correctly specified. The omitted variable bias formula quantifies what goes wrong when an important confounder is omitted.
Selection bias is the difference between the naive estimate and the true causal effect. Its direction and magnitude depend on how treatment is assigned and which confounders are unmeasured.

The potential outcomes framework is not just notation — it is a discipline. It forces you to specify what you are estimating (the estimand), what you are assuming (SUTVA, ignorability, positivity), and what could go wrong (violations of each assumption). The next chapter introduces Pearl's graphical causal model, which provides a visual and algebraic tool for reasoning about the same identification problems using directed acyclic graphs.