Chapter 16: Quiz

Test your understanding of the potential outcomes framework. Answers follow each question.


Question 1

Define the potential outcomes $Y_i(0)$ and $Y_i(1)$. What makes them "potential"?

Answer $Y_i(0)$ is the outcome that **would be observed** for unit $i$ if it receives the control condition ($D_i = 0$). $Y_i(1)$ is the outcome that **would be observed** if it receives the treatment ($D_i = 1$). They are called "potential" because both exist as well-defined quantities for every unit, but only one is ever realized (observed). The unobserved potential outcome is the **counterfactual** — what would have happened under the alternative treatment assignment.

Question 2

State the fundamental problem of causal inference in one sentence.

Answer It is impossible to observe both $Y_i(1)$ and $Y_i(0)$ for the same unit at the same time, so the individual treatment effect $\tau_i = Y_i(1) - Y_i(0)$ is never directly observable. This is not a practical limitation (better data collection or technology cannot fix it) but a logical impossibility: a patient either takes the drug or does not — we cannot observe both realities simultaneously.

Question 3

Write the switching equation and explain what it means.

Answer $$Y_i = D_i \cdot Y_i(1) + (1 - D_i) \cdot Y_i(0)$$ The switching equation connects the observed outcome $Y_i$ to the two potential outcomes. When unit $i$ is treated ($D_i = 1$), we observe $Y_i(1)$. When unit $i$ is not treated ($D_i = 0$), we observe $Y_i(0)$. The equation makes explicit that the observed data reveals only one of the two potential outcomes per unit — the other is permanently missing.

Question 4

Define the ATE, ATT, and ATU. Give a scenario where the ATT and ATU differ substantially.

Answer - **ATE** (Average Treatment Effect): $\mathbb{E}[Y(1) - Y(0)]$ — the average effect across the entire population. - **ATT** (Average Treatment Effect on the Treated): $\mathbb{E}[Y(1) - Y(0) \mid D = 1]$ — the average effect among those who actually received treatment. - **ATU** (Average Treatment Effect on the Untreated): $\mathbb{E}[Y(1) - Y(0) \mid D = 0]$ — the average effect among those who did not receive treatment. **Example:** A drug prescribed selectively to patients with a genetic biomarker who respond strongly to it (ATT = large). Patients without the biomarker (the untreated) would not respond (ATU $\approx$ 0). The ATT and ATU differ because treatment selection is correlated with treatment effect heterogeneity.

Question 5

Show that $\text{ATE} = P(D=1) \cdot \text{ATT} + P(D=0) \cdot \text{ATU}$.

Answer By the law of total expectation: $$\text{ATE} = \mathbb{E}[Y(1) - Y(0)]$$ $$= \mathbb{E}[Y(1) - Y(0) \mid D=1] \cdot P(D=1) + \mathbb{E}[Y(1) - Y(0) \mid D=0] \cdot P(D=0)$$ $$= \text{ATT} \cdot P(D=1) + \text{ATU} \cdot P(D=0)$$ The ATE is a weighted average of the ATT and ATU, where the weights are the treatment and control proportions in the population.

Question 6

Write the decomposition of the naive difference in means into ATT + selection bias. What is the selection bias, intuitively?

Answer $$\underbrace{\mathbb{E}[Y \mid D=1] - \mathbb{E}[Y \mid D=0]}_{\text{Naive estimate}} = \underbrace{\mathbb{E}[Y(1) - Y(0) \mid D=1]}_{\text{ATT}} + \underbrace{\mathbb{E}[Y(0) \mid D=1] - \mathbb{E}[Y(0) \mid D=0]}_{\text{Selection Bias}}$$ The **selection bias** is the difference in baseline outcomes (under control) between the treated and control groups. Intuitively, it measures how much the treated and control groups differ in ways that are unrelated to the treatment itself. If sicker patients receive treatment, the treated group has worse baseline outcomes, producing positive selection bias that makes the treatment appear less effective than it is.

Question 7

What are the two components of SUTVA?

Answer 1. **No interference:** Unit $i$'s potential outcomes depend only on its own treatment assignment, not on the treatment assignments of other units: $Y_i(D_1, \ldots, D_N) = Y_i(D_i)$. My outcome is unaffected by whether you are treated. 2. **Consistency (no hidden treatment variations):** If $D_i = d$, then $Y_i = Y_i(d)$. There is only one version of treatment. The potential outcome $Y_i(1)$ is the same regardless of how or why the unit received treatment.

Question 8

Give an example where the no-interference component of SUTVA is violated. Explain the mechanism.

Answer **Vaccination.** If person $A$ is vaccinated against a communicable disease, this reduces person $B$'s probability of infection even if $B$ is not vaccinated (herd immunity). Person $B$'s potential outcome $Y_B(0)$ (outcome when unvaccinated) depends on how many other people in the population are vaccinated. The treatment assignment of others directly affects my outcome, violating the no-interference assumption. Other valid examples: marketplace experiments with shared inventory/pricing, educational interventions in shared classrooms, social network interventions where behavior spreads through connections.

Question 9

What does conditional ignorability (unconfoundedness) state? Why is it untestable?

Answer Conditional ignorability states: $\{Y(0), Y(1)\} \perp\!\!\!\perp D \mid \mathbf{X}$. Treatment assignment is independent of potential outcomes, conditional on observed covariates $\mathbf{X}$. Once we control for $\mathbf{X}$, the remaining variation in who receives treatment is "as good as random." It is **untestable** because it asserts the absence of unmeasured confounders — variables that affect both treatment and outcome but are not included in $\mathbf{X}$. We cannot verify from observed data that no such variable exists, because by definition, unmeasured variables are not in the data. The assumption must be justified by domain knowledge and argument, not by statistical tests.

Question 10

State the positivity assumption. Is it testable? How do you diagnose violations?

Answer **Positivity:** $0 < P(D = 1 \mid \mathbf{X} = \mathbf{x}) < 1$ for all $\mathbf{x}$ in the support of $\mathbf{X}$. Every covariate stratum must have both treated and control units with positive probability. It is **partially testable**: we can estimate propensity scores $\hat{e}(\mathbf{x}) = \hat{P}(D = 1 \mid \mathbf{X} = \mathbf{x})$ and check their distribution. Propensity scores near 0 or 1 indicate practical positivity violations. Diagnostic tools include propensity score histograms by treatment group (looking for non-overlapping regions) and checking the fraction of units with extreme propensity scores (e.g., below 0.01 or above 0.99).

Question 11

Why does randomization solve the identification problem? Which assumptions does it satisfy?

Answer Randomization assigns treatment independent of all unit characteristics, including potential outcomes. This guarantees: 1. **Strong ignorability:** $\{Y(0), Y(1)\} \perp\!\!\!\perp D$ — treatment is independent of potential outcomes without needing to condition on any covariates. 2. **Positivity:** $0 < P(D = 1) < 1$ — every unit has a positive probability of being treated and untreated. Under these conditions, the selection bias is zero in expectation, and the simple difference in means $\bar{Y}_{D=1} - \bar{Y}_{D=0}$ is an unbiased estimator of the ATE. SUTVA is NOT guaranteed by randomization — interference and consistency must still be evaluated separately.

Question 12

In the StreamRec example, the naive difference in means estimated a recommendation effect of ~5.15 when the true effect was 2.0. Explain why the naive estimate is biased upward in this setting.

Answer The recommendation algorithm preferentially recommends items to users with high organic preference for those items. This creates positive **selection bias**: $\mathbb{E}[Y(0) \mid D = 1] > \mathbb{E}[Y(0) \mid D = 0]$. Users who receive recommendations would have had higher engagement even without the recommendation, because their preferences align with the recommended items. The naive comparison attributes this organic preference-driven engagement to the recommendation, overstating the causal effect by approximately 3.15 minutes (the selection bias).

Question 13

What is the omitted variable bias formula? State each component and explain when the bias is zero.

Answer When the true model is $Y = \alpha + \tau D + \beta_1 X + \beta_2 U + \varepsilon$ and we omit $U$, the estimated treatment effect converges to: $$\tilde{\tau} \to \tau + \underbrace{\beta_2}_{\text{effect of } U \text{ on } Y} \cdot \underbrace{\delta}_{\text{association of } U \text{ with } D \mid X}$$ where $\delta = \text{Cov}(D, U \mid X) / \text{Var}(D \mid X)$. The OVB is zero when either: - $\beta_2 = 0$: the omitted variable has no effect on the outcome (it is not a confounder of $Y$), OR - $\delta = 0$: the omitted variable is unrelated to treatment assignment conditional on the included covariates (it is not a confounder of $D$). A variable must affect **both** treatment and outcome to create omitted variable bias.

Question 14

A researcher controls for a post-treatment variable (a variable that is caused by the treatment) in a regression. Why is this problematic?

Answer Controlling for a post-treatment variable (a **mediator** or **collider**) can introduce bias rather than remove it. If a variable $M$ is on the causal path $D \to M \to Y$, controlling for $M$ blocks the causal effect that operates through $M$, underestimating the total effect. If $M$ is a collider ($D \to M \leftarrow U \to Y$), conditioning on $M$ opens a non-causal path between $D$ and $Y$ through $U$, introducing **collider bias** (also called selection bias or Berkson's bias). The general rule: only condition on pre-treatment variables that satisfy the backdoor criterion (Chapter 17). Never condition on variables that are consequences of the treatment.

Question 15

What is the standardized mean difference (SMD), and what threshold indicates acceptable covariate balance?

Answer The SMD for covariate $X_j$ is: $$\text{SMD}_j = \frac{\bar{X}_{j, \text{treated}} - \bar{X}_{j, \text{control}}}{\sqrt{(s_{j, \text{treated}}^2 + s_{j, \text{control}}^2) / 2}}$$ It measures the difference in means between treated and control groups, standardized by the pooled standard deviation. The conventional threshold for acceptable balance is $|\text{SMD}| < 0.1$ (Rubin, 2001; Austin, 2011). An SMD of 0.1 means the groups differ by 0.1 standard deviations on that covariate — small enough that the covariate is unlikely to confound the treatment effect substantially.

Question 16

Explain the difference between a deterministic positivity violation and a practical (random) positivity violation.

Answer A **deterministic violation** means treatment is impossible or mandatory for some covariate stratum by structural rule: e.g., a drug is contraindicated for patients over 85, so $P(D = 1 \mid \text{age} > 85) = 0$ exactly. The causal effect is simply not identifiable for this stratum because there are no treated individuals to compare against. A **practical violation** means treatment is theoretically possible but extremely rare: e.g., only 3 out of 10,000 patients with a rare condition received the drug, so $P(D = 1 \mid \text{rare condition}) \approx 0.0003$. Technically positivity holds, but estimation is extremely unstable because a tiny number of observations in one cell receive disproportionate weight. Estimates may be wildly imprecise or driven by a handful of individuals.

Question 17

Why is covariate balance necessary but not sufficient for causal identification?

Answer Covariate balance (low SMDs for all observed covariates) is **necessary** because large imbalances indicate that the treated and control groups differ systematically, which causes selection bias. It is **not sufficient** because balance on observed covariates does not guarantee balance on **unobserved** covariates. An unmeasured confounder could be severely imbalanced between groups even when all measured covariates are perfectly balanced. Ignorability (no unmeasured confounders) is what we actually need, and covariate balance can only verify the observable part of this assumption.

Question 18

In the StreamRec progressive project, why is the ATT less than the ATU? What does this imply about the recommendation algorithm's targeting?

Answer The ATT ($\approx 2.89$) is less than the ATU ($\approx 4.21$) because the recommendation algorithm preferentially targets high-activity users, who benefit **less** from recommendations (they would find and engage with content organically). Low-activity users — who are less likely to receive recommendations — would benefit **more** because they are less likely to discover content on their own. This implies that the algorithm is **suboptimally targeted** from a causal perspective: it recommends to users where the marginal causal impact is smallest. A causal targeting policy (Chapter 19) would redirect recommendations toward users with the highest treatment effect, potentially increasing the overall causal impact of the system while recommending to fewer users overall.

Question 19

A team computes a 95% confidence interval for the naive difference in means and finds $[4.88, 5.43]$. The true ATE is 2.0. What does this interval tell us about the estimator's properties?

Answer The confidence interval is **narrow and does not contain the true ATE**. This demonstrates a critical distinction: the interval reflects **precision** (low variance), not **accuracy** (low bias). The naive estimator is biased because of confounding, and the CI is centered on the biased estimate rather than the true effect. A narrow CI around a biased estimate is arguably worse than a wide CI around an unbiased estimate, because it gives false confidence. Confidence intervals quantify **sampling variability** — they do not account for **systematic bias** from confounding, model misspecification, or SUTVA violations.

Question 20

When should you use each of the following estimators, and what does each assume?

(a) Unadjusted difference in means (b) Regression adjustment (c) Methods from Chapter 18 (IV, DiD, RD)

Answer **(a) Unadjusted difference in means:** Use when treatment is randomized (RCT). Assumes SUTVA and strong ignorability ($\{Y(0), Y(1)\} \perp\!\!\!\perp D$). No covariate adjustment needed, though adjustment can improve precision. **(b) Regression adjustment:** Use with observational data when conditional ignorability is plausible — when you believe all confounders are measured and included in $\mathbf{X}$. Assumes SUTVA, conditional ignorability ($\{Y(0), Y(1)\} \perp\!\!\!\perp D \mid \mathbf{X}$), positivity, and (for the parametric version) correct model specification. **(c) IV, DiD, RD (Chapter 18):** Use when conditional ignorability is not plausible — when unmeasured confounding is likely. Each method uses a different structural feature of the data to achieve identification without requiring all confounders to be measured: instrumental variables exploit exogenous variation, difference-in-differences exploits parallel trends over time, and regression discontinuity exploits sharp cutoffs in treatment assignment.