Chapter 18: Quiz

Test your understanding of causal estimation methods. Answers follow each question.


Question 1

What is the propensity score, and what property makes it useful for causal estimation?

Answer The propensity score is the conditional probability of receiving treatment given observed covariates: $e(X) = P(D = 1 \mid X)$. Its key property is the **balancing property** (Rosenbaum and Rubin, 1983): conditioning on the propensity score makes covariates independent of treatment assignment, $X \perp\!\!\!\perp D \mid e(X)$. This reduces the matching problem from $p$ dimensions (matching on all covariates) to a single dimension (matching on the scalar propensity score), making matching feasible even with many covariates.

Question 2

A researcher estimates propensity scores and finds that the AUC for predicting treatment assignment is 0.98. Should she be pleased? Explain.

Answer No. An AUC of 0.98 means the propensity score model can nearly perfectly separate treated and control units, which implies that the propensity score distributions for the two groups have minimal overlap. This signals severe **positivity violations**: there are regions of covariate space where units are almost exclusively treated or exclusively control. IPW weights in these regions will be extreme (approaching infinity), leading to unstable estimates with high variance. The goal of propensity score estimation is covariate balance, not predictive accuracy. A model with lower AUC but better overlap is preferable for causal estimation.

Question 3

Explain the difference between the Horvitz-Thompson (HT) and Hajek estimators for the ATE.

Answer The **Horvitz-Thompson** estimator uses unnormalized weights: $\hat{\tau}_{\text{HT}} = \frac{1}{N}\sum_i \left[\frac{D_i Y_i}{e(X_i)} - \frac{(1-D_i)Y_i}{1-e(X_i)}\right]$. It is unbiased but can have very large variance when propensity scores are extreme. The **Hajek** estimator normalizes the weights so they sum to one within each treatment group: it replaces $\frac{1}{N}$ with the sum of the group-specific weights. It is technically biased (the ratio of expectations is not the expectation of the ratio), but it has substantially lower variance than HT, especially when propensity scores vary widely. In practice, the Hajek estimator almost always has lower mean squared error and is the recommended default.

Question 4

What does "doubly robust" mean in the context of the AIPW estimator?

Answer The AIPW estimator combines an outcome model $\hat{\mu}_d(X)$ and a propensity score model $\hat{e}(X)$. **Doubly robust** means the estimator is consistent (converges to the true ATE) if **either** the outcome model is correctly specified **or** the propensity score model is correctly specified. Both do not need to be correct simultaneously. If the outcome model is right, the IPW residuals average to zero regardless of propensity model quality. If the propensity model is right, the outcome model residuals are properly reweighted regardless of outcome model misspecification. If both are correct, AIPW achieves the semiparametric efficiency bound — the lowest possible variance among regular estimators.

Question 5

State the three conditions for a valid instrumental variable.

Answer 1. **Relevance:** The instrument $Z$ must be correlated with the endogenous treatment $D$: $\text{Cov}(Z, D) \neq 0$. This is testable via the first-stage regression. 2. **Exclusion restriction:** The instrument must affect the outcome $Y$ **only through** the treatment $D$. There is no direct effect of $Z$ on $Y$. This is generally **untestable** and must be argued from domain knowledge. 3. **Independence (exogeneity):** The instrument must be uncorrelated with the unmeasured confounders $U$: $\text{Cov}(Z, U) = 0$. This is also generally **untestable** and requires substantive justification.

Question 6

In the MediCore IV example, distance to the nearest Drug-X-prescribing hospital is used as an instrument. Provide a scenario where the exclusion restriction might be violated.

Answer The exclusion restriction requires that distance affects readmission **only** through Drug X prescription. This could be violated if: - **Distance to hospital affects care quality:** Patients far from the Drug-X hospital may also be far from high-quality hospitals in general, receiving lower-quality care that increases readmission independently of Drug X. - **Distance correlates with socioeconomic status:** Urban vs. rural residence affects both distance to specialized hospitals and health outcomes (access to follow-up care, diet, stress, environmental factors). - **Distance affects treatment adherence:** Patients who travel far for an initial prescription may have more difficulty with follow-up appointments, affecting readmission through a channel other than Drug X itself. Any of these would create a direct path from $Z$ (distance) to $Y$ (readmission) that bypasses $D$ (Drug X), violating the exclusion restriction.

Question 7

What is the first-stage F-statistic in 2SLS, and what is the standard rule of thumb for instrument strength?

Answer The first-stage F-statistic tests the null hypothesis that the instrument(s) have zero coefficient in the first-stage regression of treatment on the instrument(s) and controls. It measures instrument **relevance** — how strongly the instrument predicts treatment. The standard rule of thumb (Staiger and Stock, 1997) is $F > 10$ for a single endogenous regressor with one instrument. Below this threshold, the instrument is considered "weak," and 2SLS estimates are biased toward OLS, confidence intervals are unreliable, and size distortions of hypothesis tests become severe. For weak instruments ($F < 10$), use weak-instrument-robust inference methods such as the Anderson-Rubin test or the conditional likelihood ratio test.

Question 8

The 2SLS estimator identifies a LATE, not an ATE. What is a "complier," and why is the distinction important?

Answer A **complier** is a unit whose treatment status changes in response to the instrument. In the MediCore example, compliers are patients who would receive Drug X if they live near a prescribing hospital but would **not** receive it if they live far away. The 2SLS estimate applies only to this subpopulation. **Always-takers** (patients who get Drug X regardless of distance) and **never-takers** (patients who refuse Drug X regardless of distance) contribute nothing to the IV estimate. The distinction matters because compliers may not be representative of the full population. If Drug X works differently for compliers than for always-takers or never-takers, the LATE differs from the ATE. Whether the LATE is policy-relevant depends on the decision being made: if the policy changes the instrument (e.g., building a new hospital nearby), the LATE is exactly the right estimand.

Question 9

State the parallel trends assumption for difference-in-differences. Is it testable?

Answer The parallel trends assumption states that, **in the absence of treatment**, the treated and control groups would have experienced the same change in outcomes over time: $$\mathbb{E}[Y_t(0) - Y_{t-1}(0) \mid G = 1] = \mathbb{E}[Y_t(0) - Y_{t-1}(0) \mid G = 0]$$ This does **not** require equal outcome levels — only equal trends. The assumption is **not testable** for the post-treatment period because we cannot observe $Y(0)$ for the treated group after treatment. However, it is **plausibility-checkable** using pre-treatment data: if the two groups followed parallel trends in the periods before treatment, this provides suggestive (but not conclusive) evidence that they would have continued to do so. Event study plots that show pre-treatment coefficients near zero support the assumption.

Question 10

Why is the standard two-way fixed effects (TWFE) estimator potentially biased under staggered adoption?

Answer Goodman-Bacon (2021) showed that the TWFE estimator is a weighted average of all possible 2x2 DiD comparisons between groups and time periods. Under staggered adoption, some of these comparisons use **already-treated** units as the "control" group. When treatment effects are dynamic (changing over time since adoption), these comparisons are invalid: the "control" group's outcomes reflect a mixture of the treatment effect at their adoption date and the time trend. The resulting bias can even flip the sign of the estimate. Modern estimators (Callaway and Sant'Anna, 2021; Sun and Abraham, 2021) avoid this problem by using only not-yet-treated (or never-treated) units as controls.

Question 11

What is the difference between a sharp and a fuzzy regression discontinuity design?

Answer In a **sharp RD**, treatment is a deterministic function of the running variable: $D_i = \mathbf{1}[X_i \geq c]$. Every unit above the cutoff is treated; every unit below is untreated. The RD estimate is the jump in the outcome at the cutoff. In a **fuzzy RD**, the cutoff changes the **probability** of treatment but does not determine it perfectly. Some units above the cutoff are untreated (e.g., applications denied despite meeting the score threshold) and some below are treated (e.g., approved through manual override). The fuzzy RD uses the cutoff indicator $\mathbf{1}[X_i \geq c]$ as an instrument for the actual treatment $D_i$, and the estimate is the ratio of the jump in the outcome to the jump in treatment probability — analogous to a Wald/IV estimate. The fuzzy RD identifies a LATE: the effect among compliers at the cutoff.

Question 12

In an RD design, the McCrary density test finds significant bunching of the running variable just above the cutoff. What does this imply?

Answer Bunching above the cutoff suggests that units are **manipulating** the running variable to receive treatment. For example, if credit applicants can retake an exam or dispute credit report items to push their FICO score above the threshold, then units just above the cutoff are systematically different from units just below — they are "self-selected" into treatment. This violates the RD identifying assumption that potential outcomes are continuous at the cutoff: the units just above the cutoff are not comparable to those just below because they have taken an action (manipulation) that is correlated with the outcome. The RD estimate is likely biased. Solutions include: a "donut hole" RD that excludes units near the cutoff, or using a different running variable that is harder to manipulate.

Question 13

What role does bandwidth selection play in RD, and what is the bias-variance tradeoff?

Answer The bandwidth $h$ determines how many observations near the cutoff are included in the local regression. A **narrow bandwidth** includes only units very close to the cutoff, reducing bias (these units are most comparable) but increasing variance (fewer observations). A **wide bandwidth** includes more observations, reducing variance but increasing bias (units far from the cutoff may have different potential outcomes, and the linear approximation becomes less accurate). Optimal bandwidths (Imbens-Kalyanaraman, Calonico-Cattaneo-Titiunik) minimize the asymptotic mean squared error, balancing these two concerns. The CCT bandwidth includes a bias-correction step that allows the use of a wider bandwidth while maintaining correct coverage of confidence intervals. Sensitivity analysis (reporting estimates at multiple bandwidths) is essential to ensure robustness.

Question 14

For the StreamRec recommendation system, the naive engagement estimate was 6.15 minutes, while the IPW estimate was 3.43 minutes. Explain the source of the 2.72-minute gap in business terms.

Answer The 2.72-minute gap is **confounding bias**, not the algorithm's causal effect. The recommendation algorithm selects items for users based on features (user preference, activity level, item popularity) that independently predict higher engagement. Users who receive recommendations are more active and prefer more engaging content, so they would have consumed more content **even without the recommendation**. In business terms: the recommendation system's ROI calculation based on the naive 6.15-minute figure claims credit for 2.72 minutes of engagement that would have occurred organically. The system's true causal contribution is 3.43 minutes, meaning the naive metric overstates the system's value by approximately 80%. Investment decisions (e.g., expanding the recommendation infrastructure) should be based on the causal estimate, not the confounded metric.

Question 15

You are choosing between IPW and AIPW for a causal analysis. Under what circumstances would you prefer IPW alone?

Answer In practice, AIPW is almost always preferred over IPW alone because it is doubly robust and more efficient. The main circumstance where you might prefer IPW is when: (1) you have strong confidence in the propensity score model but no credible outcome model (e.g., the outcome process is highly nonlinear and difficult to model), and (2) implementing the additional outcome model adds complexity without clear benefit. However, these situations are rare. If you can fit a propensity model, you can usually fit an outcome model. AIPW adds minimal computational cost and provides insurance against propensity model misspecification. **AIPW should be the default** for selection-on-observables designs.

Question 16

A colleague proposes matching treated and control units on as many covariates as possible "to be safe." Explain two problems with this approach.

Answer 1. **Curse of dimensionality:** As the number of covariates increases, exact matches become impossible and nearest-neighbor matches become poor (in high dimensions, all points are approximately equidistant). This leads to either large matching distances or many unmatched treated units, both of which compromise the analysis. 2. **Conditioning on bad controls:** Not all covariates should be matched on. Matching on a **collider** (a common effect of treatment and outcome) opens a spurious association. Matching on a **post-treatment variable** (a mediator or consequence of treatment) blocks part of the causal effect. The decision of which covariates to include should be guided by the causal DAG (Chapter 17), not by a "more is better" heuristic. Only confounders — variables that affect both treatment and outcome — should be included in the propensity score model.

Question 17

In the Meridian Financial RD example, the estimated effect applies to applicants near the 660 FICO cutoff. Can this estimate be generalized to applicants with FICO scores of 750?

Answer Not directly. The RD estimate identifies a **local** effect at the cutoff: $\mathbb{E}[Y(1) - Y(0) \mid X = 660]$. This is the causal effect of credit card approval for applicants with FICO scores near 660. Applicants with scores of 750 likely differ systematically in risk behavior, income, financial discipline, and other characteristics that may moderate the treatment effect. The default probability for a 750-FICO applicant is likely much lower than for a 660-FICO applicant, so the effect of receiving a credit card (increased default probability) may be smaller. Extrapolation from the cutoff requires additional assumptions about how the treatment effect varies with the running variable. Without such assumptions, the RD estimate should be interpreted as local to the cutoff.

Question 18

List the five causal estimation methods covered in this chapter and the key identifying assumption for each.

Answer | Method | Key Identifying Assumption | |--------|---------------------------| | **Matching / PSM** | Conditional ignorability (unconfoundedness): all confounders are measured | | **IPW / AIPW** | Conditional ignorability + positivity (overlap in propensity scores) | | **Instrumental Variables (2SLS)** | Exclusion restriction: instrument affects outcome only through treatment; plus relevance and exogeneity | | **Difference-in-Differences** | Parallel trends: treated and control groups would have followed the same outcome trajectory absent treatment | | **Regression Discontinuity** | Continuity: potential outcomes are continuous functions of the running variable at the cutoff (no manipulation) | Each assumption is untestable (or only partially testable), and the choice of method depends on which assumption is most defensible given the data and domain knowledge.

Question 19

You estimate a causal effect using AIPW and get $\hat{\tau} = -0.045$ with a 95% CI of $[-0.062, -0.028]$. A sensitivity analysis shows that an unmeasured confounder explaining 5% of the residual variance in both treatment and outcome would reduce the estimate to zero. Is this result robust?

Answer This result is **moderately fragile**. A confounder that explains only 5% of the residual variance in both treatment and outcome — a relatively weak confounder — would be sufficient to explain away the estimated effect. For comparison, if the strongest observed confounder explains 15% of the residual variance in treatment and 10% in outcome, then a confounder half as strong as the strongest observed one could nullify the result. This level of sensitivity warrants caution: the result could be real, but it is not robust to modest amounts of unmeasured confounding. The appropriate conclusion is to report the estimate alongside the sensitivity analysis, noting that the finding depends on the assumption that no unmeasured confounder of this magnitude exists — an assumption that must be evaluated using domain knowledge.

Question 20

Suppose you have access to both an RD design (at a credit score cutoff) and an IPW analysis (using the full population with observed confounders) for estimating the effect of credit approval on default. The RD gives $\hat{\tau} = 0.048$ and the IPW gives $\hat{\tau} = 0.031$. How do you reconcile these two estimates?

Answer The two estimates target **different populations**, so they need not agree: - The **RD estimate** (0.048) is the local effect at the cutoff — the effect for applicants with FICO scores near 660, who are marginal borrowers with moderate credit risk. - The **IPW estimate** (0.031) is the ATE across the full population, which includes both marginal borrowers and applicants with very high or very low credit scores. The difference ($0.048 > 0.031$) is consistent with treatment effect heterogeneity: marginal borrowers near the cutoff have higher default risk when given credit than the average applicant. This makes economic sense — applicants near the approval threshold are the most financially constrained, so the causal effect of additional credit is larger for them. To reconcile: (1) both estimates can be correct for their respective populations; (2) the RD is more internally valid (fewer assumptions) but less generalizable; (3) the IPW is more generalizable but relies on conditional ignorability, which may not hold. If the goal is to assess the effect of changing the cutoff, the RD is most relevant. If the goal is to assess the overall effect of credit access, the IPW (with sensitivity analysis) is more informative.