Chapter 18: Exercises
Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field
Propensity Scores and Matching
Exercise 18.1 (*)
A job training program enrolls 500 participants and has data on 1,500 non-participants. The following covariates are measured: age, education (years), prior earnings, gender, and a binary indicator for whether the person was employed in the prior year.
(a) Fit a logistic regression to estimate the propensity score $\hat{e}(X_i) = P(\text{training} = 1 \mid X_i)$. Report the coefficient for each covariate and interpret the two largest (in absolute value).
(b) Plot the propensity score distributions for the treated and control groups as overlapping histograms. Identify any regions of non-overlap (positivity violations).
(c) Perform nearest-neighbor propensity score matching (1:1, with replacement). Report the ATT. How many unique control units were used?
(d) Compute the standardized mean differences for all five covariates before and after matching. Create a Love plot. Which covariates have $|SMD| > 0.1$ after matching?
Exercise 18.2 (*)
Using the MediCore observational simulation from Section 18.6:
(a) Compute the IPW estimate using the Horvitz-Thompson (unnormalized) estimator. Compare with the Hajek (normalized) estimator from the chapter. Which has a narrower confidence interval?
(b) Compute the effective sample sizes for treated and control groups. What is the ESS ratio for each?
(c) Trim the propensity scores at thresholds of 0.01, 0.02, 0.05, and 0.10. Plot the IPW estimate as a function of the trimming threshold. At what point does trimming change the estimate substantively?
Exercise 18.3 (**)
Consider a scenario with a binary confounder $X \in \{0, 1\}$:
| Group | $X = 0$ | $X = 1$ | Total |
|---|---|---|---|
| Treated | 200 | 800 | 1,000 |
| Control | 600 | 400 | 1,000 |
Outcomes: $\bar{Y}_{D=1, X=0} = 5.0$, $\bar{Y}_{D=1, X=1} = 8.0$, $\bar{Y}_{D=0, X=0} = 4.0$, $\bar{Y}_{D=0, X=1} = 7.0$.
(a) Compute the naive difference in means $\bar{Y}_{D=1} - \bar{Y}_{D=0}$. What is the true ATE (compute by stratification)?
(b) Compute the propensity score $e(X)$ for each stratum. Verify the balancing property by showing that within propensity score strata, treatment and control groups have the same distribution of $X$.
(c) Compute the IPW estimate manually (show all arithmetic). Verify that it equals the ATE from part (a).
(d) What happens to the IPW estimate if you add a redundant covariate $Z$ that is independent of both treatment and outcome? Show that the IPW estimate does not change.
Exercise 18.4 (**)
The propensity score model in the chapter uses logistic regression. Compare the following alternatives:
(a) Fit a gradient-boosted tree (XGBoost or LightGBM) to estimate the propensity score. Plot the propensity score distribution and compute the covariate balance after IPW. Compare with logistic regression.
(b) Fit a random forest. Repeat the analysis from (a).
(c) Which model gives better covariate balance? Which gives a more precise ATE estimate? Explain why the "best prediction model" for propensity scores is not necessarily the best for causal estimation (hint: consider the bias-variance tradeoff in the weights, not the predictions).
Exercise 18.5 (***)
Derive the variance of the Horvitz-Thompson IPW estimator under the following assumptions: treatment is independent across units, propensity scores are known (not estimated), and outcomes are bounded.
(a) Starting from $\hat{\tau}_{\text{HT}} = \frac{1}{N} \sum_{i=1}^N \left[\frac{D_i Y_i}{e(X_i)} - \frac{(1-D_i)Y_i}{1 - e(X_i)}\right]$, show that:
$$\text{Var}(\hat{\tau}_{\text{HT}}) = \frac{1}{N^2} \sum_{i=1}^N \left[\frac{\sigma_1^2(X_i)}{e(X_i)} + \frac{\sigma_0^2(X_i)}{1 - e(X_i)} + \frac{(\tau(X_i) - \tau)^2}{1}\right]$$
where $\sigma_d^2(X) = \text{Var}(Y(d) \mid X)$ and $\tau(X) = \mathbb{E}[Y(1) - Y(0) \mid X]$.
(b) Using this formula, explain why extreme propensity scores inflate the variance. What is the variance contribution of a unit with $e(X_i) = 0.01$ vs. $e(X_i) = 0.5$?
(c) Derive the efficiency bound for the ATE (the Cramer-Rao lower bound for the semiparametric model). Show that the AIPW estimator achieves this bound when both models are correctly specified.
Doubly Robust Estimation
Exercise 18.6 (*)
Using the MediCore simulation:
(a) Estimate the ATE using AIPW with three different outcome models: linear regression, random forest, and a gradient-boosted tree. Use logistic regression for the propensity score in all cases. Report the estimates and confidence intervals.
(b) Now deliberately misspecify the propensity score model (omit one confounder). Re-run AIPW with the correct outcome model and with the misspecified propensity score. Does the estimate remain close to the true ATE? This demonstrates double robustness.
(c) Deliberately misspecify both models (omit different confounders from each). What happens to the AIPW estimate?
Exercise 18.7 (**)
Implement a doubly robust estimator for the ATT (not the ATE). The ATT version of AIPW is:
$$\hat{\tau}_{\text{ATT, DR}} = \frac{1}{N_1} \sum_{i=1}^N \left[ D_i Y_i - \frac{D_i - \hat{e}(X_i)}{1 - \hat{e}(X_i)} \cdot (1 - D_i) Y_i - \frac{D_i - \hat{e}(X_i)}{1 - \hat{e}(X_i)} \cdot \hat{\mu}_0(X_i) \right]$$
(a) Implement this estimator in Python.
(b) Apply it to the MediCore simulation. Compare with the ATT from PSM and the ATE from AIPW. When do the ATT and ATE differ, and what drives the difference?
Exercise 18.8 (***)
The AIPW estimator can be understood through the lens of influence functions. The efficient influence function for the ATE is:
$$\phi(Y, D, X) = \frac{D(Y - \mu_1(X))}{e(X)} - \frac{(1-D)(Y - \mu_0(X))}{1 - e(X)} + \mu_1(X) - \mu_0(X) - \tau$$
(a) Show that $\mathbb{E}[\phi(Y, D, X)] = 0$ when either the outcome model or the propensity model is correctly specified.
(b) Show that the AIPW estimator $\hat{\tau}_{\text{AIPW}} = \frac{1}{N} \sum_i \hat{\phi}_i + \hat{\tau}$ is the sample analog of setting $\mathbb{E}[\phi] = 0$ and solving for $\tau$.
(c) Explain the connection to Neyman orthogonality (Chapter 19 preview): the derivative of $\mathbb{E}[\phi]$ with respect to nuisance parameters is zero at the true values. Why does this make AIPW robust to slow convergence rates of the nuisance parameter estimates?
Instrumental Variables
Exercise 18.9 (*)
For the MediCore IV simulation (Section 18.7):
(a) Report the first-stage regression results. What is the coefficient on distance? Interpret it: a 10-mile increase in distance to a Drug-X-prescribing hospital changes the probability of receiving Drug X by how much?
(b) Compute the reduced-form estimate (regress the outcome on the instrument and controls). Compute the Wald estimate $\hat{\tau}_{\text{Wald}} = \text{reduced form} / \text{first stage}$. Verify that it approximately equals the 2SLS estimate.
(c) Add a second instrument: whether the patient's primary care physician was trained at a hospital that uses Drug X frequently (simulated as a random binary variable correlated with treatment). Re-estimate using 2SLS with both instruments. Does the first-stage F-statistic increase?
Exercise 18.10 (*)
For each of the following proposed instruments, evaluate whether the three IV conditions (relevance, exclusion, independence) are plausible:
(a) To estimate the effect of education on earnings: quarter of birth (born in Q1 vs. Q4 affects years of schooling through compulsory attendance laws).
(b) To estimate the effect of military service on lifetime earnings: Vietnam draft lottery number.
(c) To estimate the effect of hospital quality on patient outcomes: ambulance company assignment (which hospital the patient goes to depends on which ambulance company responds).
(d) To estimate the effect of screen time on children's test scores: the number of TV channels available in the household.
(e) To estimate the effect of a streaming platform's recommendation on user retention: the position the item is shown in (position 1 vs. position 10 in the StreamRec carousel).
Exercise 18.11 (**)
Weak instrument simulation study.
(a) Generate data where the instrument $Z$ explains 20%, 5%, 1%, and 0.1% of the variance in treatment $D$. For each scenario, compute 1,000 2SLS estimates and plot the sampling distributions.
(b) Compute the bias of the 2SLS estimator relative to the OLS estimator as a function of instrument strength. Confirm that with very weak instruments, 2SLS bias approaches OLS bias.
(c) For the weak instrument cases ($R^2 < 1\%$), compute the Anderson-Rubin confidence set and compare it with the standard 2SLS confidence interval. The AR test inverts a test that is valid regardless of instrument strength.
Exercise 18.12 (***)
Derive the 2SLS estimator in matrix notation.
(a) Let $Y = D\tau + X\beta + \varepsilon$ and $D = Z\pi + X\gamma + \nu$, where $Z$ is the instrument matrix and $X$ is the exogenous controls matrix. Show that the 2SLS estimator for $[\tau, \beta]$ is:
$$\hat{\delta}_{\text{2SLS}} = \left(\tilde{X}^\top P_Z \tilde{X}\right)^{-1} \tilde{X}^\top P_Z Y$$
where $\tilde{X} = [D, X]$ and $P_Z = Z(Z^\top Z)^{-1} Z^\top$ is the projection matrix onto the column space of $[Z, X]$.
(b) Show that when $Z$ is binary (one instrument), this reduces to the Wald estimator from Section 18.7.
(c) Explain why the 2SLS standard errors from the manual two-step procedure are incorrect. What is the correct formula?
Difference-in-Differences
Exercise 18.13 (*)
A hospital network implemented a new discharge protocol in January 2024. You have monthly readmission rates for 50 network hospitals (treated) and 50 non-network hospitals (control) from July 2023 to June 2024.
(a) Compute the 2x2 DiD estimate using the six months before and after the policy change.
(b) Estimate the DiD regression model $Y_{ht} = \alpha + \beta_G G_h + \beta_T T_t + \tau (G_h \times T_t) + \varepsilon_{ht}$ with clustered standard errors (clustered at the hospital level). Report the estimate and 95% CI for $\tau$.
(c) Create an event study plot showing treatment effects for each month relative to the policy change. Are the pre-treatment coefficients consistent with parallel trends?
Exercise 18.14 (**)
Generate simulated panel data where the parallel trends assumption is violated: the treated group has a steeper declining trend in readmission rates even before the policy change.
(a) Apply the standard DiD estimator. How biased is the estimate?
(b) Add a group-specific linear time trend to the regression: $Y_{ht} = \alpha_h + \lambda_t + \gamma_G \cdot t + \tau D_{ht} + \varepsilon_{ht}$. Does this reduce the bias?
(c) Create an event study plot for the data with violated parallel trends. What visual pattern in the pre-treatment coefficients would alert you to the violation?
Exercise 18.15 (**)
Staggered adoption problem. Five hospitals adopt the discharge protocol in January 2024, five in April 2024, and five never adopt.
(a) Apply the standard TWFE regression $Y_{ht} = \alpha_h + \lambda_t + \tau D_{ht} + \varepsilon_{ht}$. Report the estimate.
(b) Following Goodman-Bacon (2021), decompose the TWFE estimate into the constituent 2x2 comparisons. Identify which comparisons use already-treated units as controls.
(c) Apply the Callaway-Sant'Anna estimator (using the did package or manual implementation) that uses only not-yet-treated units as controls. Compare with the TWFE estimate.
Exercise 18.16 (***)
Triple differences (DDD). Sometimes the parallel trends assumption is strengthened by adding a third difference.
(a) The hospital policy affects only heart failure patients. Diabetes patients in the same hospitals serve as an additional control. Write the DDD specification:
$$Y_{ihdt} = \alpha + \beta_1 G_i + \beta_2 T_t + \beta_3 \text{HF}_d + \beta_4 (G_i \times T_t) + \beta_5 (G_i \times \text{HF}_d) + \beta_6 (T_t \times \text{HF}_d) + \tau (G_i \times T_t \times \text{HF}_d) + \varepsilon_{ihdt}$$
What does $\tau$ estimate? What assumption does DDD require that is weaker than DiD's parallel trends?
(b) Simulate data with a differential time trend for treated vs. control hospitals (violating DiD parallel trends) but where the differential trend is the same for heart failure and diabetes patients. Show that DDD recovers the true treatment effect while DiD does not.
Regression Discontinuity
Exercise 18.17 (*)
Using the Meridian Financial simulation from Section 18.9:
(a) Estimate the RD effect at bandwidths of 10, 20, 30, and 40 FICO points. Plot the estimate and 95% CI as a function of bandwidth.
(b) Perform the McCrary density test at the cutoff. Is there evidence of manipulation (bunching of FICO scores at 660)?
(c) Check covariate balance at the cutoff for income, age, and debt ratio. Are any covariates discontinuous at the cutoff? What would a discontinuity imply?
Exercise 18.18 (*)
Create a visualization of the RD design:
(a) Scatter plot of the outcome (default) against the running variable (FICO score), with a vertical line at the cutoff.
(b) Add local polynomial fits (linear, quadratic) on each side of the cutoff.
(c) Bin the running variable into equal-width bins and plot bin means. The visual "jump" at the cutoff should approximate the RD estimate.
Exercise 18.19 (**)
Fuzzy RD. Modify the Meridian Financial simulation so that the cutoff is not sharp: applicants above 660 have an 85% approval probability, and applicants below 660 have a 15% approval probability (manual overrides in both directions).
(a) Estimate the sharp RD (ignoring the fuzziness). How biased is the estimate?
(b) Estimate the fuzzy RD using 2SLS with $\mathbf{1}[\text{FICO} \geq 660]$ as the instrument for actual approval. Report the first-stage coefficient (the "jump" in approval probability at the cutoff).
(c) Explain why the fuzzy RD estimate is larger (in absolute value) than the sharp RD estimate. Connect to the LATE interpretation: who are the "compliers" in this context?
Exercise 18.20 (**)
RD with manipulation. Modify the simulation so that applicants who are just below the cutoff (FICO 655-659) are able to improve their score to just above the cutoff (FICO 660-664) by disputing items on their credit report. This creates "bunching" at 660.
(a) Run the McCrary density test. Does it detect the manipulation?
(b) Estimate the RD effect. Is it biased relative to the unmanipulated simulation?
(c) Propose and implement a "donut hole" RD that excludes applicants within $\pm 5$ FICO points of the cutoff. Does this reduce the bias? What does it cost in terms of precision?
Method Selection and Comparison
Exercise 18.21 (**)
For the MediCore setting, you have access to all three identification strategies simultaneously:
(a) Estimate the treatment effect using: (i) AIPW with observed confounders, (ii) 2SLS with distance as instrument, (iii) DiD around the policy change. Report all three estimates with 95% CIs.
(b) The three estimates are likely different. Explain what each estimates (ATE vs. ATT vs. LATE) and why the target populations differ.
(c) A clinical decision-maker asks: "So what is the actual effect of Drug X?" Write a one-paragraph response that explains why a single number is insufficient and how the three estimates complement each other.
Exercise 18.22 (**)
Sensitivity analysis for unobserved confounders.
For the IPW estimate from the MediCore analysis:
(a) Using the Rosenbaum bounds framework, compute the critical value of $\Gamma$ (the maximum odds ratio of treatment assignment for matched pairs with identical observed covariates) at which the significance of the treatment effect would be overturned. Is the estimate sensitive to moderate levels of hidden bias?
(b) Using the Cinelli and Hazlett (2020) framework from Chapter 16, compute the partial $R^2$ values that an unobserved confounder would need to have (with both treatment and outcome) to explain away the AIPW estimate. Benchmark against the strongest observed confounder.
Exercise 18.23 (***)
Synthetic control method. The synthetic control (Abadie, Diamond, Hainmueller, 2010) constructs a "synthetic" version of the treated unit as a weighted combination of control units that matches the treated unit's pre-treatment outcomes.
(a) Implement a simplified synthetic control method for the MediCore DiD data. For the treated group of hospitals, find weights $w_1, \ldots, w_K$ on the control hospitals such that $\sum_k w_k \bar{Y}_{k, \text{pre}} = \bar{Y}_{\text{treated, pre}}$ and $\sum_k w_k = 1, w_k \geq 0$.
(b) Estimate the treatment effect as the post-treatment difference between the treated group and the synthetic control. Compare with the DiD estimate.
(c) Conduct a placebo (permutation) test: apply the synthetic control method to each control hospital as if it were treated. Does the actual treated unit's effect stand out relative to the placebo distribution?
Exercise 18.24 (****)
Causal estimation under interference. The SUTVA assumption (no interference) is violated in many real settings. Consider the StreamRec recommendation system, where recommending item A to user $i$ may affect user $j$'s outcomes (e.g., through word of mouth or social influence).
(a) Formalize the potential outcomes under interference: $Y_i(\mathbf{D})$ depends on the full treatment vector $\mathbf{D} = (D_1, \ldots, D_N)$. Explain why the number of potential outcomes per unit grows exponentially in $N$.
(b) Under a "partial interference" assumption (interference only within clusters, no interference across clusters), define the direct effect, the indirect (spillover) effect, and the total effect. Implement an estimator for each using a cluster-randomized design.
(c) For the StreamRec case, propose a practical design that would allow estimation of both direct and spillover effects of recommendations. What constraints does the social network structure impose?
StreamRec Progressive Project
Exercise 18.25 (**)
Extend the StreamRec IPW analysis from Section 18.12:
(a) Instead of logistic regression, use a gradient-boosted tree to estimate the propensity score. Compare covariate balance and ATE estimates between the two propensity models.
(b) Compute the ATT and ATU separately using IPW. Confirm that $\text{ATT} < \text{ATU}$ (the algorithm recommends where the causal effect is smallest). Explain why this pattern is expected given the recommendation algorithm's design.
(c) Propose a simple modification to the StreamRec algorithm that would increase the causal impact of recommendations (hint: recommend where $\hat{\tau}(X)$ is large, not where $\hat{Y}(1 \mid X)$ is large). This previews the targeting policy from Chapter 19.
Exercise 18.26 (***)
Design a difference-in-differences strategy for StreamRec. Suppose StreamRec is rolling out a new recommendation algorithm to 50% of users (randomly selected) on a specific date.
(a) Define the treatment and control groups, the pre and post periods, and the outcome.
(b) Write the DiD regression specification with user fixed effects and time fixed effects.
(c) Simulate the rollout data (including the baseline engagement patterns and the causal effect of the new algorithm). Estimate the DiD effect and create an event study plot.
(d) Discuss the threats to the parallel trends assumption in this context. Can StreamRec validate parallel trends using pre-rollout data?
Exercise 18.27 (***)
Implement a complete causal estimation pipeline using DoWhy for the StreamRec dataset:
(a) Define the causal graph (treatment: recommendation, outcome: engagement, confounders: user preference, activity level, tenure, item popularity, item quality).
(b) Use DoWhy's identify_effect to determine the estimand and valid adjustment sets.
(c) Estimate using three different methods: propensity score weighting, propensity score matching, and the linear regression method. Report all estimates.
(d) Run all four DoWhy refutation tests (placebo treatment, random common cause, data subset, unobserved common cause). Interpret the results.
Exercise 18.28 (****)
Combining methods for robust causal estimation.
Design a procedure that combines IPW, AIPW, and IV to estimate the recommendation effect under different scenarios:
(a) Define three scenarios: (i) all confounders observed, (ii) one confounder unobserved but a valid instrument exists, (iii) neither full conditional ignorability nor a valid instrument is available.
(b) For scenario (iii), implement the Oster (2019) method for bounding the treatment effect based on the degree of selection on observables as a guide to selection on unobservables. Compute the identified set: what range of treatment effects is consistent with the data and an assumption about the degree of unobserved confounding?
(c) Write a function robust_causal_estimate() that takes data and assumptions as input, automatically selects the appropriate method(s), runs diagnostics, and returns estimates with honest uncertainty quantification.