Chapter 15: Exercises

DataField.Dev

Chapter 15: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field

Prediction vs. Causation

Exercise 15.1 (*)

For each of the following questions, classify it as descriptive, predictive, or causal. Justify your classification in one sentence.

(a) What percentage of customers who received a promotional email made a purchase last quarter?

(b) Which customers are most likely to churn in the next 30 days?

(c) Does sending a promotional email increase purchase probability?

(d) Among customers who churned, what was their average tenure?

(e) If we increase the free trial period from 7 to 14 days, will conversion rates increase?

(f) What is the predicted demand for product X next month?

(g) Did the price reduction last quarter cause the increase in sales, or was it seasonal?

Exercise 15.2 (*)

A ride-sharing company builds a model to predict driver cancellation rates. The model finds that "distance to pickup" is the strongest predictor: longer pickups are more likely to be cancelled.

(a) If the company uses this model to decide which ride requests to show to each driver (filtering out long-distance pickups), are they answering a predictive or causal question?

(b) What confounders might explain the association between distance and cancellation, beyond the direct causal effect of distance?

(c) Design a simple experiment that would isolate the causal effect of pickup distance on cancellation.

Exercise 15.3 (*)

Consider the hospital readmission example from Section 15.1. A hospital deploys a readmission risk model to target patients for a post-discharge care management program.

(a) Write the predictive question the model answers using mathematical notation.

(b) Write the causal question the intervention program requires using mathematical notation.

(c) Explain in plain language why these are different questions, using a specific patient example.

(d) Propose two patient types where the predictive and causal answers diverge: one where high predicted risk does NOT correspond to a large treatment effect, and one where moderate predicted risk DOES correspond to a large treatment effect.

Exercise 15.4 (**)

Using the simulate_prediction_causation_gap function from Section 15.1, reproduce the simulation and extend it:

(a) Plot a scatter plot of risk_score (x-axis) vs. treatment_effect (y-axis). Add a horizontal line at the mean treatment effect and a vertical line at the risk threshold for the top 10%. Annotate the four quadrants: high-risk/high-effect, high-risk/low-effect, low-risk/high-effect, low-risk/low-effect.

(b) Vary the budget from 100 to 5000 in increments of 100. For each budget level, compute the readmissions prevented by risk-based targeting vs. causal oracle targeting. Plot both curves. At what budget level does risk-based targeting first surpass random targeting? At what budget level does the causal advantage diminish to less than 50%?

(c) Design a "hybrid" strategy that targets the top $k/2$ patients by risk and the top $k/2$ patients by treatment effect (removing duplicates). Compare this to risk-only and causal-only strategies. Under what conditions does the hybrid approach perform well?

Confounding

Exercise 15.5 (*)

For each of the following observed associations, identify at least one plausible confounder and explain how it could create the association without a direct causal link.

(a) Students who sit in the front of the classroom have higher GPAs.

(b) Countries with more firefighters per capita have more fires.

(c) Patients who receive a particular surgical procedure have higher mortality rates than those treated with medication.

(d) Companies that hire data scientists have higher revenue growth.

(e) People who use standing desks report fewer back problems.

Exercise 15.6 (**)

Using the simulate_confounded_drug_effect function from Section 15.3:

(a) Reproduce the simulation and verify that the naive estimate has the wrong sign.

(b) Implement a stratified analysis: divide patients into 10 equally sized groups by disease severity, compute the drug effect within each stratum, and average. How close is the stratified estimate to the true effect?

def stratified_estimate(df: pd.DataFrame, n_strata: int = 10) -> float:
    """Estimate the drug effect using stratification on disease severity.

    Args:
        df: DataFrame with columns 'severity', 'drug_x', 'hospitalized'.
        n_strata: Number of severity strata.

    Returns:
        Stratified estimate of the average treatment effect.
    """
    # Your implementation here
    pass

(c) What happens to the stratified estimate when you use only 3 strata? 20 strata? 100 strata? Explain the bias-variance tradeoff in the number of strata.

(d) Modify the simulation so that there is an unobserved confounder (a variable that affects both treatment and outcome but is not in the dataset). Repeat the stratified analysis on the observed confounder only. How much residual bias remains?

Exercise 15.7 (**)

A pharmaceutical company analyzes observational data and finds that patients who take Statin X have lower rates of hip fracture. A naive analyst concludes that Statin X protects bones.

(a) Propose a confounding explanation for this finding.

(b) Suppose the company has data on the following variables: age, sex, BMI, smoking status, exercise frequency, calcium supplement use, bone density measurements, and cholesterol levels. Draw a causal diagram (using text notation: A -> B means A causes B) that includes the statin, hip fracture, and at least three confounders.

(c) Based on your diagram, which variables should be adjusted for, and which should not be adjusted for? Justify using the concepts of confounders and colliders.

Exercise 15.8 (***)

Prove algebraically that the naive difference in means, $E[Y \mid T=1] - E[Y \mid T=0]$, can be decomposed as:

$$E[Y \mid T=1] - E[Y \mid T=0] = \text{ATE} + \underbrace{E[Y(0) \mid T=1] - E[Y(0) \mid T=0]}_{\text{selection bias}}$$

(a) Start from the definition of the ATE and the observed conditional expectations. Show each step.

(b) Interpret the selection bias term: what does $E[Y(0) \mid T=1] - E[Y(0) \mid T=0]$ mean in the context of the MediCore drug example?

(c) Under what condition does the selection bias term vanish? Connect this to the concept of randomization.

Simpson's Paradox

Exercise 15.9 (*)

A university's admissions data shows that 44% of male applicants are admitted versus 35% of female applicants. An investigation for gender discrimination begins. But when the data is stratified by department, females have equal or higher admission rates in every department.

(a) Explain how this can happen using the concept of Simpson's paradox and confounding.

(b) Is the aggregate statistic (44% vs. 35%) or the department-specific statistic the better basis for evaluating discrimination? Why?

(c) This scenario is based on the famous UC Berkeley admissions study (Bickel, Hammel, and O'Connell, 1975). Research the original study and describe the confounding mechanism they identified.

Exercise 15.10 (**)

Construct your own numerical example of Simpson's paradox using the following setup:

Two treatments (A and B)
Two subgroups (Group 1 and Group 2)
Treatment A is better in both subgroups but worse overall

(a) Fill in a $2 \times 2$ table with success rates and sample sizes that exhibit the reversal. Verify that the paradox holds.

(b) Write a Python function that takes the four success rates and four sample sizes as input and checks whether Simpson's paradox holds (i.e., the aggregate ranking reverses the within-group ranking).

def check_simpsons_paradox(
    rate_a1: float, n_a1: int,  # Treatment A, Group 1
    rate_b1: float, n_b1: int,  # Treatment B, Group 1
    rate_a2: float, n_a2: int,  # Treatment A, Group 2
    rate_b2: float, n_b2: int,  # Treatment B, Group 2
) -> bool:
    """Check whether the given rates and sample sizes exhibit Simpson's paradox.

    Returns True if Treatment A is better in both subgroups but worse overall
    (or vice versa).

    Args:
        rate_a1, n_a1: Success rate and sample size for Treatment A in Group 1.
        rate_b1, n_b1: Success rate and sample size for Treatment B in Group 1.
        rate_a2, n_a2: Success rate and sample size for Treatment A in Group 2.
        rate_b2, n_b2: Success rate and sample size for Treatment B in Group 2.

    Returns:
        True if Simpson's paradox is present.
    """
    # Your implementation here
    pass

(c) What constraints on the sample sizes are necessary for the paradox to occur? Can it happen when the sample sizes are equal across all four cells?

Exercise 15.11 (***)

Prove that Simpson's paradox cannot occur when the confounder $Z$ is independent of the treatment $T$ (i.e., when there is no confounding).

(a) Start from the law of total probability:

$$P(Y=1 \mid T=t) = \sum_z P(Y=1 \mid T=t, Z=z) \cdot P(Z=z \mid T=t)$$

(b) Show that when $P(Z=z \mid T=t) = P(Z=z)$ for all $z$ and $t$ (independence of treatment and confounder), the aggregate comparison must agree with the within-stratum comparison.

(c) Interpret this result in one sentence: what does it tell us about when Simpson's paradox is and is not a concern?

Colliders and Selection Bias

Exercise 15.12 (*)

For each of the following scenarios, identify whether the described variable is a confounder, a collider, or a mediator on the path between treatment $T$ and outcome $Y$.

(a) $T$ = exercise, $Y$ = weight loss, variable = caloric intake. Exercise reduces caloric intake (for some people), and caloric intake directly affects weight loss.

(b) $T$ = education level, $Y$ = income, variable = hiring at a specific company. Both education and income potential influence who gets hired at competitive companies.

(c) $T$ = smoking, $Y$ = lung cancer, variable = socioeconomic status. Lower SES is associated with higher smoking rates and worse health outcomes.

(d) $T$ = training program, $Y$ = employee performance, variable = manager nomination. Managers nominate underperforming employees for training based on their current performance.

Exercise 15.13 (**)

Using the collider_bias_example function from Section 15.5:

(a) Modify the selection mechanism so that selection depends on the product of talent and attractiveness (multiplicative interaction) rather than the sum. How does this change the collider bias?

(b) Vary the selection stringency from the top 50% to the top 1%. Plot the collider-induced correlation as a function of selection stringency. What pattern do you observe?

(c) Add a third independent variable ("connections") that also affects selection. In the selected population, compute the pairwise correlations among all three variables. How does the number of collider-inducing paths affect the magnitude of bias?

Exercise 15.14 (**)

A hospital-based study finds that among hospitalized patients, having diabetes is negatively associated with being diagnosed with pneumonia. A researcher concludes that diabetes might protect against pneumonia.

(a) Explain why this is likely Berkson's paradox (a form of collider bias). What is the collider?

(b) Draw the causal structure using text notation.

(c) Would you expect the same negative association in a population-based study (random sample from the general population)? Why or why not?

Exercise 15.15 (***)

Write a simulation that demonstrates Berkson's paradox in a healthcare setting.

def simulate_berksons_paradox(
    n_population: int = 100000,
    n_hospitalized: int = 5000,
    seed: int = 42
) -> Tuple[float, float]:
    """Simulate Berkson's paradox in a hospital setting.

    Two diseases are independent in the population, but conditioning
    on hospitalization (caused by either disease) creates a spurious
    negative association.

    Args:
        n_population: Total population size.
        n_hospitalized: Number of hospitalized individuals to sample.
        seed: Random seed.

    Returns:
        Tuple of (population_correlation, hospitalized_correlation).
    """
    # Your implementation here
    pass

(a) Implement the simulation with two independent diseases (prevalence ~5% each) where either disease independently causes hospitalization (with different probabilities).

(b) Compute the correlation between the two diseases in the general population and among hospitalized patients. Verify the negative association among hospitalized patients.

(c) Explain in mathematical terms why conditioning on the collider (hospitalization) induces a negative association.

The Ladder of Causation

Exercise 15.16 (*)

For each of the following questions about StreamRec's recommendation system, identify the rung of the Ladder of Causation:

(a) What fraction of users who were shown documentary content clicked on it?

(b) If we recommend documentaries to all users (regardless of their past viewing history), what would the average click rate be?

(c) A specific user was recommended a documentary and watched it. Would they have watched it if it had not been recommended?

(d) What is the average session length for users in the top engagement quartile?

(e) If we remove the "trending" section from the homepage, will session length decrease?

Exercise 15.17 (**)

Consider the statement: "Our recommendation system has a click-through rate of 12%, which means it is adding value to the platform."

(a) On which rung of the Ladder of Causation does the 12% CTR live?

(b) Explain why this statement conflates association with causation. What is the causal question the platform actually needs to answer?

(c) Construct a numerical example where the recommendation system has a 12% CTR but creates zero incremental engagement (i.e., users would have found and clicked on the same items without the recommendation).

(d) Construct a numerical example where the recommendation system has a 6% CTR but creates more incremental engagement than the 12% CTR system.

The Fundamental Problem

Exercise 15.18 (**)

A company runs a training program and compares the performance of employees who chose to attend versus those who did not.

(a) Write out the potential outcomes framework for this scenario: define $Y(1)$, $Y(0)$, and the treatment $T$.

(b) The company finds that attendees improved their performance by an average of 15 points, while non-attendees improved by 5 points. The company concludes the program caused a 10-point improvement. What is the selection bias concern?

(c) Suppose you learn that employees with higher baseline motivation are more likely to attend. How does this bias the naive estimate? In which direction?

(d) Propose a quasi-experimental design that could reduce (not eliminate) the selection bias.

Exercise 15.19 (***)

Consider a binary treatment $T \in \{0,1\}$, a binary outcome $Y \in \{0,1\}$, and the following potential outcomes table for a population of 6 individuals:

Individual	$Y(0)$	$Y(1)$	$T$ (observed)
1	0	1	1
2	1	1	0
3	0	0	1
4	1	0	0
5	0	1	1
6	0	0	0

(a) Compute the true ATE: $\frac{1}{N}\sum_{i=1}^{N}[Y_i(1) - Y_i(0)]$.

(b) Compute the observed difference in means: $\bar{Y}_{T=1} - \bar{Y}_{T=0}$.

(c) Compute the selection bias: $E[Y(0) \mid T=1] - E[Y(0) \mid T=0]$.

(d) Verify that the decomposition holds: observed difference = ATE + selection bias.

(e) Is the treatment assignment in this table positively or negatively correlated with $Y(0)$? What does this imply about the selection mechanism?

Exercise 15.20 (***)

Implement a simulation that demonstrates why the fundamental problem is not solvable with more data.

def more_data_does_not_help(
    sample_sizes: List[int] = [100, 1000, 10000, 100000],
    true_ate: float = -0.10,
    confounding_strength: float = 0.5,
    seed: int = 42
) -> pd.DataFrame:
    """Show that increasing sample size does not reduce confounding bias.

    The naive estimator converges to a BIASED value (not the true ATE)
    as sample size increases, demonstrating that bias is a systematic
    error, not a sampling error.

    Args:
        sample_sizes: List of sample sizes to evaluate.
        true_ate: True average treatment effect.
        confounding_strength: Strength of confounding (0 = none, 1 = strong).
        seed: Random seed.

    Returns:
        DataFrame with sample size, naive estimate, standard error, and bias.
    """
    # Your implementation here
    pass

(a) Implement the simulation. For each sample size, compute the naive estimate, its standard error, and its bias (distance from true ATE).

(b) Plot the naive estimate $\pm$ 2 standard errors as a function of $\log_{10}(n)$. Add a horizontal line at the true ATE. What happens as $n \to \infty$?

(c) Explain in one paragraph why increasing the sample size reduces the standard error but does NOT reduce the bias.

Applied Problems

Exercise 15.21 (*)

A marketing team at an e-commerce company shows the following analysis: "Users who redeemed a 20% discount coupon had a 35% higher average order value than users who did not redeem a coupon. Therefore, coupons increase revenue."

(a) Identify the causal question implicit in the conclusion.

(b) Identify at least two confounders that could explain the association without a causal effect.

(c) Propose an experimental design that could credibly answer the causal question.

Exercise 15.22 (**)

An online education platform compares completion rates between students who use the discussion forum and those who do not. Forum users have a 72% completion rate versus 41% for non-users.

(a) The platform wants to use this evidence to justify investing in forum features. What causal question should they be asking?

(b) Name three confounders. For each, explain the direction of bias they introduce (does confounding make the forum look more or less effective than it really is?).

(c) The platform logs when each student first posts in the forum. Propose a quasi-experimental design using this timing data.

(d) Even with a valid causal design, the estimated effect of forum use might not generalize to a policy of requiring all students to use the forum. Explain why, using the concept of heterogeneous treatment effects.

Exercise 15.23 (***)

Implement the CausalQuestionFramework from Section 15.11 for a domain of your choice (not healthcare or recommendations). Choose a real-world decision problem where a prediction model might be deployed but a causal question is actually needed.

(a) Fill in all fields: question, type, treatment, outcome, confounders, can_randomize, barrier, assumptions.

(b) For each confounder you list, explain the direction and likely magnitude of the bias it introduces.

(c) Propose the strongest feasible causal design (experimental, quasi-experimental, or observational) for your problem. Justify why stronger designs are not feasible.

Exercise 15.24 (***)

Using the StreamRec simulation from Section 15.9, extend the analysis:

(a) Modify the simulation so that the recommendation effect depends on user characteristics (heterogeneous treatment effects). Specifically, let new users (first month) have higher recommendation effects than veteran users.

(b) Show that a model trained on overall data underestimates the recommendation effect for new users and overestimates it for veteran users.

(c) Implement a stratified analysis that estimates the recommendation effect separately for new and veteran users. Compare to the pooled estimate.

(d) Explain why this matters for StreamRec's business: if recommendation effects are heterogeneous, the optimal recommendation strategy should allocate "discovery" recommendations to new users and "familiar" recommendations to veteran users.

Exercise 15.25 (****)

The causal hierarchy theorem (Bareinboim, Correa, Ibeling, and Icard, 2022) states that for almost all structural causal models (SCMs), the observational distribution $P(\mathbf{V})$ does not determine the interventional distribution $P(\mathbf{V} \mid \text{do}(\mathbf{X}))$, and the interventional distribution does not determine counterfactual quantities.

(a) Construct a simple example with three variables ($X$, $Y$, $Z$) where two different SCMs produce the same observational distribution but different interventional distributions.

(b) Construct an example where two SCMs produce the same observational and interventional distributions but different counterfactual quantities.

(c) Relate this to the fundamental problem of causal inference: why does part (a) imply that no algorithm can learn causal effects from observational data alone without assumptions about the data-generating process?

(d) The theorem says "almost all" SCMs. Can you think of a special case where the observational distribution does determine the interventional distribution? (Hint: consider what happens when the causal graph is known.)

Exercise 15.26 (****)

A/B tests are considered the gold standard for causal inference in technology companies. But they are not assumption-free.

(a) List three assumptions of a standard A/B test that, if violated, would make the causal conclusion invalid.

(b) Network interference: In a social network, treating user A may change user B's behavior. Explain why standard A/B testing fails under interference and describe one alternative experimental design (e.g., cluster randomization, ego-network randomization).

(c) Long-term effects: An A/B test runs for two weeks and shows a 3% increase in engagement. But the novelty effect may inflate the short-term estimate. Describe a method for estimating the long-term causal effect from short-term experimental data. What additional assumptions are needed?

(d) Implement a simulation of an A/B test with network interference:

def ab_test_with_interference(
    n_users: int = 10000,
    n_edges: int = 30000,
    direct_effect: float = 0.05,
    spillover_effect: float = 0.02,
    treatment_fraction: float = 0.5,
    seed: int = 42
) -> Dict[str, float]:
    """Simulate an A/B test where treated users affect control users.

    Args:
        n_users: Number of users.
        n_edges: Number of social network edges.
        direct_effect: Direct treatment effect on treated users.
        spillover_effect: Effect on control users per treated neighbor.
        treatment_fraction: Fraction of users assigned to treatment.
        seed: Random seed.

    Returns:
        Dict with true ATE, naive ATE estimate, and bias.
    """
    # Your implementation here
    pass

Show that the naive A/B test estimate is biased when spillover effects are present.

Exercise 15.27 (**)

Return to the confounded feature fragility example from Section 15.7. Extend the analysis:

(a) Instead of a single shift in the confounder distribution, simulate a gradual drift: the confounder mean increases by 0.1 per time period over 20 periods. Plot the AUC degradation curve.

(b) Train a model using only causal features (features that directly cause the outcome, not confounded proxies). Show that this model degrades less under the same drift.

(c) In practice, you do not know which features are causal and which are confounded. Propose a heuristic for identifying likely confounded features from observational data. (Hint: think about stability across subpopulations.)

Exercise 15.28 (***)

MediCore wants to build a system that can distinguish between "high-risk" patients (high $P(Y=1 \mid X)$) and "high-uplift" patients (high $P(Y(0)=1 \mid X) - P(Y(1)=1 \mid X)$, i.e., patients who would benefit most from treatment).

(a) Explain why these are fundamentally different quantities and why a single prediction model cannot estimate both.

(b) In a perfect world (with experimental data), how would you estimate each quantity? What data would you need?

(c) Using simulation, create a dataset of 10,000 patients where risk and uplift are negatively correlated (the highest-risk patients have the lowest uplift). Train a standard risk prediction model and an uplift model (using the transformed outcome approach: $Z = Y \cdot T / e(X) - Y \cdot (1-T) / (1 - e(X))$ where $e(X) = P(T=1 \mid X)$). Compare the targeting performance of each.

Exercise 15.29 (***)

Consider an observational study where MediCore wants to estimate the effect of Drug X on hospitalization. They have data on 50,000 patients with the following variables: age, sex, BMI, 17 comorbidity indicators, 4 lab values, insurance type, and hospital ID.

(a) For each variable, classify it as a potential confounder, collider, mediator, or irrelevant. Justify each classification with a one-sentence causal argument.

(b) A colleague suggests "just control for everything." Explain why this can be worse than controlling for nothing, using the concepts from this chapter.

(c) Another colleague suggests using a prediction model (e.g., gradient-boosted trees) to adjust for confounding by including all covariates. Explain why this approach can work for some purposes (propensity score estimation, Chapter 18) but not others (identifying mediators, avoiding colliders).

Exercise 15.30 (****)

Implement a comprehensive comparison of the three targeting strategies from Section 15.1 (risk-based, causal oracle, and random) under varying simulation parameters.

def comprehensive_targeting_comparison(
    n_patients: int = 50000,
    budget_fractions: List[float] = [0.01, 0.05, 0.10, 0.20, 0.50],
    correlation_values: List[float] = [-0.5, -0.2, 0.0, 0.2, 0.5],
    seed: int = 42
) -> pd.DataFrame:
    """Compare targeting strategies across budget levels and risk-uplift correlations.

    Varies two key parameters:
    1. Budget fraction: what proportion of patients can be treated
    2. Risk-uplift correlation: the relationship between baseline risk and
       treatment effect (positive = high risk patients also have high effects,
       negative = high risk patients have low effects)

    Args:
        n_patients: Number of patients.
        budget_fractions: List of budget fractions to evaluate.
        correlation_values: List of risk-uplift correlations to evaluate.
        seed: Random seed.

    Returns:
        DataFrame with results for each parameter combination.
    """
    # Your implementation here
    pass

(a) Implement the simulation. Create a heatmap where the x-axis is budget fraction, the y-axis is risk-uplift correlation, and the cell color is the ratio of risk-based prevented / causal-oracle prevented.

(b) Under what conditions does risk-based targeting approximate causal targeting? Under what conditions is it worst?

(c) Relate your findings to a real-world recommendation: when should a data science team invest in building uplift models versus using existing risk models?