Chapter 15: Quiz

Test your understanding of the concepts introduced in this chapter. Answers follow each question.


Question 1

What is the fundamental difference between a predictive question and a causal question?

Answer A **predictive question** asks about the conditional distribution of an outcome given observed features: $P(Y \mid X)$. It exploits all statistical associations in the data, regardless of their source (causal, reverse causal, or confounded). A **causal question** asks about the effect of an *intervention*: $P(Y \mid \text{do}(X))$. It requires isolating the causal pathway from treatment to outcome and excluding confounded associations. Prediction asks "what will happen?"; causation asks "what would happen if I changed something?"

Question 2

A hospital builds a model to predict 30-day readmission and uses it to target patients for a care management intervention. Explain why this can lead to worse outcomes than random targeting.

Answer The prediction model identifies patients at highest *risk* of readmission, not patients who would *benefit most* from the intervention. These are different populations. High-risk patients may have unmodifiable drivers of readmission (e.g., end-stage disease) where the intervention has no effect. Moderate-risk patients may have modifiable drivers (e.g., poor medication adherence) where the intervention could prevent readmission. By targeting the highest-risk patients, the hospital wastes intervention slots on patients whose outcomes cannot be changed and misses patients whose outcomes could be improved. If high risk is negatively correlated with treatment effect (as is often the case), risk-based targeting can be worse than random.

Question 3

Name the three sources of statistical association between two variables $X$ and $Y$.

Answer 1. **$X$ causes $Y$** (direct or indirect causal effect): The association reflects a genuine causal pathway. 2. **$Y$ causes $X$** (reverse causation): The observed direction of analysis is reversed; $Y$ influences $X$. 3. **A common cause $Z$ causes both $X$ and $Y$** (confounding): A third variable creates a non-causal association between $X$ and $Y$. Prediction models exploit all three sources indiscriminately. Causal inference methods aim to isolate only the first.

Question 4

Define "confounder" and give an example from the MediCore pharmaceutical scenario.

Answer A **confounder** is a variable that (1) causally influences both the treatment and the outcome, and (2) is not itself caused by the treatment. In the MediCore scenario, **disease severity** is a confounder: sicker patients are more likely to receive Drug X (because their doctors prescribe it for severe cases) and are also more likely to be hospitalized (because they are sicker). Disease severity creates a spurious positive association between Drug X and hospitalization, making a beneficial drug appear harmful in naive observational analysis.

Question 5

Explain Simpson's paradox in one paragraph. What causes it?

Answer Simpson's paradox occurs when a statistical trend that appears in every subgroup of the data reverses when the subgroups are combined. It is caused by **confounding**: when a confounder has a different distribution across treatment groups, the aggregate comparison is a weighted average using different weights for each treatment. If the confounder is associated with both the treatment assignment and the outcome, the weights can distort the aggregate comparison enough to reverse the direction of the effect. For example, Treatment A may be superior in both small-stone and large-stone patients, but if Treatment A is disproportionately given to large-stone patients (harder cases), its aggregate success rate can be lower than Treatment B's.

Question 6

What is a collider, and how does conditioning on a collider introduce bias?

Answer A **collider** is a variable that is caused by two or more other variables. In the causal graph $X \to Z \leftarrow Y$, variable $Z$ is a collider on the path between $X$ and $Y$. When $X$ and $Y$ are marginally independent, conditioning on (or controlling for) the collider $Z$ induces a **spurious association** between $X$ and $Y$. This happens because among observations where $Z$ takes a particular value, knowing that $X$ did not contribute to $Z$ makes it more likely that $Y$ did (and vice versa). This is called **collider bias** or **Berkson's paradox** in medical settings.

Question 7

Why is the advice "control for as many variables as possible" potentially harmful?

Answer Controlling for a **confounder** reduces bias by blocking the backdoor path between treatment and outcome. But controlling for a **collider** *introduces* bias by opening a previously blocked path. Controlling for a **mediator** (a variable on the causal pathway from treatment to outcome) blocks part of the causal effect you are trying to estimate, leading to an underestimate. Without understanding the causal structure (which variables are confounders, colliders, and mediators), indiscriminate adjustment can make estimates worse, not better.

Question 8

State the fundamental problem of causal inference.

Answer The **fundamental problem of causal inference** (Holland, 1986) is that for any individual, we can never observe both potential outcomes $Y_i(1)$ and $Y_i(0)$ simultaneously. Each individual either receives the treatment or does not; the counterfactual outcome — what *would have happened* under the alternative treatment — is inherently unobservable. This means the individual treatment effect $\tau_i = Y_i(1) - Y_i(0)$ is never directly measurable. Causal inference methods aim to estimate population-level summaries (like the ATE) under assumptions that bridge the gap between observed and counterfactual outcomes.

Question 9

Why does randomization solve the confounding problem?

Answer Randomization makes treatment assignment statistically independent of all pre-treatment variables — both observed and unobserved. Formally, $T \perp\!\!\!\perp (Y(0), Y(1))$. This means the treated and control groups are, on average, identical in every respect except the treatment itself. The confounding bias term $E[Y(0) \mid T=1] - E[Y(0) \mid T=0]$ equals zero because the distribution of baseline outcomes is the same in both groups. The naive difference in means, $E[Y \mid T=1] - E[Y \mid T=0]$, becomes an unbiased estimator of the ATE.

Question 10

Name three situations where running an RCT is not feasible, and for each, suggest an alternative causal inference approach.

Answer 1. **Ethical constraints** (e.g., withholding a known effective treatment): Use **observational methods** with careful confounding adjustment (propensity score matching, inverse probability weighting) on existing clinical data. 2. **Scale/cost constraints** (e.g., testing a recommendation algorithm change on millions of users): Use **quasi-experimental methods** like regression discontinuity (if a threshold determines treatment) or difference-in-differences (if treatment rolls out at different times). 3. **Temporal constraints** (e.g., evaluating the effect of a historical policy change): Use **natural experiments** — instrumental variables, interrupted time series, or synthetic control methods that exploit the policy's timing or geographic variation.

Question 11

Describe the three rungs of Pearl's Ladder of Causation and give one example question at each rung.

Answer **Rung 1: Association (Seeing)** — $P(Y \mid X)$. Questions about observed joint distributions. Example: "What is the click-through rate for users who were shown item X?" **Rung 2: Intervention (Doing)** — $P(Y \mid \text{do}(X))$. Questions about the effects of active interventions. Example: "If we recommend item X to a random user, what is the expected click probability?" **Rung 3: Counterfactual (Imagining)** — $P(Y_{x'} \mid X=x, Y=y)$. Questions about specific individuals under alternative conditions. Example: "This user clicked on item X after we recommended it. Would they have clicked if we had NOT recommended it?" Each rung subsumes the one below. Moving up requires additional causal assumptions beyond the data.

Question 12

Why can't a standard offline evaluation of a recommendation system measure its causal impact?

Answer Standard offline evaluation measures **predictive accuracy**: does the model correctly predict which items users will engage with? This is a Rung 1 (association) quantity. A model that perfectly predicts *organic* behavior (what users would have done without recommendations) achieves excellent offline metrics but creates zero incremental value. The causal impact of the recommendation system — the difference between engagement *with* and *without* the recommendation — is a Rung 2 (interventional) quantity that cannot be estimated from observational log data without additional causal assumptions or experimental data.

Question 13

In the MediCore drug example, the naive observational comparison shows Drug X increases hospitalization, while the true causal effect is a decrease. What specific mechanism causes this sign reversal?

Answer The sign reversal is caused by **confounding by indication** (also called confounding by disease severity). Sicker patients are more likely to be prescribed Drug X *and* more likely to be hospitalized, regardless of Drug X. When we compare Drug X patients to non-Drug X patients, we are comparing a sicker population to a healthier one. The severity difference overwhelms the true beneficial effect of the drug, producing a positive association between Drug X and hospitalization even though the drug's causal effect is negative (protective). The confounding bias is positive and larger in magnitude than the true negative causal effect, resulting in a sign flip.

Question 14

Explain the difference between the Average Treatment Effect (ATE) and the effect that risk-based targeting implicitly optimizes for.

Answer The **ATE** is the average causal effect of the treatment across the entire population: $E[Y(1) - Y(0)]$. Risk-based targeting implicitly optimizes for **high $P(Y=1 \mid X)$** — the baseline probability of the adverse outcome, regardless of whether the treatment changes it. The quantity that optimal targeting should use is the **Conditional Average Treatment Effect (CATE)**: $E[Y(1) - Y(0) \mid X = x]$, which measures how much the treatment changes the outcome for individuals with characteristics $x$. High $P(Y=1 \mid X)$ does not imply high CATE; they can even be negatively correlated.

Question 15

A spurious correlation exists between per-capita cheese consumption and the number of people who died by becoming tangled in their bedsheets (an actual example from Tyler Vigen's "Spurious Correlations"). Classify this as confounded, reverse-caused, or coincidental, and explain why a prediction model would still exploit this association.

Answer This is most likely **coincidental** — a statistical artifact of testing thousands of pairwise correlations across many time series. With enough variables, some will be correlated by chance. However, it could also reflect **confounding** by a shared time trend (both variables increased over the same period due to unrelated underlying causes). A prediction model would exploit this association because it is a real statistical pattern in the training data: cheese consumption is a useful *predictor* of bedsheet deaths (they move together). The model does not need the association to be causal — it only needs it to be statistically reliable in the prediction window. This association would break under intervention (eating more cheese would not change bedsheet mortality) and would likely break under distribution shift (the shared trend is unlikely to continue indefinitely).

Question 16

What is selection bias, and how does it relate to colliders?

Answer **Selection bias** occurs when the sample used for analysis is not representative of the target population due to a systematic relationship between the variables under study and the selection mechanism. Selection bias is the practical consequence of **collider bias**: the selection criterion (e.g., "patient is hospitalized," "user is active," "candidate was hired") is a collider caused by multiple variables. Conditioning on this collider — by analyzing only the selected sample — induces spurious associations between the variables that caused selection. For example, analyzing only hospitalized patients (selected by having at least one serious condition) creates a negative association between conditions that independently cause hospitalization.

Question 17

A data scientist argues: "We have 10 million observations. With this much data, any confounding will average out." Is this correct? Why or why not?

Answer This is **incorrect**. Confounding bias is a **systematic** error, not a **random** (sampling) error. Increasing the sample size reduces the standard error (random variation around the estimate), but it does not reduce the bias (distance between the expected value of the estimator and the true causal effect). With 10 million observations, you get a very precise estimate of the *wrong* quantity: the confounded association rather than the causal effect. The estimate converges to the biased value, not the true ATE. Formally, if the naive estimator has bias $b$ and variance $\sigma^2/n$, then as $n \to \infty$, the estimator converges to $\text{ATE} + b$, not to $\text{ATE}$.

Question 18

In the StreamRec context, explain the difference between a recommendation model that achieves high offline Hit@10 and one that creates high incremental engagement.

Answer A model with high **offline Hit@10** accurately predicts items that users will engage with in held-out data. This measures predictive accuracy for organic behavior — it succeeds by identifying items users would have found and consumed anyway. A model that creates high **incremental engagement** identifies items where the recommendation *causes* engagement that would not have occurred without it. These are items the user would not have discovered through search, browsing, or organic navigation. The first model takes credit for existing behavior; the second model creates new value. They can have very different (even inversely related) item selections.

Question 19

Why is the assumption of "no unmeasured confounders" both critical and untestable in observational causal inference?

Answer The assumption of **no unmeasured confounders** (also called "ignorability" or "unconfoundedness") states that, conditional on observed covariates, treatment assignment is independent of potential outcomes: $(Y(0), Y(1)) \perp\!\!\!\perp T \mid X$. This assumption is **critical** because all observational causal inference methods (matching, propensity scores, etc.) require it for identification — without it, the causal effect is not identifiable from the data. It is **untestable** because verifying it would require observing the unmeasured confounders, which contradicts the premise. The assumption is a **domain knowledge claim**: the analyst must argue, based on subject-matter expertise, that all relevant confounders have been measured. This is why causal inference from observational data always involves a degree of judgment that cannot be replaced by statistical methodology.

Question 20

List the four main categories of causal inference methods discussed in the landscape section, ordered from strongest to weakest assumptions. For each, give one example method.

Answer 1. **Experimental methods** (strongest / fewest assumptions): **Randomized controlled trial (RCT)**. Randomization eliminates confounding by design. 2. **Quasi-experimental methods**: **Difference-in-differences (DiD)**. Requires the parallel trends assumption but does not require randomization. 3. **Observational methods** (requires no unmeasured confounders): **Propensity score matching**. Adjusts for observed confounders but cannot address unmeasured ones. 4. **Causal machine learning methods** (combines ML flexibility with causal assumptions): **Causal forests**. Estimates heterogeneous treatment effects but still requires the same identification assumptions as observational methods. The ordering reflects assumption strength: experimental methods make the fewest assumptions about the data-generating process, while observational and causal ML methods require increasingly strong (and untestable) assumptions.