Case Study 1: The Four Journeys — A Retrospective on Statistical Thinking in Action

The Setup

It's late May. The semester is ending, and four people whose stories have woven through your entire textbook are reflecting on how far they've come.

Each of them started with a question. Each of them learned, struggled, iterated, and grew. And each of them ended up somewhere they didn't expect.

This case study isn't about a new analysis. It's about something rarer and more important: looking back at the full arc of a data project and asking, "What did I actually learn?"


Part 1: Maya's Reflection — From Numbers to Neighborhoods

Dr. Maya Chen is presenting her environmental health findings to the Riverside County Board of Supervisors. She's nervous — not about the statistics, but about the stakes.

Her presentation includes:

  • A map showing childhood asthma rates by community, with the three neighborhoods near the Henderson Chemical plant highlighted in red
  • A confidence interval for the difference in asthma rates between near-plant and far-plant communities: (2.8, 5.4) additional diagnoses per 1,000 children per year (95% CI)
  • A multiple regression table showing that proximity remains a significant predictor after controlling for income, smoking rates, and healthcare access (b = -2.7, p < 0.001)
  • A Simpson's paradox check demonstrating that the effect is consistent across age groups and income brackets — no reversals
  • An ethics section addressing privacy protections (data reported at community level, not individual level) and the potential for stigmatization

One board member interrupts: "So you're saying the Henderson plant is causing asthma?"

Maya takes a breath. She's rehearsed this.

"I'm saying our data shows a strong, statistically significant association between proximity to the plant and childhood asthma that persists after we control for the major confounders. The effect size is meaningful — approximately 4.1 additional cases per 1,000 children per year in the closest communities. But this is observational data. I cannot establish causation from a regression model. What I can tell you is that the evidence is strong enough to justify two things: first, further investigation, including air quality monitoring and a prospective study; and second, precautionary action to protect the children who are getting sick right now."

Analysis Questions

(a) Maya's answer distinguishes between association and causation. Using concepts from Chapters 4, 22, and 23, explain why she can't make a causal claim despite having a significant regression coefficient and a confidence interval that doesn't include zero.

(b) What confounders has Maya controlled for? What additional confounders might exist that she hasn't measured? How does this relate to the concept of "unmeasured confounders" from Chapter 23?

(c) Maya chose to report data at the community level rather than the individual level. Discuss the tradeoff between statistical precision and individual privacy. How does this connect to the re-identification risks discussed in Chapter 27?

(d) The board member's question ("Is the plant causing asthma?") reflects a desire for a simple yes-or-no answer. How is Maya's response an embodiment of Theme 4 (uncertainty is not failure)?

(e) Maya recommends "precautionary action." This involves making policy decisions under uncertainty. Using the ethical frameworks from Chapter 27 (utilitarian, rights-based, care ethics), evaluate her recommendation. Which framework best supports her position?


Part 2: Alex's Reflection — From Optimization to Responsibility

Alex Rivera is leading a team meeting at StreamVibe. The agenda: reviewing the results of three A/B tests from the past quarter.

Test 1: Autoplay preview length. Extending the autoplay preview from 5 seconds to 8 seconds increased click-through rate by 2.1 percentage points (p = 0.003, 95% CI: 0.7 to 3.5 pp). Effect size: Cohen's h = 0.04 (very small). Recommendation: ship the feature. The improvement is small per user but scales to millions of sessions.

Test 2: Emotional content boost. Boosting "high engagement" (anger-inducing and fear-inducing) content in the recommendation feed increased time-on-platform by 11 minutes per session (p < 0.001, 95% CI: 8.2 to 13.8 min). Effect size: Cohen's d = 0.52 (medium). Recommendation: do not ship. The test produced a statistically and practically significant result, but the ethical review flagged concerns about user well-being and the Facebook emotional contagion parallels from Chapter 27.

Test 3: Personalized thumbnails. Showing different thumbnail images to different users based on viewing history had no significant effect on click-through rate (p = 0.34, 95% CI: -1.2 to 0.4 pp). Power analysis: the test had 91% power to detect a 1 percentage-point improvement, so the null result is informative. Recommendation: don't ship. The feature adds engineering complexity without a measurable benefit.

"Three tests, three different outcomes, three different recommendations," Alex tells the team. "Two years ago, we would have shipped anything with p < 0.05 and moved on. Now we ask: Is it significant? Is it meaningful? Is it ethical? Those three questions, in that order."

Analysis Questions

(a) For Test 1, Alex's team shipped a feature with a very small effect size (h = 0.04). Under what circumstances is a very small effect size worth acting on? How does this connect to the discussion of statistical vs. practical significance in Chapter 17?

(b) For Test 2, the statistical evidence was strong (p < 0.001, d = 0.52), but Alex chose not to ship. What does this tell us about the relationship between statistical significance and ethical decision-making? Is there a tension between Themes 1 and 6 here?

(c) For Test 3, Alex explicitly mentions the power of the test. Why is this important for interpreting a null result? How does it change the interpretation compared to a null result from a test with only 30% power?

(d) Alex says the old approach was "ship anything with p < 0.05." Using concepts from Chapters 13, 17, and 27, explain why this approach is problematic.

(e) Write a two-paragraph summary of Test 2's results as you would present it to StreamVibe's VP of Product, who is not a statistician. Use the communication principles from Chapter 25.


Part 3: James's Reflection — From Analysis to Impact

Professor James Washington is in his office, staring at a framed copy of the legislation that his research helped inspire: the Algorithmic Accountability in Criminal Justice Act.

The bill requires three things: 1. All risk assessment algorithms used in bail and sentencing must be publicly audited for racial disparities in error rates every two years 2. Defendants must be informed when an algorithmic risk score is used in their case and told the algorithm's error rates for their demographic group 3. Jurisdictions must report the outcomes (recidivism rates, false positive rates, false negative rates) disaggregated by race, gender, and age

James's contribution wasn't just the statistical analysis. It was the translation. He took a two-proportion z-test (z = -4.67, p < 0.001) and turned it into something legislators could understand:

"For every 100 Black defendants flagged as high risk by this algorithm, 31 were wrong. For every 100 white defendants flagged as high risk, 13 were wrong. That's not noise. That's not random variation. That's a systematic disparity that affects thousands of people every year."

But James is honest about the limitations. In his testimony, he also said:

"I want to be clear about what my research shows and what it doesn't show. My analysis establishes that the disparity in false positive rates is statistically significant and practically meaningful. It does not establish that the algorithm was designed to be biased. It does not establish that eliminating the algorithm would produce better outcomes. And it does not resolve the fundamental fairness question — because as Chouldechova proved mathematically, you cannot equalize all fairness metrics simultaneously when base rates differ. What I'm advocating for is not the elimination of algorithms. It's transparency, auditing, and community input into the values embedded in these tools."

Analysis Questions

(a) James translated a z-statistic of -4.67 into "31 out of 100 were wrong." Which communication principle from Chapter 25 is this an example of? Why is this translation effective?

(b) James acknowledges the fairness impossibility theorem. Explain this result in your own words, using concepts from Chapter 16 and Chapter 27. Why does it matter for policy?

(c) The legislation requires disaggregated reporting by race, gender, and age. Connect this requirement to Simpson's paradox (Chapter 27) and Theme 2 (human stories behind the data). What could go wrong if results were only reported in aggregate?

(d) James says his analysis "does not establish that eliminating the algorithm would produce better outcomes." What does this acknowledgment reveal about the difference between statistical evidence and policy decisions?

(e) Using the three ethical frameworks from Chapter 27, evaluate the legislation James helped inspire. Which framework best supports each of the three requirements?


Part 4: Sam's Reflection — From Intern to Analyst

Sam Okafor is cleaning out the intern desk. Except it's not the intern desk anymore — it's Sam's desk now. The nameplate says "Junior Data Analyst, Riverside Raptors."

On the wall is a printout of Daria's shooting chart: 258 three-point attempts over the full season, with a running proportion that starts volatile and gradually stabilizes near 37.6%. Sam drew a horizontal line at 31% (the old average) and another at 37.6% (the new proportion). The gap between those lines represents the question that took an entire textbook to answer.

Sam remembers the day in Chapter 1 when the head coach asked: "Has Daria really improved, or is she just getting lucky?" Sam remembers the frustration of Chapter 14, when p = 0.097 wasn't enough to say yes. The power analysis in Chapter 17 that explained why — only 24% power. And finally, the resolution: z = 2.28, p = 0.011, with 258 attempts and 82% power.

But the analysis Sam is most proud of isn't the final hypothesis test. It's the full report:

Component Value Interpretation
Sample proportion 37.6% (97/258) Current season performance
Historical proportion 31.0% Baseline comparison
Hypothesis test z = 2.28, p = 0.011 Statistically significant improvement
95% CI for true proportion (31.7%, 43.5%) Plausible range of true shooting %
Cohen's h 0.14 Small but real improvement
Power 82% at n = 258 Adequate power to detect this effect
Practical impact +1.3 made 3s/game Approximately 3.9 points per game
Caveat Regression to the mean possible Continued monitoring recommended

"The old me would have just said 'p < 0.05, she improved, done,'" Sam tells a friend. "The new me says 'here's the effect size, here's the CI, here's the power, here's what we still don't know, and here's what I recommend we do next.'"

Daria signed her extension. The coaching staff cited Sam's analysis. And next season, Sam will track whether the improvement holds or regresses — because that's what statistical thinking demands. Not certainty. Continued observation.

Analysis Questions

(a) Sam's analysis evolved across seven chapters. Create a timeline showing the key milestone in each: Chapters 1, 6, 11, 13, 14, 17, and 28. For each, note what tool was used and what conclusion was reached.

(b) At n = 65, Sam's power was 24%. At n = 258, it was 82%. Explain, using the power analysis concepts from Chapter 17, why quadrupling the sample size didn't quadruple the power.

(c) Sam includes a regression-to-the-mean caveat. Using the concept from Chapter 22, explain what regression to the mean is and why it's relevant here.

(d) Sam's report includes eight components. Which three do you think are most important for the coaching staff's decision? Which three are most important for statistical rigor? Explain the difference.

(e) Sam says "The old me would have just said 'p < 0.05, she improved, done.'" In 100-150 words, describe the difference between the "old Sam" and the "new Sam" using the vocabulary and concepts from this course. How does this transformation mirror your own growth as a statistical thinker?


Synthesis

(f) All four characters ultimately had to communicate their findings to non-statisticians (a county board, a VP of product, a legislative committee, a coaching staff). Choose the character whose communication challenge was most difficult, and explain why. What principles from Chapter 25 did they apply?

(g) Each character's story illustrates multiple recurring themes. For each character, identify the single theme that was most central to their journey, and explain your reasoning.

(h) Write a one-paragraph reflection on how the four anchor examples changed your understanding of statistics over the course of this textbook. Which character's story resonated with you most, and why?