39 min read

> "The primary product of a research inquiry is one or more measures of effect size, not p-values."

Learning Objectives

  • Define and calculate statistical power
  • Explain why 'statistically significant' does not mean 'important'
  • Calculate and interpret effect sizes (Cohen's d, r²)
  • Conduct a basic power analysis to determine required sample size
  • Critically evaluate published research findings using effect sizes and confidence intervals

Chapter 17: Power, Effect Sizes, and What "Significant" Really Means

"The primary product of a research inquiry is one or more measures of effect size, not p-values." — Jacob Cohen

Chapter Overview

I need to tell you something that might make you uncomfortable.

Everything you've learned about hypothesis testing in the last four chapters — the careful logic, the p-values, the rejection regions, the five-step procedures — all of it is incomplete. Not wrong. Just incomplete. And the missing piece changes everything about how you should interpret statistical results.

Here's the problem. In Chapter 16, Alex found that StreamVibe's new algorithm increased average watch time by 4.5 minutes, with $p = 0.012$. Statistically significant! But is 4.5 minutes a big effect or a small one? Is it worth the engineering cost of deploying a new algorithm? The p-value can't tell you that.

And there's a darker version of this problem. With a large enough sample, any difference — no matter how tiny, no matter how meaningless — becomes statistically significant. A drug that lowers blood pressure by 0.1 mmHg. A teaching method that raises test scores by 0.2 points. A website redesign that increases click-through rates by 0.001%. All statistically significant with enough data. All utterly trivial.

Meanwhile, with a small enough sample, even a large, important effect can be missed entirely. A drug that actually saves lives. A training program that genuinely works. A real act of racial bias hidden in an algorithm. All invisible if you don't have enough data to detect them.

This chapter is about the gap between "statistically significant" and "actually important." It's about the tools that bridge that gap — effect sizes and power analysis — and about why the entire scientific community has been reckoning with the consequences of ignoring them.

In this chapter, you will learn to: - Define and calculate statistical power - Explain why "statistically significant" does not mean "important" - Calculate and interpret effect sizes (Cohen's d, $r^2$) - Conduct a basic power analysis to determine required sample size - Critically evaluate published research findings using effect sizes and confidence intervals

Fast Track: If you've encountered effect sizes before, skim Sections 17.1–17.3, then jump to Section 17.7 (power analysis in Python). Complete quiz questions 1, 10, and 17 to verify.

Deep Dive: After this chapter, read Case Study 1 (Alex's effect size analysis — was the A/B test result practically significant?) and Case Study 2 (James's effect size of algorithmic vs. human bail decisions — the fairness question with proper context). Both include full worked solutions with Python code.


17.1 A Puzzle Before We Start (Productive Struggle)

Before we get into formulas, try this thought experiment.

The Magic Pill

A pharmaceutical company tests a new blood pressure medication in a clinical trial with 50,000 participants. They find that the drug lowers systolic blood pressure by an average of 0.3 mmHg compared to the placebo, with $p = 0.001$.

Meanwhile, a small university lab tests a different medication with 50 participants. They find that their drug lowers systolic blood pressure by an average of 12 mmHg compared to the placebo, with $p = 0.08$.

(a) Which result is statistically significant at $\alpha = 0.05$?

(b) Which drug would you rather take?

(c) Why do your answers to (a) and (b) differ?

(d) What does this tell you about what p-values can and cannot reveal?

Take 3 minutes. Part (c) is the key insight for this chapter.

Here's what I hope you noticed:

For part (a), the answer is clear: only the 50,000-participant study is statistically significant ($p = 0.001 < 0.05$). The 50-participant study's result ($p = 0.08$) does not reach the conventional threshold.

For part (b), the answer is equally clear: you'd rather take the drug that lowers blood pressure by 12 mmHg. That's a clinically meaningful reduction. A 0.3 mmHg reduction is so small it would never be noticed by a patient or a doctor.

Part (c) is where the insight lives. The p-value depends on two things: the size of the effect and the size of the sample. The first study found a tiny effect with a huge sample — so the p-value was small. The second study found a large effect with a small sample — so the p-value was large. The p-value conflates the size of the effect with the size of the sample. It cannot tell you which one is driving the result.

And part (d) is the thesis of this chapter: p-values tell you whether a result is surprising under the null hypothesis. They do not tell you whether a result is important, meaningful, or worth caring about. For that, you need something else entirely.


17.2 The "Significant but Trivial" Problem

Let's make the productive struggle concrete with numbers.

Suppose the true difference between two groups is incredibly small — say, $\mu_1 - \mu_2 = 0.1$ (in whatever units you're measuring), with both groups having standard deviation $\sigma = 10$.

The test statistic for a two-sample t-test is:

$$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{s_1^2/n_1 + s_2^2/n_2}}$$

The numerator (the observed difference) stays roughly the same regardless of sample size. But the denominator (the standard error) shrinks as $n$ increases, because $SE = \sigma\sqrt{1/n_1 + 1/n_2}$.

Watch what happens:

Sample Size (per group) True Difference SE t-statistic p-value Significant at $\alpha = 0.05$?
50 0.1 2.00 0.05 0.960 No
500 0.1 0.632 0.16 0.875 No
5,000 0.1 0.200 0.50 0.617 No
50,000 0.1 0.063 1.58 0.114 No
500,000 0.1 0.020 5.00 $< 0.001$ Yes
5,000,000 0.1 0.006 15.81 $< 0.001$ Yes

The true difference never changed. It was always 0.1 — a completely trivial amount (one-hundredth of a standard deviation). But with enough data, the standard error shrinks until even this meaningless difference produces a huge test statistic and a tiny p-value.

Key Insight: With a large enough sample, any nonzero difference — no matter how trivial — becomes statistically significant.

This isn't a bug in hypothesis testing. It's a feature, working exactly as designed. The test correctly detects that the difference isn't exactly zero. But "not exactly zero" is a very low bar. The real question is: is the difference big enough to matter?

This is a pervasive problem in the age of big data. Companies like StreamVibe, Google, and Facebook run A/B tests on millions of users. With millions of observations, even a 0.1-second difference in page load time will be "statistically significant." The question isn't whether the difference is real — it almost certainly is. The question is whether anyone should care.

Real-World Examples

  • Education: A meta-analysis found that class size reduction from 25 to 15 students improved test scores by a statistically significant but small amount — about 2 percentile points. Is that worth the cost of hiring thousands of additional teachers?
  • Medicine: Many drug trials with tens of thousands of participants find statistically significant reductions in cholesterol or blood pressure that are too small to have clinical relevance.
  • Technology: A/B tests at tech companies routinely find statistically significant differences in click rates at the fourth decimal place.

The common thread: statistical significance without practical significance is noise dressed up as signal.


17.3 The "Important but Not Significant" Problem

The flip side is equally dangerous.

With a small sample, even a large, important effect can fail to reach statistical significance. The standard error is so large that the test statistic stays small, and the p-value stays large — even when something genuinely important is happening.

Remember Sam's analysis of Daria's shooting? In Chapter 14, Sam tested whether Daria's three-point percentage had improved from 31% to 38.5% (25 out of 65 shots). The z-test gave $p = 0.097$ — not significant at $\alpha = 0.05$.

But a 7.5-percentage-point improvement in three-point shooting is enormous in basketball. If real, it would move Daria from below-average to elite. The reason Sam's test didn't detect it isn't that the improvement doesn't exist — it's that 65 shots isn't enough data to distinguish a real improvement from random variation.

Key Insight: With a small enough sample, even large, important effects can be missed.

Failing to reject $H_0$ does not mean $H_0$ is true. It may simply mean you didn't have enough data. This is the Type II error from Chapter 13 — a false negative. And the probability of a Type II error depends on something called statistical power.

Here's the uncomfortable truth: many published studies in psychology, medicine, and social science are underpowered — they simply don't have enough participants to reliably detect the effects they're looking for. An underpowered study is like trying to spot a bird with binoculars that are out of focus. The bird might be right there, but you'll never see it.

🔄 Spaced Review 1 (from Ch.13): P-Value Definition — Now We See Its Limitations

In Chapter 13, we defined the p-value as "the probability of observing data as extreme as or more extreme than what was actually observed, if the null hypothesis were true." We explored five common misconceptions and introduced the ASA's six principles about p-values.

Principle 5 is the one that matters most right now: "A p-value, or statistical significance, does not measure the size of an effect or the importance of a result."

At the time, this was a warning. Now we'll understand why it matters — and what to use instead. The p-value tells you about the compatibility of your data with the null hypothesis. It does not tell you about the magnitude of the effect. A small p-value can come from a large effect in a small sample, or from a tiny effect in a massive sample. You need additional tools — effect sizes and confidence intervals — to distinguish between these two very different scenarios.


17.4 Effect Size: Measuring What Actually Matters

So if the p-value can't tell us how big an effect is, what can?

Effect sizes. An effect size is a standardized measure of the magnitude of a phenomenon — how big the difference is, how strong the relationship is, how much one variable changes when another changes. Unlike p-values, effect sizes are independent of sample size. They tell you about the world, not about your study.

Concept 1: Effect Size

An effect size is a quantitative measure of the magnitude of a phenomenon. It answers the question "how big is the effect?" rather than "is there an effect?" Effect sizes are independent of sample size: a drug that lowers blood pressure by 5 mmHg has the same effect size whether you test it on 50 people or 50,000.

There are many different effect size measures for different situations. We'll focus on the two most important ones.

Cohen's d: The Effect Size for Comparing Two Groups

When comparing the means of two groups (as in the two-sample t-test from Chapter 16), the most common effect size is Cohen's d.

The idea is simple: express the difference between the two group means in units of standard deviation.

$$\boxed{d = \frac{\bar{x}_1 - \bar{x}_2}{s_p}}$$

where $s_p$ is the pooled standard deviation:

$$s_p = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}}$$

Concept 2: Cohen's d

Cohen's d expresses the difference between two group means as a number of standard deviations. A Cohen's d of 0.5 means the two groups differ by half a standard deviation. It is the most widely used effect size for comparing two means, and it is independent of sample size.

In plain English: Cohen's d tells you how many standard deviations apart the two groups are. A $d$ of 1.0 means the groups are separated by one full standard deviation — that's a big gap. A $d$ of 0.1 means they're separated by just one-tenth of a standard deviation — barely different at all.

Interpreting Cohen's d

Jacob Cohen, who invented this measure in the 1960s, proposed benchmarks for interpreting $d$:

Cohen's d Interpretation What it looks like
0.2 Small effect Requires careful measurement to detect; barely visible in raw data
0.5 Medium effect Visible to the naked eye; a noticeable difference
0.8 Large effect Obvious and substantial; hard to miss

A word of caution about benchmarks. Cohen himself warned against applying these benchmarks mechanically. What counts as "small" or "large" depends on context. In clinical medicine, a "small" effect on mortality (say, $d = 0.2$) could represent thousands of lives saved. In education, a $d$ of 0.4 might represent an entire year's worth of learning gains. Always interpret effect sizes in the context of the field and the stakes involved.

Let's see what $d$ looks like visually. When $d = 0$ (no effect), the two group distributions overlap completely. As $d$ increases, the distributions pull apart:

  • $d = 0.2$ (small): The distributions overlap about 85%. If you randomly picked one person from each group, the person from the "higher" group would score higher only about 56% of the time (barely better than a coin flip).
  • $d = 0.5$ (medium): The distributions overlap about 67%. The person from the "higher" group would score higher about 64% of the time.
  • $d = 0.8$ (large): The distributions overlap about 53%. The person from the "higher" group would score higher about 71% of the time.
  • $d = 2.0$ (very large): The distributions barely overlap. The person from the higher group would score higher about 92% of the time.

Worked Example: Alex's A/B Test

In Chapter 16, Alex's A/B test found: - Old algorithm: $\bar{x}_1 = 42.3$ minutes, $s_1 = 18.5$, $n_1 = 247$ - New algorithm: $\bar{x}_2 = 46.8$ minutes, $s_2 = 21.2$, $n_2 = 253$

First, compute the pooled standard deviation:

$$s_p = \sqrt{\frac{(247-1)(18.5)^2 + (253-1)(21.2)^2}{247 + 253 - 2}} = \sqrt{\frac{246(342.25) + 252(449.44)}{498}}$$

$$s_p = \sqrt{\frac{84,193.5 + 113,258.9}{498}} = \sqrt{\frac{197,452.4}{498}} = \sqrt{396.5} \approx 19.9$$

Now compute Cohen's d:

$$d = \frac{46.8 - 42.3}{19.9} = \frac{4.5}{19.9} \approx 0.23$$

Alex's effect is $d \approx 0.23$ — a small effect by Cohen's benchmarks.

Here's the critical insight: the A/B test was statistically significant ($p = 0.012$) but the effect is small. This isn't a contradiction — it's the whole point. With 500 users, even a small effect can be detected. But knowing the effect is small changes how Alex should interpret and communicate the result.

Alex's Dilemma: Is 0.23 Standard Deviations Worth It?

A Cohen's d of 0.23 means the new algorithm shifts the average user's watch time by about one-quarter of a standard deviation. That's a real difference, but a modest one. Is it worth the engineering cost of deploying the new algorithm across 12 million users?

Here's where context matters. A 4.5-minute increase per session, applied to 12 million active users across an average of 3 sessions per week, translates to roughly 2.8 billion additional viewing minutes per year. In the streaming industry, that could mean millions of dollars in advertising revenue and reduced churn.

So the effect is "small" by Cohen's generic benchmarks but potentially very large in business impact. Always interpret effect sizes in context.

Worked Example: Sam's Shooting Data

In Chapter 14, Daria's three-point percentage was 38.5% (25/65) versus a historical baseline of 31%. For proportions, we can compute Cohen's h (a related effect size for proportions):

$$h = 2\arcsin\left(\sqrt{p_1}\right) - 2\arcsin\left(\sqrt{p_2}\right)$$

$$h = 2\arcsin(\sqrt{0.385}) - 2\arcsin(\sqrt{0.31}) = 2(0.672) - 2(0.589) = 1.344 - 1.178 = 0.166$$

That's a small effect by Cohen's benchmarks ($h < 0.2$). Combined with only 65 observations, it's no surprise Sam's test didn't reach significance. The sample was too small to detect an effect that small.

But wait — that 7.5-percentage-point improvement is meaningful in basketball context. This is where Cohen's benchmarks can mislead: the benchmarks are generic, and what counts as "meaningful" depends entirely on the domain.


17.5 $r^2$: How Much Variance Does the Effect Explain?

Cohen's d tells you how far apart two group means are. But there's another way to quantify effect size: ask what proportion of the total variability in the data is explained by the group difference.

This is called $r^2$ (r-squared), also known as the coefficient of determination when we get to regression in Chapter 22, or eta-squared ($\eta^2$) in the context of group comparisons.

Concept 3: $r^2$ as Effect Size

$r^2$ measures the proportion of variance in the outcome variable that is explained by the predictor (or group membership). An $r^2$ of 0.25 means the group variable explains 25% of the variation in scores. The remaining 75% is due to other factors.

For a two-group comparison, you can convert between Cohen's d and $r^2$:

$$r^2 = \frac{d^2}{d^2 + 4}$$

Or equivalently, from a t-test:

$$r^2 = \frac{t^2}{t^2 + df}$$

Interpreting $r^2$

$r^2$ Cohen's d equivalent Interpretation
0.01 $d \approx 0.2$ Small effect — explains 1% of variance
0.06 $d \approx 0.5$ Medium effect — explains 6% of variance
0.14 $d \approx 0.8$ Large effect — explains 14% of variance

Those numbers might shock you. A "large" effect explains only 14% of the variance? That means even when $d = 0.8$, 86% of the variation is due to other factors.

This is why effect sizes are humbling. Most interventions in education, psychology, and business explain a tiny fraction of the variance in outcomes. The world is complicated, and any single factor — even an important one — typically accounts for only a sliver of what's going on.

Alex's $r^2$

From Alex's A/B test: $t = 2.53$ with $df \approx 492$.

$$r^2 = \frac{2.53^2}{2.53^2 + 492} = \frac{6.40}{498.40} \approx 0.013$$

The new algorithm explains about 1.3% of the variance in watch time. The remaining 98.7% is due to individual differences in viewing preferences, time of day, content available, mood, and a thousand other factors.

Statistically significant? Yes. But it explains 1.3% of what's going on. This is what "significant but small" looks like in the real world.

🔄 Spaced Review 2 (from Ch.12): Confidence Intervals — Effect Size + Uncertainty in One Package

Here's something beautiful about confidence intervals that you might not have appreciated in Chapter 12: a confidence interval simultaneously tells you the effect size and the uncertainty about it.

Alex's 95% CI for the difference in watch time was (1.01, 7.99) minutes. This single interval answers two questions:

  1. Is the effect statistically significant? Yes — the interval doesn't contain zero.
  2. How big is the effect? The true difference is plausibly anywhere from 1 minute to 8 minutes.

The CI gives you something the p-value alone never could: a range of plausible effect sizes. A difference of 1 minute might not be worth pursuing, while a difference of 8 minutes definitely is. The CI shows you that both are plausible, which is far more informative than just "significant at $p = 0.012$."

This is why many statisticians argue that confidence intervals should be the primary tool of inference, with p-values as secondary. The CI conveys both the direction, the magnitude, and the precision of the estimate — three things the p-value cannot.


17.6 Statistical Power: The Probability of Finding What's There

Now let's turn to the other side of the coin. We've seen that big samples can make trivial effects significant, and small samples can make important effects invisible. The concept that quantifies this is statistical power.

What Power Is

Recall from Chapter 13 the two types of errors:

$H_0$ is actually true $H_0$ is actually false
Reject $H_0$ Type I error ($\alpha$) Correct! (Power)
Fail to reject $H_0$ Correct! Type II error ($\beta$)
  • Type I error ($\alpha$): Rejecting $H_0$ when it's true. A false alarm. Probability = $\alpha$.
  • Type II error ($\beta$): Failing to reject $H_0$ when it's false. A missed detection. Probability = $\beta$.

Concept 4: Statistical Power

Statistical power is the probability of correctly rejecting $H_0$ when it is actually false:

$$\text{Power} = P(\text{reject } H_0 \mid H_0 \text{ is false}) = 1 - \beta$$

Power is the probability of detecting a real effect when one exists. A study with 80% power has an 80% chance of finding a statistically significant result if the effect is real. Equivalently, it has a 20% chance of missing the effect (Type II error, $\beta = 0.20$).

Think of it this way: if you're trying to find a needle in a haystack, power is the probability that your search method actually finds the needle — given that the needle is there. A powerful search method finds the needle most of the time. An underpowered method misses it frequently.

How Much Power Is Enough?

The conventional minimum is 80% power ($\beta = 0.20$). This means you accept a 20% chance of missing a real effect. Some fields aim for 90% power in high-stakes studies.

Power Level $\beta$ Interpretation
50% 0.50 Coin flip — you'll miss the effect half the time
70% 0.30 Inadequate for most purposes
80% 0.20 Conventional minimum
90% 0.10 Desirable for well-funded studies
95% 0.05 High-stakes research (clinical trials)

A study with 50% power is essentially a coin flip. You'd miss the effect as often as you'd find it. Yet many published studies in psychology and social science have power levels around 50% or lower. This means that roughly half of all real effects go undetected — and of the ones that are detected, many are overestimates (because only the samples that happened to produce large, significant results got published).

The Four Factors That Determine Power

Power isn't a fixed number — it depends on four interconnected factors. Understanding these is the key to designing good studies.

1. Significance level ($\alpha$)

A smaller $\alpha$ (e.g., 0.01 instead of 0.05) means you need stronger evidence to reject $H_0$. This reduces the false alarm rate (Type I error), but it also makes it harder to detect real effects — reducing power.

There's an inherent tradeoff: you can't reduce both Type I and Type II error simultaneously without increasing sample size. Decreasing $\alpha$ lowers Type I error but increases $\beta$, which decreases power.

2. Sample size ($n$)

Larger samples → smaller standard errors → larger test statistics → easier to reject $H_0$ → higher power.

This is the factor researchers have the most control over. When planning a study, the primary question is usually: "How large a sample do I need to achieve adequate power?"

3. Effect size

Larger effects are easier to detect. A drug that lowers blood pressure by 20 mmHg is easier to find than one that lowers it by 2 mmHg. If the true effect is large, you need fewer observations; if the effect is small, you need more.

4. Variability ($\sigma$)

More variable data → more noise → harder to detect a signal → lower power. This is why the paired t-test from Chapter 16 is often more powerful than the independent-samples t-test: by comparing each subject to themselves, you eliminate between-subject variability, effectively reducing $\sigma$.

🔄 Spaced Review 3 (from Ch.4): Study Design and Power

In Chapter 4, you learned about different study designs — observational studies, experiments, randomized controlled trials, stratified sampling, and matched-pairs designs.

Now you can see why these design choices matter from a power perspective:

  • Randomized experiments eliminate confounders, making the effect cleaner (effectively reducing variability and increasing power).
  • Stratified sampling reduces between-stratum variability, increasing precision and power.
  • Matched-pairs designs eliminate between-subject variability, often dramatically increasing power — which is exactly why Sam's paired analysis in Chapter 16 detected Daria's improvement ($p = 0.0007$) when the unpaired analysis in Chapter 14 did not ($p = 0.097$).
  • Larger samples increase power, but with diminishing returns (because SE decreases with $\sqrt{n}$, not $n$).

Study design and statistical analysis are two sides of the same coin. A well-designed study is inherently more powerful than a poorly designed one, even with the same sample size.

The Power Equation (Conceptual)

For a one-sample z-test, the power can be expressed as:

$$\text{Power} = P\left(Z > z_{\alpha} - \frac{\delta}{\sigma/\sqrt{n}}\right)$$

where $\delta$ is the true difference from the null hypothesis value, $\sigma$ is the population standard deviation, and $z_\alpha$ is the critical value.

Don't worry about memorizing this formula. The key relationships are:

$$\text{Power} \uparrow \text{ when } \begin{cases} n \uparrow & \text{(more data)} \\ \delta \uparrow & \text{(larger effect)} \\ \sigma \downarrow & \text{(less noise)} \\ \alpha \uparrow & \text{(more lenient threshold)} \end{cases}$$

Each arrow tells a story. You can always increase power by collecting more data. You can't change the effect size (that's nature's decision), but you can choose to study effects that are expected to be large. You can reduce variability through better measurement, more precise instruments, or paired designs. And you can increase $\alpha$, but that increases false alarms — so there's a tradeoff.

Why Underpowered Studies Are Dangerous

An underpowered study is one that doesn't have enough data to reliably detect the effect it's looking for. Underpowered studies cause two serious problems:

Problem 1: They miss real effects. An underpowered study has a high probability of producing $p > 0.05$ even when the effect is real. The researcher concludes "no effect" when there is one. In medicine, this could mean an effective treatment is abandoned. In criminal justice, it could mean real bias goes undetected.

Problem 2: They overestimate effects that they do find. This is subtler but equally important. Among underpowered studies, only those that happen to overestimate the effect size will produce significant results (because the true effect is too small to be detected reliably). This creates "winner's curse" or "effect size inflation": published effects from underpowered studies tend to be larger than the true effect, and subsequent replications with larger samples find smaller effects. This is a major driver of the replication crisis we discussed in Chapter 1 and Chapter 13.


17.7 Power Analysis in Python

Power analysis answers two critical questions:

  1. Prospective (planning): How many participants do I need to detect an effect of a given size with a given power?
  2. Retrospective (evaluation): Given my sample size and effect size, what was my power to detect the effect?

Python's statsmodels library provides a clean interface for both.

Power Analysis for a Two-Sample t-Test

from statsmodels.stats.power import TTestIndPower

# Create a power analysis object
power_analysis = TTestIndPower()

# ============================================================
# Question 1: How many participants do I need?
# ============================================================
# Goal: Detect a medium effect (d = 0.5) with 80% power at alpha = 0.05

n_per_group = power_analysis.solve_power(
    effect_size=0.5,    # Cohen's d
    alpha=0.05,         # significance level
    power=0.80,         # desired power
    ratio=1.0,          # equal group sizes (n2/n1)
    alternative='two-sided'
)

print(f"Required sample size per group: {n_per_group:.0f}")
# Output: Required sample size per group: 64

# That's 64 per group, or 128 total participants

# ============================================================
# Question 2: What was our power in a completed study?
# ============================================================
# Alex's A/B test: d = 0.23, n per group ≈ 250

power_achieved = power_analysis.solve_power(
    effect_size=0.23,
    nobs1=250,
    alpha=0.05,
    ratio=1.0,
    alternative='two-sided'
)

print(f"\nAlex's study power: {power_achieved:.3f}")
# Output: Alex's study power: 0.759

# About 76% power — slightly below the 80% convention
# This means there was about a 24% chance of missing the effect

Sample Size Planning: Maya's New Study

Maya wants to design a study comparing asthma rates in two communities. Based on pilot data and prior research, she expects the difference to correspond to about $d = 0.3$ (a small-to-medium effect). How many participants does she need per community?

from statsmodels.stats.power import TTestIndPower
import matplotlib.pyplot as plt
import numpy as np

power_analysis = TTestIndPower()

# ============================================================
# Maya's sample size planning
# ============================================================

# Scenario 1: Standard 80% power
n_80 = power_analysis.solve_power(
    effect_size=0.3, alpha=0.05, power=0.80,
    alternative='two-sided'
)
print(f"For 80% power: n = {n_80:.0f} per group ({2*n_80:.0f} total)")

# Scenario 2: Higher 90% power
n_90 = power_analysis.solve_power(
    effect_size=0.3, alpha=0.05, power=0.90,
    alternative='two-sided'
)
print(f"For 90% power: n = {n_90:.0f} per group ({2*n_90:.0f} total)")

# Scenario 3: What if the effect is smaller than expected?
n_small = power_analysis.solve_power(
    effect_size=0.2, alpha=0.05, power=0.80,
    alternative='two-sided'
)
print(f"For d=0.2, 80% power: n = {n_small:.0f} per group ({2*n_small:.0f} total)")

# Output:
# For 80% power: n = 176 per group (352 total)
# For 90% power: n = 235 per group (470 total)
# For d=0.2, 80% power: n = 394 per group (788 total)

Notice the pattern: detecting smaller effects requires dramatically larger samples. Going from $d = 0.3$ to $d = 0.2$ nearly doubles the required sample size. This is why researchers should think carefully about the smallest effect that would be practically important — and design their study to detect that.

Power Curves

One of the most useful visualizations in power analysis is the power curve: a plot showing how power changes as a function of sample size for different effect sizes.

from statsmodels.stats.power import TTestIndPower
import matplotlib.pyplot as plt
import numpy as np

power_analysis = TTestIndPower()

# ============================================================
# Power curves for different effect sizes
# ============================================================

fig, ax = plt.subplots(figsize=(10, 6))

sample_sizes = np.arange(10, 500, 5)
effect_sizes = [0.2, 0.5, 0.8]
colors = ['#e74c3c', '#3498db', '#2ecc71']
labels = ['Small (d=0.2)', 'Medium (d=0.5)', 'Large (d=0.8)']

for d, color, label in zip(effect_sizes, colors, labels):
    powers = [power_analysis.solve_power(
        effect_size=d, nobs1=n, alpha=0.05,
        ratio=1.0, alternative='two-sided'
    ) for n in sample_sizes]
    ax.plot(sample_sizes, powers, color=color, linewidth=2.5, label=label)

ax.axhline(y=0.80, color='gray', linestyle='--', alpha=0.7, label='80% power')
ax.set_xlabel('Sample Size (per group)', fontsize=13)
ax.set_ylabel('Statistical Power', fontsize=13)
ax.set_title('Power Curves: How Sample Size Affects Power', fontsize=14)
ax.legend(fontsize=12, loc='lower right')
ax.set_ylim(0, 1.05)
ax.set_xlim(0, 500)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('power_curves.png', dpi=150, bbox_inches='tight')
plt.show()

# Read off the curves:
# - To detect a LARGE effect (d=0.8) with 80% power: ~26 per group
# - To detect a MEDIUM effect (d=0.5) with 80% power: ~64 per group
# - To detect a SMALL effect (d=0.2) with 80% power: ~394 per group

The power curves tell a dramatic story: - Large effects ($d = 0.8$) require modest samples — about 26 per group for 80% power - Medium effects ($d = 0.5$) need about 64 per group - Small effects ($d = 0.2$) require nearly 400 per group

If your expected effect is small, you need a lot of data.

Power for Proportions

For comparing two proportions (like James's recidivism study), use a different power function:

from statsmodels.stats.power import NormalIndPower
import numpy as np

# ============================================================
# James's study: power analysis for proportions
# ============================================================

# James observed: algorithm 21.6% vs. judge 27.6% recidivism
# Cohen's h for proportions
p1, p2 = 0.216, 0.276
h = 2 * np.arcsin(np.sqrt(p1)) - 2 * np.arcsin(np.sqrt(p2))
print(f"Cohen's h = {h:.3f}")

# Power analysis
power_prop = NormalIndPower()

# What was James's power with n1=412, n2=388?
power_james = power_prop.solve_power(
    effect_size=abs(h),
    nobs1=412,
    alpha=0.05,
    ratio=388/412,
    alternative='two-sided'
)
print(f"James's study power: {power_james:.3f}")
# Output: James's study power: approximately 0.48

# James had only about 48% power to detect this effect!
# That means his p = 0.049 (barely significant) was lucky —
# there was a >50% chance he would have missed it.

This is a sobering finding. James's overall comparison ($p = 0.049$, barely significant) was essentially a coin flip in terms of power. If he'd had a slightly different sample, the same real effect would have produced $p > 0.05$, and the study would have been a "failure" — not because the effect doesn't exist, but because the study was underpowered.

Sam's Sample Size Question — Finally Answered

How many shots would Daria need to confirm her improvement from 31% to 38.5%?

from statsmodels.stats.power import NormalIndPower
import numpy as np

# ============================================================
# Sam's question: How many shots does Daria need?
# ============================================================

# Effect: improvement from 31% to 38.5%
p_old, p_new = 0.31, 0.385
h = 2 * np.arcsin(np.sqrt(p_new)) - 2 * np.arcsin(np.sqrt(p_old))
print(f"Cohen's h = {h:.3f}")

power_prop = NormalIndPower()

# Using a one-sample proportion test framework
# (comparing to a known baseline, so one-sample)
from statsmodels.stats.power import zt_ind_solve_power

# For a one-sample z-test for proportions:
# effect_size = (p_new - p_old) / sqrt(p_old * (1 - p_old))
es = (0.385 - 0.31) / np.sqrt(0.31 * 0.69)
print(f"Standardized effect size: {es:.3f}")

from statsmodels.stats.power import NormalIndPower
power_1samp = NormalIndPower()

# One-sided test (we expect improvement, not decline)
n_needed = power_1samp.solve_power(
    effect_size=es,
    alpha=0.05,
    power=0.80,
    alternative='larger'
)
print(f"\nShots needed for 80% power: {n_needed:.0f}")
print(f"Shots needed for 90% power: ", end="")

n_90 = power_1samp.solve_power(
    effect_size=es, alpha=0.05, power=0.90,
    alternative='larger'
)
print(f"{n_90:.0f}")

# What was Sam's power with n = 65?
power_sam = power_1samp.solve_power(
    effect_size=es, nobs1=65, alpha=0.05,
    alternative='larger'
)
print(f"\nSam's power with n=65: {power_sam:.3f}")

# Output:
# Cohen's h ≈ 0.166
# Standardized effect size ≈ 0.162
# Shots needed for 80% power: ~237
# Shots needed for 90% power: ~318
# Sam's power with n=65: ~0.24

Now we understand why Sam's test didn't reject the null hypothesis. With only 65 shots, Sam had about 24% power — meaning there was only a 24% chance of detecting Daria's improvement, even if it's real. Sam would need roughly 237 shots for 80% power, or about 318 shots for 90% power.

Sam's recommendation to the analytics director can now be much more precise: "We need approximately 240 more shots to have adequate power to detect this level of improvement. Based on Daria's current playing time, that's about 30-35 more games."


17.8 The ASA Statement Revisited: What Should We Do?

In Chapter 13, you encountered the American Statistical Association's 2016 statement on p-values. Let's revisit it now with deeper understanding.

The ASA's six principles were:

  1. P-values can indicate how incompatible the data are with a specified statistical model.
  2. P-values do not measure the probability that the studied hypothesis is true.
  3. Scientific conclusions should not be based only on whether a p-value passes a threshold.
  4. Proper inference requires full reporting and transparency.
  5. A p-value does not measure the size of an effect or the importance of a result.
  6. A p-value does not provide a good measure of evidence regarding a model.

When you first read these in Chapter 13, principles 2 and 5 might have seemed like abstract warnings. Now you've seen them in action:

Principle 2 explains why Alex's $p = 0.012$ doesn't mean there's a 98.8% chance the new algorithm is better. It means the data would be unlikely if the two algorithms were identical.

Principle 3 explains why James's $p = 0.049$ shouldn't be treated radically differently from a hypothetical $p = 0.051$. The effect is the same — only the arbitrary cutoff changes.

Principle 5 explains why Alex's "significant" result with $d = 0.23$ is fundamentally different from a significant result with $d = 0.8$, even though both produce $p < 0.05$.

The 2019 Follow-Up: "Retire Statistical Significance"

In 2019, the ASA went further. In a special issue of The American Statistician, over 800 statisticians contributed to a set of recommendations that included a radical proposal: retire the concept of "statistical significance" entirely.

The key arguments:

  1. The 0.05 threshold is arbitrary. Fisher originally suggested it as a rough guideline, not an inviolable boundary.
  2. Binary thinking is harmful. Treating results as "significant" or "not significant" leads to treating evidence as "exists" or "doesn't exist" — when really, evidence is a continuum.
  3. Effect sizes and CIs are more informative. A confidence interval tells you everything a p-value tells you and more.
  4. Thresholds incentivize p-hacking. When there's a bright line at $p = 0.05$, researchers have strong incentives to push their results below it.

The proposed alternative: report effect sizes, confidence intervals, and p-values together, without making binary declarations of "significant" or "not significant." Instead of "the effect was significant ($p = 0.03$)," say "the observed difference was 4.5 minutes (95% CI: 1.0 to 8.0), $d = 0.23$, $p = 0.012$."

🗣️ Debate Framework: Should We Replace P-Values?

Position A — Replace p-values: P-values have caused enormous damage through misinterpretation and misuse. They conflate effect size and sample size, encourage binary thinking, incentivize p-hacking, and have been a primary driver of the replication crisis. Confidence intervals and Bayesian methods provide all the same information and more. It's time to move on.

Position B — Fix, don't replace: The p-value itself isn't the problem — the misuse of p-values is the problem. Banning p-values would just shift the problem to a different tool. We should instead require reporting of effect sizes, confidence intervals, and p-values together. Education (like this chapter!) is the solution, not prohibition.

Position C — Use Bayesian methods instead: Bayes factors and posterior probabilities directly answer the question researchers actually care about: "Given the data, how likely is the hypothesis?" They don't suffer from the same misinterpretation problems as p-values. The tools exist; it's time to adopt them.

Consider: What are the strengths and weaknesses of each position? Which position do you find most compelling?


17.9 Publication Bias and the File Drawer Problem

Theme 6 Connection: P-Hacking, Publication Bias, and Ethics

Publication bias and p-hacking are not just statistical problems — they are ethical problems. When research findings are filtered through a system that rewards "significant" results and punishes null results, the published literature becomes a distorted mirror: it shows a world where treatments always work, differences always exist, and effects are always large. Real people make real decisions based on this distorted picture — patients take ineffective drugs, policymakers fund wasteful programs, and companies deploy useless algorithms.

The ethical obligation is clear: report what you find, not what you wish you'd found. Pre-register your hypotheses. Report all analyses, not just the significant ones. Report effect sizes alongside p-values. And treat a well-conducted null result as valuable information, not as a failure.

We touched on publication bias in Chapter 13. Now let's dig deeper.

The File Drawer Problem

Imagine 20 research labs independently test the same null hypothesis (which happens to be true — there is no real effect). Each lab uses $\alpha = 0.05$.

By definition, each lab has a 5% chance of a false positive. Across 20 labs, the expected number of false positives is $20 \times 0.05 = 1$.

Now here's the problem: the 19 labs that found "nothing" file their results away in a drawer. The one lab that found $p < 0.05$ publishes in a prestigious journal. The published literature now contains one study showing a "significant" effect, and zero studies showing no effect.

A reader of the journal sees a clean, significant result and thinks: "The evidence supports this effect." But in reality, it's the one false positive out of 20 attempts. The evidence actually refutes the effect — but the evidence against it is locked in 19 file drawers.

This is the file drawer problem, named by psychologist Robert Rosenthal in 1979.

P-Hacking: A Simulation

Let's see p-hacking in action with a Python simulation. We'll generate data from a world where nothing is happening — both groups are drawn from the same population — and see how easily a "significant" result can be manufactured through data exploration.

import numpy as np
from scipy import stats

np.random.seed(42)

# ============================================================
# P-HACKING SIMULATION
# The truth: NO difference between groups (both drawn from N(0,1))
# Goal: Show how many "significant" results appear by chance
# ============================================================

n_studies = 10000
n_per_group = 30
alpha = 0.05

# ---- Scenario 1: One honest test per study ----
false_positives_honest = 0

for _ in range(n_studies):
    group1 = np.random.normal(0, 1, n_per_group)
    group2 = np.random.normal(0, 1, n_per_group)
    _, p = stats.ttest_ind(group1, group2)
    if p < alpha:
        false_positives_honest += 1

fp_rate_honest = false_positives_honest / n_studies
print("=" * 55)
print("P-HACKING SIMULATION: No Real Effect (H₀ is true)")
print("=" * 55)
print(f"\nScenario 1: One test per study")
print(f"  False positive rate: {fp_rate_honest:.3f}")
print(f"  Expected: {alpha:.3f}")

# ---- Scenario 2: Try 5 different outcome variables ----
# (pick the one with the smallest p-value)
false_positives_5tests = 0

for _ in range(n_studies):
    min_p = 1.0
    for _ in range(5):
        group1 = np.random.normal(0, 1, n_per_group)
        group2 = np.random.normal(0, 1, n_per_group)
        _, p = stats.ttest_ind(group1, group2)
        min_p = min(min_p, p)
    if min_p < alpha:
        false_positives_5tests += 1

fp_rate_5 = false_positives_5tests / n_studies
print(f"\nScenario 2: Best of 5 tests per study")
print(f"  False positive rate: {fp_rate_5:.3f}")
print(f"  Expected: {1 - 0.95**5:.3f}")

# ---- Scenario 3: Try 20 different analyses ----
false_positives_20tests = 0

for _ in range(n_studies):
    min_p = 1.0
    for _ in range(20):
        group1 = np.random.normal(0, 1, n_per_group)
        group2 = np.random.normal(0, 1, n_per_group)
        _, p = stats.ttest_ind(group1, group2)
        min_p = min(min_p, p)
    if min_p < alpha:
        false_positives_20tests += 1

fp_rate_20 = false_positives_20tests / n_studies
print(f"\nScenario 3: Best of 20 tests per study")
print(f"  False positive rate: {fp_rate_20:.3f}")
print(f"  Expected: {1 - 0.95**20:.3f}")

# Output (approximate):
# Scenario 1: One test per study
#   False positive rate: 0.050
#   Expected: 0.050
#
# Scenario 2: Best of 5 tests per study
#   False positive rate: 0.226
#   Expected: 0.226
#
# Scenario 3: Best of 20 tests per study
#   False positive rate: 0.642
#   Expected: 0.642

The results are striking:

Strategy Tests per Study False Positive Rate
Honest: one pre-specified test 1 5%
Try 5 variables, report the best 5 23%
Try 20 analyses, report the best 20 64%

With 20 different analyses, you have a 64% chance of finding "significance" even when nothing is happening. This isn't fraud — it's the natural consequence of exploring data flexibly and reporting selectively.

Theme 4 Connection: Uncertainty as Framework

The replication crisis, publication bias, and p-hacking are all consequences of the same fundamental mistake: refusing to take uncertainty seriously.

When researchers treat $p < 0.05$ as a binary verdict of "true" rather than as one piece of evidence, they overstate their confidence. When journals refuse to publish null results, they create a literature that systematically understates uncertainty. When analysts explore data flexibly and report only the "wins," they manufacture false certainty.

The solution has been the theme of this entire course since Chapter 1: embrace uncertainty. Report effect sizes and confidence intervals. Acknowledge what you don't know. Treat null results as valuable information. And never, ever confuse "statistically significant" with "true."


17.10 The Replication Crisis — Now Fully Explained

In Chapter 1, you learned about the replication crisis: only 36% of 100 psychology studies replicated successfully. In Chapter 13, you learned about p-hacking and the false discovery rate. Now you have the complete picture.

The replication crisis isn't caused by any single problem. It's the result of four factors working together:

Factor 1: Underpowered Studies + Publication Bias

Most studies in psychology have historically been underpowered — with average power around 50% or less. This means:

  • Real effects are missed about half the time ($\beta \approx 0.50$)
  • The effects that are detected tend to be overestimates (winner's curse)
  • Combined with publication bias (only significant results are published), the literature is filled with inflated effect sizes from underpowered studies

Factor 2: P-Hacking (The Garden of Forking Paths)

As the simulation above showed, exploring data flexibly inflates false positive rates far beyond the nominal $\alpha = 0.05$. In practice, researcher degrees of freedom include:

  • Choosing which variables to analyze
  • Choosing which subgroups to examine
  • Choosing how to handle outliers
  • Choosing which covariates to include
  • Choosing between one-tailed and two-tailed tests
  • Choosing when to stop collecting data

Each choice is a "fork" in the analysis path, and each fork inflates the false positive rate.

Factor 3: Small Effects + Large Noise

Many effects in social science are genuinely small ($d \approx 0.2$ to $d \approx 0.4$). Detecting them reliably requires large samples. But the incentive structure of academia rewards novelty and significance, not precision and replication. So researchers use small samples, get noisy estimates, and publish the ones that happen to cross $p < 0.05$.

Factor 4: The Threshold Problem

The $p = 0.05$ threshold creates a cliff: results with $p = 0.049$ are "significant" and publishable; results with $p = 0.051$ are "not significant" and often unpublishable. This cliff incentivizes everything above — p-hacking to push results below the threshold, and file-drawering null results.

The Solution

The statistical community's response has been multifaceted:

  1. Pre-registration: Publicly commit to hypotheses and analysis plans before collecting data (prevents p-hacking)
  2. Registered reports: Journals commit to publishing a study based on its design, regardless of results (prevents publication bias)
  3. Effect size reporting: Always report Cohen's d or another effect size alongside p-values
  4. Power analysis: Require sample size justification based on power analysis
  5. Replication incentives: Value and fund replication studies
  6. Open data and code: Share data and analysis code for verification

Connection to Chapter 1: The replication crisis case study asked: "If standard statistical practices could produce such an obviously wrong conclusion [precognition], what else had they gotten wrong?" Now you have the full answer. The problem wasn't the statistics — it was the ecosystem: underpowered studies + publication bias + p-hacking + binary threshold thinking. The tools in this chapter — effect sizes, power analysis, and a nuanced understanding of what p-values can and cannot do — are the antidote.


17.11 Putting It All Together: A Reporting Checklist

Here's what good statistical reporting looks like after this chapter. Instead of just reporting a p-value, you should report:

Component What It Tells You Example (Alex's A/B test)
Point estimate The size of the observed effect Difference = 4.5 minutes
Confidence interval The plausible range of the true effect 95% CI: (1.0, 8.0) minutes
Effect size Standardized magnitude Cohen's $d$ = 0.23 (small)
$r^2$ or $\eta^2$ Proportion of variance explained $r^2$ = 0.013 (1.3%)
P-value Compatibility with $H_0$ $p = 0.012$
Power Probability the study could detect this effect Power $\approx$ 0.76
Sample size How much data the conclusion rests on $n_1 = 247$, $n_2 = 253$

Notice how much richer this is than "the result was significant ($p = 0.012$)." The full report tells a nuanced story: a statistically significant but small effect, with moderate power, explaining about 1% of the variance.

How to Read a Research Paper After This Chapter

When you encounter a statistical result in a journal article, news story, or business report, ask these questions:

  1. What is the effect size? If only a p-value is reported, be suspicious. How big is the effect?
  2. What is the confidence interval? If the CI is wide, the estimate is imprecise.
  3. What was the sample size? Large samples can make trivial effects significant.
  4. Was the study pre-registered? If not, the results might reflect data exploration rather than hypothesis confirmation.
  5. What was the power? If the study was underpowered and found significance, the effect is probably overestimated.
  6. Is the result practically significant? Would this difference matter to the people involved?

🧠 Threshold Concept: Statistical Significance vs. Practical Significance

This is one of those concepts that, once you truly understand it, permanently changes how you evaluate evidence.

Statistical significance answers: "Is this result unlikely to have occurred by chance alone?" Practical significance answers: "Is this result large enough to matter in the real world?"

These are fundamentally different questions, and they can give different answers:

Practically Significant Not Practically Significant
Statistically Significant The ideal — a real and important effect "Significant but trivial" (large $n$, tiny effect)
Not Statistically Significant "Important but missed" (small $n$, real effect) Consistent with no meaningful effect

The upper-left cell is what we hope for. The lower-right cell is fine. But the upper-right and lower-left cells are where mistakes happen — and they happen constantly because most people only look at p-values.

After this chapter, you'll never look at "statistically significant" the same way again. You'll always ask: but how big is the effect?


17.12 Python: Effect Sizes and Power Analysis — Complete Toolkit

Let's put everything together in a comprehensive Python toolkit.

import numpy as np
from scipy import stats
from statsmodels.stats.power import TTestIndPower, TTestPower
import matplotlib.pyplot as plt

# ============================================================
# EFFECT SIZE CALCULATIONS
# ============================================================

def cohens_d(group1, group2):
    """Calculate Cohen's d for two independent groups."""
    n1, n2 = len(group1), len(group2)
    s1, s2 = np.std(group1, ddof=1), np.std(group2, ddof=1)

    # Pooled standard deviation
    s_pooled = np.sqrt(((n1 - 1) * s1**2 + (n2 - 1) * s2**2) / (n1 + n2 - 2))

    d = (np.mean(group1) - np.mean(group2)) / s_pooled
    return d

def r_squared_from_t(t_stat, df):
    """Calculate r² from a t-statistic and degrees of freedom."""
    return t_stat**2 / (t_stat**2 + df)

def cohens_d_to_r2(d):
    """Convert Cohen's d to r²."""
    return d**2 / (d**2 + 4)

# ============================================================
# COMPLETE ANALYSIS: Alex's A/B Test
# ============================================================

np.random.seed(42)

# Simulate Alex's data (same seed as Ch.16)
old_algo = np.random.gamma(shape=5.2, scale=8.13, size=247)
new_algo = np.random.gamma(shape=4.9, scale=9.55, size=253)

# Traditional test
t_stat, p_value = stats.ttest_ind(old_algo, new_algo, equal_var=False)

# Effect sizes
d = cohens_d(new_algo, old_algo)
df_approx = len(old_algo) + len(new_algo) - 2
r2 = r_squared_from_t(t_stat, df_approx)

# Confidence interval for the difference
diff = np.mean(new_algo) - np.mean(old_algo)
se = np.sqrt(np.var(old_algo, ddof=1)/len(old_algo) +
             np.var(new_algo, ddof=1)/len(new_algo))
ci_low = diff - 1.96 * se
ci_high = diff + 1.96 * se

# Power
power_analysis = TTestIndPower()
power = power_analysis.solve_power(
    effect_size=abs(d), nobs1=len(old_algo),
    alpha=0.05, ratio=len(new_algo)/len(old_algo),
    alternative='two-sided'
)

print("=" * 60)
print("COMPLETE STATISTICAL REPORT: StreamVibe A/B Test")
print("=" * 60)
print(f"\n--- Descriptive Statistics ---")
print(f"  Old algorithm: mean = {np.mean(old_algo):.2f}, SD = {np.std(old_algo, ddof=1):.2f}, n = {len(old_algo)}")
print(f"  New algorithm: mean = {np.mean(new_algo):.2f}, SD = {np.std(new_algo, ddof=1):.2f}, n = {len(new_algo)}")
print(f"\n--- Effect Size ---")
print(f"  Difference: {diff:.2f} minutes")
print(f"  95% CI: ({ci_low:.2f}, {ci_high:.2f}) minutes")
print(f"  Cohen's d: {abs(d):.3f} ({_interpret_d(abs(d))})")
print(f"  r²: {r2:.4f} ({r2*100:.1f}% of variance explained)")
print(f"\n--- Significance ---")
print(f"  t-statistic: {t_stat:.3f}")
print(f"  p-value: {p_value:.4f}")
print(f"  Significant at α=0.05? {'Yes' if p_value < 0.05 else 'No'}")
print(f"\n--- Power ---")
print(f"  Achieved power: {power:.3f} ({power*100:.1f}%)")
print(f"  Adequate (≥80%)? {'Yes' if power >= 0.80 else 'No'}")

def _interpret_d(d):
    """Interpret Cohen's d using conventional benchmarks."""
    if d < 0.2:
        return "negligible"
    elif d < 0.5:
        return "small"
    elif d < 0.8:
        return "medium"
    else:
        return "large"

# ============================================================
# VISUALIZATION: Effect size and overlap
# ============================================================

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, d_val, label in zip(axes, [0.2, 0.5, 0.8],
                             ['Small (d=0.2)', 'Medium (d=0.5)', 'Large (d=0.8)']):
    x = np.linspace(-4, 4 + d_val, 300)
    y1 = stats.norm.pdf(x, 0, 1)
    y2 = stats.norm.pdf(x, d_val, 1)

    ax.fill_between(x, y1, alpha=0.3, color='steelblue', label='Group 1')
    ax.fill_between(x, y2, alpha=0.3, color='coral', label='Group 2')
    ax.plot(x, y1, color='steelblue', linewidth=2)
    ax.plot(x, y2, color='coral', linewidth=2)
    ax.set_title(label, fontsize=13)
    ax.set_ylabel('Density')
    ax.legend(fontsize=9)
    ax.set_xlim(-4, 5)

plt.suptitle('Visualizing Effect Sizes: How Cohen\'s d Separates Groups',
             fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig('effect_size_visualization.png', dpi=150, bbox_inches='tight')
plt.show()

17.13 James's Analysis: The Effect Size of Algorithmic vs. Human Bail Decisions

Let's return to Professor Washington's criminal justice research with the tools from this chapter.

In Chapter 16, James compared recidivism rates: - Algorithm group: 89/412 = 21.6% re-arrested - Judge group: 107/388 = 27.6% re-arrested - Overall result: $z = -1.97$, $p = 0.049$

The p-value was barely significant. But what does the effect size tell us?

import numpy as np
from scipy import stats

# ============================================================
# James's bail study — effect size analysis
# ============================================================

# Overall comparison
p_alg, n_alg = 89/412, 412
p_judge, n_judge = 107/388, 388

# Cohen's h for proportions
h = 2 * np.arcsin(np.sqrt(p_alg)) - 2 * np.arcsin(np.sqrt(p_judge))
print(f"Overall comparison:")
print(f"  Difference: {p_alg - p_judge:.3f} ({(p_alg - p_judge)*100:.1f} percentage points)")
print(f"  Cohen's h: {h:.3f} (small effect)")

# Confidence interval for the difference
diff = p_alg - p_judge
se = np.sqrt(p_alg*(1-p_alg)/n_alg + p_judge*(1-p_judge)/n_judge)
ci_low = diff - 1.96 * se
ci_high = diff + 1.96 * se
print(f"  95% CI: ({ci_low:.3f}, {ci_high:.3f})")
print(f"  Range: {ci_low*100:.1f} to {ci_high*100:.1f} percentage points")

# The racial disparity — THIS is where the effect size matters most
print(f"\n{'='*55}")
print("Racial Disparity in False Positive Rates")
print(f"{'='*55}")

# False positive rates (from Ch.16 case study 2)
fp_white, n_white = 0.133, 225   # 13.3% false positive rate for white defendants
fp_black, n_black = 0.312, 275   # 31.2% false positive rate for Black defendants

h_racial = 2 * np.arcsin(np.sqrt(fp_black)) - 2 * np.arcsin(np.sqrt(fp_white))
diff_racial = fp_black - fp_white

print(f"  White defendants FP rate: {fp_white*100:.1f}%")
print(f"  Black defendants FP rate: {fp_black*100:.1f}%")
print(f"  Difference: {diff_racial*100:.1f} percentage points")
print(f"  Cohen's h: {h_racial:.3f} (medium-to-large effect)")

# This effect size tells a much more important story than the p-value
print(f"\n--- Interpretation ---")
print(f"  The overall algorithm-vs-judge comparison shows a SMALL effect (h={abs(h):.2f})")
print(f"  The racial disparity shows a MEDIUM-LARGE effect (h={h_racial:.2f})")
print(f"  Black defendants are {fp_black/fp_white:.1f}x more likely to be")
print(f"  falsely flagged as high-risk than white defendants")

The effect sizes reveal what the p-values obscure:

  • The overall comparison (algorithm vs. judge) shows a small effect ($h \approx 0.14$). The algorithm is slightly better overall, but the advantage is modest.
  • The racial disparity in false positive rates shows a medium-to-large effect ($h \approx 0.43$). Black defendants are 2.3 times more likely to be falsely flagged as high-risk.

The racial disparity is both statistically significant (from Chapter 16: $z = -4.67$, $p < 0.001$) and practically significant ($h = 0.43$, a medium effect). This is the upper-left cell of our significance matrix — a real and important effect.

Connection to Theme 2: Human Stories Behind the Data

When we say the false positive disparity has a "medium effect size," we're talking about real people. In James's dataset, the disparity translates to roughly 38 excess false positives among Black defendants — 38 people who were incorrectly flagged as high-risk and potentially denied bail or subjected to more restrictive conditions. The effect size gives this injustice a number. The confidence interval gives it a range. And the power analysis tells us whether we had enough data to detect it reliably.


17.14 Common Mistakes and How to Avoid Them

Mistake Why It's Wrong What to Do Instead
Equating "significant" with "important" A tiny, meaningless effect can be significant with enough data Always report effect sizes alongside p-values
Equating "not significant" with "no effect" A large, important effect can be non-significant with too little data Check the power; report the confidence interval
Using Cohen's benchmarks as rigid rules What's "small" in one field may be "large" in another Interpret effect sizes in the context of your field
Conducting post-hoc power analysis Computing the power of a completed study using the observed effect size is circular and misleading Use the expected/hypothesized effect size, or focus on the confidence interval width
Ignoring the confidence interval The CI tells you the effect size and the uncertainty — far more informative than $p$ or $d$ alone Always report and interpret the CI
P-hacking without realizing it Exploring data flexibly and reporting only the significant finding inflates the false positive rate Pre-register hypotheses; distinguish exploratory from confirmatory analysis
Treating $p = 0.049$ and $p = 0.051$ as fundamentally different Both represent similar levels of evidence; the 0.05 cutoff is arbitrary Report the exact p-value and let the reader assess the evidence

17.15 Chapter Summary

This chapter challenged everything you thought you knew about statistical significance. Here's what we learned:

  1. Statistical significance is not the same as practical significance. A result can be significant but trivial (large $n$, tiny effect) or important but non-significant (small $n$, real effect). The p-value conflates effect size and sample size.

  2. Effect sizes measure what matters. Cohen's d tells you how many standard deviations apart two groups are (small $\approx 0.2$, medium $\approx 0.5$, large $\approx 0.8$). $r^2$ tells you what proportion of variance is explained. Both are independent of sample size.

  3. Statistical power is the probability of detecting a real effect. Power depends on $\alpha$, $n$, effect size, and variability. The conventional minimum is 80%. Many published studies are woefully underpowered.

  4. Power analysis is essential for study design. Before collecting data, use power analysis to determine how many observations you need. After, use it to evaluate whether your conclusions are credible.

  5. Publication bias and p-hacking distort the scientific literature. When only significant results are published and researchers explore data flexibly, the published literature overestimates effect sizes and underestimates uncertainty. Pre-registration, effect size reporting, and open science practices are the antidote.

  6. Confidence intervals are the best single tool for inference. They tell you the effect size, the direction, the precision, and (through their relationship to hypothesis tests) the significance — all in one package.

  7. Always interpret results in context. A "small" effect by generic benchmarks might be transformative in the right domain. A "large" effect might be meaningless if it doesn't address the right question. Numbers without context are just noise.

What's Next: Chapter 18 will introduce a revolutionary approach to inference that doesn't require any of the assumptions we've been making — no normality, no known distributions, no formulas at all. The bootstrap method generates sampling distributions through simulation, providing confidence intervals and hypothesis tests for virtually any statistic. It's the computational approach to inference, and it connects directly to the philosophy of this chapter: letting the data speak for itself.


"The best thing about being a statistician is that you get to play in everyone's backyard." — John Tukey