Chapter 23: Hypothesis Testing — Making Decisions with Data (and What P-Values Actually Mean)

Contributors to Introduction to Data Science

20 min read

I need to be honest with you: this is the chapter I've been both looking forward to and dreading.

Prerequisites

{'chapter': 22, 'description': 'Sampling, Estimation, and Confidence Intervals — understanding sampling distributions and standard error'}
{'chapter': 21, 'description': 'Distributions and the Normal Curve for understanding test statistic distributions'}

Learning Objectives

State null and alternative hypotheses for a given research question and explain the logic of "proof by contradiction"
Simulate a null distribution in Python and use it to compute a p-value from first principles
Interpret a p-value correctly, avoiding the five most common misconceptions
Perform a two-sample t-test and a chi-square test of independence using scipy.stats
Distinguish between statistical significance and practical significance using effect sizes
Identify Type I and Type II errors and explain how significance level, sample size, and effect size determine statistical power
Apply hypothesis testing to the progressive project by testing whether vaccination rates differ between income groups

In This Chapter

Chapter Overview
23.1 The Logic of Hypothesis Testing: Proof by Contradiction
23.2 The Significance Level: Drawing a Line in the Sand
23.3 What P-Values Actually Are (and the Five Things They're Not)
23.4 The Mechanics: Test Statistics and Common Tests
23.5 Statistical Significance vs. Practical Significance
23.6 Type I Errors, Type II Errors, and Power
23.7 One-Tailed vs. Two-Tailed Tests
23.8 The Multiple Testing Problem
23.9 The Connection Between Confidence Intervals and Hypothesis Tests
23.10 Progressive Project: Testing Vaccination Rate Differences
23.11 A Practical Framework for Hypothesis Testing
23.12 Common Pitfalls and How to Avoid Them
23.13 Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 23: Hypothesis Testing — Making Decisions with Data (and What P-Values Actually Mean)

"The Earth is round (p < 0.05)." — Title of a famous paper by Jacob Cohen (1994)

Chapter Overview

I need to be honest with you: this is the chapter I've been both looking forward to and dreading.

Looking forward to it because hypothesis testing is one of the most powerful ideas in all of science. It gives you a rigorous framework for deciding whether the patterns you see in data are real or just noise. Every clinical trial that determines whether a drug works, every A/B test that decides which website design converts better, every scientific paper that claims a discovery — they all rest on hypothesis testing.

Dreading it because hypothesis testing is also the single most misunderstood topic in statistics. Study after study has shown that students, researchers, and even textbook authors routinely get the interpretation of p-values wrong. A 2016 statement by the American Statistical Association took the extraordinary step of issuing formal guidance on what p-values do and don't mean — because the confusion had become so widespread and so damaging.

So here's my promise: we're going to take this slowly. We're going to build intuition through simulation before we touch any formulas. We're going to be painfully explicit about what things mean and don't mean. And by the end, you'll have a genuinely correct understanding of hypothesis testing — one that will serve you better than the garbled version that unfortunately circulates in much of the scientific world.

Let's begin.

In this chapter, you will learn to:

State null and alternative hypotheses for a given research question and explain the logic of "proof by contradiction" (all paths)
Simulate a null distribution in Python and use it to compute a p-value from first principles (all paths)
Interpret a p-value correctly, avoiding the five most common misconceptions (all paths)
Perform a two-sample t-test and a chi-square test of independence using scipy.stats (standard + deep dive paths)
Distinguish between statistical significance and practical significance using effect sizes (all paths)
Identify Type I and Type II errors and explain how they relate to statistical power (standard + deep dive paths)
Apply hypothesis testing to test whether vaccination rates differ between income groups (all paths)

23.1 The Logic of Hypothesis Testing: Proof by Contradiction

Before any code, before any formulas, let's understand the logic of hypothesis testing through an analogy.

The Courtroom Analogy

In a criminal trial in many legal systems, the defendant is presumed innocent until proven guilty. The prosecution must present evidence strong enough to overcome that presumption. The jury doesn't prove the defendant is guilty with certainty — they ask whether the evidence is strong enough that the "innocent" explanation is no longer reasonable.

Hypothesis testing works the same way:

Start by assuming nothing interesting is happening. This is the null hypothesis (H₀) — the equivalent of "innocent until proven guilty." Maybe the drug doesn't work. Maybe the two groups are the same. Maybe the pattern is just random noise.
Look at the evidence (data). Compute a summary of the data that measures how far the observed results are from what you'd expect if the null hypothesis were true.
Ask: "If nothing interesting is really happening, how surprising is this evidence?" This is the p-value — the probability of seeing results as extreme as (or more extreme than) what you observed, assuming the null hypothesis is true.
Make a decision. If the evidence would be very surprising under the null hypothesis (small p-value), reject the null and conclude that something interesting probably is happening. If the evidence isn't that surprising (large p-value), fail to reject the null — the data is consistent with "nothing happening."

Let's make this concrete.

A Concrete Example: The Suspicious Coin

Imagine someone hands you a coin and claims it's fair — equal probability of heads and tails. You're skeptical. You flip it 100 times and get 63 heads.

Is the coin unfair? Or could you get 63 heads just by chance with a fair coin?

Let's think this through:

Null hypothesis (H₀): The coin is fair. P(heads) = 0.5.
Alternative hypothesis (H₁): The coin is not fair. P(heads) ≠ 0.5.
Data: 63 heads out of 100 flips.
Question: If the coin really is fair, how likely is it to get 63 or more heads (or 37 or fewer, since we're testing "not fair" in either direction)?

We can answer this with simulation:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(42)

# Simulate 100,000 experiments of flipping a FAIR coin 100 times
n_simulations = 100000
n_flips = 100
fair_coin_results = np.random.binomial(n_flips, 0.5, n_simulations)

# How often do we get 63 or more heads (or 37 or fewer)?
observed = 63
extreme_count = np.sum((fair_coin_results >= observed) |
                        (fair_coin_results <= n_flips - observed))
p_value = extreme_count / n_simulations

print(f"Observed: {observed} heads out of {n_flips}")
print(f"Simulated p-value: {p_value:.4f}")
print(f"Interpretation: If the coin is fair, we'd see results this extreme")
print(f"  about {p_value*100:.1f}% of the time.")

Observed: 63 heads out of 100
Simulated p-value: 0.0120
Interpretation: If the coin is fair, we'd see results this extreme
  about 1.2% of the time.

Let's visualize what the simulation is doing:

fig, ax = plt.subplots(figsize=(12, 5))

# Histogram of fair coin results
counts, bins, patches = ax.hist(fair_coin_results, bins=range(25, 76),
                                 color='steelblue', alpha=0.7, edgecolor='white',
                                 density=True)

# Color the extreme regions
for patch, left_edge in zip(patches, bins[:-1]):
    if left_edge >= observed or left_edge <= n_flips - observed:
        patch.set_facecolor('#e74c3c')
        patch.set_alpha(0.8)

ax.axvline(observed, color='red', linewidth=2, linestyle='--',
           label=f'Observed: {observed} heads')
ax.axvline(n_flips - observed, color='red', linewidth=2, linestyle='--',
           label=f'Mirror: {n_flips - observed} heads')
ax.axvline(50, color='black', linewidth=1, linestyle=':',
           label='Expected if fair (50)')

ax.set_xlabel('Number of Heads in 100 Flips', fontsize=12)
ax.set_ylabel('Proportion', fontsize=12)
ax.set_title('Null Distribution: What a Fair Coin Produces\n'
             f'Red = as extreme or more extreme than observed (p = {p_value:.3f})',
             fontsize=13)
ax.legend(fontsize=10)
plt.tight_layout()
plt.savefig('null_distribution_coin.png', dpi=150, bbox_inches='tight')
plt.show()

The blue histogram shows what 100,000 fair coins produce. The red regions show results as extreme as or more extreme than 63 heads (or 37 heads on the other side). The red area is the p-value: about 1.2% of the total area.

Is 1.2% "surprising enough" to reject the null hypothesis? That depends on your threshold — which brings us to the significance level.

23.2 The Significance Level: Drawing a Line in the Sand

The significance level (denoted α, "alpha") is the threshold below which you'll reject the null hypothesis. By convention, the most common choice is α = 0.05, meaning you'll reject the null if the p-value is less than 0.05.

For our coin example: p = 0.012 < 0.05, so we reject the null hypothesis. We have statistically significant evidence that the coin is not fair.

But why 0.05? Why not 0.01 or 0.10?

The honest answer: 0.05 is a convention, not a law of nature. It was popularized by Ronald Fisher in the 1920s and has stuck largely through inertia. Fisher himself didn't intend it to be a rigid cutoff — he suggested it as a rough guide, not a sacred boundary.

The choice of α reflects a trade-off: - Smaller α (like 0.01): Harder to reject the null. You need stronger evidence. Fewer false alarms, but you might miss real effects. - Larger α (like 0.10): Easier to reject the null. You need less evidence. More likely to catch real effects, but also more false alarms.

We'll formalize this trade-off in Section 23.6 when we discuss Type I and Type II errors.

Important

The significance level must be chosen before you look at the data. Choosing α after seeing the p-value — "My p-value is 0.03, so I'll set α at 0.05" or "My p-value is 0.06, so I'll set α at 0.10" — is cheating. It's called p-hacking, and it undermines the entire framework.

23.3 What P-Values Actually Are (and the Five Things They're Not)

Now for the most important section of this chapter. Please read this carefully.

What a P-Value IS

Definition: The p-value is the probability of observing a test statistic as extreme as (or more extreme than) the one computed from the data, assuming the null hypothesis is true.

Let me unpack every part of that definition:

"the probability of observing..." — It's about the data, not about the hypothesis. The p-value is a property of the data given the null, not a property of the null given the data.
"as extreme as or more extreme than" — We're not asking "what's the probability of getting exactly 63 heads?" We're asking "what's the probability of getting 63 or more (or 37 or fewer)?" This is because 64, 65, 66... heads would be even more evidence against the null.
"assuming the null hypothesis is true" — This is the crucial caveat. The p-value is computed in a hypothetical world where the null is true. It doesn't tell you the probability that the null is true.

What a P-Value Is NOT

Here are the five most common misconceptions. Each one is wrong, each one is tempting, and each one leads to bad decisions.

Misconception 1: "The p-value is the probability that the null hypothesis is true."

No. The p-value is P(data this extreme | H₀ true), not P(H₀ true | data). These are different things, just as P(wet ground | it rained) is different from P(it rained | wet ground). The wet ground could be from a sprinkler.

Misconception 2: "A p-value of 0.03 means there's a 3% chance the result is due to chance."

Close but wrong. It means: "If the null is true (i.e., if the result really is due to chance), there's a 3% probability of seeing data this extreme." The 3% is the probability of the data, not the probability of the explanation.

Misconception 3: "A small p-value proves the alternative hypothesis is true."

No. A small p-value says the data is unlikely under the null hypothesis. But the data might be even more unlikely under some alternative hypotheses. The p-value doesn't compare the null to the alternative — it only evaluates the null.

Misconception 4: "A large p-value (like 0.30) proves the null hypothesis is true."

No. Failing to reject the null is not the same as proving the null is true. "Absence of evidence is not evidence of absence." A large p-value could mean the null is true, OR it could mean the effect exists but your sample was too small to detect it.

Misconception 5: "A p-value of 0.049 is meaningfully different from a p-value of 0.051."

No. The 0.05 threshold is a convention, not a cliff. Treating p = 0.049 as "significant" and p = 0.051 as "not significant" creates a false dichotomy. In reality, these two results convey almost identical evidence against the null.

Let's drive this home with a simulation:

# Demonstrating that p-values are UNIFORMLY distributed under the null
np.random.seed(42)

# Simulate 10,000 experiments where the null IS true
# (Both groups drawn from the same population)
p_values_null_true = []

for _ in range(10000):
    group_a = np.random.normal(50, 10, 30)  # Same population
    group_b = np.random.normal(50, 10, 30)  # Same population
    _, p = stats.ttest_ind(group_a, group_b)
    p_values_null_true.append(p)

p_values_null_true = np.array(p_values_null_true)

fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(p_values_null_true, bins=50, color='steelblue', alpha=0.7,
        edgecolor='white', density=True)
ax.axhline(y=1, color='red', linewidth=2, linestyle='--',
           label='Uniform distribution (expected)')
ax.axvline(x=0.05, color='orange', linewidth=2, linestyle='-',
           label=f'α = 0.05 ({(p_values_null_true < 0.05).mean():.1%} fall below)')
ax.set_xlabel('P-value', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title('P-values When the Null is TRUE\n(They should be uniformly distributed)',
             fontsize=13)
ax.legend(fontsize=10)
plt.tight_layout()
plt.savefig('p_values_under_null.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"Proportion of p-values < 0.05: {(p_values_null_true < 0.05).mean():.3f}")
print(f"(Should be close to 0.05)")

When the null is true, p-values are uniformly distributed between 0 and 1. About 5% fall below 0.05 by pure chance. This is the false positive rate — and it's exactly what the significance level controls.

23.4 The Mechanics: Test Statistics and Common Tests

Now that you understand the logic, let's look at the machinery.

The Test Statistic

A test statistic is a single number that summarizes how far your data is from what the null hypothesis predicts. Different questions call for different test statistics.

The general pattern:

$$\text{test statistic} = \frac{\text{observed difference} - \text{expected difference under H₀}}{\text{standard error of the difference}}$$

This is a signal-to-noise ratio: how big is the effect (signal) relative to the random variability (noise)?

The Two-Sample T-Test: Comparing Two Group Means

The most common hypothesis test in data science is comparing the means of two groups. "Do high-income countries have higher vaccination rates than low-income countries?" This is a two-sample t-test.

Let's build it from scratch before using scipy:

np.random.seed(42)

# Our data: vaccination rates for two income groups
high_income = np.array([88, 92, 85, 90, 87, 93, 91, 86, 89, 94,
                        88, 90, 87, 92, 85, 91, 93, 89, 86, 90,
                        88, 92, 87, 91, 85, 90, 89, 93, 88, 86])

low_income = np.array([52, 45, 58, 48, 55, 42, 60, 50, 47, 53,
                       56, 44, 51, 49, 57, 46, 54, 43, 59, 48,
                       52, 50, 55, 47, 53, 58, 45, 51, 49, 56])

print("High-income countries:")
print(f"  n = {len(high_income)}, mean = {high_income.mean():.1f}%, "
      f"std = {high_income.std(ddof=1):.1f}%")
print(f"\nLow-income countries:")
print(f"  n = {len(low_income)}, mean = {low_income.mean():.1f}%, "
      f"std = {low_income.std(ddof=1):.1f}%")
print(f"\nObserved difference: {high_income.mean() - low_income.mean():.1f} "
      f"percentage points")

High-income countries:
  n = 30, mean = 89.2%, std = 2.7%

Low-income countries:
  n = 30, mean = 51.1%, std = 5.0%

Observed difference: 38.1 percentage points

The difference looks large. But is it "statistically significant"? Could it be due to chance?

Step 1: State the hypotheses

H₀: μ_high = μ_low (no difference in population means)
H₁: μ_high ≠ μ_low (there is a difference)

Step 2: Simulate the null distribution

If there's truly no difference between the groups, we can simulate what "no difference" looks like by randomly shuffling the labels:

# Permutation test: simulate the null by shuffling labels
all_data = np.concatenate([high_income, low_income])
n_high = len(high_income)
observed_diff = high_income.mean() - low_income.mean()

n_permutations = 100000
perm_diffs = []

for _ in range(n_permutations):
    shuffled = np.random.permutation(all_data)
    perm_high = shuffled[:n_high]
    perm_low = shuffled[n_high:]
    perm_diffs.append(perm_high.mean() - perm_low.mean())

perm_diffs = np.array(perm_diffs)

# P-value: proportion of permuted differences as extreme as observed
p_value_perm = np.mean(np.abs(perm_diffs) >= abs(observed_diff))
print(f"Observed difference: {observed_diff:.1f}")
print(f"Permutation p-value: {p_value_perm:.6f}")

Observed difference: 38.1
Permutation p-value: 0.000000

Not a single one of the 100,000 random shuffles produced a difference as large as what we observed. The p-value is essentially zero.

Step 3: Use scipy for the standard t-test

# Two-sample t-test using scipy
t_stat, p_value = stats.ttest_ind(high_income, low_income)
print(f"t-statistic: {t_stat:.2f}")
print(f"p-value: {p_value:.2e}")
print(f"\nConclusion: With p < 0.001, we reject the null hypothesis.")
print(f"There is strong evidence that vaccination rates differ between")
print(f"high-income and low-income countries.")

t-statistic: 36.14
p-value: 5.03e-41

Conclusion: With p < 0.001, we reject the null hypothesis.
There is strong evidence that vaccination rates differ between
high-income and low-income countries.

The t-statistic of 36 is enormous — the observed difference is 36 standard errors away from zero. The p-value is absurdly small (5 × 10⁻⁴¹). This is about as clear-cut as it gets.

Let's visualize the permutation null distribution:

fig, ax = plt.subplots(figsize=(12, 5))
ax.hist(perm_diffs, bins=80, color='steelblue', alpha=0.7,
        edgecolor='white', density=True)
ax.axvline(observed_diff, color='red', linewidth=2, linestyle='--',
           label=f'Observed difference: {observed_diff:.1f}')
ax.axvline(-observed_diff, color='red', linewidth=2, linestyle='--',
           label=f'Mirror: {-observed_diff:.1f}')

ax.set_xlabel('Difference in Means (permuted)', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title('Null Distribution from Permutation Test\n'
             '(The observed difference is completely off the chart)',
             fontsize=13)
ax.legend(fontsize=10)
plt.tight_layout()
plt.savefig('permutation_test_vaccination.png', dpi=150, bbox_inches='tight')
plt.show()

The Chi-Square Test: Comparing Proportions

Sometimes your data is categorical rather than numerical. Instead of comparing means, you're comparing proportions. "Is there an association between income level and whether a country reaches the 90% vaccination target?"

The chi-square test of independence handles this:

# Create a contingency table
# Country income level vs. whether they meet 90% vaccination target
contingency_data = pd.DataFrame({
    'Income Level': ['High']*50 + ['Middle']*60 + ['Low']*40,
    'Meets Target': (['Yes']*38 + ['No']*12 +   # 76% of high-income
                     ['Yes']*30 + ['No']*30 +   # 50% of middle-income
                     ['Yes']*8  + ['No']*32)    # 20% of low-income
})

# Create contingency table
contingency_table = pd.crosstab(contingency_data['Income Level'],
                                 contingency_data['Meets Target'])
print("Contingency Table:")
print(contingency_table)
print()

# Chi-square test
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
print(f"Chi-square statistic: {chi2:.2f}")
print(f"Degrees of freedom: {dof}")
print(f"P-value: {p_value:.2e}")
print(f"\nExpected frequencies (if no association):")
print(pd.DataFrame(expected,
                    index=contingency_table.index,
                    columns=contingency_table.columns).round(1))

Contingency Table:
Meets Target  No  Yes
Income Level
High          12   38
Low           32    8
Middle        30   30

Chi-square statistic: 30.87
Degrees of freedom: 2
P-value: 1.98e-07

Expected frequencies (if no association):
Meets Target    No   Yes
Income Level
High          24.7  25.3
Low           19.7  20.3
Middle        29.6  30.4

The p-value (2 × 10⁻⁷) is extremely small, telling us that the association between income level and meeting the vaccination target is not due to chance. But notice: the chi-square test tells you that an association exists, not how strong it is. For that, you need measures of association like Cramér's V:

# Cramér's V: effect size for chi-square
n = contingency_table.sum().sum()
cramers_v = np.sqrt(chi2 / (n * (min(contingency_table.shape) - 1)))
print(f"Cramér's V: {cramers_v:.3f}")
print(f"Interpretation: {'Small' if cramers_v < 0.3 else 'Medium' if cramers_v < 0.5 else 'Large'} effect size")

23.5 Statistical Significance vs. Practical Significance

Here's where many people go wrong, and where we need to have an honest conversation.

Statistical significance means: the result is unlikely to have occurred by chance.

Practical significance means: the result is large enough to matter in the real world.

These are NOT the same thing. And confusing them is one of the most common errors in data analysis.

The Giant Sample Trap

With a large enough sample, any difference — no matter how tiny — becomes statistically significant.

np.random.seed(42)

# Two groups with a TINY real difference (0.5 percentage points)
group_a = np.random.normal(70.0, 15, 100000)  # Mean = 70.0
group_b = np.random.normal(70.5, 15, 100000)  # Mean = 70.5

t_stat, p_value = stats.ttest_ind(group_a, group_b)
diff = group_b.mean() - group_a.mean()

print(f"Group A mean: {group_a.mean():.2f}%")
print(f"Group B mean: {group_b.mean():.2f}%")
print(f"Difference: {diff:.2f} percentage points")
print(f"P-value: {p_value:.2e}")
print(f"Statistically significant? {'Yes' if p_value < 0.05 else 'No'}")
print(f"\nBut is a {diff:.2f}-point difference practically meaningful?")
print(f"Probably not. It's less than 1% of the mean.")

Group A mean: 70.01%
Group B mean: 70.54%
Difference: 0.53 percentage points
P-value: 2.30e-09
Statistically significant? Yes

But is a 0.53-point difference practically meaningful?
Probably not. It's less than 1% of the mean.

The p-value is tiny (2.3 × 10⁻⁹). "Highly statistically significant." But the actual difference is half a percentage point — practically meaningless for most purposes.

This is why you should always report effect sizes alongside p-values.

Effect Size: How Big Is the Difference?

Cohen's d is the most common effect size for comparing two means:

$$d = \frac{\bar{x}_1 - \bar{x}_2}{s_{\text{pooled}}}$$

It measures the difference in units of standard deviations. Conventional benchmarks: - Small: d ≈ 0.2 - Medium: d ≈ 0.5 - Large: d ≈ 0.8

def cohens_d(group1, group2):
    """Compute Cohen's d for two independent groups."""
    n1, n2 = len(group1), len(group2)
    var1, var2 = group1.var(ddof=1), group2.var(ddof=1)
    pooled_std = np.sqrt(((n1 - 1) * var1 + (n2 - 1) * var2) / (n1 + n2 - 2))
    return (group1.mean() - group2.mean()) / pooled_std

# The tiny-difference example
d_tiny = cohens_d(group_b, group_a)
print(f"Tiny difference example:")
print(f"  Cohen's d = {d_tiny:.3f} (trivially small)")

# The vaccination rate example from earlier
d_vax = cohens_d(high_income, low_income)
print(f"\nVaccination rate example:")
print(f"  Cohen's d = {d_vax:.3f} (enormous)")

Tiny difference example:
  Cohen's d = 0.033 (trivially small)

Vaccination rate example:
  Cohen's d = 9.330 (enormous)

The vaccination rate difference (d = 9.3) isn't just statistically significant — it's massively practically significant. The income groups differ by over 9 standard deviations. That's a real, important, policy-relevant difference.

The tiny-difference example (d = 0.03) is statistically significant but practically trivial. No one should change a policy based on a 0.03 standard-deviation difference.

Rule of thumb: Always report three things: (1) the p-value (is the effect statistically distinguishable from zero?), (2) the effect size (how large is the effect?), and (3) the confidence interval for the effect (what's the plausible range for the true effect?).

23.6 Type I Errors, Type II Errors, and Power

Every time you make a decision based on a hypothesis test, you might be wrong. There are exactly two ways to be wrong:

Type I Error (False Positive)

You reject the null hypothesis when it's actually true. You conclude the drug works when it doesn't. You declare two groups different when they're the same.

The probability of a Type I error equals the significance level α. If you use α = 0.05, you'll make a Type I error 5% of the time when the null is true.

Type II Error (False Negative)

You fail to reject the null hypothesis when it's actually false. You conclude the drug doesn't work when it does. You declare two groups the same when they're actually different.

The probability of a Type II error is denoted β. It depends on: - The significance level α (lower α → higher β) - The sample size n (larger n → lower β) - The true effect size (larger effects are easier to detect → lower β)

Power: The Ability to Detect Real Effects

Statistical power = 1 - β = the probability of correctly rejecting a false null hypothesis.

In other words, power is the probability of detecting an effect when one truly exists.

# Simulate power: what fraction of the time do we detect a real difference?
np.random.seed(42)

true_difference = 5  # True difference between groups (5 percentage points)
n_per_group = 30
population_std = 15
n_simulations = 10000

p_values = []
for _ in range(n_simulations):
    # Groups with a REAL difference
    group1 = np.random.normal(70, population_std, n_per_group)
    group2 = np.random.normal(70 + true_difference, population_std, n_per_group)
    _, p = stats.ttest_ind(group1, group2)
    p_values.append(p)

p_values = np.array(p_values)
power = (p_values < 0.05).mean()

print(f"True difference: {true_difference} percentage points")
print(f"Sample size per group: {n_per_group}")
print(f"Power: {power:.3f} ({power*100:.1f}%)")
print(f"\nInterpretation: With n=30 per group and a {true_difference}-point")
print(f"true difference, we detect it {power*100:.1f}% of the time.")
print(f"We miss it {(1-power)*100:.1f}% of the time (Type II error).")

True difference: 5 percentage points
Sample size per group: 30
Power: 0.478 (47.8%)

Interpretation: With n=30 per group and a 5-point
true difference, we detect it 47.8% of the time.
We miss it 52.2% of the time (Type II error).

That's troubling. With 30 countries per group, we only detect a 5-percentage-point difference about half the time! Let's see how power changes with sample size:

sample_sizes = [10, 20, 30, 50, 75, 100, 150, 200]
powers = []

for n in sample_sizes:
    sig_count = 0
    for _ in range(5000):
        g1 = np.random.normal(70, 15, n)
        g2 = np.random.normal(75, 15, n)
        _, p = stats.ttest_ind(g1, g2)
        if p < 0.05:
            sig_count += 1
    powers.append(sig_count / 5000)

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(sample_sizes, powers, 'o-', color='steelblue', linewidth=2, markersize=8)
ax.axhline(y=0.80, color='red', linewidth=1, linestyle='--',
           label='80% power (conventional target)')
ax.set_xlabel('Sample Size per Group', fontsize=12)
ax.set_ylabel('Power', fontsize=12)
ax.set_title('Power vs. Sample Size\n(True difference = 5 points, σ = 15)',
             fontsize=13)
ax.legend(fontsize=11)
ax.set_ylim(0, 1)
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('power_curve.png', dpi=150, bbox_inches='tight')
plt.show()

The conventional target for power is 80% — meaning you want at least an 80% chance of detecting the effect if it exists. For our example, that requires about 75-100 observations per group.

The Decision Matrix

	H₀ is TRUE	H₀ is FALSE
Reject H₀	Type I Error (α)	Correct! (Power = 1-β)
Fail to reject H₀	Correct! (1-α)	Type II Error (β)

Think of it as a signal detection problem: - Type I error = false alarm (you detected something that isn't there) - Type II error = missed detection (you failed to notice something that is there)

23.7 One-Tailed vs. Two-Tailed Tests

In our coin example, we tested whether the coin was "not fair" — it could be biased toward heads OR toward tails. This is a two-tailed test (H₁: p ≠ 0.5).

Sometimes you have a directional hypothesis: "I believe this drug lowers blood pressure" (not just "changes" blood pressure). This is a one-tailed test (H₁: μ_drug < μ_placebo).

# Two-tailed vs. one-tailed example
np.random.seed(42)

# Vaccination rates: does Region A have HIGHER rates than Region B?
region_a = np.random.normal(75, 12, 40)
region_b = np.random.normal(70, 12, 40)

# Two-tailed: is there ANY difference?
t_stat, p_two = stats.ttest_ind(region_a, region_b)
print(f"Two-tailed p-value: {p_two:.4f}")

# One-tailed: is Region A HIGHER than Region B?
# (half the two-tailed p-value, if the observed difference is in the right direction)
if region_a.mean() > region_b.mean():
    p_one = p_two / 2
else:
    p_one = 1 - p_two / 2

print(f"One-tailed p-value: {p_one:.4f}")
print(f"\nObserved: Region A mean = {region_a.mean():.1f}, "
      f"Region B mean = {region_b.mean():.1f}")

When to use each: - Two-tailed: Default choice. Use when you don't have a strong directional prediction, or when an effect in either direction would be interesting. - One-tailed: Use only when you have a clear directional hypothesis stated before looking at the data, and an effect in the other direction would be scientifically meaningless.

Most of the time, use two-tailed. When in doubt, use two-tailed.

23.8 The Multiple Testing Problem

Here's a scenario that should make you uncomfortable.

You test 20 different hypotheses, all of which are null (there's no real effect for any of them). You use α = 0.05 for each test. How many will come back "significant"?

np.random.seed(42)

n_tests = 20
significant_count = 0
results = []

for i in range(n_tests):
    # Both groups drawn from the SAME population (null is true)
    group1 = np.random.normal(70, 15, 50)
    group2 = np.random.normal(70, 15, 50)
    t, p = stats.ttest_ind(group1, group2)
    sig = p < 0.05
    if sig:
        significant_count += 1
    results.append({'Test': i+1, 'p-value': p, 'Significant': sig})

results_df = pd.DataFrame(results)
print(f"Of {n_tests} tests where the null is TRUE:")
print(f"  {significant_count} came back 'significant' (p < 0.05)")
print(f"\nThe 'significant' results are FALSE POSITIVES.\n")
print(results_df[results_df['Significant']].to_string(index=False))

On average, 1 out of 20 tests (5%) will be "significant" by pure chance. If you run enough tests, you're guaranteed to find something "significant" — even when nothing is real.

This is the multiple testing problem, and it's one of the driving forces behind the replication crisis in science (more on this in Case Study 2).

Corrections for Multiple Testing

Bonferroni correction: The simplest fix. Instead of using α = 0.05 for each test, use α = 0.05/k, where k is the number of tests. If you run 20 tests, each test must have p < 0.0025 to be declared significant.

from statsmodels.stats.multitest import multipletests

# Apply Bonferroni correction
p_values_all = results_df['p-value'].values
rejected_bonf, p_corrected_bonf, _, _ = multipletests(
    p_values_all, alpha=0.05, method='bonferroni'
)

print(f"Without correction: {(p_values_all < 0.05).sum()} significant")
print(f"With Bonferroni:    {rejected_bonf.sum()} significant")

Benjamini-Hochberg (False Discovery Rate): A less conservative approach that controls the expected proportion of false positives among rejected hypotheses, rather than the probability of any false positive.

rejected_bh, p_corrected_bh, _, _ = multipletests(
    p_values_all, alpha=0.05, method='fdr_bh'
)
print(f"With Benjamini-Hochberg (FDR): {rejected_bh.sum()} significant")

23.9 The Connection Between Confidence Intervals and Hypothesis Tests

Here's a beautiful fact that connects this chapter to Chapter 22:

A two-sided hypothesis test at significance level α is equivalent to checking whether the hypothesized value falls inside a (1-α) confidence interval.

If the 95% confidence interval for the difference between two means doesn't include zero, the two-tailed t-test at α = 0.05 will reject the null hypothesis. If the CI includes zero, the test won't reject.

# Demonstrate the CI-test equivalence
from scipy.stats import ttest_ind

t_stat, p_val = ttest_ind(high_income, low_income)
diff = high_income.mean() - low_income.mean()
se_diff = np.sqrt(high_income.var(ddof=1)/len(high_income) +
                   low_income.var(ddof=1)/len(low_income))
df = len(high_income) + len(low_income) - 2
t_crit = stats.t.ppf(0.975, df=df)

ci_lower = diff - t_crit * se_diff
ci_upper = diff + t_crit * se_diff

print(f"Observed difference: {diff:.1f}")
print(f"95% CI for difference: ({ci_lower:.1f}, {ci_upper:.1f})")
print(f"Does CI include 0? {'Yes' if ci_lower <= 0 <= ci_upper else 'No'}")
print(f"P-value: {p_val:.2e}")
print(f"Reject H₀? {'Yes' if p_val < 0.05 else 'No'}")
print(f"\nNote: the CI excludes 0, and the test rejects H₀ — same conclusion!")

This equivalence has a practical advantage: the confidence interval gives you more information than the p-value alone. It tells you not just "is the difference significant?" but "how big might the difference plausibly be?"

23.10 Progressive Project: Testing Vaccination Rate Differences

Time to apply everything to the vaccination project. Our question: Do vaccination rates differ significantly between income groups?

import pandas as pd
import numpy as np
from scipy import stats

# Create our project dataset
np.random.seed(42)

income_groups = {
    'High income': {'n': 55, 'mean': 82, 'std': 9},
    'Upper middle': {'n': 50, 'mean': 72, 'std': 14},
    'Lower middle': {'n': 50, 'mean': 60, 'std': 17},
    'Low income':  {'n': 40, 'mean': 48, 'std': 20},
}

rows = []
for group, params in income_groups.items():
    rates = np.random.normal(params['mean'], params['std'], params['n'])
    rates = np.clip(rates, 5, 99)
    for rate in rates:
        rows.append({'income_group': group, 'vaccination_rate': rate})

df = pd.DataFrame(rows)
print("Summary by income group:")
summary = df.groupby('income_group')['vaccination_rate'].agg(['count', 'mean', 'std'])
print(summary.round(1))

Test 1: Overall Difference (ANOVA)

Are there any significant differences among the four income groups?

# One-way ANOVA: are the group means all equal?
groups = [df[df['income_group'] == g]['vaccination_rate'].values
          for g in income_groups.keys()]

f_stat, p_value = stats.f_oneway(*groups)
print(f"\nOne-way ANOVA:")
print(f"  F-statistic: {f_stat:.2f}")
print(f"  P-value: {p_value:.2e}")
print(f"  Conclusion: {'Reject H₀' if p_value < 0.05 else 'Fail to reject H₀'}")
print(f"  At least one group mean is significantly different from the others.")

Test 2: Pairwise Comparisons

Which specific pairs of groups differ?

# Pairwise t-tests with Bonferroni correction
from itertools import combinations

group_names = list(income_groups.keys())
n_comparisons = len(list(combinations(group_names, 2)))
bonferroni_alpha = 0.05 / n_comparisons

print(f"\nPairwise t-tests (Bonferroni-corrected α = {bonferroni_alpha:.4f}):")
print("=" * 75)

for g1, g2 in combinations(group_names, 2):
    data1 = df[df['income_group'] == g1]['vaccination_rate']
    data2 = df[df['income_group'] == g2]['vaccination_rate']
    t_stat, p_val = stats.ttest_ind(data1, data2)
    d = cohens_d(data1, data2)
    diff = data1.mean() - data2.mean()
    sig = "***" if p_val < bonferroni_alpha else "n.s."

    print(f"  {g1:15s} vs {g2:15s}: "
          f"diff = {diff:+6.1f},  d = {abs(d):.2f},  "
          f"p = {p_val:.4f}  {sig}")

Test 3: High vs. Low Income (Primary Comparison)

# Our main comparison for the project
high = df[df['income_group'] == 'High income']['vaccination_rate']
low = df[df['income_group'] == 'Low income']['vaccination_rate']

t_stat, p_val = stats.ttest_ind(high, low)
d = cohens_d(high, low)
diff = high.mean() - low.mean()

# Confidence interval for the difference
se_diff = np.sqrt(high.var(ddof=1)/len(high) + low.var(ddof=1)/len(low))
df_welch = ((high.var(ddof=1)/len(high) + low.var(ddof=1)/len(low))**2 /
            ((high.var(ddof=1)/len(high))**2/(len(high)-1) +
             (low.var(ddof=1)/len(low))**2/(len(low)-1)))
t_crit = stats.t.ppf(0.975, df=df_welch)
ci = (diff - t_crit * se_diff, diff + t_crit * se_diff)

print("\n" + "=" * 60)
print("PRIMARY ANALYSIS: High Income vs. Low Income")
print("=" * 60)
print(f"High income: mean = {high.mean():.1f}%, n = {len(high)}")
print(f"Low income:  mean = {low.mean():.1f}%, n = {len(low)}")
print(f"Difference:  {diff:.1f} percentage points")
print(f"95% CI:      ({ci[0]:.1f}, {ci[1]:.1f})")
print(f"Cohen's d:   {abs(d):.2f} (large effect)")
print(f"t-statistic: {t_stat:.2f}")
print(f"p-value:     {p_val:.2e}")
print(f"\nConclusion: Vaccination rates are significantly higher in")
print(f"high-income countries (M = {high.mean():.1f}%) than in low-income")
print(f"countries (M = {low.mean():.1f}%), t = {t_stat:.1f}, p < 0.001,")
print(f"d = {abs(d):.2f}. The difference of {diff:.1f} percentage points is")
print(f"both statistically significant and practically meaningful.")

Visualization

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot
order = ['High income', 'Upper middle', 'Lower middle', 'Low income']
colors = ['#2ecc71', '#3498db', '#e67e22', '#e74c3c']

bp = axes[0].boxplot([df[df['income_group'] == g]['vaccination_rate'].values
                       for g in order],
                      labels=[g.replace(' ', '\n') for g in order],
                      patch_artist=True, widths=0.6)

for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

axes[0].set_ylabel('Vaccination Rate (%)', fontsize=12)
axes[0].set_title('Vaccination Rates by Income Group', fontsize=13)

# Effect size comparison
comparisons = ['High vs\nUpper Mid', 'High vs\nLower Mid',
               'High vs\nLow', 'Upper Mid\nvs Low']
effects = []
for g1, g2 in [('High income', 'Upper middle'),
                ('High income', 'Lower middle'),
                ('High income', 'Low income'),
                ('Upper middle', 'Low income')]:
    d1 = df[df['income_group'] == g1]['vaccination_rate']
    d2 = df[df['income_group'] == g2]['vaccination_rate']
    effects.append(abs(cohens_d(d1, d2)))

bar_colors = ['#3498db' if e < 0.5 else '#e67e22' if e < 0.8
              else '#e74c3c' for e in effects]
axes[1].bar(comparisons, effects, color=bar_colors, alpha=0.7, edgecolor='white')
axes[1].axhline(y=0.2, color='gray', linestyle=':', alpha=0.5, label='Small (0.2)')
axes[1].axhline(y=0.5, color='gray', linestyle='--', alpha=0.5, label='Medium (0.5)')
axes[1].axhline(y=0.8, color='gray', linestyle='-', alpha=0.5, label='Large (0.8)')
axes[1].set_ylabel("Cohen's d (Effect Size)", fontsize=12)
axes[1].set_title('Effect Sizes for Pairwise Comparisons', fontsize=13)
axes[1].legend(fontsize=9)

plt.tight_layout()
plt.savefig('vaccination_hypothesis_tests.png', dpi=150, bbox_inches='tight')
plt.show()

23.11 A Practical Framework for Hypothesis Testing

Let me give you a step-by-step framework you can follow for any hypothesis test:

Step 1: State the Hypotheses

Write down H₀ and H₁ in plain language AND in mathematical notation. Be specific about what "different" means — different means? Different proportions? Different distributions?

Step 2: Choose Your Significance Level

Usually α = 0.05. Choose this BEFORE looking at the data. If the consequences of a false positive are severe (e.g., approving an ineffective drug), use a smaller α (like 0.01).

Step 3: Check Assumptions

Every test has assumptions. For a t-test: independent observations, approximately normal distributions (or large samples), roughly equal variances. Violated assumptions can make p-values unreliable.

Step 4: Compute the Test Statistic and P-Value

Use scipy.stats or the appropriate function. Don't do this by hand.

Step 5: Report the Results Fully

Always report: - The p-value (the strength of evidence against H₀) - The effect size (how large the effect is) - The confidence interval (the plausible range for the effect) - The sample sizes (how much data you had)

Step 6: Interpret in Context

Translate the statistical conclusion into a substantive conclusion. "The data provide strong evidence that vaccination rates differ between income groups (difference = 34.1 percentage points, 95% CI: [28.3, 39.9], p < 0.001, d = 2.1)."

23.12 Common Pitfalls and How to Avoid Them

Pitfall 1: P-Hacking

Testing many hypotheses and only reporting the significant ones. Slicing the data in many ways until you find a "significant" result. Adding or removing variables until p < 0.05. All of these inflate the false positive rate.

Fix: Pre-register your analysis plan. Decide what you'll test before looking at the data. Correct for multiple comparisons.

Pitfall 2: Confusing "Not Significant" with "No Effect"

A p-value of 0.15 doesn't mean there's no effect. It means you don't have enough evidence to be sure. The effect might be real but your sample might be too small to detect it.

Fix: Report the confidence interval. If the CI is wide (e.g., -5 to +25), you should say "the data are inconclusive" rather than "there is no effect."

Pitfall 3: Ignoring Effect Sizes

A p-value of 0.001 with Cohen's d of 0.01 is a trivial effect detected with a huge sample. Don't treat all "significant" results as important.

Fix: Always compute and report effect sizes.

Pitfall 4: Testing Without Thinking

Running a t-test because it's the first thing you learned, without considering whether the assumptions are met or whether it's the right test for the question.

Fix: Start with the question, not the test. "What am I trying to learn?" comes before "What function do I call?"

Pitfall 5: Treating 0.05 as a Cliff

A p-value of 0.048 and a p-value of 0.052 convey essentially the same evidence. Don't treat the 0.05 boundary as a bright line between "true" and "false."

Fix: Report exact p-values rather than just "significant" or "not significant." Let readers evaluate the strength of evidence for themselves.

23.13 Chapter Summary

This chapter covered the logic, mechanics, interpretation, and pitfalls of hypothesis testing. Here's the journey:

The logic: Hypothesis testing works by proof by contradiction. You assume nothing is happening (null hypothesis), compute how surprising your data would be under that assumption (p-value), and decide whether to reject the assumption.
The p-value: The probability of seeing data this extreme if the null is true. NOT the probability that the null is true. NOT the probability the result is "due to chance." NOT a measure of effect size.
The mechanics: Common tests include the t-test (comparing means), chi-square test (comparing proportions), and ANOVA (comparing multiple means). scipy.stats handles the computation.
The distinction that matters most: Statistical significance (p < 0.05) is not the same as practical significance (large enough to matter). Always report effect sizes.
The errors: Type I (false positive) and Type II (false negative) are the two ways to be wrong. The significance level controls Type I. Power — which depends on sample size, effect size, and α — controls Type II.
The multiple testing problem: If you test many hypotheses, some will be "significant" by chance. Corrections like Bonferroni or Benjamini-Hochberg are needed.
The relationship to confidence intervals: A test at α = 0.05 is equivalent to checking whether the hypothesized value falls inside the 95% CI. CIs give you more information than p-values alone.

Next up: Chapter 24, where we explore correlation and causation. You've now learned to test whether two groups differ. But what about the relationship between two continuous variables? And — critically — why finding a relationship doesn't mean one thing causes the other.

You now have the tools to test claims about the world with data. Use them wisely. A hypothesis test is a flashlight, not a verdict — it illuminates the evidence, but the interpretation is up to you.

Prerequisites

Learning Objectives

In This Chapter

Chapter 23: Hypothesis Testing — Making Decisions with Data (and What P-Values Actually Mean)

Chapter Overview

23.1 The Logic of Hypothesis Testing: Proof by Contradiction

The Courtroom Analogy

A Concrete Example: The Suspicious Coin

23.2 The Significance Level: Drawing a Line in the Sand

Important

23.3 What P-Values Actually Are (and the Five Things They're Not)

What a P-Value IS

What a P-Value Is NOT

23.4 The Mechanics: Test Statistics and Common Tests

The Test Statistic

The Two-Sample T-Test: Comparing Two Group Means

The Chi-Square Test: Comparing Proportions

23.5 Statistical Significance vs. Practical Significance

The Giant Sample Trap

Effect Size: How Big Is the Difference?

23.6 Type I Errors, Type II Errors, and Power

Type I Error (False Positive)

Type II Error (False Negative)

Power: The Ability to Detect Real Effects

The Decision Matrix

23.7 One-Tailed vs. Two-Tailed Tests

23.8 The Multiple Testing Problem

Corrections for Multiple Testing

23.9 The Connection Between Confidence Intervals and Hypothesis Tests

23.10 Progressive Project: Testing Vaccination Rate Differences

Test 1: Overall Difference (ANOVA)

Test 2: Pairwise Comparisons

Test 3: High vs. Low Income (Primary Comparison)

Visualization

23.11 A Practical Framework for Hypothesis Testing

Step 1: State the Hypotheses

Step 2: Choose Your Significance Level

Step 3: Check Assumptions

Step 4: Compute the Test Statistic and P-Value

Step 5: Report the Results Fully

Step 6: Interpret in Context

23.12 Common Pitfalls and How to Avoid Them

Pitfall 1: P-Hacking

Pitfall 2: Confusing "Not Significant" with "No Effect"

Pitfall 3: Ignoring Effect Sizes

Pitfall 4: Testing Without Thinking

Pitfall 5: Treating 0.05 as a Cliff

23.13 Chapter Summary

Related Reading