Key Takeaways: Hypothesis Testing

This is your reference card for Chapter 23. Bookmark it, and come back whenever you need to remember what p-values actually mean, which test to use, or how to avoid the most common mistakes.


Key Concepts

  • Hypothesis testing is proof by contradiction. You assume nothing is happening (null hypothesis), measure how surprising your data would be under that assumption (p-value), and decide whether to reject the assumption.

  • The p-value is NOT the probability the null is true. It's the probability of seeing data this extreme if the null were true. This distinction matters enormously and is the most commonly misunderstood concept in statistics.

  • Statistical significance is not the same as practical significance. A tiny difference can be "significant" with a large enough sample. Always report effect sizes (like Cohen's d) alongside p-values.

  • Two types of errors, two ways to be wrong. Type I = false positive (rejecting a true null). Type II = false negative (failing to reject a false null). You can't minimize both simultaneously — there's always a trade-off.

  • Power determines whether your study can succeed. If you don't have enough data to detect a realistic effect, your study is a waste of time. Do a power analysis before collecting data.

  • Multiple testing inflates false positives. Testing many hypotheses without correction guarantees spurious "discoveries." Use Bonferroni or Benjamini-Hochberg corrections.


The Logic in Four Steps

1. STATE THE HYPOTHESES
   H₀: Nothing is happening (e.g., no difference between groups)
   H₁: Something IS happening (e.g., groups differ)

2. CHOOSE α (before looking at data)
   Usually α = 0.05 (reject H₀ if p < 0.05)

3. COMPUTE THE P-VALUE
   "If H₀ is true, how surprising is this data?"
   Small p → data is very surprising under H₀
   Large p → data is not surprising under H₀

4. DECIDE AND INTERPRET
   p < α → Reject H₀ (evidence supports H₁)
   p ≥ α → Fail to reject H₀ (insufficient evidence)
   Always report effect size and CI alongside p-value

What P-Values Are and Aren't

Statement Correct?
"P(data this extreme | H₀ true)" YES - this is the definition
"P(H₀ is true)" NO - common misconception
"Probability the result is due to chance" NO - subtly wrong
"P(H₁ is true) = 1 - p" NO - that's not how it works
"Effect size" NO - p-values don't measure effect size
"The result is 'real' if p < 0.05" NO - it means we reject H₀, not that the effect is large or important

Which Test to Use

Question Test scipy.stats Function
Is this sample mean different from a known value? One-sample t-test ttest_1samp(data, value)
Are two independent group means different? Two-sample t-test ttest_ind(group1, group2)
Did values change (same subjects, two timepoints)? Paired t-test ttest_rel(before, after)
Are multiple group means different? One-way ANOVA f_oneway(g1, g2, g3, ...)
Are two categorical variables associated? Chi-square test chi2_contingency(table)
Is a proportion different from a known value? Proportion z-test proportions_ztest()

Effect Size Reference

Cohen's d (for comparing two means):

$$d = \frac{\bar{x}_1 - \bar{x}_2}{s_{\text{pooled}}}$$

d Value Interpretation What It Means
0.2 Small Groups overlap a lot; hard to notice
0.5 Medium Noticeable difference; visible in data
0.8 Large Substantial difference; obvious in plots
> 1.0 Very large Groups are clearly distinct

Cramér's V (for chi-square tests):

V Value Interpretation
< 0.1 Negligible association
0.1 - 0.3 Small association
0.3 - 0.5 Medium association
> 0.5 Large association

The Decision Matrix

H₀ is Actually TRUE H₀ is Actually FALSE
You reject H₀ TYPE I ERROR (False Positive). Probability = α CORRECT (True Positive). Probability = Power = 1 - β
You fail to reject H₀ CORRECT (True Negative). Probability = 1 - α TYPE II ERROR (False Negative). Probability = β

Power increases with: - Larger sample size (n) - Larger true effect size (d) - Higher significance level (α) - Lower variability (σ)

Rule of thumb: Aim for at least 80% power.


The Complete Report

When reporting hypothesis test results, always include ALL of the following:

1. The test used:     "We performed a two-sample t-test..."
2. Sample sizes:      "...with n₁ = 55 and n₂ = 40..."
3. Descriptive stats: "High-income countries (M = 82.3, SD = 9.1)
                       vs. low-income (M = 48.1, SD = 20.2)..."
4. Test statistic:    "t(93) = 10.42..."
5. P-value:           "p < 0.001..."
6. Effect size:       "Cohen's d = 2.14..."
7. Confidence interval: "95% CI for the difference: [28.3, 39.9]..."
8. Interpretation:    "The difference is both statistically significant
                       and practically meaningful."

Common Pitfalls

Pitfall What Happens How to Avoid It
P-hacking Testing many analyses, reporting only significant ones Pre-register hypotheses; report all tests
Giant sample trap Trivial effects become "significant" Always compute and report effect sizes
Confusing significance with truth Treating p < 0.05 as proof Remember: p-values quantify surprise, not truth
Ignoring power Underpowered studies miss real effects Conduct power analysis before collecting data
Multiple testing Running 20 tests, finding 1 "significant" result Apply Bonferroni or BH correction
Treating 0.05 as a cliff p = 0.049 is "real", p = 0.051 is "not" Report exact p-values; don't dichotomize
Equating non-significance with no effect "p = 0.15, so there's no difference" Report CI; large CI means inconclusive, not absent

Multiple Testing Corrections

Method Formula When to Use
Bonferroni α_adjusted = α / k (where k = number of tests) Conservative; use when false positives are costly
Holm-Bonferroni Step-down procedure; less conservative than Bonferroni General purpose; more powerful than Bonferroni
Benjamini-Hochberg Controls false discovery rate (FDR) at α Exploratory analyses with many tests

What You Should Be Able to Do Now

  • [ ] State null and alternative hypotheses for any research question
  • [ ] Explain the logic of hypothesis testing as "proof by contradiction"
  • [ ] Interpret a p-value correctly, avoiding the five major misconceptions
  • [ ] Perform a two-sample t-test using scipy.stats.ttest_ind()
  • [ ] Perform a chi-square test using scipy.stats.chi2_contingency()
  • [ ] Compute Cohen's d and interpret its magnitude
  • [ ] Distinguish statistical significance from practical significance
  • [ ] Describe Type I and Type II errors and explain the trade-off
  • [ ] Explain what statistical power is and why it matters
  • [ ] Identify and correct for the multiple testing problem
  • [ ] Write a complete results statement with p-value, effect size, CI, and interpretation

If you checked every box, you're ready for Chapter 24 — where we explore the relationship between variables (correlation) and confront the most important distinction in all of data science: correlation does not equal causation.