Key Takeaways: Hypothesis Testing
This is your reference card for Chapter 23. Bookmark it, and come back whenever you need to remember what p-values actually mean, which test to use, or how to avoid the most common mistakes.
Key Concepts
-
Hypothesis testing is proof by contradiction. You assume nothing is happening (null hypothesis), measure how surprising your data would be under that assumption (p-value), and decide whether to reject the assumption.
-
The p-value is NOT the probability the null is true. It's the probability of seeing data this extreme if the null were true. This distinction matters enormously and is the most commonly misunderstood concept in statistics.
-
Statistical significance is not the same as practical significance. A tiny difference can be "significant" with a large enough sample. Always report effect sizes (like Cohen's d) alongside p-values.
-
Two types of errors, two ways to be wrong. Type I = false positive (rejecting a true null). Type II = false negative (failing to reject a false null). You can't minimize both simultaneously — there's always a trade-off.
-
Power determines whether your study can succeed. If you don't have enough data to detect a realistic effect, your study is a waste of time. Do a power analysis before collecting data.
-
Multiple testing inflates false positives. Testing many hypotheses without correction guarantees spurious "discoveries." Use Bonferroni or Benjamini-Hochberg corrections.
The Logic in Four Steps
1. STATE THE HYPOTHESES
H₀: Nothing is happening (e.g., no difference between groups)
H₁: Something IS happening (e.g., groups differ)
2. CHOOSE α (before looking at data)
Usually α = 0.05 (reject H₀ if p < 0.05)
3. COMPUTE THE P-VALUE
"If H₀ is true, how surprising is this data?"
Small p → data is very surprising under H₀
Large p → data is not surprising under H₀
4. DECIDE AND INTERPRET
p < α → Reject H₀ (evidence supports H₁)
p ≥ α → Fail to reject H₀ (insufficient evidence)
Always report effect size and CI alongside p-value
What P-Values Are and Aren't
| Statement | Correct? |
|---|---|
| "P(data this extreme | H₀ true)" | YES - this is the definition |
| "P(H₀ is true)" | NO - common misconception |
| "Probability the result is due to chance" | NO - subtly wrong |
| "P(H₁ is true) = 1 - p" | NO - that's not how it works |
| "Effect size" | NO - p-values don't measure effect size |
| "The result is 'real' if p < 0.05" | NO - it means we reject H₀, not that the effect is large or important |
Which Test to Use
| Question | Test | scipy.stats Function |
|---|---|---|
| Is this sample mean different from a known value? | One-sample t-test | ttest_1samp(data, value) |
| Are two independent group means different? | Two-sample t-test | ttest_ind(group1, group2) |
| Did values change (same subjects, two timepoints)? | Paired t-test | ttest_rel(before, after) |
| Are multiple group means different? | One-way ANOVA | f_oneway(g1, g2, g3, ...) |
| Are two categorical variables associated? | Chi-square test | chi2_contingency(table) |
| Is a proportion different from a known value? | Proportion z-test | proportions_ztest() |
Effect Size Reference
Cohen's d (for comparing two means):
$$d = \frac{\bar{x}_1 - \bar{x}_2}{s_{\text{pooled}}}$$
| d Value | Interpretation | What It Means |
|---|---|---|
| 0.2 | Small | Groups overlap a lot; hard to notice |
| 0.5 | Medium | Noticeable difference; visible in data |
| 0.8 | Large | Substantial difference; obvious in plots |
| > 1.0 | Very large | Groups are clearly distinct |
Cramér's V (for chi-square tests):
| V Value | Interpretation |
|---|---|
| < 0.1 | Negligible association |
| 0.1 - 0.3 | Small association |
| 0.3 - 0.5 | Medium association |
| > 0.5 | Large association |
The Decision Matrix
| H₀ is Actually TRUE | H₀ is Actually FALSE | |
|---|---|---|
| You reject H₀ | TYPE I ERROR (False Positive). Probability = α | CORRECT (True Positive). Probability = Power = 1 - β |
| You fail to reject H₀ | CORRECT (True Negative). Probability = 1 - α | TYPE II ERROR (False Negative). Probability = β |
Power increases with: - Larger sample size (n) - Larger true effect size (d) - Higher significance level (α) - Lower variability (σ)
Rule of thumb: Aim for at least 80% power.
The Complete Report
When reporting hypothesis test results, always include ALL of the following:
1. The test used: "We performed a two-sample t-test..."
2. Sample sizes: "...with n₁ = 55 and n₂ = 40..."
3. Descriptive stats: "High-income countries (M = 82.3, SD = 9.1)
vs. low-income (M = 48.1, SD = 20.2)..."
4. Test statistic: "t(93) = 10.42..."
5. P-value: "p < 0.001..."
6. Effect size: "Cohen's d = 2.14..."
7. Confidence interval: "95% CI for the difference: [28.3, 39.9]..."
8. Interpretation: "The difference is both statistically significant
and practically meaningful."
Common Pitfalls
| Pitfall | What Happens | How to Avoid It |
|---|---|---|
| P-hacking | Testing many analyses, reporting only significant ones | Pre-register hypotheses; report all tests |
| Giant sample trap | Trivial effects become "significant" | Always compute and report effect sizes |
| Confusing significance with truth | Treating p < 0.05 as proof | Remember: p-values quantify surprise, not truth |
| Ignoring power | Underpowered studies miss real effects | Conduct power analysis before collecting data |
| Multiple testing | Running 20 tests, finding 1 "significant" result | Apply Bonferroni or BH correction |
| Treating 0.05 as a cliff | p = 0.049 is "real", p = 0.051 is "not" | Report exact p-values; don't dichotomize |
| Equating non-significance with no effect | "p = 0.15, so there's no difference" | Report CI; large CI means inconclusive, not absent |
Multiple Testing Corrections
| Method | Formula | When to Use |
|---|---|---|
| Bonferroni | α_adjusted = α / k (where k = number of tests) | Conservative; use when false positives are costly |
| Holm-Bonferroni | Step-down procedure; less conservative than Bonferroni | General purpose; more powerful than Bonferroni |
| Benjamini-Hochberg | Controls false discovery rate (FDR) at α | Exploratory analyses with many tests |
What You Should Be Able to Do Now
- [ ] State null and alternative hypotheses for any research question
- [ ] Explain the logic of hypothesis testing as "proof by contradiction"
- [ ] Interpret a p-value correctly, avoiding the five major misconceptions
- [ ] Perform a two-sample t-test using
scipy.stats.ttest_ind() - [ ] Perform a chi-square test using
scipy.stats.chi2_contingency() - [ ] Compute Cohen's d and interpret its magnitude
- [ ] Distinguish statistical significance from practical significance
- [ ] Describe Type I and Type II errors and explain the trade-off
- [ ] Explain what statistical power is and why it matters
- [ ] Identify and correct for the multiple testing problem
- [ ] Write a complete results statement with p-value, effect size, CI, and interpretation
If you checked every box, you're ready for Chapter 24 — where we explore the relationship between variables (correlation) and confront the most important distinction in all of data science: correlation does not equal causation.