Key Takeaways: Hypothesis Testing

Contributors to Introduction to Data Science

Key Takeaways: Hypothesis Testing

This is your reference card for Chapter 23. Bookmark it, and come back whenever you need to remember what p-values actually mean, which test to use, or how to avoid the most common mistakes.

Key Concepts

Hypothesis testing is proof by contradiction. You assume nothing is happening (null hypothesis), measure how surprising your data would be under that assumption (p-value), and decide whether to reject the assumption.
The p-value is NOT the probability the null is true. It's the probability of seeing data this extreme if the null were true. This distinction matters enormously and is the most commonly misunderstood concept in statistics.
Statistical significance is not the same as practical significance. A tiny difference can be "significant" with a large enough sample. Always report effect sizes (like Cohen's d) alongside p-values.
Two types of errors, two ways to be wrong. Type I = false positive (rejecting a true null). Type II = false negative (failing to reject a false null). You can't minimize both simultaneously — there's always a trade-off.
Power determines whether your study can succeed. If you don't have enough data to detect a realistic effect, your study is a waste of time. Do a power analysis before collecting data.
Multiple testing inflates false positives. Testing many hypotheses without correction guarantees spurious "discoveries." Use Bonferroni or Benjamini-Hochberg corrections.

The Logic in Four Steps

1. STATE THE HYPOTHESES
   H₀: Nothing is happening (e.g., no difference between groups)
   H₁: Something IS happening (e.g., groups differ)

2. CHOOSE α (before looking at data)
   Usually α = 0.05 (reject H₀ if p < 0.05)

3. COMPUTE THE P-VALUE
   "If H₀ is true, how surprising is this data?"
   Small p → data is very surprising under H₀
   Large p → data is not surprising under H₀

4. DECIDE AND INTERPRET
   p < α → Reject H₀ (evidence supports H₁)
   p ≥ α → Fail to reject H₀ (insufficient evidence)
   Always report effect size and CI alongside p-value

What P-Values Are and Aren't

Statement	Correct?
"P(data this extreme \| H₀ true)"	YES - this is the definition
"P(H₀ is true)"	NO - common misconception
"Probability the result is due to chance"	NO - subtly wrong
"P(H₁ is true) = 1 - p"	NO - that's not how it works
"Effect size"	NO - p-values don't measure effect size
"The result is 'real' if p < 0.05"	NO - it means we reject H₀, not that the effect is large or important

Which Test to Use

Question	Test	scipy.stats Function
Is this sample mean different from a known value?	One-sample t-test	`ttest_1samp(data, value)`
Are two independent group means different?	Two-sample t-test	`ttest_ind(group1, group2)`
Did values change (same subjects, two timepoints)?	Paired t-test	`ttest_rel(before, after)`
Are multiple group means different?	One-way ANOVA	`f_oneway(g1, g2, g3, ...)`
Are two categorical variables associated?	Chi-square test	`chi2_contingency(table)`
Is a proportion different from a known value?	Proportion z-test	`proportions_ztest()`

Effect Size Reference

Cohen's d (for comparing two means):

$$d = \frac{\bar{x}_1 - \bar{x}_2}{s_{\text{pooled}}}$$

d Value	Interpretation	What It Means
0.2	Small	Groups overlap a lot; hard to notice
0.5	Medium	Noticeable difference; visible in data
0.8	Large	Substantial difference; obvious in plots
> 1.0	Very large	Groups are clearly distinct

Cramér's V (for chi-square tests):

V Value	Interpretation
< 0.1	Negligible association
0.1 - 0.3	Small association
0.3 - 0.5	Medium association
> 0.5	Large association

The Decision Matrix

	H₀ is Actually TRUE	H₀ is Actually FALSE
You reject H₀	TYPE I ERROR (False Positive). Probability = α	CORRECT (True Positive). Probability = Power = 1 - β
You fail to reject H₀	CORRECT (True Negative). Probability = 1 - α	TYPE II ERROR (False Negative). Probability = β

Power increases with: - Larger sample size (n) - Larger true effect size (d) - Higher significance level (α) - Lower variability (σ)

Rule of thumb: Aim for at least 80% power.

The Complete Report

When reporting hypothesis test results, always include ALL of the following:

1. The test used:     "We performed a two-sample t-test..."
2. Sample sizes:      "...with n₁ = 55 and n₂ = 40..."
3. Descriptive stats: "High-income countries (M = 82.3, SD = 9.1)
                       vs. low-income (M = 48.1, SD = 20.2)..."
4. Test statistic:    "t(93) = 10.42..."
5. P-value:           "p < 0.001..."
6. Effect size:       "Cohen's d = 2.14..."
7. Confidence interval: "95% CI for the difference: [28.3, 39.9]..."
8. Interpretation:    "The difference is both statistically significant
                       and practically meaningful."

Common Pitfalls

Pitfall	What Happens	How to Avoid It
P-hacking	Testing many analyses, reporting only significant ones	Pre-register hypotheses; report all tests
Giant sample trap	Trivial effects become "significant"	Always compute and report effect sizes
Confusing significance with truth	Treating p < 0.05 as proof	Remember: p-values quantify surprise, not truth
Ignoring power	Underpowered studies miss real effects	Conduct power analysis before collecting data
Multiple testing	Running 20 tests, finding 1 "significant" result	Apply Bonferroni or BH correction
Treating 0.05 as a cliff	p = 0.049 is "real", p = 0.051 is "not"	Report exact p-values; don't dichotomize
Equating non-significance with no effect	"p = 0.15, so there's no difference"	Report CI; large CI means inconclusive, not absent

Multiple Testing Corrections

Method	Formula	When to Use
Bonferroni	α_adjusted = α / k (where k = number of tests)	Conservative; use when false positives are costly
Holm-Bonferroni	Step-down procedure; less conservative than Bonferroni	General purpose; more powerful than Bonferroni
Benjamini-Hochberg	Controls false discovery rate (FDR) at α	Exploratory analyses with many tests

What You Should Be Able to Do Now

[ ] State null and alternative hypotheses for any research question
[ ] Explain the logic of hypothesis testing as "proof by contradiction"
[ ] Interpret a p-value correctly, avoiding the five major misconceptions
[ ] Perform a two-sample t-test using scipy.stats.ttest_ind()
[ ] Perform a chi-square test using scipy.stats.chi2_contingency()
[ ] Compute Cohen's d and interpret its magnitude
[ ] Distinguish statistical significance from practical significance
[ ] Describe Type I and Type II errors and explain the trade-off
[ ] Explain what statistical power is and why it matters
[ ] Identify and correct for the multiple testing problem
[ ] Write a complete results statement with p-value, effect size, CI, and interpretation

If you checked every box, you're ready for Chapter 24 — where we explore the relationship between variables (correlation) and confront the most important distinction in all of data science: correlation does not equal causation.