Key Takeaways: Hypothesis Testing: Making Decisions with Data
One-Sentence Summary
Hypothesis testing uses indirect reasoning — assume nothing is happening (the null hypothesis), measure how surprising the data would be under that assumption (the p-value), and reject the assumption only if the data are sufficiently surprising (below the significance level $\alpha$) — but the p-value is NOT the probability that the null is true, and "statistically significant" does NOT mean "important."
Core Concepts at a Glance
| Concept | Definition | Why It Matters |
|---|---|---|
| Null hypothesis ($H_0$) | Default assumption of no effect, no difference, or status quo | The starting point for every hypothesis test — what we assume until evidence says otherwise |
| Alternative hypothesis ($H_a$) | The claim we're trying to find evidence for | Determines the direction of the test (one- or two-tailed) |
| P-value | Probability of data this extreme or more extreme, IF $H_0$ is true | Measures how surprising the data are under $H_0$ — smaller = more surprising = more evidence against $H_0$ |
| Significance level ($\alpha$) | Pre-set threshold for rejecting $H_0$ (typically 0.05) | Controls the Type I error rate — the false alarm probability |
| Test statistic | Standardized measure of how far the data are from $H_0$ | Converts the raw evidence into a number we can look up on a probability table |
The Logic of Hypothesis Testing
1. Assume H₀ is true (nothing is happening)
↓
2. Collect data and compute test statistic
↓
3. Ask: "How surprising are these data under H₀?"
↓
4. Compute p-value = P(data this extreme | H₀ true)
↓
5. If p-value ≤ α → Reject H₀ (data are too surprising)
If p-value > α → Fail to reject H₀ (data are plausible under H₀)
The courtroom analogy: - $H_0$ = presumption of innocence - Data = prosecution's evidence - p-value = how convincing the evidence is - $\alpha$ = "beyond a reasonable doubt" threshold - Reject $H_0$ = guilty verdict - Fail to reject $H_0$ = not guilty (NOT the same as innocent)
Test Statistic Formulas
One-Sample z-Test for a Mean (known $\sigma$)
$$\boxed{z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}}$$
Conditions: 1. Random sample (or random assignment) 2. Independence (10% condition: $n \leq 0.10 \times N$) 3. Nearly normal population or large sample ($n \geq 30$, by CLT)
One-Sample z-Test for a Proportion
$$\boxed{z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}}$$
Note: Uses $p_0$ (not $\hat{p}$) in the standard error, because we assume $H_0$ is true.
Conditions: 1. Random sample 2. Independence (10% condition) 3. Success-failure: $np_0 \geq 10$ and $n(1-p_0) \geq 10$
P-Value Calculation
| Alternative Hypothesis | P-Value Formula | Picture |
|---|---|---|
| $H_a: \mu > \mu_0$ (right-tailed) | $P(Z \geq z)$ | Right tail |
| $H_a: \mu < \mu_0$ (left-tailed) | $P(Z \leq z)$ | Left tail |
| $H_a: \mu \neq \mu_0$ (two-tailed) | $2 \times P(Z \geq \lvert z \rvert)$ | Both tails |
What the P-Value IS and IS NOT
| The p-value IS... | The p-value is NOT... |
|---|---|
| $P(\text{data this extreme} \mid H_0 \text{ true})$ | $P(H_0 \text{ true} \mid \text{data})$ |
| A measure of how surprising the data are under $H_0$ | A measure of how likely $H_0$ is to be true |
| A continuous measure of evidence against $H_0$ | A measure of effect size or practical importance |
| Valid for one pre-specified test | Valid after cherry-picking from many tests |
The #1 misconception: "The p-value is the probability the null hypothesis is true." NO. The p-value is $P(\text{data} \mid H_0)$, not $P(H_0 \mid \text{data})$. Confusing these is the prosecutor's fallacy (Chapter 9) applied to statistics.
The Error Matrix
| $H_0$ is TRUE | $H_0$ is FALSE | |
|---|---|---|
| Reject $H_0$ | Type I Error (false alarm), $P = \alpha$ | Correct (power = $1-\beta$) |
| Fail to reject $H_0$ | Correct, $P = 1-\alpha$ | Type II Error (missed detection), $P = \beta$ |
The seesaw: For fixed $n$, decreasing $\alpha$ increases $\beta$ (and vice versa). The only way to reduce both: increase sample size.
One-Tailed vs. Two-Tailed: Decision Guide
| Use One-Tailed | Use Two-Tailed |
|---|---|
| Directional prediction specified before data | Effect in either direction is interesting |
| Only one direction is meaningful | Exploratory research |
| You committed to the direction in advance | When in doubt |
Warning: Never switch from two-tailed to one-tailed after seeing the data. That's p-hacking.
The CI-Hypothesis Test Connection
$$\boxed{\text{Reject } H_0: \mu = \mu_0 \text{ at } \alpha = 0.05 \iff \mu_0 \text{ is NOT in the 95\% CI}}$$
The confidence interval contains all values of the parameter that would not be rejected. The hypothesis test tells you whether a specific value is plausible.
The Five Steps (Summary)
| Step | Action | Key Question |
|---|---|---|
| 1 | State $H_0$ and $H_a$ | What's the claim? One- or two-tailed? |
| 2 | Choose $\alpha$ | What false alarm rate am I willing to accept? |
| 3 | Compute test statistic | How far are the data from $H_0$, in SE units? |
| 4 | Find p-value | How surprising are the data under $H_0$? |
| 5 | Conclude in context | What does the decision mean for the real-world question? |
P-Hacking and the Replication Crisis
| Number of Tests | False Positive Rate (all under $H_0$) |
|---|---|
| 1 | 5.0% |
| 5 | 22.6% |
| 10 | 40.1% |
| 20 | 64.2% |
| 50 | 92.3% |
$$P(\text{at least 1 false positive in } k \text{ tests}) = 1 - (1 - \alpha)^k$$
Solution: Pre-register hypotheses. Report all analyses. Use multiple testing corrections (Bonferroni: $\alpha' = \alpha/k$). Distinguish exploratory from confirmatory research.
Python Quick Reference
import numpy as np
from scipy import stats
# --- Manual z-test for a proportion ---
p_hat, p_0, n = 0.38, 0.31, 65
z = (p_hat - p_0) / np.sqrt(p_0 * (1 - p_0) / n)
p_value_right = 1 - stats.norm.cdf(z) # right-tailed
p_value_left = stats.norm.cdf(z) # left-tailed
p_value_two = 2 * (1 - stats.norm.cdf(abs(z))) # two-tailed
# --- Manual z-test for a mean (known σ) ---
x_bar, mu_0, sigma, n = 134.2, 130, 20, 64
z = (x_bar - mu_0) / (sigma / np.sqrt(n))
p_value = 1 - stats.norm.cdf(z) # right-tailed
# --- One-sample t-test (unknown σ, the usual case) ---
data = np.array([...]) # your data
t_stat, p_two = stats.ttest_1samp(data, popmean=mu_0)
# Note: returns TWO-tailed p-value
# For one-tailed: p_one = p_two / 2 (if t in expected direction)
Common Misconceptions
| Misconception | Reality |
|---|---|
| "p = 0.03 means 3% chance $H_0$ is true" | p-value is $P(\text{data} \mid H_0)$, not $P(H_0 \mid \text{data})$ |
| "Small p-value = large effect" | p-value depends on sample size, not just effect size |
| "Not significant = no effect" | Absence of evidence ≠ evidence of absence |
| "Significant = important" | Statistical significance ≠ practical significance |
| "$\alpha = 0.05$ is sacred" | It's a convention; context should guide the choice |
| "Failing to reject = accepting $H_0$" | You just don't have enough evidence to reject it |
How This Chapter Connects
| This Chapter | Builds On | Leads To |
|---|---|---|
| P-value definition | Conditional probability (Ch.9), CLT (Ch.11) | All future inference chapters |
| Test statistic (z) | z-scores (Ch.6), standard error (Ch.11) | t-tests (Ch.15), z-tests for proportions (Ch.14) |
| CI-test duality | Confidence intervals (Ch.12) | Every inference result is both a CI and a test |
| Type I/II errors | Probability (Ch.8) | Power analysis (Ch.17) |
| P-hacking | Study design (Ch.4), replication crisis (Ch.1) | Ethics of data practice (Ch.27) |
| Significance level | Random sampling (Ch.4) | Choosing $\alpha$ in context (Ch.17) |
Threads Resolved
- "P-value explained properly" (from Ch.1): fully delivered
- "What 'statistically significant' means" (from Ch.1): fully delivered
- "Daria's shooting analysis" (from Ch.1): partially resolved (formal test conducted; two-sample framework in Ch.16, power analysis in Ch.17)
The One Thing to Remember
If you forget everything else from this chapter, remember this:
The p-value is the probability of seeing data this extreme or more extreme IF the null hypothesis is true. It is NOT the probability that the null hypothesis is true. A small p-value means the data are surprising under $H_0$ — nothing more, nothing less. "Statistically significant" means $p \leq \alpha$, not "important." These two distinctions — between $P(\text{data} \mid H_0)$ and $P(H_0 \mid \text{data})$, and between statistical significance and practical significance — are the most consequential misunderstandings in all of statistics. Getting them right makes you a better scientist, a better analyst, and a better citizen. Getting them wrong has contributed to a crisis of confidence in published research that we're still recovering from.
Key Terms
| Term | Definition |
|---|---|
| Null hypothesis ($H_0$) | The default assumption of no effect, no difference, or status quo; assumed true until evidence says otherwise |
| Alternative hypothesis ($H_a$) | The claim the researcher is trying to find evidence for; the complement of $H_0$ |
| P-value | The probability of observing data as extreme as or more extreme than what was observed, assuming $H_0$ is true |
| Significance level ($\alpha$) | The pre-set threshold for rejecting $H_0$; equals the maximum acceptable Type I error rate |
| Test statistic | A standardized measure of how far the sample data are from the null hypothesis value, in standard errors |
| Type I error | Rejecting $H_0$ when it is actually true; a false positive / false alarm; probability = $\alpha$ |
| Type II error | Failing to reject $H_0$ when it is actually false; a false negative / missed detection; probability = $\beta$ |
| Statistically significant | A result where $p\text{-value} \leq \alpha$; the data are sufficiently surprising under $H_0$ to reject it |
| One-tailed test | A test where $H_a$ specifies a direction ($>$ or $<$); p-value uses one tail of the distribution |
| Two-tailed test | A test where $H_a$ is non-directional ($\neq$); p-value uses both tails of the distribution |
| Rejection region | The set of test statistic values that lead to rejecting $H_0$; determined by $\alpha$ and the test direction |
| Fail to reject | The conclusion when $p\text{-value} > \alpha$; not enough evidence to overturn $H_0$ (does NOT mean $H_0$ is true) |