Key Takeaways: Hypothesis Testing: Making Decisions with Data

One-Sentence Summary

Hypothesis testing uses indirect reasoning — assume nothing is happening (the null hypothesis), measure how surprising the data would be under that assumption (the p-value), and reject the assumption only if the data are sufficiently surprising (below the significance level $\alpha$) — but the p-value is NOT the probability that the null is true, and "statistically significant" does NOT mean "important."

Core Concepts at a Glance

Concept Definition Why It Matters
Null hypothesis ($H_0$) Default assumption of no effect, no difference, or status quo The starting point for every hypothesis test — what we assume until evidence says otherwise
Alternative hypothesis ($H_a$) The claim we're trying to find evidence for Determines the direction of the test (one- or two-tailed)
P-value Probability of data this extreme or more extreme, IF $H_0$ is true Measures how surprising the data are under $H_0$ — smaller = more surprising = more evidence against $H_0$
Significance level ($\alpha$) Pre-set threshold for rejecting $H_0$ (typically 0.05) Controls the Type I error rate — the false alarm probability
Test statistic Standardized measure of how far the data are from $H_0$ Converts the raw evidence into a number we can look up on a probability table

The Logic of Hypothesis Testing

1. Assume H₀ is true (nothing is happening)
         ↓
2. Collect data and compute test statistic
         ↓
3. Ask: "How surprising are these data under H₀?"
         ↓
4. Compute p-value = P(data this extreme | H₀ true)
         ↓
5. If p-value ≤ α → Reject H₀ (data are too surprising)
   If p-value > α → Fail to reject H₀ (data are plausible under H₀)

The courtroom analogy: - $H_0$ = presumption of innocence - Data = prosecution's evidence - p-value = how convincing the evidence is - $\alpha$ = "beyond a reasonable doubt" threshold - Reject $H_0$ = guilty verdict - Fail to reject $H_0$ = not guilty (NOT the same as innocent)

Test Statistic Formulas

One-Sample z-Test for a Mean (known $\sigma$)

$$\boxed{z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}}$$

Conditions: 1. Random sample (or random assignment) 2. Independence (10% condition: $n \leq 0.10 \times N$) 3. Nearly normal population or large sample ($n \geq 30$, by CLT)

One-Sample z-Test for a Proportion

$$\boxed{z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}}$$

Note: Uses $p_0$ (not $\hat{p}$) in the standard error, because we assume $H_0$ is true.

Conditions: 1. Random sample 2. Independence (10% condition) 3. Success-failure: $np_0 \geq 10$ and $n(1-p_0) \geq 10$

P-Value Calculation

Alternative Hypothesis P-Value Formula Picture
$H_a: \mu > \mu_0$ (right-tailed) $P(Z \geq z)$ Right tail
$H_a: \mu < \mu_0$ (left-tailed) $P(Z \leq z)$ Left tail
$H_a: \mu \neq \mu_0$ (two-tailed) $2 \times P(Z \geq \lvert z \rvert)$ Both tails

What the P-Value IS and IS NOT

The p-value IS... The p-value is NOT...
$P(\text{data this extreme} \mid H_0 \text{ true})$ $P(H_0 \text{ true} \mid \text{data})$
A measure of how surprising the data are under $H_0$ A measure of how likely $H_0$ is to be true
A continuous measure of evidence against $H_0$ A measure of effect size or practical importance
Valid for one pre-specified test Valid after cherry-picking from many tests

The #1 misconception: "The p-value is the probability the null hypothesis is true." NO. The p-value is $P(\text{data} \mid H_0)$, not $P(H_0 \mid \text{data})$. Confusing these is the prosecutor's fallacy (Chapter 9) applied to statistics.

The Error Matrix

$H_0$ is TRUE $H_0$ is FALSE
Reject $H_0$ Type I Error (false alarm), $P = \alpha$ Correct (power = $1-\beta$)
Fail to reject $H_0$ Correct, $P = 1-\alpha$ Type II Error (missed detection), $P = \beta$

The seesaw: For fixed $n$, decreasing $\alpha$ increases $\beta$ (and vice versa). The only way to reduce both: increase sample size.

One-Tailed vs. Two-Tailed: Decision Guide

Use One-Tailed Use Two-Tailed
Directional prediction specified before data Effect in either direction is interesting
Only one direction is meaningful Exploratory research
You committed to the direction in advance When in doubt

Warning: Never switch from two-tailed to one-tailed after seeing the data. That's p-hacking.

The CI-Hypothesis Test Connection

$$\boxed{\text{Reject } H_0: \mu = \mu_0 \text{ at } \alpha = 0.05 \iff \mu_0 \text{ is NOT in the 95\% CI}}$$

The confidence interval contains all values of the parameter that would not be rejected. The hypothesis test tells you whether a specific value is plausible.

The Five Steps (Summary)

Step Action Key Question
1 State $H_0$ and $H_a$ What's the claim? One- or two-tailed?
2 Choose $\alpha$ What false alarm rate am I willing to accept?
3 Compute test statistic How far are the data from $H_0$, in SE units?
4 Find p-value How surprising are the data under $H_0$?
5 Conclude in context What does the decision mean for the real-world question?

P-Hacking and the Replication Crisis

Number of Tests False Positive Rate (all under $H_0$)
1 5.0%
5 22.6%
10 40.1%
20 64.2%
50 92.3%

$$P(\text{at least 1 false positive in } k \text{ tests}) = 1 - (1 - \alpha)^k$$

Solution: Pre-register hypotheses. Report all analyses. Use multiple testing corrections (Bonferroni: $\alpha' = \alpha/k$). Distinguish exploratory from confirmatory research.

Python Quick Reference

import numpy as np
from scipy import stats

# --- Manual z-test for a proportion ---
p_hat, p_0, n = 0.38, 0.31, 65
z = (p_hat - p_0) / np.sqrt(p_0 * (1 - p_0) / n)
p_value_right = 1 - stats.norm.cdf(z)         # right-tailed
p_value_left = stats.norm.cdf(z)               # left-tailed
p_value_two = 2 * (1 - stats.norm.cdf(abs(z))) # two-tailed

# --- Manual z-test for a mean (known σ) ---
x_bar, mu_0, sigma, n = 134.2, 130, 20, 64
z = (x_bar - mu_0) / (sigma / np.sqrt(n))
p_value = 1 - stats.norm.cdf(z)  # right-tailed

# --- One-sample t-test (unknown σ, the usual case) ---
data = np.array([...])  # your data
t_stat, p_two = stats.ttest_1samp(data, popmean=mu_0)
# Note: returns TWO-tailed p-value
# For one-tailed: p_one = p_two / 2 (if t in expected direction)

Common Misconceptions

Misconception Reality
"p = 0.03 means 3% chance $H_0$ is true" p-value is $P(\text{data} \mid H_0)$, not $P(H_0 \mid \text{data})$
"Small p-value = large effect" p-value depends on sample size, not just effect size
"Not significant = no effect" Absence of evidence ≠ evidence of absence
"Significant = important" Statistical significance ≠ practical significance
"$\alpha = 0.05$ is sacred" It's a convention; context should guide the choice
"Failing to reject = accepting $H_0$" You just don't have enough evidence to reject it

How This Chapter Connects

This Chapter Builds On Leads To
P-value definition Conditional probability (Ch.9), CLT (Ch.11) All future inference chapters
Test statistic (z) z-scores (Ch.6), standard error (Ch.11) t-tests (Ch.15), z-tests for proportions (Ch.14)
CI-test duality Confidence intervals (Ch.12) Every inference result is both a CI and a test
Type I/II errors Probability (Ch.8) Power analysis (Ch.17)
P-hacking Study design (Ch.4), replication crisis (Ch.1) Ethics of data practice (Ch.27)
Significance level Random sampling (Ch.4) Choosing $\alpha$ in context (Ch.17)

Threads Resolved

  • "P-value explained properly" (from Ch.1): fully delivered
  • "What 'statistically significant' means" (from Ch.1): fully delivered
  • "Daria's shooting analysis" (from Ch.1): partially resolved (formal test conducted; two-sample framework in Ch.16, power analysis in Ch.17)

The One Thing to Remember

If you forget everything else from this chapter, remember this:

The p-value is the probability of seeing data this extreme or more extreme IF the null hypothesis is true. It is NOT the probability that the null hypothesis is true. A small p-value means the data are surprising under $H_0$ — nothing more, nothing less. "Statistically significant" means $p \leq \alpha$, not "important." These two distinctions — between $P(\text{data} \mid H_0)$ and $P(H_0 \mid \text{data})$, and between statistical significance and practical significance — are the most consequential misunderstandings in all of statistics. Getting them right makes you a better scientist, a better analyst, and a better citizen. Getting them wrong has contributed to a crisis of confidence in published research that we're still recovering from.

Key Terms

Term Definition
Null hypothesis ($H_0$) The default assumption of no effect, no difference, or status quo; assumed true until evidence says otherwise
Alternative hypothesis ($H_a$) The claim the researcher is trying to find evidence for; the complement of $H_0$
P-value The probability of observing data as extreme as or more extreme than what was observed, assuming $H_0$ is true
Significance level ($\alpha$) The pre-set threshold for rejecting $H_0$; equals the maximum acceptable Type I error rate
Test statistic A standardized measure of how far the sample data are from the null hypothesis value, in standard errors
Type I error Rejecting $H_0$ when it is actually true; a false positive / false alarm; probability = $\alpha$
Type II error Failing to reject $H_0$ when it is actually false; a false negative / missed detection; probability = $\beta$
Statistically significant A result where $p\text{-value} \leq \alpha$; the data are sufficiently surprising under $H_0$ to reject it
One-tailed test A test where $H_a$ specifies a direction ($>$ or $<$); p-value uses one tail of the distribution
Two-tailed test A test where $H_a$ is non-directional ($\neq$); p-value uses both tails of the distribution
Rejection region The set of test statistic values that lead to rejecting $H_0$; determined by $\alpha$ and the test direction
Fail to reject The conclusion when $p\text{-value} > \alpha$; not enough evidence to overturn $H_0$ (does NOT mean $H_0$ is true)