Chapter 13: Hypothesis Testing: Making Decisions with Data

Contributors

46 min read

> "The value for which P = 0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not."

Prerequisites

11
12
4
Basic arithmetic (fractions, decimals, percentages)
Familiarity with square roots

Learning Objectives

State null and alternative hypotheses in words and symbols
Explain the logic of hypothesis testing as indirect reasoning (proof by contradiction analogy)
Calculate and interpret p-values
Make statistical decisions using significance levels
Distinguish between Type I and Type II errors and explain their real-world consequences

In This Chapter

Chapter Overview
13.1 A Puzzle Before We Start (Productive Struggle)
13.2 The Logic of Hypothesis Testing: Proof by Contradiction
13.3 Null and Alternative Hypotheses
13.4 The Test Statistic: Measuring How Far the Data Are from the Null
13.5 The P-Value: The Most Misunderstood Number in Science
13.6 What the P-Value Is NOT (The Part That Trips Everyone Up)
13.7 The Significance Level and the Decision Rule
13.8 One-Tailed vs. Two-Tailed Tests
13.9 Type I and Type II Errors: The Two Ways to Be Wrong
13.10 Putting It All Together: The Five Steps of Hypothesis Testing
13.11 The Duality: Hypothesis Tests and Confidence Intervals
13.12 The Replication Crisis Revisited: Now You Understand Why
13.13 Python: Hypothesis Testing in Practice
13.14 The Connection to Alex's A/B Test (Theme 5 Preview)
13.15 Data Detective Portfolio: Formulate and Test a Hypothesis
13.16 Chapter Summary
Key Formulas at a Glance

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 13: Hypothesis Testing: Making Decisions with Data

"The value for which P = 0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not." — Ronald Fisher, Statistical Methods for Research Workers (1925)

Chapter Overview

We need to talk about p-values.

If you've been following the news at all — or even just scrolling through headlines — you've seen them. "Study finds statistically significant link between..." or "Results were not statistically significant." These phrases appear in medical studies, business reports, court cases, and policy debates. They shape which drugs get approved, which products get launched, which suspects get convicted, and which school programs get funded.

And here's the uncomfortable truth: most people who use these phrases don't fully understand what they mean. Not just laypeople — researchers, journalists, even some statisticians get tripped up. Surveys of researchers have repeatedly found that a majority endorse at least one incorrect interpretation of a p-value (Haller & Krauss, 2002; Oakes, 1986), and the problem was serious enough that the American Statistical Association issued a formal statement in 2016 clarifying what p-values can and cannot tell us.

This chapter is going to fix that, for you at least. We're going to nail down exactly what a p-value is, what it isn't, and how to use it responsibly. And I promise: once you truly understand hypothesis testing, you'll never read a research headline the same way again.

But first, let's back up.

Remember Sam Okafor's question from Chapter 1? Daria Williams was shooting 38% from three-point range this season, up from 31% last season. Sam wanted to know: is this a real improvement, or just random variation? In Chapter 11, we used the Central Limit Theorem to compute that if Daria hadn't actually improved (if her true percentage was still 31%), there was about an 11% chance of seeing a shooting percentage of 38% or higher in 65 attempts. That 11% is a p-value. We just didn't call it that yet.

And remember Alex Rivera's A/B test at StreamVibe? Alex wanted to know if the new recommendation algorithm increased watch time. She ran the experiment, collected the data, and found a difference between the groups. But was the difference real, or was it just noise? Hypothesis testing is the framework that answers that question — and it's the framework Alex's engineering team uses every day.

In Chapter 12, you learned to estimate parameters with confidence intervals — your first inference tool. Now you're going to learn the second: hypothesis testing. Instead of asking "What is the value of this parameter?" you'll ask "Is this specific claim about the parameter supported by the data?"

These two tools — confidence intervals and hypothesis tests — are the twin pillars of statistical inference. They're mathematically connected (as we'll see), but they answer different questions. And right now, we need the hypothesis test.

In this chapter, you will learn to: - State null and alternative hypotheses in words and symbols - Explain the logic of hypothesis testing as indirect reasoning (proof by contradiction analogy) - Calculate and interpret p-values - Make statistical decisions using significance levels - Distinguish between Type I and Type II errors and explain their real-world consequences

Fast Track: If you've done hypothesis testing before and can define a p-value from memory, skim Sections 13.1–13.4 and jump to Section 13.8 (one-tailed vs. two-tailed). Complete quiz questions 1, 8, and 15 to verify your understanding. But be warned: most people who think they understand p-values have at least one misconception. Section 13.6 might surprise you.

Deep Dive: After this chapter, read Case Study 1 (the replication crisis revisited) for a detailed look at how p-value misuse contributed to science's biggest credibility crisis, then Case Study 2 (clinical trials and FDA approval) for a high-stakes application where Type I and Type II errors have life-or-death consequences.

13.1 A Puzzle Before We Start (Productive Struggle)

Before I explain anything, try this thought experiment.

The Coin Experiment

A friend claims they have a fair coin — one that lands heads 50% of the time. You're skeptical. You decide to test their claim by flipping the coin 100 times.

(a) If the coin lands heads 52 out of 100 times, would you accuse your friend of lying? Why or why not?

(b) If it lands heads 85 out of 100 times, would you accuse them? Why or why not?

(c) What about 60 out of 100? Or 65? Where exactly would you draw the line?

(d) Here's the hard question: whatever line you draw, is there a chance you're wrong? Could a fair coin actually land heads 65 times out of 100? What about 85 times?

Take 3 minutes. The question in part (c) is the one that matters most.

Here's what I hope you noticed:

For part (a), 52 out of 100 is basically what you'd expect from a fair coin. A fair coin won't land exactly 50-50 every time — in fact, it usually won't. So 52 heads is no big deal.

For part (b), 85 heads out of 100 is extremely unlikely from a fair coin. You'd probably conclude the coin is biased. Something is wrong.

Part (c) is where it gets interesting. There's no magic number where "probably fair" suddenly becomes "definitely unfair." The evidence against fairness grows gradually as you move from 52 to 60 to 65 to 85. At some point, the evidence becomes strong enough that you'd reject the "fair coin" claim — but where you draw that line involves judgment.

And part (d) is the kicker: even at 85 heads, there's a tiny, nonzero chance the coin is fair and you just got really unlucky. We can calculate that probability — it's astronomically small, but it's not zero. No matter where you draw the line, there's always a possibility of making a mistake.

You've just done hypothesis testing. Everything in this chapter is a formalization of the intuitive reasoning you just used.

13.2 The Logic of Hypothesis Testing: Proof by Contradiction

Hypothesis testing uses a style of reasoning you may have encountered in math class: proof by contradiction (also called reductio ad absurdum).

Here's how proof by contradiction works in mathematics:

Assume the thing you want to disprove is true
Show that this assumption leads to a contradiction — something that's extremely unlikely or impossible
Conclude that the assumption must be false

In statistics, we use a probabilistic version of this logic:

Assume the "boring" explanation is true — nothing interesting is happening
Calculate how likely the observed data would be under that assumption
If the data would be very unlikely under that assumption, reject the assumption

The key difference from mathematical proof is that we can never be 100% certain. We deal in probabilities, not absolute truth. But the logical structure is the same.

The Courtroom Analogy

The best analogy for hypothesis testing is a criminal trial. Let me walk you through it carefully, because this analogy will come back throughout the chapter.

Criminal Trial	Hypothesis Test
The defendant is presumed innocent	We start by assuming nothing special is happening
The prosecution presents evidence	We collect data
The jury asks: "Is this evidence convincing beyond a reasonable doubt?"	We ask: "Is this data unlikely enough under our assumption?"
Verdict: guilty or not guilty	Decision: reject the assumption or fail to reject it
"Not guilty" ≠ "innocent"	"Fail to reject" ≠ "the assumption is true"

That last row is critical. Read it again.

In a courtroom, a verdict of "not guilty" doesn't mean the jury believes the defendant is innocent. It means the prosecution didn't present enough evidence to convince them beyond a reasonable doubt. Maybe the defendant did it. Maybe they didn't. The jury simply doesn't have enough evidence to convict.

In hypothesis testing, the same principle applies. When we "fail to reject" our assumption, we're not saying the assumption is definitely true. We're saying the data doesn't provide strong enough evidence to overturn it.

This is why the language matters. We never say "accept the null hypothesis." We say "fail to reject" it. The distinction is not just semantic — it reflects the fundamental logic of the procedure.

💡 Key Insight

Hypothesis testing is asymmetric. You start with a default assumption and ask whether the evidence is strong enough to overturn it. The burden of proof is on the evidence, just like it's on the prosecution in a trial.

13.3 Null and Alternative Hypotheses

Now let's formalize this with proper notation.

Every hypothesis test involves two competing claims:

The Null Hypothesis ($H_0$)

The null hypothesis is the default assumption — the "nothing special is happening" claim. It's the defendant's presumption of innocence. It typically represents:

No effect
No difference
No change
The status quo

The null hypothesis always contains an equality: $=$, $\leq$, or $\geq$.

Example formulations:

Scenario	Null Hypothesis ($H_0$)
Is the coin fair?	$H_0: p = 0.50$
Has Daria improved?	$H_0: p = 0.31$ (her old percentage)
Does the new algorithm increase watch time?	$H_0: \mu_{\text{new}} = \mu_{\text{old}}$ (or $\mu_{\text{new}} - \mu_{\text{old}} = 0$)
Is the mean blood pressure above 130?	$H_0: \mu \leq 130$

The Alternative Hypothesis ($H_a$ or $H_1$)

The alternative hypothesis is the claim you're trying to find evidence for. It's what you suspect might be true. It represents:

There is an effect
There is a difference
Something has changed
Something departs from the status quo

The alternative hypothesis is the complement of the null. Where the null has $=$, the alternative has $\neq$, $<$, or $>$.

Example formulations:

Scenario	Alternative Hypothesis ($H_a$)
Is the coin fair?	$H_a: p \neq 0.50$
Has Daria improved?	$H_a: p > 0.31$
Does the new algorithm increase watch time?	$H_a: \mu_{\text{new}} > \mu_{\text{old}}$ (or $\mu_{\text{new}} - \mu_{\text{old}} > 0$)
Is the mean blood pressure above 130?	$H_a: \mu > 130$

Rules for Writing Hypotheses

Hypotheses are always about population parameters ($\mu$, $p$, $\sigma$, etc.), never about sample statistics ($\bar{x}$, $\hat{p}$, $s$). You already know the sample statistic — you calculated it. The question is what it tells you about the population.
The null hypothesis always gets the benefit of the doubt. It's assumed true until the data provide strong evidence against it.
The alternative hypothesis is what you'd conclude if the null is rejected. It's usually the research hypothesis — the thing the researcher hopes to show.
Hypotheses must be set before looking at the data. This is crucial. Formulating your hypothesis after peeking at the data is like a prosecutor choosing which crime to charge after the jury has already heard the evidence. (We'll return to this in the ethics section.)

🔄 Spaced Review 1 (from Ch.12): Confidence Intervals — The Other Side of the Coin

In Chapter 12, you learned to construct confidence intervals — ranges of plausible values for a population parameter. Here's the connection to hypothesis testing:

A 95% confidence interval contains all parameter values that would not be rejected by a two-sided hypothesis test at the $\alpha = 0.05$ significance level.

Let that sink in. If Sam's 95% CI for Daria's true three-point percentage was (26.7%, 50.3%), that means:

$H_0: p = 0.31$ would not be rejected (because 0.31 is inside the interval)

$H_0: p = 0.20$ would be rejected (because 0.20 is outside the interval)

$H_0: p = 0.55$ would be rejected (because 0.55 is outside the interval)

Every value inside the CI is a null hypothesis value the data are "compatible" with. Every value outside is one the data would reject. The CI and the hypothesis test are saying the same thing in two different languages.

We'll formalize this connection in Section 13.11.

Worked Example: Setting Up the Hypotheses

Let's practice with our four running examples.

Sam's question about Daria:

"Has Daria's three-point shooting percentage improved from her career rate of 31%?"

$H_0: p = 0.31$ (Daria's true shooting percentage is still 31% — no improvement)
$H_a: p > 0.31$ (Daria's true shooting percentage has increased)

Note: This is a one-sided alternative because Sam only cares about improvement, not decline. We'll discuss the distinction in Section 13.8.

Alex's A/B test at StreamVibe:

"Does the new recommendation algorithm increase average watch time compared to the old algorithm?"

$H_0: \mu_{\text{new}} - \mu_{\text{old}} = 0$ (no difference in average watch time)
$H_a: \mu_{\text{new}} - \mu_{\text{old}} > 0$ (the new algorithm increases average watch time)

Dr. Maya Chen's blood pressure study:

"Is the average systolic blood pressure in this county above the hypertension threshold of 130 mmHg?"

$H_0: \mu = 130$ (the county average equals the threshold)
$H_a: \mu > 130$ (the county average exceeds the threshold)

Professor Washington's algorithm audit:

"Does the predictive policing algorithm's false positive rate differ between racial groups?"

$H_0: p_{\text{Black}} - p_{\text{White}} = 0$ (no difference in false positive rates)
$H_a: p_{\text{Black}} - p_{\text{White}} \neq 0$ (false positive rates differ)

Note: Washington uses a two-sided alternative because the concern is any difference, not a difference in a specific direction.

13.4 The Test Statistic: Measuring How Far the Data Are from the Null

Once you've stated your hypotheses, the next step is to measure how far your observed data are from what the null hypothesis predicts.

This measurement is called the test statistic. It answers the question: "How many standard errors away from the null hypothesis value is my sample statistic?"

You already know how to do this. In Chapter 6, you learned about z-scores — how many standard deviations a value is from the mean. In Chapter 11, you learned about standard errors — the standard deviation of a sampling distribution. A test statistic is just a z-score applied to a sample statistic instead of an individual value.

The One-Sample z-Test for a Population Mean

When we know the population standard deviation $\sigma$ (rare in practice, but important for learning), the test statistic for a hypothesis test about a population mean is:

$$\boxed{z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}}$$

Where: - $\bar{x}$ is the sample mean - $\mu_0$ is the hypothesized population mean (the value in $H_0$) - $\sigma$ is the known population standard deviation - $n$ is the sample size - $\sigma / \sqrt{n}$ is the standard error of $\bar{x}$ (from Chapter 11)

In plain English: the test statistic measures how many standard errors the sample mean is from the null hypothesis value.

If $z$ is close to 0, the data are close to what the null predicts — no reason to reject. If $z$ is far from 0 (positive or negative), the data are far from what the null predicts — evidence against the null.

The One-Sample z-Test for a Population Proportion

For a proportion, the test statistic is:

$$\boxed{z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}}$$

Where: - $\hat{p}$ is the sample proportion - $p_0$ is the hypothesized population proportion (the value in $H_0$) - $n$ is the sample size

Important note: For hypothesis tests about proportions, we use $p_0$ (the null hypothesis value) in the standard error formula, not $\hat{p}$. This is different from confidence intervals, where we used $\hat{p}$. Why? Because when testing a hypothesis, we assume $H_0$ is true, so we use the value of $p$ that $H_0$ specifies.

Worked Example: Sam's Daria Test

Let's compute the test statistic for Sam's question about Daria.

Setup: - $H_0: p = 0.31$ - $H_a: p > 0.31$ - Observed: $\hat{p} = 25/65 = 0.385$ (25 three-pointers made out of 65 attempts) - $n = 65$

Wait — I need to correct something. In Chapter 11, we said Daria was shooting 38%. Let me use the exact numbers from that analysis: $\hat{p} = 0.38$ based on the data available. The precise count doesn't change the story.

Test statistic:

$$z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}} = \frac{0.38 - 0.31}{\sqrt{0.31 \times 0.69 / 65}} = \frac{0.07}{\sqrt{0.2139/65}} = \frac{0.07}{\sqrt{0.003291}} = \frac{0.07}{0.05737} = 1.22$$

A $z$-score of 1.22. Daria's observed shooting percentage is 1.22 standard errors above what we'd expect if her true percentage were still 31%.

Is that far enough to reject the null? That's where the p-value comes in.

13.5 The P-Value: The Most Misunderstood Number in Science

Here it is. The concept I warned you about in the chapter overview. The concept that has caused more confusion, more misinterpretation, and more damage to science than perhaps any other single number.

Let me give you the definition first, then unpack it carefully.

Definition: P-Value

The p-value is the probability of observing data as extreme as (or more extreme than) what was actually observed, assuming the null hypothesis is true.

Let me say that again, slightly differently:

The p-value answers the question: "If nothing special were happening ($H_0$ is true), how surprising would my data be?"

And one more time, because this really matters:

The p-value is a conditional probability: $P(\text{data this extreme or more extreme} \mid H_0 \text{ is true})$.

Let's apply this to Sam and Daria.

The question: If Daria's true three-point percentage is still 31% (the null hypothesis), what's the probability of observing a shooting percentage of 38% or higher in 65 attempts?

From our test statistic, $z = 1.22$. Using the standard normal table from Chapter 10:

$$p\text{-value} = P(Z \geq 1.22) = 1 - P(Z < 1.22) = 1 - 0.8888 = 0.1112$$

The p-value is approximately 0.111, or about 11%.

Interpretation: If Daria's true shooting percentage hasn't changed from 31%, there's about an 11% chance of observing a sample percentage of 38% or higher in 65 attempts, just from random variation.

That's not terribly surprising. An 11% chance isn't astronomically unlikely — it's roughly the chance of rolling a 12 on two dice. It could happen with a fair process. So the data don't provide strong evidence against the null hypothesis.

Visualizing the P-Value

Here's what the p-value looks like on the sampling distribution:

The p-value is the shaded area

         Sampling distribution
         if H₀: p = 0.31 is true
               ┌──┐
              ╱│  │╲
            ╱  │  │  ╲
          ╱    │  │    ╲
        ╱      │  │      ╲
      ╱        │  │        ╲
    ╱──────────│──│──────────╲▒▒▒▒▒▒▒
   ───────────────────────────────────
   0.19  0.25  0.31  0.37  0.43  0.49
                ↑           ↑
               μ₀        observed
            (null)       p̂ = 0.38

   ▒▒▒ = p-value ≈ 0.111
   "area in the tail beyond the observed value"

The p-value is the area under the curve to the right of the observed value. It represents all the outcomes that would be at least as extreme as what we observed, if the null hypothesis were true.

What "Extreme" Means

The word "extreme" in the p-value definition means "far from what the null hypothesis predicts, in the direction specified by the alternative hypothesis."

For Sam's one-sided test ($H_a: p > 0.31$), "extreme" means "higher than 0.38."

For a two-sided test ($H_a: p \neq 0.50$), "extreme" would mean "at least as far from 0.50 in either direction" — we'd look at both tails. More on this in Section 13.8.

🔄 Spaced Review 2 (from Ch.11): The CLT — Why We Can Calculate P-Values at All

Stop for a moment and appreciate what's happening here. We computed a p-value by looking up areas under a normal curve. But why is the sampling distribution normal?

Because of the Central Limit Theorem (Chapter 11). The CLT guarantees that the sampling distribution of $\hat{p}$ is approximately normal when $n$ is large enough — regardless of the population shape. Without the CLT, we wouldn't know what distribution to use, and we couldn't calculate p-values.

The CLT → we know the shape of the sampling distribution → we can measure how surprising our data are → we can calculate p-values → we can test hypotheses.

Every p-value you've ever seen in a newspaper, a medical journal, or a business report rests on this chain of logic. The CLT isn't just a theorem. It's the engine of modern statistical inference.

13.6 What the P-Value Is NOT (The Part That Trips Everyone Up)

This section might be the most important in the entire chapter. Maybe in the entire course.

The p-value is arguably the most misunderstood concept in all of science. And the misunderstandings aren't trivial — they lead to wrong conclusions, wasted resources, and bad policy. So let's be extremely clear about what the p-value does not tell you.

⚠️ MYTH vs. REALITY: The Five Biggest P-Value Misconceptions

# The Myth The Reality

1 "The p-value is the probability that the null hypothesis is true" No. The p-value is the probability of the data (or more extreme), given that $H_0$ is true. It's $P(\text{data} \mid H_0)$, not $P(H_0 \mid \text{data})$.

2 "A small p-value means the effect is large/important" No. A tiny p-value can come from a trivially small effect with a huge sample size. Statistical significance ≠ practical significance.

3 "$p = 0.05$ means there's only a 5% chance the results are due to chance" No. The p-value is calculated assuming $H_0$ is true. It doesn't tell you the probability that chance explains your results.

4 "If $p > 0.05$, the null hypothesis is true (there's no effect)" No. Failing to reject $H_0$ means you don't have enough evidence to reject it. Absence of evidence is not evidence of absence.

5 "If $p < 0.05$ in two studies, the effects are the same" No. Two studies can both have $p < 0.05$ but show very different effect sizes. P-values don't measure effect size.

Let me drill deeper into Misconception #1, because it's the most consequential.

The Critical Distinction: $P(\text{data} \mid H_0)$ vs. $P(H_0 \mid \text{data})$

Suppose Sam computes a p-value of 0.111 for Daria's shooting improvement.

What the p-value says: "If Daria hasn't improved ($H_0$ is true), there's an 11% chance of observing data this extreme."

What many people think it says: "There's an 11% chance that Daria hasn't improved."

These are completely different statements.

The first is $P(\text{data} \mid H_0)$ — the probability of the data, given the hypothesis. The second is $P(H_0 \mid \text{data})$ — the probability of the hypothesis, given the data.

🔄 Spaced Review 3 (from Ch.9): Bayes' Theorem — The Reason These Aren't the Same

In Chapter 9, you learned about conditional probability and Bayes' theorem. Remember the prosecutor's fallacy? Confusing $P(A \mid B)$ with $P(B \mid A)$ is one of the most common errors in probability — and it's exactly the error people make with p-values.

$P(\text{evidence} \mid \text{innocent})$ is not the same as $P(\text{innocent} \mid \text{evidence})$.

Remember: $P(\text{positive test} \mid \text{healthy}) = 5\%$ did NOT mean $P(\text{healthy} \mid \text{positive test}) = 5\%$. The base rate mattered enormously. A false positive rate of 5% with a disease prevalence of 1% meant the positive predictive value was only about 16% — most positive tests were false alarms.

The same logic applies to p-values. A p-value of 0.05 does NOT mean there's a 5% chance the null is true. To compute $P(H_0 \mid \text{data})$, you'd need Bayes' theorem — and you'd need a prior probability for $H_0$. The frequentist p-value deliberately avoids specifying a prior, which makes it more "objective" but also makes it say less than people think it says.

We'll revisit this Bayesian perspective in Chapter 17 and beyond.

To make this concrete, let me offer an analogy.

The Rare Disease Analogy

Suppose a medical test has a false positive rate of 5%. You test positive. What's the probability you're healthy?

If you said "5%," you just committed the same error as confusing a p-value with $P(H_0 \mid \text{data})$.

From Chapter 9, you know the answer depends on the base rate. If the disease affects 1 in 1,000 people, then even with a positive test, you're probably healthy. The false positive rate (analogous to the p-value) is just one piece of the puzzle.

The ASA Statement on P-Values

In 2016, the American Statistical Association (ASA) took the unprecedented step of issuing a formal statement about the proper use and interpretation of p-values. It was the first time in the organization's 177-year history that it had made such a statement about a specific statistical practice.

The six key principles from the ASA statement:

P-values can indicate how incompatible the data are with a specified statistical model.
P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
Proper inference requires full reporting and transparency.
A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

The fact that the world's largest organization of statisticians felt the need to issue these clarifications tells you just how pervasive the misunderstandings are.

🧠 Threshold Concept: The P-Value

The p-value is one of those concepts that, once you truly understand it, transforms how you think about evidence and uncertainty.

Here is the definition one final time:

The p-value is the probability of observing data as extreme as or more extreme than what was actually observed, IF the null hypothesis were true.

It is NOT: - The probability that the null hypothesis is true - The probability of making an error - The probability that the result is "due to chance" - A measure of effect size - A measure of practical importance

It IS: - A measure of how surprising the data would be under $H_0$ - A conditional probability: $P(\text{data this extreme} \mid H_0)$ - A continuous measure of evidence against $H_0$ (smaller = more evidence)

Once you get this — really get it — you'll never be fooled by a headline that says "study proves X" just because $p < 0.05$. You'll start asking the right questions: How large was the effect? How big was the sample? What were the hypotheses? Were they specified in advance?

This is the thread from Chapter 1 — finally delivered. The replication crisis case study introduced p-hacking and statistical significance. Now you understand the machinery behind those concepts. You understand why p-hacking is devastating (it inflates the false positive rate) and why "statistically significant" doesn't mean "important." In Section 13.12, we'll revisit the replication crisis with your new understanding.

13.7 The Significance Level and the Decision Rule

You've computed a p-value. Now what? How do you decide whether to reject the null hypothesis?

You need a significance level, denoted by the Greek letter $\alpha$ (alpha).

What $\alpha$ Is

The significance level $\alpha$ is the threshold you set before collecting data. If the p-value falls below $\alpha$, you reject $H_0$. If it doesn't, you fail to reject.

$$\boxed{\text{If } p\text{-value} \leq \alpha, \text{ reject } H_0. \quad \text{If } p\text{-value} > \alpha, \text{ fail to reject } H_0.}$$

The most common choice is $\alpha = 0.05$, which means you'll reject the null hypothesis if the data would occur less than 5% of the time under $H_0$. But 0.05 is a convention, not a law of nature. Other common choices:

$\alpha$ Level	When It's Used
0.10	Exploratory research, preliminary studies
0.05	Standard in most fields (social science, medicine, business)
0.01	When you need stronger evidence (physics, genetics)
0.001	High-stakes decisions (particle physics uses "5 sigma," which is roughly $\alpha \approx 0.0000003$)

Making the Decision: Sam's Test

Sam's p-value for Daria's improvement was approximately 0.111.

Using $\alpha = 0.05$:

$p\text{-value} = 0.111 > 0.05 = \alpha$
Decision: Fail to reject $H_0$
Conclusion: At the 5% significance level, there is not sufficient evidence to conclude that Daria's three-point shooting percentage has improved from 31%.

Does this mean Daria hasn't improved? No! It means Sam doesn't have enough evidence yet. With only 65 attempts, the sample size is too small to detect a modest improvement with statistical confidence. Sam needs more data.

This is exactly what we found in Chapter 12. The 95% confidence interval for Daria's true percentage was (26.7%, 50.3%). That wide interval included 31% — the null hypothesis value — which is entirely consistent with failing to reject $H_0$ at the 5% level. Same data, same conclusion, two different languages.

The Rejection Region

An equivalent way to think about the decision is through the rejection region — the set of test statistic values that would lead to rejecting $H_0$.

For a one-sided test ($H_a: p > p_0$) at $\alpha = 0.05$:

$$\text{Rejection region: } z \geq z^* = 1.645$$

Sam's test statistic was $z = 1.22$, which does NOT fall in the rejection region. So we fail to reject.

            Fail to reject          │  Reject H₀
                                    │
        ┌──────────────────────────╮│╭──────────────────
       ╱                          ╰│╯                  ╲
     ╱                             │                     ╲
   ╱───────────────────────────────│───────────────────────╲
  ────────────────────────────────────────────────────────────
  -3    -2    -1    0    1    1.645  2    3
                          ↑     ↑
                       z=1.22  z*=1.645
                       Sam's   Critical
                       result  value

  The test statistic z = 1.22 falls in the "fail to reject" region.

The Word "Significant"

When $p\text{-value} \leq \alpha$, we say the result is statistically significant at level $\alpha$.

🔑 Resolving the Thread from Chapter 1

"What does 'statistically significant' mean?"

Now you know. It means: the p-value is below the chosen significance level. It means the data would be surprising (at the $\alpha$ level) if the null hypothesis were true.

It does NOT mean: - The effect is large - The result is important - The finding is practically meaningful - The null hypothesis is definitely false

"Significant" in statistics doesn't mean what it means in everyday English. A "statistically significant" result can be trivially small and practically meaningless. We'll explore this gap between statistical and practical significance extensively in Chapter 17.

13.8 One-Tailed vs. Two-Tailed Tests

The choice between a one-tailed and a two-tailed test depends on the alternative hypothesis.

Two-Tailed Tests ($H_a: \mu \neq \mu_0$)

A two-tailed test is used when you care about departures from the null in either direction. The p-value includes area in both tails.

When to use: When you'd be interested in the result regardless of direction.

Example: Professor Washington tests whether the algorithm's false positive rate differs between racial groups. He'd be concerned whether it's higher or lower for Black defendants — either direction is problematic.

$H_0: p_{\text{Black}} = p_{\text{White}}$ vs. $H_a: p_{\text{Black}} \neq p_{\text{White}}$

For a two-tailed test:

$$p\text{-value} = 2 \times P(Z \geq |z|)$$

The factor of 2 accounts for both tails. If $z = 2.10$:

$$p\text{-value} = 2 \times P(Z \geq 2.10) = 2 \times 0.0179 = 0.0358$$

Two-tailed test: p-value includes BOTH tails

▒▒▒▒▒▒▒                                          ▒▒▒▒▒▒▒
       ╲╲╲                    ┌──┐                ╱╱╱
          ╲╲                 ╱│  │╲              ╱╱
            ╲╲             ╱  │  │  ╲          ╱╱
              ╲╲         ╱    │  │    ╲      ╱╱
                ╲╲     ╱      │  │      ╲  ╱╱
                  ╲╲ ╱────────│──│────────╲╱
   ──────────────────────────────────────────────
   -3    -2.10  -1    0    1    2.10    3
          ↑                      ↑
     left tail               right tail
     area ≈ 0.018            area ≈ 0.018

   p-value = 0.018 + 0.018 = 0.036

One-Tailed Tests ($H_a: \mu > \mu_0$ or $H_a: \mu < \mu_0$)

A one-tailed test is used when you have a specific directional prediction. The p-value includes area in only one tail.

When to use: When the alternative is directional AND you would not take action for an effect in the opposite direction.

Example: Sam tests whether Daria's shooting has improved. He wouldn't react to evidence that she'd gotten worse — that's a different question.

$H_0: p = 0.31$ vs. $H_a: p > 0.31$

For a right-tailed test:

$$p\text{-value} = P(Z \geq z)$$

For Sam: $p\text{-value} = P(Z \geq 1.22) = 0.111$

Choosing: When in Doubt, Use Two-Tailed

Here's a practical guide:

Use One-Tailed When...	Use Two-Tailed When...
You have a specific direction in mind before seeing data	You'd be interested in an effect in either direction
An effect in the other direction is irrelevant	You're exploring rather than confirming
The research question is inherently directional	You want to be conservative
You committed to the direction before data collection	You're not sure

The general recommendation: When in doubt, use a two-tailed test. It's more conservative (harder to reject $H_0$), and it protects you from the temptation of choosing the tail after seeing which direction the data went.

A two-tailed p-value is always exactly twice the one-tailed p-value (for symmetric distributions). So a one-tailed test is "easier" to achieve significance with — which is precisely why you need to be careful about using it.

⚠️ Warning: The One-Tail Temptation

Here's a common form of p-hacking: a researcher plans a two-tailed test, sees that $p = 0.07$ (not significant at $\alpha = 0.05$), and then switches to a one-tailed test, getting $p = 0.035$ (significant!). This is dishonest. The decision about one- vs. two-tailed must be made before looking at the data. Switching after the fact inflates the false positive rate.

13.9 Type I and Type II Errors: The Two Ways to Be Wrong

No matter how careful you are, hypothesis testing can lead to errors. There are exactly two types, and they have a seesaw relationship.

Type I Error: The False Alarm

A Type I error occurs when you reject $H_0$ when it's actually true. It's a false positive — you declared something significant that was actually just noise.

In the courtroom analogy: convicting an innocent person.

In Sam's case: concluding Daria improved when she actually didn't.

The probability of a Type I error is exactly $\alpha$, the significance level. If you set $\alpha = 0.05$, you'll commit a Type I error 5% of the time when $H_0$ is true.

Type II Error: The Missed Detection

A Type II error occurs when you fail to reject $H_0$ when it's actually false. It's a false negative — you missed a real effect.

In the courtroom analogy: acquitting a guilty person.

In Sam's case: concluding there's not enough evidence of improvement when Daria actually did improve.

The probability of a Type II error is denoted $\beta$. We'll explore this in depth in Chapter 17 when we discuss statistical power ($\text{Power} = 1 - \beta$).

The Error Matrix

Here's the complete picture:

	$H_0$ is actually TRUE	$H_0$ is actually FALSE
Reject $H_0$	❌ Type I Error (false alarm) Probability = $\alpha$	✅ Correct Decision (detected real effect) Probability = $1 - \beta$ = Power
Fail to reject $H_0$	✅ Correct Decision (rightly didn't reject) Probability = $1 - \alpha$	❌ Type II Error (missed detection) Probability = $\beta$

The Seesaw Relationship

Here's the catch: you can't reduce both error rates simultaneously (for a given sample size).

Lower $\alpha$ (e.g., from 0.05 to 0.01) → fewer Type I errors but more Type II errors (harder to detect real effects)
Higher $\alpha$ (e.g., from 0.05 to 0.10) → fewer Type II errors but more Type I errors (more false alarms)

The only way to reduce both error rates is to increase the sample size. More data gives you more power to detect real effects while maintaining a low false alarm rate.

Real-World Consequences

The stakes of these errors depend entirely on context. Here are some examples:

Scenario	Type I Error (False Alarm)	Type II Error (Missed Detection)	Which Is Worse?
Drug approval	Approve an ineffective (or harmful) drug	Reject an effective drug	Type I: patients harmed
Cancer screening	Tell a healthy person they have cancer	Miss an actual cancer	Type II: cancer progresses untreated
Criminal trial	Convict an innocent person	Acquit a guilty person	Type I: justice says innocent until proven guilty
Spam filter	Send a real email to spam	Let spam into your inbox	Type I: you miss an important email
Quality control	Shut down a working production line	Let defective products ship	Depends on product and defect
Alex's A/B test	Roll out an algorithm that doesn't help	Stick with the old algorithm when the new one is better	Type I: wastes engineering resources

💡 Key Insight

The choice of $\alpha$ should reflect which error is more costly. If Type I errors are catastrophic (like convicting an innocent person), set $\alpha$ low (0.01 or lower). If Type II errors are catastrophic (like missing a cancer diagnosis), you might tolerate a higher $\alpha$ and focus on maximizing power.

The default $\alpha = 0.05$ is a compromise. It's not sacred. It's a convention that balances the two error types for a "typical" scenario — but many scenarios aren't typical.

Dr. Maya Chen's Dilemma

Let's make this concrete with Maya's work.

Maya is testing whether a county's diabetes prevalence exceeds the national rate of 11.3%. She has two concerns:

Type I Error: Concluding the county has elevated diabetes when it doesn't. Consequence: the county health board allocates extra resources for diabetes prevention that aren't needed, diverting funding from other programs.

Type II Error: Failing to detect an elevated diabetes rate when one exists. Consequence: a real public health problem goes unaddressed, and at-risk residents don't receive the screenings and programs they need.

For Maya, the Type II error seems worse — missing a real public health problem could harm an entire community. She might choose $\alpha = 0.10$ instead of $\alpha = 0.05$ to increase her power (reduce the chance of a Type II error), accepting a slightly higher risk of false alarm.

13.10 Putting It All Together: The Five Steps of Hypothesis Testing

Here's the complete procedure, from start to finish.

Step 1: State the Hypotheses

Write $H_0$ and $H_a$ in words and symbols. Decide whether the test is one-tailed or two-tailed. Do this before looking at the data.

Step 2: Choose the Significance Level

Set $\alpha$ (typically 0.05). Consider the consequences of Type I and Type II errors.

Step 3: Compute the Test Statistic

Calculate how many standard errors the sample statistic is from the null hypothesis value.

For a mean (known $\sigma$): $z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}$

For a proportion: $z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}$

Step 4: Find the P-Value

Calculate the probability of observing a test statistic this extreme (or more extreme) under $H_0$.

One-tailed (right): $p = P(Z \geq z)$
One-tailed (left): $p = P(Z \leq z)$
Two-tailed: $p = 2 \times P(Z \geq |z|)$

Step 5: State Your Conclusion

If $p\text{-value} \leq \alpha$: Reject $H_0$. State conclusion in context.
If $p\text{-value} > \alpha$: Fail to reject $H_0$. State conclusion in context.

Always interpret in context — never just say "reject" or "fail to reject" without explaining what that means for the real-world question.

Full Worked Example: Maya's Blood Pressure Test

Dr. Maya Chen wants to test whether the mean systolic blood pressure in her county exceeds the hypertension threshold of 130 mmHg. She has a random sample of 120 adults with $\bar{x} = 128.3$ mmHg. The population standard deviation is known to be $\sigma = 18.6$ mmHg (from national data).

Step 1: State the Hypotheses

$H_0: \mu = 130$ (the county average equals the threshold)

$H_a: \mu > 130$ (the county average exceeds the threshold)

Wait — let me reconsider. Maya's CI from Chapter 12 was (124.9, 131.7). The mean was 128.3, which is below 130. If Maya's question is "does the average exceed 130," the data actually point in the opposite direction — the sample mean is below 130, not above it. Let me reformulate.

Actually, let's set up the test that Maya's data can answer. Maya observed $\bar{x} = 128.3$. A more natural question: is the county average different from the national average of 128 mmHg?

Let me use a cleaner example instead.

Revised Example: Maya has new data from a different community. A random sample of $n = 64$ adults has a mean systolic blood pressure of $\bar{x} = 134.2$ mmHg. From national health surveys, $\sigma = 20$ mmHg is known. She wants to test whether this community's mean exceeds 130 mmHg.

Step 1: State the Hypotheses

$H_0: \mu = 130$ (population mean BP is 130)
$H_a: \mu > 130$ (population mean BP exceeds 130)
One-tailed test (right-tailed)

Step 2: Choose the Significance Level

$\alpha = 0.05$

Step 3: Compute the Test Statistic

$$z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}} = \frac{134.2 - 130}{20 / \sqrt{64}} = \frac{4.2}{20/8} = \frac{4.2}{2.5} = 1.68$$

Step 4: Find the P-Value

$$p\text{-value} = P(Z \geq 1.68) = 1 - P(Z < 1.68) = 1 - 0.9535 = 0.0465$$

Step 5: State Your Conclusion

Since $p\text{-value} = 0.0465 < 0.05 = \alpha$, we reject $H_0$.

In context: At the 5% significance level, there is sufficient evidence to conclude that the mean systolic blood pressure in this community exceeds 130 mmHg. The sample data suggest this community may have an elevated average blood pressure, which could warrant further investigation and public health intervention.

Note what we are NOT saying: We are not saying the mean is definitely above 130. We are not saying the mean is 134.2. We're saying the data are statistically inconsistent with a mean of 130 or below, at the 5% significance level.

The Steps in a Table

Step	What You Do	Maya's Example
1. Hypotheses	State $H_0$ and $H_a$	$H_0: \mu = 130$, $H_a: \mu > 130$
2. $\alpha$	Choose significance level	$\alpha = 0.05$
3. Test statistic	Compute $z$	$z = \frac{134.2 - 130}{20/\sqrt{64}} = 1.68$
4. P-value	Find tail probability	$P(Z \geq 1.68) = 0.0465$
5. Conclusion	Compare $p$ to $\alpha$, interpret	$0.0465 < 0.05$: reject $H_0$. Evidence that $\mu > 130$.

13.11 The Duality: Hypothesis Tests and Confidence Intervals

I've hinted at this throughout the chapter, and now it's time to make it explicit.

The Duality Principle

A two-sided hypothesis test at significance level $\alpha$ rejects $H_0: \mu = \mu_0$ if and only if $\mu_0$ falls outside the $(1 - \alpha) \times 100\%$ confidence interval for $\mu$.

In other words:

Reject $H_0: \mu = \mu_0$ at $\alpha = 0.05$ $\iff$ $\mu_0$ is NOT in the 95% CI
Fail to reject $H_0: \mu = \mu_0$ at $\alpha = 0.05$ $\iff$ $\mu_0$ IS in the 95% CI

This makes perfect sense when you think about it. The confidence interval contains all "plausible" values for the parameter. If the hypothesized value isn't plausible, you reject it. If it is plausible, you don't.

Example: Sam and Daria

Sam's 95% CI for Daria's true three-point percentage: (26.7%, 50.3%)

Test $H_0: p = 0.31$: 0.31 is inside the CI → fail to reject. ✓ (Matches our p-value conclusion: $p = 0.111 > 0.05$)
Test $H_0: p = 0.20$: 0.20 is outside the CI → reject at $\alpha = 0.05$
Test $H_0: p = 0.55$: 0.55 is outside the CI → reject at $\alpha = 0.05$

When to Use Which

Use a Confidence Interval When...	Use a Hypothesis Test When...
You want to estimate a parameter	You want to test a specific claim
You want to show the range of plausible values	You need a yes/no decision
You want to communicate uncertainty	A threshold matters (e.g., "does it exceed 130?")
The audience is general	The audience expects a significance test

In practice, reporting both is best. The CI tells you the range of plausible values; the hypothesis test tells you whether a specific value is plausible. Together, they give a more complete picture than either alone.

13.12 The Replication Crisis Revisited: Now You Understand Why

Remember the replication crisis from Chapter 1? The case study described how a study on "precognition" passed peer review using standard statistical methods, and how the Open Science Collaboration found that only 36% of 100 psychology studies replicated successfully.

At the time, I told you about p-hacking, publication bias, and small sample sizes, but you didn't yet have the vocabulary to understand why these practices are so damaging. Now you do.

P-Hacking: The Garden of Forking Paths

P-hacking means manipulating the analysis until you get $p < 0.05$. Here's why it's devastating:

When you set $\alpha = 0.05$, you're saying: "If I run this test once, and $H_0$ is true, there's only a 5% chance of a false positive." But that 5% guarantee only holds for a single, pre-specified test.

If you test 20 different hypotheses on the same dataset, the probability of finding at least one false positive is:

$$P(\text{at least one false positive in 20 tests}) = 1 - (1 - 0.05)^{20} = 1 - 0.95^{20} \approx 0.64$$

That's a 64% chance of finding something "significant" even if nothing real is going on.

🔬 Ethical Analysis: The Garden of Forking Paths

Statistician Andrew Gelman describes the problem as the "garden of forking paths." At each step of a data analysis, the researcher faces choices:

Which variables to include?

Which subgroups to analyze?

How to handle outliers?

Which statistical test to use?

One-tailed or two-tailed?

What counts as the outcome variable?

Each choice is a fork in the path. If the researcher explores many paths and only reports the one that produced $p < 0.05$, the reported p-value is meaningless — it no longer has the interpretation "there's only a 5% chance of a false positive."

This is why pre-registration — publicly committing to your hypotheses and analysis plan before collecting data — has become a cornerstone of credible science. When a study is pre-registered, you know the researchers didn't explore dozens of paths and cherry-pick the one that "worked."

The ethical principle: A p-value is only valid when the hypothesis and analysis plan were specified before the data were examined. Post-hoc hypothesis testing is exploration, not confirmation — and there's nothing wrong with exploration, as long as you label it honestly.

Theme 6 Connection: P-Hacking and Ethics

P-hacking isn't always intentional. Researchers often don't realize they're doing it. The "garden of forking paths" problem can arise from innocent-seeming choices: "Let me just check if the effect is stronger in women" or "What if I remove that outlier?" or "The effect disappears when I include age — so let me not include age."

Each of these decisions, individually, seems reasonable. But collectively, they inflate the false positive rate far beyond the nominal 5%. And in a "publish or perish" academic culture, there's strong incentive to keep exploring until something "works."

This is one of the most important ethical challenges in modern data analysis. It doesn't require bad intent — just bad practices. And the solution isn't to abandon p-values (as some have suggested), but to use them correctly: pre-specify hypotheses, report all analyses (not just significant ones), and distinguish between confirmatory and exploratory research.

Why Small Samples Make It Worse

In Chapter 11, you learned that standard error decreases with $\sqrt{n}$. For small samples, the standard error is large, which means sample statistics bounce around a lot from sample to sample.

This means: - Small samples are more likely to produce extreme test statistics by chance - Small-sample studies that find "significant" effects tend to overestimate the true effect size (this is called the "winner's curse") - Replication studies with larger samples often find smaller effects

The Open Science Collaboration found that the average effect size in replications was about half the original. This is exactly what you'd expect if many original studies had small samples and were published because they happened to find large, significant effects.

Publication Bias: The File Drawer Problem

Publication bias compounds the problem. If 20 labs test the same null hypothesis, and only the one that finds $p < 0.05$ publishes its result, the published literature looks like there's strong evidence — when in fact, 19 out of 20 studies found nothing.

The published p-value of 0.03 looks convincing in isolation. But in context — as one of 20 independent tests — it's exactly what you'd expect from chance.

Connection to Theme 4: Uncertainty as Framework

The replication crisis is, at its core, a failure to take uncertainty seriously. When researchers treat $p < 0.05$ as a binary verdict of "true" rather than as one piece of evidence within a larger framework of uncertainty, they overstate their confidence. When journals only publish "significant" findings, they create a literature that systematically understates uncertainty.

The solution is the one this course has been building toward since Chapter 1: embrace uncertainty. Report effect sizes and confidence intervals alongside p-values. Acknowledge what you don't know. Distinguish between "we found evidence of X" and "we proved X." Statistics is the science of making decisions under uncertainty — and pretending the uncertainty doesn't exist defeats the entire purpose.

13.13 Python: Hypothesis Testing in Practice

Let's implement hypothesis tests in Python. We'll start with manual calculations and then use scipy.stats functions.

Manual Z-Test for a Proportion

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# ============================================================
# Sam's test: Has Daria's shooting improved?
# ============================================================

# Data
p_hat = 0.38       # observed proportion (25 out of ~65)
p_0 = 0.31         # null hypothesis value (last season)
n = 65              # number of attempts
alpha = 0.05        # significance level

# Step 1: State hypotheses
print("=" * 55)
print("HYPOTHESIS TEST: Daria's Three-Point Shooting")
print("=" * 55)
print(f"H₀: p = {p_0}  (no improvement)")
print(f"Hₐ: p > {p_0}  (improvement)")
print(f"α = {alpha}")
print(f"Test type: one-tailed (right)")

# Step 2: Compute test statistic
se = np.sqrt(p_0 * (1 - p_0) / n)
z = (p_hat - p_0) / se

print(f"\nStep 2: Test Statistic")
print(f"  SE = √({p_0} × {1-p_0} / {n}) = {se:.4f}")
print(f"  z = ({p_hat} - {p_0}) / {se:.4f} = {z:.2f}")

# Step 3: Find p-value (one-tailed, right)
p_value = 1 - stats.norm.cdf(z)

print(f"\nStep 3: P-value")
print(f"  P(Z ≥ {z:.2f}) = {p_value:.4f}")

# Step 4: Decision
print(f"\nStep 4: Decision")
if p_value <= alpha:
    print(f"  p-value ({p_value:.4f}) ≤ α ({alpha}): REJECT H₀")
    print(f"  Evidence supports that Daria has improved.")
else:
    print(f"  p-value ({p_value:.4f}) > α ({alpha}): FAIL TO REJECT H₀")
    print(f"  Insufficient evidence to conclude Daria has improved.")

# Output:
# ======================================================
# HYPOTHESIS TEST: Daria's Three-Point Shooting
# ======================================================
# H₀: p = 0.31  (no improvement)
# Hₐ: p > 0.31  (improvement)
# α = 0.05
# Test type: one-tailed (right)
#
# Step 2: Test Statistic
#   SE = √(0.31 × 0.69 / 65) = 0.0574
#   z = (0.38 - 0.31) / 0.0574 = 1.22
#
# Step 3: P-value
#   P(Z ≥ 1.22) = 0.1112
#
# Step 4: Decision
#   p-value (0.1112) > α (0.05): FAIL TO REJECT H₀
#   Insufficient evidence to conclude Daria has improved.

Manual Z-Test for a Mean

# ============================================================
# Maya's test: Is community mean BP above 130?
# ============================================================

x_bar = 134.2       # sample mean
mu_0 = 130          # null hypothesis value
sigma = 20          # known population SD
n = 64              # sample size
alpha = 0.05

print("\n" + "=" * 55)
print("HYPOTHESIS TEST: Community Blood Pressure")
print("=" * 55)
print(f"H₀: μ = {mu_0}")
print(f"Hₐ: μ > {mu_0}")
print(f"α = {alpha}")

# Test statistic
se = sigma / np.sqrt(n)
z = (x_bar - mu_0) / se

print(f"\nTest Statistic:")
print(f"  SE = {sigma} / √{n} = {se:.4f}")
print(f"  z = ({x_bar} - {mu_0}) / {se:.4f} = {z:.2f}")

# P-value
p_value = 1 - stats.norm.cdf(z)
print(f"\nP-value: P(Z ≥ {z:.2f}) = {p_value:.4f}")

# Decision
if p_value <= alpha:
    print(f"\nDecision: Reject H₀ (p = {p_value:.4f} < {alpha})")
    print(f"Conclusion: Evidence that mean BP exceeds {mu_0} mmHg.")
else:
    print(f"\nDecision: Fail to reject H₀ (p = {p_value:.4f} > {alpha})")

# Output:
# HYPOTHESIS TEST: Community Blood Pressure
# ======================================================
# H₀: μ = 130
# Hₐ: μ > 130
# α = 0.05
#
# Test Statistic:
#   SE = 20 / √64 = 2.5000
#   z = (134.2 - 130) / 2.5000 = 1.68
#
# P-value: P(Z ≥ 1.68) = 0.0465
#
# Decision: Reject H₀ (p = 0.0465 < 0.05)
# Conclusion: Evidence that mean BP exceeds 130 mmHg.

Using scipy.stats.ttest_1samp()

In practice, you'll rarely know $\sigma$, so you'll use a t-test instead of a z-test. The scipy.stats.ttest_1samp() function handles this automatically.

# ============================================================
# One-sample t-test using scipy.stats
# ============================================================

# Simulate Maya's blood pressure sample data
np.random.seed(42)
bp_data = np.random.normal(loc=134.2, scale=20, size=64)
# Note: in real work, you'd use actual data, not simulated

# Test: Is the mean different from 130?
t_stat, p_value_two = stats.ttest_1samp(bp_data, popmean=130)

print("=" * 55)
print("ONE-SAMPLE T-TEST (scipy.stats)")
print("=" * 55)
print(f"Sample mean: {bp_data.mean():.2f}")
print(f"Sample SD: {bp_data.std(ddof=1):.2f}")
print(f"n = {len(bp_data)}")
print(f"\nTest: H₀: μ = 130 vs Hₐ: μ ≠ 130")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value (two-tailed): {p_value_two:.4f}")

# For a one-tailed test (Hₐ: μ > 130):
if t_stat > 0:
    p_value_one = p_value_two / 2
else:
    p_value_one = 1 - p_value_two / 2

print(f"\nFor one-tailed test (Hₐ: μ > 130):")
print(f"p-value (one-tailed): {p_value_one:.4f}")

# IMPORTANT: scipy's ttest_1samp always returns a TWO-tailed p-value.
# For a one-tailed test:
#   - If the t-statistic is in the expected direction: p_one = p_two / 2
#   - If it's in the opposite direction: p_one = 1 - p_two / 2

Visualizing the P-Value

# ============================================================
# Visualizing the p-value for Sam's test
# ============================================================

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# --- Panel 1: One-tailed test (Sam's) ---
ax = axes[0]
x = np.linspace(-4, 4, 1000)
y = stats.norm.pdf(x)

ax.plot(x, y, 'k-', linewidth=2)
ax.fill_between(x, y, where=(x >= 1.22), color='coral',
                alpha=0.4, label=f'p-value = {0.1112:.3f}')
ax.axvline(x=1.22, color='red', linestyle='--', linewidth=1.5,
           label=f'z = 1.22 (observed)')
ax.axvline(x=1.645, color='blue', linestyle=':', linewidth=1.5,
           label=f'z* = 1.645 (critical)')

ax.set_title("Sam's Test: Has Daria Improved?\n"
             r"$H_0: p = 0.31$ vs $H_a: p > 0.31$",
             fontsize=12)
ax.set_xlabel('z-score')
ax.set_ylabel('Density')
ax.legend(fontsize=9)
ax.annotate('Fail to reject H₀\n(z < z*)',
            xy=(0, 0.15), fontsize=10, ha='center',
            color='green')

# --- Panel 2: Two-tailed test ---
ax = axes[1]
ax.plot(x, y, 'k-', linewidth=2)
z_obs = 2.10
ax.fill_between(x, y, where=(x >= z_obs), color='coral',
                alpha=0.4)
ax.fill_between(x, y, where=(x <= -z_obs), color='coral',
                alpha=0.4,
                label=f'p-value = {2*0.0179:.3f}')
ax.axvline(x=z_obs, color='red', linestyle='--', linewidth=1.5)
ax.axvline(x=-z_obs, color='red', linestyle='--', linewidth=1.5,
           label=f'z = ±{z_obs}')
ax.axvline(x=1.96, color='blue', linestyle=':', linewidth=1.5)
ax.axvline(x=-1.96, color='blue', linestyle=':', linewidth=1.5,
           label=f'z* = ±1.96 (critical)')

ax.set_title("Two-Tailed Test Example\n"
             r"$H_0: \mu = \mu_0$ vs $H_a: \mu \neq \mu_0$",
             fontsize=12)
ax.set_xlabel('z-score')
ax.set_ylabel('Density')
ax.legend(fontsize=9)
ax.annotate('Reject H₀\n(|z| > z*)',
            xy=(2.5, 0.12), fontsize=10, ha='center',
            color='red')

plt.tight_layout()
plt.savefig('p_value_visualization.png', dpi=150, bbox_inches='tight')
plt.show()

Demonstrating P-Hacking with Simulation

# ============================================================
# SIMULATION: Why p-hacking inflates the false positive rate
# ============================================================

np.random.seed(2024)
n_simulations = 10000
n_tests_per_sim = 20      # test 20 hypotheses per "study"
alpha = 0.05

# Each "study": draw from the null (mean = 0), test 20 variables
at_least_one_significant = 0

for _ in range(n_simulations):
    found_significant = False
    for _ in range(n_tests_per_sim):
        # Generate data from the null: mean = 0, sd = 1, n = 30
        data = np.random.normal(0, 1, 30)
        _, p = stats.ttest_1samp(data, 0)
        if p < alpha:
            found_significant = True
            break
    if found_significant:
        at_least_one_significant += 1

false_positive_rate = at_least_one_significant / n_simulations
theoretical = 1 - (1 - alpha) ** n_tests_per_sim

print("=" * 55)
print("P-HACKING SIMULATION")
print("=" * 55)
print(f"Setup: {n_tests_per_sim} tests per study, α = {alpha}")
print(f"All data generated under H₀ (nothing real)")
print(f"\nResults over {n_simulations:,} simulated studies:")
print(f"  Studies with at least one 'significant' result: "
      f"{at_least_one_significant:,} / {n_simulations:,}")
print(f"  False positive rate: {false_positive_rate:.1%}")
print(f"  Theoretical: 1 - (1-{alpha})^{n_tests_per_sim} = "
      f"{theoretical:.1%}")
print(f"\nWith just 1 test per study, the rate would be {alpha:.0%}.")
print(f"Testing {n_tests_per_sim} things inflates it to "
      f"{theoretical:.0%}!")

# Output:
# P-HACKING SIMULATION
# ======================================================
# Setup: 20 tests per study, α = 0.05
# All data generated under H₀ (nothing real)
#
# Results over 10,000 simulated studies:
#   Studies with at least one 'significant' result: 6,415 / 10,000
#   False positive rate: 64.2%
#   Theoretical: 1 - (1-0.05)^20 = 64.2%
#
# With just 1 test per study, the rate would be 5%.
# Testing 20 things inflates it to 64%!

Alex's A/B Test Connection

# ============================================================
# Alex's A/B Test: Does the new algorithm increase watch time?
# ============================================================

# Simulated data (in practice, Alex has real data)
np.random.seed(123)
old_algorithm = np.random.normal(loc=52, scale=22, size=500)
new_algorithm = np.random.normal(loc=55, scale=22, size=500)

# For now, let's focus on the new algorithm vs. a benchmark
# Full two-sample test comes in Chapter 16
# Here: one-sample test — is the new group's mean > 52 (old benchmark)?

t_stat, p_two = stats.ttest_1samp(new_algorithm, popmean=52)

print("=" * 55)
print("ALEX'S A/B TEST (simplified)")
print("=" * 55)
print(f"Old algorithm benchmark: μ₀ = 52 minutes")
print(f"New algorithm sample: n = {len(new_algorithm)}")
print(f"  Mean: {new_algorithm.mean():.2f} minutes")
print(f"  SD: {new_algorithm.std(ddof=1):.2f} minutes")
print(f"\nH₀: μ = 52  (no improvement)")
print(f"Hₐ: μ > 52  (algorithm improves watch time)")
print(f"\nt-statistic: {t_stat:.4f}")
print(f"p-value (two-tailed): {p_two:.4f}")

# One-tailed conversion
p_one = p_two / 2 if t_stat > 0 else 1 - p_two / 2
print(f"p-value (one-tailed): {p_one:.4f}")

if p_one < 0.05:
    print("\nDecision: Reject H₀ at α = 0.05")
    print("Evidence that the new algorithm increases watch time.")
else:
    print("\nDecision: Fail to reject H₀ at α = 0.05")
    print("Insufficient evidence of improvement.")

# Note: The real two-sample comparison (Chapter 16) is the proper
# approach for A/B testing. This one-sample version is simplified
# for illustration.

13.14 The Connection to Alex's A/B Test (Theme 5 Preview)

Let's think about what Alex at StreamVibe actually does with hypothesis testing.

Alex randomly assigns 1,000 users to two groups: 500 see the old recommendation algorithm, and 500 see the new one. After a week, she measures average watch time.

Group	$n$	$\bar{x}$	$s$
Old algorithm	500	52.1 min	21.8 min
New algorithm	500	54.7 min	22.3 min

The difference is 2.6 minutes. Is that real, or noise?

Alex sets up the test:

$H_0: \mu_{\text{new}} - \mu_{\text{old}} = 0$ (no difference)
$H_a: \mu_{\text{new}} - \mu_{\text{old}} > 0$ (new is better)

This is a two-sample test, which we'll formalize in Chapter 16. But the logic is identical to what you've learned here. You compute a test statistic (how many standard errors is the observed difference from zero?), find a p-value, and make a decision.

Theme 5 Connection: Correlation vs. Causation

Here's why Alex's A/B test is special: because she randomly assigned users to groups (Chapter 4), a significant result supports a causal claim. The new algorithm caused the increase in watch time.

Compare this to an observational study where Alex simply compared users who chose the new algorithm vs. those who stuck with the old one. Any difference could be confounded — maybe users who opt in are already more engaged. Without randomization, a significant p-value doesn't establish causation.

The hypothesis test tells you whether the difference is statistically real. The study design tells you whether you can make a causal claim. You need both.

13.15 Data Detective Portfolio: Formulate and Test a Hypothesis

Time to apply hypothesis testing to your own dataset. This is the Chapter 13 component of the Data Detective Portfolio.

Your Task

Formulate and test a hypothesis about a population parameter in your dataset.

Choose a variable and formulate hypotheses. Based on your exploratory work in previous chapters, identify a question that can be answered with a one-sample hypothesis test. - State $H_0$ and $H_a$ in words and symbols - Justify your choice of one-tailed vs. two-tailed - Explain why you chose this hypothesis (what question does it answer?)
Check conditions. Verify the requirements for the test: - Random sample (or approximately representative) - Independence (10% condition) - Normality (large sample via CLT, or approximately normal population)
Conduct the test. - Compute the test statistic - Calculate the p-value - State your decision at $\alpha = 0.05$
Interpret in context. Write 2-3 sentences explaining what the result means in the context of your research question. Don't just say "reject" or "fail to reject" — explain what that tells you about the real world.
Connect to your CI. Look at the confidence interval you built in Chapter 12 for the same variable. Does the CI agree with the hypothesis test? Explain the connection.

Template Code

import pandas as pd
import numpy as np
from scipy import stats

# Load your dataset
df = pd.read_csv('your_dataset.csv')

# ============================================================
# Part 1: One-Sample Hypothesis Test for a Mean
# ============================================================
variable = 'your_numerical_variable'
mu_0 = 0  # ← Replace with your null hypothesis value

data = df[variable].dropna()
n = len(data)
x_bar = data.mean()
s = data.std(ddof=1)
se = s / np.sqrt(n)

print("=" * 55)
print(f"HYPOTHESIS TEST: {variable}")
print("=" * 55)

# State hypotheses (edit as appropriate)
print(f"H₀: μ = {mu_0}")
print(f"Hₐ: μ ≠ {mu_0}  (two-tailed)")
# Or: print(f"Hₐ: μ > {mu_0}  (one-tailed, right)")
# Or: print(f"Hₐ: μ < {mu_0}  (one-tailed, left)")

# Check conditions
print(f"\nConditions:")
print(f"  Random sample: [assess based on your data source]")
print(f"  n = {n}")
print(f"  n ≥ 30 (CLT): {'✓' if n >= 30 else '✗ — check normality'}")
if n < 50000:
    print(f"  10% condition: likely satisfied (n is small "
          f"relative to population)")

# Test using scipy (t-test, since σ is unknown)
t_stat, p_value_two = stats.ttest_1samp(data, mu_0)

print(f"\nResults:")
print(f"  x̄ = {x_bar:.4f}")
print(f"  s = {s:.4f}")
print(f"  SE = {se:.4f}")
print(f"  t-statistic = {t_stat:.4f}")
print(f"  p-value (two-tailed) = {p_value_two:.4f}")

# For one-tailed (uncomment and adjust as needed):
# if t_stat > 0:
#     p_one = p_value_two / 2
# else:
#     p_one = 1 - p_value_two / 2
# print(f"  p-value (one-tailed) = {p_one:.4f}")

# Decision
alpha = 0.05
print(f"\nDecision (α = {alpha}):")
if p_value_two <= alpha:
    print(f"  Reject H₀ (p = {p_value_two:.4f} ≤ {alpha})")
else:
    print(f"  Fail to reject H₀ (p = {p_value_two:.4f} > {alpha})")

# ============================================================
# Part 2: Connect to Confidence Interval
# ============================================================
ci = stats.t.interval(confidence=0.95, df=n-1,
                      loc=x_bar, scale=se)
print(f"\n95% CI: ({ci[0]:.4f}, {ci[1]:.4f})")
print(f"Is μ₀ = {mu_0} in the CI? "
      f"{'Yes' if ci[0] <= mu_0 <= ci[1] else 'No'}")
print(f"This {'agrees' if (ci[0] <= mu_0 <= ci[1]) == (p_value_two > alpha) else 'disagrees'} "
      f"with the hypothesis test.")

Portfolio Tip: Choose a null hypothesis value that has real meaning. If you're using the CDC BRFSS dataset, you might test whether the average BMI in your data differs from the CDC national average of 26.6. If you're using the World Happiness Report, you might test whether the mean happiness score exceeds the global average. The hypothesis test should answer a genuine question, not just an exercise.

13.16 Chapter Summary

You've just learned the second major inference tool in statistics — and arguably the most consequential one.

Here's what you now know:

Hypothesis testing uses indirect reasoning. You assume the null hypothesis is true and ask how surprising the data would be under that assumption. If the data are very surprising, you reject the null.
The null hypothesis ($H_0$) is the default assumption (no effect, no difference, status quo). The alternative hypothesis ($H_a$) is what you're trying to find evidence for.
The test statistic measures how far the sample statistic is from the null hypothesis value, in standard errors.
The p-value is the probability of observing data this extreme or more extreme, IF the null hypothesis is true. It is NOT the probability that the null is true. It is NOT the probability of making an error.
Statistical significance means $p\text{-value} \leq \alpha$, not that the result is important or large.
Type I errors (false alarms) and Type II errors (missed detections) are the two ways hypothesis tests can go wrong. Lowering one increases the other, for a given sample size.
Confidence intervals and hypothesis tests are two sides of the same coin. A value outside the CI would be rejected by the hypothesis test, and vice versa.
P-hacking — exploring many analyses and reporting only the significant ones — inflates the false positive rate far beyond the nominal $\alpha$. It is one of the primary causes of the replication crisis.

Threads Resolved

This chapter resolves several threads that have been building since Chapter 1:

✅ "P-value explained properly" — fully delivered in Sections 13.5-13.6
✅ "What 'statistically significant' means" — fully delivered in Section 13.7
🔄 "Daria's shooting analysis" — partially resolved (formal test, $z = 1.22$, $p = 0.111$, fail to reject at $\alpha = 0.05$; full two-sample test framework in Chapter 16; power analysis in Chapter 17)

What's Next

In Chapter 14, you'll apply hypothesis testing specifically to proportions — the most common type of statistical test in everyday life (polls, A/B tests, quality control, medical studies). You'll learn the full one-sample and two-sample z-tests for proportions, including the conditions and nuances we've only previewed here.

Chapter 15 will introduce the t-test for means — the version you'll use when $\sigma$ is unknown (which is almost always).

And in Chapter 17, you'll finally tackle the question that Sam should be asking right now: "How many more shots would Daria need to take before I could detect a real improvement?" That's the concept of statistical power, and it completes the hypothesis testing framework.

Sam isn't done with Daria's shooting data. Not by a long shot.

Key Formulas at a Glance

Concept	Formula	When to Use
z-test for a mean	$z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}$	Testing a claim about a population mean (known $\sigma$)
z-test for a proportion	$z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}$	Testing a claim about a population proportion
P-value (right tail)	$P(Z \geq z)$	$H_a: \mu > \mu_0$ or $H_a: p > p_0$
P-value (left tail)	$P(Z \leq z)$	$H_a: \mu < \mu_0$ or $H_a: p < p_0$
P-value (two-tailed)	$2 \times P(Z \geq \lvert z \rvert)$	$H_a: \mu \neq \mu_0$ or $H_a: p \neq p_0$
Multiple testing adjustment	$P(\geq 1 \text{ false positive}) = 1 - (1-\alpha)^k$	Testing $k$ hypotheses on same data
Decision rule	Reject $H_0$ if $p\text{-value} \leq \alpha$	Every hypothesis test

#	The Myth	The Reality
1	"The p-value is the probability that the null hypothesis is true"	No. The p-value is the probability of the data (or more extreme), given that $H_0$ is true. It's $P(\text{data} \mid H_0)$, not $P(H_0 \mid \text{data})$.
2	"A small p-value means the effect is large/important"	No. A tiny p-value can come from a trivially small effect with a huge sample size. Statistical significance ≠ practical significance.
3	"$p = 0.05$ means there's only a 5% chance the results are due to chance"	No. The p-value is calculated assuming $H_0$ is true. It doesn't tell you the probability that chance explains your results.
4	"If $p > 0.05$, the null hypothesis is true (there's no effect)"	No. Failing to reject $H_0$ means you don't have enough evidence to reject it. Absence of evidence is not evidence of absence.
5	"If $p < 0.05$ in two studies, the effects are the same"	No. Two studies can both have $p < 0.05$ but show very different effect sizes. P-values don't measure effect size.

Prerequisites

Learning Objectives

In This Chapter

Chapter 13: Hypothesis Testing: Making Decisions with Data

Chapter Overview

13.1 A Puzzle Before We Start (Productive Struggle)

13.2 The Logic of Hypothesis Testing: Proof by Contradiction

The Courtroom Analogy

13.3 Null and Alternative Hypotheses

The Null Hypothesis ($H_0$)

The Alternative Hypothesis ($H_a$ or $H_1$)

Rules for Writing Hypotheses

Worked Example: Setting Up the Hypotheses

13.4 The Test Statistic: Measuring How Far the Data Are from the Null

The One-Sample z-Test for a Population Mean

The One-Sample z-Test for a Population Proportion

Worked Example: Sam's Daria Test

13.5 The P-Value: The Most Misunderstood Number in Science

Visualizing the P-Value

What "Extreme" Means

13.6 What the P-Value Is NOT (The Part That Trips Everyone Up)

The Critical Distinction: $P(\text{data} \mid H_0)$ vs. $P(H_0 \mid \text{data})$

The ASA Statement on P-Values

13.7 The Significance Level and the Decision Rule

What $\alpha$ Is

Making the Decision: Sam's Test

The Rejection Region

The Word "Significant"

13.8 One-Tailed vs. Two-Tailed Tests

Two-Tailed Tests ($H_a: \mu \neq \mu_0$)

One-Tailed Tests ($H_a: \mu > \mu_0$ or $H_a: \mu < \mu_0$)

Choosing: When in Doubt, Use Two-Tailed

13.9 Type I and Type II Errors: The Two Ways to Be Wrong

Type I Error: The False Alarm

Type II Error: The Missed Detection

The Error Matrix

The Seesaw Relationship

Real-World Consequences

Dr. Maya Chen's Dilemma

13.10 Putting It All Together: The Five Steps of Hypothesis Testing

Step 1: State the Hypotheses

Step 2: Choose the Significance Level

Step 3: Compute the Test Statistic

Step 4: Find the P-Value

Step 5: State Your Conclusion

Full Worked Example: Maya's Blood Pressure Test

The Steps in a Table

13.11 The Duality: Hypothesis Tests and Confidence Intervals

Example: Sam and Daria

When to Use Which

13.12 The Replication Crisis Revisited: Now You Understand Why

P-Hacking: The Garden of Forking Paths

Why Small Samples Make It Worse

Publication Bias: The File Drawer Problem

13.13 Python: Hypothesis Testing in Practice

Manual Z-Test for a Proportion

Manual Z-Test for a Mean

Using scipy.stats.ttest_1samp()

Visualizing the P-Value

Demonstrating P-Hacking with Simulation

Alex's A/B Test Connection

13.14 The Connection to Alex's A/B Test (Theme 5 Preview)

13.15 Data Detective Portfolio: Formulate and Test a Hypothesis

Your Task

Template Code

13.16 Chapter Summary

Threads Resolved

What's Next

Key Formulas at a Glance

Related Reading