> "Not everything that can be counted counts, and not everything that counts can be counted."
Learning Objectives
- Conduct a one-sample z-test for a population proportion
- Construct and interpret confidence intervals for proportions
- Verify conditions for inference about proportions (success-failure condition)
- Interpret results in context (not just 'reject' or 'fail to reject')
- Apply proportion inference to real polling, survey, and public health data
In This Chapter
- Chapter Overview
- 14.1 A Puzzle Before We Start (Productive Struggle)
- 14.2 Review: The Foundations You Already Have
- 14.3 The One-Sample Z-Test for a Population Proportion
- 14.4 Worked Example 1: Maya Tests Disease Prevalence
- 14.5 Worked Example 2: Sam Tests Daria's Shooting (Revisited)
- 14.6 Confidence Intervals for Proportions: Going Deeper
- 14.7 Margin of Error in Polling: A Deep Dive
- 14.8 Python: Proportion Inference with statsmodels
- 14.9 Excel: Proportion Inference
- 14.10 When the Success-Failure Condition Fails
- 14.11 Putting It All Together: The Complete Decision Flowchart
- 14.12 Common Mistakes and How to Avoid Them
- 14.13 Data Detective Portfolio: Test a Proportion Hypothesis
- 14.14 Chapter Summary
- Key Formulas at a Glance
Chapter 14: Inference for Proportions
"Not everything that can be counted counts, and not everything that counts can be counted." — William Bruce Cameron (often attributed to Albert Einstein)
Chapter Overview
Here's a number that shaped the world: 48.2%.
On November 3, 2020, as election results trickled in across the United States, millions of people stared at screens showing percentages with margins of error. News anchors said things like "Candidate A leads with 51.8%, plus or minus 3 points." Pollsters had spent months surveying samples of 1,000 or 1,500 people, trying to estimate the voting intentions of 150 million. And now the question that mattered — "who's going to win?" — hung on whether those sample proportions were close enough to the population proportions.
That's inference for proportions. And it's everywhere.
When Maya wants to know whether the diabetes prevalence in her county exceeds the national rate of 11.3%, she's doing inference for a proportion. When Sam wants to test whether Daria's three-point shooting has improved from 31%, he's doing inference for a proportion. When a pharmaceutical company tests whether a new vaccine is more than 90% effective, when a tech company wants to know if more than 5% of users click on a new feature, when a school district asks whether the graduation rate has changed — they're all doing inference for proportions.
In fact, inference for proportions may be the single most common statistical procedure you'll encounter outside of a statistics classroom. Polls, surveys, A/B tests, quality control, medical studies, criminal justice research — proportions are everywhere.
The good news? You already know almost everything you need. In Chapter 12, you built confidence intervals for proportions. In Chapter 13, you learned the logic of hypothesis testing and even computed a z-test for Sam's shooting data. This chapter takes those pieces and assembles them into a complete, systematic toolkit for proportion inference.
The new material here is about being careful: checking conditions rigorously, understanding when the standard methods break down, and knowing about improved alternatives (the Wilson interval and plus-four method) for situations where the basic approach falls short.
In this chapter, you will learn to: - Conduct a one-sample z-test for a population proportion - Construct and interpret confidence intervals for proportions - Verify conditions for inference about proportions (success-failure condition) - Interpret results in context (not just "reject" or "fail to reject") - Apply proportion inference to real polling, survey, and public health data
Fast Track: If you aced Sam's hypothesis test in Chapter 13 and Maya's proportion CI in Chapter 12, skim Sections 14.1–14.3 and jump to Section 14.6 (Wilson interval and plus-four method). Complete quiz questions 1, 10, and 15 to verify.
Deep Dive: After this chapter, read Case Study 1 (polling and election prediction) for a detailed look at how margin of error determines election night drama, then Case Study 2 (medical screening) for a deep connection between proportion inference and the false positive rates you studied in Chapter 9.
14.1 A Puzzle Before We Start (Productive Struggle)
Before we dive in, try this.
The Vaccine Trial
A pharmaceutical company claims its new vaccine is 90% effective — meaning that among vaccinated people who are exposed to the virus, 90% will not get sick.
To test this claim, researchers enroll 200 vaccinated volunteers and expose them to the virus under controlled conditions. Of the 200, 171 do not get sick (and 29 do).
(a) What is the sample proportion of people who were protected? Is it exactly 90%?
(b) The company claimed 90% effectiveness. Your sample shows 85.5%. Does this prove the vaccine is less effective than claimed?
(c) What if only 160 out of 200 were protected (80%)? Would you be more suspicious of the 90% claim then? Why?
(d) What if the study had 2,000 volunteers instead of 200, and 1,710 out of 2,000 were protected (85.5%)? Same percentage as part (b) — but does it feel different? Why?
Take 3 minutes. Part (d) is the one that matters most.
Here's what I hope you noticed:
For part (a), the sample proportion is $\hat{p} = 171/200 = 0.855$, or 85.5%. That's not 90%, but we know from Chapter 11 that sample proportions vary from sample to sample. Getting 85.5% in a sample doesn't mean the true proportion isn't 90%.
For part (b), no — a single sample proportion can't "prove" anything. The question is whether 85.5% is surprisingly far from 90%, or whether it's the kind of variation we'd expect from a truly 90%-effective vaccine with only 200 people in the trial. That's exactly what a hypothesis test will tell us.
For part (c), 80% feels further from 90% than 85.5% does — and it is. The farther the sample proportion is from the claimed value, the more suspicious we should be. But "suspicious" needs to be quantified. How far is too far?
And for part (d) — this is the key insight. The same proportion (85.5%) feels very different with $n = 200$ versus $n = 2{,}000$. With 200 people, 85.5% could easily arise from a truly 90%-effective vaccine, just by random variation. With 2,000 people, 85.5% is much harder to explain away — the standard error is smaller, so the sample proportion should be much closer to the true proportion. Same percentage. Very different evidence.
You've just been doing hypothesis testing for proportions in your head. This chapter teaches you how to do it with formulas, so you can quantify exactly how strong the evidence is.
14.2 Review: The Foundations You Already Have
Before we go further, let's take stock of what you already know. This chapter builds directly on two chapters, and I want to make the connections explicit.
From Chapter 11: The CLT for Proportions
The Central Limit Theorem tells us that the sampling distribution of $\hat{p}$ is approximately normal when the sample is large enough:
$$\hat{p} \sim N\left(p, \sqrt{\frac{p(1-p)}{n}}\right)$$
This is why the z-test works for proportions. If $\hat{p}$ follows a normal distribution centered on the true $p$, then we can measure how many standard errors our observed $\hat{p}$ is from a hypothesized $p_0$ — and that gives us a z-score.
From Chapter 12: Confidence Intervals for Proportions
You learned the formula:
$$\hat{p} \pm z^* \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$
This is called the Wald interval (named after statistician Abraham Wald). It uses $\hat{p}$ in the standard error because you're estimating $p$.
From Chapter 13: Hypothesis Testing Framework
You learned the five-step procedure:
- State $H_0$ and $H_a$
- Choose $\alpha$
- Compute the test statistic
- Find the p-value
- Conclude in context
And you already applied it to proportions. Sam's test of Daria's shooting used:
$$z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}} = \frac{0.38 - 0.31}{\sqrt{0.31 \times 0.69 / 65}} = \frac{0.07}{0.0574} = 1.22$$
The p-value was 0.111, and Sam failed to reject $H_0$ at $\alpha = 0.05$.
What's New in This Chapter?
Three things:
-
Systematic conditions checking. We'll be rigorous about when the z-test is valid and when it's not. The success-failure condition is the gatekeeper.
-
Better confidence intervals. The Wald interval you learned in Chapter 12 has known problems — it can be too narrow, especially for small samples or extreme proportions. The Wilson interval and plus-four method fix these problems.
-
Applied practice. You'll work through complete, realistic examples — polling data, disease prevalence, quality control — with full five-step procedures and contextual interpretation.
🔄 Spaced Review 1 (from Ch.9): Bayes' Theorem and False Positive Rates
In Chapter 9, you learned that when a medical test has a 5% false positive rate and the disease prevalence is 1%, a positive result means only about an 8.8% chance of actually having the disease (the positive predictive value).
That calculation involved a population proportion — the disease prevalence of 1%. But here's the thing: that 1% is itself an estimate. Maya's county might have a different prevalence than the national average. The inference tools in this chapter let you estimate the actual prevalence in a specific population and test whether it differs from the assumed value.
This matters enormously for screening programs. If you plug 1% prevalence into your Bayes' theorem calculation but the actual prevalence is 3%, your PPV estimate will be wrong — you'll underestimate how many positive results are true positives. Proportion inference lets you get the base rate right, which makes every subsequent Bayesian calculation more accurate.
The chain: proportion inference (Ch.14) → accurate base rate → accurate Bayes' theorem (Ch.9) → accurate screening decisions. The tools connect.
14.3 The One-Sample Z-Test for a Population Proportion
Let's formalize the procedure. You've seen the pieces; now here's the complete system.
When Do You Use It?
Use the one-sample z-test for proportions when you want to test a claim about a single population proportion. Common scenarios:
- "Is the proportion of defective items less than 3%?"
- "Does more than 50% of the electorate favor this candidate?"
- "Has the disease prevalence changed from the historical rate?"
- "Does the customer churn rate exceed our target of 5%?"
- "Has Daria's three-point percentage improved from 31%?"
The Hypotheses
The null hypothesis specifies a value for the population proportion:
$$H_0: p = p_0$$
The alternative can be one of three forms:
| Alternative | Type | Use When |
|---|---|---|
| $H_a: p > p_0$ | Right-tailed | You suspect the proportion is higher than claimed |
| $H_a: p < p_0$ | Left-tailed | You suspect the proportion is lower than claimed |
| $H_a: p \neq p_0$ | Two-tailed | You suspect the proportion differs from the claim (either direction) |
The Conditions (The Three Checkpoints)
Before computing anything, you must verify three conditions. If any fails, the z-test results may not be reliable.
Condition 1: Random Sample (or Random Assignment)
The data must come from a random sample of the population, or from a randomized experiment. This ensures that the sample is representative and that the sampling distribution theory applies.
If you have a convenience sample — say, a Twitter poll — the z-test is technically invalid, no matter how large the sample. (Remember Chapter 4: biased sampling can't be fixed by a larger sample.)
Condition 2: Independence (The 10% Condition)
Each observation must be independent of the others. In practice, this means:
$$n \leq 0.10 \times N$$
where $N$ is the population size. If you're sampling more than 10% of the population, the observations aren't truly independent (drawing one person affects who's left to draw), and the standard error formula needs adjustment.
Condition 3: Success-Failure Condition (The Normal Approximation)
This is the big one — the condition that's specific to proportion inference. The sampling distribution of $\hat{p}$ is only approximately normal when there are enough "successes" and "failures" in the sample:
$$\boxed{np_0 \geq 10 \quad \text{and} \quad n(1-p_0) \geq 10}$$
⚠️ Important Detail: Which $p$ Do You Use?
For hypothesis tests, use $p_0$ (the null hypothesis value) in the success-failure check, because you're assuming $H_0$ is true when you compute the test statistic and p-value.
For confidence intervals, use $\hat{p}$ (the sample proportion), because there's no hypothesized value — you're estimating.
This subtle distinction trips up a lot of students. The reason: in hypothesis testing, you calculate everything under the assumption that $H_0$ is true, so you use $p_0$. In confidence intervals, you have no assumed value, so you use the best estimate you have: $\hat{p}$.
Why does this condition exist? The sampling distribution of $\hat{p}$ is actually a scaled binomial distribution (from Chapter 10), which is discrete and potentially skewed. The normal distribution is a continuous, symmetric approximation. That approximation is only good when there are enough observations on both sides — enough successes and enough failures — to smooth out the discreteness and balance the skewness.
When the condition fails — when $p$ is very close to 0 or 1, or when $n$ is small — the normal approximation breaks down. In those cases, you need either an exact binomial test (covered in more advanced courses) or the adjustments described in Section 14.6.
🔄 Spaced Review 2 (from Ch.11): The CLT for Proportions — Why This All Works
Remember the Central Limit Theorem for proportions from Chapter 11? It said that $\hat{p}$ is approximately $N(p, \sqrt{p(1-p)/n})$ when the sample is large enough. The success-failure condition ($np \geq 10$ and $n(1-p) \geq 10$) is the formal version of "large enough."
Here's the deeper connection. The binomial distribution $X \sim \text{Binomial}(n, p)$ counts the number of successes. The sample proportion is $\hat{p} = X/n$ — just the count divided by $n$. So $\hat{p}$ inherits the shape of the binomial distribution, scaled down by $n$.
For small $n$ or extreme $p$, the binomial is skewed and lumpy. As $n$ increases (and $p$ isn't too extreme), the binomial smooths out and approaches the normal shape — that's the CLT at work. The success-failure condition ensures we're in the region where the approximation is accurate.
Everything in this chapter rests on that approximation. The z-test, the confidence interval, the margin of error — all of them assume the sampling distribution of $\hat{p}$ is (approximately) normal. The success-failure condition is your quality check on that assumption.
The Test Statistic
$$\boxed{z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}}$$
where: - $\hat{p} = X/n$ is the sample proportion (the number of successes divided by the sample size) - $p_0$ is the population proportion assumed under $H_0$ - $n$ is the sample size - $\sqrt{p_0(1-p_0)/n}$ is the standard error of $\hat{p}$ under $H_0$
In plain English: the test statistic measures how many standard errors the observed sample proportion is from the hypothesized proportion. A larger $|z|$ means more evidence against $H_0$.
Critical detail: The standard error uses $p_0$, not $\hat{p}$. Why? Because in hypothesis testing, you assume $H_0$ is true and ask, "How surprising is my data under that assumption?" If $H_0$ is true, then $p = p_0$, so the standard error should be based on $p_0$.
The P-Value
Once you have the z-score, find the p-value from the standard normal distribution:
| Alternative | P-Value |
|---|---|
| $H_a: p > p_0$ | $P(Z \geq z)$ |
| $H_a: p < p_0$ | $P(Z \leq z)$ |
| $H_a: p \neq p_0$ | $2 \times P(Z \geq |z|)$ |
The Decision
- If p-value $\leq \alpha$: Reject $H_0$
- If p-value $> \alpha$: Fail to reject $H_0$
And always — always — state the conclusion in context.
14.4 Worked Example 1: Maya Tests Disease Prevalence
Let's work through a complete example with Dr. Maya Chen.
The Scenario
Maya is studying the prevalence of hypertension (high blood pressure) in a rural county. According to CDC data, the national prevalence of hypertension among U.S. adults is 47.0%. Maya suspects that her rural county may have a higher prevalence due to limited access to healthcare, higher obesity rates, and an older population.
She randomly samples 400 adults in the county and finds that 208 of them have hypertension.
Research question: Is the prevalence of hypertension in this county higher than the national rate of 47.0%?
Step 1: State the Hypotheses
- $H_0: p = 0.47$ — the county's hypertension prevalence equals the national rate
- $H_a: p > 0.47$ — the county's hypertension prevalence exceeds the national rate
- This is a one-tailed (right-tailed) test because Maya specifically suspects the rate is higher
Step 2: Choose the Significance Level
$\alpha = 0.05$
Maya considers the consequences: - Type I error: Concluding the county has elevated hypertension when it doesn't. The county might allocate extra public health resources unnecessarily. - Type II error: Failing to detect an elevated rate when one exists. At-risk residents don't receive targeted interventions.
The Type II error seems more consequential here (people's health), so Maya might consider $\alpha = 0.10$. But for this example, she'll use the standard $\alpha = 0.05$.
Step 3: Check Conditions
Random sample? Yes — Maya used a random sample of county residents.
Independence? $n = 400 \leq 0.10 \times 85{,}000$ (county adult population). ✓
Success-failure condition: - $np_0 = 400 \times 0.47 = 188 \geq 10$ ✓ - $n(1 - p_0) = 400 \times 0.53 = 212 \geq 10$ ✓
All conditions met. The z-test is appropriate.
Step 4: Compute the Test Statistic
$$\hat{p} = \frac{208}{400} = 0.52$$
$$z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} = \frac{0.52 - 0.47}{\sqrt{\frac{0.47 \times 0.53}{400}}} = \frac{0.05}{\sqrt{\frac{0.2491}{400}}} = \frac{0.05}{\sqrt{0.000623}} = \frac{0.05}{0.02496} = 2.00$$
Step 5: Find the P-Value
Since $H_a: p > 0.47$ (right-tailed):
$$\text{p-value} = P(Z \geq 2.00) = 1 - P(Z \leq 2.00) = 1 - 0.9772 = 0.0228$$
Step 6: Make the Decision
$\text{p-value} = 0.023 < \alpha = 0.05$
Reject $H_0$.
Step 7: Interpret in Context
At the 5% significance level, there is statistically significant evidence that the prevalence of hypertension in this rural county exceeds the national rate of 47.0%. The sample proportion of 52.0% is about 2 standard errors above the national rate, which would be unlikely to occur by chance alone if the true county rate were 47.0% ($p = 0.023$).
This finding supports Maya's hypothesis that limited healthcare access and demographic factors may contribute to elevated hypertension in this community. She can now present this evidence to the county health board to advocate for targeted screening programs.
💡 Key Insight: Contextual Interpretation Matters
Notice how the last paragraph doesn't just say "reject $H_0$." It explains what the result means for Maya's real-world question. This is the difference between doing statistics mechanically and doing statistics meaningfully.
A number like $p = 0.023$ only becomes useful when you translate it: "The data provide strong evidence that this county's hypertension rate is above average, which justifies allocating extra public health resources." That's what the decision-maker needs to hear.
14.5 Worked Example 2: Sam Tests Daria's Shooting (Revisited)
We've done this test informally in Chapter 11 and formally in Chapter 13. Now let's do it one more time with the complete proportion inference framework, paying careful attention to conditions and contextual interpretation.
The Scenario
Sam Okafor's boss, the Riverside Raptors analytics director, asks for a formal statistical report on Daria Williams's three-point shooting. The question: has Daria's three-point percentage improved from her career rate of 31%?
Sam has data from the current season: Daria has made 25 out of 65 three-point attempts (38.5%).
Step 1: State the Hypotheses
- $H_0: p = 0.31$ — Daria's true shooting percentage is still 31%
- $H_a: p > 0.31$ — Daria's true shooting percentage has improved
- One-tailed test (Sam only cares about improvement)
Step 2: Choose the Significance Level
$\alpha = 0.05$
Step 3: Check Conditions
Random sample? This one requires some thought. Daria's 65 attempts aren't a "random sample" in the traditional sense — they're all the three-point shots she took this season. But we can think of them as a random sample from the hypothetical population of all three-point shots Daria could take under current conditions. This is a common and accepted framework in sports analytics.
Independence? Each shot is reasonably independent of the others. (In reality, there might be "hot hand" effects, but the statistical evidence for those is mixed, and we'll treat the shots as independent for now.)
Success-failure condition: - $np_0 = 65 \times 0.31 = 20.15 \geq 10$ ✓ - $n(1 - p_0) = 65 \times 0.69 = 44.85 \geq 10$ ✓
All conditions met.
Step 4: Compute the Test Statistic
$$\hat{p} = \frac{25}{65} = 0.3846$$
$$z = \frac{0.3846 - 0.31}{\sqrt{\frac{0.31 \times 0.69}{65}}} = \frac{0.0746}{\sqrt{\frac{0.2139}{65}}} = \frac{0.0746}{\sqrt{0.003291}} = \frac{0.0746}{0.05737} = 1.30$$
Note
You might notice that Sam got $z = 1.22$ in Chapter 13, while we're getting $z = 1.30$ here. The difference is due to rounding. In Chapter 13, we used $\hat{p} = 0.38$ (rounded) rather than the exact $\hat{p} = 25/65 = 0.38462$. The moral: use exact values throughout your calculation, and only round at the end.
Step 5: Find the P-Value
$$\text{p-value} = P(Z \geq 1.30) = 1 - 0.9032 = 0.0968$$
Step 6: Make the Decision
$\text{p-value} = 0.097 > \alpha = 0.05$
Fail to reject $H_0$.
Step 7: Interpret in Context
At the 5% significance level, there is not sufficient evidence to conclude that Daria's three-point shooting percentage has improved from her career rate of 31%. Although her current-season percentage of 38.5% is higher than her career rate, the difference is not statistically significant ($z = 1.30$, $p = 0.097$).
This doesn't mean Daria hasn't improved — it means we don't have enough evidence to be confident she has. With only 65 attempts, there's too much sampling variability to distinguish a genuine improvement from a lucky streak.
What Should Sam Tell His Boss?
Here's where contextual interpretation really matters. Sam's report should say something like:
"Based on 65 three-point attempts this season, Daria's shooting percentage of 38.5% is suggestive of improvement from her career rate of 31%, but the sample size is too small to draw a definitive conclusion (p = 0.097). The 95% confidence interval for her true shooting percentage is approximately (27%, 50%), which includes both her old rate and substantially higher rates.
To determine with reasonable confidence whether Daria has truly improved, we would need data from approximately 150-200 additional three-point attempts. I recommend we revisit this analysis at the midpoint of the season."
That's a statistical report that a decision-maker can actually use.
💡 Key Insight: The CI-Test Connection (Revisited)
Remember from Chapter 13: a 95% CI contains all values that would not be rejected at $\alpha = 0.05$. Sam's CI of approximately (27%, 50%) contains $p_0 = 0.31$. That's consistent with failing to reject $H_0: p = 0.31$. The CI and the test agree — as they always must.
Connecting Back: The Daria Thread
Let's trace how this thread has developed:
| Chapter | What We Learned About Daria |
|---|---|
| Ch.1 | Sam asks: "Is 38% really better than 31%, or just luck?" |
| Ch.11 | CLT analysis: $z = 1.23$, $P = 0.109$. Inconclusive. |
| Ch.12 | 95% CI: (26.7%, 50.3%). Wide interval contains 0.31. |
| Ch.13 | Formal hypothesis test: $z = 1.22$, $p = 0.111$. Fail to reject. |
| Ch.14 | Complete proportion inference: $z = 1.30$, $p = 0.097$. Same conclusion. Need more data. |
The answer has been consistent all along: 65 shots isn't enough to tell. Sam will revisit this question in Chapter 17 (power analysis), where he'll calculate exactly how many shots Daria would need.
14.6 Confidence Intervals for Proportions: Going Deeper
In Chapter 12, you learned the basic confidence interval for a proportion — the Wald interval. Now let's go deeper, because the Wald interval has some well-known problems, and there are better alternatives.
The Wald Interval (Review)
$$\hat{p} \pm z^* \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$
This is what most introductory textbooks teach, and it's what you learned in Chapter 12. It's simple, intuitive, and easy to compute.
But it has a dirty secret: it doesn't always work well.
The Problem with the Wald Interval
The Wald interval's actual coverage probability — the percentage of times the interval captures the true $p$ — can be substantially lower than the advertised confidence level. Here's what that means in practice:
- A "95% confidence interval" sometimes captures the true proportion only 85-90% of the time
- The problem is worst when $p$ is close to 0 or 1 (say, below 0.1 or above 0.9)
- It also fails for small sample sizes (even when the success-failure condition is technically met)
- The interval can have a lower bound below 0 or an upper bound above 1, which is nonsensical for a proportion
In 1998, statisticians Lawrence Brown, T. Tony Cai, and Anirban DasGupta published a landmark paper showing that the Wald interval's coverage problems were far worse than most statisticians realized. Their recommendation: stop using it, or at least know its limitations.
The Wilson Interval (The Better Alternative)
The Wilson interval (also called the Wilson score interval, developed by Edwin Wilson in 1927) fixes the Wald interval's coverage problems. It's computed as:
$$\frac{\hat{p} + \frac{z^{*2}}{2n} \pm z^* \sqrt{\frac{\hat{p}(1-\hat{p})}{n} + \frac{z^{*2}}{4n^2}}}{1 + \frac{z^{*2}}{n}}$$
Yes, that formula is uglier. But it has much better properties:
- Coverage is consistently close to the nominal level (95% really means about 95%)
- Never produces intervals outside [0, 1]
- Works well even for small samples and extreme proportions
- It's what many professional statisticians actually use in practice
Why the Wilson Interval Works Better
The Wald interval centers on $\hat{p}$ and uses the standard error based on $\hat{p}$. But $\hat{p}$ is an estimate — it could be wrong. When $\hat{p}$ is far from the true $p$ (which is more likely for small samples or extreme proportions), the Wald interval's standard error is also wrong, and the interval has poor coverage.
The Wilson interval adjusts for this by "pulling" the center of the interval slightly toward 0.5 and using a corrected standard error. This adjustment is small for large samples (where $\hat{p}$ is reliable) but important for small samples (where it's not).
You don't need to compute the Wilson interval by hand — that's what software is for. In Python, statsmodels computes it with one function call (we'll show this in Section 14.8).
The Plus-Four Method (The Simple Fix)
If you want a simple improvement over the Wald interval without the Wilson formula, the plus-four method is your friend.
Here's the idea: before computing the Wald interval, add 4 imaginary observations to your data — 2 successes and 2 failures. Then compute the Wald interval using the adjusted count.
$$\tilde{p} = \frac{X + 2}{n + 4}$$
$$\tilde{p} \pm z^* \sqrt{\frac{\tilde{p}(1-\tilde{p})}{n + 4}}$$
where $\tilde{p}$ (p-tilde) is the adjusted sample proportion.
That's it. Add 2 successes and 2 failures, then proceed as before. This remarkably simple trick substantially improves the interval's coverage, especially for small samples.
Why does adding fake data help? The 2 extra successes and 2 extra failures pull $\hat{p}$ toward 0.5, which is exactly the correction needed when $\hat{p}$ is near 0 or 1. For large samples, the 4 extra observations barely change anything. For small samples, they provide exactly the correction needed.
When to Use What
Method Use When Pros Cons Wald Large $n$, $p$ not extreme Simple formula, easy to understand Poor coverage for small $n$ or extreme $p$ Wilson Any situation Best coverage properties Complex formula (use software) Plus-four Small to moderate $n$ Simple formula, good coverage Slightly conservative for large $n$ Bottom line: For homework and exams, the Wald interval is usually expected (and it's fine for large samples with $p$ not near 0 or 1). For real-world work, use the Wilson interval. For a quick-and-easy improvement, use plus-four.
Worked Example: Comparing the Three Methods
A medical study tests a new surgical technique. Of 30 patients, 27 have successful outcomes. Construct a 95% CI for the true success rate.
Data: $n = 30$, $X = 27$, $\hat{p} = 27/30 = 0.90$
Check conditions: $n\hat{p} = 27 \geq 10$ ✓ and $n(1-\hat{p}) = 3$ ... that's less than 10. ❌
The success-failure condition fails for the Wald interval. This is exactly the scenario where the Wilson and plus-four methods shine.
Wald interval (shown for comparison, even though conditions aren't met):
$$0.90 \pm 1.960 \sqrt{\frac{0.90 \times 0.10}{30}} = 0.90 \pm 1.960 \times 0.0548 = 0.90 \pm 0.107$$
$$\text{Wald: } (0.793, 1.007)$$
An upper bound above 1.0 — a proportion greater than 100% — which is obviously impossible. This is the Wald interval misbehaving.
Plus-four method:
$$\tilde{p} = \frac{27 + 2}{30 + 4} = \frac{29}{34} = 0.8529$$
$$0.8529 \pm 1.960 \sqrt{\frac{0.8529 \times 0.1471}{34}} = 0.8529 \pm 1.960 \times 0.0607 = 0.8529 \pm 0.119$$
$$\text{Plus-four: } (0.734, 0.972)$$
Better — the upper bound is below 1.
Wilson interval (computed by software):
$$\text{Wilson: } (0.744, 0.968)$$
The Wilson interval gives a reasonable range that stays within [0, 1] and has good coverage properties.
14.7 Margin of Error in Polling: A Deep Dive
This is the section where Theme 1 — statistics as a superpower for navigating the real world — comes alive. After reading this, you'll never hear "margin of error" on the news the same way again.
What Pollsters Mean by "Margin of Error"
When a news report says "Candidate A leads with 52%, margin of error ±3 points," they're reporting a 95% confidence interval: $(49\%, 55\%)$.
The margin of error for a proportion is:
$$E = z^* \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$
For a 95% CI with $\hat{p}$ near 0.5 and $n = 1{,}000$:
$$E = 1.960 \sqrt{\frac{0.5 \times 0.5}{1000}} = 1.960 \times 0.0158 = 0.031 \approx \pm 3.1\%$$
That's the magic number: for proportions near 50%, a sample of 1,000 gives a margin of error of about ±3 percentage points. This is why most national polls survey around 1,000 to 1,500 people.
Why Most Polls Use Samples of About 1,000
| Sample Size | Margin of Error (95%, $p \approx 0.50$) |
|---|---|
| 100 | ±9.8% |
| 400 | ±4.9% |
| 1,000 | ±3.1% |
| 1,500 | ±2.5% |
| 4,000 | ±1.5% |
| 10,000 | ±1.0% |
Notice the diminishing returns (the $\sqrt{n}$ effect from Chapter 11). Going from 1,000 to 4,000 quadruples the cost but only halves the margin of error. For most purposes, ±3 points is "good enough" — which is why polling organizations rarely go above 1,500 respondents.
When Polls Go Wrong: It's Usually Not the Margin of Error
Here's something that surprises most people: when major polls get an election wrong, it's almost never because of random sampling error (which the margin of error measures). It's because of bias.
🔄 Spaced Review 3 (from Ch.4): Sampling Bias Affects Validity
In Chapter 4, you learned about sampling bias — systematic errors that make a sample unrepresentative of the population. The 1936 Literary Digest poll is the classic example: a huge sample of 2.4 million people, but biased toward wealthier Americans (who favored Landon over Roosevelt).
The margin of error only measures random sampling error. It assumes a perfectly random sample. If the sample is biased — if certain groups are systematically under-represented — the margin of error understates the true uncertainty.
This is exactly what happened in 2016 and 2020 presidential polls. The margin of error was ±3 points, but the actual error was larger because non-college-educated white voters (who disproportionately favored Trump) were underrepresented in many polls. The random sampling error was small; the bias was not.
The lesson: A margin of error is only as good as the sampling method. ±3 points from a biased poll is not the same as ±3 points from a truly random sample.
The "Horse Race" Problem
When a poll says "Candidate A: 49%, Candidate B: 51%, margin of error ±3%," many people conclude that Candidate B is winning. But think about what the confidence intervals are:
- Candidate A: (46%, 52%)
- Candidate B: (48%, 54%)
These intervals overlap substantially. The race is within the margin of error — which means the poll cannot distinguish between "A is ahead" and "B is ahead." The statistically honest headline would be: "Race is too close to call."
Theme 1 Connection: Statistics as a Superpower
Understanding margin of error gives you a genuine superpower on election night. While most viewers are panicking over every fraction-of-a-percent shift, you can look at the results and immediately assess: "Is this lead larger than the margin of error? If not, it's noise." You can distinguish between a meaningful lead and statistical noise — which is exactly what the numbers are trying to tell you, if you know how to listen.
Worked Example: Interpreting a Real Poll
A Pew Research Center poll of 1,503 U.S. adults finds that 62% favor stricter gun control laws.
(a) Construct a 95% confidence interval.
$$\hat{p} = 0.62, \quad n = 1{,}503$$
Check conditions: - Random sample: Pew uses rigorous random sampling methods. ✓ - Independence: $1{,}503 \leq 0.10 \times 258{,}000{,}000$ (U.S. adult population). ✓ - Success-failure: $1503 \times 0.62 = 932 \geq 10$ ✓ and $1503 \times 0.38 = 571 \geq 10$ ✓
$$E = 1.960 \sqrt{\frac{0.62 \times 0.38}{1503}} = 1.960 \sqrt{\frac{0.2356}{1503}} = 1.960 \times 0.01252 = 0.0245$$
$$\text{95% CI: } (0.62 - 0.025, 0.62 + 0.025) = (0.595, 0.645)$$
We are 95% confident that between 59.5% and 64.5% of U.S. adults favor stricter gun control laws.
(b) Can we conclude that a majority (more than 50%) support stricter gun laws?
Yes. The entire 95% CI is above 0.50. Equivalently, a hypothesis test of $H_0: p = 0.50$ vs. $H_a: p > 0.50$ would be rejected at $\alpha = 0.05$ (since 0.50 is outside the CI). The evidence strongly supports majority support.
(c) The headline says "62% of Americans favor gun control." What's a better headline?
A more statistically accurate headline: "Between 60% and 65% of Americans favor stricter gun control, according to a new poll." Or at minimum: "62% of Americans favor stricter gun control (±2.5 points)."
Theme 2 Connection: Who's Represented in the Data?
Even with a perfectly random sample and a valid margin of error, polls can miss the full story. Who is included in "U.S. adults"? What about people without phones? What about people who decline to participate (nonresponse bias, Chapter 4)? What about people who say one thing to a pollster but do another in the voting booth (response bias)?
The margin of error captures random sampling variability. It does not capture systematic errors from who's missing in the sample. When we report "62% ± 2.5%," we're being honest about the uncertainty we can measure — but there may be additional uncertainty we can't.
14.8 Python: Proportion Inference with statsmodels
Let's implement everything we've learned. We'll cover manual calculations, statsmodels, and visualization.
Manual Z-Test for a Proportion
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# ============================================================
# MAYA'S TEST: Is county hypertension prevalence above 47%?
# ============================================================
# Data
x = 208 # number of "successes" (hypertension cases)
n = 400 # sample size
p_hat = x / n # sample proportion
p_0 = 0.47 # null hypothesis value
alpha = 0.05
print("=" * 60)
print("ONE-SAMPLE Z-TEST FOR A PROPORTION")
print("Maya's Hypertension Study")
print("=" * 60)
# Step 1: Hypotheses
print(f"\nStep 1: Hypotheses")
print(f" H₀: p = {p_0}")
print(f" Hₐ: p > {p_0} (right-tailed)")
print(f" α = {alpha}")
# Step 2: Check conditions
print(f"\nStep 2: Check Conditions")
print(f" Random sample: Yes (stated)")
print(f" Independence: n = {n} ≤ 10% of population ✓")
sf_success = n * p_0
sf_failure = n * (1 - p_0)
print(f" Success-failure: np₀ = {sf_success:.0f} ≥ 10 ✓")
print(f" n(1-p₀) = {sf_failure:.0f} ≥ 10 ✓")
# Step 3: Test statistic
se = np.sqrt(p_0 * (1 - p_0) / n)
z = (p_hat - p_0) / se
print(f"\nStep 3: Test Statistic")
print(f" p̂ = {x}/{n} = {p_hat:.4f}")
print(f" SE = √({p_0} × {1-p_0} / {n}) = {se:.4f}")
print(f" z = ({p_hat:.4f} - {p_0}) / {se:.4f} = {z:.2f}")
# Step 4: P-value (right-tailed)
p_value = 1 - stats.norm.cdf(z)
print(f"\nStep 4: P-value")
print(f" P(Z ≥ {z:.2f}) = {p_value:.4f}")
# Step 5: Decision
print(f"\nStep 5: Decision")
if p_value <= alpha:
print(f" p-value ({p_value:.4f}) ≤ α ({alpha}): REJECT H₀")
print(f" Evidence that county hypertension prevalence exceeds {p_0:.0%}.")
else:
print(f" p-value ({p_value:.4f}) > α ({alpha}): FAIL TO REJECT H₀")
print(f" Insufficient evidence of elevated hypertension prevalence.")
# Output:
# ============================================================
# ONE-SAMPLE Z-TEST FOR A PROPORTION
# Maya's Hypertension Study
# ============================================================
#
# Step 1: Hypotheses
# H₀: p = 0.47
# Hₐ: p > 0.47 (right-tailed)
# α = 0.05
#
# Step 2: Check Conditions
# Random sample: Yes (stated)
# Independence: n = 400 ≤ 10% of population ✓
# Success-failure: np₀ = 188 ≥ 10 ✓
# n(1-p₀) = 212 ≥ 10 ✓
#
# Step 3: Test Statistic
# p̂ = 208/400 = 0.5200
# SE = √(0.47 × 0.53 / 400) = 0.0250
# z = (0.5200 - 0.47) / 0.0250 = 2.00
#
# Step 4: P-value
# P(Z ≥ 2.00) = 0.0228
#
# Step 5: Decision
# p-value (0.0228) ≤ α (0.05): REJECT H₀
# Evidence that county hypertension prevalence exceeds 47%.
Using statsmodels for the Z-Test
from statsmodels.stats.proportion import proportions_ztest
# ============================================================
# USING statsmodels.stats.proportion.proportions_ztest
# ============================================================
# Maya's data
x = 208 # number of successes
n = 400 # sample size
p_0 = 0.47 # null hypothesis proportion
# One-sample z-test
# count = number of successes, nobs = sample size
# value = hypothesized proportion, alternative = direction
z_stat, p_value = proportions_ztest(count=x, nobs=n,
value=p_0,
alternative='larger')
print("=" * 60)
print("PROPORTIONS Z-TEST (statsmodels)")
print("=" * 60)
print(f"z-statistic: {z_stat:.4f}")
print(f"p-value (one-tailed): {p_value:.4f}")
# Sam's test
z_sam, p_sam = proportions_ztest(count=25, nobs=65,
value=0.31,
alternative='larger')
print(f"\nSam's test (Daria): z = {z_sam:.4f}, p = {p_sam:.4f}")
# Output:
# ============================================================
# PROPORTIONS Z-TEST (statsmodels)
# ============================================================
# z-statistic: 2.0040
# p-value (one-tailed): 0.0225
#
# Sam's test (Daria): z = 1.3005, p = 0.0967
Confidence Intervals: Wald, Wilson, and Plus-Four
from statsmodels.stats.proportion import proportion_confint
# ============================================================
# COMPARING CI METHODS: Surgical success rate
# ============================================================
x = 27 # successes
n = 30 # total observations
alpha = 0.05
print("=" * 60)
print("CONFIDENCE INTERVAL COMPARISON")
print(f"Data: {x} successes out of {n} observations")
print(f"Sample proportion: {x/n:.4f}")
print("=" * 60)
# Method 1: Wald interval
ci_wald = proportion_confint(x, n, alpha=alpha, method='normal')
print(f"\nWald interval: ({ci_wald[0]:.4f}, {ci_wald[1]:.4f})")
# Method 2: Wilson interval
ci_wilson = proportion_confint(x, n, alpha=alpha, method='wilson')
print(f"Wilson interval: ({ci_wilson[0]:.4f}, {ci_wilson[1]:.4f})")
# Method 3: Plus-four (manual)
p_tilde = (x + 2) / (n + 4)
se_tilde = np.sqrt(p_tilde * (1 - p_tilde) / (n + 4))
z_star = stats.norm.ppf(1 - alpha / 2)
ci_pf = (p_tilde - z_star * se_tilde, p_tilde + z_star * se_tilde)
print(f"Plus-four interval: ({ci_pf[0]:.4f}, {ci_pf[1]:.4f})")
# Method 4: Exact (Clopper-Pearson) - for reference
ci_exact = proportion_confint(x, n, alpha=alpha, method='binom_test')
print(f"Exact (C-P) interval: ({ci_exact[0]:.4f}, {ci_exact[1]:.4f})")
print(f"\nNote: Wald upper bound ({ci_wald[1]:.4f}) exceeds 1.0!")
print("This is why Wilson or plus-four is preferred for extreme p̂.")
# Output:
# ============================================================
# CONFIDENCE INTERVAL COMPARISON
# Data: 27 successes out of 30 observations
# Sample proportion: 0.9000
# ============================================================
#
# Wald interval: (0.7927, 1.0073)
# Wilson interval: (0.7440, 0.9681)
# Plus-four interval: (0.7340, 0.9718)
# Exact (C-P) interval: (0.7347, 0.9789)
#
# Note: Wald upper bound (1.0073) exceeds 1.0!
# This is why Wilson or plus-four is preferred for extreme p̂.
Visualizing the P-Value for Maya's Test
# ============================================================
# VISUALIZATION: P-value for Maya's hypertension test
# ============================================================
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# --- Panel 1: Maya's one-tailed test ---
ax = axes[0]
x_vals = np.linspace(-4, 4, 1000)
y_vals = stats.norm.pdf(x_vals)
ax.plot(x_vals, y_vals, 'k-', linewidth=2)
ax.fill_between(x_vals, y_vals, where=(x_vals >= 2.00),
color='coral', alpha=0.4,
label=f'p-value = 0.023')
ax.axvline(x=2.00, color='red', linestyle='--', linewidth=1.5,
label='z = 2.00 (observed)')
ax.axvline(x=1.645, color='blue', linestyle=':', linewidth=1.5,
label='z* = 1.645 (critical)')
ax.set_title("Maya's Test: Hypertension Prevalence\n"
r"$H_0: p = 0.47$ vs $H_a: p > 0.47$",
fontsize=12)
ax.set_xlabel('z-score')
ax.set_ylabel('Density')
ax.legend(fontsize=9)
ax.annotate('Reject H₀', xy=(2.8, 0.02), fontsize=11,
ha='center', color='red', fontweight='bold')
# --- Panel 2: Sam's one-tailed test ---
ax = axes[1]
ax.plot(x_vals, y_vals, 'k-', linewidth=2)
ax.fill_between(x_vals, y_vals, where=(x_vals >= 1.30),
color='skyblue', alpha=0.4,
label=f'p-value = 0.097')
ax.axvline(x=1.30, color='blue', linestyle='--', linewidth=1.5,
label='z = 1.30 (observed)')
ax.axvline(x=1.645, color='red', linestyle=':', linewidth=1.5,
label='z* = 1.645 (critical)')
ax.set_title("Sam's Test: Daria's Shooting\n"
r"$H_0: p = 0.31$ vs $H_a: p > 0.31$",
fontsize=12)
ax.set_xlabel('z-score')
ax.set_ylabel('Density')
ax.legend(fontsize=9)
ax.annotate('Fail to reject H₀', xy=(0, 0.15), fontsize=11,
ha='center', color='green', fontweight='bold')
plt.tight_layout()
plt.savefig('proportion_tests_visualization.png', dpi=150,
bbox_inches='tight')
plt.show()
14.9 Excel: Proportion Inference
You can perform proportion inference in Excel using built-in functions.
Z-Test for a Proportion (Manual)
Set up your spreadsheet like this:
Cell A1: "Data" Cell B1: "Value"
Cell A2: "Successes (X)" Cell B2: 208
Cell A3: "Sample size (n)" Cell B3: 400
Cell A4: "p₀ (null value)" Cell B4: 0.47
Cell A5: "α" Cell B5: 0.05
Cell A7: "Calculations"
Cell A8: "Sample proportion" Cell B8: =B2/B3
Cell A9: "Standard error" Cell B9: =SQRT(B4*(1-B4)/B3)
Cell A10: "z-statistic" Cell B10: =(B8-B4)/B9
Cell A11: "p-value (right)" Cell B11: =1-NORM.S.DIST(B10,TRUE)
Cell A12: "p-value (left)" Cell B12: =NORM.S.DIST(B10,TRUE)
Cell A13: "p-value (two-tail)" Cell B13: =2*(1-NORM.S.DIST(ABS(B10),TRUE))
Cell A15: "Decision"
Cell A16: "Reject H₀?" Cell B16: =IF(B11<=B5,"Yes - Reject","No - Fail to reject")
Confidence Interval for a Proportion (Wald)
Cell A18: "95% CI (Wald)"
Cell A19: "Critical value z*" Cell B19: =NORM.S.INV(1-B5/2)
Cell A20: "Margin of error" Cell B20: =B19*SQRT(B8*(1-B8)/B3)
Cell A21: "Lower bound" Cell B21: =B8-B20
Cell A22: "Upper bound" Cell B22: =B8+B20
Key Excel Functions for Proportions
| Function | Purpose | Example |
|---|---|---|
NORM.S.DIST(z, TRUE) |
$P(Z \leq z)$: cumulative standard normal probability | =NORM.S.DIST(2.00, TRUE) → 0.9772 |
NORM.S.INV(prob) |
Critical value: find $z$ for a given left-tail probability | =NORM.S.INV(0.975) → 1.960 |
SQRT() |
Square root (for standard error) | =SQRT(0.47*0.53/400) → 0.02496 |
ABS() |
Absolute value (for two-tailed p-values) | =ABS(-2.00) → 2.00 |
Excel Tip: Excel does not have a built-in one-sample z-test for proportions function. The manual approach above is necessary. For the Wilson interval, the formula is more complex but can be implemented with the same building blocks. In practice, most analysts use Python or R for proportion inference and Excel for simple calculations and data entry.
14.10 When the Success-Failure Condition Fails
What do you do when $np_0 < 10$ or $n(1 - p_0) < 10$? This happens more often than you might think:
- Testing whether a rare disease rate exceeds a low baseline (e.g., cancer incidence of 0.5%)
- Testing defect rates in high-quality manufacturing (e.g., 99.9% pass rate)
- Small pilot studies with limited participants
- Testing very popular or very unpopular positions (e.g., 98% approval rate)
In these cases, the normal approximation is unreliable, and the z-test shouldn't be used. Here are your options:
Option 1: The Exact Binomial Test
Instead of approximating the binomial distribution with a normal distribution, use the exact binomial probabilities. Under $H_0: p = p_0$, the number of successes follows $X \sim \text{Binomial}(n, p_0)$.
The p-value is computed directly from the binomial distribution:
- Right-tailed: $P(X \geq x) = \sum_{k=x}^{n} \binom{n}{k} p_0^k (1-p_0)^{n-k}$
- Left-tailed: $P(X \leq x) = \sum_{k=0}^{x} \binom{n}{k} p_0^k (1-p_0)^{n-k}$
In Python:
from scipy import stats
# Example: 3 successes out of 20 trials
# Test H₀: p = 0.05 vs Hₐ: p > 0.05
x = 3
n = 20
p_0 = 0.05
# Check: np₀ = 20 × 0.05 = 1 < 10 — success-failure fails!
# Use exact binomial test instead
# Exact p-value (right-tailed)
p_value = 1 - stats.binom.cdf(x - 1, n, p_0)
# Equivalent: P(X ≥ 3) when X ~ Binomial(20, 0.05)
print(f"Exact binomial p-value: {p_value:.4f}")
# Or use scipy's binomtest function (SciPy ≥ 1.7)
result = stats.binomtest(x, n, p_0, alternative='greater')
print(f"binomtest p-value: {result.pvalue:.4f}")
# Output:
# Exact binomial p-value: 0.0755
# binomtest p-value: 0.0755
Option 2: The Plus-Four Method (for CIs)
As described in Section 14.6, adding 2 successes and 2 failures before computing the Wald interval works well even when the standard conditions fail. This is a recommended approach for confidence intervals when the sample is small.
Option 3: Use the Wilson Interval (for CIs)
The Wilson interval doesn't require the success-failure condition and works well even for small samples and extreme proportions.
💡 Key Insight
The success-failure condition is a checkpoint, not a roadblock. When it fails, it doesn't mean you can't do inference — it means you need a different tool. The exact binomial test and Wilson interval are always available and always valid.
14.11 Putting It All Together: The Complete Decision Flowchart
Here's the complete procedure for proportion inference, from start to finish.
START: You have a question about a population proportion
│
▼
Is this a test or an estimate?
┌──────────────┬──────────────┐
│ TEST │ ESTIMATE │
│ (hypothesis │ (confidence │
│ test) │ interval) │
│ │ │
▼ ▼ │
State H₀, Hₐ Choose conf. │
Choose α level (95%) │
│ │ │
▼ ▼ │
Check conditions using p₀ Check conditions using p̂
│ │
▼ ▼
Conditions met? Conditions met?
┌────┐ ┌───┐ ┌────┐ ┌───┐
│Yes │ │No │ │Yes │ │No │
│ │ │ │ │ │ │ │
▼ ▼ ▼ │ ▼ ▼ ▼ │
z-test Exact Wald Wilson or
binom. or plus-four
test Wilson interval
│ │ │ │
▼ ▼ ▼ ▼
INTERPRET IN CONTEXT
The Five Steps for a Proportion Hypothesis Test
| Step | Action | Proportion-Specific Detail |
|---|---|---|
| 1 | State $H_0$ and $H_a$ | $H_0: p = p_0$; choose one- or two-tailed |
| 2 | Choose $\alpha$ | Consider Type I vs. Type II error consequences |
| 3 | Check conditions | Random, independent, success-failure ($np_0 \geq 10$, $n(1-p_0) \geq 10$) |
| 4 | Compute test stat and p-value | $z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}$; find p-value from standard normal |
| 5 | Conclude in context | Translate the decision into language the audience understands |
14.12 Common Mistakes and How to Avoid Them
Let me save you some trouble by addressing the errors I see most often.
Mistake 1: Using $\hat{p}$ Instead of $p_0$ in the Test Statistic
Wrong: $z = \frac{\hat{p} - p_0}{\sqrt{\hat{p}(1-\hat{p})/n}}$
Right: $z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}$
The standard error in the test statistic uses $p_0$ because you're computing probabilities under the null hypothesis. The standard error in a confidence interval uses $\hat{p}$ because there's no null to assume.
Mistake 2: Confusing the Success-Failure Check for Tests vs. CIs
- Hypothesis test: check $np_0 \geq 10$ and $n(1-p_0) \geq 10$
- Confidence interval: check $n\hat{p} \geq 10$ and $n(1-\hat{p}) \geq 10$
Mistake 3: Concluding "The Proportion IS $p_0$" When You Fail to Reject
Failing to reject $H_0: p = 0.31$ does NOT mean $p = 0.31$. It means you don't have enough evidence to conclude otherwise. Sam can't say "Daria's shooting is 31%." He can only say "We can't confirm it's changed from 31%."
Mistake 4: Ignoring Context in the Conclusion
"We reject $H_0$" is incomplete. "We reject $H_0$ and conclude that the county's hypertension prevalence significantly exceeds the national rate of 47%, based on a sample of 400 adults ($\hat{p} = 0.52$, $z = 2.00$, $p = 0.023$)" is a proper conclusion.
Mistake 5: Applying the Z-Test to Non-Random Samples
The z-test assumes a random sample. If your data come from a convenience sample, voluntary response, or self-selected group, the z-test gives a valid answer to the wrong question. The p-value tells you about random sampling variability, but if your sample is biased, the bigger source of error is the bias, which the p-value can't measure.
💡 Key Insight
The z-test for proportions is a powerful tool, but it answers a narrow question: "Is the observed proportion far enough from $p_0$ to be unlikely due to random sampling alone?" It cannot detect bias, confounding, or any other systematic error. The test is only as good as the data going into it.
14.13 Data Detective Portfolio: Test a Proportion Hypothesis
Time to apply proportion inference to your own dataset. This is the Chapter 14 component of the Data Detective Portfolio.
Your Task
Identify a proportion in your dataset and conduct a complete inference procedure.
-
Identify a binary variable in your dataset (or create one by dichotomizing a categorical variable). Examples: - BRFSS: proportion of smokers, proportion with health insurance - College Scorecard: proportion of students who graduate, proportion receiving financial aid - Gapminder: proportion of countries with life expectancy above 70
-
Formulate a hypothesis. Based on a published benchmark, national average, or theoretical expectation, set up $H_0$ and $H_a$. Explain why you chose this null value.
-
Check all three conditions explicitly. If any fail, note which and use an appropriate alternative method (exact binomial test or Wilson interval).
-
Conduct the test and build a CI. - Compute the z-test statistic and p-value - Construct both a Wald CI and a Wilson CI - Compare the two CIs — do they differ substantially?
-
Interpret in context (at least 3-4 sentences). What does the result mean for your research question? Does the CI agree with the test?
Template Code
import pandas as pd
import numpy as np
from scipy import stats
from statsmodels.stats.proportion import proportions_ztest, proportion_confint
# Load your dataset
df = pd.read_csv('your_dataset.csv')
# ============================================================
# Step 1: Identify the binary variable
# ============================================================
variable = 'your_binary_variable'
success_value = 'Yes' # ← adjust to match your data
# or for numerical: success_condition = df[variable] > threshold
successes = (df[variable] == success_value).sum()
n = df[variable].notna().sum()
p_hat = successes / n
print(f"Variable: {variable}")
print(f"Successes: {successes}, n = {n}, p̂ = {p_hat:.4f}")
# ============================================================
# Step 2: Set up hypotheses
# ============================================================
p_0 = 0.0 # ← Replace with your null hypothesis value
print(f"\nH₀: p = {p_0}")
print(f"Hₐ: p ≠ {p_0}") # ← adjust direction as needed
# ============================================================
# Step 3: Check conditions
# ============================================================
print(f"\nConditions:")
print(f" np₀ = {n * p_0:.1f} {'≥' if n*p_0 >= 10 else '<'} 10")
print(f" n(1-p₀) = {n*(1-p_0):.1f} {'≥' if n*(1-p_0) >= 10 else '<'} 10")
# ============================================================
# Step 4: Conduct the test
# ============================================================
z_stat, p_value = proportions_ztest(count=successes, nobs=n,
value=p_0,
alternative='two-sided')
print(f"\nz-statistic: {z_stat:.4f}")
print(f"p-value: {p_value:.4f}")
# ============================================================
# Step 5: Confidence intervals (both methods)
# ============================================================
ci_wald = proportion_confint(successes, n, alpha=0.05, method='normal')
ci_wilson = proportion_confint(successes, n, alpha=0.05, method='wilson')
print(f"\n95% CI (Wald): ({ci_wald[0]:.4f}, {ci_wald[1]:.4f})")
print(f"95% CI (Wilson): ({ci_wilson[0]:.4f}, {ci_wilson[1]:.4f})")
# ============================================================
# Step 6: Interpret in context (write your interpretation below)
# ============================================================
# YOUR INTERPRETATION HERE:
# "Based on a sample of ... from ... , the proportion of ...
# is ... The hypothesis test (z = ..., p = ...) [rejects/fails
# to reject] H₀ at the 5% significance level, suggesting that..."
14.14 Chapter Summary
What You Learned
In this chapter, you assembled the complete toolkit for proportion inference — combining the CLT for proportions (Chapter 11), confidence intervals (Chapter 12), and hypothesis testing (Chapter 13) into a systematic, applied procedure.
You learned to: - Set up and conduct a one-sample z-test for a proportion, with rigorous conditions checking - Construct and compare three types of confidence intervals (Wald, Wilson, plus-four) - Interpret polling margins of error — and understand their limitations - Handle situations where standard conditions fail (exact binomial test, Wilson interval) - Interpret results in context, not just as "reject" or "fail to reject"
The Key Formulas
| Formula | Purpose |
|---|---|
| $z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}$ | Test statistic (uses $p_0$ in SE) |
| $\hat{p} \pm z^* \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$ | Wald CI (uses $\hat{p}$ in SE) |
| $\tilde{p} \pm z^* \sqrt{\frac{\tilde{p}(1-\tilde{p})}{n+4}}$, where $\tilde{p} = \frac{X+2}{n+4}$ | Plus-four CI |
| Success-failure: $np_0 \geq 10$ and $n(1-p_0) \geq 10$ | Conditions for the z-test |
The Threads
- Sam and Daria: The shooting percentage analysis is now fully developed for the one-sample case. The conclusion remains: 65 shots isn't enough. Sam will learn about power analysis in Chapter 17 to determine exactly how many shots he needs.
- Maya: She tested whether county hypertension prevalence exceeds the national rate — and found statistically significant evidence that it does. This result will inform her public health advocacy.
- Polling and elections: You now understand the ±3 points that pollsters report, why polls use about 1,000 respondents, and why polls can still be wrong even with valid margins of error.
What's Coming Next
In Chapter 15, you'll learn the t-test for means — which is the version of hypothesis testing you'll use most often in practice (since we rarely know $\sigma$). The logic is identical to what you've learned here, but the distribution changes from the standard normal to the t-distribution.
Then in Chapter 16, you'll finally compare two groups — which is what Alex has been waiting for with her A/B test, and what Professor Washington needs for his algorithm audit.
Key Formulas at a Glance
| Concept | Formula | When to Use |
|---|---|---|
| Sample proportion | $\hat{p} = X/n$ | Always: count successes, divide by total |
| SE for hypothesis test | $\sqrt{p_0(1-p_0)/n}$ | Testing $H_0: p = p_0$ (use $p_0$) |
| SE for confidence interval | $\sqrt{\hat{p}(1-\hat{p})/n}$ | Estimating $p$ (use $\hat{p}$) |
| Test statistic | $z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}$ | One-sample z-test for proportion |
| Wald CI | $\hat{p} \pm z^* \sqrt{\hat{p}(1-\hat{p})/n}$ | Large $n$, $p$ not extreme |
| Plus-four CI | $\tilde{p} \pm z^* \sqrt{\tilde{p}(1-\tilde{p})/(n+4)}$ | Small to moderate $n$ |
| Margin of error | $z^* \sqrt{\hat{p}(1-\hat{p})/n}$ | Half-width of CI |
| Sample size for desired MOE | $n = \left(\frac{z^*}{E}\right)^2 \hat{p}(1-\hat{p})$ | Planning a study |
| Success-failure condition | $np_0 \geq 10$ and $n(1-p_0) \geq 10$ | Check before using z-test |