Chapter 22: Sampling, Estimation, and Confidence Intervals — How to Learn About Millions from a Handful

Contributors to Introduction to Data Science

27 min read

> "The best thing about being a statistician is that you get to play in everyone's backyard."

Learning Objectives

Distinguish between a population and a sample, and explain why sampling is necessary in almost all real-world data science
Identify common sources of sampling bias and evaluate whether a given sample is likely to be representative
Compute point estimates from sample data and explain why a single number is never the whole story
Simulate sampling distributions in Python and connect the spread of the sampling distribution to the concept of standard error
Construct and interpret confidence intervals for a population mean using both the formula-based approach and bootstrap resampling
Apply confidence interval analysis to the progressive project by computing CIs for mean vaccination rate by region

In This Chapter

Chapter Overview
22.1 The Fundamental Problem: You Can't Measure Everything
22.2 How Samples Go Wrong: A Brief History of Spectacular Failures
22.3 Sampling Done Right: Random and Stratified Sampling
22.4 From Sample to Estimate: Point Estimates and Why They're Not Enough
22.5 The Sampling Distribution: The Idea That Makes Inference Possible
22.6 Confidence Intervals: Communicating Uncertainty Honestly
22.7 What Confidence Intervals Actually Mean (and Don't Mean)
22.8 The Bootstrap: Confidence Intervals Without Formulas
22.9 Progressive Project: Confidence Intervals for Vaccination Rates by Region
22.10 Common Mistakes and Misconceptions
22.11 Connecting the Threads
22.12 Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 22: Sampling, Estimation, and Confidence Intervals — How to Learn About Millions from a Handful

"The best thing about being a statistician is that you get to play in everyone's backyard." — John Tukey

Chapter Overview

Here is one of the most remarkable facts in all of science: if you want to know the average height of every adult in the United States — all 260 million or so of them — you don't need to measure all 260 million. You can measure about 1,000 people, chosen carefully, and get an answer that's accurate to within about an inch.

One thousand people to represent 260 million. That ratio — roughly 1 in 260,000 — sounds absurd. It sounds like trying to understand an ocean by scooping out a teaspoon of water. And yet it works. It works so reliably that the entire infrastructure of modern society depends on it: election polls, medical trials, quality control in manufacturing, food safety inspections, economic indicators, and yes — estimates of vaccination coverage around the world.

This chapter is about why it works, when it doesn't work, and how to do it in Python. You'll learn the logic of sampling, the concept of estimation, and the construction of confidence intervals — those "(plus or minus 3 percentage points)" margins you see in every news poll. More importantly, you'll learn what those intervals actually mean (which is not what most people think they mean).

We're going to build understanding through simulation first, formulas second. By the time you see an equation, you'll already know what it's trying to say because you'll have watched the phenomenon happen on your screen.

In this chapter, you will learn to:

Distinguish between a population and a sample, and explain why sampling is necessary in almost all real-world data science (all paths)
Identify common sources of sampling bias and evaluate whether a given sample is likely to be representative (all paths)
Compute point estimates from sample data and explain why a single number is never the whole story (all paths)
Simulate sampling distributions in Python and connect the spread of the sampling distribution to the concept of standard error (standard + deep dive paths)
Construct and interpret confidence intervals for a population mean using both the formula-based approach and bootstrap resampling (all paths)
Apply confidence interval analysis to vaccination rate data in the progressive project (all paths)

22.1 The Fundamental Problem: You Can't Measure Everything

Let's start with a thought experiment.

Imagine you're Elena, our public health analyst, and your boss walks in Monday morning with a simple question: "What's the average COVID-19 vaccination rate across all countries in the world?"

Now, in an ideal world, you'd just look it up. You'd have a perfect database with the exact vaccination rate for every country, updated yesterday, with no errors, no missing data, and no ambiguity about what "vaccination rate" means. You'd compute the mean, report it, and go to lunch.

But you don't live in an ideal world. You live in a world where:

Some countries don't report vaccination data at all
Some countries report data months late
Some countries use different definitions of "fully vaccinated" (one dose? two doses? boosted?)
Some countries have unreliable record-keeping systems
The data changes every day as more people get vaccinated

So you can't just "look up" the true average vaccination rate for the world. That number — the true average across all countries — exists in principle, but you can't observe it directly. What you can do is take the data you have — maybe 150 countries with reasonably reliable numbers — and use that data to estimate the true value.

Congratulations. You've just walked into the world of statistical inference.

Population vs. Sample: The Two Most Important Words in This Chapter

Let's define our terms precisely, because getting these right is the foundation for everything else.

Population: The complete set of individuals, items, or observations that you want to draw conclusions about.

Sample: A subset of the population that you actually observe and measure.

The population is the thing you care about. The sample is the thing you have.

A few examples to make this concrete:

You want to know...	The population is...	Your sample might be...
The average income of all US adults	All ~260 million US adults	5,000 people from a Census survey
Whether a new drug lowers blood pressure	All people with high blood pressure	200 participants in a clinical trial
The defect rate in a factory's production	All widgets produced this month	500 widgets pulled from the line
The average vaccination rate globally	All ~195 countries	150 countries with available data
The proportion of voters who support a candidate	All eligible voters	1,200 people in a phone poll

Notice something crucial: the population doesn't have to be people. It can be countries, widgets, transactions, tweets, or anything else. What makes it the "population" is that it's the entire group you're trying to learn about.

And notice the gap between the two columns. That gap — between what you want to know and what you have — is the entire reason this chapter exists.

Why Not Just Measure Everyone?

Sometimes you can. If you're analyzing all the transactions in your company's database last month, you have the complete population. If you're looking at every student in your class, that's the population. When you have the whole population, you don't need sampling or estimation — you just compute the answer.

But most of the time, measuring everyone is either:

Impossible: You can't survey all 8 billion people on Earth
Too expensive: Medical testing every single person costs too much
Destructive: To test whether a lightbulb lasts 1,000 hours, you'd have to destroy it
Too slow: By the time you survey everyone, the answer has changed
Unnecessary: A well-chosen sample gives you a good enough answer at a fraction of the cost

That last point is the magical one. Sampling isn't just a compromise we make because we can't do better — it's genuinely efficient. With the right techniques, a relatively small sample can tell you a lot about a very large population.

But — and this is a big "but" — only if the sample is chosen properly.

22.2 How Samples Go Wrong: A Brief History of Spectacular Failures

Before we talk about how to sample well, let's talk about how samples can go catastrophically wrong. Because the history of sampling is littered with confident predictions based on terrible samples.

The Literary Digest Disaster (1936)

The Literary Digest was one of the most widely read magazines in America in the 1930s. In 1936, they conducted one of the largest opinion polls ever attempted: 10 million questionnaires mailed out, 2.4 million responses received. Their prediction? Alf Landon would crush Franklin Roosevelt in the presidential election, winning 57% to 43%.

Roosevelt won in a historic landslide, carrying 61% of the popular vote and 46 of 48 states.

What went wrong? The Literary Digest had assembled their mailing list from telephone directories, automobile registrations, and their own subscriber list. In 1936, during the Great Depression, telephones, automobiles, and magazine subscriptions were luxuries. Their "sample" of 2.4 million responses systematically excluded working-class Americans — exactly the people who were most enthusiastic about Roosevelt's New Deal.

Two point four million responses, and it was garbage. Meanwhile, a young pollster named George Gallup correctly predicted Roosevelt's win using a sample of only 50,000 — because his sample was designed to represent the population, not just to be large.

The lesson: Sample size is not the same as sample quality. A large biased sample is worse than a small representative one.

Sampling Bias: When Your Sample Doesn't Look Like Your Population

The Literary Digest debacle illustrates sampling bias — a systematic tendency for your sample to differ from the population in ways that affect your conclusions.

Here are the most common forms:

Selection bias: When the process of selecting your sample favors certain individuals over others. The Literary Digest had selection bias because their lists excluded people without phones or cars.

Non-response bias: When the people who respond to your survey differ systematically from those who don't. Imagine surveying customer satisfaction: angry customers might respond at higher rates, or very happy customers might. Either way, the responders aren't representative.

Survivorship bias: When you can only observe the "survivors" of some process. The famous World War II example: the military wanted to armor the parts of planes that showed the most bullet holes. Statistician Abraham Wald pointed out that they were only seeing the planes that made it back. The holes showed where planes could take damage and survive. The parts with no holes on returning planes were the places where damage was fatal — those were the parts that needed armor.

Convenience sampling: When you sample whoever is easiest to reach. Polling people at a shopping mall on a weekday afternoon gives you a sample skewed toward people who don't work 9-to-5 jobs. Surveying your social media followers gives you a sample of people who already agree with you.

Voluntary response bias: When you let people opt into your sample. Online polls ("Click here to share your opinion!") attract people with strong feelings. A restaurant's comment cards get filled out by the ecstatic and the furious, not the quietly satisfied.

Let's see what sampling bias looks like in practice:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set up for reproducibility
np.random.seed(42)

# Create a "population" of 10,000 people with incomes
# This is our ground truth — in real life, we wouldn't have this
population_income = np.concatenate([
    np.random.normal(35000, 10000, 6000),   # Lower income group (60%)
    np.random.normal(75000, 20000, 3000),   # Middle income group (30%)
    np.random.normal(150000, 40000, 1000),  # Upper income group (10%)
])
population_income = np.maximum(population_income, 10000)  # Floor at $10k

true_mean = population_income.mean()
print(f"True population mean income: ${true_mean:,.0f}")

# --- Biased sample: convenience sample from upper-income zip codes ---
# Suppose we only sample from the top 30% of earners (like Literary Digest)
biased_indices = np.where(population_income > np.percentile(population_income, 70))[0]
biased_sample = np.random.choice(population_income[biased_indices], size=500)
print(f"Biased sample mean: ${biased_sample.mean():,.0f}")
print(f"Bias (error): ${biased_sample.mean() - true_mean:,.0f}")

# --- Random sample: every person equally likely to be selected ---
random_sample = np.random.choice(population_income, size=500, replace=False)
print(f"Random sample mean: ${random_sample.mean():,.0f}")
print(f"Error: ${random_sample.mean() - true_mean:,.0f}")

True population mean income: $57,824
Biased sample mean: $107,342
Bias (error): $49,518
Random sample mean: $58,105
Error: $281

The biased sample — which is the same size as the random sample — is off by nearly $50,000. The random sample is off by about $300. Sample size: identical. Sample quality: night and day.

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Population
axes[0].hist(population_income, bins=50, color='steelblue', alpha=0.7, edgecolor='white')
axes[0].axvline(true_mean, color='red', linewidth=2, label=f'Mean: ${true_mean:,.0f}')
axes[0].set_title('Full Population')
axes[0].set_xlabel('Income ($)')
axes[0].legend()

# Biased sample
axes[1].hist(biased_sample, bins=30, color='salmon', alpha=0.7, edgecolor='white')
axes[1].axvline(biased_sample.mean(), color='red', linewidth=2,
                label=f'Mean: ${biased_sample.mean():,.0f}')
axes[1].set_title('Biased Sample (n=500)')
axes[1].set_xlabel('Income ($)')
axes[1].legend()

# Random sample
axes[2].hist(random_sample, bins=30, color='mediumseagreen', alpha=0.7, edgecolor='white')
axes[2].axvline(random_sample.mean(), color='red', linewidth=2,
                label=f'Mean: ${random_sample.mean():,.0f}')
axes[2].set_title('Random Sample (n=500)')
axes[2].set_xlabel('Income ($)')
axes[2].legend()

plt.tight_layout()
plt.savefig('sampling_bias_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

The visual tells the story instantly: the biased sample looks nothing like the population, while the random sample mirrors it faithfully.

22.3 Sampling Done Right: Random and Stratified Sampling

If bias is the enemy, randomness is the antidote.

Simple Random Sampling

The simplest and most fundamental sampling method is the simple random sample (SRS): every individual in the population has an equal probability of being selected.

Think of it as drawing names from a hat — if the hat is well-mixed and every name is on exactly one slip of paper, you get a random sample. In Python, np.random.choice() with replace=False is your hat.

# Simple random sample
population = np.arange(10000)  # IDs for 10,000 people
sample_ids = np.random.choice(population, size=200, replace=False)
print(f"Selected {len(sample_ids)} individuals at random")
print(f"First 10 IDs: {sample_ids[:10]}")

The magic of random sampling is that it doesn't require you to know anything about the population in advance. You don't need to know the income distribution, the age distribution, or anything else. Randomness handles the representation for you — on average, your sample will look like the population.

The key phrase is "on average." Any single random sample might, by chance, over-represent one group or under-represent another. But across many possible random samples, the average of the sample means equals the population mean. We call this property unbiasedness.

Stratified Sampling

Sometimes you can do even better than simple random sampling. If you know something about the structure of the population, you can exploit that knowledge.

Stratified sampling divides the population into subgroups (called strata) and then takes a random sample from each stratum.

For example, if you're sampling countries to estimate global vaccination rates, you might stratify by WHO region (Africa, Americas, South-East Asia, Europe, Eastern Mediterranean, Western Pacific). You'd ensure that each region is represented in your sample in proportion to the number of countries in that region.

# Stratified sampling example
countries = pd.DataFrame({
    'country': [f'Country_{i}' for i in range(195)],
    'region': np.random.choice(
        ['Africa', 'Americas', 'SE_Asia', 'Europe', 'E_Med', 'W_Pacific'],
        size=195,
        p=[0.28, 0.18, 0.06, 0.28, 0.11, 0.09]  # Approximate proportions
    ),
    'vax_rate': np.random.normal(70, 20, 195).clip(10, 99)
})

# Take a stratified sample: 20% from each region
stratified_sample = countries.groupby('region', group_keys=False).apply(
    lambda x: x.sample(frac=0.2, random_state=42)
)

print("Population by region:")
print(countries['region'].value_counts().sort_index())
print(f"\nStratified sample by region:")
print(stratified_sample['region'].value_counts().sort_index())
print(f"\nPopulation mean vax rate: {countries['vax_rate'].mean():.1f}%")
print(f"Stratified sample mean: {stratified_sample['vax_rate'].mean():.1f}%")

Why does stratified sampling help? Because it guarantees that each subgroup is represented. With simple random sampling, you might by chance get zero countries from a small region. Stratified sampling prevents that.

Other Sampling Methods

There are several other methods worth knowing about:

Cluster sampling: Divide the population into clusters (e.g., schools in a city), randomly select some clusters, and measure everyone in the selected clusters. This is practical when you can't list all individuals but can list groups.
Systematic sampling: Select every kth individual from a list (e.g., every 50th person on a voter roll). Simple and efficient, but can introduce bias if there's a pattern in the list.
Multi-stage sampling: Combine methods — for instance, randomly select states, then randomly select counties within those states, then randomly select households within those counties. This is how large national surveys like the Census Bureau's American Community Survey work.

Each method has trade-offs between cost, practicality, and statistical precision. For this course, the most important thing is to understand the principle: your sample needs to be representative of your population, and randomness is the primary tool for achieving that.

22.4 From Sample to Estimate: Point Estimates and Why They're Not Enough

Okay, so you've collected a good sample. Now what?

The immediate goal is to use your sample to estimate something about the population. This brings us to two important ideas: parameters and statistics.

A parameter is a number that describes the population. The average income of all Americans is a parameter. The true global vaccination rate is a parameter. Parameters are fixed (though unknown).
A statistic is a number that describes the sample. The average income in your survey is a statistic. The mean vaccination rate in your 150-country dataset is a statistic. Statistics vary from sample to sample.

When we use a statistic to guess a parameter, we call it a point estimate — a single number that represents our best guess.

# Our "population" — all 195 countries
np.random.seed(42)
population_vax = pd.Series(
    np.random.normal(72, 18, 195).clip(10, 99),
    name='vaccination_rate'
)
true_mean = population_vax.mean()
print(f"True population mean: {true_mean:.2f}%")

# Take a random sample of 40 countries
sample = population_vax.sample(n=40, random_state=42)
sample_mean = sample.mean()
print(f"Sample mean (our point estimate): {sample_mean:.2f}%")
print(f"Error: {sample_mean - true_mean:.2f} percentage points")

The sample mean is a perfectly reasonable estimate of the population mean. In fact, it has a lovely mathematical property: it's an unbiased estimator, meaning that if you took many, many samples and averaged all the sample means, you'd get exactly the population mean.

But here's the problem with point estimates.

Why a Single Number Is Never Enough

If someone asks "What's the average vaccination rate?" and you answer "71.3%", you've told them something useful. But you've left out something critical: how confident are you in that number?

Is it 71.3% plus or minus 0.5%? Or plus or minus 15%? Those are very different situations. The first means you're quite precise. The second means you barely know anything.

A point estimate without a measure of uncertainty is like a weather forecast that says "It will be 72 degrees tomorrow" without mentioning whether it might be anywhere from 65 to 79, or whether they're quite sure it'll be between 71 and 73. The precision of the estimate matters as much as the estimate itself.

This is where confidence intervals enter the picture. But before we can build them, we need to understand one of the most beautiful ideas in statistics: the sampling distribution.

22.5 The Sampling Distribution: The Idea That Makes Inference Possible

This section contains the most important concept in the chapter. If you understand this — really feel it — everything else falls into place.

A Thought Experiment

Imagine you could repeat the following experiment thousands of times:

Draw a random sample of 40 countries from the population of 195
Compute the sample mean vaccination rate
Write down that mean
Put all the countries back
Repeat

After 10,000 repetitions, you'd have 10,000 sample means. If you made a histogram of those 10,000 means, you'd see the sampling distribution of the sample mean.

In real life, you only get one sample. You can't redo the study thousands of times. But understanding what would happen if you could is the key to quantifying uncertainty.

Let's simulate it:

np.random.seed(42)

# The population
population_vax = np.random.normal(72, 18, 195).clip(10, 99)
true_mean = population_vax.mean()

# Simulate 10,000 samples of size 40
n_simulations = 10000
sample_size = 40
sample_means = []

for _ in range(n_simulations):
    sample = np.random.choice(population_vax, size=sample_size, replace=False)
    sample_means.append(sample.mean())

sample_means = np.array(sample_means)

print(f"True population mean: {true_mean:.2f}")
print(f"Mean of sample means: {sample_means.mean():.2f}")
print(f"Std dev of sample means: {sample_means.std():.2f}")
print(f"Population std dev / sqrt(n): {population_vax.std() / np.sqrt(sample_size):.2f}")

True population mean: 70.65
Mean of sample means: 70.64
Std dev of sample means: 2.41
Population std dev / sqrt(n): 2.46

Now let's visualize this:

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

# Left: the population distribution
axes[0].hist(population_vax, bins=30, color='steelblue', alpha=0.7, edgecolor='white')
axes[0].axvline(true_mean, color='red', linewidth=2, label=f'Mean: {true_mean:.1f}%')
axes[0].set_title('Population Distribution\n(Individual countries)', fontsize=13)
axes[0].set_xlabel('Vaccination rate (%)')
axes[0].set_ylabel('Count')
axes[0].legend()

# Right: the sampling distribution
axes[1].hist(sample_means, bins=50, color='mediumseagreen', alpha=0.7, edgecolor='white')
axes[1].axvline(true_mean, color='red', linewidth=2, label=f'True mean: {true_mean:.1f}%')
axes[1].set_title('Sampling Distribution of the Mean\n(10,000 samples of n=40)', fontsize=13)
axes[1].set_xlabel('Sample mean vaccination rate (%)')
axes[1].set_ylabel('Count')
axes[1].legend()

plt.tight_layout()
plt.savefig('sampling_distribution.png', dpi=150, bbox_inches='tight')
plt.show()

Look at those two plots side by side and let three things sink in:

The sampling distribution is centered at the true mean. The average of all the sample means is almost exactly the population mean. This is unbiasedness in action.
The sampling distribution is narrower than the population distribution. Individual countries range from about 15% to 99%. But sample means cluster much more tightly — roughly from 63% to 78%. Averaging things together smooths out the extremes.
The sampling distribution is approximately normal. Even though the population distribution isn't perfectly normal (we clipped it), the distribution of sample means looks remarkably bell-shaped. This is the Central Limit Theorem from Chapter 21 making its triumphant return.

Standard Error: The Standard Deviation of the Sampling Distribution

The spread of the sampling distribution has a special name: the standard error (SE).

Standard Error: The standard deviation of the sampling distribution of a statistic. It measures how much the statistic varies from sample to sample.

For the sample mean, the standard error has a beautiful formula:

$$SE = \frac{\sigma}{\sqrt{n}}$$

where $\sigma$ is the population standard deviation and $n$ is the sample size.

Since we rarely know $\sigma$, we estimate it from the sample:

$$\widehat{SE} = \frac{s}{\sqrt{n}}$$

where $s$ is the sample standard deviation.

Let's verify this with our simulation:

# From our simulation
print(f"Standard error (from simulation): {sample_means.std():.3f}")

# From the formula
se_formula = population_vax.std() / np.sqrt(sample_size)
print(f"Standard error (from formula):    {se_formula:.3f}")

They match beautifully. The formula gives you the answer that simulation confirms.

How Sample Size Affects the Standard Error

The $\sqrt{n}$ in the denominator has a profound implication: as your sample gets larger, the standard error shrinks — but it shrinks slowly.

sample_sizes = [10, 25, 50, 100, 150, 195]
n_simulations = 5000

fig, axes = plt.subplots(2, 3, figsize=(14, 8))
axes = axes.flatten()

for i, n in enumerate(sample_sizes):
    means = [np.random.choice(population_vax, size=n, replace=False).mean()
             for _ in range(n_simulations)]
    axes[i].hist(means, bins=40, color='steelblue', alpha=0.7, edgecolor='white')
    axes[i].axvline(true_mean, color='red', linewidth=2)
    se = np.std(means)
    axes[i].set_title(f'n = {n}  |  SE = {se:.2f}', fontsize=12)
    axes[i].set_xlim(55, 85)
    axes[i].set_xlabel('Sample mean (%)')

plt.suptitle('Sampling Distribution Narrows as Sample Size Increases',
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('sample_size_effect.png', dpi=150, bbox_inches='tight')
plt.show()

The pattern is clear: - At n = 10, sample means are all over the place (SE around 5) - At n = 50, they're clustering nicely (SE around 2) - At n = 150, they're very tightly packed (SE around 1) - At n = 195 (the whole population), there's no sampling variability at all

But notice: going from n = 10 to n = 50 (5x more data) cuts the SE roughly in half. Going from n = 50 to n = 200 (4x more data) only cuts it in half again. This is the diminishing returns of sample size — you need to quadruple your sample size to cut the standard error in half.

This matters practically. If a polling firm is deciding whether to survey 1,000 people or 4,000 people, the larger survey is four times more expensive but only twice as precise. Sometimes the extra precision is worth it. Often it's not.

22.6 Confidence Intervals: Communicating Uncertainty Honestly

Now we're ready for the main event. You have a sample. You've computed a point estimate. You know that estimate isn't exact — there's sampling variability. A confidence interval puts a range around your estimate that communicates how much variability there is.

The Intuition

Remember our sampling distribution? It showed us that sample means cluster around the true population mean, with most of them falling within about 2 standard errors.

So here's the reasoning: if 95% of sample means fall within 2 standard errors of the true mean, then if we build an interval that extends 2 standard errors in each direction from our sample mean, that interval will contain the true mean about 95% of the time.

That's a confidence interval.

The Formula

For a 95% confidence interval for a population mean:

$$\bar{x} \pm z^* \times \frac{s}{\sqrt{n}}$$

where: - $\bar{x}$ is the sample mean - $z^*$ is the critical value (1.96 for 95% confidence) - $s$ is the sample standard deviation - $n$ is the sample size - $\frac{s}{\sqrt{n}}$ is the estimated standard error

The quantity $z^* \times \frac{s}{\sqrt{n}}$ is called the margin of error.

Common confidence levels and their z values: - 90% confidence: z = 1.645 - 95% confidence: z = 1.960 - 99% confidence: z = 2.576

Let's build one:

from scipy import stats

# Take a sample of 40 countries
np.random.seed(42)
sample = np.random.choice(population_vax, size=40, replace=False)

# Compute the 95% confidence interval
sample_mean = sample.mean()
sample_se = sample.std(ddof=1) / np.sqrt(len(sample))
z_star = 1.96

ci_lower = sample_mean - z_star * sample_se
ci_upper = sample_mean + z_star * sample_se

print(f"Sample mean: {sample_mean:.2f}%")
print(f"Standard error: {sample_se:.2f}")
print(f"Margin of error: {z_star * sample_se:.2f}")
print(f"95% CI: ({ci_lower:.2f}%, {ci_upper:.2f}%)")
print(f"True population mean: {true_mean:.2f}%")
print(f"Does the CI contain the true mean? {ci_lower <= true_mean <= ci_upper}")

Sample mean: 70.12%
Standard error: 2.55
Margin of error: 4.99
95% CI: (65.13%, 75.11%)
True population mean: 70.65%
Does the CI contain the true mean? True

Our 95% confidence interval is roughly (65.1%, 75.1%). The true mean is 70.65%, which falls inside the interval.

A Technical Note: z vs. t

When the sample size is small (typically n < 30), or when you're being more precise, you should use the t-distribution instead of the normal distribution for your critical values. The t-distribution has heavier tails than the normal, which makes the confidence interval wider — reflecting the extra uncertainty that comes from having a small sample.

# Using the t-distribution (more appropriate for small-to-moderate samples)
n = len(sample)
t_star = stats.t.ppf(0.975, df=n-1)  # 97.5th percentile for 95% CI

ci_lower_t = sample_mean - t_star * sample_se
ci_upper_t = sample_mean + t_star * sample_se

print(f"Using z* = {z_star:.3f}: CI = ({ci_lower:.2f}, {ci_upper:.2f})")
print(f"Using t* = {t_star:.3f}: CI = ({ci_lower_t:.2f}, {ci_upper_t:.2f})")
print(f"\nt* is slightly larger, making the CI slightly wider.")

For samples of 30 or more, the difference between z and t is small. For samples of 10 or fewer, the difference matters a lot. When in doubt, use t — it's always at least as conservative, and scipy.stats makes it easy.

The Shortcut: scipy.stats

You don't have to compute confidence intervals by hand. scipy.stats has built-in functions:

# One-line confidence interval using scipy
ci = stats.t.interval(confidence=0.95, df=len(sample)-1,
                       loc=sample.mean(), scale=stats.sem(sample))
print(f"95% CI from scipy: ({ci[0]:.2f}%, {ci[1]:.2f}%)")

22.7 What Confidence Intervals Actually Mean (and Don't Mean)

This is the section where I need to be very careful, because the interpretation of confidence intervals is one of the most commonly misunderstood topics in all of statistics. Even textbooks get it wrong sometimes. Let's get it right.

The Correct Interpretation

A 95% confidence interval means: if we repeated the sampling process many times and built a confidence interval each time, approximately 95% of those intervals would contain the true population parameter.

The "95%" refers to the process, not to any single interval.

The Incorrect (But Tempting) Interpretation

Here's what a 95% confidence interval does NOT mean:

"There is a 95% probability that the true mean is inside this particular interval."

I know. It sounds like the same thing. But it's subtly and importantly different.

The true mean is a fixed number — it's not random. It either is or isn't inside your interval. It doesn't bounce around with a 95% probability of landing in any particular range. The randomness is in the interval, not in the parameter.

Think of it this way: imagine shooting arrows at a target. Each arrow is one confidence interval, and the target (the bullseye) is the true parameter. A 95% confidence interval means your archery technique hits the bullseye 95% of the time. But once an arrow has been shot, it's either in the bullseye or it's not — it doesn't have a 95% probability of being in the bullseye after the fact.

Seeing It in Action

Let's simulate 100 confidence intervals and see how many capture the true mean:

np.random.seed(42)

n_intervals = 100
sample_size = 40
confidence_level = 0.95
captured = 0

fig, ax = plt.subplots(figsize=(10, 12))

for i in range(n_intervals):
    samp = np.random.choice(population_vax, size=sample_size, replace=False)
    samp_mean = samp.mean()
    samp_se = samp.std(ddof=1) / np.sqrt(sample_size)
    t_crit = stats.t.ppf(0.975, df=sample_size - 1)
    lower = samp_mean - t_crit * samp_se
    upper = samp_mean + t_crit * samp_se

    contains_true = lower <= true_mean <= upper
    if contains_true:
        captured += 1
        color = 'steelblue'
        alpha = 0.5
    else:
        color = 'red'
        alpha = 0.9

    ax.plot([lower, upper], [i, i], color=color, alpha=alpha, linewidth=1.5)
    ax.plot(samp_mean, i, 'o', color=color, markersize=3)

ax.axvline(true_mean, color='black', linewidth=2, linestyle='--',
           label=f'True mean: {true_mean:.1f}%')
ax.set_xlabel('Vaccination Rate (%)', fontsize=12)
ax.set_ylabel('Sample Number', fontsize=12)
ax.set_title(f'100 Confidence Intervals (95% level)\n'
             f'{captured} of {n_intervals} contain the true mean',
             fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
plt.tight_layout()
plt.savefig('confidence_intervals_100.png', dpi=150, bbox_inches='tight')
plt.show()

In a typical run, about 94-96 of the 100 intervals will capture the true mean (shown in blue), while 4-6 will miss (shown in red). That's the 95% confidence level in action: not every interval succeeds, but the method succeeds about 95% of the time.

The Relationship Between Confidence Level and Interval Width

You might be thinking: "Why not just use 99% confidence? Or 99.9%? Wouldn't that be better?"

You can! But there's a trade-off. Higher confidence requires wider intervals.

confidence_levels = [0.80, 0.90, 0.95, 0.99]

print("Confidence Level | z*     | Interval Width | Interpretation")
print("-" * 70)
for cl in confidence_levels:
    z = stats.norm.ppf((1 + cl) / 2)
    margin = z * sample_se
    lower = sample_mean - margin
    upper = sample_mean + margin
    width = upper - lower
    print(f"       {cl*100:.0f}%       | {z:.3f} | {width:.2f}% points  | "
          f"({lower:.1f}, {upper:.1f})")

Confidence Level | z*     | Interval Width | Interpretation
----------------------------------------------------------------------
       80%       | 1.282 | 6.53% points  | (66.9, 73.4)
       90%       | 1.645 | 8.39% points  | (65.9, 74.3)
       95%       | 1.960 | 9.99% points  | (65.1, 75.1)
       99%       | 2.576 | 13.14% points | (63.6, 76.7)

At 99% confidence, your interval is much wider — you're more confident, but less precise. At 80% confidence, your interval is narrow and precise, but it'll miss the true value 20% of the time.

The 95% level is a convention, not a law of nature. It balances precision and coverage in a way that most fields find acceptable. But the right choice depends on the stakes. In pharmaceutical testing, you might want 99%. In a quick market research survey, 90% might be fine.

22.8 The Bootstrap: Confidence Intervals Without Formulas

Everything we've done so far relies on the normal distribution and the Central Limit Theorem. Those work beautifully for means with moderate-to-large samples. But what if you want a confidence interval for a median? Or a correlation coefficient? Or some complicated statistic that doesn't have a nice formula for the standard error?

Enter the bootstrap — one of the most powerful and elegant ideas in modern statistics.

The Idea

The bootstrap, invented by Bradley Efron in 1979, is based on a simple but profound insight: since we can't resample from the population (that's the whole problem), we'll resample from our sample instead.

Here's the procedure:

Start with your sample of size n
Draw a new sample of size n with replacement from your original sample
Compute the statistic of interest (mean, median, whatever) for this new sample
Repeat steps 2-3 thousands of times
The distribution of those statistics is the bootstrap sampling distribution
Use the percentiles of that distribution as your confidence interval

"With replacement" is the key detail. It means that each observation in your original sample can appear zero, one, or multiple times in any given bootstrap sample. This is what creates the variability between bootstrap samples.

Bootstrap in Python

np.random.seed(42)

# Our original sample of 40 countries
sample = np.random.choice(population_vax, size=40, replace=False)
sample_mean = sample.mean()

# Bootstrap: resample with replacement 10,000 times
n_bootstrap = 10000
bootstrap_means = np.array([
    np.random.choice(sample, size=len(sample), replace=True).mean()
    for _ in range(n_bootstrap)
])

# The percentile method for a 95% CI
ci_lower_boot = np.percentile(bootstrap_means, 2.5)
ci_upper_boot = np.percentile(bootstrap_means, 97.5)

print(f"Sample mean: {sample_mean:.2f}%")
print(f"Bootstrap 95% CI: ({ci_lower_boot:.2f}%, {ci_upper_boot:.2f}%)")
print(f"Formula-based 95% CI: ({ci_lower:.2f}%, {ci_upper:.2f}%)")
print(f"True population mean: {true_mean:.2f}%")

Sample mean: 70.12%
Bootstrap 95% CI: (65.31%, 74.77%)
Formula-based 95% CI: (65.13%, 75.11%)
True population mean: 70.65%

The bootstrap interval is very close to the formula-based interval — as it should be, since for the mean with a reasonable sample size, both methods are valid.

Let's visualize the bootstrap distribution:

fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(bootstrap_means, bins=60, color='mediumseagreen', alpha=0.7,
        edgecolor='white', density=True)
ax.axvline(sample_mean, color='blue', linewidth=2, linestyle='-',
           label=f'Sample mean: {sample_mean:.1f}%')
ax.axvline(ci_lower_boot, color='red', linewidth=2, linestyle='--',
           label=f'95% CI: ({ci_lower_boot:.1f}, {ci_upper_boot:.1f})')
ax.axvline(ci_upper_boot, color='red', linewidth=2, linestyle='--')
ax.axvline(true_mean, color='black', linewidth=2, linestyle=':',
           label=f'True mean: {true_mean:.1f}%')
ax.set_title('Bootstrap Sampling Distribution (10,000 resamples)', fontsize=13)
ax.set_xlabel('Sample mean vaccination rate (%)')
ax.set_ylabel('Density')
ax.legend(fontsize=10)
plt.tight_layout()
plt.savefig('bootstrap_distribution.png', dpi=150, bbox_inches='tight')
plt.show()

Why the Bootstrap Is So Powerful

The real power of the bootstrap isn't for means — we already have formulas for those. The power is for statistics where formulas are complicated or don't exist:

# Bootstrap CI for the MEDIAN
bootstrap_medians = np.array([
    np.median(np.random.choice(sample, size=len(sample), replace=True))
    for _ in range(n_bootstrap)
])

ci_median = (np.percentile(bootstrap_medians, 2.5),
             np.percentile(bootstrap_medians, 97.5))
print(f"Sample median: {np.median(sample):.2f}%")
print(f"Bootstrap 95% CI for median: ({ci_median[0]:.2f}%, {ci_median[1]:.2f}%)")

# Bootstrap CI for the 25th PERCENTILE
bootstrap_q25 = np.array([
    np.percentile(np.random.choice(sample, size=len(sample), replace=True), 25)
    for _ in range(n_bootstrap)
])

ci_q25 = (np.percentile(bootstrap_q25, 2.5),
          np.percentile(bootstrap_q25, 97.5))
print(f"\nSample 25th percentile: {np.percentile(sample, 25):.2f}%")
print(f"Bootstrap 95% CI for 25th percentile: ({ci_q25[0]:.2f}%, {ci_q25[1]:.2f}%)")

# Bootstrap CI for the STANDARD DEVIATION
bootstrap_stds = np.array([
    np.std(np.random.choice(sample, size=len(sample), replace=True), ddof=1)
    for _ in range(n_bootstrap)
])

ci_std = (np.percentile(bootstrap_stds, 2.5),
          np.percentile(bootstrap_stds, 97.5))
print(f"\nSample std dev: {np.std(sample, ddof=1):.2f}%")
print(f"Bootstrap 95% CI for std dev: ({ci_std[0]:.2f}%, {ci_std[1]:.2f}%)")

Try getting a confidence interval for the 25th percentile using a formula. It's messy. With the bootstrap, it's exactly the same code — just swap out the statistic.

When the Bootstrap Doesn't Work

The bootstrap isn't magic. It has limitations:

Very small samples: With n = 5 or 10, the bootstrap doesn't have enough data to resample meaningfully
Heavy-tailed distributions: For distributions with extreme outliers, the bootstrap can be unreliable
Dependent data: If your observations aren't independent (e.g., time series), you need a modified bootstrap (block bootstrap)
Edge cases: The bootstrap struggles with statistics at the boundary of a distribution (e.g., the maximum value)

For most practical data science applications with samples of 30 or more, the bootstrap works wonderfully.

22.9 Progressive Project: Confidence Intervals for Vaccination Rates by Region

Time to apply everything we've learned to the vaccination dataset from our progressive project. We'll construct confidence intervals for the mean vaccination rate in each WHO region.

import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Create our project dataset
# (In a real project, you'd load this from your cleaned CSV)
np.random.seed(42)

regions = {
    'Africa': {'n': 47, 'mean': 52, 'std': 22},
    'Americas': {'n': 35, 'mean': 72, 'std': 15},
    'SE Asia': {'n': 11, 'mean': 68, 'std': 18},
    'Europe': {'n': 53, 'mean': 78, 'std': 12},
    'E Mediterranean': {'n': 22, 'mean': 61, 'std': 20},
    'W Pacific': {'n': 27, 'mean': 75, 'std': 14},
}

rows = []
for region, params in regions.items():
    rates = np.random.normal(params['mean'], params['std'], params['n'])
    rates = np.clip(rates, 5, 99)
    for rate in rates:
        rows.append({'region': region, 'vaccination_rate': rate})

df = pd.DataFrame(rows)
print(f"Total countries: {len(df)}")
print(f"\nCountries per region:")
print(df['region'].value_counts().sort_index())

Now let's compute confidence intervals for each region:

# Compute 95% CIs for each region
results = []

for region in df['region'].unique():
    data = df[df['region'] == region]['vaccination_rate']
    n = len(data)
    mean = data.mean()
    se = data.std(ddof=1) / np.sqrt(n)
    t_crit = stats.t.ppf(0.975, df=n-1)
    ci_lower = mean - t_crit * se
    ci_upper = mean + t_crit * se
    margin = t_crit * se

    results.append({
        'Region': region,
        'n': n,
        'Mean': mean,
        'Std Dev': data.std(ddof=1),
        'Std Error': se,
        'CI Lower': ci_lower,
        'CI Upper': ci_upper,
        'Margin of Error': margin
    })

results_df = pd.DataFrame(results).sort_values('Mean')
print("\n95% Confidence Intervals for Mean Vaccination Rate by Region")
print("=" * 80)
for _, row in results_df.iterrows():
    print(f"{row['Region']:20s}  n={row['n']:2.0f}  "
          f"Mean={row['Mean']:5.1f}%  "
          f"95% CI: ({row['CI Lower']:5.1f}%, {row['CI Upper']:5.1f}%)  "
          f"MOE=+/-{row['Margin of Error']:.1f}")

95% Confidence Intervals for Mean Vaccination Rate by Region
================================================================================
Africa                n=47  Mean=52.8%  95% CI: (46.4%, 59.2%)  MOE=+/-6.4
E Mediterranean       n=22  Mean=59.5%  95% CI: (51.5%, 67.5%)  MOE=+/-8.0
SE Asia               n=11  Mean=68.2%  95% CI: (56.2%, 80.2%)  MOE=+/-12.0
Americas              n=35  Mean=71.3%  95% CI: (66.3%, 76.3%)  MOE=+/-5.0
W Pacific             n=27  Mean=74.8%  95% CI: (69.4%, 80.2%)  MOE=+/-5.4
Europe                n=53  Mean=77.5%  95% CI: (74.2%, 80.8%)  MOE=+/-3.3

Let's visualize these confidence intervals:

fig, ax = plt.subplots(figsize=(10, 6))

colors = ['#e74c3c', '#e67e22', '#f39c12', '#3498db', '#2ecc71', '#9b59b6']
results_sorted = results_df.sort_values('Mean')

for i, (_, row) in enumerate(results_sorted.iterrows()):
    ax.plot([row['CI Lower'], row['CI Upper']], [i, i],
            color=colors[i], linewidth=3, solid_capstyle='round')
    ax.plot(row['Mean'], i, 'o', color=colors[i], markersize=10,
            zorder=5, markeredgecolor='white', markeredgewidth=2)
    ax.text(row['CI Upper'] + 1, i,
            f"  {row['Mean']:.1f}% ({row['CI Lower']:.1f}, {row['CI Upper']:.1f})",
            va='center', fontsize=10)

ax.set_yticks(range(len(results_sorted)))
ax.set_yticklabels(results_sorted['Region'], fontsize=11)
ax.set_xlabel('Vaccination Rate (%)', fontsize=12)
ax.set_title('95% Confidence Intervals for Mean Vaccination Rate by Region',
             fontsize=14, fontweight='bold')
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.savefig('vaccination_ci_by_region.png', dpi=150, bbox_inches='tight')
plt.show()

Interpreting the Results

Several things jump out from these intervals:

Europe has the narrowest interval — both because it has the most countries (n=53) and the smallest standard deviation. We're quite precise about Europe's average.
SE Asia has the widest interval — with only 11 countries, there's substantial uncertainty. The mean could plausibly be anywhere from about 56% to 80%.
Africa and Europe don't overlap — their confidence intervals are completely separate. This strongly suggests that the true means are genuinely different. (We'll formalize this idea in Chapter 23 with hypothesis testing.)
Americas and W Pacific overlap substantially — we can't confidently say their true means are different.
The margin of error depends on both sample size and variability. Africa has a larger margin of error than Europe despite having a decent sample size, because Africa's vaccination rates are more variable.

Bootstrap Comparison

Let's also compute bootstrap intervals for comparison:

np.random.seed(42)
n_boot = 10000

print("Comparison: Formula vs. Bootstrap 95% CIs")
print("=" * 75)

for region in results_sorted['Region']:
    data = df[df['region'] == region]['vaccination_rate'].values

    # Bootstrap
    boot_means = [np.random.choice(data, size=len(data), replace=True).mean()
                  for _ in range(n_boot)]
    boot_lower = np.percentile(boot_means, 2.5)
    boot_upper = np.percentile(boot_means, 97.5)

    # Formula (already computed)
    row = results_sorted[results_sorted['Region'] == region].iloc[0]

    print(f"{region:20s}  "
          f"Formula: ({row['CI Lower']:5.1f}, {row['CI Upper']:5.1f})  "
          f"Bootstrap: ({boot_lower:5.1f}, {boot_upper:5.1f})")

The bootstrap and formula-based intervals are very similar, which gives us confidence that both methods are working correctly.

22.10 Common Mistakes and Misconceptions

Before we close this chapter, let's address the mistakes that trip up almost everyone.

Mistake 1: "A Wider Interval Means Worse Data"

Not necessarily. A wider interval can mean: - A smaller sample (less data) - A more variable population (more spread) - A higher confidence level (more conservative)

A wide interval is an honest interval. It's saying: "Here's what we know, and here's how much uncertainty remains." Narrowing the interval by lowering the confidence level doesn't make your data better — it just makes you more likely to be wrong.

Mistake 2: "95% Confidence Means 95% Probability"

We covered this in Section 22.7, but it's worth repeating because it's the single most common error. The 95% refers to the procedure's long-run success rate, not to the probability that any specific interval contains the truth.

Mistake 3: "My Interval Doesn't Contain Zero, So the Effect Is Real"

This is more relevant for confidence intervals around differences or coefficients (which we'll see in later chapters), but the logic applies here too. Whether a particular value is "in" or "out" of your interval depends heavily on your sample size and variability. A value just barely outside your interval isn't meaningfully different from a value just barely inside it.

Mistake 4: "More Data Is Always Better"

More data reduces the standard error, but it doesn't fix bias. A biased sample of 1 million is worse than an unbiased sample of 100. Always prioritize sample quality over sample size.

Mistake 5: Ignoring the Sampling Design

The formulas in this chapter assume simple random sampling. If your data comes from a stratified sample, a cluster sample, or some other complex design, the standard error formulas need to be adjusted. Using simple random sampling formulas on data from a complex survey design can give you intervals that are too narrow (overconfident).

22.11 Connecting the Threads

Let's step back and see where this chapter fits in the bigger picture.

Looking back: In Chapter 19, we computed descriptive statistics — means, medians, standard deviations. Those were descriptions of the data we had. In Chapter 21, we learned about distributions and the Central Limit Theorem. This chapter has used the CLT to bridge the gap between "what our data shows" and "what we can say about the world."

Looking forward: In Chapter 23, we'll use sampling distributions to test hypotheses — asking not just "what's the plausible range for this parameter?" but "is there enough evidence to reject a specific claim?" Confidence intervals and hypothesis tests are two sides of the same coin.

The progressive project: We've now added uncertainty quantification to our vaccination analysis. We're no longer just saying "Africa's average vaccination rate is 52.8%." We're saying "Africa's average vaccination rate is somewhere around 46% to 59%, with 95% confidence." That's a more honest and more useful statement.

The recurring characters: Elena would use these intervals to report vaccination coverage to her public health department — and the margins of error would directly influence how many resources are allocated. Marcus might compute a confidence interval for his average daily revenue to decide whether his bakery can afford a new hire. Priya could report player statistics with margins of error rather than single numbers. Jordan could test whether the average GPA difference between departments is large enough to be meaningful.

Key Insight: Statistics is not about certainty. It's about quantifying how uncertain you are. A confidence interval is an honest admission: "Here's my best guess, and here's how wrong I might be." That honesty is what makes the estimate useful.

22.12 Chapter Summary

In this chapter, you learned the logic of sampling and estimation — one of the foundational ideas in statistical thinking. Here's the journey we took:

Population vs. sample: You want to know about the population, but you only have a sample. The gap between them is the fundamental problem of statistical inference.
Sampling bias: Not all samples are created equal. A biased sample — no matter how large — can give you wildly wrong conclusions. Randomness is the antidote to bias.
Point estimates: Your sample mean is a reasonable guess at the population mean. But a single number without a measure of uncertainty is incomplete.
Sampling distributions: If you could take many samples, the distribution of sample means would be centered at the true mean and have a spread (standard error) of $\sigma / \sqrt{n}$.
Confidence intervals: By extending a margin of error around your point estimate, you create a range that captures the true parameter about 95% (or 90%, or 99%) of the time.
The bootstrap: When formulas are unavailable or inappropriate, resampling from your sample gives you a practical way to estimate the sampling distribution and build confidence intervals for any statistic.
Interpretation matters: "95% confidence" refers to the long-run success rate of the method, not to the probability that any particular interval contains the truth.

Next up: Chapter 23, where we take the logical next step — using these tools to test specific claims about the world. Does vaccination coverage really differ between income groups, or could the difference we see be just sampling noise? That's hypothesis testing, and it's where things get both powerful and dangerously easy to misunderstand.

You've now added uncertainty quantification to your statistical toolkit. Every number you report from this point forward should come with a measure of "how sure?" The confidence interval is your honest answer to that question.

Learning Objectives

In This Chapter

Chapter 22: Sampling, Estimation, and Confidence Intervals — How to Learn About Millions from a Handful

Chapter Overview

22.1 The Fundamental Problem: You Can't Measure Everything

Population vs. Sample: The Two Most Important Words in This Chapter

Why Not Just Measure Everyone?

22.2 How Samples Go Wrong: A Brief History of Spectacular Failures

The Literary Digest Disaster (1936)

Sampling Bias: When Your Sample Doesn't Look Like Your Population

22.3 Sampling Done Right: Random and Stratified Sampling

Simple Random Sampling

Stratified Sampling

Other Sampling Methods

22.4 From Sample to Estimate: Point Estimates and Why They're Not Enough

Why a Single Number Is Never Enough

22.5 The Sampling Distribution: The Idea That Makes Inference Possible

A Thought Experiment

Standard Error: The Standard Deviation of the Sampling Distribution

How Sample Size Affects the Standard Error

22.6 Confidence Intervals: Communicating Uncertainty Honestly

The Intuition

The Formula

A Technical Note: z vs. t

The Shortcut: scipy.stats

22.7 What Confidence Intervals Actually Mean (and Don't Mean)

The Correct Interpretation

The Incorrect (But Tempting) Interpretation

Seeing It in Action

The Relationship Between Confidence Level and Interval Width

22.8 The Bootstrap: Confidence Intervals Without Formulas

The Idea

Bootstrap in Python

Why the Bootstrap Is So Powerful

When the Bootstrap Doesn't Work

22.9 Progressive Project: Confidence Intervals for Vaccination Rates by Region

Interpreting the Results

Bootstrap Comparison

22.10 Common Mistakes and Misconceptions

Mistake 1: "A Wider Interval Means Worse Data"

Mistake 2: "95% Confidence Means 95% Probability"

Mistake 3: "My Interval Doesn't Contain Zero, So the Effect Is Real"

Mistake 4: "More Data Is Always Better"

Mistake 5: Ignoring the Sampling Design

22.11 Connecting the Threads

22.12 Chapter Summary

Related Reading