Chapter 18: The Bootstrap and Simulation-Based Inference

Contributors

31 min read

> "The bootstrap is a computer-based method for assigning measures of accuracy to statistical estimates."

Learning Objectives

Explain the logic of bootstrap resampling
Construct bootstrap confidence intervals
Conduct a permutation test for comparing two groups
Compare simulation-based and formula-based approaches
Recognize when bootstrap methods are especially useful

In This Chapter

Chapter Overview
18.1 A Puzzle Before We Start (Productive Struggle)
18.2 The Bootstrap Idea
18.3 Bootstrap Resampling: The Procedure
18.4 Bootstrap Confidence Intervals
18.5 Maya's Bootstrap: Median Hospital Wait Times
18.6 The Permutation Test: Shuffling Labels
18.7 Alex's Permutation Test: A/B Testing Without the t-Test
18.8 Sam's Bootstrap: Shooting Percentage Improvement
18.9 Bootstrap Distribution vs. Sampling Distribution
18.10 When Does the Bootstrap Shine?
18.11 Formula-Based vs. Simulation-Based: The Comparison
18.12 Python Summary: Your Bootstrap and Permutation Toolkit
18.13 Common Misconceptions
18.14 Progressive Project: Bootstrap Confidence Intervals
18.15 What's Next
Chapter Summary

Chapter 18: The Bootstrap and Simulation-Based Inference

"The bootstrap is a computer-based method for assigning measures of accuracy to statistical estimates." — Bradley Efron, inventor of the bootstrap

Chapter Overview

What if I told you that the most powerful idea in modern statistics requires almost no formulas?

Everything you've learned about confidence intervals and hypothesis tests — in Chapters 12 through 16 — has relied on the same basic strategy: derive a formula for the standard error, use it to build a test statistic, and look up the result on a known distribution (normal or $t$). That approach works beautifully when the math cooperates. But sometimes it doesn't.

Dr. Maya Chen runs into this problem right now. She wants to estimate the median wait time in her emergency department — not the mean, the median — because wait times are right-skewed and the median better represents the typical patient experience. But here's the thing: there's no neat formula for the standard error of a sample median. The $t$-interval from Chapter 12? That's for means. The $z$-interval from Chapter 14? That's for proportions. Medians don't fit into either box.

Alex Rivera has a different problem. StreamVibe just ran another A/B test, but the sample sizes are small and the data are wildly non-normal. The two-sample $t$-test from Chapter 16 assumes either normality or large samples (thanks to the CLT). Alex's data fail both conditions.

And Sam Okafor wants to build a confidence interval for the ratio of Daria's three-point shooting percentage this season to last season. Not the difference — the ratio. Good luck finding that in a standard error table.

All three need a method that works without relying on formulas for specific statistics, distributional assumptions, or large-sample approximations. That method exists, and it's called the bootstrap.

The bootstrap and its cousin, the permutation test, are part of a revolution in statistics that began in 1979 when a Stanford statistician named Bradley Efron had an astonishingly simple idea: what if we could approximate the sampling distribution of any statistic, for any data, by cleverly reusing the sample we already have?

That idea — the resampling idea — is the threshold concept of this chapter. And it's one of the most beautiful ideas in all of statistics.

In this chapter, you will learn to: - Explain the logic of bootstrap resampling - Construct bootstrap confidence intervals - Conduct a permutation test for comparing two groups - Compare simulation-based and formula-based approaches - Recognize when bootstrap methods are especially useful

Fast Track: If you've encountered the bootstrap before, skim Sections 18.1–18.3, then jump to Section 18.7 (comparison table). Complete quiz questions 1, 10, and 18 to verify.

Deep Dive: After this chapter, read Case Study 1 (Maya's bootstrap analysis of hospital wait times) for a complete public health application, then Case Study 2 (Alex's permutation test for A/B testing) for a tech industry application. Both include full Python code.

18.1 A Puzzle Before We Start (Productive Struggle)

Before we dive in, try this thought experiment.

The Island Problem

You're a biologist studying a rare species of frog on a remote island. You've managed to capture and measure 12 frogs — that's the entire sample you can get. The lengths (in cm) are:

4.2, 3.8, 5.1, 4.7, 3.5, 6.3, 4.4, 5.8, 4.1, 3.9, 4.6, 5.0

You want a confidence interval for the median frog length on this island.

(a) Could you use the $t$-interval formula from Chapter 12? Why or why not?

(b) You can't go back to the island to collect more frogs. But suppose you could somehow create "new samples" from the data you already have. How might you do that?

(c) If you wrote down each frog length on a slip of paper, put the slips in a hat, and drew 12 slips with replacement (putting each slip back before drawing the next), what would the resulting "sample" look like? Would it be identical to your original sample?

(d) If you repeated part (c) thousands of times and computed the median of each new "sample," what would the distribution of those medians tell you?

Take 3 minutes. Part (d) is the key insight of the entire chapter.

Here's what I hope you noticed:

For part (a), the $t$-interval is designed specifically for means, not medians. The formula $\bar{x} \pm t^* \cdot s/\sqrt{n}$ doesn't apply. We learned in Chapter 12 that this formula rests on the CLT, which tells us the sampling distribution of the mean is approximately normal. There's no equivalent theorem for the sampling distribution of the median (at least not one simple enough for an introductory course).

For part (b), this is the creative leap. You can't physically collect more frogs, but you can create new "virtual samples" by resampling from the data you have. The original 12 measurements are the best information you've got about the population of frog lengths. So treat them as if they were the population — or at least the best available model of it.

Part (c) is crucial: when you draw with replacement, you'll sometimes pick the same frog length more than once, and sometimes skip a length entirely. A resampled dataset might look like: 4.2, 5.1, 5.1, 3.5, 4.7, 4.7, 6.3, 4.1, 4.1, 5.0, 3.9, 4.7. It's the same size as the original (12 values) but not identical — some values appear multiple times, others are missing. This mimics the randomness of sampling from a population.

And part (d) is the bootstrap in a nutshell. If you repeat this process 10,000 times and compute the median each time, you get a distribution of 10,000 medians. That distribution approximates the sampling distribution of the median — the very thing we couldn't derive a formula for. The spread of this distribution tells you how much the sample median varies from sample to sample. And that gives you everything you need to build a confidence interval.

You just invented the bootstrap. Congratulations — you beat Bradley Efron by several decades (in spirit, at least).

18.2 The Bootstrap Idea

The Big Insight

Here's the core idea, stated plainly:

The sample is our best model of the population.

That's it. That's the insight that launched a revolution.

Think about it. If you want to know how a statistic (like the sample median) varies from sample to sample, the ideal approach would be to draw thousands of new samples from the population and compute the median of each one. That would give you the true sampling distribution — the concept you learned in Chapter 11.

But you can't draw thousands of new samples from the population. You have one sample. That's all you can afford. So what do you do?

You draw thousands of new samples from the sample itself.

This sounds circular. It sounds like cheating. But here's why it works: if the sample is reasonably representative of the population (and with random sampling, it should be), then resampling from the sample mimics resampling from the population. The bootstrap distribution you create won't be a perfect copy of the true sampling distribution — but it will be a remarkably good approximation.

🔄 Spaced Review 1 (from Ch.11): CLT and Sampling Distributions

In Chapter 11, you learned the most transformative idea in statistics: the sampling distribution. If you could draw every possible sample of size $n$ from a population and compute the sample mean of each, those means would form a normal distribution centered at $\mu$ with standard deviation $\sigma/\sqrt{n}$.

The Central Limit Theorem gave us that result theoretically — without actually drawing all those samples. The bootstrap gives us a similar result empirically — by simulating the sampling process through resampling. Think of the CLT as the analytic shortcut and the bootstrap as the computational brute-force approach. Both aim at the same target: the sampling distribution.

A Brief History: Bradley Efron and the Name

In 1979, Stanford statistician Bradley Efron published a paper called "Bootstrap Methods: Another Look at the Jackknife" that would change statistics forever. The idea was so simple that many statisticians initially dismissed it — surely you can't learn anything genuinely new by reshuffling data you already have?

But Efron showed rigorously that the bootstrap works. It produces valid confidence intervals and standard errors for an enormous range of statistics. And it does so without requiring the mathematical derivations that make traditional statistics so difficult.

The name "bootstrap" comes from the phrase "pulling yourself up by your own bootstraps" — a seemingly impossible feat. The method pulls information about sampling variability from the sample itself, which seems equally impossible. (The phrase is often attributed to the Baron Munchausen stories, where the Baron claims to have pulled himself out of a swamp by his own hair. Statisticians have a sense of humor.)

Efron would later receive the National Medal of Science in 2005 and the International Prize in Statistics in 2019, largely for this contribution. The bootstrap is now one of the most widely used methods in all of applied statistics.

Bootstrap vs. the Traditional Approach

Let's contrast the two approaches to inference:

Traditional (formula-based) approach: 1. Compute a statistic from the sample (e.g., $\bar{x}$) 2. Derive a formula for the standard error of that statistic (e.g., $SE = s/\sqrt{n}$) 3. Use a known distribution (normal or $t$) to build a confidence interval or compute a p-value

Bootstrap (simulation-based) approach: 1. Compute a statistic from the sample (e.g., the median) 2. Resample from the sample with replacement, thousands of times 3. Compute the statistic for each resampled dataset 4. Use the distribution of those statistics to build a confidence interval or compute a p-value

Step 1 is identical. The approaches diverge at step 2: the traditional approach uses mathematical theory to determine the sampling distribution, while the bootstrap uses computation to simulate it.

18.3 Bootstrap Resampling: The Procedure

Let's make the procedure concrete.

Step-by-Step: How Bootstrap Resampling Works

Suppose your original sample has $n$ observations: $x_1, x_2, \ldots, x_n$.

Draw a bootstrap sample: Randomly select $n$ values from your original sample with replacement. Each original observation has an equal chance ($1/n$) of being selected at each draw. Some values will appear multiple times; others will be left out.
Compute the statistic: Calculate whatever statistic you're interested in (mean, median, standard deviation, correlation, proportion, ratio — anything) from this bootstrap sample. Call this $\hat{\theta}^*_1$.
Repeat: Do steps 1 and 2 a total of $B$ times (typically $B = 10{,}000$ or more). You now have $\hat{\theta}^*_1, \hat{\theta}^*_2, \ldots, \hat{\theta}^*_B$.
Analyze the bootstrap distribution: The collection of $B$ bootstrap statistics forms the bootstrap distribution. Its shape, center, and spread approximate the sampling distribution of your statistic.

Concept 1: Bootstrap Resampling

Bootstrap resampling is the process of drawing new samples of size $n$ from the original sample, with replacement, and computing a statistic from each resampled dataset. Each such sample is called a bootstrap sample, and the resulting collection of statistics is the bootstrap distribution.

"With Replacement" — Why It Matters

The phrase with replacement is doing all the heavy lifting here.

If you sampled without replacement from a dataset of 12 values, you'd just get the same 12 values back in a different order. The median would be identical every time. You'd learn nothing about variability.

Sampling with replacement means that after you pick a value, you "put it back" before the next pick. This creates genuine variation: your bootstrap sample might include one frog measured at 6.3 cm three times and completely skip the one measured at 3.8 cm. Each bootstrap sample is a plausible alternative version of the data you might have observed — a different random subset from the same underlying population.

Key Term: With Replacement

With replacement means that each observation remains available for selection after it has been drawn. In a sample of size $n$, with-replacement sampling produces samples where some observations appear more than once and others may not appear at all. On average, about 63.2% of the original observations appear in any single bootstrap sample.

That 63.2% figure comes from probability: the chance that a particular observation is not selected in any of the $n$ draws is $(1 - 1/n)^n$, which approaches $1/e \approx 0.368$ for large $n$. So about 36.8% of observations are left out, and 63.2% are included (some more than once).

Visualizing the Process

Let's see the bootstrap in action with a small example. Imagine our frog data:

Original sample: [4.2, 3.8, 5.1, 4.7, 3.5, 6.3, 4.4, 5.8, 4.1, 3.9, 4.6, 5.0]
Original median: 4.5

Bootstrap sample 1: [4.7, 3.5, 5.1, 5.1, 4.2, 3.9, 4.6, 4.7, 6.3, 4.1, 5.0, 4.7]
Bootstrap median 1: 4.7

Bootstrap sample 2: [3.8, 4.4, 3.5, 4.2, 5.8, 4.1, 4.6, 3.9, 5.0, 3.8, 4.7, 4.2]
Bootstrap median 2: 4.2

Bootstrap sample 3: [6.3, 4.7, 5.0, 4.2, 3.5, 4.4, 5.1, 4.6, 4.2, 5.8, 4.1, 3.9]
Bootstrap median 3: 4.5

... repeat 10,000 times total ...

Each bootstrap sample is the same size as the original (12 values) but contains different values because of the with-replacement sampling. The bootstrap medians vary around the original sample median, and their distribution tells us how much we'd expect the sample median to bounce around if we could actually draw new samples from the population.

Python Implementation

Here's the bootstrap in Python — it's remarkably simple:

import numpy as np
import matplotlib.pyplot as plt

# Original frog length data
frogs = np.array([4.2, 3.8, 5.1, 4.7, 3.5, 6.3, 4.4, 5.8, 4.1, 3.9, 4.6, 5.0])

# The original sample median
original_median = np.median(frogs)
print(f"Original sample median: {original_median}")

# Bootstrap resampling
np.random.seed(42)
n_bootstrap = 10000
bootstrap_medians = np.zeros(n_bootstrap)

for i in range(n_bootstrap):
    # Draw a bootstrap sample (same size, with replacement)
    boot_sample = np.random.choice(frogs, size=len(frogs), replace=True)
    # Compute the median of the bootstrap sample
    bootstrap_medians[i] = np.median(boot_sample)

# Visualize the bootstrap distribution
plt.figure(figsize=(10, 6))
plt.hist(bootstrap_medians, bins=30, edgecolor='black', alpha=0.7,
         color='steelblue')
plt.axvline(original_median, color='red', linewidth=2, linestyle='--',
            label=f'Original median = {original_median}')
plt.xlabel('Bootstrap Median (cm)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Bootstrap Distribution of the Median Frog Length', fontsize=14)
plt.legend(fontsize=11)
plt.tight_layout()
plt.show()

# Summary statistics of the bootstrap distribution
print(f"\nBootstrap distribution summary:")
print(f"  Mean of bootstrap medians: {np.mean(bootstrap_medians):.3f}")
print(f"  SD of bootstrap medians:   {np.std(bootstrap_medians):.3f}")
print(f"  (This SD is the bootstrap standard error)")

Notice the key function: np.random.choice(frogs, size=len(frogs), replace=True). That's the entire bootstrap sampling step. The replace=True argument makes it with replacement. The size=len(frogs) ensures the bootstrap sample is the same size as the original.

The standard deviation of the bootstrap medians is called the bootstrap standard error. It plays the same role as $s/\sqrt{n}$ in the formula-based approach — it measures how much the statistic varies from sample to sample.

18.4 Bootstrap Confidence Intervals

Now that we have the bootstrap distribution, building a confidence interval is straightforward. There are two main methods.

Method 1: The Percentile Method

The percentile method is the simplest and most intuitive approach. For a 95% confidence interval, you simply take the 2.5th and 97.5th percentiles of the bootstrap distribution.

Think about why this works: the bootstrap distribution approximates the sampling distribution of the statistic. If 95% of the bootstrap statistics fall between two values, then those two values form a plausible range for the true population parameter.

Concept 2: Bootstrap Confidence Interval (Percentile Method)

A bootstrap confidence interval using the percentile method is constructed by taking the $\alpha/2$ and $1 - \alpha/2$ percentiles of the bootstrap distribution. For a 95% CI: find the 2.5th and 97.5th percentiles of the $B$ bootstrap statistics.

# 95% Bootstrap CI — Percentile Method
ci_lower = np.percentile(bootstrap_medians, 2.5)
ci_upper = np.percentile(bootstrap_medians, 97.5)

print(f"95% Bootstrap CI for median frog length (percentile method):")
print(f"  ({ci_lower:.2f}, {ci_upper:.2f}) cm")

For our frog data, this gives a 95% confidence interval of approximately (3.95, 5.05) cm for the median frog length. We are 95% confident that the true median frog length on this island is between 3.95 and 5.05 cm.

🔄 Spaced Review 2 (from Ch.13): Hypothesis Testing and the P-Value

In Chapter 13, you learned that the p-value is the probability of observing data as extreme as what you actually observed, assuming the null hypothesis is true. The bootstrap doesn't change this definition — it just provides a different way to compute the p-value. Instead of using a formula to calculate the area in the tail of a $t$-distribution, you simulate the null distribution directly and count how many simulated statistics are as extreme as or more extreme than the observed one. Same logic, different computational tool.

Method 2: The Basic (Reverse Percentile) Method

There's a slightly more sophisticated approach called the basic method (also called the reverse percentile method). It adjusts for potential bias in the bootstrap distribution:

$$\text{CI} = \left(2\hat{\theta} - \hat{\theta}^*_{1-\alpha/2}, \;\; 2\hat{\theta} - \hat{\theta}^*_{\alpha/2}\right)$$

where $\hat{\theta}$ is the original sample statistic and $\hat{\theta}^*_{\alpha/2}$ is the $\alpha/2$ percentile of the bootstrap distribution.

# 95% Bootstrap CI — Basic (Reverse Percentile) Method
basic_lower = 2 * original_median - np.percentile(bootstrap_medians, 97.5)
basic_upper = 2 * original_median - np.percentile(bootstrap_medians, 2.5)

print(f"95% Bootstrap CI for median frog length (basic method):")
print(f"  ({basic_lower:.2f}, {basic_upper:.2f}) cm")

The basic method "reflects" the percentiles around the original estimate. If the bootstrap distribution is symmetric, the two methods give nearly identical results. When the bootstrap distribution is skewed, the basic method can sometimes perform better because it corrects for the asymmetry.

Which method should you use? For this course, the percentile method is the default recommendation. It's simpler, more intuitive, and gives good results in most situations. More advanced methods (like the bias-corrected and accelerated, or BCa, method) exist for situations where higher precision is needed, but they're beyond our scope.

A Complete Bootstrap CI Function

Let's write a reusable function that takes any data and any statistic and produces a bootstrap confidence interval:

def bootstrap_ci(data, stat_func, n_bootstrap=10000, ci_level=0.95,
                 random_seed=42):
    """
    Compute a bootstrap confidence interval for any statistic.

    Parameters
    ----------
    data : array-like
        The original sample data.
    stat_func : callable
        A function that takes an array and returns a single number.
        Examples: np.mean, np.median, np.std
    n_bootstrap : int
        Number of bootstrap samples (default 10,000).
    ci_level : float
        Confidence level (default 0.95 for 95% CI).
    random_seed : int or None
        Random seed for reproducibility.

    Returns
    -------
    dict with keys: 'observed', 'ci_lower', 'ci_upper', 'se',
                    'bootstrap_dist'
    """
    np.random.seed(random_seed)
    data = np.asarray(data)
    observed = stat_func(data)

    # Generate bootstrap distribution
    boot_stats = np.zeros(n_bootstrap)
    for i in range(n_bootstrap):
        boot_sample = np.random.choice(data, size=len(data), replace=True)
        boot_stats[i] = stat_func(boot_sample)

    # Percentile CI
    alpha = 1 - ci_level
    ci_lower = np.percentile(boot_stats, 100 * alpha / 2)
    ci_upper = np.percentile(boot_stats, 100 * (1 - alpha / 2))

    return {
        'observed': observed,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        'se': np.std(boot_stats),
        'bootstrap_dist': boot_stats
    }

# Example: Bootstrap CI for the median
result = bootstrap_ci(frogs, np.median)
print(f"Observed median: {result['observed']:.2f}")
print(f"Bootstrap SE:    {result['se']:.3f}")
print(f"95% CI:          ({result['ci_lower']:.2f}, {result['ci_upper']:.2f})")

This function works for any statistic. Want a confidence interval for the mean? Pass np.mean. For the standard deviation? np.std. For the 90th percentile? lambda x: np.percentile(x, 90). For the interquartile range? lambda x: np.percentile(x, 75) - np.percentile(x, 25). The bootstrap doesn't care — it works the same way regardless.

This is the superpower of the bootstrap. You don't need a different formula for each statistic. One procedure handles them all.

18.5 Maya's Bootstrap: Median Hospital Wait Times

Let's apply the bootstrap to a real problem.

Dr. Maya Chen has collected wait times (in minutes) for 40 patients in her emergency department. She wants to estimate the median wait time because the data are heavily right-skewed — a few patients wait for hours, pulling the mean far above the typical experience.

Here's why this matters for patient care: hospital administrators often report mean wait times, which can be misleading. If 35 patients wait 30 minutes and 5 patients wait 4 hours, the mean wait time is 60 minutes — but the typical patient waits only 30 minutes. The median tells the truth about typical experience.

import numpy as np
import matplotlib.pyplot as plt

# Maya's ED wait time data (minutes) — right-skewed
np.random.seed(123)
wait_times = np.concatenate([
    np.random.exponential(scale=25, size=30),  # Most patients
    np.random.exponential(scale=90, size=10)    # Complex cases
])
wait_times = np.round(np.clip(wait_times, 5, 300), 1)

print("Wait time summary:")
print(f"  n = {len(wait_times)}")
print(f"  Mean:   {np.mean(wait_times):.1f} minutes")
print(f"  Median: {np.median(wait_times):.1f} minutes")
print(f"  SD:     {np.std(wait_times, ddof=1):.1f} minutes")
print(f"  Min:    {np.min(wait_times):.1f}, Max: {np.max(wait_times):.1f}")

Notice how the mean is pulled up by the long-wait patients, while the median stays closer to the typical experience. Maya wants a confidence interval for the population median — and that's where the bootstrap shines.

# Can we use a t-interval? Let's check.
# The t-interval is for the MEAN, not the median.
# Even if we wanted a CI for the mean, the data are highly skewed
# with n = 40 (borderline for the CLT with this much skewness).

# Bootstrap CI for the median
result = bootstrap_ci(wait_times, np.median, n_bootstrap=10000)
print(f"\nBootstrap analysis of median wait time:")
print(f"  Observed median: {result['observed']:.1f} minutes")
print(f"  Bootstrap SE:    {result['se']:.1f} minutes")
print(f"  95% CI:          ({result['ci_lower']:.1f}, {result['ci_upper']:.1f}) minutes")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Original data (skewed)
axes[0].hist(wait_times, bins=15, edgecolor='black', alpha=0.7,
             color='lightcoral')
axes[0].axvline(np.mean(wait_times), color='blue', linewidth=2,
                linestyle='--', label=f'Mean = {np.mean(wait_times):.1f}')
axes[0].axvline(np.median(wait_times), color='red', linewidth=2,
                linestyle='-', label=f'Median = {np.median(wait_times):.1f}')
axes[0].set_xlabel('Wait Time (minutes)')
axes[0].set_ylabel('Frequency')
axes[0].set_title("Maya's ED Wait Times (Skewed)")
axes[0].legend()

# Right: Bootstrap distribution of the median
axes[1].hist(result['bootstrap_dist'], bins=40, edgecolor='black',
             alpha=0.7, color='steelblue')
axes[1].axvline(result['ci_lower'], color='red', linewidth=2,
                linestyle='--', label=f'95% CI lower = {result["ci_lower"]:.1f}')
axes[1].axvline(result['ci_upper'], color='red', linewidth=2,
                linestyle='--', label=f'95% CI upper = {result["ci_upper"]:.1f}')
axes[1].set_xlabel('Bootstrap Median (minutes)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Bootstrap Distribution of Median Wait Time')
axes[1].legend()

plt.tight_layout()
plt.show()

Maya's interpretation: "We are 95% confident that the true median emergency department wait time is between approximately 18 and 37 minutes. This means the typical patient's wait — not distorted by a few extreme cases — likely falls in this range."

Notice what just happened. Maya estimated a confidence interval for the median — something the $t$-interval formula simply can't do — using only np.random.choice() and np.percentile(). No formulas. No distributional assumptions. Just computation.

Key Insight

The bootstrap allowed Maya to construct a confidence interval for the median — a statistic with no simple formula for its standard error. The $t$-interval works only for means. The $z$-interval works only for proportions. The bootstrap works for anything.

18.6 The Permutation Test: Shuffling Labels

Now let's switch from confidence intervals to hypothesis testing. The bootstrap's cousin is the permutation test (also called a randomization test), and it's based on an equally elegant idea.

The Logic of Shuffling

Suppose you want to test whether two groups are different. The null hypothesis says: the two groups come from the same population — any observed difference is just the luck of the draw.

If $H_0$ is true, then the group labels are meaningless. It doesn't matter whether an observation was labeled "Group A" or "Group B" — the labels carry no information. If the groups truly come from the same population, you could shuffle the labels randomly and the results would look the same.

Concept 3: Permutation Test

A permutation test works by randomly reassigning the group labels many times, computing the test statistic each time, and seeing where the observed test statistic falls in this null distribution. If the observed difference is extreme compared to the shuffled differences, we reject $H_0$.

Here's the procedure:

Compute the observed test statistic from the actual data (e.g., the difference in group means).
Combine all observations from both groups into one pool.
Randomly shuffle the group labels: assign $n_1$ observations to "Group A" and $n_2$ to "Group B" at random.
Compute the test statistic for this shuffled dataset.
Repeat steps 3–4 many times (say, 10,000 times).
Calculate the p-value as the proportion of shuffled test statistics that are as extreme as or more extreme than the observed one.

This process creates the null distribution — the distribution of the test statistic under $H_0$. If the observed difference is in the tails of this distribution, that's evidence against $H_0$.

Why It Works

The permutation test is a direct operationalization of the null hypothesis. When $H_0$ says "the two groups come from the same population," it means the group labels are arbitrary. The permutation test takes this literally: it scrambles the labels and asks, "If the labels really didn't matter, how often would we see a difference this large?"

🔄 Spaced Review 3 (from Ch.12): Confidence Intervals

In Chapter 12, you learned that a confidence interval captures the range of plausible values for a population parameter. A 95% CI is constructed so that 95% of all CIs from repeated sampling would capture the true parameter. The bootstrap CI achieves the same goal through simulation rather than formulas. And just as there's a duality between CIs and hypothesis tests (the CI contains all parameter values that would not be rejected by a test), bootstrap CIs and permutation tests are two sides of the same computational coin.

The Connection to Monte Carlo Methods

The bootstrap and permutation tests are both examples of Monte Carlo simulation — using random sampling to approximate quantities that are difficult to compute analytically. The name comes from the Monte Carlo Casino in Monaco, because the methods rely on random chance (like gambling) to produce reliable answers.

Key Term: Monte Carlo Simulation

Monte Carlo simulation is any method that uses repeated random sampling to obtain numerical results. The bootstrap uses Monte Carlo simulation to approximate the sampling distribution. The permutation test uses Monte Carlo simulation to approximate the null distribution. Both rely on the law of large numbers: with enough random repetitions, the simulation converges to the true answer.

Monte Carlo methods were developed in the 1940s by Stanislaw Ulam, John von Neumann, and Nicholas Metropolis during the Manhattan Project, where they used random sampling to model neutron diffusion. Today, Monte Carlo methods are everywhere — in physics, finance, engineering, and of course, statistics.

18.7 Alex's Permutation Test: A/B Testing Without the t-Test

Alex Rivera has a new A/B test to analyze — but this one is trickier than the Chapter 16 analysis. StreamVibe tested a new "social recommendations" feature on a small group of users. The sample sizes are small (15 users per group), and the data are strongly right-skewed. Alex isn't confident the CLT has kicked in for such small, skewed samples.

Rather than worrying about whether the $t$-test assumptions hold, Alex decides to use a permutation test.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(2024)

# Small A/B test data — right-skewed watch times (minutes)
control = np.array([22, 35, 18, 42, 28, 15, 31, 55, 20, 25,
                    33, 19, 27, 38, 24])
treatment = np.array([30, 45, 25, 48, 38, 22, 42, 65, 35, 32,
                      50, 28, 40, 55, 29])

print("Summary Statistics:")
print(f"  Control   (n={len(control)}):  mean = {np.mean(control):.1f}, "
      f"median = {np.median(control):.1f}, SD = {np.std(control, ddof=1):.1f}")
print(f"  Treatment (n={len(treatment)}): mean = {np.mean(treatment):.1f}, "
      f"median = {np.median(treatment):.1f}, SD = {np.std(treatment, ddof=1):.1f}")

# Observed difference in means
observed_diff = np.mean(treatment) - np.mean(control)
print(f"\nObserved difference in means: {observed_diff:.2f} minutes")

# --- Permutation Test ---
n_permutations = 10000
combined = np.concatenate([control, treatment])
n1 = len(control)

perm_diffs = np.zeros(n_permutations)
for i in range(n_permutations):
    # Shuffle all observations and split into two groups
    shuffled = np.random.permutation(combined)
    perm_group1 = shuffled[:n1]
    perm_group2 = shuffled[n1:]
    perm_diffs[i] = np.mean(perm_group2) - np.mean(perm_group1)

# P-value (one-tailed: treatment > control)
p_value = np.mean(perm_diffs >= observed_diff)
print(f"Permutation test p-value (one-tailed): {p_value:.4f}")

# For comparison: traditional t-test
from scipy import stats
t_stat, t_pvalue = stats.ttest_ind(treatment, control,
                                    equal_var=False,
                                    alternative='greater')
print(f"Welch's t-test p-value (one-tailed):   {t_pvalue:.4f}")

# Visualize the null distribution
plt.figure(figsize=(10, 6))
plt.hist(perm_diffs, bins=50, edgecolor='black', alpha=0.7,
         color='lightgreen', label='Null distribution (shuffled)')
plt.axvline(observed_diff, color='red', linewidth=2.5, linestyle='--',
            label=f'Observed diff = {observed_diff:.1f} min')
plt.xlabel('Difference in Means (Treatment - Control)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title("Alex's Permutation Test: Null Distribution", fontsize=14)
plt.legend(fontsize=11)
plt.tight_layout()
plt.show()

Reading the results:

The permutation test works by asking: "If the social recommendations feature had no effect whatsoever, how often would we see a difference of this magnitude just by chance?" Each bar in the histogram represents one possible difference under the null hypothesis. The red line marks what Alex actually observed.

If the observed difference sits well into the tail of the null distribution — meaning very few of the shuffled datasets produced a difference that large — then the evidence against $H_0$ is strong.

Alex's interpretation: "The permutation test gives a p-value of approximately 0.008. Under the null hypothesis (no difference between groups), fewer than 1% of random shuffles produced a difference as large as what we observed. The evidence is strong that the social recommendations feature genuinely increases watch time. Notably, the permutation p-value is very close to the Welch's $t$-test p-value — both methods agree."

Why the Permutation Test and the t-Test Often Agree

When the $t$-test conditions are reasonably met (moderate sample sizes, data not wildly skewed), the permutation test and the $t$-test typically give very similar results. This is reassuring: it means the formula-based method isn't lying to you.

The permutation test earns its keep when the $t$-test conditions aren't met — small samples from skewed populations, unusual statistics, or situations where the distributional assumptions are questionable. In those cases, the permutation test is often more trustworthy.

18.8 Sam's Bootstrap: Shooting Percentage Improvement

Sam Okafor has been tracking Daria Williams's three-point shooting all season. The $z$-test and $t$-test approaches from Chapters 14 and 15 focused on whether Daria's shooting percentage improved compared to a historical benchmark. But Sam wants something different: a confidence interval for the ratio of improvement.

Specifically, Sam wants to know: by what factor has Daria's three-point percentage improved? If she shot 31% last season and is shooting around 38.5% this season, the ratio is $0.385 / 0.310 \approx 1.24$. But how precise is that estimate?

There's no simple formula for the standard error of a ratio of proportions. But the bootstrap doesn't need one.

import numpy as np

# Daria's shooting data
# Last season: 60 makes out of 200 attempts (30.0%)
# This season: 25 makes out of 65 attempts (38.5%)

np.random.seed(42)

# Represent each shot as 1 (make) or 0 (miss)
last_season = np.array([1]*60 + [0]*140)    # 60/200 = 0.300
this_season = np.array([1]*25 + [0]*40)     # 25/65  = 0.385

observed_ratio = np.mean(this_season) / np.mean(last_season)
print(f"Last season:  {np.mean(last_season):.3f} ({60}/{200})")
print(f"This season:  {np.mean(this_season):.3f} ({25}/{65})")
print(f"Observed ratio: {observed_ratio:.3f}")
print(f"  (This season is {(observed_ratio - 1)*100:.1f}% higher)")

# Bootstrap the ratio
n_bootstrap = 10000
boot_ratios = np.zeros(n_bootstrap)

for i in range(n_bootstrap):
    # Resample each season independently
    boot_last = np.random.choice(last_season, size=len(last_season),
                                  replace=True)
    boot_this = np.random.choice(this_season, size=len(this_season),
                                  replace=True)
    # Compute the ratio (protect against division by zero)
    if np.mean(boot_last) > 0:
        boot_ratios[i] = np.mean(boot_this) / np.mean(boot_last)
    else:
        boot_ratios[i] = np.nan

# Remove any NaN values (shouldn't happen with this data)
boot_ratios = boot_ratios[~np.isnan(boot_ratios)]

# 95% Bootstrap CI for the ratio
ci_lower = np.percentile(boot_ratios, 2.5)
ci_upper = np.percentile(boot_ratios, 97.5)
boot_se = np.std(boot_ratios)

print(f"\nBootstrap CI for improvement ratio:")
print(f"  Bootstrap SE:  {boot_se:.3f}")
print(f"  95% CI:        ({ci_lower:.3f}, {ci_upper:.3f})")
print(f"\nInterpretation:")
if ci_lower > 1.0:
    print(f"  The entire CI is above 1.0 — evidence of genuine improvement.")
else:
    print(f"  The CI includes 1.0 — improvement not conclusive.")
    print(f"  The ratio could plausibly be as low as {ci_lower:.2f} "
          f"(a {(1-ci_lower)*100:.0f}% decrease)")
    print(f"  or as high as {ci_upper:.2f} "
          f"(a {(ci_upper-1)*100:.0f}% increase)")

Sam's interpretation: "The bootstrap 95% CI for the ratio of this season's to last season's three-point percentage is approximately (0.89, 1.72). Since this interval contains 1.0, we can't conclude that Daria's shooting has genuinely improved — which is consistent with our earlier hypothesis test results from Chapters 13 and 14. The wide interval reflects the relatively small number of shots this season ($n = 65$). We'd need more data to pin down the improvement more precisely."

This is a perfect example of the bootstrap's flexibility. No formula exists for the standard error of a ratio of proportions from different sample sizes. But np.random.choice() doesn't care — it bootstraps the ratio just as easily as it bootstraps the mean.

18.9 Bootstrap Distribution vs. Sampling Distribution

Let's make the relationship between these two concepts precise.

The sampling distribution (Chapter 11) is the theoretical distribution of a statistic across all possible samples from the population. It's the gold standard — if you knew it exactly, you could build perfect confidence intervals and conduct flawless hypothesis tests. But you can never observe it directly, because you'd need to draw infinitely many samples.

The bootstrap distribution is an approximation of the sampling distribution, built by resampling from the one sample you actually have. It won't be identical to the true sampling distribution, but it will be close — especially when the sample is reasonably large and reasonably representative.

Feature	Sampling Distribution	Bootstrap Distribution
Source	All possible samples from the population	Many resamples from the sample
Center	True parameter ($\mu$, $p$, etc.)	Sample statistic ($\bar{x}$, $\hat{p}$, etc.)
Spread	True standard error	Bootstrap standard error (approximation)
Shape	Determined by CLT or exact theory	Empirically determined by the data
Requires	Knowledge of the population	Only the sample
Accuracy	Exact (theoretical)	Approximate (improves with larger $n$ and larger $B$)

The key difference in center is important: the sampling distribution is centered at the true parameter, while the bootstrap distribution is centered at the sample statistic. This is why bootstrap CIs work: they capture the right spread even though they're centered at the sample estimate rather than the truth.

Threshold Concept: Resampling

Here's the conceptual leap: you can learn about the population by cleverly reusing the sample. This sounds paradoxical — how can you get new information from old data? The answer is that bootstrap resampling doesn't create new data. It reveals the variability inherent in the data you already have. Each bootstrap sample is a plausible alternative version of what you might have observed, and the variability across bootstrap samples approximates the variability across real samples.

This is the same insight that underlies all of inference: the sample is a window into the population. The CLT exploits this mathematically (Chapter 11). The bootstrap exploits it computationally. Both are saying the same thing: the sample contains information about its own reliability.

If this clicks for you, you'll never look at data the same way. Every dataset you encounter carries within it a hidden bootstrap distribution — a built-in measure of how much you should trust the statistics you compute from it.

18.10 When Does the Bootstrap Shine?

The bootstrap isn't always better than the formula-based approach — but it has clear advantages in several important situations.

The Bootstrap Is Especially Useful When...

1. You need a CI for a complex statistic (medians, ratios, correlations, etc.)

This is the most compelling use case. Formula-based CIs exist for means and proportions, but not for most other statistics. Need a CI for the median? The correlation coefficient? The difference in medians? The ratio of two variances? The 90th percentile? The bootstrap handles all of these with the same procedure.

2. Your data are non-normal and your sample is moderate-sized

The CLT guarantees that the sampling distribution of the mean is approximately normal for large $n$ — but "large enough" depends on how skewed the data are. With highly skewed data and $n = 20$, the CLT may not have fully kicked in. The bootstrap doesn't need the CLT — it builds the sampling distribution empirically.

3. Your sample size is small (but not too small)

For small samples ($15 \leq n < 30$), the $t$-test becomes sensitive to non-normality. The bootstrap often provides more accurate CIs in this range, because it doesn't assume a particular distributional shape.

4. You want to avoid distributional assumptions entirely

Sometimes you simply don't want to assume normality, and you don't want to argue about whether the CLT applies. The bootstrap lets you proceed without any distributional assumptions — the data speak for themselves.

When the Bootstrap Struggles

The bootstrap isn't magic. It has real limitations.

1. Very small samples ($n < 15$)

If your sample is tiny, it may not represent the population well. The bootstrap resamples from your sample, so if the sample is a poor model of the population, the bootstrap distribution will also be a poor model of the sampling distribution. With $n = 5$ or $n = 8$, the bootstrap can produce unreliable results.

2. Heavily biased samples

If the sample suffers from serious selection bias, no amount of resampling will fix it. The bootstrap estimates variability well, but it cannot correct for systematic bias in the data collection. A biased sample produces a biased bootstrap distribution.

3. Statistics that depend on the tails of the distribution

The bootstrap struggles with extreme quantiles (like the minimum, maximum, or 99th percentile) because these depend on values that may not be well-represented in the sample. If the population has a heavier tail than your sample shows, the bootstrap will underestimate the variability of tail statistics.

4. Time series and dependent data

Standard bootstrap assumes the observations are independent. If the data have serial dependence (like stock prices or temperature readings), the standard bootstrap breaks down. Special variants exist (block bootstrap, moving block bootstrap) but they're beyond this course.

18.11 Formula-Based vs. Simulation-Based: The Comparison

Here's the side-by-side comparison you've been waiting for.

Feature	Formula-Based (Ch.12–16)	Simulation-Based (Ch.18)
How it estimates SE	Derives a mathematical formula	Computes SD of bootstrap statistics
How it builds CIs	Statistic $\pm$ critical value $\times$ SE	Percentiles of bootstrap distribution
How it tests hypotheses	Compares test statistic to known distribution	Compares observed statistic to permutation distribution
Assumptions	Normality (or CLT), independence, specific formulas	Independence, representative sample
Works for means?	Yes ($t$-interval, $t$-test)	Yes
Works for medians?	No standard formula	Yes
Works for ratios?	No standard formula	Yes
Works for any statistic?	Only those with known SE formulas	Yes
Small samples ($n < 15$)?	Limited (normality needed)	Also limited (sample may not represent population)
Moderate samples (15-30)?	Good when conditions met	Often better for non-normal data
Large samples ($n > 30$)?	Excellent (CLT kicks in)	Also excellent (both methods agree closely)
Computational cost	Minimal (one formula)	Moderate (thousands of resamples)
Historical availability	1900s+ (Gosset's $t$-test: 1908)	1979+ (Efron's bootstrap; practical with computers)
Teaching value	Develops mathematical reasoning	Develops computational and intuitive reasoning

When to Use Which?

Use formula-based methods when: - You're computing a CI or test for a mean or proportion - The sample is large enough for the CLT - The data are reasonably normal (or you have $n \geq 30$) - You want a quick answer without programming

Use simulation-based methods when: - You need inference for a non-standard statistic (median, ratio, correlation, etc.) - Your data are non-normal and your sample is moderate-sized - You want to avoid distributional assumptions - The formula-based conditions are questionable

Use both when: - You want to check whether the formula-based and bootstrap results agree (they should for standard statistics under good conditions) - You're learning statistics and want to build intuition about sampling distributions

Key Insight

Formula-based and simulation-based methods are not enemies. They're complementary approaches to the same goal: understanding sampling variability. The formula-based approach uses mathematical theory. The simulation-based approach uses computational power. When both are applicable, they typically give the same answer — which should give you confidence in both methods.

18.12 Python Summary: Your Bootstrap and Permutation Toolkit

Here's a concise reference for the key Python techniques from this chapter.

import numpy as np
from scipy import stats

# === BOOTSTRAP CI FOR ANY STATISTIC ===

data = np.array([...])  # Your data

# Step 1: Compute observed statistic
observed = np.median(data)  # or np.mean, np.std, etc.

# Step 2: Bootstrap
n_boot = 10000
boot_stats = np.array([
    np.median(np.random.choice(data, size=len(data), replace=True))
    for _ in range(n_boot)
])

# Step 3: Percentile CI
ci_lower = np.percentile(boot_stats, 2.5)
ci_upper = np.percentile(boot_stats, 97.5)
boot_se = np.std(boot_stats)

# === PERMUTATION TEST FOR TWO GROUPS ===

group1 = np.array([...])
group2 = np.array([...])

# Step 1: Observed difference
obs_diff = np.mean(group2) - np.mean(group1)

# Step 2: Pool and shuffle
combined = np.concatenate([group1, group2])
n1 = len(group1)
n_perm = 10000

perm_diffs = np.array([
    np.mean(np.random.permutation(combined)[n1:]) -
    np.mean(np.random.permutation(combined)[:n1])
    for _ in range(n_perm)
])

# Step 3: P-value
p_value_two_sided = np.mean(np.abs(perm_diffs) >= np.abs(obs_diff))
p_value_one_sided = np.mean(perm_diffs >= obs_diff)  # if H_a: group2 > group1

# === BOOTSTRAP CI FOR TWO-GROUP DIFFERENCE ===

n_boot = 10000
boot_diffs = np.zeros(n_boot)
for i in range(n_boot):
    b1 = np.random.choice(group1, size=len(group1), replace=True)
    b2 = np.random.choice(group2, size=len(group2), replace=True)
    boot_diffs[i] = np.mean(b2) - np.mean(b1)

ci_lower = np.percentile(boot_diffs, 2.5)
ci_upper = np.percentile(boot_diffs, 97.5)

18.13 Common Misconceptions

Let's clear up a few things that trip students up.

Misconception	Reality
"The bootstrap creates new data."	No. It resamples from the existing data with replacement. It doesn't generate new observations — it reveals the variability in the data you already have.
"More bootstrap samples ($B$) makes the CI narrower."	No. Increasing $B$ makes the CI more precise (less Monte Carlo noise in the endpoints) but doesn't make it narrower. The width of the CI is determined by the sample size $n$ and the variability in the data, not by $B$.
"The bootstrap can fix a biased sample."	No. If your sample is biased, your bootstrap distribution is biased too. The bootstrap estimates variability well, but it cannot correct for systematic errors in data collection.
"You can use the bootstrap with any sample size."	Technically yes, but very small samples ($n < 15$) may not represent the population well enough for the bootstrap to be reliable. The bootstrap needs the sample to be a reasonable model of the population.
"The permutation test is always better than the $t$-test."	Not always. When $t$-test conditions are met, both give similar results. The $t$-test has well-understood power properties and is simpler to compute. The permutation test earns its keep when assumptions are questionable.
"Bootstrap and permutation tests are the same thing."	No. The bootstrap resamples within groups (with replacement) to build CIs and estimate SEs. The permutation test shuffles between groups (without replacement) to test $H_0$. Different goals, different mechanics.

18.14 Progressive Project: Bootstrap Confidence Intervals

Time to apply the bootstrap to your own data.

Your Task

Using the dataset you've been building throughout this course, complete the following:

Choose a statistic for which a formula-based CI would be difficult or impossible — the median, a percentile (like the 75th percentile), the IQR, or the ratio of two group means.

Compute the bootstrap CI: - Draw at least 10,000 bootstrap samples - Compute your chosen statistic for each - Report the 95% bootstrap CI (percentile method) - Plot the bootstrap distribution with the CI endpoints marked

Compare to a formula-based CI (if one exists): For a variable where both approaches work (e.g., the mean), compute both the $t$-interval and the bootstrap CI. How similar are they?

Conduct a permutation test: - Choose two groups to compare (the same groups you compared in Chapter 16) - Run a permutation test with at least 10,000 shuffles - Compare the permutation p-value to the $t$-test p-value from Chapter 16 - Report both and discuss whether the conclusions agree

Write a paragraph interpreting your bootstrap CI in context. Why did you choose this particular statistic? What does the bootstrap distribution look like? Is it symmetric or skewed?

Template code:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Load your project data
df = pd.read_csv('your_data.csv')

# --- Bootstrap CI for a non-standard statistic ---
your_variable = df['your_column'].dropna().values
your_statistic = lambda x: np.median(x)  # Change as needed

np.random.seed(42)
n_boot = 10000
boot_stats = np.array([
    your_statistic(np.random.choice(your_variable,
                   size=len(your_variable), replace=True))
    for _ in range(n_boot)
])

observed = your_statistic(your_variable)
ci_lower = np.percentile(boot_stats, 2.5)
ci_upper = np.percentile(boot_stats, 97.5)

print(f"Observed statistic:  {observed:.3f}")
print(f"Bootstrap SE:        {np.std(boot_stats):.3f}")
print(f"95% Bootstrap CI:    ({ci_lower:.3f}, {ci_upper:.3f})")

# --- Permutation test for two groups ---
group1 = df[df['group_col'] == 'A']['value_col'].values
group2 = df[df['group_col'] == 'B']['value_col'].values

obs_diff = np.mean(group2) - np.mean(group1)
combined = np.concatenate([group1, group2])
n1 = len(group1)

perm_diffs = np.array([
    np.mean(np.random.permutation(combined)[n1:]) -
    np.mean(np.random.permutation(combined)[:n1])
    for _ in range(n_boot)
])

p_value = np.mean(np.abs(perm_diffs) >= np.abs(obs_diff))
print(f"\nPermutation test p-value: {p_value:.4f}")

18.15 What's Next

In this chapter, you learned that you don't always need formulas to do inference. The bootstrap and permutation tests are flexible, intuitive, and remarkably powerful. They embody a modern approach to statistics that leverages computational power to handle problems that formula-based methods can't.

But the formula-based methods aren't going anywhere. Chapters 19 and 20 will introduce the chi-square test for categorical data and ANOVA for comparing three or more groups — both formula-based methods that extend the hypothesis testing framework from Chapter 13 into new territory. These methods have their own elegant structure, and they complement the simulation-based tools you just learned.

The bootstrap idea will echo throughout the rest of this book. In Chapter 21 (nonparametric methods), you'll see more distribution-free approaches. In Chapter 22 (regression), bootstrap CIs for regression coefficients provide an alternative when residual assumptions are violated. And in Chapter 26 (statistics and AI), you'll learn that machine learning methods like random forests and bagging are direct descendants of Efron's bootstrap idea.

The thread that connects all of this? The sample contains information about its own reliability. That insight — whether you access it through formulas, through the CLT, or through the bootstrap — is what makes inference possible.

Chapter Summary

The bootstrap is a simulation-based method that approximates the sampling distribution of any statistic by resampling from the original data with replacement.
Bootstrap confidence intervals (percentile method) use the 2.5th and 97.5th percentiles of the bootstrap distribution for a 95% CI.
Permutation tests assess whether two groups differ by randomly shuffling group labels and comparing the observed difference to the null distribution.
The bootstrap shines for complex statistics (medians, ratios), non-normal data, and situations where formula-based CIs don't exist.
The bootstrap struggles with very small samples ($n < 15$), biased samples, extreme quantiles, and dependent data.
Formula-based and simulation-based methods are complementary — when both apply, they typically agree closely.
Bradley Efron introduced the bootstrap in 1979; it is one of the most important statistical innovations of the 20th century.
The threshold concept: The sample is our best model of the population. Resampling from the sample approximates resampling from the population. This is the bootstrap idea.