> "The normal distribution is the most important distribution in statistics. It is also, unfortunately, the most overused."
Learning Objectives
- Define probability distribution and distinguish discrete from continuous distributions
- Describe and simulate the uniform, binomial, and Poisson distributions
- Explain the properties of the normal distribution and apply the empirical rule
- Use z-scores and scipy.stats to compute probabilities under the normal curve
- Explain the Central Limit Theorem through simulation and apply it to sampling distributions
In This Chapter
- Chapter Overview
- 21.1 What Is a Probability Distribution?
- 21.2 The Uniform Distribution — The Boring-but-Important One
- 21.3 The Binomial Distribution — Counting Successes
- 21.4 The Poisson Distribution — Counting Rare Events
- 21.5 The Normal Distribution — The Star of the Show
- 21.6 The Standard Normal and Z-Scores (Revisited)
- 21.7 The Central Limit Theorem — The Magic Trick of Statistics
- 21.8 Checking Normality: Is Your Data Actually Normal?
- 21.9 The Progressive Project: Vaccination Distributions
- 21.10 When Data Isn't Normal — And That's Okay
- 21.11 Chapter Summary
- Looking Ahead
- Connections to Previous and Future Chapters
Chapter 21: Distributions and the Normal Curve — The Shape That Shows Up Everywhere
"The normal distribution is the most important distribution in statistics. It is also, unfortunately, the most overused." — An anonymous statistician who probably made this up but was right
Chapter Overview
In Chapter 19, you learned to describe the shape of data — symmetric, skewed, bimodal. In Chapter 20, you learned to think about probability — how likely different outcomes are. This chapter connects those two ideas. A probability distribution is a mathematical object that describes the shape of probability — it tells you how likely each possible outcome is, and it does so with a precise, mathematical form that you can work with, compute with, and reason about.
The most famous distribution of all is the normal distribution — the bell curve. You've seen it mentioned a hundred times. Now you'll understand why it's everywhere. The answer involves one of the most beautiful results in mathematics: the Central Limit Theorem. And the way we'll discover it is by running a simulation that will make you say "wait, THAT happened?"
But we'll start simple. We'll build up from distributions you can visualize on your fingers — coin flips, dice rolls, random numbers — before arriving at the bell curve. By the end, you'll know how to use scipy.stats to compute probabilities, how to check whether your data is actually normal (spoiler: it often isn't), and why the Central Limit Theorem means that normal distributions show up even when the underlying data isn't normal at all.
In this chapter, you will learn to:
- Define probability distribution and distinguish discrete from continuous distributions (all paths)
- Describe and simulate the uniform, binomial, and Poisson distributions (all paths)
- Explain the properties of the normal distribution and apply the empirical rule (all paths)
- Use z-scores and
scipy.statsto compute probabilities under the normal curve (all paths) - Explain the Central Limit Theorem through simulation and apply it to sampling distributions (standard + deep dive paths)
21.1 What Is a Probability Distribution?
A probability distribution is a mathematical description of the probabilities of all possible outcomes of a random variable. It answers the question: "If I repeat this random process many times, what pattern would the outcomes make?"
Think of it this way. In Chapter 19, you made histograms of data. The histogram showed you the shape — where values clustered, how spread out they were, whether there were tails. A probability distribution is like the idealized version of that histogram — the shape you'd approach if you had infinite data.
There are two flavors:
Discrete Distributions
A discrete distribution describes random variables that can only take specific, countable values. Examples: the number of heads in 10 coin flips (0, 1, 2, ..., 10), the number of customers who enter a store in an hour, the roll of a die.
For discrete distributions, we can list the probability of each possible outcome. This list is called the probability mass function (PMF).
import numpy as np
import matplotlib.pyplot as plt
# Example: roll a fair die — discrete uniform distribution
outcomes = [1, 2, 3, 4, 5, 6]
probabilities = [1/6] * 6
fig, ax = plt.subplots(figsize=(8, 4))
ax.bar(outcomes, probabilities, color='steelblue', edgecolor='white', width=0.6)
ax.set_xlabel('Outcome')
ax.set_ylabel('Probability')
ax.set_title('Discrete Uniform Distribution: Fair Die')
ax.set_ylim(0, 0.25)
ax.set_xticks(outcomes)
plt.tight_layout()
plt.savefig('discrete_uniform.png', dpi=150, bbox_inches='tight')
plt.show()
Continuous Distributions
A continuous distribution describes random variables that can take any value in a range. Examples: height, temperature, time until the next bus, vaccination rate.
For continuous distributions, we can't assign a probability to each specific value (there are infinitely many), so instead we use a probability density function (PDF) that tells us the density of probability at each point. Probabilities come from areas under the curve.
The cumulative distribution function (CDF) tells you P(X <= x) — the probability that the random variable is less than or equal to x. It's the running total of probabilities from left to right.
from scipy import stats
# Standard normal distribution (the bell curve)
x = np.linspace(-4, 4, 1000)
pdf = stats.norm.pdf(x, loc=0, scale=1)
cdf = stats.norm.cdf(x, loc=0, scale=1)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# PDF
axes[0].plot(x, pdf, color='steelblue', linewidth=2)
axes[0].fill_between(x, pdf, alpha=0.2, color='steelblue')
axes[0].set_title('PDF: Probability Density Function\n"Where is the probability concentrated?"')
axes[0].set_ylabel('Density')
axes[0].set_xlabel('x')
# CDF
axes[1].plot(x, cdf, color='coral', linewidth=2)
axes[1].set_title('CDF: Cumulative Distribution Function\n"P(X ≤ x) — probability up to this point"')
axes[1].set_ylabel('Cumulative Probability')
axes[1].set_xlabel('x')
plt.tight_layout()
plt.savefig('pdf_vs_cdf.png', dpi=150, bbox_inches='tight')
plt.show()
A key point about the PDF: the height of the curve at any point is NOT a probability. Probabilities are areas under the curve. The probability that a normally distributed variable falls between 0 and 1 is the area under the curve from 0 to 1.
21.2 The Uniform Distribution — The Boring-but-Important One
The uniform distribution is the simplest distribution. Every outcome is equally likely.
Discrete uniform: Rolling a fair die. Each of {1, 2, 3, 4, 5, 6} has probability 1/6.
Continuous uniform: Generating a random number between 0 and 1 using np.random.random(). Every value in [0, 1] is equally likely.
# Continuous uniform distribution
np.random.seed(42)
uniform_data = np.random.uniform(0, 1, 100000)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# Histogram of samples
axes[0].hist(uniform_data, bins=50, color='steelblue', edgecolor='white',
alpha=0.8, density=True)
x = np.linspace(0, 1, 100)
axes[0].plot(x, stats.uniform.pdf(x), color='red', linewidth=2, label='Theoretical PDF')
axes[0].set_title('Uniform Distribution U(0, 1)')
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Density')
axes[0].legend()
# CDF
axes[1].plot(x, stats.uniform.cdf(x), color='coral', linewidth=2)
axes[1].set_title('Uniform CDF')
axes[1].set_xlabel('Value')
axes[1].set_ylabel('P(X ≤ x)')
plt.tight_layout()
plt.savefig('uniform_dist.png', dpi=150, bbox_inches='tight')
plt.show()
# Uniform distribution properties
print("Uniform(0, 1) properties:")
print(f" Mean: {stats.uniform.mean():.2f}")
print(f" Std: {stats.uniform.std():.4f}")
print(f" P(X < 0.5): {stats.uniform.cdf(0.5):.2f}")
print(f" P(0.3 < X < 0.7): {stats.uniform.cdf(0.7) - stats.uniform.cdf(0.3):.2f}")
The uniform distribution is boring in the best way — it's the distribution of "no information." When you don't know what shape your data should be, the uniform distribution says "everything is equally likely." It's the starting point for random number generation, which is why np.random.random() gives you uniform values.
Why the Uniform Distribution Matters
You might be thinking: "The uniform distribution is so simple — why spend time on it?" Three reasons:
-
It's the foundation of simulation. Every Monte Carlo simulation starts with uniform random numbers. When you call
np.random.random(), you're drawing from a uniform distribution. Even generating random values from other distributions often starts by transforming uniform random numbers. -
It represents maximum uncertainty. If you have no idea which values are more likely than others, the uniform distribution says "everything is equally possible." It's the starting point for Bayesian analysis when you have no prior information.
-
It's a useful contrast. Understanding the uniform distribution helps you appreciate what makes other distributions interesting. The normal distribution is interesting precisely because values aren't equally likely — some are much more common than others.
# From uniform to normal: the connection
# If you take the average of 12 uniform(0,1) values, you get something
# approximately normal! (This is a quick-and-dirty way to generate
# approximately normal random numbers)
np.random.seed(42)
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Average of 1 uniform
avg1 = np.random.uniform(0, 1, 10000)
axes[0].hist(avg1, bins=40, color='steelblue', edgecolor='white', alpha=0.8, density=True)
axes[0].set_title('Average of 1 Uniform\n(Still uniform)')
# Average of 3 uniforms
avg3 = np.mean(np.random.uniform(0, 1, (10000, 3)), axis=1)
axes[1].hist(avg3, bins=40, color='steelblue', edgecolor='white', alpha=0.8, density=True)
axes[1].set_title('Average of 3 Uniforms\n(Starting to look bell-shaped!)')
# Average of 12 uniforms
avg12 = np.mean(np.random.uniform(0, 1, (10000, 12)), axis=1)
axes[2].hist(avg12, bins=40, color='steelblue', edgecolor='white', alpha=0.8, density=True)
x = np.linspace(0.2, 0.8, 100)
axes[2].plot(x, stats.norm.pdf(x, 0.5, 1/np.sqrt(12*12)), 'r-', linewidth=2)
axes[2].set_title('Average of 12 Uniforms\n(Practically normal!)')
plt.tight_layout()
plt.savefig('uniform_to_normal.png', dpi=150, bbox_inches='tight')
plt.show()
This is your first preview of the Central Limit Theorem — the star of Section 21.7. Even flat, uniform data becomes bell-shaped when you take averages. We'll explore why in detail later.
21.3 The Binomial Distribution — Counting Successes
The binomial distribution describes the number of successes in a fixed number of independent trials, where each trial has the same probability of success.
In plain English: if you flip a coin n times, the binomial distribution tells you the probability of getting exactly k heads.
Parameters: - n: number of trials - p: probability of success on each trial
# Binomial distribution: 20 coin flips, P(heads) = 0.5
n, p = 20, 0.5
k_values = np.arange(0, n + 1)
probabilities = stats.binom.pmf(k_values, n, p)
fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(k_values, probabilities, color='steelblue', edgecolor='white', width=0.7)
ax.set_xlabel('Number of Heads')
ax.set_ylabel('Probability')
ax.set_title(f'Binomial Distribution: n={n}, p={p}\n"How many heads in {n} flips?"')
ax.set_xticks(k_values)
# Mark the expected value
expected = n * p
ax.axvline(expected, color='red', linestyle='--', label=f'Expected: {expected:.0f}')
ax.legend()
plt.tight_layout()
plt.savefig('binomial.png', dpi=150, bbox_inches='tight')
plt.show()
# Key calculations
print(f"P(exactly 10 heads in 20 flips): {stats.binom.pmf(10, 20, 0.5):.4f}")
print(f"P(8 or fewer heads): {stats.binom.cdf(8, 20, 0.5):.4f}")
print(f"P(more than 15 heads): {1 - stats.binom.cdf(15, 20, 0.5):.4f}")
print(f"Expected value (n*p): {n * p}")
print(f"Std dev (sqrt(n*p*(1-p))): {np.sqrt(n * p * (1-p)):.2f}")
Real-World Binomial Examples
The binomial distribution is everywhere:
- Quality control: Out of 100 manufactured items, how many are defective? (n=100, p=defect rate)
- Vaccination: Out of 50 vaccinated people, how many develop immunity? (n=50, p=vaccine efficacy)
- Surveys: Out of 1000 respondents, how many say "yes"? (n=1000, p=true proportion)
# Marcus's pastry problem: 50 pastries, 3% defect rate
n_pastries, p_defect = 50, 0.03
print("Marcus's bakery defect analysis:")
print(f" Expected defects per day: {n_pastries * p_defect:.1f}")
print(f" P(0 defects): {stats.binom.pmf(0, n_pastries, p_defect):.4f}")
print(f" P(1 defect): {stats.binom.pmf(1, n_pastries, p_defect):.4f}")
print(f" P(2 defects): {stats.binom.pmf(2, n_pastries, p_defect):.4f}")
print(f" P(3+ defects):{1 - stats.binom.cdf(2, n_pastries, p_defect):.4f}")
21.4 The Poisson Distribution — Counting Rare Events
The Poisson distribution describes the number of events occurring in a fixed interval of time or space, when events happen independently at a constant average rate.
In plain English: "How many customers walk into the store in the next hour, if we average 5 per hour?"
Parameter: - lambda (rate): the average number of events per interval
# Poisson distribution: average 5 customers per hour
lam = 5
k_values = np.arange(0, 16)
probabilities = stats.poisson.pmf(k_values, lam)
fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(k_values, probabilities, color='mediumseagreen', edgecolor='white', width=0.7)
ax.set_xlabel('Number of Customers')
ax.set_ylabel('Probability')
ax.set_title(f'Poisson Distribution: λ={lam}\n"How many customers in the next hour?"')
ax.set_xticks(k_values)
ax.axvline(lam, color='red', linestyle='--', label=f'Expected: {lam}')
ax.legend()
plt.tight_layout()
plt.savefig('poisson.png', dpi=150, bbox_inches='tight')
plt.show()
print(f"P(exactly 5 customers): {stats.poisson.pmf(5, lam):.4f}")
print(f"P(0 customers): {stats.poisson.pmf(0, lam):.4f}")
print(f"P(10+ customers): {1 - stats.poisson.cdf(9, lam):.4f}")
The Poisson distribution is useful for counting rare events: website clicks per minute, earthquakes per year, typos per page, disease cases per week. The key feature: the mean and variance are both equal to lambda.
The Relationship Between Distributions
These distributions aren't isolated — they're connected in deep and useful ways:
-
Binomial becomes Poisson: When n is large and p is small, Binomial(n, p) is well-approximated by Poisson(n*p). This is why the Poisson is sometimes called the "law of rare events."
-
Binomial becomes Normal: When n is large, Binomial(n, p) is well-approximated by Normal(np, sqrt(np(1-p))). This is one of the earliest results in probability theory and a special case of the Central Limit Theorem.
-
Poisson becomes Normal: When lambda is large (say, > 20), Poisson(lambda) looks approximately normal.
These approximations aren't just mathematical curiosities — they're practically useful. Before computers, they allowed people to compute binomial and Poisson probabilities using normal distribution tables. Today, they help us understand why the normal distribution shows up so often.
# Binomial → Normal approximation
n, p = 100, 0.3
k = np.arange(0, 60)
fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(k, stats.binom.pmf(k, n, p), color='steelblue', alpha=0.6,
label='Binomial(100, 0.3)', width=0.8)
x = np.linspace(15, 45, 100)
ax.plot(x, stats.norm.pdf(x, n*p, np.sqrt(n*p*(1-p))),
color='red', linewidth=2, label=f'Normal({n*p}, {np.sqrt(n*p*(1-p)):.1f})')
ax.set_title('Binomial Approximated by Normal (n=100, p=0.3)')
ax.legend()
ax.set_xlabel('Number of Successes')
ax.set_ylabel('Probability')
plt.tight_layout()
plt.savefig('binomial_normal_approx.png', dpi=150, bbox_inches='tight')
plt.show()
21.5 The Normal Distribution — The Star of the Show
Okay. We've warmed up with the supporting cast. Now it's time for the headliner.
The normal distribution (also called the Gaussian distribution or the bell curve) is the most important distribution in all of statistics. It's symmetric, bell-shaped, and completely defined by just two parameters:
- mu: the mean (the center of the bell)
- sigma: the standard deviation (how wide the bell is)
# The normal distribution with different parameters
fig, ax = plt.subplots(figsize=(12, 6))
x = np.linspace(-15, 25, 1000)
distributions = [
(0, 1, 'N(0,1) — Standard Normal'),
(5, 1, 'N(5,1) — Shifted right'),
(0, 3, 'N(0,3) — Wider'),
(10, 2, 'N(10,2) — Shifted and wider'),
]
colors = ['steelblue', 'coral', 'mediumseagreen', 'mediumpurple']
for (mu, sigma, label), color in zip(distributions, colors):
ax.plot(x, stats.norm.pdf(x, mu, sigma), linewidth=2, label=label, color=color)
ax.set_title('Normal Distributions with Different Parameters')
ax.set_xlabel('x')
ax.set_ylabel('Density')
ax.legend()
ax.set_ylim(0, 0.45)
plt.tight_layout()
plt.savefig('normal_distributions.png', dpi=150, bbox_inches='tight')
plt.show()
Properties of the Normal Distribution
- Symmetric around the mean. The left half is a mirror image of the right half.
- Bell-shaped. The curve peaks at the mean and tapers off equally on both sides.
- Mean = Median = Mode. They're all the same value for a perfectly normal distribution.
- Tails never touch zero. They get very close, but technically the curve extends to infinity in both directions.
- 68-95-99.7 rule (the empirical rule): these are the percentages of data within 1, 2, and 3 standard deviations of the mean.
The Empirical Rule (68-95-99.7)
This is one of the most useful rules in all of statistics. For normally distributed data:
- 68% of values fall within 1 standard deviation of the mean
- 95% fall within 2 standard deviations
- 99.7% fall within 3 standard deviations
# Demonstrate the empirical rule
np.random.seed(42)
data = np.random.normal(100, 15, 100000) # Mean=100, SD=15
within_1 = np.mean(np.abs(data - 100) <= 15) * 100
within_2 = np.mean(np.abs(data - 100) <= 30) * 100
within_3 = np.mean(np.abs(data - 100) <= 45) * 100
print("Empirical Rule Verification (Mean=100, SD=15):")
print(f" Within 1 SD (85-115): {within_1:.1f}% (theory: 68.3%)")
print(f" Within 2 SD (70-130): {within_2:.1f}% (theory: 95.4%)")
print(f" Within 3 SD (55-145): {within_3:.1f}% (theory: 99.7%)")
# Visualize
fig, ax = plt.subplots(figsize=(12, 6))
x = np.linspace(40, 160, 1000)
y = stats.norm.pdf(x, 100, 15)
ax.plot(x, y, color='black', linewidth=2)
# Shade the regions
colors = ['steelblue', 'cornflowerblue', 'lightsteelblue']
labels = ['68% (1 SD)', '95% (2 SD)', '99.7% (3 SD)']
ranges = [(85, 115), (70, 130), (55, 145)]
for (low, high), color, label in zip(reversed(ranges), reversed(colors), reversed(labels)):
mask = (x >= low) & (x <= high)
ax.fill_between(x[mask], y[mask], color=color, alpha=0.7, label=label)
ax.set_title('The Empirical Rule (68-95-99.7)')
ax.set_xlabel('Value')
ax.set_ylabel('Density')
ax.legend(loc='upper right')
# Add SD markers
for i, sd in enumerate([1, 2, 3]):
ax.annotate(f'{100 - sd*15}', xy=(100 - sd*15, -0.001), ha='center', fontsize=8)
ax.annotate(f'{100 + sd*15}', xy=(100 + sd*15, -0.001), ha='center', fontsize=8)
plt.tight_layout()
plt.savefig('empirical_rule.png', dpi=150, bbox_inches='tight')
plt.show()
Why this matters: If someone tells you "the average test score is 75 with a standard deviation of 10," you can immediately sketch the entire distribution in your head. About 68% of students scored between 65 and 85. About 95% scored between 55 and 95. A score of 45 (3 SDs below the mean) is extremely rare.
Why Is the Normal Distribution Everywhere?
You might wonder: why is the bell curve so common in nature? Why are heights normally distributed, and why do measurement errors follow a bell curve?
The answer involves the Central Limit Theorem (which we'll prove by simulation in Section 21.7), but there's an intuitive explanation worth sharing now.
Many quantities in nature are the result of many small, independent effects adding together. Your height is influenced by hundreds of genes, each contributing a tiny amount, plus environmental factors (nutrition, health, etc.). Measurement errors come from many small disturbances (vibration, temperature, humidity) adding up. Exam scores reflect many small factors (each question, each night of sleep, each distracting thought).
When you add up many small, independent random effects, the result tends toward a normal distribution — regardless of what shape the individual effects have. This is the Central Limit Theorem, and it's the reason the bell curve is the most important shape in statistics.
But here's the flip side: when data is NOT the result of many small additive effects, it's often NOT normal. Income, for example, is multiplicative (a 10% raise compounds on previous raises), which produces a log-normal distribution (skewed right). The number of followers on social media involves preferential attachment (popular accounts get more followers because they're popular), which produces a power-law distribution (extremely skewed). Understanding what process generates your data helps you predict what distribution it follows.
# Demonstration: Adding small random effects produces the bell curve
np.random.seed(42)
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
for ax, n_effects in zip(axes, [1, 2, 10, 50]):
# Each "effect" is a random value from [-1, 1]
combined = np.sum(np.random.uniform(-1, 1, size=(10000, n_effects)), axis=1)
ax.hist(combined, bins=40, density=True, color='steelblue',
edgecolor='white', alpha=0.8)
if n_effects > 1:
x = np.linspace(combined.min(), combined.max(), 100)
sigma = np.sqrt(n_effects / 3) # SD of sum of n uniforms
ax.plot(x, stats.norm.pdf(x, 0, sigma), 'r-', linewidth=2)
ax.set_title(f'Sum of {n_effects} random effect{"s" if n_effects > 1 else ""}')
plt.suptitle('Many Small Effects → Bell Curve', fontsize=14)
plt.tight_layout()
plt.savefig('why_normal.png', dpi=150, bbox_inches='tight')
plt.show()
With just 10 random effects, the bell curve is already recognizable. With 50, it's nearly perfect. This is why height (hundreds of genetic effects) and measurement error (many small disturbances) follow the normal distribution so closely.
21.6 The Standard Normal and Z-Scores (Revisited)
The standard normal distribution is a normal distribution with mean = 0 and standard deviation = 1. It's the "reference normal" that all other normal distributions can be converted to.
Any normal distribution can be standardized using z-scores, which we first met in Chapter 19:
$$z = \frac{x - \mu}{\sigma}$$
This converts any value x (from a normal distribution with mean mu and standard deviation sigma) into the equivalent position on the standard normal.
# Converting between raw scores and z-scores
mu, sigma = 75, 10 # Test scores: mean=75, SD=10
# Student scored 92
raw_score = 92
z = (raw_score - mu) / sigma
print(f"Raw score: {raw_score}")
print(f"Z-score: {z:.2f}")
print(f"This score is {z:.1f} standard deviations above the mean")
# What percentage of students scored below 92?
percentile = stats.norm.cdf(z) * 100
print(f"Percentile: {percentile:.1f}th")
print(f" {percentile:.1f}% of students scored below {raw_score}")
# What score is needed to be in the top 10%?
z_top10 = stats.norm.ppf(0.90) # ppf = percent point function (inverse CDF)
score_top10 = mu + z_top10 * sigma
print(f"\nTo be in the top 10%, you need at least {score_top10:.1f}")
print(f" (z-score of {z_top10:.2f})")
Using scipy.stats.norm
The scipy.stats module is your toolkit for working with distributions. Here are the most important methods for the normal distribution:
from scipy import stats
# Create a normal distribution object
test_dist = stats.norm(loc=75, scale=10) # mean=75, std=10
# PDF: probability density at a point
print(f"Density at x=75 (the peak): {test_dist.pdf(75):.4f}")
print(f"Density at x=85 (1 SD up): {test_dist.pdf(85):.4f}")
# CDF: probability of being less than or equal to x
print(f"\nP(score <= 60): {test_dist.cdf(60):.4f} ({test_dist.cdf(60)*100:.1f}%)")
print(f"P(score <= 75): {test_dist.cdf(75):.4f} ({test_dist.cdf(75)*100:.1f}%)")
print(f"P(score <= 90): {test_dist.cdf(90):.4f} ({test_dist.cdf(90)*100:.1f}%)")
# Probability between two values
p_between = test_dist.cdf(85) - test_dist.cdf(65)
print(f"\nP(65 < score < 85): {p_between:.4f} ({p_between*100:.1f}%)")
# PPF: inverse CDF (find the score at a given percentile)
print(f"\n50th percentile score: {test_dist.ppf(0.50):.1f}")
print(f"90th percentile score: {test_dist.ppf(0.90):.1f}")
print(f"99th percentile score: {test_dist.ppf(0.99):.1f}")
# Generate random samples
samples = test_dist.rvs(size=10)
print(f"\n10 random scores: {np.round(samples, 1)}")
# Summary statistics
print(f"\nDistribution properties:")
print(f" Mean: {test_dist.mean():.1f}")
print(f" Std: {test_dist.std():.1f}")
print(f" Var: {test_dist.var():.1f}")
The scipy.stats Pattern
Every distribution in scipy.stats follows the same pattern. Once you learn it for the normal, you can use any distribution:
# The pattern works for any distribution:
# stats.<distribution>(parameters)
# .pdf(x) — density at x
# .cdf(x) — P(X <= x)
# .ppf(q) — value at the q-th quantile (inverse CDF)
# .rvs(size) — random samples
# .mean() — expected value
# .std() — standard deviation
# Normal: stats.norm(loc=mean, scale=std)
# Binomial: stats.binom(n=trials, p=prob)
# Poisson: stats.poisson(mu=rate)
# Uniform: stats.uniform(loc=start, scale=width)
21.7 The Central Limit Theorem — The Magic Trick of Statistics
Here it is. The most important theorem in statistics. And we're going to discover it by simulation before I tell you what it says.
The Experiment
Let's start with a decidedly NON-normal distribution — a uniform distribution. Every value between 0 and 1 is equally likely. No bell shape anywhere.
Now: take a random sample of 30 values from this uniform distribution. Compute the sample mean. Write it down.
Do this 10,000 times.
Plot the 10,000 sample means.
np.random.seed(42)
# The population: UNIFORM distribution (not normal at all!)
fig, axes = plt.subplots(2, 3, figsize=(15, 9))
# Show the population
population = np.random.uniform(0, 1, 1000000)
axes[0, 0].hist(population, bins=50, color='coral', edgecolor='white', alpha=0.8, density=True)
axes[0, 0].set_title('Population: Uniform(0,1)\n(definitely NOT normal)')
axes[0, 0].set_xlabel('Value')
# Take samples of increasing size and plot the distribution of means
sample_sizes = [2, 5, 10, 30, 100]
for idx, n in enumerate(sample_sizes):
row = (idx + 1) // 3
col = (idx + 1) % 3
sample_means = [np.mean(np.random.uniform(0, 1, n)) for _ in range(10000)]
axes[row, col].hist(sample_means, bins=50, color='steelblue',
edgecolor='white', alpha=0.8, density=True)
# Overlay the theoretical normal
x_range = np.linspace(min(sample_means), max(sample_means), 100)
theoretical_std = (1/np.sqrt(12)) / np.sqrt(n) # SD of uniform / sqrt(n)
axes[row, col].plot(x_range, stats.norm.pdf(x_range, 0.5, theoretical_std),
color='red', linewidth=2, label='Normal approximation')
axes[row, col].set_title(f'Sample means (n={n})')
axes[row, col].legend(fontsize=7)
plt.suptitle('The Central Limit Theorem in Action\n'
'Population is UNIFORM, but sample means become NORMAL!', fontsize=14)
plt.tight_layout()
plt.savefig('clt_uniform.png', dpi=150, bbox_inches='tight')
plt.show()
Look at that. The population is flat — uniform — nothing even remotely bell-shaped. But the means of samples from that population form a beautiful bell curve. The larger the sample size, the more perfect the bell.
Try It with an Extremely Non-Normal Distribution
Still skeptical? Let's try with an exponential distribution — wildly right-skewed:
np.random.seed(42)
fig, axes = plt.subplots(2, 3, figsize=(15, 9))
# Exponential population (very right-skewed)
population = np.random.exponential(5, 1000000)
axes[0, 0].hist(population, bins=100, color='coral', edgecolor='white',
alpha=0.8, density=True, range=(0, 40))
axes[0, 0].set_title('Population: Exponential(5)\n(very right-skewed!)')
for idx, n in enumerate(sample_sizes):
row = (idx + 1) // 3
col = (idx + 1) % 3
sample_means = [np.mean(np.random.exponential(5, n)) for _ in range(10000)]
axes[row, col].hist(sample_means, bins=50, color='steelblue',
edgecolor='white', alpha=0.8, density=True)
x_range = np.linspace(min(sample_means), max(sample_means), 100)
theoretical_std = 5 / np.sqrt(n)
axes[row, col].plot(x_range, stats.norm.pdf(x_range, 5, theoretical_std),
color='red', linewidth=2, label='Normal approximation')
axes[row, col].set_title(f'Sample means (n={n})')
axes[row, col].legend(fontsize=7)
plt.suptitle('CLT with Exponential Population\n'
'Even SKEWED data produces NORMAL sample means!', fontsize=14)
plt.tight_layout()
plt.savefig('clt_exponential.png', dpi=150, bbox_inches='tight')
plt.show()
There it is again. The population is dramatically skewed, but sample means become increasingly normal as n grows.
The Central Limit Theorem: What It Says
Now that you've seen it, here's the formal statement:
Central Limit Theorem (CLT): If you take random samples of size n from ANY population with a finite mean (mu) and finite standard deviation (sigma), the distribution of sample means will approach a normal distribution as n increases — regardless of the shape of the original population. The mean of this sampling distribution equals mu, and the standard deviation equals sigma / sqrt(n).
Let's break that down:
-
"Any population" — it doesn't matter if the population is uniform, skewed, bimodal, or any other shape. The magic works regardless.
-
"As n increases" — with small samples (n=2, n=5), the bell shape is approximate. By n=30, it's usually quite good. This is why "n=30" is often cited as a rule of thumb for when the CLT kicks in.
-
"Sigma / sqrt(n)" — the standard deviation of sample means (called the standard error) gets smaller as n grows. Larger samples produce more precise estimates.
# Demonstrating sigma / sqrt(n)
population_sd = 5 # Exponential(5) has SD = 5
sample_sizes = [5, 10, 25, 50, 100, 500]
print("Standard Error = sigma / sqrt(n)")
print("=" * 45)
for n in sample_sizes:
se_theoretical = population_sd / np.sqrt(n)
sample_means = [np.mean(np.random.exponential(5, n)) for _ in range(10000)]
se_simulated = np.std(sample_means)
print(f" n={n:4d}: theoretical SE = {se_theoretical:.3f}, "
f"simulated SE = {se_simulated:.3f}")
Why the CLT Matters
The Central Limit Theorem is the reason that:
-
Opinion polls work. The average of a sample approaches the population average, and we can quantify how much it might be off.
-
Confidence intervals make sense. We'll use the CLT in Chapter 22 to build intervals that say "we're 95% confident the true value is between A and B."
-
Hypothesis tests work. In Chapter 23, we'll use the CLT to decide whether a difference is "statistically significant."
-
The normal distribution is everywhere. Many quantities we measure are themselves averages or sums of many small effects — and sums of many things tend toward normal, thanks to the CLT.
Why the CLT Works — Intuition, Not Proof
You don't need to understand the mathematical proof of the CLT (it involves characteristic functions and complex analysis). But you can develop a strong intuition for why it works.
Think about what happens when you average together many values:
-
Extreme values cancel out. If you sample 50 values from an exponential distribution, some will be very large. But most will be moderate. When you average them, the few large values are diluted by the many moderate ones. The average is always less extreme than the most extreme individual value.
-
There are many more ways to get a "middle" average than an extreme one. If you're averaging 50 values from Uniform(0, 100), getting an average of exactly 50 can happen in billions of ways (countless combinations that add up right). Getting an average of exactly 95 requires almost all 50 values to be near 100 — far fewer ways. This combinatorial argument is why the bell shape emerges: moderate averages are exponentially more likely than extreme ones.
-
Each additional observation pulls the average toward the center. The more values you average, the harder it is for any single extreme value to dominate. This is the smoothing effect of averaging, and it's why larger n produces a tighter, more normal distribution.
Jordan's analogy: "Think of it like a tug-of-war with 50 people on each side. Even if a few people on one side are incredibly strong, the average strength of each team is going to be similar — the individual differences get averaged out. That's the CLT."
The Standard Error: A Crucial Quantity
The standard error (SE) is so important that it deserves its own moment in the spotlight:
$$SE = \frac{\sigma}{\sqrt{n}}$$
In English: "The standard error is the population standard deviation divided by the square root of the sample size."
This formula tells you how precise your sample mean is. A smaller SE means your sample mean is likely close to the true population mean.
Key implications:
- To cut the SE in half, you need 4x the data. (Because sqrt(4n) = 2*sqrt(n).)
- To cut the SE to a third, you need 9x the data. Diminishing returns!
- Very large samples have tiny SEs. With n=10,000, the SE is sigma/100. Your sample mean is very precise.
# Standard error at different sample sizes
sigma = 20 # Population standard deviation
print(f"Population SD: {sigma}")
print(f"{'n':>8} {'SE':>8} {'95% range of means':>25}")
print("=" * 45)
for n in [5, 10, 25, 50, 100, 500, 1000, 10000]:
se = sigma / np.sqrt(n)
print(f"{n:>8d} {se:>8.2f} {f'± {1.96*se:.2f}':>25}")
This table is worth studying. With a sample of just 25, your mean is likely within about 8 units of the truth. With 10,000, it's within 0.4 units. The precision improves, but each doubling of precision costs four times as much data.
21.8 Checking Normality: Is Your Data Actually Normal?
The empirical rule and z-score calculations assume your data is normally distributed. But real data often isn't. How do you check?
Method 1: The Histogram Test (Eyeball It)
Plot a histogram and overlay the theoretical normal curve. Does it look roughly bell-shaped?
np.random.seed(42)
# Some data that IS roughly normal
normal_data = np.random.normal(50, 10, 500)
# Some data that ISN'T
skewed_data = np.random.exponential(10, 500) + 20
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
for ax, data, title in [(axes[0], normal_data, 'Roughly Normal'),
(axes[1], skewed_data, 'Not Normal (Skewed)')]:
ax.hist(data, bins=30, density=True, color='steelblue',
edgecolor='white', alpha=0.8, label='Data')
x = np.linspace(data.min(), data.max(), 100)
ax.plot(x, stats.norm.pdf(x, data.mean(), data.std()),
color='red', linewidth=2, label='Normal fit')
ax.set_title(title)
ax.legend()
plt.tight_layout()
plt.savefig('normality_check_hist.png', dpi=150, bbox_inches='tight')
plt.show()
Method 2: The Q-Q Plot (Quantile-Quantile Plot)
A Q-Q plot compares the quantiles of your data to the quantiles of a theoretical normal distribution. If the data is normal, the points fall on a straight diagonal line. Deviations from the line reveal specific types of non-normality.
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Q-Q plot for normal data
stats.probplot(normal_data, dist="norm", plot=axes[0])
axes[0].set_title('Q-Q Plot: Normal Data\n(Points follow the line)')
# Q-Q plot for skewed data
stats.probplot(skewed_data, dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot: Skewed Data\n(Points curve away from the line)')
plt.tight_layout()
plt.savefig('qq_plots.png', dpi=150, bbox_inches='tight')
plt.show()
Reading a Q-Q plot: - Points on the line: Data is approximately normal. - Points curve upward at both ends: Data has heavier tails than normal (more extreme values). - Points curve away at one end: Data is skewed. - S-shaped curve: Data has lighter tails than normal.
Method 3: Statistical Tests
Formal tests for normality exist, but they have limitations:
# Shapiro-Wilk test (good for small to medium samples)
stat_normal, p_normal = stats.shapiro(normal_data[:100]) # Limit to 100 for Shapiro
stat_skewed, p_skewed = stats.shapiro(skewed_data[:100])
print("Shapiro-Wilk Test (H0: data is normal):")
print(f" Normal data: statistic={stat_normal:.4f}, p-value={p_normal:.4f}")
print(f" Skewed data: statistic={stat_skewed:.4f}, p-value={p_skewed:.4f}")
print()
if p_normal > 0.05:
print(" Normal data: p > 0.05 → fail to reject normality (looks normal)")
if p_skewed < 0.05:
print(" Skewed data: p < 0.05 → reject normality (not normal)")
Important caveat: With very large samples, normality tests will reject normality for even tiny deviations from a perfect bell curve. With very small samples, they lack the power to detect non-normality. The Q-Q plot is often more informative than a formal test because it shows you how the data deviates from normal.
21.9 The Progressive Project: Vaccination Distributions
Time to apply everything to our Global Health Data Explorer. Elena wants to fit normal distributions to her vaccination rate data and check whether the normal assumption is appropriate.
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Simulated vaccination data by income group
np.random.seed(21)
vaccination_df = pd.DataFrame({
'income_group': np.repeat(
['Low income', 'Lower middle', 'Upper middle', 'High income'],
[30, 55, 55, 50]
),
'measles_vacc_rate': np.concatenate([
np.clip(np.random.normal(52, 18, 30), 10, 99),
np.clip(np.random.normal(68, 14, 55), 20, 99),
np.clip(np.random.normal(82, 10, 55), 35, 99),
np.clip(np.random.normal(93, 4, 50), 60, 99),
])
})
# For each income group: fit normal, create Q-Q plot, check assumptions
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
groups = ['Low income', 'Lower middle', 'Upper middle', 'High income']
colors = ['#e74c3c', '#e67e22', '#2ecc71', '#3498db']
for idx, (group, color) in enumerate(zip(groups, colors)):
data = vaccination_df[vaccination_df['income_group'] == group]['measles_vacc_rate']
# Top row: histogram with normal fit
ax_top = axes[0, idx]
ax_top.hist(data, bins=12, density=True, color=color, edgecolor='white', alpha=0.7)
x = np.linspace(data.min() - 5, data.max() + 5, 100)
ax_top.plot(x, stats.norm.pdf(x, data.mean(), data.std()),
color='black', linewidth=2)
ax_top.set_title(f'{group}\nmean={data.mean():.1f}, sd={data.std():.1f}')
ax_top.set_xlabel('Vaccination Rate (%)')
# Bottom row: Q-Q plot
ax_bot = axes[1, idx]
stats.probplot(data, dist="norm", plot=ax_bot)
ax_bot.set_title(f'Q-Q Plot')
# Shapiro test
if len(data) >= 8:
_, p_val = stats.shapiro(data)
normality = "Yes" if p_val > 0.05 else "No"
ax_bot.text(0.05, 0.95, f'Shapiro p={p_val:.3f}\nNormal? {normality}',
transform=ax_bot.transAxes, fontsize=8, va='top',
bbox=dict(boxstyle='round', facecolor='lightyellow'))
plt.suptitle('Vaccination Rates by Income Group: Normality Assessment', fontsize=14)
plt.tight_layout()
plt.savefig('vaccination_normality.png', dpi=150, bbox_inches='tight')
plt.show()
# Compute probabilities using fitted normal distributions
print("\n=== Using Normal Distribution for Predictions ===\n")
for group in groups:
data = vaccination_df[vaccination_df['income_group'] == group]['measles_vacc_rate']
mu, sigma = data.mean(), data.std()
dist = stats.norm(mu, sigma)
print(f"{group} (N({mu:.1f}, {sigma:.1f})):")
print(f" P(rate < 50%): {dist.cdf(50):.4f} ({dist.cdf(50)*100:.1f}%)")
print(f" P(rate > 90%): {1-dist.cdf(90):.4f} ({(1-dist.cdf(90))*100:.1f}%)")
print(f" P(70% < rate < 90%): {dist.cdf(90)-dist.cdf(70):.4f} "
f"({(dist.cdf(90)-dist.cdf(70))*100:.1f}%)")
print()
Elena notices that the high-income group is left-skewed (bumping against the ceiling of 99%) and the low-income group has a wide, somewhat flat distribution. Neither is perfectly normal. But the normal approximation is reasonable for the middle-income groups, and — critically — the CLT tells her that sample means from any of these groups will be approximately normal regardless.
Connecting the CLT to the Progressive Project
Let's make this concrete for Elena. She has vaccination rate data for 190 countries, grouped by income level. She wants to estimate the mean vaccination rate for each income group, but she only has data from some countries, not all.
The CLT tells her: - Her sample mean is approximately normally distributed around the true population mean. - The uncertainty in her estimate is captured by the standard error = SD / sqrt(n). - She can quantify exactly how confident she is in her estimate (Chapter 22 will formalize this).
# Elena's CLT application
np.random.seed(21)
# "True" population of all countries in a group
true_population = np.clip(np.random.normal(72, 16, 500), 10, 99)
true_mean = np.mean(true_population)
true_sd = np.std(true_population)
print(f"True population mean: {true_mean:.1f}%")
print(f"True population SD: {true_sd:.1f}%")
# Elena only has a sample of 40 countries
sample = np.random.choice(true_population, 40, replace=False)
sample_mean = np.mean(sample)
se = true_sd / np.sqrt(40)
print(f"\nElena's sample of 40 countries:")
print(f" Sample mean: {sample_mean:.1f}%")
print(f" Standard error: {se:.1f}%")
print(f" 95% of sample means would fall between "
f"{true_mean - 1.96*se:.1f}% and {true_mean + 1.96*se:.1f}%")
print(f" Elena's sample mean of {sample_mean:.1f}% falls "
f"{'within' if abs(sample_mean - true_mean) < 1.96*se else 'outside'} this range")
This is exactly what polling organizations do. They survey 1,000 people, compute the sample proportion, and use the CLT to construct a margin of error. The math is identical — and now you understand the engine that makes it work.
21.10 When Data Isn't Normal — And That's Okay
Not all data is normal. Here are common situations and what to do:
Right-skewed data (income, response times, counts): Consider a log transformation, or use the median and IQR instead of mean and standard deviation.
Bounded data (percentages, proportions): Data that can't go below 0 or above 100 is often not normal. The beta distribution is designed for proportions.
Count data (number of events): Better modeled by Poisson or negative binomial distributions.
Bimodal data (mixed populations): No single distribution fits well. Consider mixture models or analyze each group separately.
# Log transformation for right-skewed data
np.random.seed(42)
income = np.random.lognormal(mean=10, sigma=0.8, size=1000)
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Original (skewed)
axes[0].hist(income, bins=50, color='coral', edgecolor='white', alpha=0.8, density=True)
axes[0].set_title(f'Income Data (skewed)\nskewness={stats.skew(income):.2f}')
# Log-transformed (approximately normal)
log_income = np.log(income)
axes[1].hist(log_income, bins=50, color='steelblue', edgecolor='white', alpha=0.8, density=True)
x = np.linspace(log_income.min(), log_income.max(), 100)
axes[1].plot(x, stats.norm.pdf(x, log_income.mean(), log_income.std()),
color='red', linewidth=2)
axes[1].set_title(f'Log(Income) (approximately normal)\nskewness={stats.skew(log_income):.2f}')
plt.tight_layout()
plt.savefig('log_transform.png', dpi=150, bbox_inches='tight')
plt.show()
The key insight: you don't need your raw data to be normal. What you need is for your statistics (like sample means) to be approximately normal — and the Central Limit Theorem gives you that for free, as long as your sample size is reasonably large.
21.11 Chapter Summary
Let's bring it all together:
Probability distributions are mathematical descriptions of randomness — they tell you how probability is spread across possible outcomes.
| Distribution | Type | Use When | Key Parameters | Python |
|---|---|---|---|---|
| Uniform | Discrete or Continuous | All outcomes equally likely | Low, High | stats.uniform |
| Binomial | Discrete | Counting successes in n trials | n, p | stats.binom |
| Poisson | Discrete | Counting rare events in an interval | lambda | stats.poisson |
| Normal | Continuous | Data is symmetric, bell-shaped | mu, sigma | stats.norm |
The normal distribution is defined by its mean and standard deviation. The empirical rule (68-95-99.7) lets you quickly estimate probabilities. Z-scores let you standardize any normal distribution.
The Central Limit Theorem says that sample means are approximately normal, regardless of the population shape, when n is large enough. This is why the normal distribution shows up everywhere — and why it's the foundation of statistical inference.
Checking normality: Use histograms, Q-Q plots, and the Shapiro-Wilk test. But remember that perfect normality is rare, approximate normality is usually good enough, and the CLT often makes normality assumptions unnecessary.
scipy.stats pattern: Every distribution has .pdf(), .cdf(), .ppf(), and .rvs() methods. Learn the pattern once, use it for any distribution.
Looking Ahead
You now have the three pillars you need for statistical inference: 1. Descriptive statistics (Chapter 19) — summarizing what you observe 2. Probability (Chapter 20) — reasoning about uncertainty 3. Distributions (this chapter) — the mathematical shapes that connect data to probability
In Chapter 22, we'll put them all together to build confidence intervals — ranges that tell you how precise your estimates are. And in Chapter 23, we'll use them for hypothesis testing — deciding whether a pattern is real or just noise. The Central Limit Theorem you just learned is the engine that powers both.
Connections to Previous and Future Chapters
| Concept from This Chapter | Where It Came From | Where It's Going |
|---|---|---|
| Probability distributions | Connects Ch. 19 shapes to Ch. 20 probabilities | Foundation for all of inference (Ch. 22-23) |
| Normal distribution | Formalizes "bell curve" intuition | Used in confidence intervals (Ch. 22), hypothesis tests (Ch. 23), regression (Ch. 26) |
| Z-scores | First introduced in Ch. 19 | Central to hypothesis testing (Ch. 23) |
| Central Limit Theorem | Builds on law of large numbers (Ch. 20) | Justification for confidence intervals (Ch. 22) |
scipy.stats |
New in this chapter | Used throughout the rest of the book |
| Q-Q plots | New in this chapter | Used in regression diagnostics (Ch. 26) |
| Binomial distribution | Builds on probability simulation (Ch. 20) | Used in classification (Ch. 27) |