Chapter 21: Distributions and the Normal Curve — The Shape That Shows Up Everywhere

Contributors to Introduction to Data Science

20 min read

> "The normal distribution is the most important distribution in statistics. It is also, unfortunately, the most overused."

Learning Objectives

Define probability distribution and distinguish discrete from continuous distributions
Describe and simulate the uniform, binomial, and Poisson distributions
Explain the properties of the normal distribution and apply the empirical rule
Use z-scores and scipy.stats to compute probabilities under the normal curve
Explain the Central Limit Theorem through simulation and apply it to sampling distributions

In This Chapter

Chapter Overview
21.1 What Is a Probability Distribution?
21.2 The Uniform Distribution — The Boring-but-Important One
21.3 The Binomial Distribution — Counting Successes
21.4 The Poisson Distribution — Counting Rare Events
21.5 The Normal Distribution — The Star of the Show
21.6 The Standard Normal and Z-Scores (Revisited)
21.7 The Central Limit Theorem — The Magic Trick of Statistics
21.8 Checking Normality: Is Your Data Actually Normal?
21.9 The Progressive Project: Vaccination Distributions
21.10 When Data Isn't Normal — And That's Okay
21.11 Chapter Summary
Looking Ahead
Connections to Previous and Future Chapters

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 21: Distributions and the Normal Curve — The Shape That Shows Up Everywhere

"The normal distribution is the most important distribution in statistics. It is also, unfortunately, the most overused." — An anonymous statistician who probably made this up but was right

Chapter Overview

In Chapter 19, you learned to describe the shape of data — symmetric, skewed, bimodal. In Chapter 20, you learned to think about probability — how likely different outcomes are. This chapter connects those two ideas. A probability distribution is a mathematical object that describes the shape of probability — it tells you how likely each possible outcome is, and it does so with a precise, mathematical form that you can work with, compute with, and reason about.

The most famous distribution of all is the normal distribution — the bell curve. You've seen it mentioned a hundred times. Now you'll understand why it's everywhere. The answer involves one of the most beautiful results in mathematics: the Central Limit Theorem. And the way we'll discover it is by running a simulation that will make you say "wait, THAT happened?"

But we'll start simple. We'll build up from distributions you can visualize on your fingers — coin flips, dice rolls, random numbers — before arriving at the bell curve. By the end, you'll know how to use scipy.stats to compute probabilities, how to check whether your data is actually normal (spoiler: it often isn't), and why the Central Limit Theorem means that normal distributions show up even when the underlying data isn't normal at all.

In this chapter, you will learn to:

Define probability distribution and distinguish discrete from continuous distributions (all paths)
Describe and simulate the uniform, binomial, and Poisson distributions (all paths)
Explain the properties of the normal distribution and apply the empirical rule (all paths)
Use z-scores and scipy.stats to compute probabilities under the normal curve (all paths)
Explain the Central Limit Theorem through simulation and apply it to sampling distributions (standard + deep dive paths)

21.1 What Is a Probability Distribution?

A probability distribution is a mathematical description of the probabilities of all possible outcomes of a random variable. It answers the question: "If I repeat this random process many times, what pattern would the outcomes make?"

Think of it this way. In Chapter 19, you made histograms of data. The histogram showed you the shape — where values clustered, how spread out they were, whether there were tails. A probability distribution is like the idealized version of that histogram — the shape you'd approach if you had infinite data.

There are two flavors:

Discrete Distributions

A discrete distribution describes random variables that can only take specific, countable values. Examples: the number of heads in 10 coin flips (0, 1, 2, ..., 10), the number of customers who enter a store in an hour, the roll of a die.

For discrete distributions, we can list the probability of each possible outcome. This list is called the probability mass function (PMF).

import numpy as np
import matplotlib.pyplot as plt

# Example: roll a fair die — discrete uniform distribution
outcomes = [1, 2, 3, 4, 5, 6]
probabilities = [1/6] * 6

fig, ax = plt.subplots(figsize=(8, 4))
ax.bar(outcomes, probabilities, color='steelblue', edgecolor='white', width=0.6)
ax.set_xlabel('Outcome')
ax.set_ylabel('Probability')
ax.set_title('Discrete Uniform Distribution: Fair Die')
ax.set_ylim(0, 0.25)
ax.set_xticks(outcomes)
plt.tight_layout()
plt.savefig('discrete_uniform.png', dpi=150, bbox_inches='tight')
plt.show()

Continuous Distributions

A continuous distribution describes random variables that can take any value in a range. Examples: height, temperature, time until the next bus, vaccination rate.

For continuous distributions, we can't assign a probability to each specific value (there are infinitely many), so instead we use a probability density function (PDF) that tells us the density of probability at each point. Probabilities come from areas under the curve.

The cumulative distribution function (CDF) tells you P(X <= x) — the probability that the random variable is less than or equal to x. It's the running total of probabilities from left to right.

from scipy import stats

# Standard normal distribution (the bell curve)
x = np.linspace(-4, 4, 1000)
pdf = stats.norm.pdf(x, loc=0, scale=1)
cdf = stats.norm.cdf(x, loc=0, scale=1)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# PDF
axes[0].plot(x, pdf, color='steelblue', linewidth=2)
axes[0].fill_between(x, pdf, alpha=0.2, color='steelblue')
axes[0].set_title('PDF: Probability Density Function\n"Where is the probability concentrated?"')
axes[0].set_ylabel('Density')
axes[0].set_xlabel('x')

# CDF
axes[1].plot(x, cdf, color='coral', linewidth=2)
axes[1].set_title('CDF: Cumulative Distribution Function\n"P(X ≤ x) — probability up to this point"')
axes[1].set_ylabel('Cumulative Probability')
axes[1].set_xlabel('x')

plt.tight_layout()
plt.savefig('pdf_vs_cdf.png', dpi=150, bbox_inches='tight')
plt.show()

A key point about the PDF: the height of the curve at any point is NOT a probability. Probabilities are areas under the curve. The probability that a normally distributed variable falls between 0 and 1 is the area under the curve from 0 to 1.

21.2 The Uniform Distribution — The Boring-but-Important One

The uniform distribution is the simplest distribution. Every outcome is equally likely.

Discrete uniform: Rolling a fair die. Each of {1, 2, 3, 4, 5, 6} has probability 1/6.

Continuous uniform: Generating a random number between 0 and 1 using np.random.random(). Every value in [0, 1] is equally likely.

# Continuous uniform distribution
np.random.seed(42)
uniform_data = np.random.uniform(0, 1, 100000)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Histogram of samples
axes[0].hist(uniform_data, bins=50, color='steelblue', edgecolor='white',
             alpha=0.8, density=True)
x = np.linspace(0, 1, 100)
axes[0].plot(x, stats.uniform.pdf(x), color='red', linewidth=2, label='Theoretical PDF')
axes[0].set_title('Uniform Distribution U(0, 1)')
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Density')
axes[0].legend()

# CDF
axes[1].plot(x, stats.uniform.cdf(x), color='coral', linewidth=2)
axes[1].set_title('Uniform CDF')
axes[1].set_xlabel('Value')
axes[1].set_ylabel('P(X ≤ x)')

plt.tight_layout()
plt.savefig('uniform_dist.png', dpi=150, bbox_inches='tight')
plt.show()

# Uniform distribution properties
print("Uniform(0, 1) properties:")
print(f"  Mean: {stats.uniform.mean():.2f}")
print(f"  Std:  {stats.uniform.std():.4f}")
print(f"  P(X < 0.5): {stats.uniform.cdf(0.5):.2f}")
print(f"  P(0.3 < X < 0.7): {stats.uniform.cdf(0.7) - stats.uniform.cdf(0.3):.2f}")

The uniform distribution is boring in the best way — it's the distribution of "no information." When you don't know what shape your data should be, the uniform distribution says "everything is equally likely." It's the starting point for random number generation, which is why np.random.random() gives you uniform values.

Why the Uniform Distribution Matters

You might be thinking: "The uniform distribution is so simple — why spend time on it?" Three reasons:

It's the foundation of simulation. Every Monte Carlo simulation starts with uniform random numbers. When you call np.random.random(), you're drawing from a uniform distribution. Even generating random values from other distributions often starts by transforming uniform random numbers.
It represents maximum uncertainty. If you have no idea which values are more likely than others, the uniform distribution says "everything is equally possible." It's the starting point for Bayesian analysis when you have no prior information.
It's a useful contrast. Understanding the uniform distribution helps you appreciate what makes other distributions interesting. The normal distribution is interesting precisely because values aren't equally likely — some are much more common than others.

# From uniform to normal: the connection
# If you take the average of 12 uniform(0,1) values, you get something
# approximately normal! (This is a quick-and-dirty way to generate
# approximately normal random numbers)
np.random.seed(42)
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Average of 1 uniform
avg1 = np.random.uniform(0, 1, 10000)
axes[0].hist(avg1, bins=40, color='steelblue', edgecolor='white', alpha=0.8, density=True)
axes[0].set_title('Average of 1 Uniform\n(Still uniform)')

# Average of 3 uniforms
avg3 = np.mean(np.random.uniform(0, 1, (10000, 3)), axis=1)
axes[1].hist(avg3, bins=40, color='steelblue', edgecolor='white', alpha=0.8, density=True)
axes[1].set_title('Average of 3 Uniforms\n(Starting to look bell-shaped!)')

# Average of 12 uniforms
avg12 = np.mean(np.random.uniform(0, 1, (10000, 12)), axis=1)
axes[2].hist(avg12, bins=40, color='steelblue', edgecolor='white', alpha=0.8, density=True)
x = np.linspace(0.2, 0.8, 100)
axes[2].plot(x, stats.norm.pdf(x, 0.5, 1/np.sqrt(12*12)), 'r-', linewidth=2)
axes[2].set_title('Average of 12 Uniforms\n(Practically normal!)')

plt.tight_layout()
plt.savefig('uniform_to_normal.png', dpi=150, bbox_inches='tight')
plt.show()

This is your first preview of the Central Limit Theorem — the star of Section 21.7. Even flat, uniform data becomes bell-shaped when you take averages. We'll explore why in detail later.

21.3 The Binomial Distribution — Counting Successes

The binomial distribution describes the number of successes in a fixed number of independent trials, where each trial has the same probability of success.

In plain English: if you flip a coin n times, the binomial distribution tells you the probability of getting exactly k heads.

Parameters: - n: number of trials - p: probability of success on each trial

# Binomial distribution: 20 coin flips, P(heads) = 0.5
n, p = 20, 0.5
k_values = np.arange(0, n + 1)
probabilities = stats.binom.pmf(k_values, n, p)

fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(k_values, probabilities, color='steelblue', edgecolor='white', width=0.7)
ax.set_xlabel('Number of Heads')
ax.set_ylabel('Probability')
ax.set_title(f'Binomial Distribution: n={n}, p={p}\n"How many heads in {n} flips?"')
ax.set_xticks(k_values)

# Mark the expected value
expected = n * p
ax.axvline(expected, color='red', linestyle='--', label=f'Expected: {expected:.0f}')
ax.legend()

plt.tight_layout()
plt.savefig('binomial.png', dpi=150, bbox_inches='tight')
plt.show()

# Key calculations
print(f"P(exactly 10 heads in 20 flips): {stats.binom.pmf(10, 20, 0.5):.4f}")
print(f"P(8 or fewer heads):             {stats.binom.cdf(8, 20, 0.5):.4f}")
print(f"P(more than 15 heads):           {1 - stats.binom.cdf(15, 20, 0.5):.4f}")
print(f"Expected value (n*p):             {n * p}")
print(f"Std dev (sqrt(n*p*(1-p))):        {np.sqrt(n * p * (1-p)):.2f}")

Real-World Binomial Examples

The binomial distribution is everywhere:

Quality control: Out of 100 manufactured items, how many are defective? (n=100, p=defect rate)
Vaccination: Out of 50 vaccinated people, how many develop immunity? (n=50, p=vaccine efficacy)
Surveys: Out of 1000 respondents, how many say "yes"? (n=1000, p=true proportion)

# Marcus's pastry problem: 50 pastries, 3% defect rate
n_pastries, p_defect = 50, 0.03

print("Marcus's bakery defect analysis:")
print(f"  Expected defects per day: {n_pastries * p_defect:.1f}")
print(f"  P(0 defects): {stats.binom.pmf(0, n_pastries, p_defect):.4f}")
print(f"  P(1 defect):  {stats.binom.pmf(1, n_pastries, p_defect):.4f}")
print(f"  P(2 defects): {stats.binom.pmf(2, n_pastries, p_defect):.4f}")
print(f"  P(3+ defects):{1 - stats.binom.cdf(2, n_pastries, p_defect):.4f}")

21.4 The Poisson Distribution — Counting Rare Events

The Poisson distribution describes the number of events occurring in a fixed interval of time or space, when events happen independently at a constant average rate.

In plain English: "How many customers walk into the store in the next hour, if we average 5 per hour?"

Parameter: - lambda (rate): the average number of events per interval

# Poisson distribution: average 5 customers per hour
lam = 5
k_values = np.arange(0, 16)
probabilities = stats.poisson.pmf(k_values, lam)

fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(k_values, probabilities, color='mediumseagreen', edgecolor='white', width=0.7)
ax.set_xlabel('Number of Customers')
ax.set_ylabel('Probability')
ax.set_title(f'Poisson Distribution: λ={lam}\n"How many customers in the next hour?"')
ax.set_xticks(k_values)

ax.axvline(lam, color='red', linestyle='--', label=f'Expected: {lam}')
ax.legend()

plt.tight_layout()
plt.savefig('poisson.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"P(exactly 5 customers): {stats.poisson.pmf(5, lam):.4f}")
print(f"P(0 customers):         {stats.poisson.pmf(0, lam):.4f}")
print(f"P(10+ customers):       {1 - stats.poisson.cdf(9, lam):.4f}")

The Poisson distribution is useful for counting rare events: website clicks per minute, earthquakes per year, typos per page, disease cases per week. The key feature: the mean and variance are both equal to lambda.

The Relationship Between Distributions

These distributions aren't isolated — they're connected in deep and useful ways:

Binomial becomes Poisson: When n is large and p is small, Binomial(n, p) is well-approximated by Poisson(n*p). This is why the Poisson is sometimes called the "law of rare events."
Binomial becomes Normal: When n is large, Binomial(n, p) is well-approximated by Normal(np, sqrt(np(1-p))). This is one of the earliest results in probability theory and a special case of the Central Limit Theorem.
Poisson becomes Normal: When lambda is large (say, > 20), Poisson(lambda) looks approximately normal.

These approximations aren't just mathematical curiosities — they're practically useful. Before computers, they allowed people to compute binomial and Poisson probabilities using normal distribution tables. Today, they help us understand why the normal distribution shows up so often.

# Binomial → Normal approximation
n, p = 100, 0.3
k = np.arange(0, 60)

fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(k, stats.binom.pmf(k, n, p), color='steelblue', alpha=0.6,
       label='Binomial(100, 0.3)', width=0.8)
x = np.linspace(15, 45, 100)
ax.plot(x, stats.norm.pdf(x, n*p, np.sqrt(n*p*(1-p))),
        color='red', linewidth=2, label=f'Normal({n*p}, {np.sqrt(n*p*(1-p)):.1f})')
ax.set_title('Binomial Approximated by Normal (n=100, p=0.3)')
ax.legend()
ax.set_xlabel('Number of Successes')
ax.set_ylabel('Probability')
plt.tight_layout()
plt.savefig('binomial_normal_approx.png', dpi=150, bbox_inches='tight')
plt.show()

21.5 The Normal Distribution — The Star of the Show

Okay. We've warmed up with the supporting cast. Now it's time for the headliner.

The normal distribution (also called the Gaussian distribution or the bell curve) is the most important distribution in all of statistics. It's symmetric, bell-shaped, and completely defined by just two parameters:

mu: the mean (the center of the bell)
sigma: the standard deviation (how wide the bell is)

# The normal distribution with different parameters
fig, ax = plt.subplots(figsize=(12, 6))
x = np.linspace(-15, 25, 1000)

distributions = [
    (0, 1, 'N(0,1) — Standard Normal'),
    (5, 1, 'N(5,1) — Shifted right'),
    (0, 3, 'N(0,3) — Wider'),
    (10, 2, 'N(10,2) — Shifted and wider'),
]

colors = ['steelblue', 'coral', 'mediumseagreen', 'mediumpurple']
for (mu, sigma, label), color in zip(distributions, colors):
    ax.plot(x, stats.norm.pdf(x, mu, sigma), linewidth=2, label=label, color=color)

ax.set_title('Normal Distributions with Different Parameters')
ax.set_xlabel('x')
ax.set_ylabel('Density')
ax.legend()
ax.set_ylim(0, 0.45)

plt.tight_layout()
plt.savefig('normal_distributions.png', dpi=150, bbox_inches='tight')
plt.show()

Properties of the Normal Distribution

Symmetric around the mean. The left half is a mirror image of the right half.
Bell-shaped. The curve peaks at the mean and tapers off equally on both sides.
Mean = Median = Mode. They're all the same value for a perfectly normal distribution.
Tails never touch zero. They get very close, but technically the curve extends to infinity in both directions.
68-95-99.7 rule (the empirical rule): these are the percentages of data within 1, 2, and 3 standard deviations of the mean.

The Empirical Rule (68-95-99.7)

This is one of the most useful rules in all of statistics. For normally distributed data:

68% of values fall within 1 standard deviation of the mean
95% fall within 2 standard deviations
99.7% fall within 3 standard deviations

# Demonstrate the empirical rule
np.random.seed(42)
data = np.random.normal(100, 15, 100000)  # Mean=100, SD=15

within_1 = np.mean(np.abs(data - 100) <= 15) * 100
within_2 = np.mean(np.abs(data - 100) <= 30) * 100
within_3 = np.mean(np.abs(data - 100) <= 45) * 100

print("Empirical Rule Verification (Mean=100, SD=15):")
print(f"  Within 1 SD (85-115): {within_1:.1f}% (theory: 68.3%)")
print(f"  Within 2 SD (70-130): {within_2:.1f}% (theory: 95.4%)")
print(f"  Within 3 SD (55-145): {within_3:.1f}% (theory: 99.7%)")

# Visualize
fig, ax = plt.subplots(figsize=(12, 6))
x = np.linspace(40, 160, 1000)
y = stats.norm.pdf(x, 100, 15)

ax.plot(x, y, color='black', linewidth=2)

# Shade the regions
colors = ['steelblue', 'cornflowerblue', 'lightsteelblue']
labels = ['68% (1 SD)', '95% (2 SD)', '99.7% (3 SD)']
ranges = [(85, 115), (70, 130), (55, 145)]

for (low, high), color, label in zip(reversed(ranges), reversed(colors), reversed(labels)):
    mask = (x >= low) & (x <= high)
    ax.fill_between(x[mask], y[mask], color=color, alpha=0.7, label=label)

ax.set_title('The Empirical Rule (68-95-99.7)')
ax.set_xlabel('Value')
ax.set_ylabel('Density')
ax.legend(loc='upper right')

# Add SD markers
for i, sd in enumerate([1, 2, 3]):
    ax.annotate(f'{100 - sd*15}', xy=(100 - sd*15, -0.001), ha='center', fontsize=8)
    ax.annotate(f'{100 + sd*15}', xy=(100 + sd*15, -0.001), ha='center', fontsize=8)

plt.tight_layout()
plt.savefig('empirical_rule.png', dpi=150, bbox_inches='tight')
plt.show()

Why this matters: If someone tells you "the average test score is 75 with a standard deviation of 10," you can immediately sketch the entire distribution in your head. About 68% of students scored between 65 and 85. About 95% scored between 55 and 95. A score of 45 (3 SDs below the mean) is extremely rare.

Why Is the Normal Distribution Everywhere?

You might wonder: why is the bell curve so common in nature? Why are heights normally distributed, and why do measurement errors follow a bell curve?

The answer involves the Central Limit Theorem (which we'll prove by simulation in Section 21.7), but there's an intuitive explanation worth sharing now.

Many quantities in nature are the result of many small, independent effects adding together. Your height is influenced by hundreds of genes, each contributing a tiny amount, plus environmental factors (nutrition, health, etc.). Measurement errors come from many small disturbances (vibration, temperature, humidity) adding up. Exam scores reflect many small factors (each question, each night of sleep, each distracting thought).

When you add up many small, independent random effects, the result tends toward a normal distribution — regardless of what shape the individual effects have. This is the Central Limit Theorem, and it's the reason the bell curve is the most important shape in statistics.

But here's the flip side: when data is NOT the result of many small additive effects, it's often NOT normal. Income, for example, is multiplicative (a 10% raise compounds on previous raises), which produces a log-normal distribution (skewed right). The number of followers on social media involves preferential attachment (popular accounts get more followers because they're popular), which produces a power-law distribution (extremely skewed). Understanding what process generates your data helps you predict what distribution it follows.

# Demonstration: Adding small random effects produces the bell curve
np.random.seed(42)

fig, axes = plt.subplots(1, 4, figsize=(16, 4))

for ax, n_effects in zip(axes, [1, 2, 10, 50]):
    # Each "effect" is a random value from [-1, 1]
    combined = np.sum(np.random.uniform(-1, 1, size=(10000, n_effects)), axis=1)
    ax.hist(combined, bins=40, density=True, color='steelblue',
            edgecolor='white', alpha=0.8)
    if n_effects > 1:
        x = np.linspace(combined.min(), combined.max(), 100)
        sigma = np.sqrt(n_effects / 3)  # SD of sum of n uniforms
        ax.plot(x, stats.norm.pdf(x, 0, sigma), 'r-', linewidth=2)
    ax.set_title(f'Sum of {n_effects} random effect{"s" if n_effects > 1 else ""}')

plt.suptitle('Many Small Effects → Bell Curve', fontsize=14)
plt.tight_layout()
plt.savefig('why_normal.png', dpi=150, bbox_inches='tight')
plt.show()

With just 10 random effects, the bell curve is already recognizable. With 50, it's nearly perfect. This is why height (hundreds of genetic effects) and measurement error (many small disturbances) follow the normal distribution so closely.

21.6 The Standard Normal and Z-Scores (Revisited)

The standard normal distribution is a normal distribution with mean = 0 and standard deviation = 1. It's the "reference normal" that all other normal distributions can be converted to.

Any normal distribution can be standardized using z-scores, which we first met in Chapter 19:

$$z = \frac{x - \mu}{\sigma}$$

This converts any value x (from a normal distribution with mean mu and standard deviation sigma) into the equivalent position on the standard normal.

# Converting between raw scores and z-scores
mu, sigma = 75, 10  # Test scores: mean=75, SD=10

# Student scored 92
raw_score = 92
z = (raw_score - mu) / sigma
print(f"Raw score: {raw_score}")
print(f"Z-score: {z:.2f}")
print(f"This score is {z:.1f} standard deviations above the mean")

# What percentage of students scored below 92?
percentile = stats.norm.cdf(z) * 100
print(f"Percentile: {percentile:.1f}th")
print(f"  {percentile:.1f}% of students scored below {raw_score}")

# What score is needed to be in the top 10%?
z_top10 = stats.norm.ppf(0.90)  # ppf = percent point function (inverse CDF)
score_top10 = mu + z_top10 * sigma
print(f"\nTo be in the top 10%, you need at least {score_top10:.1f}")
print(f"  (z-score of {z_top10:.2f})")

Using `scipy.stats.norm`

The scipy.stats module is your toolkit for working with distributions. Here are the most important methods for the normal distribution:

from scipy import stats

# Create a normal distribution object
test_dist = stats.norm(loc=75, scale=10)  # mean=75, std=10

# PDF: probability density at a point
print(f"Density at x=75 (the peak):  {test_dist.pdf(75):.4f}")
print(f"Density at x=85 (1 SD up):   {test_dist.pdf(85):.4f}")

# CDF: probability of being less than or equal to x
print(f"\nP(score <= 60):  {test_dist.cdf(60):.4f}  ({test_dist.cdf(60)*100:.1f}%)")
print(f"P(score <= 75):  {test_dist.cdf(75):.4f}  ({test_dist.cdf(75)*100:.1f}%)")
print(f"P(score <= 90):  {test_dist.cdf(90):.4f}  ({test_dist.cdf(90)*100:.1f}%)")

# Probability between two values
p_between = test_dist.cdf(85) - test_dist.cdf(65)
print(f"\nP(65 < score < 85):  {p_between:.4f}  ({p_between*100:.1f}%)")

# PPF: inverse CDF (find the score at a given percentile)
print(f"\n50th percentile score: {test_dist.ppf(0.50):.1f}")
print(f"90th percentile score: {test_dist.ppf(0.90):.1f}")
print(f"99th percentile score: {test_dist.ppf(0.99):.1f}")

# Generate random samples
samples = test_dist.rvs(size=10)
print(f"\n10 random scores: {np.round(samples, 1)}")

# Summary statistics
print(f"\nDistribution properties:")
print(f"  Mean: {test_dist.mean():.1f}")
print(f"  Std:  {test_dist.std():.1f}")
print(f"  Var:  {test_dist.var():.1f}")

The `scipy.stats` Pattern

Every distribution in scipy.stats follows the same pattern. Once you learn it for the normal, you can use any distribution:

# The pattern works for any distribution:
# stats.<distribution>(parameters)
#   .pdf(x)     — density at x
#   .cdf(x)     — P(X <= x)
#   .ppf(q)     — value at the q-th quantile (inverse CDF)
#   .rvs(size)  — random samples
#   .mean()     — expected value
#   .std()      — standard deviation

# Normal:   stats.norm(loc=mean, scale=std)
# Binomial: stats.binom(n=trials, p=prob)
# Poisson:  stats.poisson(mu=rate)
# Uniform:  stats.uniform(loc=start, scale=width)

21.7 The Central Limit Theorem — The Magic Trick of Statistics

Here it is. The most important theorem in statistics. And we're going to discover it by simulation before I tell you what it says.

The Experiment

Let's start with a decidedly NON-normal distribution — a uniform distribution. Every value between 0 and 1 is equally likely. No bell shape anywhere.

Now: take a random sample of 30 values from this uniform distribution. Compute the sample mean. Write it down.

Do this 10,000 times.

Plot the 10,000 sample means.

np.random.seed(42)

# The population: UNIFORM distribution (not normal at all!)
fig, axes = plt.subplots(2, 3, figsize=(15, 9))

# Show the population
population = np.random.uniform(0, 1, 1000000)
axes[0, 0].hist(population, bins=50, color='coral', edgecolor='white', alpha=0.8, density=True)
axes[0, 0].set_title('Population: Uniform(0,1)\n(definitely NOT normal)')
axes[0, 0].set_xlabel('Value')

# Take samples of increasing size and plot the distribution of means
sample_sizes = [2, 5, 10, 30, 100]
for idx, n in enumerate(sample_sizes):
    row = (idx + 1) // 3
    col = (idx + 1) % 3

    sample_means = [np.mean(np.random.uniform(0, 1, n)) for _ in range(10000)]

    axes[row, col].hist(sample_means, bins=50, color='steelblue',
                        edgecolor='white', alpha=0.8, density=True)

    # Overlay the theoretical normal
    x_range = np.linspace(min(sample_means), max(sample_means), 100)
    theoretical_std = (1/np.sqrt(12)) / np.sqrt(n)  # SD of uniform / sqrt(n)
    axes[row, col].plot(x_range, stats.norm.pdf(x_range, 0.5, theoretical_std),
                        color='red', linewidth=2, label='Normal approximation')
    axes[row, col].set_title(f'Sample means (n={n})')
    axes[row, col].legend(fontsize=7)

plt.suptitle('The Central Limit Theorem in Action\n'
             'Population is UNIFORM, but sample means become NORMAL!', fontsize=14)
plt.tight_layout()
plt.savefig('clt_uniform.png', dpi=150, bbox_inches='tight')
plt.show()

Look at that. The population is flat — uniform — nothing even remotely bell-shaped. But the means of samples from that population form a beautiful bell curve. The larger the sample size, the more perfect the bell.

Try It with an Extremely Non-Normal Distribution

Still skeptical? Let's try with an exponential distribution — wildly right-skewed:

np.random.seed(42)
fig, axes = plt.subplots(2, 3, figsize=(15, 9))

# Exponential population (very right-skewed)
population = np.random.exponential(5, 1000000)
axes[0, 0].hist(population, bins=100, color='coral', edgecolor='white',
                alpha=0.8, density=True, range=(0, 40))
axes[0, 0].set_title('Population: Exponential(5)\n(very right-skewed!)')

for idx, n in enumerate(sample_sizes):
    row = (idx + 1) // 3
    col = (idx + 1) % 3

    sample_means = [np.mean(np.random.exponential(5, n)) for _ in range(10000)]

    axes[row, col].hist(sample_means, bins=50, color='steelblue',
                        edgecolor='white', alpha=0.8, density=True)

    x_range = np.linspace(min(sample_means), max(sample_means), 100)
    theoretical_std = 5 / np.sqrt(n)
    axes[row, col].plot(x_range, stats.norm.pdf(x_range, 5, theoretical_std),
                        color='red', linewidth=2, label='Normal approximation')
    axes[row, col].set_title(f'Sample means (n={n})')
    axes[row, col].legend(fontsize=7)

plt.suptitle('CLT with Exponential Population\n'
             'Even SKEWED data produces NORMAL sample means!', fontsize=14)
plt.tight_layout()
plt.savefig('clt_exponential.png', dpi=150, bbox_inches='tight')
plt.show()

There it is again. The population is dramatically skewed, but sample means become increasingly normal as n grows.

The Central Limit Theorem: What It Says

Now that you've seen it, here's the formal statement:

Central Limit Theorem (CLT): If you take random samples of size n from ANY population with a finite mean (mu) and finite standard deviation (sigma), the distribution of sample means will approach a normal distribution as n increases — regardless of the shape of the original population. The mean of this sampling distribution equals mu, and the standard deviation equals sigma / sqrt(n).

Let's break that down:

"Any population" — it doesn't matter if the population is uniform, skewed, bimodal, or any other shape. The magic works regardless.
"As n increases" — with small samples (n=2, n=5), the bell shape is approximate. By n=30, it's usually quite good. This is why "n=30" is often cited as a rule of thumb for when the CLT kicks in.
"Sigma / sqrt(n)" — the standard deviation of sample means (called the standard error) gets smaller as n grows. Larger samples produce more precise estimates.

# Demonstrating sigma / sqrt(n)
population_sd = 5  # Exponential(5) has SD = 5
sample_sizes = [5, 10, 25, 50, 100, 500]

print("Standard Error = sigma / sqrt(n)")
print("=" * 45)
for n in sample_sizes:
    se_theoretical = population_sd / np.sqrt(n)
    sample_means = [np.mean(np.random.exponential(5, n)) for _ in range(10000)]
    se_simulated = np.std(sample_means)
    print(f"  n={n:4d}: theoretical SE = {se_theoretical:.3f}, "
          f"simulated SE = {se_simulated:.3f}")

Why the CLT Matters

The Central Limit Theorem is the reason that:

Opinion polls work. The average of a sample approaches the population average, and we can quantify how much it might be off.
Confidence intervals make sense. We'll use the CLT in Chapter 22 to build intervals that say "we're 95% confident the true value is between A and B."
Hypothesis tests work. In Chapter 23, we'll use the CLT to decide whether a difference is "statistically significant."
The normal distribution is everywhere. Many quantities we measure are themselves averages or sums of many small effects — and sums of many things tend toward normal, thanks to the CLT.

Why the CLT Works — Intuition, Not Proof

You don't need to understand the mathematical proof of the CLT (it involves characteristic functions and complex analysis). But you can develop a strong intuition for why it works.

Think about what happens when you average together many values:

Extreme values cancel out. If you sample 50 values from an exponential distribution, some will be very large. But most will be moderate. When you average them, the few large values are diluted by the many moderate ones. The average is always less extreme than the most extreme individual value.
There are many more ways to get a "middle" average than an extreme one. If you're averaging 50 values from Uniform(0, 100), getting an average of exactly 50 can happen in billions of ways (countless combinations that add up right). Getting an average of exactly 95 requires almost all 50 values to be near 100 — far fewer ways. This combinatorial argument is why the bell shape emerges: moderate averages are exponentially more likely than extreme ones.
Each additional observation pulls the average toward the center. The more values you average, the harder it is for any single extreme value to dominate. This is the smoothing effect of averaging, and it's why larger n produces a tighter, more normal distribution.

Jordan's analogy: "Think of it like a tug-of-war with 50 people on each side. Even if a few people on one side are incredibly strong, the average strength of each team is going to be similar — the individual differences get averaged out. That's the CLT."

The Standard Error: A Crucial Quantity

The standard error (SE) is so important that it deserves its own moment in the spotlight:

$$SE = \frac{\sigma}{\sqrt{n}}$$

In English: "The standard error is the population standard deviation divided by the square root of the sample size."

This formula tells you how precise your sample mean is. A smaller SE means your sample mean is likely close to the true population mean.

Key implications:

To cut the SE in half, you need 4x the data. (Because sqrt(4n) = 2*sqrt(n).)
To cut the SE to a third, you need 9x the data. Diminishing returns!
Very large samples have tiny SEs. With n=10,000, the SE is sigma/100. Your sample mean is very precise.

# Standard error at different sample sizes
sigma = 20  # Population standard deviation

print(f"Population SD: {sigma}")
print(f"{'n':>8} {'SE':>8} {'95% range of means':>25}")
print("=" * 45)
for n in [5, 10, 25, 50, 100, 500, 1000, 10000]:
    se = sigma / np.sqrt(n)
    print(f"{n:>8d} {se:>8.2f} {f'± {1.96*se:.2f}':>25}")

This table is worth studying. With a sample of just 25, your mean is likely within about 8 units of the truth. With 10,000, it's within 0.4 units. The precision improves, but each doubling of precision costs four times as much data.

21.8 Checking Normality: Is Your Data Actually Normal?

The empirical rule and z-score calculations assume your data is normally distributed. But real data often isn't. How do you check?

Method 1: The Histogram Test (Eyeball It)

Plot a histogram and overlay the theoretical normal curve. Does it look roughly bell-shaped?

np.random.seed(42)

# Some data that IS roughly normal
normal_data = np.random.normal(50, 10, 500)

# Some data that ISN'T
skewed_data = np.random.exponential(10, 500) + 20

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

for ax, data, title in [(axes[0], normal_data, 'Roughly Normal'),
                         (axes[1], skewed_data, 'Not Normal (Skewed)')]:
    ax.hist(data, bins=30, density=True, color='steelblue',
            edgecolor='white', alpha=0.8, label='Data')
    x = np.linspace(data.min(), data.max(), 100)
    ax.plot(x, stats.norm.pdf(x, data.mean(), data.std()),
            color='red', linewidth=2, label='Normal fit')
    ax.set_title(title)
    ax.legend()

plt.tight_layout()
plt.savefig('normality_check_hist.png', dpi=150, bbox_inches='tight')
plt.show()

Method 2: The Q-Q Plot (Quantile-Quantile Plot)

A Q-Q plot compares the quantiles of your data to the quantiles of a theoretical normal distribution. If the data is normal, the points fall on a straight diagonal line. Deviations from the line reveal specific types of non-normality.

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Q-Q plot for normal data
stats.probplot(normal_data, dist="norm", plot=axes[0])
axes[0].set_title('Q-Q Plot: Normal Data\n(Points follow the line)')

# Q-Q plot for skewed data
stats.probplot(skewed_data, dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot: Skewed Data\n(Points curve away from the line)')

plt.tight_layout()
plt.savefig('qq_plots.png', dpi=150, bbox_inches='tight')
plt.show()

Reading a Q-Q plot: - Points on the line: Data is approximately normal. - Points curve upward at both ends: Data has heavier tails than normal (more extreme values). - Points curve away at one end: Data is skewed. - S-shaped curve: Data has lighter tails than normal.

Method 3: Statistical Tests

Formal tests for normality exist, but they have limitations:

# Shapiro-Wilk test (good for small to medium samples)
stat_normal, p_normal = stats.shapiro(normal_data[:100])  # Limit to 100 for Shapiro
stat_skewed, p_skewed = stats.shapiro(skewed_data[:100])

print("Shapiro-Wilk Test (H0: data is normal):")
print(f"  Normal data:  statistic={stat_normal:.4f}, p-value={p_normal:.4f}")
print(f"  Skewed data:  statistic={stat_skewed:.4f}, p-value={p_skewed:.4f}")
print()

if p_normal > 0.05:
    print("  Normal data: p > 0.05 → fail to reject normality (looks normal)")
if p_skewed < 0.05:
    print("  Skewed data: p < 0.05 → reject normality (not normal)")

Important caveat: With very large samples, normality tests will reject normality for even tiny deviations from a perfect bell curve. With very small samples, they lack the power to detect non-normality. The Q-Q plot is often more informative than a formal test because it shows you how the data deviates from normal.

21.9 The Progressive Project: Vaccination Distributions

Time to apply everything to our Global Health Data Explorer. Elena wants to fit normal distributions to her vaccination rate data and check whether the normal assumption is appropriate.

import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Simulated vaccination data by income group
np.random.seed(21)
vaccination_df = pd.DataFrame({
    'income_group': np.repeat(
        ['Low income', 'Lower middle', 'Upper middle', 'High income'],
        [30, 55, 55, 50]
    ),
    'measles_vacc_rate': np.concatenate([
        np.clip(np.random.normal(52, 18, 30), 10, 99),
        np.clip(np.random.normal(68, 14, 55), 20, 99),
        np.clip(np.random.normal(82, 10, 55), 35, 99),
        np.clip(np.random.normal(93, 4, 50), 60, 99),
    ])
})

# For each income group: fit normal, create Q-Q plot, check assumptions
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
groups = ['Low income', 'Lower middle', 'Upper middle', 'High income']
colors = ['#e74c3c', '#e67e22', '#2ecc71', '#3498db']

for idx, (group, color) in enumerate(zip(groups, colors)):
    data = vaccination_df[vaccination_df['income_group'] == group]['measles_vacc_rate']

    # Top row: histogram with normal fit
    ax_top = axes[0, idx]
    ax_top.hist(data, bins=12, density=True, color=color, edgecolor='white', alpha=0.7)
    x = np.linspace(data.min() - 5, data.max() + 5, 100)
    ax_top.plot(x, stats.norm.pdf(x, data.mean(), data.std()),
                color='black', linewidth=2)
    ax_top.set_title(f'{group}\nmean={data.mean():.1f}, sd={data.std():.1f}')
    ax_top.set_xlabel('Vaccination Rate (%)')

    # Bottom row: Q-Q plot
    ax_bot = axes[1, idx]
    stats.probplot(data, dist="norm", plot=ax_bot)
    ax_bot.set_title(f'Q-Q Plot')

    # Shapiro test
    if len(data) >= 8:
        _, p_val = stats.shapiro(data)
        normality = "Yes" if p_val > 0.05 else "No"
        ax_bot.text(0.05, 0.95, f'Shapiro p={p_val:.3f}\nNormal? {normality}',
                    transform=ax_bot.transAxes, fontsize=8, va='top',
                    bbox=dict(boxstyle='round', facecolor='lightyellow'))

plt.suptitle('Vaccination Rates by Income Group: Normality Assessment', fontsize=14)
plt.tight_layout()
plt.savefig('vaccination_normality.png', dpi=150, bbox_inches='tight')
plt.show()

# Compute probabilities using fitted normal distributions
print("\n=== Using Normal Distribution for Predictions ===\n")
for group in groups:
    data = vaccination_df[vaccination_df['income_group'] == group]['measles_vacc_rate']
    mu, sigma = data.mean(), data.std()
    dist = stats.norm(mu, sigma)

    print(f"{group} (N({mu:.1f}, {sigma:.1f})):")
    print(f"  P(rate < 50%):  {dist.cdf(50):.4f} ({dist.cdf(50)*100:.1f}%)")
    print(f"  P(rate > 90%):  {1-dist.cdf(90):.4f} ({(1-dist.cdf(90))*100:.1f}%)")
    print(f"  P(70% < rate < 90%): {dist.cdf(90)-dist.cdf(70):.4f} "
          f"({(dist.cdf(90)-dist.cdf(70))*100:.1f}%)")
    print()

Elena notices that the high-income group is left-skewed (bumping against the ceiling of 99%) and the low-income group has a wide, somewhat flat distribution. Neither is perfectly normal. But the normal approximation is reasonable for the middle-income groups, and — critically — the CLT tells her that sample means from any of these groups will be approximately normal regardless.

Connecting the CLT to the Progressive Project

Let's make this concrete for Elena. She has vaccination rate data for 190 countries, grouped by income level. She wants to estimate the mean vaccination rate for each income group, but she only has data from some countries, not all.

The CLT tells her: - Her sample mean is approximately normally distributed around the true population mean. - The uncertainty in her estimate is captured by the standard error = SD / sqrt(n). - She can quantify exactly how confident she is in her estimate (Chapter 22 will formalize this).

# Elena's CLT application
np.random.seed(21)

# "True" population of all countries in a group
true_population = np.clip(np.random.normal(72, 16, 500), 10, 99)
true_mean = np.mean(true_population)
true_sd = np.std(true_population)

print(f"True population mean: {true_mean:.1f}%")
print(f"True population SD:   {true_sd:.1f}%")

# Elena only has a sample of 40 countries
sample = np.random.choice(true_population, 40, replace=False)
sample_mean = np.mean(sample)
se = true_sd / np.sqrt(40)

print(f"\nElena's sample of 40 countries:")
print(f"  Sample mean:  {sample_mean:.1f}%")
print(f"  Standard error: {se:.1f}%")
print(f"  95% of sample means would fall between "
      f"{true_mean - 1.96*se:.1f}% and {true_mean + 1.96*se:.1f}%")
print(f"  Elena's sample mean of {sample_mean:.1f}% falls "
      f"{'within' if abs(sample_mean - true_mean) < 1.96*se else 'outside'} this range")

This is exactly what polling organizations do. They survey 1,000 people, compute the sample proportion, and use the CLT to construct a margin of error. The math is identical — and now you understand the engine that makes it work.

21.10 When Data Isn't Normal — And That's Okay

Not all data is normal. Here are common situations and what to do:

Right-skewed data (income, response times, counts): Consider a log transformation, or use the median and IQR instead of mean and standard deviation.

Bounded data (percentages, proportions): Data that can't go below 0 or above 100 is often not normal. The beta distribution is designed for proportions.

Count data (number of events): Better modeled by Poisson or negative binomial distributions.

Bimodal data (mixed populations): No single distribution fits well. Consider mixture models or analyze each group separately.

# Log transformation for right-skewed data
np.random.seed(42)
income = np.random.lognormal(mean=10, sigma=0.8, size=1000)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Original (skewed)
axes[0].hist(income, bins=50, color='coral', edgecolor='white', alpha=0.8, density=True)
axes[0].set_title(f'Income Data (skewed)\nskewness={stats.skew(income):.2f}')

# Log-transformed (approximately normal)
log_income = np.log(income)
axes[1].hist(log_income, bins=50, color='steelblue', edgecolor='white', alpha=0.8, density=True)
x = np.linspace(log_income.min(), log_income.max(), 100)
axes[1].plot(x, stats.norm.pdf(x, log_income.mean(), log_income.std()),
             color='red', linewidth=2)
axes[1].set_title(f'Log(Income) (approximately normal)\nskewness={stats.skew(log_income):.2f}')

plt.tight_layout()
plt.savefig('log_transform.png', dpi=150, bbox_inches='tight')
plt.show()

The key insight: you don't need your raw data to be normal. What you need is for your statistics (like sample means) to be approximately normal — and the Central Limit Theorem gives you that for free, as long as your sample size is reasonably large.

21.11 Chapter Summary

Let's bring it all together:

Probability distributions are mathematical descriptions of randomness — they tell you how probability is spread across possible outcomes.

Distribution	Type	Use When	Key Parameters	Python
Uniform	Discrete or Continuous	All outcomes equally likely	Low, High	`stats.uniform`
Binomial	Discrete	Counting successes in n trials	n, p	`stats.binom`
Poisson	Discrete	Counting rare events in an interval	lambda	`stats.poisson`
Normal	Continuous	Data is symmetric, bell-shaped	mu, sigma	`stats.norm`

The normal distribution is defined by its mean and standard deviation. The empirical rule (68-95-99.7) lets you quickly estimate probabilities. Z-scores let you standardize any normal distribution.

The Central Limit Theorem says that sample means are approximately normal, regardless of the population shape, when n is large enough. This is why the normal distribution shows up everywhere — and why it's the foundation of statistical inference.

Checking normality: Use histograms, Q-Q plots, and the Shapiro-Wilk test. But remember that perfect normality is rare, approximate normality is usually good enough, and the CLT often makes normality assumptions unnecessary.

scipy.stats pattern: Every distribution has .pdf(), .cdf(), .ppf(), and .rvs() methods. Learn the pattern once, use it for any distribution.

Looking Ahead

You now have the three pillars you need for statistical inference: 1. Descriptive statistics (Chapter 19) — summarizing what you observe 2. Probability (Chapter 20) — reasoning about uncertainty 3. Distributions (this chapter) — the mathematical shapes that connect data to probability

In Chapter 22, we'll put them all together to build confidence intervals — ranges that tell you how precise your estimates are. And in Chapter 23, we'll use them for hypothesis testing — deciding whether a pattern is real or just noise. The Central Limit Theorem you just learned is the engine that powers both.

Connections to Previous and Future Chapters

Concept from This Chapter	Where It Came From	Where It's Going
Probability distributions	Connects Ch. 19 shapes to Ch. 20 probabilities	Foundation for all of inference (Ch. 22-23)
Normal distribution	Formalizes "bell curve" intuition	Used in confidence intervals (Ch. 22), hypothesis tests (Ch. 23), regression (Ch. 26)
Z-scores	First introduced in Ch. 19	Central to hypothesis testing (Ch. 23)
Central Limit Theorem	Builds on law of large numbers (Ch. 20)	Justification for confidence intervals (Ch. 22)
`scipy.stats`	New in this chapter	Used throughout the rest of the book
Q-Q plots	New in this chapter	Used in regression diagnostics (Ch. 26)
Binomial distribution	Builds on probability simulation (Ch. 20)	Used in classification (Ch. 27)

Learning Objectives

In This Chapter

Chapter 21: Distributions and the Normal Curve — The Shape That Shows Up Everywhere

Chapter Overview

21.1 What Is a Probability Distribution?

Discrete Distributions

Continuous Distributions

21.2 The Uniform Distribution — The Boring-but-Important One

Why the Uniform Distribution Matters

21.3 The Binomial Distribution — Counting Successes

Real-World Binomial Examples

21.4 The Poisson Distribution — Counting Rare Events

The Relationship Between Distributions

21.5 The Normal Distribution — The Star of the Show

Properties of the Normal Distribution

The Empirical Rule (68-95-99.7)

Why Is the Normal Distribution Everywhere?

21.6 The Standard Normal and Z-Scores (Revisited)

Using scipy.stats.norm

The scipy.stats Pattern

21.7 The Central Limit Theorem — The Magic Trick of Statistics

The Experiment

Try It with an Extremely Non-Normal Distribution

The Central Limit Theorem: What It Says

Why the CLT Matters

Why the CLT Works — Intuition, Not Proof

The Standard Error: A Crucial Quantity

21.8 Checking Normality: Is Your Data Actually Normal?

Method 1: The Histogram Test (Eyeball It)

Method 2: The Q-Q Plot (Quantile-Quantile Plot)

Method 3: Statistical Tests

21.9 The Progressive Project: Vaccination Distributions

Connecting the CLT to the Progressive Project

21.10 When Data Isn't Normal — And That's Okay

21.11 Chapter Summary

Looking Ahead

Connections to Previous and Future Chapters

Using `scipy.stats.norm`

The `scipy.stats` Pattern