> — George E. P. Box, Journal of the American Statistical Association (1976)
Learning Objectives
- Distinguish between discrete and continuous probability distributions
- Calculate probabilities using the binomial distribution
- Describe the properties of the normal distribution
- Use z-scores and standard normal tables (or Python/Excel) to find probabilities
- Assess normality using QQ-plots and the Shapiro-Wilk test
In This Chapter
- Chapter Overview
- 10.1 A Puzzle Before We Start (Productive Struggle)
- 10.2 Random Variables and Probability Distributions
- 10.3 The Binomial Distribution: Counting Successes
- 10.4 From Discrete to Continuous: A Conceptual Leap
- 10.5 The Normal Distribution: The Bell Curve
- 10.6 The Standard Normal Distribution and Z-Scores
- 10.7 Finding Normal Probabilities: Three Worked Examples
- 10.8 The Continuity Correction (When Discrete Meets Continuous)
- 10.9 Assessing Normality: Is the Model Appropriate?
- 10.10 When Normality Matters and When It Doesn't
- 10.11 Putting It All Together: A Complete Worked Example
- 10.12 Data Detective Portfolio: Assessing Normality
- 10.13 Drift Check: Voice Consistency Audit
- 10.14 Chapter Summary
- Key Formulas at a Glance
Chapter 10: Probability Distributions and the Normal Curve
"All models are wrong, but some are useful." — George E. P. Box, Journal of the American Statistical Association (1976)
Chapter Overview
Here's something that might blow your mind: if you measure the heights of 10,000 randomly selected adults, and I know nothing else about the data except the mean and standard deviation, I can tell you — with surprising accuracy — how many people are between 5'4" and 5'8", how many are over 6'2", and how many are shorter than 5'0".
How? Because heights follow a pattern. A very specific pattern. A pattern so reliable that mathematicians worked out its exact shape more than 200 years ago.
That pattern is called the normal distribution, and it's arguably the single most important idea in all of statistics. It's the reason the Empirical Rule from Chapter 6 works. It's the engine behind the z-scores you already know how to calculate. And it's the mathematical foundation for nearly every inference technique you'll learn from here through the end of this course.
But here's the thing I need to be upfront about: no real dataset is perfectly normal. Not one. Ever. The normal distribution is a model — an idealized mathematical curve that approximates reality well enough to be incredibly useful. The quote at the top of this chapter, from the British statistician George Box, captures this perfectly. The normal distribution is "wrong" in the sense that real data never matches it exactly. But it's "useful" in the sense that it gets us close enough to make powerful predictions, smart decisions, and valid inferences.
That tension — between the idealized model and messy reality — is the threshold concept of this chapter. Understanding it will make you a better statistician, a better data analyst, and a better critical thinker.
In Chapter 8, you learned the basic rules of probability. In Chapter 9, you learned how new information changes probabilities through conditional probability and Bayes' theorem. But all those examples involved discrete events — things that either happened or didn't. A coin lands heads or tails. A patient has a disease or doesn't. An email is spam or it isn't.
Now we're making a leap. What happens when the variable you care about can take any value on a continuous scale? Heights, weights, blood pressure readings, test scores, wait times — these don't come in neat categories. A person isn't "tall" or "short." They're 5'7.324" (if you measure precisely enough). How do you assign probabilities to continuous variables?
That's what probability distributions are for. And the normal distribution is the most important one you'll ever meet.
In this chapter, you will learn to: - Distinguish between discrete and continuous probability distributions - Calculate probabilities using the binomial distribution - Describe the properties of the normal distribution - Use z-scores and standard normal tables (or Python/Excel) to find probabilities - Assess normality using QQ-plots and the Shapiro-Wilk test
Fast Track: If you're comfortable with discrete vs. continuous random variables, understand the binomial distribution, and can use z-scores, skim Sections 10.1-10.5 and jump to Section 10.9 (Assessing Normality). Complete quiz questions 1, 10, and 17 to verify your understanding.
Deep Dive: After this chapter, read Case Study 1 (the normal distribution in standardized testing) for a detailed look at how SAT/ACT scores are designed to be normal, then Case Study 2 (when normality fails) for the important story of why income, wealth, and many real-world phenomena refuse to follow the bell curve — and why that matters.
10.1 A Puzzle Before We Start (Productive Struggle)
Before I teach you anything, try this.
The Height Puzzle
The average height of adult women in the U.S. is about 64 inches (5'4"), with a standard deviation of about 2.8 inches. The distribution is approximately bell-shaped.
Without using any tables, formulas, or computers, estimate:
(a) What percentage of women are taller than 69.6 inches (5'9.6")?
(b) What percentage of women are between 61.2 and 66.8 inches?
(c) How tall would a woman need to be to be in the tallest 0.15% of all women?
Hint: You already have a tool for this — the Empirical Rule from Chapter 6.
Take a moment and work through it. If you remember the 68-95-99.7 pattern, you can actually answer all three. And if you can answer all three, congratulations — you already understand the intuition behind the normal distribution. This chapter is about making that intuition precise, powerful, and general.
Here's the key to the puzzle: 69.6 inches is exactly 2 standard deviations above the mean (64 + 2 × 2.8 = 69.6). So the Empirical Rule tells us about 2.5% of women are taller — that's the tail above 2 standard deviations. Part (b) involves the range from 1 SD below to 1 SD above the mean, so about 68%. And part (c) asks for 3 standard deviations above: 64 + 3 × 2.8 = 72.4 inches, or about 6'0.4".
The Empirical Rule gave us approximate answers. But what if someone asks: "What percentage of women are between 65 and 68 inches?" That's not a neat 1-or-2-standard-deviation boundary. The Empirical Rule can't help. For that, you need the full machinery of the normal distribution — and that's what this chapter delivers.
10.2 Random Variables and Probability Distributions
What Is a Random Variable?
Let's start with a definition that sounds more intimidating than it actually is.
Key Concept: Random Variable
A random variable is a numerical outcome of a random process. We typically denote random variables with capital letters like $X$, $Y$, or $Z$.
- The specific value a random variable takes is denoted with a lowercase letter: $x$.
- Example: If $X$ = "the number of heads in 10 coin flips," then $X$ is the random variable, and $x = 7$ is one particular outcome.
The word "variable" should be familiar from Chapter 2. The word "random" means the outcome isn't predetermined — there's uncertainty involved. Put them together, and a random variable is just a way of attaching numbers to the outcomes of uncertain processes.
You've been working with random variables this whole time without calling them that: - The number of patients who test positive in a batch of 50 screenings (Maya's world) - The watch time of a randomly selected StreamVibe user (Alex's world) - The risk score assigned to a randomly selected defendant by a predictive algorithm (Professor Washington's world) - The number of three-pointers Daria makes in her next 20 attempts (Sam's world)
Random variables come in two flavors, and this distinction matters a lot.
Discrete vs. Continuous Random Variables
Key Concept: Discrete vs. Continuous
A discrete random variable takes on a countable number of values (you can list them, even if the list is long). Think: counts.
A continuous random variable can take any value within an interval (there are infinitely many possibilities between any two values). Think: measurements.
This should feel familiar — it's the discrete vs. continuous distinction from Chapter 2, applied to random processes. A few examples to solidify the idea:
| Random Variable | Discrete or Continuous? | Why? |
|---|---|---|
| Number of defective items in a batch of 100 | Discrete | Counts: 0, 1, 2, ..., 100 |
| Weight of a randomly selected newborn | Continuous | Could be 7.23 lbs, 7.231 lbs, 7.2314 lbs... |
| Number of customers arriving per hour | Discrete | Counts: 0, 1, 2, 3, ... |
| Time between customer arrivals | Continuous | Could be any positive number |
| SAT score | Discrete (technically) | Possible scores: 400, 410, 420, ..., 1600 |
| Body temperature | Continuous | Could be 98.62°F, 98.621°F, ... |
Notice that SAT scores are technically discrete — they come in increments of 10. But with 121 possible values, they're close enough to continuous that we often treat them as such. This kind of practical judgment — knowing when a discrete variable is "continuous enough" — is part of becoming a good statistician.
What Is a Probability Distribution?
Now the payoff. A probability distribution assigns probabilities to every possible value (or range of values) of a random variable.
Key Concept: Probability Distribution
A probability distribution is a mathematical description of all possible values a random variable can take and how likely each value (or range of values) is.
For discrete random variables: we list each value and its probability. The probabilities must sum to 1.
For continuous random variables: we use a curve (called a density curve or probability density function) where the area under the curve represents probability. The total area under the curve must equal 1.
Here's a simple discrete example. Suppose you roll a fair die. The random variable $X$ = the number showing. The probability distribution is:
| $x$ | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| $P(X = x)$ | 1/6 | 1/6 | 1/6 | 1/6 | 1/6 | 1/6 |
Every value has the same probability. The probabilities sum to 1. Done.
But what about a random variable like "the height of a randomly selected woman"? You can't list every possible height and its probability — there are infinitely many. Instead, you use a smooth curve, and probability is represented by area under the curve. We'll get there. First, let's build up through the most important discrete distribution.
Expected Value: The Long-Run Average
Before we move to specific distributions, there's one more concept we need.
Key Concept: Expected Value
The expected value of a random variable (written $E(X)$ or $\mu$) is the long-run average outcome if you repeated the random process infinitely many times. It's the "center" of the probability distribution.
For a discrete random variable:
$$E(X) = \mu = \sum x \cdot P(X = x)$$
In words: multiply each possible value by its probability, then add them all up.
This connects directly to the long-run frequency interpretation of probability from Chapter 8. If you roll a fair die millions of times, the average of all those rolls approaches:
$$E(X) = 1 \times \frac{1}{6} + 2 \times \frac{1}{6} + 3 \times \frac{1}{6} + 4 \times \frac{1}{6} + 5 \times \frac{1}{6} + 6 \times \frac{1}{6} = \frac{21}{6} = 3.5$$
You can never actually roll a 3.5. But over millions of rolls, the average converges to 3.5. That's what "expected value" means — not a value you expect on any single trial, but the theoretical long-run center.
Spaced Review (SR.1) — Empirical Rule (Ch.6): In Chapter 6, you learned the Empirical Rule: for bell-shaped data, about 68% falls within 1 SD, 95% within 2 SDs, and 99.7% within 3 SDs. You applied it to Maya's body temperature data and Alex's engagement metrics. You used it informally, trusting the pattern. In this chapter, we'll see why the Empirical Rule works — it's a direct consequence of the normal distribution's mathematical formula. The Empirical Rule wasn't magic; it was the normal distribution all along.
10.3 The Binomial Distribution: Counting Successes
When Does the Binomial Apply?
Here's a scenario Sam Okafor thinks about constantly.
Daria Williams shoots three-pointers at a long-run rate of about 38% (Sam's tracked this meticulously over 500 shots). If she takes 10 three-point attempts in tonight's game, what's the probability she makes exactly 4? What about 5 or more?
This is a classic binomial situation. The binomial distribution models the number of "successes" in a fixed number of independent trials, where each trial has the same probability of success.
Key Concept: Binomial Distribution
A random variable $X$ follows a binomial distribution with parameters $n$ (number of trials) and $p$ (probability of success on each trial) — written $X \sim \text{Binomial}(n, p)$ — if:
- There are a fixed number of trials ($n$)
- Each trial has exactly two outcomes (success or failure)
- The probability of success ($p$) is the same on every trial
- The trials are independent (one outcome doesn't affect the others)
These four conditions are sometimes remembered as the acronym BINS: Binary outcomes, Independent trials, Number of trials is fixed, Same probability on each trial.
Let's check Daria's three-pointers against the BINS conditions:
- Binary? Yes — each shot either goes in (success) or doesn't (failure). ✓
- Independent? Mostly yes — whether she makes shot #3 doesn't significantly affect shot #4. (There are debates about "hot hand" effects, but the statistical evidence is surprisingly thin.) ✓
- Number fixed? Yes — we're considering exactly $n = 10$ attempts. ✓
- Same probability? Approximately yes — her long-run rate is about $p = 0.38$ per attempt. ✓
Spaced Review (SR.2) — Multiplication Rule (Ch.8): Remember the multiplication rule for independent events from Chapter 8? $P(A \text{ and } B) = P(A) \times P(B)$. The binomial distribution uses this rule repeatedly. The probability of a specific sequence of successes and failures (like make-miss-make-miss-miss-make-...) is found by multiplying the individual probabilities together. That's why independence (condition 2 of BINS) is so important — without it, we can't multiply.
The Binomial Formula
The probability that $X$ equals exactly $k$ successes in $n$ trials is:
Mathematical Formulation: Binomial Probability
$$\boxed{P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}}$$
where: - $n$ = number of trials - $k$ = number of successes (0, 1, 2, ..., $n$) - $p$ = probability of success on each trial - $\binom{n}{k} = \frac{n!}{k!(n-k)!}$ = the number of ways to arrange $k$ successes among $n$ trials
In plain English: "the number of arrangements" × "the probability of each arrangement."
I know that formula looks scary. Let's break it down with Daria's example.
What's the probability Daria makes exactly 4 out of 10 three-pointers?
Here, $n = 10$, $k = 4$, $p = 0.38$.
Step 1: The arrangement count. How many different ways could she make exactly 4 out of 10? She could make shots 1, 2, 3, 4 and miss the rest. Or make shots 1, 3, 7, 9. Or any other combination of 4 out of 10.
$$\binom{10}{4} = \frac{10!}{4! \cdot 6!} = \frac{10 \times 9 \times 8 \times 7}{4 \times 3 \times 2 \times 1} = 210$$
There are 210 different ways to choose which 4 shots she makes.
Step 2: The probability of each arrangement. Any specific sequence of 4 makes and 6 misses has the same probability:
$$p^4 \cdot (1-p)^6 = (0.38)^4 \cdot (0.62)^6 = 0.02085 \times 0.05688 \approx 0.001187$$
Step 3: Multiply.
$$P(X = 4) = 210 \times 0.001187 \approx 0.2492$$
There's about a 24.9% chance Daria makes exactly 4 of her 10 three-point attempts.
Binomial Mean and Standard Deviation
The expected value and standard deviation of a binomial distribution have beautifully simple formulas:
$$E(X) = \mu = np \qquad \qquad \sigma = \sqrt{np(1-p)}$$
For Daria:
$$\mu = 10 \times 0.38 = 3.8 \qquad \sigma = \sqrt{10 \times 0.38 \times 0.62} = \sqrt{2.356} \approx 1.535$$
On average, she'd make 3.8 out of 10. The standard deviation of 1.535 tells you that a typical game's three-point count would be within about 1.5 makes of that average.
Python Implementation
Let's do this calculation in Python, because for anything beyond small examples, you don't want to compute factorials by hand.
from scipy import stats
import numpy as np
# Define the binomial distribution
n = 10 # number of attempts
p = 0.38 # probability of making each shot
# P(X = 4): probability of exactly 4 makes
prob_exactly_4 = stats.binom.pmf(4, n, p)
print(f"P(X = 4) = {prob_exactly_4:.4f}") # 0.2492
# P(X >= 5): probability of 5 or more makes
prob_5_or_more = 1 - stats.binom.cdf(4, n, p)
print(f"P(X >= 5) = {prob_5_or_more:.4f}") # 0.3340
# Mean and standard deviation
mean = stats.binom.mean(n, p)
std = stats.binom.std(n, p)
print(f"Mean: {mean:.2f}, SD: {std:.3f}") # Mean: 3.80, SD: 1.535
# Full probability distribution
print("\nComplete distribution:")
print(f"{'k':>3} {'P(X=k)':>10} {'Cumulative':>12}")
print("-" * 28)
for k in range(n + 1):
pmf_val = stats.binom.pmf(k, n, p)
cdf_val = stats.binom.cdf(k, n, p)
print(f"{k:3d} {pmf_val:10.4f} {cdf_val:12.4f}")
Output:
P(X = 4) = 0.2492
P(X >= 5) = 0.3340
Mean: 3.80, SD: 1.535
Complete distribution:
k P(X=k) Cumulative
----------------------------
0 0.0084 0.0084
1 0.0514 0.0598
2 0.1418 0.2017
3 0.2319 0.4335
4 0.2492 0.6828
5 0.1829 0.8657
6 0.0928 0.9585
7 0.0324 0.9909
8 0.0075 0.9984
9 0.0010 0.9995
10 0.0001 0.9995
Notice two important Python functions:
- stats.binom.pmf(k, n, p) — probability mass function — gives the probability of exactly $k$ successes
- stats.binom.cdf(k, n, p) — cumulative distribution function — gives the probability of $k$ or fewer successes
Visual description (the binomial distribution bar chart):
Imagine a bar chart with the x-axis showing the number of made three-pointers (0 through 10) and the y-axis showing the probability of each outcome. The tallest bar is at $k = 4$ (probability 0.249), with $k = 3$ nearly as tall (0.232). The distribution is slightly right-skewed — the left tail drops off a bit more steeply than the right tail. The bars for 0 and 10 are barely visible (probabilities less than 1%). The distribution is centered near 3.8, consistent with the expected value $np = 3.8$.
When the Binomial Doesn't Apply
Not everything is binomial. Watch out for these violations:
- Trials not independent: Drawing cards from a deck without replacement. Each draw changes the probabilities for the next draw. (This is why we have the hypergeometric distribution — but that's beyond our scope.)
- More than two outcomes: A customer can buy Product A, Product B, Product C, or nothing. That's not binary.
- Probability changes: If Daria gets tired and her shooting percentage drops over the course of the game, the "same probability" condition fails.
10.4 From Discrete to Continuous: A Conceptual Leap
Here's where things get interesting — and maybe a little weird.
With discrete distributions like the binomial, we can ask: "What's the probability that $X$ equals exactly 4?" And we get a specific number (0.2492 for Daria's three-pointers).
But with continuous variables, something strange happens. The probability that a continuous random variable equals any single exact value is zero.
Wait, what? Let me explain.
Suppose you're measuring the height of a randomly selected woman. What's the probability she's exactly 65.0000000... inches tall? Not 65.001, not 64.999 — but exactly 65.000000 to infinite decimal places?
Zero. Or so close to zero that it doesn't matter.
Why? Because there are infinitely many possible heights. If you divided probability equally among infinitely many values, each one gets an infinitely small share.
This isn't a problem — it just means we need to think differently. Instead of asking about exact values, we ask about ranges:
- What's the probability that her height is between 63 and 67 inches?
- What's the probability she's taller than 70 inches?
- What's the probability she's shorter than 60 inches?
These questions all have meaningful, nonzero answers. And the tool for answering them is the probability density function.
Key Concept: Probability Density Function (PDF)
A probability density function (PDF) is a curve that describes the relative likelihood of different values for a continuous random variable.
- The curve is always non-negative (at or above the x-axis)
- The total area under the curve equals 1
- Probability = area under the curve between two values
- The height of the curve at any point is not a probability — it's a density
That last point is crucial and trips up a lot of students. The height of the curve at $x = 65$ inches isn't the probability of being 65 inches tall (that's zero, remember). It's a relative measure — it tells you this value is more (or less) likely than other values. Only the area gives you probability.
Think of it like a topographic map. The height of the terrain at a single point doesn't tell you much. But if you draw a boundary and calculate the area within it, you learn something useful — like how much land is in a particular elevation range.
Spaced Review (SR.3) — Distribution Shapes (Ch.5): In Chapter 5, you learned to describe distribution shapes: symmetric, skewed left, skewed right, unimodal, bimodal. You looked at histograms and identified patterns. Now we're taking those visual shapes and putting mathematical equations behind them. The histogram was the rough sketch. The probability density function is the precise blueprint. When Maya's flu data had a bimodal shape in Chapter 5, that told us no single bell-curve model would fit. Shape matters — it determines which model works.
10.5 The Normal Distribution: The Bell Curve
Why It's Everywhere
Before I give you the formula, I want you to appreciate just how weird it is that one mathematical curve shows up so often in the real world.
Heights of adults? Approximately normal. Measurement errors in a physics lab? Normal. Test scores on well-designed exams? Normal. Blood pressure readings? Approximately normal. The weight of apples from an orchard? Normal. Shoe sizes? Normal. IQ scores? Designed to be normal. The daily returns of stock prices? Approximately normal (with some important exceptions we'll get to).
Why? There's actually a deep mathematical reason, which you'll see in Chapter 11 (the Central Limit Theorem). The short version: whenever a measurement is the result of many small, independent, additive factors, the result tends toward a normal distribution. Your height is influenced by hundreds of genes, your nutrition, your health history, random developmental variation — all adding up. Each factor contributes a small amount, and their sum follows the bell curve.
But let's not get ahead of ourselves. For now, let's meet the curve.
Properties of the Normal Distribution
Key Concept: The Normal Distribution
A continuous random variable $X$ follows a normal distribution with mean $\mu$ and standard deviation $\sigma$ — written $X \sim N(\mu, \sigma^2)$ — if its probability density function has these properties:
- Bell-shaped: The curve rises to a peak, then falls symmetrically on both sides
- Symmetric about the mean: The left half is a mirror image of the right half
- Mean = median = mode: All three measures of center coincide at the peak
- Determined by two parameters: Once you know $\mu$ and $\sigma$, you know the entire shape
- The 68-95-99.7 rule holds exactly: This is the Empirical Rule from Chapter 6, which is actually a property of the normal distribution
- Tails extend to infinity: The curve never quite touches the x-axis, though the tails become vanishingly small beyond 3-4 standard deviations
- Total area = 1: The entire area under the curve sums to 1, as required for any probability distribution
Visual description (the normal curve):
Imagine a smooth, symmetric bell-shaped curve. The peak sits at the center, directly above the mean $\mu$. The curve drops away equally on both sides, falling steeply at first and then more gradually. At 1 standard deviation ($\sigma$) from the mean in either direction, the curve has its steepest descent — these are called the inflection points, where the curve changes from curving inward to curving outward. Beyond 3 standard deviations, the curve is very close to the x-axis but never actually touches it. The area under the curve between $\mu - \sigma$ and $\mu + \sigma$ represents 68.27% of the total area. The area between $\mu - 2\sigma$ and $\mu + 2\sigma$ covers 95.45%. And $\mu - 3\sigma$ to $\mu + 3\sigma$ covers 99.73%.
The Mathematical Formula
Here's the equation behind the curve. Take a breath — I'm showing you this because this is a math course and you should see the formula, but I promise you will never need to compute this by hand.
Mathematical Formulation: Normal Probability Density Function
$$\boxed{f(x) = \frac{1}{\sigma\sqrt{2\pi}} \, e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}}$$
where: - $\mu$ = mean (center of the distribution) - $\sigma$ = standard deviation (controls the spread) - $e \approx 2.71828$ (Euler's number) - $\pi \approx 3.14159$
You don't need to memorize this or compute it. What you should notice is:
- The exponent contains $\left(\frac{x-\mu}{\sigma}\right)^2$ — that's a squared z-score! The z-score you learned in Chapter 6 is literally baked into the formula.
- The negative sign in the exponent makes the curve decrease as you move away from $\mu$ — values far from the mean have smaller density.
- The $\frac{1}{\sigma\sqrt{2\pi}}$ out front is a scaling constant that ensures the total area equals 1.
Different Normals for Different Situations
The normal distribution is actually a family of distributions, all with the same bell shape but with different centers and spreads.
- Heights of women: $X \sim N(64, 2.8^2)$ — mean 64 inches, SD 2.8 inches
- SAT scores: $X \sim N(1060, 217^2)$ — mean about 1060, SD about 217
- Body temperatures: $X \sim N(98.2, 0.7^2)$ — mean 98.2°F, SD 0.7°F
Each of these is a normal distribution, but they look different because they have different means and standard deviations. A narrow curve (small $\sigma$) is tall and concentrated. A wide curve (large $\sigma$) is flatter and more spread out.
Visual description (three normal curves on the same axes):
Imagine three bell curves overlaid. The first has $\mu = 0, \sigma = 0.5$ — it's narrow and tall, tightly concentrated around zero. The second has $\mu = 0, \sigma = 1$ — the "standard" width, shorter and wider than the first. The third has $\mu = 0, \sigma = 2$ — broad and flat, spreading probability over a wide range. All three are centered at the same spot, but the amount of spread varies dramatically. The area under all three curves is exactly 1.
10.6 The Standard Normal Distribution and Z-Scores
One Distribution to Rule Them All
Here's the problem: if there are infinitely many normal distributions (one for every combination of $\mu$ and $\sigma$), how do we find probabilities? Do we need a separate table for each one?
The brilliant insight: no, we just need one table. We can transform any normal distribution into the standard normal distribution — a special normal with $\mu = 0$ and $\sigma = 1$.
Key Concept: Standard Normal Distribution
The standard normal distribution is the normal distribution with mean $\mu = 0$ and standard deviation $\sigma = 1$. We use the letter $Z$ for a standard normal variable:
$$Z \sim N(0, 1)$$
To transform any normal variable $X \sim N(\mu, \sigma^2)$ to the standard normal:
$$\boxed{Z = \frac{X - \mu}{\sigma}}$$
This is the z-score transformation — the exact same z-score formula from Chapter 6! Now we see its deeper purpose: it converts any normal distribution into the standard one.
The z-score transformation is like currency conversion. If you're comparing prices in dollars, euros, and yen, it's hard to make direct comparisons. But convert everything to one currency, and suddenly you can compare anything. The standard normal distribution is the "universal currency" of normal distributions.
The Z-Table: Finding Probabilities
A z-table (also called a standard normal table) tells you the area under the standard normal curve to the left of any z-score. This area is the cumulative probability $P(Z \leq z)$.
Key Concept: Z-Table
A z-table gives $P(Z \leq z)$ — the probability that a standard normal variable is less than or equal to $z$.
- This is the area under the curve to the left of $z$
- For negative z-values: the area is less than 0.5 (less than half the distribution)
- For positive z-values: the area is greater than 0.5 (more than half)
- At $z = 0$: the area is exactly 0.5 (half the distribution)
Here's a portion of the z-table for reference:
| $z$ | 0.00 | 0.01 | 0.02 | 0.03 | 0.04 | 0.05 | 0.06 | 0.07 | 0.08 | 0.09 |
|---|---|---|---|---|---|---|---|---|---|---|
| -2.0 | .0228 | .0222 | .0217 | .0212 | .0207 | .0202 | .0197 | .0192 | .0188 | .0183 |
| -1.5 | .0668 | .0655 | .0643 | .0630 | .0618 | .0606 | .0594 | .0582 | .0571 | .0559 |
| -1.0 | .1587 | .1562 | .1539 | .1515 | .1492 | .1469 | .1446 | .1423 | .1401 | .1379 |
| -0.5 | .3085 | .3050 | .3015 | .2981 | .2946 | .2912 | .2877 | .2843 | .2810 | .2776 |
| 0.0 | .5000 | .5040 | .5080 | .5120 | .5160 | .5199 | .5239 | .5279 | .5319 | .5359 |
| 0.5 | .6915 | .6950 | .6985 | .7019 | .7054 | .7088 | .7123 | .7157 | .7190 | .7224 |
| 1.0 | .8413 | .8438 | .8461 | .8485 | .8508 | .8531 | .8554 | .8577 | .8599 | .8621 |
| 1.5 | .9332 | .9345 | .9357 | .9370 | .9382 | .9394 | .9406 | .9418 | .9429 | .9441 |
| 2.0 | .9772 | .9778 | .9783 | .9788 | .9793 | .9798 | .9803 | .9808 | .9812 | .9817 |
(A complete z-table appears in the Appendix.)
How to Read the Z-Table
The row gives the first decimal place, and the column gives the second. For example, to find $P(Z \leq 1.53)$:
- Find the row for $z = 1.5$
- Find the column for $0.03$
- Read the value: 0.9370
So $P(Z \leq 1.53) = 0.9370$, meaning 93.70% of the standard normal distribution falls below $z = 1.53$.
10.7 Finding Normal Probabilities: Three Worked Examples
Let's put it all together with three examples from our anchor characters.
Example 1: Maya's Blood Pressure Data
Dr. Maya Chen is studying systolic blood pressure in a community. She knows from large-scale studies that blood pressure for healthy adults is approximately normally distributed with $\mu = 120$ mmHg and $\sigma = 15$ mmHg.
Question: What proportion of healthy adults have blood pressure above 145 mmHg (the threshold for Stage 2 hypertension)?
Step 1: Draw a picture. Sketch a normal curve centered at 120. Mark 145 to the right of center. Shade the area to the right of 145 — that's the probability we want.
Step 2: Convert to a z-score.
$$z = \frac{x - \mu}{\sigma} = \frac{145 - 120}{15} = \frac{25}{15} = 1.67$$
A blood pressure of 145 is 1.67 standard deviations above the mean.
Step 3: Look up the z-score. From the z-table, $P(Z \leq 1.67) = 0.9525$.
Step 4: Subtract from 1. We want the area to the right (above 145), but the table gives the area to the left:
$$P(X > 145) = 1 - P(Z \leq 1.67) = 1 - 0.9525 = 0.0475$$
Answer: About 4.75% of healthy adults have blood pressure above 145 mmHg.
In Python:
from scipy import stats
# P(X > 145) where X ~ N(120, 15)
prob = 1 - stats.norm.cdf(145, loc=120, scale=15)
print(f"P(X > 145) = {prob:.4f}") # 0.0478
(The slight difference — 0.0475 vs. 0.0478 — comes from rounding in the z-table. Python uses exact calculations.)
Example 2: Sam's Game Analysis
Sam Okafor is analyzing the Riverside Raptors' total points per game. Over the past three seasons, the team's scoring has been approximately normal with $\mu = 105$ points and $\sigma = 12$ points.
Question: In what proportion of games does the team score between 90 and 115 points?
Step 1: Convert both bounds to z-scores.
$$z_1 = \frac{90 - 105}{12} = \frac{-15}{12} = -1.25$$
$$z_2 = \frac{115 - 105}{12} = \frac{10}{12} = 0.83$$
Step 2: Find the cumulative probabilities.
From the z-table: - $P(Z \leq 0.83) = 0.7967$ - $P(Z \leq -1.25) = 0.1056$
Step 3: Subtract.
$$P(90 < X < 115) = P(Z \leq 0.83) - P(Z \leq -1.25) = 0.7967 - 0.1056 = 0.6911$$
Answer: The team scores between 90 and 115 points in about 69.1% of their games.
In Python:
from scipy import stats
# P(90 < X < 115) where X ~ N(105, 12)
prob = stats.norm.cdf(115, loc=105, scale=12) - stats.norm.cdf(90, loc=105, scale=12)
print(f"P(90 < X < 115) = {prob:.4f}") # 0.6919
Example 3: Alex's User Watch Time
Alex Rivera knows that daily watch time for StreamVibe's active users follows an approximately normal distribution with $\mu = 52$ minutes and $\sigma = 18$ minutes.
Question: What watch time separates the top 10% of users (the "power users") from the rest?
This is a reverse problem — instead of going from a value to a probability, we're going from a probability to a value. We need to work backwards.
Step 1: Identify the z-score for the 90th percentile. We want the value where 90% of the distribution falls below and 10% falls above. Looking in the z-table for a cumulative probability closest to 0.9000:
$P(Z \leq 1.28) = 0.8997 \approx 0.90$
So $z \approx 1.28$.
Step 2: Convert back to the original scale.
$$x = \mu + z \cdot \sigma = 52 + 1.28 \times 18 = 52 + 23.04 = 75.04$$
Answer: Users who watch more than about 75 minutes per day are in the top 10%.
In Python: The ppf (percent point function) is the inverse of cdf:
from scipy import stats
# Find the 90th percentile
x = stats.norm.ppf(0.90, loc=52, scale=18)
print(f"90th percentile: {x:.1f} minutes") # 75.1 minutes
Key functions summary:
Task Math Python $P(X \leq x)$ — area to the left Z-table lookup stats.norm.cdf(x, loc=μ, scale=σ)$P(X > x)$ — area to the right $1 - \text{table lookup}$ 1 - stats.norm.cdf(x, loc=μ, scale=σ)$P(a < X < b)$ — area between Two lookups, subtract stats.norm.cdf(b, ...) - stats.norm.cdf(a, ...)Find $x$ from probability Reverse z-table lookup stats.norm.ppf(probability, loc=μ, scale=σ)
10.8 The Continuity Correction (When Discrete Meets Continuous)
Sometimes you'll want to use the normal distribution to approximate a discrete distribution (like the binomial). This is common when $n$ is large and computing exact binomial probabilities is cumbersome.
But there's a catch: the binomial is discrete (it takes values 0, 1, 2, ...), while the normal is continuous. When you approximate a discrete distribution with a continuous one, you lose a little accuracy at the boundaries. The continuity correction fixes this.
Key Concept: Continuity Correction
When using a normal distribution to approximate a discrete distribution, adjust the boundary by 0.5:
- $P(X \leq k)$ becomes $P(X \leq k + 0.5)$
- $P(X \geq k)$ becomes $P(X \geq k - 0.5)$
- $P(X = k)$ becomes $P(k - 0.5 \leq X \leq k + 0.5)$
The intuition: each discrete value "occupies" a half-unit on either side of itself on the continuous scale.
Example: Daria takes 50 three-point shots with $p = 0.38$. What's the probability she makes 20 or more?
With $n = 50$ and $p = 0.38$: $\mu = np = 19$, $\sigma = \sqrt{np(1-p)} = \sqrt{11.78} \approx 3.43$.
Without continuity correction: $P(X \geq 20)$, $z = \frac{20 - 19}{3.43} = 0.29$, $P(Z \geq 0.29) = 1 - 0.6141 = 0.3859$.
With continuity correction: $P(X \geq 19.5)$, $z = \frac{19.5 - 19}{3.43} = 0.15$, $P(Z \geq 0.15) = 1 - 0.5596 = 0.4404$.
Exact binomial answer (from Python): $P(X \geq 20) = 0.4468$.
The corrected normal approximation (0.4404) is much closer to the exact answer (0.4468) than the uncorrected version (0.3859).
from scipy import stats
n, p = 50, 0.38
mu, sigma = n * p, (n * p * (1 - p)) ** 0.5
# Exact binomial
exact = 1 - stats.binom.cdf(19, n, p)
print(f"Exact binomial: {exact:.4f}") # 0.4468
# Normal approximation without correction
no_correction = 1 - stats.norm.cdf(20, mu, sigma)
print(f"Normal (no correction): {no_correction:.4f}") # 0.3859
# Normal approximation with correction
with_correction = 1 - stats.norm.cdf(19.5, mu, sigma)
print(f"Normal (with correction): {with_correction:.4f}") # 0.4404
When do you need the continuity correction? Only when you're using a normal distribution to approximate a discrete one. If you're working with a genuinely continuous variable (heights, blood pressure), no correction is needed. And honestly? In the age of Python and R, you can usually just compute the exact binomial probability directly. The normal approximation is most valuable as a conceptual tool — it's the bridge to the Central Limit Theorem in Chapter 11.
10.9 Assessing Normality: Is the Model Appropriate?
Here's the moment of truth. The normal distribution is powerful, but it only works when data is approximately normal. So how do you check?
This is where the threshold concept of this chapter comes into sharp focus.
Threshold Concept: The Normal Distribution as a Model
The normal distribution is a model — a simplified mathematical description of reality. No real dataset is perfectly normal. The question is never "Is my data normal?" (the answer is always no). The question is: "Is my data close enough to normal that the normal model gives useful, approximately correct answers?"
George Box said it best: "All models are wrong, but some are useful." The normal distribution is the most useful wrong model in all of statistics.
This is a threshold concept because it requires a shift in thinking. You need to move from "Is this exactly true?" to "Is this approximately true enough?" That's not sloppy thinking — it's the way science and engineering actually work. Every physics equation you've ever seen is an approximation. The normal distribution is no different.
Method 1: The Histogram Check
The simplest approach: look at a histogram. Does it look roughly bell-shaped? Is it reasonably symmetric? Or does it have obvious skewness, multiple peaks, or other non-normal features?
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
# Generate some data (or use your own)
np.random.seed(42)
data = np.random.normal(loc=100, scale=15, size=200)
# Histogram with normal curve overlay
fig, ax = plt.subplots(figsize=(8, 5))
ax.hist(data, bins=20, density=True, alpha=0.7, color='steelblue',
edgecolor='white', label='Data')
# Overlay the theoretical normal curve
x = np.linspace(data.min() - 10, data.max() + 10, 200)
ax.plot(x, stats.norm.pdf(x, data.mean(), data.std()),
'r-', linewidth=2, label=f'Normal fit (μ={data.mean():.1f}, σ={data.std():.1f})')
ax.set_xlabel('Value')
ax.set_ylabel('Density')
ax.set_title('Histogram with Normal Curve Overlay')
ax.legend()
plt.tight_layout()
plt.show()
The histogram check is quick and intuitive, but subjective. Two people can look at the same histogram and disagree about whether it's "close enough" to normal. We need something more rigorous.
Method 2: The QQ-Plot (Quantile-Quantile Plot)
The QQ-plot is the gold standard for visually assessing normality. It's more sensitive than a histogram, especially for detecting problems in the tails.
Key Concept: QQ-Plot
A QQ-plot (Quantile-Quantile plot) plots the quantiles of your data against the quantiles you'd expect if the data were perfectly normal.
- If the data is approximately normal, the points fall roughly along a straight diagonal line
- Deviations from the line reveal specific ways the data departs from normality
Reading QQ-plot deviations: - Points curve up at both ends: heavier tails than normal (leptokurtic) - Points curve down at both ends: lighter tails than normal (platykurtic) - S-shaped pattern: data is skewed - Points follow the line but with scatter: approximately normal with random variation
Here's how to create and interpret QQ-plots in Python:
import matplotlib.pyplot as plt
from scipy import stats
import numpy as np
# Create a figure with multiple QQ-plots
fig, axes = plt.subplots(2, 2, figsize=(10, 10))
np.random.seed(42)
# Panel 1: Normal data — should follow the line
normal_data = np.random.normal(loc=50, scale=10, size=200)
stats.probplot(normal_data, dist="norm", plot=axes[0, 0])
axes[0, 0].set_title("Normal Data\n(Points follow the line)")
# Panel 2: Right-skewed data — curved pattern
skewed_data = np.random.exponential(scale=10, size=200)
stats.probplot(skewed_data, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title("Right-Skewed Data\n(Points curve away from line)")
# Panel 3: Heavy-tailed data — S-shaped at extremes
heavy_tailed = np.random.standard_t(df=3, size=200)
stats.probplot(heavy_tailed, dist="norm", plot=axes[1, 0])
axes[1, 0].set_title("Heavy-Tailed Data\n(Extremes flare out from line)")
# Panel 4: Uniform data — S-shaped, opposite pattern
uniform_data = np.random.uniform(low=0, high=100, size=200)
stats.probplot(uniform_data, dist="norm", plot=axes[1, 1])
axes[1, 1].set_title("Uniform Data\n(Points flatten at both ends)")
plt.tight_layout()
plt.show()
Visual description (four QQ-plots):
Panel 1 (Normal data): The dots form a tight line from lower-left to upper-right, closely following the diagonal reference line. Small random deviations exist, but the overall pattern is clearly linear. Verdict: data is approximately normal.
Panel 2 (Right-skewed data): The dots follow the line for most of the range but then curve sharply upward on the right side. The largest values in the data are much larger than a normal distribution would predict. Verdict: right-skewed, not normal.
Panel 3 (Heavy-tailed data): The dots follow the line in the middle but flare outward at both extremes — dipping below the line on the left and rising above it on the right, forming an S-shape. Both the smallest and largest values are more extreme than a normal would predict. Verdict: heavy tails, not normal.
Panel 4 (Uniform data): The dots form a flattened S-shape — rising above the line in the left portion and falling below it in the right. The extremes are less extreme than a normal distribution would predict. Verdict: lighter tails than normal (too flat, no peak).
Method 3: The Shapiro-Wilk Test
For a formal statistical test of normality, the Shapiro-Wilk test is widely used.
Key Concept: Shapiro-Wilk Test
The Shapiro-Wilk test tests the null hypothesis that a sample came from a normally distributed population.
- Null hypothesis: The data comes from a normal distribution
- Alternative hypothesis: The data does not come from a normal distribution
- A small p-value (typically < 0.05) suggests the data is significantly non-normal
- A large p-value means we don't have evidence to reject normality (but it doesn't prove the data is normal)
from scipy import stats
import numpy as np
np.random.seed(42)
# Test 1: Data that's actually normal
normal_data = np.random.normal(loc=50, scale=10, size=100)
stat, p_value = stats.shapiro(normal_data)
print(f"Normal data: W = {stat:.4f}, p-value = {p_value:.4f}")
# Expect large p-value (fail to reject normality)
# Test 2: Skewed data
skewed_data = np.random.exponential(scale=10, size=100)
stat, p_value = stats.shapiro(skewed_data)
print(f"Skewed data: W = {stat:.4f}, p-value = {p_value:.4f}")
# Expect small p-value (reject normality)
# Test 3: Uniform data
uniform_data = np.random.uniform(low=0, high=100, size=100)
stat, p_value = stats.shapiro(uniform_data)
print(f"Uniform data: W = {stat:.4f}, p-value = {p_value:.4f}")
# Expect small p-value (reject normality)
Typical output:
Normal data: W = 0.9939, p-value = 0.9356
Skewed data: W = 0.8946, p-value = 0.0000
Uniform data: W = 0.9564, p-value = 0.0025
The normal data has a large p-value (0.94) — no evidence against normality. The skewed data has a p-value of essentially zero — clearly not normal. The uniform data's p-value (0.0025) is below 0.05, suggesting non-normality.
Warning
The Shapiro-Wilk test is highly sensitive to sample size. With very large samples ($n > 5{,}000$), even tiny, practically meaningless departures from normality will produce statistically significant results. With very small samples ($n < 20$), the test has low power to detect even substantial non-normality.
Best practice: Use the Shapiro-Wilk test alongside QQ-plots. The plot tells you how the data departs from normality. The test tells you whether the departure is statistically significant. Together, they give you the full picture.
10.10 When Normality Matters and When It Doesn't
Here's the question I get from students every semester: "My data isn't perfectly normal. Should I panic?"
No. Definitely not. Here's why.
Many statistical procedures are robust to violations of normality. This means they still work reasonably well even when the normality assumption isn't perfectly met. In later chapters, you'll learn about:
- t-tests (Chapter 15): surprisingly robust for moderate sample sizes, especially for two-tailed tests
- ANOVA (Chapter 20): robust when group sizes are equal and the data isn't wildly skewed
- Confidence intervals (Chapter 12): get more robust as sample size increases, thanks to the Central Limit Theorem (Chapter 11)
The key insight is that the Central Limit Theorem (coming in Chapter 11) guarantees that sample means are approximately normal even when the underlying data isn't — as long as the sample is large enough. This is one of the most remarkable results in all of mathematics, and it's the reason normality matters less than you might think.
When normality DOES matter:
- Small sample sizes ($n < 30$): With small samples, the CLT hasn't kicked in, so the shape of the data matters more
- Extreme outliers: Even robust procedures break down when there are extreme outliers
- Prediction intervals: If you're predicting individual outcomes (not averages), you need the underlying distribution to be approximately normal
- Certain tests: Some procedures (like certain control chart calculations in quality control) explicitly require normality
When normality DOESN'T matter much:
- Large sample sizes ($n > 30$ or so): The CLT rescues you
- Mild skewness: A little skewness usually doesn't cause problems
- Sample means and proportions: Even if individual observations aren't normal, their averages tend to be
Theme 4 Connection: The Bell Curve Isn't Destiny
One of the most persistent — and most harmful — misapplications of the normal distribution is the claim that human traits like intelligence, talent, or potential must follow a bell curve. This reasoning has been used to argue that inequality is "natural" and that differences between groups are fixed. But this confuses a useful mathematical model with a deterministic law of nature.
The normal distribution describes what happens when many small, independent, additive effects combine. It says nothing about whether those effects are changeable, whether measurement tools are biased, or whether social structures constrain outcomes. IQ scores are designed to be normal — the test is constructed to produce a bell curve. That tells us about the test, not about human potential.
Professor Washington thinks about this constantly. When a predictive policing algorithm assigns risk scores that follow a normal distribution, it doesn't mean those scores represent some fixed, innate criminality. They represent the output of a model that's been fed data shaped by the very inequities the algorithm claims to objectively measure.
The model is useful. But mistaking the model for reality is dangerous.
Theme 3 Connection: AI Assumes Distributions
Many machine learning algorithms — including neural networks, Gaussian processes, and Bayesian classifiers — assume that data follows specific distributions, often normal ones. When Alex's recommendation algorithm at StreamVibe clusters users by behavior, it frequently models each cluster as a multivariate normal distribution. When Maya's epidemiological models predict disease spread, they often assume that measurement errors are normally distributed.
These assumptions aren't always checked. And when the assumptions are wrong — when the data is severely skewed, or has heavy tails, or is bimodal — the AI system can make systematically poor predictions. Understanding distributions isn't just a statistics class concept. It's the foundation for knowing when to trust an algorithm and when to question it.
10.11 Putting It All Together: A Complete Worked Example
Let's walk through a complete analysis using Professor Washington's research.
Washington is examining the risk scores assigned by a county's pretrial risk assessment algorithm. The algorithm scores defendants from 0 to 100, and the scores for the 2,000 defendants in his dataset have a mean of 42 and a standard deviation of 11. He wants to know if these scores are approximately normal, and if so, what proportion of defendants fall into various risk categories.
Step 1: Check Normality
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Washington's risk score data (simulated for demonstration)
np.random.seed(2024)
risk_scores = np.random.normal(loc=42, scale=11, size=2000)
risk_scores = np.clip(risk_scores, 0, 100) # Bounded 0-100
# Histogram with normal overlay
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Left panel: Histogram
axes[0].hist(risk_scores, bins=30, density=True, alpha=0.7,
color='steelblue', edgecolor='white')
x = np.linspace(0, 100, 200)
axes[0].plot(x, stats.norm.pdf(x, risk_scores.mean(), risk_scores.std()),
'r-', linewidth=2)
axes[0].set_title('Risk Score Distribution')
axes[0].set_xlabel('Risk Score')
axes[0].set_ylabel('Density')
# Right panel: QQ-plot
stats.probplot(risk_scores, dist="norm", plot=axes[1])
axes[1].set_title('QQ-Plot of Risk Scores')
plt.tight_layout()
plt.show()
# Shapiro-Wilk test (on a subsample — test works best for n < 5000)
subsample = np.random.choice(risk_scores, size=500, replace=False)
stat, p_val = stats.shapiro(subsample)
print(f"Shapiro-Wilk: W = {stat:.4f}, p-value = {p_val:.4f}")
The histogram looks approximately bell-shaped, and the QQ-plot shows points following the diagonal line closely. The Shapiro-Wilk test on a subsample gives a reasonably large p-value. Washington concludes the normal model is a reasonable approximation for these risk scores.
Step 2: Calculate Probabilities
The county classifies risk as: Low (0-25), Medium (26-55), High (56-75), Very High (76+).
from scipy import stats
mu, sigma = 42, 11
# Low risk: P(X <= 25)
p_low = stats.norm.cdf(25, mu, sigma)
# Medium risk: P(25 < X <= 55)
p_medium = stats.norm.cdf(55, mu, sigma) - stats.norm.cdf(25, mu, sigma)
# High risk: P(55 < X <= 75)
p_high = stats.norm.cdf(75, mu, sigma) - stats.norm.cdf(55, mu, sigma)
# Very high risk: P(X > 75)
p_very_high = 1 - stats.norm.cdf(75, mu, sigma)
print(f"Low risk (0-25): {p_low:.4f} ({p_low*100:.1f}%)")
print(f"Medium risk (26-55): {p_medium:.4f} ({p_medium*100:.1f}%)")
print(f"High risk (56-75): {p_high:.4f} ({p_high*100:.1f}%)")
print(f"Very high risk (76+): {p_very_high:.4f} ({p_very_high*100:.1f}%)")
print(f"\nTotal: {p_low + p_medium + p_high + p_very_high:.4f}")
Output:
Low risk (0-25): 0.0614 (6.1%)
Medium risk (26-55): 0.8209 (82.1%)
High risk (56-75): 0.1164 (11.6%)
Very high risk (76+): 0.0013 (0.1%)
Total: 1.0000
Step 3: The Ethical Question
Washington notices something: the algorithm classifies the vast majority of defendants (82%) as medium risk. Only 6% get low risk scores. If "medium risk" is used to justify pretrial detention, this means most defendants — regardless of their individual circumstances — are being flagged for the same restrictive treatment.
"The normal distribution tells me the spread," Washington notes. "But it doesn't tell me whether 42 is a fair threshold, or whether the algorithm produces the same distribution for every demographic group."
This is the model-vs-reality distinction in action. The normal distribution fits the data. But fitting the data doesn't mean the system is just.
10.12 Data Detective Portfolio: Assessing Normality
Time to apply what you've learned to your own dataset. This is the Chapter 10 component of the Data Detective Portfolio.
Your Task
For each numerical variable in your dataset:
-
Create a histogram. Does it look approximately bell-shaped? Note any obvious skewness, multiple peaks, or outliers.
-
Create a QQ-plot. Do the points follow the diagonal line? Where do they deviate?
-
Run the Shapiro-Wilk test. Report the test statistic and p-value. Based on your sample size, is the p-value meaningful?
-
Make a judgment. For each variable, write one sentence: "The variable [name] is / is not approximately normally distributed because [evidence]."
-
Reflect. For variables that aren't normal, speculate about why. Is the variable bounded on one side (like income, which can't be negative)? Is it naturally skewed? Are there subgroups creating multiple peaks?
Template Code
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Load your dataset
df = pd.read_csv('your_dataset.csv')
# Choose a numerical variable
variable = 'your_variable_name'
data = df[variable].dropna()
# 1. Histogram
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].hist(data, bins=25, density=True, alpha=0.7,
color='steelblue', edgecolor='white')
x = np.linspace(data.min(), data.max(), 200)
axes[0].plot(x, stats.norm.pdf(x, data.mean(), data.std()),
'r-', linewidth=2)
axes[0].set_title(f'Histogram of {variable}')
axes[0].set_xlabel(variable)
axes[0].set_ylabel('Density')
# 2. QQ-plot
stats.probplot(data, dist="norm", plot=axes[1])
axes[1].set_title(f'QQ-Plot of {variable}')
plt.tight_layout()
plt.show()
# 3. Shapiro-Wilk test
if len(data) > 5000:
sample = data.sample(n=5000, random_state=42)
print(f"Note: Using random subsample of 5000 (full n = {len(data)})")
else:
sample = data
stat, p_value = stats.shapiro(sample)
print(f"Shapiro-Wilk test: W = {stat:.4f}, p-value = {p_value:.4f}")
# 4. Summary statistics for context
print(f"\nSummary statistics for {variable}:")
print(f" n = {len(data)}")
print(f" Mean = {data.mean():.2f}")
print(f" SD = {data.std():.2f}")
print(f" Min = {data.min():.2f}")
print(f" Max = {data.max():.2f}")
print(f" Skew = {data.skew():.3f}")
Portfolio Tip: If you're using the CDC BRFSS dataset, try assessing normality of
_BMI5(BMI, which tends to be right-skewed) andWTKG3(weight in kilograms). For the Gapminder dataset, try life expectancy (often approximately normal within a given year) versus GDP per capita (almost always strongly right-skewed). These contrasts make for excellent portfolio entries.
10.13 Drift Check: Voice Consistency Audit
This section is a behind-the-scenes note on tone and style, relevant to instructors and textbook reviewers.
At Chapter 10 — the end of Part 3 (Probability) — I want to pause for a quick voice audit. We're now ten chapters deep, and it's important that the tone hasn't drifted.
Checklist: - Are we still conversational? Using "you" and "I"? Contractions? ✓ - Are we still leading with stories and concrete examples before abstractions? ✓ (Blood pressure, Daria's three-pointers, StreamVibe watch time) - Are we acknowledging math anxiety without condescending? ✓ ("Take a breath — I'm showing you this because...") - Are formulas always accompanied by plain-English explanations? ✓ (Every formula is followed by a "In plain English" or "In words" translation) - Are the four anchor characters still actively featured? ✓ (Maya: blood pressure; Alex: watch time and power users; Sam/Daria: three-pointers and binomial; Washington: risk scores and ethics) - Are we building on prior chapters rather than re-teaching? ✓ (Three spaced review moments connect to Ch.5, Ch.6, Ch.8) - Is the progressive project advancing naturally? ✓ (Assessing normality is a logical next step after Ch.9's conditional probability analysis)
Tone adjustments noted: This chapter is more formula-heavy than Chapters 8-9, which could create a more "textbook-ish" feel. Mitigation: extra emphasis on worked examples, visual descriptions, and the "you will never compute this by hand" reassurance. The threshold concept block in Section 10.9 is the emotional heart of the chapter — it should land as empowering ("Is this close enough?" is a better question than "Is this exactly right?") rather than deflating ("nothing is perfectly normal so what's the point").
10.14 Chapter Summary
Let's step back and see what we've built.
You started this chapter knowing the Empirical Rule and z-scores from Chapter 6 — useful tools, but without a theoretical foundation. Now you have the full picture:
-
Random variables assign numerical values to random outcomes. They're either discrete (countable) or continuous (measurable).
-
Probability distributions describe how probability is spread across the values of a random variable. For discrete variables, we list probabilities. For continuous variables, we use density curves where area = probability.
-
The binomial distribution models the count of successes in a fixed number of independent trials with the same probability of success. Formula: $P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$. Python:
scipy.stats.binom. -
The normal distribution is a symmetric, bell-shaped continuous distribution defined by its mean $\mu$ and standard deviation $\sigma$. It's the mathematical model behind the Empirical Rule. Formula: exists but don't compute it — use tables or Python.
-
Z-scores transform any normal distribution to the standard normal ($Z \sim N(0, 1)$), allowing you to find probabilities from a single table. Formula: $z = (x - \mu) / \sigma$.
-
The normal distribution is a model, not truth. Real data is never perfectly normal. The question is always whether it's close enough. Use histograms, QQ-plots, and the Shapiro-Wilk test to check.
-
Normality matters less than you think for large samples, thanks to the Central Limit Theorem (Chapter 11). But for small samples and individual predictions, it matters more.
What's Next
In Chapter 11, we'll answer a question that's been lurking since Chapter 4: if you take different samples from the same population, you get different results. How much do those results vary? And why do sample means tend to be normally distributed even when the population isn't?
The answer is the Central Limit Theorem — arguably the single most important theorem in all of statistics. It's the bridge from probability to inference, and it'll explain why everything we've learned about the normal distribution matters even more than you currently think.
Key Formulas at a Glance
| Concept | Formula | When to Use |
|---|---|---|
| Expected value | $E(X) = \sum x \cdot P(X = x)$ | Finding the long-run average of a discrete random variable |
| Binomial probability | $P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$ | Counting successes in $n$ independent trials with constant $p$ |
| Binomial mean/SD | $\mu = np$, $\sigma = \sqrt{np(1-p)}$ | Center and spread of a binomial distribution |
| Z-score transformation | $z = \frac{x - \mu}{\sigma}$ | Converting any normal variable to the standard normal |
| Inverse transformation | $x = \mu + z \cdot \sigma$ | Converting from z-score back to original units |
| Normal CDF (Python) | stats.norm.cdf(x, loc=μ, scale=σ) |
$P(X \leq x)$ for a normal variable |
| Inverse normal (Python) | stats.norm.ppf(p, loc=μ, scale=σ) |
Find $x$ such that $P(X \leq x) = p$ |
| Binomial PMF (Python) | stats.binom.pmf(k, n, p) |
$P(X = k)$ for a binomial variable |
| Shapiro-Wilk (Python) | stats.shapiro(data) |
Test whether data is approximately normal |