> "Not everything that can be counted counts, and not everything that counts can be counted."
Learning Objectives
- Calculate and interpret mean, median, and mode
- Calculate and interpret range, IQR, variance, and standard deviation
- Use the five-number summary and create box plots
- Apply the Empirical Rule (68-95-99.7) to bell-shaped distributions
- Detect outliers using the IQR method and z-scores
In This Chapter
- Chapter Overview
- 6.1 Measures of Center: Where's the Middle?
- 6.2 Mean vs. Median: When Averages Lie
- 6.3 The Weighted Mean: Not All Values Are Created Equal
- 6.4 Measures of Spread: How Much Do Values Vary?
- 6.5 Variance and Standard Deviation: Measuring Typical Distance
- 6.5.1 Worked Example: Standard Deviation by Hand and in Python
- 6.6 The Five-Number Summary and Box Plots
- 6.7 The Empirical Rule (68-95-99.7 Rule)
- 6.8 Z-Scores: How Many Standard Deviations from the Mean?
- 6.9 Detecting Outliers: The IQR Method and Z-Score Method
- 6.10 Putting It All Together: The Complete Summary
- 6.11 Which Summary Statistics to Use: A Decision Guide
- 6.12 All Functions at a Glance: Python and Excel
- 6.13 Project Checkpoint: Your Turn
- 6.14 Spaced Review: Strengthening Previous Learning
- Chapter Summary
- What's Next
Chapter 6: Numerical Summaries: Center, Spread, and Shape
"Not everything that can be counted counts, and not everything that counts can be counted." — Often attributed to Albert Einstein (likely paraphrased from William Bruce Cameron)
Chapter Overview
Here's something that happened to me once. I was looking at a news article that said, "The average American household earns $105,000 per year." And my first thought was: that can't be right. Because I knew plenty of American households — including my own, at the time — that earned far less. Was the article wrong? Was I wrong? Were both of us wrong?
Neither, actually. The article was reporting the mean household income. What it should have reported — and what would have painted a very different picture — was the median household income, which was about $75,000. The difference between $105,000 and $75,000 is not a rounding error. It's a $30,000 gap caused by choosing a single word: mean instead of median.
That's a gap big enough to change how you think about the economy, how politicians design policy, and how you feel about your own financial situation. One number says "most Americans are doing fine." The other says "most Americans earn less than you'd think." Same data. Different summary. Completely different story.
This chapter is about learning to choose — and interpret — the right summary for the right situation. In Chapter 5, you learned to see data through graphs. You described shapes, spotted outliers, and talked about distributions being "centered around 30" or "spread from 10 to 180." Those descriptions were useful but vague. Now you're going to make them precise.
By the end of this chapter, you'll be able to do something powerful: take any numerical dataset and distill it into a handful of numbers that capture its center, its spread, and its shape. You'll know when to use the mean and when the median tells a truer story. You'll understand why standard deviation — a concept that intimidates many students — is actually just the answer to a simple question: How far, on average, does a typical value fall from the center?
And you'll meet the box plot — a graph that condenses an entire distribution into five numbers and reveals outliers at a glance.
In this chapter, you will learn to: - Calculate and interpret mean, median, and mode - Calculate and interpret range, IQR, variance, and standard deviation - Use the five-number summary and create box plots - Apply the Empirical Rule (68-95-99.7) to bell-shaped distributions - Detect outliers using the IQR method and z-scores
Fast Track: If you can already calculate mean, median, standard deviation, and IQR, and you know what a box plot is, skim Sections 6.1-6.3 and jump to Section 6.6 (The Empirical Rule). Complete quiz questions 1, 8, and 15 to verify your foundation.
Deep Dive: After this chapter, read Case Study 1 (why the mean lies about income inequality) for a powerful example of how choosing the wrong measure of center distorts entire policy debates, then Case Study 2 (the Empirical Rule in quality control) for a real-world application in manufacturing.
6.1 Measures of Center: Where's the Middle?
Let's start with a question that sounds simple but isn't: What's the "typical" value in a dataset?
In Chapter 5, when you described histograms, you talked about "center" — the approximate middle of the distribution. But "approximate" isn't good enough anymore. We need numbers. Specifically, we need three candidates for "the center": the mean, the median, and the mode.
Let's introduce each one using a dataset from Sam Okafor's world.
Sam's Shooting Data
Sam, our sports analytics intern from Chapter 1, is tracking the points scored by the Riverside Raptors' star player, Daria, over her last 11 games:
Daria's points per game: 12, 15, 18, 19, 21, 22, 22, 24, 25, 28, 31
These 11 games are already sorted from smallest to largest (a step that'll matter in a moment). Let's find the center three different ways.
The Mean: Add Up and Divide
The mean is what most people call "the average." You've been calculating it since elementary school: add up all the values, then divide by how many there are.
Mathematical Formulation: The Sample Mean
$$\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{x_1 + x_2 + \cdots + x_n}{n}$$
In plain English: Add up every value in the dataset, then divide by the total number of values. The symbol $\bar{x}$ (pronounced "x-bar") represents the sample mean. The $\Sigma$ (capital sigma) just means "add them all up."
For Daria's points:
$$\bar{x} = \frac{12 + 15 + 18 + 19 + 21 + 22 + 22 + 24 + 25 + 28 + 31}{11} = \frac{237}{11} = 21.5$$
Daria averages 21.5 points per game. That's the mean — the balance point of the data. If you put all 11 game scores on a number line and tried to balance them on a fulcrum, the fulcrum would sit at 21.5.
Here's the thing about the mean, though: it feels every value in the dataset. Every single number pulls the mean toward itself. This makes it powerful — it uses all the information — but also vulnerable to extreme values.
Watch what happens if Daria has one monster game. Replace that 31-point game with a 61-point explosion:
$$\bar{x} = \frac{12 + 15 + 18 + 19 + 21 + 22 + 22 + 24 + 25 + 28 + 61}{11} = \frac{267}{11} = 24.3$$
One game changed the mean from 21.5 to 24.3 — a jump of nearly 3 points. But does 24.3 really represent Daria's "typical" game? Ten of her eleven games were 28 points or fewer. That 61-point outburst yanked the mean upward.
This is the mean's superpower and its weakness: it's sensitive to every value, including outliers.
The Median: The Middle Value
The median solves this problem by ignoring how far values are from the center and focusing only on position. It's the middle value when the data is sorted from smallest to largest.
Finding the Median: 1. Sort the data from smallest to largest. 2. If n is odd, the median is the value at position $\frac{n+1}{2}$. 3. If n is even, the median is the average of the two middle values.
For Daria's original data (n = 11, which is odd):
12, 15, 18, 19, 21, 22, 22, 24, 25, 28, 31
Position $\frac{11+1}{2} = 6$. The 6th value is 22. So the median is 22 points.
Now replace 31 with 61:
12, 15, 18, 19, 21, 22, 22, 24, 25, 28, 61
The middle value? Still 22. The median didn't budge. That 61-point game could have been 161 points or 6,100 points — the median wouldn't care. It only looks at position, not magnitude.
This makes the median a resistant measure — a statistic that isn't heavily influenced by extreme values or outliers. The mean is not resistant.
Plain-language rule: The median is the value where half the data is above and half is below. It's the 50th percentile — the statistical equator.
What If n Is Even?
Suppose Daria plays a 12th game and scores 20 points. Now we have:
12, 15, 18, 19, 20, 21, 22, 22, 24, 25, 28, 31
With 12 values, there's no single middle value. The two middle values are at positions 6 and 7: 21 and 22. The median is their average: $\frac{21 + 22}{2} = 21.5$.
The Mode: Most Frequent
The mode is the value that appears most often. In Daria's original dataset, 22 appears twice — more than any other value — so the mode is 22.
The mode is the only measure of center that works for categorical data (you can't average colors or cities, but you can identify the most common one). For numerical data, though, the mode is often less useful than the mean or median. Some datasets have no mode (every value appears once), and some have multiple modes. Remember the bimodal flu data from Chapter 5? That distribution had two peaks — two modes — reflecting two different age groups hit by the flu.
Putting Them Together
| Measure | How to Find It | Resistant to Outliers? | Uses All Data? |
|---|---|---|---|
| Mean | Sum / count | No | Yes |
| Median | Middle value when sorted | Yes | No (only position matters) |
| Mode | Most frequent value | Yes | No (only frequency matters) |
In Daria's original data: mean = 21.5, median = 22, mode = 22. They're all close — which makes sense, because the data is roughly symmetric. When a distribution is symmetric, the mean and median are approximately equal.
But when a distribution is skewed? That's where the story gets interesting.
6.2 Mean vs. Median: When Averages Lie
Productive Struggle
Before reading further, try this. Here are monthly rents (in dollars) for 10 apartments in a mid-size city:
850, 900, 925, 950, 975, 1,000, 1,050, 1,100, 1,200, 4,500
Calculate the mean and median. Which one better represents a "typical" rent? Why?
Take two minutes. Seriously — get out a calculator (or use Python:
import numpy as np; data = [850, 900, 925, 950, 975, 1000, 1050, 1100, 1200, 4500]; print(np.mean(data), np.median(data))). Think about your answer before reading on.
After you've tried it
Mean = $1,345. Median = $987.50. The mean is pulled upward by that one luxury apartment at $4,500. The median — $987.50 — is much closer to what most renters are actually paying. Nine of the ten apartments cost less than $1,200. The mean of $1,345 would lead you to believe rents are much higher than they actually are for a typical renter. The median wins here.
The Classic Example: Income Data
This mean-vs-median tension is most dramatically visible in income data. Let's scale up from apartments to the entire country.
Consider ten households with the following annual incomes:
$30,000 | $35,000 | $42,000 | $48,000 | $55,000 | $62,000 | $70,000 | $85,000 | $110,000 | $1,200,000
- Mean: $173,700
- Median: $58,500
The mean is almost three times the median. Why? That one household earning $1.2 million — maybe a CEO or a successful entrepreneur — drags the mean skyward. Nine out of ten households earn $110,000 or less, but the mean says the "average" household earns $173,700. Nobody here is "average."
This isn't just an academic exercise. Every time a politician says "average income is rising" while using the mean, they might be describing a world where CEO pay tripled while everyone else's paycheck stayed flat. The median would tell you what happened to the typical family.
Here's the rule you need to remember:
Situation Use the... Why Symmetric distribution Mean Mean and median are close; mean uses all data Skewed distribution Median Median isn't pulled by the tail Outliers present Median Median is resistant to extreme values Further calculations needed Mean Many statistical formulas are built on the mean Memory trick: Right-skewed data → mean > median (tail pulls mean right). Left-skewed data → mean < median (tail pulls mean left). Symmetric → mean ≈ median.
Let's revisit the distribution shapes from Chapter 5. Remember describing histograms as "skewed right" or "skewed left"? Now you have a quantitative way to confirm: compare the mean and median. If the mean is substantially larger than the median, the distribution is likely skewed right. If the mean is substantially smaller, it's likely skewed left.
This connection between shape (Chapter 5) and numerical summaries (this chapter) is no accident. Visualization and summary statistics are two languages for describing the same data. You need both.
6.3 The Weighted Mean: Not All Values Are Created Equal
Sometimes, not every observation deserves equal influence in the average. Your GPA is a perfect example.
Suppose you earn these grades in a semester:
| Course | Credits | Grade Points |
|---|---|---|
| Statistics | 4 | A (4.0) |
| English | 3 | B (3.0) |
| Art History | 2 | A (4.0) |
| Phys Ed | 1 | C (2.0) |
If you just average the grade points: $\frac{4.0 + 3.0 + 4.0 + 2.0}{4} = 3.25$
But that treats a 4-credit statistics course the same as a 1-credit PE class. That doesn't feel right. The statistics course should count more because it's more credits.
Mathematical Formulation: The Weighted Mean
$$\bar{x}_w = \frac{\sum_{i=1}^{n} w_i \cdot x_i}{\sum_{i=1}^{n} w_i}$$
In plain English: Multiply each value by its weight, add up all those products, then divide by the sum of the weights.
$$\bar{x}_w = \frac{(4 \times 4.0) + (3 \times 3.0) + (2 \times 4.0) + (1 \times 2.0)}{4 + 3 + 2 + 1} = \frac{16 + 9 + 8 + 2}{10} = \frac{35}{10} = 3.50$$
Your weighted GPA is 3.50 — higher than the simple average of 3.25, because the courses where you did well happened to carry more weight.
The weighted mean appears everywhere: stock market indices weight companies by market capitalization, inflation calculations weight goods by spending patterns, and overall survey scores often weight responses by demographic representativeness. Whenever values have different levels of importance, you need a weighted mean.
6.4 Measures of Spread: How Much Do Values Vary?
Knowing the center of your data is only half the picture. Two datasets can share the same mean and median but tell completely different stories because their spread is different.
Consider two factories that make batteries. Both produce batteries with an average lifespan of 500 hours. But Factory A's batteries last between 490 and 510 hours (very consistent), while Factory B's batteries last between 200 and 800 hours (wildly variable). Would you rather buy from Factory A or Factory B? Obviously Factory A — even though the "average" is the same.
Spread is the statistical word for this consistency (or inconsistency). And here's where we connect to one of the book's recurring themes: spread is uncertainty (Theme 4). When we measure how spread out values are, we're quantifying how uncertain we are about any given observation. More spread = more uncertainty = harder to predict.
Spaced Review (From Chapter 1 — Statistical Thinking): Remember when we said in Chapter 1 that statistical thinking means seeing variation as the raw material of understanding? This is what we meant. Variation isn't noise to be ignored — it's information to be measured. The tools you're about to learn are how we measure it.
Let's build from the simplest measure of spread to the most important one, using Daria's original scoring data: 12, 15, 18, 19, 21, 22, 22, 24, 25, 28, 31.
The Range: Simple but Fragile
The range is the distance from the smallest value to the largest:
$$\text{Range} = \text{Maximum} - \text{Minimum} = 31 - 12 = 19$$
Daria's scoring spans 19 points. That's easy to calculate and easy to understand. But the range has a fatal flaw: it depends on only two values — the most extreme ones. If Daria had that 61-point game, the range would jump to 61 - 12 = 49, even though the middle of her data didn't change at all.
The range is the least resistant measure of spread. One outlier can blow it up.
Percentiles and Quartiles: Dividing Data into Chunks
To do better than the range, we need to measure how the middle of the data is spread — which means we need to know where the quartiles are.
A percentile tells you the value below which a certain percentage of the data falls. If you scored at the 85th percentile on the SAT, that means 85% of test-takers scored below you.
Quartiles divide the data into four equal parts: - Q1 (25th percentile): 25% of values fall below this point - Q2 (50th percentile): This is the median — 50% below - Q3 (75th percentile): 75% of values fall below this point
For Daria's sorted data: 12, 15, 18, 19, 21, 22, 22, 24, 25, 28, 31
- Q2 (median): Position 6 → 22
- Q1: Median of the lower half (12, 15, 18, 19, 21) → 18
- Q3: Median of the upper half (22, 24, 25, 28, 31) → 25
(When finding Q1 and Q3, we split the data at the median. If n is odd, the median value is excluded from both halves.)
The Interquartile Range (IQR): Spread of the Middle 50%
The interquartile range is the distance between Q3 and Q1:
$$\text{IQR} = Q3 - Q1 = 25 - 18 = 7$$
The IQR tells you the range of the middle 50% of the data — the core of the distribution, ignoring the most extreme values on either end. That makes it resistant to outliers, just like the median.
Think of it this way: the range measures the distance from the two most extreme observations. The IQR measures the distance from the two quartile boundaries. It's a much more stable measure of spread because it's based on positions in the sorted data, not on the most extreme values.
Summary so far:
Measure Formula Resistant? What It Tells You Range Max - Min No Total spread IQR Q3 - Q1 Yes Spread of the middle 50%
But neither the range nor the IQR tells you how far a typical observation falls from the center. For that, we need variance and standard deviation.
6.5 Variance and Standard Deviation: Measuring Typical Distance
This section is the most important in the chapter. Standard deviation is the concept you'll use more than any other in this entire course — in confidence intervals, hypothesis tests, regression, and dozens of other contexts. So let's take it slow and build the idea from scratch.
The Question
Here's the question we're trying to answer:
How far does a typical value fall from the mean?
If we can measure that "typical distance," we'll have a powerful summary of spread. Small typical distance = tightly clustered data. Large typical distance = data all over the place.
Attempt 1: Average Distance from the Mean
Let's use a tiny dataset to make the math visible:
Data: 2, 4, 6, 8, 10 → Mean = $\frac{30}{5}$ = 6
For each value, let's compute how far it is from the mean:
| Value ($x$) | Distance from mean ($x - \bar{x}$) |
|---|---|
| 2 | 2 - 6 = -4 |
| 4 | 4 - 6 = -2 |
| 6 | 6 - 6 = 0 |
| 8 | 8 - 6 = +2 |
| 10 | 10 - 6 = +4 |
| Sum | 0 |
The deviations add up to zero. Always. Every time. This is a mathematical fact: the positive distances above the mean exactly cancel out the negative distances below. The mean is, by definition, the balance point.
So simply averaging the deviations gives us zero — useless.
Attempt 2: Square the Deviations
Here's the fix: square each deviation before averaging. Squaring does two things: (1) it makes everything positive, and (2) it gives extra weight to larger deviations — values far from the mean get penalized more.
| Value ($x$) | Deviation ($x - \bar{x}$) | Squared deviation $(x - \bar{x})^2$ |
|---|---|---|
| 2 | -4 | 16 |
| 4 | -2 | 4 |
| 6 | 0 | 0 |
| 8 | +2 | 4 |
| 10 | +4 | 16 |
| Sum | 0 | 40 |
The Variance: Average of Squared Deviations
The variance is the average of those squared deviations — with one catch that we'll explain in a moment.
Mathematical Formulation: Sample Variance
$$s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}$$
In plain English: Subtract the mean from each value, square the result, add up all the squares, then divide by $n - 1$ (one less than the number of observations).
For our dataset:
$$s^2 = \frac{40}{5 - 1} = \frac{40}{4} = 10$$
The variance is 10. But 10 what? The units of variance are the units of the original data squared. If Daria's points are measured in points, the variance is measured in "points squared" — which doesn't mean anything intuitive. Nobody says "Daria's scoring varies by 10 points squared."
That's why we take the square root.
The Standard Deviation: Back to the Right Units
The standard deviation is the square root of the variance:
Mathematical Formulation: Sample Standard Deviation
$$s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}}$$
In plain English: It's the square root of the variance. This brings the units back to the same scale as the original data.
$$s = \sqrt{10} \approx 3.16$$
The standard deviation is about 3.16 — meaning a typical value in our dataset falls roughly 3.16 units from the mean of 6. That's a meaningful, interpretable number.
🔑 Threshold Concept: Standard Deviation as Typical Distance
Here's the key insight that makes standard deviation click:
The standard deviation measures the typical distance of observations from the mean.
Not the maximum distance. Not the minimum. The typical distance. When someone tells you that the standard deviation of test scores is 12 points, they're saying: "A typical student scored about 12 points away from the class average — some above, some below."
This single idea — standard deviation is typical distance from center — will carry you through the rest of this course. It's the foundation of the Empirical Rule, z-scores, confidence intervals, and hypothesis tests. Every time you see a standard deviation, translate it in your head: "That's how far a typical value falls from the mean."
Think about it with real examples: - If the average height of women is 5'4" with a standard deviation of 3 inches, then a typical woman's height is within about 3 inches of 5'4" — between about 5'1" and 5'7". - If the average commute is 26 minutes with a standard deviation of 15 minutes, commute times vary a lot — a typical commuter is anywhere from 11 to 41 minutes. - If a machine fills soda bottles to an average of 12.00 oz with a standard deviation of 0.02 oz, that machine is incredibly consistent — typical bottles are within a tiny fraction of the target.
Large standard deviation = values spread far from the mean. Small standard deviation = values clustered near the mean. That's it. That's the whole idea.
Why n - 1? The Intuitive Explanation
You might have noticed we divided by $n - 1$ instead of $n$ in the variance formula. This is one of the most-asked questions in introductory statistics, and the full mathematical reason involves a concept called degrees of freedom that you'll see in Chapters 12 and 15. But here's the intuitive version.
When you calculate a sample variance, you're using the sample mean $\bar{x}$ as a stand-in for the true population mean $\mu$. But $\bar{x}$ was calculated from the same data. This means the deviations from $\bar{x}$ are systematically a little smaller than the deviations from $\mu$ would be — because $\bar{x}$ is, by definition, the value that minimizes the sum of squared deviations for this sample.
Dividing by $n - 1$ instead of $n$ corrects for this underestimate. It makes the sample variance an unbiased estimator of the population variance. Think of it as a small penalty for using your data twice — once to calculate the mean, and again to measure distances from that mean.
For large samples (n = 100 or more), dividing by $n$ or $n - 1$ gives nearly the same answer. It only matters for small samples — which is exactly when getting it right matters most.
Notation note: $s^2$ and $s$ are for sample variance and standard deviation. $\sigma^2$ (sigma squared) and $\sigma$ (sigma) are for the population variance and standard deviation — where we would divide by $n$ because we have every value. In this course, you'll almost always work with sample statistics ($s$), because you rarely have the entire population.
6.5.1 Worked Example: Standard Deviation by Hand and in Python
Let's compute the standard deviation for Daria's scoring data step by step.
Data: 12, 15, 18, 19, 21, 22, 22, 24, 25, 28, 31 → Mean = $\bar{x} = 21.5$ (we calculated this in Section 6.1)
Step 1: Calculate each deviation and its square.
| Game | Points ($x$) | Deviation ($x - 21.5$) | Squared deviation |
|---|---|---|---|
| 1 | 12 | -9.5 | 90.25 |
| 2 | 15 | -6.5 | 42.25 |
| 3 | 18 | -3.5 | 12.25 |
| 4 | 19 | -2.5 | 6.25 |
| 5 | 21 | -0.5 | 0.25 |
| 6 | 22 | +0.5 | 0.25 |
| 7 | 22 | +0.5 | 0.25 |
| 8 | 24 | +2.5 | 6.25 |
| 9 | 25 | +3.5 | 12.25 |
| 10 | 28 | +6.5 | 42.25 |
| 11 | 31 | +9.5 | 90.25 |
| Sum = 0 | Sum = 302.75 |
Step 2: Divide by n - 1.
$$s^2 = \frac{302.75}{11 - 1} = \frac{302.75}{10} = 30.275$$
Step 3: Take the square root.
$$s = \sqrt{30.275} \approx 5.50$$
Interpretation: Daria's standard deviation is about 5.5 points. That means her scoring typically varies by about 5.5 points from her average of 21.5. In most games, you'd expect her to score roughly between 16 and 27 points. She's reasonably consistent — not wildly variable, but not a metronome either.
Now in Python
import pandas as pd
import numpy as np
# Daria's scoring data
points = pd.Series([12, 15, 18, 19, 21, 22, 22, 24, 25, 28, 31])
# All at once
print(f"Mean: {points.mean():.2f}")
print(f"Median: {points.median():.2f}")
print(f"Std Dev: {points.std():.2f}") # pandas uses n-1 by default
print(f"Variance: {points.var():.2f}")
Output:
Mean: 21.55
Median: 22.00
Std Dev: 5.50
Variance: 30.27
Important Python note: Both
pandasandnumpydefault to dividing by $n - 1$ for.std()and.var()— the sample standard deviation and variance. This is almost always what you want. If you ever need the population version (dividing by $n$), use.std(ddof=0).
And in Excel
| Formula | What It Calculates |
|---|---|
=AVERAGE(A1:A11) |
Mean |
=MEDIAN(A1:A11) |
Median |
=STDEV.S(A1:A11) |
Sample standard deviation (uses n-1) |
=STDEV.P(A1:A11) |
Population standard deviation (uses n) |
=VAR.S(A1:A11) |
Sample variance |
Use STDEV.S (the "S" stands for "sample") in almost all situations. Use STDEV.P only if you truly have the entire population — which is rare.
6.6 The Five-Number Summary and Box Plots
You now have two families of summary statistics: - Measures of center: mean and median - Measures of spread: range, IQR, and standard deviation
The five-number summary brings several of these together into a compact portrait of a distribution:
The Five-Number Summary:
- Minimum — the smallest value
- Q1 — the first quartile (25th percentile)
- Median (Q2) — the middle value (50th percentile)
- Q3 — the third quartile (75th percentile)
- Maximum — the largest value
For Daria's data:
| Statistic | Value |
|---|---|
| Minimum | 12 |
| Q1 | 18 |
| Median | 22 |
| Q3 | 25 |
| Maximum | 31 |
These five numbers tell you: the lowest score was 12, the highest was 31, the middle score was 22, and the middle half of the data spans from 18 to 25 (an IQR of 7).
Building a Box Plot
A box plot (sometimes called a box-and-whisker plot) is a graph of the five-number summary. It's one of the most powerful tools in descriptive statistics, and John Tukey — the same statistician quoted at the start of Chapter 5 — invented it in the 1970s.
Here's how to construct one:
Visual description (box plot construction):
Imagine a horizontal number line running from 10 to 35 (the range of Daria's data).
Draw the box: A rectangle stretching from Q1 (18) to Q3 (25). This box represents the middle 50% of the data — the IQR.
Draw the median line: A vertical line inside the box at the median (22). This divides the box into two parts. Notice the median line is slightly closer to Q3 than to Q1, suggesting slight left skew — or close to symmetric.
Draw the whiskers: Lines extending from each side of the box to the most extreme data points within 1.5 × IQR of the box edges: - Lower fence: Q1 - 1.5 × IQR = 18 - 1.5(7) = 18 - 10.5 = 7.5 - Upper fence: Q3 + 1.5 × IQR = 25 + 1.5(7) = 25 + 10.5 = 35.5 - The whisker extends to the most extreme value inside the fence. Lower whisker reaches to 12 (the minimum, which is above 7.5). Upper whisker reaches to 31 (the maximum, which is below 35.5).
Mark outliers: Any points beyond the fences are plotted individually as dots. In Daria's data, all values are inside the fences, so there are no outliers.
The result: a compact rectangle from 18 to 25 with a line at 22, whiskers extending to 12 and 31, and no outlier dots.
Reading a Box Plot
A box plot tells you five things at a glance:
- Center: The median line inside the box.
- Spread: The width of the box (IQR) and the span of the whiskers (overall range).
- Skewness: If the median line is centered in the box and the whiskers are roughly equal length, the distribution is symmetric. If the median is off-center or one whisker is much longer, the distribution is skewed.
- Outliers: Any dots beyond the whiskers are potential outliers.
- Comparisons: Box plots shine when you display multiple groups side by side — we'll use this throughout the course.
Box Plots in Python
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Daria's data
points = [12, 15, 18, 19, 21, 22, 22, 24, 25, 28, 31]
df = pd.DataFrame({'points': points})
# Horizontal box plot
plt.figure(figsize=(8, 3))
sns.boxplot(data=df, x='points', color='steelblue')
plt.title("Daria's Points Per Game")
plt.xlabel('Points')
plt.tight_layout()
plt.show()
Comparing Groups with Side-by-Side Box Plots
The real power of box plots comes when you compare distributions. Let's say Sam wants to compare scoring across three players:
# Comparison data
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = {
'player': ['Daria']*11 + ['Marcus']*11 + ['Jess']*11,
'points': [12, 15, 18, 19, 21, 22, 22, 24, 25, 28, 31, # Daria
20, 20, 21, 21, 22, 22, 23, 23, 24, 24, 25, # Marcus
5, 8, 12, 15, 20, 22, 25, 28, 33, 35, 42] # Jess
}
df = pd.DataFrame(data)
plt.figure(figsize=(8, 5))
sns.boxplot(data=df, x='points', y='player', palette='Set2')
plt.title('Points Per Game Comparison')
plt.xlabel('Points')
plt.ylabel('')
plt.tight_layout()
plt.show()
Visual description (side-by-side box plots): Three horizontal box plots stacked vertically. Daria: box from about 18-25, median at 22, whiskers from 12-31. Moderate spread. Marcus: a narrow box from about 21-24, median at 22, whiskers from 20-25. Very compact — Marcus is extremely consistent. Jess: a wide box from about 12-33, median at 22, whiskers from 5-42. Jess is a streaky scorer — some big games and some duds.
All three players have the same median (22), but their box plots look completely different. This is Theme 2 in action: averages can hide stories. The median alone would tell you these players are identical. The box plots reveal they're dramatically different.
6.7 The Empirical Rule (68-95-99.7 Rule)
Now let's connect the standard deviation to a specific kind of distribution: the bell-shaped, symmetric distribution that you saw glimpses of in Chapter 5. If a dataset is approximately bell-shaped (what we'll later call normal in Chapter 10), then the standard deviation follows a remarkably precise pattern.
This pattern is called the Empirical Rule, and it goes like this:
The Empirical Rule (68-95-99.7 Rule)
For a distribution that is approximately bell-shaped and symmetric:
- About 68% of observations fall within 1 standard deviation of the mean ($\bar{x} \pm 1s$)
- About 95% of observations fall within 2 standard deviations of the mean ($\bar{x} \pm 2s$)
- About 99.7% of observations fall within 3 standard deviations of the mean ($\bar{x} \pm 3s$)
Visual description (the Empirical Rule bell curve):
Imagine a smooth, symmetric bell-shaped curve centered on the mean $\bar{x}$.
- The area between $\bar{x} - 1s$ and $\bar{x} + 1s$ (the central region) is shaded darkly, representing 68% of the data. This is the largest chunk — about two-thirds.
- The area between $\bar{x} - 2s$ and $\bar{x} + 2s$ (a wider region) is shaded more lightly, adding another 27% (for a total of 95%). Nearly all the data falls here.
- The area between $\bar{x} - 3s$ and $\bar{x} + 3s$ (nearly the entire curve) is shaded faintest, adding a final 4.7% (for a total of 99.7%). Only about 3 observations in 1,000 fall beyond 3 standard deviations.
- The tiny tails beyond $\bar{x} \pm 3s$ contain just 0.3% of the data — roughly 3 in every 1,000 observations.
Applying the Empirical Rule
Let's use Dr. Maya Chen's data. She's tracking the body temperatures of 500 healthy adults and finds the distribution is approximately bell-shaped with: - Mean: $\bar{x}$ = 98.2°F - Standard deviation: $s$ = 0.7°F
Using the Empirical Rule:
| Range | Calculation | Interval | % of adults |
|---|---|---|---|
| Within 1 SD | 98.2 ± 0.7 | 97.5°F to 98.9°F | About 68% (~340 people) |
| Within 2 SD | 98.2 ± 1.4 | 96.8°F to 99.6°F | About 95% (~475 people) |
| Within 3 SD | 98.2 ± 2.1 | 96.1°F to 100.3°F | About 99.7% (~499 people) |
So about 68% of healthy adults have temperatures between 97.5°F and 98.9°F. About 95% fall between 96.8°F and 99.6°F. And almost everyone (99.7%) falls between 96.1°F and 100.3°F.
If a patient comes in with a temperature of 101.5°F, you can immediately see that's more than 3 standard deviations above the mean — something we'd expect to see in fewer than 0.15% of healthy adults. That temperature is almost certainly abnormal. The Empirical Rule just gave Maya a quick and dirty diagnostic tool, without running a single statistical test.
Critical warning: The Empirical Rule only works for bell-shaped, approximately symmetric distributions. It does NOT work for skewed distributions, bimodal distributions, or uniform distributions. Before applying it, check the shape of your data with a histogram. If the histogram doesn't look roughly bell-shaped, the Empirical Rule will give you wrong answers.
In Chapter 10, we'll formalize this with the normal distribution — a mathematical model that perfectly follows the 68-95-99.7 pattern.
6.8 Z-Scores: How Many Standard Deviations from the Mean?
The Empirical Rule lets you think in terms of standard deviations. The z-score makes that thinking precise.
A z-score tells you exactly how many standard deviations a value is from the mean.
Mathematical Formulation: Z-Score
$$z = \frac{x - \bar{x}}{s}$$
In plain English: Take the value, subtract the mean, and divide by the standard deviation. The result tells you how many standard deviations the value is above (+) or below (-) the mean.
Interpreting Z-Scores
| Z-score | Interpretation |
|---|---|
| $z = 0$ | The value equals the mean |
| $z = +1$ | The value is 1 standard deviation above the mean |
| $z = -1$ | The value is 1 standard deviation below the mean |
| $z = +2$ | The value is 2 standard deviations above the mean (unusual) |
| $z = -2.5$ | The value is 2.5 standard deviations below the mean (very unusual) |
Example: Comparing Across Different Scales
Z-scores solve a problem that raw numbers can't: comparing values from different distributions.
Alex Rivera is analyzing two metrics for StreamVibe content: - Watch time has a mean of 28 minutes and a standard deviation of 12 minutes. - Engagement score has a mean of 65 points and a standard deviation of 15 points.
A particular video has a watch time of 46 minutes and an engagement score of 92 points. Which metric is more impressive — the watch time or the engagement?
You can't compare 46 minutes to 92 points directly. They're on different scales. But z-scores put them on the same scale:
$$z_{\text{watch time}} = \frac{46 - 28}{12} = \frac{18}{12} = 1.50$$
$$z_{\text{engagement}} = \frac{92 - 65}{15} = \frac{27}{15} = 1.80$$
The engagement score ($z = 1.80$) is more impressive than the watch time ($z = 1.50$) relative to their respective distributions. The video's engagement is 1.8 standard deviations above average, while its watch time is 1.5 standard deviations above average.
This is why z-scores are sometimes called standardized scores — they strip away the original units and put everything on the same "standard deviations from the mean" scale. You'll use z-scores throughout this course — they're the foundation of hypothesis testing in Chapter 13.
Z-Scores and the Empirical Rule
Combining z-scores with the Empirical Rule gives you a quick way to assess whether a value is unusual:
| |z| value | Using the Empirical Rule... | |:---------:|---------------------------| | $< 1$ | Within 1 SD — very common (about 68% of data falls here) | | 1 to 2 | Between 1 and 2 SDs — somewhat unusual but not rare | | 2 to 3 | Between 2 and 3 SDs — unusual (only about 5% of data falls beyond 2 SDs) | | $> 3$ | Beyond 3 SDs — very unusual (only about 0.3% of data falls here) |
Professor Washington could use this framework, too. If a predictive policing algorithm assigns a "risk score" with a mean of 50 and standard deviation of 10, then a score of 85 has a z-score of (85 - 50)/10 = 3.5 — more than 3.5 standard deviations above the mean. That's an extreme score, flagging the individual as a dramatic outlier. But is the algorithm right to flag them, or is the algorithm broken? Understanding how unusual a z-score is gives Washington a quantitative basis for questioning the algorithm's output.
Spaced Review (From Chapter 4 — Confounding Variables): Remember confounding variables from Chapter 4 — variables that distort the relationship between the things you're studying? When you see spread in data, ask yourself: What's causing this variation? If body temperatures have a standard deviation of 0.7°F, some of that variation might come from the time of day the temperature was measured, whether the person just exercised, or even the accuracy of the thermometer. Identifying sources of variation connects directly to identifying potential confounders — both are about understanding why values differ.
6.9 Detecting Outliers: The IQR Method and Z-Score Method
In Chapter 5, you learned to spot outliers visually using histograms. Now let's formalize outlier detection with two quantitative methods.
Method 1: The IQR Method (Tukey's Fences)
This is the method used by box plots. A value is a potential outlier if it falls below the lower fence or above the upper fence:
IQR Outlier Detection:
$$\text{Lower fence} = Q1 - 1.5 \times \text{IQR}$$ $$\text{Upper fence} = Q3 + 1.5 \times \text{IQR}$$
Any value below the lower fence or above the upper fence is flagged as a potential outlier.
Let's test it on a dataset with an obvious outlier. Suppose we add a value to Daria's data — a game where she only scored 1 point (maybe she got injured early):
Data: 1, 12, 15, 18, 19, 21, 22, 22, 24, 25, 28, 31
- Q1 = 16.5, Q3 = 24.5 (recalculated for 12 values)
- IQR = 24.5 - 16.5 = 8
- Lower fence = 16.5 - 1.5(8) = 16.5 - 12 = 4.5
- Upper fence = 24.5 + 1.5(8) = 24.5 + 12 = 36.5
The value 1 falls below the lower fence of 4.5 → outlier detected! Everything else is within the fences.
Why 1.5?
Why multiply by 1.5 and not 2 or 1 or some other number? John Tukey, who invented this method, chose 1.5 because it works well in practice: for a bell-shaped distribution, the fences capture about 99.3% of the data, meaning only about 0.7% of values are flagged as outliers. That's strict enough to catch genuinely unusual values without flagging too many ordinary ones.
Method 2: The Z-Score Method
A value is a potential outlier if its z-score is beyond $\pm 2$ (some textbooks use $\pm 3$):
Z-Score Outlier Detection:
A value is a potential outlier if $|z| > 2$ (moderate threshold) or $|z| > 3$ (strict threshold).
For Daria's 1-point game:
Using the full 12-game dataset (mean ≈ 19.8, s ≈ 8.4):
$$z = \frac{1 - 19.8}{8.4} = \frac{-18.8}{8.4} \approx -2.24$$
With $|z| = 2.24 > 2$, the 1-point game is flagged as a potential outlier.
Comparing the Two Methods
| Feature | IQR Method | Z-Score Method |
|---|---|---|
| Based on | Quartiles (resistant) | Mean and SD (not resistant) |
| Works for skewed data? | Yes | Better for symmetric data |
| Threshold | 1.5 × IQR beyond Q1/Q3 | Typically $|z| > 2$ or $|z| > 3$ |
| Used by | Box plots | Many statistical procedures |
The IQR method is more robust because it's based on resistant measures (Q1, Q3). The z-score method is more common in later statistical procedures. In practice, use both and see if they agree.
What to Do with Outliers
Detecting an outlier is step one. The harder question is: What do you do about it?
- Investigate first. Is it a data entry error? (Someone typed 1 instead of 21?) If so, fix it.
- Consider the context. Is it a legitimate extreme value? (Daria really did get injured and only scored 1 point?) If so, it's real data — removing it hides the truth.
- Report both ways. Calculate your statistics with and without the outlier. If the conclusions change dramatically, discuss why.
- Never delete an outlier just because it's inconvenient. That's the kind of thing we'll discuss in Chapter 27 (ethical data practice).
6.10 Putting It All Together: The Complete Summary
Let's bring everything together by summarizing Alex Rivera's StreamVibe watch-time data using Python's .describe() function — the same function you first met in Chapter 3.
import pandas as pd
import numpy as np
# Simulated StreamVibe watch time data (5000 sessions)
np.random.seed(42)
watch_time = pd.Series(
np.concatenate([
np.random.exponential(scale=25, size=4500), # Most sessions
np.random.normal(loc=120, scale=20, size=500) # Binge-watchers
]),
name='watch_time_minutes'
)
# The complete summary
print(watch_time.describe())
Output:
count 5000.000000
mean 34.420000
std 30.150000
min 0.120000
25% 11.500000
50% 24.700000
75% 43.200000
max 178.300000
Name: watch_time_minutes, dtype: float64
Let's read this: - count: 5,000 sessions (no missing data) - mean: 34.4 minutes (pulled up by binge-watchers) - std: 30.2 minutes (huge variation — typical sessions range widely from the mean) - min/max: Sessions range from 0.12 to 178.3 minutes (range = 178.2 minutes) - 25%/50%/75%: Q1 = 11.5, median = 24.7, Q3 = 43.2 (IQR = 31.7 minutes)
Notice that the mean (34.4) is higher than the median (24.7). This tells us the distribution is skewed right — those binge-watching sessions pull the mean toward the tail. If someone asked "How long does a typical session last?" the median of 24.7 minutes is the better answer. The mean is inflated by extreme values.
Mean-Median Comparison Summary
# Quick comparison
print(f"Mean: {watch_time.mean():.1f} minutes")
print(f"Median: {watch_time.median():.1f} minutes")
print(f"Ratio: mean is {watch_time.mean() / watch_time.median():.1f}x the median")
print(f"\nSince mean > median, the distribution is skewed RIGHT.")
Output:
Mean: 34.4 minutes
Median: 24.7 minutes
Ratio: mean is 1.4x the median
Since mean > median, the distribution is skewed RIGHT.
Box Plot with Outlier Detection
import matplotlib.pyplot as plt
import seaborn as sns
fig, axes = plt.subplots(1, 2, figsize=(14, 4))
# Histogram for shape
sns.histplot(watch_time, bins=40, color='steelblue',
edgecolor='white', ax=axes[0])
axes[0].axvline(watch_time.mean(), color='red', linestyle='--',
label=f'Mean = {watch_time.mean():.1f}')
axes[0].axvline(watch_time.median(), color='green', linestyle='-',
label=f'Median = {watch_time.median():.1f}')
axes[0].legend()
axes[0].set_title('Distribution of Watch Time')
axes[0].set_xlabel('Minutes')
# Box plot for summary
sns.boxplot(x=watch_time, color='steelblue', ax=axes[1])
axes[1].set_title('Box Plot of Watch Time')
axes[1].set_xlabel('Minutes')
plt.tight_layout()
plt.show()
Visual description (histogram and box plot side by side): The left panel shows a right-skewed histogram of watch times, with a tall peak near 10-20 minutes and a long tail extending past 120 minutes. A dashed red vertical line marks the mean (34.4) and a solid green line marks the median (24.7) — the mean is pulled to the right of the median by the long tail. The right panel shows a horizontal box plot: the box spans from about 11.5 to 43 minutes with the median line at about 25 minutes. The right whisker extends further than the left. Several dots appear beyond the right whisker, representing outlier binge-watching sessions exceeding about 90 minutes.
This side-by-side view — histogram for shape, box plot for summary — is the gold standard for exploratory data analysis. Use it every time you explore a new numerical variable.
6.11 Which Summary Statistics to Use: A Decision Guide
With so many numbers available, how do you choose? Here's a practical guide:
Decision Guide: Choosing Summary Statistics
Is the distribution approximately symmetric with no major outliers? │ ├── YES → Report MEAN and STANDARD DEVIATION │ (Mean uses all the data; SD pairs naturally with it) │ └── NO (skewed or has outliers) → Report MEDIAN and IQR (Both are resistant to outliers and skewness)Always also report: the five-number summary and a box plot (or histogram). No single set of numbers tells the whole story — remember Maya's bimodal flu data from Chapter 5.
| Situation | Center | Spread | Why |
|---|---|---|---|
| Symmetric, no outliers | Mean | Standard deviation | Mean uses all data; SD quantifies "typical distance" |
| Skewed or outliers | Median | IQR | Both resistant to extreme values |
| Comparing groups | Either (but be consistent) | Either (match center choice) | Use the same measure for all groups |
| Reporting to a general audience | Median | Range or IQR | More intuitive; less misleading |
| Further statistical analysis | Mean | Standard deviation | Most inferential methods are built on these |
6.12 All Functions at a Glance: Python and Excel
Python Reference
import pandas as pd
# Assume df is a DataFrame, 'col' is a numerical column
# Measures of center
df['col'].mean() # Mean
df['col'].median() # Median
df['col'].mode() # Mode (may return multiple values)
# Measures of spread
df['col'].std() # Sample standard deviation (n-1)
df['col'].var() # Sample variance (n-1)
df['col'].max() - df['col'].min() # Range
df['col'].quantile(0.75) - df['col'].quantile(0.25) # IQR
# Quartiles and percentiles
df['col'].quantile(0.25) # Q1
df['col'].quantile(0.50) # Q2 (median)
df['col'].quantile(0.75) # Q3
df['col'].quantile(0.90) # 90th percentile
# The five-number summary (and more)
df['col'].describe() # min, Q1, median, Q3, max, mean, std, count
# Z-scores
from scipy import stats
z_scores = stats.zscore(df['col'])
# Or manually:
z_scores = (df['col'] - df['col'].mean()) / df['col'].std()
# Box plots
import matplotlib.pyplot as plt
import seaborn as sns
sns.boxplot(data=df, x='col')
plt.show()
Excel Reference
| Calculation | Excel Formula |
|---|---|
| Mean | =AVERAGE(A1:A100) |
| Median | =MEDIAN(A1:A100) |
| Mode | =MODE.SNGL(A1:A100) |
| Sample standard deviation | =STDEV.S(A1:A100) |
| Population standard deviation | =STDEV.P(A1:A100) |
| Sample variance | =VAR.S(A1:A100) |
| Minimum | =MIN(A1:A100) |
| Maximum | =MAX(A1:A100) |
| Range | =MAX(A1:A100)-MIN(A1:A100) |
| Q1 (25th percentile) | =QUARTILE(A1:A100, 1) |
| Median (50th percentile) | =QUARTILE(A1:A100, 2) |
| Q3 (75th percentile) | =QUARTILE(A1:A100, 3) |
| Any percentile | =PERCENTILE(A1:A100, 0.90) |
| Weighted mean | =SUMPRODUCT(values, weights)/SUM(weights) |
6.13 Project Checkpoint: Your Turn
DATA DETECTIVE PORTFOLIO — Chapter 6
It's time to quantify the distributions you visualized in Chapter 5. You already have histograms for your numerical variables — now you'll put precise numbers on what you see.
Your tasks:
Compute summary statistics for at least two numerical variables in your dataset: - Mean and median - Standard deviation and IQR - Five-number summary
Compare mean and median for each variable. If they differ substantially, explain why (skewness? outliers?).
Create box plots for your numerical variables. If your dataset has a grouping variable, create side-by-side box plots to compare distributions across groups.
Detect outliers using the IQR method. How many outliers does each variable have? Are they data errors or legitimate extreme values?
Apply the Empirical Rule (if appropriate). Check whether any of your numerical variables are approximately bell-shaped. If so, verify that roughly 68% of values fall within 1 standard deviation of the mean.
Code template to get started:
```python import pandas as pd import matplotlib.pyplot as plt import seaborn as sns
Load your dataset
df = pd.read_csv('your_dataset.csv')
Summary statistics for a numerical column
col = 'your_numerical_column' print(f"--- Summary for {col} ---") print(f"Mean: {df[col].mean():.2f}") print(f"Median: {df[col].median():.2f}") print(f"Std Dev: {df[col].std():.2f}") print(f"IQR: {df[col].quantile(0.75) - df[col].quantile(0.25):.2f}") print(f"\nFive-number summary:") print(df[col].describe()[['min', '25%', '50%', '75%', 'max']])
Compare mean and median
if df[col].mean() > df[col].median(): print(f"\nMean > Median → likely skewed RIGHT") elif df[col].mean() < df[col].median(): print(f"\nMean < Median → likely skewed LEFT") else: print(f"\nMean ≈ Median → approximately symmetric")
Box plot
plt.figure(figsize=(8, 4)) sns.boxplot(data=df, x=col, color='steelblue') plt.title(f'Box Plot of {col}') plt.tight_layout() plt.show()
Outlier detection (IQR method)
Q1 = df[col].quantile(0.25) Q3 = df[col].quantile(0.75) IQR = Q3 - Q1 lower_fence = Q1 - 1.5 * IQR upper_fence = Q3 + 1.5 * IQR outliers = df[(df[col] < lower_fence) | (df[col] > upper_fence)] print(f"\nOutliers detected: {len(outliers)}") print(f"Lower fence: {lower_fence:.2f}, Upper fence: {upper_fence:.2f}") ```
Suggested variables by dataset: - CDC BRFSS:
_BMI5(BMI — likely skewed right),SLEPTIM1(sleep hours — try the Empirical Rule) - Gapminder:gdpPercap(GDP per capita — extremely skewed right),lifeExp(life expectancy — try comparing by continent with side-by-side box plots) - College Scorecard:ADM_RATE(admission rate),SAT_AVG(SAT scores — likely bell-shaped) - World Happiness Report:Happiness Score(approximately symmetric — good for Empirical Rule) - NOAA Climate: Temperature (try comparing summer vs. winter box plots)
6.14 Spaced Review: Strengthening Previous Learning
These questions revisit concepts from earlier chapters at expanding intervals, helping you build long-term retention.
SR.1 (From Chapter 2 — Categorical vs. Numerical Variables): All the summary statistics in this chapter — mean, median, standard deviation, IQR, box plots — apply to numerical variables only. Explain why you can't calculate a mean or standard deviation for a categorical variable. What summary would you use instead for a categorical variable?
Check your thinking
Categorical variables represent groups or categories (e.g., blood type, political party, device type). You can't add up "Democrat + Republican + Independent" and divide by 3 — the arithmetic operations that define the mean don't make sense for categories. Similarly, "distance from the mean" has no meaning when there's no numerical scale. For categorical variables, you summarize with **frequencies** (counts) and **relative frequencies** (proportions/percentages). The only "center" measure that works for categorical data is the **mode** — the most common category. This is why Chapter 2's variable classification matters so much: the type of variable determines which summaries are appropriate.SR.2 (From Chapter 4 — Confounding and Variation): In this chapter, you learned that standard deviation measures how much values vary from the mean. In Chapter 4, you learned about confounding variables — hidden variables that create misleading patterns. Explain how variation in data might be caused by confounding variables, using a concrete example.
Check your thinking
Suppose you measure the standard deviation of daily ice cream sales across a year and find it's very large — sales vary wildly. Some of that variation is driven by a confounding variable: *temperature*. Hot days have high sales; cold days have low sales. The temperature confounds the relationship between "day" and "sales" and also drives the variation you observe. If you measured ice cream sales only during summer months (controlling for temperature), the standard deviation would shrink — because you've removed the confounding source of variation. Understanding where variation *comes from* is crucial for both good statistics (Chapter 6) and good study design (Chapter 4). In Chapter 22, you'll learn regression, which lets you statistically account for confounding variables rather than physically controlling them.SR.3 (From Chapter 1 — Statistical Thinking): In Chapter 1, we said that statistical thinking means seeing variation as the raw material of understanding. Now that you've learned to quantify variation with standard deviation and IQR, give an example of how understanding variation (not just averages) changes a real-world decision.
Check your thinking
Consider choosing between two commute routes. Route A averages 25 minutes with a standard deviation of 3 minutes. Route B also averages 25 minutes but with a standard deviation of 15 minutes. If you only looked at the mean, the routes seem identical. But the *variation* tells a different story: Route A is highly predictable (you'll almost always arrive in 22-28 minutes), while Route B is a gamble (you might get there in 10 minutes or it might take 40). If you're heading to a job interview, you'd choose Route A — the one with less variation, less uncertainty. If you're just driving to the grocery store and don't mind the occasional long trip, Route B might be fine if the short trips are worth it. This is statistical thinking: the mean tells you what to expect *on average*, but the standard deviation tells you *how reliable that expectation is*. Variation IS uncertainty (Theme 4), and quantifying it changes decisions.Chapter Summary
The Big Ideas
-
Measures of center — mean, median, and mode — each answer the question "What's the typical value?" in different ways. The mean uses every value but is sensitive to outliers. The median is resistant to outliers. The mode identifies the most common value.
-
The mean vs. median comparison reveals skewness. When the mean is substantially larger than the median, the distribution is skewed right. When the mean is substantially smaller, it's skewed left. When they're close, the distribution is approximately symmetric.
-
Measures of spread — range, IQR, variance, and standard deviation — quantify how much values differ from each other. Standard deviation is the most important: it measures the typical distance of values from the mean.
-
The five-number summary and box plot provide a compact visual of a distribution's center, spread, and potential outliers. Box plots are especially powerful for comparing groups.
-
The Empirical Rule (68-95-99.7) describes how data clusters around the mean in bell-shaped distributions. Combined with z-scores, it lets you assess whether any value is unusual.
Key Terms
| Term | Definition |
|---|---|
| Mean | Sum of all values divided by the number of values — the "balance point" |
| Median | The middle value when data is sorted — the 50th percentile |
| Mode | The most frequently occurring value |
| Range | Maximum minus minimum — total spread of the data |
| Interquartile range (IQR) | Q3 minus Q1 — spread of the middle 50% |
| Variance | Average of squared deviations from the mean (using $n - 1$) |
| Standard deviation | Square root of the variance — typical distance from the mean |
| Five-number summary | Min, Q1, Median, Q3, Max |
| Box plot | Graph of the five-number summary with whiskers and outlier markers |
| Percentile | The value below which a given percentage of data falls |
| Quartile | Values that divide data into four equal parts (Q1, Q2, Q3) |
| Z-score | Number of standard deviations a value is from the mean |
| Empirical Rule | In bell-shaped distributions: 68% within 1 SD, 95% within 2 SDs, 99.7% within 3 SDs |
| Resistant measure | A statistic not heavily influenced by extreme values (e.g., median, IQR) |
| Weighted mean | A mean where each value is multiplied by a weight before averaging |
Check Your Understanding — Final Retrieval Practice (try to answer without scrolling up)
- What is the difference between a resistant measure and a non-resistant measure? Give one example of each.
- When the mean is greater than the median, what does that tell you about the shape of the distribution?
- In your own words, what does standard deviation measure?
- State the Empirical Rule. What kind of distributions does it apply to?
- How do you detect outliers using the IQR method?
Check your thinking
- A resistant measure is not heavily affected by outliers or extreme values (examples: median, IQR). A non-resistant measure is pulled by extreme values (examples: mean, range, standard deviation).
- Mean > median indicates the distribution is skewed right — a long tail to the right is pulling the mean above the median.
- Standard deviation measures the typical distance of values from the mean. It tells you how spread out the data is around the center.
- For bell-shaped, symmetric distributions: about 68% of data falls within 1 SD of the mean, about 95% within 2 SDs, and about 99.7% within 3 SDs.
- Calculate Q1 and Q3, find IQR = Q3 - Q1, then compute the fences: Lower = Q1 - 1.5 × IQR, Upper = Q3 + 1.5 × IQR. Any value outside these fences is a potential outlier.
What's Next
You can now quantify data. You know how to measure center (mean, median), spread (range, IQR, standard deviation), and detect outliers (IQR method, z-scores). Combined with the visualization skills from Chapter 5, you have a complete toolkit for exploratory data analysis.
In Chapter 7: Data Wrangling, you'll face the messy reality of real data — missing values, duplicates, inconsistent formatting, and the judgment calls required to clean it all up. The summary statistics you learned here will help you diagnose data quality issues: a mean that doesn't make sense, a standard deviation that's suspiciously large, an outlier that turns out to be a typo.
In Chapter 8: Probability, you'll take the first step from describing data to making inferences about populations — and you'll see how the variation you measured in this chapter connects to the uncertainty you'll quantify with probability.
And in Chapter 10: Probability Distributions and the Normal Curve, the Empirical Rule will evolve from a rough guideline into a precise mathematical model — the normal distribution — that lets you calculate exact probabilities for any z-score.
The journey from "What's in my data?" (Chapters 5-6) to "What can my data tell me about the world?" (Chapters 11+) passes through the terrain of probability. You've built the foundation. Now the adventure gets really interesting.