Chapter 19: Descriptive Statistics — Center, Spread, Shape, and the Stories Numbers Tell

Contributors to Introduction to Data Science

26 min read

That joke usually gets a laugh, and it should — because it perfectly illustrates something important. The "average" can be technically correct and completely useless at the same time. The average doesn't describe anyone. It erases the very...

Prerequisites

{'chapter': 6, 'description': 'Basic experience loading and exploring data in Python'}
{'chapter': 7, 'description': 'Familiarity with pandas DataFrames'}

Learning Objectives

Compute and interpret mean, median, and mode as measures of center
Compute and interpret variance, standard deviation, and IQR as measures of spread
Describe distribution shape using terms like symmetric, skewed, and bimodal
Detect outliers using IQR fences and z-scores
Choose appropriate summary statistics based on distribution shape

In This Chapter

Chapter Overview
19.1 Why One Number Is Never Enough
19.2 Measuring the Center: Mean, Median, and Mode
19.3 Measuring Spread: How Much Do Values Vary?
19.4 Distribution Shape: The Third Dimension
19.5 Detecting Outliers: When Values Don't Belong
19.6 Putting It All Together: A Descriptive Statistics Workflow
19.7 Choosing the Right Summary Statistics
19.8 The Progressive Project: Vaccination Rate Statistics by Income Group
19.9 Common Pandas Methods for Descriptive Statistics
19.10 The Anscombe's Quartet Warning
19.11 Chapter Summary
Looking Ahead
Connections to Previous and Future Chapters

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 19: Descriptive Statistics — Center, Spread, Shape, and the Stories Numbers Tell

"The average human has one breast and one testicle." — Des MacHale, mathematician

Chapter Overview

That joke usually gets a laugh, and it should — because it perfectly illustrates something important. The "average" can be technically correct and completely useless at the same time. The average doesn't describe anyone. It erases the very differences that make the data interesting.

Welcome to descriptive statistics. This is the chapter where we learn to summarize data with numbers — and, just as importantly, where we learn that summarizing data badly can be worse than not summarizing it at all.

You've already done some of this informally. Back in Chapter 6, when you first explored a dataset, you probably used .describe() in pandas and got back a table full of means, standard deviations, and percentiles. At the time, you might have skimmed those numbers without thinking too hard about what they meant. That changes today.

By the end of this chapter, you'll know not just how to compute these numbers, but how to interpret them — when to trust the mean and when it lies, what standard deviation actually measures (hint: it's not as scary as it sounds), and why the shape of your data matters at least as much as any single number.

In this chapter, you will learn to:

Compute and interpret mean, median, and mode as measures of center (all paths)
Compute and interpret variance, standard deviation, and IQR as measures of spread (all paths)
Describe distribution shape using terms like symmetric, skewed, and bimodal (all paths)
Detect outliers using IQR fences and z-scores (standard + deep dive paths)
Choose appropriate summary statistics based on distribution shape (all paths)

Note — Learning path annotations: Objectives marked (all paths) are essential for every reader. Those marked (standard + deep dive) can be skimmed on the Fast Track but are important for deeper understanding. See "How to Use This Book" for full path descriptions.

19.1 Why One Number Is Never Enough

Let me tell you about two cities.

City A has an average household income of $75,000. City B also has an average household income of $75,000.

Same number. Same average. Are these cities the same?

Not even close. City A is a middle-class suburb where most families earn between $60,000 and $90,000. City B is a deeply divided place where half the population lives on $30,000 a year and the other half earns $120,000, with a handful of millionaires pulling the average up. The average income is identical, but life in these cities is profoundly different.

This is why we need more than one number. A single summary statistic — no matter how carefully computed — can hide as much as it reveals. To actually understand a dataset, we need to answer three questions:

Where is the center? (What's "typical"?)
How spread out is the data? (How much do values vary?)
What shape does the data make? (Is it symmetric? Lopsided? Clumpy?)

These three questions are the backbone of descriptive statistics, and they're the backbone of this chapter. Let's take them one at a time.

19.2 Measuring the Center: Mean, Median, and Mode

When someone asks "what's the typical value?", they're asking about the center of the data. But "center" is a slippery word. There are several ways to define it, and they don't always agree.

The Mean: Adding Up and Dividing

The mean — technically the arithmetic mean — is the one you already know. Add up all the values, divide by how many there are. Done.

In plain English: the mean is the balance point of the data. If you put all your data points on a seesaw, the mean is where you'd put the fulcrum to make it balance.

import numpy as np

# Household incomes in City A (in thousands)
city_a = [62, 68, 71, 73, 75, 76, 78, 80, 85, 82]

mean_a = np.mean(city_a)
print(f"City A mean income: ${mean_a:.1f}k")
# City A mean income: $75.0k

The formula, if you want it, is simple:

$$\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$$

But let's translate that into human language: "Add up everything, then divide by the count." That's it. The sigma symbol just means "add up," the x-bar is the mean, and n is how many data points you have. Nothing scary here.

When the mean works great: When your data is roughly symmetric — roughly the same number of values above and below the center — the mean is an excellent summary. It uses every data point, which means it's sensitive and responsive.

When the mean lies: That sensitivity is also its weakness. Throw one billionaire into a room of schoolteachers and the "average salary" skyrockets. The mean gets pulled toward extreme values.

The Median: The Middle Value

The median is the value that sits right in the middle when you line up all your data from smallest to largest. Half the data is below it, half is above it, and it doesn't care one bit about extreme values.

# Household incomes in City B (in thousands)
city_b = [28, 30, 31, 32, 33, 115, 118, 120, 125, 500]

mean_b = np.mean(city_b)
median_b = np.median(city_b)

print(f"City B mean income: ${mean_b:.1f}k")
print(f"City B median income: ${median_b:.1f}k")
# City B mean income: $113.2k
# City B median income: $74.0k

See that? The mean says $113.2k — a number that describes *nobody* in City B. Most people earn around $30k, and a few earn $120k+. The mean got dragged up by that $500k outlier. The median of $74k isn't perfect either, but at least it sits between the two groups.

To find the median: 1. Sort the data from smallest to largest. 2. If there's an odd number of values, the median is the one in the middle. 3. If there's an even number of values, the median is the average of the two middle values.

When to use the median instead of the mean: Whenever your data is skewed or has outliers. Income, home prices, response times, hospital stays — all of these tend to be skewed, and the median is almost always a better summary than the mean.

Elena's notebook: Elena is computing vaccination rates across neighborhoods. Most neighborhoods have rates between 60% and 85%, but a few affluent areas hit 95% and one underserved area is at 23%. She reports the median vaccination rate (72%) rather than the mean (69%) because that one very low outlier drags the mean down and makes the overall picture look worse than it is for most neighborhoods.

The Mode: The Most Common Value

The mode is the value that appears most frequently. It's the simplest measure of center, and it's the only one that works for categorical data (you can't compute a mean of "red, blue, green, blue, green, green" — but you can say the mode is "green").

from scipy import stats

# Payment methods at Marcus's bakery
payments = ['card', 'card', 'cash', 'card', 'mobile', 'card', 'cash', 'card']
mode_result = stats.mode(payments, keepdims=True)
print(f"Most common payment method: {mode_result.mode[0]}")
# Most common payment method: card

For numerical data, the mode is less commonly used, but it can be informative — especially when data clusters in unexpected ways.

Bimodal data has two modes — two peaks. This often signals that you're looking at two different groups mixed together. If you measured the heights of everyone at a basketball game (players and fans), you'd probably see a bimodal distribution: one peak around average adult height and another peak around NBA-player height.

Comparing the Three: A Visual Story

Let's make this concrete with a visualization. One of the most important skills you'll develop in this chapter is seeing the relationship between these three measures.

import matplotlib.pyplot as plt
import numpy as np

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Symmetric distribution
np.random.seed(42)
symmetric = np.random.normal(50, 10, 1000)
axes[0].hist(symmetric, bins=30, color='steelblue', edgecolor='white', alpha=0.8)
axes[0].axvline(np.mean(symmetric), color='red', linestyle='--', label='Mean')
axes[0].axvline(np.median(symmetric), color='orange', linestyle='--', label='Median')
axes[0].set_title('Symmetric: Mean ≈ Median')
axes[0].legend()

# Right-skewed distribution
right_skewed = np.random.exponential(20, 1000)
axes[1].hist(right_skewed, bins=30, color='steelblue', edgecolor='white', alpha=0.8)
axes[1].axvline(np.mean(right_skewed), color='red', linestyle='--', label='Mean')
axes[1].axvline(np.median(right_skewed), color='orange', linestyle='--', label='Median')
axes[1].set_title('Right-Skewed: Mean > Median')
axes[1].legend()

# Left-skewed distribution
left_skewed = 100 - np.random.exponential(20, 1000)
axes[2].hist(left_skewed, bins=30, color='steelblue', edgecolor='white', alpha=0.8)
axes[2].axvline(np.mean(left_skewed), color='red', linestyle='--', label='Mean')
axes[2].axvline(np.median(left_skewed), color='orange', linestyle='--', label='Median')
axes[2].set_title('Left-Skewed: Mean < Median')
axes[2].legend()

plt.tight_layout()
plt.savefig('center_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

The pattern to remember: - Symmetric data: Mean ≈ Median. Either one is fine. - Right-skewed data (long tail to the right): Mean > Median. The mean gets pulled right. Use the median. - Left-skewed data (long tail to the left): Mean < Median. The mean gets pulled left. Use the median.

Pause and think: Right now, before reading further, think about whether the following datasets would be symmetric or skewed: (a) test scores in a university course, (b) annual household income in the United States, (c) heights of adult women, (d) the number of Instagram followers per account. Jot down your guesses — we'll come back to them later.

19.3 Measuring Spread: How Much Do Values Vary?

Knowing the center is a start, but it's not enough. Two datasets can have the same center and look completely different. Remember our two cities? Both had an average of $75k, but one was tightly clustered and the other was wildly spread out.

Here's another way to think about it. Imagine two archery targets. On Target A, all the arrows are clustered tightly around the bullseye. On Target B, the arrows are scattered all over — some near the center, some at the edges. Both targets might have the same average position (maybe centered on the bullseye), but they tell very different stories about the archer's consistency.

That's what spread measures: consistency. Variability. Reliability. How far, typically, do values stray from the center? There are several ways to quantify this, and each has its strengths.

Range: Simple but Fragile

The range is the simplest measure of spread: the difference between the largest and smallest values.

city_a = [62, 68, 71, 73, 75, 76, 78, 80, 82, 85]
city_b = [28, 30, 31, 32, 33, 115, 118, 120, 125, 500]

print(f"City A range: {max(city_a) - min(city_a)}k")   # 23k
print(f"City B range: {max(city_b) - min(city_b)}k")   # 472k

The range tells us that City B is much more spread out — which is true! But the range has a fatal flaw: it depends entirely on the two most extreme values. Add one billionaire to any dataset and the range explodes, even if everyone else is tightly clustered.

The range is useful for a quick sanity check ("Are these values in the right ballpark?"), but it's too fragile for serious analysis.

Variance and Standard Deviation: The Workhorses

Here's where things get a little more involved — but stay with me, because I promise this is less scary than it looks.

The variance measures the average squared distance from the mean. In plain English: how far, on average, do values stray from the center?

Let me walk through the logic step by step.

Step 1: Find the mean.

data = [4, 8, 6, 5, 3, 7, 9, 5, 6, 7]
mean = np.mean(data)  # 6.0

Step 2: Find how far each value is from the mean.

These are called deviations.

deviations = [x - mean for x in data]
print(deviations)
# [-2.0, 2.0, 0.0, -1.0, -3.0, 1.0, 3.0, -1.0, 0.0, 1.0]

Step 3: Square each deviation.

Why square? Because if you just add up the deviations, the positives and negatives cancel out to zero (try it!). Squaring makes everything positive.

squared_deviations = [d**2 for d in deviations]
print(squared_deviations)
# [4.0, 4.0, 0.0, 1.0, 9.0, 1.0, 9.0, 1.0, 0.0, 1.0]

Step 4: Average the squared deviations.

variance = np.mean(squared_deviations)  # 3.0
print(f"Variance: {variance}")

That's the variance. But notice a problem: our data was in the original units (like dollars or degrees), and the variance is in squared units. That's weird and hard to interpret.

Step 5: Take the square root.

The standard deviation is just the square root of the variance. It puts us back in the original units.

std_dev = np.sqrt(variance)  # 1.73
print(f"Standard deviation: {std_dev:.2f}")

# Or just use NumPy directly:
print(f"NumPy std: {np.std(data):.2f}")   # 1.73

In plain English: The standard deviation tells you, roughly, "how far a typical value is from the mean." If the standard deviation of test scores is 12 points, that means a typical score is about 12 points away from the class average — some above, some below.

A note about n vs. n-1: You may notice that np.std(data) gives a slightly different result than np.std(data, ddof=1). The difference is whether you divide by n (the number of data points) or by n-1. When you're describing a sample and want to estimate the population's standard deviation, you use n-1 (called Bessel's correction). When you're describing the data you actually have, use n. Pandas uses n-1 by default (.std() uses ddof=1); NumPy uses n by default. The gap shrinks as the dataset grows — for thousands of rows it is negligible — but on small samples it is easy to see. The n=10 example just below gives 1.73 with ddof=0 versus 1.83 with ddof=1, a difference of about 6%, so always check which default a library is using before comparing numbers. We'll revisit this distinction in Chapter 22 when we talk about inference.

import pandas as pd

# Pandas vs. NumPy default behavior
data_series = pd.Series([4, 8, 6, 5, 3, 7, 9, 5, 6, 7])
print(f"Pandas std (ddof=1): {data_series.std():.2f}")    # 1.83
print(f"NumPy std (ddof=0):  {np.std(data):.2f}")         # 1.73
print(f"NumPy std (ddof=1):  {np.std(data, ddof=1):.2f}") # 1.83

IQR: The Robust Alternative

The interquartile range (IQR) is to the median what the standard deviation is to the mean. It measures spread, but it's robust against outliers.

To understand the IQR, we first need percentiles.

A percentile tells you what percentage of the data falls below a given value. The 25th percentile (also called Q1, the first quartile) means 25% of the data is below this value. The 75th percentile (Q3, the third quartile) means 75% is below it.

The IQR is simply Q3 minus Q1. It covers the middle 50% of the data.

city_b = [28, 30, 31, 32, 33, 115, 118, 120, 125, 500]

Q1 = np.percentile(city_b, 25)
Q3 = np.percentile(city_b, 75)
IQR = Q3 - Q1

print(f"Q1: {Q1}")            # 30.75
print(f"Q3: {Q3}")            # 121.25
print(f"IQR: {IQR}")          # 90.5
print(f"Std dev: {np.std(city_b, ddof=1):.1f}")  # 139.3

That $500k value inflated the standard deviation to $139.3k, but the IQR of $90.5k gives a more honest picture of how spread out the bulk of the data is.

When to use which: - Standard deviation when your data is roughly symmetric and doesn't have extreme outliers. - IQR when your data is skewed or has outliers.

This mirrors our earlier rule: mean + standard deviation go together for symmetric data; median + IQR go together for skewed data.

Practical Intuition for Standard Deviation

If standard deviation still feels abstract, here are some rules of thumb that will help you develop intuition.

When someone tells you the standard deviation, you can immediately picture the data:

Small SD relative to the mean: The data is tightly clustered. A class with a mean score of 80 and SD of 5 means almost everyone scored between 70 and 90. Not much variation.
Large SD relative to the mean: The data is widely spread. A class with a mean of 80 and SD of 20 means scores range from around 40 to 120 (well, probably capped at 100) — huge variation.
SD of zero: Every value is identical. No variation at all. This basically never happens with real data.

# Building intuition: what SD "looks like"
np.random.seed(42)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, sd, title in [(axes[0], 3, 'Tight (SD=3)'),
                       (axes[1], 10, 'Moderate (SD=10)'),
                       (axes[2], 25, 'Wide (SD=25)')]:
    data = np.random.normal(50, sd, 1000)
    ax.hist(data, bins=30, color='steelblue', edgecolor='white', alpha=0.8)
    ax.set_title(f'Mean=50, {title}')
    ax.set_xlim(-30, 130)

plt.tight_layout()
plt.savefig('sd_intuition.png', dpi=150, bbox_inches='tight')
plt.show()

Marcus's insight: Marcus's bakery has average daily sales of $1,200 with a standard deviation of $150. That tells him most days fall between about $900 and $1,500. If someone told him the SD was $50 instead, he'd know the business was very predictable. If the SD was $500, he'd be worried about the volatility.

The Five-Number Summary

The five-number summary wraps up several ideas into one compact package:

Minimum — the smallest value
Q1 — the 25th percentile
Median — the 50th percentile
Q3 — the 75th percentile
Maximum — the largest value

import pandas as pd

city_b_series = pd.Series(city_b, name="Income (thousands)")
print(city_b_series.describe())

count     10.000000
mean     113.200000
std      139.333333
min       28.000000
25%       30.750000
50%       74.000000
75%      121.250000
max      500.000000

Look at that output — pandas gives you the five-number summary (min, 25%, 50%, 75%, max) plus the count, mean, and standard deviation. You've been looking at this table since Chapter 7. Now you actually know what every line means.

The five-number summary is the basis of the box plot (which you may have encountered in Part III). The box spans from Q1 to Q3, the line inside the box is the median, and the "whiskers" extend to the min and max (or to a fence, as we'll see shortly).

19.4 Distribution Shape: The Third Dimension

Here's the threshold concept for this chapter — the idea that, once it clicks, changes how you think about data forever:

Threshold Concept: Distribution Thinking

A single number (the mean, the median) reduces your entire dataset to a point. But data has shape. It has peaks and valleys, tails and clusters. Learning to see data as a distribution — a shape rather than a single number — is one of the most important mindset shifts in all of data science.

Let me show you what I mean. Consider four datasets that all have a mean of 50 and a standard deviation of 10:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Dataset 1: Symmetric (normal)
d1 = np.random.normal(50, 10, 5000)
axes[0, 0].hist(d1, bins=50, color='steelblue', edgecolor='white', alpha=0.8)
axes[0, 0].set_title('Symmetric (Normal)')

# Dataset 2: Right-skewed
d2 = np.random.exponential(10, 5000) + 30
d2 = (d2 - d2.mean()) / d2.std() * 10 + 50  # rescale
axes[0, 1].hist(d2, bins=50, color='coral', edgecolor='white', alpha=0.8)
axes[0, 1].set_title('Right-Skewed')

# Dataset 3: Left-skewed
d3 = 100 - np.random.exponential(10, 5000)
d3 = (d3 - d3.mean()) / d3.std() * 10 + 50  # rescale
axes[1, 0].hist(d3, bins=50, color='mediumseagreen', edgecolor='white', alpha=0.8)
axes[1, 0].set_title('Left-Skewed')

# Dataset 4: Bimodal
d4_part1 = np.random.normal(35, 5, 2500)
d4_part2 = np.random.normal(65, 5, 2500)
d4 = np.concatenate([d4_part1, d4_part2])
d4 = (d4 - d4.mean()) / d4.std() * 10 + 50  # rescale
axes[1, 1].hist(d4, bins=50, color='mediumpurple', edgecolor='white', alpha=0.8)
axes[1, 1].set_title('Bimodal')

for ax in axes.flat:
    ax.axvline(50, color='red', linestyle='--', alpha=0.5, label='Mean=50')
    ax.legend()

plt.suptitle('Four Datasets, Same Mean (50), Same Std Dev (10)', fontsize=14)
plt.tight_layout()
plt.savefig('four_shapes.png', dpi=150, bbox_inches='tight')
plt.show()

Same mean. Same standard deviation. Completely different stories. If all you reported was "the mean is 50 and the standard deviation is 10," you'd be equally describing all four of these — and missing everything interesting.

This is why shape matters. Let's talk about the vocabulary for describing it.

Skewness: Which Way Does the Tail Point?

Skewness measures asymmetry. Imagine your histogram is a hill. Which direction does the long, thin tail stretch?

Right-skewed (positive skew): The tail stretches to the right. Most values cluster on the left, with a few extreme high values pulling the tail rightward. Example: income, home prices, response times.
Left-skewed (negative skew): The tail stretches to the left. Most values cluster on the right, with a few extreme low values. Example: age at retirement (most people retire around 62-67, but some retire very young), scores on an easy exam.
Symmetric (zero skew): The data is roughly mirror-image around the center. Example: heights, many natural measurements.

from scipy import stats

# Generate example distributions
np.random.seed(42)
right_skew = np.random.exponential(10, 10000)
left_skew = 100 - np.random.exponential(10, 10000)
symmetric = np.random.normal(50, 10, 10000)

print(f"Right-skewed skewness: {stats.skew(right_skew):.2f}")   # ~2.0
print(f"Left-skewed skewness:  {stats.skew(left_skew):.2f}")    # ~-2.0
print(f"Symmetric skewness:    {stats.skew(symmetric):.2f}")     # ~0.0

The number itself is useful — positive means right-skewed, negative means left-skewed, and close to zero means roughly symmetric — but the visual is more important than the number. Always plot your data.

A common confusion: "Right-skewed" means the tail goes right, not that the bulk of the data is on the right. Think of it as "the tail points right." The bulk of the data is actually on the left in a right-skewed distribution. This trips people up constantly. If it helps, think "the tail is right" or draw a quick sketch.

Kurtosis: How Heavy Are the Tails?

Kurtosis measures how much data is in the tails of the distribution compared to a normal distribution. High kurtosis means heavy tails (more extreme values than you'd expect). Low kurtosis means light tails.

# Normal distribution: kurtosis ≈ 0 (using Fisher's definition)
normal_data = np.random.normal(0, 1, 100000)
print(f"Normal kurtosis: {stats.kurtosis(normal_data):.2f}")  # ~0.0

# Heavy-tailed: Student's t with 3 degrees of freedom
heavy_tails = np.random.standard_t(3, 100000)
print(f"Heavy-tailed kurtosis: {stats.kurtosis(heavy_tails):.2f}")  # ~3+

# Light-tailed: Uniform distribution
light_tails = np.random.uniform(-1, 1, 100000)
print(f"Light-tailed kurtosis: {stats.kurtosis(light_tails):.2f}")  # ~-1.2

In practice, you'll use kurtosis much less often than skewness. It's most useful when you're checking whether your data has more extreme values than a normal distribution would predict — something that matters a lot in finance (where heavy tails mean more frequent "once-in-a-century" events) and in any field where tail risk is important.

Note

scipy.stats.kurtosis() uses "excess kurtosis" by default — it subtracts 3 so that a normal distribution has kurtosis 0. Some textbooks use "regular" kurtosis where normal = 3. Same concept, different baseline. Just be aware of which convention you're using.

Modality: How Many Peaks?

A distribution can have: - One peak (unimodal) — the most common case - Two peaks (bimodal) — often signals two distinct groups - Multiple peaks (multimodal) — multiple groups

Bimodal distributions are particularly important to recognize because they tell you that your data might be a mixture of two different populations. If you see two peaks in your vaccination rate data, that might mean some countries have high rates and others have low rates — and computing a single mean for the whole dataset would describe neither group well.

np.random.seed(42)

# Bimodal: mixing two groups
group_high = np.random.normal(80, 5, 500)   # High-vaccination countries
group_low = np.random.normal(45, 8, 500)    # Low-vaccination countries
combined = np.concatenate([group_high, group_low])

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Combined view
axes[0].hist(combined, bins=40, color='steelblue', edgecolor='white', alpha=0.8)
axes[0].axvline(np.mean(combined), color='red', linestyle='--', label=f'Mean={np.mean(combined):.0f}%')
axes[0].set_title('Combined: Bimodal Distribution')
axes[0].set_xlabel('Vaccination Rate (%)')
axes[0].legend()

# Separated groups
axes[1].hist(group_high, bins=25, color='mediumseagreen', edgecolor='white', alpha=0.8)
axes[1].axvline(np.mean(group_high), color='red', linestyle='--', label=f'Mean={np.mean(group_high):.0f}%')
axes[1].set_title('High-Vaccination Countries')
axes[1].set_xlabel('Vaccination Rate (%)')
axes[1].legend()

axes[2].hist(group_low, bins=25, color='coral', edgecolor='white', alpha=0.8)
axes[2].axvline(np.mean(group_low), color='red', linestyle='--', label=f'Mean={np.mean(group_low):.0f}%')
axes[2].set_title('Low-Vaccination Countries')
axes[2].set_xlabel('Vaccination Rate (%)')
axes[2].legend()

plt.tight_layout()
plt.savefig('bimodal_vaccination.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"Combined mean: {np.mean(combined):.1f}%")
print(f"Combined median: {np.median(combined):.1f}%")
print(f"Neither number describes either group well!")

That combined mean of about 63% describes nobody. The high-vaccination countries are around 80%, the low-vaccination countries are around 45%, and 63% is a number that falls in the valley between the two peaks. This is exactly why plotting your data matters — no summary statistic would have told you about the two peaks.

Connecting Shape to Real-World Stories

The shape of a distribution isn't just a statistical fact — it tells you something meaningful about the phenomenon you're studying.

Right-skewed data often means there's a natural floor but no ceiling. Income can't be negative, but there's no upper limit — so a few high earners stretch the tail rightward. Response times can't be negative, but slow responses can be very slow. File sizes can't be negative, but some files are enormous. Wherever there's a floor and no ceiling, expect right skew.

Left-skewed data means there's a ceiling but no floor (or a natural clustering near the top). Exam scores on an easy test cluster near 100%, with a tail of lower scores stretching left. Life expectancy in developed countries clusters near 80 years, with a tail of early deaths. Whenever most people succeed and a few fail, expect left skew.

Bimodal data almost always means there are two different groups in your data that you should analyze separately. If you see two peaks, ask yourself: "What are the two populations here?" Once you split them apart, each group often has a much simpler, more interpretable distribution.

# Real-world shape examples
np.random.seed(42)
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Right-skewed: Income
income = np.random.lognormal(10.5, 0.8, 5000)
axes[0, 0].hist(income, bins=50, color='coral', edgecolor='white', alpha=0.8, range=(0, 300000))
axes[0, 0].set_title('Income: Right-Skewed\n(Floor at $0, no ceiling)')
axes[0, 0].set_xlabel('Annual Income ($)')

# Left-skewed: Easy exam scores
easy_exam = np.clip(100 - np.random.exponential(8, 5000), 0, 100)
axes[0, 1].hist(easy_exam, bins=30, color='mediumseagreen', edgecolor='white', alpha=0.8)
axes[0, 1].set_title('Easy Exam: Left-Skewed\n(Ceiling at 100%, most students near top)')
axes[0, 1].set_xlabel('Score (%)')

# Symmetric: Heights
heights = np.random.normal(170, 10, 5000)
axes[1, 0].hist(heights, bins=30, color='steelblue', edgecolor='white', alpha=0.8)
axes[1, 0].set_title('Heights: Symmetric\n(Natural variation around a center)')
axes[1, 0].set_xlabel('Height (cm)')

# Bimodal: Commute times (city vs. suburb)
commute = np.concatenate([np.random.normal(12, 4, 3000), np.random.normal(45, 8, 2000)])
axes[1, 1].hist(commute, bins=30, color='mediumpurple', edgecolor='white', alpha=0.8)
axes[1, 1].set_title('Commute Times: Bimodal\n(City workers vs. suburban commuters)')
axes[1, 1].set_xlabel('Minutes')

plt.tight_layout()
plt.savefig('shape_stories.png', dpi=150, bbox_inches='tight')
plt.show()

Priya's observation: Priya notices that three-point shooting percentages in the NBA are roughly symmetric — most players cluster around 35%, with tails in both directions. But minutes played per game is right-skewed — most players get limited minutes, but starters play a lot. The shape immediately tells her something about how playing time is distributed in the league.

19.5 Detecting Outliers: When Values Don't Belong

An outlier is a data point that's unusually far from the rest of the data. Outliers matter because they can distort summary statistics (especially the mean and standard deviation), and because they often represent something genuinely interesting — a data entry error, a measurement malfunction, or a genuinely unusual observation.

But "unusually far" is vague. How far is too far? There are two common approaches.

Method 1: The IQR Fence Method

The IQR fence method defines outliers as any value below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR. These boundaries are called fences.

def find_outliers_iqr(data):
    """Identify outliers using the IQR fence method."""
    Q1 = np.percentile(data, 25)
    Q3 = np.percentile(data, 75)
    IQR = Q3 - Q1

    lower_fence = Q1 - 1.5 * IQR
    upper_fence = Q3 + 1.5 * IQR

    outliers = [x for x in data if x < lower_fence or x > upper_fence]

    print(f"Q1: {Q1:.2f}")
    print(f"Q3: {Q3:.2f}")
    print(f"IQR: {IQR:.2f}")
    print(f"Lower fence: {lower_fence:.2f}")
    print(f"Upper fence: {upper_fence:.2f}")
    print(f"Outliers: {outliers}")

    return outliers

# City B incomes
city_b = [28, 30, 31, 32, 33, 115, 118, 120, 125, 500]
outliers = find_outliers_iqr(city_b)

The 1.5 * IQR rule is a convention established by the statistician John Tukey — the same person who invented the box plot. It's not a law of nature; it's a reasonable default. Some analysts use 3 * IQR for "extreme outliers."

This is the method behind the whiskers on a box plot: the whiskers extend to the most extreme data points that are within the fences, and any points beyond the fences are plotted individually as dots.

Method 2: Z-Scores

A z-score tells you how many standard deviations a value is from the mean. The formula is:

$$z = \frac{x - \bar{x}}{s}$$

In English: "How far is this value from the mean, measured in standard deviations?"

def compute_z_scores(data):
    """Compute z-scores for each value in the dataset."""
    mean = np.mean(data)
    std = np.std(data, ddof=1)
    z_scores = [(x - mean) / std for x in data]
    return z_scores

city_b = [28, 30, 31, 32, 33, 115, 118, 120, 125, 500]
z_scores = compute_z_scores(city_b)

for value, z in zip(city_b, z_scores):
    flag = " <-- OUTLIER" if abs(z) > 2 else ""
    print(f"  Value: {value:6}, z-score: {z:+.2f}{flag}")

A common rule of thumb: values with |z| > 2 are "unusual," and values with |z| > 3 are "very unusual." But be careful — z-scores assume your data is roughly normally distributed. If the data is skewed, z-scores become less reliable, and the IQR method is usually better.

# A complete outlier analysis with pandas
import pandas as pd

df = pd.DataFrame({'income': city_b})
df['z_score'] = (df['income'] - df['income'].mean()) / df['income'].std()
df['iqr_outlier'] = (df['income'] < 30.75 - 1.5 * 90.5) | (df['income'] > 121.25 + 1.5 * 90.5)
df['z_outlier'] = df['z_score'].abs() > 2

print(df)

What to Do with Outliers

This is the question students always ask: "Should I remove outliers?"

The answer is: it depends on why they're there. Here's a decision framework:

Data entry error? Fix or remove. If someone typed 5000 instead of 500, that's a mistake. Fix it if you can, remove it if you can't.
Measurement error? Investigate. A temperature sensor that reads -999 degrees is malfunctioning. That's not real data.
Genuinely unusual observation? Keep it, but be thoughtful. If Bill Gates walks into a coffee shop, the average net worth of the patrons goes up by billions. He's a real data point, but the mean is no longer useful — use the median.
Interesting signal? Investigate further. Sometimes outliers are the most interesting part of the data. A neighborhood with an anomalously low vaccination rate might be exactly where Elena should focus her attention.

Marcus's insight: Marcus notices that one Saturday in October had three times more sales than any other day. He's tempted to exclude it as an outlier. But then he checks his records — it was the day of the town's annual harvest festival. That's not an error; it's a real spike. He keeps the data point but notes the special event. Next year, he'll stock up for it.

19.6 Putting It All Together: A Descriptive Statistics Workflow

Let's walk through a complete descriptive statistics analysis using our progressive project data. Elena is computing descriptive statistics for vaccination rates across countries, grouped by income level.

import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Simulate WHO-style vaccination data by income group
np.random.seed(19)

data = {
    'country': [f'Country_{i}' for i in range(200)],
    'income_group': np.random.choice(
        ['Low', 'Lower-Middle', 'Upper-Middle', 'High'],
        size=200,
        p=[0.15, 0.30, 0.30, 0.25]
    ),
    'vaccination_rate': np.concatenate([
        np.clip(np.random.normal(52, 18, 30), 5, 100),    # Low income
        np.clip(np.random.normal(65, 15, 60), 10, 100),   # Lower-middle
        np.clip(np.random.normal(78, 12, 60), 20, 100),   # Upper-middle
        np.clip(np.random.normal(90, 6, 50), 40, 100),    # High income
    ])
}

df = pd.DataFrame(data)

# Step 1: Overview
print("=== Step 1: First Look ===")
print(df['vaccination_rate'].describe())
print()

# Step 2: By-group statistics
print("=== Step 2: Statistics by Income Group ===")
group_stats = df.groupby('income_group')['vaccination_rate'].agg(
    ['count', 'mean', 'median', 'std', 'min', 'max']
).round(1)
print(group_stats)
print()

# Step 3: Check skewness by group
print("=== Step 3: Skewness by Group ===")
for group_name, group_data in df.groupby('income_group'):
    skew = stats.skew(group_data['vaccination_rate'])
    print(f"  {group_name}: skewness = {skew:.2f}", end="")
    if abs(skew) < 0.5:
        print(" (roughly symmetric)")
    elif skew > 0:
        print(" (right-skewed)")
    else:
        print(" (left-skewed)")

# Step 4: Visualize distributions
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
income_groups = ['Low', 'Lower-Middle', 'Upper-Middle', 'High']
colors = ['#e74c3c', '#e67e22', '#2ecc71', '#3498db']

for ax, group, color in zip(axes.flat, income_groups, colors):
    group_data = df[df['income_group'] == group]['vaccination_rate']
    ax.hist(group_data, bins=15, color=color, edgecolor='white', alpha=0.8)
    ax.axvline(group_data.mean(), color='black', linestyle='--',
               label=f'Mean={group_data.mean():.1f}')
    ax.axvline(group_data.median(), color='black', linestyle=':',
               label=f'Median={group_data.median():.1f}')
    ax.set_title(f'{group} Income Countries')
    ax.set_xlabel('Vaccination Rate (%)')
    ax.legend(fontsize=8)

plt.suptitle('Vaccination Rates by Income Group', fontsize=14)
plt.tight_layout()
plt.savefig('vaccination_by_income.png', dpi=150, bbox_inches='tight')
plt.show()

# Step 5: Box plot comparison
fig, ax = plt.subplots(figsize=(10, 5))
df.boxplot(column='vaccination_rate', by='income_group', ax=ax,
           positions=[1, 2, 3, 4])
ax.set_title('Vaccination Rates by Income Group')
ax.set_xlabel('Income Group')
ax.set_ylabel('Vaccination Rate (%)')
plt.suptitle('')  # Remove auto-generated title
plt.tight_layout()
plt.savefig('vaccination_boxplot.png', dpi=150, bbox_inches='tight')
plt.show()

Let's walk through what Elena sees and what decisions she makes:

Step 1: First Look. The overall mean vaccination rate is about 72%, but the standard deviation is 18% — that's a lot of variation. The range goes from about 5% to 100%.

Step 2: Group Comparison. Breaking it down by income group reveals that the story is really about inequality. High-income countries cluster tightly around 90%, while low-income countries are much more spread out, with a mean around 52%.

Step 3: Shape Check. The high-income group is left-skewed (most countries are near 90%, with a few trailing lower). The low-income group is roughly symmetric but with high variation.

Step 4: Visual Confirmation. The histograms confirm what the numbers suggested. The high-income group is tightly packed and left-skewed. The low-income group is wide and flat.

Step 5: Box Plot. The box plot makes group comparisons easy — you can see the medians, the spreads, and any outliers at a glance.

Elena's decision: For her report to the county health department, Elena uses the median for each income group (because some groups are skewed) and reports the IQR as the measure of spread. She notes that the mean and median are fairly close for the low-income group (roughly symmetric) but diverge for the high-income group (left-skewed). She includes both the box plot and the histograms — the box plot for quick comparison, the histograms for understanding shape.

Outliers in Practice: Elena's Vaccination Data

Let's apply outlier detection to Elena's real-world data. She has vaccination rates for 200 countries and wants to flag those that are unusually low or high for their income group.

import pandas as pd

# For each income group, detect outliers
np.random.seed(19)
groups_data = {
    'Low income': np.clip(np.random.normal(52, 18, 30), 5, 100),
    'Lower-Middle': np.clip(np.random.normal(65, 15, 60), 10, 100),
    'Upper-Middle': np.clip(np.random.normal(78, 12, 60), 20, 100),
    'High income': np.clip(np.random.normal(90, 6, 50), 40, 100),
}

print("Outlier Detection by Income Group:")
print("=" * 60)
for group_name, data in groups_data.items():
    Q1 = np.percentile(data, 25)
    Q3 = np.percentile(data, 75)
    IQR = Q3 - Q1
    lower_fence = Q1 - 1.5 * IQR
    upper_fence = Q3 + 1.5 * IQR

    outliers_low = data[data < lower_fence]
    outliers_high = data[data > upper_fence]

    print(f"\n{group_name}:")
    print(f"  Q1={Q1:.1f}, Q3={Q3:.1f}, IQR={IQR:.1f}")
    print(f"  Fences: [{lower_fence:.1f}, {upper_fence:.1f}]")
    print(f"  Low outliers ({len(outliers_low)}): {sorted(outliers_low.round(1))}")
    print(f"  High outliers ({len(outliers_high)}): {sorted(outliers_high.round(1))}")

Elena notes that outliers in the "Low income" group include countries with extremely low vaccination rates (below 20%). These deserve investigation — are they conflict zones? Countries with supply chain failures? The outlier flag isn't a verdict; it's a prompt for deeper research.

In the "High income" group, outliers tend to be countries with surprisingly low rates despite having resources. These might reflect vaccine hesitancy in wealthy countries — a different kind of public health challenge.

19.7 Choosing the Right Summary Statistics

You now have a full toolkit of descriptive statistics. But knowing which tool to use when is just as important as knowing how to compute them. Here's your decision guide:

The Decision Framework

Step 1: Plot your data
   |
   v
Step 2: Is the distribution roughly symmetric?
   |
   ├── YES --> Use mean + standard deviation
   |           These capture center and spread well
   |           for symmetric distributions.
   |
   └── NO (skewed or has outliers)
       |
       └── Use median + IQR
           These are robust to outliers and
           better represent the "typical" value.

In BOTH cases, also report:
   - The five-number summary (min, Q1, median, Q3, max)
   - The shape (symmetric, right-skewed, left-skewed, bimodal)
   - Any notable outliers and what you think they represent

Common Mistakes to Avoid

Mistake 1: Reporting the mean for skewed data. When someone says "the average home price is $500,000," that mean is pulled up by a few mansions. The median home price is almost always lower and more informative.

Mistake 2: Reporting only the center, not the spread. "The average score was 75" tells you nothing about whether everyone scored between 70-80 or whether scores ranged from 20 to 100. Always report a measure of spread.

Mistake 3: Ignoring bimodality. If your data has two peaks, no single number for center is appropriate. Report the two groups separately.

Mistake 4: Computing statistics for categorical data. The mean of zip codes is meaningless. The mean of Likert scale responses (1-5) is debatable. The mean of continuous measurements is usually fine. Know your data types.

Mistake 5: Removing outliers without thinking. Outliers might be errors or they might be the most interesting part of your data. Investigate before removing.

19.8 The Progressive Project: Vaccination Rate Statistics by Income Group

Time to add to our Global Health Data Explorer. In previous chapters, you loaded, cleaned, and visualized WHO vaccination data. Now we're going to compute formal descriptive statistics and interpret them.

# Progressive Project Milestone: Chapter 19
# Compute descriptive statistics for vaccination rates by income group

import pandas as pd
import numpy as np
from scipy import stats

# Load your cleaned vaccination data
# (Using simulated data here — replace with your actual cleaned dataset)
np.random.seed(19)
vaccination_df = pd.DataFrame({
    'country': [f'Country_{i}' for i in range(180)],
    'income_group': np.repeat(
        ['Low income', 'Lower middle income', 'Upper middle income', 'High income'],
        [27, 54, 54, 45]
    ),
    'measles_vacc_rate': np.concatenate([
        np.clip(np.random.normal(55, 20, 27), 5, 100),
        np.clip(np.random.normal(68, 15, 54), 15, 100),
        np.clip(np.random.normal(82, 10, 54), 30, 100),
        np.clip(np.random.normal(93, 4, 45), 50, 100),
    ])
})

# --- Your analysis ---

# 1. Compute summary statistics by income group
summary = vaccination_df.groupby('income_group')['measles_vacc_rate'].describe()
print("Summary Statistics by Income Group:")
print(summary.round(1))
print()

# 2. Add skewness
skewness = vaccination_df.groupby('income_group')['measles_vacc_rate'].apply(
    lambda x: stats.skew(x)
)
print("Skewness by Income Group:")
for group, skew in skewness.items():
    shape = "symmetric" if abs(skew) < 0.5 else ("right-skewed" if skew > 0 else "left-skewed")
    print(f"  {group}: {skew:.2f} ({shape})")
print()

# 3. Interpret: For which groups should we use mean vs. median?
print("Interpretation Guide:")
for group_name, group_data in vaccination_df.groupby('income_group'):
    data = group_data['measles_vacc_rate']
    mean = data.mean()
    median = data.median()
    diff_pct = abs(mean - median) / median * 100

    if diff_pct < 5:
        recommendation = "Mean and median are close — either is fine"
    else:
        recommendation = f"Mean and median differ by {diff_pct:.1f}% — use median"

    print(f"  {group_name}: mean={mean:.1f}, median={median:.1f} --> {recommendation}")

# 4. Check for outliers in each group
print("\nOutlier Check (IQR method):")
for group_name, group_data in vaccination_df.groupby('income_group'):
    data = group_data['measles_vacc_rate']
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    n_outliers = ((data < lower) | (data > upper)).sum()
    print(f"  {group_name}: {n_outliers} outliers (fences: {lower:.1f} to {upper:.1f})")

What to write in your project notebook: Summarize your findings in plain English. Which income groups have the highest and lowest vaccination rates? Which groups show the most variation? Are any distributions skewed, and what might that mean? Are there outliers that deserve investigation?

19.9 Common Pandas Methods for Descriptive Statistics

Here's your reference card for the pandas methods you'll use most often. Every one of these works on both Series and DataFrames.

import pandas as pd
import numpy as np

# Create a sample Series
s = pd.Series([23, 45, 12, 67, 89, 34, 56, 78, 23, 45, 67, 90, 12, 34, 56])

# Measures of center
s.mean()          # Arithmetic mean
s.median()        # Median (50th percentile)
s.mode()          # Mode (most frequent value[s])

# Measures of spread
s.std()           # Standard deviation (ddof=1 by default)
s.var()           # Variance (ddof=1 by default)
s.min()           # Minimum
s.max()           # Maximum
s.quantile(0.25)  # Any percentile (Q1 here)
s.quantile(0.75)  # Q3

# Convenience
s.describe()      # Five-number summary + count, mean, std

# Skewness and kurtosis
s.skew()          # Skewness
s.kurt()          # Excess kurtosis

# For DataFrames, these work column-wise by default
df = pd.DataFrame({
    'temperature': [72, 68, 75, 80, 65],
    'humidity': [45, 50, 55, 40, 60]
})

df.describe()            # Summary for all numeric columns
df.mean()                # Mean of each column
df.std()                 # Std dev of each column
df.corr()                # Correlation matrix (we'll cover this in Ch. 24)

# Group-wise operations
df_grouped = vaccination_df.groupby('income_group')['measles_vacc_rate']
df_grouped.mean()        # Mean by group
df_grouped.describe()    # Full summary by group
df_grouped.agg(['mean', 'median', 'std', 'count'])  # Custom aggregations

19.10 The Anscombe's Quartet Warning

We'll end with a famous cautionary tale. In 1973, the statistician Francis Anscombe created four datasets that have identical summary statistics — same mean, same variance, same correlation, same regression line — but look completely different when plotted.

import matplotlib.pyplot as plt

# Anscombe's Quartet (built into many libraries)
# Dataset I: Normal linear relationship
x1 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]

# Dataset II: Non-linear relationship
x2 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]

# Dataset III: Linear with one outlier
x3 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]

# Dataset IV: Outlier-driven
x4 = [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]
y4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]

fig, axes = plt.subplots(2, 2, figsize=(10, 8))
datasets = [(x1, y1, 'I'), (x2, y2, 'II'), (x3, y3, 'III'), (x4, y4, 'IV')]

for ax, (x, y, label) in zip(axes.flat, datasets):
    ax.scatter(x, y, color='steelblue', s=60)
    ax.set_title(f'Dataset {label}')
    ax.set_xlim(2, 20)
    ax.set_ylim(2, 14)

    # Add summary stats
    ax.text(0.05, 0.95, f'Mean x={np.mean(x):.1f}\nMean y={np.mean(y):.2f}\nStd y={np.std(y, ddof=1):.2f}',
            transform=ax.transAxes, verticalalignment='top', fontsize=8,
            bbox=dict(boxstyle='round', facecolor='lightyellow'))

plt.suptitle("Anscombe's Quartet: Same Statistics, Different Stories", fontsize=14)
plt.tight_layout()
plt.savefig('anscombes_quartet.png', dpi=150, bbox_inches='tight')
plt.show()

The lesson of Anscombe's Quartet is simple and powerful: never trust summary statistics alone. Always plot your data.

This has been the unofficial motto of this entire book — from Chapter 6 when you made your first charts, through Part III when you built serious visualizations, and now here, where we've spent an entire chapter learning to compute summary statistics and I'm telling you that they're not enough by themselves.

Summary statistics and visualizations are partners. One without the other is incomplete. A histogram without numbers is vague. Numbers without a histogram can be misleading. Use both.

19.11 Chapter Summary

Let's zoom out and review the big picture.

The Center answers "what's typical?" - Mean: The balance point. Sensitive to outliers. Best for symmetric data. - Median: The middle value. Robust to outliers. Best for skewed data. - Mode: The most frequent value. Works for categorical data too.

The Spread answers "how much do values vary?" - Range: Max minus min. Quick but fragile. - Variance / Standard Deviation: Average squared distance from the mean. The workhorse for symmetric data. - IQR: The middle 50%. Robust to outliers. The workhorse for skewed data.

The Shape answers "what does the distribution look like?" - Symmetric: Balanced around the center. Mean ≈ Median. - Right-skewed: Long tail to the right. Mean > Median. - Left-skewed: Long tail to the left. Mean < Median. - Bimodal: Two peaks. Might be two groups mixed together.

The Decision Rule: - Symmetric data → report mean + standard deviation - Skewed or outlier-heavy data → report median + IQR - Always report the five-number summary - Always, always, always plot your data

The Threshold Concept, revisited: Distribution thinking — seeing data as a shape, not just a single number — is a mindset shift that will stay with you for the rest of this book and beyond. Every time someone tells you "the average is X," your new reflex should be: "What's the shape? What's the spread? Is the average even a good summary?" If you've built that reflex, this chapter has done its job.

Looking Ahead

You now have the tools to describe data — to summarize what you see with numbers that capture center, spread, and shape. In the next chapter, we'll step into a world that's fundamentally different: probability. Where descriptive statistics tells you about the data you have, probability gives you a framework for reasoning about data you haven't seen yet — about uncertainty, randomness, and the patterns that emerge from chance.

If that sounds abstract, don't worry. We're going to start by flipping coins and rolling dice in Python, building your intuition through simulation before touching a single formula. See you in Chapter 20.

Connections to Previous and Future Chapters

Concept from This Chapter	Where It Came From	Where It's Going
Mean, median, mode	Built on `.describe()` from Ch. 7	Foundation for hypothesis testing (Ch. 23)
Standard deviation	New in this chapter	Core concept in confidence intervals (Ch. 22)
Distribution shape	First formal treatment	Central to Ch. 21 (normal curve) and Ch. 23 (test assumptions)
Z-scores	New in this chapter	Used in normal distribution (Ch. 21) and hypothesis testing (Ch. 23)
Outlier detection	Built on box plots from Part III	Applied in regression diagnostics (Ch. 26)
Five-number summary	Formalized box plot concepts	Quick diagnostic throughout the rest of the book
IQR	New in this chapter	Used in robust statistics throughout Part IV

Prerequisites

Learning Objectives

In This Chapter

Chapter 19: Descriptive Statistics — Center, Spread, Shape, and the Stories Numbers Tell

Chapter Overview

19.1 Why One Number Is Never Enough

19.2 Measuring the Center: Mean, Median, and Mode

The Mean: Adding Up and Dividing

The Median: The Middle Value

The Mode: The Most Common Value

Comparing the Three: A Visual Story

19.3 Measuring Spread: How Much Do Values Vary?

Range: Simple but Fragile

Variance and Standard Deviation: The Workhorses

IQR: The Robust Alternative

Practical Intuition for Standard Deviation

The Five-Number Summary

19.4 Distribution Shape: The Third Dimension

Skewness: Which Way Does the Tail Point?

Kurtosis: How Heavy Are the Tails?

Note

Modality: How Many Peaks?

Connecting Shape to Real-World Stories

19.5 Detecting Outliers: When Values Don't Belong

Method 1: The IQR Fence Method

Method 2: Z-Scores

What to Do with Outliers

19.6 Putting It All Together: A Descriptive Statistics Workflow

Outliers in Practice: Elena's Vaccination Data

19.7 Choosing the Right Summary Statistics

The Decision Framework

Common Mistakes to Avoid

19.8 The Progressive Project: Vaccination Rate Statistics by Income Group

19.9 Common Pandas Methods for Descriptive Statistics

19.10 The Anscombe's Quartet Warning

19.11 Chapter Summary

Looking Ahead

Connections to Previous and Future Chapters

Related Reading