Key Takeaways: Numerical Summaries — Center, Spread, and Shape
One-Sentence Summary
Numerical summaries distill entire distributions into a handful of numbers — measures of center (mean, median) say where the data lives, measures of spread (standard deviation, IQR) say how tightly it's clustered, and the relationship between them reveals the distribution's shape and potential outliers.
Core Concepts at a Glance
| Concept | Definition | Why It Matters |
|---|---|---|
| Mean | Sum of values / count — the balance point | Uses all data but is pulled by outliers |
| Median | Middle value when sorted — the 50th percentile | Resistant to outliers; better for skewed data |
| Standard deviation | Typical distance of values from the mean | THE measure of spread — used in nearly every statistical procedure |
| IQR | Q3 - Q1 — spread of the middle 50% | Resistant to outliers; pairs with the median |
| Box plot | Graph of the five-number summary | Compact visual that reveals center, spread, skew, and outliers at a glance |
Decision Guide: Which Summary to Use
Is the distribution approximately symmetric with no major outliers?
│
├── YES → Report MEAN and STANDARD DEVIATION
│ (Mean uses all the data; SD pairs naturally with it)
│
└── NO (skewed or has outliers) → Report MEDIAN and IQR
(Both are resistant to outliers and skewness)
Always also report the five-number summary and a box plot (or histogram). No single number tells the whole story.
Measures of Center
| Measure | Formula | Resistant? | When to Use |
|---|---|---|---|
| Mean | $\bar{x} = \frac{\sum x_i}{n}$ | No | Symmetric data; when further calculations are needed |
| Median | Middle value (sorted) | Yes | Skewed data; when outliers are present |
| Mode | Most frequent value | Yes | Categorical data; identifying peaks |
| Weighted mean | $\bar{x}_w = \frac{\sum w_i x_i}{\sum w_i}$ | No | When observations have different importance (GPA, indices) |
Key relationship: In right-skewed data, mean > median. In left-skewed data, mean < median. In symmetric data, mean ≈ median.
Measures of Spread
| Measure | Formula | Resistant? | What It Tells You |
|---|---|---|---|
| Range | Max - Min | No | Total spread (fragile — depends on two extreme values) |
| IQR | Q3 - Q1 | Yes | Spread of the middle 50% |
| Variance | $s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1}$ | No | Average squared distance from the mean (units are squared) |
| Standard deviation | $s = \sqrt{s^2}$ | No | Typical distance from the mean (original units) |
Formula Reference Card
Sample Mean
$$\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}$$
Weighted Mean
$$\bar{x}_w = \frac{\sum_{i=1}^{n} w_i \cdot x_i}{\sum_{i=1}^{n} w_i}$$
Sample Variance
$$s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}$$
Sample Standard Deviation
$$s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}}$$
Interquartile Range
$$\text{IQR} = Q3 - Q1$$
Z-Score
$$z = \frac{x - \bar{x}}{s}$$
Outlier Fences (IQR Method)
$$\text{Lower fence} = Q1 - 1.5 \times \text{IQR}$$ $$\text{Upper fence} = Q3 + 1.5 \times \text{IQR}$$
The Empirical Rule (68-95-99.7)
Only for bell-shaped, approximately symmetric distributions:
| Range | % of Data |
|---|---|
| $\bar{x} \pm 1s$ | ~68% |
| $\bar{x} \pm 2s$ | ~95% |
| $\bar{x} \pm 3s$ | ~99.7% |
Z-Score Interpretation Guide
| Z-Score | Interpretation |
|---|---|
| $z = 0$ | Exactly at the mean |
| $\|z\| < 1$ | Very common (within 68% central region) |
| $1 < \|z\| < 2$ | Somewhat unusual but not rare |
| $\|z\| > 2$ | Unusual (beyond 95% of data in a bell-shaped distribution) |
| $\|z\| > 3$ | Very unusual (beyond 99.7% — only ~3 in 1,000) |
Five-Number Summary and Box Plot Construction
Five-number summary: Min, Q1, Median, Q3, Max
Box plot construction: 1. Draw a box from Q1 to Q3 (the IQR) 2. Draw a line inside the box at the median 3. Calculate fences: Q1 - 1.5×IQR and Q3 + 1.5×IQR 4. Draw whiskers to the most extreme values within the fences 5. Plot any values beyond the fences as individual dots (outliers)
Reading a box plot: - Box width = IQR (spread of the middle 50%) - Median position in box = symmetry or skew - Whisker lengths = range of non-outlier data - Dots beyond whiskers = potential outliers - Comparing box plots side by side = comparing distributions
Python Quick Reference
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# All at once
df['col'].describe() # min, Q1, median, Q3, max, mean, std, count
# Individual measures
df['col'].mean() # Mean
df['col'].median() # Median
df['col'].std() # Sample standard deviation (n-1)
df['col'].var() # Sample variance (n-1)
df['col'].quantile(0.25) # Q1
df['col'].quantile(0.75) # Q3
# IQR and outlier detection
Q1 = df['col'].quantile(0.25)
Q3 = df['col'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
# Z-scores
z = (df['col'] - df['col'].mean()) / df['col'].std()
# Box plot
sns.boxplot(data=df, x='col')
plt.show()
Excel Quick Reference
| Calculation | Formula |
|---|---|
| Mean | =AVERAGE(range) |
| Median | =MEDIAN(range) |
| Sample Std Dev | =STDEV.S(range) |
| Sample Variance | =VAR.S(range) |
| Q1 | =QUARTILE(range, 1) |
| Q3 | =QUARTILE(range, 3) |
| Any percentile | =PERCENTILE(range, p) |
Key Terms
| Term | Definition |
|---|---|
| Mean | Sum of all values divided by the count — the balance point of the data |
| Median | The middle value when data is sorted — the 50th percentile |
| Mode | The most frequently occurring value |
| Range | Maximum minus minimum — total spread |
| Interquartile range (IQR) | Q3 minus Q1 — spread of the middle 50% |
| Variance | Average of squared deviations from the mean (dividing by $n-1$) |
| Standard deviation | Square root of the variance — typical distance from the mean |
| Five-number summary | Min, Q1, Median, Q3, Max |
| Box plot | Graph of the five-number summary with whiskers and outlier markers |
| Percentile | The value below which a given percentage of data falls |
| Quartile | Values dividing sorted data into four equal parts |
| Z-score | Number of standard deviations a value is from the mean |
| Empirical Rule | 68-95-99.7 pattern for bell-shaped distributions |
| Resistant measure | A statistic not heavily influenced by outliers (e.g., median, IQR) |
| Weighted mean | A mean where each value is weighted by its importance |
The One Thing to Remember
If you forget everything else from this chapter, remember this:
The standard deviation is the typical distance of values from the mean. It's the single most important measure of spread in all of statistics — and it's the foundation of nearly every inferential procedure you'll learn. When someone tells you the standard deviation, translate it: "That's how far a typical value falls from the center."