Key Takeaways: Numerical Summaries — Center, Spread, and Shape

Contributors

Key Takeaways: Numerical Summaries — Center, Spread, and Shape

One-Sentence Summary

Numerical summaries distill entire distributions into a handful of numbers — measures of center (mean, median) say where the data lives, measures of spread (standard deviation, IQR) say how tightly it's clustered, and the relationship between them reveals the distribution's shape and potential outliers.

Core Concepts at a Glance

Concept	Definition	Why It Matters
Mean	Sum of values / count — the balance point	Uses all data but is pulled by outliers
Median	Middle value when sorted — the 50th percentile	Resistant to outliers; better for skewed data
Standard deviation	Typical distance of values from the mean	THE measure of spread — used in nearly every statistical procedure
IQR	Q3 - Q1 — spread of the middle 50%	Resistant to outliers; pairs with the median
Box plot	Graph of the five-number summary	Compact visual that reveals center, spread, skew, and outliers at a glance

Decision Guide: Which Summary to Use

Is the distribution approximately symmetric with no major outliers?
│
├── YES → Report MEAN and STANDARD DEVIATION
│         (Mean uses all the data; SD pairs naturally with it)
│
└── NO (skewed or has outliers) → Report MEDIAN and IQR
          (Both are resistant to outliers and skewness)

Always also report the five-number summary and a box plot (or histogram). No single number tells the whole story.

Measures of Center

Measure	Formula	Resistant?	When to Use
Mean	$\bar{x} = \frac{\sum x_i}{n}$	No	Symmetric data; when further calculations are needed
Median	Middle value (sorted)	Yes	Skewed data; when outliers are present
Mode	Most frequent value	Yes	Categorical data; identifying peaks
Weighted mean	$\bar{x}_w = \frac{\sum w_i x_i}{\sum w_i}$	No	When observations have different importance (GPA, indices)

Key relationship: In right-skewed data, mean > median. In left-skewed data, mean < median. In symmetric data, mean ≈ median.

Measures of Spread

Measure	Formula	Resistant?	What It Tells You
Range	Max - Min	No	Total spread (fragile — depends on two extreme values)
IQR	Q3 - Q1	Yes	Spread of the middle 50%
Variance	$s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1}$	No	Average squared distance from the mean (units are squared)
Standard deviation	$s = \sqrt{s^2}$	No	Typical distance from the mean (original units)

Formula Reference Card

Sample Mean

$$\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}$$

Weighted Mean

$$\bar{x}_w = \frac{\sum_{i=1}^{n} w_i \cdot x_i}{\sum_{i=1}^{n} w_i}$$

Sample Variance

$$s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}$$

Sample Standard Deviation

$$s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}}$$

Interquartile Range

$$\text{IQR} = Q3 - Q1$$

Z-Score

$$z = \frac{x - \bar{x}}{s}$$

Outlier Fences (IQR Method)

$$\text{Lower fence} = Q1 - 1.5 \times \text{IQR}$$ $$\text{Upper fence} = Q3 + 1.5 \times \text{IQR}$$

The Empirical Rule (68-95-99.7)

Only for bell-shaped, approximately symmetric distributions:

Range	% of Data
$\bar{x} \pm 1s$	~68%
$\bar{x} \pm 2s$	~95%
$\bar{x} \pm 3s$	~99.7%

Z-Score Interpretation Guide

Z-Score	Interpretation
$z = 0$	Exactly at the mean
$\\|z\\| < 1$	Very common (within 68% central region)
$1 < \\|z\\| < 2$	Somewhat unusual but not rare
$\\|z\\| > 2$	Unusual (beyond 95% of data in a bell-shaped distribution)
$\\|z\\| > 3$	Very unusual (beyond 99.7% — only ~3 in 1,000)

Five-Number Summary and Box Plot Construction

Five-number summary: Min, Q1, Median, Q3, Max

Box plot construction: 1. Draw a box from Q1 to Q3 (the IQR) 2. Draw a line inside the box at the median 3. Calculate fences: Q1 - 1.5×IQR and Q3 + 1.5×IQR 4. Draw whiskers to the most extreme values within the fences 5. Plot any values beyond the fences as individual dots (outliers)

Reading a box plot: - Box width = IQR (spread of the middle 50%) - Median position in box = symmetry or skew - Whisker lengths = range of non-outlier data - Dots beyond whiskers = potential outliers - Comparing box plots side by side = comparing distributions

Python Quick Reference

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# All at once
df['col'].describe()        # min, Q1, median, Q3, max, mean, std, count

# Individual measures
df['col'].mean()             # Mean
df['col'].median()           # Median
df['col'].std()              # Sample standard deviation (n-1)
df['col'].var()              # Sample variance (n-1)
df['col'].quantile(0.25)     # Q1
df['col'].quantile(0.75)     # Q3

# IQR and outlier detection
Q1 = df['col'].quantile(0.25)
Q3 = df['col'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

# Z-scores
z = (df['col'] - df['col'].mean()) / df['col'].std()

# Box plot
sns.boxplot(data=df, x='col')
plt.show()

Excel Quick Reference

Calculation	Formula
Mean	`=AVERAGE(range)`
Median	`=MEDIAN(range)`
Sample Std Dev	`=STDEV.S(range)`
Sample Variance	`=VAR.S(range)`
Q1	`=QUARTILE(range, 1)`
Q3	`=QUARTILE(range, 3)`
Any percentile	`=PERCENTILE(range, p)`

Key Terms

Term	Definition
Mean	Sum of all values divided by the count — the balance point of the data
Median	The middle value when data is sorted — the 50th percentile
Mode	The most frequently occurring value
Range	Maximum minus minimum — total spread
Interquartile range (IQR)	Q3 minus Q1 — spread of the middle 50%
Variance	Average of squared deviations from the mean (dividing by $n-1$)
Standard deviation	Square root of the variance — typical distance from the mean
Five-number summary	Min, Q1, Median, Q3, Max
Box plot	Graph of the five-number summary with whiskers and outlier markers
Percentile	The value below which a given percentage of data falls
Quartile	Values dividing sorted data into four equal parts
Z-score	Number of standard deviations a value is from the mean
Empirical Rule	68-95-99.7 pattern for bell-shaped distributions
Resistant measure	A statistic not heavily influenced by outliers (e.g., median, IQR)
Weighted mean	A mean where each value is weighted by its importance

The One Thing to Remember

If you forget everything else from this chapter, remember this:

The standard deviation is the typical distance of values from the mean. It's the single most important measure of spread in all of statistics — and it's the foundation of nearly every inferential procedure you'll learn. When someone tells you the standard deviation, translate it: "That's how far a typical value falls from the center."