Key Takeaways: Descriptive Statistics
This is your reference card for Chapter 19. When someone hands you a dataset and says "tell me what you see," this is the page to come back to.
The Threshold Concept
Distribution thinking — seeing data as a shape, not just a single number — is the mindset shift that makes everything else in statistics click. Every time someone gives you an "average," your new reflex should be: "What's the shape? What's the spread? Is the average even a good summary?"
Measures of Center
| Measure | What It Is | When to Use It | Watch Out For |
|---|---|---|---|
| Mean | Sum of values / count — the balance point | Symmetric distributions without extreme outliers | Gets dragged by outliers and skewed tails |
| Median | The middle value when data is sorted — the 50th percentile | Skewed distributions, data with outliers | Ignores the actual values of extreme points (that's the feature, not a bug) |
| Mode | The most frequent value | Categorical data, detecting peaks in continuous data | Might not be unique; less useful for continuous data |
The key rule: If mean and median are close, the data is roughly symmetric and either works. If they're far apart, the data is skewed and you should use the median.
Measures of Spread
| Measure | What It Is | When to Use It | Watch Out For |
|---|---|---|---|
| Range | Max - Min | Quick sanity check only | Determined entirely by the two most extreme values |
| Variance | Average of squared deviations from mean | Mathematical building block; used in formulas | Units are squared (hard to interpret directly) |
| Standard Deviation | Square root of variance — "typical distance from mean" | Symmetric data; pairs with the mean | Inflated by outliers |
| IQR | Q3 - Q1 — the range of the middle 50% | Skewed data, data with outliers; pairs with the median | Ignores the tails entirely |
The pairing rule: - Symmetric data: mean + standard deviation - Skewed data: median + IQR
The Five-Number Summary
Minimum ── Q1 ── Median ── Q3 ── Maximum
|←── IQR ──→|
(middle 50%)
This is what .describe() gives you (plus count, mean, and std). It's the skeleton of a box plot and a great first look at any numerical variable.
Distribution Shape
| Shape | What It Looks Like | Mean vs. Median | Real-World Examples |
|---|---|---|---|
| Symmetric | Mirror-image around center | Mean ≈ Median | Heights, many test scores |
| Right-skewed | Long tail to the RIGHT | Mean > Median | Income, home prices, wait times |
| Left-skewed | Long tail to the LEFT | Mean < Median | Age at death, easy exam scores |
| Bimodal | Two distinct peaks | Mean falls in the valley (misleading!) | Mixed populations |
The tail rule: Skewness is named for the direction of the tail, not the peak. Right-skewed means the tail goes right.
Outlier Detection
IQR Fence Method: - Lower fence = Q1 - 1.5 * IQR - Upper fence = Q3 + 1.5 * IQR - Values beyond the fences are flagged as outliers - Robust — based on the median and quartiles
Z-Score Method: - z = (x - mean) / standard deviation - |z| > 2: unusual - |z| > 3: very unusual - Less robust — based on the mean, which outliers themselves distort
Before removing any outlier, ask: Is it an error? A measurement problem? Or the most interesting data point in the set?
Essential Pandas Methods
s.mean() # Mean
s.median() # Median
s.mode() # Mode(s)
s.std() # Standard deviation (ddof=1 by default)
s.var() # Variance
s.min(), s.max() # Min, Max
s.quantile(0.25) # Any percentile
s.describe() # Full summary
s.skew() # Skewness
s.kurt() # Kurtosis
# Grouped analysis
df.groupby('group')['value'].describe()
df.groupby('group')['value'].agg(['mean', 'median', 'std'])
Decision Flowchart
Got a numerical dataset?
|
v
STEP 1: Plot it (histogram or box plot)
|
v
STEP 2: Is it symmetric?
|
├── YES ─── Report mean ± std dev
|
└── NO ──── Report median (IQR)
|
v
STEP 3: Is it bimodal?
|
├── YES ─── Report each group separately
|
└── NO ──── Continue
|
v
STEP 4: Are there outliers?
|
├── YES ─── Investigate (error? interesting signal?)
|
└── NO ──── Continue
|
v
STEP 5: Are there meaningful subgroups?
|
├── YES ─── Stratify and report by group
|
└── NO ──── Report overall statistics
Anscombe's Warning
Four datasets can have identical means, standard deviations, and correlations but look completely different when plotted. Never trust summary statistics alone. Always plot your data.
What You Should Be Able to Do Now
- [ ] Compute and interpret mean, median, and mode in Python
- [ ] Compute and interpret standard deviation, variance, and IQR
- [ ] Describe a distribution's shape (symmetric, skewed, bimodal)
- [ ] Detect outliers using IQR fences and z-scores
- [ ] Choose mean + SD vs. median + IQR based on distribution shape
- [ ] Use
.describe()and.groupby().agg()for grouped analysis - [ ] Explain why plotting data is as important as computing statistics
- [ ] Write a plain-English interpretation of descriptive statistics for a non-technical audience
If all of these feel solid, you're ready for Chapter 20: Probability Thinking.