Key Takeaways: Descriptive Statistics

This is your reference card for Chapter 19. When someone hands you a dataset and says "tell me what you see," this is the page to come back to.


The Threshold Concept

Distribution thinking — seeing data as a shape, not just a single number — is the mindset shift that makes everything else in statistics click. Every time someone gives you an "average," your new reflex should be: "What's the shape? What's the spread? Is the average even a good summary?"


Measures of Center

Measure What It Is When to Use It Watch Out For
Mean Sum of values / count — the balance point Symmetric distributions without extreme outliers Gets dragged by outliers and skewed tails
Median The middle value when data is sorted — the 50th percentile Skewed distributions, data with outliers Ignores the actual values of extreme points (that's the feature, not a bug)
Mode The most frequent value Categorical data, detecting peaks in continuous data Might not be unique; less useful for continuous data

The key rule: If mean and median are close, the data is roughly symmetric and either works. If they're far apart, the data is skewed and you should use the median.


Measures of Spread

Measure What It Is When to Use It Watch Out For
Range Max - Min Quick sanity check only Determined entirely by the two most extreme values
Variance Average of squared deviations from mean Mathematical building block; used in formulas Units are squared (hard to interpret directly)
Standard Deviation Square root of variance — "typical distance from mean" Symmetric data; pairs with the mean Inflated by outliers
IQR Q3 - Q1 — the range of the middle 50% Skewed data, data with outliers; pairs with the median Ignores the tails entirely

The pairing rule: - Symmetric data: mean + standard deviation - Skewed data: median + IQR


The Five-Number Summary

Minimum ── Q1 ── Median ── Q3 ── Maximum
           |←── IQR ──→|
           (middle 50%)

This is what .describe() gives you (plus count, mean, and std). It's the skeleton of a box plot and a great first look at any numerical variable.


Distribution Shape

Shape What It Looks Like Mean vs. Median Real-World Examples
Symmetric Mirror-image around center Mean ≈ Median Heights, many test scores
Right-skewed Long tail to the RIGHT Mean > Median Income, home prices, wait times
Left-skewed Long tail to the LEFT Mean < Median Age at death, easy exam scores
Bimodal Two distinct peaks Mean falls in the valley (misleading!) Mixed populations

The tail rule: Skewness is named for the direction of the tail, not the peak. Right-skewed means the tail goes right.


Outlier Detection

IQR Fence Method: - Lower fence = Q1 - 1.5 * IQR - Upper fence = Q3 + 1.5 * IQR - Values beyond the fences are flagged as outliers - Robust — based on the median and quartiles

Z-Score Method: - z = (x - mean) / standard deviation - |z| > 2: unusual - |z| > 3: very unusual - Less robust — based on the mean, which outliers themselves distort

Before removing any outlier, ask: Is it an error? A measurement problem? Or the most interesting data point in the set?


Essential Pandas Methods

s.mean()             # Mean
s.median()           # Median
s.mode()             # Mode(s)
s.std()              # Standard deviation (ddof=1 by default)
s.var()              # Variance
s.min(), s.max()     # Min, Max
s.quantile(0.25)     # Any percentile
s.describe()         # Full summary
s.skew()             # Skewness
s.kurt()             # Kurtosis

# Grouped analysis
df.groupby('group')['value'].describe()
df.groupby('group')['value'].agg(['mean', 'median', 'std'])

Decision Flowchart

Got a numerical dataset?
  |
  v
STEP 1: Plot it (histogram or box plot)
  |
  v
STEP 2: Is it symmetric?
  |
  ├── YES ─── Report mean ± std dev
  |
  └── NO ──── Report median (IQR)
  |
  v
STEP 3: Is it bimodal?
  |
  ├── YES ─── Report each group separately
  |
  └── NO ──── Continue
  |
  v
STEP 4: Are there outliers?
  |
  ├── YES ─── Investigate (error? interesting signal?)
  |
  └── NO ──── Continue
  |
  v
STEP 5: Are there meaningful subgroups?
  |
  ├── YES ─── Stratify and report by group
  |
  └── NO ──── Report overall statistics

Anscombe's Warning

Four datasets can have identical means, standard deviations, and correlations but look completely different when plotted. Never trust summary statistics alone. Always plot your data.


What You Should Be Able to Do Now

  • [ ] Compute and interpret mean, median, and mode in Python
  • [ ] Compute and interpret standard deviation, variance, and IQR
  • [ ] Describe a distribution's shape (symmetric, skewed, bimodal)
  • [ ] Detect outliers using IQR fences and z-scores
  • [ ] Choose mean + SD vs. median + IQR based on distribution shape
  • [ ] Use .describe() and .groupby().agg() for grouped analysis
  • [ ] Explain why plotting data is as important as computing statistics
  • [ ] Write a plain-English interpretation of descriptive statistics for a non-technical audience

If all of these feel solid, you're ready for Chapter 20: Probability Thinking.