Key Takeaways: Descriptive Statistics

Contributors to Introduction to Data Science

Key Takeaways: Descriptive Statistics

This is your reference card for Chapter 19. When someone hands you a dataset and says "tell me what you see," this is the page to come back to.

The Threshold Concept

Distribution thinking — seeing data as a shape, not just a single number — is the mindset shift that makes everything else in statistics click. Every time someone gives you an "average," your new reflex should be: "What's the shape? What's the spread? Is the average even a good summary?"

Measures of Center

Measure	What It Is	When to Use It	Watch Out For
Mean	Sum of values / count — the balance point	Symmetric distributions without extreme outliers	Gets dragged by outliers and skewed tails
Median	The middle value when data is sorted — the 50th percentile	Skewed distributions, data with outliers	Ignores the actual values of extreme points (that's the feature, not a bug)
Mode	The most frequent value	Categorical data, detecting peaks in continuous data	Might not be unique; less useful for continuous data

The key rule: If mean and median are close, the data is roughly symmetric and either works. If they're far apart, the data is skewed and you should use the median.

Measures of Spread

Measure	What It Is	When to Use It	Watch Out For
Range	Max - Min	Quick sanity check only	Determined entirely by the two most extreme values
Variance	Average of squared deviations from mean	Mathematical building block; used in formulas	Units are squared (hard to interpret directly)
Standard Deviation	Square root of variance — "typical distance from mean"	Symmetric data; pairs with the mean	Inflated by outliers
IQR	Q3 - Q1 — the range of the middle 50%	Skewed data, data with outliers; pairs with the median	Ignores the tails entirely

The pairing rule: - Symmetric data: mean + standard deviation - Skewed data: median + IQR

The Five-Number Summary

Minimum ── Q1 ── Median ── Q3 ── Maximum
           |←── IQR ──→|
           (middle 50%)

This is what .describe() gives you (plus count, mean, and std). It's the skeleton of a box plot and a great first look at any numerical variable.

Distribution Shape

Shape	What It Looks Like	Mean vs. Median	Real-World Examples
Symmetric	Mirror-image around center	Mean ≈ Median	Heights, many test scores
Right-skewed	Long tail to the RIGHT	Mean > Median	Income, home prices, wait times
Left-skewed	Long tail to the LEFT	Mean < Median	Age at death, easy exam scores
Bimodal	Two distinct peaks	Mean falls in the valley (misleading!)	Mixed populations

The tail rule: Skewness is named for the direction of the tail, not the peak. Right-skewed means the tail goes right.

Outlier Detection

IQR Fence Method: - Lower fence = Q1 - 1.5 * IQR - Upper fence = Q3 + 1.5 * IQR - Values beyond the fences are flagged as outliers - Robust — based on the median and quartiles

Z-Score Method: - z = (x - mean) / standard deviation - |z| > 2: unusual - |z| > 3: very unusual - Less robust — based on the mean, which outliers themselves distort

Before removing any outlier, ask: Is it an error? A measurement problem? Or the most interesting data point in the set?

Essential Pandas Methods

s.mean()             # Mean
s.median()           # Median
s.mode()             # Mode(s)
s.std()              # Standard deviation (ddof=1 by default)
s.var()              # Variance
s.min(), s.max()     # Min, Max
s.quantile(0.25)     # Any percentile
s.describe()         # Full summary
s.skew()             # Skewness
s.kurt()             # Kurtosis

# Grouped analysis
df.groupby('group')['value'].describe()
df.groupby('group')['value'].agg(['mean', 'median', 'std'])

Decision Flowchart

Got a numerical dataset?
  |
  v
STEP 1: Plot it (histogram or box plot)
  |
  v
STEP 2: Is it symmetric?
  |
  ├── YES ─── Report mean ± std dev
  |
  └── NO ──── Report median (IQR)
  |
  v
STEP 3: Is it bimodal?
  |
  ├── YES ─── Report each group separately
  |
  └── NO ──── Continue
  |
  v
STEP 4: Are there outliers?
  |
  ├── YES ─── Investigate (error? interesting signal?)
  |
  └── NO ──── Continue
  |
  v
STEP 5: Are there meaningful subgroups?
  |
  ├── YES ─── Stratify and report by group
  |
  └── NO ──── Report overall statistics

Anscombe's Warning

Four datasets can have identical means, standard deviations, and correlations but look completely different when plotted. Never trust summary statistics alone. Always plot your data.

What You Should Be Able to Do Now

[ ] Compute and interpret mean, median, and mode in Python
[ ] Compute and interpret standard deviation, variance, and IQR
[ ] Describe a distribution's shape (symmetric, skewed, bimodal)
[ ] Detect outliers using IQR fences and z-scores
[ ] Choose mean + SD vs. median + IQR based on distribution shape
[ ] Use .describe() and .groupby().agg() for grouped analysis
[ ] Explain why plotting data is as important as computing statistics
[ ] Write a plain-English interpretation of descriptive statistics for a non-technical audience

If all of these feel solid, you're ready for Chapter 20: Probability Thinking.