Chapter 25 Key Takeaways: Descriptive Statistics for Business Decisions
The Big Idea
Descriptive statistics transforms raw data into a summary that reveals the shape, center, and spread of your business metrics. The goal is not to compute numbers for their own sake — it is to move from "I think" to "the data shows" so that decisions are grounded in evidence rather than intuition alone.
Central Tendency: Choosing the Right "Typical"
Use the median when: - Your data is skewed (e.g., revenue, deal sizes, salaries, customer LTV) - You want to represent the "typical" case, not the aggregate - Outliers exist and you do not want them to distort the picture
Use the mean when: - Your data is roughly symmetric (similar mean and median) - You need an aggregate projection (total revenue = mean × count) - All data points matter equally and none are extreme outliers
Use the mode when: - Your data is categorical (e.g., most common product, most common issue type) - You want the single most representative value in a discrete dataset
The quick check: If mean >> median, you have right skew and a few large values pulling the average up. If mean << median, you have left skew. Either way, the median is the more honest "typical" value.
Spread: Consistency Is as Valuable as Performance
| Statistic | What It Measures | When to Use It |
|---|---|---|
| Range | Full span (max - min) | Quick first look; fragile with outliers |
| Standard Deviation | Typical distance from the mean | When distribution is roughly symmetric |
| Variance | Std dev squared | In formulas; hard to interpret directly |
| IQR | Spread of the middle 50% | When outliers exist; paired with median |
| Coefficient of Variation | Std dev as % of mean | Comparing variability across different-scale metrics |
Low standard deviation = high consistency = lower risk. Two teams with the same mean performance are not the same if one is consistent and one is wildly variable.
Percentiles and Quartiles: The Language of Segmentation
- Percentiles answer "where does this value rank relative to all others?"
- Quartiles divide your data into four equal groups — essential for performance tiers
- The IQR (Q3 - Q1) is the most robust measure of spread for skewed business data
- Common business percentile uses:
- P80 revenue threshold = "premium customer" cutoff
- P90 resolution time = SLA benchmark
- P25 performance = threshold for coaching intervention
- P95+ values = candidates for outlier investigation
Outlier Detection
Two standard methods: 1. IQR method: Outlier if value < Q1 - 1.5×IQR or > Q3 + 1.5×IQR (robust, preferred when skewed data) 2. Z-score method: Outlier if |z-score| > 2 or 3 (works best with roughly normal distributions)
Not all outliers are errors — some are genuinely exceptional cases (a whale client, a record-breaking quarter). Always investigate before excluding. Report outliers separately when they would distort aggregate analysis.
Correlation: Relationships Without Causation
- Correlation ranges from -1.0 (perfect negative) to +1.0 (perfect positive)
- |r| > 0.7: strong; 0.4–0.7: moderate; < 0.4: weak (general business guidelines)
.corr()in pandas calculates the full correlation matrix for all numeric columns- Correlation only measures linear relationships — strong non-linear patterns can show near-zero correlation
- Correlation never proves causation — you need a controlled experiment or a clear causal mechanism
Three Visualization Tools
| Chart Type | Best For |
|---|---|
| Histogram | Distribution shape — is it skewed? symmetric? bimodal? |
| Box plot | Comparing distributions across groups; spots the median and quartiles immediately |
| Violin plot | Combines box plot summary with full distribution shape |
Always mark the mean and median on histograms when presenting to business audiences — the gap between them often tells the most important story.
pandas describe(): Your Starting Point
df.describe() gives you: count, mean, std, min, 25%, 50%, 75%, max.
Diagnostic checklist when reviewing describe() output: 1. Is count less than your total rows? → Missing data 2. Is mean >> 50% (median)? → Right skew; use median as "typical" 3. Is std more than ~50% of mean? → High variability; investigate consistency 4. Is max wildly larger than 75%? → Outliers present 5. Is min unexpectedly low (zero or negative where impossible)? → Data quality issue
scipy.stats Tools Worth Knowing
from scipy import stats
stats.skew(data) # skewness: positive = right tail, negative = left tail
stats.kurtosis(data) # tail heaviness relative to normal distribution
stats.zscore(data) # z-score for each value (how many std devs from mean)
np.percentile(data, 80) # value at the 80th percentile
Business Statistics Traps
Simpson's Paradox: A trend that appears in each subgroup can reverse or disappear when groups are combined, usually because groups are unequal in size. Always segment before concluding. Always ask: "Could a hidden variable be driving this?"
Survivorship Bias: Analyzing only successful or surviving cases misses the failures that could overturn your conclusion. Always ask: "What is not in my dataset because it did not survive to be measured?"
Correlation vs. Causation: Two variables moving together does not mean one causes the other. Both might be caused by a third variable. To establish causation, you need either a controlled experiment or a compelling causal mechanism with strong logical support.
The "So What" Discipline
Every statistic you report must connect to a decision. The chain:
- Observation: What does the number show?
- Interpretation: What does this mean about how the business operates?
- Implication: What would be true if this pattern continues?
- Decision: What should management do differently as a result?
A number without a decision is trivia. A number with a decision is insight.
Code Patterns to Memorize
# Central tendency
df["col"].mean()
df["col"].median()
df["col"].mode()[0]
# Spread
df["col"].std()
df["col"].var()
q1, q3 = df["col"].quantile(0.25), df["col"].quantile(0.75)
iqr = q3 - q1
# Percentiles
df["col"].quantile(0.80) # 80th percentile
df["col"].describe() # all key stats at once
# Outlier detection (IQR method)
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
outliers = df[(df["col"] < lower) | (df["col"] > upper)]
# Correlation
df[cols].corr() # full matrix
df["a"].corr(df["b"]) # single pair
# Group statistics
df.groupby("category")["metric"].agg(["mean", "median", "std"])
Connections to the Next Chapter
Chapter 26 extends descriptive statistics into time: how do your metrics change over weeks, months, and years? Moving averages, trend lines, and seasonality are all built on the same foundational thinking developed here — measuring what is typical and what is variable, but now with the dimension of time added.