Chapter 25 Key Takeaways: Descriptive Statistics for Business Decisions


The Big Idea

Descriptive statistics transforms raw data into a summary that reveals the shape, center, and spread of your business metrics. The goal is not to compute numbers for their own sake — it is to move from "I think" to "the data shows" so that decisions are grounded in evidence rather than intuition alone.


Central Tendency: Choosing the Right "Typical"

Use the median when: - Your data is skewed (e.g., revenue, deal sizes, salaries, customer LTV) - You want to represent the "typical" case, not the aggregate - Outliers exist and you do not want them to distort the picture

Use the mean when: - Your data is roughly symmetric (similar mean and median) - You need an aggregate projection (total revenue = mean × count) - All data points matter equally and none are extreme outliers

Use the mode when: - Your data is categorical (e.g., most common product, most common issue type) - You want the single most representative value in a discrete dataset

The quick check: If mean >> median, you have right skew and a few large values pulling the average up. If mean << median, you have left skew. Either way, the median is the more honest "typical" value.


Spread: Consistency Is as Valuable as Performance

Statistic What It Measures When to Use It
Range Full span (max - min) Quick first look; fragile with outliers
Standard Deviation Typical distance from the mean When distribution is roughly symmetric
Variance Std dev squared In formulas; hard to interpret directly
IQR Spread of the middle 50% When outliers exist; paired with median
Coefficient of Variation Std dev as % of mean Comparing variability across different-scale metrics

Low standard deviation = high consistency = lower risk. Two teams with the same mean performance are not the same if one is consistent and one is wildly variable.


Percentiles and Quartiles: The Language of Segmentation

  • Percentiles answer "where does this value rank relative to all others?"
  • Quartiles divide your data into four equal groups — essential for performance tiers
  • The IQR (Q3 - Q1) is the most robust measure of spread for skewed business data
  • Common business percentile uses:
  • P80 revenue threshold = "premium customer" cutoff
  • P90 resolution time = SLA benchmark
  • P25 performance = threshold for coaching intervention
  • P95+ values = candidates for outlier investigation

Outlier Detection

Two standard methods: 1. IQR method: Outlier if value < Q1 - 1.5×IQR or > Q3 + 1.5×IQR (robust, preferred when skewed data) 2. Z-score method: Outlier if |z-score| > 2 or 3 (works best with roughly normal distributions)

Not all outliers are errors — some are genuinely exceptional cases (a whale client, a record-breaking quarter). Always investigate before excluding. Report outliers separately when they would distort aggregate analysis.


Correlation: Relationships Without Causation

  • Correlation ranges from -1.0 (perfect negative) to +1.0 (perfect positive)
  • |r| > 0.7: strong; 0.4–0.7: moderate; < 0.4: weak (general business guidelines)
  • .corr() in pandas calculates the full correlation matrix for all numeric columns
  • Correlation only measures linear relationships — strong non-linear patterns can show near-zero correlation
  • Correlation never proves causation — you need a controlled experiment or a clear causal mechanism

Three Visualization Tools

Chart Type Best For
Histogram Distribution shape — is it skewed? symmetric? bimodal?
Box plot Comparing distributions across groups; spots the median and quartiles immediately
Violin plot Combines box plot summary with full distribution shape

Always mark the mean and median on histograms when presenting to business audiences — the gap between them often tells the most important story.


pandas describe(): Your Starting Point

df.describe() gives you: count, mean, std, min, 25%, 50%, 75%, max.

Diagnostic checklist when reviewing describe() output: 1. Is count less than your total rows? → Missing data 2. Is mean >> 50% (median)? → Right skew; use median as "typical" 3. Is std more than ~50% of mean? → High variability; investigate consistency 4. Is max wildly larger than 75%? → Outliers present 5. Is min unexpectedly low (zero or negative where impossible)? → Data quality issue


scipy.stats Tools Worth Knowing

from scipy import stats
stats.skew(data)          # skewness: positive = right tail, negative = left tail
stats.kurtosis(data)      # tail heaviness relative to normal distribution
stats.zscore(data)        # z-score for each value (how many std devs from mean)
np.percentile(data, 80)   # value at the 80th percentile

Business Statistics Traps

Simpson's Paradox: A trend that appears in each subgroup can reverse or disappear when groups are combined, usually because groups are unequal in size. Always segment before concluding. Always ask: "Could a hidden variable be driving this?"

Survivorship Bias: Analyzing only successful or surviving cases misses the failures that could overturn your conclusion. Always ask: "What is not in my dataset because it did not survive to be measured?"

Correlation vs. Causation: Two variables moving together does not mean one causes the other. Both might be caused by a third variable. To establish causation, you need either a controlled experiment or a compelling causal mechanism with strong logical support.


The "So What" Discipline

Every statistic you report must connect to a decision. The chain:

  1. Observation: What does the number show?
  2. Interpretation: What does this mean about how the business operates?
  3. Implication: What would be true if this pattern continues?
  4. Decision: What should management do differently as a result?

A number without a decision is trivia. A number with a decision is insight.


Code Patterns to Memorize

# Central tendency
df["col"].mean()
df["col"].median()
df["col"].mode()[0]

# Spread
df["col"].std()
df["col"].var()
q1, q3 = df["col"].quantile(0.25), df["col"].quantile(0.75)
iqr = q3 - q1

# Percentiles
df["col"].quantile(0.80)   # 80th percentile
df["col"].describe()        # all key stats at once

# Outlier detection (IQR method)
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
outliers = df[(df["col"] < lower) | (df["col"] > upper)]

# Correlation
df[cols].corr()                          # full matrix
df["a"].corr(df["b"])                    # single pair

# Group statistics
df.groupby("category")["metric"].agg(["mean", "median", "std"])

Connections to the Next Chapter

Chapter 26 extends descriptive statistics into time: how do your metrics change over weeks, months, and years? Moving averages, trend lines, and seasonality are all built on the same foundational thinking developed here — measuring what is typical and what is variable, but now with the dimension of time added.