> "Without data you're just another person with an opinion."
In This Chapter
- From Gut Feel to Evidence
- The Business Intuition for Each Statistic
- Correlation: Measuring Relationships Between Variables
- Business Statistics Traps
- Visualizing Distributions
- pandas describe(): Your Starting Point
- scipy.stats for More Advanced Calculations
- pandas describe() for Non-Numeric Data
- Practical Business Applications
- The "So What" Discipline
- Putting It All Together: A Complete Statistical Analysis
- Correlation Heatmaps: Seeing All Relationships at Once
- Chapter Summary
- Key Vocabulary
Chapter 25: Descriptive Statistics for Business Decisions
"Without data you're just another person with an opinion." — W. Edwards Deming
From Gut Feel to Evidence
Priya has been at Acme Corp for three years. She knows in her bones that the Q4 sales surge is real, that the enterprise reps consistently outperform the mid-market reps, and that the new product line is underperforming. She knows these things the way you know things when you have lived inside a business long enough — from patterns noticed over time, from water-cooler conversations, from a mental filing system built from hundreds of small observations.
But when Sandra Chen, Acme's CFO, asked for a budget justification last quarter, "I have a strong feeling" did not hold up in the room. Neither did "everyone knows" or "it's pretty obvious." Sandra wanted numbers. She wanted to know how much bigger Q4 was, how much better the enterprise reps performed, and precisely how far the new product line was trailing.
This is the gap that descriptive statistics closes. It transforms "I think" into "the data shows." It gives your intuition a voice that finance will respect.
Descriptive statistics is not about running complex models or proving causation. It is the art of summarizing a dataset so that the story inside it becomes visible. Where are the values clustered? How spread out are they? Are there outliers distorting the picture? What is the relationship between two variables? These are the questions descriptive statistics answers — and they are among the most powerful questions in business.
This chapter will teach you to read and produce those summaries using Python. By the end, you will be able to take a messy CSV of sales data and extract, in minutes, a clear statistical picture that would have taken an analyst days to produce manually.
The Business Intuition for Each Statistic
Before we write a single line of code, let us build intuition. Statistics without intuition produces numbers without meaning. We are going for meaning.
Measures of Central Tendency: What Is "Typical"?
Every manager wants to know what "typical" looks like. Typical deal size. Typical customer spend. Typical support ticket resolution time. There are three ways to define typical, and they tell you very different things.
The Mean: The Balancing Point
The mean (arithmetic average) is calculated by adding all the values and dividing by the count. Every person reading this book already knows how to compute an average. The reason to understand it statistically is to understand when it lies to you.
The mean is the balancing point of a distribution. Imagine your dataset as a seesaw: the mean is where you would have to put the fulcrum to balance it. If all your values are similar (all deals between $10,000 and $20,000), the mean tells you almost everything you need. If some values are wildly different (most deals between $10,000 and $20,000, but one massive enterprise deal at $2,000,000), the mean gets pulled toward the outlier and stops representing the typical case.
Business use cases for the mean: - Average monthly revenue (when revenue is relatively stable) - Average cost per acquisition across similar campaigns - Average time-to-close for a uniform deal type
The Median: The Middle Value
The median is the value that splits your dataset exactly in half — 50% of values are below it, 50% are above. To find it, sort your values and pick the middle one.
The median is resistant to outliers. That massive enterprise deal that distorts your mean? The median does not care about it. The median only cares about the count: how many values are above and how many are below.
When the median is better than the mean:
Consider household income in any major city. A neighborhood might have 200 households earning between $40,000 and $80,000 per year, plus five tech executives earning $2,000,000+. The mean household income could be $130,000 — a number that does not represent any of the actual households accurately. The median would be around $58,000, which is a much more honest picture of what most families actually earn.
The same logic applies in business. If you are looking at customer lifetime value and you have a handful of whale accounts spending ten times what your typical customer spends, use the median to understand your typical customer. Use the mean when you need to project total revenue (because the whales matter for the total).
Business use cases for the median: - Typical deal size (especially with enterprise outliers) - Typical customer support wait time (especially with complex cases skewing the mean) - Typical employee tenure (especially in a company with a few 20-year veterans)
The Mode: The Most Common Value
The mode is simply the most frequently occurring value. It is less commonly used in business than mean or median, but it has specific applications.
Business use cases for the mode: - Most common product ordered (great for inventory decisions) - Most common support issue type (great for identifying training needs) - Most common payment method (great for checkout optimization)
The mode becomes especially useful when you are dealing with categorical data — data that comes in discrete buckets rather than a continuous number. What is the average product category? That question is meaningless. What is the most common product category? That is the mode, and it is meaningful.
Measures of Spread: How Consistent Are Things?
Two sales teams can have the same average monthly revenue but be completely different businesses. Team A has every rep producing between $85,000 and $115,000 per month — consistent, predictable, easy to plan around. Team B has reps ranging from $20,000 to $280,000 per month — the same average, but wildly inconsistent, high-risk, hard to plan around.
Measures of spread tell you how consistent or variable your data is. In business, consistency is often as valuable as performance.
Range: The Simplest Spread Measure
Range is maximum minus minimum. It tells you the full span of your data.
The range is useful but fragile. If you have one unusually low or high value, the range balloons immediately. It is a good first cut — if the range is tiny, your data is consistent; if the range is enormous, you know to look deeper — but it should rarely be your only spread measure.
Standard Deviation: The Typical Distance from the Mean
Standard deviation (std) measures how far the typical value sits from the mean. More precisely, it is roughly the average distance of all values from the mean.
Think of it this way: if your mean monthly revenue per rep is $100,000 and your standard deviation is $10,000, most of your reps are somewhere between $90,000 and $110,000. If your standard deviation is $50,000, most of your reps are somewhere between $50,000 and $150,000. High standard deviation means high variability. Low standard deviation means high consistency.
Business intuition: Low standard deviation is often a sign of a well-managed, systematized process. High standard deviation often signals that results depend heavily on individual talent, chance, or undefined process — which is a management problem worth solving.
The 68-95-99.7 rule (the rough version): In many business datasets, approximately 68% of values fall within one standard deviation of the mean, approximately 95% within two standard deviations, and almost all values within three standard deviations. This gives you a fast way to identify outliers: any value more than two or three standard deviations from the mean deserves a closer look.
Variance: Standard Deviation Squared
Variance is the standard deviation squared. The standard deviation is more interpretable because it is in the same units as your data (dollars, if you are measuring revenue in dollars). Variance is in squared units, which is hard to think about intuitively.
You will encounter variance in statistical formulas and in code output. Know that it is simply the standard deviation squared, and that it is less useful for day-to-day business interpretation. When choosing which to report, report standard deviation.
Interquartile Range (IQR): The Spread of the Middle
The IQR is the spread of the middle 50% of your data — specifically, the distance between the 25th percentile (Q1) and the 75th percentile (Q3).
Like the median, the IQR is resistant to outliers. It simply ignores the top 25% and bottom 25% of your data and measures how spread out the middle half is.
Business use case: IQR is excellent for identifying your "core" business behavior while setting aside extreme cases. The IQR of your deal sizes tells you how wide the range is for your typical deals, excluding your smallest and largest outliers.
Percentiles and Quartiles: Where Does Any Value Rank?
A percentile tells you what percentage of values fall below a given point. The 80th percentile order value is the order size below which 80% of orders fall. The bottom quartile of customers consists of the 25% of customers with the lowest spend.
Quartiles divide the data into four equal groups: - Q1 (25th percentile): 25% of values fall below this point - Q2 (50th percentile): The median — 50% of values fall below this point - Q3 (75th percentile): 75% of values fall below this point
Business use cases for percentiles:
Customer segmentation: "The top decile (top 10%) of our customers account for 42% of revenue." This is actionable. It tells you who to protect, who to prioritize for account management, who to study when you want to understand what drives loyalty.
Performance benchmarking: "Our target is for every sales rep to reach the 75th percentile benchmark by their second year." Percentile-based goals are self-calibrating — as the whole team improves, the benchmark rises.
SLA setting: "Our support team resolves 95% of tickets within 4 hours." This is a 95th percentile statement. It is a much more honest SLA than "our average resolution time is 2 hours" (which could mask a long tail of tickets taking 24+ hours).
Outlier detection: Values beyond the 1st or 99th percentile are often outliers worth investigating. Is that giant order real, or a data entry error? Is that tiny order a customer test, or a legitimate small purchase?
Correlation: Measuring Relationships Between Variables
Correlation measures whether two variables tend to move together. A correlation of +1.0 means they move in perfect lockstep (as one goes up, the other goes up by a proportional amount). A correlation of -1.0 means they move in perfect opposition (as one goes up, the other goes down). A correlation of 0 means they have no linear relationship at all.
Business correlations worth measuring: - Marketing spend vs. new customer acquisitions - Average deal size vs. sales cycle length - Employee satisfaction score vs. customer satisfaction score - Temperature vs. sales of seasonal products
Calculating Correlation in pandas
Pandas makes correlation calculation trivially easy:
import pandas as pd
# Load your data
df = pd.read_csv("sales_data.csv")
# Correlation between two columns
correlation = df["marketing_spend"].corr(df["new_customers"])
print(f"Correlation: {correlation:.3f}")
# Full correlation matrix for all numeric columns
corr_matrix = df.corr()
print(corr_matrix)
The .corr() method returns the Pearson correlation coefficient by default, which measures linear relationships. Values between 0.7 and 1.0 (or -0.7 and -1.0) are generally considered strong correlations in business data. Values between 0.4 and 0.7 are moderate. Values below 0.4 are weak, though may still be meaningful depending on context.
What Correlation Does Not Tell You
Correlation is one of the most misused concepts in business analytics. Here are the critical limitations:
Correlation is not causation. This is the most important sentence in this section. The fact that two variables move together does not mean one is causing the other. Ice cream sales and drowning rates are positively correlated — not because ice cream causes drowning, but because both are driven by hot weather. In business: customer support ticket volume and revenue may be correlated not because support causes revenue, but because both are driven by the number of active customers.
Correlation only measures linear relationships. Two variables can have a strong, predictable, non-linear relationship and show a correlation of near zero. If sales performance improves with experience up to a point and then plateaus, the overall correlation between tenure and performance might look weak even though there is clearly a pattern.
Correlation depends on the range of data. If you restrict your analysis to only your top 20% of customers, you might find that marketing spend has no correlation with their purchases — because within that high-value group, purchasing behavior is driven by relationship and contract, not by ads. But across the full customer base, the correlation might be strong.
Business Statistics Traps
Simpson's Paradox
Simpson's Paradox occurs when a trend appears in several groups of data but disappears or reverses when the groups are combined. It is one of the most counterintuitive and dangerous traps in business statistics.
Classic example: Suppose you are comparing two salespeople, Alice and Bob.
| Quarter | Alice Close Rate | Bob Close Rate |
|---|---|---|
| Q1 | 40% (40/100) | 60% (6/10) |
| Q2 | 70% (70/100) | 80% (8/10) |
| Total | 55% | 64% |
Bob has a higher close rate in every individual quarter. But Alice closed 110 deals out of 200 attempts (55%), while Bob closed 14 out of 20 attempts (70%). Bob looks better in every quarter, but Alice handled far more opportunities. If you were judging solely on overall close rate without looking at deal volume, you might draw wrong conclusions.
Simpson's Paradox typically appears when groups are unequal in size and there is a confounding variable (deal volume in the example above). The solution is to always segment your analysis and ask: could there be a hidden variable driving this result?
Survivorship Bias
Survivorship bias occurs when you analyze only the cases that made it through a selection process and draw conclusions that do not account for the ones that did not.
Business example: Acme is trying to understand what makes their most successful product launches successful. They analyze all their current products and find that the successful ones all had a dedicated product champion in the early months. They conclude: "Product champions drive success — we should assign champions to every new product."
But wait. What about the products that failed and were discontinued? Were some of those also championed? If yes, then product champions do not actually correlate with success — they are just a common feature of all product launches, successful or not. By only looking at surviving products, the analysis missed the full picture.
Survivorship bias is everywhere in business: studying only successful companies, analyzing only customers who stayed (not those who churned), looking only at employees who were promoted. Always ask: what am I not seeing because it did not survive to be included in my dataset?
Correlation vs. Causation (in Practice)
We discussed this above in the context of correlation, but it deserves a section of its own because the practical business mistake is so common.
The business version: a company sees that customers who use their premium support plan have significantly higher retention rates. They conclude: "Premium support causes retention. Let us push everyone into premium support." They invest heavily in expanding premium support access.
But what if premium support does not cause retention — what if both are caused by a third factor: customer success and satisfaction with the core product? Happy customers who love the product both choose to invest in premium support and are likely to renew. Pushing unhappy customers into premium support would not improve their retention if their dissatisfaction stems from the product itself.
The test for causation: To establish causation, you need either a controlled experiment (randomly assign some customers to premium support and others not, then compare retention) or a compelling causal mechanism (a logical chain of events where A clearly produces B). Correlation alone never proves causation.
Visualizing Distributions
Numbers alone do not tell the full story. Distributions need to be visualized to be understood. Three visualizations are essential for business statistics work.
Histograms: The Shape of Your Distribution
A histogram divides your data into bins (ranges) and shows how many values fall in each bin. The shape tells you everything:
- Bell-shaped (normal): Most values cluster around the middle, tapering off symmetrically on both sides. Many natural business metrics approximate this: daily call volume, manufacturing defect rates.
- Right-skewed (long right tail): Most values cluster at the low end, with a few very large values pulling the right tail out. Revenue distributions are almost always right-skewed — most customers or deals are moderate, a few are enormous.
- Left-skewed (long left tail): Most values cluster at the high end, with a few very small values pulling the left tail. Customer satisfaction scores often look like this — most people rate highly, a few rate very low.
- Bimodal (two humps): Two distinct clusters in the data. This often signals two different populations mixed together — for example, if your customer base includes both individual consumers and enterprise clients, their purchase values might form two separate humps.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Sample revenue data (right-skewed, as real revenue data tends to be)
np.random.seed(42)
revenue = np.random.lognormal(mean=10.5, sigma=0.8, size=500)
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(revenue, bins=40, color="#2196F3", edgecolor="white", alpha=0.8)
ax.axvline(np.mean(revenue), color="#F44336", linewidth=2,
label=f"Mean: ${np.mean(revenue):,.0f}")
ax.axvline(np.median(revenue), color="#4CAF50", linewidth=2,
label=f"Median: ${np.median(revenue):,.0f}")
ax.set_xlabel("Revenue ($)", fontsize=13)
ax.set_ylabel("Number of Transactions", fontsize=13)
ax.set_title("Revenue Distribution — Why Median Matters More Than Mean", fontsize=14)
ax.legend(fontsize=12)
plt.tight_layout()
plt.savefig("revenue_histogram.png", dpi=150)
plt.show()
When you run this, the gap between the mean (pulled right by large values) and the median (stable in the center) becomes immediately visible. This is worth showing to any manager who insists on discussing average deal size.
Box Plots: Five Numbers at Once
A box plot (box-and-whisker plot) displays five key statistics simultaneously: - The minimum (left whisker end) - Q1, the 25th percentile (left edge of the box) - The median (line inside the box) - Q3, the 75th percentile (right edge of the box) - The maximum (right whisker end) - Outliers (individual points beyond the whiskers)
Box plots are exceptional for comparing distributions across groups. Comparing the distribution of deal sizes by region, or response times by support agent, or project margins by client type — box plots make these comparisons immediately readable.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
np.random.seed(42)
regions = {
"North": np.random.normal(95000, 20000, 80),
"South": np.random.normal(78000, 35000, 80),
"East": np.random.normal(112000, 15000, 80),
"West": np.random.normal(88000, 28000, 80),
}
df = pd.DataFrame(regions)
fig, ax = plt.subplots(figsize=(10, 6))
df.boxplot(ax=ax, patch_artist=True,
boxprops=dict(facecolor="#E3F2FD", color="#1565C0"),
medianprops=dict(color="#F44336", linewidth=2))
ax.set_ylabel("Monthly Revenue ($)", fontsize=13)
ax.set_title("Sales Rep Revenue Distribution by Region", fontsize=14)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f"${x:,.0f}"))
plt.tight_layout()
plt.savefig("regional_boxplot.png", dpi=150)
plt.show()
The box plot immediately reveals: East has the highest median and lowest spread (consistent high performers). South has the lowest median but the widest spread (some great reps, some struggling). This tells you exactly where management attention is most needed.
Violin Plots: The Box Plot's Richer Sibling
A violin plot combines the summary statistics of a box plot with a full view of the distribution shape. Where a box plot shows you Q1/median/Q3, a violin plot shows you the entire density — you can see whether the distribution is symmetric, bimodal, or heavily skewed.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
np.random.seed(42)
data = [
np.random.normal(95000, 20000, 80), # North
np.random.normal(78000, 35000, 80), # South
np.random.normal(112000, 15000, 80), # East
np.random.normal(88000, 28000, 80), # West
]
fig, ax = plt.subplots(figsize=(10, 6))
parts = ax.violinplot(data, positions=[1, 2, 3, 4], showmedians=True)
for pc in parts["bodies"]:
pc.set_facecolor("#90CAF9")
pc.set_alpha(0.7)
ax.set_xticks([1, 2, 3, 4])
ax.set_xticklabels(["North", "South", "East", "West"])
ax.set_ylabel("Monthly Revenue ($)", fontsize=13)
ax.set_title("Revenue Distribution by Region — Violin Plot", fontsize=14)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f"${x:,.0f}"))
plt.tight_layout()
plt.savefig("regional_violin.png", dpi=150)
plt.show()
pandas describe(): Your Starting Point
Every exploratory data analysis should begin with df.describe(). This single method gives you the eight most fundamental statistics for every numeric column in your dataset.
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({
"rep_name": [f"Rep_{i}" for i in range(50)],
"monthly_revenue": np.random.lognormal(11.3, 0.5, 50),
"deals_closed": np.random.randint(5, 35, 50),
"avg_deal_size": np.random.lognormal(9.5, 0.6, 50),
"close_rate_pct": np.random.normal(42, 12, 50).clip(10, 80),
})
print(df.describe())
Output:
monthly_revenue deals_closed avg_deal_size close_rate_pct
count 50.000000 50.000000 50.000000 50.000000
mean 93445.231847 19.560000 16847.432156 42.341200
std 52341.876543 7.234521 9234.123456 9.876543
min 31234.567890 5.000000 4567.890123 15.234567
25% 58234.123456 14.000000 9876.543210 35.678901
50% 81234.567890 20.000000 14567.890123 42.123456
75% 114567.890123 25.000000 22345.678901 49.012345
max 289345.678901 34.000000 52345.678901 74.567890
What each row tells you:
- count: How many non-null values exist. If this is less than your total row count, you have missing data.
- mean: The arithmetic average. Compare to median (50%) to sense skewness.
- std: Standard deviation. Compare to mean — if std is more than half the mean, your data is highly variable.
- min: The smallest value. Is it plausible? Could it be an error?
- 25%: The first quartile (Q1). 25% of values are below this.
- 50%: The median. This is your most reliable "typical" value.
- 75%: The third quartile (Q3). 75% of values are below this.
- max: The largest value. Is it plausible? Could it be an outlier?
The quick diagnosis: Look at the gap between mean and 50% (median). If mean is much larger than median, your data is right-skewed — a few large values are pulling the average up. If mean is much smaller than median, your data is left-skewed. If mean ≈ median, your data is roughly symmetric.
scipy.stats for More Advanced Calculations
The scipy.stats module provides statistical functions that go beyond what pandas offers out of the box. You do not need to understand the underlying mathematics — you need to know which function to call and how to interpret the result.
from scipy import stats
import numpy as np
# Generate some sales performance data
np.random.seed(42)
sales = np.random.lognormal(11, 0.5, 100)
# Skewness: how asymmetric is the distribution?
# Positive = right-skewed (long right tail)
# Negative = left-skewed (long left tail)
# Near 0 = symmetric
skewness = stats.skew(sales)
print(f"Skewness: {skewness:.3f}")
# Kurtosis: how heavy are the tails?
# High kurtosis = more extreme values than a normal distribution
kurtosis = stats.kurtosis(sales)
print(f"Kurtosis: {kurtosis:.3f}")
# Percentile: what value is at the 80th percentile?
p80 = np.percentile(sales, 80)
print(f"80th percentile revenue: ${p80:,.0f}")
# Mode for continuous data (less useful, but here for completeness)
# For grouped/categorical data, use value_counts() in pandas instead
# Z-score: how many standard deviations from the mean is each value?
z_scores = stats.zscore(sales)
outliers = sales[np.abs(z_scores) > 2]
print(f"\nOutliers (>2 std devs from mean):")
for val in sorted(outliers):
print(f" ${val:,.0f}")
The z-score approach to outlier detection is one of the most practically useful tools in this chapter. Any value with an absolute z-score greater than 2 (or 3, depending on your standard) is worth examining. In business data, outliers are often: - Data entry errors (a deal entered as $500,000 instead of $50,000) - Genuinely exceptional events that need separate treatment (a one-time enterprise contract that should not be mixed into your typical deal analysis) - Signals of a process problem (a support ticket that took 30 days instead of 2)
pandas describe() for Non-Numeric Data
Do not forget that describe() works on text columns too, with a different set of statistics:
import pandas as pd
df_cat = pd.DataFrame({
"rep_name": ["Alice", "Bob", "Alice", "Carol", "Alice", "Bob"],
"product_line": ["Enterprise", "Mid-Market", "Enterprise",
"SMB", "Mid-Market", "Enterprise"],
})
print(df_cat.describe(include="object"))
Output:
rep_name product_line
count 6 6
unique 3 3
top Alice Enterprise
freq 3 3
- count: Total non-null values
- unique: Number of distinct values
- top: The most common value (the mode)
- freq: How often the most common value appears
Practical Business Applications
Application 1: Pricing Analysis
You have a dataset of deals closed over the past year. You want to understand your pricing distribution before setting a new pricing strategy.
Key questions to answer with statistics: 1. What is the typical (median) deal size? (Use median, not mean, because large enterprise deals will skew the mean) 2. How variable is deal pricing? (Use IQR — what is the "normal" range?) 3. Are we leaving money on the table with our enterprise deals? (Compare enterprise deals at 75th percentile vs. median) 4. Do certain sales reps consistently price higher? (Box plot comparing rep pricing distributions)
Application 2: Sales Rep Performance Distribution
Understanding the performance distribution of your sales team is one of the highest-leverage analyses a sales manager can do. Key questions:
- Is performance normally distributed, or is there a cluster of high performers and a cluster of underperformers?
- What percentage of reps fall below the median? (By definition, 50% — but where do they fall relative to target?)
- What is the performance range in the top quartile? (Q3 to max)
- Which reps are statistical outliers (more than 2 standard deviations from the mean)?
Application 3: Product Demand Variance
For operations and supply chain planning, variance matters enormously. High variance in demand means you need more safety stock and more flexible capacity. Low variance means you can plan tightly. Comparing the IQR and standard deviation of demand across your product lines tells you which products need conservative inventory buffers and which can be managed lean.
The "So What" Discipline
Here is the most important thing you will learn in this chapter, and it has nothing to do with Python.
Every statistical finding must answer "so what?" A statistic without a decision it informs is just a number. Here is the framework:
- Observation: "The median deal size for our enterprise reps is $127,000, but the mean is $289,000."
- Interpretation: "A small number of very large deals are pulling our average way above what most enterprise reps typically close."
- Implication: "Our quota structure based on average deal size means most reps are set up to fail. Only the reps who land the occasional whale hit their numbers."
- Decision: "We should either restructure quotas around median deal size, or explicitly create a separate track and incentive structure for mega-deal hunters."
Without step 4, the statistic is trivia. With step 4, it is insight that drives action.
Always connect your statistical finding to a decision. If you cannot identify a decision it could influence, either the finding is not important enough to share, or you have not thought deeply enough about the implications.
Putting It All Together: A Complete Statistical Analysis
Here is the full workflow for analyzing a business dataset. We will walk through each step, and the complete code appears in code/stats_analysis.py.
Step 1: Load and Preview
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
df = pd.read_csv("sales_data.csv")
print(df.shape)
print(df.dtypes)
print(df.head())
print(df.describe())
Step 2: Check for Missing Data and Outliers
# Missing data
print(df.isnull().sum())
# Outlier detection using IQR method
Q1 = df["revenue"].quantile(0.25)
Q3 = df["revenue"].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df["revenue"] < lower_bound) | (df["revenue"] > upper_bound)]
print(f"\n{len(outliers)} outliers detected:")
print(outliers[["rep_name", "revenue"]])
The IQR method for outlier detection (values beyond Q1 - 1.5×IQR or Q3 + 1.5×IQR) is the standard approach in business analytics. It is the same method box plots use to determine what counts as an outlier point beyond the whiskers.
Step 3: Segment and Compare
# Revenue statistics by region
region_stats = df.groupby("region")["revenue"].agg([
"mean", "median", "std",
lambda x: x.quantile(0.75) - x.quantile(0.25) # IQR
]).rename(columns={"<lambda_0>": "IQR"})
print(region_stats)
Step 4: Visualize
Three charts tell the full story: 1. Histogram — shape of the overall distribution 2. Box plot by segment — comparison across groups 3. Correlation heatmap — relationships between numeric variables
Step 5: State Your Findings in Business Language
Statistical findings must be translated for non-technical audiences:
- NOT: "The revenue distribution is right-skewed with mean $289K and median $127K."
- YES: "Most of our enterprise deals close around $127,000, but a handful of large contracts brings our reported average to $289,000 — which means our quota model is built around numbers that most reps will never see."
Correlation Heatmaps: Seeing All Relationships at Once
When you have multiple numeric columns, a correlation heatmap lets you see all pairwise relationships simultaneously.
import matplotlib.pyplot as plt
import seaborn as sns
corr = df[["revenue", "deals_closed", "avg_deal_size",
"close_rate_pct", "calls_made"]].corr()
fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(corr, cmap="RdYlGn", vmin=-1, vmax=1)
plt.colorbar(im, ax=ax)
ticks = ["Revenue", "Deals Closed", "Avg Deal Size", "Close Rate", "Calls Made"]
ax.set_xticks(range(len(ticks)))
ax.set_yticks(range(len(ticks)))
ax.set_xticklabels(ticks, rotation=45, ha="right")
ax.set_yticklabels(ticks)
for i in range(len(ticks)):
for j in range(len(ticks)):
ax.text(j, i, f"{corr.iloc[i, j]:.2f}", ha="center", va="center",
fontsize=9, color="black")
ax.set_title("Sales Metrics Correlation Matrix", fontsize=14)
plt.tight_layout()
plt.savefig("correlation_heatmap.png", dpi=150)
plt.show()
Chapter Summary
Descriptive statistics is the foundation of data-driven business management. Before you build models, before you run experiments, before you make recommendations to leadership, you need to understand what your data actually looks like. The tools in this chapter — mean, median, mode, standard deviation, IQR, percentiles, correlation — are not exotic. They are the standard vocabulary of evidence-based business.
Use the median when outliers might distort the mean. Use standard deviation to understand consistency and risk. Use percentiles to segment customers, benchmark performance, and define SLAs. Use correlation to identify relationships worth investigating — but never mistake correlation for causation.
Most importantly: always ask "so what?" Every statistic you calculate should connect to a decision someone can make. Numbers in service of decisions — that is the discipline of business analytics.
In the next chapter, we will extend these ideas into time: how do business metrics change over weeks, months, and years, and how can we project where they are going?
Key Vocabulary
| Term | Definition |
|---|---|
| Mean | The arithmetic average; sum divided by count |
| Median | The middle value; resistant to outliers |
| Mode | The most frequently occurring value |
| Standard Deviation | Typical distance of values from the mean |
| Variance | Standard deviation squared |
| Range | Maximum minus minimum |
| IQR | Interquartile range; Q3 minus Q1; spread of the middle 50% |
| Percentile | The value below which a given percentage of observations fall |
| Quartile | Q1 (25th), Q2 (50th), Q3 (75th) percentiles |
| Correlation | Measure of linear relationship between two variables (-1 to +1) |
| Skewness | Measure of asymmetry in a distribution |
| Outlier | A value unusually far from the bulk of the data |
| Simpson's Paradox | A trend in groups reverses when groups are combined |
| Survivorship Bias | Analyzing only successful cases and missing failures |
| z-score | Number of standard deviations a value is from the mean |