Case Study 1: What

Contributors to Introduction to Data Science

Case Study 1: What "Average" Salary Really Means — When the Mean Lies

Tier 3 — Illustrative/Composite Example: This case study uses a fictional technology company ("NovaTech") to illustrate real statistical phenomena that are well-documented in labor economics and compensation research. The salary distributions, gender pay gap dynamics, and statistical patterns described here are composites based on widely reported patterns in the tech industry. All specific figures, company names, and character details are invented for pedagogical purposes. No actual company is represented.

The Setting

Ava Chen is a data analyst at NovaTech, a mid-sized software company with about 800 employees. She's been there for three years, and she likes her job — but this morning, she's staring at a press release that her company's PR department just sent out, and something doesn't sit right.

The press release reads:

"NovaTech is proud to report that the average employee salary is $112,000, reflecting our commitment to competitive compensation across all levels and departments."

Ava earns $78,000. Most of the people she works with earn somewhere between $65,000 and $95,000. She's talked to enough colleagues to know that $112,000 doesn't feel like a "typical" salary at NovaTech. It feels high. Suspiciously high.

"That's the mean, isn't it?" she mutters to herself, opening her laptop. Ava has access to the company's anonymized salary data as part of her role in the analytics team. She decides to investigate.

This case study follows Ava's analysis, step by step, as she discovers how a single number — the mean — can tell a story that's technically true and deeply misleading.

The Investigation

Step 1: Look at the Distribution

The first rule of descriptive statistics, as we learned in Chapter 19: always plot your data before computing anything.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Simulated NovaTech salary data
np.random.seed(42)

# Individual contributors (600 employees)
ic_salaries = np.random.normal(75000, 12000, 600)
ic_salaries = np.clip(ic_salaries, 45000, 120000)

# Mid-level managers (120 employees)
manager_salaries = np.random.normal(115000, 15000, 120)

# Senior leadership (60 employees)
senior_salaries = np.random.normal(180000, 30000, 50)

# C-suite (10 employees)
csuite_salaries = np.array([350000, 420000, 380000, 510000, 450000,
                            620000, 390000, 470000, 880000, 1200000])

all_salaries = np.concatenate([ic_salaries, manager_salaries,
                                senior_salaries, csuite_salaries])

# Plot the distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Full histogram
axes[0].hist(all_salaries, bins=50, color='steelblue', edgecolor='white', alpha=0.8)
axes[0].axvline(np.mean(all_salaries), color='red', linestyle='--', linewidth=2,
                label=f'Mean: ${np.mean(all_salaries):,.0f}')
axes[0].axvline(np.median(all_salaries), color='orange', linestyle='--', linewidth=2,
                label=f'Median: ${np.median(all_salaries):,.0f}')
axes[0].set_title('NovaTech Salary Distribution')
axes[0].set_xlabel('Annual Salary ($)')
axes[0].set_ylabel('Number of Employees')
axes[0].legend()

# Log-scale histogram for detail
axes[1].hist(all_salaries, bins=50, color='steelblue', edgecolor='white', alpha=0.8)
axes[1].axvline(np.mean(all_salaries), color='red', linestyle='--', linewidth=2,
                label=f'Mean: ${np.mean(all_salaries):,.0f}')
axes[1].axvline(np.median(all_salaries), color='orange', linestyle='--', linewidth=2,
                label=f'Median: ${np.median(all_salaries):,.0f}')
axes[1].set_title('Same Data, Log Scale on Y-axis')
axes[1].set_xlabel('Annual Salary ($)')
axes[1].set_ylabel('Number of Employees (log)')
axes[1].set_yscale('log')
axes[1].legend()

plt.tight_layout()
plt.savefig('novatech_salary_dist.png', dpi=150, bbox_inches='tight')
plt.show()

Ava looks at the histogram and immediately sees the problem. The distribution is massively right-skewed. There's a huge cluster of employees earning between $60,000 and $90,000, a smaller bump around $115,000 (managers), and then a long, thin tail stretching all the way to $1.2 million (the CEO).

The mean — that $112,000 number in the press release — sits to the right of where most employees are. The median, around $80,000, is much closer to what a "typical" NovaTech employee actually earns.

Step 2: Compute Proper Summary Statistics

print("=== NovaTech Salary Statistics ===")
print(f"  Count:            {len(all_salaries)}")
print(f"  Mean:             ${np.mean(all_salaries):>12,.0f}")
print(f"  Median:           ${np.median(all_salaries):>12,.0f}")
print(f"  Std Dev:          ${np.std(all_salaries, ddof=1):>12,.0f}")
print(f"  IQR:              ${np.percentile(all_salaries, 75) - np.percentile(all_salaries, 25):>12,.0f}")
print(f"  Min:              ${np.min(all_salaries):>12,.0f}")
print(f"  Q1 (25th %ile):   ${np.percentile(all_salaries, 25):>12,.0f}")
print(f"  Q3 (75th %ile):   ${np.percentile(all_salaries, 75):>12,.0f}")
print(f"  Max:              ${np.max(all_salaries):>12,.0f}")
print(f"  Skewness:         {stats.skew(all_salaries):>12.2f}")
print()
print(f"  Gap (Mean - Median): ${np.mean(all_salaries) - np.median(all_salaries):>10,.0f}")
print(f"  The mean is {(np.mean(all_salaries) / np.median(all_salaries) - 1) * 100:.0f}% higher than the median")

The numbers confirm what the histogram showed: - The mean ($112k) is roughly 40% higher than the median ($80k) — a massive gap. - The skewness is strongly positive (around 3-4), indicating extreme right skew. - The standard deviation is enormous because it's being inflated by the executive salaries. - The IQR tells a more honest story about the spread of "typical" salaries.

Step 3: Understand the Impact of C-Suite Salaries

Ava wants to quantify exactly how much the top earners are distorting the picture:

# What happens when we remove the top 10 earners?
salaries_no_csuite = all_salaries[all_salaries < 300000]

print("=== Without C-Suite (top 10 earners) ===")
print(f"  Mean drops from ${np.mean(all_salaries):,.0f} to ${np.mean(salaries_no_csuite):,.0f}")
print(f"  That's a ${np.mean(all_salaries) - np.mean(salaries_no_csuite):,.0f} drop")
print(f"  Median barely changes: ${np.median(all_salaries):,.0f} -> ${np.median(salaries_no_csuite):,.0f}")
print()

# Impact analysis
pct_employees = 10 / len(all_salaries) * 100
mean_impact = (np.mean(all_salaries) - np.mean(salaries_no_csuite)) / np.mean(all_salaries) * 100
print(f"  {pct_employees:.1f}% of employees (C-suite) shifted the mean by {mean_impact:.1f}%")
print(f"  But the median moved by less than 1%")

Ten people — roughly 1.3% of the workforce — shifted the mean by a significant percentage. The median barely moved. This is the power and the danger of the mean in skewed data.

Step 4: Break It Down by Level

The overall statistics hide important group differences. Ava breaks the analysis down:

# Create a DataFrame with job levels
df = pd.DataFrame({
    'salary': all_salaries,
    'level': (['Individual Contributor'] * 600 +
              ['Manager'] * 120 +
              ['Senior Leader'] * 50 +
              ['C-Suite'] * 10)
})

level_stats = df.groupby('level')['salary'].agg(
    ['count', 'mean', 'median', 'std']
).round(0)

level_stats.columns = ['Count', 'Mean', 'Median', 'Std Dev']
level_stats['Mean-Median Gap'] = (level_stats['Mean'] - level_stats['Median']).round(0)
print(level_stats.to_string())

Now the picture is clear. Within each level, the mean and median are much closer together. It's the mixing of very different groups that creates the misleading overall mean.

Step 5: The Gender Pay Gap Question

Ava's analysis takes a more serious turn when a colleague asks: "Is there a gender pay gap at NovaTech?"

The company's diversity report states: "The average salary for women at NovaTech is $94,000, compared to $118,000 for men — a gap of $24,000."

But Ava knows to dig deeper:

# Add gender to the simulation (reflecting typical tech industry demographics)
np.random.seed(42)

# Women are underrepresented in senior positions
gender = (
    np.random.choice(['F', 'M'], 600, p=[0.40, 0.60]).tolist() +   # ICs: 40% F
    np.random.choice(['F', 'M'], 120, p=[0.30, 0.70]).tolist() +   # Managers: 30% F
    np.random.choice(['F', 'M'], 50, p=[0.20, 0.80]).tolist() +    # Senior: 20% F
    np.random.choice(['F', 'M'], 10, p=[0.10, 0.90]).tolist()      # C-Suite: 10% F
)

df['gender'] = gender

# Overall gap
print("=== Overall Gender Salary Comparison ===")
gender_overall = df.groupby('gender')['salary'].agg(['mean', 'median', 'count'])
print(gender_overall.round(0))
print()

# Gap by level
print("=== Gender Salary Comparison BY LEVEL ===")
for level in ['Individual Contributor', 'Manager', 'Senior Leader', 'C-Suite']:
    level_data = df[df['level'] == level]
    by_gender = level_data.groupby('gender')['salary'].agg(['mean', 'median', 'count'])
    print(f"\n{level}:")
    print(by_gender.round(0))

The results reveal something important: within each job level, the gender pay gap is much smaller (or might even reverse). The large overall gap is partly driven by the fact that women are concentrated in lower-level positions — there are more female ICs (earning $75k) and fewer female executives (earning $400k+).

This doesn't mean there's no problem. The underrepresentation of women in senior positions is the problem — it's just a different problem than "women are paid less for the same work." The overall mean hides the distinction between these two very different issues.

The Lessons

Lesson 1: The Mean Can Be Technically Correct and Practically Misleading

NovaTech's press release wasn't lying. The mean salary really was $112,000. But that number doesn't describe the experience of a "typical" employee. More than 75% of employees earn less than the mean. The median of $80,000 is a far better summary.

Lesson 2: Always Report Shape and Spread, Not Just Center

If the press release had said "the typical salary is $80,000 (median), with most employees earning between $66,000 and $95,000 (IQR)," that would have been honest and informative. A single number — any single number — is incomplete.

Lesson 3: Disaggregate Before Aggregating

The overall gender pay gap looked alarming. Breaking it down by job level told a more nuanced story. In data science, aggregated statistics often hide important group-level differences. Simpson's Paradox (Exercise 19.18) is the extreme version of this, where the aggregated trend actually reverses at the group level.

Lesson 4: Ask "Who Benefits from This Summary?"

When a company reports the mean salary, they're choosing the number that makes them look most generous. When a union reports the median salary, they're choosing the number that highlights how most workers are paid. Neither is wrong. But the choice of statistic is itself a story — and a thoughtful analyst recognizes that.

Discussion Questions

If you were writing NovaTech's annual report, how would you present salary data honestly? What statistics would you include?
A politician says "the average American household income increased by $5,000 last year." What follow-up questions should you ask before concluding that people are better off?
When is it appropriate to report the mean for skewed data? (Hint: think about situations where the total matters, not just the typical value — for example, total healthcare costs.)
How would you explain to a non-technical manager why the median is a better summary than the mean for their employee salary data?

Connection to the Chapter

This case study illustrates every concept from Section 19.2 (mean vs. median), Section 19.4 (skewness), and Section 19.7 (choosing the right statistic). The threshold concept — distribution thinking — is exactly what Ava used when she moved from "here's one number" to "here's the shape of the data." That shift from a single number to a shape is the key insight of descriptive statistics.

The gender pay gap analysis also previews the kind of careful group-comparison thinking you'll develop further in Chapters 22-23 (inference) and Chapter 24 (correlation). For now, notice how simply describing the data carefully — without any hypothesis tests or p-values — already reveals important patterns.