Case Study 25-2: When Your Average Is a Lie

Character: Maya Reyes, Independent Consultant Business Question: Is her consulting practice as profitable as she thinks? Key Insight: Mean project margin is masking a less flattering median reality


The Setup

Maya had been consulting independently for three years, and by her own assessment, things were going well. Her standard answer when anyone asked how business was going: "My average project margin is around 58%." She had calculated that number herself — total profit across all projects divided by total revenue. Fifty-eight percent. She was proud of it.

But her accountant, during their quarterly review, had asked a question that stuck with her: "Are your margins consistent, or are a few great projects making the average look better than it is?"

Maya had brushed it off at the time. But the question nagged at her. She had her project data in a spreadsheet — time tracking, revenue, costs, everything. On a slow Friday afternoon, she decided to actually look.


The Data

Maya exported her project log to a CSV file. Each row represented a completed project over the past 30 months.

import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Load Maya's project data
df = pd.read_csv("maya_project_log.csv")

print(f"Total projects: {len(df)}")
print(f"Total revenue: ${df['revenue'].sum():,.0f}")
print(f"Total profit: ${df['profit'].sum():,.0f}")
print()
print(df.describe())

The dataset had these columns:

project_id, client_name, client_type, project_type,
revenue, cost, profit, margin_pct, duration_weeks,
start_date

client_type was one of: "startup", "mid-size_company", "enterprise", "nonprofit" project_type was one of: "strategy", "implementation", "training", "audit"


The First Surprise: Mean vs. Median

margin = df["margin_pct"]

mean_margin = margin.mean()
median_margin = margin.median()
gap = mean_margin - median_margin

print("Project Margin Statistics:")
print(f"  Mean margin   : {mean_margin:.1f}%")
print(f"  Median margin : {median_margin:.1f}%")
print(f"  Gap           : {gap:+.1f} percentage points")
print()
print("Distribution details:")
print(f"  Std deviation : {margin.std():.1f}%")
print(f"  Min margin    : {margin.min():.1f}%")
print(f"  Q1 (25th pct) : {margin.quantile(0.25):.1f}%")
print(f"  Q3 (75th pct) : {margin.quantile(0.75):.1f}%")
print(f"  Max margin    : {margin.max():.1f}%")

Output:

Project Margin Statistics:
  Mean margin   : 57.8%
  Median margin : 41.3%
  Gap           : +16.5 percentage points

Distribution details:
  Std deviation : 24.6%
  Min margin    : 8.2%
  Q1 (25th pct) : 29.4%
  Q3 (75th pct) : 62.1%
  Max margin    : 94.7%

Maya set down her coffee.

A 16.5 percentage point gap between mean and median. Her "58% average margin" figure — the one she had told her accountant, her friends, herself — was telling a fundamentally misleading story. The typical project (median) was earning her 41% margins. The mean was inflated by some very high-margin outliers.

She pulled up the histogram to see the shape.


Visualizing the Distribution

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

# Histogram with mean and median marked
ax1 = axes[0]
ax1.hist(df["margin_pct"], bins=18, color="#7C4DFF", edgecolor="white",
         alpha=0.85)
ax1.axvline(mean_margin, color="#F44336", linewidth=2.5, linestyle="--",
            label=f"Mean: {mean_margin:.1f}%")
ax1.axvline(median_margin, color="#00C853", linewidth=2.5,
            label=f"Median: {median_margin:.1f}%")
ax1.set_xlabel("Project Margin (%)")
ax1.set_ylabel("Number of Projects")
ax1.set_title("Project Margin Distribution")
ax1.legend(fontsize=11)
ax1.set_xlim(0, 100)

# Box plot by client type
ax2 = axes[1]
client_types = ["startup", "mid-size_company", "enterprise", "nonprofit"]
labels = ["Startup", "Mid-Size", "Enterprise", "Nonprofit"]
data = [df[df["client_type"] == ct]["margin_pct"] for ct in client_types]
bp = ax2.boxplot(data, labels=labels, patch_artist=True)
colors = ["#E8F5E9", "#E3F2FD", "#FFF8E1", "#FCE4EC"]
for patch, color in zip(bp["boxes"], colors):
    patch.set_facecolor(color)
for med in bp["medians"]:
    med.set_color("#D32F2F")
    med.set_linewidth(2)
ax2.set_ylabel("Margin (%)")
ax2.set_title("Margin by Client Type")
ax2.axhline(median_margin, color="#757575", linewidth=1, linestyle="--",
            label=f"Overall median: {median_margin:.1f}%")
ax2.legend(fontsize=9)

plt.suptitle("Maya Reyes Consulting — Project Profitability Analysis",
             fontsize=13, fontweight="bold")
plt.tight_layout()
plt.savefig("maya_margin_analysis.png", dpi=150)
plt.show()

The histogram told the story visually. The distribution had a long right tail — the bulk of projects clustered in the 25%–55% range, with a distinct group of high-margin projects ($85%+) pulling the mean rightward.


The Second Surprise: Which Projects Had the High Margins?

# Sort projects by margin descending
df_sorted = df.sort_values("margin_pct", ascending=False)

print("Top 10 highest-margin projects:")
print(df_sorted[["client_name", "client_type", "project_type",
                  "revenue", "margin_pct"]].head(10).to_string(index=False))

print("\nBottom 10 lowest-margin projects:")
print(df_sorted[["client_name", "client_type", "project_type",
                  "revenue", "margin_pct"]].tail(10).to_string(index=False))

Output (top 10, simplified):

  client_name   client_type  project_type  revenue  margin_pct
  TechVenture A    enterprise      training   42,000       94.7
  TechVenture B    enterprise      training   38,500       91.2
  CloudCo          enterprise      training   35,000       88.9
  MegaCorp         enterprise      audit       51,000       86.3
  ...

Bottom 10:
  client_name   client_type   project_type  revenue  margin_pct
  GreenNPO          nonprofit  implementation  18,000        8.2
  LocalNPO          nonprofit  implementation  14,000       11.4
  TechStartup_14    startup    implementation  22,000       14.8
  TechStartup_7     startup    implementation  19,500       16.1
  ...

A clear pattern: the highest-margin projects were enterprise training engagements. The lowest-margin projects were implementation work for nonprofits and startups.


Segmenting by Client Type and Project Type

# Margin by client type
print("Median Margin by Client Type:")
by_client = df.groupby("client_type")["margin_pct"].agg(
    count="count",
    median="median",
    mean="mean",
    std="std"
).round(1)
print(by_client.sort_values("median", ascending=False).to_string())

print()

# Margin by project type
print("Median Margin by Project Type:")
by_project = df.groupby("project_type")["margin_pct"].agg(
    count="count",
    median="median",
    mean="mean",
    std="std"
).round(1)
print(by_project.sort_values("median", ascending=False).to_string())

Output:

Median Margin by Client Type:
               count  median   mean   std
enterprise         9    68.4   71.2  14.3
mid-size_company  16    48.7   52.1  18.9
startup           22    34.2   38.6  22.4
nonprofit          8    17.8   22.3  11.2

Median Margin by Project Type:
                count  median   mean   std
training           11    79.4   81.3  10.1
audit              10    63.2   65.8  13.4
strategy           21    41.8   46.2  19.7
implementation     13    18.4   22.6  12.8

Maya looked at these tables for a long time.

The enterprise/training combination was generating median margins of nearly 80%. The nonprofit/implementation combination was generating median margins of around 14%. She had been treating all her projects as roughly equivalent. She had been wrong by a factor of five.


The Outlier Analysis

# IQR-based outlier detection
Q1 = df["margin_pct"].quantile(0.25)
Q3 = df["margin_pct"].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

high_outliers = df[df["margin_pct"] > upper_bound]
low_outliers = df[df["margin_pct"] < lower_bound]

print(f"IQR: {IQR:.1f} percentage points")
print(f"Outlier bounds: {lower_bound:.1f}% to {upper_bound:.1f}%")
print()
print(f"High-margin outliers ({len(high_outliers)} projects):")
print(high_outliers[["client_type", "project_type", "revenue",
                       "margin_pct"]].to_string(index=False))

print(f"\nLow-margin outliers ({len(low_outliers)} projects):")
print(low_outliers[["client_type", "project_type", "revenue",
                      "margin_pct"]].to_string(index=False))

# What share of total profit came from high-margin outliers?
outlier_profit = high_outliers["profit"].sum()
total_profit = df["profit"].sum()
print(f"\nHigh-margin outlier projects represent:")
print(f"  {len(high_outliers)/len(df)*100:.1f}% of project count")
print(f"  {outlier_profit/total_profit*100:.1f}% of total profit")

Output:

IQR: 32.7 percentage points
Outlier bounds: -19.7% to 111.2%
(No statistical outliers by IQR method — the distribution is wide but no extreme values)

Wait — checking top quartile instead...

Projects above 75th percentile (>{Q3:.1f}%):
  14 projects above 62.1% margin

These 14 projects (25% of total) represent:
  52.4% of total profit

The high-margin projects — even without being statistical outliers by the IQR method — were doing enormous heavy lifting. A quarter of her projects were generating more than half her total profit.


The Profit Concentration Analysis

# Sort by profit contribution
df_profit = df.sort_values("profit", ascending=False).reset_index(drop=True)
df_profit["cumulative_profit"] = df_profit["profit"].cumsum()
df_profit["profit_share"] = df_profit["cumulative_profit"] / df_profit["profit"].sum() * 100

# How many projects to reach 50% of profit?
projects_for_50_pct = (df_profit["profit_share"] < 50).sum() + 1
print(f"Projects needed to reach 50% of total profit: {projects_for_50_pct}")
print(f"Out of {len(df_profit)} total projects ({projects_for_50_pct/len(df_profit)*100:.1f}%)")

# Cumulative profit curve
fig, ax = plt.subplots(figsize=(9, 5))
ax.plot(range(1, len(df_profit) + 1),
        df_profit["profit_share"], color="#7C4DFF", linewidth=2.5)
ax.axhline(50, color="#9E9E9E", linestyle=":", linewidth=1.5)
ax.axvline(projects_for_50_pct, color="#F44336", linestyle="--",
           linewidth=1.5, label=f"{projects_for_50_pct} projects = 50% of profit")
ax.set_xlabel("Number of Projects (ranked by profit, high to low)")
ax.set_ylabel("Cumulative % of Total Profit")
ax.set_title("Profit Concentration — Maya Reyes Consulting")
ax.legend()
ax.set_xlim(1, len(df_profit))
ax.set_ylim(0, 105)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f"{x:.0f}%"))
plt.tight_layout()
plt.savefig("maya_profit_concentration.png", dpi=150)
plt.show()

Output:

Projects needed to reach 50% of total profit: 9
Out of 55 total projects (16.4%)

Nine projects. Sixteen percent of her project count was generating half her profit. The bottom third of her projects — implementation work, primarily for startups and nonprofits — was profitable in absolute terms, but at margins that were costing her time she could have spent on higher-value work.


What This Changes

Maya sat with these findings for a while. Here is what the statistics had shown her:

  1. Her "58% average margin" was inflated by a small number of high-margin enterprise training engagements. The typical project (median) generated 41% margins — not 58%.

  2. Enterprise + training was her most profitable combination by a significant margin (79% median vs. 41% overall median).

  3. Implementation work, especially for nonprofits and early-stage startups, had margins below 20%. She was not losing money, but she was spending significant time for modest returns.

  4. A minority of her projects (the top 25%) generated the majority (52%) of her profit. The effort she was putting into low-margin projects was essentially subsidizing the rest of her business.

She thought back to how she had been selecting clients. She had said yes to a lot of implementation work because she liked the tangible, build-something quality of it. She had taken nonprofit clients at discounted rates because she believed in the causes. Both were legitimate choices — but she had made them without understanding the profitability cost.


The Strategic Decision

# Simulate: what if Maya rebalanced her project mix?
# Current mix
current_by_type = df.groupby("project_type").agg(
    count=("revenue", "count"),
    total_revenue=("revenue", "sum"),
    total_profit=("profit", "sum"),
).assign(effective_margin=lambda x: x["total_profit"] / x["total_revenue"] * 100)

print("Current Project Mix:")
print(current_by_type.round(1).to_string())

# Hypothetical: shift 5 implementation projects to training/strategy
impl_avg_revenue = df[df["project_type"] == "implementation"]["revenue"].median()
impl_avg_margin = df[df["project_type"] == "implementation"]["margin_pct"].median() / 100
training_avg_revenue = df[df["project_type"] == "training"]["revenue"].median()
training_avg_margin = df[df["project_type"] == "training"]["margin_pct"].median() / 100

lost_profit = 5 * impl_avg_revenue * impl_avg_margin
gained_profit = 5 * training_avg_revenue * training_avg_margin
net_change = gained_profit - lost_profit

print(f"\nHypothetical: Replace 5 implementation projects with 5 training projects")
print(f"  Lost profit from implementation : ${lost_profit:,.0f}")
print(f"  Gained profit from training      : ${gained_profit:,.0f}")
print(f"  Net profit improvement           : ${net_change:+,.0f}")

Output:

Hypothetical: Replace 5 implementation projects with 5 training projects
  Lost profit from implementation : $15,400
  Gained profit from training     : $31,200
  Net profit improvement          : +$15,800

Without working more hours, without raising her rates, just by changing the type of work she pursued — Maya could increase her annual profit by approximately $15,800.


The New Strategy

Maya updated her client intake criteria:

Prioritize: - Enterprise clients seeking training or strategy work - Mid-size company strategy engagements (strong margins, interesting work) - Any client where the project can be templated for training delivery

Evaluate carefully: - Implementation projects (only at higher rates, or for strategic relationships) - Nonprofit engagements (selective — limit to 1-2 per year as pro bono/relationship work, not core revenue)

New success metric: She replaced "average project margin" in her mental scorecard with "median project margin" — the number that actually represented a typical engagement, not the inflated mean. Her goal: get the median above 50% over the next 18 months.


What Maya Learned About Statistics

Looking back at this analysis, Maya identified three statistical lessons:

Lesson 1: Always compare mean and median. If they diverge significantly, your mean is being distorted by outliers. In business, this almost always means a few exceptional outcomes are masking a more modest typical reality.

Lesson 2: Segment before you conclude. "Average margin" across all projects was meaningless for decision-making. Margin by client type and project type was actionable.

Lesson 3: Profit concentration matters. Knowing that 16% of her projects generated 50% of her profit was not just interesting — it told her exactly where to point her business development energy.

Her accountant had asked the right question. The data had answered it honestly. Maya just had to be willing to look.


Technical Summary

Code Pattern What It Revealed
margin.mean() vs margin.median() 16.5 point gap — mean inflated by high-margin outliers
df.groupby("client_type")["margin_pct"].median() Enterprise clients: 68% median margin
df.groupby("project_type")["margin_pct"].median() Training: 79% vs. Implementation: 18%
IQR outlier detection No extreme outliers — but top quartile disproportionate
Cumulative profit analysis 16% of projects = 50% of total profit
Histogram with mean/median lines Right skew made visually immediate