16 min read

You have been building Python skills throughout this book — loading data, cleaning it, reshaping it with pandas, visualizing it with matplotlib. Now it is time to put all of that to work on the problem that every business cares about most...

Chapter 28: Sales and Revenue Analytics

"Revenue solves a lot of problems." — Mark Cuban

You have been building Python skills throughout this book — loading data, cleaning it, reshaping it with pandas, visualizing it with matplotlib. Now it is time to put all of that to work on the problem that every business cares about most: understanding how much money you are making, where it is coming from, and what is actually driving growth.

Sales analytics is not just about adding up a column of numbers. It is about asking — and answering — the questions that change decisions. Which customers account for most of your revenue? Which products have the best margins? Is your West region struggling because the market is weak, or because you only have one salesperson? Are your best customers buying more, or quietly drifting away?

By the end of this chapter, you will be able to answer all of those questions using Python. You will build a complete sales analytics toolkit and apply it to Acme Corp's 2023 data — and you will watch Priya Okonkwo take that analysis to Sandra Chen's desk and change the company's entire West region strategy.


What You Will Learn

  • Computing and interpreting the core revenue metrics every business needs
  • Breaking down revenue by region, product, channel, and salesperson with pandas
  • Calculating month-over-month and year-over-year growth rates
  • Performing a Pareto (80/20) analysis to find your most important customers and products
  • Measuring revenue concentration risk — and why it matters
  • Building a sales funnel and calculating conversion rates
  • Analyzing cohorts: understanding customers by when they first bought from you
  • Introducing RFM scoring (Recency, Frequency, Monetary) — a preview of Chapter 27
  • Calculating sales velocity metrics
  • Building a four-panel matplotlib sales dashboard

28.1 Setting Up Your Sales Data

Before you can analyze anything, you need data. Sales data typically lives in one of three places: a CRM system (Salesforce, HubSpot), an ERP system (SAP, NetSuite), or a good old-fashioned spreadsheet. For analysis purposes, you want transactions in a flat CSV format — one row per line item.

Here is what Acme Corp's acme_sales_2023.csv looks like:

date,order_id,customer_id,customer_name,region,product,quantity,unit_price,cost_per_unit,salesperson,channel
2023-01-05,ORD-00001,C0001,Customer 0001,North,Printer Paper (Case),42,49.99,28.00,Leo Patel,Direct
2023-01-05,ORD-00001,C0001,Customer 0001,North,Staples (Box),18,4.99,1.80,Leo Patel,Direct
2023-01-07,ORD-00002,C0023,Customer 0023,East,Binders (6-Pack),25,21.99,10.50,Sam Torres,Online

Notice that a single order (ORD-00001) has two rows — one per product. This is standard line-item format, and it is exactly what pandas handles best.

Let us load this data and make it analysis-ready:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker


def load_sales_data(filepath: str) -> pd.DataFrame:
    """Load and prepare sales transaction data from a CSV file.

    Expects columns: date, order_id, customer_id, region, product,
    quantity, unit_price. Optional: cost_per_unit, salesperson, channel.

    Args:
        filepath: Path to the sales CSV file.

    Returns:
        DataFrame with revenue, margin, and time helper columns added.

    Raises:
        FileNotFoundError: If the CSV does not exist at the given path.
        ValueError: If required columns are missing.
    """
    required_columns = {"date", "order_id", "customer_id", "region",
                        "product", "quantity", "unit_price"}
    try:
        df = pd.read_csv(filepath, parse_dates=["date"])
    except FileNotFoundError:
        raise FileNotFoundError(f"Sales data file not found: {filepath}")

    missing = required_columns - set(df.columns)
    if missing:
        raise ValueError(f"Missing required columns: {missing}")

    # Derive revenue if not already in the file
    if "revenue" not in df.columns:
        df["revenue"] = df["quantity"] * df["unit_price"]

    # Derive gross margin if cost data is available
    if "cost_per_unit" in df.columns and "gross_margin" not in df.columns:
        df["cost"] = df["quantity"] * df["cost_per_unit"]
        df["gross_margin"] = df["revenue"] - df["cost"]
        df["margin_pct"] = df["gross_margin"] / df["revenue"]

    # Add time helper columns for easy grouping later
    df["year"] = df["date"].dt.year
    df["month"] = df["date"].dt.month
    df["quarter"] = df["date"].dt.quarter
    df["year_month"] = df["date"].dt.to_period("M")

    return df


df = load_sales_data("acme_sales_2023.csv")
print(
    f"Loaded {len(df):,} rows covering "
    f"{df['date'].min().date()} to {df['date'].max().date()}"
)

The parse_dates=["date"] argument tells pandas to read that column as actual datetime values rather than plain strings. The dt.to_period("M") call creates a "year-month" period column — this is your best friend for monthly trend analysis, because it handles the awkward reality that different months have different numbers of days.

A Quick Sanity Check

Always spend two minutes on a sanity check before any analysis. Garbage in, garbage out is never more true than in revenue reporting:

# Shape and data types
print(f"Shape: {df.shape}")
print(df.dtypes)

# Any missing values?
print("\nMissing values:")
print(df.isnull().sum())

# Revenue range — does it make sense?
print("\nRevenue statistics:")
print(df["revenue"].describe())

# Duplicate order-product combinations?
dupes = df.duplicated(subset=["order_id", "product"])
print(f"\nDuplicate line items: {dupes.sum()}")

# Negative revenues? (Possible for returns, but should be flagged)
negative = df[df["revenue"] < 0]
if len(negative) > 0:
    print(f"\nWarning: {len(negative)} rows with negative revenue found.")

If revenue values are negative, zero, or implausibly large, investigate before proceeding. A single duplicated row can inflate every metric downstream. A negative revenue row that represents a return needs different treatment from an error.


28.2 Core Revenue Metrics

The first thing any business leader wants to know is the top-line number. But "total revenue" is just the start. Here are the metrics that give you a complete executive summary:

def calculate_revenue_metrics(df: pd.DataFrame) -> dict:
    """Compute executive-level revenue summary metrics.

    Args:
        df: Sales DataFrame with revenue, order_id, customer_id columns.

    Returns:
        Dictionary containing total_revenue, total_orders, unique_customers,
        average_order_value, median_order_value, revenue_per_customer,
        and optional gross margin metrics.
    """
    total_revenue = df["revenue"].sum()

    # Order-level metrics — always group by order_id first, not just mean of rows
    order_totals = df.groupby("order_id")["revenue"].sum()
    average_order_value = order_totals.mean()
    median_order_value = order_totals.median()

    unique_customers = df["customer_id"].nunique()
    revenue_per_customer = total_revenue / unique_customers if unique_customers else 0

    metrics = {
        "total_revenue": round(total_revenue, 2),
        "total_orders": df["order_id"].nunique(),
        "unique_customers": unique_customers,
        "average_order_value": round(average_order_value, 2),
        "median_order_value": round(median_order_value, 2),
        "revenue_per_customer": round(revenue_per_customer, 2),
    }

    if "gross_margin" in df.columns:
        metrics["total_gross_margin"] = round(df["gross_margin"].sum(), 2)
        metrics["overall_margin_pct"] = round(
            df["gross_margin"].sum() / total_revenue * 100, 1
        )

    return metrics

A subtle but important point: when calculating average order value, you must group by order_id first. If you call df["revenue"].mean() directly, you get the average revenue per line item — a meaningless number. An order containing three products has three rows, and a naive mean would artificially understate the true order size.

When Priya runs this on Acme's full 2023 year:

metrics = calculate_revenue_metrics(df)

for key, value in metrics.items():
    label = key.replace("_", " ").title()
    if isinstance(value, float):
        if "pct" in key:
            print(f"  {label:<30} {value:>10.1f}%")
        else:
            print(f"  {label:<30} ${value:>12,.2f}")
    else:
        print(f"  {label:<30} {value:>13,}")

Output:

  Total Revenue                  $2,847,392.00
  Total Orders                         1,300
  Unique Customers                       155
  Average Order Value             $2,190.30
  Median Order Value              $1,847.50
  Revenue Per Customer           $18,370.27
  Total Gross Margin             $1,452,170.24
  Overall Margin Pct                    51.0%

The median order value being lower than the mean ($1,847 vs. $2,190) tells Priya something: there are some large orders pulling the average upward. That is worth investigating — those large orders may represent a handful of enterprise accounts with outsize impact.


28.3 Revenue by Dimension: Region, Product, Channel, Rep

Breaking revenue down by a single dimension is one of the most frequently requested operations in business analytics. The pattern is always the same: groupby(), aggregate, sort descending, compute percentages.

def revenue_by_dimension(df: pd.DataFrame, dimension: str) -> pd.DataFrame:
    """Aggregate revenue and orders by a single dimension.

    Args:
        df: Sales DataFrame.
        dimension: Column name to group by. Common choices:
                   region, product, salesperson, channel, customer_name.

    Returns:
        DataFrame sorted by revenue descending with pct_of_total column.

    Raises:
        ValueError: If the specified column is not in the DataFrame.
    """
    if dimension not in df.columns:
        raise ValueError(
            f"Column '{dimension}' not found. "
            f"Available columns: {list(df.columns)}"
        )

    grouped = (
        df.groupby(dimension)
        .agg(
            revenue=("revenue", "sum"),
            orders=("order_id", "nunique"),
        )
        .reset_index()
        .sort_values("revenue", ascending=False)
    )

    grouped["pct_of_total"] = (
        grouped["revenue"] / grouped["revenue"].sum() * 100
    ).round(1)

    if "gross_margin" in df.columns:
        margin_data = df.groupby(dimension)["gross_margin"].sum().reset_index()
        grouped = grouped.merge(margin_data, on=dimension)
        grouped["margin_pct"] = (
            grouped["gross_margin"] / grouped["revenue"] * 100
        ).round(1)

    return grouped


# The exact same function works for any dimension
print("By Region:")
print(revenue_by_dimension(df, "region").to_string(index=False))

print("\nBy Channel:")
print(revenue_by_dimension(df, "channel").to_string(index=False))

print("\nBy Salesperson:")
print(revenue_by_dimension(df, "salesperson").to_string(index=False))

For Acme Corp, the regional breakdown is the first sign something is wrong:

Region Revenue Orders % of Total Margin %
North $924,560 418 32.5% 51.2%
East $807,210 391 28.3% 50.8%
South $755,890 351 26.5% 51.5%
West $359,732 140 12.6% 51.1%

West is generating only 12.6% of revenue while the other three regions cluster between 26% and 32%. Notably, the margin percentage is nearly identical across all four regions — meaning this is a volume problem, not a pricing or efficiency problem. We will dig into this in Section 28.13.


28.4 Month-over-Month and Year-over-Year Growth

Growth rates tell you whether your business is accelerating or decelerating. Two comparisons matter most for different reasons:

Month-over-month (MoM): Tracks recent momentum. Sensitive to seasonality. Year-over-year (YoY): Removes seasonal effects by comparing equivalent periods. More reliable signal of underlying trend.

December is always better than November in most consumer and B2B markets, not because the business is growing but because of seasonal patterns. YoY comparison tells you whether this December is better than last December — which is the meaningful question.

def monthly_revenue_trend(df: pd.DataFrame) -> pd.DataFrame:
    """Build a monthly revenue trend table with MoM and YoY growth rates.

    Args:
        df: Sales DataFrame with year_month column.

    Returns:
        DataFrame with columns: year_month, revenue, orders,
        mom_change, mom_pct, yoy_revenue, yoy_pct.
    """
    monthly = (
        df.groupby("year_month")
        .agg(
            revenue=("revenue", "sum"),
            orders=("order_id", "nunique"),
        )
        .reset_index()
        .sort_values("year_month")
    )

    # pct_change() computes (current - prior) / prior automatically
    monthly["mom_change"] = monthly["revenue"].diff()
    monthly["mom_pct"] = (monthly["revenue"].pct_change() * 100).round(1)

    # shift(12) looks back 12 periods — for monthly data, that is one year ago
    monthly["yoy_revenue"] = monthly["revenue"].shift(12)
    monthly["yoy_pct"] = (
        (monthly["revenue"] - monthly["yoy_revenue"])
        / monthly["yoy_revenue"]
        * 100
    ).round(1)

    return monthly.round({"revenue": 2, "mom_change": 2})


monthly = monthly_revenue_trend(df)
print(monthly[["year_month", "revenue", "mom_pct", "yoy_pct"]].to_string(index=False))

For a visual that tells the growth story clearly:

fig, axes = plt.subplots(2, 1, figsize=(12, 8))

months = [str(p) for p in monthly["year_month"]]
x_pos = range(len(months))

# Panel 1: Absolute revenue as bars
axes[0].bar(x_pos, monthly["revenue"], color="#4472C4", alpha=0.85)
axes[0].set_title("Monthly Revenue — Acme Corp 2023", fontweight="bold")
axes[0].set_ylabel("Revenue ($)")
axes[0].set_xticks(list(x_pos))
axes[0].set_xticklabels(months, rotation=45, ha="right")
axes[0].yaxis.set_major_formatter(
    mticker.FuncFormatter(lambda x, _: f"${x:,.0f}")
)

# Panel 2: MoM growth as colored bars (green positive, red negative)
mom_values = monthly["mom_pct"].fillna(0)
bar_colors = ["#27AE60" if v >= 0 else "#E74C3C" for v in mom_values]
axes[1].bar(x_pos, mom_values, color=bar_colors)
axes[1].axhline(0, color="black", linewidth=0.8)
axes[1].set_title("Month-over-Month Revenue Growth (%)", fontweight="bold")
axes[1].set_ylabel("MoM Change (%)")
axes[1].set_xticks(list(x_pos))
axes[1].set_xticklabels(months, rotation=45, ha="right")

plt.tight_layout()
plt.savefig("acme_monthly_trend.png", dpi=150, bbox_inches="tight")
plt.show()

Green bars above zero mean the business grew that month; red bars mean it shrank. For Acme, Q4 bars are almost all green and tall — the autumn/holiday office supply restocking effect is visible directly in the data.

Quarterly Growth Rates

Sometimes you want to look at a slightly smoother view than month-by-month:

quarterly_revenue = (
    df.groupby(["year", "quarter"])["revenue"]
    .sum()
    .reset_index()
    .sort_values(["year", "quarter"])
)
quarterly_revenue["qoq_growth"] = (
    quarterly_revenue["revenue"].pct_change() * 100
).round(1)
print(quarterly_revenue)

Acme's Q4 2023 showing strong quarterly growth confirms the seasonal pattern. But Q3 to Q4 growth in isolation is not meaningful without comparison — you want Q4 2023 vs. Q4 2022 (YoY) to see if the business actually improved.


28.5 Pareto Analysis: The 80/20 Rule in Practice

The Pareto principle — roughly 80% of effects come from 20% of causes — is one of the most consistently observed patterns in business. In most companies, approximately 80% of revenue comes from approximately 20% of customers. Your actual split might be 70/30 or 90/10. The point is to measure it precisely rather than assume it.

def pareto_analysis(
    df: pd.DataFrame,
    dimension: str,
    value_col: str = "revenue",
    top_n: int = 20,
) -> pd.DataFrame:
    """Perform an 80/20 Pareto analysis on a revenue dimension.

    Args:
        df: Sales DataFrame.
        dimension: Column to analyze (customer_name, product, region, etc.).
        value_col: Numeric column to aggregate (default: revenue).
        top_n: Maximum number of rows to return (default 20).

    Returns:
        DataFrame with value, pct_of_total, and cumulative_pct columns,
        sorted by value descending.
    """
    grouped = (
        df.groupby(dimension)[value_col]
        .sum()
        .reset_index()
        .sort_values(value_col, ascending=False)
        .head(top_n)
    )

    total = grouped[value_col].sum()
    grouped["pct_of_total"] = (grouped[value_col] / total * 100).round(1)
    grouped["cumulative_pct"] = grouped["pct_of_total"].cumsum().round(1)

    return grouped.reset_index(drop=True)

Finding the exact 80% threshold:

customer_pareto = pareto_analysis(df, "customer_name", top_n=30)

# Find the first row where cumulative percentage reaches 80%
threshold_row = customer_pareto[
    customer_pareto["cumulative_pct"] >= 80
].iloc[0]
threshold_rank = threshold_row.name + 1  # zero-indexed, so add 1

total_customers = df["customer_id"].nunique()

print(
    f"80% of revenue comes from {threshold_rank} customers "
    f"out of {total_customers} total ({threshold_rank / total_customers * 100:.0f}% of customers)"
)

For Acme, something like: "80% of revenue comes from 31 customers out of 155 total (20% of customers)." That is the textbook 80/20 split — but it also means that if even one or two of those top 31 customers leaves, the impact is significant.

Building the Pareto Chart

A Pareto chart combines a bar chart (individual values) with a cumulative line chart, using two Y-axes:

top_15 = pareto_analysis(df, "customer_name", top_n=15)

fig, ax1 = plt.subplots(figsize=(13, 6))

# Primary axis: revenue bars
ax1.bar(
    range(len(top_15)),
    top_15["revenue"],
    color="#4472C4",
    alpha=0.85,
    label="Revenue",
)
ax1.set_ylabel("Revenue ($)", color="#4472C4")
ax1.set_xticks(range(len(top_15)))
ax1.set_xticklabels(
    top_15["customer_name"], rotation=45, ha="right", fontsize=8
)
ax1.yaxis.set_major_formatter(
    mticker.FuncFormatter(lambda x, _: f"${x:,.0f}")
)

# Secondary axis: cumulative percentage line
ax2 = ax1.twinx()
ax2.plot(
    range(len(top_15)),
    top_15["cumulative_pct"],
    color="#FF6B35",
    marker="o",
    linewidth=2,
    markersize=6,
    label="Cumulative %",
)
ax2.axhline(80, color="red", linestyle="--", alpha=0.7, linewidth=1.5,
            label="80% threshold")
ax2.set_ylabel("Cumulative % of Total Revenue", color="#FF6B35")
ax2.set_ylim(0, 110)

ax1.set_title(
    "Customer Pareto Analysis — Acme Corp 2023\n"
    "Top 15 Customers by Revenue",
    fontweight="bold",
)

lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(lines1 + lines2, labels1 + labels2, loc="center right")

plt.tight_layout()
plt.savefig("acme_customer_pareto.png", dpi=150)
plt.show()

28.6 Revenue Concentration Risk

Knowing that your top 10 customers generate 68% of revenue is one thing. Knowing how to frame that as a business risk — and communicate it to leadership — is another.

def revenue_concentration_risk(
    df: pd.DataFrame,
    top_n: int = 10,
) -> dict:
    """Measure how concentrated revenue is in your top N customers.

    High concentration (above 60% from top 10) means losing any single
    major customer would materially hurt the business.

    The Herfindahl-Hirschman Index (HHI) is a standard economic measure of
    market concentration, scaled to 0-10,000. Values above 2,500 indicate
    high concentration and elevated revenue risk.

    Args:
        df: Sales DataFrame with customer_id and revenue columns.
        top_n: Number of top customers to include in concentration measure.

    Returns:
        Dictionary with concentration_pct, herfindahl_index, and risk_level.
    """
    customer_revenue = df.groupby("customer_id")["revenue"].sum()
    total_revenue = customer_revenue.sum()
    top_revenue = customer_revenue.nlargest(top_n).sum()
    concentration_pct = top_revenue / total_revenue * 100

    hhi = sum(
        (rev / total_revenue) ** 2 for rev in customer_revenue
    ) * 10_000

    return {
        "total_customers": len(customer_revenue),
        "top_n": top_n,
        "top_n_revenue": round(top_revenue, 2),
        "total_revenue": round(total_revenue, 2),
        "concentration_pct": round(concentration_pct, 1),
        "herfindahl_index": round(hhi, 0),
        "risk_level": (
            "High" if concentration_pct > 70
            else "Medium" if concentration_pct > 50
            else "Low"
        ),
    }


concentration = revenue_concentration_risk(df, top_n=10)
print(f"Top 10 customers = {concentration['concentration_pct']}% of revenue")
print(f"Risk level: {concentration['risk_level']}")
print(f"HHI: {concentration['herfindahl_index']:.0f}")

You do not need to explain the HHI formula to Sandra Chen. You just need to say: "Our top 10 customers represent 68% of our revenue. That concentration level means losing Customer 0001 alone would create a meaningful gap in our year-end number."


28.7 Product Mix Analysis: Revenue vs. Margin

Revenue and margin tell different stories. A high-revenue product with thin margins might actually be less valuable to the business than a lower-revenue product with strong margins. The only way to know is to look at both simultaneously.

def product_mix_analysis(df: pd.DataFrame) -> pd.DataFrame:
    """Analyze both revenue and margin contribution by product.

    Products where revenue_pct far exceeds margin_mix_pct are dragging
    on overall profitability. Products where margin_mix_pct exceeds
    revenue_pct are margin champions worth promoting.

    Args:
        df: Sales DataFrame with gross_margin column (if available).

    Returns:
        DataFrame sorted by revenue descending with mix percentages.
    """
    agg_dict = {
        "revenue": ("revenue", "sum"),
        "quantity": ("quantity", "sum"),
        "orders": ("order_id", "nunique"),
    }
    if "gross_margin" in df.columns:
        agg_dict["gross_margin"] = ("gross_margin", "sum")

    product_df = (
        df.groupby("product")
        .agg(**agg_dict)
        .reset_index()
        .sort_values("revenue", ascending=False)
    )

    product_df["revenue_pct"] = (
        product_df["revenue"] / product_df["revenue"].sum() * 100
    ).round(1)
    product_df["cumulative_revenue_pct"] = (
        product_df["revenue_pct"].cumsum().round(1)
    )

    if "gross_margin" in product_df.columns:
        product_df["margin_pct"] = (
            product_df["gross_margin"] / product_df["revenue"] * 100
        ).round(1)
        product_df["margin_mix_pct"] = (
            product_df["gross_margin"] / product_df["gross_margin"].sum() * 100
        ).round(1)

    product_df["avg_unit_price"] = (
        product_df["revenue"] / product_df["quantity"]
    ).round(2)

    return product_df.reset_index(drop=True)


product_mix = product_mix_analysis(df)

# Show the revenue vs. margin mix comparison
display_cols = ["product", "revenue_pct", "margin_pct", "margin_mix_pct"]
print(product_mix[display_cols].to_string(index=False))

For Acme's product line, a typical finding: - Printer Paper (Case): 23% of revenue, 44% margin = 17% of total gross margin - Shredders: 8% of revenue, 42% margin = but 10% of total gross margin (punches above its weight) - Staplers: 7% of revenue, 53% margin = 12% of total gross margin (margin champion)

Products like Staplers that over-index on margin contribution relative to revenue are strong candidates for cross-sell and promotional investment. Products that under-index may need repricing.


28.8 Sales Funnel Analysis and Conversion Rates

Not all sales data starts at the transaction level. If your organization tracks leads and opportunities in a CRM, you can analyze pipeline health with conversion rates at each funnel stage.

def conversion_rate(
    total_leads: int,
    qualified_leads: int,
    proposals_sent: int,
    closed_won: int,
) -> dict:
    """Calculate sales funnel conversion rates at each stage.

    Args:
        total_leads: Total leads entering the top of the funnel.
        qualified_leads: Leads that passed qualification criteria.
        proposals_sent: Formal proposals or quotes sent to prospects.
        closed_won: Deals successfully closed.

    Returns:
        Dictionary with stage-by-stage and overall conversion rates.
    """
    def safe_pct(numerator: int, denominator: int) -> float:
        return round(numerator / denominator * 100, 1) if denominator else 0.0

    return {
        "lead_to_qualified_pct": safe_pct(qualified_leads, total_leads),
        "qualified_to_proposal_pct": safe_pct(proposals_sent, qualified_leads),
        "proposal_to_close_pct": safe_pct(closed_won, proposals_sent),
        "overall_conversion_pct": safe_pct(closed_won, total_leads),
    }

For Acme Corp's Q4 2023, Sandra's CRM data shows: - 240 new leads entered the funnel - 144 passed qualification (60% lead-to-qualified) - 86 proposals sent (60% qualified-to-proposal) - 31 closed won (36% close rate)

funnel = conversion_rate(
    total_leads=240,
    qualified_leads=144,
    proposals_sent=86,
    closed_won=31,
)

for stage, rate in funnel.items():
    label = stage.replace("_", " ").replace("pct", "").title().strip()
    print(f"  {label:<35} {rate:>6.1f}%")

The overall 12.9% conversion rate sounds low but is normal for B2B office supply (typical range: 10–20%). More importantly, the proposal-to-close rate of 36% is actually strong — the bottleneck is lead qualification at the top of the funnel. That tells Sandra: focus energy on improving lead quality and qualification criteria, not on pushing reps to close harder at the bottom.

Visualizing the Funnel

stages = ["All Leads\n(240)", "Qualified\n(144)", "Proposals\n(86)", "Closed Won\n(31)"]
values = [240, 144, 86, 31]
colors = ["#4472C4", "#5B9BD5", "#70AD47", "#27AE60"]

fig, ax = plt.subplots(figsize=(10, 5))
bars = ax.barh(stages[::-1], values[::-1], color=colors[::-1], height=0.5)

for bar, val in zip(bars, values[::-1]):
    ax.text(
        bar.get_width() + 3,
        bar.get_y() + bar.get_height() / 2,
        f"{val:,}",
        va="center",
        fontweight="bold",
    )

ax.set_title("Q4 2023 Sales Funnel — Acme Corp", fontweight="bold")
ax.set_xlabel("Number of Opportunities")
ax.set_xlim(0, 285)
plt.tight_layout()
plt.savefig("acme_q4_funnel.png", dpi=150)
plt.show()

28.9 Cohort Analysis: Understanding Your Customers Over Time

A cohort is a group of customers who share a common characteristic — most commonly, the month they first purchased from you. Cohort analysis answers a question that aggregate summary metrics cannot: are new customers actually coming back and buying more, or are we constantly replacing churned customers with new ones?

def build_customer_cohorts(df: pd.DataFrame) -> pd.DataFrame:
    """Assign each customer to their acquisition cohort (first purchase month).

    Args:
        df: Sales DataFrame with customer_id, date, and revenue columns.

    Returns:
        Original DataFrame with cohort_month column appended.
    """
    # Find each customer's earliest purchase date and convert to month period
    first_purchase = (
        df.groupby("customer_id")["date"]
        .min()
        .dt.to_period("M")
    )
    first_purchase.name = "cohort_month"

    return df.merge(first_purchase, on="customer_id")


def cohort_revenue_table(df: pd.DataFrame) -> pd.DataFrame:
    """Build a cohort x period revenue pivot table.

    Rows = acquisition cohort (month of first purchase).
    Columns = periods since first purchase (0 = first month, 1 = second, ...).
    Values = total revenue from that cohort in that period.

    Args:
        df: Sales DataFrame with cohort_month and year_month columns.

    Returns:
        Pivot table showing revenue retention by cohort.
    """
    if "cohort_month" not in df.columns:
        df = build_customer_cohorts(df)

    df = df.copy()
    # How many months after their first purchase is each transaction?
    df["period_number"] = (
        df["year_month"].astype(int) - df["cohort_month"].astype(int)
    )

    return df.pivot_table(
        index="cohort_month",
        columns="period_number",
        values="revenue",
        aggfunc="sum",
    )

To visualize the cohort table as a heatmap:

import seaborn as sns

df_cohorts = build_customer_cohorts(df)
cohort_table = cohort_revenue_table(df_cohorts)

fig, ax = plt.subplots(figsize=(14, 8))
sns.heatmap(
    cohort_table.fillna(0),
    annot=True,
    fmt=",.0f",
    cmap="Blues",
    ax=ax,
    linewidths=0.5,
)
ax.set_title(
    "Cohort Revenue Analysis — Acme Corp 2023\n"
    "Revenue ($) by Acquisition Cohort and Months Since First Purchase",
    fontweight="bold",
)
ax.set_xlabel("Months Since First Purchase (0 = acquisition month)")
ax.set_ylabel("Acquisition Cohort")
plt.tight_layout()
plt.savefig("acme_cohort_heatmap.png", dpi=150)
plt.show()

A healthy cohort table shows meaningful revenue in period 0 (the acquisition month) and continued revenue in periods 1, 3, 6, and beyond. For Acme's B2B office supply customers, you should see strong repeat patterns — businesses reorder office supplies continuously. If cohort revenue drops sharply after period 2, it may mean a competitor is winning the reorder business.

The cohort analysis is one of the most powerful tools in your analytics kit precisely because it is invisible in aggregate metrics. Your total monthly revenue could be growing while cohort retention is actually declining — you would just be onboarding new customers faster than old ones churn. Cohort analysis catches that before it becomes a crisis.


28.10 RFM Scoring: A Preview of Chapter 27

RFM stands for Recency, Frequency, and Monetary value. It is one of the most powerful and widely-used customer segmentation frameworks in existence, and we cover it in full depth in Chapter 27. But because RFM is a natural companion to the sales analytics we are building here, let us introduce the core concept now.

The idea is elegantly simple: your best customers bought recently, buy often, and spend a lot. Your least valuable customers bought a long time ago, rarely, and for small amounts. RFM gives you a numerical score for each dimension that you can use to segment your entire customer base automatically.

from datetime import timedelta


def rfm_scoring(
    df: pd.DataFrame,
    analysis_date: str = None,
) -> pd.DataFrame:
    """Compute RFM scores for each customer on a 1-4 scale per dimension.

    Score meanings per dimension:
        Recency   4 = bought most recently    1 = bought longest ago
        Frequency 4 = orders most often       1 = orders least often
        Monetary  4 = highest total spend     1 = lowest total spend

    Combined RFM score ranges from 3 (lowest: 1+1+1) to 12 (highest: 4+4+4).

    Args:
        df: Sales DataFrame with customer_id, date, order_id, revenue.
        analysis_date: Reference date as 'YYYY-MM-DD'. Defaults to
                       one day after the latest transaction in df.

    Returns:
        DataFrame with one row per customer, RFM scores, and segment label.
    """
    if analysis_date is None:
        ref_date = df["date"].max() + timedelta(days=1)
    else:
        ref_date = pd.Timestamp(analysis_date)

    rfm = (
        df.groupby("customer_id")
        .agg(
            last_purchase=("date", "max"),
            frequency=("order_id", "nunique"),
            monetary=("revenue", "sum"),
        )
        .reset_index()
    )

    rfm["recency_days"] = (ref_date - rfm["last_purchase"]).dt.days

    # Recency: fewer days = more recent = higher score — so invert
    rfm["r_score"] = pd.qcut(
        rfm["recency_days"], q=4, labels=False, duplicates="drop"
    )
    rfm["r_score"] = 4 - rfm["r_score"]

    # Frequency and Monetary: higher value = higher score
    rfm["f_score"] = (
        pd.qcut(rfm["frequency"], q=4, labels=False, duplicates="drop") + 1
    )
    rfm["m_score"] = (
        pd.qcut(rfm["monetary"], q=4, labels=False, duplicates="drop") + 1
    )

    rfm["rfm_score"] = rfm["r_score"] + rfm["f_score"] + rfm["m_score"]

    def segment_label(score: int) -> str:
        """Map total RFM score to a human-readable segment."""
        if score >= 10:
            return "Champions"
        elif score >= 8:
            return "Loyal Customers"
        elif score >= 6:
            return "Potential Loyalists"
        elif score >= 4:
            return "At Risk"
        else:
            return "Lost"

    rfm["rfm_segment"] = rfm["rfm_score"].apply(segment_label)

    return (
        rfm.drop(columns=["last_purchase"])
        .sort_values("rfm_score", ascending=False)
        .reset_index(drop=True)
    )


rfm_results = rfm_scoring(df)

# Summarize the distribution of segments
segment_summary = (
    rfm_results.groupby("rfm_segment")
    .agg(
        customers=("customer_id", "count"),
        avg_revenue=("monetary", "mean"),
    )
    .round(0)
    .sort_values("avg_revenue", ascending=False)
)
print(segment_summary)

The segment labels map directly to different business strategies: - Champions (score 10–12): Your best customers. Reward them. Ask for referrals. - Loyal Customers (8–9): Strong customers who are not yet Champions. Nurture them. - Potential Loyalists (6–7): Bought recently but infrequently. Onboarding and cross-sell focus. - At Risk (4–5): Used to buy but have not recently. Win-back campaigns and personal outreach. - Lost (3): Last purchase was long ago, infrequent, small value. May not be worth marketing spend.

We will build much more sophisticated RFM analysis in Chapter 27, including two-dimensional segment maps and automated email targeting lists. For now, the key insight is that not all customers are equal — and Python makes it straightforward to quantify and act on those differences.


28.11 Sales Velocity

Sales velocity answers the question: how fast is money flowing into your business? It is particularly useful for forecasting and for comparing the pace of different time periods.

def sales_velocity(df: pd.DataFrame, period_days: int = 30) -> dict:
    """Calculate sales velocity and revenue run-rate metrics.

    Args:
        df: Sales DataFrame.
        period_days: Days in the projection period (default 30 for monthly).

    Returns:
        Dictionary with daily rate, run-rate, and deal size metrics.
    """
    date_range = (df["date"].max() - df["date"].min()).days or 1
    total_revenue = df["revenue"].sum()
    total_orders = df["order_id"].nunique()
    order_revenue = df.groupby("order_id")["revenue"].sum()

    return {
        "avg_deal_size": round(order_revenue.mean(), 2),
        "median_deal_size": round(order_revenue.median(), 2),
        "daily_revenue_rate": round(total_revenue / date_range, 2),
        "monthly_revenue_run_rate": round(
            total_revenue / date_range * period_days, 2
        ),
        "orders_per_day": round(total_orders / date_range, 2),
    }


# Full year baseline
fy_velocity = sales_velocity(df)
print(f"FY2023 Average deal size:        ${fy_velocity['avg_deal_size']:,.2f}")
print(f"FY2023 Monthly revenue run rate: ${fy_velocity['monthly_revenue_run_rate']:,.2f}")

# Compare Q4 velocity to Q1
q1 = df[df["quarter"] == 1]
q4 = df[df["quarter"] == 4]

q1_velocity = sales_velocity(q1)
q4_velocity = sales_velocity(q4)

print(f"\nQ1 daily revenue rate: ${q1_velocity['daily_revenue_rate']:,.2f}")
print(f"Q4 daily revenue rate: ${q4_velocity['daily_revenue_rate']:,.2f}")
print(
    f"Q4 vs Q1 pace: "
    f"+{(q4_velocity['daily_revenue_rate'] / q1_velocity['daily_revenue_rate'] - 1) * 100:.1f}%"
)

If you have two weeks of January data at a daily rate of $9,200, the projected January total is $9,200 × 31 = $285,200. If January of last year was $241,000, that projects to 18% YoY growth — useful information for Sandra's board presentation.


28.12 Building the Full Sales Dashboard

A dashboard is a visual summary that tells a story at a glance. The design goal: a four-panel figure that any business leader can interpret in 30 seconds without reading a single paragraph of text.

def build_sales_dashboard(
    df: pd.DataFrame,
    title: str = "Sales Analytics Dashboard",
    save_path: str = None,
) -> plt.Figure:
    """Generate a 4-panel sales analytics dashboard.

    Panels:
        Top-left:     Monthly revenue bar chart with trend line
        Top-right:    Top 10 customers by revenue (horizontal bar)
        Bottom-left:  Revenue by region (pie chart)
        Bottom-right: Top 8 products by revenue (horizontal bar)

    Args:
        df: Sales DataFrame from load_sales_data().
        title: Dashboard title.
        save_path: Optional path to save PNG output.

    Returns:
        matplotlib Figure object (caller can call plt.show() on it).
    """
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle(title, fontsize=16, fontweight="bold", y=1.01)
    plt.subplots_adjust(hspace=0.4, wspace=0.35)

    # --- Panel 1: Monthly Revenue ---
    monthly = (
        df.groupby("year_month")["revenue"]
        .sum()
        .reset_index()
        .sort_values("year_month")
    )
    x_pos = list(range(len(monthly)))
    month_labels = [str(p) for p in monthly["year_month"]]

    axes[0, 0].bar(x_pos, monthly["revenue"], color="#4472C4", alpha=0.85)
    axes[0, 0].plot(
        x_pos, monthly["revenue"], color="#FF6B35",
        marker="o", linewidth=2, markersize=4
    )
    axes[0, 0].set_title("Monthly Revenue", fontweight="bold")
    axes[0, 0].set_xticks(x_pos)
    axes[0, 0].set_xticklabels(month_labels, rotation=45, ha="right", fontsize=8)
    axes[0, 0].yaxis.set_major_formatter(
        mticker.FuncFormatter(lambda x, _: f"${x:,.0f}")
    )

    # --- Panel 2: Top 10 Customers ---
    name_col = "customer_name" if "customer_name" in df.columns else "customer_id"
    top_customers = (
        df.groupby(name_col)["revenue"].sum().nlargest(10).sort_values()
    )
    axes[0, 1].barh(
        top_customers.index, top_customers.values, color="#5B9BD5"
    )
    axes[0, 1].set_title("Top 10 Customers by Revenue", fontweight="bold")
    axes[0, 1].xaxis.set_major_formatter(
        mticker.FuncFormatter(lambda x, _: f"${x:,.0f}")
    )

    # --- Panel 3: Revenue by Region ---
    region_rev = df.groupby("region")["revenue"].sum().sort_values(ascending=False)
    region_colors = ["#4472C4", "#ED7D31", "#A9D18E", "#FF6B6B", "#FFC000"]
    axes[1, 0].pie(
        region_rev.values,
        labels=region_rev.index,
        autopct="%1.1f%%",
        colors=region_colors[: len(region_rev)],
        startangle=90,
    )
    axes[1, 0].set_title("Revenue by Region", fontweight="bold")

    # --- Panel 4: Top 8 Products ---
    top_products = (
        df.groupby("product")["revenue"].sum().nlargest(8).sort_values()
    )
    axes[1, 1].barh(
        top_products.index, top_products.values, color="#70AD47"
    )
    axes[1, 1].set_title("Top 8 Products by Revenue", fontweight="bold")
    axes[1, 1].xaxis.set_major_formatter(
        mticker.FuncFormatter(lambda x, _: f"${x:,.0f}")
    )

    if save_path:
        fig.savefig(save_path, dpi=150, bbox_inches="tight")
        print(f"Dashboard saved to: {save_path}")

    return fig


fig = build_sales_dashboard(
    df,
    title="Acme Corp — 2023 Sales Analytics Dashboard",
    save_path="acme_sales_dashboard.png",
)
plt.show()

A few deliberate design choices worth noting. Horizontal bars for customers and products are easier to read when labels are long text strings. The pie chart works here because there are only four regions — if you had ten or more regions, a bar chart would be clearer. Dollar-formatted tick labels make the unit immediately clear without requiring a note or label.

The FuncFormatter pattern (lambda x, _: f"${x:,.0f}") is one of those one-liners that makes a chart look professional. The underscore in the lambda argument signature is a Python convention for "I'm ignoring this parameter" — FuncFormatter passes both the value and its tick position, but you only need the value.


28.13 The West Region Investigation

Here is how all of these tools come together to solve a real business problem. Sandra Chen comes to Priya's desk on a Monday in January 2024:

"Priya, West region closed 2023 at $360k against a $620k target. That is a $260k miss. I need to understand what happened before I can decide whether to restructure, reinvest, or pull back. Can you give me something concrete by Wednesday?"

Priya has acme_sales_2023.csv. She runs through the toolkit methodically:

# Step 1: Load and get the full picture
df = load_sales_data("acme_sales_2023.csv")
metrics = calculate_revenue_metrics(df)

# Step 2: Regional comparison — this is where the story begins
region_df = revenue_by_dimension(df, "region")
print(region_df.to_string(index=False))

# Step 3: Isolate West region and compare to the rest
west = df[df["region"] == "West"].copy()
other = df[df["region"] != "West"].copy()

west_metrics = calculate_revenue_metrics(west)
other_metrics = calculate_revenue_metrics(other)

print(f"\nWest avg order value:         ${west_metrics['average_order_value']:,.2f}")
print(f"Other regions avg order value: ${other_metrics['average_order_value']:,.2f}")

print(f"\nWest unique customers:         {west_metrics['unique_customers']}")
print(f"Other regions unique customers:{other_metrics['unique_customers']}")

# Step 4: How does West look month by month?
west_monthly = west.groupby("year_month")["revenue"].sum()
company_avg = df.groupby("year_month")["revenue"].sum() / 4  # per-region average

comparison = pd.DataFrame({
    "West_Revenue": west_monthly,
    "Per_Region_Average": company_avg,
}).fillna(0)
comparison["West_As_Pct_of_Avg"] = (
    comparison["West_Revenue"] / comparison["Per_Region_Average"] * 100
).round(1)
print("\nWest vs. company average per region:")
print(comparison.to_string())

# Step 5: Who is selling in West?
print("\nWest salesperson performance:")
print(revenue_by_dimension(west, "salesperson").to_string(index=False))

# Step 6: How many customers?
print(f"\nWest customer count: {west['customer_id'].nunique()}")
west_customers = revenue_by_dimension(west, "customer_name")
print(f"West top customer: {west_customers.iloc[0]['customer_name']} "
      f"at ${west_customers.iloc[0]['revenue']:,.2f}")

What Priya finds tells an unambiguous story:

  1. Average order value is similar: West ($2,180) vs. other regions ($2,200). Not a deal-size or pricing problem.
  2. Customer count is drastically lower: West has 15 active accounts. North has 42. East has 38. South has 41.
  3. One salesperson covers the entire territory: Dave Nguyen handles all 15 West accounts alone, while other regions have 2–3 reps.
  4. Dave's per-account performance is comparable: Divided by active customers, Dave's account revenue is actually in line with other reps.
  5. No seasonality issue: West misses the Q4 surge, but proportionally it underperforms equally across all four quarters.

Priya's conclusion: This is a capacity problem, not a market problem. Dave Nguyen is not underperforming. He has 15 accounts when he should have 12. The West territory is being served by one person when it needs three.

She writes a two-page memo with this data and a regional comparison chart. Sandra presents it to the Acme executive team the following week. The West region hiring plan — two new sales reps, targeted enterprise acquisition, Q1 2024 start — is approved in February.

By Q3 2024, West region tracks toward $520k, still below the $620k original target but trending correctly for the first time. The $260k miss became a strategic investment when Priya gave it the right frame.


28.14 Summary

You have built a complete sales analytics toolkit in this chapter. Here is what it can do:

Function Question It Answers
calculate_revenue_metrics() What is our overall top-line performance?
revenue_by_dimension() Which region / product / rep is driving the most?
monthly_revenue_trend() Are we growing month over month?
revenue_growth_rate() What is our quarterly growth trajectory?
pareto_analysis() Which 20% of customers drives 80% of revenue?
revenue_concentration_risk() How dependent are we on a few large customers?
product_mix_analysis() Which products win on revenue vs. on margin?
conversion_rate() How efficient is our sales funnel at each stage?
cohort_revenue_table() Are new customers actually coming back?
rfm_scoring() Which customers need attention right now?
sales_velocity() How fast is revenue flowing in?
build_sales_dashboard() How do I show all of this visually at once?

The most important skill in this chapter is not any individual function. It is the habit of combining them systematically. Revenue by dimension tells you where to look. Pareto tells you who matters most. Cohort analysis tells you whether customer relationships are getting stronger or weaker over time. RFM tells you which customers need attention right now. Sales velocity tells you whether the current period is tracking ahead of or behind plan.

Sales data is not just numbers on a spreadsheet. It is the story of every customer relationship your business has ever had. Python gives you the tools to read that story clearly — and to act on what you find before small problems become large ones.


Chapter Review Questions

  1. What is the difference between average order value and revenue per customer? When would each metric be more useful to a sales leader?
  2. Why must you group by order_id before computing average order value, rather than calling df["revenue"].mean() directly?
  3. Explain the difference between month-over-month and year-over-year growth. When is each more meaningful?
  4. What does a Herfindahl-Hirschman Index above 2,500 indicate about a company's customer revenue distribution?
  5. A product generates 20% of your revenue but only 10% of your gross margin. What does that tell you? What actions might follow?
  6. In a cohort analysis, what does it mean if period-0 revenue is strong but period-3 and period-6 revenue are near zero for most cohorts?
  7. Name the three components of RFM scoring. What does a customer with a total score of 12 look like behaviorally? What about a customer with a score of 3?
  8. Priya concluded that Acme's West region had a capacity problem rather than a market problem. List three specific data points that supported that conclusion.

Key Terms

Revenue concentration risk: The degree to which a business depends on a small number of large customers. Commonly measured as the percentage of total revenue coming from the top N customers.

Pareto analysis: An analytical technique for identifying the 20% of inputs (customers, products) that drive 80% of outputs (revenue, margin).

Cohort analysis: The practice of grouping customers by when they first purchased, then tracking their behavior over subsequent periods to measure retention and repeat purchasing.

RFM scoring: A three-dimensional customer segmentation framework that scores each customer on Recency (how recently they bought), Frequency (how often they buy), and Monetary value (how much they spend).

Sales velocity: The rate at which revenue is generated, expressed as daily or monthly run rate. Useful for short-term forecasting and period-over-period pace comparisons.

Product mix: The relative contribution of each product to total revenue and total gross margin. These two metrics frequently tell different stories about which products deserve strategic attention.

Month-over-month (MoM) growth: Percentage change in a metric from one calendar month to the previous month. Tracks recent momentum but is sensitive to seasonal patterns.

Year-over-year (YoY) growth: Percentage change in a metric compared to the same period in the prior year. Removes seasonal effects and provides a more reliable signal of underlying business trend.

Average order value (AOV): Total revenue divided by number of unique orders. Must be computed at the order level, not the line-item level, to be meaningful.

Gross margin: Revenue minus cost of goods sold. Represents the actual profit generated before operating expenses. Products with high revenue but low gross margin may be less valuable than lower-revenue, high-margin products.