Chapter 25 Exercises: Descriptive Statistics for Business Decisions

Complete these exercises in order. Each tier builds on the concepts in the previous one. You will need pandas, numpy, matplotlib, and scipy installed.


Tier 1 — Foundation (Conceptual + Simple Code)

These exercises test whether you understand the core concepts before applying them in code.

Exercise 1.1 — Mean vs. Median: Which Should You Use?

For each scenario below, state whether you would use the mean or the median to represent "typical" — and briefly explain why.

a) You want to understand the typical salary of employees at a startup with 50 employees, including a CEO who earns $2,000,000 and 49 employees who earn $65,000–$95,000.

b) You want to calculate the total revenue projection for next quarter based on individual sales rep estimates.

c) You want to understand the typical wait time for customer support tickets. Most tickets resolve in 2–4 hours, but 5% of tickets involve complex technical issues and take 2–5 days.

d) You want to know the most common city your customers ship to (for warehouse location planning).

e) You want to calculate the average manufacturing defect rate across 12 production lines, which are all roughly similar in scale and process.


Exercise 1.2 — Reading describe() Output

Given the following describe() output for a dataset of customer purchase values, answer the questions below:

count     842.000
mean      347.82
std       512.34
min        12.50
25%        89.75
50%       198.40
75%       412.00
max      8750.00

a) How many customers are in the dataset?

b) Is this distribution likely to be right-skewed, left-skewed, or roughly symmetric? How do you know?

c) What is the IQR? What does this tell you about the middle 50% of customers?

d) A new customer places an order for $3,500. Is this unusual? Use the statistics to justify your answer.

e) Your manager wants to know the "average customer order value" to put in the quarterly presentation. Should you report the mean or the median? Write one sentence explaining your recommendation.


Exercise 1.3 — Standard Deviation as a Risk Signal

You are comparing two potential suppliers for a component your company uses in manufacturing.

Supplier Mean Lead Time (days) Std Dev (days)
SupplierA 8.2 0.9
SupplierB 7.8 4.3

a) Which supplier is faster on average?

b) Which supplier is more reliable (consistent)?

c) Your production line needs components within 12 days or you face a line shutdown. Using the 68-95 rule (roughly 95% of values fall within 2 standard deviations of the mean), estimate the probability of each supplier exceeding 12 days.

d) Which supplier would you choose, and why? There is no single right answer — justify your reasoning.


Exercise 1.4 — Percentile Thinking

Answer these business questions using percentile thinking:

a) You want to set a "premium customer" threshold for your loyalty program — defined as customers who spend more than 80% of all other customers annually. You look at your data and find the 80th percentile of annual spend is $4,200. If a customer spent $3,800 last year, do they qualify?

b) Your SLA promises to resolve 90% of support tickets within 4 hours. You measure that the 90th percentile resolution time is 6.5 hours. Are you meeting your SLA?

c) A new sales rep in her second quarter closed deals totaling $87,000. The 50th percentile for second-quarter performance among all reps is $72,000, and the 75th percentile is $95,000. How would you characterize her performance?


Exercise 1.5 — Correlation Intuition

For each pair of variables, predict whether the correlation would be positive, negative, or approximately zero — and explain your reasoning.

a) Marketing spend and new leads generated (same month)

b) Employee satisfaction and customer satisfaction (across company locations)

c) Price of a product and quantity sold (assuming a typical demand curve)

d) Height of a building (floors) and number of people who work in it

e) Day of the week (Monday=1, Sunday=7) and number of support tickets opened (for a B2B software company whose customers work weekdays)


Tier 2 — Application (Writing Code)

Exercise 2.1 — Basic Stats from Scratch

Create a Python list of the following monthly revenue figures (in dollars) for a retail store:

82400, 91200, 78600, 95300, 103400, 88700, 79200, 94100,
112300, 88900, 91800, 156700

Without using pandas, calculate: a) The mean b) The median (sort the list first) c) The range d) Identify any value that seems like an outlier and explain why

Then repeat the same calculations using pd.Series and its built-in methods. Verify your manual calculations match.


Exercise 2.2 — describe() and IQR

Generate the following synthetic dataset:

import numpy as np
import pandas as pd

np.random.seed(99)
customer_data = pd.DataFrame({
    "customer_id": range(1, 201),
    "annual_spend": np.random.lognormal(7.5, 0.7, 200),
    "orders_per_year": np.random.randint(1, 25, 200),
    "avg_order_value": np.random.lognormal(5.8, 0.5, 200),
    "support_tickets": np.random.poisson(3.2, 200),
    "tenure_months": np.random.uniform(1, 60, 200).round(0),
})

a) Run describe() on the entire DataFrame and print the output.

b) Calculate the IQR for annual_spend manually (Q3 - Q1 using .quantile()).

c) Find all customers whose annual_spend is an outlier by the IQR method (below Q1 - 1.5×IQR or above Q3 + 1.5×IQR).

d) How many outliers are there? What percentage of customers are they?

e) Print the top 10 customers by annual spend along with all their other metrics.


Exercise 2.3 — Correlation Matrix

Using the customer_data DataFrame from Exercise 2.2:

a) Calculate the full correlation matrix using .corr().

b) Find all pairs of variables with a correlation above 0.4 (positive or negative). Print them along with their correlation value.

c) Which variable correlates most strongly with annual_spend?

d) Add a new column revenue_per_ticket = annual_spend / (support_tickets + 1) and recalculate. Does this new variable have a stronger or weaker correlation with annual_spend than the original variables?


Exercise 2.4 — Visualizing Distributions

Using the customer_data DataFrame from Exercise 2.2:

a) Create a histogram of annual_spend with 20 bins. Add vertical lines for the mean and median. Label them.

b) Create a box plot comparing annual_spend across four tenure groups: 0-15 months, 16-30 months, 31-45 months, 46-60 months. (Use pd.cut() to create the groups.)

c) Describe in one sentence what each chart tells you that the other does not.


Tier 3 — Analysis (Combining Techniques)

Exercise 3.1 — Product Profitability Analysis

Generate a dataset of product sales:

import numpy as np
import pandas as pd

np.random.seed(77)
n = 300
products = pd.DataFrame({
    "order_id": range(1, n+1),
    "product_category": np.random.choice(
        ["Electronics", "Apparel", "Home Goods", "Books"], n,
        p=[0.25, 0.35, 0.25, 0.15]
    ),
    "region": np.random.choice(["North", "South", "East", "West"], n),
    "units_sold": np.random.randint(1, 50, n),
    "unit_price": np.random.lognormal(3.5, 0.6, n).round(2),
    "unit_cost": None,  # to be calculated
})
# Cost is 40-70% of price depending on category
cost_ratios = {"Electronics": 0.65, "Apparel": 0.50,
               "Home Goods": 0.55, "Books": 0.42}
products["unit_cost"] = products.apply(
    lambda r: r["unit_price"] * cost_ratios[r["product_category"]] *
              np.random.uniform(0.85, 1.15), axis=1
).round(2)
products["revenue"] = (products["unit_price"] * products["units_sold"]).round(2)
products["profit"] = ((products["unit_price"] - products["unit_cost"]) *
                       products["units_sold"]).round(2)
products["margin_pct"] = (products["profit"] / products["revenue"] * 100).round(1)

Using this dataset:

a) Calculate mean, median, std, and IQR of margin_pct for each product_category.

b) Which category has the highest median margin? Which has the highest mean? Are they the same category?

c) Which category has the most consistent margins (lowest coefficient of variation)?

d) Create a box plot comparing margin distributions across the four categories.

e) Write 3 business recommendations based on your findings.


Exercise 3.2 — Identifying Simpson's Paradox

Construct a scenario that demonstrates Simpson's Paradox:

import pandas as pd

# Two sales reps: Alex and Jordan
# Alex handles mostly small accounts; Jordan handles mostly large accounts
data = pd.DataFrame({
    "rep": (["Alex"] * 120 + ["Jordan"] * 30),
    "account_size": (["small"] * 100 + ["large"] * 20 +
                     ["small"] * 5 + ["large"] * 25),
    "closed": ([1]*45 + [0]*55 +  # Alex small: 45%
               [1]*14 + [0]*6 +   # Alex large: 70%
               [1]*2 + [0]*3 +    # Jordan small: 40%
               [1]*18 + [0]*7),   # Jordan large: 72%
})

a) Calculate each rep's overall close rate (closed deals / total deals). Who appears better overall?

b) Calculate each rep's close rate separately for small accounts and large accounts. Who is better at small accounts? Who is better at large accounts?

c) Explain in plain English why the overall numbers are misleading. What is causing the paradox?

d) What management mistake might someone make if they only looked at the overall close rates without segmenting by account size?


Exercise 3.3 — Survivorship Bias Scenario

You are analyzing why certain product launches at your company succeeded. You pull a dataset of current active products and find that 85% of them had a dedicated product champion during launch.

a) Why might this finding be misleading?

b) What additional data would you need to draw a valid conclusion about whether product champions drive success?

c) Design a simple study structure (not a formal experiment — just a description of what data to collect and how to analyze it) that would give you a more reliable answer.


Tier 4 — Business Integration

Exercise 4.1 — Full Sales Team Analysis

Generate the following sales team dataset and perform a complete statistical analysis:

import numpy as np
import pandas as pd

np.random.seed(55)
n_reps = 75
reps = pd.DataFrame({
    "rep_id": [f"REP-{i:03d}" for i in range(1, n_reps+1)],
    "region": np.random.choice(["North", "South", "East", "West", "Central"], n_reps),
    "years_exp": np.random.uniform(0.5, 15, n_reps).round(1),
    "annual_revenue": np.random.lognormal(11.5, 0.6, n_reps).round(0),
    "quota": np.random.lognormal(11.4, 0.3, n_reps).round(0),
    "deals_closed": np.random.randint(3, 40, n_reps),
    "close_rate": np.random.normal(38, 13, n_reps).clip(5, 80).round(1),
})
reps["quota_attainment"] = (reps["annual_revenue"] / reps["quota"] * 100).round(1)
reps["avg_deal_size"] = (reps["annual_revenue"] / reps["deals_closed"]).round(0)

Your analysis should include:

a) Full describe() output and written interpretation of the mean/median gap in annual_revenue.

b) Percentile breakdown: at what revenue level are the 25th, 50th, 75th, and 90th percentile reps?

c) What percentage of reps are above quota (quota attainment > 100%)?

d) Correlation matrix for all numeric columns. What is the strongest predictor of annual_revenue?

e) Box plots comparing annual_revenue by region.

f) IQR-based outlier detection. How many reps are outliers? Are they overperformers, underperformers, or both?

g) A three-paragraph written summary suitable for a VP of Sales audience (no code in this part — just findings and recommendations).


Exercise 4.2 — Customer Segmentation by Statistics

You manage a subscription software product. Using the customer data below, segment customers into "Champion," "At Risk," and "Needs Attention" tiers based on their statistics.

import numpy as np
import pandas as pd

np.random.seed(33)
n = 500
customers = pd.DataFrame({
    "customer_id": [f"C-{i:04d}" for i in range(1, n+1)],
    "monthly_spend": np.random.lognormal(7.8, 0.65, n).round(2),
    "logins_per_month": np.random.poisson(18, n),
    "features_used": np.random.randint(1, 15, n),
    "support_tickets_ytd": np.random.poisson(2.5, n),
    "months_active": np.random.randint(1, 72, n),
    "nps_score": np.random.normal(6.5, 2.8, n).clip(0, 10).round(1),
})

a) Calculate descriptive statistics for each numeric column.

b) Define "Champion" as: monthly_spend > 75th percentile AND logins_per_month > median AND nps_score > 7. How many Champions are there?

c) Define "At Risk" as: logins_per_month < 25th percentile AND nps_score < 5. How many At-Risk customers are there?

d) Everyone else is "Needs Attention." Verify your three segments add up to the total customer count.

e) Compare the mean monthly_spend, logins_per_month, and months_active across the three segments. What story do the numbers tell?


Tier 5 — Challenge

Exercise 5.1 — Build a Statistical Report Function

Write a function business_stats_report(df, numeric_col, group_col=None) that:

  1. Calculates mean, median, std, IQR, min, max, and coefficient of variation for numeric_col
  2. Detects and reports outliers using both IQR and z-score methods
  3. If group_col is provided, repeats the analysis for each group and produces a formatted comparison table
  4. Prints the results in a clean, business-readable format (no raw DataFrame dumps — formatted strings with labels and currency or percentage formatting as appropriate)
  5. Returns a dictionary with all calculated values

Test your function on: - The annual_revenue column of the sales data from Exercise 4.1 - The monthly_spend column of the customer data from Exercise 4.2, grouped by a quartile of months_active


Exercise 5.2 — Visualizing Skewness Across Business Metrics

Create a 2×3 figure showing six different business metrics distributions side by side. For each subplot: - Draw a histogram with 20 bins - Add a vertical mean line and a vertical median line - Label both lines in the legend - Add a text annotation in the top-right corner showing the skewness value - Title the subplot with the metric name and a brief note on what the skewness implies

Use the following dataset:

import numpy as np
import pandas as pd

np.random.seed(11)
n = 1000
business_metrics = pd.DataFrame({
    "support_ticket_hours": np.random.lognormal(1.5, 0.8, n),   # right-skewed
    "defect_rate_pct": np.random.beta(2, 8, n) * 100,            # right-skewed
    "satisfaction_score": np.random.beta(7, 3, n) * 10,          # left-skewed
    "daily_sales_count": np.random.normal(142, 18, n),            # symmetric
    "deal_size_usd": np.random.lognormal(9.5, 1.1, n),            # right-skewed
    "response_time_mins": np.random.gamma(2, 12, n),              # right-skewed
})

After creating the figure, write a three-sentence interpretation for each metric explaining what the shape of the distribution would mean for a business manager making decisions based on it.


Exercise 5.3 — The Simpson's Paradox Detector

Write a function check_simpsons_paradox(df, target_col, group_col, segment_col) that:

  1. Calculates the mean of target_col for each value of group_col (the overall comparison)
  2. For each segment in segment_col, calculates the mean of target_col for each value of group_col (the segmented comparison)
  3. Detects whether any group that looks better overall looks worse in at least one segment, or vice versa
  4. Prints a clear warning if potential Simpson's Paradox is detected, along with the evidence

Test it on the data from Exercise 3.2 and on a dataset you construct yourself where the paradox does not appear.


All exercises were designed for Python 3.10+ with pandas, numpy, matplotlib, and scipy. Sample solutions are available in the instructor supplement.