Chapter 13 Exercises: Transforming and Aggregating Business Data

These exercises are organized into five tiers of increasing difficulty. Complete them in order for the best learning progression. Exercises marked with a star (*) are recommended as minimum completion for the chapter.


Setup: Shared Data

Use the following DataFrames for all exercises unless an exercise specifies its own data:

import pandas as pd
import numpy as np

sales = pd.DataFrame({
    "order_id": range(1001, 1021),
    "order_date": pd.to_datetime([
        "2024-01-05", "2024-01-12", "2024-01-18", "2024-01-25",
        "2024-02-03", "2024-02-09", "2024-02-14", "2024-02-21",
        "2024-03-02", "2024-03-08", "2024-03-15", "2024-03-22",
        "2024-04-04", "2024-04-11", "2024-04-18", "2024-04-25",
        "2024-05-03", "2024-05-10", "2024-05-17", "2024-05-24",
    ]),
    "customer_id": [
        "C01","C02","C01","C03","C02","C04","C01","C03",
        "C05","C02","C04","C01","C03","C05","C02","C04",
        "C01","C03","C05","C02",
    ],
    "region": [
        "North","South","North","West","South","East","North","West",
        "North","South","East","North","West","North","South","East",
        "North","West","North","South",
    ],
    "product": [
        "Software","Hardware","Software","Services","Hardware","Software",
        "Services","Software","Hardware","Services","Software","Hardware",
        "Services","Software","Hardware","Services","Software","Services",
        "Hardware","Software",
    ],
    "revenue": [
        4500, 2300, 6700, 3800, 1900, 5100,
        7200, 4200, 3300, 2700, 4400, 8100,
        5600, 3900, 3100, 6200, 7800, 4900,
        2100, 5500,
    ],
    "cost": [
        900,  1150, 1340, 760,  950,  1020,
        1440, 840,  1320, 810,  880,  1620,
        1120, 780,  1550, 1240, 1560, 980,
        1050, 1100,
    ],
    "units": [3, 5, 4, 1, 8, 2, 3, 2, 6, 1, 2, 5, 1, 3, 7, 4, 5, 2, 9, 3],
})

customers = pd.DataFrame({
    "customer_id": ["C01","C02","C03","C04","C05"],
    "name": ["Goldfinch Inc","Redwood LLC","Birchwood Co","Magnolia Ltd","Cypress Group"],
    "tier": ["Gold","Silver","Bronze","Gold","Silver"],
    "region": ["North","South","West","East","North"],
    "annual_value": [120000, 48000, 24000, 95000, 62000],
})

sales["margin"] = sales["revenue"] - sales["cost"]
sales["margin_pct"] = (sales["margin"] / sales["revenue"] * 100).round(1)

Tier 1: Foundation (Recall and Recognition)

Exercise 1-1 * — Basic GroupBy Sum

Group the sales DataFrame by region and calculate the total revenue for each region. Print the result sorted from highest to lowest revenue.

Expected output columns: region, total_revenue


Exercise 1-2 * — Multiple Aggregations

Using .groupby() and .agg() with named aggregations, create a summary of sales by product with these columns:

  • total_revenue — sum of revenue
  • total_orders — count of orders
  • avg_order_value — mean revenue
  • avg_margin_pct — mean margin_pct

Print the result sorted by total_revenue descending.


Exercise 1-3 * — Simple Merge

Merge the sales DataFrame with the customers DataFrame on customer_id using a left join. Call the result sales_enriched. Print the number of rows before and after the merge to confirm no rows were lost.


Exercise 1-4 — .apply() with Lambda

Add a column size_category to sales using .apply() with a lambda: - Revenue >= 6000: "Large" - Revenue 3000–5999: "Medium" - Revenue < 3000: "Small"

Print the value counts of size_category.


Exercise 1-5 — .map() for Lookup

Create a dictionary that maps each region to a "zone": North and East are "Zone A", South and West are "Zone B". Add a zone column to sales using .map().


Tier 2: Application (Use in Context)

Exercise 2-1 * — Multi-Column GroupBy

Group sales by both region AND product, and compute: - total_revenue (sum) - deal_count (count) - avg_margin_pct (mean)

Sort by region, then total_revenue descending. Print the full result.


Exercise 2-2 * — Pivot Table

Using pd.pivot_table(), create a table showing total revenue with: - Rows: region - Columns: product - Values: revenue (sum) - Fill missing combinations with 0 - Include row and column totals (use margins=True)


Exercise 2-3 — Wide to Long with .melt()

Create a summary DataFrame with one row per region showing total revenue by month (Jan, Feb, Mar, Apr, May). Then use .melt() to reshape it from wide format (months as columns) to long format (one row per region-month combination).

Hint: Use .groupby() on region and order_date.dt.month_name(), then use .pivot() or pd.pivot_table() to get the wide format before melting.


Exercise 2-4 * — Date Extraction and Grouping

Add the following columns to sales: - month (integer: 1–12) - quarter (integer: 1–4) - day_of_week (integer: Monday=0)

Then group by quarter and compute total revenue and average margin percentage.


Exercise 2-5 — .str Accessor

The customers DataFrame has a name column with values like "Goldfinch Inc" and "Redwood LLC". Using .str methods:

  1. Create a name_upper column with names in all capitals.
  2. Create a company_type column containing just the last word of the company name (e.g., "Inc", "LLC", "Co", "Ltd", "Group").
  3. Create a name_length column with the character count of each name.

Tier 3: Analysis (Derive Insights)

Exercise 3-1 * — Customer Tier Revenue Analysis

Merge sales with customers on customer_id (left join). Then group by tier to compute: - Number of unique customers - Total transactions - Total revenue - Revenue as a percentage of all revenue - Average margin percentage

Which tier contributes the most revenue? Which has the highest average margin?


Exercise 3-2 — Rolling 3-Month Average

Compute total monthly revenue across all regions. Then add: - A 3-month rolling average column - A month-over-month growth percentage column - A year-to-date cumulative revenue column

Print the result for all months. Explain in comments what NaN values mean in the rolling column.


Exercise 3-3 — Custom Aggregation Function

Write a custom function revenue_coefficient_of_variation(series) that returns the standard deviation divided by the mean (a normalized measure of variability). Apply it using .agg() to compute the coefficient of variation of revenue for each region.

Which region has the most consistent order values? Which is the most variable?


Exercise 3-4 * — .concat() Across Periods

Split sales into two DataFrames: orders in Q1 (January–March) and orders in Q2 (April–June). Compute a regional summary for each. Stack them using pd.concat() and print the combined result.


Exercise 3-5 — GroupBy with .transform()

Using .transform() instead of .agg(), add two new columns to the sales DataFrame: - region_total_revenue — the total revenue for that row's region (same value for all rows in the same region) - pct_of_region — each order's revenue as a percentage of its region total

This allows you to see how much each individual transaction contributes to its region without losing any rows.


Tier 4: Synthesis (Combine Techniques)

Exercise 4-1 * — Full Enrichment Pipeline

Build a complete enrichment pipeline:

  1. Start with sales.
  2. Add month, quarter, and day_of_week columns from order_date.
  3. Add size_category using .apply().
  4. Add zone using .map().
  5. Merge with customers on customer_id using a left join.
  6. Group by tier and product using named aggregations to get total revenue, total margin, and deal count.
  7. Add a margin_pct column calculated from totals.
  8. Sort by tier (Gold first), then by total revenue descending within each tier.

Print the final result.


Exercise 4-2 — Account Manager Performance Report

After merging sales with customers, suppose you want to generate an account manager performance report. Unfortunately, the customers table does not have an account manager column. Add one:

customers["account_manager"] = ["Alice", "Bob", "Carol", "Alice", "Bob"]

Then build a summary grouped by account_manager showing: - Number of unique customers managed - Total revenue - Average deal size - Average margin percentage - Best-performing product (by revenue) per manager

Hint: The last item requires a second groupby or a custom function.


Exercise 4-3 — Pivot + WoW Comparison

Build two separate pivot tables: - Pivot A: Q1 revenue by region × product - Pivot B: Q2 revenue by region × product

After building both, subtract Pivot A from Pivot B (where both have data) to create a "change" pivot showing the dollar difference in revenue for each region-product combination.


Exercise 4-4 — Long-to-Wide Reshape

Using the melted long-format DataFrame you created in Exercise 2-3 (or recreate it), use .pivot() to reshape it back to wide format. Verify that the round-trip (wide → long → wide) produces the same values as the original wide format.


Tier 5: Extension (Open-Ended Challenges)

Exercise 5-1 — Anomaly Detection

Using the sales DataFrame, identify orders that are statistical outliers by revenue. Define an outlier as any order where the revenue is more than 1.5 standard deviations above or below the mean for its region.

Use .groupby() with .transform() to compute the mean and standard deviation per region, then flag outlier rows. How many outliers are there? Which regions have the most?


Exercise 5-2 — Customer Cohort Analysis

Using the order_date column, assign each customer to an "acquisition cohort" based on the month of their first transaction. Then compute, for each cohort: - Number of customers - Total revenue in each subsequent month they purchased

Hint: Use .groupby() to find each customer's first order date, merge that back onto sales as cohort_month, then pivot by cohort_month and calendar month.


Exercise 5-3 — Dynamic Pricing Audit

Suppose Acme's pricing policy says that "Gold tier customers should receive a 10% discount off list price, Silver tier a 5% discount, and Bronze tier no discount." The revenue column reflects actual amounts billed.

Create a list_price_revenue column that reverses the discount for each order (based on the customer's tier after merging). Then compute the total revenue forgone due to discounting, broken down by tier.


Exercise 5-4 — Build a Reusable Reporting Function

Write a function generate_regional_report(transactions_df, customer_df, week_label) that: 1. Merges transactions with customers on customer_id. 2. Adds margin and margin_pct columns. 3. Returns a dictionary with three keys: - "regional_summary" — grouped by region with total revenue, margin, deal count - "tier_summary" — grouped by tier with total revenue and customer count - "pivot" — pd.pivot_table of revenue by region × product

Test the function by calling it separately on Q1 and Q2 subsets of sales, then compare the regional_summary results from both calls.


Exercise 5-5 — Year-over-Year Simulation

Simulate two years of data by doubling the sales DataFrame and backdating the second copy by exactly one year. Then compute year-over-year revenue growth by region, accounting for the fact that some regions may have different transaction counts between years.

Present your results as a clean pivot table with regions as rows, years as columns, and a "YoY Growth %" column appended.


Answer Guidance

Answers to starred exercises (*) are provided in the companion solutions file. For unstarred exercises, compare your output against the expected logic described in the exercise. Key checks:

  • Did you use named aggregations to get clean column names?
  • Did you reset the index when needed?
  • Did you verify row counts after merges?
  • Did you sort results logically for a business audience?
  • Are your percentage columns computed from totals (not averages of percentages)?