Chapter 13 Exercises: Transforming and Aggregating Business Data
These exercises are organized into five tiers of increasing difficulty. Complete them in order for the best learning progression. Exercises marked with a star (*) are recommended as minimum completion for the chapter.
Setup: Shared Data
Use the following DataFrames for all exercises unless an exercise specifies its own data:
import pandas as pd
import numpy as np
sales = pd.DataFrame({
"order_id": range(1001, 1021),
"order_date": pd.to_datetime([
"2024-01-05", "2024-01-12", "2024-01-18", "2024-01-25",
"2024-02-03", "2024-02-09", "2024-02-14", "2024-02-21",
"2024-03-02", "2024-03-08", "2024-03-15", "2024-03-22",
"2024-04-04", "2024-04-11", "2024-04-18", "2024-04-25",
"2024-05-03", "2024-05-10", "2024-05-17", "2024-05-24",
]),
"customer_id": [
"C01","C02","C01","C03","C02","C04","C01","C03",
"C05","C02","C04","C01","C03","C05","C02","C04",
"C01","C03","C05","C02",
],
"region": [
"North","South","North","West","South","East","North","West",
"North","South","East","North","West","North","South","East",
"North","West","North","South",
],
"product": [
"Software","Hardware","Software","Services","Hardware","Software",
"Services","Software","Hardware","Services","Software","Hardware",
"Services","Software","Hardware","Services","Software","Services",
"Hardware","Software",
],
"revenue": [
4500, 2300, 6700, 3800, 1900, 5100,
7200, 4200, 3300, 2700, 4400, 8100,
5600, 3900, 3100, 6200, 7800, 4900,
2100, 5500,
],
"cost": [
900, 1150, 1340, 760, 950, 1020,
1440, 840, 1320, 810, 880, 1620,
1120, 780, 1550, 1240, 1560, 980,
1050, 1100,
],
"units": [3, 5, 4, 1, 8, 2, 3, 2, 6, 1, 2, 5, 1, 3, 7, 4, 5, 2, 9, 3],
})
customers = pd.DataFrame({
"customer_id": ["C01","C02","C03","C04","C05"],
"name": ["Goldfinch Inc","Redwood LLC","Birchwood Co","Magnolia Ltd","Cypress Group"],
"tier": ["Gold","Silver","Bronze","Gold","Silver"],
"region": ["North","South","West","East","North"],
"annual_value": [120000, 48000, 24000, 95000, 62000],
})
sales["margin"] = sales["revenue"] - sales["cost"]
sales["margin_pct"] = (sales["margin"] / sales["revenue"] * 100).round(1)
Tier 1: Foundation (Recall and Recognition)
Exercise 1-1 * — Basic GroupBy Sum
Group the sales DataFrame by region and calculate the total revenue for each region. Print the result sorted from highest to lowest revenue.
Expected output columns: region, total_revenue
Exercise 1-2 * — Multiple Aggregations
Using .groupby() and .agg() with named aggregations, create a summary of sales by product with these columns:
total_revenue— sum of revenuetotal_orders— count of ordersavg_order_value— mean revenueavg_margin_pct— mean margin_pct
Print the result sorted by total_revenue descending.
Exercise 1-3 * — Simple Merge
Merge the sales DataFrame with the customers DataFrame on customer_id using a left join. Call the result sales_enriched. Print the number of rows before and after the merge to confirm no rows were lost.
Exercise 1-4 — .apply() with Lambda
Add a column size_category to sales using .apply() with a lambda:
- Revenue >= 6000: "Large"
- Revenue 3000–5999: "Medium"
- Revenue < 3000: "Small"
Print the value counts of size_category.
Exercise 1-5 — .map() for Lookup
Create a dictionary that maps each region to a "zone": North and East are "Zone A", South and West are "Zone B". Add a zone column to sales using .map().
Tier 2: Application (Use in Context)
Exercise 2-1 * — Multi-Column GroupBy
Group sales by both region AND product, and compute:
- total_revenue (sum)
- deal_count (count)
- avg_margin_pct (mean)
Sort by region, then total_revenue descending. Print the full result.
Exercise 2-2 * — Pivot Table
Using pd.pivot_table(), create a table showing total revenue with:
- Rows: region
- Columns: product
- Values: revenue (sum)
- Fill missing combinations with 0
- Include row and column totals (use margins=True)
Exercise 2-3 — Wide to Long with .melt()
Create a summary DataFrame with one row per region showing total revenue by month (Jan, Feb, Mar, Apr, May). Then use .melt() to reshape it from wide format (months as columns) to long format (one row per region-month combination).
Hint: Use .groupby() on region and order_date.dt.month_name(), then use .pivot() or pd.pivot_table() to get the wide format before melting.
Exercise 2-4 * — Date Extraction and Grouping
Add the following columns to sales:
- month (integer: 1–12)
- quarter (integer: 1–4)
- day_of_week (integer: Monday=0)
Then group by quarter and compute total revenue and average margin percentage.
Exercise 2-5 — .str Accessor
The customers DataFrame has a name column with values like "Goldfinch Inc" and "Redwood LLC". Using .str methods:
- Create a
name_uppercolumn with names in all capitals. - Create a
company_typecolumn containing just the last word of the company name (e.g., "Inc", "LLC", "Co", "Ltd", "Group"). - Create a
name_lengthcolumn with the character count of each name.
Tier 3: Analysis (Derive Insights)
Exercise 3-1 * — Customer Tier Revenue Analysis
Merge sales with customers on customer_id (left join). Then group by tier to compute:
- Number of unique customers
- Total transactions
- Total revenue
- Revenue as a percentage of all revenue
- Average margin percentage
Which tier contributes the most revenue? Which has the highest average margin?
Exercise 3-2 — Rolling 3-Month Average
Compute total monthly revenue across all regions. Then add: - A 3-month rolling average column - A month-over-month growth percentage column - A year-to-date cumulative revenue column
Print the result for all months. Explain in comments what NaN values mean in the rolling column.
Exercise 3-3 — Custom Aggregation Function
Write a custom function revenue_coefficient_of_variation(series) that returns the standard deviation divided by the mean (a normalized measure of variability). Apply it using .agg() to compute the coefficient of variation of revenue for each region.
Which region has the most consistent order values? Which is the most variable?
Exercise 3-4 * — .concat() Across Periods
Split sales into two DataFrames: orders in Q1 (January–March) and orders in Q2 (April–June). Compute a regional summary for each. Stack them using pd.concat() and print the combined result.
Exercise 3-5 — GroupBy with .transform()
Using .transform() instead of .agg(), add two new columns to the sales DataFrame:
- region_total_revenue — the total revenue for that row's region (same value for all rows in the same region)
- pct_of_region — each order's revenue as a percentage of its region total
This allows you to see how much each individual transaction contributes to its region without losing any rows.
Tier 4: Synthesis (Combine Techniques)
Exercise 4-1 * — Full Enrichment Pipeline
Build a complete enrichment pipeline:
- Start with
sales. - Add
month,quarter, andday_of_weekcolumns fromorder_date. - Add
size_categoryusing.apply(). - Add
zoneusing.map(). - Merge with
customersoncustomer_idusing a left join. - Group by
tierandproductusing named aggregations to get total revenue, total margin, and deal count. - Add a
margin_pctcolumn calculated from totals. - Sort by tier (Gold first), then by total revenue descending within each tier.
Print the final result.
Exercise 4-2 — Account Manager Performance Report
After merging sales with customers, suppose you want to generate an account manager performance report. Unfortunately, the customers table does not have an account manager column. Add one:
customers["account_manager"] = ["Alice", "Bob", "Carol", "Alice", "Bob"]
Then build a summary grouped by account_manager showing:
- Number of unique customers managed
- Total revenue
- Average deal size
- Average margin percentage
- Best-performing product (by revenue) per manager
Hint: The last item requires a second groupby or a custom function.
Exercise 4-3 — Pivot + WoW Comparison
Build two separate pivot tables: - Pivot A: Q1 revenue by region × product - Pivot B: Q2 revenue by region × product
After building both, subtract Pivot A from Pivot B (where both have data) to create a "change" pivot showing the dollar difference in revenue for each region-product combination.
Exercise 4-4 — Long-to-Wide Reshape
Using the melted long-format DataFrame you created in Exercise 2-3 (or recreate it), use .pivot() to reshape it back to wide format. Verify that the round-trip (wide → long → wide) produces the same values as the original wide format.
Tier 5: Extension (Open-Ended Challenges)
Exercise 5-1 — Anomaly Detection
Using the sales DataFrame, identify orders that are statistical outliers by revenue. Define an outlier as any order where the revenue is more than 1.5 standard deviations above or below the mean for its region.
Use .groupby() with .transform() to compute the mean and standard deviation per region, then flag outlier rows. How many outliers are there? Which regions have the most?
Exercise 5-2 — Customer Cohort Analysis
Using the order_date column, assign each customer to an "acquisition cohort" based on the month of their first transaction. Then compute, for each cohort:
- Number of customers
- Total revenue in each subsequent month they purchased
Hint: Use .groupby() to find each customer's first order date, merge that back onto sales as cohort_month, then pivot by cohort_month and calendar month.
Exercise 5-3 — Dynamic Pricing Audit
Suppose Acme's pricing policy says that "Gold tier customers should receive a 10% discount off list price, Silver tier a 5% discount, and Bronze tier no discount." The revenue column reflects actual amounts billed.
Create a list_price_revenue column that reverses the discount for each order (based on the customer's tier after merging). Then compute the total revenue forgone due to discounting, broken down by tier.
Exercise 5-4 — Build a Reusable Reporting Function
Write a function generate_regional_report(transactions_df, customer_df, week_label) that:
1. Merges transactions with customers on customer_id.
2. Adds margin and margin_pct columns.
3. Returns a dictionary with three keys:
- "regional_summary" — grouped by region with total revenue, margin, deal count
- "tier_summary" — grouped by tier with total revenue and customer count
- "pivot" — pd.pivot_table of revenue by region × product
Test the function by calling it separately on Q1 and Q2 subsets of sales, then compare the regional_summary results from both calls.
Exercise 5-5 — Year-over-Year Simulation
Simulate two years of data by doubling the sales DataFrame and backdating the second copy by exactly one year. Then compute year-over-year revenue growth by region, accounting for the fact that some regions may have different transaction counts between years.
Present your results as a clean pivot table with regions as rows, years as columns, and a "YoY Growth %" column appended.
Answer Guidance
Answers to starred exercises (*) are provided in the companion solutions file. For unstarred exercises, compare your output against the expected logic described in the exercise. Key checks:
- Did you use named aggregations to get clean column names?
- Did you reset the index when needed?
- Did you verify row counts after merges?
- Did you sort results logically for a business audience?
- Are your percentage columns computed from totals (not averages of percentages)?