Raw business data rarely arrives in the shape you need. A sales export from your CRM might have one row per transaction, but your VP wants a table showing revenue by region and week. A time-tracking system might record hours at the project-task...
In This Chapter
- Learning Objectives
- 13.1 Why Transformation and Aggregation Matter
- 13.2 Element-wise Transformations: .apply() and .map()
- 13.3 Grouping and Aggregating: .groupby() and .agg()
- 13.4 Pivot Tables with pd.pivot_table()
- 13.5 Combining DataFrames: .merge(), .join(), .concat()
- 13.6 Reshaping Data: .melt() and .pivot()
- 13.7 String Operations: The .str Accessor
- 13.8 Date and Time Operations
- 13.9 Window Functions: .rolling() and .expanding()
- 13.10 Putting It All Together: Priya's Weekly Regional Report
- 13.11 Chapter Summary
- Key Terms
Chapter 13: Transforming and Aggregating Business Data
Learning Objectives
By the end of this chapter, you will be able to:
- Apply element-wise transformations to DataFrame columns using
.apply()and.map() - Group data by one or more columns and compute aggregate statistics with
.groupby()and.agg() - Build pivot tables to summarize data across two dimensions
- Combine DataFrames using
.merge(),.join(), and.concat() - Reshape data between wide and long formats using
.melt()and.pivot() - Work with string data using the
.straccessor - Parse and extract date components using
pd.to_datetime()and the.dtaccessor - Calculate rolling and expanding window statistics
13.1 Why Transformation and Aggregation Matter
Raw business data rarely arrives in the shape you need. A sales export from your CRM might have one row per transaction, but your VP wants a table showing revenue by region and week. A time-tracking system might record hours at the project-task level, but a client invoice needs totals. An HR system might store salaries as annual figures, but your analysis needs monthly equivalents.
Priya Okonkwo, business analyst at Acme Corp, spends a significant portion of every Monday morning doing exactly this kind of work. She receives a flat transaction file from the data warehouse and transforms it into the regional sales summary that Sandra Chen, VP of Sales, reviews at 9 a.m. Before learning pandas, Priya did this in Excel — pivot tables, VLOOKUP, manual filters. It took two hours. With pandas, it takes about ten minutes, and the process is fully reproducible.
This chapter covers the core pandas toolkit for transformation and aggregation. By the end, you will be able to reshape data as fluidly as an experienced analyst.
13.2 Element-wise Transformations: .apply() and .map()
13.2.1 The .apply() Method
.apply() lets you run a function against every element in a Series, or against every row or column of a DataFrame. It is one of the most flexible tools in pandas.
Applying a lambda to a column
Suppose Priya has a DataFrame of sales transactions and wants to add a column that categorizes order size:
import pandas as pd
sales = pd.DataFrame({
"order_id": [1001, 1002, 1003, 1004, 1005],
"revenue": [450.00, 12300.00, 875.50, 5200.00, 320.00],
"region": ["North", "West", "North", "East", "West"],
})
# Categorize order size with a lambda
sales["order_size"] = sales["revenue"].apply(
lambda x: "Large" if x >= 5000 else ("Medium" if x >= 1000 else "Small")
)
print(sales[["order_id", "revenue", "order_size"]])
Output:
order_id revenue order_size
0 1001 450.00 Small
1 1002 12300.00 Large
2 1003 875.50 Small
3 1004 5200.00 Large
4 1005 320.00 Small
The lambda receives one value at a time and returns one value. This is the simplest form of .apply().
Applying a named function
When your logic becomes too complex for a lambda, define a proper function:
def commission_rate(revenue):
"""Return the commission rate based on revenue tier."""
if revenue >= 10000:
return 0.10
elif revenue >= 5000:
return 0.07
elif revenue >= 1000:
return 0.05
else:
return 0.03
sales["commission_rate"] = sales["revenue"].apply(commission_rate)
sales["commission_amount"] = sales["revenue"] * sales["commission_rate"]
print(sales[["order_id", "revenue", "commission_rate", "commission_amount"]])
Output:
order_id revenue commission_rate commission_amount
0 1001 450.00 0.03 13.50
1 1002 12300.00 0.10 1230.00
2 1003 875.50 0.03 26.27
3 1004 5200.00 0.07 364.00
4 1005 320.00 0.03 9.60
Applying a function across rows (axis=1)
Sometimes you need to look at multiple columns together. Use axis=1 to pass each row as a Series:
sales["rep"] = ["Alice", "Bob", "Alice", "Carol", "Bob"]
sales["quota"] = [400.00, 10000.00, 800.00, 6000.00, 300.00]
def quota_status(row):
"""Return whether a rep met, exceeded, or missed quota."""
ratio = row["revenue"] / row["quota"]
if ratio >= 1.10:
return "Exceeded"
elif ratio >= 1.00:
return "Met"
else:
return "Missed"
sales["quota_status"] = sales.apply(quota_status, axis=1)
print(sales[["order_id", "rep", "revenue", "quota", "quota_status"]])
Output:
order_id rep revenue quota quota_status
0 1001 Alice 450.00 400.00 Exceeded
1 1002 Bob 12300.00 10000.00 Exceeded
2 1003 Alice 875.50 800.00 Exceeded
3 1004 Carol 5200.00 6000.00 Missed
4 1005 Bob 320.00 300.00 Exceeded
Performance note:
.apply()withaxis=1is slower than vectorized operations because it calls Python for each row. For large datasets, prefer vectorized pandas or NumPy operations when possible. Use.apply()when the logic is genuinely complex or conditional.
13.2.2 The .map() Method
.map() works on a Series and is designed for element-wise substitution. You can pass it a dictionary to replace values, another Series to act as a lookup table, or a function.
Dictionary mapping — replacing codes with labels
region_names = {
"N": "North",
"S": "South",
"E": "East",
"W": "West",
}
orders = pd.DataFrame({
"order_id": [2001, 2002, 2003, 2004],
"region_code": ["N", "W", "E", "N"],
"revenue": [1200, 4500, 800, 3300],
})
orders["region_name"] = orders["region_code"].map(region_names)
print(orders)
Output:
order_id region_code revenue region_name
0 2001 N 1200 North
1 2002 W 4500 West
2 2003 E 800 East
3 2004 N 3300 North
If a key is missing from the dictionary, .map() returns NaN for that row — which is a useful signal that you have unmapped codes.
Mapping with a Series
You can also map using a Series where the index is the lookup key:
rep_territory = pd.Series({
"Alice": "Northeast",
"Bob": "West Coast",
"Carol": "Midwest",
})
sales["territory"] = sales["rep"].map(rep_territory)
.map() vs .apply()
| Feature | .map() |
.apply() |
|---|---|---|
| Works on | Series only | Series or DataFrame |
| Typical use | Element substitution/lookup | Conditional logic, multi-column |
| Accepts | dict, Series, function | function (lambda or named) |
| NaN handling | Returns NaN for missing keys | Depends on your function |
13.3 Grouping and Aggregating: .groupby() and .agg()
13.3.1 The Split-Apply-Combine Paradigm
Hadley Wickham, the statistician who designed R's dplyr, described the core idea as "split-apply-combine":
- Split — divide the DataFrame into groups based on one or more columns
- Apply — compute a statistic or transformation within each group
- Combine — reassemble the results into a new DataFrame
pandas implements this through .groupby(). Priya uses this pattern every single day.
13.3.2 Basic .groupby() Usage
import pandas as pd
transactions = pd.DataFrame({
"date": pd.to_datetime([
"2024-01-05", "2024-01-08", "2024-01-12", "2024-01-15",
"2024-01-18", "2024-01-22", "2024-01-25", "2024-01-28",
]),
"region": ["North", "South", "North", "West",
"South", "North", "West", "South"],
"product_category": ["Software", "Hardware", "Software", "Services",
"Hardware", "Services", "Software", "Software"],
"revenue": [4500, 2300, 6800, 3200, 1900, 5100, 4700, 3600],
"cost": [900, 1400, 1360, 960, 1200, 1020, 940, 720],
})
transactions["margin"] = transactions["revenue"] - transactions["cost"]
# Group by region and sum revenue
regional_revenue = transactions.groupby("region")["revenue"].sum()
print(regional_revenue)
Output:
region
North 16400
South 7800
West 7900
Name: revenue, dtype: int64
The result is a Series with the group keys as the index. You can call .reset_index() to turn it into a regular DataFrame:
regional_revenue = (
transactions
.groupby("region")["revenue"]
.sum()
.reset_index()
.rename(columns={"revenue": "total_revenue"})
)
print(regional_revenue)
Output:
region total_revenue
0 North 16400
1 South 7800
2 West 7900
13.3.3 Common Aggregation Functions
pandas supports a wide range of built-in aggregation functions you can call directly on a GroupBy object:
| Method | Description |
|---|---|
.sum() |
Sum of values |
.mean() |
Arithmetic mean |
.median() |
Median |
.count() |
Count of non-null values |
.nunique() |
Count of unique values |
.min() |
Minimum |
.max() |
Maximum |
.std() |
Standard deviation |
.first() |
First value in the group |
.last() |
Last value in the group |
# Multiple aggregations in sequence
print("Revenue by region:")
print(transactions.groupby("region")["revenue"].mean().round(2))
print("\nTransaction count by region:")
print(transactions.groupby("region")["revenue"].count())
print("\nRevenue range by product category:")
print(transactions.groupby("product_category")[["revenue"]].agg(["min", "max"]))
13.3.4 .agg() — Multiple Aggregations at Once
.agg() (short for .aggregate()) is the powerhouse for building summary tables. You can compute multiple statistics for multiple columns in a single call.
Basic .agg() with a list of functions
summary = transactions.groupby("region")[["revenue", "margin"]].agg(
["sum", "mean", "count"]
)
print(summary)
Output:
revenue margin
sum mean count sum mean count
region
North 16400 5466.667 3 7820 2606.667 3
South 7800 2600.000 3 2380 793.333 3
West 7900 3950.000 2 5010 2505.000 2
The result has a MultiIndex on the columns. To flatten it:
summary.columns = ["_".join(col) for col in summary.columns]
summary = summary.reset_index()
print(summary.columns.tolist())
# ['region', 'revenue_sum', 'revenue_mean', 'revenue_count',
# 'margin_sum', 'margin_mean', 'margin_count']
Named aggregations — the cleanest approach
pandas 0.25+ supports named aggregations with pd.NamedAgg, which lets you control output column names directly:
summary = transactions.groupby("region").agg(
total_revenue=("revenue", "sum"),
avg_revenue=("revenue", "mean"),
transaction_count=("revenue", "count"),
total_margin=("margin", "sum"),
avg_margin=("margin", "mean"),
).round(2).reset_index()
print(summary)
Output:
region total_revenue avg_revenue transaction_count total_margin avg_margin
0 North 16400.0 5466.67 3 7820.0 2606.67
1 South 7800.0 2600.00 3 2380.0 793.33
2 West 7900.0 3950.00 2 5010.0 2505.00
This is Priya's preferred syntax — the output column names are self-documenting, and there is no MultiIndex to flatten.
Named aggregations with custom functions
You can pass any callable as the aggregation function:
import numpy as np
def revenue_range(series):
"""Return the difference between max and min revenue."""
return series.max() - series.min()
def top_quartile_avg(series):
"""Return the mean of the top 25% of values."""
threshold = series.quantile(0.75)
return series[series >= threshold].mean()
summary = transactions.groupby("region").agg(
total_revenue=("revenue", "sum"),
revenue_spread=("revenue", revenue_range),
top_quartile_avg=("revenue", top_quartile_avg),
).reset_index()
print(summary)
13.3.5 Grouping by Multiple Columns
# Group by region AND product category
category_summary = transactions.groupby(
["region", "product_category"]
).agg(
total_revenue=("revenue", "sum"),
deal_count=("revenue", "count"),
).reset_index()
print(category_summary)
Output:
region product_category total_revenue deal_count
0 North Services 5100 1
1 North Software 11300 2
2 South Hardware 2300 1
3 South Software 5900 2
4 West Services 3200 1
5 West Software 4700 1
13.3.6 The transform() Method
While .agg() reduces groups to single rows, .transform() returns a Series with the same length as the original DataFrame, making it ideal for adding group-level statistics as new columns:
# Add a column showing each region's total revenue (for percentage calculation)
transactions["region_total"] = transactions.groupby("region")["revenue"].transform("sum")
transactions["pct_of_region"] = (
transactions["revenue"] / transactions["region_total"] * 100
).round(1)
print(transactions[["region", "revenue", "region_total", "pct_of_region"]])
13.4 Pivot Tables with pd.pivot_table()
A pivot table is a two-dimensional summary: rows represent one dimension (e.g., region), columns represent another (e.g., product category), and cells contain an aggregated value (e.g., total revenue). It is one of the most requested deliverables in business analytics.
13.4.1 Basic Pivot Table
pivot = pd.pivot_table(
transactions,
values="revenue",
index="region",
columns="product_category",
aggfunc="sum",
fill_value=0, # Replace NaN with 0 where a combination doesn't exist
)
print(pivot)
Output:
product_category Hardware Services Software
region
North 0 5100 11300
South 2300 0 5900
West 0 3200 4700
13.4.2 Pivot Table with Multiple Values and Margins
pivot_full = pd.pivot_table(
transactions,
values=["revenue", "margin"],
index="region",
columns="product_category",
aggfunc="sum",
fill_value=0,
margins=True, # Add row and column totals
margins_name="Total", # Label for the totals row/column
)
print(pivot_full)
13.4.3 Pivot Table with Multiple Aggregation Functions
pivot_multi = pd.pivot_table(
transactions,
values="revenue",
index="region",
columns="product_category",
aggfunc=["sum", "count"],
fill_value=0,
)
print(pivot_multi)
13.4.4 Resetting the Pivot Table Index
After creating a pivot table, you often want to export it or work with it as a flat DataFrame:
pivot_flat = pivot.reset_index()
pivot_flat.columns.name = None # Remove the axis label "product_category"
print(pivot_flat)
13.5 Combining DataFrames: .merge(), .join(), .concat()
13.5.1 The Business Case for Joining Data
Priya's transaction data contains a customer_id column, but not the customer's name, tier (Gold/Silver/Bronze), or account manager. That information lives in a separate customer master table. To enrich her analysis, she needs to join the two tables — exactly the same operation as a SQL JOIN.
13.5.2 .merge() — The Primary Join Tool
transactions_extended = pd.DataFrame({
"order_id": [3001, 3002, 3003, 3004, 3005, 3006],
"customer_id": ["C001", "C002", "C001", "C003", "C002", "C004"],
"revenue": [5400, 2100, 8900, 3300, 6700, 1200],
"region": ["North", "South", "North", "West", "South", "East"],
})
customers = pd.DataFrame({
"customer_id": ["C001", "C002", "C003", "C005"],
"customer_name": ["Goldfinch Inc", "Redwood LLC", "Birchwood Co", "Maple Corp"],
"tier": ["Gold", "Silver", "Bronze", "Gold"],
"account_manager": ["Alice", "Bob", "Carol", "Alice"],
})
# Inner join: only rows where customer_id exists in BOTH tables
merged_inner = pd.merge(
transactions_extended,
customers,
on="customer_id",
how="inner",
)
print("Inner join — rows:", len(merged_inner))
print(merged_inner)
Output:
Inner join — rows: 5
order_id customer_id revenue region customer_name tier account_manager
0 3001 C001 5400 North Goldfinch Inc Gold Alice
1 3003 C001 8900 North Goldfinch Inc Gold Alice
2 3002 C002 2100 South Redwood LLC Silver Bob
3 3005 C002 6700 South Redwood LLC Silver Bob
4 3004 C003 3300 West Birchwood Co Bronze Carol
Notice that order 3006 (customer C004) and customer C005 are both missing — C004 has no entry in the customer master, and C005 has no transactions.
The four join types
# Left join: all rows from transactions_extended, matching rows from customers
merged_left = pd.merge(
transactions_extended,
customers,
on="customer_id",
how="left",
)
# Result: 6 rows; order 3006 appears with NaN in customer columns
# Right join: all rows from customers, matching rows from transactions_extended
merged_right = pd.merge(
transactions_extended,
customers,
on="customer_id",
how="right",
)
# Result: 5 rows; C005 appears with NaN in transaction columns
# Outer join: all rows from both tables
merged_outer = pd.merge(
transactions_extended,
customers,
on="customer_id",
how="outer",
)
# Result: 7 rows; C004 and C005 appear with NaN in missing columns
| Join Type | Keeps | Use When |
|---|---|---|
inner |
Only matching rows | You only want rows that exist in both tables |
left |
All left rows + matching right | You want all transactions even if customer data is missing |
right |
All right rows + matching left | You want all customers even if they have no transactions |
outer |
All rows from both tables | You want a complete picture with all gaps visible |
Merging on columns with different names
orders_alt = pd.DataFrame({
"order_num": [4001, 4002],
"cust_code": ["C001", "C002"],
"amount": [7200, 3100],
})
merged_diff_names = pd.merge(
orders_alt,
customers,
left_on="cust_code",
right_on="customer_id",
how="left",
)
Merging on multiple columns
# Imagine you need to match on both customer_id and region
pd.merge(
left_df,
right_df,
on=["customer_id", "region"],
how="inner",
)
13.5.3 .join() — Index-based Joining
.join() merges on the DataFrame index rather than a column. It is a shortcut for index-to-index joins:
customers_indexed = customers.set_index("customer_id")
transactions_indexed = transactions_extended.set_index("customer_id")
joined = transactions_indexed.join(customers_indexed, how="left")
joined = joined.reset_index()
In practice, most business analysts use .merge() because data rarely arrives with meaningful indexes. .join() is most useful when you have pre-indexed lookup tables.
13.5.4 .concat() — Stacking DataFrames
.concat() stacks DataFrames either vertically (adding more rows) or horizontally (adding more columns). The most common use in business analysis is combining data from multiple periods or sources.
Vertical concatenation (stacking rows)
q1_sales = pd.DataFrame({
"quarter": ["Q1", "Q1", "Q1"],
"region": ["North", "South", "West"],
"revenue": [45000, 32000, 28000],
})
q2_sales = pd.DataFrame({
"quarter": ["Q2", "Q2", "Q2"],
"region": ["North", "South", "West"],
"revenue": [52000, 35000, 31000],
})
q3_sales = pd.DataFrame({
"quarter": ["Q3", "Q3", "Q3"],
"region": ["North", "South", "West"],
"revenue": [48000, 38000, 34000],
})
full_year = pd.concat([q1_sales, q2_sales, q3_sales], ignore_index=True)
print(full_year)
Output:
quarter region revenue
0 Q1 North 45000
1 Q1 South 32000
2 Q1 West 28000
3 Q2 North 52000
4 Q2 South 35000
5 Q2 West 31000
6 Q3 North 48000
7 Q3 South 38000
8 Q3 West 34000
ignore_index=True resets the index to a clean 0-based sequence. Without it, you would get duplicate index values from the original DataFrames.
Horizontal concatenation (stacking columns)
revenue_data = pd.DataFrame({"revenue": [45000, 52000, 48000]})
cost_data = pd.DataFrame({"cost": [18000, 22000, 19000]})
combined = pd.concat([revenue_data, cost_data], axis=1)
combined["margin"] = combined["revenue"] - combined["cost"]
13.6 Reshaping Data: .melt() and .pivot()
13.6.1 Wide vs. Long Format
Business data often arrives in "wide" format — one row per entity, with multiple columns representing different time periods or categories. Analytical tools, including many charting libraries and statistical models, prefer "long" format — one row per observation.
Wide format (one row per region, columns are months):
region Jan Feb Mar
North 45000 52000 48000
South 32000 35000 38000
West 28000 31000 34000
Long format (one row per region-month combination):
region month revenue
North Jan 45000
North Feb 52000
North Mar 48000
South Jan 32000
...
13.6.2 .melt() — Wide to Long
wide_sales = pd.DataFrame({
"region": ["North", "South", "West"],
"Jan": [45000, 32000, 28000],
"Feb": [52000, 35000, 31000],
"Mar": [48000, 38000, 34000],
})
long_sales = wide_sales.melt(
id_vars="region", # Columns to keep as-is
var_name="month", # Name for the new column holding old column names
value_name="revenue", # Name for the new column holding values
)
print(long_sales.sort_values(["region", "month"]).reset_index(drop=True))
Output:
region month revenue
0 North Feb 52000
1 North Jan 45000
2 North Mar 48000
3 South Feb 35000
4 South Jan 32000
5 South Mar 38000
6 West Feb 31000
7 West Jan 28000
8 West Mar 34000
13.6.3 .pivot() — Long to Wide
.pivot() is the inverse of .melt():
wide_again = long_sales.pivot(
index="region",
columns="month",
values="revenue",
)
wide_again.columns.name = None # Clean up the axis label
wide_again = wide_again.reset_index()
print(wide_again)
Output:
region Feb Jan Mar
0 North 52000 45000 48000
1 South 35000 32000 38000
2 West 31000 28000 34000
.pivot()vspd.pivot_table(): Use.pivot()when there is exactly one value per row/column combination. Usepd.pivot_table()when there might be duplicates and you need an aggregation function to resolve them.
13.7 String Operations: The .str Accessor
When working with text columns, pandas provides the .str accessor, which gives you access to Python string methods vectorized across the entire column.
13.7.1 Common .str Methods
products = pd.DataFrame({
"product_code": ["SW-001", "HW-002", "SV-003", "SW-004"],
"product_name": [" Enterprise Suite ", "Laptop Pro 15", "consulting SERVICES", "Cloud Storage"],
"sku": ["sw001-a", "hw002-b", "sv003-c", "sw004-d"],
})
# Strip leading/trailing whitespace
products["product_name"] = products["product_name"].str.strip()
# Convert to title case
products["product_name"] = products["product_name"].str.title()
# Convert to uppercase / lowercase
products["sku_upper"] = products["sku"].str.upper()
# Check if string contains a pattern
products["is_software"] = products["product_code"].str.startswith("SW")
# Extract the category prefix (characters before the hyphen)
products["category"] = products["product_code"].str.split("-").str[0]
# Replace characters
products["sku_clean"] = products["sku"].str.replace("-", "_")
# String length
products["name_length"] = products["product_name"].str.len()
print(products)
13.7.2 Regular Expressions with .str
customer_data = pd.DataFrame({
"raw_phone": ["(555) 123-4567", "555.987.6543", "5551112222", "1-800-555-0199"],
"email": ["alice@acmecorp.com", "bob@gmail.com", "carol@acmecorp.com", "dave@yahoo.com"],
})
# Extract domain from email
customer_data["email_domain"] = customer_data["email"].str.extract(r"@(.+)$")
# Check if email is from Acme
customer_data["is_acme"] = customer_data["email"].str.contains(r"@acmecorp\.com", regex=True)
# Extract digits only from phone numbers
customer_data["phone_digits"] = customer_data["raw_phone"].str.replace(r"\D", "", regex=True)
print(customer_data)
13.7.3 Useful .str Methods Reference
| Method | Example | Result |
|---|---|---|
.str.upper() |
"hello" |
"HELLO" |
.str.lower() |
"HELLO" |
"hello" |
.str.title() |
"hello world" |
"Hello World" |
.str.strip() |
" hello " |
"hello" |
.str.len() |
"hello" |
5 |
.str.contains("x") |
"fox" |
True |
.str.startswith("x") |
"xenon" |
True |
.str.endswith("x") |
"fox" |
True |
.str.replace("a","b") |
"cat" |
"cbt" |
.str.split("x") |
"a-b-c" split on - |
["a","b","c"] |
.str.extract(r"pattern") |
regex capture group | captured string |
13.8 Date and Time Operations
Business data is almost always time-stamped, and date manipulation is one of the most common — and most error-prone — tasks in data analysis.
13.8.1 Parsing Dates with pd.to_datetime()
When you load data from a CSV, date columns arrive as strings. You need to convert them:
orders = pd.DataFrame({
"order_date": ["2024-01-15", "2024-02-03", "2024-03-22", "2024-04-10"],
"ship_date": ["01/18/2024", "02/06/2024", "03/25/2024", "04/14/2024"],
"revenue": [4500, 3200, 6700, 2900],
})
# pandas usually infers the format automatically
orders["order_date"] = pd.to_datetime(orders["order_date"])
# Specify format explicitly when parsing is ambiguous or slow
orders["ship_date"] = pd.to_datetime(orders["ship_date"], format="%m/%d/%Y")
print(orders.dtypes)
Output:
order_date datetime64[ns]
ship_date datetime64[ns]
revenue int64
dtype: object
13.8.2 The .dt Accessor
Once a column is datetime type, the .dt accessor unlocks a library of date extraction methods:
orders["year"] = orders["order_date"].dt.year
orders["month"] = orders["order_date"].dt.month
orders["month_name"] = orders["order_date"].dt.month_name()
orders["quarter"] = orders["order_date"].dt.quarter
orders["day_of_week"] = orders["order_date"].dt.dayofweek # Monday=0, Sunday=6
orders["day_name"] = orders["order_date"].dt.day_name()
orders["week_number"] = orders["order_date"].dt.isocalendar().week.astype(int)
orders["is_weekday"] = orders["order_date"].dt.dayofweek < 5
# Period: round to month or quarter
orders["month_period"] = orders["order_date"].dt.to_period("M")
orders["quarter_period"] = orders["order_date"].dt.to_period("Q")
print(orders[["order_date", "year", "month", "quarter", "day_name", "week_number"]])
13.8.3 Date Arithmetic
# Calculate fulfillment time in days
orders["fulfillment_days"] = (orders["ship_date"] - orders["order_date"]).dt.days
# Filter to a date range
jan_orders = orders[orders["order_date"].between("2024-01-01", "2024-01-31")]
# Add a fixed number of days (e.g., payment due date = order date + 30 days)
orders["payment_due"] = orders["order_date"] + pd.Timedelta(days=30)
# Time since today (days overdue)
today = pd.Timestamp("2024-05-01")
orders["days_since_order"] = (today - orders["order_date"]).dt.days
13.8.4 Grouping by Time Periods
# Group by month
monthly_revenue = orders.groupby(
orders["order_date"].dt.to_period("M")
)["revenue"].sum().reset_index()
monthly_revenue.columns = ["month", "revenue"]
# Group by quarter
quarterly_revenue = orders.groupby(
orders["order_date"].dt.to_period("Q")
)["revenue"].sum().reset_index()
13.9 Window Functions: .rolling() and .expanding()
Window functions compute statistics over a sliding or growing range of rows. They are essential for trend analysis, smoothing noisy data, and computing running totals.
13.9.1 .rolling() — Fixed-Size Window
A rolling window moves through the data, computing the statistic over the most recent N rows at each position:
monthly = pd.DataFrame({
"month": pd.date_range("2024-01-01", periods=12, freq="MS"),
"revenue": [42000, 38000, 51000, 55000, 47000, 60000,
63000, 58000, 67000, 71000, 65000, 79000],
})
# 3-month rolling average (smooths out noise)
monthly["rolling_3m_avg"] = monthly["revenue"].rolling(window=3).mean().round(0)
# 3-month rolling sum (useful for quarterly views)
monthly["rolling_3m_sum"] = monthly["revenue"].rolling(window=3).sum()
# Rolling standard deviation (volatility measure)
monthly["rolling_3m_std"] = monthly["revenue"].rolling(window=3).std().round(0)
print(monthly[["month", "revenue", "rolling_3m_avg", "rolling_3m_sum"]].head(6))
Output:
month revenue rolling_3m_avg rolling_3m_sum
0 2024-01-01 42000 NaN NaN
1 2024-02-01 38000 NaN NaN
2 2024-03-01 51000 43667.0 131000.0
3 2024-04-01 55000 48000.0 144000.0
4 2024-05-01 47000 51000.0 153000.0
5 2024-06-01 60000 54000.0 162000.0
The first two rows have NaN because a 3-month window cannot be computed with fewer than 3 data points. You can change this with the min_periods parameter.
13.9.2 .expanding() — Cumulative Window
An expanding window grows from the beginning of the Series to include all preceding rows. It is ideal for running totals and cumulative averages:
# Year-to-date (YTD) cumulative revenue
monthly["ytd_revenue"] = monthly["revenue"].expanding().sum()
# Cumulative average (how does current month compare to all prior months?)
monthly["cumulative_avg"] = monthly["revenue"].expanding().mean().round(0)
print(monthly[["month", "revenue", "ytd_revenue", "cumulative_avg"]])
13.9.3 Month-over-Month Growth
# Percentage change from previous row
monthly["mom_growth_pct"] = monthly["revenue"].pct_change() * 100
# Year-over-year change (if you had 24 months, compare to 12 periods ago)
# monthly["yoy_growth_pct"] = monthly["revenue"].pct_change(periods=12) * 100
print(monthly[["month", "revenue", "mom_growth_pct"]].round(1))
13.10 Putting It All Together: Priya's Weekly Regional Report
This section traces Priya's complete workflow for building the weekly regional sales report that Sandra receives every Monday. It integrates everything covered so far in this chapter.
import pandas as pd
import numpy as np
# ── Load and clean raw transaction data ──────────────────────────────────────
raw = pd.DataFrame({
"transaction_date": [
"2024-01-08", "2024-01-09", "2024-01-10", "2024-01-12",
"2024-01-15", "2024-01-16", "2024-01-17", "2024-01-19",
"2024-01-22", "2024-01-23", "2024-01-24", "2024-01-26",
],
"customer_id": [
"C001", "C002", "C003", "C001",
"C004", "C002", "C005", "C003",
"C001", "C004", "C002", "C005",
],
"region": [
"North", "South", "West", "North",
"East", "South", "West", "West",
"North", "East", "South", "West",
],
"revenue": [
4500, 2300, 3800, 6700,
5100, 1900, 4200, 3300,
7800, 2700, 4400, 5600,
],
"cost": [
900, 1150, 760, 1340,
1020, 950, 840, 660,
1560, 540, 880, 1120,
],
})
customer_master = pd.DataFrame({
"customer_id": ["C001", "C002", "C003", "C004", "C005"],
"customer_name": [
"Goldfinch Inc", "Redwood LLC", "Birchwood Co",
"Magnolia Ltd", "Cypress Group",
],
"tier": ["Gold", "Silver", "Bronze", "Gold", "Silver"],
"account_manager": ["Alice", "Bob", "Carol", "Alice", "Bob"],
})
# ── Parse dates and derive time columns ──────────────────────────────────────
raw["transaction_date"] = pd.to_datetime(raw["transaction_date"])
raw["week_number"] = raw["transaction_date"].dt.isocalendar().week.astype(int)
raw["margin"] = raw["revenue"] - raw["cost"]
raw["margin_pct"] = (raw["margin"] / raw["revenue"] * 100).round(1)
# ── Merge with customer master (left join to keep all transactions) ───────────
enriched = pd.merge(raw, customer_master, on="customer_id", how="left")
# ── Regional weekly summary ───────────────────────────────────────────────────
weekly_regional = enriched.groupby(["region", "week_number"]).agg(
total_revenue=("revenue", "sum"),
total_margin=("margin", "sum"),
transaction_count=("revenue", "count"),
avg_order_value=("revenue", "mean"),
).round(2).reset_index()
weekly_regional["margin_pct"] = (
weekly_regional["total_margin"] / weekly_regional["total_revenue"] * 100
).round(1)
print("=== Weekly Regional Summary ===")
print(weekly_regional.to_string(index=False))
# ── Pivot to region × week matrix ────────────────────────────────────────────
pivot_view = pd.pivot_table(
enriched,
values="revenue",
index="region",
columns="week_number",
aggfunc="sum",
fill_value=0,
margins=True,
margins_name="Total",
)
print("\n=== Revenue by Region × Week ===")
print(pivot_view)
# ── Customer tier analysis ────────────────────────────────────────────────────
tier_summary = enriched.groupby("tier").agg(
customer_count=("customer_id", "nunique"),
total_revenue=("revenue", "sum"),
avg_revenue_per_transaction=("revenue", "mean"),
avg_margin_pct=("margin_pct", "mean"),
).round(2).reset_index()
tier_summary["revenue_share_pct"] = (
tier_summary["total_revenue"] / tier_summary["total_revenue"].sum() * 100
).round(1)
print("\n=== Customer Tier Analysis ===")
print(tier_summary.to_string(index=False))
This script goes from raw transaction data to three distinct deliverables — the weekly regional table, the pivot matrix, and the tier analysis — in under 50 lines of clean, readable code.
13.11 Chapter Summary
In this chapter, you mastered the core toolkit for transforming and aggregating business data with pandas:
.apply()and.map()enable element-wise transformations — from simple value lookups to complex conditional logic..groupby()and.agg()implement the split-apply-combine pattern, letting you summarize data across any dimension with multiple statistics in a single call.pd.pivot_table()builds two-dimensional summaries with optional margins and multiple aggregation functions..merge(),.join(), and.concat()combine DataFrames — merge for column-based joins, join for index-based joins, concat for stacking..melt()and.pivot()reshape data between wide and long formats to match the requirements of different tools and analyses.- The
.straccessor provides vectorized string operations — cleaning, extracting, and filtering text columns. pd.to_datetime()and the.dtaccessor unlock the full power of time-series analysis: extracting components, computing differences, and grouping by period..rolling()and.expanding()compute window statistics — moving averages, cumulative totals, and growth rates.
The combination of these tools covers the vast majority of data transformation work that business analysts perform every day. Practice them on real data, and they will become second nature.
Key Terms
Aggregation — The process of computing a summary statistic (sum, mean, count, etc.) over a group of values.
GroupBy — A pandas object that holds the result of splitting a DataFrame by one or more columns, before the apply and combine steps.
Named Aggregation — A pandas 0.25+ syntax where you specify both the output column name and the aggregation function in a single .agg() call.
Pivot Table — A two-dimensional summary table where row and column labels come from the data and cell values are aggregated.
Long Format — A data shape with one row per observation and one column per variable; preferred by most analytical tools.
Wide Format — A data shape with one row per entity and multiple columns representing different time periods or categories.
Rolling Window — A computation over a fixed-size, moving subset of rows.
Expanding Window — A computation over all rows from the beginning up to the current row (cumulative).
.str Accessor — A pandas object that provides vectorized string methods for Series with string dtype.
.dt Accessor — A pandas object that provides date/time component extraction and manipulation methods for datetime Series.