11 min read

Raw business data rarely arrives in the shape you need. A sales export from your CRM might have one row per transaction, but your VP wants a table showing revenue by region and week. A time-tracking system might record hours at the project-task...

Chapter 13: Transforming and Aggregating Business Data

Learning Objectives

By the end of this chapter, you will be able to:

  • Apply element-wise transformations to DataFrame columns using .apply() and .map()
  • Group data by one or more columns and compute aggregate statistics with .groupby() and .agg()
  • Build pivot tables to summarize data across two dimensions
  • Combine DataFrames using .merge(), .join(), and .concat()
  • Reshape data between wide and long formats using .melt() and .pivot()
  • Work with string data using the .str accessor
  • Parse and extract date components using pd.to_datetime() and the .dt accessor
  • Calculate rolling and expanding window statistics

13.1 Why Transformation and Aggregation Matter

Raw business data rarely arrives in the shape you need. A sales export from your CRM might have one row per transaction, but your VP wants a table showing revenue by region and week. A time-tracking system might record hours at the project-task level, but a client invoice needs totals. An HR system might store salaries as annual figures, but your analysis needs monthly equivalents.

Priya Okonkwo, business analyst at Acme Corp, spends a significant portion of every Monday morning doing exactly this kind of work. She receives a flat transaction file from the data warehouse and transforms it into the regional sales summary that Sandra Chen, VP of Sales, reviews at 9 a.m. Before learning pandas, Priya did this in Excel — pivot tables, VLOOKUP, manual filters. It took two hours. With pandas, it takes about ten minutes, and the process is fully reproducible.

This chapter covers the core pandas toolkit for transformation and aggregation. By the end, you will be able to reshape data as fluidly as an experienced analyst.


13.2 Element-wise Transformations: .apply() and .map()

13.2.1 The .apply() Method

.apply() lets you run a function against every element in a Series, or against every row or column of a DataFrame. It is one of the most flexible tools in pandas.

Applying a lambda to a column

Suppose Priya has a DataFrame of sales transactions and wants to add a column that categorizes order size:

import pandas as pd

sales = pd.DataFrame({
    "order_id": [1001, 1002, 1003, 1004, 1005],
    "revenue": [450.00, 12300.00, 875.50, 5200.00, 320.00],
    "region": ["North", "West", "North", "East", "West"],
})

# Categorize order size with a lambda
sales["order_size"] = sales["revenue"].apply(
    lambda x: "Large" if x >= 5000 else ("Medium" if x >= 1000 else "Small")
)

print(sales[["order_id", "revenue", "order_size"]])

Output:

   order_id    revenue order_size
0      1001     450.00      Small
1      1002   12300.00      Large
2      1003     875.50      Small
3      1004    5200.00      Large
4      1005     320.00      Small

The lambda receives one value at a time and returns one value. This is the simplest form of .apply().

Applying a named function

When your logic becomes too complex for a lambda, define a proper function:

def commission_rate(revenue):
    """Return the commission rate based on revenue tier."""
    if revenue >= 10000:
        return 0.10
    elif revenue >= 5000:
        return 0.07
    elif revenue >= 1000:
        return 0.05
    else:
        return 0.03


sales["commission_rate"] = sales["revenue"].apply(commission_rate)
sales["commission_amount"] = sales["revenue"] * sales["commission_rate"]

print(sales[["order_id", "revenue", "commission_rate", "commission_amount"]])

Output:

   order_id    revenue  commission_rate  commission_amount
0      1001     450.00             0.03              13.50
1      1002   12300.00             0.10            1230.00
2      1003     875.50             0.03              26.27
3      1004    5200.00             0.07             364.00
4      1005     320.00             0.03               9.60

Applying a function across rows (axis=1)

Sometimes you need to look at multiple columns together. Use axis=1 to pass each row as a Series:

sales["rep"] = ["Alice", "Bob", "Alice", "Carol", "Bob"]
sales["quota"] = [400.00, 10000.00, 800.00, 6000.00, 300.00]

def quota_status(row):
    """Return whether a rep met, exceeded, or missed quota."""
    ratio = row["revenue"] / row["quota"]
    if ratio >= 1.10:
        return "Exceeded"
    elif ratio >= 1.00:
        return "Met"
    else:
        return "Missed"

sales["quota_status"] = sales.apply(quota_status, axis=1)
print(sales[["order_id", "rep", "revenue", "quota", "quota_status"]])

Output:

   order_id    rep    revenue    quota quota_status
0      1001  Alice     450.00    400.00     Exceeded
1      1002    Bob   12300.00  10000.00     Exceeded
2      1003  Alice     875.50    800.00     Exceeded
3      1004  Carol    5200.00   6000.00       Missed
4      1005    Bob     320.00    300.00     Exceeded

Performance note: .apply() with axis=1 is slower than vectorized operations because it calls Python for each row. For large datasets, prefer vectorized pandas or NumPy operations when possible. Use .apply() when the logic is genuinely complex or conditional.

13.2.2 The .map() Method

.map() works on a Series and is designed for element-wise substitution. You can pass it a dictionary to replace values, another Series to act as a lookup table, or a function.

Dictionary mapping — replacing codes with labels

region_names = {
    "N": "North",
    "S": "South",
    "E": "East",
    "W": "West",
}

orders = pd.DataFrame({
    "order_id": [2001, 2002, 2003, 2004],
    "region_code": ["N", "W", "E", "N"],
    "revenue": [1200, 4500, 800, 3300],
})

orders["region_name"] = orders["region_code"].map(region_names)
print(orders)

Output:

   order_id region_code  revenue region_name
0      2001           N     1200       North
1      2002           W     4500        West
2      2003           E      800        East
3      2004           N     3300       North

If a key is missing from the dictionary, .map() returns NaN for that row — which is a useful signal that you have unmapped codes.

Mapping with a Series

You can also map using a Series where the index is the lookup key:

rep_territory = pd.Series({
    "Alice": "Northeast",
    "Bob": "West Coast",
    "Carol": "Midwest",
})

sales["territory"] = sales["rep"].map(rep_territory)

.map() vs .apply()

Feature .map() .apply()
Works on Series only Series or DataFrame
Typical use Element substitution/lookup Conditional logic, multi-column
Accepts dict, Series, function function (lambda or named)
NaN handling Returns NaN for missing keys Depends on your function

13.3 Grouping and Aggregating: .groupby() and .agg()

13.3.1 The Split-Apply-Combine Paradigm

Hadley Wickham, the statistician who designed R's dplyr, described the core idea as "split-apply-combine":

  1. Split — divide the DataFrame into groups based on one or more columns
  2. Apply — compute a statistic or transformation within each group
  3. Combine — reassemble the results into a new DataFrame

pandas implements this through .groupby(). Priya uses this pattern every single day.

13.3.2 Basic .groupby() Usage

import pandas as pd

transactions = pd.DataFrame({
    "date": pd.to_datetime([
        "2024-01-05", "2024-01-08", "2024-01-12", "2024-01-15",
        "2024-01-18", "2024-01-22", "2024-01-25", "2024-01-28",
    ]),
    "region": ["North", "South", "North", "West",
               "South", "North", "West", "South"],
    "product_category": ["Software", "Hardware", "Software", "Services",
                         "Hardware", "Services", "Software", "Software"],
    "revenue": [4500, 2300, 6800, 3200, 1900, 5100, 4700, 3600],
    "cost": [900, 1400, 1360, 960, 1200, 1020, 940, 720],
})

transactions["margin"] = transactions["revenue"] - transactions["cost"]

# Group by region and sum revenue
regional_revenue = transactions.groupby("region")["revenue"].sum()
print(regional_revenue)

Output:

region
North    16400
South     7800
West      7900
Name: revenue, dtype: int64

The result is a Series with the group keys as the index. You can call .reset_index() to turn it into a regular DataFrame:

regional_revenue = (
    transactions
    .groupby("region")["revenue"]
    .sum()
    .reset_index()
    .rename(columns={"revenue": "total_revenue"})
)
print(regional_revenue)

Output:

  region  total_revenue
0  North          16400
1  South           7800
2   West           7900

13.3.3 Common Aggregation Functions

pandas supports a wide range of built-in aggregation functions you can call directly on a GroupBy object:

Method Description
.sum() Sum of values
.mean() Arithmetic mean
.median() Median
.count() Count of non-null values
.nunique() Count of unique values
.min() Minimum
.max() Maximum
.std() Standard deviation
.first() First value in the group
.last() Last value in the group
# Multiple aggregations in sequence
print("Revenue by region:")
print(transactions.groupby("region")["revenue"].mean().round(2))

print("\nTransaction count by region:")
print(transactions.groupby("region")["revenue"].count())

print("\nRevenue range by product category:")
print(transactions.groupby("product_category")[["revenue"]].agg(["min", "max"]))

13.3.4 .agg() — Multiple Aggregations at Once

.agg() (short for .aggregate()) is the powerhouse for building summary tables. You can compute multiple statistics for multiple columns in a single call.

Basic .agg() with a list of functions

summary = transactions.groupby("region")[["revenue", "margin"]].agg(
    ["sum", "mean", "count"]
)
print(summary)

Output:

         revenue                   margin
             sum      mean count      sum      mean count
region
North      16400  5466.667     3     7820  2606.667     3
South       7800  2600.000     3     2380   793.333     3
West        7900  3950.000     2     5010  2505.000     2

The result has a MultiIndex on the columns. To flatten it:

summary.columns = ["_".join(col) for col in summary.columns]
summary = summary.reset_index()
print(summary.columns.tolist())
# ['region', 'revenue_sum', 'revenue_mean', 'revenue_count',
#  'margin_sum', 'margin_mean', 'margin_count']

Named aggregations — the cleanest approach

pandas 0.25+ supports named aggregations with pd.NamedAgg, which lets you control output column names directly:

summary = transactions.groupby("region").agg(
    total_revenue=("revenue", "sum"),
    avg_revenue=("revenue", "mean"),
    transaction_count=("revenue", "count"),
    total_margin=("margin", "sum"),
    avg_margin=("margin", "mean"),
).round(2).reset_index()

print(summary)

Output:

  region  total_revenue  avg_revenue  transaction_count  total_margin  avg_margin
0  North        16400.0      5466.67                  3        7820.0     2606.67
1  South         7800.0      2600.00                  3        2380.0      793.33
2   West         7900.0      3950.00                  2        5010.0     2505.00

This is Priya's preferred syntax — the output column names are self-documenting, and there is no MultiIndex to flatten.

Named aggregations with custom functions

You can pass any callable as the aggregation function:

import numpy as np

def revenue_range(series):
    """Return the difference between max and min revenue."""
    return series.max() - series.min()

def top_quartile_avg(series):
    """Return the mean of the top 25% of values."""
    threshold = series.quantile(0.75)
    return series[series >= threshold].mean()

summary = transactions.groupby("region").agg(
    total_revenue=("revenue", "sum"),
    revenue_spread=("revenue", revenue_range),
    top_quartile_avg=("revenue", top_quartile_avg),
).reset_index()

print(summary)

13.3.5 Grouping by Multiple Columns

# Group by region AND product category
category_summary = transactions.groupby(
    ["region", "product_category"]
).agg(
    total_revenue=("revenue", "sum"),
    deal_count=("revenue", "count"),
).reset_index()

print(category_summary)

Output:

  region product_category  total_revenue  deal_count
0  North         Services           5100           1
1  North         Software          11300           2
2  South         Hardware           2300           1
3  South         Software           5900           2
4   West         Services           3200           1
5   West         Software           4700           1

13.3.6 The transform() Method

While .agg() reduces groups to single rows, .transform() returns a Series with the same length as the original DataFrame, making it ideal for adding group-level statistics as new columns:

# Add a column showing each region's total revenue (for percentage calculation)
transactions["region_total"] = transactions.groupby("region")["revenue"].transform("sum")
transactions["pct_of_region"] = (
    transactions["revenue"] / transactions["region_total"] * 100
).round(1)

print(transactions[["region", "revenue", "region_total", "pct_of_region"]])

13.4 Pivot Tables with pd.pivot_table()

A pivot table is a two-dimensional summary: rows represent one dimension (e.g., region), columns represent another (e.g., product category), and cells contain an aggregated value (e.g., total revenue). It is one of the most requested deliverables in business analytics.

13.4.1 Basic Pivot Table

pivot = pd.pivot_table(
    transactions,
    values="revenue",
    index="region",
    columns="product_category",
    aggfunc="sum",
    fill_value=0,  # Replace NaN with 0 where a combination doesn't exist
)

print(pivot)

Output:

product_category  Hardware  Services  Software
region
North                    0      5100     11300
South                 2300         0      5900
West                     0      3200      4700

13.4.2 Pivot Table with Multiple Values and Margins

pivot_full = pd.pivot_table(
    transactions,
    values=["revenue", "margin"],
    index="region",
    columns="product_category",
    aggfunc="sum",
    fill_value=0,
    margins=True,           # Add row and column totals
    margins_name="Total",   # Label for the totals row/column
)

print(pivot_full)

13.4.3 Pivot Table with Multiple Aggregation Functions

pivot_multi = pd.pivot_table(
    transactions,
    values="revenue",
    index="region",
    columns="product_category",
    aggfunc=["sum", "count"],
    fill_value=0,
)

print(pivot_multi)

13.4.4 Resetting the Pivot Table Index

After creating a pivot table, you often want to export it or work with it as a flat DataFrame:

pivot_flat = pivot.reset_index()
pivot_flat.columns.name = None  # Remove the axis label "product_category"
print(pivot_flat)

13.5 Combining DataFrames: .merge(), .join(), .concat()

13.5.1 The Business Case for Joining Data

Priya's transaction data contains a customer_id column, but not the customer's name, tier (Gold/Silver/Bronze), or account manager. That information lives in a separate customer master table. To enrich her analysis, she needs to join the two tables — exactly the same operation as a SQL JOIN.

13.5.2 .merge() — The Primary Join Tool

transactions_extended = pd.DataFrame({
    "order_id": [3001, 3002, 3003, 3004, 3005, 3006],
    "customer_id": ["C001", "C002", "C001", "C003", "C002", "C004"],
    "revenue": [5400, 2100, 8900, 3300, 6700, 1200],
    "region": ["North", "South", "North", "West", "South", "East"],
})

customers = pd.DataFrame({
    "customer_id": ["C001", "C002", "C003", "C005"],
    "customer_name": ["Goldfinch Inc", "Redwood LLC", "Birchwood Co", "Maple Corp"],
    "tier": ["Gold", "Silver", "Bronze", "Gold"],
    "account_manager": ["Alice", "Bob", "Carol", "Alice"],
})

# Inner join: only rows where customer_id exists in BOTH tables
merged_inner = pd.merge(
    transactions_extended,
    customers,
    on="customer_id",
    how="inner",
)
print("Inner join — rows:", len(merged_inner))
print(merged_inner)

Output:

Inner join — rows: 5
   order_id customer_id  revenue region customer_name    tier account_manager
0      3001        C001     5400  North  Goldfinch Inc    Gold           Alice
1      3003        C001     8900  North  Goldfinch Inc    Gold           Alice
2      3002        C002     2100  South    Redwood LLC  Silver             Bob
3      3005        C002     6700  South    Redwood LLC  Silver             Bob
4      3004        C003     3300   West   Birchwood Co  Bronze           Carol

Notice that order 3006 (customer C004) and customer C005 are both missing — C004 has no entry in the customer master, and C005 has no transactions.

The four join types

# Left join: all rows from transactions_extended, matching rows from customers
merged_left = pd.merge(
    transactions_extended,
    customers,
    on="customer_id",
    how="left",
)
# Result: 6 rows; order 3006 appears with NaN in customer columns

# Right join: all rows from customers, matching rows from transactions_extended
merged_right = pd.merge(
    transactions_extended,
    customers,
    on="customer_id",
    how="right",
)
# Result: 5 rows; C005 appears with NaN in transaction columns

# Outer join: all rows from both tables
merged_outer = pd.merge(
    transactions_extended,
    customers,
    on="customer_id",
    how="outer",
)
# Result: 7 rows; C004 and C005 appear with NaN in missing columns
Join Type Keeps Use When
inner Only matching rows You only want rows that exist in both tables
left All left rows + matching right You want all transactions even if customer data is missing
right All right rows + matching left You want all customers even if they have no transactions
outer All rows from both tables You want a complete picture with all gaps visible

Merging on columns with different names

orders_alt = pd.DataFrame({
    "order_num": [4001, 4002],
    "cust_code": ["C001", "C002"],
    "amount": [7200, 3100],
})

merged_diff_names = pd.merge(
    orders_alt,
    customers,
    left_on="cust_code",
    right_on="customer_id",
    how="left",
)

Merging on multiple columns

# Imagine you need to match on both customer_id and region
pd.merge(
    left_df,
    right_df,
    on=["customer_id", "region"],
    how="inner",
)

13.5.3 .join() — Index-based Joining

.join() merges on the DataFrame index rather than a column. It is a shortcut for index-to-index joins:

customers_indexed = customers.set_index("customer_id")
transactions_indexed = transactions_extended.set_index("customer_id")

joined = transactions_indexed.join(customers_indexed, how="left")
joined = joined.reset_index()

In practice, most business analysts use .merge() because data rarely arrives with meaningful indexes. .join() is most useful when you have pre-indexed lookup tables.

13.5.4 .concat() — Stacking DataFrames

.concat() stacks DataFrames either vertically (adding more rows) or horizontally (adding more columns). The most common use in business analysis is combining data from multiple periods or sources.

Vertical concatenation (stacking rows)

q1_sales = pd.DataFrame({
    "quarter": ["Q1", "Q1", "Q1"],
    "region": ["North", "South", "West"],
    "revenue": [45000, 32000, 28000],
})

q2_sales = pd.DataFrame({
    "quarter": ["Q2", "Q2", "Q2"],
    "region": ["North", "South", "West"],
    "revenue": [52000, 35000, 31000],
})

q3_sales = pd.DataFrame({
    "quarter": ["Q3", "Q3", "Q3"],
    "region": ["North", "South", "West"],
    "revenue": [48000, 38000, 34000],
})

full_year = pd.concat([q1_sales, q2_sales, q3_sales], ignore_index=True)
print(full_year)

Output:

  quarter region  revenue
0      Q1  North    45000
1      Q1  South    32000
2      Q1   West    28000
3      Q2  North    52000
4      Q2  South    35000
5      Q2   West    31000
6      Q3  North    48000
7      Q3  South    38000
8      Q3   West    34000

ignore_index=True resets the index to a clean 0-based sequence. Without it, you would get duplicate index values from the original DataFrames.

Horizontal concatenation (stacking columns)

revenue_data = pd.DataFrame({"revenue": [45000, 52000, 48000]})
cost_data = pd.DataFrame({"cost": [18000, 22000, 19000]})

combined = pd.concat([revenue_data, cost_data], axis=1)
combined["margin"] = combined["revenue"] - combined["cost"]

13.6 Reshaping Data: .melt() and .pivot()

13.6.1 Wide vs. Long Format

Business data often arrives in "wide" format — one row per entity, with multiple columns representing different time periods or categories. Analytical tools, including many charting libraries and statistical models, prefer "long" format — one row per observation.

Wide format (one row per region, columns are months):

region    Jan    Feb    Mar
North   45000  52000  48000
South   32000  35000  38000
West    28000  31000  34000

Long format (one row per region-month combination):

region  month  revenue
North     Jan    45000
North     Feb    52000
North     Mar    48000
South     Jan    32000
...

13.6.2 .melt() — Wide to Long

wide_sales = pd.DataFrame({
    "region": ["North", "South", "West"],
    "Jan": [45000, 32000, 28000],
    "Feb": [52000, 35000, 31000],
    "Mar": [48000, 38000, 34000],
})

long_sales = wide_sales.melt(
    id_vars="region",      # Columns to keep as-is
    var_name="month",      # Name for the new column holding old column names
    value_name="revenue",  # Name for the new column holding values
)

print(long_sales.sort_values(["region", "month"]).reset_index(drop=True))

Output:

   region month  revenue
0   North   Feb    52000
1   North   Jan    45000
2   North   Mar    48000
3   South   Feb    35000
4   South   Jan    32000
5   South   Mar    38000
6    West   Feb    31000
7    West   Jan    28000
8    West   Mar    34000

13.6.3 .pivot() — Long to Wide

.pivot() is the inverse of .melt():

wide_again = long_sales.pivot(
    index="region",
    columns="month",
    values="revenue",
)
wide_again.columns.name = None  # Clean up the axis label
wide_again = wide_again.reset_index()
print(wide_again)

Output:

  region    Feb    Jan    Mar
0  North  52000  45000  48000
1  South  35000  32000  38000
2   West  31000  28000  34000

.pivot() vs pd.pivot_table(): Use .pivot() when there is exactly one value per row/column combination. Use pd.pivot_table() when there might be duplicates and you need an aggregation function to resolve them.


13.7 String Operations: The .str Accessor

When working with text columns, pandas provides the .str accessor, which gives you access to Python string methods vectorized across the entire column.

13.7.1 Common .str Methods

products = pd.DataFrame({
    "product_code": ["SW-001", "HW-002", "SV-003", "SW-004"],
    "product_name": ["  Enterprise Suite  ", "Laptop Pro 15", "consulting SERVICES", "Cloud Storage"],
    "sku": ["sw001-a", "hw002-b", "sv003-c", "sw004-d"],
})

# Strip leading/trailing whitespace
products["product_name"] = products["product_name"].str.strip()

# Convert to title case
products["product_name"] = products["product_name"].str.title()

# Convert to uppercase / lowercase
products["sku_upper"] = products["sku"].str.upper()

# Check if string contains a pattern
products["is_software"] = products["product_code"].str.startswith("SW")

# Extract the category prefix (characters before the hyphen)
products["category"] = products["product_code"].str.split("-").str[0]

# Replace characters
products["sku_clean"] = products["sku"].str.replace("-", "_")

# String length
products["name_length"] = products["product_name"].str.len()

print(products)

13.7.2 Regular Expressions with .str

customer_data = pd.DataFrame({
    "raw_phone": ["(555) 123-4567", "555.987.6543", "5551112222", "1-800-555-0199"],
    "email": ["alice@acmecorp.com", "bob@gmail.com", "carol@acmecorp.com", "dave@yahoo.com"],
})

# Extract domain from email
customer_data["email_domain"] = customer_data["email"].str.extract(r"@(.+)$")

# Check if email is from Acme
customer_data["is_acme"] = customer_data["email"].str.contains(r"@acmecorp\.com", regex=True)

# Extract digits only from phone numbers
customer_data["phone_digits"] = customer_data["raw_phone"].str.replace(r"\D", "", regex=True)

print(customer_data)

13.7.3 Useful .str Methods Reference

Method Example Result
.str.upper() "hello" "HELLO"
.str.lower() "HELLO" "hello"
.str.title() "hello world" "Hello World"
.str.strip() " hello " "hello"
.str.len() "hello" 5
.str.contains("x") "fox" True
.str.startswith("x") "xenon" True
.str.endswith("x") "fox" True
.str.replace("a","b") "cat" "cbt"
.str.split("x") "a-b-c" split on - ["a","b","c"]
.str.extract(r"pattern") regex capture group captured string

13.8 Date and Time Operations

Business data is almost always time-stamped, and date manipulation is one of the most common — and most error-prone — tasks in data analysis.

13.8.1 Parsing Dates with pd.to_datetime()

When you load data from a CSV, date columns arrive as strings. You need to convert them:

orders = pd.DataFrame({
    "order_date": ["2024-01-15", "2024-02-03", "2024-03-22", "2024-04-10"],
    "ship_date": ["01/18/2024", "02/06/2024", "03/25/2024", "04/14/2024"],
    "revenue": [4500, 3200, 6700, 2900],
})

# pandas usually infers the format automatically
orders["order_date"] = pd.to_datetime(orders["order_date"])

# Specify format explicitly when parsing is ambiguous or slow
orders["ship_date"] = pd.to_datetime(orders["ship_date"], format="%m/%d/%Y")

print(orders.dtypes)

Output:

order_date    datetime64[ns]
ship_date     datetime64[ns]
revenue                int64
dtype: object

13.8.2 The .dt Accessor

Once a column is datetime type, the .dt accessor unlocks a library of date extraction methods:

orders["year"] = orders["order_date"].dt.year
orders["month"] = orders["order_date"].dt.month
orders["month_name"] = orders["order_date"].dt.month_name()
orders["quarter"] = orders["order_date"].dt.quarter
orders["day_of_week"] = orders["order_date"].dt.dayofweek  # Monday=0, Sunday=6
orders["day_name"] = orders["order_date"].dt.day_name()
orders["week_number"] = orders["order_date"].dt.isocalendar().week.astype(int)
orders["is_weekday"] = orders["order_date"].dt.dayofweek < 5

# Period: round to month or quarter
orders["month_period"] = orders["order_date"].dt.to_period("M")
orders["quarter_period"] = orders["order_date"].dt.to_period("Q")

print(orders[["order_date", "year", "month", "quarter", "day_name", "week_number"]])

13.8.3 Date Arithmetic

# Calculate fulfillment time in days
orders["fulfillment_days"] = (orders["ship_date"] - orders["order_date"]).dt.days

# Filter to a date range
jan_orders = orders[orders["order_date"].between("2024-01-01", "2024-01-31")]

# Add a fixed number of days (e.g., payment due date = order date + 30 days)
orders["payment_due"] = orders["order_date"] + pd.Timedelta(days=30)

# Time since today (days overdue)
today = pd.Timestamp("2024-05-01")
orders["days_since_order"] = (today - orders["order_date"]).dt.days

13.8.4 Grouping by Time Periods

# Group by month
monthly_revenue = orders.groupby(
    orders["order_date"].dt.to_period("M")
)["revenue"].sum().reset_index()
monthly_revenue.columns = ["month", "revenue"]

# Group by quarter
quarterly_revenue = orders.groupby(
    orders["order_date"].dt.to_period("Q")
)["revenue"].sum().reset_index()

13.9 Window Functions: .rolling() and .expanding()

Window functions compute statistics over a sliding or growing range of rows. They are essential for trend analysis, smoothing noisy data, and computing running totals.

13.9.1 .rolling() — Fixed-Size Window

A rolling window moves through the data, computing the statistic over the most recent N rows at each position:

monthly = pd.DataFrame({
    "month": pd.date_range("2024-01-01", periods=12, freq="MS"),
    "revenue": [42000, 38000, 51000, 55000, 47000, 60000,
                63000, 58000, 67000, 71000, 65000, 79000],
})

# 3-month rolling average (smooths out noise)
monthly["rolling_3m_avg"] = monthly["revenue"].rolling(window=3).mean().round(0)

# 3-month rolling sum (useful for quarterly views)
monthly["rolling_3m_sum"] = monthly["revenue"].rolling(window=3).sum()

# Rolling standard deviation (volatility measure)
monthly["rolling_3m_std"] = monthly["revenue"].rolling(window=3).std().round(0)

print(monthly[["month", "revenue", "rolling_3m_avg", "rolling_3m_sum"]].head(6))

Output:

       month  revenue  rolling_3m_avg  rolling_3m_sum
0 2024-01-01    42000             NaN             NaN
1 2024-02-01    38000             NaN             NaN
2 2024-03-01    51000         43667.0        131000.0
3 2024-04-01    55000         48000.0        144000.0
4 2024-05-01    47000         51000.0        153000.0
5 2024-06-01    60000         54000.0        162000.0

The first two rows have NaN because a 3-month window cannot be computed with fewer than 3 data points. You can change this with the min_periods parameter.

13.9.2 .expanding() — Cumulative Window

An expanding window grows from the beginning of the Series to include all preceding rows. It is ideal for running totals and cumulative averages:

# Year-to-date (YTD) cumulative revenue
monthly["ytd_revenue"] = monthly["revenue"].expanding().sum()

# Cumulative average (how does current month compare to all prior months?)
monthly["cumulative_avg"] = monthly["revenue"].expanding().mean().round(0)

print(monthly[["month", "revenue", "ytd_revenue", "cumulative_avg"]])

13.9.3 Month-over-Month Growth

# Percentage change from previous row
monthly["mom_growth_pct"] = monthly["revenue"].pct_change() * 100

# Year-over-year change (if you had 24 months, compare to 12 periods ago)
# monthly["yoy_growth_pct"] = monthly["revenue"].pct_change(periods=12) * 100

print(monthly[["month", "revenue", "mom_growth_pct"]].round(1))

13.10 Putting It All Together: Priya's Weekly Regional Report

This section traces Priya's complete workflow for building the weekly regional sales report that Sandra receives every Monday. It integrates everything covered so far in this chapter.

import pandas as pd
import numpy as np

# ── Load and clean raw transaction data ──────────────────────────────────────
raw = pd.DataFrame({
    "transaction_date": [
        "2024-01-08", "2024-01-09", "2024-01-10", "2024-01-12",
        "2024-01-15", "2024-01-16", "2024-01-17", "2024-01-19",
        "2024-01-22", "2024-01-23", "2024-01-24", "2024-01-26",
    ],
    "customer_id": [
        "C001", "C002", "C003", "C001",
        "C004", "C002", "C005", "C003",
        "C001", "C004", "C002", "C005",
    ],
    "region": [
        "North", "South", "West", "North",
        "East", "South", "West", "West",
        "North", "East", "South", "West",
    ],
    "revenue": [
        4500, 2300, 3800, 6700,
        5100, 1900, 4200, 3300,
        7800, 2700, 4400, 5600,
    ],
    "cost": [
        900, 1150, 760, 1340,
        1020, 950, 840, 660,
        1560, 540, 880, 1120,
    ],
})

customer_master = pd.DataFrame({
    "customer_id": ["C001", "C002", "C003", "C004", "C005"],
    "customer_name": [
        "Goldfinch Inc", "Redwood LLC", "Birchwood Co",
        "Magnolia Ltd", "Cypress Group",
    ],
    "tier": ["Gold", "Silver", "Bronze", "Gold", "Silver"],
    "account_manager": ["Alice", "Bob", "Carol", "Alice", "Bob"],
})

# ── Parse dates and derive time columns ──────────────────────────────────────
raw["transaction_date"] = pd.to_datetime(raw["transaction_date"])
raw["week_number"] = raw["transaction_date"].dt.isocalendar().week.astype(int)
raw["margin"] = raw["revenue"] - raw["cost"]
raw["margin_pct"] = (raw["margin"] / raw["revenue"] * 100).round(1)

# ── Merge with customer master (left join to keep all transactions) ───────────
enriched = pd.merge(raw, customer_master, on="customer_id", how="left")

# ── Regional weekly summary ───────────────────────────────────────────────────
weekly_regional = enriched.groupby(["region", "week_number"]).agg(
    total_revenue=("revenue", "sum"),
    total_margin=("margin", "sum"),
    transaction_count=("revenue", "count"),
    avg_order_value=("revenue", "mean"),
).round(2).reset_index()

weekly_regional["margin_pct"] = (
    weekly_regional["total_margin"] / weekly_regional["total_revenue"] * 100
).round(1)

print("=== Weekly Regional Summary ===")
print(weekly_regional.to_string(index=False))

# ── Pivot to region × week matrix ────────────────────────────────────────────
pivot_view = pd.pivot_table(
    enriched,
    values="revenue",
    index="region",
    columns="week_number",
    aggfunc="sum",
    fill_value=0,
    margins=True,
    margins_name="Total",
)

print("\n=== Revenue by Region × Week ===")
print(pivot_view)

# ── Customer tier analysis ────────────────────────────────────────────────────
tier_summary = enriched.groupby("tier").agg(
    customer_count=("customer_id", "nunique"),
    total_revenue=("revenue", "sum"),
    avg_revenue_per_transaction=("revenue", "mean"),
    avg_margin_pct=("margin_pct", "mean"),
).round(2).reset_index()

tier_summary["revenue_share_pct"] = (
    tier_summary["total_revenue"] / tier_summary["total_revenue"].sum() * 100
).round(1)

print("\n=== Customer Tier Analysis ===")
print(tier_summary.to_string(index=False))

This script goes from raw transaction data to three distinct deliverables — the weekly regional table, the pivot matrix, and the tier analysis — in under 50 lines of clean, readable code.


13.11 Chapter Summary

In this chapter, you mastered the core toolkit for transforming and aggregating business data with pandas:

  • .apply() and .map() enable element-wise transformations — from simple value lookups to complex conditional logic.
  • .groupby() and .agg() implement the split-apply-combine pattern, letting you summarize data across any dimension with multiple statistics in a single call.
  • pd.pivot_table() builds two-dimensional summaries with optional margins and multiple aggregation functions.
  • .merge(), .join(), and .concat() combine DataFrames — merge for column-based joins, join for index-based joins, concat for stacking.
  • .melt() and .pivot() reshape data between wide and long formats to match the requirements of different tools and analyses.
  • The .str accessor provides vectorized string operations — cleaning, extracting, and filtering text columns.
  • pd.to_datetime() and the .dt accessor unlock the full power of time-series analysis: extracting components, computing differences, and grouping by period.
  • .rolling() and .expanding() compute window statistics — moving averages, cumulative totals, and growth rates.

The combination of these tools covers the vast majority of data transformation work that business analysts perform every day. Practice them on real data, and they will become second nature.


Key Terms

Aggregation — The process of computing a summary statistic (sum, mean, count, etc.) over a group of values.

GroupBy — A pandas object that holds the result of splitting a DataFrame by one or more columns, before the apply and combine steps.

Named Aggregation — A pandas 0.25+ syntax where you specify both the output column name and the aggregation function in a single .agg() call.

Pivot Table — A two-dimensional summary table where row and column labels come from the data and cell values are aggregated.

Long Format — A data shape with one row per observation and one column per variable; preferred by most analytical tools.

Wide Format — A data shape with one row per entity and multiple columns representing different time periods or categories.

Rolling Window — A computation over a fixed-size, moving subset of rows.

Expanding Window — A computation over all rows from the beginning up to the current row (cumulative).

.str Accessor — A pandas object that provides vectorized string methods for Series with string dtype.

.dt Accessor — A pandas object that provides date/time component extraction and manipulation methods for datetime Series.