23 min read

> "Data is like a lump of clay. The analysis you need determines the shape you mold it into."

Learning Objectives

  • Merge two DataFrames on shared keys using inner, left, right, and outer joins and explain when each is appropriate
  • Reshape data between wide and long formats using pivot_table and melt
  • Aggregate data using groupby with multiple aggregation functions (mean, sum, count, custom)
  • Chain multiple transformations (filter, merge, groupby, sort) into readable data pipelines
  • Diagnose common reshape errors (duplicate keys causing row explosion, mismatched key types)

Chapter 9: Reshaping and Transforming Data — Merge, Join, Pivot, Melt, and GroupBy

"Data is like a lump of clay. The analysis you need determines the shape you mold it into." — Hadley Wickham, paraphrased from "Tidy Data" (2014)


Chapter Overview

Here's a situation you're going to encounter constantly in data science: the data you have isn't in the shape you need.

Maybe you downloaded vaccination rates from the WHO in one file and GDP per capita from the World Bank in another. Both files have country names, but they're separate tables. To ask "Do wealthier countries have higher vaccination rates?", you need to combine them — to merge the two tables into one based on the shared country column.

Or maybe your dataset has a column for each year — coverage_2019, coverage_2020, coverage_2021 — spread wide across the page. But your plotting library wants one row per country-year combination, with a single year column and a single coverage column. You need to reshape the data from wide format to long format.

Or maybe you have thousands of rows of individual country-year observations, and you just need a summary: "What's the average vaccination rate by region?" You need to group the data, aggregate within each group, and produce a compact summary table.

These three operations — merging, reshaping, and grouping — are the structural transformations at the heart of data wrangling. They don't change the values in your data; they change its shape. And until you're comfortable with them, you'll be stuck: staring at data that has all the information you need but isn't arranged in a way that lets you use it.

This chapter is going to make you dangerous. By the end, you'll be able to take data in whatever shape it arrives and mold it into whatever shape your analysis demands.

In this chapter, you will learn to:

  1. Merge two DataFrames on shared keys using inner, left, right, and outer joins and explain when each is appropriate (all paths)
  2. Reshape data between wide and long formats using pivot_table and melt (all paths)
  3. Aggregate data using groupby with multiple aggregation functions (all paths)
  4. Chain multiple transformations (filter, merge, groupby, sort) into readable data pipelines (all paths)
  5. Diagnose common reshape errors (duplicate keys causing row explosion, mismatched key types) (all paths)

9.1 Why Shape Matters: The Same Data, Three Layouts

Before we touch any code, let's build intuition about what we mean by "the shape of data."

Consider this simple fact: the measles vaccination rate in Brazil was 85% in 2021 and 90% in 2022, while in Kenya it was 69% in 2021 and 75% in 2022.

That's four data points. But look at how differently we can arrange them:

Layout 1: Wide format (one row per country, years as columns)

country    | coverage_2021 | coverage_2022
-----------+---------------+--------------
Brazil     |            85 |            90
Kenya      |            69 |            75

Layout 2: Long format (one row per country-year observation)

country    | year | coverage
-----------+------+---------
Brazil     | 2021 |       85
Brazil     | 2022 |       90
Kenya      | 2021 |       69
Kenya      | 2022 |       75

Layout 3: Summary table (one row per year, aggregated across countries)

year | mean_coverage
-----+--------------
2021 |          77.0
2022 |          82.5

Same underlying facts, three completely different shapes. The wide format is great for reading — your eye can compare Brazil and Kenya across years in a single glance. The long format is what most analysis and plotting tools expect, because each row represents one observation. The summary table is what a report might need: a compact answer to "How did average coverage change?"

The transformations you'll learn in this chapter are the operations that convert between these layouts:

WIDE FORMAT                     LONG FORMAT
  +--------+------+------+       +--------+------+------+
  | country| 2021 | 2022 |       | country| year | value|
  +--------+------+------+  melt | Brazil | 2021 | 85   |
  | Brazil |  85  |  90  | ====> | Brazil | 2022 | 90   |
  | Kenya  |  69  |  75  |       | Kenya  | 2021 | 69   |
  +--------+------+------+       | Kenya  | 2022 | 75   |
                           <==== +--------+------+------+
                          pivot_table

SEPARATE TABLES                 MERGED TABLE
  Table A:          Table B:       +--------+------+------+
  +--------+----+   +--------+--+  | country| cov  | gdp  |
  | country| cov|   | country|gdp| | Brazil |  90  | 8920 |
  | Brazil | 90 | + | Brazil |8920 => Kenya  |  75  | 2007 |
  | Kenya  | 75 |   | Kenya  |2007 +--------+------+------+
  +--------+----+   +--------+--+
                    merge

RAW OBSERVATIONS                GROUPED SUMMARY
  +--------+------+-----+        +--------+----------+
  | region | year | cov |        | region | mean_cov |
  | AFRO   | 2021 | 69  |        | AFRO   |     72.0 |
  | AFRO   | 2021 | 75  | ====> | AMRO   |     87.5 |
  | AMRO   | 2021 | 85  | groupby| (etc.) |          |
  | AMRO   | 2021 | 90  |        +--------+----------+
  +--------+------+-----+

Let's learn each of these, starting with the one you'll use most often: merging.


9.2 Merging DataFrames: Combining Tables on Shared Keys

The Problem Merge Solves

In the real world, data almost never arrives in a single, perfectly organized table. You'll get vaccination data from the WHO, economic data from the World Bank, population data from the UN, and geographic data from a mapping service. Each dataset has a column for country name (or country code), but the other columns are different. To ask any interesting cross-dataset question — "How does GDP relate to vaccination rates?" — you need to bring the tables together.

This is what pd.merge() does. It takes two DataFrames, finds rows that share a common key value, and combines them into a single row.

Let's set up some small example DataFrames so we can see exactly what each merge type does:

import pandas as pd

# Vaccination coverage data
vaccinations = pd.DataFrame({
    'country': ['Brazil', 'Kenya', 'India', 'Germany'],
    'coverage': [90, 75, 88, 95]
})

# GDP per capita data (note: France is here but not in vaccinations)
gdp = pd.DataFrame({
    'country': ['Brazil', 'Kenya', 'Germany', 'France'],
    'gdp_per_capita': [8920, 2007, 48636, 42330]
})

Here's what these look like side by side:

vaccinations:                 gdp:
+----------+----------+      +----------+----------------+
|  country | coverage |      |  country | gdp_per_capita |
+----------+----------+      +----------+----------------+
| Brazil   |       90 |      | Brazil   |           8920 |
| Kenya    |       75 |      | Kenya    |           2007 |
| India    |       88 |      | Germany  |          48636 |
| Germany  |       95 |      | France   |          42330 |
+----------+----------+      +----------+----------------+

Notice: India is ONLY in vaccinations.
        France is ONLY in gdp.
        Brazil, Kenya, Germany are in BOTH.

This mismatch is completely typical. Datasets almost never have exactly the same set of keys. The join type you choose determines what happens to the mismatched rows.

Inner Join: Only the Overlap

An inner join keeps only the rows where the key appears in both DataFrames. If a country is in one table but not the other, that row is dropped.

inner = pd.merge(vaccinations, gdp, on='country', how='inner')
print(inner)
  country  coverage  gdp_per_capita
0  Brazil        90            8920
1   Kenya        75            2007
2 Germany        95           48636

Here's what happened visually:

vaccinations:          gdp:                 INNER JOIN result:
+---------+-----+     +---------+-------+  +---------+-----+-------+
| Brazil  |  90 |---->| Brazil  |  8920 |  | Brazil  |  90 |  8920 |
| Kenya   |  75 |---->| Kenya   |  2007 |  | Kenya   |  75 |  2007 |
| India   |  88 | ✗   |         |       |  | Germany |  95 | 48636 |
| Germany |  95 |---->| Germany | 48636 |  +---------+-----+-------+
+---------+-----+     | France  | 42330 | ✗
                      +---------+-------+
                                           India DROPPED (not in gdp)
                                           France DROPPED (not in vaccinations)

When to use inner join: When you only want rows where you have complete information from both sources. This is the strictest join — you'll never see NaN values from the merge itself.

Left Join: Keep Everything from the Left

A left join keeps all rows from the left DataFrame (the first one you pass to merge), whether or not they have a match in the right DataFrame. Where there's no match, the right-side columns get filled with NaN.

left = pd.merge(vaccinations, gdp, on='country', how='left')
print(left)
  country  coverage  gdp_per_capita
0  Brazil        90          8920.0
1   Kenya        75          2007.0
2   India        88             NaN
3 Germany        95         48636.0

Visually:

vaccinations:          gdp:                 LEFT JOIN result:
+---------+-----+     +---------+-------+  +---------+-----+--------+
| Brazil  |  90 |---->| Brazil  |  8920 |  | Brazil  |  90 |   8920 |
| Kenya   |  75 |---->| Kenya   |  2007 |  | Kenya   |  75 |   2007 |
| India   |  88 |---->|   ???   |       |  | India   |  88 |    NaN |
| Germany |  95 |---->| Germany | 48636 |  | Germany |  95 |  48636 |
+---------+-----+     | France  | 42330 | ✗ +---------+-----+--------+
                      +---------+-------+
                                           India KEPT (NaN for gdp)
                                           France DROPPED

When to use left join: When your left table is the "primary" dataset and you want to enrich it with information from the right table where available. This is the most common join type in practice. You're saying, "Keep all my main data; add extra info where you can."

Right Join: Keep Everything from the Right

A right join is the mirror image of a left join: it keeps all rows from the right DataFrame.

right = pd.merge(vaccinations, gdp, on='country', how='right')
print(right)
  country  coverage  gdp_per_capita
0  Brazil      90.0            8920
1   Kenya      75.0            2007
2 Germany      95.0           48636
3  France       NaN           42330

When to use right join: Honestly? Almost never. You can always swap the order of the two DataFrames and use a left join instead. A right join is just a left join read from the other direction. Most data scientists use left joins exclusively and simply choose which DataFrame to put on the left.

Outer Join: Keep Everything

An outer join (also called a full outer join) keeps all rows from both DataFrames. Where there's no match, the missing side gets NaN.

outer = pd.merge(vaccinations, gdp, on='country', how='outer')
print(outer)
  country  coverage  gdp_per_capita
0  Brazil      90.0          8920.0
1   Kenya      75.0          2007.0
2   India      88.0             NaN
3 Germany      95.0         48636.0
4  France       NaN         42330.0

Visually:

OUTER JOIN result:
+---------+------+--------+
| Brazil  |   90 |   8920 |  (matched)
| Kenya   |   75 |   2007 |  (matched)
| India   |   88 |    NaN |  (left only)
| Germany |   95 |  48636 |  (matched)
| France  |  NaN |  42330 |  (right only)
+---------+------+--------+

Nobody gets dropped. Everybody comes to the party.

When to use outer join: When you want to see the complete picture and need to know what's missing on each side. Useful for data auditing: "Which countries are in our vaccination data but missing from our GDP data, and vice versa?"

Comparison Table: Join Types at a Glance

Join Type Keeps Left-Only? Keeps Right-Only? Keeps Matches? NaN Values?
Inner No No Yes Never from the merge
Left Yes No Yes In right-side columns for unmatched rows
Right No Yes Yes In left-side columns for unmatched rows
Outer Yes Yes Yes On both sides for unmatched rows

Priya's mental model. Priya, our sports journalist, thinks of joins like this: "Inner join is an exclusive club — both tables need to vouch for you. Left join is my guest list — everyone I invited gets in, plus their friends if they bring any. Outer join is an open party — everybody shows up, and we figure out who knows whom later."

Merging on Different Column Names

Sometimes the key column has different names in the two DataFrames. Maybe one file uses country and the other uses nation. Use left_on and right_on:

# If the GDP table called its column 'nation' instead of 'country'
gdp_renamed = gdp.rename(columns={'country': 'nation'})

merged = pd.merge(vaccinations, gdp_renamed,
                  left_on='country', right_on='nation',
                  how='inner')
print(merged)
  country  coverage nation  gdp_per_capita
0  Brazil        90 Brazil            8920
1   Kenya        75  Kenya            2007
2 Germany        95 Germany          48636

Notice that you end up with both the country and nation columns — pandas doesn't know they're the same thing. You'll usually want to drop the duplicate:

merged = merged.drop(columns='nation')

Merging on Multiple Keys

Sometimes a single column isn't enough to uniquely identify a match. For example, if you're merging vaccination data by country and year:

vax = pd.DataFrame({
    'country': ['Brazil', 'Brazil', 'Kenya', 'Kenya'],
    'year': [2021, 2022, 2021, 2022],
    'coverage': [85, 90, 69, 75]
})

spending = pd.DataFrame({
    'country': ['Brazil', 'Brazil', 'Kenya', 'Kenya'],
    'year': [2021, 2022, 2021, 2022],
    'health_spend_pct': [9.6, 9.9, 4.6, 4.8]
})

merged = pd.merge(vax, spending, on=['country', 'year'], how='inner')
print(merged)
  country  year  coverage  health_spend_pct
0  Brazil  2021        85               9.6
1  Brazil  2022        90               9.9
2   Kenya  2021        69               4.6
3   Kenya  2022        75               4.8

Pass a list of column names to on (or to left_on and right_on) when you need a composite key.

Check Your Understanding

  1. You have a table of 200 countries with vaccination data and a table of 180 countries with GDP data. If you do an inner join on country, what's the maximum number of rows in the result? What's the minimum?
  2. You choose a left join to merge vaccination data (left) with GDP data (right). A colleague says, "Some countries have NaN for GDP after the merge." Is this expected? Why or why not?
  3. When would you choose an outer join over an inner join? Give a concrete scenario.

9.3 The Danger Zone: Key Explosions and Mismatched Types

Merging is powerful, but it has two traps that catch nearly every beginner. Let's walk through both so you can recognize them before they ruin your afternoon.

Trap 1: Duplicate Keys and Row Explosion

What happens when a key appears more than once in both DataFrames? Pandas creates every possible combination. This is called a many-to-many join, and the result can be much larger than either input.

# Two students named 'Smith' in the class list
students = pd.DataFrame({
    'last_name': ['Smith', 'Smith', 'Jones'],
    'first_name': ['Alice', 'Bob', 'Carol'],
})

# Two Smiths also appear in the scholarship list
scholarships = pd.DataFrame({
    'last_name': ['Smith', 'Smith', 'Jones'],
    'scholarship': ['Merit Award', 'Need-Based', 'Research Grant'],
})

result = pd.merge(students, scholarships, on='last_name')
print(result)
  last_name first_name    scholarship
0     Smith      Alice    Merit Award
1     Smith      Alice     Need-Based
2     Smith        Bob    Merit Award
3     Smith        Bob     Need-Based
4     Jones      Carol  Research Grant

Wait — we started with 3 rows in each table, but the result has 5 rows. The two Smiths in the students table matched with the two Smiths in the scholarships table, producing 2 x 2 = 4 Smith rows. This is the Cartesian product of the duplicate keys, and it can be explosive.

Imagine a merge where one table has 1,000 rows with key "USA" and the other has 500 rows with key "USA". The result would have 1,000 x 500 = 500,000 rows just for that one key. Your 10,000-row dataset suddenly has 50 million rows, your computer runs out of memory, and you spend twenty minutes staring at a frozen Jupyter notebook.

How to prevent key explosions:

  1. Check for duplicates before merging:
# Are there duplicate countries in each table?
print(f"Left duplicates: {vaccinations['country'].duplicated().sum()}")
print(f"Right duplicates: {gdp['country'].duplicated().sum()}")
  1. Use the validate parameter to have pandas raise an error if the merge is many-to-many:
# This will raise MergeError if either side has duplicate keys
merged = pd.merge(vaccinations, gdp, on='country',
                  how='inner', validate='one_to_one')

The validate options are: - 'one_to_one' — neither side should have duplicate keys - 'one_to_many' — left side must be unique, right side can repeat - 'many_to_one' — right side must be unique, left side can repeat - 'many_to_many' — no validation (the default, and the dangerous one)

  1. Use the indicator parameter to see where each row came from:
merged = pd.merge(vaccinations, gdp, on='country',
                  how='outer', indicator=True)
print(merged)
  country  coverage  gdp_per_capita      _merge
0  Brazil      90.0          8920.0        both
1   Kenya      75.0          2007.0        both
2   India      88.0             NaN   left_only
3 Germany      95.0         48636.0        both
4  France       NaN         42330.0  right_only

The _merge column tells you exactly which rows matched, which came only from the left, and which came only from the right. This is invaluable for debugging.

Trap 2: Mismatched Key Types

This is the subtler trap. Suppose your vaccination data has country codes as integers and your GDP data has them as strings:

vax = pd.DataFrame({
    'country_code': [76, 404, 276],    # integers
    'coverage': [90, 75, 95]
})

gdp = pd.DataFrame({
    'country_code': ['76', '404', '276'],  # strings!
    'gdp': [8920, 2007, 48636]
})

result = pd.merge(vax, gdp, on='country_code', how='inner')
print(result)
Empty DataFrame
Columns: [country_code, coverage, gdp]
Index: []

Zero rows! The merge returned nothing because 76 (integer) doesn't equal '76' (string) in Python. There was no error, no warning — just an empty result that might take you an hour to debug.

How to prevent type mismatches:

# Always check dtypes before merging
print(vax['country_code'].dtype)    # int64
print(gdp['country_code'].dtype)    # object (string)

# Fix the mismatch
gdp['country_code'] = gdp['country_code'].astype(int)

# Now the merge works
result = pd.merge(vax, gdp, on='country_code', how='inner')
print(len(result))  # 3

Debugging Walkthrough: The Disappearing Merge

Symptom: You merge two DataFrames and the result has far fewer rows than expected — or zero rows.

Diagnostic steps:

  1. Check dtypes: print(left[key].dtype, right[key].dtype). Are they the same?
  2. Check sample values: print(left[key].head(), right[key].head()). Do the actual values match? Watch for leading/trailing spaces.
  3. Check for NaN keys: print(left[key].isna().sum()). NaN never matches NaN in a merge.
  4. Try a manual check: print(left[key].isin(right[key]).sum()). How many left keys exist in the right table?

Common fixes: - .astype(int) or .astype(str) to align types - .str.strip() to remove whitespace - .str.lower() to standardize case


9.4 Wide vs. Long Format: Understanding the Shape of Your Data

Now let's talk about the second major transformation: converting between wide format and long format.

What Are Wide and Long Format?

These terms describe how a dataset uses rows and columns to encode information.

Wide format uses columns to encode one dimension of the data. A wide table for vaccination rates might have a separate column for each year:

WIDE FORMAT:
+----------+------+------+------+
| country  | 2020 | 2021 | 2022 |
+----------+------+------+------+
| Brazil   |   83 |   85 |   90 |
| Kenya    |   66 |   69 |   75 |
| India    |   84 |   88 |   91 |
+----------+------+------+------+

3 rows x 4 columns
Each row = one country
Each column = one year's value

Long format (also called tidy format) uses rows to encode that dimension. The same data in long format has one row per observation:

LONG FORMAT:
+----------+------+----------+
| country  | year | coverage |
+----------+------+----------+
| Brazil   | 2020 |       83 |
| Brazil   | 2021 |       85 |
| Brazil   | 2022 |       90 |
| Kenya    | 2020 |       66 |
| Kenya    | 2021 |       69 |
| Kenya    | 2022 |       75 |
| India    | 2020 |       84 |
| India    | 2021 |       88 |
| India    | 2022 |       91 |
+----------+------+----------+

9 rows x 3 columns
Each row = one country-year observation

The wide table is more compact and easier to read visually. The long table is more verbose but far more flexible for analysis. Here's the key insight:

Most pandas operations — groupby, plotting, filtering by condition — work best with long format data. When you encounter data in wide format (which is very common, because humans prefer it for reading), you'll often need to melt it into long format before analyzing it.

Melting: Wide to Long

The pd.melt() function (or equivalently, df.melt()) converts a wide DataFrame to long format by "unpivoting" columns into rows.

wide = pd.DataFrame({
    'country': ['Brazil', 'Kenya', 'India'],
    '2020': [83, 66, 84],
    '2021': [85, 69, 88],
    '2022': [90, 75, 91]
})

long = pd.melt(wide,
               id_vars='country',       # column(s) to KEEP as-is
               var_name='year',          # name for the new column holding old column headers
               value_name='coverage')    # name for the new column holding the values

print(long)
  country  year  coverage
0  Brazil  2020        83
1   Kenya  2020        66
2   India  2020        84
3  Brazil  2021        85
4   Kenya  2021        69
5   India  2021        88
6  Brazil  2022        90
7   Kenya  2022        75
8   India  2022        91

Let's trace exactly what happened:

BEFORE (wide):                          AFTER (long):
+--------+------+------+------+         +--------+------+----------+
| country| 2020 | 2021 | 2022 |         | country| year | coverage |
+--------+------+------+------+         +--------+------+----------+
| Brazil |   83 |   85 |   90 |  =====> | Brazil | 2020 |       83 |
| Kenya  |   66 |   69 |   75 |  melt   | Kenya  | 2020 |       66 |
| India  |   84 |   88 |   91 |         | India  | 2020 |       84 |
+--------+------+------+------+         | Brazil | 2021 |       85 |
                                        | Kenya  | 2021 |       69 |
  3 rows x 4 cols                       | India  | 2021 |       88 |
  = 3 x 3 = 9 observations             | Brazil | 2022 |       90 |
  packed into 3 rows                    | Kenya  | 2022 |       75 |
                                        | India  | 2022 |       91 |
                                        +--------+------+----------+
                                         9 rows x 3 cols
                                         9 observations in 9 rows

The id_vars parameter specifies which columns should stay put — they're the identifiers. Every other column gets "melted" into two new columns: one for the original column name (the var_name) and one for the value (the value_name).

Pro tip: After melting, you'll often want to fix the data type of the new variable column. In our example, year came from column headers, so it's a string. Convert it:

long['year'] = long['year'].astype(int)

Pivoting: Long to Wide

The reverse operation — going from long to wide — uses pivot_table(). This is useful when you want to create a summary view or prepare data for a specific type of visualization.

wide_again = long.pivot_table(
    index='country',      # what becomes the rows
    columns='year',       # what becomes the columns
    values='coverage'     # what fills the cells
)

print(wide_again)
year     2020  2021  2022
country
Brazil     83    85    90
India      84    88    91
Kenya      66    69    75

Here's the visual:

BEFORE (long):                    AFTER (wide via pivot_table):
+--------+------+------+         year     2020  2021  2022
| country| year | cov  |         country
+--------+------+------+ pivot   --------+----+-----+-----
| Brazil | 2020 |   83 | =====> | Brazil |  83|   85|   90
| Brazil | 2021 |   85 |        | India  |  84|   88|   91
| Brazil | 2022 |   90 |        | Kenya  |  66|   69|   75
| Kenya  | 2020 |   66 |        --------+----+-----+-----
| Kenya  | 2021 |   69 |
| Kenya  | 2022 |   75 |         year becomes column headers
| India  | 2020 |   84 |         country becomes row index
| India  | 2021 |   88 |         coverage fills the cells
| India  | 2022 |   91 |
+--------+------+------+

Why pivot_table instead of pivot? Pandas has both df.pivot() and df.pivot_table(). The difference is that pivot_table can handle duplicate entries by aggregating them (e.g., taking the mean), while pivot raises an error if there are duplicate index-column combinations. Use pivot_table by default — it's safer and more flexible.

# pivot_table can handle duplicates by aggregating
summary = long.pivot_table(
    index='country',
    columns='year',
    values='coverage',
    aggfunc='mean'  # how to handle duplicates (default is 'mean')
)

Worked Example: Reshaping WHO Vaccination Data

Let's say Elena downloads vaccination data from the WHO, and it arrives in wide format — one column per vaccine type:

who_wide = pd.DataFrame({
    'country': ['Brazil', 'Brazil', 'Kenya', 'Kenya'],
    'year': [2021, 2022, 2021, 2022],
    'measles': [85, 90, 69, 75],
    'dtp3': [78, 83, 82, 85],
    'polio': [80, 86, 76, 79]
})

print(who_wide)
  country  year  measles  dtp3  polio
0  Brazil  2021       85    78     80
1  Brazil  2022       90    83     86
2   Kenya  2021       69    82     76
3   Kenya  2022       75    85     79

Elena wants to compare vaccine coverage rates, which is easiest when all the rates are in a single column. She melts the vaccine columns:

who_long = pd.melt(who_wide,
                   id_vars=['country', 'year'],
                   var_name='vaccine',
                   value_name='coverage')

print(who_long)
   country  year  vaccine  coverage
0   Brazil  2021  measles        85
1   Brazil  2022  measles        90
2    Kenya  2021  measles        69
3    Kenya  2022  measles        75
4   Brazil  2021     dtp3        78
5   Brazil  2022     dtp3        83
6    Kenya  2021     dtp3        82
7    Kenya  2022     dtp3        85
8   Brazil  2021    polio        80
9   Brazil  2022    polio        86
10   Kenya  2021    polio        76
11   Kenya  2022    polio        79

Now she can easily filter for a specific vaccine, group by vaccine type to compare averages, or plot all vaccines on the same chart with vaccine as the color.

Check Your Understanding

  1. You have a DataFrame with 50 countries and columns for country, pop_2018, pop_2019, pop_2020, pop_2021, pop_2022. After melting all the population columns, how many rows will the result have?
  2. What are the three parameters of pd.melt() that you need to specify? What does each one control?
  3. If you melt a wide DataFrame and then pivot the result back, do you always get the original DataFrame? What might be different?

9.5 GroupBy: Split, Apply, Combine

The Pattern

groupby is arguably the most important data transformation in all of data analysis. Here's the idea in one sentence: split your data into groups, apply a computation to each group, and combine the results.

This is so fundamental that it has a name: the split-apply-combine pattern. Let's see it step by step.

Say you have vaccination data and you want to know the average coverage by region:

df = pd.DataFrame({
    'country': ['Brazil', 'Kenya', 'India', 'Germany', 'Nigeria', 'Japan'],
    'region': ['AMRO', 'AFRO', 'SEARO', 'EURO', 'AFRO', 'WPRO'],
    'coverage': [90, 75, 88, 95, 62, 97]
})

result = df.groupby('region')['coverage'].mean()
print(result)
region
AFRO     68.5
AMRO     90.0
EURO     95.0
SEARO    88.0
WPRO     97.0
Name: coverage, dtype: float64

Here's what happened inside:

SPLIT (by region):

AFRO group:            AMRO group:     EURO group:
+--------+----+        +-------+--+    +--------+--+
| Kenya  | 75 |        | Brazil|90|    | Germany|95|
| Nigeria| 62 |        +-------+--+    +--------+--+
+--------+----+

SEARO group:           WPRO group:
+-------+----+         +------+--+
| India | 88 |         | Japan|97|
+-------+----+         +------+--+

APPLY (mean to each group):

AFRO:  (75 + 62) / 2 = 68.5
AMRO:  90 / 1 = 90.0
EURO:  95 / 1 = 95.0
SEARO: 88 / 1 = 88.0
WPRO:  97 / 1 = 97.0

COMBINE (into result):

region  | mean_coverage
--------+--------------
AFRO    |         68.5
AMRO    |         90.0
EURO    |         95.0
SEARO   |         88.0
WPRO    |         97.0

Multiple Aggregations with .agg()

Often you want more than just the mean. The .agg() method lets you apply multiple aggregation functions at once:

summary = df.groupby('region')['coverage'].agg(['mean', 'min', 'max', 'count'])
print(summary)
       mean  min  max  count
region
AFRO   68.5   62   75      2
AMRO   90.0   90   90      1
EURO   95.0   95   95      1
SEARO  88.0   88   88      1
WPRO   97.0   97   97      1

You can also use a dictionary to apply different aggregations to different columns:

df['population'] = [214_000_000, 54_000_000, 1_400_000_000,
                    84_000_000, 218_000_000, 125_000_000]

summary = df.groupby('region').agg(
    avg_coverage=('coverage', 'mean'),
    total_population=('population', 'sum'),
    num_countries=('country', 'count')
)
print(summary)
       avg_coverage  total_population  num_countries
region
AFRO           68.5         272000000              2
AMRO           90.0         214000000              1
EURO           95.0          84000000              1
SEARO          88.0        1400000000              1
WPRO           97.0         125000000              1

This named aggregation syntax — output_name=('column', 'function') — is clean, readable, and the recommended approach for modern pandas code.

Grouping by Multiple Columns

You can group by more than one column. This creates nested groups:

df2 = pd.DataFrame({
    'region': ['AFRO', 'AFRO', 'AFRO', 'AFRO', 'AMRO', 'AMRO'],
    'year': [2021, 2021, 2022, 2022, 2021, 2022],
    'country': ['Kenya', 'Nigeria', 'Kenya', 'Nigeria', 'Brazil', 'Brazil'],
    'coverage': [69, 55, 75, 62, 85, 90]
})

by_region_year = df2.groupby(['region', 'year'])['coverage'].mean()
print(by_region_year)
region  year
AFRO    2021    62.0
        2022    68.5
AMRO    2021    85.0
        2022    90.0
Name: coverage, dtype: float64

The result has a multi-index — an index with two levels (region and year). You can access specific values with:

# Average coverage for AFRO in 2022
print(by_region_year.loc[('AFRO', 2022)])  # 68.5

If the multi-index confuses you (and it confuses many people at first), you can flatten it:

by_region_year = by_region_year.reset_index()
print(by_region_year)
  region  year  coverage
0   AFRO  2021      62.0
1   AFRO  2022      68.5
2   AMRO  2021      85.0
3   AMRO  2022      90.0

That's a regular DataFrame with clean column names. Much easier to work with.

The transform Method: GroupBy Without Reducing

Sometimes you don't want to reduce each group to a single summary value — you want to compute something within each group and assign it back to every row. That's what .transform() does.

For example, you might want to add a column showing each country's deviation from its regional average:

df['region_mean'] = df.groupby('region')['coverage'].transform('mean')
df['deviation'] = df['coverage'] - df['region_mean']
print(df[['country', 'region', 'coverage', 'region_mean', 'deviation']])
   country region  coverage  region_mean  deviation
0   Brazil   AMRO        90         90.0        0.0
1    Kenya   AFRO        75         68.5        6.5
2    India  SEARO        88         88.0        0.0
3  Germany   EURO        95         95.0        0.0
4  Nigeria   AFRO        62         68.5       -6.5
5    Japan   WPRO        97         97.0        0.0

Notice that .transform() returned a Series with the same number of rows as the original DataFrame. Each row got its group's mean, not a single aggregated value. This is the key difference between agg (which reduces) and transform (which preserves shape).

Check Your Understanding

  1. In your own words, explain the "split-apply-combine" pattern. What does each step do?
  2. What's the difference between .agg() and .transform()? When would you use each?
  3. You group a DataFrame with 1,000 rows by a column that has 10 unique values. After calling .mean(), how many rows does the result have?

9.6 Method Chaining: Building Data Pipelines

Up to now, we've been saving intermediate results into variables at every step:

# Step-by-step approach
filtered = df[df['coverage'] > 50]
merged = pd.merge(filtered, gdp, on='country')
grouped = merged.groupby('region')['gdp_per_capita'].mean()
sorted_result = grouped.sort_values(ascending=False)

This works, but it creates four variables that you never use again. Method chaining lets you connect operations together in a pipeline, with each step feeding into the next:

result = (
    df[df['coverage'] > 50]
    .merge(gdp, on='country')
    .groupby('region')['gdp_per_capita']
    .mean()
    .sort_values(ascending=False)
)

The parentheses let you break the chain across multiple lines for readability. Each line applies one transformation to the result of the previous line. You read it top to bottom: "Take df, filter for coverage > 50, merge with GDP data, group by region, compute the mean GDP, and sort highest to lowest."

Why Method Chaining?

  1. Readability. The chain tells a story. Each line is a verb: filter, merge, group, aggregate, sort.

  2. Fewer intermediate variables. You don't litter your notebook with temp1, temp2, filtered_merged_grouped.

  3. Easier refactoring. Want to add a step? Insert a new line. Want to remove one? Delete a line. Want to reorder? Move lines around.

Tips for Readable Chains

Use .assign() instead of direct column assignment when you need to create new columns in a chain:

result = (
    df
    .assign(coverage_pct=lambda x: x['coverage'] / 100)
    .groupby('region')['coverage_pct']
    .mean()
)

The lambda x inside .assign() refers to the DataFrame at that point in the chain. This keeps everything flowing without breaking the chain to create an intermediate variable.

Use .query() instead of bracket filtering for cleaner chain syntax:

result = (
    df
    .query('coverage > 50 and year == 2022')
    .groupby('region')['coverage']
    .mean()
    .sort_values(ascending=False)
)

Use .pipe() for custom functions that don't exist as DataFrame methods:

def flag_low_coverage(dataframe, threshold=70):
    return dataframe.assign(
        low_coverage=lambda x: x['coverage'] < threshold
    )

result = (
    df
    .pipe(flag_low_coverage, threshold=75)
    .groupby('low_coverage')['country']
    .count()
)

When NOT to Chain

Method chaining is a tool, not a religion. Break the chain when:

  • A step needs debugging. If you're getting unexpected results, assign the intermediate to a variable and inspect it.
  • The chain is too long. More than 8-10 operations in a single chain starts to become hard to follow. Break it into logical sections.
  • You need the intermediate result elsewhere. If you'll use the filtered data for both a merge and a separate plot, store it in a variable.

Check Your Understanding

  1. Rewrite this step-by-step code as a method chain: python temp = df.dropna(subset=['coverage']) temp = temp[temp['year'] >= 2020] temp = temp.groupby('region')['coverage'].mean() temp = temp.sort_values()
  2. What does the lambda x in .assign(new_col=lambda x: x['col'] * 2) refer to?
  3. Name two situations where you should not use method chaining.

9.7 Putting It All Together: A Complete Worked Example

Let's walk through a realistic scenario that uses merging, melting, grouping, and chaining together — the kind of workflow you'll do constantly in practice.

Scenario: Comparing Vaccination Coverage Across Income Groups

Elena wants to answer this question: "How does average vaccination coverage differ between high-income and low-income countries, and has the gap narrowed from 2020 to 2022?"

She has two datasets:

# Dataset 1: Vaccination coverage (wide format — one column per year)
vax_wide = pd.DataFrame({
    'country': ['Brazil', 'Kenya', 'India', 'Germany', 'Nigeria', 'Japan',
                'Ethiopia', 'France', 'Bangladesh', 'USA'],
    'measles_2020': [83, 66, 84, 97, 54, 97, 58, 90, 84, 92],
    'measles_2021': [85, 69, 88, 95, 55, 97, 48, 91, 86, 92],
    'measles_2022': [90, 75, 91, 95, 62, 97, 52, 92, 88, 93],
})

# Dataset 2: Country income classification
income = pd.DataFrame({
    'country': ['Brazil', 'Kenya', 'India', 'Germany', 'Nigeria', 'Japan',
                'Ethiopia', 'France', 'Bangladesh', 'USA'],
    'income_group': ['Upper middle', 'Lower middle', 'Lower middle',
                     'High', 'Lower middle', 'High',
                     'Low', 'High', 'Lower middle', 'High']
})

Step 1: Melt the wide vaccination data to long format.

vax_long = pd.melt(
    vax_wide,
    id_vars='country',
    var_name='year_raw',
    value_name='measles_coverage'
)

# Clean up the year column — extract the number from 'measles_2020'
vax_long['year'] = vax_long['year_raw'].str.extract(r'(\d{4})').astype(int)
vax_long = vax_long.drop(columns='year_raw')

print(vax_long.head())
  country  measles_coverage  year
0  Brazil                83  2020
1   Kenya                66  2020
2   India                84  2020
3 Germany                97  2020
4 Nigeria                54  2020

Step 2: Merge with income classification.

merged = pd.merge(vax_long, income, on='country', how='left')
print(merged.head())
  country  measles_coverage  year  income_group
0  Brazil                83  2020  Upper middle
1   Kenya                66  2020  Lower middle
2   India                84  2020  Lower middle
3 Germany                97  2020          High
4 Nigeria                54  2020  Lower middle

Step 3: Group by income group and year, compute average coverage.

summary = (
    merged
    .groupby(['income_group', 'year'])
    .agg(avg_coverage=('measles_coverage', 'mean'),
         num_countries=('country', 'count'))
    .reset_index()
)

print(summary)
   income_group  year  avg_coverage  num_countries
0          High  2020         94.00              4
1          High  2021         93.75              4
2          High  2022         94.25              4
3           Low  2020         58.00              1
4           Low  2021         48.00              1
5           Low  2022         52.00              1
6  Lower middle  2020         72.00              4
7  Lower middle  2021         74.50              4
8  Lower middle  2022         79.00              4
9  Upper middle  2020         83.00              1
10 Upper middle  2021         85.00              1
11 Upper middle  2022         90.00              1

Step 4: Pivot to make it easy to compare years side by side.

comparison = summary.pivot_table(
    index='income_group',
    columns='year',
    values='avg_coverage'
)

print(comparison)
year           2020   2021   2022
income_group
High          94.00  93.75  94.25
Low           58.00  48.00  52.00
Lower middle  72.00  74.50  79.00
Upper middle  83.00  85.00  90.00

Now Elena can see the answer clearly: high-income countries maintained coverage around 94%, while lower-middle-income countries improved from 72% to 79%. The gap narrowed from 22 percentage points to about 15. That's a clear, communicable finding that started with two messy datasets and four transformation steps.

As a single chain, the entire analysis would look like this:

comparison = (
    pd.melt(vax_wide, id_vars='country',
            var_name='year_raw', value_name='measles_coverage')
    .assign(year=lambda x: x['year_raw'].str.extract(r'(\d{4})').astype(int))
    .drop(columns='year_raw')
    .merge(income, on='country', how='left')
    .groupby(['income_group', 'year'])
    .agg(avg_coverage=('measles_coverage', 'mean'))
    .reset_index()
    .pivot_table(index='income_group', columns='year', values='avg_coverage')
)

That's eight operations in one readable pipeline. Melt, assign, drop, merge, group, aggregate, reset, pivot. Each line is a transformation step, and the whole thing reads like a recipe.


9.8 Common Errors and How to Fix Them

Let's collect the most frequent mistakes into a debugging reference you can consult when things go wrong.

Error 1: "ValueError: columns overlap but no suffix specified"

What happened: You merged two DataFrames that share column names beyond the key. For example, both have a name column.

Fix: Use the suffixes parameter:

merged = pd.merge(df1, df2, on='id', suffixes=('_left', '_right'))

This renames the overlapping columns as name_left and name_right.

Error 2: Row count explodes after merge

What happened: Duplicate keys on both sides caused a Cartesian product (see Section 9.3).

Fix: Check for duplicates before merging. Use validate='one_to_one' or validate='many_to_one'.

# Diagnostic
print(f"Left table rows: {len(df1)}")
print(f"Right table rows: {len(df2)}")
print(f"Left key duplicates: {df1['key'].duplicated().sum()}")
print(f"Right key duplicates: {df2['key'].duplicated().sum()}")

# Preventive
merged = pd.merge(df1, df2, on='key', validate='one_to_one')

Error 3: All NaN values after merge

What happened: The key columns have different data types (int vs string) or different formatting (leading spaces, different case).

Fix: Align types and formatting before merging.

df1['key'] = df1['key'].astype(str).str.strip().str.lower()
df2['key'] = df2['key'].astype(str).str.strip().str.lower()

Error 4: "ValueError: Index contains duplicate entries, cannot reshape"

What happened: You used pivot() (not pivot_table()) and there are duplicate index-column combinations.

Fix: Use pivot_table() instead, which handles duplicates via aggregation:

# Instead of df.pivot(...)
df.pivot_table(index='country', columns='year',
               values='coverage', aggfunc='mean')

Error 5: Multi-index confusion after groupby

What happened: Grouping by multiple columns creates a multi-level index that's hard to work with.

Fix: Add .reset_index() after groupby to flatten the result into a regular DataFrame:

result = df.groupby(['region', 'year'])['coverage'].mean().reset_index()

Error 6: "KeyError" when accessing columns after melt

What happened: After melting, the old column names no longer exist. The year values that were column headers are now values in the var_name column.

Fix: Access the melted data using the new column names you specified in var_name and value_name.

# Before melt: df['2020'] works
# After melt:  long['year'] == '2020'  (filter by value, not column access)

Quick Reference: When to Use What

I need to... Use... Example
Combine two tables on a shared key pd.merge() pd.merge(vax, gdp, on='country')
Add columns from another table pd.merge() with how='left' Left join to enrich
Convert columns into rows pd.melt() Years-as-columns to year-as-row
Convert rows into columns df.pivot_table() Observations to summary table
Compute statistics by group df.groupby().agg() Mean by region
Add group stats to each row df.groupby().transform() Regional mean on every row
Connect operations in a pipeline Method chaining with () Filter, merge, group, sort

9.9 Project Checkpoint: Reshaping the Global Health Data

Time to apply everything to your running project. In Chapter 7, you loaded the WHO vaccination dataset into pandas. In Chapter 8, you cleaned it. Now you're going to reshape it for analysis.

Task 1: Pivot Vaccination Data for Cross-Region Comparison

Your cleaned vaccination data is in long format (one row per country-year-vaccine observation). Create a pivot table that shows average vaccination coverage by WHO region and year:

# Starting from your cleaned WHO data (long format)
region_year_summary = (
    who_data
    .groupby(['who_region', 'year'])['coverage']
    .mean()
    .reset_index()
    .pivot_table(index='who_region', columns='year', values='coverage')
    .round(1)
)

print(region_year_summary)

This gives you a table where each row is a region and each column is a year — perfect for seeing trends at a glance.

Task 2: Merge with GDP Data

Download GDP per capita data (from the World Bank or a similar source) and merge it with your vaccination data:

# Load GDP data
gdp = pd.read_csv('gdp_per_capita.csv')

# Check key alignment
print(f"Countries in vaccination data: {who_data['country'].nunique()}")
print(f"Countries in GDP data: {gdp['country'].nunique()}")
print(f"Overlap: {who_data['country'].isin(gdp['country']).sum()}")

# Merge
enriched = pd.merge(
    who_data,
    gdp[['country', 'year', 'gdp_per_capita']],
    on=['country', 'year'],
    how='left',
    validate='many_to_one'
)

# Check for merge quality
print(f"Rows before merge: {len(who_data)}")
print(f"Rows after merge: {len(enriched)}")
print(f"GDP missing after merge: {enriched['gdp_per_capita'].isna().sum()}")

Task 3: Create an Income-Coverage Summary

Group your merged data to answer: "Does vaccination coverage vary systematically with income level?"

income_summary = (
    enriched
    .assign(income_quartile=lambda x: pd.qcut(
        x['gdp_per_capita'], q=4,
        labels=['Low', 'Lower-middle', 'Upper-middle', 'High']
    ))
    .groupby('income_quartile')['coverage']
    .agg(['mean', 'median', 'std', 'count'])
    .round(1)
)

print(income_summary)

Save your enriched, reshaped dataset — you'll need it for visualization in Part III and statistical analysis in Part IV.


9.10 Spaced Review: Reinforcing Earlier Concepts

This chapter built on skills from Chapters 7 and 8. Let's make sure those foundations are solid.

From Chapter 7: DataFrame Fundamentals

  1. What is the difference between .loc[] and .iloc[]? Give an example of when you'd use each. (If this feels shaky, revisit Section 7.4.)

  2. Why do we prefer vectorized operations over loops? The merge and groupby operations you learned today are all vectorized — pandas does the looping internally in optimized C code. Could you write a merge using a for loop? Sure. Should you? No — it would be orders of magnitude slower.

  3. What does df.dtypes show you, and why should you check it before merging? (Recall: type mismatches are the number-one cause of empty merge results.)

From Chapter 8: Data Cleaning

  1. After merging two datasets, you notice new NaN values. Is this a data quality problem or an expected consequence of the join type? How can you tell the difference?

  2. You're about to merge a vaccination dataset with a GDP dataset, but the country names don't match exactly — one uses "United States" and the other uses "USA." What technique from Chapter 8 would you use to fix this before the merge?

  3. After melting a wide DataFrame, the new year column has dtype object (string). What Chapter 8 technique would you use to convert it to integer?

Connecting to What's Next

The reshaping skills you learned in this chapter are prerequisites for almost everything that follows:

  • Chapter 10 (Text Data): You'll use .str methods to clean up keys before merging — standardizing country names, extracting codes, etc.
  • Chapter 11 (Dates and Times): Time series analysis often requires melting date-as-column data into long format, then setting date as the index.
  • Chapters 14-18 (Visualization): Plotting libraries like seaborn expect long-format data with categorical columns — exactly what melt and groupby produce.
  • Chapters 19-24 (Statistics): Computing group-level statistics — the bread and butter of statistical analysis — uses groupby constantly.

You now have the tools to take data in any shape and mold it into whatever your analysis requires. That's a superpower.


Chapter Summary

This chapter covered the three structural transformations that reshape data without changing its values:

Merging combines two DataFrames by matching rows on shared key columns. The four join types — inner, left, right, and outer — determine what happens to unmatched rows. Inner keeps only matches. Left keeps all left-side rows. Outer keeps everything. Always check for duplicate keys (to prevent row explosion) and mismatched key types (to prevent empty results) before merging.

Reshaping converts between wide format (values spread across columns) and long format (values stacked in rows). Use melt() to go from wide to long — you'll do this whenever your analysis tool expects one observation per row. Use pivot_table() to go from long to wide — you'll do this when creating summary views for reports.

Grouping splits data into groups, applies a computation to each group, and combines the results. Use groupby().agg() for summary statistics by group. Use groupby().transform() when you want to compute group-level values but keep the original row count. Use named aggregation syntax for clean, readable code.

Method chaining connects these operations into readable pipelines where each line applies one transformation. Wrap chains in parentheses for multi-line formatting. Use .assign() for new columns, .query() for filtering, and .pipe() for custom functions within chains.

Common pitfalls include key explosions from many-to-many merges, empty results from type-mismatched keys, multi-index confusion after grouping, and trying to access old column names after melting. Use validate, indicator, reset_index(), and careful dtype checking to catch these early.

These skills transform you from someone who can only analyze data that arrives in the right shape to someone who can analyze data in any shape. That's a fundamental shift in capability.


Looking Ahead

In Chapter 10, we'll learn to work with text data — extracting information from messy strings, matching patterns with regular expressions, and cleaning up the kinds of inconsistent text fields that make merging difficult. The str.strip() and str.lower() calls you used to fix merge keys in this chapter? That's just the beginning of what pandas can do with text.