Key Takeaways: Reshaping and Transforming Data
This is your reference card for Chapter 9 — the chapter where you learned to take data in any shape and mold it into whatever your analysis demands. Keep this nearby whenever you're combining, reshaping, or summarizing DataFrames.
Join Types Comparison
Use this table every time you write a merge until you've internalized the differences.
| Join Type | What It Keeps | NaN Behavior | When to Use |
|---|---|---|---|
| Inner | Only rows with keys in both tables | No NaN from the merge | You only want complete records |
| Left | All rows from the left table | NaN in right-side columns where no match | Your left table is "primary"; you're enriching it |
| Outer | All rows from both tables | NaN on both sides where no match | You need the full picture; auditing completeness |
| Right | All rows from the right table | NaN in left-side columns where no match | Rarely used; swap tables and use left instead |
The merge call:
pd.merge(left_df, right_df,
on='shared_key', # same column name in both
# OR:
left_on='key_a', # different column names
right_on='key_b',
how='left', # inner | left | right | outer
validate='many_to_one', # catch unexpected duplicates
indicator=True) # adds _merge column for debugging
Pre-merge checklist:
- [ ] Check both key columns have the same dtype (df['key'].dtype)
- [ ] Check for duplicate keys (df['key'].duplicated().sum())
- [ ] Check for key overlap (left['key'].isin(right['key']).sum())
- [ ] Check for NaN in key columns (df['key'].isna().sum())
- [ ] Strip whitespace and standardize case if keys are strings
Reshape Cheat Sheet
WIDE FORMAT --------melt()--------> LONG FORMAT
(values in <-----pivot_table()---- (values in
columns) rows)
Melt (wide to long):
pd.melt(df,
id_vars='country', # column(s) to KEEP as-is
var_name='year', # name for column of old headers
value_name='coverage') # name for column of old values
Rule of thumb: if your column headers contain data values (years, vaccine names, question numbers), you probably need to melt.
Pivot (long to wide):
df.pivot_table(
index='country', # what becomes the rows
columns='year', # what becomes the columns
values='coverage', # what fills the cells
aggfunc='mean') # how to handle duplicates
Use pivot_table (not pivot) to safely handle duplicate index-column pairs.
Shape arithmetic:
- Melting a table with R rows and C value columns produces R x C rows
- Pivoting reduces rows by the number of unique values in the columns parameter
GroupBy Patterns
The split-apply-combine pattern:
df.groupby('column')['target'].operation()
SPLIT SELECT APPLY + COMBINE
Common operations:
| Pattern | Code | Result |
|---|---|---|
| Single stat | .groupby('region')['coverage'].mean() |
One value per group |
| Multiple stats | .groupby('region')['coverage'].agg(['mean', 'min', 'max']) |
Multiple columns per group |
| Named aggregation | .groupby('region').agg(avg=('coverage', 'mean'), n=('country', 'count')) |
Named output columns |
| Different stats per column | .agg(avg_cov=('coverage', 'mean'), total_pop=('population', 'sum')) |
Mixed aggregations |
| Group stat on every row | .groupby('region')['coverage'].transform('mean') |
Same row count as input |
| Multi-level grouping | .groupby(['region', 'year'])['coverage'].mean() |
Multi-index result |
After groupby:
- Use .reset_index() to flatten a multi-index into regular columns
- Use .sort_values() to rank groups
- Use .unstack() as an alternative to pivot_table for reshaping grouped results
Method Chaining Template
result = (
df
.query('year >= 2020') # filter rows
.assign(rate=lambda x: x['count'] / x['total'] * 100) # add column
.merge(other_df, on='key', how='left') # combine tables
.groupby('category') # split
.agg(avg_rate=('rate', 'mean')) # aggregate
.sort_values('avg_rate', ascending=False) # sort
.reset_index() # flatten
)
Chaining tools:
- .query() for filtering (cleaner in chains than bracket notation)
- .assign() for new columns (returns new DataFrame, doesn't modify in place)
- .pipe(func) for custom functions that take a DataFrame and return a DataFrame
Common Errors Quick Fix
| Symptom | Likely Cause | Fix |
|---|---|---|
| Merge returns empty DataFrame | Key dtype mismatch (int vs string) | .astype(int) or .astype(str) on one side |
| Merge returns too many rows | Duplicate keys on both sides (many-to-many) | Check .duplicated() on keys; use validate |
| NaN everywhere after merge | Keys don't actually match (whitespace, case) | .str.strip().str.lower() on key columns |
ValueError: cannot reshape |
Duplicate index-column pairs in pivot | Use pivot_table with aggfunc instead of pivot |
| Multi-index confusion | Grouping by multiple columns | Add .reset_index() after groupby |
| KeyError after melt | Accessing old column names | Column names became values — filter, don't index |
| Overlapping column names in merge | Both tables share non-key columns | Use suffixes=('_left', '_right') |
Key Concepts
-
Merging is how you combine separate datasets into one analysis-ready table. The join type determines what happens to unmatched rows. Always check key alignment and data types before merging.
-
Wide format stores values across columns (one column per year, per vaccine, per metric). It's human-readable but hard to analyze programmatically.
-
Long format stores values in rows (one row per observation). It's verbose but flexible — most pandas operations, plotting libraries, and statistical tools expect it.
-
GroupBy implements the split-apply-combine pattern: divide data into groups, compute something within each group, reassemble the results. It's the foundation of virtually all summary statistics.
-
Method chaining connects transformations into readable pipelines. Each line is a verb (filter, merge, group, sort). Wrap chains in parentheses for multi-line formatting.
Terms to Remember
| Term | Definition |
|---|---|
| merge | Combining two DataFrames by matching rows on shared key columns |
| join | A general term for combining tables; in pandas, .join() is an index-based alternative to .merge() |
| inner join | Keeps only rows where the key appears in both tables |
| left join | Keeps all rows from the left table; fills NaN where the right table has no match |
| outer join | Keeps all rows from both tables; fills NaN on both sides where no match exists |
| pivot_table | Reshapes long data to wide by turning row values into column headers, with aggregation |
| melt | Reshapes wide data to long by unpivoting columns into rows |
| wide format | Data layout where one variable's values are spread across multiple columns |
| long format | Data layout where each row represents a single observation; also called tidy format |
| groupby | Splitting a DataFrame into groups based on column values for group-wise computation |
| aggregation | Reducing multiple values to a single summary value (mean, sum, count, etc.) |
| agg | The pandas method for applying one or more aggregation functions to grouped data |
| transform | A groupby method that broadcasts group-level results back to every row, preserving row count |
| multi-index | An index with two or more levels, created when grouping by multiple columns |
| method chaining | Connecting multiple DataFrame operations in sequence, with each step's output feeding the next |
What You Should Be Able to Do Now
Use this checklist to verify you've absorbed the chapter. If any item feels shaky, revisit the relevant section.
- [ ] Merge two DataFrames on a shared key using inner, left, and outer joins
- [ ] Handle different key column names using
left_onandright_on - [ ] Merge on multiple keys by passing a list of column names
- [ ] Check for key alignment (dtype, duplicates, NaN, whitespace) before merging
- [ ] Use
validateandindicatorto catch merge problems early - [ ] Melt a wide DataFrame into long format, specifying
id_vars,var_name, andvalue_name - [ ] Pivot a long DataFrame into wide format using
pivot_table - [ ] Group by one or more columns and compute aggregations with
.agg() - [ ] Use named aggregation syntax for clean, readable summary tables
- [ ] Use
.transform()to add group-level statistics to individual rows - [ ] Build a method chain that connects filtering, merging, grouping, and sorting
- [ ] Diagnose common errors: empty merge results, row explosions, multi-index confusion
If you checked every box, you now have the structural transformation skills that underpin every serious data analysis. The data you receive will almost never be in the shape you need — and now that's no longer a problem.