Key Takeaways: Reshaping and Transforming Data

This is your reference card for Chapter 9 — the chapter where you learned to take data in any shape and mold it into whatever your analysis demands. Keep this nearby whenever you're combining, reshaping, or summarizing DataFrames.


Join Types Comparison

Use this table every time you write a merge until you've internalized the differences.

Join Type What It Keeps NaN Behavior When to Use
Inner Only rows with keys in both tables No NaN from the merge You only want complete records
Left All rows from the left table NaN in right-side columns where no match Your left table is "primary"; you're enriching it
Outer All rows from both tables NaN on both sides where no match You need the full picture; auditing completeness
Right All rows from the right table NaN in left-side columns where no match Rarely used; swap tables and use left instead

The merge call:

pd.merge(left_df, right_df,
         on='shared_key',           # same column name in both
         # OR:
         left_on='key_a',           # different column names
         right_on='key_b',
         how='left',                # inner | left | right | outer
         validate='many_to_one',    # catch unexpected duplicates
         indicator=True)            # adds _merge column for debugging

Pre-merge checklist: - [ ] Check both key columns have the same dtype (df['key'].dtype) - [ ] Check for duplicate keys (df['key'].duplicated().sum()) - [ ] Check for key overlap (left['key'].isin(right['key']).sum()) - [ ] Check for NaN in key columns (df['key'].isna().sum()) - [ ] Strip whitespace and standardize case if keys are strings


Reshape Cheat Sheet

WIDE FORMAT  --------melt()-------->  LONG FORMAT
(values in   <-----pivot_table()----  (values in
 columns)                              rows)

Melt (wide to long):

pd.melt(df,
        id_vars='country',         # column(s) to KEEP as-is
        var_name='year',           # name for column of old headers
        value_name='coverage')     # name for column of old values

Rule of thumb: if your column headers contain data values (years, vaccine names, question numbers), you probably need to melt.

Pivot (long to wide):

df.pivot_table(
    index='country',       # what becomes the rows
    columns='year',        # what becomes the columns
    values='coverage',     # what fills the cells
    aggfunc='mean')        # how to handle duplicates

Use pivot_table (not pivot) to safely handle duplicate index-column pairs.

Shape arithmetic: - Melting a table with R rows and C value columns produces R x C rows - Pivoting reduces rows by the number of unique values in the columns parameter


GroupBy Patterns

The split-apply-combine pattern:

df.groupby('column')['target'].operation()
     SPLIT         SELECT     APPLY + COMBINE

Common operations:

Pattern Code Result
Single stat .groupby('region')['coverage'].mean() One value per group
Multiple stats .groupby('region')['coverage'].agg(['mean', 'min', 'max']) Multiple columns per group
Named aggregation .groupby('region').agg(avg=('coverage', 'mean'), n=('country', 'count')) Named output columns
Different stats per column .agg(avg_cov=('coverage', 'mean'), total_pop=('population', 'sum')) Mixed aggregations
Group stat on every row .groupby('region')['coverage'].transform('mean') Same row count as input
Multi-level grouping .groupby(['region', 'year'])['coverage'].mean() Multi-index result

After groupby: - Use .reset_index() to flatten a multi-index into regular columns - Use .sort_values() to rank groups - Use .unstack() as an alternative to pivot_table for reshaping grouped results


Method Chaining Template

result = (
    df
    .query('year >= 2020')                                    # filter rows
    .assign(rate=lambda x: x['count'] / x['total'] * 100)    # add column
    .merge(other_df, on='key', how='left')                    # combine tables
    .groupby('category')                                       # split
    .agg(avg_rate=('rate', 'mean'))                           # aggregate
    .sort_values('avg_rate', ascending=False)                  # sort
    .reset_index()                                             # flatten
)

Chaining tools: - .query() for filtering (cleaner in chains than bracket notation) - .assign() for new columns (returns new DataFrame, doesn't modify in place) - .pipe(func) for custom functions that take a DataFrame and return a DataFrame


Common Errors Quick Fix

Symptom Likely Cause Fix
Merge returns empty DataFrame Key dtype mismatch (int vs string) .astype(int) or .astype(str) on one side
Merge returns too many rows Duplicate keys on both sides (many-to-many) Check .duplicated() on keys; use validate
NaN everywhere after merge Keys don't actually match (whitespace, case) .str.strip().str.lower() on key columns
ValueError: cannot reshape Duplicate index-column pairs in pivot Use pivot_table with aggfunc instead of pivot
Multi-index confusion Grouping by multiple columns Add .reset_index() after groupby
KeyError after melt Accessing old column names Column names became values — filter, don't index
Overlapping column names in merge Both tables share non-key columns Use suffixes=('_left', '_right')

Key Concepts

  • Merging is how you combine separate datasets into one analysis-ready table. The join type determines what happens to unmatched rows. Always check key alignment and data types before merging.

  • Wide format stores values across columns (one column per year, per vaccine, per metric). It's human-readable but hard to analyze programmatically.

  • Long format stores values in rows (one row per observation). It's verbose but flexible — most pandas operations, plotting libraries, and statistical tools expect it.

  • GroupBy implements the split-apply-combine pattern: divide data into groups, compute something within each group, reassemble the results. It's the foundation of virtually all summary statistics.

  • Method chaining connects transformations into readable pipelines. Each line is a verb (filter, merge, group, sort). Wrap chains in parentheses for multi-line formatting.


Terms to Remember

Term Definition
merge Combining two DataFrames by matching rows on shared key columns
join A general term for combining tables; in pandas, .join() is an index-based alternative to .merge()
inner join Keeps only rows where the key appears in both tables
left join Keeps all rows from the left table; fills NaN where the right table has no match
outer join Keeps all rows from both tables; fills NaN on both sides where no match exists
pivot_table Reshapes long data to wide by turning row values into column headers, with aggregation
melt Reshapes wide data to long by unpivoting columns into rows
wide format Data layout where one variable's values are spread across multiple columns
long format Data layout where each row represents a single observation; also called tidy format
groupby Splitting a DataFrame into groups based on column values for group-wise computation
aggregation Reducing multiple values to a single summary value (mean, sum, count, etc.)
agg The pandas method for applying one or more aggregation functions to grouped data
transform A groupby method that broadcasts group-level results back to every row, preserving row count
multi-index An index with two or more levels, created when grouping by multiple columns
method chaining Connecting multiple DataFrame operations in sequence, with each step's output feeding the next

What You Should Be Able to Do Now

Use this checklist to verify you've absorbed the chapter. If any item feels shaky, revisit the relevant section.

  • [ ] Merge two DataFrames on a shared key using inner, left, and outer joins
  • [ ] Handle different key column names using left_on and right_on
  • [ ] Merge on multiple keys by passing a list of column names
  • [ ] Check for key alignment (dtype, duplicates, NaN, whitespace) before merging
  • [ ] Use validate and indicator to catch merge problems early
  • [ ] Melt a wide DataFrame into long format, specifying id_vars, var_name, and value_name
  • [ ] Pivot a long DataFrame into wide format using pivot_table
  • [ ] Group by one or more columns and compute aggregations with .agg()
  • [ ] Use named aggregation syntax for clean, readable summary tables
  • [ ] Use .transform() to add group-level statistics to individual rows
  • [ ] Build a method chain that connects filtering, merging, grouping, and sorting
  • [ ] Diagnose common errors: empty merge results, row explosions, multi-index confusion

If you checked every box, you now have the structural transformation skills that underpin every serious data analysis. The data you receive will almost never be in the shape you need — and now that's no longer a problem.