Key Takeaways: Reshaping and Transforming Data

Contributors to Introduction to Data Science

Key Takeaways: Reshaping and Transforming Data

This is your reference card for Chapter 9 — the chapter where you learned to take data in any shape and mold it into whatever your analysis demands. Keep this nearby whenever you're combining, reshaping, or summarizing DataFrames.

Join Types Comparison

Use this table every time you write a merge until you've internalized the differences.

Join Type	What It Keeps	NaN Behavior	When to Use
Inner	Only rows with keys in both tables	No NaN from the merge	You only want complete records
Left	All rows from the left table	NaN in right-side columns where no match	Your left table is "primary"; you're enriching it
Outer	All rows from both tables	NaN on both sides where no match	You need the full picture; auditing completeness
Right	All rows from the right table	NaN in left-side columns where no match	Rarely used; swap tables and use left instead

The merge call:

pd.merge(left_df, right_df,
         on='shared_key',           # same column name in both
         # OR:
         left_on='key_a',           # different column names
         right_on='key_b',
         how='left',                # inner | left | right | outer
         validate='many_to_one',    # catch unexpected duplicates
         indicator=True)            # adds _merge column for debugging

Pre-merge checklist: - [ ] Check both key columns have the same dtype (df['key'].dtype) - [ ] Check for duplicate keys (df['key'].duplicated().sum()) - [ ] Check for key overlap (left['key'].isin(right['key']).sum()) - [ ] Check for NaN in key columns (df['key'].isna().sum()) - [ ] Strip whitespace and standardize case if keys are strings

Reshape Cheat Sheet

WIDE FORMAT  --------melt()-------->  LONG FORMAT
(values in   <-----pivot_table()----  (values in
 columns)                              rows)

Melt (wide to long):

pd.melt(df,
        id_vars='country',         # column(s) to KEEP as-is
        var_name='year',           # name for column of old headers
        value_name='coverage')     # name for column of old values

Rule of thumb: if your column headers contain data values (years, vaccine names, question numbers), you probably need to melt.

Pivot (long to wide):

df.pivot_table(
    index='country',       # what becomes the rows
    columns='year',        # what becomes the columns
    values='coverage',     # what fills the cells
    aggfunc='mean')        # how to handle duplicates

Use pivot_table (not pivot) to safely handle duplicate index-column pairs.

Shape arithmetic: - Melting a table with R rows and C value columns produces R x C rows - Pivoting reduces rows by the number of unique values in the columns parameter

GroupBy Patterns

The split-apply-combine pattern:

df.groupby('column')['target'].operation()
     SPLIT         SELECT     APPLY + COMBINE

Common operations:

Pattern	Code	Result
Single stat	`.groupby('region')['coverage'].mean()`	One value per group
Multiple stats	`.groupby('region')['coverage'].agg(['mean', 'min', 'max'])`	Multiple columns per group
Named aggregation	`.groupby('region').agg(avg=('coverage', 'mean'), n=('country', 'count'))`	Named output columns
Different stats per column	`.agg(avg_cov=('coverage', 'mean'), total_pop=('population', 'sum'))`	Mixed aggregations
Group stat on every row	`.groupby('region')['coverage'].transform('mean')`	Same row count as input
Multi-level grouping	`.groupby(['region', 'year'])['coverage'].mean()`	Multi-index result

After groupby: - Use .reset_index() to flatten a multi-index into regular columns - Use .sort_values() to rank groups - Use .unstack() as an alternative to pivot_table for reshaping grouped results

Method Chaining Template

result = (
    df
    .query('year >= 2020')                                    # filter rows
    .assign(rate=lambda x: x['count'] / x['total'] * 100)    # add column
    .merge(other_df, on='key', how='left')                    # combine tables
    .groupby('category')                                       # split
    .agg(avg_rate=('rate', 'mean'))                           # aggregate
    .sort_values('avg_rate', ascending=False)                  # sort
    .reset_index()                                             # flatten
)

Chaining tools: - .query() for filtering (cleaner in chains than bracket notation) - .assign() for new columns (returns new DataFrame, doesn't modify in place) - .pipe(func) for custom functions that take a DataFrame and return a DataFrame

Common Errors Quick Fix

Symptom	Likely Cause	Fix
Merge returns empty DataFrame	Key dtype mismatch (int vs string)	`.astype(int)` or `.astype(str)` on one side
Merge returns too many rows	Duplicate keys on both sides (many-to-many)	Check `.duplicated()` on keys; use `validate`
NaN everywhere after merge	Keys don't actually match (whitespace, case)	`.str.strip().str.lower()` on key columns
`ValueError: cannot reshape`	Duplicate index-column pairs in pivot	Use `pivot_table` with `aggfunc` instead of `pivot`
Multi-index confusion	Grouping by multiple columns	Add `.reset_index()` after groupby
KeyError after melt	Accessing old column names	Column names became values — filter, don't index
Overlapping column names in merge	Both tables share non-key columns	Use `suffixes=('_left', '_right')`

Key Concepts

Merging is how you combine separate datasets into one analysis-ready table. The join type determines what happens to unmatched rows. Always check key alignment and data types before merging.
Wide format stores values across columns (one column per year, per vaccine, per metric). It's human-readable but hard to analyze programmatically.
Long format stores values in rows (one row per observation). It's verbose but flexible — most pandas operations, plotting libraries, and statistical tools expect it.
GroupBy implements the split-apply-combine pattern: divide data into groups, compute something within each group, reassemble the results. It's the foundation of virtually all summary statistics.
Method chaining connects transformations into readable pipelines. Each line is a verb (filter, merge, group, sort). Wrap chains in parentheses for multi-line formatting.

Terms to Remember

Term	Definition
merge	Combining two DataFrames by matching rows on shared key columns
join	A general term for combining tables; in pandas, `.join()` is an index-based alternative to `.merge()`
inner join	Keeps only rows where the key appears in both tables
left join	Keeps all rows from the left table; fills NaN where the right table has no match
outer join	Keeps all rows from both tables; fills NaN on both sides where no match exists
pivot_table	Reshapes long data to wide by turning row values into column headers, with aggregation
melt	Reshapes wide data to long by unpivoting columns into rows
wide format	Data layout where one variable's values are spread across multiple columns
long format	Data layout where each row represents a single observation; also called tidy format
groupby	Splitting a DataFrame into groups based on column values for group-wise computation
aggregation	Reducing multiple values to a single summary value (mean, sum, count, etc.)
agg	The pandas method for applying one or more aggregation functions to grouped data
transform	A groupby method that broadcasts group-level results back to every row, preserving row count
multi-index	An index with two or more levels, created when grouping by multiple columns
method chaining	Connecting multiple DataFrame operations in sequence, with each step's output feeding the next

What You Should Be Able to Do Now

Use this checklist to verify you've absorbed the chapter. If any item feels shaky, revisit the relevant section.

[ ] Merge two DataFrames on a shared key using inner, left, and outer joins
[ ] Handle different key column names using left_on and right_on
[ ] Merge on multiple keys by passing a list of column names
[ ] Check for key alignment (dtype, duplicates, NaN, whitespace) before merging
[ ] Use validate and indicator to catch merge problems early
[ ] Melt a wide DataFrame into long format, specifying id_vars, var_name, and value_name
[ ] Pivot a long DataFrame into wide format using pivot_table
[ ] Group by one or more columns and compute aggregations with .agg()
[ ] Use named aggregation syntax for clean, readable summary tables
[ ] Use .transform() to add group-level statistics to individual rows
[ ] Build a method chain that connects filtering, merging, grouping, and sorting
[ ] Diagnose common errors: empty merge results, row explosions, multi-index confusion

If you checked every box, you now have the structural transformation skills that underpin every serious data analysis. The data you receive will almost never be in the shape you need — and now that's no longer a problem.