Chapter 9 Exercises: Reshaping and Transforming Data

Contributors to Introduction to Data Science

Chapter 9 Exercises: Reshaping and Transforming Data

How to use these exercises: Work through the sections in order. Parts A-D focus on Chapter 9 material, building from recall through application to original analysis. Part E applies your skills to new datasets. Part M mixes in concepts from earlier chapters to reinforce your growing toolkit. You'll need Python with pandas installed for most problems.

Difficulty key: 1-star: Foundational | 2-star: Intermediate | 3-star: Advanced | 4-star: Extension

Part A: Conceptual Understanding (1-star)

These questions check whether you absorbed the core ideas from the chapter. Write clear, concise answers.

Exercise 9.1 — Join type selection

For each scenario below, state which join type (inner, left, right, or outer) is most appropriate and explain why in one or two sentences.

(a) You have a list of all students in a course and want to add their exam scores. Some students haven't taken the exam yet — you want to keep those students in the result with NaN for the score.

(b) You have employee records from HR and salary records from Payroll. You only want to analyze employees who appear in both systems.

(c) You're auditing two inventory systems and want to see every item that appears in either system, flagging items that are missing from one or the other.

(d) You have a primary customer list and want to enrich it with data from a third-party demographics dataset. You want to keep all your customers even if the third party doesn't have data for some of them.

Guidance

(a) **Left join** — the student list is your primary dataset. You want every student, with scores where available. (b) **Inner join** — you only want complete records that exist in both systems. (c) **Outer join** — you need the full picture from both sides to identify gaps. (d) **Left join** — your customer list is primary; third-party data is supplementary.

Exercise 9.2 — Wide vs. long identification

Look at each dataset description and classify it as wide format or long format. Then state which format would be more useful for the analysis described.

(a) A spreadsheet with columns Student, Math_Score, Science_Score, English_Score. You want to compute the overall average score across all subjects.

(b) A table with columns City, Month, Temperature. Each city appears 12 times (once per month). You want to compare January temperatures across cities.

(c) A financial report with columns Company, Q1_Revenue, Q2_Revenue, Q3_Revenue, Q4_Revenue. You want to plot revenue trends over time for each company.

Guidance

(a) **Wide format** (subjects are spread across columns). For computing overall average across subjects, **long format** is better — you'd melt so all scores are in one column, then compute the mean. (b) **Long format** (each observation is one city-month). For comparing January specifically, this format works well — filter for `Month == 'January'`. (c) **Wide format** (quarters are spread across columns). For plotting time trends, **long format** is better — melt so you have a `Quarter` column and a `Revenue` column.

Exercise 9.3 — Split-apply-combine in plain English

Explain the split-apply-combine pattern by describing what happens when you run df.groupby('department')['salary'].mean() on a company employee dataset. Use the words "split," "apply," and "combine" in your answer, and describe what each step produces.

Guidance

**Split:** Pandas divides the DataFrame into separate groups based on the unique values in the `department` column (e.g., all Engineering rows in one group, all Marketing rows in another). **Apply:** For each group, pandas computes the mean of the `salary` column. **Combine:** Pandas collects the per-group means into a single Series, with department names as the index and mean salaries as the values.

Exercise 9.4 — Predicting merge output

Without running any code, predict the output of this merge. How many rows will the result have?

left = pd.DataFrame({'key': ['A', 'B', 'C'], 'val1': [1, 2, 3]})
right = pd.DataFrame({'key': ['B', 'C', 'D'], 'val2': [4, 5, 6]})
result = pd.merge(left, right, on='key', how='inner')

Now predict the output for how='outer'. How many rows? Which cells will be NaN?

Guidance

**Inner join:** Only keys that appear in both tables — B and C. Result has **2 rows**: `(B, 2, 4)` and `(C, 3, 5)`. A and D are dropped. **Outer join:** All keys from both tables — A, B, C, D. Result has **4 rows**: `(A, 1, NaN)`, `(B, 2, 4)`, `(C, 3, 5)`, `(D, NaN, 6)`. A has NaN for val2 (not in right); D has NaN for val1 (not in left).

Exercise 9.5 — Method chaining readability

Rewrite the following step-by-step code as a single method chain. Then write a one-sentence description of what the entire pipeline does.

temp = df.dropna(subset=['revenue'])
temp = temp[temp['year'] >= 2020]
temp = temp.groupby('region')['revenue'].sum()
temp = temp.reset_index()
temp = temp.sort_values('revenue', ascending=False)

Guidance

result = (
    df
    .dropna(subset=['revenue'])
    .query('year >= 2020')
    .groupby('region')['revenue']
    .sum()
    .reset_index()
    .sort_values('revenue', ascending=False)
)

Description: "This pipeline removes rows with missing revenue, filters to 2020 and later, computes total revenue by region, and sorts regions from highest to lowest total revenue."

Exercise 9.6 — Transform vs. agg

Explain the difference between groupby().agg() and groupby().transform(). For each of the following tasks, state which one you'd use:

(a) Compute the average salary in each department. (b) Add a column to each employee's row showing their department's average salary. (c) Create a summary table with the minimum, maximum, and mean temperature for each city. (d) Flag every transaction that's more than 2 standard deviations above its category's mean.

Guidance

`.agg()` **reduces** each group to a summary value, producing fewer rows. `.transform()` **preserves** the original row count, broadcasting group-level values back to each row. (a) `.agg()` — you want one number per department. (b) `.transform()` — you want the department average on *every* employee row. (c) `.agg()` — you want a summary table with fewer rows. (d) `.transform()` — you need the category mean and std on each row to compute the flag.

Part B: Applied Practice (2-star)

These exercises require writing code. Create the DataFrames as shown and verify your results.

Exercise 9.7 — Basic merge practice

Create these two DataFrames:

products = pd.DataFrame({
    'product_id': [101, 102, 103, 104, 105],
    'name': ['Widget', 'Gadget', 'Doohickey', 'Thingamajig', 'Whatchamacallit'],
    'category': ['Tools', 'Electronics', 'Tools', 'Electronics', 'Toys']
})

sales = pd.DataFrame({
    'product_id': [101, 102, 102, 103, 106],
    'date': ['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-15', '2024-01-18'],
    'quantity': [10, 5, 8, 3, 12]
})

(a) Perform an inner join on product_id. How many rows are in the result? Which products are missing?

(b) Perform a left join with products on the left. How does the result differ from the inner join?

(c) Perform an outer join with indicator=True. Which products appear in only one table?

(d) Product 106 in the sales table is not in the products table — this is a data quality issue. Write code to identify all product_ids in sales that don't exist in products.

Guidance

(a) Inner join produces 4 rows (product_ids 101, 102, 102, 103). Products 104, 105 (no sales), and 106 (no product info) are missing. (b) Left join produces 6 rows — all 5 products plus the duplicate for 102. Products 104 and 105 appear with NaN for date and quantity. Product 106 is dropped. (c) Outer join produces 7 rows. The `_merge` column shows 104 and 105 as `left_only`, 106 as `right_only`. (d) `sales[~sales['product_id'].isin(products['product_id'])]` or check the `_merge == 'right_only'` rows from part (c).

Exercise 9.8 — Melt and pivot

Create this wide-format DataFrame:

temps = pd.DataFrame({
    'city': ['Seattle', 'Phoenix', 'Chicago'],
    'jan': [42, 67, 26],
    'apr': [52, 85, 50],
    'jul': [76, 106, 77],
    'oct': [53, 88, 52]
})

(a) Melt it into long format with columns city, month, and temperature.

(b) Using the melted result, find the city with the highest temperature in any month.

(c) Using the melted result, compute the temperature range (max - min) for each city.

(d) Pivot the melted data back to wide format. Compare it to the original. Is it identical?

Guidance

(a) `pd.melt(temps, id_vars='city', var_name='month', value_name='temperature')` produces 12 rows. (b) `long.loc[long['temperature'].idxmax()]` — Phoenix in July (106). (c) `long.groupby('city')['temperature'].agg(lambda x: x.max() - x.min())` — Phoenix: 39, Chicago: 51, Seattle: 34. (d) `long.pivot_table(index='city', columns='month', values='temperature')`. It will be similar but the column order may differ and the index will be `city` rather than a default integer index.

Exercise 9.9 — GroupBy aggregations

Using the following DataFrame:

orders = pd.DataFrame({
    'customer': ['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice', 'Carol', 'Bob'],
    'product': ['Book', 'Book', 'Pen', 'Book', 'Pen', 'Book', 'Pen', 'Book'],
    'amount': [12.99, 12.99, 2.50, 12.99, 2.50, 12.99, 2.50, 12.99],
    'quantity': [1, 2, 5, 1, 3, 1, 2, 1]
})

(a) Compute the total amount spent by each customer.

(b) Compute both the total amount and total quantity per customer using .agg().

(c) Using named aggregation, create a summary with columns total_spent, num_orders, and avg_quantity grouped by customer.

(d) Add a column to the original DataFrame showing each customer's total spending (using .transform()).

Guidance

(a) `orders.groupby('customer')['amount'].sum()`. (b) `orders.groupby('customer')[['amount', 'quantity']].agg('sum')`. (c) `orders.groupby('customer').agg(total_spent=('amount', 'sum'), num_orders=('amount', 'count'), avg_quantity=('quantity', 'mean'))`. (d) `orders['customer_total'] = orders.groupby('customer')['amount'].transform('sum')`.

Exercise 9.10 — Multi-key merge

Create these DataFrames:

enrollment = pd.DataFrame({
    'student': ['Ana', 'Ana', 'Ben', 'Ben', 'Cara'],
    'course': ['Math', 'English', 'Math', 'Science', 'English'],
    'semester': ['Fall', 'Fall', 'Fall', 'Fall', 'Fall']
})

grades = pd.DataFrame({
    'student': ['Ana', 'Ana', 'Ben', 'Ben', 'Cara'],
    'course': ['Math', 'English', 'Math', 'Science', 'English'],
    'grade': [92, 88, 75, 91, 95]
})

(a) Merge on both student and course. Why would merging on just student give incorrect results?

(b) After merging, compute the average grade per course.

(c) Add a column showing each student's deviation from the course average using transform.

Guidance

(a) Merging on just `student` would create a many-to-many join: Ana's two enrollment rows would each match with Ana's two grade rows, producing 4 rows instead of 2. Merging on both `['student', 'course']` ensures each enrollment matches its correct grade. (b) `merged.groupby('course')['grade'].mean()`. (c) `merged['course_avg'] = merged.groupby('course')['grade'].transform('mean')` then `merged['deviation'] = merged['grade'] - merged['course_avg']`.

Exercise 9.11 — Method chain construction

Start with this DataFrame:

sales = pd.DataFrame({
    'store': ['A', 'A', 'B', 'B', 'A', 'B', 'A', 'B'],
    'month': ['Jan', 'Feb', 'Jan', 'Feb', 'Mar', 'Mar', 'Jan', 'Feb'],
    'revenue': [1200, 1500, 800, 950, 1800, 1100, 1300, 870],
    'returns': [50, 30, 20, 40, 10, 60, 45, 35]
})

Write a single method chain that: 1. Creates a net_revenue column (revenue minus returns) 2. Groups by store 3. Computes total net revenue per store 4. Sorts from highest to lowest 5. Resets the index

Guidance

result = (
    sales
    .assign(net_revenue=lambda x: x['revenue'] - x['returns'])
    .groupby('store')['net_revenue']
    .sum()
    .sort_values(ascending=False)
    .reset_index()
)

Exercise 9.12 — Diagnosing a key explosion

A colleague runs this code and their computer freezes:

visits = pd.DataFrame({
    'patient_id': [1, 1, 1, 2, 2],
    'visit_date': ['2024-01-01', '2024-02-15', '2024-03-20', '2024-01-10', '2024-02-28'],
    'diagnosis': ['Cold', 'Flu', 'Cold', 'Sprain', 'Follow-up']
})

prescriptions = pd.DataFrame({
    'patient_id': [1, 1, 1, 2, 2, 2],
    'medication': ['Aspirin', 'Ibuprofen', 'Cough syrup', 'Bandage', 'Aspirin', 'Ice pack'],
    'prescribed_date': ['2024-01-01', '2024-02-15', '2024-02-15', '2024-01-10', '2024-01-10', '2024-02-28']
})

result = pd.merge(visits, prescriptions, on='patient_id')

(a) Explain why this merge produces more rows than either input table.

(b) How many rows will the result have? Show your calculation.

(c) Suggest a better merge that avoids the explosion.

Guidance

(a) Patient 1 has 3 visits and 3 prescriptions, so the merge creates 3x3=9 rows just for patient 1. Patient 2 has 2 visits and 3 prescriptions, producing 2x3=6 rows. Total: 15 rows from 5+6=11 input rows. (b) 9 + 6 = **15 rows**. (c) Merge on both `patient_id` and date: `pd.merge(visits, prescriptions, left_on=['patient_id', 'visit_date'], right_on=['patient_id', 'prescribed_date'])`. This matches visits with prescriptions from the same date.

Exercise 9.13 — Fixing a type mismatch

You try to merge these DataFrames and get zero rows:

survey = pd.DataFrame({'zip_code': [10001, 10002, 10003], 'response': ['Yes', 'No', 'Yes']})
demographics = pd.DataFrame({'zip_code': ['10001', '10002', '10003'], 'population': [21000, 36000, 15000]})

result = pd.merge(survey, demographics, on='zip_code')
print(len(result))  # 0 — why?

(a) Diagnose the problem.

(b) Write code to fix it and produce the correct merge.

(c) How could you have caught this before the merge?

Guidance

(a) `survey['zip_code']` is int64 while `demographics['zip_code']` is object (string). Integer 10001 does not equal string '10001'. (b) `demographics['zip_code'] = demographics['zip_code'].astype(int)` then merge. Or convert survey to string. (c) Check dtypes: `print(survey['zip_code'].dtype, demographics['zip_code'].dtype)`.

Exercise 9.14 — Complex melt with multiple id_vars

Create this DataFrame representing student test scores:

scores = pd.DataFrame({
    'student': ['Amy', 'Ben', 'Cara'],
    'grade_level': [10, 11, 10],
    'math_midterm': [88, 72, 95],
    'math_final': [91, 78, 93],
    'english_midterm': [82, 90, 88],
    'english_final': [85, 92, 90]
})

(a) Melt so the result has columns: student, grade_level, exam, score.

(b) From the melted data, extract two new columns: subject (math or english) and exam_type (midterm or final) by splitting the exam column on the underscore.

(c) Compute the average score by subject and exam type.

Guidance

(a) `pd.melt(scores, id_vars=['student', 'grade_level'], var_name='exam', value_name='score')`. (b) `melted[['subject', 'exam_type']] = melted['exam'].str.split('_', expand=True)`. (c) `melted.groupby(['subject', 'exam_type'])['score'].mean()`.

Part C: Real-World Scenarios (2-star to 3-star)

These exercises use realistic data scenarios that mirror actual data science work.

Exercise 9.15 — Merging messy country data (2-star)

You have vaccination data that uses full country names and GDP data that uses ISO country codes:

vax = pd.DataFrame({
    'country_name': ['Brazil', 'Kenya', 'India', 'Germany'],
    'coverage': [90, 75, 88, 95]
})

gdp = pd.DataFrame({
    'iso_code': ['BRA', 'KEN', 'IND', 'DEU'],
    'gdp_per_capita': [8920, 2007, 2257, 48636]
})

# You also have a lookup table
lookup = pd.DataFrame({
    'country_name': ['Brazil', 'Kenya', 'India', 'Germany'],
    'iso_code': ['BRA', 'KEN', 'IND', 'DEU']
})

Write code that uses the lookup table to merge vax and gdp into a single DataFrame with columns country_name, coverage, gdp_per_capita.

Guidance

Two-step merge: (1) Merge vax with lookup on `country_name` to add `iso_code`. (2) Merge the result with gdp on `iso_code`. Then drop the `iso_code` column if desired.

result = (
    vax
    .merge(lookup, on='country_name')
    .merge(gdp, on='iso_code')
    .drop(columns='iso_code')
)

Exercise 9.16 — Reshaping survey data (2-star)

A marketing survey arrives in this format:

survey = pd.DataFrame({
    'respondent_id': [1, 2, 3, 4, 5],
    'age_group': ['18-24', '25-34', '25-34', '35-44', '18-24'],
    'q1_satisfaction': [4, 5, 3, 4, 2],
    'q2_likelihood_recommend': [3, 5, 4, 5, 1],
    'q3_ease_of_use': [5, 4, 4, 3, 3]
})

(a) Melt the three question columns into long format with columns respondent_id, age_group, question, rating.

(b) Compute the average rating per question.

(c) Compute the average rating per question per age group.

(d) Pivot part (c) into a table with age groups as rows and questions as columns.

Guidance

(a) `pd.melt(survey, id_vars=['respondent_id', 'age_group'], var_name='question', value_name='rating')`. (b) `long.groupby('question')['rating'].mean()`. (c) `long.groupby(['age_group', 'question'])['rating'].mean().reset_index()`. (d) `part_c.pivot_table(index='age_group', columns='question', values='rating')`.

Exercise 9.17 — Sales analysis pipeline (3-star)

You're an analyst at a retail company with two tables:

transactions = pd.DataFrame({
    'transaction_id': range(1, 11),
    'store_id': [1, 1, 2, 2, 1, 3, 3, 2, 1, 3],
    'amount': [45.99, 23.50, 67.00, 12.75, 89.99, 34.50, 55.00, 41.25, 78.00, 19.99],
    'date': pd.to_datetime(['2024-01-15', '2024-01-16', '2024-01-15', '2024-01-17',
                            '2024-01-18', '2024-01-15', '2024-01-16', '2024-01-18',
                            '2024-01-19', '2024-01-19'])
})

stores = pd.DataFrame({
    'store_id': [1, 2, 3],
    'city': ['Portland', 'Seattle', 'Portland'],
    'state': ['OR', 'WA', 'OR']
})

Write a method chain that: 1. Merges transactions with stores 2. Adds a day_of_week column from the date 3. Groups by city 4. Computes total revenue and transaction count per city 5. Sorts by total revenue descending

Guidance

result = (
    transactions
    .merge(stores, on='store_id')
    .assign(day_of_week=lambda x: x['date'].dt.day_name())
    .groupby('city')
    .agg(total_revenue=('amount', 'sum'),
         num_transactions=('transaction_id', 'count'))
    .sort_values('total_revenue', ascending=False)
    .reset_index()
)

Exercise 9.18 — Handling duplicates in a merge (3-star)

You're merging student enrollment data with course instructor data:

enrollment = pd.DataFrame({
    'student': ['Amy', 'Ben', 'Amy', 'Cara', 'Ben'],
    'course': ['CS101', 'CS101', 'MATH201', 'CS101', 'MATH201']
})

instructors = pd.DataFrame({
    'course': ['CS101', 'CS101', 'MATH201'],
    'instructor': ['Dr. Smith', 'Dr. Jones', 'Dr. Patel'],
    'section': ['A', 'B', 'A']
})

(a) Merge on course without specifying a section. How many rows result? Explain why.

(b) This is a many-to-many merge. Why is the result misleading?

(c) Add a section column to enrollment and merge on both course and section to get correct results.

Guidance

(a) CS101 has 3 enrollments and 2 instructor rows, producing 6 rows for CS101 alone. MATH201 has 2 enrollments and 1 instructor row, producing 2 rows. Total: 8 rows — more than either input! (b) It incorrectly pairs every CS101 student with every CS101 instructor, even though each student is in only one section. Amy is paired with both Dr. Smith and Dr. Jones. (c) Add section info to enrollment, then merge on `['course', 'section']`.

Exercise 9.19 — Weather data reshaping (2-star)

You download weather data that looks like this:

weather = pd.DataFrame({
    'station': ['SEA', 'SEA', 'PDX', 'PDX'],
    'year': [2022, 2023, 2022, 2023],
    'jan_temp': [39, 41, 40, 38],
    'feb_temp': [42, 44, 43, 41],
    'mar_temp': [47, 45, 48, 46],
})

(a) Melt into long format with columns station, year, month, temperature.

(b) Clean the month column to contain just the month abbreviation (e.g., jan not jan_temp).

(c) Create a pivot table showing average temperature by station and month.

(d) Which station is warmer on average across all months and years?

Guidance

(a) `pd.melt(weather, id_vars=['station', 'year'], var_name='month', value_name='temperature')`. (b) `long['month'] = long['month'].str.replace('_temp', '')`. (c) `long.pivot_table(index='station', columns='month', values='temperature', aggfunc='mean')`. (d) `long.groupby('station')['temperature'].mean()` — compare SEA vs PDX.

Exercise 9.20 — Vaccination equity analysis (3-star)

Create this dataset:

vax_data = pd.DataFrame({
    'country': ['Brazil', 'Brazil', 'Kenya', 'Kenya', 'Germany', 'Germany',
                'Nigeria', 'Nigeria', 'Japan', 'Japan'],
    'year': [2021, 2022, 2021, 2022, 2021, 2022, 2021, 2022, 2021, 2022],
    'measles_cov': [85, 90, 69, 75, 95, 95, 55, 62, 97, 97],
    'dtp3_cov': [78, 83, 82, 85, 94, 94, 57, 65, 96, 96]
})

income = pd.DataFrame({
    'country': ['Brazil', 'Kenya', 'Germany', 'Nigeria', 'Japan'],
    'income_group': ['Upper-middle', 'Lower-middle', 'High', 'Lower-middle', 'High']
})

Write a complete analysis pipeline that: 1. Melts the vaccine columns into long format 2. Merges with the income classification 3. Computes average coverage by income group, year, and vaccine 4. Pivots to show income groups as rows and years as columns (for each vaccine separately)

Guidance

long = (
    pd.melt(vax_data, id_vars=['country', 'year'],
            var_name='vaccine', value_name='coverage')
    .merge(income, on='country')
    .groupby(['income_group', 'year', 'vaccine'])['coverage']
    .mean()
    .reset_index()
)

# For each vaccine, pivot
for vaccine in long['vaccine'].unique():
    subset = long[long['vaccine'] == vaccine]
    pivoted = subset.pivot_table(index='income_group', columns='year', values='coverage')
    print(f"\n{vaccine}:")
    print(pivoted)

Part D: Synthesis and Extension (3-star to 4-star)

These exercises require combining multiple concepts and thinking critically.

Exercise 9.21 — Self-join: Comparing countries within regions (3-star)

A self-join is when you merge a DataFrame with itself. Using the vaccination DataFrame from Exercise 9.20, create a table that pairs every country with every other country in the same income group, showing both countries' measles coverage for 2022. (Hint: filter for 2022 first, then merge the filtered DataFrame with itself on income_group.)

Guidance

subset = vax_data[vax_data['year'] == 2022][['country', 'measles_cov']].merge(income, on='country')
pairs = pd.merge(subset, subset, on='income_group', suffixes=('_1', '_2'))
pairs = pairs[pairs['country_1'] != pairs['country_2']]  # remove self-pairs

Exercise 9.22 — Building a complete report (4-star)

Using any publicly available dataset (e.g., Gapminder data, a Kaggle dataset, or your course's WHO vaccination data), perform a complete reshape-and-analyze workflow in a Jupyter notebook:

Load the data
Melt or pivot as needed to get it into analysis-ready format
Merge with at least one supplementary dataset
Use groupby to compute at least three different summary statistics
Write a method chain that performs at least four transformations
Document every step with Markdown explanations

This is an open-ended exercise. The goal is fluency with the reshape-merge-group workflow.

Guidance

There is no single correct answer. A strong response will: demonstrate correct use of melt/pivot, merge with appropriate join type, use named aggregation in groupby, include a readable method chain, and most importantly, tell a coherent analytical story with Markdown narration between code cells.

Exercise 9.23 — Performance awareness (3-star)

You have two DataFrames: big_table with 10 million rows and lookup with 500 rows. You need to add information from lookup to big_table based on a shared key.

(a) Which DataFrame should be on the left in a left join? Why?

(b) Before merging, you notice big_table['key'] has dtype object and lookup['key'] has dtype int64. What should you do?

(c) Your colleague suggests sorting both DataFrames by the key column before merging. Would this improve performance in pandas? (Research question — look up how pandas implements merge internally.)

Guidance

(a) `big_table` should be the left table (left join) since you want to keep all 10M rows and add lookup info. (b) Convert to matching types — `big_table['key'] = big_table['key'].astype(int)` or convert lookup to string. Mismatched types would give zero matches. (c) Pandas uses hash-based joins internally, so pre-sorting doesn't help significantly. (In SQL databases, sort-merge joins can benefit from pre-sorted data, but pandas' implementation is different.)

Exercise 9.24 — Chained debugging (3-star)

This method chain has a bug. Find it, explain it, and fix it.

result = (
    df
    .merge(gdp_data, on='country', how='left')
    .groupby('region')['gdp_per_capita']
    .mean()
    .query('gdp_per_capita > 10000')
    .sort_values(ascending=False)
)

Guidance

After `.mean()`, the result is a Series with `region` as the index and the mean values as the data. A Series doesn't have a column named `gdp_per_capita` — the Series *is* the values. The `.query()` call should be replaced with boolean indexing on the Series: `.loc[lambda x: x > 10000]` or convert to DataFrame first with `.reset_index()` before using `.query()`.

Exercise 9.25 — Multi-level groupby with unstack (4-star)

Using vax_data from Exercise 9.20, compute the mean measles coverage by income group and year using groupby with both columns. Then use .unstack() to convert the year level of the multi-index into columns (similar to a pivot). Compare the result with what you'd get using pivot_table.

Guidance

# GroupBy + unstack approach
grouped = (
    vax_data
    .merge(income, on='country')
    .groupby(['income_group', 'year'])['measles_cov']
    .mean()
    .unstack('year')
)

# Pivot_table approach
pivoted = (
    vax_data
    .merge(income, on='country')
    .pivot_table(index='income_group', columns='year', values='measles_cov', aggfunc='mean')
)

# Both should produce the same result

Part E: New Dataset Challenge (2-star to 3-star)

Apply your Chapter 9 skills to a completely new context.

Exercise 9.26 — Movie ratings analysis (2-star)

movies = pd.DataFrame({
    'movie_id': [1, 2, 3, 4, 5],
    'title': ['The Matrix', 'Inception', 'Parasite', 'Dune', 'Everything Everywhere'],
    'genre': ['Sci-Fi', 'Sci-Fi', 'Thriller', 'Sci-Fi', 'Sci-Fi']
})

ratings = pd.DataFrame({
    'user_id': [101, 101, 102, 102, 103, 103, 101, 103],
    'movie_id': [1, 2, 1, 3, 2, 4, 5, 5],
    'rating': [5, 4, 4, 5, 3, 4, 5, 4]
})

(a) Merge movies and ratings so each rating row includes the movie title and genre.

(b) Compute the average rating per movie.

(c) Compute the average rating per genre.

(d) Which user is the most generous rater (highest average rating)?

Exercise 9.27 — Fitness tracker data (3-star)

tracker = pd.DataFrame({
    'user': ['Alice', 'Alice', 'Alice', 'Bob', 'Bob', 'Bob'],
    'week': [1, 2, 3, 1, 2, 3],
    'steps_mon': [8000, 9200, 7500, 6000, 6500, 7000],
    'steps_wed': [10000, 8500, 11000, 7500, 8000, 6500],
    'steps_fri': [7000, 9000, 8000, 5500, 7000, 8500],
})

(a) Melt the step columns into long format with columns user, week, day, steps.

(b) Clean the day column to remove the steps_ prefix.

(c) Compute average steps per user per day of the week.

(d) Who walks more on average, Alice or Bob? By how much?

(e) Write a method chain that melts, cleans, and computes average weekly steps per user.

Exercise 9.28 — Combining election data (3-star)

results_2020 = pd.DataFrame({
    'state': ['Oregon', 'Washington', 'Idaho', 'Nevada'],
    'dem_pct': [56.5, 58.4, 33.1, 50.1],
    'rep_pct': [40.4, 38.8, 63.8, 47.7]
})

results_2024 = pd.DataFrame({
    'state': ['Oregon', 'Washington', 'Idaho', 'Montana'],
    'dem_pct': [55.8, 57.9, 31.2, 40.1],
    'rep_pct': [41.2, 39.3, 66.5, 56.8]
})

demographics = pd.DataFrame({
    'state': ['Oregon', 'Washington', 'Idaho', 'Nevada', 'Montana'],
    'population': [4240000, 7740000, 1940000, 3180000, 1100000],
    'median_income': [67058, 78687, 58915, 62043, 57153]
})

(a) Use an outer join to combine 2020 and 2024 results. Use suffixes to distinguish the years.

(b) Compute the change in dem_pct from 2020 to 2024 for states that appear in both elections.

(c) Merge the combined election data with demographics.

(d) Is there a pattern between median income and Democratic vote share in 2024?

Part M: Mixed Review (1-star to 3-star)

These exercises deliberately mix Chapter 9 skills with concepts from earlier chapters.

Exercise 9.29 — Full pipeline from loading to analysis (3-star)

Write a complete analysis starting from raw data. Create a CSV string and load it:

import io

csv_text = """country,year,vaccine,coverage,population
Brazil,2021,measles,85,213000000
Brazil,2022,measles,90,215000000
Kenya,2021,measles,69,55000000
Kenya,2022,measles,75,56000000
Brazil,2021,dtp3,78,213000000
Brazil,2022,dtp3,83,215000000
Kenya,2021,dtp3,82,55000000
Kenya,2022,dtp3,85,56000000"""

df = pd.read_csv(io.StringIO(csv_text))

Using skills from Chapters 7, 8, and 9:

(a) Check dtypes and verify there are no missing values (Ch. 7-8).

(b) Group by country and year, compute total weighted coverage (coverage * population / total population) across vaccines (Ch. 9).

(c) Pivot to show countries as rows and years as columns (Ch. 9).

(d) Add a column showing the year-over-year change for each country (Ch. 7).

Exercise 9.30 — Debugging a multi-step analysis (2-star)

Your colleague wrote this analysis but it produces incorrect results. Find and fix all the bugs. (There are at least three.)

# Bug-ridden code — find the problems
data = pd.DataFrame({
    'store': ['A', 'B', 'A', 'B'],
    'q1': [100, 200, 150, 250],
    'q2': [120, 180, 160, 220]
})

lookup = pd.DataFrame({
    'store': ['A', 'B'],
    'region': ['West', 'East']
})

# Attempt to compute total quarterly revenue by region
result = (
    data
    .melt(id_vars='store', var_name='quarter', value_name='revenue')   # Looks ok
    .merge(lookup, on='store', how='inner')
    .groupby('region')['revenue'].mean()   # Bug 1: should be sum, not mean
    .reset_index()
    .rename(columns={'revenue': 'total_revenue'})
    .sort_values('total_revenue', ascending=True)  # Bug 2: should be descending for "top regions"
)

Also consider: is there a Bug 3 related to the data itself? Why does store A and store B each appear twice in the original data? Should those duplicate rows be summed first, or do they represent different records?

Guidance

Bug 1: `.mean()` instead of `.sum()`. Bug 2: `ascending=True` should be `ascending=False` if we want highest revenue first. Bug 3 (conceptual): The two rows for each store might represent two different time periods, two different departments, or duplicates — the analyst should understand the data semantics before choosing how to aggregate. If they're genuine separate records, the code (with Bug 1 fixed) is fine. If they're duplicates, they should be deduplicated first ([Chapter 8](../chapter-08-cleaning-messy-data/index.md)).