Chapter 9 Quiz: Reshaping and Transforming Data

Contributors to Introduction to Data Science

Chapter 9 Quiz: Reshaping and Transforming Data

Instructions: This quiz tests your understanding of Chapter 9. Answer all questions before checking the solutions. For multiple choice, select the best answer — some options may be partially correct. For short answer questions, aim for 2-4 clear sentences. Total points: 100.

Section 1: Multiple Choice (8 questions, 5 points each)

Question 1. You perform pd.merge(left, right, on='id', how='left'). The left DataFrame has 100 rows and the right has 80 rows, with 70 ids appearing in both. Assuming no duplicate ids in either table, how many rows will the result have?

(A) 70
(B) 80
(C) 100
(D) 180

Answer

**Correct: (C)** - **(A)** would be the result of an inner join (only matching rows). - **(B)** is just the size of the right table — irrelevant to a left join. - **(C)** is correct. A left join keeps *all* rows from the left table. The 70 matching rows get values from the right table; the 30 non-matching left rows get NaN for the right-side columns. Total: 100. - **(D)** would be the maximum for a full outer join (100 + 80 - 70 = 110, actually), and even that calculation is wrong for 180.

Question 2. What does pd.melt() do?

(A) Removes columns from a DataFrame
(B) Converts wide-format data to long-format by unpivoting columns into rows
(C) Combines two DataFrames into one
(D) Sorts a DataFrame by multiple columns

Answer

**Correct: (B)** - **(A)** describes `drop()`, not `melt()`. Melt doesn't remove data — it restructures it. - **(B)** is correct. `melt()` takes columns that represent variables (like year columns `2020`, `2021`, `2022`) and "unpivots" them into rows, creating a `variable` column (holding the old column names) and a `value` column (holding the cell values). - **(C)** describes `merge()` or `concat()`. - **(D)** describes `sort_values()`.

Question 3. You have a DataFrame with 1,000 rows and group it by a column with 25 unique values. After calling .mean() on a numeric column, how many rows does the result have?

(A) 1,000
(B) 25
(C) 975
(D) It depends on the data

Answer

**Correct: (B)** - **(A)** would be the result of `.transform('mean')`, which preserves the original row count. - **(B)** is correct. `groupby().mean()` reduces each group to a single value. With 25 unique groups, the result has 25 rows. - **(C)** has no logical basis. - **(D)** is incorrect — the number of output rows always equals the number of unique groups, regardless of group sizes.

Question 4. Which join type should you use when you want to keep all rows from both DataFrames, filling in NaN where there's no match?

(A) Inner join
(B) Left join
(C) Right join
(D) Outer join

Answer

**Correct: (D)** - **(A)** keeps *only* matching rows — the opposite of what's described. - **(B)** keeps all left rows but drops unmatched right rows. - **(C)** keeps all right rows but drops unmatched left rows. - **(D)** is correct. An outer (full outer) join keeps all rows from both sides, filling NaN where matches don't exist. This is the most inclusive join type.

Question 5. What is the most likely cause when pd.merge(df1, df2, on='key') returns an empty DataFrame?

(A) The DataFrames have different numbers of rows
(B) The key column has different data types in the two DataFrames
(C) The DataFrames have too many columns
(D) The how parameter was not specified

Answer

**Correct: (B)** - **(A)** is normal — DataFrames almost always have different row counts when merging. - **(B)** is correct. If one table has the key as integer (e.g., `76`) and the other has it as string (e.g., `'76'`), no values will match because `76 != '76'` in Python. This is the most common cause of unexpectedly empty merge results. - **(C)** has no effect on whether keys match. - **(D)** is incorrect — the default `how='inner'` would still return matches if the keys align.

Question 6. What does the validate parameter in pd.merge() do?

(A) Checks that the merge key column exists in both DataFrames
(B) Ensures the resulting DataFrame has no missing values
(C) Raises an error if the merge relationship doesn't match the expected cardinality (e.g., one-to-one, one-to-many)
(D) Automatically removes duplicate rows before merging

Answer

**Correct: (C)** - **(A)** is done automatically by pandas (you'd get a KeyError if the column didn't exist). - **(B)** is not what `validate` does — missing values can still appear in outer/left/right joins. - **(C)** is correct. `validate='one_to_one'` raises `MergeError` if either side has duplicate keys; `validate='one_to_many'` requires the left side to be unique; etc. This helps catch unexpected many-to-many merges that cause row explosions. - **(D)** is not a feature of `validate`. Duplicate removal must be done explicitly with `drop_duplicates()`.

Question 7. In a method chain, what does .assign(new_col=lambda x: x['old_col'] * 2) do?

(A) Modifies the original DataFrame in place
(B) Returns a new DataFrame with new_col added, leaving the original unchanged
(C) Raises an error because lambda functions can't be used in .assign()
(D) Replaces old_col with the doubled values

Answer

**Correct: (B)** - **(A)** is incorrect. `.assign()` returns a *new* DataFrame; it does not modify in place. This is what makes it safe for method chaining. - **(B)** is correct. The `lambda x` receives the DataFrame at that point in the chain, and `.assign()` returns a copy with `new_col` added. - **(C)** is incorrect — lambda functions are the standard way to reference the current DataFrame inside `.assign()`. - **(D)** is incorrect. `.assign()` *adds* a new column; it doesn't replace existing ones (unless `new_col` has the same name as an existing column).

Question 8. What is the difference between df.pivot() and df.pivot_table()?

(A) They are identical; pivot_table is just an alias for pivot
(B) pivot() requires an aggregation function; pivot_table() does not
(C) pivot_table() can handle duplicate index-column combinations by aggregating; pivot() raises an error on duplicates
(D) pivot() works on numeric data only; pivot_table() works on any data type

Answer

**Correct: (C)** - **(A)** is incorrect — they have different behavior. - **(B)** is backwards. `pivot_table()` takes an optional `aggfunc` parameter; `pivot()` does not aggregate. - **(C)** is correct. When there are duplicate entries for the same index-column pair, `pivot()` raises a `ValueError` because it doesn't know how to resolve the conflict. `pivot_table()` handles this by aggregating (defaulting to `mean`). - **(D)** is incorrect — both can work with various data types.

Section 2: True/False (3 questions, 5 points each)

Question 9. True or False: After a left join, the resulting DataFrame always has the same number of rows as the left DataFrame.

Answer

**False.** A left join keeps all rows from the left table, but if the right table has *duplicate keys*, a single left row can match multiple right rows, producing more output rows than input rows. For example, if the left has one row for country "USA" and the right has three rows for "USA" (perhaps for different years), the left join produces three rows for "USA." The statement is only true when the right-side keys are unique.

Question 10. True or False: groupby().transform() returns a result with the same number of rows as the original DataFrame.

Answer

**True.** This is the defining characteristic of `.transform()` versus `.agg()`. While `.agg()` reduces each group to a single row, `.transform()` broadcasts the group-level result back to every row in the group, preserving the original row count. This is why `transform` is used when you want to add group-level statistics (like the group mean) as a new column on each original row.

Question 11. True or False: NaN values in a merge key column will match with NaN values in the other DataFrame's key column.

Answer

**False.** NaN never equals NaN in pandas merge operations. If one DataFrame has NaN in the key column and the other also has NaN, they will *not* be matched. Rows with NaN keys will appear as unmatched in left/right/outer joins (with NaN filling the other side's columns) or be dropped entirely in inner joins. This is consistent with the IEEE floating-point standard where NaN != NaN.

Section 3: Short Answer (4 questions, 5 points each)

Question 12. Explain the "split-apply-combine" pattern in your own words. Give a one-sentence example of each step applied to a dataset of employee salaries grouped by department.

Answer

The split-apply-combine pattern is a three-phase strategy for group-wise computation. **Split:** Divide the employee DataFrame into separate groups, one per department (Engineering, Marketing, Sales, etc.). **Apply:** Compute the desired statistic (e.g., mean salary) within each department group independently. **Combine:** Collect the per-department results into a single output — either a summary table (one row per department) or a column added back to the original data.

Question 13. You have a wide-format DataFrame with 50 rows and columns country, pop_2018, pop_2019, pop_2020, pop_2021, pop_2022. After melting the five population columns, how many rows will the result have? Show your reasoning.

Answer

The result will have **250 rows**. Each of the 50 countries has 5 population values (one per year column). When melted, each value becomes its own row: 50 countries x 5 years = 250 rows. The resulting DataFrame will have 3 columns: `country` (the id_var), `year` (from the old column headers), and `population` (the values).

Question 14. Your merge produces twice as many rows as you expected. List three diagnostic steps you would take to figure out what went wrong.

Answer

1. **Check for duplicate keys:** `df['key'].duplicated().sum()` on both DataFrames. Duplicate keys on both sides cause a Cartesian product (many-to-many join), which multiplies rows. 2. **Use the `indicator` parameter:** Re-run the merge with `indicator=True` and examine the `_merge` column to see how many rows are `both`, `left_only`, and `right_only`. 3. **Compare row counts before and after:** Print `len(left)`, `len(right)`, and `len(result)`. If the result is larger than either input, there are almost certainly duplicate keys causing expansion. Use `validate='one_to_one'` or `validate='one_to_many'` to make pandas catch this automatically.

Question 15. Describe two benefits and one drawback of method chaining in pandas.

Answer

**Benefit 1:** Readability — the chain reads as a sequence of transformation steps (filter, merge, group, sort), making the analytical logic clear at a glance. Each line is a verb describing one operation. **Benefit 2:** Fewer intermediate variables — you don't clutter the namespace with `temp1`, `temp2`, `filtered_df`, etc., reducing the risk of accidentally using a stale intermediate result. **Drawback:** Debugging is harder — when a chain produces an unexpected result, you can't easily inspect the intermediate state at each step without breaking the chain apart into separate statements. Long chains can also become hard to read if they exceed 8-10 operations.

Section 4: Applied Scenarios (3 questions, 5 points each)

Question 16. You're an analyst at a hospital. You have a patients table (columns: patient_id, name, age, ward) and a test_results table (columns: patient_id, test_name, result, date). Each patient may have zero, one, or many test results.

(a) What join type would you use to create a table showing all patients along with their test results (if any)?

(b) Write the pandas merge call.

(c) A patient with no test results will appear in the result. What values will the test_name, result, and date columns contain for that patient?

Answer

(a) **Left join** — you want all patients (the primary dataset), with test results where available. (b) `pd.merge(patients, test_results, on='patient_id', how='left')` (c) Those columns will contain **NaN** (specifically, `NaN` for numeric/float columns and `NaN` or `None` for string columns). The patient row is preserved because of the left join, but there's no matching test data to fill in.

Question 17. A climate researcher has temperature data in this format:

station | year | jan | feb | mar | ... | dec
SFO     | 2020 | 51  | 54  | 56  | ... | 50
SFO     | 2021 | 52  | 55  | 57  | ... | 51
LAX     | 2020 | 58  | 59  | 60  | ... | 59

She wants to create a line chart showing monthly temperature trends for each station. The plotting library expects data in long format with columns station, year, month, temperature.

Write the pd.melt() call that transforms this data. What should id_vars, var_name, and value_name be?

Answer

long = pd.melt(
    weather,
    id_vars=['station', 'year'],    # columns to keep as identifiers
    var_name='month',               # old column names become values here
    value_name='temperature'        # old cell values go here
)

`id_vars` should be `['station', 'year']` because these identify each observation and should remain as columns. `var_name` should be `'month'` because the month abbreviations (jan, feb, etc.) were the old column headers. `value_name` should be `'temperature'` because the cell values represent temperature readings.

Question 18. Priya is analyzing NBA player statistics. She has per-game stats for 450 players and team records for 30 teams. She wants to know whether players on winning teams score more points on average.

Describe the complete workflow (in English, not code) using merge and groupby to answer her question. Include: what to merge on, what join type, what to group by, and what aggregation to use.

Answer

1. **Merge** the player stats with the team records on the `team` column (or `team_id`) using an **inner join** (we only want players on recognized teams with win/loss records). 2. **Create a derived column** indicating whether the team is a "winning team" — for example, `winning = (wins > losses)` or classify by win percentage into groups. 3. **Group by** the winning/losing classification and compute the **mean** of the `points_per_game` column within each group. 4. **Compare** the two group means to see if players on winning teams average more points. (Note: this shows correlation, not causation — winning teams might attract better scorers, or good scorers might help teams win.)

Section 5: Code Analysis (2 questions, 5 points each)

Question 19. What does this code produce? Trace through it step by step.

df = pd.DataFrame({
    'dept': ['Eng', 'Eng', 'Sales', 'Sales', 'Eng'],
    'salary': [90000, 85000, 70000, 75000, 95000]
})

df['dept_avg'] = df.groupby('dept')['salary'].transform('mean')
df['above_avg'] = df['salary'] > df['dept_avg']
print(df['above_avg'].sum())

Answer

Step 1: Compute department averages using `.transform('mean')`: - Eng: (90000 + 85000 + 95000) / 3 = 90000 - Sales: (70000 + 75000) / 2 = 72500 Step 2: `dept_avg` column gets: [90000, 90000, 72500, 72500, 90000] Step 3: Compare each salary to its department average: - 90000 > 90000 = False - 85000 > 90000 = False - 70000 > 72500 = False - 75000 > 72500 = True - 95000 > 90000 = True Step 4: `sum()` of [False, False, False, True, True] = **2** The code prints `2` — two employees earn more than their department's average.

Question 20. This code is supposed to compute total sales by region by merging sales data with store locations, but it produces incorrect results. Identify the bug and explain the fix.

sales = pd.DataFrame({
    'store_id': [1, 2, 3, 1, 2],
    'amount': [100, 200, 150, 300, 250]
})

locations = pd.DataFrame({
    'store_id': ['1', '2', '3'],
    'region': ['North', 'South', 'North']
})

result = (
    sales
    .merge(locations, on='store_id', how='left')
    .groupby('region')['amount']
    .sum()
)

print(result)

Answer

**Bug:** The `store_id` column is `int64` in the `sales` DataFrame but `object` (string) in the `locations` DataFrame. Integer `1` does not equal string `'1'`, so the merge finds zero matches. After the left join, all rows have NaN for `region`, and the groupby produces no meaningful groups. **Fix:** Convert the types to match before merging:

locations['store_id'] = locations['store_id'].astype(int)

Alternatively: `sales['store_id'] = sales['store_id'].astype(str)`. **How to catch this:** Always check `df['key'].dtype` on both DataFrames before merging. If the merge result has unexpectedly many NaN values or fewer rows than expected, type mismatch is the first thing to investigate.