Chapter 13 Quiz: Transforming and Aggregating Business Data

Instructions: Choose the best answer for each question. The answer key with explanations is provided at the end.

Questions

1. You have a DataFrame df with a column revenue. Which of the following correctly adds a new column category that labels each row as "High" if revenue >= 10000 and "Low" otherwise?

A) df["category"] = df["revenue"].map({"High": 10000, "Low": 0})

B) df["category"] = df["revenue"].apply(lambda x: "High" if x >= 10000 else "Low")

C) df["category"] = df.groupby("revenue")["category"].transform("High")

D) df["category"] = df["revenue"].agg(lambda x: "High" if x >= 10000 else "Low")

2. What is the correct way to use .map() to replace region abbreviations with full names?

codes = {"N": "North", "S": "South", "E": "East", "W": "West"}
df["region_name"] = df["region_code"].???

A) df["region_code"].apply(codes)

B) df["region_code"].replace(codes) (assume this also works, but what is the .map() equivalent?)

C) df["region_code"].map(codes)

D) df["region_code"].merge(codes)

3. A colleague writes:

result = df.groupby("region")["revenue"].sum()

The result object is:

A) A DataFrame with columns region and revenue

B) A Series with region as the index and sum of revenue as values

C) A GroupBy object that needs .aggregate() called on it

D) A pivot table

4. You want to compute both the sum and mean of revenue, and the count of orders, all in one .agg() call, with clean output column names. Which syntax is correct?

df.groupby("region").agg(["sum", "mean", "count"])

df.groupby("region")[["revenue","orders"]].agg({"sum","mean","count"})

df.groupby("region").agg(
    total_revenue=("revenue", "sum"),
    avg_revenue=("revenue", "mean"),
    order_count=("orders", "count"),
)

df.groupby("region").agg(
    total_revenue="revenue.sum",
    avg_revenue="revenue.mean",
    order_count="orders.count",
)

5. What does fill_value=0 do in pd.pivot_table(df, values="revenue", index="region", columns="product", aggfunc="sum", fill_value=0)?

A) Replaces zeros in the revenue column with the mean

B) Fills cells where no data exists for a region-product combination with 0 instead of NaN

C) Sets the default revenue to 0 for all rows before aggregation

D) Fills missing index labels with 0

6. You have a transactions table and a customers table. A transaction for customer ID "C099" exists in the transactions table, but "C099" does not exist in the customers table. After an inner join on customer_id, what happens to that transaction row?

A) It is kept with NaN values for the customer columns

B) It is dropped from the result

C) It raises a KeyError

D) It is added to the customers table automatically

7. What is the primary difference between .merge() with how="left" and .merge() with how="outer"?

A) Left join keeps only left table rows with matches; outer join keeps all rows from both tables

B) Left join keeps all rows from the left table (matched or not); outer join keeps all rows from both tables regardless of matches

C) They are identical except for how NaN values are filled

D) Left join uses the left index; outer join uses both indexes

8. You want to stack three quarterly DataFrames q1, q2, q3 vertically into one DataFrame with a fresh 0-based index. Which call is correct?

A) pd.merge([q1, q2, q3], ignore_index=True)

B) pd.concat([q1, q2, q3], axis=1)

C) pd.concat([q1, q2, q3], ignore_index=True)

D) q1.join([q2, q3], how="outer")

9. A DataFrame wide has columns region, Jan, Feb, Mar. You want to reshape it so that each row represents one region-month combination. Which method and parameters achieve this?

wide.pivot(index="region", columns=["Jan","Feb","Mar"], values="revenue")

wide.melt(id_vars="region", var_name="month", value_name="revenue")

wide.groupby("region")[["Jan","Feb","Mar"]].sum()

pd.pivot_table(wide, index="region", columns="month", values="revenue")

10. Which .dt accessor property returns an integer from 0 (Monday) to 6 (Sunday) for a datetime Series?

A) .dt.weekday B) .dt.day C) .dt.dayofmonth D) .dt.isoweekday

11. You have a monthly revenue Series and run:

monthly["rolling_3m_avg"] = monthly["revenue"].rolling(window=3).mean()

Why do the first two rows of rolling_3m_avg contain NaN?

A) The revenue values in rows 0 and 1 are null

B) A 3-month window requires at least 3 data points; rows 0 and 1 have only 1 and 2 preceding values respectively

C) pandas requires you to set min_periods=3 explicitly to compute any values

D) Rolling averages always skip the first and last rows

12. What is the key difference between .rolling() and .expanding()?

A) .rolling() computes the sum; .expanding() computes the mean

B) .rolling() uses a fixed window size; .expanding() grows the window from the start to the current row

C) .rolling() works on DataFrames; .expanding() works only on Series

D) They are identical; .expanding() is an alias for .rolling(window=len(df))

13. You want to add a column to df showing each row's revenue as a percentage of its region's total revenue, WITHOUT reducing the number of rows. Which approach is correct?

df["region_total"] = df.groupby("region")["revenue"].sum()
df["pct"] = df["revenue"] / df["region_total"] * 100

df["region_total"] = df.groupby("region")["revenue"].transform("sum")
df["pct"] = df["revenue"] / df["region_total"] * 100

df["pct"] = df.groupby("region")["revenue"].agg("sum") / df["revenue"] * 100

df["pct"] = df.groupby("region")["revenue"].apply(lambda x: x / x.sum() * 100)

14. A DataFrame df has a column customer_name with values like "Acme Corp" and "BigCo LLC". Which expression creates a new column containing only the first word of each name?

A) df["customer_name"].str[0]

B) df["customer_name"].str.split(" ").str[0]

C) df["customer_name"].apply(str.split)[0]

D) df["customer_name"].str.extract(r"(\w+) ")

15. What is the difference between .pivot() and pd.pivot_table()?

A) .pivot() aggregates duplicate index-column combinations; pd.pivot_table() does not

B) .pivot() requires exactly one value per index-column combination and raises an error on duplicates; pd.pivot_table() handles duplicates using an aggregation function

C) pd.pivot_table() is deprecated and you should use .pivot() instead

D) They are identical methods with different names

16. You call pd.to_datetime(df["order_date"]) and get the error: "time data '25-01-2024' does not match format '%Y-%m-%d'". What is the most direct fix?

A) pd.to_datetime(df["order_date"], dayfirst=True)

B) pd.to_datetime(df["order_date"], format="%d-%m-%Y")

C) df["order_date"].str.replace("-", "/")

D) Both A and B would work

17. After merging transactions with customers using how="left", you see that 3 rows have NaN in the tier column. What does this indicate?

A) The tier column in the customers table has missing values

B) Those 3 transactions have customer_id values that do not exist in the customers table

C) You should have used how="outer" to capture all tiers

D) The merge key customer_id has a different data type in the two tables

18. Which of the following is NOT a valid aggregation function name you can pass as a string to .agg()?

A) "sum"

B) "nunique"

C) "percentile"

D) "first"

19. You have a quarterly revenue DataFrame indexed by quarter name (Q1, Q2, Q3, Q4). You compute:

df["qoq_growth"] = df["revenue"].pct_change() * 100

What value will df["qoq_growth"] have for Q1 (the first row)?

A) 0.0 (no change from nothing)

B) 100.0 (an arbitrary default)

C) NaN (there is no prior period to compare)

D) The growth rate relative to Q4 of the previous year

20. Sandra asks Priya to add a column showing whether each week's revenue is "Above Target" or "Below Target", given a target of $30,000 per week. The weekly totals are in a Series called weekly_rev. Which approach is cleanest?

A) weekly_rev.apply(lambda x: "Above Target" if x >= 30000 else "Below Target")

B) weekly_rev.map(lambda x: "Above Target" if x >= 30000 else "Below Target")

C) np.where(weekly_rev >= 30000, "Above Target", "Below Target")

D) All of A, B, and C produce the same result

Answer Key

Q	Answer	Explanation
1	B	`.apply()` with a lambda processes each element through the function. `.map()` is for dictionary substitution. `.agg()` reduces groups.
2	C	`.map(dict)` on a Series replaces each value using the dictionary as a lookup.
3	B	A single-column `.groupby().sum()` returns a Series with the grouping column as the index.
4	C	Named aggregations use the syntax `output_col_name=("source_col", "func")`. This is the cleanest approach because output column names are controlled directly.
5	B	`fill_value=0` substitutes 0 for any NaN cells created when a group combination doesn't exist in the data.
6	B	Inner join keeps only rows where the key exists in BOTH tables. Unmatched rows are silently dropped.
7	B	Left join preserves every row from the left table, filling NaN for unmatched right-table columns. Outer join preserves every row from both tables.
8	C	`pd.concat([list_of_dfs], ignore_index=True)` stacks DataFrames vertically and resets the index. `axis=1` would stack horizontally.
9	B	`.melt(id_vars="region", var_name="month", value_name="revenue")` converts from wide (months as columns) to long (one row per region-month).
10	A	`.dt.dayofweek` and `.dt.weekday` are equivalent; both return 0=Monday through 6=Sunday.
11	B	With `window=3`, the first row has only 1 data point and the second has only 2. Both fall below the default `min_periods` which equals `window`.
12	B	`.rolling(n)` always uses the last n rows. `.expanding()` grows from row 0 to the current row, so the window size increases with each row.
13	B	`.transform("sum")` returns a Series the same length as the original DataFrame with each value replaced by its group's aggregate. Option A would produce index-misaligned NaN values.
14	B	`.str.split(" ")` splits each string into a list; chaining `.str[0]` extracts the first element. Option A extracts the first character, not the first word.
15	B	`.pivot()` raises a `ValueError` when there is more than one value for any (index, column) combination. `pd.pivot_table()` resolves duplicates with `aggfunc`.
16	D	Both `dayfirst=True` (for DD-MM-YYYY inference) and `format="%d-%m-%Y"` (explicit) correctly parse the format. B is more precise and reliable.
17	B	NaN in the merged customer columns means the `customer_id` from the transaction had no match in the customers table. It could also be caused by type mismatches (Answer D is plausible as a root cause).
18	C	`"percentile"` is not a built-in aggregation string. Use `.quantile(0.5)` for median or call numpy's `np.percentile` in a custom function.
19	C	`pct_change()` compares each value to the previous one. The first row has no previous row, so it returns `NaN`.
20	D	All three produce the same output. `.apply()` and `.map()` with a function are equivalent on a Series; `np.where()` is a vectorized alternative that is often faster on large data.

Score Interpretation

18–20 correct: Excellent — you have a strong command of the chapter material.
14–17 correct: Good — review the topics corresponding to your missed questions.
10–13 correct: Fair — revisit the chapter sections on groupby, merge, and window functions.
Below 10: Spend more time with the hands-on code examples before moving to Chapter 14.