Chapter 12 Quiz: Cleaning and Preparing Data for Analysis

Instructions: Choose the best answer for each multiple-choice question. For short-answer and code questions, write complete responses. The answer key appears at the end.

Time suggested: 30–40 minutes Total questions: 20


Part A: Multiple Choice (Questions 1–13)

Question 1

You have a DataFrame with a price column. You run df["price"].fillna(df["price"].mean()) but then check df["price"].isna().sum() and see the original count of NaN values. What went wrong?

A) .fillna() only works on integer columns, not float B) .fillna() returns a new Series but does not modify df["price"] — you must assign the result back C) The mean of a column with NaN values is always NaN D) .isna() does not work after .fillna()


Question 2

Which of the following correctly removes duplicate rows while keeping the last occurrence of each duplicate?

A) df.drop_duplicates(keep="last") B) df.drop_duplicates(last=True) C) df.remove_duplicates(keep="last") D) df.deduplicate(keep="last")


Question 3

You call df["category"].map(category_map) where category_map does not include an entry for the value "Uncategorized". What happens to rows where category == "Uncategorized"?

A) They keep the value "Uncategorized" B) They raise a KeyError C) They become NaN D) They are dropped from the DataFrame


Question 4

A column called revenue has values like "$1,234.56"`, `"$987.00", and "$2,100". Which sequence of operations correctly converts this column to float64?

A)

df["revenue"] = pd.to_numeric(df["revenue"])

B)

df["revenue"] = df["revenue"].str.replace("$", "").str.replace(",", "")
df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")

C)

df["revenue"] = df["revenue"].astype(float)

D)

df["revenue"] = float(df["revenue"])

Question 5

You want to standardize a status column that has 8 different capitalizations of 3 intended values. You want any value NOT in your mapping to become NaN so you can detect mapping gaps. Which method should you use?

A) .str.lower() followed by .str.title() B) .replace() with a dictionary C) .map() with a dictionary D) .fillna() with the most common value


Question 6

Which of the following correctly fills missing values in a region column with forward fill (propagating the last known value forward)?

A) df["region"] = df["region"].fillna(method="bfill") B) df["region"] = df["region"].fillna(method="ffill") C) df["region"] = df["region"].interpolate() D) df["region"] = df["region"].fillna(df["region"].shift(1))


Question 7

After loading a CSV, the sale_date column has dtype object. You run df["sale_date"] = pd.to_datetime(df["sale_date"], errors="coerce"). One value in the column is "not a date". What is the result for that specific row?

A) The row is dropped B) The value remains as the string "not a date" C) The value becomes NaT (Not a Time) D) A ValueError is raised


Question 8

You run .str.strip() on a city column and find that the number of unique values drops from 12 to 8. What does this tell you?

A) 4 rows were deleted B) Some city names had leading or trailing whitespace that made them appear as different values C) .str.strip() automatically merges similar-sounding city names D) The column had 4 duplicate rows


Question 9

You are applying the IQR outlier detection method. Your data has Q1=10, Q3=30. What are the outlier boundaries?

A) Values < 5 or > 35 B) Values < -20 or > 60 C) Values < 0 or > 40 D) Values < -10 or > 50


Question 10

Which of the following dropna() calls removes rows ONLY if they have a missing value in the customer_id column?

A) df.dropna() B) df.dropna(how="any") C) df.dropna(subset=["customer_id"]) D) df.dropna(columns=["customer_id"])


Question 11

You want to find all rows in a product column that contain the word "chair" regardless of case. Which code is correct?

A) df[df["product"] == "chair"] B) df[df["product"].str.contains("chair", case=False, na=False)] C) df[df["product"].str.lower() == "chair"] D) df[df["product"].contains("chair")]


Question 12

What is the primary advantage of .map() over .replace() for standardizing categorical values?

A) .map() is faster than .replace() for large DataFrames B) .map() converts unmapped values to NaN, making it easy to detect mapping gaps C) .replace() cannot handle dictionaries, only single values D) .map() works in place while .replace() does not


Question 13

You are cleaning a date_joined column with these values: ["2023-01-15", "01/20/2023", NaN, "Feb 5 2023"]. Which call handles all three non-null formats and returns NaT for the null?

A) pd.to_datetime(df["date_joined"], format="%Y-%m-%d") B) pd.to_datetime(df["date_joined"], errors="raise") C) pd.to_datetime(df["date_joined"], errors="coerce") D) df["date_joined"].astype("datetime64")


Part B: Short Answer (Questions 14–17)

Question 14

Explain the difference between .map() and .replace() for standardizing categorical values. Give a specific scenario where you would choose each one.


Question 15

What is the "fill vs. drop" decision framework for missing values? List at least four considerations that should inform the decision, and give an example of when dropping is appropriate and when filling is appropriate.


Question 16

A colleague says: "I just overwrite the original CSV file with the cleaned version. It saves disk space." Write a 3-4 sentence explanation of why this is a bad practice and what the correct approach is.


Question 17

Explain what a "cleaning log" is, why it is important, and what information it should contain. Give a concrete example of how a cleaning log could prevent a problem in a business setting.


Part C: Code Analysis (Questions 18–20)

Question 18

Examine the following cleaning pipeline. Identify all the bugs and explain how to fix each one.

import pandas as pd

df = pd.DataFrame({
    "name":    ["  Alice  ", "BOB", "carol", None],
    "revenue": ["$1,200", "$850", "N/A", "$2,100"],
    "status":  ["active", "ACTIVE", "Inactive", "active"]
})

# Bug 1: Clean the name column
df["name"].str.strip().str.title()

# Bug 2: Clean the revenue column
df["revenue"] = df["revenue"].str.replace("$", "")
df["revenue"] = df["revenue"].str.replace(",", "")
df["revenue"] = df["revenue"].astype(float)

# Bug 3: Standardize status
status_map = {"active": "Active", "ACTIVE": "Active"}
df["status"] = df["status"].map(status_map)

# Check result
print(df["status"].isna().sum())  # Should print 0

Identify: a) Bug 1: What is wrong with the name cleaning code and how do you fix it? b) Bug 2: What will happen when .astype(float) encounters "N/A"? c) Bug 3: What will happen to "Inactive" which is not in the status_map?

Provide corrected code for all three issues.


Question 19

A junior analyst writes the following code to clean a discount column:

df = df.dropna(subset=["discount"])
print(f"Rows remaining: {len(df)}")

df["discount"] = df["discount"].fillna(0)
print(df["discount"].isna().sum())

The analyst is confused because the .fillna() seems unnecessary after .dropna(). Is the analyst correct? Explain why this code is logically redundant (though not incorrect) and rewrite it cleanly.


Question 20

Read the following messy data and write a complete cleaning pipeline:

import pandas as pd

raw = pd.DataFrame({
    "emp_id":   ["E01", "E02", "E03", "E04", "E05", "E03"],
    "dept":     ["SALES", "it", "Sales", "HR", "It", "Sales"],
    "salary":   ["$72,000", "$68,500", "$81,000", "$58,000", "$95,000", "$81,000"],
    "hire_date": ["2019-03-01", "2020-07-15", "01/01/2018", "2022-09-01",
                  "June 2017", "01/01/2018"],
    "active":   ["Yes", "yes", "YES", "No", "yes", "YES"]
})

Your pipeline must: a) Remove exact duplicate rows b) Standardize dept to title case c) Clean salary (strip $ and ,) and convert to integer d) Convert hire_date to datetime (handle mixed formats) e) Standardize active to boolean (True/False) f) Run a 5-point validation suite g) Print a before/after comparison (row counts, dtypes, unique values per column)


Answer Key


Part A: Multiple Choice

Q Answer Explanation
1 B .fillna() returns a new Series. Without assignment back to df["price"], the original column is unchanged.
2 A keep="last" is the correct parameter value for keeping the last occurrence.
3 C .map() converts any value not in the dictionary to NaN. This is the key behavioral difference from .replace().
4 B First strip non-numeric characters, then use pd.to_numeric() with errors="coerce" for safety.
5 C .map() converts unmapped values to NaN, making mapping gaps immediately visible.
6 B method="ffill" propagates the last known (previous) value forward. bfill would be backward fill.
7 C errors="coerce" converts unparseable values to NaT (Not a Time), the datetime equivalent of NaN.
8 B Whitespace made "Boston" and "Boston " appear as two different city names. Stripping reveals they are the same.
9 D IQR = Q3 - Q1 = 30 - 10 = 20. Lower bound = 10 - 1.5×20 = -20. Upper bound = 30 + 1.5×20 = 60. Wait — recalculating: Q1=10, Q3=30, IQR=20. Lower = 10 - 30 = -20. Upper = 30 + 30 = 60. Answer is B.
10 C subset=["customer_id"] limits the dropna() to only rows missing customer_id.
11 B .str.contains() with case=False performs case-insensitive matching; na=False handles NaN safely.
12 B The key advantage of .map() is that unmapped values become NaN, revealing gaps in the mapping dictionary.
13 C errors="coerce" handles all parseable formats (regardless of how they are formatted) and returns NaT for NaN and unparseable values.

Note for Q9: Recalculation — Q1=10, Q3=30, IQR=20. Lower bound: 10 − (1.5 × 20) = 10 − 30 = −20. Upper bound: 30 + (1.5 × 20) = 30 + 30 = 60. Correct answer is B.


Part B: Short Answer

Question 14

.map() applies a dictionary to replace values, and any value NOT in the dictionary becomes NaN. It is strict: every possible value must be in the mapping, or it disappears. Use .map() when you want complete control and want missing mappings to be immediately visible as NaN.

.replace() applies a dictionary to replace values, leaving any value NOT in the dictionary unchanged. It is lenient: unspecified values pass through as-is. Use .replace() when you want to fix a few known problems while leaving other values alone.

Example for .map(): You want to standardize a status column from 8 variants to 3 canonical values. Using .map() ensures any new, unexpected status value becomes NaN instead of silently passing through unstandardized.

Example for .replace(): You discover one region value "NE" should be "Northeast". You want to fix that specific value while leaving all the other valid region values untouched. Using .replace({"NE": "Northeast"}) does exactly this.


Question 15

The fill-vs-drop decision framework:

  1. Percentage of missing values — Under 2-5%: consider dropping. Over 20%: consider dropping the column entirely. Between: assess further.
  2. Randomness of missing data — Randomly missing: safe to drop rows. Systematically missing (e.g., high-value customers more likely to skip a field): dropping introduces bias; fill instead.
  3. Availability of fill information — Can you infer the correct value from other columns? From domain knowledge? From the business source? If yes, fill.
  4. Business meaning of missing — Sometimes "missing" means something specific. A missing end_date means the project is still active — that is not an error, do not fill it.
  5. Downstream sensitivity — How sensitive is your analysis to this column? If region feeds a required regional grouping, missing values must be handled (fill or remove) before that analysis is valid.

Drop appropriate when: Missing customer_name in 0.5% of rows, and you need names for every row in an outreach campaign — drop the 0.5%.

Fill appropriate when: Missing discount_rate in a sales dataset — fill with 0 (no discount applied) based on business context.


Question 16

Overwriting the original file is dangerous because it destroys the raw data permanently. If your cleaning code has a bug, you cannot recover the original values. If business rules change and you need to re-clean the data differently, you have nothing to start from. Reproducibility also requires that someone can run your cleaning script on the raw file and produce the same clean output.

The correct approach: keep the original file untouched, give the clean file a different name (e.g., acme_sales_2023_clean.csv), and store your cleaning code in a script so the transformation can always be reproduced. Many teams also add a timestamp or version to the cleaned filename for additional traceability.


Question 17

A cleaning log is a recorded list of every modification made to a dataset during the cleaning process. Each entry should include: what was changed (which column), why it was changed (the data quality issue), how many rows were affected, and what the before/after state looks like.

The cleaning log is important because it makes the analysis reproducible and auditable. If a business stakeholder questions a number, the analyst can trace it back through the cleaning log to show exactly what was in the raw data and what was changed.

Example: Priya's log shows that 8 duplicate rows were removed and the category column was standardized. Three months later, Sandra asks why Office Chairs revenue increased compared to last quarter's report. Priya can show that last quarter used uncleaned data where three capitalization variants were counted separately, while this quarter uses cleaned data. Without the log, Priya could not explain the discrepancy.


Part C: Code Analysis

Question 18

a) Bug 1 — The result of .str.strip().str.title() is computed but never assigned back to the DataFrame column:

# WRONG: result is discarded
df["name"].str.strip().str.title()

# CORRECT:
df["name"] = df["name"].str.strip().str.title()

b) Bug 2"N/A" cannot be converted by .astype(float) — it will raise a ValueError. Use pd.to_numeric() with errors="coerce" instead:

df["revenue"] = df["revenue"].str.replace("$", "", regex=False)
df["revenue"] = df["revenue"].str.replace(",", "", regex=False)
df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")
# "N/A" becomes NaN instead of raising an error

c) Bug 3"Inactive" is not in status_map, so it becomes NaN. The final isna().sum() will print 1, not 0. Fix by adding "Inactive" to the map:

status_map = {
    "active": "Active",
    "ACTIVE": "Active",
    "Inactive": "Inactive",
    "inactive": "Inactive",
}

Question 19

The analyst is correct that the code is logically redundant. .dropna(subset=["discount"]) removes ALL rows where discount is missing. After this operation, df["discount"].isna().sum() is guaranteed to be 0. The subsequent .fillna(0) therefore has nothing to fill — it is a no-op.

Rewritten cleanly (choose one approach):

# Option A: Drop the missing rows
df = df.dropna(subset=["discount"])
print(f"Rows remaining: {len(df)}")

# Option B: Fill missing with 0 (if 0 is the correct business default)
df["discount"] = df["discount"].fillna(0)
print(f"Missing discount: {df['discount'].isna().sum()}")

The analyst needs to choose which approach is correct based on business context — not use both.


Question 20

import pandas as pd

raw = pd.DataFrame({
    "emp_id":    ["E01", "E02", "E03", "E04", "E05", "E03"],
    "dept":      ["SALES", "it", "Sales", "HR", "It", "Sales"],
    "salary":    ["$72,000", "$68,500", "$81,000", "$58,000", "$95,000", "$81,000"],
    "hire_date": ["2019-03-01", "2020-07-15", "01/01/2018", "2022-09-01",
                  "June 2017", "01/01/2018"],
    "active":    ["Yes", "yes", "YES", "No", "yes", "YES"]
})

df = raw.copy()

print("=== BEFORE ===")
print(f"Rows: {len(df)}")
print(df.dtypes)

# a) Remove duplicates
df = df.drop_duplicates()

# b) Standardize dept
df["dept"] = df["dept"].str.strip().str.title()

# c) Clean salary → integer
df["salary"] = (
    df["salary"]
    .str.replace("$", "", regex=False)
    .str.replace(",", "", regex=False)
    .astype(int)
)

# d) Convert hire_date to datetime
df["hire_date"] = pd.to_datetime(df["hire_date"], errors="coerce")

# e) Standardize active → boolean
active_map = {"Yes": True, "yes": True, "YES": True,
              "No": False, "no": False, "NO": False}
df["active"] = df["active"].map(active_map)

# f) Validation suite
assert df.duplicated().sum() == 0, "Duplicates remain"
assert df["dept"].str[0].str.isupper().all(), "dept not title case"
assert str(df["salary"].dtype) == "int64", "salary not int"
assert str(df["hire_date"].dtype).startswith("datetime"), "hire_date not datetime"
assert df["active"].dtype == bool, "active not bool"
print("All validations passed!")

# g) Before/after
print("\n=== AFTER ===")
print(f"Rows: {len(df)} (was {len(raw)})")
print(df.dtypes)
print()
print(df.to_string(index=False))