Chapter 8 Quiz: Cleaning Messy Data

Q: True or False: `df.dropna()` removes columns with missing values by default.

False. By default, `dropna()` removes rows (axis=0) with missing values, not columns. To drop columns, you'd need `df.dropna(axis=1)`. This is a common source of confusion.

Contributors to Introduction to Data Science

Chapter 8 Quiz: Cleaning Messy Data

Instructions: This quiz tests your understanding of Chapter 8. Answer all questions before checking the solutions. For multiple choice, select the best answer — some options may be partially correct. For code analysis questions, predict the output or identify the error before running anything. Total points: 100.

Section 1: Multiple Choice (10 questions, 4 points each)

Question 1. What does NaN stand for in pandas, and which of the following statements about it is TRUE?

(A) "Not a Number" — and NaN == NaN evaluates to True
(B) "Not a Number" — and NaN == NaN evaluates to False
(C) "Null and Nil" — and it can only represent missing numeric values
(D) "Not Available Now" — and it is identical to Python's None in all contexts

Answer

**Correct: (B)** NaN stands for "Not a Number." Its most counterintuitive property is that `NaN == NaN` evaluates to `False` — NaN is not equal to anything, including itself. This is why you must use `.isnull()` or `.isna()` to detect missing values, not `==`. Option (C) is wrong because NaN is used for missing values of any type in pandas. Option (D) is wrong because while pandas often converts `None` to `NaN`, they are distinct objects in Python.

Question 2. You have a DataFrame with 10,000 rows. Column income is missing in 800 rows (8%), and you discover that the missing values are concentrated among respondents under age 25. What type of missingness is this most likely?

(A) Missing Completely at Random (MCAR)
(B) Missing at Random (MAR)
(C) Missing Not at Random (MNAR)
(D) Cannot be determined from this information alone

Answer

**Correct: (B)** The missingness is related to an *observed* variable (age) rather than to the *missing* value itself (income). Young respondents are less likely to report income, but the missingness is predictable from the age column. This is MAR — Missing at Random, meaning "random conditional on observed data." If the missingness were related to income itself (e.g., people with very low income are embarrassed to report it), that would be MNAR. Option (D) is tempting, and in practice you can never fully prove the type of missingness, but the described pattern most closely fits MAR.

Question 3. Which of the following is the safest way to convert a column that might contain non-numeric values to numbers?

(A) df['col'].astype(float)
(B) df['col'].astype(int)
(C) pd.to_numeric(df['col'], errors='coerce')
(D) pd.to_numeric(df['col'], errors='ignore')

Answer

**Correct: (C)** `errors='coerce'` converts values that cannot be parsed as numbers to `NaN`, which you can then investigate and handle. Option (A) would raise a `ValueError` on any non-numeric value. Option (B) would also raise an error. Option (D) returns the column unchanged if it contains non-numeric values, which means you still have strings — not useful for numeric analysis.

Question 4. What does df.duplicated(subset=['customer_id'], keep='first') return?

(A) A DataFrame with duplicate rows removed
(B) A boolean Series that is True for ALL rows sharing a duplicated customer_id
(C) A boolean Series that is True for the second and subsequent occurrences of each duplicated customer_id, but False for the first
(D) The count of duplicate customer_id values

Answer

**Correct: (C)** `duplicated()` returns a boolean Series, not a modified DataFrame (that's `drop_duplicates()`). With `keep='first'`, the first occurrence of each duplicated value is marked `False` (not a duplicate), and subsequent occurrences are marked `True`. This lets you identify which rows *would* be removed if you called `drop_duplicates()`.

Question 5. You're cleaning a column of U.S. ZIP codes and notice that values like "02134" have been stored as the integer 2134. What's the best approach?

(A) Add 0 to the beginning of all 4-digit numbers using arithmetic
(B) Convert the column to string and use .str.zfill(5) to left-pad with zeros
(C) Leave it as integers — the leading zero doesn't matter
(D) Delete the rows with 4-digit ZIP codes

Answer

**Correct: (B)** ZIP codes are identifiers, not numbers — you'd never add or average them. Storing them as strings preserves leading zeros. `.str.zfill(5)` pads strings to 5 characters with leading zeros. Option (A) doesn't work — you can't "add 0 to the beginning" of an integer arithmetically. Option (C) is wrong because the leading zero is part of the ZIP code (02134 and 2134 are different identifiers). Option (D) destroys valid data.

Question 6. Which cleaning decision is MOST likely to introduce systematic bias into your analysis?

(A) Standardizing capitalization of a country column
(B) Converting a date column from string to datetime
(C) Dropping all rows with missing income values when missingness correlates with race
(D) Removing exact duplicate rows from a dataset

Answer

**Correct: (C)** If income missingness is correlated with race, dropping those rows systematically removes certain racial groups from the analysis — introducing bias. The analysis results will describe only those who reported income, which is not representative of the full population. Options (A), (B), and (D) are generally safe operations that don't selectively remove data from particular groups.

Question 7. What is the primary purpose of a data cleaning log?

(A) To make the code run faster
(B) To record every transformation applied to the data, with rationale, so decisions are reproducible and auditable
(C) To automatically detect errors in the dataset
(D) To create a backup of the original data

Answer

**Correct: (B)** A cleaning log documents what was done, why, and how many records were affected. This serves three purposes: reproducibility (others can follow your steps), auditability (reviewers can evaluate your choices), and memory (you'll forget your rationale in six months without it). It doesn't make code faster, detect errors automatically, or serve as a backup.

Question 8. You have a temperature dataset where the sensor records -999 when it malfunctions. What is the BEST first step?

(A) Immediately replace -999 with NaN and move on
(B) Replace -999 with the column mean
(C) Investigate how many -999 values exist, whether they cluster at certain times, and document the sensor behavior before deciding on a strategy
(D) Remove all rows with -999 and reduce the dataset

Answer

**Correct: (C)** Before taking any action, you should understand the scope and pattern of the problem. If -999 occurs mostly during extreme cold, the missingness is MNAR and simple imputation would bias the data. If it occurs randomly, MCAR strategies work fine. Options (A), (B), and (D) skip the investigation step and might introduce bias or lose information.

Question 9. When using fillna() with the mean, which potential problem is most concerning?

(A) The fill operation is too slow for large datasets
(B) The mean is distorted by outliers, so filled values may be unrealistic; additionally, filling with the mean reduces the variance of the column
(C) fillna() changes the original DataFrame and cannot be undone
(D) The mean can only be computed for integer columns, not floats

Answer

**Correct: (B)** Filling with the mean has two statistical problems: (1) outliers can make the mean unrepresentative of typical values, and (2) replacing missing values with a single constant (the mean) artificially reduces the spread of the data, making the distribution look more concentrated than it really is. Option (C) is incorrect because `fillna()` returns a new object by default and does not modify the original. Option (D) is incorrect — mean works on any numeric type.

Question 10. The chapter describes the threshold concept "Cleaning IS Analysis." Which of the following BEST captures this idea?

(A) Data cleaning is more time-consuming than analysis, so it should be considered the main analytical activity
(B) Every cleaning decision (dropping rows, imputing values, merging categories) encodes an assumption that shapes the results, so cleaning decisions should be treated with the same rigor as analytical decisions
(C) You should analyze the data before cleaning it to know which cleaning operations to apply
(D) Cleaning and analysis should always be done by the same person to maintain consistency

Answer

**Correct: (B)** The threshold concept is that cleaning is not a neutral, pre-analytical step — it IS analysis because every operation embeds an assumption. Dropping rows assumes the remaining data is representative. Imputing with the median assumes missing values follow the same distribution. These are analytical choices with real consequences for results.

Section 2: True/False (3 questions, 5 points each)

Question 11. True or False: df.dropna() removes columns with missing values by default.

Answer

**False.** By default, `dropna()` removes *rows* (axis=0) with missing values, not columns. To drop columns, you'd need `df.dropna(axis=1)`. This is a common source of confusion.

Question 12. True or False: In pandas, None and np.nan are treated identically in all operations.

Answer

**False.** While pandas generally treats both as missing values and converts `None` to `NaN` in numeric columns, there are subtle differences. In object-type columns, `None` and `NaN` can coexist as distinct values. Additionally, `None` is a Python object while `NaN` is a floating-point value, which matters for performance and some comparison operations.

Question 13. True or False: If 2% of rows in your dataset are missing a single column, dropping those rows is always safe and introduces no bias.

Answer

**False.** The percentage of missing data alone doesn't determine whether dropping is safe — the *mechanism* of missingness matters. Even if only 2% of rows are missing, if those rows represent a specific subgroup (e.g., all from rural areas), dropping them introduces systematic bias. Low percentage + MCAR = safe. Low percentage + MNAR = potentially biased.

Section 3: Short Answer (3 questions, 5 points each)

Question 14. Name three different strategies for handling missing values and give one scenario where each would be the best choice.

Answer

1. **Deletion (dropna):** Best when missing data is MCAR and represents a tiny fraction of the dataset (e.g., < 1%). Example: A sensor randomly fails 0.5% of the time, and you have millions of readings — dropping the missing ones is fine. 2. **Imputation (fillna with mean/median/mode):** Best when missing data is MAR and you want to preserve sample size. Example: Some survey respondents didn't report their age, but age correlates with the question asked (Q1 vs Q2). You can impute using the median age for each question. 3. **Flagging (create indicator column):** Best when the fact that data is missing is itself informative and may predict the outcome. Example: In medical records, patients who miss follow-up appointments may have worse health outcomes — the missingness itself is a predictive feature.

Question 15. Explain the difference between df['col'].replace() and df['col'].map() when standardizing categories. When would you prefer one over the other?

Answer

`replace()` substitutes specified values while leaving unmatched values unchanged. `map()` applies a mapping to every value, and any value not in the mapping becomes `NaN`. Use `replace()` when you want to fix specific variations while keeping everything else as-is (e.g., replacing "US" with "United States" but leaving "Brazil" untouched). Use `map()` when you have a complete mapping and want to ensure every value is accounted for — unmatched values becoming NaN helps you catch unmapped categories.

Question 16. What is the difference between an outlier that should be removed and an outlier that should be kept? Give an example of each.

Answer

An outlier that should be **removed** is one that results from a data error — it doesn't reflect reality. Example: A patient's age recorded as -5 (negative ages are impossible) or a temperature sensor reading of -999 (a placeholder code, not a real temperature). An outlier that should be **kept** is one that represents a genuine extreme value. Example: A CEO's salary of $5 million in a company salary dataset — it's real, it's just much higher than the median. Removing it would understate total compensation and hide real inequality. The key question: does this value reflect what actually happened in the world, or does it reflect a data collection/entry error?

Section 4: Code Analysis (4 questions, 5 points each)

Question 17. What is the output of this code?

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'x': [1, 2, np.nan, 4, np.nan, 6]
})

result = df.dropna()
print(len(result))
print(df['x'].fillna(0).sum())
print(df['x'].isnull().sum())

Answer

4
13.0
2

- `len(result)` = 4: `dropna()` removes 2 rows with NaN, leaving 4 rows. - `df['x'].fillna(0).sum()` = 13.0: filling NaN with 0 gives [1, 2, 0, 4, 0, 6], sum = 13.0. Note: the original `df` is unchanged because neither `dropna()` nor `fillna()` modified it in place. - `df['x'].isnull().sum()` = 2: the original still has 2 NaN values.

Question 18. This code has a bug. What is it, and what will it print?

import pandas as pd

df = pd.DataFrame({
    'city': ['New York', 'new york', 'NEW YORK', 'Los Angeles', 'los angeles']
})

df['city_clean'] = df['city'].str.lower()
print(df['city_clean'].nunique())

Answer

This code actually works correctly. It prints `2` — after lowercasing, there are two unique values: "new york" and "los angeles". The trick: there is no bug. This is a question about whether you understand what the code does. If you expected 5, you forgot that `.str.lower()` makes all three "New York" variations identical.

Question 19. What does this code produce, and is there a potential problem?

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'group': ['A', 'A', 'A', 'B', 'B', 'B'],
    'value': [10, np.nan, 30, 100, np.nan, 300]
})

df['value_filled'] = df['value'].fillna(df['value'].mean())
print(df)

Answer

The mean of `[10, 30, 100, 300]` (ignoring NaN) is 110. Both missing values get filled with 110. The problem: the two groups (A and B) have very different scales. Group A values are 10-30, but the missing value gets filled with 110 — wildly out of range for that group. Group B values are 100-300, and 110 is at the low end. A better approach would be group-based imputation:

df['value_filled'] = df.groupby('group')['value'].transform(
    lambda x: x.fillna(x.mean())
)

This would fill group A's missing value with 20 and group B's with 200.

Question 20. What does this code do, and what is the final shape of clean?

import pandas as pd

df = pd.DataFrame({
    'id': [1, 2, 3, 3, 4, 5, 5, 5],
    'name': ['A', 'B', 'C', 'C', 'D', 'E', 'E', 'E'],
    'score': [80, 90, 85, 85, 70, 95, 95, 95]
})

clean = df.drop_duplicates()
print(f"After exact dedup: {clean.shape}")

clean2 = df.drop_duplicates(subset=['id'])
print(f"After id dedup: {clean2.shape}")

Answer

After exact dedup: (5, 3)
After id dedup: (5, 3)

The original has 8 rows. There are 3 exact duplicate rows (the second "C" row, and the second and third "E" rows). After exact deduplication: 5 rows remain. For ID-based deduplication: IDs 3 and 5 have duplicates (3 appears twice, 5 appears three times). After keeping only the first occurrence of each ID: 5 unique IDs = 5 rows. In this particular case, both approaches yield the same result because the duplicate rows happen to be exact duplicates. In general, these two operations can produce different results (e.g., if the duplicate IDs had different scores).

Scoring Guide

Section	Points
Multiple Choice (10 x 4)	40
True/False (3 x 5)	15
Short Answer (3 x 5)	15
Code Analysis (4 x 5)	20
Total	90

Note: The total is 90 points. Award up to 10 bonus points for exceptional depth in short answer responses. Passing score: 63/90 (70%).