Quiz: Data Wrangling — Cleaning and Preparing Real Data

Q: True or False: Feature engineering means creating new variables from existing ones in your dataset.

True. Feature engineering is the process of creating new variables ("features") from existing data to better capture patterns. Examples include creating BMI from height and weight, extracting the month from a date, or combining multiple variables into a risk score. Reference: Section 7.8

Contributors

Quiz: Data Wrangling — Cleaning and Preparing Real Data

Test your understanding before moving on. Target: 70% or higher to proceed confidently.

Section 1: Multiple Choice (1 point each)

1. Approximately what percentage of a typical data science project is spent on data cleaning and preparation?

A) 20%
B) 40%
C) 60%
D) 80%

Answer

**D)** 80%. *Why D:* "80%" is the figure most widely cited in the field, and it's the answer the chapter highlights. It traces to a 2014 *New York Times* article that reported a "50 to 80 percent" range from interviews and expert estimates. Treat it as a memorable rule of thumb rather than a precise measurement — later surveys (e.g., Anaconda's *State of Data Science*) put cleaning-and-loading closer to 45%. The key takeaway is qualitative: data preparation, not modeling, dominates most real projects. *Why not A-C:* While the true figure is debated and varies by project, these choices understate the consensus that preparation is the single largest phase of most analyses; 80% is the benchmark the chapter asks you to recognize. *Reference:* Section 7.1

2. A survey asks respondents about their income. Many high-income respondents leave this question blank. What type of missing data mechanism is this most likely?

A) MCAR (Missing Completely at Random)
B) MAR (Missing at Random)
C) MNAR (Missing Not at Random)
D) Not enough information to determine

Answer

**C)** MNAR (Missing Not at Random). *Why C:* The reason the data is missing (high income) is directly related to the value of the missing variable itself (income). The missingness depends on the unobserved value — this is the defining characteristic of MNAR. *Why not A:* MCAR would mean the missingness has no relationship to any variable. Here, it's clearly related to income level. *Why not B:* MAR would mean the missingness could be predicted by *other* observed variables (like education or job title), but the problem states the missingness is related to the income value itself. *Why not D:* The scenario gives enough information — the explicit connection between "high income" and "leaving blank" points to MNAR. *Reference:* Section 7.2

3. What does the pandas method .isna().sum() return when applied to a DataFrame?

A) The total number of missing values in the entire DataFrame
B) The number of missing values in each column
C) A True/False value for each cell
D) The percentage of missing values per column

Answer

**B)** The number of missing values in each column. *Why B:* `.isna()` returns a DataFrame of True/False values (True = missing), and `.sum()` adds up the True values (counted as 1) for each column, giving the count of missing values per column. *Why not A:* That would be `df.isna().sum().sum()` (two `.sum()` calls). *Why not C:* That's what `.isna()` alone returns — before the `.sum()`. *Why not D:* For percentages, you'd need `df.isna().mean() * 100`. *Reference:* Section 7.2

4. Why is mean imputation problematic for skewed data?

A) It changes the data type of the column
B) It introduces values that don't match the typical observation
C) It creates duplicate values
D) It only works for categorical variables

Answer

**B)** It introduces values that don't match the typical observation. *Why B:* In skewed distributions, the mean is pulled toward the tail (e.g., mean income > median income in right-skewed data). Imputing with the mean inserts values that are higher (or lower) than what a "typical" missing value would actually be. It also artificially reduces variance because every imputed value is identical. *Why not A:* Mean imputation doesn't change data types. *Why not C:* While imputed values are all the same (the mean), this is a side effect, not the core problem. *Why not D:* Mean imputation is specifically for numerical variables, not categorical. *Reference:* Section 7.3, Spaced Review SR.1

5. You run df.duplicated().sum() and get the value 15. What does this mean?

A) There are 15 unique rows in the DataFrame
B) There are 15 rows that are exact copies of earlier rows
C) There are 15 columns with duplicate values
D) There are 15 pairs of duplicate rows (30 rows total)

Answer

**B)** There are 15 rows that are exact copies of earlier rows. *Why B:* `.duplicated()` marks the *second* (and subsequent) occurrence of each duplicate row as True. A sum of 15 means 15 rows are duplicates of earlier rows. Note: this doesn't count the first occurrence, so there could be more than 15 rows involved in duplication. *Why not A:* The number of unique rows would be `len(df) - df.duplicated().sum()`. *Why not C:* `.duplicated()` operates on rows, not columns. *Why not D:* The 15 count represents duplicate rows, not pairs. Some original rows might have more than one copy. *Reference:* Section 7.4

6. Which of the following is NOT a rule of tidy data?

A) Each variable has its own column
B) Each observation has its own row
C) Each value has its own cell
D) Each column has no missing values

Answer

**D)** Each column has no missing values. *Why D:* Tidy data is about *structure*, not completeness. A tidy dataset can absolutely have missing values — what matters is that variables, observations, and values are organized correctly (one variable per column, one observation per row, one value per cell). *Why not A-C:* These are the three actual rules of tidy data as formalized by Hadley Wickham. *Reference:* Section 7.9

7. A column called "gender" contains the values: "Male", "male", "MALE", "M", "m". After running df['gender'].str.lower().str.strip(), what values would remain?

A) "male", "male", "male", "m", "m"
B) "male", "male", "male", "male", "male"
C) "Male", "male", "MALE", "M", "m"
D) "male", "male", "male", "M", "m"

Answer

**A)** "male", "male", "male", "m", "m". *Why A:* `.str.lower()` converts all text to lowercase ("Male" → "male", "MALE" → "male", "M" → "m"). `.str.strip()` removes whitespace (no effect here since there's no leading/trailing whitespace). You'd still have two distinct values: "male" and "m". To reduce to one value, you'd need an additional `.replace()` step. *Why not B:* `.lower()` and `.strip()` don't merge "m" into "male" — that requires an explicit mapping. *Why not C:* `.lower()` definitely changes uppercase letters. *Why not D:* `.lower()` would convert "M" to "m", not leave it unchanged. *Reference:* Section 7.5

8. What is binning?

A) Deleting outlier values from a dataset
B) Dividing a continuous variable into discrete intervals
C) Combining two categorical variables into one
D) Converting a categorical variable to a numerical one

Answer

**B)** Dividing a continuous variable into discrete intervals. *Why B:* Binning (also called discretization) takes a continuous variable like age (0 to 95) and divides it into categories like "0-17", "18-34", "35-54", "55+". The pandas function `pd.cut()` is the primary tool for this. *Why not A:* That's outlier removal, a different process entirely. *Why not C:* That would be recoding or collapsing categories. *Why not D:* That's encoding (e.g., one-hot encoding), not binning. *Reference:* Section 7.7

9. A researcher creates a missing-data flag column (age_was_missing) before imputing age values. Why is this a good practice?

A) It prevents pandas from throwing errors during imputation
B) It preserves the information about which values were originally missing
C) It speeds up the imputation process
D) It's required by pandas — .fillna() won't work without it

Answer

**B)** It preserves the information about which values were originally missing. *Why B:* Once you impute a value, you can't tell the difference between an original value and an imputed one. The flag lets you (and future analysts) know which values are real and which are estimated. The pattern of missingness itself might be informative — e.g., patients with missing ages might differ systematically from those with recorded ages. *Why not A:* `.fillna()` works fine without a flag. *Why not C:* The flag doesn't affect computation speed. *Why not D:* The flag is a best practice, not a technical requirement. *Reference:* Section 7.3

10. Which pandas function creates equal-frequency bins (bins with approximately the same number of observations)?

A) pd.cut()
B) pd.qcut()
C) pd.bin()
D) df.groupby()

Answer

**B)** `pd.qcut()`. *Why B:* `pd.qcut()` (quantile cut) divides data into bins based on quantiles, ensuring each bin contains approximately the same number of observations. `pd.qcut(df['age'], q=4)` creates quartile-based bins. *Why not A:* `pd.cut()` creates equal-*width* bins (same range), not equal-frequency. *Why not C:* `pd.bin()` doesn't exist in pandas. *Why not D:* `groupby()` groups data by existing categories — it doesn't create bins. *Reference:* Section 7.7

Section 2: True or False (1 point each)

11. True or False: If only 3% of values in a column are missing, it is always safe to simply delete those rows.

Answer

**False.** Even with a small percentage of missing data, deletion can introduce bias if the data is MNAR. For example, if 3% of income values are missing and they all belong to the wealthiest respondents, deleting those rows would understate the mean income and underestimate income inequality. The percentage missing matters, but the *mechanism* of missingness matters more. *Reference:* Sections 7.2, 7.3

12. True or False: The df.drop_duplicates() method removes all copies of a duplicated row, leaving no version of it in the dataset.

Answer

**False.** By default, `df.drop_duplicates()` uses `keep='first'`, which keeps the *first* occurrence and removes subsequent duplicates. To remove ALL copies (including the first), you'd need `keep=False` — but this is rarely what you want, as it removes the original observation along with its copies. *Reference:* Section 7.4

13. True or False: Feature engineering means creating new variables from existing ones in your dataset.

Answer

**True.** Feature engineering is the process of creating new variables ("features") from existing data to better capture patterns. Examples include creating BMI from height and weight, extracting the month from a date, or combining multiple variables into a risk score. *Reference:* Section 7.8

14. True or False: Reproducibility means getting the same results on a different dataset.

Answer

**False.** Reproducibility means that someone else can take the *same* raw data, follow your documented steps, and arrive at the *same* cleaned dataset and results. Getting similar results on a *different* dataset is called **replicability** — a related but distinct concept. Reproducibility requires documentation (a cleaning log, code, and clear decision records). *Reference:* Section 7.10

15. True or False: In pandas, .fillna() modifies the DataFrame in place by default.

Answer

**False.** By default, `.fillna()` returns a *new* Series or DataFrame — it does not modify the original. You must assign the result back: `df['col'] = df['col'].fillna(value)`. (Older code sometimes used `inplace=True`, but this is unreliable for single-column assignment under the Copy-on-Write model in pandas 3.0+ and is being phased out — assign the result back instead.) Forgetting to capture the result is one of the most common pandas cleaning errors. *Reference:* Section 7.11 (Debugging Spotlight, Error 3)

Section 3: Short Answer (2 points each)

16. You're cleaning a dataset and discover that the "temperature" column (which should contain body temperatures in Fahrenheit) has a minimum value of 36.5 and a maximum value of 104.2. The mean is 98.4.

(a) What's suspicious about the minimum value? (b) What's the most likely explanation? (c) How would you fix it?

Answer

(a) A body temperature of 36.5 degrees Fahrenheit is impossibly low — only a few degrees above the freezing point of water (32°F), and far below the roughly 70°F at which severe, life-threatening hypothermia sets in. Normal body temperature in Fahrenheit is around 98.6°F, so 36.5°F could not belong to a living patient. (b) The value 36.5 is a perfectly normal body temperature *in Celsius* (36.5°C = 97.7°F). Someone likely entered a Celsius value in a Fahrenheit column — a unit inconsistency. (c) You could convert the suspicious values using the formula: F = C × 9/5 + 32. In pandas: `df.loc[df['temperature'] < 90, 'temperature'] = df['temperature'] * 9/5 + 32`. You'd set the threshold at 90°F because any body temperature below 90°F is extremely unusual (severe hypothermia), while Celsius values would typically range from 35-42. Document this decision in your cleaning log.

17. Explain the difference between pd.cut() and pd.qcut(). Create a scenario where each would be the better choice.

Answer

`pd.cut()` creates **equal-width** bins — each bin spans the same range of values. `pd.qcut()` creates **equal-frequency** bins — each bin contains approximately the same number of observations. **When to use `pd.cut()`:** You're creating age groups for a public health report: 0-17, 18-34, 35-54, 55+. These boundaries have real-world meaning (legal adulthood, retirement age), so equal-width bins based on domain knowledge make sense. **When to use `pd.qcut()`:** You're dividing customers into spending quartiles (bottom 25%, lower-middle, upper-middle, top 25%) for a marketing campaign. You want each group to have roughly the same number of customers so you can allocate marketing resources evenly. Equal-frequency bins ensure balanced groups.

18. A colleague shows you the following cleaning code and asks you to find the bugs. Identify at least two errors and explain how to fix them.

# Clean the dataset
df_clean = df[df['age'] > 18]
df_clean['income_level'] = 'unknown'  # Bug?

df['salary'].fillna(df['salary'].mean())  # Bug?

df['name'] = df['name'].lower()  # Bug?

Answer

**Bug 1 (line 2-3):** `df_clean` might be a view of `df`, not a copy. Assigning a new column to it could trigger a `SettingWithCopyWarning` and may not work as expected. **Fix:** Use `df_clean = df[df['age'] > 18].copy()`. **Bug 2 (line 5):** `.fillna()` returns a new Series — it doesn't modify `df['salary']` in place. The result isn't assigned to anything, so the missing values remain. **Fix:** `df['salary'] = df['salary'].fillna(df['salary'].mean())`. **Bug 3 (line 7):** `.lower()` is a string method, but when applied to a pandas Series, you need the `.str` accessor. This will throw an `AttributeError`. **Fix:** `df['name'] = df['name'].str.lower()`.

19. Dr. Maya Chen is debating whether to impute the missing vaccination status in her flu dataset. The missing rate is 9.6%. Argue for and against imputation in this case, considering the likely missing data mechanism.

Answer

**Argument FOR imputation:** Deleting 9.6% of rows reduces sample size and statistical power. If vaccination status is needed for the analysis, imputation preserves those rows. If the missingness is MAR (explainable by clinic or age), group-based imputation could produce reasonable estimates. **Argument AGAINST imputation:** Vaccination status is likely MNAR — patients who aren't vaccinated may be less likely to know or report their status. If we impute with the mode ("Yes"), we'd overestimate the vaccination rate. If we impute with a model, the model would be trained on the non-missing data, which may not represent the missing population. **Recommendation:** Do NOT impute vaccination status. Instead, leave it as missing, create a flag, and analyze the data in three ways: (1) only among patients with known vaccination status, (2) assuming all missing are "No," and (3) assuming all missing are "Yes." If the conclusions are similar across all three, you can be more confident in the results. If they differ, report the sensitivity analysis. This is an ethical data practice — transparency about the limitation is better than false certainty.

20. Explain why the following statement is both true and important: "Every data cleaning decision is a study design decision in disguise."

Answer

When you decide to delete rows, impute values, or recode categories, you're changing *who* is in your sample and *what* their data looks like — which is exactly what study design decisions do. Deleting rows with missing income is equivalent to excluding certain people from your study. Imputing vaccination status with the mode is equivalent to assuming something about participants you didn't actually observe. Recoding race/ethnicity categories determines which groups get analyzed separately and which get combined. These decisions affect the representativeness of your sample (just like sampling method choices in [Chapter 4](../../part-01-getting-started/chapter-04-designing-studies/index.md)), can introduce bias (just like selection bias or nonresponse bias), and ultimately determine what conclusions your analysis can support. The key insight is that data cleaning is not a neutral, technical step — it's a series of choices that shape results. That's why documentation and transparency are so important: they let others evaluate not just your analysis, but the cleaning decisions that made the analysis possible. *Reference:* Sections 7.3, 7.10, Spaced Review SR.3