Chapter 8 Exercises: Cleaning Messy Data

Contributors to Introduction to Data Science

Chapter 8 Exercises: Cleaning Messy Data

How to use these exercises: Work through the sections in order. Parts A and B test core understanding. Parts C and D require writing and running code — have a Jupyter notebook open. Part E asks you to think about the bigger picture of cleaning decisions. Part M (Metacognitive) asks you to reflect on your own learning.

Difficulty key: ⭐ Foundational | ⭐⭐ Intermediate | ⭐⭐⭐ Advanced | ⭐⭐⭐⭐ Extension

Part A: Conceptual Understanding ⭐

These questions check whether you absorbed the core ideas. No code required.

Exercise 8.1 — The five types of mess

List the five types of data quality problems described in this chapter. For each one, give a one-sentence example that's different from the examples in the chapter.

Guidance

The five types are: (1) missing values, (2) duplicate records, (3) type errors, (4) inconsistent categories, and (5) outliers/implausible values. Your examples should be specific — "a column where the city is spelled three different ways" is better than "data that has errors."

Exercise 8.2 — MCAR, MAR, MNAR

Classify each of the following as MCAR, MAR, or MNAR. Briefly explain your reasoning.

A fitness tracker fails to record heart rate during intense exercise because the sensor can't get a reliable reading when there's too much sweat.
A spreadsheet was partially corrupted during a file transfer, losing 5% of rows at random.
In a salary survey, respondents with higher salaries are more likely to skip the income question.
An online survey has missing responses for participants using mobile devices because the form was poorly formatted on small screens.
Patients with more severe illness are less likely to complete a follow-up questionnaire because they feel too unwell.

Guidance

1. **MNAR** — The missingness is directly related to the missing value itself (high heart rate causes the sensor to fail, so extreme values are systematically missing). 2. **MCAR** — The corruption is random and unrelated to data content. 3. **MNAR** — The decision not to report income is related to the income level itself. 4. **MAR** — The missingness is related to an observed variable (device type) rather than the missing values themselves. You can account for this if you know which device each respondent used. 5. **MNAR** — Illness severity (related to the missing follow-up data) is the reason for non-response.

Exercise 8.3 — Cleaning as analysis

In your own words, explain what the chapter means by "cleaning IS analysis." Give one specific example where a cleaning decision would change the conclusions of an analysis.

Guidance

The core idea is that every cleaning operation encodes an assumption. For example: if you're computing the average income in a city and you drop all rows with missing income, you might get a higher average than the true one (if lower-income residents were less likely to report). The "clean" average is actually a biased estimate — and the bias was introduced by your cleaning decision, not by the original data.

Exercise 8.4 — Why not just dropna()?

Explain two scenarios where calling df.dropna() would be inappropriate. For each scenario, suggest a better approach.

Guidance

Scenario 1: A large fraction of rows have missing values (say 40%). Dropping all of them would dramatically reduce your sample size and statistical power. Better approach: impute values or analyze with missing data methods. Scenario 2: The missingness is not random — it correlates with a demographic group you care about (e.g., rural clinics have more missing data). Dropping would introduce systematic bias. Better approach: flag missing values and use group-based imputation.

Exercise 8.5 — Data type consequences

Explain what goes wrong in each of these scenarios:

You try to compute the mean of a column, but pandas returns an error because the column is stored as object type.
You store ZIP codes as integers and notice that "02134" (Boston) has become "2134."
You merge two DataFrames on a "year" column, but get no matches — one DataFrame stores year as int64 and the other as object.

Guidance

1. Numeric operations like `.mean()` don't work on string columns. The column likely contains a non-numeric value that forced pandas to read it as `object`. You need to find and fix the problematic values, then convert with `pd.to_numeric()`. 2. Integers strip leading zeros. ZIP codes are identifiers, not numbers — they should be stored as strings. 3. The merge fails because `2024` (int) does not equal `'2024'` (string). You need to convert one or both columns to the same type before merging.

Exercise 8.6 — Outlier or insight?

For each of the following, argue whether the value should be removed as an outlier or kept as a valid data point. There's no single right answer — the goal is to practice reasoning.

In a dataset of employee salaries at a small company, the CEO's salary is 10x the median.
In a dataset of daily temperatures, one reading shows -40 degrees in a city that normally ranges from 30 to 100 degrees Fahrenheit.
In a dataset of website page load times, one page took 45 seconds to load when the average is 2 seconds.
In a dataset of ages at a retirement home, one record shows age 23.

Guidance

1. Probably keep — CEO pay really is that high. Removing it would understate total compensation. 2. Could go either way — -40 is physically possible (it's where Fahrenheit and Celsius meet) but extremely rare. Check the date, the location, and whether the sensor was functioning. 3. Probably keep — slow page loads happen due to server issues. But verify it's not a data entry error (maybe someone entered seconds instead of milliseconds). 4. Probably investigate — age 23 is implausible for a retirement home resident. Could be a staff member, a data entry error, or an extraordinary situation. Don't remove without checking.

Part B: Code Reading and Prediction ⭐⭐

Read each code snippet and predict the output before running it. Then verify in your notebook.

Exercise 8.7 — isnull patterns

import pandas as pd
import numpy as np

s = pd.Series([1, np.nan, 3, None, 5, np.nan])
print(s.isnull().sum())
print(s.notnull().sum())
print(s.dropna().mean())

What three numbers will be printed?

Answer

- `s.isnull().sum()` = **3** (indices 1, 3, 5 are NaN/None) - `s.notnull().sum()` = **3** (indices 0, 2, 4 are not null) - `s.dropna().mean()` = **3.0** (mean of 1, 3, 5)

Exercise 8.8 — fillna behavior

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, np.nan, 3, np.nan, 5],
    'B': ['x', None, 'y', 'z', None]
})

df_filled = df.fillna({'A': 0, 'B': 'unknown'})
print(df_filled)
print()
print(df)  # Is the original changed?

What will the two print statements show? Will the original df be modified?

Answer

`df_filled` will have 0 in place of NaN in column A and 'unknown' in place of None in column B. The original `df` will NOT be changed — `fillna()` returns a new DataFrame by default. This is a common gotcha: if you forget to assign the result back to `df`, nothing changes.

Exercise 8.9 — Duplicate detection

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Alice', 'Carol', 'Bob'],
    'score': [85, 92, 85, 78, 95]
})

print(df.duplicated().sum())
print(df.duplicated(subset=['name']).sum())

What two numbers will be printed, and why are they different?

Answer

- `df.duplicated().sum()` = **1** (only row 2 is an exact match of row 0 — same name AND same score) - `df.duplicated(subset=['name']).sum()` = **2** (row 2 duplicates "Alice" and row 4 duplicates "Bob" — just checking names, not scores) The difference: exact duplicates require ALL columns to match. Subset duplicates only check the specified columns.

Exercise 8.10 — to_numeric with errors

import pandas as pd

s = pd.Series(['10', '20', 'thirty', '40', 'N/A'])

print(pd.to_numeric(s, errors='coerce'))
print(pd.to_numeric(s, errors='coerce').sum())

What will be printed?

Answer

The first print shows: `[10.0, 20.0, NaN, 40.0, NaN]` — 'thirty' and 'N/A' become NaN. The second print shows: `70.0` — pandas ignores NaN in `.sum()` by default.

Part C: Applied Coding ⭐⭐

Open a Jupyter notebook and solve these problems with code. Each problem builds a skill you'll use in real projects.

Exercise 8.11 — Missing value audit function ⭐⭐

Write a function called missing_report(df) that takes a DataFrame and returns a new DataFrame with columns: column_name, missing_count, missing_pct, dtype. Sort the output by missing_pct descending. Only include columns that have at least one missing value.

Test it on this data:

import pandas as pd
import numpy as np

test_df = pd.DataFrame({
    'id': range(100),
    'age': [np.nan if i % 10 == 0 else i + 20 for i in range(100)],
    'income': [np.nan if i % 5 == 0 else 50000 + i * 100 for i in range(100)],
    'name': ['Alice'] * 100,
    'score': [np.nan if i % 3 == 0 else 75 + (i % 25) for i in range(100)]
})

Guidance

def missing_report(df):
    missing = df.isnull().sum()
    missing_pct = (df.isnull().mean() * 100).round(2)
    dtypes = df.dtypes

    report = pd.DataFrame({
        'column_name': missing.index,
        'missing_count': missing.values,
        'missing_pct': missing_pct.values,
        'dtype': dtypes.values
    })

    report = report[report['missing_count'] > 0]
    report = report.sort_values('missing_pct', ascending=False)
    return report.reset_index(drop=True)

Exercise 8.12 — Cleaning a price column ⭐⭐

Given this Series of messy prices, write code to convert it to clean numeric values:

prices = pd.Series([
    '$12.99', '$1,299.00', '15.50', '$0.99', 'free',
    'N/A', '$45', '  $99.99  ', None, '1000'
])

Your cleaned Series should have: 12.99, 1299.0, 15.5, 0.99, 0.0, NaN, 45.0, 99.99, NaN, 1000.0

Note: "free" should become 0.0, not NaN.

Guidance

clean = (
    prices
    .str.strip()
    .str.replace('$', '', regex=False)
    .str.replace(',', '', regex=False)
    .replace({'free': '0', 'N/A': np.nan})
    .pipe(pd.to_numeric, errors='coerce')
)

Exercise 8.13 — Standardizing country names ⭐⭐

Given this Series of inconsistent country names, standardize them to a canonical form:

countries = pd.Series([
    'United States', 'USA', 'US', 'U.S.A.', 'united states',
    'Brazil', 'brasil', 'BRAZIL', 'Brazil ',
    'United Kingdom', 'UK', 'U.K.', 'Great Britain',
    'South Korea', 'Korea, Republic of', 'Republic of Korea',
    None, '', 'Germany', 'germany'
])

Write code that produces a clean Series with consistent names. Handle the blank strings and None values.

Guidance

# Step 1: Normalize whitespace and case
clean = countries.str.strip().str.title()

# Step 2: Replace blanks with NaN
clean = clean.replace({'': np.nan})

# Step 3: Map variations to canonical names
mapping = {
    'Usa': 'United States',
    'Us': 'United States',
    'U.S.A.': 'United States',
    'Brasil': 'Brazil',
    'Uk': 'United Kingdom',
    'U.K.': 'United Kingdom',
    'Great Britain': 'United Kingdom',
    'Korea, Republic Of': 'South Korea',
    'Republic Of Korea': 'South Korea',
}
clean = clean.replace(mapping)

Exercise 8.14 — Detecting implausible values ⭐⭐

Write a function flag_implausible(df, column, min_val=None, max_val=None) that: 1. Returns a boolean Series where True means the value is outside the plausible range 2. Handles NaN values (they should NOT be flagged as implausible)

Test it:

ages = pd.Series([25, 30, -5, 150, None, 45, 0, 200, 67, np.nan])
print(flag_implausible(ages, min_val=0, max_val=120))

Expected: [False, False, True, True, False, False, False, True, False, False]

Guidance

def flag_implausible(series, min_val=None, max_val=None):
    flags = pd.Series(False, index=series.index)
    if min_val is not None:
        flags = flags | (series < min_val)
    if max_val is not None:
        flags = flags | (series > max_val)
    # Don't flag NaN as implausible
    flags = flags & series.notna()
    return flags

Exercise 8.15 — Group-based imputation ⭐⭐⭐

Given this dataset of student test scores with some missing values:

students = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Carol', 'Dave', 'Eve',
             'Frank', 'Grace', 'Henry', 'Ivy', 'Jack'],
    'department': ['Math', 'Math', 'Math', 'English', 'English',
                   'English', 'Science', 'Science', 'Science', 'Science'],
    'score': [85, np.nan, 92, 78, np.nan, 82, np.nan, 88, 91, np.nan]
})

Impute missing scores using the median score within each department. Explain why this is better than using the overall median.

Guidance

students['score_imputed'] = students.groupby('department')['score'].transform(
    lambda x: x.fillna(x.median())
)

Group-based imputation is better because test scores may vary systematically by department (perhaps Science has harder exams). Using the overall median would push imputed values toward the global center, ignoring departmental differences. Group median preserves the structure of the data.

Exercise 8.16 — Complete cleaning pipeline ⭐⭐⭐

Clean this messy dataset end to end. Write a cleaning pipeline that handles all five types of data problems. Document each step.

messy = pd.DataFrame({
    'employee_id': [1001, 1002, 1003, 1003, 1004, 1005, 1006, 1007, 1008, 1009,
                    1010, 1002, 1011, 1012, 1013],
    'name': ['Alice Smith', 'bob jones', 'Carol Lee', 'Carol Lee', 'David Kim',
             None, 'Eve Torres', 'FRANK NGUYEN', '  Grace Hall  ', 'Henry Wu',
             'Iris Patel', 'Bob Jones', 'Jack Ryan', 'Kate Mills', 'Leo Park'],
    'department': ['Engineering', 'engineering', 'Sales', 'Sales', 'HR',
                   'Engineering', 'Sales', 'hr', 'Engineering', 'SALES',
                   'Marketing', 'Engineering', 'Marketing', None, 'marketing'],
    'salary': ['75000', '82000', '65000', '65000', 'N/A', '91000',
               '58000', '79000', '88000', '62000', '71000', '82000',
               '67000', '73000', '-5000'],
    'hire_date': ['2020-01-15', '2019-03-22', '2021-06-01', '2021-06-01',
                  '2022-11-10', '2018-07-30', 'June 5, 2023', '2017-09-14',
                  '2020-04-20', '2022-01-01', '2023-03-15', '2019-03-22',
                  '2021-08-08', '2020-12-01', '2019-05-17']
})

Your cleaned output should have: no duplicates, no missing names, proper data types, consistent department names, valid salaries, and parsed dates.

Guidance

Follow the cleaning checklist from the chapter: 1. Inspect with `.info()`, `.dtypes`, `.isnull().sum()` 2. Remove duplicates on `employee_id` 3. Standardize `name` (strip, title case) and `department` (strip, title case) 4. Convert `salary` to numeric with `pd.to_numeric(errors='coerce')` 5. Flag/fix the negative salary 6. Parse `hire_date` with `pd.to_datetime(format='mixed', errors='coerce')` 7. Handle the missing `name` and `department` 8. Document every step

Part D: Debugging and Troubleshooting ⭐⭐

Each problem presents buggy code. Find and fix the error.

Exercise 8.17 — The silent failure

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'temperature': [72.1, np.nan, 68.5, np.nan, 71.3, 69.8]
})

# "Fill missing temperatures with the mean"
df['temperature'].fillna(df['temperature'].mean())

print(f"Missing after fill: {df['temperature'].isnull().sum()}")
# Prints: Missing after fill: 2
# Expected: Missing after fill: 0

Why does the fill appear to have no effect? Fix the code.

Answer

`fillna()` returns a new Series — it doesn't modify `df['temperature']` in place. Fix: assign the result back:

df['temperature'] = df['temperature'].fillna(df['temperature'].mean())

Exercise 8.18 — The type surprise

import pandas as pd

df = pd.DataFrame({
    'count': ['10', '20', '30', '40', '50']
})

print(df['count'].sum())
# Prints: 1020304050
# Expected: 150

What went wrong? Fix it.

Answer

The column is stored as strings, so `.sum()` concatenates instead of adding. Fix:

df['count'] = pd.to_numeric(df['count'])
print(df['count'].sum())  # 150

Exercise 8.19 — The disappearing rows

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Carol', 'Dave', 'Eve'],
    'age': [25, 30, np.nan, 35, 40],
    'score': [85, np.nan, 92, 78, np.nan]
})

# "I only want to drop rows where age is missing"
clean = df.dropna()
print(f"Rows remaining: {len(clean)}")
# Prints: Rows remaining: 2
# Expected: Rows remaining: 4

Why were more rows dropped than expected? Fix it.

Answer

`dropna()` without arguments drops rows with ANY missing value in ANY column. It removed rows where `age` OR `score` was missing. Fix:

clean = df.dropna(subset=['age'])  # Only drop where age is missing

Exercise 8.20 — The merge mystery

import pandas as pd

sales = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Doohickey'],
    'revenue': [1000, 2000, 1500]
})

costs = pd.DataFrame({
    'product': ['widget', 'gadget', 'doohickey'],
    'cost': [400, 800, 600]
})

merged = sales.merge(costs, on='product')
print(merged)
# Empty DataFrame!

Why is the merge empty? Fix it.

Answer

The product names have different capitalization: "Widget" vs "widget". String merges are case-sensitive. Fix:

sales['product'] = sales['product'].str.lower()
# OR
costs['product'] = costs['product'].str.title()
# Then merge
merged = sales.merge(costs, on='product')

Exercise 8.21 — The NaN equality trap

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'status': ['active', np.nan, 'inactive', np.nan, 'active']
})

# "Count how many are missing"
missing_count = len(df[df['status'] == np.nan])
print(f"Missing: {missing_count}")
# Prints: Missing: 0
# Expected: Missing: 2

Why does the equality check fail? Fix it.

Answer

`np.nan == np.nan` is `False` by design (NaN is not equal to anything, including itself). You cannot use `==` to check for NaN. Fix:

missing_count = df['status'].isnull().sum()
# OR
missing_count = len(df[df['status'].isnull()])

Part E: Synthesis and Critical Thinking ⭐⭐⭐

These problems require combining multiple concepts and thinking carefully about trade-offs.

Exercise 8.22 — The imputation experiment

Using the following dataset, compute the mean of income using three different missing value strategies: (a) drop missing rows, (b) fill with the overall median, (c) fill with the group median by region. Compare the three results and explain which is most appropriate and why.

import pandas as pd
import numpy as np

data = pd.DataFrame({
    'person_id': range(1, 21),
    'region': ['Urban'] * 10 + ['Rural'] * 10,
    'income': [55000, 62000, np.nan, 71000, 48000,
               np.nan, 83000, 59000, 65000, np.nan,
               28000, np.nan, 32000, np.nan, 25000,
               31000, np.nan, 29000, 33000, 27000]
})

Guidance

Compute all three and note that the drop and overall-median approaches likely give a higher average than group-median because the missing values are concentrated in the rural group. If rural incomes are lower, imputing with the overall median inflates rural values. Group-median is most appropriate here because it respects the regional structure.

Exercise 8.23 — Design a cleaning strategy

You've been given a dataset of 50,000 restaurant health inspection records with the following issues:

12% of rows are missing the inspection_score column
3% of rows are exact duplicates
The restaurant_name column has inconsistent capitalization and abbreviations ("McDonald's", "mcdonalds", "McDonalds", "MCDONALD'S")
The date column is a mix of "2024-03-15" and "March 15, 2024" formats
The score column has 47 values that are negative (scores range from 0 to 100)
The zip_code column is stored as int64, losing leading zeros

Without writing code, describe your cleaning strategy in order. For each step, state what you'd do, why, and what assumptions you're making.

Guidance

A strong answer addresses each issue in a logical order (typically: duplicates first, then types, then missing values, then categories, then outliers) and justifies each decision. Key question for missing scores: are restaurants with missing scores systematically different from those with scores? For negative scores: are these data entry errors or a special code? For ZIP codes: convert to string and left-pad with zeros to 5 digits.

Exercise 8.24 — The ethical audit

A school district provides you with student test score data to analyze achievement gaps by race and socioeconomic status. You discover:

18% of scores are missing for Black students, compared to 6% for white students
22% of scores are missing for students receiving free/reduced lunch, compared to 4% for others
Missing scores are concentrated in three schools that had testing irregularities

You've been asked to report the "average test score by race" for a school board meeting. Describe at least three different approaches you could take, the result each would produce, and which you'd recommend and why.

Guidance

Key approaches: (1) drop missing rows — inflates Black/low-income averages because missing students may have lower scores; (2) fill with zeros — deflates those averages dramatically; (3) fill with group median — preserves existing gaps but doesn't widen or narrow them; (4) report results with and without imputation and flag the missing data pattern. Approach 4 is most responsible because it makes the uncertainty visible. Any single approach hides information.

Exercise 8.25 — Sensitivity analysis

Take the student dataset from Exercise 8.15 and run your analysis (average score by department) under three different imputation strategies: (a) drop missing, (b) fill with 0, (c) fill with group median. Create a table comparing the results. How much do the department averages change depending on your strategy? What does this tell you about the sensitivity of your results to cleaning decisions?

Guidance

Create a side-by-side comparison table. If the results change dramatically between strategies, your conclusions are sensitive to cleaning choices, and you should report this uncertainty. If results are stable across strategies, your conclusions are robust.

Exercise 8.26 — Building a cleaning function

Write a function called clean_survey_data(df) that takes a raw survey DataFrame and returns a cleaned version. The function should: 1. Remove exact duplicate rows 2. Standardize all string columns (strip whitespace, title case) 3. Convert any column named *_date to datetime 4. Convert any column named *_score or *_count to numeric 5. Return a tuple of (cleaned_df, cleaning_report_dict)

The cleaning report should include: rows before/after, duplicates removed, columns converted, and values coerced to NaN.

Guidance

This is a synthesis exercise. Use what you've learned about string methods, `pd.to_datetime`, `pd.to_numeric`, and cleaning logs to build a reusable function. The report dictionary is a simplified cleaning log.

Part M: Metacognitive Reflection ⭐

Exercise 8.27 — Your cleaning instincts

Think about a time you worked with data (even just a spreadsheet or a list of information) and encountered something messy. How did you handle it at the time? Now that you've read this chapter, what would you do differently?

Guidance

There's no wrong answer here. The goal is to connect the chapter's techniques to your own experience. Most people's initial instinct with messy data is to "just delete the bad rows" without thinking about who's being deleted or what assumptions they're making. If you can now articulate a more nuanced approach, that's growth.

Exercise 8.28 — What surprised you?

What was the most surprising or counterintuitive idea in this chapter? Why did it surprise you, and how does it change how you think about working with data?

Guidance

Common answers include: the idea that NaN != NaN, that cleaning is analysis, that dropping missing data can introduce bias, or that 80% of data science is cleaning. Reflect on which idea challenges your prior assumptions most.

Bonus Challenges ⭐⭐⭐⭐

For students who want to push further.

Exercise 8.29 — Fuzzy deduplication

Research the fuzzywuzzy or rapidfuzz Python library. Use it to find near-duplicate restaurant names in this list:

names = pd.Series([
    "McDonald's", "McDonalds", "Mcdonald's", "mcdonalds",
    "Burger King", "BURGER KING", "Burger king",
    "Wendy's", "Wendys", "wendy's",
    "Taco Bell", "TacoBell", "taco bell",
    "Chick-fil-A", "Chickfila", "Chick Fil A"
])

Group the names into clusters of likely duplicates. What similarity threshold works best?

Guidance

After `pip install rapidfuzz`, use `fuzz.ratio()` or `fuzz.token_sort_ratio()` to compare pairs. A threshold of 80-85% typically works well for name matching. `token_sort_ratio` is more robust because it ignores word order and is case-insensitive.

Exercise 8.30 — Automated data quality report

Write a comprehensive data_quality_report(df) function that produces a full diagnostic including: - Missing values per column (count and percentage) - Duplicate row count - Data type summary - Unique value counts per column - Min/max/mean for numeric columns - Value counts for categorical columns with fewer than 20 unique values - Potential outliers using the IQR method

Format the output as a readable string or as a dictionary of DataFrames. Test it on a dataset of your choice.

Guidance

This is a significant coding exercise that synthesizes everything in the chapter. Build it up piece by piece, testing each section. The function should handle edge cases like all-null columns and columns with mixed types.