Chapter 12 Exercises: Cleaning and Preparing Data for Analysis

These exercises are organized into five tiers of increasing difficulty. All exercises use inline data — no external files required. Complete earlier tiers before attempting later ones.

Tier 1: Recall (Concept Checks)

Exercise 1.1 — Match the Method

For each task, identify the correct pandas method or function.

Task	Method/Function
Count missing values per column	_____
Remove rows where all values are NaN	_____
Fill missing values in a column with the column median	_____
Remove exact duplicate rows, keeping first occurrence	_____
Convert a string column to float, turning errors into NaN	_____
Convert a column of date strings to datetime objects	_____
Remove leading and trailing whitespace from strings	_____
Convert all strings in a column to lowercase	_____
Replace a dictionary of old values with new values	_____
Apply a dictionary to recode values (unmapped → NaN)	_____

Answers: .isna().sum(), dropna(how='all'), .fillna(df['col'].median()), .drop_duplicates(keep='first'), pd.to_numeric(errors='coerce'), pd.to_datetime(), .str.strip(), .str.lower(), .replace(), .map()

Exercise 1.2 — True or False

df.dropna() modifies the original DataFrame in place by default.
pd.to_numeric("$125.99", errors="coerce") returns NaN.
.fillna(method="ffill") fills each NaN with the next known value.
.drop_duplicates() removes a row only if every column is identical to another row.
.str.strip() removes whitespace from inside a string, not just the edges.
Using .map() with a dictionary, any value not in the dictionary becomes NaN.
.str.lower() raises an error if the column contains NaN values.
df["price"].astype(float) will succeed even if values contain "$".
The IQR method defines outliers as values beyond Q1 - 1.5×IQR or above Q3 + 1.5×IQR.
df.duplicated(keep=False) marks ALL occurrences of a duplicate as True, not just the second one.

Exercise 1.3 — Fill in the Blank

import pandas as pd

data = {
    "name":     [" Alice", "Bob ", " Carol ", "Dave", None],
    "score":    [85, 92, None, 78, 88],
    "grade":    ["A", "A+", "B", "b", "A"],
    "amount":   ["$125.00", "$200", "150", "$175.50", None]
}
df = pd.DataFrame(data)

# 1. Count missing values per column
print(df._____.sum())

# 2. Strip whitespace from name column
df["name"] = df["name"].str._____()

# 3. Fill missing score with the column mean
df["score"] = df["score"]._____(df["score"].mean())

# 4. Convert grade to lowercase
df["grade"] = df["grade"].str._____()

# 5. Strip $ from amount, then convert to float
df["amount"] = df["amount"].str.replace("$", "", _____=False)
df["amount"] = pd.to_numeric(df["amount"], errors=_____)

# 6. Remove rows where name is still None/NaN
df = df.dropna(subset=[_____])

Exercise 1.4 — Identify the Issue

For each of the following snippets, explain what problem it will cause or what result it will produce:

Snippet A:

df["category"] = df["category"].str.lower()
df["category"] = df["category"].map({
    "office chairs": "Office Chairs",
    "desks": "Desks",
    "technology": "Technology",
    "storage": "Storage",
})
# Then later:
print(df["category"].value_counts())

What happens to any original value like "office chair" (singular, typo) that was not in the map?

Snippet B:

df["price"].fillna(df["price"].mean())
print(df["price"].isna().sum())  # still shows the original count!

What is wrong with this code?

Snippet C:

df["phone"] = df["phone"].str.replace("(", "")
df["phone"] = df["phone"].str.replace(")", "")
df["phone"] = df["phone"].str.replace("-", "")

This code is correct but inefficient. Rewrite it more cleanly.

Tier 2: Application (Single-Concept Practice)

Exercise 2.1 — Duplicate Detection and Removal

import pandas as pd

orders = pd.DataFrame({
    "order_id":   ["O001", "O002", "O003", "O004", "O002", "O005", "O001", "O006"],
    "customer":   ["Alice", "Bob", "Carol", "Dave", "Bob", "Eve", "Alice", "Frank"],
    "product":    ["Chair", "Desk", "Monitor", "Keyboard", "Desk", "Mouse", "Chair", "Webcam"],
    "quantity":   [2, 1, 3, 5, 1, 2, 2, 1],
    "total":      [499.98, 899.99, 1049.97, 149.95, 899.99, 59.98, 499.98, 89.99]
})

Write code to: a) Count exact duplicate rows b) Display the duplicated rows (all occurrences) c) Check if order_id has any duplicate values (even if other columns differ) d) Remove exact duplicates, keeping the first occurrence e) Print a before/after row count

Exercise 2.2 — Missing Value Strategies

import pandas as pd
import numpy as np

employee_reviews = pd.DataFrame({
    "employee_id": ["E01", "E02", "E03", "E04", "E05", "E06", "E07", "E08"],
    "name":        ["Alice", "Bob", "Carol", None, "Eve", "Frank", "Grace", "Henry"],
    "department":  ["Sales", "IT", None, "HR", "Sales", None, "IT", "Sales"],
    "rating":      [4.5, 3.8, 4.2, None, 4.8, 3.5, None, 4.1],
    "review_date": ["2023-12-01", "2023-12-03", "2023-12-05", "2023-12-07",
                    None, "2023-12-10", "2023-12-12", None],
    "bonus_pct":   [10.0, 7.5, 8.0, 5.0, 15.0, None, None, 9.5]
})

Write code to: a) Show a full missing value report (count and percentage for each column) b) Drop rows where name is missing c) Fill missing department with "Unknown" d) Fill missing rating with the column median e) Fill missing bonus_pct with 0 (employees with no bonus get 0%) f) Convert review_date to datetime (leave NaT for missing dates) g) After all fixes, confirm no numeric columns have missing values

Exercise 2.3 — Type Conversion: The Dollar Sign Problem

import pandas as pd

financial_report = pd.DataFrame({
    "quarter":     ["Q1", "Q2", "Q3", "Q4"],
    "revenue":     ["$1,250,000", "$1,487,500", "$1,623,000", "$1,891,200"],
    "cost":        ["$875,000", "$1,023,400", "$1,134,100", "$1,320,800"],
    "employees":   ["125", "131", "138", "147"],
    "report_date": ["2023-03-31", "2023-06-30", "2023-09-30", "2023-12-31"]
})

Write code to: a) Show the dtypes before any cleaning b) Strip $ and , from revenue and cost, then convert to float c) Convert employees to integer d) Convert report_date to datetime e) Calculate and print profit = revenue - cost for each quarter f) Show the dtypes after all conversions

Exercise 2.4 — String Normalization

import pandas as pd

contacts = pd.DataFrame({
    "name":     ["  Alice Smith  ", "BOB JONES", "carol  lee", "DAVID PARK",
                 "  Eve Torres", "Frank  Chen  "],
    "email":    ["alice@ACME.COM", "bob.jones@acme.com", "carol@ACME.COM",
                 "  david@acme.com  ", "EVE@ACME.COM", "frank@acme.com"],
    "phone":    ["(555) 123-4567", "555.987.6543", "555-456-7890",
                 "(555)234-5678", "5552345678", "555 876 5432"],
    "region":   ["northeast", "SOUTHWEST", "Midwest", "  Southeast  ",
                 "west", "NORTHEAST"]
})

Write code to: a) Strip whitespace from name and convert to title case b) Convert email to lowercase and strip whitespace c) Standardize region to title case after stripping d) Remove all non-digit characters from phone (result should be 10-digit strings) e) Verify all emails end with "@acme.com" using .str.endswith()

Exercise 2.5 — Category Standardization with .map()

import pandas as pd

tickets = pd.DataFrame({
    "ticket_id":   range(1, 13),
    "priority":    ["HIGH", "Low", "medium", "CRITICAL", "High", "low",
                    "MEDIUM", "critical", "high", "Low", "Medium", "CRITICAL"],
    "status":      ["Open", "open", "OPEN", "Closed", "closed", "CLOSED",
                    "In Review", "in review", "IN REVIEW", "Open",
                    "Pending", "pending"],
    "category":    ["Bug", "Feature", "bug", "BUG", "FEATURE", "feature",
                    "Bug", "Docs", "Feature", "bug", "DOCS", "docs"]
})

Write code to: a) Define mapping dictionaries for priority, status, and category b) Apply .map() to standardize each column c) Verify no NaN values were introduced (all variants were in your mapping) d) Print a final value_counts() for each column e) (Challenge) What happens if you accidentally miss a variant? Demonstrate and fix it.

Tier 3: Synthesis (Multi-Concept Problems)

Exercise 3.1 — The Complete Cleaning Pipeline

You receive this messy sales dataset:

import pandas as pd
from io import StringIO

messy_data = """sale_id,rep_name,region,sale_date,product,qty,price,status
S001, Alice Johnson ,Northeast,2023-01-05,Widget Pro,3,$29.99,completed
S002,BOB SMITH,Southwest,01/10/2023,Gadget Plus,1,$149.99,Completed
S003,alice johnson,Northeast,2023-01-15,Widget Pro,3,$29.99,completed
S004,Carol Davis,Midwest,,Widget Basic,5,24.99,COMPLETED
S005,Dave Wilson,Southeast,2023-02-01,Gadget Plus,2,$149.99,In Progress
S006,Eve Martinez,West,February 5 2023,Widget Pro,1,$29.99,completed
S001, Alice Johnson ,Northeast,2023-01-05,Widget Pro,3,$29.99,completed
S007,Dave Wilson,southeast,2023-02-10,Widget Basic,10,24.99,in progress
S008,Carol Davis,MIDWEST,2023-02-15,Gadget Plus,1,$149.99,Cancelled
S009,Eve Martinez,West,2023-02-20,WIDGET PRO,4,$29.99,Completed
"""

df = pd.read_csv(StringIO(messy_data))

Build a complete cleaning pipeline that: a) Detects and removes exact duplicates (with logging) b) Strips whitespace from all string columns c) Standardizes rep_name to title case d) Standardizes region to title case and validates only 5 expected regions exist e) Converts sale_date to datetime (handle mixed formats) f) Strips $ from price and converts to float g) Standardizes status to 3 canonical values: "Completed", "In Progress", "Cancelled" h) Runs a final validation suite (at least 6 checks) i) Prints a cleaning log with row-level impact for each step

Exercise 3.2 — Outlier Investigation

import pandas as pd
import numpy as np

np.random.seed(42)
n = 200

sales = pd.DataFrame({
    "salesperson": np.random.choice(["Alice", "Bob", "Carol", "Dave", "Eve"], n),
    "units_sold":  np.random.normal(5, 2, n).round().clip(1).astype(int),
    "unit_price":  np.random.choice([29.99, 49.99, 99.99, 199.99, 499.99], n),
})

# Inject outliers
sales.loc[17, "units_sold"] = 500    # likely a data entry error
sales.loc[83, "units_sold"] = 250    # possibly legitimate bulk order
sales.loc[141, "unit_price"] = 9999.99  # suspicious

sales["total_revenue"] = sales["units_sold"] * sales["unit_price"]

Write code to: a) Apply the IQR method to detect outliers in units_sold b) Apply the IQR method to detect outliers in unit_price c) For each outlier, print the row with full context d) Calculate z-scores for units_sold and identify values with |z| > 3 e) Decide which outliers to remove, which to cap (Winsorize), and which to keep — write the code for each decision f) Show total revenue before and after your outlier handling

Exercise 3.3 — Before/After Analysis Report

Using the cleaned data from Exercise 3.1:

a) Generate a revenue summary grouped by region (using dirty data) b) Generate the same summary using clean data c) Show the difference in revenue figures between dirty and clean d) Write a 3-paragraph "data quality impact statement" explaining what changed and why

Tier 4: Analysis (Open-Ended Problems)

Exercise 4.1 — Designing a Cleaning Pipeline Function

Write a reusable function clean_business_df() that:

Parameters: - df: The DataFrame to clean - string_cols: List of string columns to strip and apply title case - numeric_cols_with_dollar: List of columns with $ signs to strip and convert - date_cols: List of date columns to convert to datetime - category_maps: Dictionary of {column_name: mapping_dict} for standardization - drop_duplicate_subset: Optional list of columns to use for duplicate detection

Behavior: - Applies each cleaning operation in the correct order - Logs each step with row counts and change summaries - Returns the cleaned DataFrame and the cleaning log as a list of strings

Test it: Apply your function to any dataset from this chapter.

Exercise 4.2 — The Reproducibility Challenge

Take any messy dataset from this chapter and write a cleaning script that: a) Is fully reproducible (running it twice produces the same result) b) Handles edge cases (empty DataFrame, columns not present, all values missing) c) Can be run on a dataset with different column names if a column mapping is provided d) Saves both the original and cleaned data with timestamp-based filenames

Tier 5: Extension (Beyond the Chapter)

Exercise 5.1 — String Cleaning with Regular Expressions

The .str.replace() method supports regular expressions. Using the following data:

import pandas as pd

df = pd.DataFrame({
    "phone": ["(555) 123-4567", "555.987.6543", "+1 555 456 7890",
              "555-234-5678", "tel: 555-876-5432", "1-800-555-0199"],
    "zip":   ["10001", "10001-1234", "SW1A 1AA", "90210",
              "10001-5678", "N/A"],
    "price": ["USD 49.99", "€29.99", "£19.99", "49.99", "$59.99", "AUD 39.99"],
})

Write code to: a) Extract just the 10 digits from phone (ignore country code and formatting) b) Extract just the 5-digit US ZIP code from zip (ignore extensions, mark non-US as NaN) c) Extract just the numeric price from price (strip currency symbols and codes) d) Use a regex to validate that all cleaned emails follow the pattern word@word.word

Exercise 5.2 — Time-Series Data Cleaning

import pandas as pd

# Daily revenue data with gaps
daily_data = {
    "date": pd.date_range("2023-01-02", "2023-01-31", freq="B"),  # Business days
    "revenue": [12500, 13200, None, 14100, 13800,
                None, None, 15200, 14700, 13900,
                16100, None, 15800, 14900, 15100,
                15600, None, 16200, 15900, None,
                17200, 16800]
}
df_daily = pd.DataFrame(daily_data)

Write code to: a) Identify which days have missing revenue b) Apply forward fill and show the result c) Apply backward fill and show the result d) Apply linear interpolation and show the result e) For each fill strategy, calculate the total monthly revenue — do they differ? f) Explain which strategy is most appropriate for revenue data and why

Exercise 5.3 — Multi-Dataset Cleaning and Merging

You have three messy datasets:

import pandas as pd

orders = pd.DataFrame({
    "order_id":   ["ORD-001", "ORD-002", "ORD-003", "ORD-004", "ORD-002"],
    "customer_id": ["C101", "C102", "C101", "C103", "C102"],
    "amount":     ["$1,250.00", "$875.50", "$432.00", "$2,100.00", "$875.50"],
    "order_date": ["2023-01-05", "Jan 7, 2023", "2023-01-09", "01/12/2023", "Jan 7, 2023"]
})

customers = pd.DataFrame({
    "id":     ["C101", "C102", "C103", "C104"],
    "name":   ["  Acme Corp  ", "BRIGHTFIELD LLC", "Nova Dynamics", "CoreTech Ltd"],
    "tier":   ["Gold", "silver", "Gold", "BRONZE"],
    "region": ["Northeast", "southwest", "Midwest", None]
})

products_ordered = pd.DataFrame({
    "order":    ["ORD-001", "ORD-002", "ORD-003", "ORD-004"],
    "product":  ["widget", "GADGET", "Widget", "gadget"],
    "qty":      ["3", "2", "1", "5"]
})

Write a complete pipeline that: a) Cleans each DataFrame independently b) Merges all three into a single analysis-ready DataFrame c) Validates the merged result (check for unexpected NaN after merge) d) Produces a summary: total revenue by customer tier and product

Answer Hints (Selected)

Exercise 1.2 Answers: 1. False — dropna() returns a new DataFrame; original is unchanged 2. True — the $ prevents numeric conversion; errors='coerce' returns NaN 3. False — ffill fills with the PREVIOUS known value, not the next 4. True 5. False — .str.strip() only removes from the edges (beginning and end) 6. True — this is .map() behavior; use .replace() if you want non-mapped values unchanged 7. False — .str.lower() on a column with NaN simply leaves NaN as NaN 8. False — .astype(float) will raise ValueError on "$125.99" 9. True 10. True — keep=False marks all occurrences; keep='first' only marks subsequent ones

Exercise 1.3 Answers: .isna(), strip, fillna, lower, regex, 'coerce', "name"

Exercise 1.4 Answers:

Snippet A: Any value not in the map (like "office chair" with no "s") becomes NaN. This is .map()'s strict behavior — it is a feature, not a bug, because it reveals gaps in your mapping.

Snippet B: .fillna() returns a NEW Series — it does not modify df["price"] in place. You must assign the result back: df["price"] = df["price"].fillna(df["price"].mean()).

Snippet C: More efficient rewrite:

df["phone"] = df["phone"].str.replace(r"[()\\-]", "", regex=True)
# Or for any non-digit character:
df["phone"] = df["phone"].str.replace(r"\D", "", regex=True)