Chapter 12 Exercises: Cleaning and Preparing Data for Analysis
These exercises are organized into five tiers of increasing difficulty. All exercises use inline data — no external files required. Complete earlier tiers before attempting later ones.
Tier 1: Recall (Concept Checks)
Exercise 1.1 — Match the Method
For each task, identify the correct pandas method or function.
| Task | Method/Function |
|---|---|
| Count missing values per column | _____ |
| Remove rows where all values are NaN | _____ |
| Fill missing values in a column with the column median | _____ |
| Remove exact duplicate rows, keeping first occurrence | _____ |
| Convert a string column to float, turning errors into NaN | _____ |
| Convert a column of date strings to datetime objects | _____ |
| Remove leading and trailing whitespace from strings | _____ |
| Convert all strings in a column to lowercase | _____ |
| Replace a dictionary of old values with new values | _____ |
| Apply a dictionary to recode values (unmapped → NaN) | _____ |
Answers: .isna().sum(), dropna(how='all'), .fillna(df['col'].median()), .drop_duplicates(keep='first'), pd.to_numeric(errors='coerce'), pd.to_datetime(), .str.strip(), .str.lower(), .replace(), .map()
Exercise 1.2 — True or False
df.dropna()modifies the original DataFrame in place by default.pd.to_numeric("$125.99", errors="coerce")returnsNaN..fillna(method="ffill")fills each NaN with the next known value..drop_duplicates()removes a row only if every column is identical to another row..str.strip()removes whitespace from inside a string, not just the edges.- Using
.map()with a dictionary, any value not in the dictionary becomesNaN. .str.lower()raises an error if the column contains NaN values.df["price"].astype(float)will succeed even if values contain"$".- The IQR method defines outliers as values beyond Q1 - 1.5×IQR or above Q3 + 1.5×IQR.
df.duplicated(keep=False)marks ALL occurrences of a duplicate as True, not just the second one.
Exercise 1.3 — Fill in the Blank
import pandas as pd
data = {
"name": [" Alice", "Bob ", " Carol ", "Dave", None],
"score": [85, 92, None, 78, 88],
"grade": ["A", "A+", "B", "b", "A"],
"amount": ["$125.00", "$200", "150", "$175.50", None]
}
df = pd.DataFrame(data)
# 1. Count missing values per column
print(df._____.sum())
# 2. Strip whitespace from name column
df["name"] = df["name"].str._____()
# 3. Fill missing score with the column mean
df["score"] = df["score"]._____(df["score"].mean())
# 4. Convert grade to lowercase
df["grade"] = df["grade"].str._____()
# 5. Strip $ from amount, then convert to float
df["amount"] = df["amount"].str.replace("$", "", _____=False)
df["amount"] = pd.to_numeric(df["amount"], errors=_____)
# 6. Remove rows where name is still None/NaN
df = df.dropna(subset=[_____])
Exercise 1.4 — Identify the Issue
For each of the following snippets, explain what problem it will cause or what result it will produce:
Snippet A:
df["category"] = df["category"].str.lower()
df["category"] = df["category"].map({
"office chairs": "Office Chairs",
"desks": "Desks",
"technology": "Technology",
"storage": "Storage",
})
# Then later:
print(df["category"].value_counts())
What happens to any original value like "office chair" (singular, typo) that was not in the map?
Snippet B:
df["price"].fillna(df["price"].mean())
print(df["price"].isna().sum()) # still shows the original count!
What is wrong with this code?
Snippet C:
df["phone"] = df["phone"].str.replace("(", "")
df["phone"] = df["phone"].str.replace(")", "")
df["phone"] = df["phone"].str.replace("-", "")
This code is correct but inefficient. Rewrite it more cleanly.
Tier 2: Application (Single-Concept Practice)
Exercise 2.1 — Duplicate Detection and Removal
import pandas as pd
orders = pd.DataFrame({
"order_id": ["O001", "O002", "O003", "O004", "O002", "O005", "O001", "O006"],
"customer": ["Alice", "Bob", "Carol", "Dave", "Bob", "Eve", "Alice", "Frank"],
"product": ["Chair", "Desk", "Monitor", "Keyboard", "Desk", "Mouse", "Chair", "Webcam"],
"quantity": [2, 1, 3, 5, 1, 2, 2, 1],
"total": [499.98, 899.99, 1049.97, 149.95, 899.99, 59.98, 499.98, 89.99]
})
Write code to:
a) Count exact duplicate rows
b) Display the duplicated rows (all occurrences)
c) Check if order_id has any duplicate values (even if other columns differ)
d) Remove exact duplicates, keeping the first occurrence
e) Print a before/after row count
Exercise 2.2 — Missing Value Strategies
import pandas as pd
import numpy as np
employee_reviews = pd.DataFrame({
"employee_id": ["E01", "E02", "E03", "E04", "E05", "E06", "E07", "E08"],
"name": ["Alice", "Bob", "Carol", None, "Eve", "Frank", "Grace", "Henry"],
"department": ["Sales", "IT", None, "HR", "Sales", None, "IT", "Sales"],
"rating": [4.5, 3.8, 4.2, None, 4.8, 3.5, None, 4.1],
"review_date": ["2023-12-01", "2023-12-03", "2023-12-05", "2023-12-07",
None, "2023-12-10", "2023-12-12", None],
"bonus_pct": [10.0, 7.5, 8.0, 5.0, 15.0, None, None, 9.5]
})
Write code to:
a) Show a full missing value report (count and percentage for each column)
b) Drop rows where name is missing
c) Fill missing department with "Unknown"
d) Fill missing rating with the column median
e) Fill missing bonus_pct with 0 (employees with no bonus get 0%)
f) Convert review_date to datetime (leave NaT for missing dates)
g) After all fixes, confirm no numeric columns have missing values
Exercise 2.3 — Type Conversion: The Dollar Sign Problem
import pandas as pd
financial_report = pd.DataFrame({
"quarter": ["Q1", "Q2", "Q3", "Q4"],
"revenue": ["$1,250,000", "$1,487,500", "$1,623,000", "$1,891,200"],
"cost": ["$875,000", "$1,023,400", "$1,134,100", "$1,320,800"],
"employees": ["125", "131", "138", "147"],
"report_date": ["2023-03-31", "2023-06-30", "2023-09-30", "2023-12-31"]
})
Write code to:
a) Show the dtypes before any cleaning
b) Strip $ and , from revenue and cost, then convert to float
c) Convert employees to integer
d) Convert report_date to datetime
e) Calculate and print profit = revenue - cost for each quarter
f) Show the dtypes after all conversions
Exercise 2.4 — String Normalization
import pandas as pd
contacts = pd.DataFrame({
"name": [" Alice Smith ", "BOB JONES", "carol lee", "DAVID PARK",
" Eve Torres", "Frank Chen "],
"email": ["alice@ACME.COM", "bob.jones@acme.com", "carol@ACME.COM",
" david@acme.com ", "EVE@ACME.COM", "frank@acme.com"],
"phone": ["(555) 123-4567", "555.987.6543", "555-456-7890",
"(555)234-5678", "5552345678", "555 876 5432"],
"region": ["northeast", "SOUTHWEST", "Midwest", " Southeast ",
"west", "NORTHEAST"]
})
Write code to:
a) Strip whitespace from name and convert to title case
b) Convert email to lowercase and strip whitespace
c) Standardize region to title case after stripping
d) Remove all non-digit characters from phone (result should be 10-digit strings)
e) Verify all emails end with "@acme.com" using .str.endswith()
Exercise 2.5 — Category Standardization with .map()
import pandas as pd
tickets = pd.DataFrame({
"ticket_id": range(1, 13),
"priority": ["HIGH", "Low", "medium", "CRITICAL", "High", "low",
"MEDIUM", "critical", "high", "Low", "Medium", "CRITICAL"],
"status": ["Open", "open", "OPEN", "Closed", "closed", "CLOSED",
"In Review", "in review", "IN REVIEW", "Open",
"Pending", "pending"],
"category": ["Bug", "Feature", "bug", "BUG", "FEATURE", "feature",
"Bug", "Docs", "Feature", "bug", "DOCS", "docs"]
})
Write code to:
a) Define mapping dictionaries for priority, status, and category
b) Apply .map() to standardize each column
c) Verify no NaN values were introduced (all variants were in your mapping)
d) Print a final value_counts() for each column
e) (Challenge) What happens if you accidentally miss a variant? Demonstrate and fix it.
Tier 3: Synthesis (Multi-Concept Problems)
Exercise 3.1 — The Complete Cleaning Pipeline
You receive this messy sales dataset:
import pandas as pd
from io import StringIO
messy_data = """sale_id,rep_name,region,sale_date,product,qty,price,status
S001, Alice Johnson ,Northeast,2023-01-05,Widget Pro,3,$29.99,completed
S002,BOB SMITH,Southwest,01/10/2023,Gadget Plus,1,$149.99,Completed
S003,alice johnson,Northeast,2023-01-15,Widget Pro,3,$29.99,completed
S004,Carol Davis,Midwest,,Widget Basic,5,24.99,COMPLETED
S005,Dave Wilson,Southeast,2023-02-01,Gadget Plus,2,$149.99,In Progress
S006,Eve Martinez,West,February 5 2023,Widget Pro,1,$29.99,completed
S001, Alice Johnson ,Northeast,2023-01-05,Widget Pro,3,$29.99,completed
S007,Dave Wilson,southeast,2023-02-10,Widget Basic,10,24.99,in progress
S008,Carol Davis,MIDWEST,2023-02-15,Gadget Plus,1,$149.99,Cancelled
S009,Eve Martinez,West,2023-02-20,WIDGET PRO,4,$29.99,Completed
"""
df = pd.read_csv(StringIO(messy_data))
Build a complete cleaning pipeline that:
a) Detects and removes exact duplicates (with logging)
b) Strips whitespace from all string columns
c) Standardizes rep_name to title case
d) Standardizes region to title case and validates only 5 expected regions exist
e) Converts sale_date to datetime (handle mixed formats)
f) Strips $ from price and converts to float
g) Standardizes status to 3 canonical values: "Completed", "In Progress", "Cancelled"
h) Runs a final validation suite (at least 6 checks)
i) Prints a cleaning log with row-level impact for each step
Exercise 3.2 — Outlier Investigation
import pandas as pd
import numpy as np
np.random.seed(42)
n = 200
sales = pd.DataFrame({
"salesperson": np.random.choice(["Alice", "Bob", "Carol", "Dave", "Eve"], n),
"units_sold": np.random.normal(5, 2, n).round().clip(1).astype(int),
"unit_price": np.random.choice([29.99, 49.99, 99.99, 199.99, 499.99], n),
})
# Inject outliers
sales.loc[17, "units_sold"] = 500 # likely a data entry error
sales.loc[83, "units_sold"] = 250 # possibly legitimate bulk order
sales.loc[141, "unit_price"] = 9999.99 # suspicious
sales["total_revenue"] = sales["units_sold"] * sales["unit_price"]
Write code to:
a) Apply the IQR method to detect outliers in units_sold
b) Apply the IQR method to detect outliers in unit_price
c) For each outlier, print the row with full context
d) Calculate z-scores for units_sold and identify values with |z| > 3
e) Decide which outliers to remove, which to cap (Winsorize), and which to keep
— write the code for each decision
f) Show total revenue before and after your outlier handling
Exercise 3.3 — Before/After Analysis Report
Using the cleaned data from Exercise 3.1:
a) Generate a revenue summary grouped by region (using dirty data)
b) Generate the same summary using clean data
c) Show the difference in revenue figures between dirty and clean
d) Write a 3-paragraph "data quality impact statement" explaining what changed and why
Tier 4: Analysis (Open-Ended Problems)
Exercise 4.1 — Designing a Cleaning Pipeline Function
Write a reusable function clean_business_df() that:
Parameters:
- df: The DataFrame to clean
- string_cols: List of string columns to strip and apply title case
- numeric_cols_with_dollar: List of columns with $ signs to strip and convert
- date_cols: List of date columns to convert to datetime
- category_maps: Dictionary of {column_name: mapping_dict} for standardization
- drop_duplicate_subset: Optional list of columns to use for duplicate detection
Behavior: - Applies each cleaning operation in the correct order - Logs each step with row counts and change summaries - Returns the cleaned DataFrame and the cleaning log as a list of strings
Test it: Apply your function to any dataset from this chapter.
Exercise 4.2 — The Reproducibility Challenge
Take any messy dataset from this chapter and write a cleaning script that: a) Is fully reproducible (running it twice produces the same result) b) Handles edge cases (empty DataFrame, columns not present, all values missing) c) Can be run on a dataset with different column names if a column mapping is provided d) Saves both the original and cleaned data with timestamp-based filenames
Tier 5: Extension (Beyond the Chapter)
Exercise 5.1 — String Cleaning with Regular Expressions
The .str.replace() method supports regular expressions. Using the following data:
import pandas as pd
df = pd.DataFrame({
"phone": ["(555) 123-4567", "555.987.6543", "+1 555 456 7890",
"555-234-5678", "tel: 555-876-5432", "1-800-555-0199"],
"zip": ["10001", "10001-1234", "SW1A 1AA", "90210",
"10001-5678", "N/A"],
"price": ["USD 49.99", "€29.99", "£19.99", "49.99", "$59.99", "AUD 39.99"],
})
Write code to:
a) Extract just the 10 digits from phone (ignore country code and formatting)
b) Extract just the 5-digit US ZIP code from zip (ignore extensions, mark non-US as NaN)
c) Extract just the numeric price from price (strip currency symbols and codes)
d) Use a regex to validate that all cleaned emails follow the pattern word@word.word
Exercise 5.2 — Time-Series Data Cleaning
import pandas as pd
# Daily revenue data with gaps
daily_data = {
"date": pd.date_range("2023-01-02", "2023-01-31", freq="B"), # Business days
"revenue": [12500, 13200, None, 14100, 13800,
None, None, 15200, 14700, 13900,
16100, None, 15800, 14900, 15100,
15600, None, 16200, 15900, None,
17200, 16800]
}
df_daily = pd.DataFrame(daily_data)
Write code to: a) Identify which days have missing revenue b) Apply forward fill and show the result c) Apply backward fill and show the result d) Apply linear interpolation and show the result e) For each fill strategy, calculate the total monthly revenue — do they differ? f) Explain which strategy is most appropriate for revenue data and why
Exercise 5.3 — Multi-Dataset Cleaning and Merging
You have three messy datasets:
import pandas as pd
orders = pd.DataFrame({
"order_id": ["ORD-001", "ORD-002", "ORD-003", "ORD-004", "ORD-002"],
"customer_id": ["C101", "C102", "C101", "C103", "C102"],
"amount": ["$1,250.00", "$875.50", "$432.00", "$2,100.00", "$875.50"],
"order_date": ["2023-01-05", "Jan 7, 2023", "2023-01-09", "01/12/2023", "Jan 7, 2023"]
})
customers = pd.DataFrame({
"id": ["C101", "C102", "C103", "C104"],
"name": [" Acme Corp ", "BRIGHTFIELD LLC", "Nova Dynamics", "CoreTech Ltd"],
"tier": ["Gold", "silver", "Gold", "BRONZE"],
"region": ["Northeast", "southwest", "Midwest", None]
})
products_ordered = pd.DataFrame({
"order": ["ORD-001", "ORD-002", "ORD-003", "ORD-004"],
"product": ["widget", "GADGET", "Widget", "gadget"],
"qty": ["3", "2", "1", "5"]
})
Write a complete pipeline that: a) Cleans each DataFrame independently b) Merges all three into a single analysis-ready DataFrame c) Validates the merged result (check for unexpected NaN after merge) d) Produces a summary: total revenue by customer tier and product
Answer Hints (Selected)
Exercise 1.2 Answers:
1. False — dropna() returns a new DataFrame; original is unchanged
2. True — the $ prevents numeric conversion; errors='coerce' returns NaN
3. False — ffill fills with the PREVIOUS known value, not the next
4. True
5. False — .str.strip() only removes from the edges (beginning and end)
6. True — this is .map() behavior; use .replace() if you want non-mapped values unchanged
7. False — .str.lower() on a column with NaN simply leaves NaN as NaN
8. False — .astype(float) will raise ValueError on "$125.99"
9. True
10. True — keep=False marks all occurrences; keep='first' only marks subsequent ones
Exercise 1.3 Answers:
.isna(), strip, fillna, lower, regex, 'coerce', "name"
Exercise 1.4 Answers:
Snippet A: Any value not in the map (like "office chair" with no "s") becomes NaN. This is .map()'s strict behavior — it is a feature, not a bug, because it reveals gaps in your mapping.
Snippet B: .fillna() returns a NEW Series — it does not modify df["price"] in place. You must assign the result back: df["price"] = df["price"].fillna(df["price"].mean()).
Snippet C: More efficient rewrite:
df["phone"] = df["phone"].str.replace(r"[()\\-]", "", regex=True)
# Or for any non-digit character:
df["phone"] = df["phone"].str.replace(r"\D", "", regex=True)