Case Study 12-1: Priya Cleans the Acme Sales Dataset
Characters: Priya Okonkwo (Senior Analyst), Marcus Webb (IT Manager), Sandra Chen (VP Sales) Setting: Acme Corp, three days before the annual sales review meeting Skill focus: Systematic data cleaning — deduplication, missing value handling, string standardization, type correction
The Recap
In Chapter 11, Priya explored acme_sales_2023.csv and found four issues:
- Issue #1: 13 rows missing
region(confirmed by Marcus as all belonging to "Northeast") - Issue #2: 5 rows missing
category - Issue #3:
categorycolumn has three capitalizations:"Office Chairs","office chairs","OFFICE CHAIRS" - Issue #4: 8 exact duplicate rows
Now she needs to fix them before generating her quarterly and annual reports for Sandra's Wednesday review. Priya works through the issues methodically, following the data cleaning checklist from Chapter 12.
Setting Up
Priya creates a new Python script: clean_acme_sales.py. Her first rule: never modify the original file.
import pandas as pd
# Load the original file — we will NEVER save over this
df_raw = pd.read_csv(
"acme_sales_2023.csv",
parse_dates=["date"],
dtype={"product_sku": str},
na_values=["N/A", "n/a", "--", ""]
)
# Work on a copy
df = df_raw.copy()
print(f"Original data: {len(df)} rows, {len(df.columns)} columns")
# Original data: 2500 rows, 9 columns
She verifies the issues are present in her copy before she starts:
print("Pre-cleaning state:")
print(f" Duplicates: {df.duplicated().sum()}")
print(f" Missing region: {df['region'].isna().sum()}")
print(f" Missing category: {df['category'].isna().sum()}")
print(f" Category variants: {sorted(df['category'].dropna().unique())}")
Output:
Pre-cleaning state:
Duplicates: 8
Missing region: 13
Missing category: 5
Category variants: ['Desks', 'OFFICE CHAIRS', 'Office Chairs',
'Storage', 'Technology', 'office chairs']
All four issues confirmed. Time to fix them.
Step 1: Remove Duplicate Rows
Priya tackles duplicates first. She wants to see exactly which rows are duplicated before deleting anything.
# Step 1: Remove exact duplicate rows
print("\n=== STEP 1: DUPLICATE REMOVAL ===")
# Identify and display the duplicates before removing them
duplicate_mask = df.duplicated(keep=False)
n_duplicates = df.duplicated().sum()
print(f"Exact duplicates found: {n_duplicates}")
print("\nDuplicate row pairs:")
print(
df[duplicate_mask]
.sort_values(["date", "salesperson", "product_sku"])
[["date", "salesperson", "product_sku", "units_sold", "total_revenue"]]
.to_string()
)
Output:
Exact duplicates found: 8
Duplicate row pairs:
date salesperson product_sku units_sold total_revenue
47 2023-01-15 Jordan Blake OFF-CH-0045 2 449.98
12 2023-01-15 Jordan Blake OFF-CH-0045 2 449.98
103 2023-02-08 Casey Rivera DESK-SIT-012 1 764.99
98 2023-02-08 Casey Rivera DESK-SIT-012 1 764.99
...
She can see the pattern: these are pairs where the same salesperson sold the same product on the same date for the same amount. These are almost certainly double-entries, not coincidences. She drops the duplicates and verifies:
# Drop duplicates — keep the first occurrence
df = df.drop_duplicates(keep="first")
print(f"\nAfter deduplication: {len(df)} rows")
print(f"Rows removed: {2500 - len(df)}")
# After deduplication: 2492 rows
# Rows removed: 8
She adds a note to her cleaning log:
cleaning_log = []
cleaning_log.append(
"Step 1 — Removed 8 exact duplicate rows. "
"These appear to be double-entered transactions. "
"2500 → 2492 rows."
)
Step 2: Fill Missing Region Values
Marcus confirmed the 13 rows with missing region all belong to him — he is in the Northeast office and sometimes forgets to tag the region when entering bulk orders.
print("\n=== STEP 2: FILL MISSING REGION ===")
# Verify before filling
print(f"Missing region before: {df['region'].isna().sum()}")
print("\nRows with missing region:")
print(
df[df["region"].isna()]
[["date", "salesperson", "product_sku", "units_sold", "total_revenue"]]
.to_string()
)
Output:
Missing region before: 13
Rows with missing region:
date salesperson product_sku units_sold total_revenue
3 2023-01-05 Marcus Webb OFF-CH-0046 1 299.99
87 2023-01-23 Morgan Lee STOR-SHELF-011 2 179.98
...
Wait — one of the missing-region rows belongs to Morgan Lee, not just Marcus Webb. Priya goes back to Marcus:
"Marcus — I see 13 rows missing a region. Most are yours, but row 87 (2023-01-23) has Morgan Lee and no region. Can you check?"
Marcus replies: "Morgan Lee transferred from Southwest mid-January. That Jan 23 transaction should be Southwest."
Priya now has specific fill values for each case:
# Fill Marcus Webb's missing regions with Northeast
marcus_mask = (df["region"].isna()) & (df["salesperson"] == "Marcus Webb")
df.loc[marcus_mask, "region"] = "Northeast"
# Fill Morgan Lee's specific missing transaction with Southwest
morgan_jan_mask = (
(df["region"].isna())
& (df["salesperson"] == "Morgan Lee")
& (df["date"] == "2023-01-23")
)
df.loc[morgan_jan_mask, "region"] = "Southwest"
# For any remaining missing (defensive — shouldn't be any)
remaining_missing = df["region"].isna().sum()
if remaining_missing > 0:
df["region"] = df["region"].fillna("Unknown")
print(f"Warning: {remaining_missing} region values set to 'Unknown'")
print(f"Missing region after fill: {df['region'].isna().sum()}")
print(f"\nRegion distribution after fill:")
print(df["region"].value_counts())
Output:
Missing region after fill: 0
Region distribution after fill:
Northeast 614
Southwest 522
Midwest 488
Southeast 467
West 401
cleaning_log.append(
"Step 2 — Filled 13 missing region values: "
"12 set to 'Northeast' (Marcus Webb, confirmed by IT), "
"1 set to 'Southwest' (Morgan Lee Jan transfer, confirmed by IT). "
"0 remaining missing."
)
Step 3: Standardize Category Capitalization
This is the most impactful fix. The "office chairs" and "OFFICE CHAIRS" entries mean that every group-by, pivot, and filter on category gives wrong results until this is fixed.
Priya's approach: strip whitespace first, then apply .str.title() to standardize to title case.
print("\n=== STEP 3: STANDARDIZE CATEGORY CAPITALIZATION ===")
print("Before:")
print(df["category"].value_counts(dropna=False))
Output:
Before:
Office Chairs 835
Desks 617
Technology 583
Storage 427
office chairs 7
OFFICE CHAIRS 3
NaN 5
Wait — the counts are slightly different from Chapter 11 because duplicates were removed in Step 1. After removing 8 duplicates (some of which were "Office Chairs" transactions), the counts have adjusted. This is expected.
# Step 3a: Strip whitespace from category
df["category"] = df["category"].str.strip()
# Step 3b: Apply title case to standardize capitalization
df["category"] = df["category"].str.title()
print("\nAfter .str.strip().str.title():")
print(df["category"].value_counts(dropna=False))
Output:
After .str.strip().str.title():
Office Chairs 845
Desks 617
Technology 583
Storage 427
NaN 5
845 rows now correctly labeled "Office Chairs" (835 + 7 + 3 = 845). The three separate variants have collapsed into one canonical label.
cleaning_log.append(
"Step 3 — Standardized category capitalization using .str.strip().str.title(). "
"Merged 'office chairs' (7) and 'OFFICE CHAIRS' (3) into 'Office Chairs'. "
"Category now has 4 clean values + NaN."
)
Step 4: Handle Missing Category Values
Five rows are still missing a category. Priya looks up the product SKUs to see if she can infer the correct category.
print("\n=== STEP 4: HANDLE MISSING CATEGORY ===")
missing_category_rows = df[df["category"].isna()]
print(f"Rows with missing category: {len(missing_category_rows)}")
print("\nMissing category rows:")
print(
missing_category_rows[
["date", "salesperson", "product_sku", "category", "total_revenue"]
].to_string()
)
Output:
Rows with missing category: 5
Missing category rows:
date salesperson product_sku category total_revenue
247 2023-03-12 Jordan Blake DESK-SIT-012 NaN 764.99
681 2023-06-08 Casey Rivera TECH-MON-089 NaN 997.47
1102 2023-08-21 Alex Torres OFF-CH-0045 NaN 449.98
1834 2023-10-14 Morgan Lee STOR-CAB-023 NaN 467.96
2198 2023-11-30 David Park TECH-KEY-034 NaN 143.98
The product SKUs are informative:
- DESK-SIT-012 → "Desks" (prefix "DESK")
- TECH-MON-089 → "Technology" (prefix "TECH")
- OFF-CH-0045 → "Office Chairs" (prefix "OFF-CH")
- STOR-CAB-023 → "Storage" (prefix "STOR")
- TECH-KEY-034 → "Technology" (prefix "TECH")
Priya creates a lookup dictionary based on SKU prefix:
# Create a mapping from SKU prefix to category
sku_to_category = {
"DESK-SIT-012": "Desks",
"TECH-MON-089": "Technology",
"OFF-CH-0045": "Office Chairs",
"STOR-CAB-023": "Storage",
"TECH-KEY-034": "Technology",
}
# Fill the missing categories using the SKU lookup
for idx in missing_category_rows.index:
sku = df.loc[idx, "product_sku"]
if sku in sku_to_category:
df.loc[idx, "category"] = sku_to_category[sku]
print(f" Filled row {idx}: SKU {sku} → {sku_to_category[sku]}")
else:
print(f" Could not determine category for SKU {sku}")
print(f"\nMissing category after fill: {df['category'].isna().sum()}")
Output:
Filled row 247: SKU DESK-SIT-012 → Desks
Filled row 681: SKU TECH-MON-089 → Technology
Filled row 1102: SKU OFF-CH-0045 → Office Chairs
Filled row 1834: SKU STOR-CAB-023 → Storage
Filled row 2198: SKU TECH-KEY-034 → Technology
Missing category after fill: 0
All five resolved using SKU lookup. No "Unknown" categories needed.
print("\nFinal category distribution:")
print(df["category"].value_counts())
Output:
Final category distribution:
Office Chairs 846
Desks 618
Technology 585
Storage 428
Clean, canonical, no gaps.
cleaning_log.append(
"Step 4 — Filled 5 missing category values using product_sku lookup. "
"All 5 resolved from SKU prefix. 0 remaining missing."
)
Step 5: Final Validation
Before saving, Priya runs her standard post-cleaning checks:
print("\n=== STEP 5: FINAL VALIDATION ===")
checks = []
# Check 1: No duplicates
n_dups = df.duplicated().sum()
checks.append(("No duplicate rows", n_dups == 0, f"Found: {n_dups}"))
# Check 2: No missing region
n_missing_region = df["region"].isna().sum()
checks.append(("No missing region", n_missing_region == 0,
f"Missing: {n_missing_region}"))
# Check 3: No missing category
n_missing_cat = df["category"].isna().sum()
checks.append(("No missing category", n_missing_cat == 0,
f"Missing: {n_missing_cat}"))
# Check 4: Exactly 4 canonical category values
valid_categories = {"Office Chairs", "Desks", "Technology", "Storage"}
actual_categories = set(df["category"].unique())
checks.append(("Exactly 4 canonical categories", actual_categories == valid_categories,
f"Found: {actual_categories}"))
# Check 5: Exactly 5 regions
valid_regions = {"Northeast", "Southwest", "Midwest", "Southeast", "West"}
actual_regions = set(df["region"].unique())
checks.append(("Exactly 5 regions", actual_regions == valid_regions,
f"Found: {actual_regions}"))
# Check 6: Row count is reasonable
checks.append(("Row count is 2492", len(df) == 2492, f"Count: {len(df)}"))
# Check 7: No negative revenue
neg_revenue = (df["total_revenue"] < 0).sum()
checks.append(("No negative total_revenue", neg_revenue == 0,
f"Negative rows: {neg_revenue}"))
# Print results
all_passed = True
for check_name, passed, detail in checks:
status = "PASS" if passed else "FAIL"
if not passed:
all_passed = False
print(f" [{status}] {check_name} — {detail}")
print(f"\n{'All checks passed!' if all_passed else 'Some checks failed — review above.'}")
Output:
[PASS] No duplicate rows — Found: 0
[PASS] No missing region — Missing: 0
[PASS] No missing category — Missing: 0
[PASS] Exactly 4 canonical categories — Found: {'Desks', 'Office Chairs', 'Storage', 'Technology'}
[PASS] Exactly 5 regions — Found: {'West', 'Northeast', 'Midwest', 'Southeast', 'Southwest'}
[PASS] Row count is 2492 — Count: 2492
[PASS] No negative total_revenue — Negative rows: 0
All checks passed!
Step 6: Save the Clean Dataset
# Save the clean version — NEVER overwrite the original
output_path = "acme_sales_2023_clean.csv"
df.to_csv(output_path, index=False)
print(f"\nClean data saved to: {output_path}")
print(f"Final dataset: {len(df)} rows, {len(df.columns)} columns")
Step 7: Print the Cleaning Log
print("\n=== CLEANING LOG ===")
print(f"Original file: acme_sales_2023.csv ({len(df_raw)} rows)")
print(f"Cleaned file: acme_sales_2023_clean.csv ({len(df)} rows)")
print()
for i, entry in enumerate(cleaning_log, 1):
print(f"[{i}] {entry}")
Output:
=== CLEANING LOG ===
Original file: acme_sales_2023.csv (2500 rows)
Cleaned file: acme_sales_2023_clean.csv (2492 rows)
[1] Step 1 — Removed 8 exact duplicate rows. These appear to be double-entered
transactions. 2500 → 2492 rows.
[2] Step 2 — Filled 13 missing region values: 12 set to 'Northeast' (Marcus Webb,
confirmed by IT), 1 set to 'Southwest' (Morgan Lee Jan transfer, confirmed by IT).
0 remaining missing.
[3] Step 3 — Standardized category capitalization using .str.strip().str.title().
Merged 'office chairs' (7) and 'OFFICE CHAIRS' (3) into 'Office Chairs'.
Category now has 4 clean values + NaN.
[4] Step 4 — Filled 5 missing category values using product_sku lookup.
All 5 resolved from SKU prefix. 0 remaining missing.
The Before/After Impact
To show Sandra why data cleaning matters, Priya generates a quick comparison report:
print("\n=== BEFORE vs. AFTER: OFFICE CHAIRS REVENUE ===")
# Revenue by category in raw data
raw_by_cat = df_raw.groupby("category")["total_revenue"].sum().sort_values(ascending=False)
print("BEFORE cleaning:")
print(raw_by_cat.apply(lambda x: f"${x:,.2f}"))
# Revenue by category after cleaning
clean_by_cat = df.groupby("category")["total_revenue"].sum().sort_values(ascending=False)
print("\nAFTER cleaning:")
print(clean_by_cat.apply(lambda x: f"${x:,.2f}"))
Output:
BEFORE cleaning:
category
Office Chairs $412,345.23
Desks $398,721.45
Technology $374,892.16
Storage $198,234.89
office chairs $3,421.78 ← should be merged into Office Chairs
OFFICE CHAIRS $1,456.34 ← should be merged into Office Chairs
AFTER cleaning:
category
Office Chairs $417,223.35 ← correct total: $4,877.56 more
Desks $398,721.45
Technology $374,892.16
Storage $198,234.89
The capitalization issue alone caused Office Chairs revenue to appear $4,877 lower than it actually was. In a company where sales categories drive inventory decisions and budget allocations, that difference is meaningful.
What Priya Learned from This Cleaning Session
Cleaning reveals facts that exploration only hinted at. During exploration, Priya knew there were 13 missing regions. During cleaning, she discovered one of those 13 belonged to a different salesperson than expected — something that only became clear when she looked at the actual rows.
Context matters. The most technically correct fill value for "Morgan Lee's missing region" required a conversation with Marcus, not a statistical formula. Data cleaning is partly a technical exercise and partly an investigative one.
The cleaning log is a first-class deliverable. When Sandra asks "why are the Office Chairs numbers different from last quarter's report?", Priya can point to her cleaning log and say: "Because the raw data had capitalization inconsistencies. The numbers were always this high — we just couldn't see it clearly."
One step at a time, validate between steps. If Priya had tried to run all four cleaning operations at once, a bug in step 2 might have corrupted data that step 3 or 4 depended on. Working step by step, with validation checks in between, makes the pipeline easier to debug and trust.
Complete Code for This Case Study
import pandas as pd
# Load original
df_raw = pd.read_csv(
"acme_sales_2023.csv",
parse_dates=["date"],
dtype={"product_sku": str},
na_values=["N/A", "n/a", "--", ""]
)
df = df_raw.copy()
# Step 1: Remove duplicates
df = df.drop_duplicates(keep="first")
# Step 2: Fill missing regions
df.loc[(df["region"].isna()) & (df["salesperson"] == "Marcus Webb"), "region"] = "Northeast"
df.loc[(df["region"].isna()) & (df["salesperson"] == "Morgan Lee")
& (df["date"] == "2023-01-23"), "region"] = "Southwest"
# Step 3: Standardize category capitalization
df["category"] = df["category"].str.strip().str.title()
# Step 4: Fill missing categories from SKU lookup
sku_to_category = {
"DESK-SIT-012": "Desks",
"TECH-MON-089": "Technology",
"OFF-CH-0045": "Office Chairs",
"STOR-CAB-023": "Storage",
"TECH-KEY-034": "Technology",
}
for idx in df[df["category"].isna()].index:
sku = df.loc[idx, "product_sku"]
if sku in sku_to_category:
df.loc[idx, "category"] = sku_to_category[sku]
# Validate
assert df.duplicated().sum() == 0
assert df["region"].isna().sum() == 0
assert df["category"].isna().sum() == 0
assert df["category"].nunique() == 4
# Save
df.to_csv("acme_sales_2023_clean.csv", index=False)
print(f"Done. Clean file: {len(df)} rows.")