Case Study 12-1: Priya Cleans the Acme Sales Dataset

Characters: Priya Okonkwo (Senior Analyst), Marcus Webb (IT Manager), Sandra Chen (VP Sales) Setting: Acme Corp, three days before the annual sales review meeting Skill focus: Systematic data cleaning — deduplication, missing value handling, string standardization, type correction


The Recap

In Chapter 11, Priya explored acme_sales_2023.csv and found four issues:

  • Issue #1: 13 rows missing region (confirmed by Marcus as all belonging to "Northeast")
  • Issue #2: 5 rows missing category
  • Issue #3: category column has three capitalizations: "Office Chairs", "office chairs", "OFFICE CHAIRS"
  • Issue #4: 8 exact duplicate rows

Now she needs to fix them before generating her quarterly and annual reports for Sandra's Wednesday review. Priya works through the issues methodically, following the data cleaning checklist from Chapter 12.


Setting Up

Priya creates a new Python script: clean_acme_sales.py. Her first rule: never modify the original file.

import pandas as pd

# Load the original file — we will NEVER save over this
df_raw = pd.read_csv(
    "acme_sales_2023.csv",
    parse_dates=["date"],
    dtype={"product_sku": str},
    na_values=["N/A", "n/a", "--", ""]
)

# Work on a copy
df = df_raw.copy()

print(f"Original data: {len(df)} rows, {len(df.columns)} columns")
# Original data: 2500 rows, 9 columns

She verifies the issues are present in her copy before she starts:

print("Pre-cleaning state:")
print(f"  Duplicates:        {df.duplicated().sum()}")
print(f"  Missing region:    {df['region'].isna().sum()}")
print(f"  Missing category:  {df['category'].isna().sum()}")
print(f"  Category variants: {sorted(df['category'].dropna().unique())}")

Output:

Pre-cleaning state:
  Duplicates:        8
  Missing region:    13
  Missing category:  5
  Category variants: ['Desks', 'OFFICE CHAIRS', 'Office Chairs',
                      'Storage', 'Technology', 'office chairs']

All four issues confirmed. Time to fix them.


Step 1: Remove Duplicate Rows

Priya tackles duplicates first. She wants to see exactly which rows are duplicated before deleting anything.

# Step 1: Remove exact duplicate rows
print("\n=== STEP 1: DUPLICATE REMOVAL ===")

# Identify and display the duplicates before removing them
duplicate_mask = df.duplicated(keep=False)
n_duplicates = df.duplicated().sum()

print(f"Exact duplicates found: {n_duplicates}")
print("\nDuplicate row pairs:")
print(
    df[duplicate_mask]
    .sort_values(["date", "salesperson", "product_sku"])
    [["date", "salesperson", "product_sku", "units_sold", "total_revenue"]]
    .to_string()
)

Output:

Exact duplicates found: 8

Duplicate row pairs:
          date     salesperson product_sku  units_sold  total_revenue
47  2023-01-15    Jordan Blake  OFF-CH-0045           2         449.98
12  2023-01-15    Jordan Blake  OFF-CH-0045           2         449.98
103 2023-02-08     Casey Rivera DESK-SIT-012          1         764.99
98  2023-02-08     Casey Rivera DESK-SIT-012          1         764.99
...

She can see the pattern: these are pairs where the same salesperson sold the same product on the same date for the same amount. These are almost certainly double-entries, not coincidences. She drops the duplicates and verifies:

# Drop duplicates — keep the first occurrence
df = df.drop_duplicates(keep="first")

print(f"\nAfter deduplication: {len(df)} rows")
print(f"Rows removed: {2500 - len(df)}")
# After deduplication: 2492 rows
# Rows removed: 8

She adds a note to her cleaning log:

cleaning_log = []
cleaning_log.append(
    "Step 1 — Removed 8 exact duplicate rows. "
    "These appear to be double-entered transactions. "
    "2500 → 2492 rows."
)

Step 2: Fill Missing Region Values

Marcus confirmed the 13 rows with missing region all belong to him — he is in the Northeast office and sometimes forgets to tag the region when entering bulk orders.

print("\n=== STEP 2: FILL MISSING REGION ===")

# Verify before filling
print(f"Missing region before: {df['region'].isna().sum()}")
print("\nRows with missing region:")
print(
    df[df["region"].isna()]
    [["date", "salesperson", "product_sku", "units_sold", "total_revenue"]]
    .to_string()
)

Output:

Missing region before: 13

Rows with missing region:
          date    salesperson product_sku  units_sold  total_revenue
3   2023-01-05    Marcus Webb  OFF-CH-0046           1         299.99
87  2023-01-23    Morgan Lee  STOR-SHELF-011          2         179.98
...

Wait — one of the missing-region rows belongs to Morgan Lee, not just Marcus Webb. Priya goes back to Marcus:

"Marcus — I see 13 rows missing a region. Most are yours, but row 87 (2023-01-23) has Morgan Lee and no region. Can you check?"

Marcus replies: "Morgan Lee transferred from Southwest mid-January. That Jan 23 transaction should be Southwest."

Priya now has specific fill values for each case:

# Fill Marcus Webb's missing regions with Northeast
marcus_mask = (df["region"].isna()) & (df["salesperson"] == "Marcus Webb")
df.loc[marcus_mask, "region"] = "Northeast"

# Fill Morgan Lee's specific missing transaction with Southwest
morgan_jan_mask = (
    (df["region"].isna())
    & (df["salesperson"] == "Morgan Lee")
    & (df["date"] == "2023-01-23")
)
df.loc[morgan_jan_mask, "region"] = "Southwest"

# For any remaining missing (defensive — shouldn't be any)
remaining_missing = df["region"].isna().sum()
if remaining_missing > 0:
    df["region"] = df["region"].fillna("Unknown")
    print(f"Warning: {remaining_missing} region values set to 'Unknown'")

print(f"Missing region after fill: {df['region'].isna().sum()}")
print(f"\nRegion distribution after fill:")
print(df["region"].value_counts())

Output:

Missing region after fill: 0

Region distribution after fill:
Northeast    614
Southwest    522
Midwest      488
Southeast    467
West         401
cleaning_log.append(
    "Step 2 — Filled 13 missing region values: "
    "12 set to 'Northeast' (Marcus Webb, confirmed by IT), "
    "1 set to 'Southwest' (Morgan Lee Jan transfer, confirmed by IT). "
    "0 remaining missing."
)

Step 3: Standardize Category Capitalization

This is the most impactful fix. The "office chairs" and "OFFICE CHAIRS" entries mean that every group-by, pivot, and filter on category gives wrong results until this is fixed.

Priya's approach: strip whitespace first, then apply .str.title() to standardize to title case.

print("\n=== STEP 3: STANDARDIZE CATEGORY CAPITALIZATION ===")

print("Before:")
print(df["category"].value_counts(dropna=False))

Output:

Before:
Office Chairs    835
Desks            617
Technology       583
Storage          427
office chairs      7
OFFICE CHAIRS      3
NaN                5

Wait — the counts are slightly different from Chapter 11 because duplicates were removed in Step 1. After removing 8 duplicates (some of which were "Office Chairs" transactions), the counts have adjusted. This is expected.

# Step 3a: Strip whitespace from category
df["category"] = df["category"].str.strip()

# Step 3b: Apply title case to standardize capitalization
df["category"] = df["category"].str.title()

print("\nAfter .str.strip().str.title():")
print(df["category"].value_counts(dropna=False))

Output:

After .str.strip().str.title():
Office Chairs    845
Desks            617
Technology       583
Storage          427
NaN                5

845 rows now correctly labeled "Office Chairs" (835 + 7 + 3 = 845). The three separate variants have collapsed into one canonical label.

cleaning_log.append(
    "Step 3 — Standardized category capitalization using .str.strip().str.title(). "
    "Merged 'office chairs' (7) and 'OFFICE CHAIRS' (3) into 'Office Chairs'. "
    "Category now has 4 clean values + NaN."
)

Step 4: Handle Missing Category Values

Five rows are still missing a category. Priya looks up the product SKUs to see if she can infer the correct category.

print("\n=== STEP 4: HANDLE MISSING CATEGORY ===")

missing_category_rows = df[df["category"].isna()]
print(f"Rows with missing category: {len(missing_category_rows)}")
print("\nMissing category rows:")
print(
    missing_category_rows[
        ["date", "salesperson", "product_sku", "category", "total_revenue"]
    ].to_string()
)

Output:

Rows with missing category: 5

Missing category rows:
           date    salesperson  product_sku category  total_revenue
247  2023-03-12   Jordan Blake  DESK-SIT-012      NaN         764.99
681  2023-06-08   Casey Rivera  TECH-MON-089      NaN         997.47
1102 2023-08-21   Alex Torres   OFF-CH-0045       NaN         449.98
1834 2023-10-14   Morgan Lee    STOR-CAB-023      NaN         467.96
2198 2023-11-30   David Park    TECH-KEY-034      NaN         143.98

The product SKUs are informative: - DESK-SIT-012 → "Desks" (prefix "DESK") - TECH-MON-089 → "Technology" (prefix "TECH") - OFF-CH-0045 → "Office Chairs" (prefix "OFF-CH") - STOR-CAB-023 → "Storage" (prefix "STOR") - TECH-KEY-034 → "Technology" (prefix "TECH")

Priya creates a lookup dictionary based on SKU prefix:

# Create a mapping from SKU prefix to category
sku_to_category = {
    "DESK-SIT-012": "Desks",
    "TECH-MON-089": "Technology",
    "OFF-CH-0045":  "Office Chairs",
    "STOR-CAB-023": "Storage",
    "TECH-KEY-034": "Technology",
}

# Fill the missing categories using the SKU lookup
for idx in missing_category_rows.index:
    sku = df.loc[idx, "product_sku"]
    if sku in sku_to_category:
        df.loc[idx, "category"] = sku_to_category[sku]
        print(f"  Filled row {idx}: SKU {sku} → {sku_to_category[sku]}")
    else:
        print(f"  Could not determine category for SKU {sku}")

print(f"\nMissing category after fill: {df['category'].isna().sum()}")

Output:

  Filled row 247: SKU DESK-SIT-012 → Desks
  Filled row 681: SKU TECH-MON-089 → Technology
  Filled row 1102: SKU OFF-CH-0045 → Office Chairs
  Filled row 1834: SKU STOR-CAB-023 → Storage
  Filled row 2198: SKU TECH-KEY-034 → Technology

Missing category after fill: 0

All five resolved using SKU lookup. No "Unknown" categories needed.

print("\nFinal category distribution:")
print(df["category"].value_counts())

Output:

Final category distribution:
Office Chairs    846
Desks            618
Technology       585
Storage          428

Clean, canonical, no gaps.

cleaning_log.append(
    "Step 4 — Filled 5 missing category values using product_sku lookup. "
    "All 5 resolved from SKU prefix. 0 remaining missing."
)

Step 5: Final Validation

Before saving, Priya runs her standard post-cleaning checks:

print("\n=== STEP 5: FINAL VALIDATION ===")

checks = []

# Check 1: No duplicates
n_dups = df.duplicated().sum()
checks.append(("No duplicate rows", n_dups == 0, f"Found: {n_dups}"))

# Check 2: No missing region
n_missing_region = df["region"].isna().sum()
checks.append(("No missing region", n_missing_region == 0,
               f"Missing: {n_missing_region}"))

# Check 3: No missing category
n_missing_cat = df["category"].isna().sum()
checks.append(("No missing category", n_missing_cat == 0,
               f"Missing: {n_missing_cat}"))

# Check 4: Exactly 4 canonical category values
valid_categories = {"Office Chairs", "Desks", "Technology", "Storage"}
actual_categories = set(df["category"].unique())
checks.append(("Exactly 4 canonical categories", actual_categories == valid_categories,
               f"Found: {actual_categories}"))

# Check 5: Exactly 5 regions
valid_regions = {"Northeast", "Southwest", "Midwest", "Southeast", "West"}
actual_regions = set(df["region"].unique())
checks.append(("Exactly 5 regions", actual_regions == valid_regions,
               f"Found: {actual_regions}"))

# Check 6: Row count is reasonable
checks.append(("Row count is 2492", len(df) == 2492, f"Count: {len(df)}"))

# Check 7: No negative revenue
neg_revenue = (df["total_revenue"] < 0).sum()
checks.append(("No negative total_revenue", neg_revenue == 0,
               f"Negative rows: {neg_revenue}"))

# Print results
all_passed = True
for check_name, passed, detail in checks:
    status = "PASS" if passed else "FAIL"
    if not passed:
        all_passed = False
    print(f"  [{status}] {check_name} — {detail}")

print(f"\n{'All checks passed!' if all_passed else 'Some checks failed — review above.'}")

Output:

  [PASS] No duplicate rows — Found: 0
  [PASS] No missing region — Missing: 0
  [PASS] No missing category — Missing: 0
  [PASS] Exactly 4 canonical categories — Found: {'Desks', 'Office Chairs', 'Storage', 'Technology'}
  [PASS] Exactly 5 regions — Found: {'West', 'Northeast', 'Midwest', 'Southeast', 'Southwest'}
  [PASS] Row count is 2492 — Count: 2492
  [PASS] No negative total_revenue — Negative rows: 0

All checks passed!

Step 6: Save the Clean Dataset

# Save the clean version — NEVER overwrite the original
output_path = "acme_sales_2023_clean.csv"
df.to_csv(output_path, index=False)

print(f"\nClean data saved to: {output_path}")
print(f"Final dataset: {len(df)} rows, {len(df.columns)} columns")

Step 7: Print the Cleaning Log

print("\n=== CLEANING LOG ===")
print(f"Original file:  acme_sales_2023.csv  ({len(df_raw)} rows)")
print(f"Cleaned file:   acme_sales_2023_clean.csv  ({len(df)} rows)")
print()
for i, entry in enumerate(cleaning_log, 1):
    print(f"[{i}] {entry}")

Output:

=== CLEANING LOG ===
Original file:  acme_sales_2023.csv  (2500 rows)
Cleaned file:   acme_sales_2023_clean.csv  (2492 rows)

[1] Step 1 — Removed 8 exact duplicate rows. These appear to be double-entered
    transactions. 2500 → 2492 rows.

[2] Step 2 — Filled 13 missing region values: 12 set to 'Northeast' (Marcus Webb,
    confirmed by IT), 1 set to 'Southwest' (Morgan Lee Jan transfer, confirmed by IT).
    0 remaining missing.

[3] Step 3 — Standardized category capitalization using .str.strip().str.title().
    Merged 'office chairs' (7) and 'OFFICE CHAIRS' (3) into 'Office Chairs'.
    Category now has 4 clean values + NaN.

[4] Step 4 — Filled 5 missing category values using product_sku lookup.
    All 5 resolved from SKU prefix. 0 remaining missing.

The Before/After Impact

To show Sandra why data cleaning matters, Priya generates a quick comparison report:

print("\n=== BEFORE vs. AFTER: OFFICE CHAIRS REVENUE ===")

# Revenue by category in raw data
raw_by_cat = df_raw.groupby("category")["total_revenue"].sum().sort_values(ascending=False)
print("BEFORE cleaning:")
print(raw_by_cat.apply(lambda x: f"${x:,.2f}"))

# Revenue by category after cleaning
clean_by_cat = df.groupby("category")["total_revenue"].sum().sort_values(ascending=False)
print("\nAFTER cleaning:")
print(clean_by_cat.apply(lambda x: f"${x:,.2f}"))

Output:

BEFORE cleaning:
category
Office Chairs    $412,345.23
Desks            $398,721.45
Technology       $374,892.16
Storage          $198,234.89
office chairs      $3,421.78    ← should be merged into Office Chairs
OFFICE CHAIRS      $1,456.34    ← should be merged into Office Chairs

AFTER cleaning:
category
Office Chairs    $417,223.35   ← correct total: $4,877.56 more
Desks            $398,721.45
Technology       $374,892.16
Storage          $198,234.89

The capitalization issue alone caused Office Chairs revenue to appear $4,877 lower than it actually was. In a company where sales categories drive inventory decisions and budget allocations, that difference is meaningful.


What Priya Learned from This Cleaning Session

Cleaning reveals facts that exploration only hinted at. During exploration, Priya knew there were 13 missing regions. During cleaning, she discovered one of those 13 belonged to a different salesperson than expected — something that only became clear when she looked at the actual rows.

Context matters. The most technically correct fill value for "Morgan Lee's missing region" required a conversation with Marcus, not a statistical formula. Data cleaning is partly a technical exercise and partly an investigative one.

The cleaning log is a first-class deliverable. When Sandra asks "why are the Office Chairs numbers different from last quarter's report?", Priya can point to her cleaning log and say: "Because the raw data had capitalization inconsistencies. The numbers were always this high — we just couldn't see it clearly."

One step at a time, validate between steps. If Priya had tried to run all four cleaning operations at once, a bug in step 2 might have corrupted data that step 3 or 4 depended on. Working step by step, with validation checks in between, makes the pipeline easier to debug and trust.


Complete Code for This Case Study

import pandas as pd

# Load original
df_raw = pd.read_csv(
    "acme_sales_2023.csv",
    parse_dates=["date"],
    dtype={"product_sku": str},
    na_values=["N/A", "n/a", "--", ""]
)
df = df_raw.copy()

# Step 1: Remove duplicates
df = df.drop_duplicates(keep="first")

# Step 2: Fill missing regions
df.loc[(df["region"].isna()) & (df["salesperson"] == "Marcus Webb"), "region"] = "Northeast"
df.loc[(df["region"].isna()) & (df["salesperson"] == "Morgan Lee")
       & (df["date"] == "2023-01-23"), "region"] = "Southwest"

# Step 3: Standardize category capitalization
df["category"] = df["category"].str.strip().str.title()

# Step 4: Fill missing categories from SKU lookup
sku_to_category = {
    "DESK-SIT-012": "Desks",
    "TECH-MON-089": "Technology",
    "OFF-CH-0045":  "Office Chairs",
    "STOR-CAB-023": "Storage",
    "TECH-KEY-034": "Technology",
}
for idx in df[df["category"].isna()].index:
    sku = df.loc[idx, "product_sku"]
    if sku in sku_to_category:
        df.loc[idx, "category"] = sku_to_category[sku]

# Validate
assert df.duplicated().sum() == 0
assert df["region"].isna().sum() == 0
assert df["category"].isna().sum() == 0
assert df["category"].nunique() == 4

# Save
df.to_csv("acme_sales_2023_clean.csv", index=False)
print(f"Done. Clean file: {len(df)} rows.")