Chapter 12 Key Takeaways: Cleaning and Preparing Data for Analysis


The Big Ideas

1. Real data is always messy — plan for it, not against it. Data cleaning is not an exception to be dealt with occasionally. It is a core, recurring part of every data analysis project. Building systematic habits (checklists, cleaning logs, validation suites) transforms a frustrating chore into a repeatable, professional workflow.

2. Understand before you fix. Every data quality issue has a cause. A column with mixed capitalization was entered by multiple people over time. Missing values exist because the system does not require the field, or because the event hasn't happened yet. Dollar signs in numeric fields come from a finance system that treats prices as formatted strings. Understanding the cause leads to the right fix; guessing leads to broken data.

3. The clean dataset is not a file — it is a reproducible pipeline. The deliverable is not just data_clean.csv. It is the Python script that transforms data_raw.csv into data_clean.csv, with documented steps that can be rerun by anyone on any machine. This is what reproducibility means in practice.


The Data Cleaning Checklist

Work through these steps in order — earlier steps enable later ones:

  1. Remove exact duplicate rows.duplicated(), .drop_duplicates()
  2. Handle missing values.isna(), .fillna(), .dropna(), .interpolate()
  3. Fix data types.astype(), pd.to_numeric(), pd.to_datetime()
  4. Strip whitespace from strings.str.strip()
  5. Standardize string case.str.lower(), .str.upper(), .str.title()
  6. Remove unwanted characters.str.replace()
  7. Standardize categorical values.map(), .replace()
  8. Detect and handle outliers — IQR method, z-scores, .clip()
  9. Validate the results — automated checks after each step
  10. Document the pipeline — cleaning log, comments, saved clean file

Core Methods Quick Reference

Missing Values

df.isna().sum()                          # Count NaN per column
df.isna().sum() / len(df) * 100          # Percentage missing
df[df.isna().any(axis=1)]                # Rows with any missing
df.dropna()                              # Drop rows with any NaN
df.dropna(subset=["col"])                # Drop rows where col is NaN
df["col"] = df["col"].fillna("Unknown")  # Fill with constant
df["col"] = df["col"].fillna(df["col"].median())  # Fill with median
df["col"] = df["col"].fillna(method="ffill")      # Forward fill
df["col"] = df["col"].interpolate()               # Linear interpolation

Duplicates

df.duplicated().sum()                    # Count duplicate rows
df[df.duplicated(keep=False)]            # Show all duplicate rows
df = df.drop_duplicates(keep="first")    # Remove duplicates
df.duplicated(subset=["col1","col2"])    # Check specific columns

Type Conversion

df["col"] = df["col"].astype(int)
df["col"] = df["col"].astype(float)
df["col"] = df["col"].astype(str)
df["col"] = df["col"].astype("category")

df["col"] = pd.to_numeric(df["col"], errors="coerce")
df["col"] = pd.to_datetime(df["col"], errors="coerce")
df["col"] = pd.to_datetime(df["col"], format="%m/%d/%Y")

String Operations

df["col"] = df["col"].str.strip()              # Remove edge whitespace
df["col"] = df["col"].str.lower()              # Lowercase
df["col"] = df["col"].str.upper()              # Uppercase
df["col"] = df["col"].str.title()              # Title Case
df["col"] = df["col"].str.replace("$", "", regex=False)
df["col"] = df["col"].str.replace(r"\D", "", regex=True)
df[df["col"].str.contains("word", case=False, na=False)]

Categorical Standardization

# .map() — strict: unmapped values become NaN
mapping = {"old_val": "New Val", "another_old": "New Val"}
df["col"] = df["col"].map(mapping)

# .replace() — lenient: unmapped values remain unchanged
df["col"] = df["col"].replace({"old_val": "New Val"})

Outlier Detection (IQR Method)

Q1 = df["col"].quantile(0.25)
Q3 = df["col"].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df["col"] < lower) | (df["col"] > upper)]

# Cap instead of removing (Winsorization)
df["col"] = df["col"].clip(lower=lower, upper=upper)

The Three Rules of Clean Data Work

Rule 1: Never modify the source file. Load from the original. Work on df = raw.copy(). Save as a new file. Always.

Rule 2: Assign your results back. df["col"].fillna(0) does nothing — the result is discarded. You need df["col"] = df["col"].fillna(0). This applies to every pandas method that returns a new object.

Rule 3: Validate after every step. After each cleaning operation, check that it had the intended effect. Check value counts before and after. Check that type conversions succeeded. Check that NaN counts changed correctly. Do not assume — verify.


The Impact of Cleaning

Issue (Before) Impact on Analysis (If Uncleaned) Fix
"office chairs", "OFFICE CHAIRS", "Office Chairs" Category revenue split across 3 rows instead of 1 .str.strip().str.title()
Duplicate rows Counts, sums, and averages artificially inflated .drop_duplicates()
Missing region (13 rows) 0.52% of transactions excluded from regional analysis .fillna() with confirmed value
unit_price as "$249.99"` (string) | Cannot sum, average, or compare prices | Strip `$, then pd.to_numeric()
"Completed" / "completed" / "DONE" value_counts() splits one status into 3; filter for "Completed" misses the others .map() with canonical dictionary
Date stored as string Cannot filter by month, calculate duration, or group by week pd.to_datetime()

Key Distinctions to Remember

.map() vs .replace(): - .map() — strict: unmapped values → NaN (use for complete recoding) - .replace() — lenient: unmapped values → unchanged (use for targeted fixes)

errors="coerce" vs errors="raise" in pd.to_numeric() / pd.to_datetime(): - "coerce" — silently converts failures to NaN (safe for production pipelines) - "raise" — raises an exception at the first failure (useful for debugging) - "ignore" — leaves failures as-is (rarely what you want in a cleaning pipeline)

.fillna(method="ffill") vs .interpolate(): - "ffill" — repeats the last known value (use for categorical or stepwise data) - "interpolate()" — calculates a value between two known points (use for continuous numeric series)


Acme Corp Cleaning Results (Chapter 12 Case Study)

Issue Rows Affected Fix Applied Before After
Duplicate rows 8 removed .drop_duplicates() 2500 rows 2492 rows
Missing region 13 filled .fillna() with confirmed values 13 NaN 0 NaN
Missing category 5 filled SKU-based lookup 5 NaN 0 NaN
Category capitalization 10 rows fixed .str.strip().str.title() 6 variants 4 canonical values
Net impact on Office Chairs revenue $412,345 | $417,223 (+$4,878)

One-Liner Summary

Cleaning data is a systematic, documented, repeatable process — not a one-time scramble. The tools are straightforward; the discipline is what separates professional work from analysis that quietly produces wrong answers.