Chapter 12 Key Takeaways: Cleaning and Preparing Data for Analysis
The Big Ideas
1. Real data is always messy — plan for it, not against it. Data cleaning is not an exception to be dealt with occasionally. It is a core, recurring part of every data analysis project. Building systematic habits (checklists, cleaning logs, validation suites) transforms a frustrating chore into a repeatable, professional workflow.
2. Understand before you fix. Every data quality issue has a cause. A column with mixed capitalization was entered by multiple people over time. Missing values exist because the system does not require the field, or because the event hasn't happened yet. Dollar signs in numeric fields come from a finance system that treats prices as formatted strings. Understanding the cause leads to the right fix; guessing leads to broken data.
3. The clean dataset is not a file — it is a reproducible pipeline.
The deliverable is not just data_clean.csv. It is the Python script that transforms data_raw.csv into data_clean.csv, with documented steps that can be rerun by anyone on any machine. This is what reproducibility means in practice.
The Data Cleaning Checklist
Work through these steps in order — earlier steps enable later ones:
- Remove exact duplicate rows —
.duplicated(),.drop_duplicates() - Handle missing values —
.isna(),.fillna(),.dropna(),.interpolate() - Fix data types —
.astype(),pd.to_numeric(),pd.to_datetime() - Strip whitespace from strings —
.str.strip() - Standardize string case —
.str.lower(),.str.upper(),.str.title() - Remove unwanted characters —
.str.replace() - Standardize categorical values —
.map(),.replace() - Detect and handle outliers — IQR method, z-scores,
.clip() - Validate the results — automated checks after each step
- Document the pipeline — cleaning log, comments, saved clean file
Core Methods Quick Reference
Missing Values
df.isna().sum() # Count NaN per column
df.isna().sum() / len(df) * 100 # Percentage missing
df[df.isna().any(axis=1)] # Rows with any missing
df.dropna() # Drop rows with any NaN
df.dropna(subset=["col"]) # Drop rows where col is NaN
df["col"] = df["col"].fillna("Unknown") # Fill with constant
df["col"] = df["col"].fillna(df["col"].median()) # Fill with median
df["col"] = df["col"].fillna(method="ffill") # Forward fill
df["col"] = df["col"].interpolate() # Linear interpolation
Duplicates
df.duplicated().sum() # Count duplicate rows
df[df.duplicated(keep=False)] # Show all duplicate rows
df = df.drop_duplicates(keep="first") # Remove duplicates
df.duplicated(subset=["col1","col2"]) # Check specific columns
Type Conversion
df["col"] = df["col"].astype(int)
df["col"] = df["col"].astype(float)
df["col"] = df["col"].astype(str)
df["col"] = df["col"].astype("category")
df["col"] = pd.to_numeric(df["col"], errors="coerce")
df["col"] = pd.to_datetime(df["col"], errors="coerce")
df["col"] = pd.to_datetime(df["col"], format="%m/%d/%Y")
String Operations
df["col"] = df["col"].str.strip() # Remove edge whitespace
df["col"] = df["col"].str.lower() # Lowercase
df["col"] = df["col"].str.upper() # Uppercase
df["col"] = df["col"].str.title() # Title Case
df["col"] = df["col"].str.replace("$", "", regex=False)
df["col"] = df["col"].str.replace(r"\D", "", regex=True)
df[df["col"].str.contains("word", case=False, na=False)]
Categorical Standardization
# .map() — strict: unmapped values become NaN
mapping = {"old_val": "New Val", "another_old": "New Val"}
df["col"] = df["col"].map(mapping)
# .replace() — lenient: unmapped values remain unchanged
df["col"] = df["col"].replace({"old_val": "New Val"})
Outlier Detection (IQR Method)
Q1 = df["col"].quantile(0.25)
Q3 = df["col"].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df["col"] < lower) | (df["col"] > upper)]
# Cap instead of removing (Winsorization)
df["col"] = df["col"].clip(lower=lower, upper=upper)
The Three Rules of Clean Data Work
Rule 1: Never modify the source file.
Load from the original. Work on df = raw.copy(). Save as a new file. Always.
Rule 2: Assign your results back.
df["col"].fillna(0) does nothing — the result is discarded. You need df["col"] = df["col"].fillna(0). This applies to every pandas method that returns a new object.
Rule 3: Validate after every step. After each cleaning operation, check that it had the intended effect. Check value counts before and after. Check that type conversions succeeded. Check that NaN counts changed correctly. Do not assume — verify.
The Impact of Cleaning
| Issue (Before) | Impact on Analysis (If Uncleaned) | Fix |
|---|---|---|
"office chairs", "OFFICE CHAIRS", "Office Chairs" |
Category revenue split across 3 rows instead of 1 | .str.strip().str.title() |
| Duplicate rows | Counts, sums, and averages artificially inflated | .drop_duplicates() |
| Missing region (13 rows) | 0.52% of transactions excluded from regional analysis | .fillna() with confirmed value |
unit_price as "$249.99"` (string) | Cannot sum, average, or compare prices | Strip `$, then pd.to_numeric() |
||
"Completed" / "completed" / "DONE" |
value_counts() splits one status into 3; filter for "Completed" misses the others |
.map() with canonical dictionary |
| Date stored as string | Cannot filter by month, calculate duration, or group by week | pd.to_datetime() |
Key Distinctions to Remember
.map() vs .replace():
- .map() — strict: unmapped values → NaN (use for complete recoding)
- .replace() — lenient: unmapped values → unchanged (use for targeted fixes)
errors="coerce" vs errors="raise" in pd.to_numeric() / pd.to_datetime():
- "coerce" — silently converts failures to NaN (safe for production pipelines)
- "raise" — raises an exception at the first failure (useful for debugging)
- "ignore" — leaves failures as-is (rarely what you want in a cleaning pipeline)
.fillna(method="ffill") vs .interpolate():
- "ffill" — repeats the last known value (use for categorical or stepwise data)
- "interpolate()" — calculates a value between two known points (use for continuous numeric series)
Acme Corp Cleaning Results (Chapter 12 Case Study)
| Issue | Rows Affected | Fix Applied | Before | After |
|---|---|---|---|---|
| Duplicate rows | 8 removed | .drop_duplicates() |
2500 rows | 2492 rows |
| Missing region | 13 filled | .fillna() with confirmed values |
13 NaN | 0 NaN |
| Missing category | 5 filled | SKU-based lookup | 5 NaN | 0 NaN |
| Category capitalization | 10 rows fixed | .str.strip().str.title() |
6 variants | 4 canonical values |
| Net impact on Office Chairs revenue | — | — | $412,345 | $417,223 (+$4,878) |
One-Liner Summary
Cleaning data is a systematic, documented, repeatable process — not a one-time scramble. The tools are straightforward; the discipline is what separates professional work from analysis that quietly produces wrong answers.