Key Takeaways: Cleaning Messy Data
This is your reference card for Chapter 8. Keep it open while you're cleaning data — you'll reach for it constantly during the first few months, and occasionally for years after that.
The Core Insight
Cleaning IS analysis. Every cleaning decision — dropping rows, filling values, merging categories — encodes an assumption about the data. Treat cleaning decisions with the same rigor you'd bring to statistical tests, because they shape your results just as much.
The Five Types of Data Mess
| Type | What It Looks Like | Primary Tools |
|---|---|---|
| Missing values | Blank cells, NaN, None, "N/A", empty strings | isnull(), dropna(), fillna() |
| Duplicate records | Same row appearing twice; same entity with different IDs | duplicated(), drop_duplicates() |
| Type errors | Numbers stored as strings; dates unparsed; ZIP codes as integers | pd.to_numeric(), pd.to_datetime(), astype() |
| Inconsistent categories | "Male" / "male" / "M"; "United States" / "US" / "USA" | .str.lower(), .str.strip(), replace(), map() |
| Outliers/implausible values | Negative ages, 999-degree temperatures, 250-year-old patients | Domain-based rules, IQR method, z-scores |
Missing Value Decision Framework
Is the data missing for a random reason unrelated to the value?
|
+--> Yes (MCAR): Deletion is safe. Imputation also works.
|
+--> No
|
Is the missingness explained by OTHER observed variables?
|
+--> Yes (MAR): Use group-based imputation.
| Fill with the median/mean of the relevant group.
|
+--> No (MNAR): The missingness is related to the
missing value itself. Proceed with extreme caution.
Flag missingness. Run sensitivity analysis.
Missing Value Strategies at a Glance
| Strategy | When to Use | Watch Out For |
|---|---|---|
Drop rows (dropna) |
< 5% missing, MCAR, plenty of data | Removes people, not just numbers — check who you're losing |
| Fill with mean | Normally distributed numeric data, MCAR/MAR | Sensitive to outliers; reduces variance |
| Fill with median | Skewed data or data with outliers | Better than mean for skewed distributions |
| Fill with mode | Categorical data | Arbitrary if no clear mode |
| Fill with constant (0, "Unknown") | When you want to preserve the fact that data was missing | Zero can be confused with a real measurement |
| Group-based imputation | MAR data with known group structure | Requires enough non-missing data per group |
| Forward/backward fill | Time series data | Assumes values don't change suddenly |
| Flag + impute | When missingness is itself informative | Adds a column; slightly more complex |
Data Cleaning Workflow Checklist
Use this every time you start cleaning a new dataset:
- [ ] Load and inspect:
df.shape,df.dtypes,df.head(),df.info(),df.describe() - [ ] Check missing values:
df.isnull().sum()anddf.isnull().mean() * 100 - [ ] Check duplicates:
df.duplicated().sum()anddf.duplicated(subset=[key_cols]).sum() - [ ] Check data types: Are numeric columns numeric? Are dates parsed? Are IDs stored as strings?
- [ ] Check categorical consistency:
df[col].value_counts()for each categorical column - [ ] Check value ranges: For numeric columns, look at min/max for implausible values
- [ ] Handle duplicates (usually first — it affects all subsequent counts)
- [ ] Fix data types (before imputation — you can't compute median of strings)
- [ ] Handle missing values (strategy depends on type and mechanism of missingness)
- [ ] Standardize categories (after type fixes, before analysis)
- [ ] Handle outliers (after everything else — needs domain knowledge)
- [ ] Document everything in a cleaning log
- [ ] Verify: Re-run
df.shape,df.dtypes,df.isnull().sum()after cleaning
Quick Reference: Essential Methods
Missing Values
df.isnull().sum() # Count missing per column
df.isnull().mean() * 100 # Percentage missing
df.dropna() # Drop rows with any NaN
df.dropna(subset=['col1', 'col2']) # Drop where specific cols are NaN
df['col'].fillna(value) # Fill with constant/statistic
df.groupby('grp')['col'].transform( # Group-based imputation
lambda x: x.fillna(x.median()))
Duplicates
df.duplicated().sum() # Count exact duplicates
df.duplicated(subset=['id']).sum() # Count key-based duplicates
df.drop_duplicates() # Remove exact duplicates
df.drop_duplicates(subset=['id']) # Remove key-based duplicates
Type Conversion
pd.to_numeric(df['col'], errors='coerce') # String -> number (safe)
pd.to_datetime(df['col'], errors='coerce')# String -> datetime (safe)
df['col'].astype(float) # Direct cast (raises on error)
df['col'].astype('Int64') # Nullable integer (handles NaN)
Category Standardization
df['col'].str.strip() # Remove whitespace
df['col'].str.lower() # Lowercase
df['col'].str.title() # Title Case
df['col'].replace({'old': 'new'}) # Map specific values
df['col'].map(mapping_dict) # Map all values (unmapped -> NaN)
Five Ethical Questions for Every Cleaning Decision
- Who is missing? Whose data am I removing, and are they different from those who remain?
- What am I assuming? What hypothesis does this cleaning operation encode?
- Could this change my answer? Would different cleaning choices lead to different conclusions?
- Am I erasing real variation? Could this outlier or unusual category reflect a genuine phenomenon?
- Is this reproducible? Would another analyst make the same choice given the same information?
What You Should Be Able to Do Now
- [ ] Audit a new dataset for all five types of data quality problems in under 10 minutes
- [ ] Choose between dropping, imputing, and flagging missing values — and justify the choice
- [ ] Convert columns to correct types using
pd.to_numeric,pd.to_datetime, andastype - [ ] Standardize messy categorical columns using string methods and explicit mappings
- [ ] Detect outliers using domain rules and statistical methods (IQR, z-scores)
- [ ] Write a cleaning log that documents every transformation with its rationale
- [ ] Explain why "cleaning IS analysis" — and give a concrete example of how a cleaning decision can change analytical conclusions
If any of these feel shaky, revisit the relevant section before moving on. The skills in this chapter are used in every subsequent chapter — they're the foundation of everything that follows.