Key Takeaways: Cleaning Messy Data

Contributors to Introduction to Data Science

Key Takeaways: Cleaning Messy Data

This is your reference card for Chapter 8. Keep it open while you're cleaning data — you'll reach for it constantly during the first few months, and occasionally for years after that.

The Core Insight

Cleaning IS analysis. Every cleaning decision — dropping rows, filling values, merging categories — encodes an assumption about the data. Treat cleaning decisions with the same rigor you'd bring to statistical tests, because they shape your results just as much.

The Five Types of Data Mess

Type	What It Looks Like	Primary Tools
Missing values	Blank cells, NaN, None, "N/A", empty strings	`isnull()`, `dropna()`, `fillna()`
Duplicate records	Same row appearing twice; same entity with different IDs	`duplicated()`, `drop_duplicates()`
Type errors	Numbers stored as strings; dates unparsed; ZIP codes as integers	`pd.to_numeric()`, `pd.to_datetime()`, `astype()`
Inconsistent categories	"Male" / "male" / "M"; "United States" / "US" / "USA"	`.str.lower()`, `.str.strip()`, `replace()`, `map()`
Outliers/implausible values	Negative ages, 999-degree temperatures, 250-year-old patients	Domain-based rules, IQR method, z-scores

Missing Value Decision Framework

Is the data missing for a random reason unrelated to the value?
  |
  +--> Yes (MCAR): Deletion is safe. Imputation also works.
  |
  +--> No
        |
        Is the missingness explained by OTHER observed variables?
          |
          +--> Yes (MAR): Use group-based imputation.
          |     Fill with the median/mean of the relevant group.
          |
          +--> No (MNAR): The missingness is related to the
                missing value itself. Proceed with extreme caution.
                Flag missingness. Run sensitivity analysis.

Missing Value Strategies at a Glance

Strategy	When to Use	Watch Out For
Drop rows (`dropna`)	< 5% missing, MCAR, plenty of data	Removes people, not just numbers — check who you're losing
Fill with mean	Normally distributed numeric data, MCAR/MAR	Sensitive to outliers; reduces variance
Fill with median	Skewed data or data with outliers	Better than mean for skewed distributions
Fill with mode	Categorical data	Arbitrary if no clear mode
Fill with constant (0, "Unknown")	When you want to preserve the fact that data was missing	Zero can be confused with a real measurement
Group-based imputation	MAR data with known group structure	Requires enough non-missing data per group
Forward/backward fill	Time series data	Assumes values don't change suddenly
Flag + impute	When missingness is itself informative	Adds a column; slightly more complex

Data Cleaning Workflow Checklist

Use this every time you start cleaning a new dataset:

[ ] Load and inspect: df.shape, df.dtypes, df.head(), df.info(), df.describe()
[ ] Check missing values: df.isnull().sum() and df.isnull().mean() * 100
[ ] Check duplicates: df.duplicated().sum() and df.duplicated(subset=[key_cols]).sum()
[ ] Check data types: Are numeric columns numeric? Are dates parsed? Are IDs stored as strings?
[ ] Check categorical consistency: df[col].value_counts() for each categorical column
[ ] Check value ranges: For numeric columns, look at min/max for implausible values
[ ] Handle duplicates (usually first — it affects all subsequent counts)
[ ] Fix data types (before imputation — you can't compute median of strings)
[ ] Handle missing values (strategy depends on type and mechanism of missingness)
[ ] Standardize categories (after type fixes, before analysis)
[ ] Handle outliers (after everything else — needs domain knowledge)
[ ] Document everything in a cleaning log
[ ] Verify: Re-run df.shape, df.dtypes, df.isnull().sum() after cleaning

Quick Reference: Essential Methods

Missing Values

df.isnull().sum()                          # Count missing per column
df.isnull().mean() * 100                   # Percentage missing
df.dropna()                                # Drop rows with any NaN
df.dropna(subset=['col1', 'col2'])         # Drop where specific cols are NaN
df['col'].fillna(value)                    # Fill with constant/statistic
df.groupby('grp')['col'].transform(        # Group-based imputation
    lambda x: x.fillna(x.median()))

Duplicates

df.duplicated().sum()                      # Count exact duplicates
df.duplicated(subset=['id']).sum()         # Count key-based duplicates
df.drop_duplicates()                       # Remove exact duplicates
df.drop_duplicates(subset=['id'])          # Remove key-based duplicates

Type Conversion

pd.to_numeric(df['col'], errors='coerce') # String -> number (safe)
pd.to_datetime(df['col'], errors='coerce')# String -> datetime (safe)
df['col'].astype(float)                   # Direct cast (raises on error)
df['col'].astype('Int64')                 # Nullable integer (handles NaN)

Category Standardization

df['col'].str.strip()                     # Remove whitespace
df['col'].str.lower()                     # Lowercase
df['col'].str.title()                     # Title Case
df['col'].replace({'old': 'new'})         # Map specific values
df['col'].map(mapping_dict)               # Map all values (unmapped -> NaN)

Five Ethical Questions for Every Cleaning Decision

Who is missing? Whose data am I removing, and are they different from those who remain?
What am I assuming? What hypothesis does this cleaning operation encode?
Could this change my answer? Would different cleaning choices lead to different conclusions?
Am I erasing real variation? Could this outlier or unusual category reflect a genuine phenomenon?
Is this reproducible? Would another analyst make the same choice given the same information?

What You Should Be Able to Do Now

[ ] Audit a new dataset for all five types of data quality problems in under 10 minutes
[ ] Choose between dropping, imputing, and flagging missing values — and justify the choice
[ ] Convert columns to correct types using pd.to_numeric, pd.to_datetime, and astype
[ ] Standardize messy categorical columns using string methods and explicit mappings
[ ] Detect outliers using domain rules and statistical methods (IQR, z-scores)
[ ] Write a cleaning log that documents every transformation with its rationale
[ ] Explain why "cleaning IS analysis" — and give a concrete example of how a cleaning decision can change analytical conclusions

If any of these feel shaky, revisit the relevant section before moving on. The skills in this chapter are used in every subsequent chapter — they're the foundation of everything that follows.