Key Takeaways: Data Wrangling — Cleaning and Preparing Real Data

One-Sentence Summary

Data cleaning is the unglamorous foundation of all analysis — handling missing values, fixing inconsistencies, removing duplicates, transforming variables, and documenting every decision so your results are transparent and reproducible.

Core Concepts at a Glance

Concept Definition Why It Matters
Missing data mechanisms MCAR, MAR, MNAR — why data is missing Determines which handling strategies are appropriate
Imputation Replacing missing values with estimates Preserves sample size but introduces assumptions
Tidy data Each variable a column, each observation a row, each value a cell The standard structure for analysis-ready data
Reproducibility Others can follow your steps to the same result The foundation of trustworthy, transparent analysis
Feature engineering Creating new variables from existing ones Captures patterns that raw variables might hide

Decision Guide: Handling Missing Data

Missing data detected in a column
│
├── How much is missing?
│   ├── < 5% → Simple methods usually fine
│   ├── 5-20% → Think carefully about mechanism
│   └── > 20% → Consider dropping the column or using advanced methods
│
├── What TYPE of variable?
│   ├── Numerical + symmetric distribution → Mean imputation
│   ├── Numerical + skewed distribution → Median imputation
│   └── Categorical → Mode imputation
│
├── What's the MECHANISM?
│   ├── MCAR → Deletion or imputation both OK
│   ├── MAR → Group-based imputation preferred
│   └── MNAR → No simple fix; document limitation
│
└── ALWAYS: Create a missing-data flag + document your decision

Missing Data Mechanisms

Mechanism What It Means Example Safe to Delete?
MCAR Missingness unrelated to any variable Coffee spilled on random forms Yes (remaining data still representative)
MAR Missingness predictable from other variables Older patients more likely to have missing vaccination records Risky — introduces bias toward younger patients
MNAR Missingness related to the missing value itself High-income people skip income question No — the remaining data systematically misrepresents the population

Imputation Methods

Method How Pros Cons
Listwise deletion Drop rows with any NaN Simple; no assumptions Loses data; biased if not MCAR
Mean imputation Replace NaN with column mean Simple; preserves mean Reduces variance; biased for skewed data
Median imputation Replace NaN with column median Better for skewed data; resistant to outliers Reduces variance
Mode imputation Replace NaN with most common value Works for categorical data May overrepresent the dominant category
Group-based Impute using group mean/median Respects data structure Requires a meaningful grouping variable
Flagging Impute + create "was_missing" indicator Preserves information about missingness Extra column; pattern might not be used

Data Quality Checklist

Use this checklist every time you start a new data analysis:

  • [ ] Run .info() — Check data types and non-null counts
  • [ ] Run .describe() — Look for impossible min/max, suspicious means, large standard deviations
  • [ ] Run .value_counts() on categorical columns — Check for inconsistent categories
  • [ ] Check for duplicatesdf.duplicated().sum()
  • [ ] Count missing valuesdf.isna().sum() and percentages
  • [ ] Check for placeholder values — 999, -1, -99, 0 in places where they shouldn't be
  • [ ] Verify date columns are datetime type, not strings
  • [ ] Check zip codes, phone numbers, IDs are strings, not numbers

The Data Cleaning Pipeline

Follow this order for every project:

1. ASSESS    → .info(), .describe(), .value_counts(), .duplicated()
      ↓
2. DEDUPLICATE → .drop_duplicates()
      ↓
3. FIX TYPES  → pd.to_datetime(), .astype()
      ↓
4. STANDARDIZE TEXT → .str.lower(), .str.strip(), .replace()
      ↓
5. FIX IMPOSSIBLE VALUES → Set out-of-range values to NaN
      ↓
6. HANDLE MISSING DATA → Flag first, then impute or drop
      ↓
7. CREATE NEW VARIABLES → Binning, calculations, recoding
      ↓
8. VERIFY    → Final .info(), .describe(), .isna().sum()
      ↓
9. DOCUMENT  → Write the cleaning log
      ↓
10. SAVE     → .to_csv('cleaned_data.csv')

Tidy Data Rules

Rule Violation Example Fix
Each variable → its own column "120/80" in a blood_pressure column Split into systolic and diastolic columns
Each observation → its own row Patient row with Jan_Temp, Feb_Temp, Mar_Temp columns Reshape to long format: Patient, Month, Temperature
Each value → its own cell "flu, pneumonia" in a diagnosis cell Separate rows or separate columns

Python Quick Reference

import pandas as pd
import numpy as np

# --- Assessment ---
df.info()                           # Types, non-null counts
df.describe()                       # Summary stats (spot impossibles)
df.isna().sum()                     # Missing count per column
df.isna().mean() * 100              # Missing percentage per column
df.duplicated().sum()               # Count duplicate rows
df['col'].value_counts()            # Category frequencies

# --- Cleaning ---
df.drop_duplicates()                # Remove duplicate rows
df.dropna()                         # Drop rows with any NaN
df.dropna(subset=['col'])           # Drop rows where 'col' is NaN
df['col'].fillna(value)             # Fill NaN with value
df['col'].fillna(df['col'].median()) # Fill with median
df['col'].str.lower().str.strip()   # Standardize text
df['col'].replace({'old': 'new'})   # Map values

# --- Transformation ---
pd.cut(df['col'], bins=..., labels=...)   # Equal-width bins
pd.qcut(df['col'], q=4, labels=...)       # Equal-frequency bins
pd.to_datetime(df['col'])                  # Convert to datetime
df.loc[condition, 'col'] = new_value       # Conditional assignment

# --- Reshaping ---
pd.melt(df, id_vars=..., value_vars=...)   # Wide → long
df.pivot(index=..., columns=..., values=...) # Long → wide

# REMEMBER: Most methods return a NEW object.
# Always assign the result back: df['col'] = df['col'].fillna(0)

Excel Quick Reference

Task Method
Count blanks =COUNTBLANK(range)
Remove spaces =TRIM(cell)
Remove non-printable chars =CLEAN(cell)
Standardize case =LOWER(cell), =UPPER(cell), =PROPER(cell)
Replace values Ctrl+H (Find & Replace)
Highlight blanks Conditional Formatting > Blanks
Highlight duplicates Conditional Formatting > Duplicate Values
Remove duplicates Data > Remove Duplicates
Restrict future entries Data > Data Validation
Handle errors =IFERROR(formula, "")

Key Terms

Term Definition
Missing data (NA/NaN) Values absent from the dataset — blank cells, null values
MCAR Missing Completely at Random — missingness unrelated to any variable
MAR Missing at Random — missingness predictable from other observed variables
MNAR Missing Not at Random — missingness related to the missing value itself
Imputation Replacing missing values with estimated values
Data cleaning Detecting and correcting errors, inconsistencies, and quality issues
Data wrangling Transforming raw data into analysis-ready format (broader than cleaning)
Tidy data Each variable a column, each observation a row, each value a cell
Recoding Creating a new variable with different values or categories
Binning Dividing a continuous variable into discrete intervals
Feature engineering Creating new variables from existing ones
Data quality Accuracy, completeness, consistency, and fitness for purpose
Reproducibility Others can follow your steps and get the same results
Cleaning log Documentation of all data cleaning decisions and justifications
Sensitivity analysis Testing whether conclusions change under different cleaning choices

Common Pandas Pitfalls

Pitfall Wrong Right
Not saving result df['col'].fillna(0) df['col'] = df['col'].fillna(0)
Missing .str accessor df['col'].lower() df['col'].str.lower()
Modifying a view df_sub = df[mask] then modify df_sub = df[mask].copy() then modify
Replacing with spaces df['col'].replace('M','Male') Strip first: df['col'].str.strip().replace('M','Male')
NaN making ints float Fill then wonder why float df['col'].fillna(0).astype(int)

The One Thing to Remember

If you forget everything else from this chapter, remember this:

Every data cleaning decision is a judgment call that shapes your conclusions. There is no "neutral" cleaning — every choice to delete, impute, recode, or transform changes the data and the story it tells. The ethical move is to make your choices transparently, document them in a cleaning log, and test whether different choices would lead to different conclusions. Reproducibility isn't a luxury — it's the price of being trustworthy.