Key Takeaways: Data Wrangling — Cleaning and Preparing Real Data

Contributors

Key Takeaways: Data Wrangling — Cleaning and Preparing Real Data

One-Sentence Summary

Data cleaning is the unglamorous foundation of all analysis — handling missing values, fixing inconsistencies, removing duplicates, transforming variables, and documenting every decision so your results are transparent and reproducible.

Core Concepts at a Glance

Concept	Definition	Why It Matters
Missing data mechanisms	MCAR, MAR, MNAR — why data is missing	Determines which handling strategies are appropriate
Imputation	Replacing missing values with estimates	Preserves sample size but introduces assumptions
Tidy data	Each variable a column, each observation a row, each value a cell	The standard structure for analysis-ready data
Reproducibility	Others can follow your steps to the same result	The foundation of trustworthy, transparent analysis
Feature engineering	Creating new variables from existing ones	Captures patterns that raw variables might hide

Decision Guide: Handling Missing Data

Missing data detected in a column
│
├── How much is missing?
│   ├── < 5% → Simple methods usually fine
│   ├── 5-20% → Think carefully about mechanism
│   └── > 20% → Consider dropping the column or using advanced methods
│
├── What TYPE of variable?
│   ├── Numerical + symmetric distribution → Mean imputation
│   ├── Numerical + skewed distribution → Median imputation
│   └── Categorical → Mode imputation
│
├── What's the MECHANISM?
│   ├── MCAR → Deletion or imputation both OK
│   ├── MAR → Group-based imputation preferred
│   └── MNAR → No simple fix; document limitation
│
└── ALWAYS: Create a missing-data flag + document your decision

Missing Data Mechanisms

Mechanism	What It Means	Example	Safe to Delete?
MCAR	Missingness unrelated to any variable	Coffee spilled on random forms	Yes (remaining data still representative)
MAR	Missingness predictable from other variables	Older patients more likely to have missing vaccination records	Risky — introduces bias toward younger patients
MNAR	Missingness related to the missing value itself	High-income people skip income question	No — the remaining data systematically misrepresents the population

Imputation Methods

Method	How	Pros	Cons
Listwise deletion	Drop rows with any NaN	Simple; no assumptions	Loses data; biased if not MCAR
Mean imputation	Replace NaN with column mean	Simple; preserves mean	Reduces variance; biased for skewed data
Median imputation	Replace NaN with column median	Better for skewed data; resistant to outliers	Reduces variance
Mode imputation	Replace NaN with most common value	Works for categorical data	May overrepresent the dominant category
Group-based	Impute using group mean/median	Respects data structure	Requires a meaningful grouping variable
Flagging	Impute + create "was_missing" indicator	Preserves information about missingness	Extra column; pattern might not be used

Data Quality Checklist

Use this checklist every time you start a new data analysis:

[ ] Run .info() — Check data types and non-null counts
[ ] Run .describe() — Look for impossible min/max, suspicious means, large standard deviations
[ ] Run .value_counts() on categorical columns — Check for inconsistent categories
[ ] Check for duplicates — df.duplicated().sum()
[ ] Count missing values — df.isna().sum() and percentages
[ ] Check for placeholder values — 999, -1, -99, 0 in places where they shouldn't be
[ ] Verify date columns are datetime type, not strings
[ ] Check zip codes, phone numbers, IDs are strings, not numbers

The Data Cleaning Pipeline

Follow this order for every project:

1. ASSESS    → .info(), .describe(), .value_counts(), .duplicated()
      ↓
2. DEDUPLICATE → .drop_duplicates()
      ↓
3. FIX TYPES  → pd.to_datetime(), .astype()
      ↓
4. STANDARDIZE TEXT → .str.lower(), .str.strip(), .replace()
      ↓
5. FIX IMPOSSIBLE VALUES → Set out-of-range values to NaN
      ↓
6. HANDLE MISSING DATA → Flag first, then impute or drop
      ↓
7. CREATE NEW VARIABLES → Binning, calculations, recoding
      ↓
8. VERIFY    → Final .info(), .describe(), .isna().sum()
      ↓
9. DOCUMENT  → Write the cleaning log
      ↓
10. SAVE     → .to_csv('cleaned_data.csv')

Tidy Data Rules

Rule	Violation Example	Fix
Each variable → its own column	"120/80" in a blood_pressure column	Split into `systolic` and `diastolic` columns
Each observation → its own row	Patient row with Jan_Temp, Feb_Temp, Mar_Temp columns	Reshape to long format: Patient, Month, Temperature
Each value → its own cell	"flu, pneumonia" in a diagnosis cell	Separate rows or separate columns

Python Quick Reference

import pandas as pd
import numpy as np

# --- Assessment ---
df.info()                           # Types, non-null counts
df.describe()                       # Summary stats (spot impossibles)
df.isna().sum()                     # Missing count per column
df.isna().mean() * 100              # Missing percentage per column
df.duplicated().sum()               # Count duplicate rows
df['col'].value_counts()            # Category frequencies

# --- Cleaning ---
df.drop_duplicates()                # Remove duplicate rows
df.dropna()                         # Drop rows with any NaN
df.dropna(subset=['col'])           # Drop rows where 'col' is NaN
df['col'].fillna(value)             # Fill NaN with value
df['col'].fillna(df['col'].median()) # Fill with median
df['col'].str.lower().str.strip()   # Standardize text
df['col'].replace({'old': 'new'})   # Map values

# --- Transformation ---
pd.cut(df['col'], bins=..., labels=...)   # Equal-width bins
pd.qcut(df['col'], q=4, labels=...)       # Equal-frequency bins
pd.to_datetime(df['col'])                  # Convert to datetime
df.loc[condition, 'col'] = new_value       # Conditional assignment

# --- Reshaping ---
pd.melt(df, id_vars=..., value_vars=...)   # Wide → long
df.pivot(index=..., columns=..., values=...) # Long → wide

# REMEMBER: Most methods return a NEW object.
# Always assign the result back: df['col'] = df['col'].fillna(0)

Excel Quick Reference

Task	Method
Count blanks	`=COUNTBLANK(range)`
Remove spaces	`=TRIM(cell)`
Remove non-printable chars	`=CLEAN(cell)`
Standardize case	`=LOWER(cell)`, `=UPPER(cell)`, `=PROPER(cell)`
Replace values	Ctrl+H (Find & Replace)
Highlight blanks	Conditional Formatting > Blanks
Highlight duplicates	Conditional Formatting > Duplicate Values
Remove duplicates	Data > Remove Duplicates
Restrict future entries	Data > Data Validation
Handle errors	`=IFERROR(formula, "")`

Key Terms

Term	Definition
Missing data (NA/NaN)	Values absent from the dataset — blank cells, null values
MCAR	Missing Completely at Random — missingness unrelated to any variable
MAR	Missing at Random — missingness predictable from other observed variables
MNAR	Missing Not at Random — missingness related to the missing value itself
Imputation	Replacing missing values with estimated values
Data cleaning	Detecting and correcting errors, inconsistencies, and quality issues
Data wrangling	Transforming raw data into analysis-ready format (broader than cleaning)
Tidy data	Each variable a column, each observation a row, each value a cell
Recoding	Creating a new variable with different values or categories
Binning	Dividing a continuous variable into discrete intervals
Feature engineering	Creating new variables from existing ones
Data quality	Accuracy, completeness, consistency, and fitness for purpose
Reproducibility	Others can follow your steps and get the same results
Cleaning log	Documentation of all data cleaning decisions and justifications
Sensitivity analysis	Testing whether conclusions change under different cleaning choices

Common Pandas Pitfalls

Pitfall	Wrong	Right
Not saving result	`df['col'].fillna(0)`	`df['col'] = df['col'].fillna(0)`
Missing `.str` accessor	`df['col'].lower()`	`df['col'].str.lower()`
Modifying a view	`df_sub = df[mask]` then modify	`df_sub = df[mask].copy()` then modify
Replacing with spaces	`df['col'].replace('M','Male')`	Strip first: `df['col'].str.strip().replace('M','Male')`
NaN making ints float	Fill then wonder why float	`df['col'].fillna(0).astype(int)`

The One Thing to Remember

If you forget everything else from this chapter, remember this:

Every data cleaning decision is a judgment call that shapes your conclusions. There is no "neutral" cleaning — every choice to delete, impute, recode, or transform changes the data and the story it tells. The ethical move is to make your choices transparently, document them in a cleaning log, and test whether different choices would lead to different conclusions. Reproducibility isn't a luxury — it's the price of being trustworthy.