> "Cleaning data is not glamorous. It is also not optional. The quality of every insight you produce is bounded above by the quality of the data going in."
In This Chapter
- Chapter Overview
- 12.1 The Data Cleaning Mindset
- 12.2 The Data Cleaning Checklist
- 12.3 Handling Missing Values
- 12.4 Removing Duplicate Rows
- 12.5 Correcting Data Types
- 12.6 String Normalization
- 12.7 Standardizing Categorical Values
- 12.8 Detecting and Handling Outliers
- 12.9 Before and After: The Impact of Cleaning
- 12.10 Documenting Your Cleaning Steps
- 12.11 A Complete Cleaning Pipeline
- 12.12 Chapter Summary
- Key Terms
- What Comes Next
Chapter 12: Cleaning and Preparing Data for Analysis
"Cleaning data is not glamorous. It is also not optional. The quality of every insight you produce is bounded above by the quality of the data going in." — Maya Reyes, writing in her client project notes
Chapter Overview
Chapter 11 taught you how to load data and systematically identify its problems. Chapter 12 teaches you how to fix them. You will learn the complete toolkit for transforming messy, inconsistent, real-world business data into a clean, analysis-ready state.
The problems Priya found in the Acme sales data — missing region values, inconsistent category capitalization, duplicate rows — are not unusual or extreme. They are typical. Every analyst, in every industry, spends a substantial portion of their time on exactly this kind of work. The goal of this chapter is to make you fast, systematic, and confident at it.
By the end of this chapter, you will be able to:
- Apply the data cleaning mindset: understanding why data is messy and how to approach it methodically
- Detect and handle missing values using
.isna(),.fillna(),.dropna(), and.interpolate() - Find and remove duplicate rows with
.duplicated()and.drop_duplicates() - Correct data types using
.astype(),pd.to_numeric(), andpd.to_datetime() - Normalize string values with
.str.strip(),.str.lower(),.str.replace(), and.str.contains() - Detect outliers using the IQR method and z-scores
- Standardize categorical values with
.map()and.replace() - Follow a systematic data cleaning checklist
- Document your cleaning steps for reproducibility
12.1 The Data Cleaning Mindset
Before writing a single line of cleaning code, it helps to understand why data is messy in the first place. Data does not become dirty randomly — it becomes dirty for specific, predictable reasons. Understanding those reasons helps you anticipate what to look for.
Why Business Data Is Always Messy
Human data entry is imperfect. When people type values into a system, they make typos, use inconsistent capitalization, leave fields blank when they are in a hurry, and invent their own abbreviations. The same product category might appear as "Office Chairs," "office chairs," "OFC CHAIRS," and "Chairs - Office" depending on who entered it.
Systems are built at different times by different teams. The CRM was built in 2012. The inventory system was built in 2018. The e-commerce platform was added in 2021. Each system made different decisions about date formats, missing value codes, text casing, and field definitions. When data from these systems is combined, inconsistencies are inevitable.
Data formats evolve. A company might switch from storing prices as integers (in cents) to floats (in dollars) partway through a dataset. Date formats shift from MM/DD/YYYY to YYYY-MM-DD. What was called "Furniture" becomes "Office Furniture" when someone restructures the product catalog.
Exports and integrations introduce artifacts. CRM exports sometimes double-post records. Excel exports from finance teams include currency symbols, thousands separators, and formatting characters. Database backups preserve old deleted records alongside current ones.
Edge cases were not anticipated. The order management system was not designed to handle returns, so return transactions have negative quantities and someone workarounded by entering "0 units, $0 revenue." The billing system adds "VOID" to cancelled invoices. These edge cases end up in your dataset.
The Cleaning Mindset
Effective data cleaning requires a specific mental approach:
Investigate before you fix. When you find an issue, understand it fully before writing the fix. How many rows are affected? Are there patterns? What does the original value mean? Cleaning a value incorrectly is often worse than leaving it dirty.
Document every change. Keep a record of what you changed, why, and how many rows were affected. This is essential for reproducibility and for answering the inevitable question: "Why does this number look different from last month's report?"
Preserve the original data. Never overwrite the source file. Work on a copy. Store your cleaning code in a script so the transformation from raw data to clean data is always reproducible.
Validate after each step. After each cleaning operation, verify that it had the intended effect. Check the value counts before and after. Check that nulls were filled. Check that types changed correctly.
Know when to consult the source. Some data issues cannot be resolved by looking at the data alone. Missing region values might only be resolvable by asking the salesperson. An ambiguous category might require checking the product catalog. Know when to escalate.
12.2 The Data Cleaning Checklist
Professional analysts work through data cleaning systematically. Here is the checklist that Priya and Maya use — it is the framework for this entire chapter:
- Remove or investigate exact duplicate rows
- Handle missing values (fill, drop, or flag)
- Fix data type issues (convert strings to numbers, strings to dates)
- Normalize string values (strip whitespace, standardize case)
- Standardize categorical values (consolidate equivalent categories)
- Validate numeric ranges (identify and handle outliers)
- Validate relationships between columns (e.g.,
total_revenue = units_sold * unit_price * (1 - discount_rate)) - Document all cleaning steps
Work through this list in order. Type corrections (step 3) often must happen before string normalization (step 4) is possible. Whitespace stripping (step 4) must happen before category standardization (step 5) is effective.
12.3 Handling Missing Values
Missing values are the most common data quality issue in business data. pandas represents them as NaN (Not a Number) for numeric columns and NaT (Not a Time) for datetime columns. For all types, df.isna() returns True for missing values.
12.3.1 Detecting Missing Values
You learned this in Chapter 11, but it bears repeating because missing value detection is always the first step:
import pandas as pd
# Count missing values per column
print(df.isna().sum())
# Percentage of missing values per column
print((df.isna().sum() / len(df) * 100).round(2))
# Rows where any column has a missing value
df_with_missing = df[df.isna().any(axis=1)]
print(f"{len(df_with_missing)} rows have at least one missing value")
12.3.2 Dropping Rows with Missing Values: .dropna()
The simplest approach — remove any row that has a missing value. This is appropriate only when: - The percentage of missing rows is very small (typically under 5%) - The missing values are not systematically different from the present values - You have enough data that dropping rows does not materially harm your analysis
# Drop all rows with ANY missing value
df_clean = df.dropna()
print(f"Rows before: {len(df)}, Rows after: {len(df_clean)}")
# Drop rows only if a SPECIFIC column is missing
df_clean = df.dropna(subset=["region"])
print(f"Rows with region: {len(df_clean)}")
# Drop columns that have more than 50% missing values
df_clean = df.dropna(axis=1, thresh=int(len(df) * 0.5))
# Only drop if ALL values in a row are missing
df_clean = df.dropna(how="all")
Important: dropna() returns a new DataFrame by default. The original df is unchanged. To modify in place, use inplace=True — but be cautious; most professional code avoids inplace=True because it makes debugging harder.
12.3.3 Filling Missing Values: .fillna()
When you know (or can reasonably infer) what the missing value should be, fill it.
Fill with a constant value:
# Fill missing region with "Unknown"
df["region"] = df["region"].fillna("Unknown")
# Fill missing numeric values with zero
df["discount_rate"] = df["discount_rate"].fillna(0)
# Fill multiple columns at once with a dictionary
df = df.fillna({
"region": "Unknown",
"discount_rate": 0,
"category": "Uncategorized"
})
Fill with a statistical measure:
# Fill missing unit_price with the column median
# (median is better than mean when the data is skewed)
median_price = df["unit_price"].median()
df["unit_price"] = df["unit_price"].fillna(median_price)
# Fill missing units_sold with the column mean
mean_units = df["units_sold"].mean()
df["units_sold"] = df["units_sold"].fillna(mean_units)
Forward fill and backward fill (for time-series data):
# Forward fill: propagate the last known value forward
df["price"] = df["price"].fillna(method="ffill")
# Backward fill: propagate the next known value backward
df["price"] = df["price"].fillna(method="bfill")
Forward fill is useful when you have daily records and want to propagate a value (like a stock price or exchange rate) until the next update. It assumes that the last known value is the best estimate for any gap.
12.3.4 Interpolation: .interpolate()
For numeric time-series data, linear interpolation fills missing values by fitting a straight line between the known values on either side.
# Fill missing values with linear interpolation
df["daily_revenue"] = df["daily_revenue"].interpolate(method="linear")
# Before: [100, NaN, NaN, 160]
# After: [100, 120, 140, 160] (evenly spaced between 100 and 160)
Interpolation is appropriate when you have regularly-spaced time data (daily, weekly, monthly) and missing values represent gaps in an otherwise continuous series. It is not appropriate for categorical data or when missing values are random rather than temporal.
12.3.5 The Fill-vs-Drop Decision
Use this framework to decide:
| Situation | Recommended Approach |
|---|---|
| Less than 2% missing, random pattern | Drop rows |
| Between 2-10% missing, known fill value | Fill with constant |
| Numeric column, data is approximately normal | Fill with mean |
| Numeric column, data is skewed | Fill with median |
| Time-series with gaps | Interpolate |
| Missing because the event hasn't happened yet (e.g., end_date for active projects) | Leave as NaN; filter out when needed |
| More than 20% missing | Investigate further before deciding; consider dropping the column entirely |
12.4 Removing Duplicate Rows
Duplicate records inflate counts, totals, and averages, and are nearly always a data quality error rather than a legitimate business event.
12.4.1 Detecting Duplicates: .duplicated()
# Count exact duplicate rows
n_dups = df.duplicated().sum()
print(f"Exact duplicate rows: {n_dups}")
# Show the duplicate rows themselves
# keep='first' marks all duplicates except the first occurrence as True
# keep=False marks ALL occurrences of any duplicate as True
print(df[df.duplicated(keep=False)])
# Check for duplicates based on specific columns
# (useful for finding transactions with the same date, salesperson, product)
df["is_dup"] = df.duplicated(
subset=["date", "salesperson", "product_sku"],
keep=False
)
12.4.2 Removing Duplicates: .drop_duplicates()
# Remove exact duplicate rows, keeping the first occurrence
df_clean = df.drop_duplicates()
print(f"Rows removed: {len(df) - len(df_clean)}")
# Remove duplicates, keeping the last occurrence
df_clean = df.drop_duplicates(keep="last")
# Remove duplicates based on specific columns
# Useful when you want to keep only the most recent record per transaction
df_clean = df.drop_duplicates(
subset=["date", "salesperson", "product_sku"],
keep="first"
)
# Verify the result
print(f"Before: {len(df)} rows, After: {len(df_clean)} rows")
print(f"Removed: {len(df) - len(df_clean)} rows")
Always document what you dropped. Before dropping duplicates, save the removed rows to a separate variable so you can inspect or log them:
# Save the duplicates before dropping
duplicates_removed = df[df.duplicated(keep="first")]
print(f"Duplicate rows to be removed ({len(duplicates_removed)} rows):")
print(duplicates_removed)
# Now drop them
df_clean = df.drop_duplicates(keep="first")
12.5 Correcting Data Types
Incorrect data types prevent calculations and comparisons from working. This section covers the three most common type correction tasks in business data.
12.5.1 The .astype() Method
.astype() converts a column to a specified dtype. It is your general-purpose type conversion tool.
# Convert to integer
df["units_sold"] = df["units_sold"].astype(int)
# Convert to float
df["unit_price"] = df["unit_price"].astype(float)
# Convert to string
df["product_sku"] = df["product_sku"].astype(str)
# Convert to category (memory-efficient for low-cardinality strings)
df["region"] = df["region"].astype("category")
# Convert to boolean
df["is_active"] = df["is_active"].astype(bool)
Caution: .astype() will raise an error if the column contains values that cannot be converted. For example, df["price"].astype(float) will fail if any value is "$149.99" because of the dollar sign. You must clean the values first.
12.5.2 Converting Strings to Numbers: pd.to_numeric()
pd.to_numeric() is safer than .astype(float) for columns that might contain non-numeric strings. Its errors parameter controls what happens when conversion fails:
# Raise an error if any value cannot be converted (default)
df["revenue"] = pd.to_numeric(df["revenue"])
# Coerce failures to NaN instead of raising an error
# This is the most useful option for messy business data
df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")
# Ignore failures — non-convertible values remain as-is
df["revenue"] = pd.to_numeric(df["revenue"], errors="ignore")
errors="coerce" is extremely useful for the common case where a mostly-numeric column has a few text values. Instead of crashing, it converts what it can and marks the problematic values as NaN. After conversion, you can investigate and fix just those NaN rows.
The dollar sign problem — complete solution:
# Step 1: Strip the dollar sign and any commas
df["unit_price"] = df["unit_price"].str.replace("$", "", regex=False)
df["unit_price"] = df["unit_price"].str.replace(",", "", regex=False)
# Step 2: Strip any whitespace
df["unit_price"] = df["unit_price"].str.strip()
# Step 3: Convert to numeric, coercing failures to NaN
df["unit_price"] = pd.to_numeric(df["unit_price"], errors="coerce")
# Step 4: Investigate any remaining NaN values
leftover_nulls = df["unit_price"].isna().sum()
if leftover_nulls > 0:
print(f"Warning: {leftover_nulls} values could not be converted")
print(df[df["unit_price"].isna()][["product_sku", "unit_price"]])
12.5.3 Converting Strings to Dates: pd.to_datetime()
When dates arrive as strings (because parse_dates was not used at load time, or the dates are in inconsistent formats), pd.to_datetime() converts them.
# Convert a column of date strings to datetime
df["sale_date"] = pd.to_datetime(df["sale_date"])
# Specify the format explicitly for unambiguous parsing (much faster on large datasets)
df["sale_date"] = pd.to_datetime(df["sale_date"], format="%Y-%m-%d")
# Format string reference:
# %Y = four-digit year (2023)
# %m = two-digit month (01-12)
# %d = two-digit day (01-31)
# %H = hour (00-23), %M = minute (00-59), %S = second (00-59)
# Handle multiple formats and non-parseable values with errors='coerce'
df["sale_date"] = pd.to_datetime(df["sale_date"], errors="coerce")
# Verify conversion
print(df["sale_date"].dtype) # Should show datetime64[ns]
print(df["sale_date"].isna().sum()) # Any values that failed conversion
Common date format strings:
| Format | Example | Format String |
|---|---|---|
| ISO 8601 | 2023-01-15 |
"%Y-%m-%d" |
| US style | 01/15/2023 |
"%m/%d/%Y" |
| European style | 15/01/2023 |
"%d/%m/%Y" |
| US with dashes | 01-15-2023 |
"%m-%d-%Y" |
| Month name | January 15, 2023 |
"%B %d, %Y" |
| Short month name | Jan 15, 2023 |
"%b %d, %Y" |
12.6 String Normalization
String columns in business data almost always need normalization. The most common operations are removing extra whitespace, standardizing case, and removing unwanted characters.
12.6.1 The .str Accessor
All string operations in pandas are accessed through the .str accessor on a Series:
df["column"].str.method()
The .str accessor applies the string method to every value in the column. If a value is NaN, it remains NaN (the method is not applied to missing values, which is usually the correct behavior).
12.6.2 Removing Whitespace: .str.strip()
Extra whitespace is invisible but causes comparisons to fail. "Northeast" and "Northeast " (with a trailing space) are not equal, so a filter for one will miss the other.
# Strip leading and trailing whitespace from all values
df["region"] = df["region"].str.strip()
# Strip only leading whitespace
df["region"] = df["region"].str.lstrip()
# Strip only trailing whitespace
df["region"] = df["region"].str.rstrip()
# Strip a specific character
df["product_code"] = df["product_code"].str.strip("_")
Always strip whitespace before any other string operations. Many case-sensitivity and comparison issues disappear after stripping.
12.6.3 Standardizing Case: .str.lower(), .str.upper(), .str.title()
Case standardization is how you fix the "Office Chairs" vs "office chairs" vs "OFFICE CHAIRS" problem.
# Convert all to lowercase (most common for comparisons and merges)
df["category"] = df["category"].str.lower()
# Convert all to uppercase
df["region"] = df["region"].str.upper()
# Convert to title case (first letter of each word capitalized)
df["category"] = df["category"].str.title()
# Convert to sentence case (first letter of first word capitalized)
df["notes"] = df["notes"].str.capitalize()
Strategy for category standardization:
1. Strip whitespace first: .str.strip()
2. Convert to lowercase: .str.lower()
3. Map to canonical values: .map(canonical_dict) or .str.title()
The combination of stripping and lowercasing resolves most case-inconsistency issues before you need any manual mapping.
12.6.4 Replacing Characters: .str.replace()
# Remove dollar signs from price strings
df["unit_price"] = df["unit_price"].str.replace("$", "", regex=False)
# Remove commas from number strings like "1,234.56"
df["revenue"] = df["revenue"].str.replace(",", "", regex=False)
# Replace a pattern using regular expressions
# Remove any non-digit characters from a phone number
df["phone"] = df["phone"].str.replace(r"[^0-9]", "", regex=True)
# Replace multiple spaces with a single space
df["name"] = df["name"].str.replace(r"\s+", " ", regex=True).str.strip()
The regex=False parameter treats the search string as a literal character, not a regular expression pattern. Always specify it when you are not using regex — the $ character has special regex meaning, and regex=False prevents pandas from interpreting it as such.
12.6.5 Searching Within Strings: .str.contains()
.str.contains() creates a boolean mask for filtering rows where a column contains a specific substring.
# Find rows where category contains "chair" (case-insensitive)
chair_mask = df["category"].str.contains("chair", case=False, na=False)
chair_rows = df[chair_mask]
print(f"Rows with 'chair' in category: {len(chair_rows)}")
# Find rows where salesperson name contains "Blake"
blake_sales = df[df["salesperson"].str.contains("Blake", na=False)]
# Use regex for more complex patterns
# Find SKUs that start with "TECH-"
tech_skus = df[df["product_sku"].str.contains(r"^TECH-", regex=True, na=False)]
The na=False parameter handles NaN values gracefully — without it, .str.contains() returns NaN for missing values, which can cause issues when using the result as a boolean mask.
12.7 Standardizing Categorical Values
After normalizing case and whitespace, you may still have categories that need to be explicitly mapped to standard values. This is where .map() and .replace() come in.
12.7.1 Using .map() for Explicit Recoding
.map() applies a dictionary mapping to replace values. Any value not in the dictionary becomes NaN. This makes it strict — you must account for every value.
# Define the canonical mapping from all variants to the standard value
status_mapping = {
"Completed": "Completed",
"completed": "Completed",
"Complete": "Completed",
"complete": "Completed",
"DONE": "Completed",
"COMPLETED": "Completed",
"In Progress": "In Progress",
"in progress": "In Progress",
"On Hold": "On Hold",
"Cancelled": "Cancelled",
"Canceled": "Cancelled", # Common spelling variation
}
df["status"] = df["status"].map(status_mapping)
# Check: any values not in the mapping become NaN
print(df["status"].value_counts(dropna=False))
The strictness of .map() is a feature: if a new status value appears that is not in your dictionary, it becomes NaN — a visible signal that you need to update your mapping. This is better than silently passing through an unstandardized value.
12.7.2 Using .replace() for Flexible Recoding
.replace() is more flexible — values not in the mapping are left unchanged rather than becoming NaN. Use it when you want to recode specific values but leave everything else as-is.
# Replace specific values, leave others unchanged
df["category"] = df["category"].replace({
"office chairs": "Office Chairs",
"OFFICE CHAIRS": "Office Chairs",
"Chairs": "Office Chairs",
})
# Replace a single value
df["region"] = df["region"].replace("NE", "Northeast")
# Replace a list of values with a single value
df["status"] = df["status"].replace(
["Done", "DONE", "done", "Finished"], "Completed"
)
When to use .map() vs .replace():
- Use .map() when you want to recode every value and need strict control — any unmapped value becomes NaN, signaling a gap in your mapping.
- Use .replace() when you want to fix specific known problems while leaving other values alone.
12.8 Detecting and Handling Outliers
Outliers are values that are unusually distant from the rest of the data. They may be legitimate (a real bulk order of 50 laptops) or erroneous (a data entry mistake of 5000 units when 50 was intended). You need to investigate before deciding what to do.
12.8.1 The IQR Method
The IQR (Interquartile Range) method is the standard statistical approach for detecting outliers:
import pandas as pd
# Calculate Q1 (25th percentile), Q3 (75th percentile), and IQR
Q1 = df["units_sold"].quantile(0.25)
Q3 = df["units_sold"].quantile(0.75)
IQR = Q3 - Q1
# Define the outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print(f"Q1: {Q1}, Q3: {Q3}, IQR: {IQR}")
print(f"Outlier range: < {lower_bound:.2f} or > {upper_bound:.2f}")
# Identify outlier rows
outliers = df[(df["units_sold"] < lower_bound) | (df["units_sold"] > upper_bound)]
print(f"\nOutliers found: {len(outliers)} rows")
print(outliers[["date", "salesperson", "product_sku", "units_sold", "total_revenue"]])
The IQR method is robust to extreme values because it is based on percentiles, not the mean. A single extreme outlier will not dramatically change Q1 or Q3.
12.8.2 The Z-Score Method
Z-scores measure how many standard deviations a value is from the mean. Values with |z| > 3 are commonly flagged as outliers:
# Calculate z-scores
mean = df["units_sold"].mean()
std = df["units_sold"].std()
df["units_sold_zscore"] = (df["units_sold"] - mean) / std
# Flag outliers with |z| > 3
outliers_z = df[df["units_sold_zscore"].abs() > 3]
print(f"Outliers by z-score (|z|>3): {len(outliers_z)} rows")
The z-score method is sensitive to extreme outliers — if one value is wildly extreme, it inflates the standard deviation and can make other moderate outliers look normal. Use the IQR method when your data might have extreme outliers.
12.8.3 What to Do with Outliers
Outliers should be investigated, not automatically removed. Common approaches:
Investigate first:
# Print the outlier rows with full context
print(outliers.to_string())
# Check if these are the same salesperson or product
print(outliers["salesperson"].value_counts())
print(outliers["product_sku"].value_counts())
If the outlier is a data entry error, correct or remove it:
# Remove rows where units_sold > 100 (identified as likely data entry errors)
df = df[df["units_sold"] <= 100]
If the outlier is legitimate, document it and keep it:
# Flag extreme values without removing them
df["is_large_order"] = df["units_sold"] > upper_bound
print(f"Large but legitimate orders: {df['is_large_order'].sum()}")
If you are unsure, apply capping (Winsorization) as a conservative approach:
# Cap values at the upper/lower bounds rather than removing them
df["units_sold_capped"] = df["units_sold"].clip(lower=lower_bound, upper=upper_bound)
12.9 Before and After: The Impact of Cleaning
To make the value of data cleaning concrete, consider this comparison for the Acme sales data:
Before Cleaning
# Category distribution with dirty data
print(df["category"].value_counts(dropna=False))
Office Chairs 847
Desks 623
Technology 589
Storage 431
office chairs 7
OFFICE CHAIRS 3
NaN 5
# Revenue by category with dirty data
print(df.groupby("category")["total_revenue"].sum().sort_values(ascending=False))
category
Office Chairs 412,345.23
Desks 398,721.45
Technology 374,892.16
Storage 198,234.89
office chairs 3,421.78 # ← Should be part of Office Chairs
OFFICE CHAIRS 1,456.34 # ← Should be part of Office Chairs
The category split means Office Chairs appears to generate $412,345 when it actually generates **$417,223** ($412,345 + $3,421 + $1,456). A report generated from dirty data would understate Office Chairs revenue by $4,878 — a meaningful error in a business context.
After Cleaning
# Standardize category — strip whitespace, then title case
df["category"] = df["category"].str.strip().str.title()
# Category distribution after cleaning
print(df["category"].value_counts(dropna=False))
Office Chairs 857
Desks 623
Technology 589
Storage 431
NaN 5
# Revenue by category after cleaning
print(df.groupby("category")["total_revenue"].sum().sort_values(ascending=False))
category
Office Chairs 417,223.35 # ← Correct, consolidated figure
Desks 398,721.45
Technology 374,892.16
Storage 198,234.89
12.10 Documenting Your Cleaning Steps
Reproducibility is a professional obligation. If you cannot explain exactly what you did to the data, and if someone else cannot reproduce your cleaned dataset from the raw file, your analysis is not fully trustworthy.
12.10.1 Writing a Cleaning Log
cleaning_log = []
# --- Step 1: Remove duplicates ---
n_before = len(df)
df = df.drop_duplicates()
n_after = len(df)
cleaning_log.append({
"step": "Remove exact duplicates",
"rows_before": n_before,
"rows_after": n_after,
"rows_changed": n_before - n_after,
"notes": "8 exact duplicate rows removed"
})
# --- Step 2: Standardize category ---
before_unique = df["category"].nunique()
df["category"] = df["category"].str.strip().str.title()
after_unique = df["category"].nunique()
cleaning_log.append({
"step": "Standardize category capitalization",
"rows_before": None,
"rows_after": None,
"rows_changed": None,
"notes": (f"Reduced unique category values from {before_unique} "
f"to {after_unique}")
})
# Print the cleaning log
log_df = pd.DataFrame(cleaning_log)
print(log_df.to_string(index=False))
12.10.2 Saving the Cleaned Dataset
After cleaning, save the clean version separately from the original:
# Save as CSV
df.to_csv("acme_sales_2023_clean.csv", index=False)
# Save as Excel
df.to_excel("acme_sales_2023_clean.xlsx", index=False)
print("Clean data saved.")
print(f"Final dataset: {len(df)} rows, {len(df.columns)} columns")
The index=False parameter prevents pandas from writing the row numbers as a column in the output file.
12.11 A Complete Cleaning Pipeline
The following is a complete, structured data cleaning pipeline that incorporates all the techniques from this chapter. It uses Acme Corp sales data defined inline, applies every cleaning step systematically, and logs each step.
import pandas as pd
from io import StringIO
# === RAW DATA (simulating the file Priya received) ===
raw_csv = """date,region,salesperson,product_sku,category,units_sold,unit_price,discount_rate,total_revenue
2023-01-03,Northeast,Jordan Blake,OFF-CH-0045,Office Chairs,2,249.99,0.10,449.98
2023-01-03,Southwest,Casey Rivera,DESK-SIT-012,Desks,1,899.99,0.15,764.99
2023-01-04,Midwest,Alex Torres,TECH-MON-089,Technology,3,349.99,0.05,997.47
2023-01-05,,Marcus Webb,OFF-CH-0046,Office Chairs,1,299.99,0.00,299.99
2023-01-05,Northeast,Jordan Blake,STOR-CAB-023,Storage,4,129.99,0.10,467.96
2023-01-08,West,Sarah Kim,TECH-KEY-034,Technology,2,79.99,0.10,143.98
2023-01-09,Southeast,David Park,DESK-SIT-012,Desks,1,899.99,0.20,719.99
2023-01-10,Northeast,Jordan Blake,OFF-CH-0045,office chairs,2,249.99,0.10,449.98
2023-01-11,Southwest,Casey Rivera,TECH-MON-089,Technology,1,349.99,0.00,349.99
2023-01-12,Midwest,Alex Torres,OFF-CH-0047,OFFICE CHAIRS,3,199.99,0.15,509.97
2023-01-05,,Marcus Webb,OFF-CH-0046,Office Chairs,1,299.99,0.00,299.99
"""
df = pd.read_csv(
StringIO(raw_csv),
parse_dates=["date"],
dtype={"product_sku": str},
na_values=["", "N/A"]
)
print("=== BEFORE CLEANING ===")
print(f"Rows: {len(df)}")
df.info()
print(df["category"].value_counts(dropna=False))
cleaning_log = []
# --- STEP 1: Remove exact duplicates ---
n_before = len(df)
df = df.drop_duplicates()
n_after = len(df)
cleaning_log.append(f"Step 1: Removed {n_before - n_after} duplicate rows")
# --- STEP 2: Fill known missing region values ---
# Marcus Webb (confirmed with IT) is always Northeast
df.loc[df["salesperson"] == "Marcus Webb", "region"] = df.loc[
df["salesperson"] == "Marcus Webb", "region"
].fillna("Northeast")
n_filled = df["region"].isna().sum()
cleaning_log.append(
f"Step 2: Filled Marcus Webb's missing region with 'Northeast'. "
f"Remaining missing: {n_filled}"
)
# --- STEP 3: Standardize category capitalization ---
before_unique = sorted(df["category"].dropna().unique())
df["category"] = df["category"].str.strip().str.title()
after_unique = sorted(df["category"].dropna().unique())
cleaning_log.append(
f"Step 3: Standardized category from {before_unique} to {after_unique}"
)
# --- STEP 4: Handle remaining missing categories ---
df["category"] = df["category"].fillna("Uncategorized")
cleaning_log.append("Step 4: Filled remaining missing categories with 'Uncategorized'")
# === PRINT RESULTS ===
print("\n=== AFTER CLEANING ===")
print(f"Rows: {len(df)}")
df.info()
print(df["category"].value_counts(dropna=False))
print("\n=== CLEANING LOG ===")
for entry in cleaning_log:
print(f" - {entry}")
12.12 Chapter Summary
Data cleaning is the bridge between raw data and reliable analysis. The techniques in this chapter — handling missing values, removing duplicates, fixing data types, normalizing strings, standardizing categories, and detecting outliers — are the core toolkit of any working data analyst.
The most important habit is to work systematically. Use the cleaning checklist. Document every step. Validate after each operation. Never modify the source data. And always ask: "Do I understand this issue well enough to fix it correctly, or do I need more information first?"
Clean data is not just technically correct — it is trustworthy. When you hand Priya's cleaned sales dataset to Sandra Chen for the annual review, both of them can be confident that the numbers in the report reflect reality. That confidence is the whole point.
Key Terms
Missing value (NaN/NaT): A value that is absent from a dataset. In pandas, NaN represents missing numeric or string values; NaT represents missing datetime values.
Imputation: The process of replacing missing values with estimated or inferred values (means, medians, forward-fills, etc.).
Outlier: A value that is significantly different from the other values in a distribution. May be a data error or a legitimate extreme observation.
IQR (Interquartile Range): The difference between the 75th and 25th percentiles of a dataset. Used in the IQR method for outlier detection.
Winsorization: Capping extreme values at a specified percentile boundary rather than removing them. A conservative approach to handling outliers.
String normalization: The process of converting string values to a consistent form — stripping whitespace, standardizing case, removing special characters.
Reproducibility: The ability to recreate the same result by running the same code on the same input data. Reproducible cleaning pipelines are essential for professional work.
What Comes Next
With clean data in hand, Chapters 13 and 14 introduce the core data transformation operations in pandas: grouping, aggregating, merging, and reshaping. These are the operations that turn clean rows of transaction data into the summaries, comparisons, and insights that drive business decisions.