Chapter 8: Cleaning Messy Data: Missing Values, Duplicates, Type Errors, and the 80% of the Job

Contributors to Introduction to Data Science

27 min read

> "It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts."

Prerequisites

{'chapter': 7, 'description': 'pandas DataFrame basics — loading, selecting, filtering, and sorting data'}

Learning Objectives

Detect missing values, duplicates, and type mismatches using pandas diagnostic methods (isnull, duplicated, dtypes)
Apply appropriate missing value strategies (deletion, imputation, flagging) and justify the choice for each situation
Correct data type errors using astype, to_numeric, and to_datetime with error handling
Standardize inconsistent categorical values using replace, map, and string methods
Document all cleaning decisions and their rationale in a reproducible notebook, treating cleaning as analysis

In This Chapter

Chapter Overview
8.1 The Anatomy of Messy Data
8.2 Missing Values: The Ghosts in Your Data
8.3 Duplicates: The Same Row, Twice (or Worse)
8.4 Type Errors: When Numbers Aren't Numbers
8.5 Inconsistent Categories: "Male," "male," "M," and Other Headaches
8.6 Outliers: When Valid Data Looks Wrong
8.7 The Data Cleaning Log: Document Everything
8.8 Threshold Concept: Cleaning IS Analysis
8.9 Putting It All Together: A Complete Cleaning Pipeline
8.10 Project Checkpoint: Cleaning the Vaccination Dataset
8.11 Common Cleaning Patterns: A Quick Reference
8.12 Ethical Dimensions of Data Cleaning
8.13 Debugging Common Cleaning Errors
8.14 Beyond the Basics: What Comes Next
Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 8: Cleaning Messy Data: Missing Values, Duplicates, Type Errors, and the 80% of the Job

"It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts." — Arthur Conan Doyle, A Scandal in Bohemia

Chapter Overview

Let me tell you about the first time I saw a real dataset.

I was expecting something clean. Rows and columns, numbers and labels, everything neatly in its place — like the practice datasets from a textbook. What I got instead was a CSV file with 47,000 rows, three different spellings of "United States," a column called date that contained a mix of "2021-03-15," "March 15 2021," "15/03/21," and the occasional "TBD." There were 2,300 rows where the vaccination count was blank. There were 412 rows that appeared twice. And the column that was supposed to contain numbers? It had dollar signs in it.

I stared at that file for a long time.

If this sounds like a horror story, I have good news and bad news. The bad news: this is what real data looks like. Always. Every single time. The good news: cleaning messy data is a learnable skill, and by the end of this chapter, you'll have the tools and the mindset to handle it.

Here's something that might surprise you: data cleaning is not the boring prelude before the "real" analysis begins. Data cleaning is analysis. Every time you decide to drop a row, fill in a missing value, or merge two inconsistent categories, you're making an analytical decision that shapes your results. A researcher who drops all rows with missing income data has made a decision about whose voices get heard. A business analyst who fills missing sales with zeros has made a different decision than one who fills them with averages. Neither is automatically right or wrong — but both are choices, and both have consequences.

This is the chapter where we get honest about what data science actually looks like day to day. Surveys consistently find that data professionals spend 60 to 80 percent of their time on data cleaning and preparation. That's not a failure of the field — it's the nature of the work. The sooner you embrace it, the better you'll be at it.

In this chapter, you will learn to:

Detect missing values, duplicates, and type mismatches using pandas diagnostic methods (isnull, duplicated, dtypes) (all paths)
Apply appropriate missing value strategies (deletion, imputation, flagging) and justify the choice for each situation (all paths)
Correct data type errors using astype, to_numeric, and to_datetime with error handling (all paths)
Standardize inconsistent categorical values using replace, map, and string methods (standard + deep dive paths)
Document all cleaning decisions and their rationale in a reproducible notebook, treating cleaning as analysis (all paths)

8.1 The Anatomy of Messy Data

Before we start fixing things, let's understand what can go wrong. In my experience, messy data falls into five categories, and most real datasets have problems in at least three of them simultaneously.

The Five Types of Data Mess

Missing values — cells that should contain data but don't. The vaccination count is blank. The patient's age is recorded as "unknown." The sensor reading for 3 AM is absent because the sensor was rebooting.
Duplicate records — rows that appear more than once, sometimes identically, sometimes with slight variations. The same patient shows up twice because they were entered at two different clinics. The same country appears under both "Cote d'Ivoire" and "Ivory Coast."
Type errors — columns where the data type doesn't match what it should be. A column of numbers that pandas reads as text because one cell contains "N/A." A date column stored as a string. A ZIP code that lost its leading zero because Excel treated it as a number.
Inconsistent categories — the same thing spelled different ways. "Male," "male," "M," "m," "MALE." "New York," "NY," "N.Y.," "new york," "New York City." You'd be amazed how many ways people can spell the same word.
Outliers and implausible values — data points that are technically present but seem wrong. A patient listed as 250 years old. A temperature reading of -999 (a common placeholder for "missing" in scientific data). A sales figure that's ten thousand times the average.

Let's see all five in action. Here's a small dataset that contains every kind of mess:

import pandas as pd
import numpy as np

# A messy dataset — typical of what you'll encounter in the wild
messy_data = pd.DataFrame({
    'patient_id':    [101, 102, 103, 103, 104, 105, 106, 107, 108, 109],
    'name':          ['Alice Smith', 'Bob Jones', 'carol lee', 'Carol Lee',
                      'David Kim', None, 'Eve Torres', 'Frank N.',
                      'Grace Hall', 'Henry Wu'],
    'age':           [34, 'forty-five', 28, 28, 91, 42, -3, 55, None, 67],
    'gender':        ['Female', 'M', 'female', 'Female', 'Male', 'F',
                      'Female', 'male', 'Female', 'M'],
    'visit_date':    ['2024-03-15', '03/16/2024', '2024-03-15', '2024-03-15',
                      '2024-03-17', '2024-03-18', 'March 19, 2024',
                      '2024-03-20', '2024-03-21', '2024-3-22'],
    'blood_pressure': ['120/80', '140/90', '115/75', '115/75', '180/110',
                       None, '90/60', '130/85', '125/80', '$135/88'],
    'diagnosis':     ['Flu', 'flu', 'Diabetes', 'Diabetes', 'Hypertension',
                      'Flu', None, 'hypertension', 'Diabetes', 'FLU']
})

print(messy_data)
print(f"\nShape: {messy_data.shape}")
print(f"\nData types:\n{messy_data.dtypes}")

Take a moment to find the problems before reading on. How many can you spot?

Here's what I count:

Missing values: name for patient 105, age for patient 108, blood_pressure for patient 105, diagnosis for patient 106
Duplicate: patients 103 appear twice (rows 2 and 3)
Type errors: age contains "forty-five" (a string among numbers), and blood_pressure has a dollar sign in the last entry
Inconsistent categories: gender uses "Female," "F," "female," "M," "Male," "male"; diagnosis uses "Flu," "flu," "FLU"
Outlier/implausible: patient 106 has age -3 (negative age is not possible)

This is a tiny dataset with ten rows. Imagine these problems spread across 47,000 rows. That's the reality of data cleaning.

Check Your Understanding

Which of the five types of data mess is hardest to detect automatically? Why?

Patient 104 has age 91 and blood pressure 180/110. Is this an outlier that should be removed, or a legitimate record? How would you decide?

Why might "March 19, 2024" cause problems even though it's a perfectly valid date?

8.2 Missing Values: The Ghosts in Your Data

Missing values are the single most common data quality issue you'll face. They're also the most dangerous, because how you handle them changes your results — sometimes dramatically.

What Does "Missing" Look Like in pandas?

In pandas, the primary representation for missing data is NaN — which stands for "Not a Number." Despite the name, NaN is used for missing values of any type, not just numbers. You'll also encounter None (Python's built-in null value) and sometimes NaT (Not a Time, for missing datetime values). pandas generally treats None and NaN interchangeably, converting None to NaN in numeric columns.

import pandas as pd
import numpy as np

# Different faces of "missing"
s = pd.Series([1, None, np.nan, pd.NaT, 5])
print(s)
print(f"\nNull check:\n{s.isnull()}")

A critical thing to understand: NaN is not equal to anything, not even itself.

print(np.nan == np.nan)     # False — this is by design!
print(np.nan != np.nan)     # True — the only value where this is True

# This is WHY we need isnull() — you can't use == to find NaN
print(None == None)         # True in Python, but...
# In pandas DataFrames, None gets converted to NaN in numeric columns

This is why you can never use == np.nan or == None to check for missing values in a DataFrame. Always use .isnull() (or its alias .isna()).

Detecting Missing Values

Your first step with any new dataset should be a missing value audit. Here's the toolkit:

# Load a sample dataset — let's use something realistic
# Imagine this is our WHO vaccination dataset
df = pd.read_csv('vaccination_data.csv')

# Strategy 1: The quick overview
print(df.isnull().sum())       # Count of missing values per column
print(df.isnull().sum().sum()) # Total missing values in entire DataFrame

# Strategy 2: The percentage view (more useful for large datasets)
missing_pct = df.isnull().mean() * 100
print(missing_pct.round(1))

# Strategy 3: The visual check — which rows have missing values?
print(df[df.isnull().any(axis=1)])  # Show all rows with ANY missing value

# Strategy 4: The info() shortcut
df.info()  # Shows non-null count per column — quick way to spot gaps

Let me walk you through each of these with our messy patient data:

# Using our messy_data from above
print("Missing values per column:")
print(messy_data.isnull().sum())
print()
print("Missing value percentage per column:")
print((messy_data.isnull().mean() * 100).round(1))

Output:

Missing values per column:
patient_id        0
name              1
age               1
gender            0
visit_date        0
blood_pressure    1
diagnosis         1

Missing value percentage per column:
patient_id         0.0
name              10.0
age               10.0
gender             0.0
visit_date         0.0
blood_pressure    10.0
diagnosis         10.0

Ten percent missing in four columns. In a 10-row dataset that's one cell each. In a 100,000-row dataset, 10% means 10,000 gaps. The strategy that works for one scale might not work for the other.

The Three Strategies for Missing Values

When you find missing values, you have three basic options. Each has trade-offs, and the right choice depends on why the data is missing.

Strategy 1: Delete (dropna)

The simplest approach: just remove the rows (or columns) with missing values.

# Drop any row that has ANY missing value
clean_df = messy_data.dropna()
print(f"Before: {len(messy_data)} rows")
print(f"After:  {len(clean_df)} rows")

# Drop rows only if a SPECIFIC column is missing
clean_df = messy_data.dropna(subset=['diagnosis'])

# Drop rows only if ALL values are missing (rare but useful)
clean_df = messy_data.dropna(how='all')

# Drop COLUMNS where more than 50% of values are missing
clean_df = messy_data.dropna(axis=1, thresh=len(messy_data) * 0.5)

When deletion is appropriate: - The missing values are a small percentage of your data (often cited as less than 5%) - The data is missing completely at random (MCAR) — the reason for missingness has nothing to do with the value itself or any other variable - You have enough data that losing some rows won't affect your analysis

When deletion is dangerous: - The missingness is not random. If low-income patients are more likely to have missing insurance information, dropping those rows silently removes low-income patients from your analysis. Your results now describe only the people with complete records — which may not represent the population you care about. - You'd lose too much data. If 40% of rows have at least one missing value, dropna() throws away nearly half your dataset.

Ethical Analysis

Imagine you're analyzing COVID vaccination rates across U.S. ZIP codes. You notice that rural ZIP codes are more likely to have missing data — perhaps because smaller clinics report less consistently. If you simply drop all rows with missing vaccination counts, your analysis will over-represent urban areas and may miss the very disparities you're trying to study.

This is not a technical problem. It's an ethical one. The decision to drop rows is a decision about whose data matters. Always ask: who is missing from my dataset, and does their absence change my conclusions?

Strategy 2: Fill (fillna and Imputation)

Instead of deleting rows with missing values, you can fill them in — but what you fill them with matters enormously.

# Fill with a constant value
df['diagnosis'].fillna('Unknown', inplace=False)  # Explicit "we don't know"

# Fill with the column mean (numeric columns only)
df['age'].fillna(df['age'].mean())

# Fill with the column median (more robust to outliers)
df['age'].fillna(df['age'].median())

# Fill with the mode (most common value — good for categorical data)
df['diagnosis'].fillna(df['diagnosis'].mode()[0])

# Forward fill: use the previous row's value
# Useful for time series where values don't change abruptly
df['temperature'].fillna(method='ffill')

# Backward fill: use the next row's value
df['temperature'].fillna(method='bfill')

The term imputation refers to the process of replacing missing values with estimated values. Simple imputation uses basic statistics (mean, median, mode). More sophisticated imputation methods exist — multiple imputation, k-nearest neighbors imputation, regression imputation — but for now, the simple methods will serve you well.

Choosing between mean, median, and mode:

Fill strategy	Best for	Watch out for
Mean	Normally distributed numeric data	Sensitive to outliers; if one age is 999, the mean is useless
Median	Skewed numeric data or data with outliers	Less sensitive to outliers, but still an estimate
Mode	Categorical data	If there's no clear mode, this is arbitrary
Constant ("Unknown", 0, -1)	When you want to preserve the fact that data was missing	Zero can be confused with a real value (did they score 0, or is the score missing?)
Forward/backward fill	Time series data	Assumes values don't change — can be wrong if they do

# A practical example: filling missing ages with the median
median_age = messy_data['age'].median()  # This will fail because 'forty-five' is there!
# We'll fix the type error first (see Section 8.4), then impute

Strategy 3: Flag (Create a Missing Indicator)

Sometimes the fact that a value is missing is itself informative. Maybe patients who don't report their income have systematically different health outcomes. In that case, you want to preserve the missingness as a feature.

# Create a binary flag: 1 if age was missing, 0 if it wasn't
df['age_was_missing'] = df['age'].isnull().astype(int)

# Then fill the missing age with median (or whatever)
df['age'] = df['age'].fillna(df['age'].median())

# Now you have BOTH the imputed value AND a record of what was imputed

This approach is particularly valuable in machine learning, where the pattern of missingness can carry predictive information. We'll revisit this in Part V.

Why the Data Is Missing Matters

Data scientists and statisticians categorize missing data into three types. This might sound academic, but it directly affects which strategy you should use:

Missing Completely at Random (MCAR): The missingness has nothing to do with the data. A lab tech accidentally spilled coffee on a form, destroying some results. The missing values are random — they don't relate to the patient's age, diagnosis, or anything else. MCAR is the best-case scenario because deletion won't introduce bias.

Missing at Random (MAR): The missingness is related to other observed variables but not to the missing value itself. Older patients might be less likely to fill out an online survey, so age predicts missingness in the survey responses — but the content of their responses (if they'd given them) wouldn't be systematically different. MAR is trickier; deletion can introduce bias, but imputation based on the related variables can help.

Missing Not at Random (MNAR): The missingness is related to the missing value itself. People with very high incomes are less likely to report their income. Patients with severe depression are less likely to complete follow-up questionnaires. MNAR is the hardest case — no amount of imputation from other variables can fully correct the bias, because the very thing you're trying to impute is what's causing the missingness.

You often can't prove which type of missingness you have. But you should think about it, because it guides your strategy:

Is the missingness related to the missing value itself?
  |
  +--> Yes (or possibly yes) --> MNAR: Be very cautious.
  |     Flag the missingness. Consider sensitivity analysis
  |     (run your analysis with AND without the missing rows
  |     and see if conclusions change).
  |
  +--> No
        |
        Is the missingness related to other variables in the dataset?
          |
          +--> Yes --> MAR: Imputation using related variables
          |     can help. Group-based imputation (fill with
          |     group median) is better than overall median.
          |
          +--> No --> MCAR: Deletion is safe, imputation is
                fine, whatever is most convenient.

Check Your Understanding

You're analyzing a survey about internet use. Respondents over 65 are more likely to leave the "hours spent online per week" question blank. Is this MCAR, MAR, or MNAR?

A temperature sensor occasionally fails at random intervals. The missing readings have no pattern — they occur equally in summer and winter, day and night. Is this MCAR, MAR, or MNAR?

In a study of household income, wealthier households are less likely to disclose their income. Is this MCAR, MAR, or MNAR?

8.3 Duplicates: The Same Row, Twice (or Worse)

Duplicates are deceptively dangerous. A single duplicated row might not change your analysis much. But 500 duplicated rows can skew averages, inflate counts, and make your conclusions unreliable.

Finding Duplicates

# Check for exact duplicates (all columns match)
print(f"Number of duplicate rows: {messy_data.duplicated().sum()}")

# See which rows are duplicates
print(messy_data[messy_data.duplicated(keep=False)])
# keep=False marks ALL occurrences, not just the second one

# Check for duplicates based on specific columns
# (maybe the same patient visited twice — different visit dates are fine)
print(messy_data.duplicated(subset=['patient_id']).sum())

The keep parameter controls which duplicate is marked: - keep='first' (default): marks all duplicates except the first occurrence - keep='last': marks all duplicates except the last occurrence - keep=False: marks all duplicates (useful for inspection)

# Let's see our duplicate patients
dupes = messy_data[messy_data.duplicated(subset=['patient_id'], keep=False)]
print(dupes)

Output:

   patient_id       name  age  gender   visit_date blood_pressure diagnosis
2         103  carol lee   28  female  2024-03-15         115/75  Diabetes
3         103  Carol Lee   28  Female  2024-03-15         115/75  Diabetes

Look at that — patients 103 are almost identical, but the name capitalization and gender formatting differ. Are these the same person entered twice, or two different people with the same ID? This is a judgment call that requires domain knowledge.

Removing Duplicates

# Remove exact duplicates
clean_df = messy_data.drop_duplicates()

# Remove duplicates based on specific columns, keeping the first occurrence
clean_df = messy_data.drop_duplicates(subset=['patient_id'], keep='first')

# Remove duplicates based on specific columns, keeping the last occurrence
clean_df = messy_data.drop_duplicates(subset=['patient_id'], keep='last')

When Duplicates Aren't Obvious

Sometimes duplicates are fuzzy — the same entity recorded slightly differently:

# These might all be the same country:
countries = pd.Series([
    'United States', 'United States of America', 'US', 'USA',
    'U.S.A.', 'united states', 'Untied States'  # typo!
])

Fuzzy deduplication is a harder problem that we'll touch on in Section 8.5 when we discuss inconsistent categories. For now, know that it exists and that real datasets are full of it.

Scenario Walkthrough

Elena is working with vaccination records from multiple clinics. She notices that the same patient sometimes appears in records from two different clinics — the patient visited one for a first dose and another for a second dose. If she deduplicates on patient ID alone, she'll lose the second visit, which is a legitimate record. If she doesn't deduplicate at all, she'll double-count patients when calculating "number of unique individuals vaccinated."

The solution: Elena deduplicates for counting purposes (unique patients) but keeps all rows for visit-level analysis (number of doses administered). She creates two views of the data: one for patient-level questions and one for visit-level questions. This is a common pattern in healthcare data.

8.4 Type Errors: When Numbers Aren't Numbers

One of the most frustrating data cleaning problems is when a column looks like it contains numbers but pandas treats it as text. This usually happens because one or more values in the column aren't valid numbers, so pandas reads the entire column as strings.

Diagnosing Type Problems

# Always check dtypes as part of your initial data exploration
print(messy_data.dtypes)

Output:

patient_id         int64
name              object    # object usually means string
age               object    # Uh oh — age should be numeric!
gender            object
visit_date        object    # Should be datetime
blood_pressure    object
diagnosis         object

The age column is object (pandas speak for "string") instead of int64 or float64. That means pandas found something in the age column that isn't a number. Let's find the culprit:

# Find non-numeric values in a column that should be numeric
def find_non_numeric(series):
    """Return values that can't be converted to numbers."""
    mask = pd.to_numeric(series, errors='coerce').isnull() & series.notnull()
    return series[mask]

print(find_non_numeric(messy_data['age']))

Output:

1    forty-five
Name: age, dtype: object

There it is. One cell contains "forty-five" instead of 45, and that single value forced the entire column to be stored as strings.

Fixing Type Errors

Numeric Conversion with to_numeric

# The gentle approach: coerce errors to NaN
messy_data['age_clean'] = pd.to_numeric(messy_data['age'], errors='coerce')
print(messy_data[['age', 'age_clean']])

The errors parameter is your best friend here: - errors='raise' (default): throws an error if any value can't be converted - errors='coerce': converts unconvertible values to NaN - errors='ignore': returns the original column unchanged (rarely useful)

# After coercion, the problematic value becomes NaN
# You can then decide: drop it, impute it, or look up the correct value

Type Casting with astype

When you're confident the conversion is safe, astype() performs a direct type cast:

# Convert a column that's already clean
df['zip_code'] = df['zip_code'].astype(str)       # Keep leading zeros
df['quantity'] = df['quantity'].astype(int)         # String "5" to integer 5
df['price'] = df['price'].astype(float)             # String "9.99" to float

# astype will RAISE an error on invalid values — unlike to_numeric with coerce
# Use it when you've already cleaned the column

Date Conversion with to_datetime

Date columns are especially common type errors, because dates come in countless formats:

# Convert various date formats to proper datetime
messy_data['visit_date_clean'] = pd.to_datetime(
    messy_data['visit_date'],
    format='mixed',      # Let pandas figure out the format
    dayfirst=False        # Assume month comes first in ambiguous cases
)
print(messy_data[['visit_date', 'visit_date_clean']])

For more control, you can specify the exact format:

# When all dates are in the same format
df['date'] = pd.to_datetime(df['date_string'], format='%Y-%m-%d')

# Common format codes:
# %Y = four-digit year (2024)
# %m = two-digit month (03)
# %d = two-digit day (15)
# %B = full month name (March)
# %b = abbreviated month name (Mar)
# %H = hour (24h), %M = minute, %S = second

Debugging Walkthrough

You try pd.to_datetime(df['date']) and get this error: ParserError: Unknown string format: TBD

What happened: One of the date values is literally "TBD" (To Be Determined). pandas can't parse it as a date.

The fix: Use errors='coerce' to turn unparseable dates into NaT (Not a Time): ```python df['date'] = pd.to_datetime(df['date'], errors='coerce')

Then investigate: how many dates became NaT?

print(f"Unparseable dates: {df['date'].isna().sum()}") ```

The lesson: Always use errors='coerce' on your first pass, then investigate what couldn't be parsed. Don't silently lose data — know exactly what you're losing and why.

Cleaning Strings Before Conversion

Sometimes you need to clean a string before converting it to a number:

# Remove dollar signs and commas from a "price" column
df['price'] = df['price_raw'].str.replace('$', '', regex=False)
df['price'] = df['price'].str.replace(',', '', regex=False)
df['price'] = pd.to_numeric(df['price'], errors='coerce')

# Or chain it all together
df['price'] = (
    df['price_raw']
    .str.replace('[$,]', '', regex=True)   # Remove $ and , in one step
    .pipe(pd.to_numeric, errors='coerce')  # Convert to numeric
)

Check Your Understanding

What's the difference between astype(float) and pd.to_numeric(errors='coerce')? When would you use each?

A column called zip_code contains values like 02134 (Boston), 10001 (NYC), and 90210 (Beverly Hills). Should you store this as int or str? Why?

You have a column with values like "120/80" (blood pressure). What steps would you take to extract the systolic (120) and diastolic (80) values into separate numeric columns?

8.5 Inconsistent Categories: "Male," "male," "M," and Other Headaches

Categorical data — columns with a fixed set of values like gender, country, or diagnosis — is a breeding ground for inconsistency. Every person who enters data has their own conventions. The result is chaos.

Seeing the Problem

# How many unique values does the gender column have?
print(messy_data['gender'].value_counts())

Output:

Female    3
M         2
female    1
F         1
Male      1
male      1

Six different representations of two categories. Let's fix that.

Standardizing with replace and map

# Approach 1: replace — specify a mapping of old values to new values
gender_map = {
    'Female': 'Female', 'female': 'Female', 'F': 'Female',
    'Male': 'Male', 'male': 'Male', 'M': 'Male'
}
messy_data['gender_clean'] = messy_data['gender'].replace(gender_map)
print(messy_data['gender_clean'].value_counts())

# Approach 2: map — similar, but unmapped values become NaN
# Useful when you want to be strict about valid categories
messy_data['gender_clean'] = messy_data['gender'].map(gender_map)

# Approach 3: String methods — when the inconsistency is just capitalization
messy_data['diagnosis_clean'] = messy_data['diagnosis'].str.title()
# 'flu' -> 'Flu', 'FLU' -> 'Flu', 'Flu' -> 'Flu'
print(messy_data['diagnosis_clean'].value_counts())

String Methods for Categorical Cleanup

The .str accessor gives you access to a whole family of string operations that work element-wise on a Series:

# Standardize capitalization
df['city'] = df['city'].str.lower()       # everything lowercase
df['city'] = df['city'].str.upper()       # EVERYTHING UPPERCASE
df['city'] = df['city'].str.title()       # Title Case Each Word
df['city'] = df['city'].str.strip()       # Remove leading/trailing whitespace

# Fix common problems
df['name'] = df['name'].str.strip()                    # " Alice " -> "Alice"
df['code'] = df['code'].str.replace(' ', '', regex=False)  # "AB 123" -> "AB123"
df['email'] = df['email'].str.lower().str.strip()     # Chain operations

# Search and filter
mask = df['city'].str.contains('york', case=False, na=False)
new_york_rows = df[mask]

A Systematic Approach to Category Standardization

Here's a workflow that works well in practice:

def standardize_column(series, mapping=None, case='lower'):
    """
    Standardize a categorical column.

    Parameters:
    -----------
    series : pd.Series
        The column to standardize
    mapping : dict, optional
        Explicit old->new value mapping
    case : str
        'lower', 'upper', or 'title'
    """
    # Step 1: Strip whitespace
    result = series.str.strip()

    # Step 2: Standardize case
    if case == 'lower':
        result = result.str.lower()
    elif case == 'upper':
        result = result.str.upper()
    elif case == 'title':
        result = result.str.title()

    # Step 3: Apply explicit mapping if provided
    if mapping:
        result = result.replace(mapping)

    return result

# Usage
messy_data['gender_clean'] = standardize_column(
    messy_data['gender'],
    mapping={'f': 'female', 'm': 'male'},
    case='lower'
)

When Categories Need Domain Knowledge

Sometimes standardization requires understanding what the values mean, not just what they look like:

# These might all be the same country
country_values = pd.Series([
    'Korea, Republic of', 'South Korea', 'Republic of Korea',
    'Korea (South)', 'Korea, Rep.', 'KOR'
])

# You need a lookup table (often called a "crosswalk")
# to map variations to canonical names
country_crosswalk = {
    'Korea, Republic of': 'South Korea',
    'Republic of Korea': 'South Korea',
    'Korea (South)': 'South Korea',
    'Korea, Rep.': 'South Korea',
    'KOR': 'South Korea'
}

country_values_clean = country_values.replace(country_crosswalk)

Building crosswalk tables is tedious but essential. In professional data science, entire teams maintain standardized code lists. The WHO, for instance, uses ISO 3166 country codes precisely to avoid this kind of ambiguity.

Real-World Application

Marcus is analyzing his bakery's sales data. The product names in his point-of-sale system are entered by different employees:

"Choc Croissant," "Chocolate Croissant," "choc croiss," "CHOCOLATE CROISSANT," "Chcolate Croissant" (typo)

These are all the same product, but a naive analysis would treat them as five different items. Marcus needs to standardize his product names before he can answer "which item sells the most?"

His approach: lowercase everything, strip whitespace, then build a mapping table for variations. For the typo ("Chcolate"), he'd catch it by sorting unique values and eyeballing the list for near-duplicates.

8.6 Outliers: When Valid Data Looks Wrong

An outlier is a data point that's far from the other values in the dataset. But "far" is a judgment call, and not all outliers are errors.

The Outlier Dilemma

Consider these scenarios:

A patient's age is recorded as -3. That's an error — you can't be negative years old.
A patient's age is recorded as 104. That's unusual but entirely possible. Removing it because it's "extreme" would be wrong.
A test score is 0 in a dataset where most scores are between 60 and 100. Is the student very low-performing, or did they not take the test and 0 is a placeholder for "missing"?

The fundamental question with outliers is: is this a data error, or is this real data that happens to be extreme? There's no automatic answer. You need domain knowledge.

Detecting Outliers

# Method 1: Descriptive statistics
print(df['age'].describe())
# Look at min, max, and compare to the mean and standard deviation

# Method 2: The IQR method
Q1 = df['age'].quantile(0.25)
Q3 = df['age'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['age'] < lower_bound) | (df['age'] > upper_bound)]
print(f"Outliers detected: {len(outliers)}")
print(outliers)

# Method 3: Z-scores (how many standard deviations from the mean)
from scipy import stats
z_scores = stats.zscore(df['age'].dropna())
outliers = df[abs(z_scores) > 3]  # Common threshold: 3 standard deviations

# Method 4: Simple domain-based rules
impossible = df[(df['age'] < 0) | (df['age'] > 120)]
print(f"Impossible ages: {len(impossible)}")

Handling Outliers

Once you've identified outliers, your options include:

# Option 1: Remove them (only if you're sure they're errors)
df = df[df['age'] >= 0]                     # Remove impossible values
df = df[(df['age'] >= 0) & (df['age'] <= 120)]  # Remove implausible values

# Option 2: Cap them (winsorization)
df['age_capped'] = df['age'].clip(lower=0, upper=100)

# Option 3: Replace with NaN and then handle as missing data
df.loc[df['age'] < 0, 'age'] = np.nan

# Option 4: Keep them and investigate
# Sometimes the outlier IS the interesting finding!

Ethical Analysis

In 2019, a widely-cited machine learning study showed that an algorithm used by U.S. hospitals to prioritize patients for care management was systematically discriminating against Black patients. Part of the problem was data processing: the algorithm used healthcare spending as a proxy for health needs, but Black patients historically had less access to healthcare and thus lower spending — not because they were healthier, but because of systemic barriers.

What does this have to do with outliers? If Black patients' healthcare spending looked "unusually low" compared to white patients, a naive outlier detection approach might flag those values as anomalies to be "corrected." But the values weren't errors — they were reflections of real-world inequality. "Cleaning" them away would erase evidence of the very disparity the data should have revealed.

The lesson: never remove outliers without understanding why they're outliers. An outlier might be an error, or it might be the most important data point in your dataset.

8.7 The Data Cleaning Log: Document Everything

Here's a practice that separates professional data scientists from beginners: write down every cleaning decision you make.

This isn't busywork. It's essential for three reasons:

Reproducibility. If someone asks "why did you drop those 200 rows?" you need to be able to answer.
Auditability. If your analysis is challenged, your cleaning log shows your work — like showing your math on an exam.
Future you. Six months from now, you won't remember why you replaced "N/A" with the median. Your log will.

What a Cleaning Log Looks Like

A data cleaning log is a record of every transformation applied to raw data, including the rationale. It can live in a Markdown cell in your Jupyter notebook, a separate text file, or even a dedicated DataFrame.

Here's a simple template:

# Data Cleaning Log — Vaccination Dataset
# Date: 2024-03-22
# Analyst: Elena
# Source file: raw_vaccination_data.csv (47,283 rows, 12 columns)

cleaning_log = []

def log_step(step_number, description, rows_before, rows_after, rationale):
    """Record a cleaning step."""
    cleaning_log.append({
        'step': step_number,
        'description': description,
        'rows_before': rows_before,
        'rows_after': rows_after,
        'rows_affected': rows_before - rows_after,
        'rationale': rationale
    })

# Step 1
n_before = len(df)
df = df.drop_duplicates(subset=['country', 'date', 'vaccine_type'])
log_step(1,
         'Remove exact duplicate records',
         n_before, len(df),
         'Duplicate entries from overlapping clinic reporting systems')

# Step 2
n_before = len(df)
df['vaccination_count'] = pd.to_numeric(df['vaccination_count'], errors='coerce')
n_coerced = df['vaccination_count'].isna().sum()
log_step(2,
         'Convert vaccination_count to numeric, coerce errors to NaN',
         n_before, n_before,  # No rows removed
         f'{n_coerced} values could not be parsed as numbers')

# ... continue for each step

# At the end, create a summary
log_df = pd.DataFrame(cleaning_log)
print(log_df.to_string(index=False))

The Notebook-Based Approach

In a Jupyter notebook, you can use Markdown cells to document inline:

### Cleaning Step 3: Handle missing vaccination counts

**Problem:** 2,347 rows (4.9%) have missing vaccination counts.

**Investigation:** Missing counts correlate with smaller clinics (< 50 patients/day).
This appears to be MAR — missingness is related to clinic size, not to the actual
vaccination count.

**Decision:** Fill missing counts with the median count for the same country and
vaccine type. This preserves the country-level patterns while not distorting the
overall distribution.

**Rationale:** Deletion would disproportionately remove data from small/rural
clinics, introducing urban bias. Group-median imputation is more appropriate than
overall median because vaccination rates vary substantially by country.

**Rows affected:** 2,347 values imputed. No rows removed.

Then the code cell immediately below performs the actual transformation:

# Cleaning Step 3: Impute missing vaccination counts with group median
df['vaccination_count'] = df.groupby(['country', 'vaccine_type'])[
    'vaccination_count'
].transform(lambda x: x.fillna(x.median()))

This pattern — Markdown explanation followed by code implementation — makes your notebook a self-documenting cleaning pipeline. Anyone reading it (including future you) can understand not just what you did but why you did it.

Action Checklist: The Data Cleaning Workflow

Use this checklist every time you clean a new dataset. It won't cover every edge case, but it will catch the most common problems.

[ ] Initial inspection: df.shape, df.dtypes, df.head(), df.describe(), df.info()

[ ] Missing values: df.isnull().sum(), decide on strategy per column

[ ] Duplicates: df.duplicated().sum(), investigate and remove/keep

[ ] Data types: Check each column — are numbers stored as strings? Are dates parsed correctly?

[ ] Categorical consistency: For each categorical column, df[col].value_counts() — look for misspellings, case variations, abbreviations

[ ] Value ranges: For numeric columns, check min/max for implausible values

[ ] Document: Log every cleaning decision with rationale

[ ] Verify: After cleaning, re-run df.shape, df.dtypes, df.isnull().sum() to confirm

8.8 Threshold Concept: Cleaning IS Analysis

Threshold Concept: Cleaning IS Analysis

There's a persistent myth in data science education that cleaning is the "prep work" you do before the "real" analysis begins. This framing is wrong, and it's worth understanding why.

Consider this scenario: you have a dataset of 10,000 patient records, and 1,200 are missing the "income" field. You decide to drop those rows. You've now made an analytical decision: your results describe the 8,800 patients who reported income. Are those patients representative of all 10,000? Almost certainly not. People who decline to report income are different from people who report it — maybe they're wealthier, maybe they're poorer, maybe they distrust the institution collecting the data.

Here's another scenario: a temperature sensor records -999 when it malfunctions. You replace those values with the average temperature. But what if the sensor malfunctions more often in extreme cold? By replacing -999 with an average, you've biased your data toward moderate temperatures and hidden the very extremes you might want to study.

Every cleaning operation is a hypothesis. "I'm dropping these rows" is a hypothesis that the remaining rows are representative. "I'm filling missing values with the median" is a hypothesis that the missing values are distributed like the non-missing ones. "I'm combining these two categories" is a hypothesis that the distinction between them doesn't matter.

Once you see cleaning as analysis, two things happen. First, you become more careful — you stop using dropna() casually and start thinking about who you're dropping. Second, you become more confident — because you realize that there's no "objectively clean" version of a dataset, only a version that's been cleaned in a way that's appropriate for your specific question. Your cleaning choices should be driven by your analytical goals, not by a generic checklist.

The professionals who are best at data cleaning aren't the ones with the fanciest tools. They're the ones who pause before every operation and ask: What am I assuming? What could go wrong? Who might be affected?

8.9 Putting It All Together: A Complete Cleaning Pipeline

Let's walk through a realistic cleaning scenario from start to finish. We'll work with a simulated vaccination dataset that has the kinds of problems you'll encounter in real public health data.

import pandas as pd
import numpy as np

# Simulating a messy vaccination dataset
np.random.seed(42)
n = 200

raw_data = pd.DataFrame({
    'country': np.random.choice(
        ['United States', 'US', 'USA', 'Brazil', 'brasil', 'Brazil ',
         'India', 'INDIA', 'Germany', 'germany', 'Nigeria', 'Japan',
         'Japan ', 'South Africa', 'S. Africa'], n
    ),
    'date': np.random.choice(
        ['2024-01-15', '01/15/2024', 'Jan 15, 2024', '2024-02-20',
         '2024-03-10', 'March 10 2024', 'TBD', '', '2024-04-05'], n
    ),
    'vaccine_type': np.random.choice(
        ['Pfizer', 'pfizer', 'PFIZER', 'Moderna', 'moderna',
         'AstraZeneca', 'astrazeneca', 'J&J', 'Johnson & Johnson'], n
    ),
    'doses_administered': np.random.choice(
        [str(x) for x in range(100, 10000)] + ['N/A', 'unknown', '-', ''],
        n
    ),
    'region': np.random.choice(
        ['North America', 'South America', 'Asia', 'Europe',
         'Africa', None, ''], n
    )
})

# Add some exact duplicates
raw_data = pd.concat([raw_data, raw_data.iloc[10:15]], ignore_index=True)

print(f"Raw data shape: {raw_data.shape}")
print(f"\nFirst few rows:")
print(raw_data.head(10))

Now let's clean it step by step:

# ============================================================
# DATA CLEANING PIPELINE — Vaccination Dataset
# ============================================================

df = raw_data.copy()  # Always work on a copy!
cleaning_log = []

# --- STEP 1: Initial Assessment ---
print("=" * 60)
print("STEP 1: INITIAL ASSESSMENT")
print("=" * 60)
print(f"Shape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nSample:\n{df.head()}")

# --- STEP 2: Remove Exact Duplicates ---
n_before = len(df)
df = df.drop_duplicates()
n_after = len(df)
print(f"\nSTEP 2: Removed {n_before - n_after} exact duplicate rows")
cleaning_log.append(f"Step 2: Removed {n_before - n_after} exact duplicates")

# --- STEP 3: Standardize Country Names ---
# First, see what we're dealing with
print(f"\nSTEP 3: Country values before cleaning:")
print(df['country'].value_counts())

# Clean: strip whitespace, title case, then map variations
df['country'] = df['country'].str.strip().str.title()
country_map = {
    'Us': 'United States',
    'Usa': 'United States',
    'Brasil': 'Brazil',
    'S. Africa': 'South Africa'
}
df['country'] = df['country'].replace(country_map)
print(f"\nCountry values after cleaning:")
print(df['country'].value_counts())
cleaning_log.append("Step 3: Standardized country names (whitespace, case, aliases)")

# --- STEP 4: Standardize Vaccine Types ---
df['vaccine_type'] = df['vaccine_type'].str.strip().str.title()
vaccine_map = {
    'J&J': 'Johnson & Johnson'
}
df['vaccine_type'] = df['vaccine_type'].replace(vaccine_map)
print(f"\nSTEP 4: Vaccine types after cleaning:")
print(df['vaccine_type'].value_counts())
cleaning_log.append("Step 4: Standardized vaccine type names")

# --- STEP 5: Parse Dates ---
df['date'] = df['date'].replace({'': np.nan, 'TBD': np.nan})
df['date'] = pd.to_datetime(df['date'], format='mixed', errors='coerce')
n_bad_dates = df['date'].isna().sum()
print(f"\nSTEP 5: {n_bad_dates} dates could not be parsed")
cleaning_log.append(f"Step 5: Parsed dates; {n_bad_dates} unparseable -> NaT")

# --- STEP 6: Convert Doses to Numeric ---
df['doses_administered'] = df['doses_administered'].replace(
    {'N/A': np.nan, 'unknown': np.nan, '-': np.nan, '': np.nan}
)
df['doses_administered'] = pd.to_numeric(
    df['doses_administered'], errors='coerce'
)
n_missing_doses = df['doses_administered'].isna().sum()
print(f"\nSTEP 6: {n_missing_doses} dose values are missing after conversion")
cleaning_log.append(f"Step 6: Converted doses to numeric; {n_missing_doses} missing")

# --- STEP 7: Handle Missing Region ---
df['region'] = df['region'].replace({'': np.nan})
n_missing_region = df['region'].isna().sum()

# Build a country-to-region lookup from non-missing data
region_lookup = (
    df.dropna(subset=['region'])
    .drop_duplicates(subset=['country'])
    .set_index('country')['region']
    .to_dict()
)
df['region'] = df['region'].fillna(df['country'].map(region_lookup))
n_still_missing = df['region'].isna().sum()
print(f"\nSTEP 7: Filled {n_missing_region - n_still_missing} missing regions using country lookup")
cleaning_log.append(f"Step 7: Imputed {n_missing_region - n_still_missing} regions from country lookup")

# --- STEP 8: Final Assessment ---
print("\n" + "=" * 60)
print("FINAL ASSESSMENT")
print("=" * 60)
print(f"Shape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")

# --- CLEANING LOG ---
print("\n" + "=" * 60)
print("CLEANING LOG")
print("=" * 60)
for entry in cleaning_log:
    print(f"  {entry}")

Notice the pattern: assess, transform, verify, document. Every step follows this rhythm.

8.10 Project Checkpoint: Cleaning the Vaccination Dataset

It's time to apply everything you've learned to the progressive project.

Project Milestone: Clean the Vaccination Dataset

In Chapter 7, you loaded the WHO vaccination dataset into a pandas DataFrame and performed initial exploration. Now you're going to clean it.

Your tasks:

Audit the data quality. Run df.info(), df.isnull().sum(), df.duplicated().sum(), and df.describe(). Write a Markdown cell summarizing what you find.

Handle missing values. - For numeric columns: decide between dropping and imputing. Use group-based imputation where appropriate (e.g., fill missing vaccination counts with the median for the same country). - For categorical columns: decide whether to fill with "Unknown" or drop. - Document your rationale for each decision.

Remove duplicates. Identify duplicate records (same country, same date, same vaccine type). Remove them, keeping the first occurrence.

Fix data types. Ensure date columns are datetime64, numeric columns are float64 or int64, and categorical columns are clean strings.

Standardize categories. Standardize country names, vaccine type names, and any other categorical columns.

Create a cleaning log. Document every step: what you did, how many rows/values were affected, and why.

Save the clean data. Export the cleaned DataFrame to a new CSV file: python df.to_csv('vaccination_data_clean.csv', index=False)

Verify. After saving, reload the file and confirm the data looks right: python df_check = pd.read_csv('vaccination_data_clean.csv') print(df_check.shape) print(df_check.dtypes) print(df_check.isnull().sum())

Deliverable: A Jupyter notebook with your cleaning pipeline, Markdown documentation, and a final clean CSV file.

8.11 Common Cleaning Patterns: A Quick Reference

Here are the patterns you'll use most often, collected in one place for easy reference.

Missing Values

# Detect
df.isnull().sum()                    # Count per column
df.isnull().mean() * 100            # Percentage per column
df[df.isnull().any(axis=1)]         # Rows with any missing value

# Remove
df.dropna()                         # Drop rows with ANY missing value
df.dropna(subset=['col1', 'col2'])  # Drop rows where specific columns are missing
df.dropna(thresh=3)                 # Keep rows with at least 3 non-null values

# Fill
df['col'].fillna(0)                 # Fill with constant
df['col'].fillna(df['col'].mean())  # Fill with mean
df['col'].fillna(df['col'].median()) # Fill with median
df['col'].fillna(method='ffill')    # Forward fill
df.groupby('group')['col'].transform(
    lambda x: x.fillna(x.median())
)                                    # Group-based imputation

Duplicates

# Detect
df.duplicated().sum()                          # Count exact duplicates
df.duplicated(subset=['id_col']).sum()         # Duplicates by key
df[df.duplicated(subset=['id_col'], keep=False)]  # Show all duplicates

# Remove
df.drop_duplicates()                           # Remove exact duplicates
df.drop_duplicates(subset=['id_col'])          # Deduplicate by key
df.drop_duplicates(subset=['id_col'], keep='last')  # Keep last occurrence

Type Conversion

# Numeric
pd.to_numeric(df['col'], errors='coerce')      # Safe numeric conversion
df['col'].astype(float)                         # Direct cast (raises on error)

# Datetime
pd.to_datetime(df['col'], errors='coerce')      # Safe date parsing
pd.to_datetime(df['col'], format='%Y-%m-%d')    # Specified format

# String
df['col'].astype(str)                           # Convert to string

# Pre-clean for numeric conversion
df['col'].str.replace('[$,]', '', regex=True)   # Remove symbols first

Category Standardization

# Case normalization
df['col'].str.lower()                           # lowercase
df['col'].str.upper()                           # UPPERCASE
df['col'].str.title()                           # Title Case

# Whitespace
df['col'].str.strip()                           # Trim edges
df['col'].str.replace(r'\s+', ' ', regex=True)  # Collapse internal whitespace

# Mapping
df['col'].replace({'old1': 'new1', 'old2': 'new2'})  # Explicit mapping
df['col'].map(mapping_dict)                           # Map (unmapped -> NaN)

8.12 Ethical Dimensions of Data Cleaning

We've touched on ethics at several points in this chapter, but let's bring it together. Data cleaning is not ethically neutral. Every cleaning decision has the potential to amplify or mute certain voices in the data.

Five Questions to Ask Before Every Cleaning Decision

Who is missing? When you drop rows with missing values, whose data are you losing? Are certain demographics more likely to have incomplete records? If so, your "clean" dataset may systematically under-represent already marginalized groups.
What am I assuming? When you impute a missing value with the mean, you're assuming the missing value is typical. When you standardize categories by merging them, you're assuming the distinction doesn't matter. Are those assumptions justified for your specific question?
Could my cleaning change the answer? Run your analysis on the uncleaned data (or with different cleaning strategies) and see if the conclusions differ. If they do, your cleaning choices are shaping your results — and you need to be transparent about that.
Am I erasing real variation? Outliers, unusual categories, and edge cases sometimes reflect real phenomena — especially phenomena that affect small or marginalized groups. Removing them for "cleanliness" can erase evidence of disparities.
Is my cleaning reproducible? If someone else cleaned this data, would they make the same choices? If not, would different choices lead to different conclusions? Document your decisions so others can evaluate them.

Ethical Analysis

Consider a dataset of student test scores where some students have missing scores. The school discovers that students who are absent on test day disproportionately come from low-income families (transportation issues, health problems, family obligations). Three different analysts handle this differently:

Analyst A drops all rows with missing scores. The resulting dataset over-represents higher-income students, and the school's average score looks higher than it really is.

Analyst B fills missing scores with zeros. The resulting dataset dramatically lowers the average for students from low-income families, making the achievement gap look worse than it may actually be.

Analyst C fills missing scores with the median score for the student's school, then creates a flag column indicating which scores were imputed. The analysis is run both with and without the imputed values, and both results are reported.

Analyst C's approach isn't the only right answer, but it's the most honest. It acknowledges uncertainty, preserves information about the imputation, and lets the reader evaluate how much the cleaning decisions affected the results.

8.13 Debugging Common Cleaning Errors

Even when you know the right approach, things go wrong. Here are the most common errors you'll hit during data cleaning, and how to fix them.

Debugging Walkthrough

Error 1: "SettingWithCopyWarning" SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.

What happened: You filtered a DataFrame and then tried to modify it. pandas isn't sure whether you're modifying the original or a copy.

The fix: Use .loc for assignment, or make an explicit copy: ```python

Bad (triggers warning):

subset = df[df['country'] == 'Brazil'] subset['region'] = 'South America'

Good (explicit):

df.loc[df['country'] == 'Brazil', 'region'] = 'South America'

Also good (explicit copy):

subset = df[df['country'] == 'Brazil'].copy() subset['region'] = 'South America' ```

Error 2: fillna not working python df['age'].fillna(df['age'].median()) print(df['age'].isnull().sum()) # Still shows missing values!

What happened: fillna() returns a new Series by default — it doesn't modify the original.

The fix: Assign the result back: python df['age'] = df['age'].fillna(df['age'].median()) # Assign it back!

Error 3: Unexpected NaN after merge You merge two DataFrames and suddenly have NaN values that weren't there before.

What happened: The join keys didn't match exactly. Maybe one DataFrame has "United States" and the other has "United States " (trailing space).

The fix: Strip whitespace from join keys before merging: python df1['country'] = df1['country'].str.strip() df2['country'] = df2['country'].str.strip() merged = df1.merge(df2, on='country')

Error 4: astype(int) fails on NaN ```python df['age'] = df['age'].astype(int)

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

```

What happened: Integer columns can't contain NaN in regular NumPy integers.

The fix: Use nullable integer type or fill NaN first: ```python df['age'] = df['age'].astype('Int64') # Nullable integer (capital I)

or

df['age'] = df['age'].fillna(0).astype(int) # Fill first, then cast ```

8.14 Beyond the Basics: What Comes Next

This chapter covers the fundamentals of data cleaning — the techniques you'll use on every single project. But data cleaning is a deep field, and there's more to learn as you progress:

Text cleaning (Chapter 10): Regular expressions, fuzzy matching, and natural language processing for messy text data.

Date and time cleaning (Chapter 11): Time zones, daylight saving time, inconsistent date formats across international datasets.

Automated data validation (Chapter 30): Building validation rules into your data pipeline so problems are caught automatically.

Statistical approaches to imputation (Chapter 26): Using regression models to predict missing values from other variables.

For now, the tools in this chapter will handle the vast majority of cleaning tasks you'll encounter. Master these before moving to the advanced techniques.

Chapter Summary

Let's come back to where we started. Data cleaning is not the boring part of data science. It is data science — or at least 80% of it. The skills you've learned in this chapter are the skills you'll use every day in professional practice:

Missing values are everywhere, and how you handle them shapes your results. isnull(), dropna(), and fillna() are your core tools, but the decisions behind them matter more than the code.
Duplicates sneak in from multiple data sources, data entry errors, and system glitches. duplicated() and drop_duplicates() find and remove them, but domain knowledge determines what counts as a "duplicate."
Type errors happen when numbers contain non-numeric characters, dates aren't parsed, or identifiers lose leading zeros. pd.to_numeric(), pd.to_datetime(), and astype() fix them, but you need to understand why the type is wrong before you can fix it correctly.
Inconsistent categories are the bane of categorical analysis. String methods (.str.lower(), .str.strip()) and explicit mapping (replace(), map()) tame the chaos.
Outliers demand judgment, not automation. Some are errors to remove; others are the most important data points in your dataset.
Documentation turns ad hoc cleaning into reproducible analysis. Log every decision, note every assumption, record every number of affected rows.

And above all: cleaning IS analysis. Every dropna() is a hypothesis about who can be excluded. Every fillna() is a hypothesis about what the missing value might have been. Every replace() is a hypothesis about which distinctions matter. Treat your cleaning decisions with the same rigor you'd bring to a statistical test, because they affect your results just as much.

The messy data isn't the enemy. It's the reality. And now you know how to work with it.

What's Next

In Chapter 9, you'll learn to reshape data — merging multiple DataFrames, pivoting between wide and long formats, and using groupby for aggregation. These operations build directly on the clean data you'll produce with the skills from this chapter. After all, you can't merge two datasets if the country names don't match between them. Cleaning comes first. Everything else builds on it.

In Chapter 10, you'll go deeper into text data — learning regular expressions and string extraction techniques that handle the messiest text fields. And in Chapter 11, you'll tackle dates and times, with all their wonderful complications (time zones, daylight saving time, leap years).

But first, make sure you've completed the project checkpoint in Section 8.10. Cleaning your vaccination dataset is the foundation for everything that follows in Part II.

Prerequisites

Learning Objectives

In This Chapter

Chapter 8: Cleaning Messy Data: Missing Values, Duplicates, Type Errors, and the 80% of the Job

Chapter Overview

8.1 The Anatomy of Messy Data

The Five Types of Data Mess

8.2 Missing Values: The Ghosts in Your Data

What Does "Missing" Look Like in pandas?

Detecting Missing Values

The Three Strategies for Missing Values

Strategy 1: Delete (dropna)

Strategy 2: Fill (fillna and Imputation)

Strategy 3: Flag (Create a Missing Indicator)

Why the Data Is Missing Matters

8.3 Duplicates: The Same Row, Twice (or Worse)

Finding Duplicates

Removing Duplicates

When Duplicates Aren't Obvious

8.4 Type Errors: When Numbers Aren't Numbers

Diagnosing Type Problems

Fixing Type Errors

Numeric Conversion with to_numeric

Type Casting with astype

Date Conversion with to_datetime

Then investigate: how many dates became NaT?

Cleaning Strings Before Conversion

8.5 Inconsistent Categories: "Male," "male," "M," and Other Headaches

Seeing the Problem

Standardizing with replace and map

String Methods for Categorical Cleanup

A Systematic Approach to Category Standardization

When Categories Need Domain Knowledge

8.6 Outliers: When Valid Data Looks Wrong

The Outlier Dilemma

Detecting Outliers

Handling Outliers

8.7 The Data Cleaning Log: Document Everything

What a Cleaning Log Looks Like

The Notebook-Based Approach

8.8 Threshold Concept: Cleaning IS Analysis

8.9 Putting It All Together: A Complete Cleaning Pipeline

8.10 Project Checkpoint: Cleaning the Vaccination Dataset

8.11 Common Cleaning Patterns: A Quick Reference

Missing Values

Duplicates

Type Conversion

Category Standardization

8.12 Ethical Dimensions of Data Cleaning

Five Questions to Ask Before Every Cleaning Decision

8.13 Debugging Common Cleaning Errors

Bad (triggers warning):

Good (explicit):

Also good (explicit copy):

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

or

8.14 Beyond the Basics: What Comes Next

Chapter Summary

Related Reading