Chapter 7: Data Wrangling: Cleaning and Preparing Real Data

Contributors

33 min read

> "It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts."

Learning Objectives

Identify and handle missing data using appropriate strategies
Detect and address data quality issues (duplicates, inconsistencies, errors)
Transform variables (recoding, binning, creating new variables)
Use pandas for common data cleaning tasks
Document data cleaning decisions for reproducibility

In This Chapter

Chapter Overview
7.1 Why Data Cleaning Matters: The 80% Problem
7.2 Missing Data: The Invisible Problem
7.3 Strategies for Handling Missing Data
7.4 Detecting and Handling Duplicates
7.5 Handling Inconsistent Data
7.6 Recoding Variables
7.7 Binning: Creating Categories from Continuous Data
7.8 Creating New Variables: Feature Engineering
7.9 Tidy Data: The Goal of Data Wrangling
7.10 The Cleaning Log: Documenting Your Decisions
7.11 Debugging Spotlight: Common Pandas Cleaning Errors
7.12 Putting It All Together: A Complete Cleaning Workflow
7.13 All Functions at a Glance: Python and Excel
7.14 Project Checkpoint: Your Turn
7.15 Spaced Review: Strengthening Previous Learning
Chapter Summary
What's Next

Chapter 7: Data Wrangling: Cleaning and Preparing Real Data

"It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts." — Arthur Conan Doyle, A Scandal in Bohemia

Chapter Overview

I need to tell you something that every data scientist knows but no one wants to admit in public: roughly 80% of any data project is cleaning the data.

Not building models. Not running tests. Not making beautiful graphs. Cleaning. Finding the missing values. Fixing the typos. Figuring out why someone typed "Male" in one row and "M" in the next and "male" in the third, and whether the row that says "Mal" was a typo for "Male" or refers to someone named Mal. Discovering that a column supposedly containing ages includes the value 999, and deciding whether that's a coding error, a placeholder for "unknown," or the world's oldest person.

If that sounds tedious — well, sometimes it is. But here's what I've learned: data cleaning is also where you learn the most about your data. It's where you catch errors that would destroy your analysis. It's where you discover that your "complete" dataset has 2,000 missing values in the column you care about most. And it's where you make decisions — sometimes difficult, sometimes ethical — that quietly shape every conclusion you'll draw later.

In Chapter 3, you loaded data and peeked at it with .head(), .info(), and .describe(). In Chapter 5, you visualized distributions. In Chapter 6, you computed summary statistics. But all of that assumed the data was ready for analysis. This chapter is about what happens when it's not — which is to say, always.

In this chapter, you will learn to: - Identify and handle missing data using appropriate strategies - Detect and address data quality issues (duplicates, inconsistencies, errors) - Transform variables (recoding, binning, creating new variables) - Use pandas for common data cleaning tasks - Document data cleaning decisions for reproducibility

Fast Track: If you've cleaned real datasets before and can explain the difference between listwise deletion and imputation, skim Sections 7.1-7.3 and jump to Section 7.7 (Creating New Variables). Complete quiz questions 1, 8, and 15 to verify your foundation.

Deep Dive: After this chapter, read Case Study 1 (how a health department's cleaning decisions affected COVID-19 statistics) for a powerful example of when data cleaning has life-or-death consequences, then Case Study 2 (a student's step-by-step cleaning log) for a practical template you can follow.

7.1 Why Data Cleaning Matters: The 80% Problem

Let's start with Dr. Maya Chen.

Maya has been collecting data on flu cases for months. Her county health department sends her spreadsheets from twelve different clinics. She merges them into one big dataset and runs .describe(). The results look... wrong.

The average patient age is 847 years. The column for "temperature_f" ranges from 0 to 9999. One clinic apparently treated negative-three patients last Tuesday.

Maya doesn't have a statistics problem. She has a data quality problem. And if she runs any analysis on this data without fixing it first, every conclusion she draws will be garbage. The statisticians have a classic phrase for this: garbage in, garbage out (GIGO).

This scenario isn't unusual — it's the norm. Here's why real data is almost always messy:

Human data entry is imperfect. People make typos, skip fields, use inconsistent formats, and occasionally enter creative placeholder values like 999 or -1 when they don't know the answer.
Systems have inconsistencies. One clinic records temperatures in Fahrenheit, another in Celsius. One uses "Y/N" for vaccination status, another uses "Yes/No," and a third uses 1/0.
Data gets merged imperfectly. When you combine datasets from different sources (as Maya does), the column names don't match, the ID formats are different, and some patients appear in multiple files.
Missing data is everywhere. Surveys have unanswered questions. Sensors fail. Records get lost. People refuse to answer sensitive questions.

Key Concept: Data Wrangling

Data wrangling (also called data munging or data preparation) is the process of transforming raw data into a format suitable for analysis. It includes cleaning errors, handling missing values, standardizing formats, and restructuring data. The term "wrangling" isn't accidental — it evokes wrestling something unruly into shape, which is exactly what it feels like.

The Cost of Skipping This Step

You might be tempted to skip data cleaning and jump straight to the "interesting" analysis. Don't.

Here's what happens when you analyze dirty data:

Wrong answers. A single data entry error — like an age of 847 — can distort a mean, inflate a standard deviation, and create outliers that don't exist. Remember from Chapter 6: the mean is not resistant to extreme values. One bad data point can pull it anywhere.
Wasted time. You'll get nonsensical results, spend hours debugging your code, and eventually realize the problem isn't your code — it's your data.
Dangerous conclusions. In healthcare, criminal justice, and policy, wrong conclusions from dirty data can hurt real people. If Maya's flu data includes duplicate records, she might overcount cases, leading her county to misallocate resources.

Theme Connection: Human Stories Behind the Data (Theme 2)

Every missing value represents something that didn't get recorded — and often, someone who didn't get counted. If a clinic's intake form is only available in English, non-English-speaking patients may have incomplete records. If a survey asks about income and many respondents skip the question, the "cleaned" dataset may systematically exclude people who are uncomfortable reporting low incomes. Data cleaning isn't just technical work — it's work that determines whose stories get told.

7.2 Missing Data: The Invisible Problem

Missing data is the most common data quality issue you'll face, and handling it correctly is one of the most important skills in applied statistics.

Let's look at what missing data actually looks like. Here's a small slice of Maya's flu dataset:

patient_id	age	temperature_f	vaccinated	zip_code
001	34	101.2	Yes	48201
002	67	99.8	No	48204
003	45		Yes	48201
004		102.1	Yes
005	28	100.5		48207
006	71	103.4	No	48201

Patients 003, 004, and 005 have missing values — blank cells where data should be. In pandas, these show up as NaN (Not a Number) or None. In Excel, they appear as blank cells. In statistical notation, they're often written as NA (Not Available).

Before you do anything with missing data, you need to ask: why is this data missing?

This question matters enormously, because the answer determines which strategies for handling missing data are appropriate. Statisticians describe three mechanisms of missingness, and while the formal theory gets technical, the intuition is straightforward.

Three Types of Missing Data

Missing Completely at Random (MCAR)

Data is MCAR when the reason it's missing has nothing to do with the data itself. A lab technician spilled coffee on a form, destroying a few records. A sensor had a momentary power glitch and missed some readings. The missingness is pure bad luck.

Here's the intuition: if you looked at the records with missing values and the records with complete values, they'd look the same in every way. The "missers" are a random subset of the full sample.

MCAR is the best-case scenario. It's also the rarest in practice.

Missing at Random (MAR)

Confusingly named — "Missing at Random" doesn't mean the missingness is random. It means the missingness can be explained by other variables in the dataset.

Here's an example: older patients might be less likely to have their vaccination status recorded (maybe they were vaccinated decades ago and the records are harder to find). If you looked at just the older patients, the missing vaccination records would be random among them — but the rate of missingness depends on age.

The key point: if you can predict which records have missing values using other variables you've measured, the data is MAR. This is the most common type in practice.

Missing Not at Random (MNAR)

This is the troublesome one. Data is MNAR when the reason it's missing is related to the missing value itself. People with very high incomes don't report their income. Patients with very high blood pressure skip their follow-up appointments. Students who failed the exam don't fill out the course evaluation.

MNAR creates bias that you can't fix just by looking at the remaining data, because the missingness is directly linked to the thing you're trying to measure.

Making It Stick: The Party Analogy

Imagine you're throwing a party and keeping track of who shows up.

MCAR: Some invitations got lost in the mail at random. The people who didn't show up are a random mix — there's nothing special about them.

MAR: People who live farther away were less likely to come. If you know the distance, you can predict who's missing — but among people at the same distance, attendance was random.

MNAR: The people who didn't come are specifically the ones who don't like parties. The reason they're missing is the thing you'd want to know about them. You can't figure out who the party-haters are by looking only at the people who showed up.

Detecting Missing Data with Pandas

Let's see how to find missing values in Python. We'll use the tools you learned in Chapter 3 — pd.read_csv(), .info(), .describe() — and add some new ones.

import pandas as pd

# Load the dataset (recall pd.read_csv from Chapter 3)
df = pd.read_csv('flu_data.csv')

# First look — .info() shows non-null counts per column
print(df.info())

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   patient_id     500 non-null    object
 1   age            478 non-null    float64
 2   temperature_f  489 non-null    float64
 3   vaccinated     452 non-null    object
 4   zip_code       465 non-null    object
 5   clinic         500 non-null    object
 6   diagnosis_date 500 non-null    object
 7   hospitalized   491 non-null    object
dtypes: float64(2), object(6)

Notice how .info() shows "Non-Null Count" for each column. If the count is less than 500 (our total rows), we have missing data. The age column has 478 non-null values — meaning 22 values are missing (500 - 478 = 22).

Now let's use .isna() — a pandas method that returns True for every missing value:

# Count missing values per column
print(df.isna().sum())

Output:

patient_id        0
age              22
temperature_f    11
vaccinated       48
zip_code         35
clinic            0
diagnosis_date    0
hospitalized      9

The vaccinated column has the most missing values: 48 out of 500, or about 9.6%. That's not unusual for a question that depends on someone remembering (or proving) their vaccination history.

# What percentage of data is missing per column?
print((df.isna().sum() / len(df) * 100).round(1))

Output:

patient_id        0.0
age               4.4
temperature_f     2.2
vaccinated        9.6
zip_code          7.0
clinic            0.0
diagnosis_date    0.0
hospitalized      1.8

Rule of Thumb for Missing Data

< 5% missing: Usually safe to handle with simple methods (deletion or basic imputation).

5-20% missing: Requires more careful thought about the missing data mechanism. Simple methods may introduce bias.

> 20% missing: Major concern. The variable may not be usable without advanced techniques (like multiple imputation, which is beyond our scope).

These thresholds are guidelines, not hard rules. A variable with 3% missing data that's MNAR could be more problematic than one with 15% missing that's MCAR.

Detecting Missing Data in Excel

In Excel, missing data often appears as blank cells or as placeholder values like "N/A", "999", or "-". Here are useful techniques:

Conditional Formatting for Blanks: Select your data range, go to Home > Conditional Formatting > New Rule > "Format only cells that contain" > "Blanks." Choose a fill color (like red) so blank cells stand out visually.
COUNTBLANK(): =COUNTBLANK(A2:A501) counts the number of blank cells in a range.
COUNTA() vs. COUNT(): COUNTA() counts non-blank cells (any type), while COUNT() counts only numeric cells. The difference reveals text in a numeric column — a common data quality issue.

7.3 Strategies for Handling Missing Data

You've found the missing values. Now what? You have three main strategies, and choosing the right one depends on the type and amount of missingness.

Strategy 1: Listwise Deletion (Drop the Rows)

The simplest approach: delete any row that has a missing value.

# Drop rows with ANY missing value
df_clean = df.dropna()
print(f"Original: {len(df)} rows")
print(f"After dropping: {len(df_clean)} rows")
print(f"Rows lost: {len(df) - len(df_clean)}")

Output:

Original: 500 rows
After dropping: 391 rows
Rows lost: 109

Wait — we lost 109 rows? That's nearly 22% of our data! Even though no single column had more than 9.6% missing, the missing values are scattered across different rows. Drop a row for any missing value and you lose a lot of data fast.

When to use listwise deletion: - The data is MCAR (so the remaining rows are still representative) - A small percentage of rows are affected (< 5%) - You have plenty of data to spare

When NOT to use it: - The data is MAR or MNAR (dropping rows introduces bias) - You'd lose too much data - The remaining data would no longer represent the population

You can also drop rows only when specific columns have missing values:

# Drop rows only where 'age' is missing
df_clean = df.dropna(subset=['age'])
print(f"Rows remaining: {len(df_clean)}")  # 478 rows

Or drop an entire column if it's mostly missing:

# Drop columns where more than 30% of values are missing
threshold = 0.30
cols_to_drop = df.columns[df.isna().mean() > threshold]
df_clean = df.drop(columns=cols_to_drop)

Strategy 2: Imputation (Fill In the Blanks)

Imputation means replacing missing values with estimated values. The idea is to preserve the sample size while making a reasonable guess about what the missing value might have been.

Mean Imputation

Replace each missing value with the column's mean.

# Fill missing ages with the mean age
mean_age = df['age'].mean()
df['age_filled'] = df['age'].fillna(mean_age)
print(f"Mean age used for imputation: {mean_age:.1f}")

The problem with mean imputation: It artificially reduces the spread of the data. Every imputed value is exactly at the mean, which makes the distribution narrower than it really is. This underestimates variance and standard deviation — remember from Chapter 6 how important spread is for understanding data.

Spaced Review (From Chapter 5 — Distribution Shapes)

SR.1: In Chapter 5, you learned to describe distributions as symmetric, skewed left, or skewed right. Here's why that matters for imputation: if the distribution of ages is skewed right (as age data often is for certain conditions that affect the elderly), the mean is pulled to the right by the tail. Imputing with the mean would put those guessed values higher than the typical patient's age. In a skewed distribution, the median is often a better choice for imputation. The shape of the distribution directly affects which imputation strategy is appropriate.

Median Imputation

Replace each missing value with the column's median. Preferred when the distribution is skewed.

# Fill missing ages with the median age
median_age = df['age'].median()
df['age_filled'] = df['age'].fillna(median_age)
print(f"Median age used for imputation: {median_age:.1f}")

Mode Imputation (for Categorical Variables)

For categorical variables like "vaccinated" (Yes/No), you can't use mean or median. Use the mode — the most frequent value.

# Fill missing vaccination status with the mode
mode_vax = df['vaccinated'].mode()[0]  # mode() returns a Series
df['vaccinated_filled'] = df['vaccinated'].fillna(mode_vax)
print(f"Mode used for imputation: {mode_vax}")

Conditional / Group-Based Imputation

A smarter approach: impute missing values based on groups. If older patients tend to have different temperatures, fill in missing temperatures using the mean temperature for that age group, not the overall mean.

# Fill missing temperature with the median temperature for each clinic
df['temperature_filled'] = df.groupby('clinic')['temperature_f'].transform(
    lambda x: x.fillna(x.median())
)

This is better because it respects the structure in your data. If Clinic A tends to see sicker patients (higher temperatures), using Clinic A's own median is more accurate than the overall median.

Strategy 3: Flagging (Keep Track of What Was Missing)

Sometimes the best approach is to impute the value and create a new column that records whether the original value was missing.

# Create a flag for missing age, then impute
df['age_was_missing'] = df['age'].isna().astype(int)  # 1 if missing, 0 if not
df['age_filled'] = df['age'].fillna(df['age'].median())

# Now you can see:
print(df[['patient_id', 'age', 'age_filled', 'age_was_missing']].head(10))

Why bother? Because the fact that a value was missing might itself be informative. If patients with missing ages turn out to have worse outcomes, that tells you something — maybe those patients arrived in emergencies where nobody had time to record their age. The "missingness flag" preserves that information even after you've filled in the blanks.

A Decision Framework

Missing data detected
│
├── How much is missing?
│   ├── < 5% → Simple methods probably fine
│   ├── 5-20% → Think carefully about mechanism
│   └── > 20% → Consider dropping the column (or seek expert help)
│
├── What type of variable?
│   ├── Numerical → Mean (symmetric) or Median (skewed) imputation
│   └── Categorical → Mode imputation
│
├── What's the mechanism?
│   ├── MCAR → Deletion or imputation both OK
│   ├── MAR → Imputation preferred (group-based if possible)
│   └── MNAR → No simple fix — document the limitation
│
└── Always → Create a missing-data flag and document your decision

Theme Connection: Ethical Cleaning Decisions (Theme 6)

Here's the part nobody puts in the formula sheet: every strategy for handling missing data involves a judgment call, and that judgment call affects your conclusions.

If you delete all rows with missing vaccination status, you might systematically exclude patients from clinics that don't keep good records — clinics that disproportionately serve low-income communities. Your "clean" dataset now underrepresents an entire population.

If you impute vaccination status with the mode ("Yes"), you're assuming the missing patients were vaccinated — which might overstate vaccination rates and lead policymakers to think the community is better protected than it is.

There is no "neutral" option. Every choice changes the data. The ethical move is to make your choice transparently, document it, and test whether your conclusions change under different choices (a practice called sensitivity analysis).

7.4 Detecting and Handling Duplicates

After missing data, duplicates are the second most common data quality headache. They're surprisingly easy to create — a patient checks in twice, a file gets uploaded twice, a merge operation doubles some records — and surprisingly dangerous if undetected.

Finding Duplicates

# How many duplicate rows exist?
print(f"Duplicate rows: {df.duplicated().sum()}")

# Show the duplicated rows
print(df[df.duplicated(keep=False)].sort_values('patient_id'))

The .duplicated() method returns True for rows that are exact copies of earlier rows. The keep=False parameter marks all copies (not just the second one), which helps you see the full picture.

You can also check for duplicates based on specific columns:

# Check for duplicate patient IDs (even if other columns differ)
print(f"Duplicate patient IDs: {df.duplicated(subset=['patient_id']).sum()}")

# Show patients who appear more than once
dup_ids = df[df.duplicated(subset=['patient_id'], keep=False)]
print(dup_ids.sort_values('patient_id'))

Why Duplicates Matter

Duplicates inflate your sample size, which distorts everything: - Means and medians get weighted toward the duplicated values - Standard deviations change - Proportions shift - Statistical tests become overly confident (because they think you have more data than you do)

If Maya has 20 duplicate flu records, she's overcounting by 20 cases. In a county-level analysis, that could trigger or prevent a public health response.

Removing Duplicates

# Remove exact duplicate rows (keep the first occurrence)
df_deduped = df.drop_duplicates()
print(f"Before: {len(df)} rows")
print(f"After: {len(df_deduped)} rows")

# Remove duplicates based on patient_id (keep most recent entry)
df_deduped = df.sort_values('diagnosis_date').drop_duplicates(
    subset=['patient_id'], keep='last'
)

The keep parameter matters: - keep='first' — keeps the first occurrence, drops later ones - keep='last' — keeps the last occurrence (useful when later entries are more recent) - keep=False — drops ALL copies (use cautiously)

Duplicates in Excel

In Excel: - Conditional Formatting > Highlight Duplicates: Select your ID column, go to Home > Conditional Formatting > Highlight Cells Rules > Duplicate Values. Duplicates light up immediately. - Remove Duplicates: Go to Data > Remove Duplicates. Select the columns to check, and Excel removes duplicates, telling you how many were found. - COUNTIF for specific checks: =COUNTIF(A:A, A2) returns how many times the value in A2 appears in column A. Anything > 1 is a duplicate.

7.5 Handling Inconsistent Data

This one drives data scientists crazy. You've got a column called "gender" and it contains:

Male, male, MALE, M, m, Male , male., F, Female, female, FEMALE, f, Femal

These aren't missing values — they're inconsistent values. "Male" and "M" and "male" all mean the same thing, but pandas treats them as different categories. If you run .value_counts():

print(df['gender'].value_counts())

Male      87
Female    82
male      74
female    69
M         43
F         38
MALE      22
FEMALE    18
m         14
f         11
Male       9  # note the trailing space!
Femal      3
male.      2

Instead of two categories, you have thirteen. Your bar chart would be absurd, and any analysis that groups by gender would split the data into tiny, meaningless subgroups.

Fixing Inconsistent Text Data

Step 1: Standardize case and strip whitespace.

# Convert to lowercase and remove leading/trailing spaces
df['gender_clean'] = df['gender'].str.lower().str.strip()
print(df['gender_clean'].value_counts())

male      239
female    218
m          57
f          49
male.       2
femal       3

Better — we went from 13 categories to 6. Now let's use .replace() to map abbreviations and typos to standard values:

# Map variations to standard values
gender_map = {
    'm': 'male',
    'f': 'female',
    'male.': 'male',
    'femal': 'female'
}
df['gender_clean'] = df['gender_clean'].replace(gender_map)
print(df['gender_clean'].value_counts())

male      298
female    270

Two clean categories. That's what we wanted.

A General Cleaning Pattern

For any text column with inconsistencies, follow this sequence:

# The cleaning pipeline for text columns
df['col_clean'] = (
    df['col']
    .str.lower()           # standardize case
    .str.strip()           # remove leading/trailing whitespace
    .str.replace(r'[^\w\s]', '', regex=True)  # remove punctuation
    .replace(your_mapping_dict)  # map known variations
)

Fixing Inconsistent Numeric Data

Sometimes numeric columns have problems too. Common issues: - Negative values where they shouldn't exist (negative age) - Impossibly large values (age = 847, temperature = 9999) - Values in wrong units (Celsius mixed with Fahrenheit) - Placeholder values masquerading as data (999, -1, 0)

# Check for suspicious values in age
print(f"Age range: {df['age'].min()} to {df['age'].max()}")
print(f"Negative ages: {(df['age'] < 0).sum()}")
print(f"Ages > 120: {(df['age'] > 120).sum()}")

# Replace impossible values with NaN
import numpy as np
df.loc[df['age'] < 0, 'age'] = np.nan
df.loc[df['age'] > 120, 'age'] = np.nan

Spaced Review (From Chapter 3 — Loading Data with Pandas)

SR.2: In Chapter 3, you learned to use .info() and .describe() for a first look at your data. Those tools are your first line of defense against inconsistent data. If .describe() shows that the minimum age is -3 or the maximum temperature is 9999, you've caught a data quality issue before it could corrupt your analysis. If .info() shows a column you expected to be numeric is listed as object (text), that's often a clue that inconsistent values are hiding in it — like someone typing "unknown" in a numeric field.

Handling Inconsistent Data in Excel

Excel has several built-in tools for inconsistent text:

TRIM(): =TRIM(A2) removes extra spaces (leading, trailing, and multiple internal spaces).
CLEAN(): =CLEAN(A2) removes non-printable characters that sometimes sneak in from imports.
UPPER(), LOWER(), PROPER(): Standardize capitalization. =LOWER(A2) converts to lowercase; =PROPER(A2) capitalizes the first letter of each word.
Find & Replace (Ctrl+H): Search for "M" and replace with "Male" (use "Match entire cell contents" to avoid replacing "M" inside "Male" or "Female").
Data Validation: After cleaning, set up data validation (Data > Data Validation) to restrict future entries to an approved list. This prevents inconsistencies from creeping back in.

7.6 Recoding Variables

Sometimes you don't need to fix a variable — you need to transform it. Recoding means creating a new version of a variable with different values or categories.

Recoding Categorical Variables

A common recoding task is collapsing many categories into fewer ones. Suppose Maya's dataset records "diagnosis" with dozens of specific flu strains:

print(df['diagnosis'].value_counts())

Influenza A (H1N1)     145
Influenza A (H3N2)     102
Influenza B (Yamagata)  89
Influenza B (Victoria)  67
Influenza A (other)     43
Unknown                 54

For her analysis, Maya only needs to know "Influenza A" vs. "Influenza B" vs. "Unknown":

# Recode detailed diagnosis to broad categories
def recode_flu(diagnosis):
    if 'Influenza A' in diagnosis:
        return 'Influenza A'
    elif 'Influenza B' in diagnosis:
        return 'Influenza B'
    else:
        return 'Unknown'

df['flu_type'] = df['diagnosis'].apply(recode_flu)
print(df['flu_type'].value_counts())

Influenza A    290
Influenza B    156
Unknown         54

Recoding Numeric Responses

Survey data often needs recoding. A satisfaction survey might use 1-5, and you want to collapse it into "Low" (1-2), "Medium" (3), and "High" (4-5):

# Recode satisfaction scores to categories
df['satisfaction_level'] = df['satisfaction'].map({
    1: 'Low', 2: 'Low',
    3: 'Medium',
    4: 'High', 5: 'High'
})

You can also create binary (yes/no) variables from numeric ones — sometimes called dummy coding:

# Create a binary "hospitalized" variable from length_of_stay
# 0 days = not hospitalized, any positive value = hospitalized
df['was_hospitalized'] = (df['length_of_stay'] > 0).astype(int)

7.7 Binning: Creating Categories from Continuous Data

Binning (also called discretization) means dividing a continuous variable into intervals and assigning each observation to a bin. You've already seen this idea — histograms use bins to group continuous data into bars. Now you're doing it explicitly to create a new categorical variable.

Why Bin?

Simplify analysis: Instead of analyzing age as a continuous variable with 80+ possible values, you analyze age groups (0-17, 18-34, 35-54, 55+).
Handle nonlinear relationships: Sometimes the effect of a variable isn't smooth. The relationship between age and flu risk might jump at certain thresholds rather than increase gradually.
Communication: "35% of young adults (18-34) were vaccinated" is clearer than a regression coefficient.

Binning with pandas: pd.cut()

# Create age groups from continuous age
bins = [0, 17, 34, 54, 120]
labels = ['0-17', '18-34', '35-54', '55+']

df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels)
print(df['age_group'].value_counts().sort_index())

0-17     67
18-34    134
35-54    158
55+      119

How pd.cut() works: - bins defines the edges of each interval. A value of 25 falls in the interval (17, 34], so it gets the label "18-34". - labels assigns human-readable names to each bin. - By default, pd.cut() creates intervals that are open on the left and closed on the right: (0, 17], (17, 34], etc. The value 17 goes in the "0-17" bin, while 18 goes in "18-34."

Equal-Width vs. Equal-Frequency Bins

There are two main approaches to choosing bin boundaries:

Equal-width bins: Each bin spans the same range (like 0-20, 20-40, 40-60, etc.). This is what pd.cut() does by default if you specify the number of bins:

# Create 4 equal-width bins
df['age_bin_equal'] = pd.cut(df['age'], bins=4)
print(df['age_bin_equal'].value_counts().sort_index())

Equal-frequency bins: Each bin contains approximately the same number of observations. Use pd.qcut():

# Create 4 bins with approximately equal numbers of observations
df['age_quartile'] = pd.qcut(df['age'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
print(df['age_quartile'].value_counts().sort_index())

When to use each: - Equal-width when the intervals have natural meaning (decades of age, income brackets, time periods) - Equal-frequency when you want balanced groups for comparison (quartiles, percentiles)

The Trade-Off of Binning

Binning is useful, but it comes with a cost: you're throwing away information. A 19-year-old and a 34-year-old in the "18-34" bin are treated identically, even though their flu risk might be quite different. The finer your bins, the more information you preserve — but the more categories you create, and the fewer observations in each bin.

There's no universal "right" number of bins. Use domain knowledge. Age groups of 0-17, 18-34, 35-54, 55+ make sense for public health because vaccination schedules, risk factors, and health policies often follow these boundaries.

7.8 Creating New Variables: Feature Engineering

Sometimes the most useful variable isn't in your dataset — you need to create it. Feature engineering is the process of creating new variables from existing ones to better capture the patterns in your data.

Arithmetic Combinations

# BMI from height and weight
df['bmi'] = df['weight_kg'] / (df['height_m'] ** 2)

# Fever flag: temperature above 100.4°F
df['has_fever'] = (df['temperature_f'] > 100.4).astype(int)

# Days between symptom onset and diagnosis
df['diagnosis_date'] = pd.to_datetime(df['diagnosis_date'])
df['symptom_date'] = pd.to_datetime(df['symptom_date'])
df['days_to_diagnosis'] = (df['diagnosis_date'] - df['symptom_date']).dt.days

Text Extraction

# Extract year from a date column
df['diagnosis_year'] = df['diagnosis_date'].dt.year

# Extract month for seasonal analysis
df['diagnosis_month'] = df['diagnosis_date'].dt.month

# Create a "season" variable
def get_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Fall'

df['season'] = df['diagnosis_month'].apply(get_season)

Combining Variables

# Create a risk score combining multiple factors
df['risk_score'] = (
    (df['age'] > 65).astype(int) +
    (df['has_fever']).astype(int) +
    (df['vaccinated_clean'] == 'no').astype(int)
)
# risk_score ranges from 0 (low risk) to 3 (high risk)

Key Concept: Feature Engineering

Feature engineering is the art of creating new variables ("features") that better capture the information in your data. It's one of the most creative and impactful parts of data science. A well-engineered feature can reveal patterns that the raw variables hide. For example, knowing someone's height and weight separately is less useful than knowing their BMI, which combines both into a single health-relevant measure.

7.9 Tidy Data: The Goal of Data Wrangling

All of this cleaning, recoding, and transforming is heading somewhere: you want tidy data. This is a concept formalized by statistician Hadley Wickham, and it's the gold standard for data organization.

Tidy data follows three rules:

Each variable has its own column. Don't store two pieces of information in one column (like "120/80" for blood pressure — split it into systolic and diastolic).
Each observation has its own row. Don't put multiple observations in one row (like a row for "January" with columns for each patient).
Each value has its own cell. Don't put multiple values in one cell (like "flu, pneumonia" in a diagnosis field — use separate rows or columns).

Untidy Example

Patient	Jan_Temp	Feb_Temp	Mar_Temp
Smith	98.6	99.1	101.2
Jones	100.4	99.8	98.9

This is "wide" format — months are stored as columns. It violates rule 1 (temperature is one variable scattered across three columns) and makes it hard to analyze trends or filter by month.

Tidy Version

Patient	Month	Temperature
Smith	Jan	98.6
Smith	Feb	99.1
Smith	Mar	101.2
Jones	Jan	100.4
Jones	Feb	99.8
Jones	Mar	98.9

This is "long" format — each row is one observation (one patient's temperature on one date). Now you can easily filter, group, and visualize.

Reshaping in Pandas

# Wide to long (melt)
df_tidy = pd.melt(
    df_wide,
    id_vars=['Patient'],
    value_vars=['Jan_Temp', 'Feb_Temp', 'Mar_Temp'],
    var_name='Month',
    value_name='Temperature'
)

# Long to wide (pivot)
df_wide = df_tidy.pivot(
    index='Patient',
    columns='Month',
    values='Temperature'
)

Most analysis tools in Python (and most statistical procedures) expect tidy data. Getting your data into tidy format is often the single biggest step in data wrangling.

7.10 The Cleaning Log: Documenting Your Decisions

Here's something I see students (and professionals) skip all the time, and it always comes back to haunt them: documenting what you did to the data.

Every decision you've made — which rows to drop, how to impute missing values, which categories to merge, which outliers to remove — changes the data. If someone can't see those decisions, they can't evaluate your analysis. They can't reproduce your work. They can't challenge your choices.

Key Concept: Reproducibility

Reproducibility means that someone else (or your future self) can take your raw data, follow your documented steps, and arrive at exactly the same clean dataset. It's a foundational principle of good science. Without it, your analysis is a black box — and black boxes don't earn trust.

A Cleaning Log Template

Here's what a cleaning log looks like in practice. Every data project should have one:

# Data Cleaning Log
## Dataset: County Flu Surveillance Data (2024)
## Analyst: Dr. Maya Chen
## Date: 2024-11-15
## Raw dataset: flu_raw_2024.csv (500 rows, 8 columns)

### Step 1: Initial Assessment
- 500 rows, 8 columns
- Missing data: age (22, 4.4%), temperature_f (11, 2.2%),
  vaccinated (48, 9.6%), zip_code (35, 7.0%), hospitalized (9, 1.8%)
- No complete duplicates found
- 3 duplicate patient_ids found (patients 089, 234, 401)

### Step 2: Remove Duplicates
- Removed 3 duplicate patient_ids, keeping most recent entry
- Rows: 500 → 497
- Justification: Duplicate entries from re-registration at same clinic

### Step 3: Fix Inconsistent Values
- gender column: 13 unique values → 2 after standardization
  - Lowercased, stripped whitespace
  - Mapped: 'm' → 'male', 'f' → 'female', 'male.' → 'male', 'femal' → 'female'
- vaccinated column: standardized 'Y'/'N' to 'Yes'/'No'

### Step 4: Handle Impossible Values
- age: 2 values > 120 (ages 847 and 999) → set to NaN
  - Likely placeholder values for unknown age
- temperature_f: 1 value of 9999 → set to NaN
  - Likely data entry error or placeholder
- temperature_f: 1 value of 0 → set to NaN
  - Likely equipment malfunction or missing reading

### Step 5: Handle Missing Data
- age (24 missing after Step 4, 4.8%):
  - Imputed with median age (47.0) within each clinic
  - Created age_was_missing flag
  - Justification: < 5% missing, likely MCAR (ages missing from
    different clinics, demographics)
- temperature_f (13 missing after Step 4, 2.6%):
  - Imputed with median temperature within each clinic
  - Justification: < 5% missing, MAR (more missing from one clinic
    with older equipment)
- vaccinated (48 missing, 9.7%):
  - NOT imputed — left as missing with flag
  - Justification: Likely MNAR (unvaccinated patients may be less
    likely to know/report status). Imputing would bias vaccination rates.
- zip_code (35 missing, 7.0%):
  - NOT imputed — left as missing
  - Justification: No reasonable basis for guessing location
- hospitalized (9 missing, 1.8%):
  - Imputed with mode ('No') within each clinic
  - Justification: < 2% missing, likely MCAR

### Step 6: Create New Variables
- age_group: binned age into 0-17, 18-34, 35-54, 55+
- has_fever: 1 if temperature_f > 100.4, 0 otherwise
- flu_type: recoded detailed diagnosis to Influenza A / B / Unknown
- diagnosis_month: extracted from diagnosis_date

### Final Dataset
- 497 rows, 15 columns (8 original + 5 new + 2 flags)
- Saved as: flu_cleaned_2024.csv

Writing the Cleaning Log in Code

The best cleaning logs are the code itself — when you write your cleaning steps in a Jupyter notebook with Markdown explanations, the notebook is the log:

# =============================================================
# STEP 1: Load and assess raw data
# =============================================================
import pandas as pd
import numpy as np

df = pd.read_csv('flu_raw_2024.csv')
print(f"Raw data: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"\nMissing values:\n{df.isna().sum()}")
print(f"\nDuplicates: {df.duplicated().sum()}")

# =============================================================
# STEP 2: Remove duplicate patient_ids (keep most recent)
# Decision: 3 patients appeared twice due to re-registration.
#           Keeping the most recent entry.
# =============================================================
df = df.sort_values('diagnosis_date').drop_duplicates(
    subset=['patient_id'], keep='last'
)
print(f"After deduplication: {len(df)} rows")

# ... (and so on for each step)

Spaced Review (From Chapter 4 — Bias in Data Collection)

SR.3: In Chapter 4, you learned about selection bias, response bias, and nonresponse bias. Data cleaning decisions can introduce these same biases — or make existing ones worse.

If you delete all rows with missing vaccination status, you may create nonresponse bias: patients who didn't report vaccination status might be systematically different from those who did. If the "missing" group skews toward unvaccinated patients (MNAR), your cleaned dataset will overestimate the vaccination rate.

If a survey about income has missing values concentrated among low-income respondents, imputing with the mean income creates selection bias: you're essentially replacing real low-income responses with the average, making inequality look smaller than it is.

Every cleaning decision is a study design decision in disguise. The biases from Chapter 4 don't just come from data collection — they can be created (or amplified) during data cleaning.

7.11 Debugging Spotlight: Common Pandas Cleaning Errors

Data cleaning in pandas involves a lot of method chaining, filtering, and assignment — which means a lot of opportunities for bugs. Here are the most common ones and how to fix them.

Error 1: The SettingWithCopyWarning

# This might trigger a warning:
df_subset = df[df['age'] > 50]
df_subset['risk_level'] = 'high'  # ⚠️ SettingWithCopyWarning

What's happening: When you filter a DataFrame, pandas sometimes returns a view (a window into the original) and sometimes returns a copy. If it's a view, modifying it might or might not change the original — pandas can't tell, so it warns you.

The fix: Use .copy() when creating a subset you plan to modify:

# Safe version
df_subset = df[df['age'] > 50].copy()
df_subset['risk_level'] = 'high'  # No warning

Or use .loc for assignment on the original:

# Modify original DataFrame safely
df.loc[df['age'] > 50, 'risk_level'] = 'high'

Error 2: Replacing Values That Don't Match

# This won't work if there are extra spaces:
df['gender'] = df['gender'].replace('Male', 'male')
# "Male " (with a trailing space) won't be replaced!

The fix: Always strip whitespace first:

df['gender'] = df['gender'].str.strip().replace('Male', 'male')

Error 3: fillna() Not Saving the Result

# This does NOTHING to df:
df['age'].fillna(df['age'].median())

# You need to either assign it back:
df['age'] = df['age'].fillna(df['age'].median())

# Or use inplace=True:
df['age'].fillna(df['age'].median(), inplace=True)

Most pandas methods return a new object and don't modify the original. This is intentional (it prevents accidental data destruction), but it trips up beginners constantly. Always check: did you assign the result?

Error 4: dtype Surprises After Cleaning

# After filling NaN in an integer column, it might become float:
df['age'] = df['age'].fillna(0)
print(df['age'].dtype)  # float64, not int!

Why: pandas uses NaN (a float value) to represent missing data. Any integer column with NaN values gets promoted to float64. After filling the NaN, the column stays float.

The fix:

df['age'] = df['age'].fillna(0).astype(int)

Or use pandas' nullable integer type from the start:

df['age'] = df['age'].astype('Int64')  # Capital I — nullable integer

Error 5: Forgetting That String Operations Need .str

# This FAILS:
df['name'] = df['name'].lower()  # AttributeError!

# Correct:
df['name'] = df['name'].str.lower()

When working with text columns in pandas, string methods must be accessed through the .str accessor. It's one extra word, but forgetting it is one of the most common pandas errors.

7.12 Putting It All Together: A Complete Cleaning Workflow

Let's walk through a complete cleaning workflow, from raw data to analysis-ready dataset. This is the real deal — the kind of script you'll write for every data project.

import pandas as pd
import numpy as np

# =============================================================
# LOAD RAW DATA
# =============================================================
df = pd.read_csv('flu_raw_2024.csv')
print(f"Raw data: {df.shape[0]} rows, {df.shape[1]} columns")

# =============================================================
# INITIAL ASSESSMENT
# =============================================================
print("\n--- Data Types ---")
print(df.dtypes)
print("\n--- Missing Values ---")
print(df.isna().sum())
print("\n--- Duplicates ---")
print(f"Exact duplicates: {df.duplicated().sum()}")
print(f"Duplicate IDs: {df.duplicated(subset=['patient_id']).sum()}")
print("\n--- Descriptive Stats ---")
print(df.describe())

# =============================================================
# STEP 1: REMOVE DUPLICATES
# =============================================================
df = df.sort_values('diagnosis_date').drop_duplicates(
    subset=['patient_id'], keep='last'
)
print(f"\nAfter removing duplicates: {len(df)} rows")

# =============================================================
# STEP 2: FIX DATA TYPES
# =============================================================
df['diagnosis_date'] = pd.to_datetime(df['diagnosis_date'])
df['zip_code'] = df['zip_code'].astype(str)  # zip codes aren't numbers!

# =============================================================
# STEP 3: STANDARDIZE TEXT COLUMNS
# =============================================================
for col in ['gender', 'vaccinated', 'diagnosis']:
    df[col] = df[col].str.lower().str.strip()

# Map inconsistent values
df['gender'] = df['gender'].replace({
    'm': 'male', 'f': 'female', 'male.': 'male', 'femal': 'female'
})
df['vaccinated'] = df['vaccinated'].replace({
    'y': 'yes', 'n': 'no', '1': 'yes', '0': 'no'
})

# =============================================================
# STEP 4: FIX IMPOSSIBLE VALUES
# =============================================================
df.loc[df['age'] < 0, 'age'] = np.nan
df.loc[df['age'] > 120, 'age'] = np.nan
df.loc[df['temperature_f'] > 115, 'temperature_f'] = np.nan
df.loc[df['temperature_f'] < 90, 'temperature_f'] = np.nan

# =============================================================
# STEP 5: HANDLE MISSING DATA
# =============================================================
# Create missing-data flags before imputing
df['age_missing_flag'] = df['age'].isna().astype(int)
df['temp_missing_flag'] = df['temperature_f'].isna().astype(int)

# Impute age with clinic-specific median
df['age'] = df.groupby('clinic')['age'].transform(
    lambda x: x.fillna(x.median())
)

# Impute temperature with clinic-specific median
df['temperature_f'] = df.groupby('clinic')['temperature_f'].transform(
    lambda x: x.fillna(x.median())
)

# DO NOT impute vaccination status (likely MNAR)

# =============================================================
# STEP 6: CREATE NEW VARIABLES
# =============================================================
# Age groups
bins = [0, 17, 34, 54, 120]
labels = ['0-17', '18-34', '35-54', '55+']
df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels)

# Fever flag
df['has_fever'] = (df['temperature_f'] > 100.4).astype(int)

# Season
df['month'] = df['diagnosis_date'].dt.month
df['season'] = df['month'].map({
    12: 'Winter', 1: 'Winter', 2: 'Winter',
    3: 'Spring', 4: 'Spring', 5: 'Spring',
    6: 'Summer', 7: 'Summer', 8: 'Summer',
    9: 'Fall', 10: 'Fall', 11: 'Fall'
})

# =============================================================
# FINAL CHECK
# =============================================================
print(f"\nFinal dataset: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"Missing values remaining:\n{df.isna().sum()}")
print(f"\nColumn types:\n{df.dtypes}")

# =============================================================
# SAVE
# =============================================================
df.to_csv('flu_cleaned_2024.csv', index=False)
print("\nCleaned data saved to flu_cleaned_2024.csv")

This script does everything we've covered: removes duplicates, fixes data types, standardizes text, handles impossible values, creates missing-data flags, imputes carefully, engineers new features, and saves the result. And because it's code, it's automatically reproducible — anyone can run it on the same raw data and get the same cleaned dataset.

7.13 All Functions at a Glance: Python and Excel

Python Reference

import pandas as pd
import numpy as np

# ---- Missing Data ----
df.isna()                          # True/False for every cell
df.isna().sum()                    # Count missing per column
df.isna().mean() * 100             # Percentage missing per column
df.dropna()                        # Drop rows with any NaN
df.dropna(subset=['col'])          # Drop rows where 'col' is NaN
df['col'].fillna(value)            # Fill NaN with a specific value
df['col'].fillna(df['col'].mean()) # Fill with mean
df['col'].fillna(df['col'].median()) # Fill with median
df['col'].fillna(method='ffill')   # Forward fill (use previous value)

# ---- Duplicates ----
df.duplicated()                    # True for duplicate rows
df.duplicated().sum()              # Count duplicates
df.duplicated(subset=['col'])      # Check duplicates by specific column
df.drop_duplicates()               # Remove exact duplicate rows
df.drop_duplicates(subset=['col']) # Remove duplicates by column
df.drop_duplicates(keep='last')    # Keep last occurrence

# ---- Text Cleaning ----
df['col'].str.lower()              # Convert to lowercase
df['col'].str.upper()              # Convert to uppercase
df['col'].str.strip()              # Remove leading/trailing whitespace
df['col'].str.replace('old', 'new')# Replace substring
df['col'].replace({'old': 'new'})  # Replace exact values

# ---- Type Conversion ----
df['col'].astype(int)              # Convert to integer
df['col'].astype(float)            # Convert to float
df['col'].astype(str)              # Convert to string
df['col'].astype('category')       # Convert to categorical
pd.to_datetime(df['col'])          # Convert to datetime

# ---- Binning ----
pd.cut(df['col'], bins=4)                      # Equal-width bins
pd.cut(df['col'], bins=[0,18,65,120],
       labels=['Young','Adult','Senior'])       # Custom bins
pd.qcut(df['col'], q=4, labels=['Q1','Q2','Q3','Q4'])  # Equal-frequency

# ---- Reshaping ----
pd.melt(df, id_vars=['id'], value_vars=['a','b'])  # Wide to long
df.pivot(index='id', columns='var', values='val')  # Long to wide

# ---- Conditional Assignment ----
df.loc[condition, 'col'] = new_value               # Assign where condition is True
np.where(condition, value_if_true, value_if_false)  # Vectorized if/else

Excel Reference

Task	Excel Method
Count blanks	`=COUNTBLANK(A1:A500)`
Count non-blanks	`=COUNTA(A1:A500)`
Fill blank with value	`=IF(A2="", "Unknown", A2)`
Remove extra spaces	`=TRIM(A2)`
Remove non-printable characters	`=CLEAN(A2)`
Convert to lowercase	`=LOWER(A2)`
Convert to uppercase	`=UPPER(A2)`
Proper capitalization	`=PROPER(A2)`
Find & Replace	Ctrl+H (check "Match entire cell contents" for exact match)
Highlight duplicates	Conditional Formatting > Highlight Cells Rules > Duplicate Values
Remove duplicates	Data > Remove Duplicates
Highlight blanks	Conditional Formatting > New Rule > "Blanks"
Data validation (restrict entries)	Data > Data Validation > List
Replace errors with blank	`=IFERROR(A2, "")`
Check if cell is blank	`=ISBLANK(A2)`

7.14 Project Checkpoint: Your Turn

DATA DETECTIVE PORTFOLIO — Chapter 7

This is one of the most important portfolio checkpoints. You're going to clean your real dataset and document every decision you make. The cleaning log you produce here will be referenced in every future chapter — it's the foundation for all your subsequent analysis.

Your tasks:

Assess missing data in your dataset: - How many missing values does each column have? What percentage? - For each column with missing data, hypothesize whether the missingness is MCAR, MAR, or MNAR. Explain your reasoning.

Handle missing data using appropriate strategies: - For each column, decide: delete, impute, or leave as missing? - If imputing, choose mean, median, or mode (and justify your choice based on the distribution shape from Chapter 5) - Create missing-data flags for any imputed columns

Check for and remove duplicates: - How many duplicate rows exist? - How many duplicate IDs (or equivalent) exist? - Remove duplicates and document your keep strategy

Fix inconsistencies in categorical columns: - Use .value_counts() on each categorical column - Standardize any inconsistent values

Create at least one new variable through recoding, binning, or calculation.

Write a cleaning log documenting every decision. Follow the template from Section 7.10.

Code template to get started:

```python import pandas as pd import numpy as np

Load your dataset

df = pd.read_csv('your_dataset.csv')

--- ASSESSMENT ---

print("=== INITIAL ASSESSMENT ===") print(f"Shape: {df.shape}") print(f"\nMissing values:\n{df.isna().sum()}") print(f"\nMissing percentages:\n{(df.isna().mean() * 100).round(1)}") print(f"\nDuplicates: {df.duplicated().sum()}")

--- CLEANING ---

Step 1: Remove duplicates

df = df.drop_duplicates()

Step 2: Fix inconsistent values

(customize for your dataset)

df['col'] = df['col'].str.lower().str.strip()

Step 3: Handle impossible values

df.loc[df['col'] < 0, 'col'] = np.nan

Step 4: Create missing-data flags

df['col_missing'] = df['col'].isna().astype(int)

Step 5: Impute missing values

df['col'] = df['col'].fillna(df['col'].median())

Step 6: Create new variables

df['new_col'] = pd.cut(df['col'], bins=..., labels=...)

--- FINAL CHECK ---

print("\n=== AFTER CLEANING ===") print(f"Shape: {df.shape}") print(f"\nMissing values:\n{df.isna().sum()}")

Save cleaned data

df.to_csv('your_dataset_cleaned.csv', index=False) ```

Suggested focus by dataset: - CDC BRFSS: Check _BMI5 for impossible values (BMI > 100?), standardize SEX coding, handle missing INCOME2 (likely MNAR — people skip income questions) - Gapminder: Check for missing gdpPercap in developing nations (likely MAR — depends on country development level), create income_group from GDP per capita - College Scorecard: Many columns have > 50% missing data — decide which to keep. Standardize institution names. Create selectivity bins from ADM_RATE. - World Happiness Report: Relatively clean, but check for missing values in sub-scores. Create happiness_level categories (Low/Medium/High) from happiness score. - NOAA Climate: Handle TMAX/TMIN missing readings (likely MCAR from sensor failure), create temp_range = TMAX - TMIN, create season from date.

7.15 Spaced Review: Strengthening Previous Learning

These questions revisit concepts from earlier chapters at expanding intervals, helping you build long-term retention.

SR.1 (From Chapter 5 — Distribution Shapes and Imputation): In Section 7.3, you learned that the shape of a distribution (skewed vs. symmetric) should guide your choice of imputation method. Explain why imputing with the mean is problematic for a right-skewed variable like income. What would you use instead, and why?

Check your thinking

In a right-skewed distribution, the mean is pulled to the right by the long tail of high values (as you learned in Chapter 5, Section 5.7, and again in Chapter 6, Section 6.2). The mean income might be $105,000 when the median is $75,000 — meaning the mean doesn't represent the "typical" value. If you impute missing income values with the mean ($105,000), you're inserting values that are *higher* than what most people actually earn. This artificially shifts the distribution toward the right, making income look higher than it is. The **median** ($75,000) is a better choice for imputation because it represents the actual center of the data. This is exactly why we distinguish between resistant and non-resistant measures (Chapter 6, Section 6.1) — the median resists the pull of extreme values, making it the better imputation choice for skewed data.

SR.2 (From Chapter 3 — Loading and Exploring Data with Pandas): You've now used several pandas methods for data cleaning (.isna(), .fillna(), .dropna(), .replace(), .duplicated(), .drop_duplicates()). Back in Chapter 3, you learned .info(), .describe(), .head(), and .value_counts(). Explain how these Chapter 3 tools serve as your first line of defense against dirty data — what can each one reveal about data quality issues?

Check your thinking

Each Chapter 3 tool catches different quality issues: - **`.info()`** reveals missing data (non-null counts less than total rows), wrong data types (a column of ages showing as `object` instead of `int64`, which means text is hiding in a numeric column), and memory usage. - **`.describe()`** reveals impossible values (min age of -3, max temperature of 9999), suspicious spread (an unreasonably large standard deviation), and potential outliers. - **`.head()`** gives you a visual sanity check — you can spot formatting issues, trailing spaces, mixed formats, and placeholder values by just looking at the first few rows. - **`.value_counts()`** reveals inconsistent categories immediately — if you see "Male", "male", "M", and "MALE" in the same column, you know you have a standardization problem. These Chapter 3 tools are diagnostic — they help you *find* the problems. The Chapter 7 tools (`.fillna()`, `.replace()`, `.drop_duplicates()`) are the treatment. Diagnosis always comes before treatment.

SR.3 (From Chapter 4 — Bias in Data Collection): A researcher is cleaning a survey dataset about job satisfaction. She notices that 30% of respondents didn't answer the question "How many hours of overtime do you work per week?" She decides to delete those rows. Identify which type of bias from Chapter 4 this might introduce, explain why the missingness might be MNAR, and suggest a better approach.

Check your thinking

Deleting rows where overtime hours are missing could introduce **nonresponse bias** (Chapter 4, Section 4.3). Employees who work the most overtime might avoid reporting it — perhaps out of fear of employer retaliation, or because they're so overworked they skip questions. If so, the data is **MNAR**: the reason the value is missing is related to the value itself (high overtime → more likely to skip the question). Deleting these rows would systematically remove the most overworked employees, making the remaining dataset look like employees work *less* overtime than they actually do. Any analysis of job satisfaction as a function of overtime would be biased — you'd be missing the very people for whom overtime is the biggest problem. A better approach: keep all rows, create a missing-data flag (`overtime_missing = 1`), and analyze whether the people with missing overtime data differ from those without (do they have lower satisfaction scores? different job titles?). Document the limitation honestly. If you must impute, use group-based imputation (e.g., median overtime by job role) rather than the overall mean.

Chapter Summary

The Big Ideas

Data cleaning is the foundation of analysis. Roughly 80% of data work is cleaning — finding and fixing errors, handling missing values, and standardizing formats. Skip this step and every conclusion you draw is unreliable.
Missing data has mechanisms. Data can be Missing Completely at Random (MCAR), Missing at Random (MAR), or Missing Not at Random (MNAR). The mechanism determines which handling strategies are appropriate. MNAR is the most dangerous because the missingness is related to the missing value itself.
Every cleaning decision is a judgment call. Deleting rows, imputing values, and recoding categories all change the data — and they change it in ways that affect conclusions. The ethical move is to make your choices transparently, document them, and consider whether different choices would lead to different conclusions.
Reproducibility requires documentation. A cleaning log that records every decision, justification, and transformation is essential. Without it, your analysis is a black box that no one — including your future self — can evaluate or replicate.

Key Terms

Term	Definition
Missing data (NA/NaN)	Values that are absent from the dataset — blank cells, null values
Imputation	Replacing missing values with estimated values (mean, median, mode, or model-based)
Data cleaning	The process of detecting and correcting errors, inconsistencies, and quality issues in a dataset
Data wrangling	The broader process of transforming raw data into a format suitable for analysis
Tidy data	Data organized so each variable has its own column, each observation its own row, and each value its own cell
Recoding	Creating a new version of a variable with different values or categories
Binning	Dividing a continuous variable into discrete intervals (categories)
Feature engineering	Creating new variables from existing ones to better capture patterns
Data quality	The degree to which data is accurate, complete, consistent, and suitable for its intended purpose
Reproducibility	The ability for someone else to follow your documented steps and arrive at the same results

Check Your Understanding — Final Retrieval Practice (try to answer without scrolling up)

What are the three types of missing data mechanisms? Give a real-world example of each.

Why is mean imputation problematic? What does it do to the distribution?

What does .duplicated() return in pandas?

Name three steps for cleaning inconsistent text data in pandas.

What are the three rules of tidy data?

Check your thinking

MCAR (Missing Completely at Random): missingness unrelated to any variable — e.g., a lab spills coffee on random forms. MAR (Missing at Random): missingness related to other observed variables — e.g., older patients more likely to have missing vaccination records. MNAR (Missing Not at Random): missingness related to the missing value itself — e.g., high-income people don't report their income.

Mean imputation replaces every missing value with the same number (the mean), which artificially reduces the spread of the data. It makes the standard deviation too small, underestimates variance, and can bias the distribution — especially if the data is skewed.

.duplicated() returns a Boolean Series — True for rows that are duplicates of earlier rows, False otherwise. Use .duplicated().sum() to count them.

(1) Convert to lowercase with .str.lower(), (2) strip whitespace with .str.strip(), (3) map variations to standard values with .replace().

(1) Each variable has its own column. (2) Each observation has its own row. (3) Each value has its own cell.

What's Next

You've done the hard, unglamorous, absolutely essential work of data wrangling. Your data is now clean, documented, and ready for analysis. You've learned to handle missing values without pretending they never existed, fix inconsistencies without losing information, and create new variables that capture meaningful patterns.

In Chapter 8: Probability, you'll take the leap from describing data to reasoning about uncertainty. You'll learn to quantify the chances of events, and you'll see how probability provides the mathematical foundation for every inferential procedure that follows. The clean, well-structured dataset you've built in this chapter is exactly what you'll need.

In Chapter 10: Probability Distributions and the Normal Curve, the distribution shapes you used to guide imputation decisions (symmetric vs. skewed) will become formal mathematical models. And in Chapter 13: Hypothesis Testing, you'll learn why reproducibility — the principle you practiced in your cleaning log — is so critical to scientific claims.

The bridge from "here's what my data looks like" (Chapters 5-7) to "here's what my data means about the world" (Chapters 11+) runs through probability. You've built the strongest possible foundation. Now let's cross it.