Case Study: Alex's StreamVibe Cleaning Log — A Step-by-Step Template
The Setup
Alex Rivera, the marketing analyst at StreamVibe from Chapter 1, has been asked to analyze user engagement data for a new recommendation algorithm pilot. His manager wants to know whether users who received the new algorithm ("treatment" group) watch more content than those who didn't ("control" group).
Alex exports the data and takes his first look:
import pandas as pd
import numpy as np
df = pd.read_csv('streamvibe_pilot_data.csv')
print(f"Shape: {df.shape}")
print(f"\nColumn names:\n{list(df.columns)}")
print(f"\nFirst 5 rows:")
print(df.head())
Shape: (8247, 12)
Column names:
['user_id', 'signup_date', 'group', 'watch_time_min', 'sessions',
'device', 'plan_type', 'age', 'country', 'gender', 'last_active',
'satisfaction_score']
First 5 rows:
user_id signup_date group watch_time_min sessions device ...
0 U00001 2024-01-05 control 45.2 3.0 mobile ...
1 U00002 2024-01-05 treatment 72.8 5.0 desktop ...
2 U00003 2024-01-06 control NaN NaN mobile ...
3 U00004 2024-01-06 treatment 312.5 18.0 Mobile ...
4 U00005 2024-01-07 control 28.1 2.0 smart_tv ...
Already Alex can see problems. But instead of diving in randomly, he follows the structured cleaning workflow from Section 7.12. Here is his complete cleaning log — a template you can follow for your own projects.
Alex's Data Cleaning Log
Metadata
Dataset: StreamVibe Recommendation Algorithm Pilot
Analyst: Alex Rivera
Date: 2024-03-22
Raw file: streamvibe_pilot_data.csv
Rows: 8,247
Columns: 12
Purpose: Compare watch time between treatment and control groups
Step 1: Initial Assessment
# --- DATA TYPES ---
print(df.dtypes)
user_id object
signup_date object # Should be datetime
group object
watch_time_min float64
sessions float64 # Should be int, but NaN forces float
device object
plan_type object
age float64 # Should be int, but NaN forces float
country object
gender object
last_active object # Should be datetime
satisfaction_score float64
Issues found:
- signup_date and last_active are stored as text, not dates
- sessions is float instead of integer (NaN contamination)
- age is float instead of integer (NaN contamination)
# --- MISSING VALUES ---
print(df.isna().sum())
print(f"\n--- Percentages ---")
print((df.isna().mean() * 100).round(1))
user_id 0 (0.0%)
signup_date 0 (0.0%)
group 0 (0.0%)
watch_time_min 743 (9.0%)
sessions 743 (9.0%)
device 0 (0.0%)
plan_type 12 (0.1%)
age 156 (1.9%)
country 0 (0.0%)
gender 289 (3.5%)
last_active 412 (5.0%)
satisfaction_score 1847 (22.4%)
Key observations:
- watch_time_min and sessions have identical missing counts (743) — they're missing together, which makes sense (if a user has no watch data, sessions would also be missing)
- satisfaction_score has 22.4% missing — above the 20% threshold. This variable may not be reliable enough for primary analysis.
- last_active missing for 5% of users — possibly inactive users who haven't logged in during the study period
# --- DUPLICATES ---
print(f"Exact duplicate rows: {df.duplicated().sum()}")
print(f"Duplicate user_ids: {df.duplicated(subset=['user_id']).sum()}")
Exact duplicate rows: 23
Duplicate user_ids: 31
Issues found: - 23 exact duplicate rows (likely from a data export error) - 31 duplicate user_ids total — 8 more than the exact duplicates, meaning some users appear twice with slightly different data (possibly from different time windows)
# --- DESCRIPTIVE STATS ---
print(df.describe())
watch_time_min sessions age satisfaction_score
count 7504.0000 7504.0000 8091.0000 6400.0000
mean 127.3412 7.8234 34.5621 3.4210
std 298.7891 9.1245 12.0345 1.1234
min -5.2000 0.0000 -99.0000 1.0000
25% 22.1000 2.0000 25.0000 3.0000
50% 65.3000 5.0000 33.0000 3.0000
75% 142.8000 10.0000 43.0000 4.0000
max 15840.0000 312.0000 999.0000 5.0000
Red flags from .describe():
- watch_time_min has a minimum of -5.2 (impossible — can't watch negative minutes)
- watch_time_min maximum is 15,840 minutes = 264 hours = 11 days straight. Possible bot or data error.
- age minimum is -99 (impossible — likely a placeholder for "unknown")
- age maximum is 999 (impossible — likely a placeholder for "unknown")
- sessions maximum is 312 — about 10 sessions per day for a month. Extreme but not necessarily impossible. Flag for review.
# --- CATEGORICAL VALUE COUNTS ---
print("--- Group ---")
print(df['group'].value_counts())
print("\n--- Device ---")
print(df['device'].value_counts())
print("\n--- Plan Type ---")
print(df['plan_type'].value_counts())
print("\n--- Gender ---")
print(df['gender'].value_counts())
--- Group ---
control 4156
treatment 4091
--- Device ---
mobile 3412
desktop 2198
smart_tv 1456
Mobile 623
tablet 412
Desktop 102
Smart TV 44
--- Plan Type ---
premium 3245
basic 2890
free 2100
--- Gender ---
Male 2456
Female 2312
male 1234
female 789
M 502
F 398
Other 156
other 21
m 18
f 14
MALE 12
FEMALE 8
Inconsistencies found:
- device: "mobile" vs. "Mobile", "desktop" vs. "Desktop", "smart_tv" vs. "Smart TV" — 7 categories should be 4
- gender: 13 unique values should be 4 (male, female, other, + missing)
Step 2: Remove Duplicates
# Remove exact duplicate rows (keep first occurrence)
before = len(df)
df = df.drop_duplicates()
print(f"Removed {before - len(df)} exact duplicate rows")
# Handle duplicate user_ids: keep the entry with the most recent last_active date
df = df.sort_values('last_active', na_position='first').drop_duplicates(
subset=['user_id'], keep='last'
)
print(f"After user_id dedup: {len(df)} rows")
Removed 23 exact duplicate rows
After user_id dedup: 8193 rows
Decision: For duplicate user_ids with different data, kept the most recent entry (by last_active). Rationale: later entries are more likely to reflect the user's current engagement.
Rows removed: 54 (23 exact + 31 user_id duplicates - 23 overlap = 54)
Step 3: Fix Data Types
# Convert dates
df['signup_date'] = pd.to_datetime(df['signup_date'])
df['last_active'] = pd.to_datetime(df['last_active'], errors='coerce')
# Note: sessions and age will remain float until NaN is handled
Decision: Used errors='coerce' for last_active because some entries might have invalid date formats. Any unparseable dates become NaN.
Step 4: Standardize Text Columns
# Device
df['device'] = df['device'].str.lower().str.strip().str.replace(' ', '_')
print(f"Device categories: {df['device'].nunique()}")
# Result: 4 categories (mobile, desktop, smart_tv, tablet)
# Gender
df['gender'] = df['gender'].str.lower().str.strip()
df['gender'] = df['gender'].replace({
'm': 'male', 'f': 'female'
})
print(f"Gender categories: {df['gender'].nunique()}")
# Result: 3 categories (male, female, other)
# Plan type (already consistent but verify)
df['plan_type'] = df['plan_type'].str.lower().str.strip()
Step 5: Fix Impossible Values
# Age: replace placeholder values with NaN
impossible_ages = (df['age'] < 0) | (df['age'] > 120)
print(f"Impossible ages: {impossible_ages.sum()}")
df.loc[impossible_ages, 'age'] = np.nan
# Found: 8 impossible ages (-99 and 999 values)
# Watch time: replace negative values with NaN
negative_watch = df['watch_time_min'] < 0
print(f"Negative watch times: {negative_watch.sum()}")
df.loc[negative_watch, 'watch_time_min'] = np.nan
# Found: 3 negative watch time values
# Watch time: flag extreme values (> 1440 min = 24 hours in a day)
extreme_watch = df['watch_time_min'] > 1440
print(f"Extreme watch times (>24 hrs): {extreme_watch.sum()}")
# Found: 12 extreme values
# Decision: flag but don't remove — could be legitimate binge-watchers
# over a multi-day period, or could be bot accounts
df['watch_time_extreme_flag'] = extreme_watch.astype(int)
Decisions: - Age -99 and 999: Set to NaN (clearly placeholder values, not real ages) - Negative watch times: Set to NaN (impossible values) - Extreme watch times (>24 hours): Flagged but NOT removed. Rationale: the data might represent cumulative watch time over the study period, not a single day. Will investigate further before deciding.
Step 6: Handle Missing Data
# --- Mechanism analysis ---
# watch_time_min: Check if missingness differs by group
missing_by_group = df.groupby('group')['watch_time_min'].apply(
lambda x: x.isna().mean() * 100
)
print(f"Missing watch time by group:\n{missing_by_group.round(1)}")
Missing watch time by group:
control 4.2%
treatment 14.1%
Critical finding: Watch time is missing at a MUCH higher rate in the treatment group (14.1%) than the control group (4.2%). This is NOT MCAR. Possible explanations: - The new algorithm might have caused app crashes for some users (technical failure → no data) - Users in the treatment group who didn't like the new recommendations might have churned (MNAR: users who would have had LOW watch times are the ones missing)
Decision: Do NOT impute watch_time_min. Imputing would mask a potentially systematic difference between groups. Instead, flag the missingness and analyze in three ways: 1. Complete cases only (with caveat) 2. Assume missing treatment users had watch_time = 0 (worst case for the algorithm) 3. Assume missing treatment users watched the control group median (conservative estimate)
# Create missing data flags
df['watch_time_missing'] = df['watch_time_min'].isna().astype(int)
df['sessions_missing'] = df['sessions'].isna().astype(int)
df['age_missing'] = df['age'].isna().astype(int)
df['gender_missing'] = df['gender'].isna().astype(int)
df['satisfaction_missing'] = df['satisfaction_score'].isna().astype(int)
# Age: impute with median by plan_type (MAR: plan type likely related to age)
df['age'] = df.groupby('plan_type')['age'].transform(
lambda x: x.fillna(x.median())
)
# Gender: leave as missing (3.5% — analysis can proceed without it)
# satisfaction_score: leave as missing (22.4% — too much to impute reliably)
# Will analyze as a secondary variable with caveat about missingness
# plan_type: impute with mode (only 12 missing, 0.1%, likely MCAR)
df['plan_type'] = df['plan_type'].fillna(df['plan_type'].mode()[0])
Step 7: Create New Variables
# Days in study: signup_date to last_active
df['days_in_study'] = (df['last_active'] - df['signup_date']).dt.days
# Active user flag: had at least one session
df['is_active'] = (df['sessions'] > 0).astype(int)
# Engagement level: based on watch time quartiles (among non-missing)
df['engagement_level'] = pd.qcut(
df['watch_time_min'].dropna(),
q=4,
labels=['Low', 'Medium', 'High', 'Very High']
)
# Age group
bins = [0, 25, 35, 50, 120]
labels = ['18-25', '26-35', '36-50', '50+']
df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels)
# Weekend signup flag (could affect initial engagement)
df['weekend_signup'] = df['signup_date'].dt.dayofweek.isin([5, 6]).astype(int)
Step 8: Final Verification
print("=== FINAL DATASET ===")
print(f"Rows: {len(df)}")
print(f"Columns: {df.shape[1]}")
print(f"\nRemaining missing values:")
print(df.isna().sum())
print(f"\nGroup balance:")
print(df['group'].value_counts())
print(f"\nDevice categories:")
print(df['device'].value_counts())
print(f"\nGender categories (including NaN):")
print(df['gender'].value_counts(dropna=False))
=== FINAL DATASET ===
Rows: 8193
Columns: 22 (12 original + 6 flags + 4 new variables)
Remaining missing values:
watch_time_min 749 (intentionally not imputed)
sessions 749 (intentionally not imputed)
gender 283 (intentionally not imputed)
satisfaction_score 1835 (intentionally not imputed)
last_active 408 (can't create days_in_study for these)
...
Group balance:
control 4132
treatment 4061
Summary of Decisions
| Decision | Action | Justification |
|---|---|---|
| Exact duplicates (23) | Removed | Data export error |
| Duplicate user_ids (31) | Kept most recent | Later entry more current |
| Device inconsistencies | Standardized to 4 categories | Case/whitespace differences only |
| Gender inconsistencies | Standardized to 3 categories | Case/abbreviation differences |
| Age = -99, 999 | Set to NaN | Placeholder values |
| Negative watch time | Set to NaN | Impossible values |
| Extreme watch time (>24h) | Flagged, not removed | May be legitimate; needs investigation |
| Missing watch time | NOT imputed | Differential missingness by group (MNAR suspected) |
| Missing age | Imputed with plan-type median | 1.9% missing, likely MAR |
| Missing gender | Left as NaN | 3.5% missing, not critical for primary analysis |
| Missing satisfaction | Left as NaN | 22.4% missing, unreliable for primary analysis |
| Missing plan_type | Imputed with mode | 0.1% missing, likely MCAR |
The Key Finding: Missing Data Was the Story
Here's the part that surprised Alex — and that makes this cleaning log more than a technical exercise.
When he first ran a simple comparison of average watch time between groups, the treatment group looked great:
Control mean watch time: 118.4 minutes
Treatment mean watch time: 156.2 minutes
The new algorithm appeared to increase watch time by 32%. His manager would be thrilled.
But then he looked at his cleaning log. Treatment-group users had a 14.1% missing rate for watch time, compared to 4.2% for control. If the missing treatment users had low watch time (which makes sense — disengaged users are less likely to generate data), the true treatment mean could be much lower.
His sensitivity analysis:
Scenario 1 (complete cases only):
Control: 118.4 min | Treatment: 156.2 min → +32.0%
Scenario 2 (missing treatment = 0 min):
Control: 115.3 min | Treatment: 134.1 min → +16.3%
Scenario 3 (missing treatment = control median):
Control: 116.8 min | Treatment: 143.5 min → +22.8%
The algorithm probably did help — but the 32% improvement was inflated by missing data. The real effect was somewhere between 16% and 32%, depending on assumptions about the missing users.
If Alex had skipped the cleaning log and just reported the 32% number, his company might have rolled out the algorithm to all users based on an overestimate. The cleaning log didn't just document his work — it revealed the most important finding in his analysis.
What You Can Learn From Alex's Approach
-
Structure your cleaning. Follow a consistent order: assess, deduplicate, fix types, standardize text, fix impossible values, handle missing data, create new variables, verify.
-
Write the log as you go. Don't try to reconstruct your decisions after the fact. Write each step as you do it.
-
Check missingness across groups. The most dangerous missing data patterns are the ones that differ between the groups you want to compare.
-
Run sensitivity analyses. If your conclusions depend on how you handle missing data, say so. Show the range of possible results.
-
Let the cleaning process inform your analysis. Alex's most important finding came from his cleaning log, not from a statistical test. The pattern of missing data was the story.
Discussion Questions
-
If Alex reports only the "complete cases" result (32% improvement), why would this be misleading? What would be a more honest way to present his findings?
-
The treatment group has 14.1% missing watch time. Brainstorm at least three possible explanations for why the missingness rate is higher in the treatment group.
-
Alex decided not to impute watch time. Do you agree with this decision? Under what circumstances might imputation have been appropriate here?
-
Alex's satisfaction score variable has 22.4% missing data. He decides not to use it as a primary analysis variable. Is this the right call? What would you do with it?
-
Look at Alex's cleaning log and identify three places where a different analyst might have made a different decision. For each, explain both choices and their consequences.