Case Study: Alex's StreamVibe Cleaning Log — A Step-by-Step Template

Contributors

Case Study: Alex's StreamVibe Cleaning Log — A Step-by-Step Template

The Setup

Alex Rivera, the marketing analyst at StreamVibe from Chapter 1, has been asked to analyze user engagement data for a new recommendation algorithm pilot. His manager wants to know whether users who received the new algorithm ("treatment" group) watch more content than those who didn't ("control" group).

Alex exports the data and takes his first look:

import pandas as pd
import numpy as np

df = pd.read_csv('streamvibe_pilot_data.csv')
print(f"Shape: {df.shape}")
print(f"\nColumn names:\n{list(df.columns)}")
print(f"\nFirst 5 rows:")
print(df.head())

Shape: (8247, 12)

Column names:
['user_id', 'signup_date', 'group', 'watch_time_min', 'sessions',
 'device', 'plan_type', 'age', 'country', 'gender', 'last_active',
 'satisfaction_score']

First 5 rows:
   user_id signup_date    group  watch_time_min  sessions     device  ...
0  U00001  2024-01-05  control            45.2       3.0     mobile  ...
1  U00002  2024-01-05  treatment          72.8       5.0    desktop  ...
2  U00003  2024-01-06  control             NaN       NaN     mobile  ...
3  U00004  2024-01-06  treatment         312.5      18.0    Mobile   ...
4  U00005  2024-01-07  control            28.1       2.0  smart_tv  ...

Already Alex can see problems. But instead of diving in randomly, he follows the structured cleaning workflow from Section 7.12. Here is his complete cleaning log — a template you can follow for your own projects.

Alex's Data Cleaning Log

Metadata

Dataset: StreamVibe Recommendation Algorithm Pilot
Analyst: Alex Rivera
Date: 2024-03-22
Raw file: streamvibe_pilot_data.csv
Rows: 8,247
Columns: 12
Purpose: Compare watch time between treatment and control groups

Step 1: Initial Assessment

# --- DATA TYPES ---
print(df.dtypes)

user_id              object
signup_date          object    # Should be datetime
group                object
watch_time_min      float64
sessions            float64   # Should be int, but NaN forces float
device               object
plan_type            object
age                 float64   # Should be int, but NaN forces float
country              object
gender               object
last_active          object   # Should be datetime
satisfaction_score  float64

Issues found: - signup_date and last_active are stored as text, not dates - sessions is float instead of integer (NaN contamination) - age is float instead of integer (NaN contamination)

# --- MISSING VALUES ---
print(df.isna().sum())
print(f"\n--- Percentages ---")
print((df.isna().mean() * 100).round(1))

user_id                 0     (0.0%)
signup_date             0     (0.0%)
group                   0     (0.0%)
watch_time_min        743     (9.0%)
sessions              743     (9.0%)
device                  0     (0.0%)
plan_type              12     (0.1%)
age                   156     (1.9%)
country                 0     (0.0%)
gender                289     (3.5%)
last_active           412     (5.0%)
satisfaction_score   1847    (22.4%)

Key observations: - watch_time_min and sessions have identical missing counts (743) — they're missing together, which makes sense (if a user has no watch data, sessions would also be missing) - satisfaction_score has 22.4% missing — above the 20% threshold. This variable may not be reliable enough for primary analysis. - last_active missing for 5% of users — possibly inactive users who haven't logged in during the study period

# --- DUPLICATES ---
print(f"Exact duplicate rows: {df.duplicated().sum()}")
print(f"Duplicate user_ids: {df.duplicated(subset=['user_id']).sum()}")

Exact duplicate rows: 23
Duplicate user_ids: 31

Issues found: - 23 exact duplicate rows (likely from a data export error) - 31 duplicate user_ids total — 8 more than the exact duplicates, meaning some users appear twice with slightly different data (possibly from different time windows)

# --- DESCRIPTIVE STATS ---
print(df.describe())

       watch_time_min    sessions         age  satisfaction_score
count      7504.0000   7504.0000   8091.0000          6400.0000
mean        127.3412      7.8234     34.5621             3.4210
std         298.7891      9.1245     12.0345             1.1234
min          -5.2000      0.0000    -99.0000             1.0000
25%          22.1000      2.0000     25.0000             3.0000
50%          65.3000      5.0000     33.0000             3.0000
75%         142.8000     10.0000     43.0000             4.0000
max       15840.0000    312.0000    999.0000             5.0000

Red flags from .describe(): - watch_time_min has a minimum of -5.2 (impossible — can't watch negative minutes) - watch_time_min maximum is 15,840 minutes = 264 hours = 11 days straight. Possible bot or data error. - age minimum is -99 (impossible — likely a placeholder for "unknown") - age maximum is 999 (impossible — likely a placeholder for "unknown") - sessions maximum is 312 — about 10 sessions per day for a month. Extreme but not necessarily impossible. Flag for review.

# --- CATEGORICAL VALUE COUNTS ---
print("--- Group ---")
print(df['group'].value_counts())
print("\n--- Device ---")
print(df['device'].value_counts())
print("\n--- Plan Type ---")
print(df['plan_type'].value_counts())
print("\n--- Gender ---")
print(df['gender'].value_counts())

--- Group ---
control      4156
treatment    4091

--- Device ---
mobile       3412
desktop      2198
smart_tv     1456
Mobile        623
tablet        412
Desktop       102
Smart TV       44

--- Plan Type ---
premium      3245
basic        2890
free         2100

--- Gender ---
Male         2456
Female       2312
male         1234
female        789
M             502
F             398
Other         156
other          21
m              18
f              14
MALE           12
FEMALE          8

Inconsistencies found: - device: "mobile" vs. "Mobile", "desktop" vs. "Desktop", "smart_tv" vs. "Smart TV" — 7 categories should be 4 - gender: 13 unique values should be 4 (male, female, other, + missing)

Step 2: Remove Duplicates

# Remove exact duplicate rows (keep first occurrence)
before = len(df)
df = df.drop_duplicates()
print(f"Removed {before - len(df)} exact duplicate rows")

# Handle duplicate user_ids: keep the entry with the most recent last_active date
df = df.sort_values('last_active', na_position='first').drop_duplicates(
    subset=['user_id'], keep='last'
)
print(f"After user_id dedup: {len(df)} rows")

Removed 23 exact duplicate rows
After user_id dedup: 8216 rows

Decision: For duplicate user_ids with different data, kept the most recent entry (by last_active). Rationale: later entries are more likely to reflect the user's current engagement.

Rows removed: 31 total. Here's the arithmetic, which is worth getting right: df.duplicated(subset=['user_id']).sum() reported 31 — that's the number of extra rows beyond the first occurrence of each user_id. Every one of the 23 exact-duplicate rows is also a user_id duplicate, so the 23 are a subset of the 31, not a separate group to add on. Removing the 23 exact duplicates first, then deduplicating on user_id, removes the remaining 31 − 23 = 8 rows. Total removed = 23 + 8 = 31, leaving 8247 − 31 = 8216 rows. (A common mistake here is to add 23 + 31 and double-count the overlap — always check whether your two duplicate counts refer to overlapping or disjoint sets of rows.)

Step 3: Fix Data Types

# Convert dates
df['signup_date'] = pd.to_datetime(df['signup_date'])
df['last_active'] = pd.to_datetime(df['last_active'], errors='coerce')

# Note: sessions and age will remain float until NaN is handled

Decision: Used errors='coerce' for last_active because some entries might have invalid date formats. Any unparseable dates become NaN.

Step 4: Standardize Text Columns

# Device
df['device'] = df['device'].str.lower().str.strip().str.replace(' ', '_')
print(f"Device categories: {df['device'].nunique()}")
# Result: 4 categories (mobile, desktop, smart_tv, tablet)

# Gender
df['gender'] = df['gender'].str.lower().str.strip()
df['gender'] = df['gender'].replace({
    'm': 'male', 'f': 'female'
})
print(f"Gender categories: {df['gender'].nunique()}")
# Result: 3 categories (male, female, other)

# Plan type (already consistent but verify)
df['plan_type'] = df['plan_type'].str.lower().str.strip()

Step 5: Fix Impossible Values

# Age: replace placeholder values with NaN
impossible_ages = (df['age'] < 0) | (df['age'] > 120)
print(f"Impossible ages: {impossible_ages.sum()}")
df.loc[impossible_ages, 'age'] = np.nan
# Found: 8 impossible ages (-99 and 999 values)

# Watch time: replace negative values with NaN
negative_watch = df['watch_time_min'] < 0
print(f"Negative watch times: {negative_watch.sum()}")
df.loc[negative_watch, 'watch_time_min'] = np.nan
# Found: 3 negative watch time values

# Watch time: flag extreme values (> 1440 min = 24 hours in a day)
extreme_watch = df['watch_time_min'] > 1440
print(f"Extreme watch times (>24 hrs): {extreme_watch.sum()}")
# Found: 12 extreme values
# Decision: flag but don't remove — could be legitimate binge-watchers
# over a multi-day period, or could be bot accounts
df['watch_time_extreme_flag'] = extreme_watch.astype(int)

Decisions: - Age -99 and 999: Set to NaN (clearly placeholder values, not real ages) - Negative watch times: Set to NaN (impossible values) - Extreme watch times (>24 hours): Flagged but NOT removed. Rationale: the data might represent cumulative watch time over the study period, not a single day. Will investigate further before deciding.

Step 6: Handle Missing Data

# --- Mechanism analysis ---
# watch_time_min: Check if missingness differs by group
missing_by_group = df.groupby('group')['watch_time_min'].apply(
    lambda x: x.isna().mean() * 100
)
print(f"Missing watch time by group:\n{missing_by_group.round(1)}")

Missing watch time by group:
control       4.2%
treatment    14.1%

Critical finding: Watch time is missing at a MUCH higher rate in the treatment group (14.1%) than the control group (4.2%). This is NOT MCAR. Possible explanations: - The new algorithm might have caused app crashes for some users (technical failure → no data) - Users in the treatment group who didn't like the new recommendations might have churned (MNAR: users who would have had LOW watch times are the ones missing)

Decision: Do NOT impute watch_time_min. Imputing would mask a potentially systematic difference between groups. Instead, flag the missingness and analyze in three ways: 1. Complete cases only (with caveat) 2. Assume missing treatment users had watch_time = 0 (worst case for the algorithm) 3. Assume missing treatment users watched the control group median (conservative estimate)

# Create missing data flags
df['watch_time_missing'] = df['watch_time_min'].isna().astype(int)
df['sessions_missing'] = df['sessions'].isna().astype(int)
df['age_missing'] = df['age'].isna().astype(int)
df['gender_missing'] = df['gender'].isna().astype(int)
df['satisfaction_missing'] = df['satisfaction_score'].isna().astype(int)

# Age: impute with median by plan_type (MAR: plan type likely related to age)
df['age'] = df.groupby('plan_type')['age'].transform(
    lambda x: x.fillna(x.median())
)

# Gender: leave as missing (3.5% — analysis can proceed without it)

# satisfaction_score: leave as missing (22.4% — too much to impute reliably)
# Will analyze as a secondary variable with caveat about missingness

# plan_type: impute with mode (only 12 missing, 0.1%, likely MCAR)
df['plan_type'] = df['plan_type'].fillna(df['plan_type'].mode()[0])

Step 7: Create New Variables

# Days in study: signup_date to last_active
df['days_in_study'] = (df['last_active'] - df['signup_date']).dt.days

# Active user flag: had at least one session
df['is_active'] = (df['sessions'] > 0).astype(int)

# Engagement level: based on watch time quartiles (among non-missing)
df['engagement_level'] = pd.qcut(
    df['watch_time_min'].dropna(),
    q=4,
    labels=['Low', 'Medium', 'High', 'Very High']
)

# Age group
bins = [0, 25, 35, 50, 120]
labels = ['18-25', '26-35', '36-50', '50+']
df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels)

# Weekend signup flag (could affect initial engagement)
df['weekend_signup'] = df['signup_date'].dt.dayofweek.isin([5, 6]).astype(int)

Step 8: Final Verification

print("=== FINAL DATASET ===")
print(f"Rows: {len(df)}")
print(f"Columns: {df.shape[1]}")
print(f"\nRemaining missing values:")
print(df.isna().sum())
print(f"\nGroup balance:")
print(df['group'].value_counts())
print(f"\nDevice categories:")
print(df['device'].value_counts())
print(f"\nGender categories (including NaN):")
print(df['gender'].value_counts(dropna=False))

=== FINAL DATASET ===
Rows: 8216
Columns: 22 (12 original + 6 flags + 4 new variables)

Remaining missing values:
watch_time_min       749   (intentionally not imputed)
sessions             749   (intentionally not imputed)
gender               283   (intentionally not imputed)
satisfaction_score  1835   (intentionally not imputed)
last_active          408   (can't create days_in_study for these)
...

Group balance:
control      4145
treatment    4071

Summary of Decisions

Decision	Action	Justification
Exact duplicates (23)	Removed	Data export error
Duplicate user_ids (31)	Kept most recent	Later entry more current
Device inconsistencies	Standardized to 4 categories	Case/whitespace differences only
Gender inconsistencies	Standardized to 3 categories	Case/abbreviation differences
Age = -99, 999	Set to NaN	Placeholder values
Negative watch time	Set to NaN	Impossible values
Extreme watch time (>24h)	Flagged, not removed	May be legitimate; needs investigation
Missing watch time	NOT imputed	Differential missingness by group (MNAR suspected)
Missing age	Imputed with plan-type median	1.9% missing, likely MAR
Missing gender	Left as NaN	3.5% missing, not critical for primary analysis
Missing satisfaction	Left as NaN	22.4% missing, unreliable for primary analysis
Missing plan_type	Imputed with mode	0.1% missing, likely MCAR

The Key Finding: Missing Data Was the Story

Here's the part that surprised Alex — and that makes this cleaning log more than a technical exercise.

When he first ran a simple comparison of average watch time between groups, the treatment group looked great:

Control mean watch time:      118.4 minutes
Treatment mean watch time:    156.2 minutes

The new algorithm appeared to increase watch time by 32%. His manager would be thrilled.

But then he looked at his cleaning log. Treatment-group users had a 14.1% missing rate for watch time, compared to 4.2% for control. If the missing treatment users had low watch time (which makes sense — disengaged users are less likely to generate data), the true treatment mean could be much lower.

His sensitivity analysis:

Scenario 1 (complete cases only):
  Control: 118.4 min | Treatment: 156.2 min → +32.0%

Scenario 2 (missing treatment = 0 min):
  Control: 115.3 min | Treatment: 134.1 min → +16.3%

Scenario 3 (missing treatment = control median):
  Control: 116.8 min | Treatment: 143.5 min → +22.8%

The algorithm probably did help — but the 32% improvement was inflated by missing data. The real effect was somewhere between 16% and 32%, depending on assumptions about the missing users.

If Alex had skipped the cleaning log and just reported the 32% number, his company might have rolled out the algorithm to all users based on an overestimate. The cleaning log didn't just document his work — it revealed the most important finding in his analysis.

What You Can Learn From Alex's Approach

Structure your cleaning. Follow a consistent order: assess, deduplicate, fix types, standardize text, fix impossible values, handle missing data, create new variables, verify.
Write the log as you go. Don't try to reconstruct your decisions after the fact. Write each step as you do it.
Check missingness across groups. The most dangerous missing data patterns are the ones that differ between the groups you want to compare.
Run sensitivity analyses. If your conclusions depend on how you handle missing data, say so. Show the range of possible results.
Let the cleaning process inform your analysis. Alex's most important finding came from his cleaning log, not from a statistical test. The pattern of missing data was the story.

Discussion Questions

If Alex reports only the "complete cases" result (32% improvement), why would this be misleading? What would be a more honest way to present his findings?
The treatment group has 14.1% missing watch time. Brainstorm at least three possible explanations for why the missingness rate is higher in the treatment group.
Alex decided not to impute watch time. Do you agree with this decision? Under what circumstances might imputation have been appropriate here?
Alex's satisfaction score variable has 22.4% missing data. He decides not to use it as a primary analysis variable. Is this the right call? What would you do with it?
Look at Alex's cleaning log and identify three places where a different analyst might have made a different decision. For each, explain both choices and their consequences.