Case Study 1: StreamFlow --- When Missing Usage Data Is the Best Feature You Have
Background
StreamFlow's churn prediction model is performing well on paper. AUC of 0.82 on the holdout set. The retention team is targeting the top 5% highest-risk subscribers with personalized offers. Monthly churn has dropped from 8.2% to 7.6%.
Then someone runs the numbers on the model's blind spots.
Of the subscribers who churned last month, 34% were not in the model's top 20% risk tier. They churned without warning --- or rather, without the model detecting the warning. When the team investigates these "surprise churners," a pattern emerges: almost all of them had missing usage data in the final week before cancellation.
The model never saw the signal because the signal was in the gaps.
The Data
StreamFlow tracks subscriber behavior through event logging. Every time a user opens the app, watches content, creates a dashboard, or interacts with any feature, an event is recorded with a timestamp, user ID, event type, and metadata.
The feature engineering pipeline (from Chapter 6) aggregates these events into subscriber-level features:
| Feature | Source | Missingness Rate | Likely Mechanism |
|---|---|---|---|
| total_hours_last_7d | Event log aggregation | 12.3% | Mixed (MAR + MNAR) |
| total_hours_last_30d | Event log aggregation | 8.1% | Mixed (MAR + MNAR) |
| sessions_last_7d | Event log aggregation | 12.3% | Mixed (MAR + MNAR) |
| avg_session_duration | Derived from sessions | 15.2% | Mixed (MAR + MNAR) |
| genre_diversity_score | Derived from watch history | 18.4% | MAR (requires history) |
| email_open_rate | Email tracking pixel | 22.0% | MAR (email client blocks) |
| nps_score | Quarterly survey | 65.3% | MAR + selection bias |
| days_since_last_login | Login event | 4.8% | MCAR (tracking bug) |
The "Mixed (MAR + MNAR)" classification for usage features is the crux of the problem. There are two distinct populations of subscribers with missing usage data:
Population A (MCAR/MAR): The usage tracking pipeline occasionally drops events due to data engineering issues --- a queue overflow, a failed batch job, a timezone parsing error. These failures are approximately random with respect to the subscriber's actual usage level. This accounts for roughly 2-3% of the 12.3% missingness.
Population B (MNAR): The subscriber simply did not use the product. No events were logged because there were no events to log. The usage value is not missing --- it is implicitly zero. This accounts for the remaining 9-10% of missingness, and this population has a churn rate of 43%, compared to the overall rate of 8.2%.
The Investigation
The data science lead, suspecting the missingness was informative, ran a stratified analysis:
import pandas as pd
import numpy as np
# Load subscriber-month data
df = pd.read_csv('streamflow_subscriber_months.csv')
# Create missingness indicator
df['usage_7d_missing'] = df['total_hours_last_7d'].isna().astype(int)
# Churn rate by missingness status
print("Churn rate by usage data availability:")
print("=" * 55)
churn_by_missing = df.groupby('usage_7d_missing').agg(
n_subscribers=('churned_within_30_days', 'count'),
n_churned=('churned_within_30_days', 'sum'),
churn_rate=('churned_within_30_days', 'mean')
).round(4)
churn_by_missing.index = ['Usage present', 'Usage missing']
print(churn_by_missing)
print()
# Further stratification: among those with missing usage,
# how does days_since_last_login segment them?
missing_only = df[df['usage_7d_missing'] == 1].copy()
missing_only['login_recency_group'] = pd.cut(
missing_only['days_since_last_login'],
bins=[0, 3, 7, 14, 30, float('inf')],
labels=['0-3 days', '4-7 days', '8-14 days', '15-30 days', '30+ days']
)
print("Among subscribers with missing usage data:")
print("Churn rate by login recency:")
print("=" * 55)
recency_analysis = missing_only.groupby('login_recency_group').agg(
n_subscribers=('churned_within_30_days', 'count'),
churn_rate=('churned_within_30_days', 'mean')
).round(4)
print(recency_analysis)
Churn rate by usage data availability:
=======================================================
n_subscribers n_churned churn_rate
Usage present 43850 2761 0.0630
Usage missing 6150 1339 0.2177
Among subscribers with missing usage data:
Churn rate by login recency:
=======================================================
n_subscribers churn_rate
0-3 days 980 0.0612
4-7 days 720 0.1250
8-14 days 1340 0.2388
15-30 days 1850 0.3189
30+ days 1260 0.4365
The numbers are striking. Subscribers with missing usage data churn at 3.5x the rate of those with present data. But the story deepens: among those with missing usage, login recency further segments the risk. Users who last logged in over 30 days ago and have no usage data this week churn at 43.6% --- more than 5x the overall rate.
The model that dropped these rows was, quite literally, dropping the subscribers most likely to churn.
The Three-Layer Strategy
The team implemented a three-layer approach to missing usage data:
Layer 1: Classify the Missingness Mechanism
def classify_usage_missingness(row):
"""
Classify why usage data is missing for a given subscriber.
Returns:
- 'observed': Data is present (not missing)
- 'pipeline_failure': Likely MCAR (tracking bug)
- 'inactive': Likely MNAR (user did not use the product)
- 'ambiguous': Cannot determine mechanism
"""
if pd.notna(row['total_hours_last_7d']):
return 'observed'
# If the user logged in recently but has no usage events,
# this is likely a pipeline failure (MCAR)
if (pd.notna(row['days_since_last_login'])
and row['days_since_last_login'] <= 2):
return 'pipeline_failure'
# If the user has not logged in for 7+ days AND has no usage,
# they almost certainly did not use the product (MNAR)
if (pd.notna(row['days_since_last_login'])
and row['days_since_last_login'] > 7):
return 'inactive'
return 'ambiguous'
df['usage_missing_reason'] = df.apply(classify_usage_missingness, axis=1)
print(df['usage_missing_reason'].value_counts())
observed 43850
inactive 4320
pipeline_failure 1050
ambiguous 780
Name: usage_missing_reason, dtype: int64
Layer 2: Impute Based on Mechanism
def impute_by_mechanism(df):
"""
Impute missing usage data based on the classified mechanism.
"""
df = df.copy()
# MNAR (inactive): Impute with 0 --- the user did not use the product
inactive_mask = df['usage_missing_reason'] == 'inactive'
for col in ['total_hours_last_7d', 'sessions_last_7d']:
df.loc[inactive_mask, col] = 0.0
# MCAR (pipeline failure): Impute with median of observed values
pipeline_mask = df['usage_missing_reason'] == 'pipeline_failure'
for col in ['total_hours_last_7d', 'sessions_last_7d']:
median_val = df.loc[df[col].notna(), col].median()
df.loc[pipeline_mask, col] = median_val
# Ambiguous: Impute with a conservative estimate (25th percentile)
ambiguous_mask = df['usage_missing_reason'] == 'ambiguous'
for col in ['total_hours_last_7d', 'sessions_last_7d']:
p25_val = df.loc[df[col].notna(), col].quantile(0.25)
df.loc[ambiguous_mask, col] = p25_val
return df
Layer 3: Add Missing Indicators and Derived Features
def add_missingness_features(df):
"""
Create features from the missingness pattern itself.
"""
df = df.copy()
# Binary indicators
df['usage_7d_missing'] = (
df['usage_missing_reason'].isin(['inactive', 'pipeline_failure', 'ambiguous'])
).astype(int)
df['usage_inactive'] = (
df['usage_missing_reason'] == 'inactive'
).astype(int)
# Interaction: inactive AND high tenure (long-time subscriber going silent)
df['longtime_inactive'] = (
(df['usage_inactive'] == 1) &
(df['tenure_months'] > 6)
).astype(int)
# Interaction: inactive AND premium plan (high-value subscriber going silent)
df['premium_inactive'] = (
(df['usage_inactive'] == 1) &
(df['plan_type'].isin(['pro', 'enterprise']))
).astype(int)
# Count of missing features (overall engagement signal)
usage_features = [
'total_hours_last_7d', 'total_hours_last_30d',
'sessions_last_7d', 'avg_session_duration'
]
# Count how many usage features were originally missing
df['n_missing_usage_features'] = sum(
df[f'{col}_missing'] if f'{col}_missing' in df.columns
else df[col].isna().astype(int)
for col in usage_features
if col in df.columns
)
return df
The Results
The team evaluated three configurations:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
configs = {
'V1: Drop missing rows': {
'description': 'Original approach: df.dropna()',
'auc': 0.819,
'rows_used': 12840,
'pct_churners_captured_top5': 0.38,
},
'V2: Median imputation': {
'description': 'Median imputation for all missing values',
'auc': 0.841,
'rows_used': 50000,
'pct_churners_captured_top5': 0.44,
},
'V3: Mechanism-aware + indicators': {
'description': 'Classify mechanism, impute accordingly, add indicators',
'auc': 0.878,
'rows_used': 50000,
'pct_churners_captured_top5': 0.61,
},
}
print(f"{'Config':45s} {'AUC':>6s} {'Rows':>7s} {'Recall@5%':>10s}")
print("-" * 72)
for name, cfg in configs.items():
print(f"{name:45s} {cfg['auc']:6.3f} {cfg['rows_used']:7,d} "
f"{cfg['pct_churners_captured_top5']:10.0%}")
Config AUC Rows Recall@5%
------------------------------------------------------------------------
V1: Drop missing rows 0.819 12,840 38%
V2: Median imputation 0.841 50,000 44%
V3: Mechanism-aware + indicators 0.878 50,000 61%
The AUC improvement from V1 to V3 (+5.9 points) is significant. But the operational metric --- Recall@5%, meaning the percentage of actual churners captured in the top 5% of risk scores --- tells the more important story. V3 captures 61% of churners in the retention team's target list, compared to 38% for V1. That is 23 percentage points of additional reach, translating to roughly 5,500 additional churners identified per month.
At StreamFlow's average revenue per user of $18.40/month and a 30% save rate on targeted churners, that translates to:
additional_churners_identified = 5500
save_rate = 0.30
monthly_revenue_per_user = 18.40
avg_retained_months = 8 # Average additional lifetime after save
incremental_monthly_saves = additional_churners_identified * save_rate
incremental_annual_revenue = (
incremental_monthly_saves * monthly_revenue_per_user * avg_retained_months
)
print(f"Additional churners identified per month: {additional_churners_identified:,}")
print(f"Additional saves per month (30% rate): {incremental_monthly_saves:,.0f}")
print(f"Incremental annual revenue: ${incremental_annual_revenue:,.0f}")
Additional churners identified per month: 5,500
Additional saves per month (30% rate): 1,650
Incremental annual revenue: $242,880
Nearly $243K in annual incremental revenue from treating missing data as a feature rather than a nuisance.
Feature Importance: The Proof
The feature importance rankings for the V3 model reveal the mechanism:
Feature Importance (V3 model, top 20):
1. days_since_last_login 0.128
2. total_hours_last_7d 0.094
3. usage_inactive 0.082 *** Missingness feature
4. usage_7d_missing 0.071 *** Missingness feature
5. tenure_months 0.064
6. n_missing_usage_features 0.058 *** Missingness feature
7. support_tickets_last_90d 0.055
8. sessions_last_7d 0.049
9. longtime_inactive 0.042 *** Missingness feature
10. avg_session_duration 0.039
11. monthly_charge 0.037
12. premium_inactive 0.031 *** Missingness feature
13. plan_type_encoded 0.028
14. genre_diversity_score 0.024
15. total_hours_last_30d 0.022
16. email_open_rate 0.019
17. devices_used 0.017
18. avg_session_duration_missing 0.015 *** Missingness feature
19. email_open_rate_missing 0.013 *** Missingness feature
20. sessions_last_30d 0.012
Seven of the top 20 features are derived from missingness patterns. The usage_inactive indicator --- a binary flag that says "this subscriber's usage data is missing because they stopped using the product" --- is the third most important feature in the entire model.
The Interaction Effect
The missingness features are powerful on their own, but they become even more powerful in combination with observed features. The longtime_inactive interaction (inactive AND tenure > 6 months) captures a specific churn archetype: the long-tenured subscriber who gradually disengages. This is a different population from the new subscriber who never activated --- and the model needs different features to identify them.
# Partial dependence: churn probability by tenure, split by activity status
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 6))
for label, group in df.groupby('usage_inactive'):
tenure_bins = pd.cut(group['tenure_months'], bins=12)
churn_by_tenure = group.groupby(tenure_bins)['churned_within_30_days'].mean()
status = 'Inactive (missing usage)' if label == 1 else 'Active (usage present)'
ax.plot(range(len(churn_by_tenure)), churn_by_tenure.values,
marker='o', label=status, linewidth=2)
ax.set_xlabel('Tenure (months, binned)')
ax.set_ylabel('Churn Rate')
ax.set_title('Churn Rate by Tenure and Activity Status')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('churn_by_tenure_activity.png', dpi=150, bbox_inches='tight')
plt.show()
The plot reveals a clear interaction: among active subscribers, churn rate decreases monotonically with tenure (long-tenured users are loyal). Among inactive subscribers, churn rate is high and relatively flat across tenure levels --- once a subscriber goes silent, their historical loyalty provides little protection.
Lessons Learned
1. "Missing" is not a single category. StreamFlow's missing usage data contained at least three distinct populations (pipeline failure, inactive user, ambiguous), each requiring a different imputation strategy. Treating them identically --- whether by dropping, mean imputation, or even KNN imputation --- blurs a signal that the model needs.
2. Missing indicators are cheap insurance. Adding binary flags for missingness costs nothing computationally and lets the model discover informative patterns that you might not have anticipated. Even if you are unsure whether missingness is informative, add the indicator and let the model decide.
3. Domain knowledge unlocks mechanism classification. No statistical test can tell you that missing usage data means the user stopped using the product. That insight comes from understanding the data-generating process --- how events are logged, what causes gaps, and what user behavior looks like from the system's perspective.
4. The operational metric matters more than AUC. The jump from AUC 0.841 to 0.878 is meaningful. But the jump from 44% to 61% Recall@5% is transformative --- it means the retention team's fixed capacity (targeting the top 5% of subscribers) now reaches far more actual churners.
5. Missingness features interact with other features. The longtime_inactive and premium_inactive interactions capture churn archetypes that neither the missingness indicator nor the tenure/plan feature could capture alone.
Discussion Questions
-
Imputation order. The three-layer strategy classifies the missingness mechanism before imputing. What would happen if you imputed first (with median) and then tried to classify the mechanism? Would the classification still be meaningful?
-
Model stability. The
usage_inactivefeature depends on a threshold (days_since_last_login > 7). How sensitive is the model's performance to this threshold? Design an experiment to test thresholds of 3, 5, 7, 10, and 14 days. -
Feedback loops. If the retention team successfully re-engages inactive subscribers (generating new usage events), the
usage_inactiveindicator will flip from 1 to 0 in subsequent months. How does this affect the model's training data over time? Is this a problem? -
Data pipeline reliability. The team classified 1,050 rows as "pipeline failure" (MCAR). Should the data engineering team invest in fixing these pipeline issues? What is the argument for and against, from a modeling perspective?
-
Generalization. Would this missingness-as-feature approach transfer to StreamFlow's other models (e.g., upsell prediction, content recommendation)? For which models would you expect it to be most useful, and why?
This case study supports Chapter 8: Missing Data Strategies. Return to the chapter or continue to Case Study 2: TurbineTech Sensor Dropout.