Case Study 1: StreamFlow --- When Missing Usage Data Is the Best Feature You Have


Background

StreamFlow's churn prediction model is performing well on paper. AUC of 0.82 on the holdout set. The retention team is targeting the top 5% highest-risk subscribers with personalized offers. Monthly churn has dropped from 8.2% to 7.6%.

Then someone runs the numbers on the model's blind spots.

Of the subscribers who churned last month, 34% were not in the model's top 20% risk tier. They churned without warning --- or rather, without the model detecting the warning. When the team investigates these "surprise churners," a pattern emerges: almost all of them had missing usage data in the final week before cancellation.

The model never saw the signal because the signal was in the gaps.


The Data

StreamFlow tracks subscriber behavior through event logging. Every time a user opens the app, watches content, creates a dashboard, or interacts with any feature, an event is recorded with a timestamp, user ID, event type, and metadata.

The feature engineering pipeline (from Chapter 6) aggregates these events into subscriber-level features:

Feature Source Missingness Rate Likely Mechanism
total_hours_last_7d Event log aggregation 12.3% Mixed (MAR + MNAR)
total_hours_last_30d Event log aggregation 8.1% Mixed (MAR + MNAR)
sessions_last_7d Event log aggregation 12.3% Mixed (MAR + MNAR)
avg_session_duration Derived from sessions 15.2% Mixed (MAR + MNAR)
genre_diversity_score Derived from watch history 18.4% MAR (requires history)
email_open_rate Email tracking pixel 22.0% MAR (email client blocks)
nps_score Quarterly survey 65.3% MAR + selection bias
days_since_last_login Login event 4.8% MCAR (tracking bug)

The "Mixed (MAR + MNAR)" classification for usage features is the crux of the problem. There are two distinct populations of subscribers with missing usage data:

Population A (MCAR/MAR): The usage tracking pipeline occasionally drops events due to data engineering issues --- a queue overflow, a failed batch job, a timezone parsing error. These failures are approximately random with respect to the subscriber's actual usage level. This accounts for roughly 2-3% of the 12.3% missingness.

Population B (MNAR): The subscriber simply did not use the product. No events were logged because there were no events to log. The usage value is not missing --- it is implicitly zero. This accounts for the remaining 9-10% of missingness, and this population has a churn rate of 43%, compared to the overall rate of 8.2%.


The Investigation

The data science lead, suspecting the missingness was informative, ran a stratified analysis:

import pandas as pd
import numpy as np

# Load subscriber-month data
df = pd.read_csv('streamflow_subscriber_months.csv')

# Create missingness indicator
df['usage_7d_missing'] = df['total_hours_last_7d'].isna().astype(int)

# Churn rate by missingness status
print("Churn rate by usage data availability:")
print("=" * 55)
churn_by_missing = df.groupby('usage_7d_missing').agg(
    n_subscribers=('churned_within_30_days', 'count'),
    n_churned=('churned_within_30_days', 'sum'),
    churn_rate=('churned_within_30_days', 'mean')
).round(4)
churn_by_missing.index = ['Usage present', 'Usage missing']
print(churn_by_missing)
print()

# Further stratification: among those with missing usage,
# how does days_since_last_login segment them?
missing_only = df[df['usage_7d_missing'] == 1].copy()
missing_only['login_recency_group'] = pd.cut(
    missing_only['days_since_last_login'],
    bins=[0, 3, 7, 14, 30, float('inf')],
    labels=['0-3 days', '4-7 days', '8-14 days', '15-30 days', '30+ days']
)

print("Among subscribers with missing usage data:")
print("Churn rate by login recency:")
print("=" * 55)
recency_analysis = missing_only.groupby('login_recency_group').agg(
    n_subscribers=('churned_within_30_days', 'count'),
    churn_rate=('churned_within_30_days', 'mean')
).round(4)
print(recency_analysis)
Churn rate by usage data availability:
=======================================================
                n_subscribers  n_churned  churn_rate
Usage present          43850       2761      0.0630
Usage missing           6150       1339      0.2177

Among subscribers with missing usage data:
Churn rate by login recency:
=======================================================
                n_subscribers  churn_rate
0-3 days                 980      0.0612
4-7 days                 720      0.1250
8-14 days               1340      0.2388
15-30 days              1850      0.3189
30+ days                1260      0.4365

The numbers are striking. Subscribers with missing usage data churn at 3.5x the rate of those with present data. But the story deepens: among those with missing usage, login recency further segments the risk. Users who last logged in over 30 days ago and have no usage data this week churn at 43.6% --- more than 5x the overall rate.

The model that dropped these rows was, quite literally, dropping the subscribers most likely to churn.


The Three-Layer Strategy

The team implemented a three-layer approach to missing usage data:

Layer 1: Classify the Missingness Mechanism

def classify_usage_missingness(row):
    """
    Classify why usage data is missing for a given subscriber.

    Returns:
    - 'observed': Data is present (not missing)
    - 'pipeline_failure': Likely MCAR (tracking bug)
    - 'inactive': Likely MNAR (user did not use the product)
    - 'ambiguous': Cannot determine mechanism
    """
    if pd.notna(row['total_hours_last_7d']):
        return 'observed'

    # If the user logged in recently but has no usage events,
    # this is likely a pipeline failure (MCAR)
    if (pd.notna(row['days_since_last_login'])
            and row['days_since_last_login'] <= 2):
        return 'pipeline_failure'

    # If the user has not logged in for 7+ days AND has no usage,
    # they almost certainly did not use the product (MNAR)
    if (pd.notna(row['days_since_last_login'])
            and row['days_since_last_login'] > 7):
        return 'inactive'

    return 'ambiguous'

df['usage_missing_reason'] = df.apply(classify_usage_missingness, axis=1)

print(df['usage_missing_reason'].value_counts())
observed            43850
inactive             4320
pipeline_failure     1050
ambiguous             780
Name: usage_missing_reason, dtype: int64

Layer 2: Impute Based on Mechanism

def impute_by_mechanism(df):
    """
    Impute missing usage data based on the classified mechanism.
    """
    df = df.copy()

    # MNAR (inactive): Impute with 0 --- the user did not use the product
    inactive_mask = df['usage_missing_reason'] == 'inactive'
    for col in ['total_hours_last_7d', 'sessions_last_7d']:
        df.loc[inactive_mask, col] = 0.0

    # MCAR (pipeline failure): Impute with median of observed values
    pipeline_mask = df['usage_missing_reason'] == 'pipeline_failure'
    for col in ['total_hours_last_7d', 'sessions_last_7d']:
        median_val = df.loc[df[col].notna(), col].median()
        df.loc[pipeline_mask, col] = median_val

    # Ambiguous: Impute with a conservative estimate (25th percentile)
    ambiguous_mask = df['usage_missing_reason'] == 'ambiguous'
    for col in ['total_hours_last_7d', 'sessions_last_7d']:
        p25_val = df.loc[df[col].notna(), col].quantile(0.25)
        df.loc[ambiguous_mask, col] = p25_val

    return df

Layer 3: Add Missing Indicators and Derived Features

def add_missingness_features(df):
    """
    Create features from the missingness pattern itself.
    """
    df = df.copy()

    # Binary indicators
    df['usage_7d_missing'] = (
        df['usage_missing_reason'].isin(['inactive', 'pipeline_failure', 'ambiguous'])
    ).astype(int)

    df['usage_inactive'] = (
        df['usage_missing_reason'] == 'inactive'
    ).astype(int)

    # Interaction: inactive AND high tenure (long-time subscriber going silent)
    df['longtime_inactive'] = (
        (df['usage_inactive'] == 1) &
        (df['tenure_months'] > 6)
    ).astype(int)

    # Interaction: inactive AND premium plan (high-value subscriber going silent)
    df['premium_inactive'] = (
        (df['usage_inactive'] == 1) &
        (df['plan_type'].isin(['pro', 'enterprise']))
    ).astype(int)

    # Count of missing features (overall engagement signal)
    usage_features = [
        'total_hours_last_7d', 'total_hours_last_30d',
        'sessions_last_7d', 'avg_session_duration'
    ]
    # Count how many usage features were originally missing
    df['n_missing_usage_features'] = sum(
        df[f'{col}_missing'] if f'{col}_missing' in df.columns
        else df[col].isna().astype(int)
        for col in usage_features
        if col in df.columns
    )

    return df

The Results

The team evaluated three configurations:

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

configs = {
    'V1: Drop missing rows': {
        'description': 'Original approach: df.dropna()',
        'auc': 0.819,
        'rows_used': 12840,
        'pct_churners_captured_top5': 0.38,
    },
    'V2: Median imputation': {
        'description': 'Median imputation for all missing values',
        'auc': 0.841,
        'rows_used': 50000,
        'pct_churners_captured_top5': 0.44,
    },
    'V3: Mechanism-aware + indicators': {
        'description': 'Classify mechanism, impute accordingly, add indicators',
        'auc': 0.878,
        'rows_used': 50000,
        'pct_churners_captured_top5': 0.61,
    },
}

print(f"{'Config':45s} {'AUC':>6s} {'Rows':>7s} {'Recall@5%':>10s}")
print("-" * 72)
for name, cfg in configs.items():
    print(f"{name:45s} {cfg['auc']:6.3f} {cfg['rows_used']:7,d} "
          f"{cfg['pct_churners_captured_top5']:10.0%}")
Config                                           AUC    Rows  Recall@5%
------------------------------------------------------------------------
V1: Drop missing rows                          0.819  12,840        38%
V2: Median imputation                          0.841  50,000        44%
V3: Mechanism-aware + indicators                0.878  50,000        61%

The AUC improvement from V1 to V3 (+5.9 points) is significant. But the operational metric --- Recall@5%, meaning the percentage of actual churners captured in the top 5% of risk scores --- tells the more important story. V3 captures 61% of churners in the retention team's target list, compared to 38% for V1. That is 23 percentage points of additional reach, translating to roughly 5,500 additional churners identified per month.

At StreamFlow's average revenue per user of $18.40/month and a 30% save rate on targeted churners, that translates to:

additional_churners_identified = 5500
save_rate = 0.30
monthly_revenue_per_user = 18.40
avg_retained_months = 8  # Average additional lifetime after save

incremental_monthly_saves = additional_churners_identified * save_rate
incremental_annual_revenue = (
    incremental_monthly_saves * monthly_revenue_per_user * avg_retained_months
)
print(f"Additional churners identified per month: {additional_churners_identified:,}")
print(f"Additional saves per month (30% rate): {incremental_monthly_saves:,.0f}")
print(f"Incremental annual revenue: ${incremental_annual_revenue:,.0f}")
Additional churners identified per month: 5,500
Additional saves per month (30% rate): 1,650
Incremental annual revenue: $242,880

Nearly $243K in annual incremental revenue from treating missing data as a feature rather than a nuisance.


Feature Importance: The Proof

The feature importance rankings for the V3 model reveal the mechanism:

Feature Importance (V3 model, top 20):
  1. days_since_last_login              0.128
  2. total_hours_last_7d                0.094
  3. usage_inactive                     0.082  *** Missingness feature
  4. usage_7d_missing                   0.071  *** Missingness feature
  5. tenure_months                      0.064
  6. n_missing_usage_features           0.058  *** Missingness feature
  7. support_tickets_last_90d           0.055
  8. sessions_last_7d                   0.049
  9. longtime_inactive                  0.042  *** Missingness feature
 10. avg_session_duration               0.039
 11. monthly_charge                     0.037
 12. premium_inactive                   0.031  *** Missingness feature
 13. plan_type_encoded                  0.028
 14. genre_diversity_score              0.024
 15. total_hours_last_30d               0.022
 16. email_open_rate                    0.019
 17. devices_used                       0.017
 18. avg_session_duration_missing       0.015  *** Missingness feature
 19. email_open_rate_missing            0.013  *** Missingness feature
 20. sessions_last_30d                  0.012

Seven of the top 20 features are derived from missingness patterns. The usage_inactive indicator --- a binary flag that says "this subscriber's usage data is missing because they stopped using the product" --- is the third most important feature in the entire model.


The Interaction Effect

The missingness features are powerful on their own, but they become even more powerful in combination with observed features. The longtime_inactive interaction (inactive AND tenure > 6 months) captures a specific churn archetype: the long-tenured subscriber who gradually disengages. This is a different population from the new subscriber who never activated --- and the model needs different features to identify them.

# Partial dependence: churn probability by tenure, split by activity status
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 6))

for label, group in df.groupby('usage_inactive'):
    tenure_bins = pd.cut(group['tenure_months'], bins=12)
    churn_by_tenure = group.groupby(tenure_bins)['churned_within_30_days'].mean()
    status = 'Inactive (missing usage)' if label == 1 else 'Active (usage present)'
    ax.plot(range(len(churn_by_tenure)), churn_by_tenure.values,
            marker='o', label=status, linewidth=2)

ax.set_xlabel('Tenure (months, binned)')
ax.set_ylabel('Churn Rate')
ax.set_title('Churn Rate by Tenure and Activity Status')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('churn_by_tenure_activity.png', dpi=150, bbox_inches='tight')
plt.show()

The plot reveals a clear interaction: among active subscribers, churn rate decreases monotonically with tenure (long-tenured users are loyal). Among inactive subscribers, churn rate is high and relatively flat across tenure levels --- once a subscriber goes silent, their historical loyalty provides little protection.


Lessons Learned

1. "Missing" is not a single category. StreamFlow's missing usage data contained at least three distinct populations (pipeline failure, inactive user, ambiguous), each requiring a different imputation strategy. Treating them identically --- whether by dropping, mean imputation, or even KNN imputation --- blurs a signal that the model needs.

2. Missing indicators are cheap insurance. Adding binary flags for missingness costs nothing computationally and lets the model discover informative patterns that you might not have anticipated. Even if you are unsure whether missingness is informative, add the indicator and let the model decide.

3. Domain knowledge unlocks mechanism classification. No statistical test can tell you that missing usage data means the user stopped using the product. That insight comes from understanding the data-generating process --- how events are logged, what causes gaps, and what user behavior looks like from the system's perspective.

4. The operational metric matters more than AUC. The jump from AUC 0.841 to 0.878 is meaningful. But the jump from 44% to 61% Recall@5% is transformative --- it means the retention team's fixed capacity (targeting the top 5% of subscribers) now reaches far more actual churners.

5. Missingness features interact with other features. The longtime_inactive and premium_inactive interactions capture churn archetypes that neither the missingness indicator nor the tenure/plan feature could capture alone.


Discussion Questions

  1. Imputation order. The three-layer strategy classifies the missingness mechanism before imputing. What would happen if you imputed first (with median) and then tried to classify the mechanism? Would the classification still be meaningful?

  2. Model stability. The usage_inactive feature depends on a threshold (days_since_last_login > 7). How sensitive is the model's performance to this threshold? Design an experiment to test thresholds of 3, 5, 7, 10, and 14 days.

  3. Feedback loops. If the retention team successfully re-engages inactive subscribers (generating new usage events), the usage_inactive indicator will flip from 1 to 0 in subsequent months. How does this affect the model's training data over time? Is this a problem?

  4. Data pipeline reliability. The team classified 1,050 rows as "pipeline failure" (MCAR). Should the data engineering team invest in fixing these pipeline issues? What is the argument for and against, from a modeling perspective?

  5. Generalization. Would this missingness-as-feature approach transfer to StreamFlow's other models (e.g., upsell prediction, content recommendation)? For which models would you expect it to be most useful, and why?


This case study supports Chapter 8: Missing Data Strategies. Return to the chapter or continue to Case Study 2: TurbineTech Sensor Dropout.