Case Study 1: StreamFlow Drift Detection and Monitoring

DataField.Dev

Case Study 1: StreamFlow Drift Detection and Monitoring

Background

StreamFlow's churn prediction model was deployed in January following the work in Chapters 29--31. The model --- a LightGBM classifier trained on 50,000 subscribers with 19 features --- had an AUC of 0.8834 on the holdout set. The customer success team uses it to generate a weekly "high-risk" list: subscribers with predicted churn probability above 0.20 receive a retention intervention (personalized content recommendations, loyalty offer, or a direct outreach call).

For the first six weeks, the model performs well. The high-risk list identifies genuine at-risk subscribers. Interventions reduce churn by an estimated 15% among contacted subscribers. The VP of Customer Success calls the model "the best tool this team has ever had."

Then, in mid-March, StreamFlow launches a major product redesign. The home screen now features a "Continue Watching" row, an algorithmic "For You" feed, and a new short-form content category. User engagement patterns change overnight. Session frequency increases. Average session duration decreases. Content completion rates shift as users sample more titles but finish fewer.

Nobody tells the data science team about the product launch.

By early April, the customer success team notices something strange: the high-risk list is shrinking. In January, it contained about 600 subscribers per week. Now it has 180. But actual churn has not decreased --- if anything, the team suspects it has increased slightly due to the product transition's friction. The model is not catching the risk.

This case study walks through how a monitoring system would have caught the problem, what the drift looks like in the data, and how to respond.

Phase 1: Setting Up the Monitoring Baseline

Before deployment, the team should have saved the reference distributions. Here is what that looks like:

import numpy as np
import pandas as pd
from scipy import stats

np.random.seed(42)
n_ref = 50000

# Reference data: the training distribution at deployment time
reference = pd.DataFrame({
    "sessions_last_30d": np.random.poisson(14, n_ref),
    "avg_session_minutes": np.random.exponential(28, n_ref).round(1),
    "unique_titles_watched": np.random.poisson(8, n_ref),
    "content_completion_rate": np.random.beta(3, 2, n_ref).round(3),
    "binge_sessions_30d": np.random.poisson(2, n_ref),
    "weekend_ratio": np.random.beta(2.5, 3, n_ref).round(3),
    "hours_change_pct": np.random.normal(0, 30, n_ref).round(1),
    "sessions_change_pct": np.random.normal(0, 25, n_ref).round(1),
    "months_active": np.random.randint(1, 60, n_ref),
    "plan_price": np.random.choice(
        [9.99, 14.99, 24.99, 29.99], n_ref, p=[0.35, 0.35, 0.20, 0.10]
    ),
    "devices_used": np.random.randint(1, 6, n_ref),
    "profiles_active": np.random.randint(1, 5, n_ref),
    "payment_failures_6m": np.random.poisson(0.3, n_ref),
    "support_tickets_90d": np.random.poisson(1.2, n_ref),
    "genre_diversity": np.random.uniform(0.1, 1.0, n_ref).round(3),
})

# Save reference statistics
ref_stats = reference.describe()
print("Reference feature summary statistics:")
print(ref_stats.to_string())

The reference prediction distribution is also saved:

# Reference predictions (from validation set at deployment time)
ref_predictions = np.random.beta(2, 15, n_ref)
print(f"\nReference prediction distribution:")
print(f"  Mean:   {ref_predictions.mean():.4f}")
print(f"  Median: {np.median(ref_predictions):.4f}")
print(f"  Std:    {ref_predictions.std():.4f}")
print(f"  P(churn > 0.20): {(ref_predictions > 0.20).mean():.4f}")

Phase 2: The First Six Weeks (Stable)

For the first six weeks, production data looks like training data. Here is the Week 4 check:

n_prod = 5000

# Week 4: stable production data
prod_week4 = pd.DataFrame({
    "sessions_last_30d": np.random.poisson(14, n_prod),
    "avg_session_minutes": np.random.exponential(28, n_prod).round(1),
    "unique_titles_watched": np.random.poisson(8, n_prod),
    "content_completion_rate": np.random.beta(3, 2, n_prod).round(3),
    "binge_sessions_30d": np.random.poisson(2, n_prod),
    "weekend_ratio": np.random.beta(2.5, 3, n_prod).round(3),
    "hours_change_pct": np.random.normal(0, 30, n_prod).round(1),
    "sessions_change_pct": np.random.normal(0, 25, n_prod).round(1),
    "months_active": np.random.randint(1, 60, n_prod),
    "plan_price": np.random.choice(
        [9.99, 14.99, 24.99, 29.99], n_prod, p=[0.35, 0.35, 0.20, 0.10]
    ),
    "devices_used": np.random.randint(1, 6, n_prod),
    "profiles_active": np.random.randint(1, 5, n_prod),
    "payment_failures_6m": np.random.poisson(0.3, n_prod),
    "support_tickets_90d": np.random.poisson(1.2, n_prod),
    "genre_diversity": np.random.uniform(0.1, 1.0, n_prod).round(3),
})

# Compute PSI for each feature
def compute_psi(reference, production, n_bins=10, eps=1e-4):
    bin_edges = np.quantile(reference, np.linspace(0, 1, n_bins + 1))
    bin_edges[0] = -np.inf
    bin_edges[-1] = np.inf
    ref_counts = np.histogram(reference, bins=bin_edges)[0]
    prod_counts = np.histogram(production, bins=bin_edges)[0]
    ref_proportions = ref_counts / len(reference) + eps
    prod_proportions = prod_counts / len(production) + eps
    return np.sum(
        (prod_proportions - ref_proportions)
        * np.log(prod_proportions / ref_proportions)
    )

print("Week 4 PSI Report (stable):")
print("-" * 45)
for col in reference.columns:
    psi = compute_psi(reference[col].values, prod_week4[col].values)
    status = "stable" if psi < 0.10 else "investigate" if psi < 0.25 else "RETRAIN"
    print(f"  {col:30s}  PSI={psi:.4f}  [{status}]")

All PSI values are well below 0.10. The monitoring dashboard shows green across the board. The team is confident.

Phase 3: The Product Redesign (Week 7)

StreamFlow launches its redesign in the middle of Week 7. By Week 8, the production data looks different:

# Week 8: post-redesign production data
prod_week8 = pd.DataFrame({
    "sessions_last_30d": np.random.poisson(22, n_prod),        # UP: new UI encourages more sessions
    "avg_session_minutes": np.random.exponential(18, n_prod).round(1),  # DOWN: shorter sessions
    "unique_titles_watched": np.random.poisson(14, n_prod),     # UP: algorithmic feed increases sampling
    "content_completion_rate": np.random.beta(2, 3, n_prod).round(3),  # DOWN: more sampling, less finishing
    "binge_sessions_30d": np.random.poisson(1, n_prod),         # DOWN: short-form content reduces binges
    "weekend_ratio": np.random.beta(2.5, 3, n_prod).round(3),  # Stable
    "hours_change_pct": np.random.normal(15, 35, n_prod).round(1),  # UP: hours increasing overall
    "sessions_change_pct": np.random.normal(20, 30, n_prod).round(1),  # UP: sessions increasing
    "months_active": np.random.randint(1, 60, n_prod),          # Stable
    "plan_price": np.random.choice(
        [9.99, 14.99, 24.99, 29.99], n_prod, p=[0.35, 0.35, 0.20, 0.10]
    ),  # Stable
    "devices_used": np.random.randint(1, 6, n_prod),            # Stable
    "profiles_active": np.random.randint(1, 5, n_prod),         # Stable
    "payment_failures_6m": np.random.poisson(0.3, n_prod),      # Stable
    "support_tickets_90d": np.random.poisson(1.8, n_prod),      # UP: redesign confusion
    "genre_diversity": np.random.uniform(0.2, 1.0, n_prod).round(3),  # Slight shift
})

print("Week 8 PSI Report (post-redesign):")
print("-" * 55)
drifted_features = []
for col in reference.columns:
    psi = compute_psi(reference[col].values, prod_week8[col].values)
    status = "stable" if psi < 0.10 else "investigate" if psi < 0.25 else "RETRAIN"
    print(f"  {col:30s}  PSI={psi:.4f}  [{status}]")
    if psi >= 0.25:
        drifted_features.append(col)

print(f"\nFeatures flagged for retraining: {len(drifted_features)}")
for f in drifted_features:
    print(f"  - {f}")

The monitoring system fires multiple critical alerts:

sessions_last_30d: PSI well above 0.25 (mean shifted from 14 to 22)
avg_session_minutes: PSI above 0.25 (mean shifted from 28 to 18)
unique_titles_watched: PSI above 0.25 (mean shifted from 8 to 14)
content_completion_rate: PSI above 0.25 (distribution shape changed)
sessions_change_pct: PSI above 0.25 (mean shifted from 0 to 20)

Five features are in the "retrain" zone. Two more are in the "investigate" zone. This is not subtle drift --- this is a structural shift in user behavior.

Phase 4: Understanding the Impact

The drift explains why the high-risk list is shrinking. The model learned that high sessions and high title diversity were signals of engagement (low churn risk). After the redesign, these features are elevated for nearly everyone, making the model believe the entire subscriber base is highly engaged. It pushes predicted churn probabilities down across the board.

# Simulated prediction distributions
pred_week4 = np.random.beta(2, 15, n_prod)       # Normal: centered low
pred_week8 = np.random.beta(1.5, 20, n_prod)     # After drift: pushed even lower

pred_psi = compute_psi(pred_week4, pred_week8)
print(f"Prediction distribution PSI: {pred_psi:.4f}")
print(f"\nWeek 4 predictions:")
print(f"  Mean: {pred_week4.mean():.4f}")
print(f"  P(churn > 0.20): {(pred_week4 > 0.20).mean():.4f}")
print(f"  Flagged for intervention (of {n_prod}): {(pred_week4 > 0.20).sum()}")

print(f"\nWeek 8 predictions (post-drift):")
print(f"  Mean: {pred_week8.mean():.4f}")
print(f"  P(churn > 0.20): {(pred_week8 > 0.20).mean():.4f}")
print(f"  Flagged for intervention (of {n_prod}): {(pred_week8 > 0.20).sum()}")

The number of subscribers flagged for intervention drops dramatically. The model is not "wrong" in the traditional sense --- it is applying the patterns it learned to data that no longer represents the same behavioral context. It is correctly computing that these users look engaged by the standards of the old product. But the old product no longer exists.

Phase 5: The Response

The data science team has several options. Here is what they actually do:

Step 1: Root Cause Identification

The monitoring alert fires on Tuesday morning. The team sees five features with PSI > 0.25 and investigates. They check:

Data pipeline issues: No. The data pipeline is healthy. The values are real.
Product changes: Yes. The PM confirms the redesign launched 10 days ago.
Temporary vs. permanent: Permanent. The redesign is not being rolled back.

Step 2: Short-Term Mitigation

While the retrained model is being built, the team adjusts the threshold:

# Temporary threshold adjustment
# Old threshold: 0.20 (calibrated for pre-redesign distribution)
# Observation: the model's predicted probabilities are compressed downward.
# Use a percentile-based threshold instead of an absolute one.

# Instead of "probability > 0.20", use "top 12% of predicted probabilities"
# (12% was the approximate churn rate at deployment time)
percentile_threshold = np.percentile(pred_week8, 88)
print(f"Percentile-based threshold: {percentile_threshold:.4f}")
print(f"Subscribers flagged: {(pred_week8 > percentile_threshold).sum()}")

Practical Note --- Percentile-based thresholds are more robust to distributional shifts than absolute thresholds. The model's rank-ordering may still be reasonable even when its probabilities are miscalibrated. Using "flag the top N%" instead of "flag anyone above probability P" is a useful interim measure while you retrain.

Step 3: Retraining

The team retrains on data that includes two weeks of post-redesign behavior. Because churn labels have a 90-day delay, they use a shorter proxy: "did the subscriber show disengagement signals in the two weeks following the redesign?" This proxy is imperfect but better than nothing.

# Retraining pipeline (simplified)
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score

# Combine pre-redesign and post-redesign data
# In practice, this comes from the data warehouse
np.random.seed(42)
n_retrain = 60000

# Simulate retrained dataset (includes new behavioral patterns)
retrain_data = pd.DataFrame({
    "sessions_last_30d": np.concatenate([
        np.random.poisson(14, n_retrain // 2),
        np.random.poisson(22, n_retrain // 2),
    ]),
    "avg_session_minutes": np.concatenate([
        np.random.exponential(28, n_retrain // 2).round(1),
        np.random.exponential(18, n_retrain // 2).round(1),
    ]),
    "unique_titles_watched": np.concatenate([
        np.random.poisson(8, n_retrain // 2),
        np.random.poisson(14, n_retrain // 2),
    ]),
    "content_completion_rate": np.concatenate([
        np.random.beta(3, 2, n_retrain // 2).round(3),
        np.random.beta(2, 3, n_retrain // 2).round(3),
    ]),
    "support_tickets_90d": np.concatenate([
        np.random.poisson(1.2, n_retrain // 2),
        np.random.poisson(1.8, n_retrain // 2),
    ]),
})

# Simulate churn labels (simplified)
churn_score = (
    -0.02 * retrain_data["sessions_last_30d"]
    + 0.01 * retrain_data["avg_session_minutes"]
    - 0.03 * retrain_data["content_completion_rate"]
    + 0.08 * retrain_data["support_tickets_90d"]
    + np.random.normal(0, 0.5, n_retrain)
)
retrain_labels = (churn_score > np.percentile(churn_score, 85)).astype(int)

X_train, X_val, y_train, y_val = train_test_split(
    retrain_data, retrain_labels, test_size=0.2,
    stratify=retrain_labels, random_state=42,
)

model_v2 = GradientBoostingClassifier(
    n_estimators=300, learning_rate=0.05, max_depth=5, random_state=42
)
model_v2.fit(X_train, y_train)
val_pred = model_v2.predict_proba(X_val)[:, 1]

print(f"Retrained model AUC: {roc_auc_score(y_val, val_pred):.4f}")

Step 4: Shadow Deployment

The retrained model is deployed in shadow mode for one week. The champion (v1) continues to serve predictions. The challenger (v2) predictions are logged for comparison.

# Shadow comparison after one week
# (Simulated: the challenger's predictions on post-redesign data)
v1_predictions = np.random.beta(1.5, 20, n_prod)   # v1: compressed, miscalibrated
v2_predictions = np.random.beta(2, 12, n_prod)      # v2: better calibrated

print("Shadow deployment comparison:")
print(f"  v1 (champion) mean prediction: {v1_predictions.mean():.4f}")
print(f"  v2 (challenger) mean prediction: {v2_predictions.mean():.4f}")
print(f"  v1 subscribers flagged (>0.20): {(v1_predictions > 0.20).sum()}")
print(f"  v2 subscribers flagged (>0.20): {(v2_predictions > 0.20).sum()}")

The challenger flags significantly more subscribers, which aligns with the customer success team's intuition that churn risk has not actually decreased.

Step 5: Promotion

After the shadow period, the team promotes v2 to champion. The reference distributions are updated to reflect the post-redesign world. The monitoring system resets its baseline.

Lessons Learned

The monitoring system caught drift within one week of the product launch. Without monitoring, the problem was not noticed for three weeks --- and only because a human looked at the high-risk list size. Automated monitoring would have triggered an alert on the first weekly check after the redesign.
Product changes are the most common cause of sudden data drift in SaaS models. Feature launches, UI redesigns, pricing changes, and onboarding flow modifications all change user behavior patterns. The data science team should be on the product release notification list.
Percentile-based thresholds are more robust than absolute thresholds. When prediction distributions shift, rank-ordering may be preserved even when calibration is lost. Using "top N%" instead of "above probability P" is a simple and effective interim measure.
Retraining on mixed data (old + new) is better than retraining on new data alone. The post-redesign data is only two weeks old --- not enough to capture the full range of churn patterns. Including pre-redesign data provides more training examples, while the new data teaches the model what the shifted features mean.
Shadow deployment is non-negotiable for production model updates. The retrained model might be worse than the current model in unexpected ways. A one-week shadow period costs almost nothing and prevents a potentially costly bad deployment.

Key Metrics Summary

Metric	Week 4 (Stable)	Week 8 (Post-Redesign)	After Retraining
Max feature PSI	< 0.05	> 0.50	< 0.05 (new baseline)
Features in "retrain" zone	0	5	0
Mean predicted probability	0.117	0.068	0.142
Subscribers flagged (of 5,000)	~600	~180	~680
Monitoring status	Green	Red	Green

This case study supports Chapter 32: Monitoring Models in Production. Return to the chapter for the full framework.