Case Study 1: The StreamFlow Workflow in Practice

Walking the ML Lifecycle — Including the Wrong Turns


Background

StreamFlow is a subscription streaming analytics platform with $180 million in annual recurring revenue, 2.4 million subscribers, and an 8.2% monthly churn rate. That churn rate means approximately 196,800 subscribers leave every month. At an average monthly revenue of $12.50 per subscriber, that is $2.46 million in lost revenue per month — $29.5 million per year.

The VP of Customer Success, Dana Park, wants a model to identify at-risk subscribers so her retention team can intervene. Her team can handle 15,000 outbound contacts per month. She needs to know: which 15,000 subscribers are most likely to leave?

This case study walks through how the ML workflow actually played out — including the dead ends, the mistakes, and the lessons learned.


Stage 1: Problem Framing (Week 1)

The data science team — lead data scientist Priya, junior data scientist Marcus, and ML engineer Leo — met with Dana and the product analytics team for a problem framing session.

The first 30 minutes were unproductive. Dana wanted "a model that predicts churn." Priya pressed for specifics.

The critical questions that surfaced:

What counts as churn? Dana initially said "anyone who leaves." Priya asked about subscribers whose payments fail and are never recovered (involuntary churn). Dana had not considered this. After discussion, they agreed: the model should predict voluntary cancellation only. Involuntary churn (payment failure) would be handled by the billing team with a separate process.

What is the prediction window? Dana wanted to know "as early as possible." Priya explained the tradeoff: a 90-day prediction window gives more lead time but is harder to predict accurately. A 7-day window is more accurate but gives almost no time to intervene. They settled on 30 days — enough time for an email campaign, a discount offer, or a personal call.

What about free trial users? StreamFlow offers a 14-day free trial. Free trial conversions are managed by the growth team with separate tooling. The churn model would only cover paying subscribers.

What action changes based on the prediction? This question nearly derailed the project. It turned out Dana's retention team had no standard intervention playbook. They had been randomly selecting subscribers from a list and calling them. Priya insisted that the model would be useless without a consistent intervention: "We need to know what you will do differently for high-risk vs. low-risk subscribers." Dana committed to implementing a tiered retention strategy: high-risk subscribers receive a 20% discount offer via email, very-high-risk subscribers receive a personal call.

The problem framing document was completed by the end of week 1. It took three meetings and approximately 8 hours of discussion. This felt slow to the team. In retrospect, it was the most productive week of the project.


Stage 2: Success Metrics (Week 1-2)

The team defined two metric layers:

Offline (model performance): - Primary: AUC-ROC (ranking quality) - Secondary: Precision@15000 (operational relevance — of the top 15,000 flagged subscribers, how many actually churn?) - Reporting: Calibration plot, F1, confusion matrix at chosen threshold

Online (business impact): - Primary: Monthly churn rate reduction in treatment group vs. control (A/B test) - Secondary: Net revenue impact (revenue retained minus discount cost) - Guardrail: Customer satisfaction score (do not annoy loyal subscribers with unnecessary offers)

The CFO initially pushed for accuracy as the metric. Priya prepared a quick analysis showing that a model predicting "no churn" for everyone achieves 91.8% accuracy. The CFO stopped asking about accuracy.


Stage 3: Data Collection — The First Dead End (Weeks 2-4)

Marcus was responsible for pulling the data. He discovered several problems:

Problem 1: Missing historical data. The data warehouse migration 18 months ago lost granular watch event data from before the migration. The team had only 18 months of behavioral data instead of the 3+ years they had hoped for.

Problem 2: Schema changes. The engineering team had changed how "sessions" were defined 8 months ago. Before the change, a session was a continuous watch period. After the change, a session included all activity within a 30-minute window, even if the user took breaks. This meant sessions_last_30d had a distributional shift mid-dataset.

Problem 3: Missing support data. Support ticket data was stored in Zendesk, not the data warehouse. Extracting it required an API integration that took Marcus a full week.

Problem 4: Target variable construction. Defining who churned when was harder than expected. Some subscribers canceled but their subscription remained active through the end of their billing period. Others canceled and were immediately terminated. Some "canceled" subscribers reactivated within days — were they really churners? Marcus had to trace subscription state transitions, not just look at current status. He ended up writing a 200-line SQL query with six CTEs to correctly identify the churn date for each subscriber-month.

Problem 5: Data quality. The first data pull contained 2.4 million rows for the most recent month but only 1.8 million for months 6-12. Investigation revealed that the data warehouse's retention policy had archived some event-level data, making aggregated features (like hours watched) incomplete for older months. The team had to choose: use only the last 6 months of clean data (smaller dataset but correct) or use 12 months with known quality issues in the older data. They chose 12 months but added a data_quality_flag to each row.

The data collection phase was budgeted for one week. It took three. This is not unusual — it is the norm.

Lesson: Data collection always takes longer than you expect. The data you need rarely exists in the form you need it, in the systems you expect, with the quality you assume. Budget at least twice the time you think you need.


Stage 4: The Baseline That Embarrassed Everyone (Week 5)

Before building features, Priya insisted on a baseline. Marcus protested — "We already know the baseline will be bad. Let's just build the real model." Priya overruled him.

Baseline 1: Majority class. Predict "no churn" for everyone. Accuracy: 91.8%. AUC-ROC: 0.500. Precision@15000: 8.2% (random chance). Useless, but it established the floor.

Baseline 2: Single-rule heuristic. The retention team had been using a simple rule: flag subscribers who had not watched anything in the last 14 days. Priya computed the metrics for this rule. AUC-ROC: 0.68. Precision@15000: 18.4%.

This second baseline was the important one. It represented what the business was already doing without ML. The ML model needed to beat 0.68 AUC-ROC and 18.4% precision to justify its existence. If the team had skipped baselines and built a model with AUC-ROC of 0.72, they might have been satisfied. Against the heuristic baseline, 0.72 would be only a marginal improvement — and possibly not worth the engineering investment.


Stage 5: Feature Engineering and Iteration (Weeks 5-9)

The team tracked every iteration in a shared spreadsheet (they would later migrate to MLflow, but the spreadsheet was enough to start). Each row recorded: date, feature set description, model type, key hyperparameters, cross-validated AUC-ROC, Precision@15000, and a short note on what was tried and why.

Iteration 1 (Week 5): Basic features only — tenure, plan type, hours watched in the last 30 days. Logistic regression. AUC-ROC: 0.71. Precision@15000: 19.2%. Barely above the heuristic baseline. Disappointing.

Iteration 2 (Week 6): Added temporal features — trend in hours watched over 3 months (using LAG comparisons), days since last watch, session frequency trend. Same logistic regression. AUC-ROC: 0.77. Precision@15000: 24.1%. Significant jump. The trend features mattered more than the absolute values.

Iteration 3 (Week 7): Added engagement features — genre diversity, completion rate, device count, watchlist activity. And support ticket features — count, recency, whether any tickets were billing-related. AUC-ROC: 0.81. Precision@15000: 28.6%.

Iteration 4 (Week 7): Switched from logistic regression to LightGBM with default parameters. AUC-ROC: 0.84. Precision@15000: 31.3%. The algorithm switch helped, but less than the feature engineering in iterations 2 and 3.

Iteration 5 (Week 8): Added billing features — failed payments, plan changes, promo usage. Basic hyperparameter tuning on LightGBM. AUC-ROC: 0.86. Precision@15000: 33.8%.

The dead end (Week 9): Marcus spent an entire week engineering "advanced" features — interaction terms between plan type and usage, polynomial features of tenure, complex rolling-window aggregations over 7/14/30/60/90-day windows for every behavioral metric. The feature count grew from 28 to 94. AUC-ROC moved from 0.86 to 0.865. The effort was not worth it. Worse, the 94-feature model was harder to explain to Dana and slower to compute. Priya called it: they would proceed with the iteration 5 model (28 features). She documented the dead end in the iteration log with the note: "Diminishing returns. Feature expansion from 28 to 94 features yielded 0.005 AUC improvement. Not worth the complexity."

This is a critical decision point in any ML project: knowing when to stop iterating. The temptation is always "one more feature, one more experiment." The discipline is recognizing when you have crossed the point of diminishing returns.


Stage 6: Offline Evaluation — The Leakage Scare (Week 10)

The team evaluated on the held-out test set (the most recent 3 months of data). AUC-ROC: 0.84. Precision@15000: 30.1%.

Note the drop from 0.86 (validation) to 0.84 (test). This is normal — some optimism bias from repeated validation-set evaluation is expected. But Priya flagged a concern: the days_since_last_watch feature was the model's top feature by importance. She asked Marcus to verify how it was computed.

Marcus went pale. He had computed days_since_last_watch using the subscriber's entire watch history — including events that occurred after the prediction date for some training rows. A temporal leakage bug in the SQL query.

The fix took two days. After re-computing the feature correctly (using only events before the prediction date), AUC-ROC dropped from 0.84 to 0.82. The team lost two weeks of perceived progress, but Priya pointed out they had gained something more valuable: a correct model.

Lesson: Always verify how your most important features are computed. Trace them back to the raw data and the exact SQL or pandas query that produced them.


Stage 7: Deployment and A/B Test (Weeks 11-14)

Leo deployed the model as a batch scoring job. Every month, the pipeline runs: extract data, compute features, score all active subscribers, write predictions to a database table. The retention team queries the table for subscribers above the intervention threshold.

The A/B test design: - Control (50%): Retention team uses their existing heuristic (14-day inactivity rule) - Treatment (50%): Retention team uses the model's top 15,000 predictions - Duration: 3 months - Primary metric: Monthly churn rate in each group

Results after 3 months: - Control group churn rate: 8.1% - Treatment group churn rate: 7.2% - Relative reduction: 11.1% - Estimated monthly revenue saved: approximately $270,000 - Annualized: approximately $3.2 million

The model was approved for full deployment.

It is worth pausing to note what was not measured by the A/B test: whether the model-identified subscribers were the right subscribers to target. Some of those subscribers might have stayed regardless of the offer (the model correctly predicted they were at risk, but the offer was unnecessary). Others might have been so determined to leave that no offer would have changed their mind (the model was right, but the intervention was futile). The ideal target is the "persuadable" segment — subscribers who would churn without an offer but will stay with one. Measuring this requires a more sophisticated experimental design that the team planned for phase 2.


Stage 8: The Ongoing Maintenance Reality (Month 4+)

Within two months of full deployment, two monitoring alerts fired:

Alert 1: Feature drift. StreamFlow launched a "Family Plan" that had no precedent in the training data. The model had never seen plan_type = "Family" and was defaulting to the mode of the other plan types. The feature encoding needed updating.

Alert 2: Prediction distribution shift. The average predicted churn probability dropped from 0.082 to 0.061, even though actual churn had not decreased. Investigation revealed that the engineering team had changed the session logging format (again), inflating sessions_last_30d. The model interpreted more sessions as lower churn risk.

Both issues required model retraining. The team established a quarterly retraining cadence with automated drift detection. Leo built a monitoring dashboard that tracked feature distributions, prediction distributions, and monthly accuracy once ground truth became available.


Timeline Summary

Week Activity Key Outcome
1 Problem framing Formal problem definition; voluntary churn only; 30-day window
1-2 Success metrics AUC-ROC + Precision@15000 offline; churn rate reduction online
2-4 Data collection Three weeks (budgeted one); schema issues, missing data
5 Baseline Heuristic baseline at 0.68 AUC; established the bar to beat
5-9 Feature/model iteration Five iterations; 0.71 to 0.86 AUC; features > algorithms
10 Offline evaluation Caught temporal leakage; corrected AUC: 0.82
11-14 Deployment + A/B test 11% churn reduction; $3.2M annualized savings
15+ Monitoring Feature drift from new plan type; retraining cadence established

Lessons Learned

  1. Problem framing took 8 hours of meetings. It saved the team from building the wrong model. The distinction between voluntary and involuntary churn, and the requirement for a defined intervention strategy, would have been discovered eventually — but at a much higher cost.

  2. Data collection was the bottleneck, not modeling. Three weeks on data, five weeks on features and models. This ratio is typical.

  3. The heuristic baseline set an honest bar. Without it, the team might have accepted a model that was barely better than what the retention team was already doing.

  4. Feature engineering contributed more than algorithm selection. The jump from logistic regression with good features (AUC 0.81) to LightGBM with those same features (AUC 0.84) was smaller than the jump from basic features (AUC 0.71) to well-engineered features (AUC 0.81).

  5. The leakage bug was caught because someone checked. If Priya had not questioned the top feature, the model would have been deployed with inflated expectations and would have underperformed in production.

  6. Deployment was the beginning, not the end. Two monitoring alerts in two months. Quarterly retraining became necessary. The model is a system, not a deliverable.


Discussion Questions

  1. The problem framing phase excluded free trial users. Under what circumstances might it make sense to include them? What would change in the workflow?

  2. Marcus's temporal leakage bug reduced AUC from 0.84 to 0.82. How would you estimate the business impact of that 0.02 difference? Is it significant?

  3. The A/B test showed an 11.1% relative reduction in churn. What confounders might affect this result? How would you address them?

  4. The "advanced features" in iteration 5 of the modeling phase barely moved the needle. How would you decide when to stop iterating on features? What signals indicate diminishing returns?