Case Study 1: StreamFlow Complete System Architecture Walkthrough

DataField.Dev

Case Study 1: StreamFlow Complete System Architecture Walkthrough

Background

It is Wednesday morning. The StreamFlow churn prediction system has been in production for four months. The system was built over the course of this textbook: SQL extraction (Chapter 5), feature engineering (Chapter 6), model training (Chapters 11--18), interpretation (Chapter 19), experiment tracking (Chapter 30), deployment (Chapter 31), monitoring (Chapter 32), fairness audit (Chapter 33), and ROI analysis (Chapter 34).

Today, three things happen simultaneously:

The monitoring dashboard fires a PSI alert for sessions_last_30d (PSI = 0.31, above the 0.25 threshold).
The customer success team reports that the weekly high-risk list "feels off" --- subscribers they expected to see are missing, and subscribers who seem perfectly happy are flagged.
The VP of Engineering asks whether the churn model should be integrated into the mobile app to trigger in-app retention offers in real-time.

Each of these events tests a different part of the architecture. This case study walks through all three.

Event 1: The Drift Alert

The Signal

The monitoring service runs daily at 2:00 AM. It computes PSI for each numeric feature by comparing the last 7 days of production predictions against the training data distribution. This morning's report:

import numpy as np
import pandas as pd
from datetime import datetime

np.random.seed(42)

# Simulated PSI report from the monitoring dashboard
psi_report = {
    "date": "2025-04-16",
    "features": {
        "sessions_last_30d":          {"psi": 0.31, "status": "ALERT"},
        "avg_session_minutes":        {"psi": 0.18, "status": "WARNING"},
        "days_since_last_activity":   {"psi": 0.07, "status": "OK"},
        "monthly_charges":            {"psi": 0.03, "status": "OK"},
        "support_tickets_last_90d":   {"psi": 0.04, "status": "OK"},
        "engagement_intensity":       {"psi": 0.28, "status": "ALERT"},
        "content_diversity_ratio":    {"psi": 0.12, "status": "WARNING"},
        "recency_score":              {"psi": 0.06, "status": "OK"},
        "support_burden":             {"psi": 0.05, "status": "OK"},
        "charges_per_session":        {"psi": 0.22, "status": "WARNING"},
    },
}

alert_features = [
    f for f, v in psi_report["features"].items() if v["status"] == "ALERT"
]
warning_features = [
    f for f, v in psi_report["features"].items() if v["status"] == "WARNING"
]
print(f"ALERT features:   {alert_features}")
print(f"WARNING features: {warning_features}")

Two features in ALERT. Three in WARNING. The alert features --- sessions_last_30d and engagement_intensity --- are related (engagement intensity is derived from sessions). This is a single underlying shift, not two independent ones.

The Diagnosis

Step one: determine why sessions increased.

# Simulated: compare session distributions (training vs. last 7 days)
train_sessions_mean = 14.2
train_sessions_std = 6.1

prod_sessions_mean = 21.8
prod_sessions_std = 7.3

print(f"Training: mean={train_sessions_mean}, std={train_sessions_std}")
print(f"Production (last 7d): mean={prod_sessions_mean}, std={prod_sessions_std}")
print(f"Shift: +{prod_sessions_mean - train_sessions_mean:.1f} sessions ({((prod_sessions_mean - train_sessions_mean) / train_sessions_mean) * 100:.0f}% increase)")

A 53% increase in average sessions. Before you retrain, you need to understand the cause. Three hypotheses:

Product change: StreamFlow launched a "Continue Watching" feature on the home screen two weeks ago. This feature autoplays the next episode, increasing session counts without increasing engagement depth.
Seasonal effect: Spring break. More streaming across all demographics.
Data pipeline bug: A code change in the session tracking service is double-counting certain events.

Investigation

# Check whether the shift is uniform across plan types
plan_session_shifts = {
    "basic":    {"train_mean": 10.1, "prod_mean": 17.2, "pct_change": 70.3},
    "standard": {"train_mean": 14.8, "prod_mean": 22.4, "pct_change": 51.4},
    "premium":  {"train_mean": 18.9, "prod_mean": 25.1, "pct_change": 32.8},
    "family":   {"train_mean": 16.5, "prod_mean": 23.8, "pct_change": 44.2},
}

for plan, shifts in plan_session_shifts.items():
    print(f"  {plan:>10}: {shifts['train_mean']:.1f} -> {shifts['prod_mean']:.1f} ({shifts['pct_change']:+.1f}%)")

The shift is largest for basic subscribers (70% increase) and smallest for premium subscribers (33% increase). This pattern is consistent with the "Continue Watching" feature hypothesis: basic subscribers, who previously had shorter sessions and less engagement, are most affected by a feature that reduces friction for the next episode.

A data pipeline bug would produce a uniform shift across all plans. A seasonal effect would be roughly proportional to existing engagement levels. The non-uniform, access-tier-correlated shift points to the product change.

Decision

# Decision tree for drift response
decision = {
    "cause": "Product change (Continue Watching feature)",
    "cause_confidence": "High (non-uniform shift correlated with plan tier)",
    "is_data_pipeline_bug": False,
    "is_concept_drift": "Unknown (need labeled data from post-launch period)",
    "immediate_action": "No retrain yet. Wait for 60-day labels.",
    "monitoring_action": "Add Continue Watching feature flag as a model input in next retrain.",
    "timeline": "Retrain in 6-8 weeks when post-launch labeled data is available.",
    "risk": "Model may underestimate churn for basic subscribers whose increased "
            "sessions are driven by autoplay, not genuine re-engagement.",
}

for key, value in decision.items():
    print(f"  {key}: {value}")

Practitioner Note --- The temptation is to retrain immediately. Resist it. Retraining on drifted data before understanding the cause can make things worse. If the new session counts reflect genuine engagement (subscribers are actually watching more), the model should learn the new relationship. If the new session counts are inflated by autoplay (subscribers are not actually more engaged), retraining will teach the model to ignore session frequency as a churn signal --- which would be catastrophic. You need labeled data from the post-launch period to distinguish these scenarios.

Event 2: The "List Feels Off" Problem

The Complaint

Rachel Torres (VP of Customer Success) messages the data science team: "The high-risk list this week has 340 subscribers, up from the usual 280. But three of my team's top-concern accounts are not on the list, and there are at least a dozen names I do not recognize as at-risk. What changed?"

Connecting the Dots

The drift alert and the "list feels off" complaint are the same event observed from two perspectives. The monitoring dashboard sees a distributional shift. The customer success team sees unexpected predictions.

# Simulated: compare this week's predictions to last month's
this_week_flagged = 340
last_month_avg_flagged = 278

# The increase in flagged subscribers
print(f"This week: {this_week_flagged} flagged (threshold 0.20)")
print(f"Last month average: {last_month_avg_flagged} flagged")
print(f"Increase: {this_week_flagged - last_month_avg_flagged} ({((this_week_flagged - last_month_avg_flagged) / last_month_avg_flagged) * 100:.0f}%)")

# Why are more subscribers flagged?
# The "Continue Watching" feature increased sessions for most subscribers.
# But the model was trained when higher sessions = lower churn risk.
# Now, some subscribers with genuinely declining engagement look "fine"
# because their session count is inflated by autoplay.
# Meanwhile, subscribers with truly high engagement have even higher session
# counts, pushing their predicted churn probability down too far.

Explaining It to the Stakeholder

The data science team sends this response to Rachel:

Subject: Re: High-risk list concerns

Rachel,

Short answer: the "Continue Watching" feature launch two weeks ago
changed how sessions are counted. The model was trained before this
feature existed, so it is misinterpreting the new session patterns.

What's happening:
- Subscribers who would have had low sessions now have moderate
  sessions (autoplay). The model thinks they re-engaged. They didn't.
  These are the three accounts your team flagged that the model missed.
- Some subscribers with genuinely stable engagement now have very
  high session counts. The model is extra-confident they will stay.
  These are the unfamiliar names on the list --- the model is
  flagging subscribers on other risk factors (support tickets,
  inactivity between sessions) that it now weights more heavily
  because session count is less discriminative.

What we're doing:
1. Adding the "Continue Watching" opt-in flag as a feature in the
   next model retrain (6-8 weeks, when we have post-launch labels).
2. In the interim, we recommend your team use the SHAP explanations
   (the "top reasons" column) to prioritize. If a subscriber's top
   reason is "low session frequency," trust it. If the top reason
   is "support tickets" for a subscriber who seems happy, it may
   be a false alarm driven by the distributional shift.
3. We will provide a supplementary list of subscribers whose session
   counts increased less than expected after the feature launch ---
   these are the most likely missed churners.

Timeline: Interim supplementary list by Friday. Full retrain by
end of Q2.

-- Data Science Team

The Supplementary Analysis

# Identify subscribers whose session increase was lower than expected
# given their plan type (potential "true" disengagement masked by autoplay)

def identify_masked_churners(
    subscribers_df: pd.DataFrame,
    plan_expected_increase: dict,
    threshold_pct: float = 0.5,  # flag if increase is < 50% of plan average
) -> pd.DataFrame:
    """
    Find subscribers whose session increase is below the plan-level average,
    suggesting they did not benefit from the Continue Watching feature
    and may be genuinely disengaging.
    """
    flagged = []
    for _, row in subscribers_df.iterrows():
        expected_increase = plan_expected_increase.get(row["plan_type"], 0)
        actual_increase = row["sessions_current"] - row["sessions_previous"]
        pct_of_expected = actual_increase / max(expected_increase, 1)

        if pct_of_expected < threshold_pct:
            flagged.append({
                "subscriber_id": row["subscriber_id"],
                "plan_type": row["plan_type"],
                "sessions_previous": row["sessions_previous"],
                "sessions_current": row["sessions_current"],
                "expected_increase": expected_increase,
                "actual_increase": actual_increase,
                "pct_of_expected": round(pct_of_expected, 2),
            })

    return pd.DataFrame(flagged).sort_values("pct_of_expected")

# Plan-level expected increases from the drift analysis
plan_expected_increase = {
    "basic": 7.1,
    "standard": 7.6,
    "premium": 6.2,
    "family": 7.3,
}

# Simulated subscriber data
n_subs = 200
np.random.seed(42)
subs_df = pd.DataFrame({
    "subscriber_id": [f"SUB-{i:05d}" for i in range(n_subs)],
    "plan_type": np.random.choice(["basic", "standard", "premium", "family"], n_subs, p=[0.35, 0.35, 0.20, 0.10]),
    "sessions_previous": np.random.poisson(14, n_subs),
})
# Most subscribers increased; a subset did not
increase = np.where(
    np.random.random(n_subs) > 0.15,  # 85% got the expected increase
    np.random.poisson(7, n_subs),
    np.random.poisson(1, n_subs),      # 15% barely increased
)
subs_df["sessions_current"] = subs_df["sessions_previous"] + increase

masked = identify_masked_churners(subs_df, plan_expected_increase)
print(f"Subscribers with below-expected session increase: {len(masked)}")
print(f"These should be added to the supplementary high-risk list.")
print(masked.head(10).to_string(index=False))

Event 3: Mobile App Integration Request

The Ask

The VP of Engineering wants to trigger in-app retention offers (discounted upgrade, personalized content recommendations) when the churn model predicts a subscriber is at risk. This would happen in real-time, inside the mobile app, as the subscriber is browsing.

The Architecture Gap

The current system was not designed for this. Here is why:

# Current system: designed for batch + dashboard use
current_architecture = {
    "serving_mode": "Batch nightly + real-time API for dashboard",
    "latency_requirement": "< 500ms (acceptable for dashboard lookup)",
    "throughput": "~100 requests/day (CS team lookups)",
    "consumer": "Customer success team (human decision-makers)",
    "action": "Manual outreach (email, call)",
    "decision_latency": "Hours to days (team reviews list, then acts)",
}

# Proposed system: real-time in-app triggers
proposed_architecture = {
    "serving_mode": "Real-time, embedded in app request flow",
    "latency_requirement": "< 50ms (user is waiting for the page to load)",
    "throughput": "~50,000 requests/hour (every app session)",
    "consumer": "Automated system (no human in the loop)",
    "action": "Automated offer (discount, recommendation)",
    "decision_latency": "Milliseconds (user sees offer on current screen)",
}

print("Architecture Comparison:")
print(f"{'Dimension':<25} {'Current':<40} {'Proposed'}")
print("-" * 100)
for key in current_architecture:
    print(f"{key:<25} {current_architecture[key]:<40} {proposed_architecture[key]}")

Five Problems to Solve

1. Latency. The current API computes SHAP values for every prediction, which takes 50--200ms depending on tree depth. The mobile app cannot wait 200ms for a churn score on every screen load. Solution: pre-compute scores in batch and cache them; or deploy a lightweight model (logistic regression) for real-time scoring and use the full model for the nightly batch.

2. Throughput. The current API handles 100 requests per day. The mobile app would generate 50,000 per hour. The single Docker container on ECS will not scale. Solution: horizontal auto-scaling with a load balancer, or a model-as-a-service platform that handles scaling automatically.

3. Human-in-the-loop. The current system puts a human between the prediction and the action. A customer success representative reviews the prediction, checks the SHAP explanations, and decides whether and how to intervene. The mobile app removes the human. An automated system decides whether to show a discount offer based solely on the model's output. This changes the risk profile: a false positive means an unnecessary discount (revenue loss), and a systematic bias means a subgroup receives more or fewer offers than they should.

4. Fairness implications. If the model is less accurate for certain demographics (e.g., younger subscribers, as shown in Chapter 33), automated offers could systematically over-discount or under-discount those groups. This is a fairness problem that is tolerable when a human reviews the list (the human can catch obvious errors) but unacceptable when the system acts autonomously at scale.

5. A/B testing infrastructure. If the app shows retention offers to predicted churners, how do you measure whether the offers work? You need an A/B testing framework: randomly assign some predicted churners to receive the offer and some to not, then compare retention rates. This requires integration with the mobile app's experimentation platform, which is a separate engineering effort.

The Response

# Recommendation to the VP of Engineering
recommendation = {
    "short_answer": "Yes, but not with the current architecture.",
    "phase_1": {
        "description": "Pre-compute churn scores nightly and cache in a low-latency store (Redis).",
        "latency": "< 5ms (cache lookup, no model inference)",
        "effort": "2-3 weeks engineering",
        "limitation": "Scores are up to 24 hours stale.",
    },
    "phase_2": {
        "description": "Build an A/B test: 50% of high-risk subscribers see an offer, 50% do not.",
        "purpose": "Measure the causal effect of the offer on retention.",
        "effort": "4-6 weeks (requires experimentation platform integration)",
        "prerequisite": "Phase 1 must be running and stable.",
    },
    "phase_3": {
        "description": "If A/B test shows positive ROI, deploy a real-time scoring service.",
        "latency": "< 50ms (lightweight model, no SHAP)",
        "effort": "6-8 weeks engineering",
        "prerequisite": "Phase 2 must show statistically significant positive ROI.",
    },
    "fairness_requirement": "Fairness audit must pass for the real-time model before Phase 3 launch. "
                            "Automated decisions at scale require stricter fairness guarantees than "
                            "human-reviewed lists.",
}

for phase, details in recommendation.items():
    if isinstance(details, dict):
        print(f"\n{phase}:")
        for k, v in details.items():
            print(f"  {k}: {v}")
    else:
        print(f"{phase}: {details}")

Lessons from This Day

Three events. One underlying cause (a product change that shifted feature distributions). Three different responses required:

Technical response (drift alert): Investigate, diagnose, plan a retrain with appropriate timing.
Operational response (list complaints): Communicate the issue honestly, provide an interim workaround, set expectations for resolution.
Strategic response (mobile integration): Map the gap between current architecture and the new requirement, propose a phased plan with clear prerequisites.

The data scientist who built the model needed to operate across all three dimensions in a single day. The technical skills from Chapters 1--34 enabled the diagnosis. The system architecture from this capstone chapter enabled the response. The communication skills from Chapter 34 enabled the stakeholder interactions.

That is the job.

Questions for Discussion

The team decided to wait 6--8 weeks for post-launch labeled data before retraining. What is the risk of this decision? Under what circumstances should they retrain sooner?
The supplementary list of "masked churners" is a heuristic, not a model prediction. How would you validate whether this heuristic is accurate? What data would you need?
The phased approach to mobile integration (cache first, A/B test second, real-time model third) is conservative. A product manager argues: "We are losing churners every day we wait. Ship the real-time model now." Write a 3--4 sentence response.
The fairness requirement for Phase 3 is described as "stricter than for human-reviewed lists." Why? What specific fairness metric would you use, and what threshold would you set?

This case study is part of Chapter 35: Capstone --- End-to-End ML System. Return to the chapter for the full architecture.