Chapter 2: The Machine Learning Workflow

DataField.Dev

33 min read

Five steps. Clean data. Guaranteed convergence. And almost nothing in common with how machine learning actually works.

In This Chapter

Problem Framing, Data Pipeline, Modeling, Evaluation, Deployment
2.1 The ML Lifecycle: An Honest Map
2.2 Stage 1: Problem Framing
2.3 Stage 2: Defining Success Metrics
2.4 Stage 3: Data Collection and Validation
2.5 Stage 4: The Stupid Baseline
2.6 Stage 5: Feature Engineering and Model Iteration
2.7 The Train/Validation/Test Split
2.8 Stage 6: Offline Evaluation
2.9 Data Leakage: The Silent Killer
2.10 Stage 7: Deployment and Online Evaluation
2.11 Stage 8: Monitoring and Maintenance
2.12 The ShopSmart Contrast: A Different Workflow for Recommendations
2.13 Putting It All Together: The Iterative Nature of ML
2.14 Progressive Project: Framing the StreamFlow Churn Problem
2.15 Summary
Bridge to Chapter 3
Key Terms Introduced

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 2: The Machine Learning Workflow

Problem Framing, Data Pipeline, Modeling, Evaluation, Deployment

The Two Workflows

Here is how most online tutorials teach machine learning:

Load dataset.
Split into train and test.
Train a model.
Print accuracy.
Celebrate.

Five steps. Clean data. Guaranteed convergence. And almost nothing in common with how machine learning actually works.

Here is what actually happens when a data science team at a company like StreamFlow — our subscription streaming analytics platform with 2.4 million subscribers and an 8.2% monthly churn rate — builds an ML system:

Someone in the business says "we need AI."
Weeks of meetings to figure out what that means.
Defining a metric that the business cares about (not accuracy).
Discovering the data you need does not exist, or lives in five different systems, or is full of nulls.
Building a data pipeline that breaks three times before it works.
Establishing a baseline so stupid it embarrasses you — and then discovering your first "real" model barely beats it.
Iterating on features for weeks.
Evaluating with the wrong metric, realizing it, starting over.
Finally getting a model that works offline.
Deploying it and watching it fail in ways you did not anticipate.
Monitoring, retraining, maintaining — forever.

The tutorial workflow takes an afternoon. The real workflow takes months. And the gap between them is where most ML projects die.

This chapter maps the real workflow end-to-end. By the time you finish, you will understand every stage, where each one can go wrong, and why problem framing — the step that gets the least attention — is the one that matters most.

2.1 The ML Lifecycle: An Honest Map

Let us lay out the full lifecycle before we dive into each stage. There are eight stages, and they are emphatically not linear:

Problem Framing — What are we actually predicting? Why? For whom?
Success Metric Definition — How will we know the model is working? Both offline and in production.
Data Collection and Validation — Getting the data, verifying it, understanding its limitations.
Baseline Establishment — The simplest possible model that sets the floor.
Feature Engineering and Model Iteration — The cycle of building features, training models, evaluating, and repeating.
Offline Evaluation — Rigorous testing on held-out data before anything goes live.
Deployment and Online Evaluation — Putting the model in production and running A/B tests.
Monitoring and Maintenance — Watching for drift, retraining, and managing technical debt.

Notice that "train a model" is step 5 of 8, and it shares that step with feature engineering. In practice, modeling is maybe 20% of the work. The other 80% is everything around it.

War Story — A team at a mid-size SaaS company spent three months building a sophisticated deep learning model for customer churn. When they finally evaluated it against a logistic regression with hand-crafted features, the logistic regression won. Three months of compute costs, architecture experiments, and GPU time — beaten by a model a junior analyst could have built in a week. The problem was not the algorithm. The problem was that they started at step 5 without doing steps 1 through 4 properly.

2.2 Stage 1: Problem Framing

Problem framing is the most important step in the ML workflow. It is also the one most frequently skipped, shortchanged, or delegated to someone who does not understand the implications.

Why is it neglected? Because it looks like meetings, not engineering. It produces documents, not models. It requires talking to business stakeholders who do not speak Python and who often cannot articulate what they actually need. The junior data scientist itching to train a model sees problem framing as bureaucratic overhead. The senior data scientist who has shipped three failed models sees it as the only thing that prevents the fourth from failing.

Problem framing answers five questions:

What are we predicting? — The target variable. This must be precise.
What is the observation unit? — What does one row in the training data represent?
When do we make the prediction? — The prediction point in time.
What information is available at prediction time? — This is where data leakage lives.
What action will be taken based on the prediction? — If no action changes, the model is useless.

Problem Framing at StreamFlow

StreamFlow's VP of Customer Success walks into a meeting and says: "We need a model to predict churn." That is not a problem frame. It is a wish. Let us turn it into something we can actually build.

What are we predicting?

"Churn" is vague. Does it mean: - A subscriber cancels their subscription? - A subscriber stops watching content but keeps paying? - A subscriber downgrades from Premium to Basic? - A subscriber's payment fails and is not recovered?

Each definition leads to a different model, different training data, and different business actions. For StreamFlow, we will define churn as: the subscriber's subscription status transitions to "canceled" within 30 days of the prediction date. That gives us a binary target: canceled_within_30_days (0 or 1).

What is the observation unit?

One row in our dataset is a subscriber-month: a snapshot of one subscriber's state at the beginning of a calendar month. A subscriber who has been active for 18 months contributes 18 rows to the dataset (though we will only use recent ones for training to avoid concept drift).

When do we make the prediction?

On the first day of each month, we score every active subscriber. The prediction covers the next 30 days. This timing matters for two reasons: it determines what features are available, and it determines when the retention team can act.

What information is available at prediction time?

Only data that exists before the prediction date. If we are predicting on March 1, we can use February data. We cannot use March data. This sounds obvious. It is the single most common source of data leakage in production ML.

What action will the business take?

Subscribers predicted to churn with probability > 0.7 will receive a retention offer: a 20% discount for 3 months. The retention team has capacity to contact 15,000 subscribers per month. This capacity constraint means the model's ranking (who is most likely to churn) matters more than its exact probabilities.

Common Mistake — Skipping the "what action will be taken?" question. A model that predicts churn but does not connect to any intervention is a reporting tool, not a decision system. Reporting tools do not need ML — a SQL query and a dashboard would suffice.

The Problem Framing Document

Before writing any code, write a one-page problem framing document. It does not need to be formal. It does need to be written down. Here is the template:

PROBLEM FRAMING: StreamFlow Churn Prediction
=============================================

Business Question:
  Which active subscribers will cancel within the next 30 days?

Target Variable:
  canceled_within_30_days (binary: 0 = retained, 1 = canceled)

Observation Unit:
  subscriber-month (one row per active subscriber per month)

Prediction Timing:
  First day of each month, covering the next 30 days

Available Features (at prediction time):
  - Subscription metadata: plan type, tenure, price, payment method
  - Usage behavior: hours watched (last 7/14/30/60/90 days), genres,
    devices, sessions, completion rates
  - Engagement signals: ratings given, watchlist additions, search queries
  - Support history: tickets filed, resolution time, complaint topics
  - Billing: failed payments, plan changes, promo history

Excluded (available but cannot use):
  - Anything after the prediction date
  - Cancellation reason (target leakage — only exists after churn)
  - Retention offer outcomes (would leak the intervention itself)

Business Action:
  Subscribers with P(churn) > 0.7 receive retention offer (20% discount
  for 3 months). Team capacity: 15,000/month.

Success Metrics:
  - Offline: Precision@15000, AUC-ROC, calibration
  - Online: Reduction in monthly churn rate (A/B test)
  - Business: Net revenue impact (retention revenue - discount cost)

Baseline:
  Previous month's churn rate applied uniformly (8.2%)

That document took 20 minutes to write. It will save you weeks of wasted work.

The "What Action?" Test

There is a simple test for whether your problem frame is complete: describe the action that changes based on the model's output. If you cannot, the model has no business value.

Good answers to "what action?": - "Subscribers with P(churn) > 0.7 receive a retention offer." (StreamFlow) - "Patients with P(readmission) > 0.4 receive a post-discharge follow-up call." (Hospital) - "Products predicted to sell out within 7 days trigger automatic reorder." (Retail) - "Transactions with P(fraud) > 0.9 are held for manual review." (Payments)

Bad answers to "what action?": - "We will have a dashboard that shows churn risk." (A dashboard is not an action.) - "The marketing team will know which customers are at risk." (Knowing is not acting.) - "We will report monthly churn predictions to leadership." (Reporting is not intervening.)

If the best answer to "what action?" is "we will look at it," you do not need ML. You need a SQL query and a visualization.

Common Mistake — Building a model before the intervention strategy exists. A churn model is worthless if the retention team has no protocol for high-risk subscribers. A readmission model is worthless if the hospital has no post-discharge program. Build the intervention strategy in parallel with the model, or the model will sit in a notebook and gather dust.

2.3 Stage 2: Defining Success Metrics

You need two kinds of success metrics: offline metrics that you can compute before deployment, and online metrics that you measure in production.

Offline Metrics

Offline metrics evaluate the model on held-out historical data. For StreamFlow's churn problem, the metrics that matter are:

AUC-ROC — How well does the model rank churners above non-churners? This is the primary offline metric because the business cares about ranking (who should get the retention offer first).
Precision@K — Of the top K subscribers the model flags, how many actually churned? With K = 15,000 (the team's capacity), this directly measures operational relevance.
Calibration — When the model says P(churn) = 0.3, do roughly 30% of those subscribers actually churn? Calibration matters because the threshold (0.7) is a business decision, and it only makes sense if the probabilities are well-calibrated.

What about accuracy? Accuracy is almost never the right metric for production ML. StreamFlow's churn rate is 8.2%. A model that predicts "no churn" for everyone achieves 91.8% accuracy. That model is useless.

Online Metrics

Online metrics are measured in production, typically through A/B tests:

Churn rate reduction — The primary business metric. Does the churn rate decrease for subscribers who receive model-guided interventions compared to a control group?
Net revenue impact — Revenue retained from prevented churn minus the cost of retention offers. This is what the CFO cares about.
Intervention efficiency — Of the subscribers who received offers, what fraction would have churned without intervention? This measures how well the model targets the "persuadable" segment.

Production Tip — Always define a "guardrail metric" — something that must not get worse even if the primary metric improves. For StreamFlow, a guardrail might be customer satisfaction score. If the model identifies high-churn-risk subscribers and bombards them with desperate discount offers, it might reduce churn but annoy loyal customers who were never going to leave.

The Metric Hierarchy

Put your metrics in order:

Primary: AUC-ROC (offline), churn rate reduction (online)
Secondary: Precision@15000, calibration
Guardrail: Customer satisfaction, false positive rate among long-tenure subscribers

This hierarchy prevents arguments later. When someone asks "is the model good?" you point to the primary metric. When someone asks "but what about X?" you check the guardrails.

2.4 Stage 3: Data Collection and Validation

You have a framed problem and defined metrics. Now you need data. This is where reality intrudes.

The Data You Want vs. The Data You Have

StreamFlow's ideal dataset would include:

Complete usage history for every subscriber since they signed up
Detailed demographic information
Every support interaction with full transcripts
External data: credit scores, competitor usage, household income

The data you actually have:

Usage events going back 18 months (the data warehouse was migrated and everything before that was lost)
Demographics limited to age range and country (collected at signup, never updated)
Support tickets with category tags but no transcripts (GDPR-compliant retention policy deleted older records)
No external data (legal says the data sharing agreement will take 6 months to finalize)

This gap between ideal and actual is normal. Do not wait for perfect data. Start with what you have.

Data Validation

Before you use the data, validate it. Data validation catches problems early — before they silently corrupt your model.

import pandas as pd
import numpy as np


def validate_churn_dataset(df: pd.DataFrame) -> dict:
    """
    Validate the StreamFlow churn dataset.

    Returns a dictionary of validation results with pass/fail status.
    """
    results = {}

    # 1. Check for duplicate observation units
    dupes = df.duplicated(subset=["subscriber_id", "observation_month"]).sum()
    results["no_duplicate_rows"] = {
        "passed": dupes == 0,
        "detail": f"{dupes} duplicate subscriber-month pairs found",
    }

    # 2. Check target variable is binary
    target_values = df["canceled_within_30_days"].unique()
    results["binary_target"] = {
        "passed": set(target_values).issubset({0, 1}),
        "detail": f"Unique target values: {sorted(target_values)}",
    }

    # 3. Check for future data leakage in column names
    suspicious_columns = [
        col for col in df.columns
        if any(
            leak_word in col.lower()
            for leak_word in ["cancel_reason", "churn_date", "retention_offer_outcome"]
        )
    ]
    results["no_obvious_leakage_columns"] = {
        "passed": len(suspicious_columns) == 0,
        "detail": f"Suspicious columns: {suspicious_columns}",
    }

    # 4. Check churn rate is plausible
    churn_rate = df["canceled_within_30_days"].mean()
    results["plausible_churn_rate"] = {
        "passed": 0.02 < churn_rate < 0.25,
        "detail": f"Churn rate: {churn_rate:.3f}",
    }

    # 5. Check for nulls in the target
    target_nulls = df["canceled_within_30_days"].isna().sum()
    results["no_null_targets"] = {
        "passed": target_nulls == 0,
        "detail": f"{target_nulls} null target values",
    }

    # 6. Check temporal ordering
    if "observation_month" in df.columns:
        date_range = (
            df["observation_month"].min(),
            df["observation_month"].max(),
        )
        results["date_range"] = {
            "passed": True,
            "detail": f"Data spans {date_range[0]} to {date_range[1]}",
        }

    # Print summary
    for check, result in results.items():
        status = "PASS" if result["passed"] else "FAIL"
        print(f"[{status}] {check}: {result['detail']}")

    return results

Try It — Add two more validation checks to the function above: (a) verify that tenure_months is never negative, and (b) verify that hours_watched_last_30d is never greater than 720 (24 hours x 30 days).

The Data Pipeline

Raw data rarely arrives in one clean table. At StreamFlow, the data lives in multiple source systems:

Subscription service (PostgreSQL) — plan type, signup date, billing status
Event stream (Kafka -> data warehouse) — watch events, search events, device logs
Support platform (Zendesk API) — ticket metadata
Feature store (if one exists) — pre-computed features from other teams

A data pipeline is the code that extracts data from these sources, transforms it into the format your model needs, and loads it into a training dataset. This is the ETL (Extract-Transform-Load) that underpins every ML system.

For now, think of the pipeline as three phases:

Extract: Pull raw data from source systems.
Transform: Clean, join, aggregate, and compute features.
Load: Write the final feature matrix and target vector to a training-ready format.

We will build the actual SQL extraction in Chapter 5 and the preprocessing pipeline in Chapter 10. For now, the important point is: the data pipeline is code, it should be version-controlled, and it must be reproducible. If you cannot re-run the pipeline and get the same training data, your results are not reproducible.

Production Tip — A feature store is a centralized system for storing, managing, and serving ML features. Companies like Uber, Airbnb, and Netflix use feature stores to avoid having every data scientist re-compute the same features independently. If your organization has one, use it. If it does not, and you have more than two ML models in production, advocate for building one. The alternative is feature drift: every model computes "days since last login" slightly differently, and nobody knows which version is correct.

2.5 Stage 4: The Stupid Baseline

Before you train any real model, establish a baseline. And I mean a truly stupid baseline — something so simple it would be embarrassing to present in a meeting. Every model you build must beat this baseline. If it does not, your model is not learning anything useful.

Why Baselines Matter

Baselines serve three purposes:

Floor-setting: They establish the minimum performance any useful model must exceed.
Sanity-checking: If your fancy model only barely beats a trivial baseline, something is wrong — probably with your features, not your algorithm.
Debugging: When a model behaves unexpectedly, comparing to the baseline helps isolate whether the issue is in the data, the features, or the model.

Types of Stupid Baselines

For classification problems:

Majority class: Always predict the most common class. For StreamFlow (8.2% churn), this predicts "no churn" for everyone. Accuracy: 91.8%. AUC-ROC: 0.5.
Random proportional: Predict churn with probability 0.082 for every subscriber. AUC-ROC: 0.5.
Single-feature heuristic: Use the single best feature with a threshold. "If hours_watched_last_30d < 2, predict churn." Cheap, interpretable, and often surprisingly decent.

For regression problems:

Predict the mean: Always predict the training set mean. MSE equals the variance.
Predict the median: Always predict the training set median. Minimizes MAE.

import numpy as np
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    classification_report,
)


# Simulate StreamFlow data for demonstration
# In practice, this comes from your data pipeline
np.random.seed(42)
n_subscribers = 50000
churn_rate = 0.082

# Generate synthetic features
hours_watched = np.where(
    np.random.rand(n_subscribers) < churn_rate,
    np.random.exponential(5, n_subscribers),       # churners watch less
    np.random.exponential(25, n_subscribers),       # retained watch more
)
tenure_months = np.where(
    np.random.rand(n_subscribers) < churn_rate,
    np.random.exponential(6, n_subscribers),
    np.random.exponential(18, n_subscribers),
)
target = (np.random.rand(n_subscribers) < churn_rate).astype(int)

X = np.column_stack([hours_watched, tenure_months])
y = target

# Train/test split (more on this in Section 2.7)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Baseline 1: Majority class
majority_baseline = DummyClassifier(strategy="most_frequent", random_state=42)
majority_baseline.fit(X_train, y_train)
y_pred_majority = majority_baseline.predict(X_test)

print("=== Majority Class Baseline ===")
print(f"Accuracy:  {accuracy_score(y_test, y_pred_majority):.3f}")
print(f"Precision: {precision_score(y_test, y_pred_majority, zero_division=0):.3f}")
print(f"Recall:    {recall_score(y_test, y_pred_majority, zero_division=0):.3f}")
print(f"F1:        {f1_score(y_test, y_pred_majority, zero_division=0):.3f}")
print(f"AUC-ROC:   {roc_auc_score(y_test, majority_baseline.predict_proba(X_test)[:, 1]):.3f}")

# Baseline 2: Stratified random
stratified_baseline = DummyClassifier(strategy="stratified", random_state=42)
stratified_baseline.fit(X_train, y_train)
y_prob_strat = stratified_baseline.predict_proba(X_test)[:, 1]

print("\n=== Stratified Random Baseline ===")
print(f"AUC-ROC:   {roc_auc_score(y_test, y_prob_strat):.3f}")

=== Majority Class Baseline ===
Accuracy:  0.918
Precision: 0.000
Recall:    0.000
F1:        0.000
AUC-ROC:   0.500

=== Stratified Random Baseline ===
AUC-ROC:   0.500

Look at the majority class baseline: 91.8% accuracy, but zero precision, zero recall. It learned nothing. AUC-ROC is 0.5, which is random coin-flipping. This is your floor. Any model worth deploying must substantially exceed an AUC-ROC of 0.5.

Common Mistake — Reporting accuracy for imbalanced classification problems. When your positive class is 8.2% of the data, accuracy is dominated by the negative class. A model that is 93% accurate might still be missing half the churners. Use AUC-ROC, precision-recall curves, or F1 — metrics that account for both classes.

The Single-Feature Baseline

Now let us try something slightly less stupid: using a single feature with a threshold.

from sklearn.metrics import roc_auc_score

# Single-feature baseline: hours_watched_last_30d
# Lower hours -> higher churn risk
# Use negative hours as the "score" so lower hours = higher score
hours_feature_scores = -X_test[:, 0]

auc_single_feature = roc_auc_score(y_test, hours_feature_scores)
print(f"Single-feature baseline (hours watched) AUC-ROC: {auc_single_feature:.3f}")

If this single-feature baseline achieves AUC-ROC of 0.72, and your carefully engineered gradient boosting model achieves 0.74, you have a problem. Not a modeling problem — a feature problem. The model is barely extracting more signal than a single column lookup.

The Business Heuristic Baseline

There is one more baseline that matters, and it is the one most teams forget: what is the business already doing?

Before your model existed, how was the retention team deciding which subscribers to contact? At StreamFlow, the retention team used a simple heuristic: flag subscribers who had not watched anything in the last 14 days. This rule was not based on ML. It was based on a product manager's intuition from three years ago.

Compute the metrics for this heuristic. Suppose it achieves AUC-ROC of 0.68 and Precision@15000 of 18.4%. This is your operational baseline — the bar your model must clear to justify its existence. Beating the majority-class baseline (AUC-ROC 0.50) is necessary but not sufficient. Beating the business heuristic is what justifies the investment.

If your gradient boosting model achieves 0.72 against the heuristic's 0.68, you need to ask honestly: is a 0.04 AUC improvement worth the engineering cost of deploying and maintaining an ML system? Sometimes the answer is yes (when the business impact is large enough). Sometimes it is no (when the heuristic is cheap and "good enough"). This is a business decision, not a technical one.

Production Tip — Always benchmark against the current business process, not just a statistical baseline. The question is not "is the model better than random?" The question is "is the model better than what we are already doing, by enough to justify the cost?"

2.6 Stage 5: Feature Engineering and Model Iteration

This is where most tutorials start and most production ML projects spend the majority of their time. It is a cycle:

Engineer features (Chapter 6 covers this in depth).
Train a model.
Evaluate on validation data.
Analyze errors.
Go back to step 1.

The key insight: improving features almost always matters more than improving algorithms. A logistic regression with brilliant features will outperform a neural network with bad features. We will return to this repeatedly throughout the book.

For now, here is the high-level iteration loop for StreamFlow:

Iteration 1: Basic features (tenure, plan type, hours watched last 30 days). Logistic regression. AUC-ROC: 0.71.

Iteration 2: Add temporal features (trend in hours watched over 3 months, days since last login). Same model. AUC-ROC: 0.76.

Iteration 3: Add engagement features (genre diversity, completion rate, device count). AUC-ROC: 0.79.

Iteration 4: Switch to gradient boosting. AUC-ROC: 0.82.

Iteration 5: Add support ticket features and billing history. AUC-ROC: 0.85.

Notice: the jump from iteration 1 to iteration 3 (adding features) was larger than the jump from iteration 3 to iteration 4 (changing the algorithm). This pattern is typical. And it points to one of the most underappreciated truths in applied ML: the algorithm matters far less than the features. A logistic regression with good features (AUC 0.79) outperformed a gradient boosting model with basic features (AUC 0.82 only after switching). The features got you from 0.71 to 0.79. The algorithm got you from 0.79 to 0.82. The features did more work.

Error Analysis Between Iterations

Between iterations, do not just look at aggregate metrics. Look at the errors. Which subscribers did the model get wrong? Are there patterns?

import pandas as pd
import numpy as np

# After training a model, examine false negatives (subscribers who churned
# but the model predicted they would stay)
def analyze_errors(y_true, y_pred, X, feature_names, threshold=0.5):
    """
    Compare false negatives vs. true positives to find
    patterns the model is missing.
    """
    y_pred_label = (y_pred >= threshold).astype(int)

    false_negatives = (y_true == 1) & (y_pred_label == 0)
    true_positives = (y_true == 1) & (y_pred_label == 1)

    if isinstance(X, np.ndarray):
        X = pd.DataFrame(X, columns=feature_names)

    fn_stats = X[false_negatives].describe().T[["mean", "std"]]
    tp_stats = X[true_positives].describe().T[["mean", "std"]]

    comparison = pd.DataFrame({
        "FN_mean": fn_stats["mean"],
        "TP_mean": tp_stats["mean"],
        "difference": fn_stats["mean"] - tp_stats["mean"],
    })
    print("Feature comparison: False Negatives vs True Positives")
    print(comparison.round(3))
    return comparison

If the false negatives are mostly long-tenure subscribers (tenure > 24 months) who recently stopped watching, you know the model needs features that capture behavioral change relative to a subscriber's own history — not just absolute usage levels. This error analysis is what drives the next iteration.

Production Tip — Keep a log of every iteration. Record the features used, the model type, the hyperparameters, and the evaluation metrics. This log is invaluable for debugging, for reporting to stakeholders, and for the next data scientist who inherits your project. Tools like MLflow (Chapter 30) formalize this, but even a spreadsheet works.

2.7 The Train/Validation/Test Split

You already know about train/test splits from your introductory course. In production ML, we need three partitions, not two.

Why Three Partitions?

Training set (60-70%): Fit the model.
Validation set (15-20%): Tune hyperparameters, select features, compare models. You look at this repeatedly.
Test set (15-20%): Final evaluation. You look at this once. Once. Not twice. Once.

The validation set is the model selection set. You use it to decide which features, hyperparameters, and algorithms to keep. Because you make decisions based on validation performance, the validation set is "consumed" — it is no longer an unbiased estimate of generalization.

The test set is your honest estimate of how the model will perform on truly unseen data. It stays sealed until the very end. If you peek at the test set, tweak your model, and re-evaluate, it is no longer a test set — it is a second validation set, and your reported performance is optimistic.

from sklearn.model_selection import train_test_split

# Step 1: Split off the test set (hold out 20%)
X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)

# Step 2: Split training into train and validation (hold out 20% of remaining)
X_train, X_val, y_train, y_val = train_test_split(
    X_trainval, y_trainval, test_size=0.20, random_state=42, stratify=y_trainval
)

print(f"Training:   {X_train.shape[0]:,} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Validation: {X_val.shape[0]:,} samples ({X_val.shape[0]/len(X)*100:.1f}%)")
print(f"Test:       {X_test.shape[0]:,} samples ({X_test.shape[0]/len(X)*100:.1f}%)")
print(f"Churn rate - Train: {y_train.mean():.3f}, Val: {y_val.mean():.3f}, Test: {y_test.mean():.3f}")

Training:   32,000 samples (64.0%)
Validation: 8,000 samples (16.0%)
Test:       10,000 samples (20.0%)
Churn rate - Train: 0.082, Val: 0.082, Test: 0.082

Notice the stratify=y parameter. This ensures each partition has the same churn rate as the original dataset. Without stratification, random chance might put 10% churners in the training set and 5% in the test set, making evaluation misleading.

Time-Based Splits

For problems with temporal structure — which includes most real-world ML — random splits are wrong. If your data spans January through December, a random split might put June data in the training set and March data in the test set. The model would be training on future data and testing on past data. That is information leakage.

For StreamFlow, the correct approach is a temporal split:

import pandas as pd

# Assume df is sorted by observation_month
# Data spans 2023-01 through 2024-06 (18 months)

# Training: 2023-01 through 2023-12 (12 months)
train_mask = df["observation_month"] < "2024-01"

# Validation: 2024-01 through 2024-03 (3 months)
val_mask = (df["observation_month"] >= "2024-01") & (df["observation_month"] < "2024-04")

# Test: 2024-04 through 2024-06 (3 months)
test_mask = df["observation_month"] >= "2024-04"

df_train = df[train_mask].copy()
df_val = df[val_mask].copy()
df_test = df[test_mask].copy()

print(f"Training:   {len(df_train):,} rows ({df_train['observation_month'].min()} to {df_train['observation_month'].max()})")
print(f"Validation: {len(df_val):,} rows ({df_val['observation_month'].min()} to {df_val['observation_month'].max()})")
print(f"Test:       {len(df_test):,} rows ({df_test['observation_month'].min()} to {df_test['observation_month'].max()})")

Common Mistake — Using random train/test splits on time-series data. If customer behavior changed in 2024 (maybe StreamFlow raised prices), a model trained on randomly shuffled 2023-2024 data will see 2024 patterns in its training set and appear to predict 2024 data well. In production, it will only ever predict the future — data it has never seen. A temporal split mirrors this reality.

Cross-Validation

In practice, a single train/validation split is unstable — the performance estimate depends on which samples ended up in which partition. Cross-validation averages over multiple splits:

from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Build a simple pipeline
pipeline = make_pipeline(
    StandardScaler(),
    LogisticRegression(random_state=42, max_iter=1000),
)

# 5-fold stratified cross-validation on the training+validation set
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(
    pipeline, X_trainval, y_trainval, cv=cv, scoring="roc_auc"
)

print(f"Cross-validated AUC-ROC: {scores.mean():.3f} (+/- {scores.std():.3f})")
print(f"Per-fold scores: {[f'{s:.3f}' for s in scores]}")

Cross-validated AUC-ROC: 0.731 (+/- 0.012)
Per-fold scores: ['0.725', '0.742', '0.718', '0.736', '0.733']

Low standard deviation across folds (0.012) means the estimate is stable. High standard deviation (> 0.05) means you may not have enough data, or your data is highly variable.

For temporal data, use TimeSeriesSplit instead of StratifiedKFold:

from sklearn.model_selection import TimeSeriesSplit

# TimeSeriesSplit: each fold uses earlier data for training, later data for validation
ts_cv = TimeSeriesSplit(n_splits=5)

# Note: TimeSeriesSplit expects data sorted chronologically
# Each successive fold uses more training data and a later validation window

2.8 Stage 6: Offline Evaluation

Offline evaluation is your last checkpoint before the model touches real users. It happens on the test set, and it happens once.

The Evaluation Checklist

Before declaring a model ready for deployment, verify:

Performance exceeds the baseline. By a meaningful margin. If AUC-ROC improved from 0.50 (baseline) to 0.52, that is noise, not signal.
Performance is stable across subgroups. Does the model predict equally well for new subscribers (tenure < 3 months) and long-tenure subscribers? For users on different plans? Different countries? If performance varies wildly by subgroup, you have a fairness and reliability problem.
The model is calibrated. When it says P(churn) = 0.4, do approximately 40% of those subscribers actually churn? Use a calibration curve to check.
There is no leakage. This is so important it gets its own section (2.9).
Performance on the test set matches the validation set. A large gap (test performance significantly worse) suggests overfitting to the validation set through repeated tuning.

from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt

# Assume we have a trained model and test predictions
# y_test: true labels, y_prob: predicted probabilities

def plot_evaluation_dashboard(y_test, y_prob, model_name="Model"):
    """
    Plot a concise evaluation dashboard for binary classification.
    """
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))

    # 1. ROC Curve
    from sklearn.metrics import roc_curve, auc
    fpr, tpr, _ = roc_curve(y_test, y_prob)
    roc_auc = auc(fpr, tpr)
    axes[0].plot(fpr, tpr, label=f"{model_name} (AUC={roc_auc:.3f})")
    axes[0].plot([0, 1], [0, 1], "k--", label="Random (AUC=0.500)")
    axes[0].set_xlabel("False Positive Rate")
    axes[0].set_ylabel("True Positive Rate")
    axes[0].set_title("ROC Curve")
    axes[0].legend()

    # 2. Precision-Recall Curve
    from sklearn.metrics import precision_recall_curve, average_precision_score
    precision, recall, _ = precision_recall_curve(y_test, y_prob)
    ap = average_precision_score(y_test, y_prob)
    axes[1].plot(recall, precision, label=f"{model_name} (AP={ap:.3f})")
    baseline_rate = y_test.mean()
    axes[1].axhline(y=baseline_rate, color="k", linestyle="--", label=f"Baseline ({baseline_rate:.3f})")
    axes[1].set_xlabel("Recall")
    axes[1].set_ylabel("Precision")
    axes[1].set_title("Precision-Recall Curve")
    axes[1].legend()

    # 3. Calibration Curve
    prob_true, prob_pred = calibration_curve(y_test, y_prob, n_bins=10)
    axes[2].plot(prob_pred, prob_true, "s-", label=model_name)
    axes[2].plot([0, 1], [0, 1], "k--", label="Perfectly Calibrated")
    axes[2].set_xlabel("Mean Predicted Probability")
    axes[2].set_ylabel("Fraction of Positives")
    axes[2].set_title("Calibration Curve")
    axes[2].legend()

    plt.tight_layout()
    plt.savefig("evaluation_dashboard.png", dpi=150, bbox_inches="tight")
    plt.show()

Try It — Extend the evaluation dashboard with a fourth panel: a histogram of predicted probabilities, split by actual class (churned vs. retained). Well-separated distributions indicate strong discriminative power.

2.9 Data Leakage: The Silent Killer

Data leakage is the single most dangerous pitfall in machine learning. It occurs when your training data contains information that would not be available at prediction time. Leaky models perform spectacularly on your test set and catastrophically in production.

Why Leakage Is So Common

Data leakage is not a rare edge case. It is staggeringly common. A 2019 study by Kapoor and Narayanan surveyed hundreds of published ML papers and found evidence of data leakage in a significant fraction. If peer-reviewed academic papers — with multiple rounds of review — contain leakage, imagine how often it occurs in internal corporate projects with no external review.

The reason is structural: data leakage is easy to introduce and hard to detect. The model will train happily on leaky data. It will report excellent metrics. It will pass code review (unless the reviewer understands the data deeply). The only reliable detector is a human who asks, "Could I actually compute this feature at the moment I need to make the prediction?"

Types of Data Leakage

Target leakage — Features that are derived from or correlated with the target variable in ways that would not exist at prediction time.

StreamFlow example: Including cancellation_reason as a feature. This field is only populated after a subscriber cancels — it is literally a consequence of the target event. A model that sees this feature learns the trivial rule "if cancellation_reason is not null, predict churn." AUC-ROC: 0.99. Production value: zero.

Temporal leakage — Using future data to predict past events.

StreamFlow example: When building training data for January predictions, accidentally including February usage data as features. The model learns that subscribers who watched 0 hours in February (because they had already canceled) are likely to churn. True, but useless.

Train/test contamination — Information from the test set leaking into the training process.

Example: Fitting a StandardScaler on the entire dataset before splitting. The scaler's mean and standard deviation incorporate test set statistics, giving the model a small but real advantage on the test set.

War Story: The Model That Was Too Good

War Story — A team at a healthcare analytics company built a readmission prediction model that achieved AUC-ROC of 0.97. The team was ecstatic. Their VP scheduled a board presentation. Then a junior data scientist asked a quiet question: "Why is discharge_disposition our most important feature?"

They investigated. discharge_disposition included categories like "discharged to home" and "discharged to skilled nursing facility." It also included "expired" — the patient died during the hospitalization. Dead patients are never readmitted. The model had learned: if discharge_disposition == expired, predict no readmission. Technically correct. Clinically useless.

But the leakage was subtler than that. Patients discharged to skilled nursing facilities had different readmission patterns, and the discharge disposition was determined partly based on the same clinical factors the model was supposed to predict. The feature was a proxy for the outcome, not a predictor of it.

After removing the feature and auditing the remaining variables, the model's AUC-ROC dropped to 0.73. Still useful. But the board presentation was postponed.

How to Detect Leakage

Leakage announces itself through suspiciously good performance. Here are the warning signs:

AUC-ROC > 0.95 on your first model. Real-world prediction problems rarely achieve this without leakage. If your first attempt hits 0.97, be suspicious, not excited.
Feature importance dominated by one or two features. If a single feature explains 80% of the prediction, ask: is this feature available at prediction time? Is it a proxy for the target?
Large gap between cross-validation and production performance. If offline AUC is 0.92 but the deployed model performs like random chance, you almost certainly have leakage.

def check_for_leakage_signals(model, X_test, y_test, feature_names):
    """
    Check for common leakage signals in a trained model.
    """
    from sklearn.metrics import roc_auc_score

    # Signal 1: Suspiciously high AUC
    y_prob = model.predict_proba(X_test)[:, 1]
    auc = roc_auc_score(y_test, y_prob)
    if auc > 0.95:
        print(f"WARNING: AUC-ROC = {auc:.3f}. This is suspiciously high.")
        print("  -> Investigate top features for target leakage.")

    # Signal 2: Single feature dominance
    if hasattr(model, "feature_importances_"):
        importances = model.feature_importances_
    elif hasattr(model, "coef_"):
        importances = np.abs(model.coef_[0])
    else:
        print("Cannot extract feature importances from this model type.")
        return

    top_idx = np.argsort(importances)[::-1]
    top_importance = importances[top_idx[0]] / importances.sum()

    if top_importance > 0.5:
        print(f"WARNING: Top feature '{feature_names[top_idx[0]]}' accounts "
              f"for {top_importance:.1%} of total importance.")
        print("  -> Verify this feature is available at prediction time.")
        print("  -> Check if it is a proxy for the target variable.")

    # Signal 3: Print top 5 features for manual review
    print("\nTop 5 features by importance:")
    for i in range(min(5, len(feature_names))):
        idx = top_idx[i]
        print(f"  {i+1}. {feature_names[idx]}: {importances[idx]:.4f}")

How to Prevent Leakage

Define the prediction point explicitly. For every feature, ask: "Would I have this value at the moment I need to make the prediction?" If no, exclude it.
Compute features only from data before the prediction date. This is the temporal boundary. No exceptions.
Fit preprocessors on training data only. Scalers, encoders, imputers — all of them must be fit on the training set and then applied (transform only) to validation and test sets. scikit-learn Pipelines enforce this automatically, which is one of the main reasons to use them.
Be paranoid about derived features. If a feature was computed by another team or system, trace its lineage. How was it computed? What data went into it? A "customer_risk_score" from the finance team might incorporate information about whether the customer eventually defaulted.
Test with a temporal holdout. Hold out the most recent data as your test set. If the model performs well on historical data but poorly on the most recent data, leakage is a likely culprit.

Common Mistake — Performing feature selection before splitting. If you compute mutual information between features and the target on the entire dataset, then split, the test set was involved in selecting which features to keep. The selection step must happen inside the cross-validation loop or on the training set only.

2.10 Stage 7: Deployment and Online Evaluation

Your model passed offline evaluation. Now it needs to face real users. Deployment is where data science meets software engineering, and the culture clash is real.

The Deployment Spectrum

Models can be deployed in several ways, from simple to complex:

Batch scoring: Run the model on a schedule (nightly, weekly). Score all subscribers, write predictions to a database table, and let downstream systems consume them. Simplest. Most common. Often good enough.
Real-time API: Wrap the model in a REST API. Downstream systems send a request with features, the API returns a prediction. Required when predictions must be made at the moment of interaction (e.g., fraud detection at checkout).
Embedded model: The model runs inside another application, loaded from a serialized file. Common in mobile apps and edge devices.

For StreamFlow's churn model, batch scoring is appropriate. We score all active subscribers on the first of each month. The retention team has days, not milliseconds, to act on the predictions.

Online Evaluation: A/B Testing the Model

Offline metrics tell you how the model performs on historical data. Online evaluation — typically an A/B test — tells you whether the model improves actual business outcomes.

For StreamFlow, the A/B test design:

Control group: Subscribers receive the existing retention strategy (random selection of at-risk accounts based on tenure heuristics).
Treatment group: Subscribers receive model-guided retention offers.
Primary metric: Monthly churn rate.
Duration: 3 months (to capture long-term effects and seasonal patterns).
Sample size: Power analysis determines this (Chapter 3 covers A/B testing in depth).

Production Tip — Never deploy a model to 100% of users on day one. Start with a shadow deployment (model scores users but no action is taken — compare predictions to outcomes). Then ramp to 5%, 10%, 25%, 50%, 100%, monitoring metrics at each step. This is called a canary deployment, and it catches problems before they affect all users.

The Model Registry

When you have a model in production, you need to track which version is running, what data it was trained on, and when it was last retrained. A model registry solves this:

# Conceptual example using MLflow (covered in detail in Chapter 30)
import mlflow

# Log the model with metadata
with mlflow.start_run(run_name="churn_v2.1_2024Q1"):
    mlflow.log_param("model_type", "LightGBM")
    mlflow.log_param("training_data", "2023-01 to 2023-12")
    mlflow.log_param("n_features", 28)
    mlflow.log_metric("auc_roc_test", 0.847)
    mlflow.log_metric("precision_at_15k", 0.312)
    mlflow.sklearn.log_model(pipeline, "churn_model")

Without a model registry, you end up with the dreaded "which version of the model is running in production?" question that nobody can answer. We will build a full experiment tracking and model registry workflow in Chapters 29 and 30.

Shadow Deployment

Before switching to full deployment, run a shadow deployment: the model scores every subscriber on the same schedule, but no action is taken on the predictions. Instead, you compare predictions against actual outcomes after the prediction window closes.

Shadow deployment catches problems that offline evaluation misses: - Feature computation failures (a join that silently drops 5% of rows) - Feature distribution differences between training data and live data - Latency issues (the pipeline takes longer than expected on full production data) - Infrastructure failures (the database query times out on the production cluster)

Two to four weeks of shadow deployment is standard before any model goes live.

2.11 Stage 8: Monitoring and Maintenance

Deployment is not the finish line. It is the starting line for a new set of problems.

Why Models Decay

Models assume that the future looks like the past. When the world changes, the model's assumptions break. This is called model decay or concept drift.

For StreamFlow, drift can come from:

Product changes: StreamFlow launches a new "Family Plan." The model has never seen this plan type. Feature distributions shift.
Competitive changes: A rival streaming service launches with a free trial. Churn spikes for reasons the model cannot capture because it has no features about competitors.
Seasonal patterns: Churn increases every January (New Year's resolution to cut subscriptions) and decreases every November (holiday content releases). If the model was trained on March–September data, it has not seen these patterns.
Data pipeline changes: The engineering team changes how watch events are logged. What used to be one event per session is now one event per episode. The hours_watched feature doubles overnight — not because behavior changed, but because the measurement changed.

What to Monitor

Prediction distribution: Is the model outputting the same distribution of probabilities as it did at training time? A sudden shift (e.g., the model starts predicting < 0.1 for everyone) indicates a problem.
Feature distributions: Have the input features drifted? Compare current feature statistics to training-time statistics using metrics like Population Stability Index (PSI) or Kolmogorov-Smirnov tests.
Actuals vs. predictions: Once ground truth becomes available (did the subscriber actually churn?), compare to predictions. This is delayed — you have to wait 30 days — but it is the definitive check.
Business metrics: Is the churn rate actually decreasing? Are retention offers being accepted? Is net revenue improving?

def monitor_prediction_drift(
    training_predictions: np.ndarray,
    current_predictions: np.ndarray,
    threshold: float = 0.1,
) -> dict:
    """
    Compare the distribution of predictions between training time
    and current scoring. Returns drift metrics.
    """
    from scipy.stats import ks_2samp

    # Kolmogorov-Smirnov test
    ks_stat, ks_pvalue = ks_2samp(training_predictions, current_predictions)

    # Summary statistics comparison
    train_mean = training_predictions.mean()
    current_mean = current_predictions.mean()
    mean_shift = abs(current_mean - train_mean)

    drift_detected = ks_stat > threshold

    result = {
        "ks_statistic": ks_stat,
        "ks_pvalue": ks_pvalue,
        "train_mean": train_mean,
        "current_mean": current_mean,
        "mean_shift": mean_shift,
        "drift_detected": drift_detected,
    }

    if drift_detected:
        print(f"ALERT: Prediction drift detected (KS={ks_stat:.3f}, p={ks_pvalue:.4f})")
        print(f"  Training mean: {train_mean:.3f}, Current mean: {current_mean:.3f}")
    else:
        print(f"No significant drift (KS={ks_stat:.3f})")

    return result

War Story — A recommendation model at an e-commerce company (similar to ShopSmart) was deployed and monitored on click-through rate. CTR looked great for two months. Then a product manager noticed that revenue per session was declining. The model was recommending items that users clicked on but did not buy — optimizing for engagement rather than conversion. The team had monitored the wrong metric. When they added revenue-per-session as a guardrail metric, they caught the problem immediately.

Technical Debt in ML Systems

Deploying a model creates technical debt — the ongoing cost of maintaining the system. Sculley et al. (2015) famously argued that the ML code in a production system is a tiny fraction of the total system. The rest is:

Data collection and validation pipelines
Feature extraction and storage
Model serving infrastructure
Monitoring and alerting
Configuration management
Process management and orchestration

+------------------------------------------------------+
|                  ML SYSTEM IN PRODUCTION              |
|                                                       |
|  +--------+  +--------+  +---------+  +----------+   |
|  |  Data   |  |Feature |  |  Model  |  |  Serving |   |
|  |Pipeline |  | Store  |  |Training |  |   API    |   |
|  +--------+  +--------+  +---------+  +----------+   |
|                                                       |
|  +--------+  +--------+  +---------+  +----------+   |
|  |Monitor-|  | Alert  |  | Retrain |  |  Config  |   |
|  |  ing   |  | System |  |Schedule |  | Manage.  |   |
|  +--------+  +--------+  +---------+  +----------+   |
|                                                       |
|  +-------------------------------------------+       |
|  |         The actual ML model code           |       |
|  |         (5% of the total system)           |       |
|  +-------------------------------------------+       |
+------------------------------------------------------+

Every additional model in production multiplies this infrastructure. This is why experienced teams resist deploying new models unless the expected value clearly exceeds the maintenance cost.

2.12 The ShopSmart Contrast: A Different Workflow for Recommendations

Not every ML problem follows the same workflow pattern. StreamFlow's churn model is a standard binary classification problem with batch scoring. Let us briefly contrast it with a different kind of ML system.

ShopSmart, the e-commerce marketplace with 14 million monthly users, needs a product recommendation model. Here is how the workflow differs:

Aspect	StreamFlow Churn	ShopSmart Recommendations
Target	Binary: churned or not	Implicit: clicked, purchased, or added to cart
Scoring	Monthly batch	Real-time (< 100ms per request)
Evaluation	Offline AUC-ROC, then A/B test	Offline recall@K, then A/B test on CTR and revenue
Cold start	Not an issue (all subscribers have history)	Major issue (new users, new products)
Scale	2.4M predictions per month	14M users x 50 page views = 700M predictions per month
Feedback loop	30-day delay for ground truth	Immediate (did the user click?)

The recommendation workflow has the same eight stages, but the emphasis shifts: data collection involves implicit signals (clicks, not labels), evaluation must handle the cold-start problem, and deployment requires real-time serving infrastructure. The same principles apply — framing matters, baselines matter, leakage detection matters — but the implementation looks very different.

2.13 Putting It All Together: The Iterative Nature of ML

The ML workflow is not a waterfall. It is an iterative loop with feedback at every stage.

 Business Question
       |
       v
 Problem Framing  <---+
       |               |
       v               |
 Data Collection  <---+|
       |              ||
       v              ||
 Baseline Model       ||
       |              ||
       v              ||
 Feature / Model  ----+|  (iterate: features, algorithms,
   Iteration           |   hyperparameters)
       |               |
       v               |
 Offline Evaluation ---+  (if not good enough, go back)
       |
       v
 Deploy + A/B Test
       |
       v
 Monitor  ----------> Retrain (go back to data collection
       |                       or feature engineering)
       v
 Maintain (forever)

In practice, you may cycle through the middle stages dozens of times. You may also jump backwards: monitoring reveals drift, which triggers retraining, which requires new data collection, which reveals that the data source changed, which forces you to re-frame the problem. This is normal. This is the job.

Production Tip — Budget your time accordingly. A common split for an ML project: - Problem framing and data understanding: 20% - Data collection and pipeline building: 25% - Feature engineering: 20% - Modeling and evaluation: 15% - Deployment and monitoring: 20%

Note that modeling — the step most people think of as "doing data science" — is 15%. The other 85% is what separates a Kaggle notebook from a production system.

2.14 Progressive Project: Framing the StreamFlow Churn Problem

This section is your first hands-on milestone for the progressive project. By the end, you will have a formal problem frame for the StreamFlow churn prediction system.

Task

Create a Jupyter notebook (or add to your existing one from Chapter 1) with the following sections.

1. Define the target variable

# Target variable definition
target_config = {
    "name": "canceled_within_30_days",
    "type": "binary",
    "positive_class": 1,  # subscriber canceled
    "negative_class": 0,  # subscriber retained
    "definition": (
        "1 if the subscriber's subscription status transitions to "
        "'canceled' within 30 calendar days of the observation date; "
        "0 otherwise."
    ),
    "exclusions": [
        "Subscribers on free trials (not yet paying customers)",
        "Subscribers who were involuntarily churned due to payment failure "
        "(separate model for payment recovery)",
        "Subscribers in their first 7 days (too early to predict meaningfully)",
    ],
}

2. Define the observation unit

# Observation unit definition
observation_unit = {
    "grain": "subscriber-month",
    "description": (
        "One row per active subscriber per calendar month. "
        "Snapshot taken on the first day of the month."
    ),
    "subscriber_id": "Unique subscriber identifier",
    "observation_month": "First day of the calendar month (e.g., 2024-01-01)",
    "expected_rows_per_month": "~2,400,000 (total active subscribers)",
    "expected_training_rows": "~28,800,000 (12 months x 2.4M subscribers)",
}

3. Define feature categories (no data yet — just the categories)

# Feature categories for the churn model
feature_categories = {
    "subscription_metadata": {
        "description": "Static and slowly-changing subscription attributes",
        "examples": [
            "plan_type (Basic/Standard/Premium)",
            "tenure_months",
            "monthly_price",
            "payment_method (credit_card/paypal/bank_transfer)",
            "has_annual_plan",
        ],
    },
    "usage_behavior": {
        "description": "Content consumption patterns",
        "examples": [
            "hours_watched_last_7d / 14d / 30d / 60d / 90d",
            "sessions_last_30d",
            "unique_titles_watched_last_30d",
            "completion_rate_last_30d",
            "device_count",
        ],
    },
    "engagement_signals": {
        "description": "Non-consumption interactions with the platform",
        "examples": [
            "ratings_given_last_30d",
            "watchlist_additions_last_30d",
            "search_queries_last_30d",
            "profile_updates_last_90d",
        ],
    },
    "usage_trends": {
        "description": "Temporal patterns in usage",
        "examples": [
            "hours_watched_trend_3m (slope of last 3 months)",
            "session_frequency_change_1m_vs_3m",
            "days_since_last_watch",
            "genre_diversity_change_3m",
        ],
    },
    "support_history": {
        "description": "Customer support interactions",
        "examples": [
            "tickets_filed_last_90d",
            "avg_resolution_time_hours",
            "has_billing_complaint_last_90d",
            "has_content_complaint_last_90d",
        ],
    },
    "billing_history": {
        "description": "Payment and plan change events",
        "examples": [
            "failed_payments_last_90d",
            "plan_downgrades_last_12m",
            "plan_upgrades_last_12m",
            "promo_discount_active",
            "days_until_next_billing",
        ],
    },
}

# Print summary
print("Feature categories for StreamFlow churn prediction:")
print(f"{'Category':<25} {'Example count':<15} {'First example'}")
print("-" * 70)
for cat_name, cat_info in feature_categories.items():
    print(f"{cat_name:<25} {len(cat_info['examples']):<15} {cat_info['examples'][0]}")

Feature categories for StreamFlow churn prediction:
Category                  Example count   First example
----------------------------------------------------------------------
subscription_metadata     5               plan_type (Basic/Standard/Premium)
usage_behavior            5               hours_watched_last_7d / 14d / 30d / 60d / 90d
engagement_signals        4               ratings_given_last_30d
usage_trends              4               hours_watched_trend_3m (slope of last 3 months)
support_history           4               tickets_filed_last_90d
billing_history           5               failed_payments_last_90d

4. Document what is excluded and why

# Columns that must NOT be used as features (leakage risk)
excluded_features = {
    "cancellation_reason": "Only populated after churn event (target leakage)",
    "cancellation_date": "Directly encodes the target (target leakage)",
    "retention_offer_sent": "Result of previous model; creates feedback loop",
    "retention_offer_accepted": "Only exists for subscribers who received offers",
    "subscriber_status": "Contains 'canceled' — is the target variable itself",
    "last_active_date": "If after observation date, constitutes temporal leakage",
}

print("EXCLUDED FEATURES (leakage risk):")
for feature, reason in excluded_features.items():
    print(f"  {feature}: {reason}")

This problem framing document will guide every decision you make through the rest of the book. When you are in Chapter 14 debating whether to add another feature to your gradient boosting model, you will come back here to verify it does not violate the temporal boundary. When you are in Chapter 31 deploying the model, this document will define the API contract.

2.15 Summary

The ML workflow is eight stages, not three. Problem framing comes first and matters most. Every model must beat a stupid baseline. Data leakage is the most common and most dangerous pitfall. Offline evaluation happens once, on a sealed test set. Online evaluation (A/B testing) is where you prove the model creates business value. And monitoring never ends.

The single biggest takeaway from this chapter: most ML project failures are not modeling failures. They are framing failures, data failures, or evaluation failures. The algorithm is rarely the problem. The workflow around the algorithm is where projects live or die.

Bridge to Chapter 3

You now have the map. You know what the ML workflow looks like, where it can go wrong, and why each stage matters. But there is a gap in the workflow we mentioned but did not fill: how do you prove that a deployed model actually improves outcomes?

The answer is experimentation. Chapter 3 covers experimental design and A/B testing — the statistical framework for answering "did this change actually work, or did we just get lucky?" For the StreamFlow project, you will design the A/B test that will validate whether your churn model's retention offers actually reduce churn. It is the bridge between "the model works offline" and "the model creates value."

Key Terms Introduced

Term	Definition
Problem framing	The process of translating a business question into a precise ML task with defined target, features, timing, and actions
Data pipeline	Code that extracts raw data from source systems, transforms it, and loads it into a format ready for modeling
Feature store	A centralized system for storing, managing, and serving pre-computed ML features
Train/validation/test split	Three-way partition of data: training (fit model), validation (tune and select), test (final unbiased evaluation)
Data leakage	When training data contains information not available at prediction time, producing overly optimistic offline performance
Baseline model	The simplest possible model that sets the performance floor; every real model must beat it
Offline evaluation	Assessing model performance on held-out historical data before deployment
Online evaluation (A/B test)	Measuring model impact in production by comparing treated vs. control groups
Model registry	A system for versioning, tracking, and managing deployed models
ML lifecycle	The full end-to-end process from problem framing through deployment to monitoring and maintenance
Technical debt	The ongoing maintenance cost of an ML system, including infrastructure, monitoring, and retraining