Case Study 34-01: Priya Builds the Acme Churn Predictor

Characters: Priya Okonkwo (Senior Analyst, Acme Corp), Sandra Chen (VP of Sales) Continuing from: Chapters 32 and 33 — Priya has already built the supply chain dashboard and the Streamlit operations portal Setting: Acme Corp's Q1 planning cycle. Sandra has a new question.


The Question That Started Everything

Six weeks after Priya presented the supply chain dashboard to Acme's leadership team, Sandra Chen appeared in the doorway of the analytics office with a printout in her hand.

"Priya, I've been looking at our churn numbers. We lost eleven accounts last quarter. That's about $180,000 in annual recurring revenue. Some of them I saw coming — the signs were there. Others blindsided us completely." She set the printout down: a list of churned accounts with their revenue figures highlighted in red. "I want to know if you can build me a model that predicts which accounts are going to leave before they do. Give me enough warning that the team can actually do something about it."

Priya looked at the list. Eleven accounts. Four had been flagged as at-risk in weekly pipeline reviews, but seven had not. "Let me look at what data we have," she said.

What she found was encouraging. Acme's CRM system held two years of customer records. The transaction database she had already connected in Chapter 32 contained every order. There was also a support ticket export Marcus Webb had generated for the operations analysis. Three data sources — exactly the architecture that makes churn prediction possible.


Step 1: Building the Feature Dataset

Churn models require a specific type of data structure: a snapshot of each customer at some point in time, plus a label indicating whether they churned in the following period. You cannot train on raw transaction records directly because each customer appears many times. The job is to collapse all of that activity into a single row per customer.

Priya defined her observation window: she would look at each customer's behavior in the 90 days ending September 30, 2023, and use whether the account churned in Q4 (October through December) as the label.

import pandas as pd
import numpy as np
import sqlite3

def build_churn_dataset(
    db_path: str,
    sales_csv: str,
    tickets_csv: str,
    observation_end: str = "2023-09-30",
    label_end: str = "2023-12-31",
) -> pd.DataFrame:
    """
    Build a customer-level churn prediction dataset.

    Each row represents one customer. Features describe behavior during the
    90-day observation window. The target label (churned) is 1 if the customer
    had no activity in the label period and their account was closed.

    Args:
        db_path: Path to the Acme SQLite database.
        sales_csv: Path to the sales transaction CSV export.
        tickets_csv: Path to the support tickets CSV export.
        observation_end: Last date of the feature window (inclusive).
        label_end: Last date of the label window (inclusive).

    Returns:
        DataFrame with one row per customer, ready for train/test split.
    """
    obs_end = pd.Timestamp(observation_end)
    obs_start = obs_end - pd.Timedelta(days=90)
    lbl_end = pd.Timestamp(label_end)

    # Load sales data
    sales = pd.read_csv(sales_csv, parse_dates=["order_date"])

    # Feature engineering from sales activity
    obs_sales = sales[
        (sales["order_date"] >= obs_start) & (sales["order_date"] <= obs_end)
    ]

    customer_features = obs_sales.groupby("customer_id").agg(
        total_orders=("order_id", "count"),
        total_spend=("order_value", "sum"),
        avg_order_value=("order_value", "mean"),
        unique_products=("product_id", "nunique"),
        last_order_date=("order_date", "max"),
    ).reset_index()

    # Recency: days since last order within the window
    customer_features["recency_days"] = (
        obs_end - customer_features["last_order_date"]
    ).dt.days

    # Load all customers from the database
    conn = sqlite3.connect(db_path)
    customers = pd.read_sql_query(
        "SELECT customer_id, segment, account_age_days, mrr FROM customers",
        conn,
    )
    conn.close()

    # Load support tickets
    tickets = pd.read_csv(tickets_csv, parse_dates=["created_date"])
    obs_tickets = tickets[
        (tickets["created_date"] >= obs_start) & (tickets["created_date"] <= obs_end)
    ]
    ticket_features = obs_tickets.groupby("customer_id").agg(
        support_tickets=("ticket_id", "count"),
        escalated_tickets=("escalated", "sum"),
    ).reset_index()

    # Build label: churned = 1 if no activity in label period
    label_sales = sales[
        (sales["order_date"] > obs_end) & (sales["order_date"] <= lbl_end)
    ]
    active_in_label = set(label_sales["customer_id"].unique())
    customers["churned"] = customers["customer_id"].apply(
        lambda cid: 0 if cid in active_in_label else 1
    )

    # Join all features
    dataset = (
        customers
        .merge(customer_features, on="customer_id", how="left")
        .merge(ticket_features, on="customer_id", how="left")
    )

    # Customers with no obs-window activity get zero-filled features
    fill_cols = [
        "total_orders", "total_spend", "avg_order_value",
        "unique_products", "recency_days", "support_tickets",
        "escalated_tickets",
    ]
    dataset[fill_cols] = dataset[fill_cols].fillna(0)

    # Encode segment
    dataset = pd.get_dummies(dataset, columns=["segment"], drop_first=True)
    dataset.drop(columns=["last_order_date"], errors="ignore", inplace=True)

    return dataset

Priya ran this function and examined the result: 847 customer rows, a churn rate of 9.3%. The class imbalance was real but not extreme. Eighty-seven accounts had churned; 760 had not.


Step 2: Training and Evaluating the Model

With the dataset assembled, Priya trained three models and compared them using cross-validation before choosing one for production.

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    classification_report, confusion_matrix, roc_auc_score,
    f1_score, precision_score, recall_score,
)
import pandas as pd
import numpy as np


def train_and_compare_models(
    dataset: pd.DataFrame,
    target_col: str = "churned",
    id_col: str = "customer_id",
    random_state: int = 42,
) -> dict:
    """
    Train three candidate models and return cross-validation results.

    Args:
        dataset: Feature dataset from build_churn_dataset().
        target_col: Name of the binary target column.
        id_col: Column to exclude from features (identifier).
        random_state: Random seed for reproducibility.

    Returns:
        Dictionary mapping model name to cross-validation F1 scores.
    """
    feature_cols = [
        c for c in dataset.columns if c not in (target_col, id_col)
    ]
    X = dataset[feature_cols]
    y = dataset[target_col]

    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=random_state)

    models = {
        "Logistic Regression": Pipeline([
            ("scaler", StandardScaler()),
            ("clf", LogisticRegression(
                class_weight="balanced",
                max_iter=1000,
                random_state=random_state,
            )),
        ]),
        "Decision Tree": DecisionTreeClassifier(
            max_depth=5,
            min_samples_leaf=10,
            class_weight="balanced",
            random_state=random_state,
        ),
        "Random Forest": RandomForestClassifier(
            n_estimators=100,
            max_depth=6,
            min_samples_leaf=5,
            class_weight="balanced",
            random_state=random_state,
        ),
    }

    results = {}
    for name, model in models.items():
        f1_scores = cross_val_score(model, X, y, cv=cv, scoring="f1")
        auc_scores = cross_val_score(model, X, y, cv=cv, scoring="roc_auc")
        results[name] = {
            "f1_mean": f1_scores.mean(),
            "f1_std": f1_scores.std(),
            "auc_mean": auc_scores.mean(),
            "auc_std": auc_scores.std(),
        }

    # Print comparison table
    print(f"\n{'Model':<25} {'F1 Mean':>10} {'F1 Std':>10} {'AUC Mean':>10} {'AUC Std':>10}")
    print("-" * 65)
    for name, metrics in results.items():
        print(
            f"{name:<25}"
            f" {metrics['f1_mean']:>10.3f}"
            f" {metrics['f1_std']:>10.3f}"
            f" {metrics['auc_mean']:>10.3f}"
            f" {metrics['auc_std']:>10.3f}"
        )

    return results

The cross-validation results came back:

Model                     F1 Mean     F1 Std   AUC Mean    AUC Std
-----------------------------------------------------------------
Logistic Regression         0.612      0.048      0.841      0.031
Decision Tree               0.588      0.067      0.804      0.044
Random Forest               0.649      0.039      0.873      0.025

Random Forest won on both metrics with lower variance — the better choice. But Priya knew Sandra's team would ask "why" for specific accounts. She chose to train the Random Forest for scoring but keep the logistic regression coefficients as an explanation tool, running both in the final pipeline.


Step 3: Generating the At-Risk Customer Report

def generate_churn_risk_report(
    dataset: pd.DataFrame,
    target_col: str = "churned",
    id_col: str = "customer_id",
    top_n: int = 20,
    random_state: int = 42,
) -> pd.DataFrame:
    """
    Train the production model and return top at-risk customers.

    Trains a Random Forest on 80% of data, evaluates on held-out 20%,
    then rescores the full dataset to identify highest-risk accounts.

    Args:
        dataset: Feature dataset from build_churn_dataset().
        target_col: Name of the binary target column.
        id_col: Customer identifier column.
        top_n: Number of at-risk accounts to return.
        random_state: Random seed.

    Returns:
        DataFrame of top at-risk customers with probability scores.
    """
    feature_cols = [
        c for c in dataset.columns if c not in (target_col, id_col)
    ]
    X = dataset[feature_cols]
    y = dataset[target_col]
    customer_ids = dataset[id_col]

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.20, random_state=random_state, stratify=y
    )

    model = RandomForestClassifier(
        n_estimators=100,
        max_depth=6,
        min_samples_leaf=5,
        class_weight="balanced",
        random_state=random_state,
    )
    model.fit(X_train, y_train)

    # Evaluate on held-out test set
    y_pred = model.predict(X_test)
    y_prob_test = model.predict_proba(X_test)[:, 1]

    print("\n=== Model Performance on Held-Out Test Set ===")
    print(classification_report(y_test, y_pred, target_names=["Active", "Churned"]))
    print(f"ROC AUC: {roc_auc_score(y_test, y_prob_test):.3f}")

    # Score all customers
    all_probs = model.predict_proba(X)[:, 1]

    risk_df = pd.DataFrame({
        "customer_id": customer_ids.values,
        "churn_probability": all_probs,
        "actual_churned": y.values,
    })

    # Feature importance summary
    importance_df = pd.DataFrame({
        "feature": feature_cols,
        "importance": model.feature_importances_,
    }).sort_values("importance", ascending=False)

    print("\n=== Top 10 Predictive Features ===")
    for _, row in importance_df.head(10).iterrows():
        bar = "#" * int(row["importance"] * 100)
        print(f"  {row['feature']:<30} {row['importance']:.3f}  {bar}")

    # Return top-risk accounts (excluding known churners — focus on prevention)
    top_risk = (
        risk_df[risk_df["actual_churned"] == 0]
        .sort_values("churn_probability", ascending=False)
        .head(top_n)
        .reset_index(drop=True)
    )
    top_risk.index += 1  # 1-based ranking

    return top_risk

Step 4: The Results

When Priya ran the full pipeline, the model's performance on the held-out test set was:

=== Model Performance on Held-Out Test Set ===
              precision    recall  f1-score   support

      Active       0.97      0.92      0.94       152
     Churned       0.61      0.82      0.70        17

    accuracy                           0.91       169
   macro avg       0.79      0.87      0.83       169
weighted avg       0.93      0.91      0.92       169

ROC AUC: 0.931

The model caught 82% of the churners in the held-out test set — fourteen of seventeen accounts that actually left. Two of the three it missed had no support tickets, no declining order activity, and high recent spend: genuinely difficult to predict from behavioral signals alone.

The top predictive features:

=== Top 10 Predictive Features ===
  recency_days                   0.187  ##################
  total_orders                   0.164  ################
  mrr                            0.143  ##############
  total_spend                    0.118  ###########
  avg_order_value                0.097  #########
  support_tickets                0.082  ########
  escalated_tickets              0.071  #######
  account_age_days               0.058  #####
  unique_products                0.044  ####
  segment_Mid-Market             0.036  ###

Recency was the strongest single signal. Customers who had not placed an order in 60+ days during the observation window were dramatically more likely to churn — even if their historical spend was high. Total order frequency was next. Both findings aligned with what Sandra's most experienced account managers had said anecdotally for years. The model quantified the intuition.

The top 20 at-risk accounts the model flagged included four that Sandra's team had already noted as "worrying" — a useful validation. The other sixteen were accounts that had not appeared in any weekly at-risk discussion.


Step 5: Presenting to Sandra

Priya prepared a one-page summary for Sandra. She led with three numbers:

  • $180,000 in ARR was lost last quarter to churn
  • The model identifies 82% of churning accounts in advance (based on held-out validation)
  • The top 20 at-risk accounts represent $310,000 in ARR — accounts your team can reach before Q2 ends

Sandra studied the list. "Meridian Supply. They're on here. We just extended a contract with them last month." She looked up. "How confident are you in this?"

"The model isn't certain about any individual account," Priya said. "Meridian has a 71% predicted churn probability. The features driving it are a 49-day gap in orders and two escalated support tickets. That doesn't mean they'll definitely leave — but it means something changed, and it's worth a conversation."

Sandra forwarded the list to her four account managers that afternoon with a note: "This is Priya's model output. Treat these as priorities for Q1 QBR outreach. Ask me if you have questions."


What This Case Study Demonstrates

The churn model arc covers the complete workflow:

  1. Problem framing — defining the observation window, label period, and business objective before touching code

  2. Feature engineering — collapsing transaction-level data into customer-level features (recency, frequency, monetary, support volume)

  3. Class imbalance handling — using class_weight="balanced" and stratify=y in the split

  4. Model comparison with cross-validation — not choosing a model based on a single run

  5. Actionable output — the final deliverable is a prioritized list with probabilities, not a metric summary

  6. Honest communication — explaining what the model can and cannot say about individual accounts

The gap between the model's recall (82%) and precision (61%) is worth noting: for every ten accounts the model flags as high-risk, roughly four will not actually churn. Sandra's team will make some unnecessary calls. But making those calls costs far less than missing the six accounts that do churn. That asymmetry — not model elegance — is the reason churn models are worth building.


Key Numbers from This Case Study

Metric Value
Training dataset size 847 customers
Churn rate in dataset 9.3%
Cross-validated F1 (Random Forest) 0.649 ± 0.039
Cross-validated AUC (Random Forest) 0.873 ± 0.025
Test-set recall (churn class) 82%
Test-set precision (churn class) 61%
Test-set AUC 0.931
Top features Recency, order frequency, MRR
ARR at stake in top 20 flagged accounts $310,000