Case Study 34-01: Priya Builds the Acme Churn Predictor
Characters: Priya Okonkwo (Senior Analyst, Acme Corp), Sandra Chen (VP of Sales) Continuing from: Chapters 32 and 33 — Priya has already built the supply chain dashboard and the Streamlit operations portal Setting: Acme Corp's Q1 planning cycle. Sandra has a new question.
The Question That Started Everything
Six weeks after Priya presented the supply chain dashboard to Acme's leadership team, Sandra Chen appeared in the doorway of the analytics office with a printout in her hand.
"Priya, I've been looking at our churn numbers. We lost eleven accounts last quarter. That's about $180,000 in annual recurring revenue. Some of them I saw coming — the signs were there. Others blindsided us completely." She set the printout down: a list of churned accounts with their revenue figures highlighted in red. "I want to know if you can build me a model that predicts which accounts are going to leave before they do. Give me enough warning that the team can actually do something about it."
Priya looked at the list. Eleven accounts. Four had been flagged as at-risk in weekly pipeline reviews, but seven had not. "Let me look at what data we have," she said.
What she found was encouraging. Acme's CRM system held two years of customer records. The transaction database she had already connected in Chapter 32 contained every order. There was also a support ticket export Marcus Webb had generated for the operations analysis. Three data sources — exactly the architecture that makes churn prediction possible.
Step 1: Building the Feature Dataset
Churn models require a specific type of data structure: a snapshot of each customer at some point in time, plus a label indicating whether they churned in the following period. You cannot train on raw transaction records directly because each customer appears many times. The job is to collapse all of that activity into a single row per customer.
Priya defined her observation window: she would look at each customer's behavior in the 90 days ending September 30, 2023, and use whether the account churned in Q4 (October through December) as the label.
import pandas as pd
import numpy as np
import sqlite3
def build_churn_dataset(
db_path: str,
sales_csv: str,
tickets_csv: str,
observation_end: str = "2023-09-30",
label_end: str = "2023-12-31",
) -> pd.DataFrame:
"""
Build a customer-level churn prediction dataset.
Each row represents one customer. Features describe behavior during the
90-day observation window. The target label (churned) is 1 if the customer
had no activity in the label period and their account was closed.
Args:
db_path: Path to the Acme SQLite database.
sales_csv: Path to the sales transaction CSV export.
tickets_csv: Path to the support tickets CSV export.
observation_end: Last date of the feature window (inclusive).
label_end: Last date of the label window (inclusive).
Returns:
DataFrame with one row per customer, ready for train/test split.
"""
obs_end = pd.Timestamp(observation_end)
obs_start = obs_end - pd.Timedelta(days=90)
lbl_end = pd.Timestamp(label_end)
# Load sales data
sales = pd.read_csv(sales_csv, parse_dates=["order_date"])
# Feature engineering from sales activity
obs_sales = sales[
(sales["order_date"] >= obs_start) & (sales["order_date"] <= obs_end)
]
customer_features = obs_sales.groupby("customer_id").agg(
total_orders=("order_id", "count"),
total_spend=("order_value", "sum"),
avg_order_value=("order_value", "mean"),
unique_products=("product_id", "nunique"),
last_order_date=("order_date", "max"),
).reset_index()
# Recency: days since last order within the window
customer_features["recency_days"] = (
obs_end - customer_features["last_order_date"]
).dt.days
# Load all customers from the database
conn = sqlite3.connect(db_path)
customers = pd.read_sql_query(
"SELECT customer_id, segment, account_age_days, mrr FROM customers",
conn,
)
conn.close()
# Load support tickets
tickets = pd.read_csv(tickets_csv, parse_dates=["created_date"])
obs_tickets = tickets[
(tickets["created_date"] >= obs_start) & (tickets["created_date"] <= obs_end)
]
ticket_features = obs_tickets.groupby("customer_id").agg(
support_tickets=("ticket_id", "count"),
escalated_tickets=("escalated", "sum"),
).reset_index()
# Build label: churned = 1 if no activity in label period
label_sales = sales[
(sales["order_date"] > obs_end) & (sales["order_date"] <= lbl_end)
]
active_in_label = set(label_sales["customer_id"].unique())
customers["churned"] = customers["customer_id"].apply(
lambda cid: 0 if cid in active_in_label else 1
)
# Join all features
dataset = (
customers
.merge(customer_features, on="customer_id", how="left")
.merge(ticket_features, on="customer_id", how="left")
)
# Customers with no obs-window activity get zero-filled features
fill_cols = [
"total_orders", "total_spend", "avg_order_value",
"unique_products", "recency_days", "support_tickets",
"escalated_tickets",
]
dataset[fill_cols] = dataset[fill_cols].fillna(0)
# Encode segment
dataset = pd.get_dummies(dataset, columns=["segment"], drop_first=True)
dataset.drop(columns=["last_order_date"], errors="ignore", inplace=True)
return dataset
Priya ran this function and examined the result: 847 customer rows, a churn rate of 9.3%. The class imbalance was real but not extreme. Eighty-seven accounts had churned; 760 had not.
Step 2: Training and Evaluating the Model
With the dataset assembled, Priya trained three models and compared them using cross-validation before choosing one for production.
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
classification_report, confusion_matrix, roc_auc_score,
f1_score, precision_score, recall_score,
)
import pandas as pd
import numpy as np
def train_and_compare_models(
dataset: pd.DataFrame,
target_col: str = "churned",
id_col: str = "customer_id",
random_state: int = 42,
) -> dict:
"""
Train three candidate models and return cross-validation results.
Args:
dataset: Feature dataset from build_churn_dataset().
target_col: Name of the binary target column.
id_col: Column to exclude from features (identifier).
random_state: Random seed for reproducibility.
Returns:
Dictionary mapping model name to cross-validation F1 scores.
"""
feature_cols = [
c for c in dataset.columns if c not in (target_col, id_col)
]
X = dataset[feature_cols]
y = dataset[target_col]
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=random_state)
models = {
"Logistic Regression": Pipeline([
("scaler", StandardScaler()),
("clf", LogisticRegression(
class_weight="balanced",
max_iter=1000,
random_state=random_state,
)),
]),
"Decision Tree": DecisionTreeClassifier(
max_depth=5,
min_samples_leaf=10,
class_weight="balanced",
random_state=random_state,
),
"Random Forest": RandomForestClassifier(
n_estimators=100,
max_depth=6,
min_samples_leaf=5,
class_weight="balanced",
random_state=random_state,
),
}
results = {}
for name, model in models.items():
f1_scores = cross_val_score(model, X, y, cv=cv, scoring="f1")
auc_scores = cross_val_score(model, X, y, cv=cv, scoring="roc_auc")
results[name] = {
"f1_mean": f1_scores.mean(),
"f1_std": f1_scores.std(),
"auc_mean": auc_scores.mean(),
"auc_std": auc_scores.std(),
}
# Print comparison table
print(f"\n{'Model':<25} {'F1 Mean':>10} {'F1 Std':>10} {'AUC Mean':>10} {'AUC Std':>10}")
print("-" * 65)
for name, metrics in results.items():
print(
f"{name:<25}"
f" {metrics['f1_mean']:>10.3f}"
f" {metrics['f1_std']:>10.3f}"
f" {metrics['auc_mean']:>10.3f}"
f" {metrics['auc_std']:>10.3f}"
)
return results
The cross-validation results came back:
Model F1 Mean F1 Std AUC Mean AUC Std
-----------------------------------------------------------------
Logistic Regression 0.612 0.048 0.841 0.031
Decision Tree 0.588 0.067 0.804 0.044
Random Forest 0.649 0.039 0.873 0.025
Random Forest won on both metrics with lower variance — the better choice. But Priya knew Sandra's team would ask "why" for specific accounts. She chose to train the Random Forest for scoring but keep the logistic regression coefficients as an explanation tool, running both in the final pipeline.
Step 3: Generating the At-Risk Customer Report
def generate_churn_risk_report(
dataset: pd.DataFrame,
target_col: str = "churned",
id_col: str = "customer_id",
top_n: int = 20,
random_state: int = 42,
) -> pd.DataFrame:
"""
Train the production model and return top at-risk customers.
Trains a Random Forest on 80% of data, evaluates on held-out 20%,
then rescores the full dataset to identify highest-risk accounts.
Args:
dataset: Feature dataset from build_churn_dataset().
target_col: Name of the binary target column.
id_col: Customer identifier column.
top_n: Number of at-risk accounts to return.
random_state: Random seed.
Returns:
DataFrame of top at-risk customers with probability scores.
"""
feature_cols = [
c for c in dataset.columns if c not in (target_col, id_col)
]
X = dataset[feature_cols]
y = dataset[target_col]
customer_ids = dataset[id_col]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=random_state, stratify=y
)
model = RandomForestClassifier(
n_estimators=100,
max_depth=6,
min_samples_leaf=5,
class_weight="balanced",
random_state=random_state,
)
model.fit(X_train, y_train)
# Evaluate on held-out test set
y_pred = model.predict(X_test)
y_prob_test = model.predict_proba(X_test)[:, 1]
print("\n=== Model Performance on Held-Out Test Set ===")
print(classification_report(y_test, y_pred, target_names=["Active", "Churned"]))
print(f"ROC AUC: {roc_auc_score(y_test, y_prob_test):.3f}")
# Score all customers
all_probs = model.predict_proba(X)[:, 1]
risk_df = pd.DataFrame({
"customer_id": customer_ids.values,
"churn_probability": all_probs,
"actual_churned": y.values,
})
# Feature importance summary
importance_df = pd.DataFrame({
"feature": feature_cols,
"importance": model.feature_importances_,
}).sort_values("importance", ascending=False)
print("\n=== Top 10 Predictive Features ===")
for _, row in importance_df.head(10).iterrows():
bar = "#" * int(row["importance"] * 100)
print(f" {row['feature']:<30} {row['importance']:.3f} {bar}")
# Return top-risk accounts (excluding known churners — focus on prevention)
top_risk = (
risk_df[risk_df["actual_churned"] == 0]
.sort_values("churn_probability", ascending=False)
.head(top_n)
.reset_index(drop=True)
)
top_risk.index += 1 # 1-based ranking
return top_risk
Step 4: The Results
When Priya ran the full pipeline, the model's performance on the held-out test set was:
=== Model Performance on Held-Out Test Set ===
precision recall f1-score support
Active 0.97 0.92 0.94 152
Churned 0.61 0.82 0.70 17
accuracy 0.91 169
macro avg 0.79 0.87 0.83 169
weighted avg 0.93 0.91 0.92 169
ROC AUC: 0.931
The model caught 82% of the churners in the held-out test set — fourteen of seventeen accounts that actually left. Two of the three it missed had no support tickets, no declining order activity, and high recent spend: genuinely difficult to predict from behavioral signals alone.
The top predictive features:
=== Top 10 Predictive Features ===
recency_days 0.187 ##################
total_orders 0.164 ################
mrr 0.143 ##############
total_spend 0.118 ###########
avg_order_value 0.097 #########
support_tickets 0.082 ########
escalated_tickets 0.071 #######
account_age_days 0.058 #####
unique_products 0.044 ####
segment_Mid-Market 0.036 ###
Recency was the strongest single signal. Customers who had not placed an order in 60+ days during the observation window were dramatically more likely to churn — even if their historical spend was high. Total order frequency was next. Both findings aligned with what Sandra's most experienced account managers had said anecdotally for years. The model quantified the intuition.
The top 20 at-risk accounts the model flagged included four that Sandra's team had already noted as "worrying" — a useful validation. The other sixteen were accounts that had not appeared in any weekly at-risk discussion.
Step 5: Presenting to Sandra
Priya prepared a one-page summary for Sandra. She led with three numbers:
- $180,000 in ARR was lost last quarter to churn
- The model identifies 82% of churning accounts in advance (based on held-out validation)
- The top 20 at-risk accounts represent $310,000 in ARR — accounts your team can reach before Q2 ends
Sandra studied the list. "Meridian Supply. They're on here. We just extended a contract with them last month." She looked up. "How confident are you in this?"
"The model isn't certain about any individual account," Priya said. "Meridian has a 71% predicted churn probability. The features driving it are a 49-day gap in orders and two escalated support tickets. That doesn't mean they'll definitely leave — but it means something changed, and it's worth a conversation."
Sandra forwarded the list to her four account managers that afternoon with a note: "This is Priya's model output. Treat these as priorities for Q1 QBR outreach. Ask me if you have questions."
What This Case Study Demonstrates
The churn model arc covers the complete workflow:
-
Problem framing — defining the observation window, label period, and business objective before touching code
-
Feature engineering — collapsing transaction-level data into customer-level features (recency, frequency, monetary, support volume)
-
Class imbalance handling — using
class_weight="balanced"andstratify=yin the split -
Model comparison with cross-validation — not choosing a model based on a single run
-
Actionable output — the final deliverable is a prioritized list with probabilities, not a metric summary
-
Honest communication — explaining what the model can and cannot say about individual accounts
The gap between the model's recall (82%) and precision (61%) is worth noting: for every ten accounts the model flags as high-risk, roughly four will not actually churn. Sandra's team will make some unnecessary calls. But making those calls costs far less than missing the six accounts that do churn. That asymmetry — not model elegance — is the reason churn models are worth building.
Key Numbers from This Case Study
| Metric | Value |
|---|---|
| Training dataset size | 847 customers |
| Churn rate in dataset | 9.3% |
| Cross-validated F1 (Random Forest) | 0.649 ± 0.039 |
| Cross-validated AUC (Random Forest) | 0.873 ± 0.025 |
| Test-set recall (churn class) | 82% |
| Test-set precision (churn class) | 61% |
| Test-set AUC | 0.931 |
| Top features | Recency, order frequency, MRR |
| ARR at stake in top 20 flagged accounts | $310,000 |