Case Study 33-01: Priya Frames the Churn Prediction Problem

Character: Priya, Senior Analyst at Acme Corp Setting: Acme Corp has a subscription product used by roughly 3,000 business customers. Priya has been asked by the VP of Sales to "use AI to predict which customers are about to cancel."

The Meeting That Starts Everything

It is 3:15 on a Tuesday and Priya is at her desk when Marcus Webb stops by.

"Priya, I need you to build a churn prediction model. The board is asking about customer retention and I told them we're getting ahead of it with machine learning." He pauses. "Can you have something by end of next week?"

Priya has heard this kind of request before. The instinct, she knows, is to fire up a Jupyter notebook and start loading data. That is exactly the wrong instinct.

"I can probably have a working model in a week," she says, "but I want to spend an hour with you first to make sure we're building the right thing. Can we do Friday morning?"

Marcus nods. Priya starts a new document titled: Churn Prediction — Problem Framing.

The Problem Framing Document

Before writing a line of code, Priya works through five questions. She has learned from experience that skipping this step results in technically correct solutions to the wrong problem.

Question 1: What exactly are we predicting?

"Customer churn" means different things at different companies. At Acme, it could mean:

A customer canceling their subscription entirely
A customer downgrading to a lower plan
A customer who has not logged in for 60 days (risk indicator, not actual churn)
A customer who has stated an intention to cancel in a support ticket

Priya pulls up the subscription database and checks what's actually recorded. She finds a subscription_status field that goes from active to canceled, and a canceled_date timestamp.

Decision: The label is binary. churned = 1 means the customer's subscription moved to canceled status. churned = 0 means it remained active. Everything else (downgrade, disengagement) is a separate problem for another day.

Question 2: What time window are we predicting over?

This is subtler than it looks. The window defines both the label and the available features.

If Priya predicts churn "in the next 30 days," the model needs to be run 30 days before any intervention is possible. If account managers need two weeks to work a save conversation, the usable prediction window is only two weeks out of the 30. That may not give enough lead time.

She consults with the sales team. Account managers typically run quarterly business reviews with at-risk accounts. They need at least 6 weeks of lead time to schedule a call and prepare.

Decision: Predict churn in the next 90 days. This gives the sales team enough time to act. It also gives a large enough positive sample — customers who cancel within 3 months — to train a reasonable model.

Question 3: What data do we have?

Priya spends most of Friday morning on this question. She pulls together the data inventory:

Data Source	What It Contains	Available From
Subscription database	Plan type, start date, status, cancellation date	Always
Application logs	Logins, feature usage, session duration	2 years of history
Support ticketing system	Contact frequency, ticket categories, resolution time	3 years
Payment processor	Payment method, failures, declined transactions	2 years
CRM	Account manager notes, NPS scores, contract value	18 months

She notes what is missing: no customer survey data (other than sparse NPS scores), no product engagement depth beyond feature counts, no competitive intelligence.

Critical check: Do we have labeled examples?

For a supervised learning model, Priya needs historical customers with known outcomes. She queries the database: how many customers subscribed at some point in the past two years? How many subsequently churned?

Total historical customers (last 2 years):  3,847
  Of which: churned                          612  (15.9%)
  Of which: still active                   3,235  (84.1%)

612 churned examples is workable — not abundant, but enough to train a basic model. She flags that the class imbalance (16% positive) means she will need to be careful about her evaluation metric. Accuracy will be misleading.

Question 4: How will predictions be used?

This determines the operational requirements for the model.

Priya talks to two account managers. Their workflow: - Each manager covers roughly 150 accounts - They run weekly team meetings where at-risk accounts are discussed - They have a CRM field for "risk level" that currently they fill in manually based on gut feel - They want a weekly updated list of their top 10–15 highest-risk accounts

This tells Priya several things:

The model needs to produce a probability score (0 to 1), not just a binary prediction. Account managers want to rank their accounts by risk, not just split them into churned/not-churned.
The output needs to be visible in the CRM, not buried in a Jupyter notebook.
The model needs to run weekly, which means automation eventually.
The prediction must be explainable at the account level. An account manager cannot act on "the model says 78% churn probability." They need to know: why? Is it the payment failures? The drop in logins?

Question 5: What does success look like?

Priya writes this out explicitly before touching any code:

Minimum viable performance: - ROC AUC > 0.80 on held-out test set (versus random baseline of 0.50) - Recall > 0.70 for the top 20% of scored customers (we need to catch most churners in the high-risk tier) - Precision > 0.40 at recall 0.70 (we can tolerate some false alarms but not an overwhelming flood)

Business success (harder to measure, but the real goal): - At least 10% reduction in churn rate within 90 days of model deployment - Account managers report that the risk scores are useful and actionable

Failure conditions: - The model performs no better than account manager intuition - The model is built but never integrated into the workflow - The model degrades after 6 months due to product changes and no one notices

The Data Snapshot Problem (and How to Solve It)

This is the part most business ML projects get wrong.

The naive approach: take all current customers, label them (churned in last 90 days vs. not), train a model.

The problem: this conflates customers at different stages of their lifecycle. A customer who signed up yesterday has almost no features — no logins, no support contacts, no usage patterns. Including them in training introduces noise. More seriously, they are not yet "at risk" in the same way as a customer six months in.

Priya's approach: Create a training set using a snapshot date approach.

For each month going back 18 months, she takes a "snapshot" of each active customer's feature values at that moment in time. She then looks forward 90 days from the snapshot date to determine the label: did this customer churn?

This produces many observations per customer (one per monthly snapshot) and correctly captures the temporal structure of the prediction problem. Features are measured before the label is determined. There is no leakage.

Snapshot date:  2023-06-01
Feature values: Measured as of 2023-06-01
Label:          Did the customer cancel between 2023-06-01 and 2023-09-01?

The snapshot approach is more work than the naive approach, but it is the correct approach. Priya blocks out a day to write the SQL queries that build this dataset.

The Feature List

After the framing session and data exploration, Priya settles on an initial feature set:

Behavioral features (strongest signal): - logins_last_7_days — recent engagement - logins_last_30_days — medium-term engagement - logins_last_90_days — long-term engagement - login_trend — ratio of last 30 days to 90-day average (is engagement increasing or declining?) - features_used_last_30_days — depth of product adoption - session_duration_avg_last_30_days — quality of engagement, not just quantity

Support features (friction signals): - support_contacts_last_90_days — unresolved product issues - days_since_last_support_contact — recency of friction - unresolved_tickets — open issues

Account features (context): - account_age_days — tenure (longer = more loyal on average) - plan_type — enterprise customers churn less than basic - contract_value — high-value accounts may get more proactive attention

Payment features (financial signals): - payment_failures_last_year — billing issues are a leading indicator - days_since_last_payment_failure — recency of payment issues - has_valid_payment_method — critical binary flag

Satisfaction features: - nps_score_last_survey — self-reported satisfaction (use carefully — sparse)

What Priya deliberately excludes: - Future-dated features (anything measured after the snapshot date) - Features with more than 30% missing values at this stage - Account manager subjective ratings (too inconsistent across reps)

The Evaluation Framework

Before any model is trained, Priya documents what she will measure and how.

Primary metric: ROC AUC (less sensitive to class imbalance than accuracy)

Secondary metrics: - Precision and recall at several probability thresholds - The confusion matrix at the threshold used for operational decisions - Performance broken down by plan type (the model should work for enterprise customers, not just basic)

Baseline comparisons: 1. Naive: always predict no churn (accuracy = 84.1%, recall = 0%) 2. Simple rule: flag customers with payment failures or zero logins in 30 days 3. Account manager accuracy on historical cases (requires a sample survey)

Temporal holdout: Priya will train on snapshots from months 1–12 of her 18-month window and test on months 13–18. This respects the temporal ordering of the data and avoids the optimistic bias of random splitting on time-series data.

Priya's Checklist Before Writing Any Code

Looking at her completed framing document, Priya checks off each item:

[x] Clear, specific prediction target (subscription cancellation, 90-day window)
[x] Known data sources with sufficient labeled examples (612 churn events)
[x] Temporal snapshot approach to avoid leakage
[x] Defined features with business rationale for each
[x] Clear evaluation metrics with numeric success thresholds
[x] Defined baseline comparisons
[x] Understood how output will be used (CRM integration, weekly refresh)
[x] Identified explainability requirement (account-level drivers)
[x] Identified class imbalance issue and mitigation approach

Only now does Priya open a Jupyter notebook.

What She Tells Marcus

At the Friday meeting, Priya presents a one-page summary:

"We can build a churn prediction model. We have enough labeled data. The strongest signals are payment failures, declining login trends, and high support contact rates — which is consistent with what your best account managers already watch.

The model will produce a weekly risk score for every customer, visible in the CRM. I'd target a first version in three weeks: one week for data preparation, one week for model training and evaluation, one week to wire it into the CRM.

One thing I want to be clear about upfront: the model will make mistakes. It will flag some customers as high-risk who weren't going to leave, and it will miss some who do. That's normal. The goal is to catch more at-risk customers earlier than you do today, not to achieve perfect prediction. We'll measure whether it's actually improving retention at the 90-day mark."

Marcus nods. "Three weeks works. And I appreciate the realistic expectation-setting."

Priya opens her laptop and starts writing the SQL.

Key Lessons from This Case Study

Frame before you code. The framing document takes a few hours and prevents weeks of work on the wrong problem.

Be specific about the prediction. "Churn prediction" is not specific enough. "Binary classification: will this active customer cancel their subscription in the next 90 days?" is specific enough to build.

The label construction is as important as the model. Using the snapshot approach rather than the naive approach makes the model correct by construction.

Know what success looks like before you start. Pre-committing to success criteria prevents post-hoc rationalization ("well, the accuracy is only 65%, but the precision is pretty good if you squint...").

Interpretability is not optional here. Account managers cannot act on a number without understanding why. The choice of model will need to support feature-level explanations.

The code that follows this framing — the actual model training — appears in code/ml_workflow.py. Notice that the code is simple. The hard work happened here, before the notebook was opened.