Case Study 1: Will This Customer Churn? Marcus Predicts Loyalty

Contributors to Introduction to Data Science

Case Study 1: Will This Customer Churn? Marcus Predicts Loyalty

Tier 1 — Verified Concepts: This case study explores customer churn prediction, one of the most common applications of classification in business analytics. The general patterns described — that usage behavior, tenure, and support interactions predict churn, and that class imbalance complicates evaluation — are well-documented in the business analytics and marketing literature. The specific data is simulated for pedagogical purposes, but the modeling workflow and business considerations reflect standard industry practice.

Marcus Gets an Assignment

Marcus works as a data analyst at a mid-sized streaming service. His manager walks in with a question that sounds simple but turns out to be anything but: "Can we predict which customers will cancel their subscriptions next month? If we can identify them early, we can offer them incentives to stay."

This is a classification problem — binary classification, specifically. The target is: will the customer cancel (churn) or not? The answer is yes or no, 1 or 0.

Marcus has seen enough of this textbook to know where to start.

The Data

Marcus pulls three months of customer data:

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (confusion_matrix,
    classification_report, precision_score, recall_score)

np.random.seed(42)
n = 2000

customers = pd.DataFrame({
    'months_active': np.random.exponential(24, n).clip(1, 120),
    'monthly_hours': np.random.lognormal(2, 0.8, n).clip(0.5, 100),
    'support_tickets': np.random.poisson(1.5, n),
    'plan_price': np.random.choice([9.99, 14.99, 19.99], n,
                   p=[0.5, 0.35, 0.15]),
    'devices_used': np.random.choice([1, 2, 3, 4], n,
                     p=[0.3, 0.35, 0.25, 0.1]),
})

# Churn probability depends on features
logit = (
    -2.0 +
    -0.02 * customers['months_active'] +
    -0.05 * customers['monthly_hours'] +
    0.4 * customers['support_tickets'] +
    0.05 * customers['plan_price'] +
    -0.3 * customers['devices_used'] +
    np.random.normal(0, 0.5, n)
)

customers['churned'] = (np.random.random(n) <
    1 / (1 + np.exp(-logit))).astype(int)

print(f"Total customers: {n}")
print(f"Churned: {customers['churned'].sum()} "
      f"({customers['churned'].mean():.1%})")
print(f"\nFeature summary:")
print(customers.describe().round(2))

The Class Imbalance Problem

Marcus's first observation: only about 20-25% of customers churn. The rest stay. This is a class imbalance problem.

print("Class distribution:")
print(customers['churned'].value_counts())
print(f"\nBaseline accuracy (always predict 'stays'): "
      f"{1 - customers['churned'].mean():.1%}")

If Marcus builds a model that always predicts "stays," he gets about 75-80% accuracy with zero machine learning. His model needs to beat that baseline — and specifically, it needs to actually catch the churners. Accuracy alone won't tell the story.

Building the First Model

features = ['months_active', 'monthly_hours',
            'support_tickets', 'plan_price', 'devices_used']
X = customers[features]
y = customers['churned']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

print(f"Accuracy: {model.score(X_test, y_test):.3f}")
print(f"\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(f"  TN={cm[0,0]}  FP={cm[0,1]}")
print(f"  FN={cm[1,0]}  TP={cm[1,1]}")
print(f"\n{classification_report(y_test, y_pred, target_names=['Stays', 'Churns'])}")

Marcus Reads the Confusion Matrix

The confusion matrix reveals a nuanced picture. The model has good overall accuracy, but Marcus focuses on two specific numbers:

False Negatives (FN): Customers the model predicts will stay, but who actually churn. These are missed opportunities — customers who left without anyone reaching out to save them.

False Positives (FP): Customers the model predicts will churn, but who actually stay. These would receive unnecessary retention offers — a wasted cost, but not catastrophic.

Marcus calculates the business impact:

# Business impact analysis
cost_per_retention_offer = 15  # discount or incentive
revenue_per_saved_customer = 120  # average 10 months retained
cost_per_lost_customer = 120  # lifetime value lost

# What does each error cost?
fp_cost = cm[0, 1] * cost_per_retention_offer  # wasted offers
fn_cost = cm[1, 0] * cost_per_lost_customer     # missed saves
tp_benefit = cm[1, 1] * (revenue_per_saved_customer -
    cost_per_retention_offer)  # successful saves

print(f"False positive cost (wasted offers): ${fp_cost:,.0f}")
print(f"False negative cost (lost customers): ${fn_cost:,.0f}")
print(f"True positive benefit (saved customers): ${tp_benefit:,.0f}")
print(f"Net value: ${tp_benefit - fp_cost - fn_cost:,.0f}")

The Probability Advantage

Marcus's manager asks a sharp question: "Can you rank the customers by risk, so we focus on the ones most likely to leave?"

This is exactly what predict_proba does:

# Risk ranking
test_customers = X_test.copy()
test_customers['churn_probability'] = y_proba
test_customers['actual_churned'] = y_test.values

# Top 20 highest risk
high_risk = test_customers.nlargest(20, 'churn_probability')
print("Top 20 highest-risk customers:")
print(high_risk[['churn_probability', 'actual_churned',
    'monthly_hours', 'support_tickets']].to_string())

# How many of the top 20 actually churned?
top_20_hit_rate = high_risk['actual_churned'].mean()
print(f"\nHit rate in top 20: {top_20_hit_rate:.0%}")

The probability ranking lets Marcus target retention efforts efficiently. Instead of contacting all flagged customers (some of whom are barely above the 0.5 threshold), the team can focus on the 50 or 100 customers with the highest churn probability.

What the Model Learned

print("Feature coefficients:")
for feat, coef in sorted(
    zip(features, model.coef_[0]),
    key=lambda x: abs(x[1]), reverse=True
):
    direction = "increases" if coef > 0 else "decreases"
    print(f"  {feat}: {coef:.4f} "
          f"(higher value {direction} churn risk)")

The model's story makes intuitive sense:

Support tickets (positive coefficient): More complaints = higher churn risk. Frustrated customers leave.
Monthly hours (negative coefficient): Customers who use the service more are less likely to leave. Engagement protects against churn.
Months active (negative coefficient): Long-tenured customers are more loyal. Switching costs and habit keep them around.
Devices used (negative coefficient): Customers using the service on multiple devices are more invested in the ecosystem.
Plan price (positive coefficient): Higher-priced plans have more churn — customers are more sensitive to value at higher price points.

Marcus's Recommendations

Marcus presents his findings to the team with three recommendations:

Recommendation 1: Target the top 100 highest-probability churners each month with a personalized retention offer. Based on the model's probability rankings, these customers are most likely to leave. The probability output allows for calibrated intervention — the highest-risk customers get the most aggressive offers.

Recommendation 2: Investigate the support ticket pattern. The model shows that support tickets are a strong predictor of churn. This suggests that improving the support experience — not just predicting churn — could reduce it. Prevention is better than prediction.

Recommendation 3: Monitor low-usage customers. Declining monthly hours is an early warning sign. Rather than waiting for customers to reach the churn threshold, proactively engage customers whose usage drops significantly.

The Follow-Up Question

Marcus's manager asks one more question: "Can you tell me why each customer is likely to churn?"

Marcus pauses. With logistic regression, he can point to the features and their coefficients: "This customer has high churn risk because they've submitted 8 support tickets and only watched 3 hours last month." The coefficients tell a story that stakeholders can understand and act on.

If Marcus had used a more complex model (a neural network, say), he might get slightly better predictions, but he couldn't answer the "why" question as clearly. For a business audience, the interpretability of logistic regression is a significant advantage.

Discussion Questions

Marcus's model treats all false negatives (missed churners) as equally costly. In reality, some churners might be more valuable than others. How would you modify the approach to account for customer lifetime value?
The model shows that support tickets predict churn. Does this mean filing a support ticket causes churn? What's the more likely explanation, and what does it suggest about intervention strategy?
If Marcus lowers the churn threshold from 0.5 to 0.3, he'll catch more churners but also flag more non-churners. How should he decide the optimal threshold? What information does he need?
The model was trained on the last 3 months of data. What could go wrong if customer behavior changes (e.g., a competitor launches a new service, the company raises prices)?

Key Takeaways from This Case Study

Classification problems in business require understanding the cost of each type of error
Probability outputs are more useful than binary predictions — they enable risk ranking and targeted intervention
Class imbalance (most customers don't churn) makes accuracy misleading — use precision, recall, and the confusion matrix
Interpretable models like logistic regression allow stakeholders to understand why the model makes its predictions
Prediction is a means to an end — the real goal is effective intervention, not a high accuracy number
The model's insights (support tickets matter, engagement protects against churn) are as valuable as the predictions themselves