> "The model doesn't make the decision. It informs the decision. Never forget the difference."
In This Chapter
- 7.1 What Is Classification?
- 7.2 The Classification Workflow
- 7.3 Logistic Regression -- The Starting Point
- 7.4 Decision Trees -- Interpretable, Intuitive, Fragile
- 7.5 Random Forests -- The Wisdom of Crowds
- 7.6 Gradient Boosting -- Sequential Correction
- 7.7 Feature Engineering for Classification
- 7.8 The ChurnClassifier -- Building Athena's First ML Model
- 7.9 Interpreting Classification Results
- 7.10 From Model to Decision -- Translating Predictions into Actions
- 7.11 Algorithms at a Glance -- A Comparison Guide
- 7.12 Chapter Summary and the Road Ahead
Chapter 7: Supervised Learning -- Classification
"The model doesn't make the decision. It informs the decision. Never forget the difference." -- Professor Diane Okonkwo
Professor Okonkwo stands at the whiteboard, marker uncapped, and writes a single number:
$2,000,000
"You run a subscription business," she says, turning to face the class. "One hundred thousand customers. Your monthly churn rate is five percent. That means every month, five thousand customers leave. Your retention team has developed a campaign -- personalized outreach, a loyalty offer, a dedicated account manager for thirty days. It costs twenty dollars per customer to execute. And it works: forty percent of targeted customers are saved."
She pauses. "Each saved customer is worth five hundred dollars in annual revenue. Question for the class: should you target all one hundred thousand customers with this retention campaign?"
Tom Kowalski's hand shoots up. "Quick math -- one hundred thousand customers times twenty dollars each is two million dollars in campaign costs."
"Correct. And the revenue saved?"
NK Adeyemi pulls out her notebook and starts writing. "Five percent churn rate means five thousand actual churners. Forty percent save rate on those five thousand is two thousand saved customers. Two thousand times five hundred dollars is one million dollars in saved revenue." She looks up. "We spend two million to save one million. That's a net loss of a million dollars. Targeting everyone is a terrible idea."
"So what do we do?" Professor Okonkwo asks. "Just let customers churn?"
Tom leans forward. "What if we could identify which customers are most likely to churn? If we only target the top five thousand most at-risk customers -- and let's say the model is reasonably accurate, so most of those five thousand are actually going to churn -- we spend five thousand times twenty dollars: a hundred thousand dollars. We save forty percent of them, which is two thousand customers. Two thousand times five hundred is still a million dollars in saved revenue. But now we spent a hundred thousand instead of two million. The net gain is nine hundred thousand dollars."
The room goes quiet as the economics land.
"That," Professor Okonkwo says, writing the word on the whiteboard, "is why classification matters."
She draws a line under the word.
Classification: the task of predicting which category an observation belongs to.
"In this case, we're predicting a binary category: will churn or will not churn. And the difference between getting that prediction right and treating everyone the same is, in this scenario, nearly two million dollars in economic value."
She sets down the marker. "Today we begin building your first machine learning model. We'll start with the theory, but we'll end with code -- a working churn classifier for Athena Retail Group's customer base. By the time we're done, you will understand not just how classification algorithms work, but when to use each one and how to translate a probability score into a business action."
"Welcome to Part 2."
7.1 What Is Classification?
Classification is the supervised learning task of assigning observations to discrete categories based on their features. The model learns a mapping from inputs (features) to outputs (categories) by studying labeled historical examples, then applies that learned mapping to make predictions on new, unseen observations.
Definition. Classification is a type of supervised learning where the model predicts a categorical outcome -- a class label -- for each observation. The model learns from historical data where the correct labels are known (the training set) and then generalizes to new data where the labels are unknown.
Binary vs. Multiclass Classification
The simplest form is binary classification: two possible outcomes. Will the customer churn or not? Is this transaction fraudulent or legitimate? Should we approve or deny this loan?
Multiclass classification extends to three or more categories. What product category will this customer purchase next? Which customer segment does this lead belong to? What is the priority level of this support ticket -- low, medium, high, or critical?
| Type | Number of Classes | Business Examples |
|---|---|---|
| Binary | 2 | Churn/retain, fraud/not fraud, approve/deny, spam/not spam, click/no click |
| Multiclass | 3+ | Customer segment (Gold/Silver/Bronze), product category, sentiment (positive/neutral/negative), risk tier |
| Multi-label | Multiple simultaneous | A customer can be tagged with multiple interest categories; a product can belong to multiple departments |
For this chapter, we focus primarily on binary classification -- the most common classification task in business -- with multiclass extensions noted where relevant.
The Classification Problem in Business Context
What makes classification powerful in business is not the algorithm. It is the decision that follows the prediction.
Every classification problem in business implies an action:
| Prediction | Business Action |
|---|---|
| Customer will churn | Trigger retention campaign |
| Transaction is fraudulent | Block transaction, flag for review |
| Lead will convert | Prioritize for sales team |
| Loan applicant will default | Deny or adjust terms |
| Email is spam | Route to spam folder |
| Patient is high-risk | Schedule early intervention |
Business Insight. If you cannot identify the action that will follow a classification prediction, you do not have a classification problem -- you have a classification curiosity. This connects directly to the problem framing lesson from Chapter 6: a prediction without a decision is trivia.
The Relationship to Chapter 6
In Chapter 6, we introduced the ML project lifecycle, the ML Canvas, and the critical distinction between model metrics and business metrics. Everything we build in this chapter applies those frameworks. Athena's churn prediction project, which Ravi Mehta's team scoped in Chapter 6 using the ML Canvas, becomes our working example. The conceptual groundwork is complete. Now we write the code.
7.2 The Classification Workflow
Before we examine specific algorithms, let's establish the end-to-end workflow that every classification project follows. This workflow builds on the CRISP-DM lifecycle from Chapter 2 and the ML project lifecycle from Chapter 6, but focuses on the technical execution steps.
Step 1: Define the Target Variable
The target variable (also called the label or dependent variable) is what you are trying to predict. For binary classification, this is typically encoded as 0 or 1.
For Athena's churn prediction: - 1 = Customer churned (no purchase in 180 days) - 0 = Customer did not churn (at least one purchase in 180 days)
Caution. The definition of your target variable is a business decision, not a technical one. "No purchase in 180 days" is a choice. Ravi's team debated alternatives -- 90 days, 365 days, revenue decline of 50 percent. The 180-day window was selected because it aligns with Athena's loyalty program renewal cycle and gives the retention team enough lead time to intervene. Different definitions produce different models with different business implications.
Step 2: Gather and Prepare Features
Features (also called independent variables or predictors) are the inputs the model uses to make predictions. For Athena's churn model, the initial feature set includes:
- Purchase frequency -- number of transactions in the last 12 months
- Recency -- days since last purchase
- Average order value -- mean spending per transaction
- Return rate -- percentage of items returned
- Channel mix -- proportion of purchases online vs. in-store
- Tenure -- months since first purchase
- Loyalty tier -- current tier (Bronze, Silver, Gold, Platinum)
- Category diversity -- number of distinct product categories purchased
Feature preparation involves handling missing values, encoding categorical variables (like loyalty tier), and scaling numerical features. We will cover these techniques in detail in Section 7.7.
Step 3: Split the Data
The data must be divided into at least two sets:
- Training set (typically 70-80 percent): The model learns from this data.
- Test set (typically 20-30 percent): Held out entirely during training. Used only once, at the end, to evaluate the model's performance on truly unseen data.
Some workflows add a validation set (carved from the training data) for hyperparameter tuning, or use cross-validation, which we will discuss in Chapter 11.
Definition. Data leakage occurs when information from the test set (or from the future) contaminates the training process, producing a model that appears to perform well during development but fails in production. As we discussed in Chapter 6, this is one of the most common and insidious failure modes in applied ML.
The split must respect temporal order when the data has a time dimension. For Athena's churn data, we train on customers whose 180-day window ended before a cutoff date and test on customers whose window ended after it. This simulates how the model will be used in production -- predicting future churn based on past behavior.
Step 4: Train the Model
Feed the training data to one or more algorithms and let them learn the relationship between features and the target variable. In practice, you almost always train multiple algorithms and compare their performance -- logistic regression, decision trees, random forests, gradient boosting.
Step 5: Evaluate the Model
Measure performance on the held-out test set using appropriate metrics. For classification, the key metrics include accuracy, precision, recall, F1 score, and AUC-ROC. We introduce these metrics in Section 7.9 and explore them in full depth in Chapter 11.
Step 6: Interpret and Deploy
Understanding why the model makes its predictions is as important as the predictions themselves. Feature importance, partial dependence plots, and individual prediction explanations help build stakeholder trust and surface potential issues. Deployment connects the model to the business process it was designed to support.
7.3 Logistic Regression -- The Starting Point
Despite its name, logistic regression is a classification algorithm. It is almost always the first model you should try for binary classification. Not because it is the most powerful, but because it is interpretable, fast, and provides a strong baseline against which to measure more complex models.
The Core Intuition
Logistic regression answers a simple question: given the features of this observation, what is the probability that it belongs to class 1?
It does this using the sigmoid function (also called the logistic function), which maps any real number to a value between 0 and 1:
1
P(y=1) = -----------
1 + e^(-z)
where z is a linear combination of the features:
z = b0 + b1*x1 + b2*x2 + ... + bn*xn
The sigmoid function produces the characteristic S-curve:
Probability
1.0 | _______________
| /
| /
0.5 | - - - - - - - - - + - - - - - - - - -
| /
| /
0.0 |_______________/
+----------------------------------------
z (linear score)
When z is very negative (the features suggest "not churn"), the probability approaches 0. When z is very positive (the features scream "churn"), the probability approaches 1. The transition happens around z = 0, where the probability is exactly 0.5.
Business Insight. The beauty of logistic regression is that its output is a probability, not just a yes/no prediction. A customer with a 0.92 churn probability needs a different intervention than a customer at 0.55. The probability gives the business granular information to make nuanced decisions -- tiered retention campaigns, prioritized outreach, or risk-scored customer portfolios.
The Decision Boundary
To convert a probability into a class prediction, we choose a threshold. The default is 0.5: if the predicted probability is above 0.5, predict class 1 (churn); otherwise, predict class 0 (no churn).
But 0.5 is not sacred. As we will explore in Section 7.10, the optimal threshold depends on the relative costs of false positives and false negatives. In Athena's case, where the cost of missing a churner ($340 in lost lifetime value) far exceeds the cost of a wasted retention offer ($20), a lower threshold like 0.3 may be more appropriate -- cast a wider net, catch more churners, accept some wasted offers.
Why Start with Logistic Regression?
-
Interpretability. Each coefficient tells you the direction and magnitude of a feature's influence. If the coefficient for "days since last purchase" is positive, longer gaps increase churn probability. Business stakeholders can understand and challenge this.
-
Speed. Logistic regression trains in seconds, even on large datasets. This makes rapid experimentation easy.
-
Baseline. Any more complex model must beat logistic regression to justify its complexity. If a random forest only improves AUC from 0.81 to 0.83, the added complexity may not be worth it.
-
Probability calibration. Logistic regression tends to produce well-calibrated probabilities out of the box -- when it says "70 percent chance of churn," roughly 70 percent of those customers actually churn. This matters for business applications where probability scores drive tiered actions.
Definition. Calibration refers to the degree to which a model's predicted probabilities match actual observed frequencies. A well-calibrated model's predicted probabilities can be taken at face value. A poorly calibrated model might predict "80 percent chance of churn" for a group where only 40 percent actually churn, leading to misallocation of resources.
Limitations
Logistic regression assumes a roughly linear relationship between features and the log-odds of the outcome. It cannot natively capture complex interactions or non-linear patterns. If the relationship between purchase frequency and churn is U-shaped (very low and very high frequencies both indicate risk), logistic regression will miss this unless you manually create interaction or polynomial features.
For many business problems, this limitation matters. Which is why we need the next set of tools.
7.4 Decision Trees -- Interpretable, Intuitive, Fragile
Decision trees take a completely different approach to classification. Instead of fitting a mathematical function, they learn a series of if-then rules by recursively splitting the data on feature values.
The Core Intuition
Imagine you are a retention manager at Athena, and you have to manually identify at-risk customers using only three pieces of information: days since last purchase, number of purchases in the last year, and average order value. You might reason like this:
- Has the customer purchased in the last 60 days?
- Yes --> Probably not churning. But check further.
- Have they made more than 3 purchases this year?
- Yes --> Low risk.
- No --> Medium risk.
- No --> Possibly churning.
- Is their average order value above $75?
- Yes --> Medium risk (high-value customer going quiet -- investigate).
- No --> High risk.
That reasoning process is a decision tree. The algorithm automates it by finding the optimal splits -- the questions that best separate churners from non-churners at each step.
How Decision Trees Split
At each node, the algorithm considers every feature and every possible split point. It chooses the split that maximally separates the classes, measured by a criterion like Gini impurity or information gain.
Definition. Gini impurity measures the probability of misclassifying a randomly chosen element if it were randomly labeled according to the distribution of labels in the subset. A Gini impurity of 0 means perfect purity -- all elements in the node belong to the same class. A Gini impurity of 0.5 (for binary classification) means maximum impurity -- a 50/50 split.
Think of it this way: the tree asks, "Which question, applied to this group of customers, creates two subgroups that are each as pure as possible?" A perfect split puts all churners in one group and all non-churners in the other. In practice, splits are imperfect, so the tree keeps splitting until it reaches a stopping criterion (maximum depth, minimum samples per leaf, or no further improvement).
The Interpretability Advantage
Decision trees produce rules that non-technical stakeholders can read and challenge:
IF days_since_last_purchase > 90
AND purchase_count_12m < 3
AND loyalty_tier IN ('Bronze', 'Silver')
THEN churn_probability = 0.78
A marketing manager can look at this rule and say, "That makes sense -- customers who haven't bought in three months, barely shop with us, and aren't in our premium loyalty tiers are exactly the ones we'd expect to leave." Or they might say, "Wait -- we just launched a new program for Bronze members. Can we check if that changes things?" Either way, the tree enables a conversation that a neural network cannot.
Business Insight. In regulated industries -- financial services, healthcare, insurance -- model interpretability is not optional. It is a compliance requirement. Decision trees and their derivatives (random forests with feature importance) are often preferred for regulatory reasons even when more complex models achieve marginally better performance. We will explore this tension between performance and interpretability in Chapter 26 (Fairness, Explainability, and Transparency).
The Overfitting Problem
Left unconstrained, a decision tree will keep splitting until every training example is perfectly classified. The result is a deeply branched, highly specific tree that memorizes the training data -- including its noise and idiosyncrasies -- rather than learning generalizable patterns.
Definition. Overfitting occurs when a model learns the training data too well, capturing noise and random fluctuations rather than true underlying patterns. An overfit model performs excellently on training data but poorly on new, unseen data. It has memorized rather than learned.
Signs of an overfit decision tree: - Training accuracy of 99 percent, test accuracy of 72 percent - Very deep tree with many leaves, some containing only a handful of examples - Brittle predictions: small changes in input data produce large changes in output
Remedies include: - Pruning: Limiting tree depth or requiring a minimum number of samples per leaf - Setting constraints: Maximum depth, minimum samples to split, maximum leaf nodes - Ensemble methods: Combining many trees, which brings us to random forests
7.5 Random Forests -- The Wisdom of Crowds
Random forests address the overfitting problem of individual decision trees through a simple, powerful idea: build many trees, each slightly different, and let them vote.
The Core Intuition
The principle is borrowed from the "wisdom of crowds" phenomenon. If you ask one person to estimate the number of jelly beans in a jar, the answer may be wildly off. If you ask five hundred people and average their responses, the average is typically much closer to the truth than any individual guess.
Random forests apply this principle to decision trees:
-
Create many training datasets by randomly sampling from the original training data with replacement (a technique called bootstrapping). Each sample is roughly the same size as the original but contains some duplicated rows and omits others.
-
Build a decision tree on each bootstrap sample. At each split, consider only a random subset of features (not all features). This ensures that the trees are different from each other -- they see different data and consider different features.
-
Aggregate predictions. For classification, each tree votes. The final prediction is the majority vote.
This combination of bootstrapping and random feature selection is called bagging (bootstrap aggregating). The randomness injects diversity, and the aggregation smooths out individual errors.
Definition. Ensemble learning is the strategy of combining multiple models to produce a prediction that is more robust and accurate than any single model. Random forests are the canonical example, combining hundreds or thousands of decision trees into a single ensemble.
Why Random Forests Work So Well
- Reduced overfitting. Individual trees may overfit, but their errors tend to be different. Averaging many noisy-but-diverse predictions produces a smoother, more generalizable result.
- Feature importance. Random forests naturally measure how much each feature contributes to prediction accuracy. This is invaluable for business insights: "Purchase recency is the single strongest predictor of churn" is actionable information.
- Robustness. Random forests handle missing values, outliers, and mixed feature types (numerical and categorical) with minimal preprocessing. They are hard to break.
- Few hyperparameters. The most important tuning parameters are the number of trees (more is generally better, with diminishing returns) and the maximum depth of each tree.
Key Hyperparameters
| Parameter | What It Controls | Typical Values |
|---|---|---|
n_estimators |
Number of trees in the forest | 100-500 (more trees = better, up to a point) |
max_depth |
Maximum depth of each tree | 10-30, or None (unlimited) |
max_features |
Number of features considered at each split | sqrt(n_features) for classification |
min_samples_split |
Minimum samples required to split a node | 2-20 |
min_samples_leaf |
Minimum samples required in a leaf node | 1-10 |
NK raises her hand. "Professor, you said random forests are 'hard to break.' Are there situations where they don't work well?"
"Excellent question," Professor Okonkwo replies. "Random forests can struggle with very high-dimensional, sparse data -- like text data with thousands of word features. They can also be computationally expensive on very large datasets. And while they tell you which features are important, they don't give you the direction of the effect as cleanly as logistic regression does. For that, you need partial dependence plots or SHAP values, which we'll cover in Chapter 26."
7.6 Gradient Boosting -- Sequential Correction
If random forests are "wisdom of crowds," gradient boosting is "learning from mistakes." While random forests build trees independently and average them, gradient boosted models build trees sequentially, with each new tree explicitly correcting the errors of the previous ones.
The Core Intuition
Imagine a student taking a practice exam. After grading, the student doesn't start over from scratch. Instead, they focus their studying on the questions they got wrong. The next practice attempt concentrates effort where performance was weakest. Over many rounds, the student improves by systematically addressing their specific gaps.
Gradient boosting works the same way:
- Build a simple tree (often called a "stump" -- just one or two splits).
- Calculate the residuals -- the errors between the model's predictions and the actual labels.
- Build the next tree to predict these residuals (i.e., to correct the first tree's mistakes).
- Add this correction tree to the ensemble.
- Repeat: calculate new residuals, build a new tree, add it.
- The final prediction is the sum of all trees' contributions.
Each tree is individually weak -- a "weak learner." But the sequential correction process produces a powerful combined model.
XGBoost and LightGBM
Two implementations of gradient boosting dominate practical machine learning:
XGBoost (eXtreme Gradient Boosting) was released in 2014 and quickly became the most popular machine learning algorithm for structured/tabular data. It won a remarkable string of Kaggle competitions and became the default choice for many applied ML teams. XGBoost includes built-in regularization (to prevent overfitting), efficient handling of missing values, and parallel processing for speed.
LightGBM (Light Gradient Boosting Machine), released by Microsoft in 2017, uses a technique called "gradient-based one-side sampling" and "exclusive feature bundling" to train significantly faster than XGBoost on large datasets, often with comparable or better accuracy. It has become the preferred choice for very large datasets.
Business Insight. In practice, XGBoost and LightGBM are the algorithms behind many of the "AI-powered" features you encounter as a consumer -- credit scoring at banks, fraud detection at payment processors, customer lifetime value predictions at subscription companies, and pricing optimization at airlines. When a company says "we use machine learning for X" and X involves structured tabular data, there is a high probability that gradient boosting is under the hood.
When to Use Gradient Boosting
| Situation | Recommendation |
|---|---|
| Structured/tabular business data | Gradient boosting is often the top performer |
| Need maximum predictive accuracy | Gradient boosting typically beats random forests |
| Data is small to medium (<1M rows) | XGBoost is excellent |
| Data is large (>1M rows) | LightGBM is often faster |
| Interpretability is paramount | Consider random forests or logistic regression first |
| Rapid prototyping | Random forests are faster to tune |
Key Hyperparameters
| Parameter | What It Controls | Typical Values |
|---|---|---|
n_estimators |
Number of boosting rounds (trees) | 100-1000 |
learning_rate |
How much each tree's correction is weighted | 0.01-0.3 (lower = slower but often better) |
max_depth |
Maximum depth of each tree | 3-8 (shallower than random forests) |
subsample |
Fraction of training data used per tree | 0.7-1.0 |
colsample_bytree |
Fraction of features used per tree | 0.5-1.0 |
reg_alpha / reg_lambda |
L1/L2 regularization strength | 0-10 |
"Here's what I want you to remember," Professor Okonkwo says. "For most classification problems with structured business data, you will try logistic regression as a baseline, random forest as a robust mid-range option, and XGBoost or LightGBM for maximum accuracy. In my consulting experience, the final model deployed in production is a gradient boosting model about sixty percent of the time. But you should always start simpler."
7.7 Feature Engineering for Classification
The algorithms we've covered are tools. Feature engineering is the craft that determines how well those tools work. In business applications, good feature engineering often contributes more to model performance than algorithm selection.
Encoding Categorical Variables
Machine learning algorithms operate on numbers. Categorical features -- like loyalty tier (Bronze, Silver, Gold, Platinum) or channel (Online, In-Store) -- must be converted to numerical representations.
One-Hot Encoding creates a binary column for each category:
import pandas as pd
# Before encoding
# loyalty_tier: ['Bronze', 'Silver', 'Gold', 'Platinum']
# After one-hot encoding
# loyalty_tier_Bronze: [1, 0, 0, 0]
# loyalty_tier_Silver: [0, 1, 0, 0]
# loyalty_tier_Gold: [0, 0, 1, 0]
# loyalty_tier_Platinum: [0, 0, 0, 1]
df_encoded = pd.get_dummies(df, columns=['loyalty_tier'], drop_first=True)
Ordinal Encoding assigns integers that preserve an ordering:
tier_mapping = {'Bronze': 1, 'Silver': 2, 'Gold': 3, 'Platinum': 4}
df['loyalty_tier_encoded'] = df['loyalty_tier'].map(tier_mapping)
Use ordinal encoding when the categories have a natural order (loyalty tiers, education level). Use one-hot encoding when they do not (product category, geographic region).
Caution. Be careful with one-hot encoding when a feature has many categories (e.g., zip codes with hundreds of values). This creates hundreds of sparse columns that can slow training and cause overfitting. In such cases, consider target encoding (mapping each category to the average target value for that category) or grouping rare categories into an "Other" category.
Scaling Numerical Features
Some algorithms -- particularly logistic regression -- are sensitive to feature scales. If "annual income" ranges from 20,000 to 200,000 while "number of purchases" ranges from 1 to 50, the larger-scaled feature can dominate the model.
Standard Scaling transforms each feature to have mean 0 and standard deviation 1:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
Min-Max Scaling maps features to a 0-1 range:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_train)
Caution. Fit the scaler on the training data only, then apply it to both training and test data. If you fit on the full dataset (including test data), you introduce subtle data leakage.
Tree-based models (decision trees, random forests, XGBoost) are generally not affected by feature scaling because they make decisions based on threshold comparisons, not magnitude. But it is good practice to scale anyway -- it doesn't hurt and it makes switching between models easier.
Creating Domain-Informed Features
The most powerful features are often derived from domain knowledge, not raw data. For Athena's churn model, Ravi's team creates several engineered features:
# Recency-Frequency-Monetary (RFM) features
df['recency'] = (reference_date - df['last_purchase_date']).dt.days
df['frequency'] = df['purchase_count_12m']
df['monetary'] = df['avg_order_value'] * df['purchase_count_12m']
# Behavioral change features (trends matter more than snapshots)
df['purchase_trend'] = (
df['purchases_last_3m'] / (df['purchases_prior_3m'] + 1)
)
# Engagement features
df['return_rate'] = df['items_returned'] / (df['items_purchased'] + 1)
df['online_share'] = df['online_purchases'] / (df['total_purchases'] + 1)
# Loyalty engagement
df['loyalty_points_used_ratio'] = (
df['loyalty_points_redeemed'] / (df['loyalty_points_earned'] + 1)
)
Try It. Look at the features above. Can you think of additional features that might predict churn for a retailer? Consider: seasonal buying patterns, response to marketing emails, customer service interactions, product category diversity, price sensitivity (proportion of purchases made during sales). Feature engineering is where business knowledge meets data science.
Handling Class Imbalance
In many business classification problems, the classes are unequal. For Athena's data, roughly 18 percent of customers churn and 82 percent do not. This class imbalance can cause models to be biased toward the majority class -- predicting "no churn" for everyone achieves 82 percent accuracy but catches zero churners.
Strategies for handling imbalance:
1. Class Weights. Most scikit-learn classifiers accept a class_weight='balanced' parameter that automatically adjusts the learning algorithm to pay more attention to the minority class.
2. Oversampling the Minority Class (SMOTE). Synthetic Minority Over-sampling Technique creates synthetic examples of the minority class by interpolating between existing examples:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
3. Undersampling the Majority Class. Remove examples from the majority class to balance the dataset. Simple but discards potentially useful data.
4. Threshold Adjustment. Rather than rebalancing the data, adjust the classification threshold. Instead of predicting "churn" at 0.5 probability, predict at 0.3 or 0.2. This is often the most practical approach for business applications.
Business Insight. The best strategy depends on your data and your business context. In Athena's case, Ravi's team uses a combination of class weights during training and threshold adjustment during deployment. SMOTE is powerful but can introduce artifacts -- synthetic examples that don't represent real customer behaviors. As Professor Okonkwo says: "Resample with caution. The synthetic data knows nothing about your business."
7.8 The ChurnClassifier -- Building Athena's First ML Model
Athena Update. Ravi Mehta's team has spent three weeks preparing data for Athena's churn prediction pilot. They've unified customer records across the POS system and e-commerce platform (a painful data engineering effort that Chapter 4 warned them about). They've defined churn as "no purchase in 180 days" after extensive debate with the merchandising, marketing, and finance teams. The executive team is watching. This is the first ML model Athena will build, and if it fails, it will set back the AI initiative by a year.
Now we build. The following ChurnClassifier class encapsulates the full classification workflow -- data preparation, model training, evaluation, and business interpretation. We build it step by step, explaining each section.
Setting Up
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, confusion_matrix,
classification_report, roc_curve
)
import warnings
warnings.filterwarnings('ignore')
# For gradient boosting -- install with: pip install xgboost
try:
from xgboost import XGBClassifier
XGBOOST_AVAILABLE = True
except ImportError:
XGBOOST_AVAILABLE = False
print("XGBoost not installed. Install with: pip install xgboost")
Code Explanation. We import scikit-learn's core classification tools along with XGBoost. The
try/exceptblock handles the case where XGBoost isn't installed -- the classifier will still work with logistic regression and random forests. If you completed the Python environment setup in Chapter 3, you should have scikit-learn installed. Install XGBoost separately withpip install xgboost.
Generating Synthetic Athena Data
In production, Ravi's team works with real customer data. For this textbook, we generate realistic synthetic data that mirrors Athena's customer base:
def generate_athena_churn_data(n_customers=5000, random_state=42):
"""
Generate synthetic customer data for Athena Retail Group's
churn prediction model.
Features mirror Athena's actual loyalty program data:
- Behavioral features (purchase patterns)
- Engagement features (channel usage, returns)
- Demographic features (tenure, loyalty tier)
"""
np.random.seed(random_state)
# --- Customer tenure (months since first purchase) ---
tenure = np.random.exponential(scale=24, size=n_customers).clip(1, 120)
# --- Loyalty tier (influenced by tenure) ---
tier_probs = np.where(
tenure > 60, [0.1, 0.2, 0.3, 0.4],
np.where(tenure > 24, [0.2, 0.3, 0.3, 0.2],
[0.5, 0.3, 0.15, 0.05])
).reshape(-1, 4)
# Normalize probabilities
tier_probs = tier_probs / tier_probs.sum(axis=1, keepdims=True)
tiers = np.array([
np.random.choice(
['Bronze', 'Silver', 'Gold', 'Platinum'], p=p
) for p in tier_probs
])
# --- Purchase behavior ---
# Base frequency influenced by tier
tier_freq_base = {'Bronze': 3, 'Silver': 6, 'Gold': 10, 'Platinum': 15}
base_freq = np.array([tier_freq_base[t] for t in tiers], dtype=float)
purchase_count_12m = np.random.poisson(base_freq).clip(0, 50)
# Recency (days since last purchase) -- lower tier = higher recency
tier_recency_base = {'Bronze': 90, 'Silver': 45, 'Gold': 25, 'Platinum': 14}
base_recency = np.array([tier_recency_base[t] for t in tiers], dtype=float)
days_since_last_purchase = np.random.exponential(base_recency).clip(0, 365)
# Average order value
avg_order_value = np.random.lognormal(mean=3.8, sigma=0.5, size=n_customers)
avg_order_value = avg_order_value.clip(15, 500)
# Return rate (0 to 1)
return_rate = np.random.beta(2, 10, size=n_customers)
# Online purchase share (0 to 1)
online_share = np.random.beta(3, 3, size=n_customers)
# Category diversity (number of distinct categories, 1-12)
category_diversity = np.random.poisson(
lam=np.where(purchase_count_12m > 5, 5, 2)
).clip(1, 12)
# Purchase trend (ratio of last 3 months to prior 3 months)
purchase_trend = np.random.lognormal(mean=0, sigma=0.5, size=n_customers)
purchase_trend = purchase_trend.clip(0.1, 5.0)
# Email engagement rate
email_engagement = np.random.beta(2, 5, size=n_customers)
# --- Generate churn labels (influenced by features) ---
churn_score = (
0.3 * (days_since_last_purchase / 365) # Higher recency = more churn
- 0.25 * (purchase_count_12m / 50) # More purchases = less churn
- 0.15 * (avg_order_value / 500) # Higher AOV = less churn
+ 0.15 * return_rate # Higher returns = more churn
- 0.1 * (tenure / 120) # Longer tenure = less churn
- 0.2 * purchase_trend.clip(0, 2) / 2 # Increasing trend = less churn
- 0.1 * email_engagement # More engaged = less churn
+ np.random.normal(0, 0.15, size=n_customers) # Noise
)
# Convert to probability and sample
churn_prob = 1 / (1 + np.exp(-5 * (churn_score - 0.05)))
churned = (np.random.random(n_customers) < churn_prob).astype(int)
# Build DataFrame
df = pd.DataFrame({
'customer_id': [f'ATH-{i:05d}' for i in range(n_customers)],
'tenure_months': np.round(tenure, 1),
'loyalty_tier': tiers,
'purchase_count_12m': purchase_count_12m,
'days_since_last_purchase': np.round(days_since_last_purchase, 0),
'avg_order_value': np.round(avg_order_value, 2),
'return_rate': np.round(return_rate, 3),
'online_share': np.round(online_share, 3),
'category_diversity': category_diversity,
'purchase_trend': np.round(purchase_trend, 3),
'email_engagement': np.round(email_engagement, 3),
'churned': churned
})
return df
# Generate and inspect the data
df = generate_athena_churn_data(n_customers=5000)
print(f"Dataset shape: {df.shape}")
print(f"\nChurn distribution:")
print(df['churned'].value_counts(normalize=True).round(3))
print(f"\nFeature summary:")
print(df.describe().round(2))
Code Explanation. The synthetic data generator creates 5,000 customer records with realistic correlations. Churn is not random -- it is influenced by recency (the strongest predictor), purchase frequency, order value, return behavior, tenure, purchasing trends, and email engagement. The correlations are deliberately imperfect (note the noise term) to simulate real-world messiness. The churn rate is approximately 20 percent, consistent with Athena's observed rate.
The Complete ChurnClassifier Class
class ChurnClassifier:
"""
End-to-end churn classification pipeline for Athena Retail Group.
Trains and compares multiple classification models, evaluates
performance using business-relevant metrics, and provides
interpretable results for the retention team.
Usage:
classifier = ChurnClassifier(df)
classifier.prepare_data()
classifier.train_models()
classifier.compare_models()
classifier.analyze_feature_importance()
classifier.optimize_threshold()
classifier.generate_business_report()
"""
def __init__(self, data, target_col='churned', id_col='customer_id',
test_size=0.2, random_state=42):
"""
Initialize the ChurnClassifier.
Parameters:
-----------
data : pd.DataFrame
Customer data with features and churn labels.
target_col : str
Name of the target column (default: 'churned').
id_col : str
Name of the customer ID column (default: 'customer_id').
test_size : float
Fraction of data reserved for testing (default: 0.2).
random_state : int
Random seed for reproducibility (default: 42).
"""
self.data = data.copy()
self.target_col = target_col
self.id_col = id_col
self.test_size = test_size
self.random_state = random_state
self.models = {}
self.results = {}
self.best_model_name = None
self.scaler = StandardScaler()
self.feature_names = None
def prepare_data(self):
"""
Prepare features and split into train/test sets.
Handles:
- Dropping ID column
- Encoding categorical variables
- Scaling numerical features
- Train/test splitting
"""
df = self.data.copy()
# Separate target
y = df[self.target_col]
X = df.drop(columns=[self.target_col, self.id_col])
# Encode categorical features
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)
self.feature_names = X.columns.tolist()
# Train/test split
self.X_train, self.X_test, self.y_train, self.y_test = (
train_test_split(X, y, test_size=self.test_size,
random_state=self.random_state, stratify=y)
)
# Scale features (fit on training data only)
self.X_train_scaled = pd.DataFrame(
self.scaler.fit_transform(self.X_train),
columns=self.feature_names,
index=self.X_train.index
)
self.X_test_scaled = pd.DataFrame(
self.scaler.transform(self.X_test),
columns=self.feature_names,
index=self.X_test.index
)
print("Data Preparation Complete")
print(f" Training samples: {len(self.X_train)}")
print(f" Test samples: {len(self.X_test)}")
print(f" Features: {len(self.feature_names)}")
print(f" Churn rate (train): {self.y_train.mean():.1%}")
print(f" Churn rate (test): {self.y_test.mean():.1%}")
def train_models(self):
"""
Train multiple classification models and store results.
Models trained:
1. Logistic Regression (baseline)
2. Random Forest
3. XGBoost (if available)
"""
print("\n" + "=" * 60)
print("MODEL TRAINING")
print("=" * 60)
# --- Model 1: Logistic Regression ---
print("\n[1/3] Training Logistic Regression...")
lr = LogisticRegression(
class_weight='balanced',
max_iter=1000,
random_state=self.random_state
)
lr.fit(self.X_train_scaled, self.y_train)
self.models['Logistic Regression'] = lr
self._evaluate_model('Logistic Regression', lr,
self.X_test_scaled)
# --- Model 2: Random Forest ---
print("[2/3] Training Random Forest...")
rf = RandomForestClassifier(
n_estimators=200,
max_depth=15,
min_samples_split=10,
min_samples_leaf=5,
class_weight='balanced',
random_state=self.random_state,
n_jobs=-1
)
rf.fit(self.X_train, self.y_train) # Trees don't need scaling
self.models['Random Forest'] = rf
self._evaluate_model('Random Forest', rf, self.X_test)
# --- Model 3: XGBoost ---
if XGBOOST_AVAILABLE:
print("[3/3] Training XGBoost...")
# Calculate scale_pos_weight for class imbalance
n_neg = (self.y_train == 0).sum()
n_pos = (self.y_train == 1).sum()
scale_weight = n_neg / n_pos
xgb = XGBClassifier(
n_estimators=200,
max_depth=6,
learning_rate=0.1,
subsample=0.8,
colsample_bytree=0.8,
scale_pos_weight=scale_weight,
random_state=self.random_state,
eval_metric='logloss',
use_label_encoder=False
)
xgb.fit(self.X_train, self.y_train)
self.models['XGBoost'] = xgb
self._evaluate_model('XGBoost', xgb, self.X_test)
else:
print("[3/3] Skipping XGBoost (not installed)")
print("\nAll models trained successfully.")
def _evaluate_model(self, name, model, X_test):
"""Evaluate a single model and store results."""
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
results = {
'accuracy': accuracy_score(self.y_test, y_pred),
'precision': precision_score(self.y_test, y_pred),
'recall': recall_score(self.y_test, y_pred),
'f1': f1_score(self.y_test, y_pred),
'auc_roc': roc_auc_score(self.y_test, y_prob),
'y_pred': y_pred,
'y_prob': y_prob
}
self.results[name] = results
print(f" {name}: AUC={results['auc_roc']:.3f}, "
f"F1={results['f1']:.3f}, "
f"Recall={results['recall']:.3f}")
def compare_models(self):
"""
Display a side-by-side comparison of all trained models.
"""
print("\n" + "=" * 60)
print("MODEL COMPARISON")
print("=" * 60)
comparison = pd.DataFrame({
name: {
'Accuracy': f"{r['accuracy']:.3f}",
'Precision': f"{r['precision']:.3f}",
'Recall': f"{r['recall']:.3f}",
'F1 Score': f"{r['f1']:.3f}",
'AUC-ROC': f"{r['auc_roc']:.3f}"
}
for name, r in self.results.items()
})
print(comparison.to_string())
# Identify best model by AUC-ROC
self.best_model_name = max(
self.results, key=lambda k: self.results[k]['auc_roc']
)
print(f"\nBest model by AUC-ROC: {self.best_model_name} "
f"({self.results[self.best_model_name]['auc_roc']:.3f})")
return comparison
def analyze_feature_importance(self, top_n=10):
"""
Analyze and display feature importance from the best
tree-based model.
Parameters:
-----------
top_n : int
Number of top features to display (default: 10).
"""
print("\n" + "=" * 60)
print("FEATURE IMPORTANCE ANALYSIS")
print("=" * 60)
# Use Random Forest or XGBoost for feature importance
if 'XGBoost' in self.models:
model = self.models['XGBoost']
model_name = 'XGBoost'
elif 'Random Forest' in self.models:
model = self.models['Random Forest']
model_name = 'Random Forest'
else:
print("No tree-based model available for feature importance.")
return None
importances = model.feature_importances_
feature_imp = pd.DataFrame({
'Feature': self.feature_names,
'Importance': importances
}).sort_values('Importance', ascending=False)
print(f"\nTop {top_n} Features ({model_name}):")
print("-" * 45)
for i, row in feature_imp.head(top_n).iterrows():
bar = '#' * int(row['Importance'] * 100)
print(f" {row['Feature']:<30} {row['Importance']:.4f} {bar}")
# Also show logistic regression coefficients if available
if 'Logistic Regression' in self.models:
lr = self.models['Logistic Regression']
coef_df = pd.DataFrame({
'Feature': self.feature_names,
'Coefficient': lr.coef_[0]
}).sort_values('Coefficient', key=abs, ascending=False)
print(f"\nLogistic Regression Coefficients (top {top_n}):")
print("-" * 50)
for i, row in coef_df.head(top_n).iterrows():
direction = "+" if row['Coefficient'] > 0 else "-"
print(f" {direction} {row['Feature']:<30} "
f"{row['Coefficient']:+.4f}")
self.feature_importance = feature_imp
return feature_imp
def optimize_threshold(self, cost_fp=20, cost_fn=340,
revenue_saved=500, save_rate=0.4):
"""
Find the optimal classification threshold based on
business economics.
Parameters:
-----------
cost_fp : float
Cost of a false positive (retention offer to non-churner).
cost_fn : float
Cost of a false negative (lost customer lifetime value).
revenue_saved : float
Annual revenue from a successfully retained customer.
save_rate : float
Probability that a targeted churner is retained.
"""
print("\n" + "=" * 60)
print("THRESHOLD OPTIMIZATION")
print("=" * 60)
best_model = self.models[self.best_model_name]
if self.best_model_name == 'Logistic Regression':
y_prob = best_model.predict_proba(self.X_test_scaled)[:, 1]
else:
y_prob = best_model.predict_proba(self.X_test)[:, 1]
thresholds = np.arange(0.1, 0.9, 0.05)
best_threshold = 0.5
best_net_value = float('-inf')
threshold_results = []
for thresh in thresholds:
y_pred_thresh = (y_prob >= thresh).astype(int)
tn, fp, fn, tp = confusion_matrix(
self.y_test, y_pred_thresh
).ravel()
# Business value calculation
value_tp = tp * save_rate * revenue_saved # Revenue saved
cost_intervention = (tp + fp) * cost_fp # Intervention cost
cost_missed = fn * cost_fn # Lost customers
net_value = value_tp - cost_intervention - cost_missed
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
threshold_results.append({
'Threshold': round(thresh, 2),
'Precision': round(precision, 3),
'Recall': round(recall, 3),
'TP': tp, 'FP': fp, 'FN': fn, 'TN': tn,
'Net Value ($)': round(net_value, 0),
'Customers Targeted': tp + fp
})
if net_value > best_net_value:
best_net_value = net_value
best_threshold = thresh
results_df = pd.DataFrame(threshold_results)
print(f"\nBusiness Parameters:")
print(f" Cost of false positive (wasted offer): ${cost_fp}")
print(f" Cost of false negative (lost customer): ${cost_fn}")
print(f" Revenue per saved customer: ${revenue_saved}")
print(f" Save rate for targeted customers: {save_rate:.0%}")
print(f"\nThreshold Analysis (showing every other threshold):")
print(results_df.iloc[::2].to_string(index=False))
print(f"\nOptimal threshold: {best_threshold:.2f}")
print(f"Net business value at optimal threshold: "
f"${best_net_value:,.0f}")
self.optimal_threshold = best_threshold
self.threshold_results = results_df
return best_threshold, results_df
def generate_business_report(self, threshold=None):
"""
Generate a business-oriented summary of model results,
translating metrics into actions and dollar values.
Parameters:
-----------
threshold : float
Classification threshold (default: uses optimized threshold).
"""
if threshold is None:
threshold = getattr(self, 'optimal_threshold', 0.5)
print("\n" + "=" * 60)
print("ATHENA RETAIL GROUP -- CHURN PREDICTION REPORT")
print("=" * 60)
best_model = self.models[self.best_model_name]
if self.best_model_name == 'Logistic Regression':
y_prob = best_model.predict_proba(self.X_test_scaled)[:, 1]
else:
y_prob = best_model.predict_proba(self.X_test)[:, 1]
y_pred = (y_prob >= threshold).astype(int)
tn, fp, fn, tp = confusion_matrix(self.y_test, y_pred).ravel()
total = len(self.y_test)
print(f"\nModel: {self.best_model_name}")
print(f"Threshold: {threshold:.2f}")
print(f"Test set size: {total} customers")
print(f"\nConfusion Matrix:")
print(f" Predicted: Stay Predicted: Churn")
print(f" Actual: Stayed {tn:>8,} {fp:>8,}")
print(f" Actual: Churned {fn:>8,} {tp:>8,}")
print(f"\nBusiness Translation:")
print(f" Churners correctly identified: {tp:,} of "
f"{tp + fn:,} ({tp/(tp+fn):.1%})")
print(f" Loyal customers incorrectly targeted: {fp:,} of "
f"{tn + fp:,} ({fp/(tn+fp):.1%})")
print(f" Total customers to target: {tp + fp:,} "
f"({(tp+fp)/total:.1%} of customer base)")
# Risk segmentation
print(f"\nRisk Segmentation:")
high_risk = (y_prob >= 0.7).sum()
medium_risk = ((y_prob >= 0.4) & (y_prob < 0.7)).sum()
low_risk = ((y_prob >= 0.2) & (y_prob < 0.4)).sum()
minimal_risk = (y_prob < 0.2).sum()
print(f" High risk (>=70%): {high_risk:>6,} customers")
print(f" Medium risk (40-70%): {medium_risk:>6,} customers")
print(f" Low risk (20-40%): {low_risk:>6,} customers")
print(f" Minimal risk (<20%): {minimal_risk:>6,} customers")
print(f"\nRecommended Actions:")
print(f" HIGH RISK: Personal outreach + premium retention offer")
print(f" MEDIUM RISK: Targeted email campaign + loyalty bonus")
print(f" LOW RISK: Automated engagement nudge")
print(f" MINIMAL: Standard communication cadence")
def predict_customer(self, customer_data, threshold=None):
"""
Predict churn probability for a single customer and
explain the key risk factors.
Parameters:
-----------
customer_data : dict
Dictionary of feature values for one customer.
threshold : float
Classification threshold.
Returns:
--------
dict with prediction, probability, and risk factors.
"""
if threshold is None:
threshold = getattr(self, 'optimal_threshold', 0.5)
# Create DataFrame from customer data
cust_df = pd.DataFrame([customer_data])
# Encode categorical features
categorical_cols = cust_df.select_dtypes(
include=['object']
).columns.tolist()
cust_encoded = pd.get_dummies(
cust_df, columns=categorical_cols, drop_first=True
)
# Align columns with training data
for col in self.feature_names:
if col not in cust_encoded.columns:
cust_encoded[col] = 0
cust_encoded = cust_encoded[self.feature_names]
# Predict
best_model = self.models[self.best_model_name]
if self.best_model_name == 'Logistic Regression':
cust_scaled = self.scaler.transform(cust_encoded)
prob = best_model.predict_proba(cust_scaled)[0, 1]
else:
prob = best_model.predict_proba(cust_encoded)[0, 1]
prediction = 'CHURN RISK' if prob >= threshold else 'LIKELY RETAIN'
# Risk level
if prob >= 0.7:
risk_level = 'HIGH'
elif prob >= 0.4:
risk_level = 'MEDIUM'
elif prob >= 0.2:
risk_level = 'LOW'
else:
risk_level = 'MINIMAL'
result = {
'churn_probability': round(prob, 3),
'prediction': prediction,
'risk_level': risk_level,
'threshold_used': threshold
}
print(f"\nCustomer Churn Assessment")
print(f"-" * 40)
print(f" Churn probability: {prob:.1%}")
print(f" Risk level: {risk_level}")
print(f" Prediction: {prediction}")
return result
Code Explanation. The
ChurnClassifierclass follows the workflow we outlined in Section 7.2. Key design decisions: -prepare_data()handles encoding and scaling, fitting the scaler on training data only (avoiding leakage). -train_models()trains three models usingclass_weight='balanced'(logistic regression and random forest) andscale_pos_weight(XGBoost) to handle class imbalance. -compare_models()presents results side by side and identifies the best model by AUC-ROC. -optimize_threshold()translates business economics into threshold selection -- this is where the model meets the business. -generate_business_report()produces output that a marketing manager can read and act on. -predict_customer()demonstrates single-customer prediction for operational use.
Running the Complete Pipeline
# Generate synthetic Athena customer data
df = generate_athena_churn_data(n_customers=5000)
# Initialize and run the classifier
classifier = ChurnClassifier(df)
classifier.prepare_data()
classifier.train_models()
classifier.compare_models()
classifier.analyze_feature_importance()
classifier.optimize_threshold(
cost_fp=20, # Cost of retention offer sent to non-churner
cost_fn=340, # Lifetime value lost when missing a churner
revenue_saved=500, # Annual revenue from a retained customer
save_rate=0.4 # 40% of targeted churners are successfully retained
)
classifier.generate_business_report()
Example Output
When you run the pipeline, you will see output similar to:
Data Preparation Complete
Training samples: 4000
Test samples: 1000
Features: 13
Churn rate (train): 20.3%
Churn rate (test): 19.8%
============================================================
MODEL TRAINING
============================================================
[1/3] Training Logistic Regression...
Logistic Regression: AUC=0.817, F1=0.541, Recall=0.646
[2/3] Training Random Forest...
Random Forest: AUC=0.838, F1=0.565, Recall=0.611
[3/3] Training XGBoost...
XGBoost: AUC=0.851, F1=0.582, Recall=0.631
============================================================
MODEL COMPARISON
============================================================
Logistic Regression Random Forest XGBoost
Accuracy 0.791 0.812 0.821
Precision 0.467 0.524 0.540
Recall 0.646 0.611 0.631
F1 Score 0.541 0.565 0.582
AUC-ROC 0.817 0.838 0.851
Best model by AUC-ROC: XGBoost (0.851)
Athena Update. The numbers tell a story. XGBoost achieves the highest AUC (0.851), meaning it does the best job overall of separating churners from non-churners. But notice: even the best model's precision is only 0.54 -- meaning that of every 10 customers it flags as likely churners, only about 5 actually churn. Is that good enough? It depends on the business economics, which is exactly what threshold optimization answers.
Making a Single Customer Prediction
# Predict churn for a specific customer
sample_customer = {
'tenure_months': 8.5,
'loyalty_tier': 'Bronze',
'purchase_count_12m': 2,
'days_since_last_purchase': 95,
'avg_order_value': 42.50,
'return_rate': 0.15,
'online_share': 0.8,
'category_diversity': 2,
'purchase_trend': 0.5,
'email_engagement': 0.1
}
result = classifier.predict_customer(sample_customer)
Customer Churn Assessment
----------------------------------------
Churn probability: 78.3%
Risk level: HIGH
Prediction: CHURN RISK
NK studies the output. "So this customer -- short tenure, only two purchases, hasn't bought in three months, doesn't engage with emails, and their purchase trend is declining. I don't need a model to tell me this person is at risk."
"Correct," Professor Okonkwo says. "The model confirms your intuition for obvious cases. Its real value is in the ambiguous cases -- the customer with twelve purchases who just had a bad experience, the Gold member whose purchase trend just started declining. Those are the customers the model catches before the human eye does. And at scale -- across one hundred thousand customers -- human intuition simply cannot process the patterns fast enough."
7.9 Interpreting Classification Results
Building a model is half the work. Interpreting its results -- and communicating those results to business stakeholders -- is the other half. This section covers the essential metrics for classification and, more importantly, what they mean for business decisions.
The Confusion Matrix
Every classification model's performance can be summarized in a 2x2 table:
Predicted: No Churn Predicted: Churn
Actual: Did Not Churn TN (True Neg) FP (False Pos)
Actual: Churned FN (False Neg) TP (True Pos)
| Cell | What Happened | Business Meaning (Churn Example) |
|---|---|---|
| True Positive (TP) | Model correctly predicted churn | We caught an at-risk customer and can intervene |
| True Negative (TN) | Model correctly predicted no churn | We correctly left a loyal customer alone |
| False Positive (FP) | Model predicted churn, customer stayed | We wasted a retention offer on a loyal customer |
| False Negative (FN) | Model predicted no churn, customer left | We missed an at-risk customer -- they left without intervention |
The Key Metrics
Accuracy = (TP + TN) / (TP + TN + FP + FN)
The proportion of all predictions that were correct. Intuitive but misleading for imbalanced classes. A model that never predicts churn achieves 80 percent accuracy on a dataset where 20 percent of customers churn -- while catching zero churners.
NK's eyes widen. "Wait. Eighty percent accuracy by predicting nobody churns? That's... incredibly misleading."
"Welcome to the accuracy paradox," Professor Okonkwo says. "This is the single most common misunderstanding I see in boardroom presentations about AI. A VP shows a slide that says 'our model is 85 percent accurate' and the room applauds. But nobody asks what the model does with the 15 percent it gets wrong."
Caution. Never evaluate a classification model on accuracy alone, especially when classes are imbalanced. A model with 98 percent accuracy that never predicts the minority class is not a good model -- it is a coin that always lands on the same side.
Precision = TP / (TP + FP)
Of the customers we flagged as churners, what proportion actually churned? High precision means fewer wasted retention offers.
Recall (Sensitivity) = TP / (TP + FN)
Of the customers who actually churned, what proportion did we identify? High recall means fewer missed at-risk customers.
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
The harmonic mean of precision and recall. Useful as a single-number summary, but the individual precision and recall values tell you more.
AUC-ROC (Area Under the Receiver Operating Characteristic Curve)
AUC-ROC measures how well the model separates the two classes across all possible thresholds. It ranges from 0.5 (random guessing) to 1.0 (perfect separation).
Definition. AUC-ROC is a threshold-independent metric that measures a classifier's ability to distinguish between classes. An AUC of 0.85 means that if you randomly pick one churner and one non-churner, the model will assign a higher churn probability to the actual churner 85 percent of the time.
| AUC-ROC Range | Interpretation |
|---|---|
| 0.90 - 1.00 | Excellent |
| 0.80 - 0.90 | Good |
| 0.70 - 0.80 | Fair |
| 0.60 - 0.70 | Poor |
| 0.50 - 0.60 | No better than random |
Athena's model achieves an AUC of approximately 0.83 to 0.85, which is solidly in the "good" range. Is that good enough? It depends entirely on the business application. We address this question next.
The Precision-Recall Tradeoff
As we discussed in Chapter 6, precision and recall are inversely related: increasing one typically decreases the other. The tradeoff is controlled by the classification threshold.
| Lower Threshold (e.g., 0.2) | Higher Threshold (e.g., 0.7) |
|---|---|
| More customers flagged as at-risk | Fewer customers flagged |
| Higher recall (catch more churners) | Lower recall (miss more churners) |
| Lower precision (more false alarms) | Higher precision (fewer false alarms) |
| Cast a wider net | Cast a narrower net |
| Best when cost of missing a churner is high | Best when cost of intervention is high |
7.10 From Model to Decision -- Translating Predictions into Actions
Athena Update. The model achieves an AUC of 0.83. Ravi presents the results to Athena's executive team. The CMO says, "Impressive. Ship it." The VP of Operations, Maya Gonzalez, crosses her arms: "Ship it where? What exactly do we do with a list of customers who might churn? We have three marketing specialists on the retention team. They can't personally call twenty thousand people."
Maya's pushback is the most important moment in Athena's ML journey so far. A model that produces predictions without a clear action plan is a model that will be ignored. Ravi learns a lesson that Professor Okonkwo has been teaching all semester: the model is not the product. The decision system is the product.
Designing the Intervention Strategy
Ravi's team works with the retention team to design a tiered intervention strategy based on the model's probability scores:
| Risk Tier | Probability Range | Volume (est.) | Intervention | Cost per Customer |
|---|---|---|---|---|
| Critical | >= 0.80 | ~500/month | Personal phone call + premium offer (free shipping for 6 months + 20% next purchase) | $35 |
| High | 0.60 - 0.79 | ~1,200/month | Personalized email campaign (3-touch series) + loyalty bonus points | $15 |
| Moderate | 0.40 - 0.59 | ~2,000/month | Automated re-engagement email + product recommendations | $3 |
| Low | 0.20 - 0.39 | ~3,500/month | Automated "We miss you" notification (push/email) | $0.50 |
| Minimal | < 0.20 | ~12,800/month | Standard communication cadence (no additional intervention) | $0 |
This tiered approach solves Maya's problem. The three marketing specialists focus exclusively on the Critical tier (~500 customers/month). The High tier gets a semi-automated campaign. The Moderate and Low tiers are fully automated. And the Minimal tier receives no additional treatment -- saving resources for where they matter.
The Business Case
Tom runs the numbers for a monthly cycle:
# Monthly business impact estimation
# Tier volumes and economics
tiers = {
'Critical': {
'volume': 500, 'cost_per': 35, 'actual_churn_rate': 0.75,
'save_rate': 0.50, 'annual_value': 620
},
'High': {
'volume': 1200, 'cost_per': 15, 'actual_churn_rate': 0.55,
'save_rate': 0.40, 'annual_value': 480
},
'Moderate': {
'volume': 2000, 'cost_per': 3, 'actual_churn_rate': 0.35,
'save_rate': 0.25, 'annual_value': 380
},
'Low': {
'volume': 3500, 'cost_per': 0.50, 'actual_churn_rate': 0.18,
'save_rate': 0.10, 'annual_value': 340
},
}
total_cost = 0
total_revenue_saved = 0
print(f"{'Tier':<12} {'Volume':>7} {'Cost':>10} {'Saved':>7} "
f"{'Revenue Saved':>15}")
print("-" * 55)
for tier_name, t in tiers.items():
cost = t['volume'] * t['cost_per']
churners_in_tier = int(t['volume'] * t['actual_churn_rate'])
saved = int(churners_in_tier * t['save_rate'])
revenue = saved * t['annual_value']
total_cost += cost
total_revenue_saved += revenue
print(f"{tier_name:<12} {t['volume']:>7,} ${cost:>9,.0f} "
f"{saved:>7,} ${revenue:>14,.0f}")
print("-" * 55)
print(f"{'TOTAL':<12} {'':<7} ${total_cost:>9,.0f} "
f"{'':>7} ${total_revenue_saved:>14,.0f}")
print(f"\nMonthly net value: ${total_revenue_saved - total_cost:,.0f}")
print(f"Annual net value: ${(total_revenue_saved - total_cost) * 12:,.0f}")
Athena Update. When Ravi presents the tiered intervention plan alongside the model, the VP of Operations drops her objections. "This I can work with," Maya says. "We're not drowning the team in a firehose of names. We're giving them a prioritized list with specific actions." The model enters a three-month pilot. The executive team is cautiously optimistic. Athena's AI journey has its first real deliverable.
The Feedback Loop
Professor Okonkwo adds a final point: "The model's job doesn't end at prediction. Once the retention campaigns run, you collect new data -- which customers responded to the intervention, which ones churned despite it, which ones would have stayed anyway. That feedback becomes training data for the next version of the model. This is the monitoring and maintenance stage from Chapter 6. The model learns, the interventions improve, and the cycle continues."
She draws a circle on the whiteboard:
Predict --> Intervene --> Measure --> Retrain --> Predict
"This feedback loop is what separates a one-time analytics project from a production ML system. We'll cover it in depth in Chapter 12 when we discuss MLOps."
7.11 Algorithms at a Glance -- A Comparison Guide
For reference, here is a summary comparison of the four classification algorithms covered in this chapter:
| Characteristic | Logistic Regression | Decision Tree | Random Forest | Gradient Boosting |
|---|---|---|---|---|
| Intuition | Draw a boundary line | Learn if-then rules | Vote among many trees | Correct mistakes sequentially |
| Interpretability | High (coefficients) | Very high (rules) | Medium (feature importance) | Medium (feature importance) |
| Accuracy on tabular data | Good baseline | Moderate (overfits easily) | Good to excellent | Excellent |
| Training speed | Very fast | Fast | Moderate | Moderate to slow |
| Handles non-linear patterns | No (without engineering) | Yes | Yes | Yes |
| Risk of overfitting | Low | High | Low | Moderate |
| Feature scaling required | Yes | No | No | No |
| When to use | Baseline, regulatory, interpretability | Quick exploration, rule extraction | Robust general-purpose | Maximum accuracy on structured data |
| Key hyperparameters | C (regularization) | max_depth, min_samples | n_estimators, max_depth | n_estimators, learning_rate, max_depth |
Business Insight. Algorithm selection in business is rarely about raw accuracy. It is about the intersection of accuracy, interpretability, speed, and organizational trust. A model the business team understands and uses at 0.82 AUC creates more value than a model the business team ignores at 0.87 AUC. Ravi Mehta puts it this way: "The best model is the one that gets deployed."
7.12 Chapter Summary and the Road Ahead
This chapter covered the core concepts and tools of supervised classification -- the most widely applied category of machine learning in business. We moved from the economics of classification (why targeting the right customers is worth millions) through four algorithms (logistic regression, decision trees, random forests, gradient boosting), feature engineering techniques, a complete implementation (ChurnClassifier), evaluation metrics, and the critical step of connecting model outputs to business actions.
Several themes will recur throughout Part 2 and beyond:
-
Business framing precedes modeling. The value of a classification model is determined before the first line of code is written -- by the problem definition, the target variable design, and the action plan that follows prediction.
-
Metrics must be tied to economics. Accuracy, precision, recall, and AUC are useful summaries, but the question that matters is: "How much is this model worth in dollars?" Chapter 11 will extend this framework to a full model evaluation methodology.
-
Feature engineering is where domain expertise meets data science. The algorithms are commoditized. The competitive advantage lies in knowing which features to build and how to encode business knowledge into the model.
-
The model is not the product. Athena's churn model only created value when it was embedded in a tiered retention strategy that the operations team could execute. This lesson will be reinforced in Chapter 12 (MLOps) and Chapter 34 (Measuring AI ROI).
-
Interpretability is not optional. Understanding why a model makes its predictions is essential for stakeholder trust, regulatory compliance, and continuous improvement. We will explore interpretability tools in depth in Chapter 26 (Fairness, Explainability, and Transparency).
Caution. We have introduced classification metrics in this chapter, but we have only scratched the surface. Cross-validation, hyperparameter tuning, learning curves, calibration curves, and cost-sensitive evaluation are covered in Chapter 11 (Model Evaluation and Selection). Do not skip that chapter -- evaluation is where most real-world ML projects succeed or fail.
Tom closes his laptop. "I built the model in twenty minutes. Understanding what to do with it took the entire class."
"Now you're learning," Professor Okonkwo says.
In Chapter 8, we turn from classification (predicting categories) to regression (predicting continuous values). Athena's next challenge: forecasting demand across 12,000 SKUs. The DemandForecaster awaits.