Chapter 7: Supervised Learning -- Classification

DataField.Dev

34 min read

> "The model doesn't make the decision. It informs the decision. Never forget the difference."

Prerequisites

Chapter 2 (data scientist thinking) and Chapter 5 (EDA)
Basic understanding of probability (percentages, likelihoods)
Familiarity with KPIs like churn rate, conversion rate, default rate
Exposure to confusion-matrix-style thinking (true/false positives)

Learning Objectives

Explain how classification models learn from labeled examples
Frame a business problem (churn, fraud, lead scoring) as a classification task
Interpret confusion matrix, precision, recall, and ROC/AUC for executives
Compare logistic regression, decision trees, and random forests at a conceptual level
Decide when classification is the right tool and when it isn't

In This Chapter

7.1 What Is Classification?
7.2 The Classification Workflow
7.3 Logistic Regression -- The Starting Point
7.4 Decision Trees -- Interpretable, Intuitive, Fragile
7.5 Random Forests -- The Wisdom of Crowds
7.6 Gradient Boosting -- Sequential Correction
7.7 Feature Engineering for Classification
7.8 The ChurnClassifier -- Building Athena's First ML Model
7.9 Interpreting Classification Results
7.10 From Model to Decision -- Translating Predictions into Actions
7.11 Algorithms at a Glance -- A Comparison Guide
7.12 Chapter Summary and the Road Ahead

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 7: Supervised Learning -- Classification

"The model doesn't make the decision. It informs the decision. Never forget the difference." -- Professor Diane Okonkwo

Professor Okonkwo stands at the whiteboard, marker uncapped, and writes a single number:

$2,000,000

"You run a subscription business," she says, turning to face the class. "One hundred thousand customers. Your monthly churn rate is five percent. That means every month, five thousand customers leave. Your retention team has developed a campaign -- personalized outreach, a loyalty offer, a dedicated account manager for thirty days. It costs twenty dollars per customer to execute. And it works: forty percent of targeted customers are saved."

She pauses. "Each saved customer is worth five hundred dollars in annual revenue. Question for the class: should you target all one hundred thousand customers with this retention campaign?"

Tom Kowalski's hand shoots up. "Quick math -- one hundred thousand customers times twenty dollars each is two million dollars in campaign costs."

"Correct. And the revenue saved?"

NK Adeyemi pulls out her notebook and starts writing. "Five percent churn rate means five thousand actual churners. Forty percent save rate on those five thousand is two thousand saved customers. Two thousand times five hundred dollars is one million dollars in saved revenue." She looks up. "We spend two million to save one million. That's a net loss of a million dollars. Targeting everyone is a terrible idea."

"So what do we do?" Professor Okonkwo asks. "Just let customers churn?"

Tom leans forward. "What if we could identify which customers are most likely to churn? If we only target the top five thousand most at-risk customers -- and let's say the model is reasonably accurate, so most of those five thousand are actually going to churn -- we spend five thousand times twenty dollars: a hundred thousand dollars. We save forty percent of them, which is two thousand customers. Two thousand times five hundred is still a million dollars in saved revenue. But now we spent a hundred thousand instead of two million. The net gain is nine hundred thousand dollars."

The room goes quiet as the economics land.

"That," Professor Okonkwo says, writing the word on the whiteboard, "is why classification matters."

She draws a line under the word.

Classification: the task of predicting which category an observation belongs to.

"In this case, we're predicting a binary category: will churn or will not churn. And the difference between getting that prediction right and treating everyone the same is, in this scenario, nearly two million dollars in economic value."

She sets down the marker. "Today we begin building your first machine learning model. We'll start with the theory, but we'll end with code -- a working churn classifier for Athena Retail Group's customer base. By the time we're done, you will understand not just how classification algorithms work, but when to use each one and how to translate a probability score into a business action."

"Welcome to Part 2."

7.1 What Is Classification?

Classification is the supervised learning task of assigning observations to discrete categories based on their features. The model learns a mapping from inputs (features) to outputs (categories) by studying labeled historical examples, then applies that learned mapping to make predictions on new, unseen observations.

Definition. Classification is a type of supervised learning where the model predicts a categorical outcome -- a class label -- for each observation. The model learns from historical data where the correct labels are known (the training set) and then generalizes to new data where the labels are unknown.

Binary vs. Multiclass Classification

The simplest form is binary classification: two possible outcomes. Will the customer churn or not? Is this transaction fraudulent or legitimate? Should we approve or deny this loan?

Multiclass classification extends to three or more categories. What product category will this customer purchase next? Which customer segment does this lead belong to? What is the priority level of this support ticket -- low, medium, high, or critical?

Type	Number of Classes	Business Examples
Binary	2	Churn/retain, fraud/not fraud, approve/deny, spam/not spam, click/no click
Multiclass	3+	Customer segment (Gold/Silver/Bronze), product category, sentiment (positive/neutral/negative), risk tier
Multi-label	Multiple simultaneous	A customer can be tagged with multiple interest categories; a product can belong to multiple departments

For this chapter, we focus primarily on binary classification -- the most common classification task in business -- with multiclass extensions noted where relevant.

The Classification Problem in Business Context

What makes classification powerful in business is not the algorithm. It is the decision that follows the prediction.

Every classification problem in business implies an action:

Prediction	Business Action
Customer will churn	Trigger retention campaign
Transaction is fraudulent	Block transaction, flag for review
Lead will convert	Prioritize for sales team
Loan applicant will default	Deny or adjust terms
Email is spam	Route to spam folder
Patient is high-risk	Schedule early intervention

Business Insight. If you cannot identify the action that will follow a classification prediction, you do not have a classification problem -- you have a classification curiosity. This connects directly to the problem framing lesson from Chapter 6: a prediction without a decision is trivia.

The Relationship to Chapter 6

In Chapter 6, we introduced the ML project lifecycle, the ML Canvas, and the critical distinction between model metrics and business metrics. Everything we build in this chapter applies those frameworks. Athena's churn prediction project, which Ravi Mehta's team scoped in Chapter 6 using the ML Canvas, becomes our working example. The conceptual groundwork is complete. Now we write the code.

7.2 The Classification Workflow

Before we examine specific algorithms, let's establish the end-to-end workflow that every classification project follows. This workflow builds on the CRISP-DM lifecycle from Chapter 2 and the ML project lifecycle from Chapter 6, but focuses on the technical execution steps.

Step 1: Define the Target Variable

The target variable (also called the label or dependent variable) is what you are trying to predict. For binary classification, this is typically encoded as 0 or 1.

For Athena's churn prediction: - 1 = Customer churned (no purchase in 180 days) - 0 = Customer did not churn (at least one purchase in 180 days)

Caution. The definition of your target variable is a business decision, not a technical one. "No purchase in 180 days" is a choice. Ravi's team debated alternatives -- 90 days, 365 days, revenue decline of 50 percent. The 180-day window was selected because it aligns with Athena's loyalty program renewal cycle and gives the retention team enough lead time to intervene. Different definitions produce different models with different business implications.

Step 2: Gather and Prepare Features

Features (also called independent variables or predictors) are the inputs the model uses to make predictions. For Athena's churn model, the initial feature set includes:

Purchase frequency -- number of transactions in the last 12 months
Recency -- days since last purchase
Average order value -- mean spending per transaction
Return rate -- percentage of items returned
Channel mix -- proportion of purchases online vs. in-store
Tenure -- months since first purchase
Loyalty tier -- current tier (Bronze, Silver, Gold, Platinum)
Category diversity -- number of distinct product categories purchased

Feature preparation involves handling missing values, encoding categorical variables (like loyalty tier), and scaling numerical features. We will cover these techniques in detail in Section 7.7.

Step 3: Split the Data

The data must be divided into at least two sets:

Training set (typically 70-80 percent): The model learns from this data.
Test set (typically 20-30 percent): Held out entirely during training. Used only once, at the end, to evaluate the model's performance on truly unseen data.

Some workflows add a validation set (carved from the training data) for hyperparameter tuning, or use cross-validation, which we will discuss in Chapter 11.

Definition. Data leakage occurs when information from the test set (or from the future) contaminates the training process, producing a model that appears to perform well during development but fails in production. As we discussed in Chapter 6, this is one of the most common and insidious failure modes in applied ML.

The split must respect temporal order when the data has a time dimension. For Athena's churn data, we train on customers whose 180-day window ended before a cutoff date and test on customers whose window ended after it. This simulates how the model will be used in production -- predicting future churn based on past behavior.

Step 4: Train the Model

Feed the training data to one or more algorithms and let them learn the relationship between features and the target variable. In practice, you almost always train multiple algorithms and compare their performance -- logistic regression, decision trees, random forests, gradient boosting.

Step 5: Evaluate the Model

Measure performance on the held-out test set using appropriate metrics. For classification, the key metrics include accuracy, precision, recall, F1 score, and AUC-ROC. We introduce these metrics in Section 7.9 and explore them in full depth in Chapter 11.

Step 6: Interpret and Deploy

Understanding why the model makes its predictions is as important as the predictions themselves. Feature importance, partial dependence plots, and individual prediction explanations help build stakeholder trust and surface potential issues. Deployment connects the model to the business process it was designed to support.

7.3 Logistic Regression -- The Starting Point

Despite its name, logistic regression is a classification algorithm. It is almost always the first model you should try for binary classification. Not because it is the most powerful, but because it is interpretable, fast, and provides a strong baseline against which to measure more complex models.

The Core Intuition

Logistic regression answers a simple question: given the features of this observation, what is the probability that it belongs to class 1?

It does this using the sigmoid function (also called the logistic function), which maps any real number to a value between 0 and 1:

              1
P(y=1) = -----------
          1 + e^(-z)

where z is a linear combination of the features:

z = b0 + b1*x1 + b2*x2 + ... + bn*xn

The sigmoid function produces the characteristic S-curve:

Probability
1.0 |                          _______________
    |                        /
    |                      /
0.5 |  - - - - - - - - - + - - - - - - - - -
    |                  /
    |                /
0.0 |_______________/
    +----------------------------------------
                    z (linear score)

When z is very negative (the features suggest "not churn"), the probability approaches 0. When z is very positive (the features scream "churn"), the probability approaches 1. The transition happens around z = 0, where the probability is exactly 0.5.

Business Insight. The beauty of logistic regression is that its output is a probability, not just a yes/no prediction. A customer with a 0.92 churn probability needs a different intervention than a customer at 0.55. The probability gives the business granular information to make nuanced decisions -- tiered retention campaigns, prioritized outreach, or risk-scored customer portfolios.

The Decision Boundary

To convert a probability into a class prediction, we choose a threshold. The default is 0.5: if the predicted probability is above 0.5, predict class 1 (churn); otherwise, predict class 0 (no churn).

But 0.5 is not sacred. As we will explore in Section 7.10, the optimal threshold depends on the relative costs of false positives and false negatives. In Athena's case, where the cost of missing a churner ($340 in lost lifetime value) far exceeds the cost of a wasted retention offer ($20), a lower threshold like 0.3 may be more appropriate -- cast a wider net, catch more churners, accept some wasted offers.

Why Start with Logistic Regression?

Interpretability. Each coefficient tells you the direction and magnitude of a feature's influence. If the coefficient for "days since last purchase" is positive, longer gaps increase churn probability. Business stakeholders can understand and challenge this.
Speed. Logistic regression trains in seconds, even on large datasets. This makes rapid experimentation easy.
Baseline. Any more complex model must beat logistic regression to justify its complexity. If a random forest only improves AUC from 0.81 to 0.83, the added complexity may not be worth it.
Probability calibration. Logistic regression tends to produce well-calibrated probabilities out of the box -- when it says "70 percent chance of churn," roughly 70 percent of those customers actually churn. This matters for business applications where probability scores drive tiered actions.

Definition. Calibration refers to the degree to which a model's predicted probabilities match actual observed frequencies. A well-calibrated model's predicted probabilities can be taken at face value. A poorly calibrated model might predict "80 percent chance of churn" for a group where only 40 percent actually churn, leading to misallocation of resources.

Limitations

Logistic regression assumes a roughly linear relationship between features and the log-odds of the outcome. It cannot natively capture complex interactions or non-linear patterns. If the relationship between purchase frequency and churn is U-shaped (very low and very high frequencies both indicate risk), logistic regression will miss this unless you manually create interaction or polynomial features.

For many business problems, this limitation matters. Which is why we need the next set of tools.

7.4 Decision Trees -- Interpretable, Intuitive, Fragile

Decision trees take a completely different approach to classification. Instead of fitting a mathematical function, they learn a series of if-then rules by recursively splitting the data on feature values.

The Core Intuition

Imagine you are a retention manager at Athena, and you have to manually identify at-risk customers using only three pieces of information: days since last purchase, number of purchases in the last year, and average order value. You might reason like this:

Has the customer purchased in the last 60 days? - Yes --> Probably not churning. But check further.
- Have they made more than 3 purchases this year?
- Yes --> Low risk.
- No --> Medium risk.
- No --> Possibly churning.
- Is their average order value above $75?
- Yes --> Medium risk (high-value customer going quiet -- investigate).
- No --> High risk.

That reasoning process is a decision tree. The algorithm automates it by finding the optimal splits -- the questions that best separate churners from non-churners at each step.

How Decision Trees Split

At each node, the algorithm considers every feature and every possible split point. It chooses the split that maximally separates the classes, measured by a criterion like Gini impurity or information gain.

Definition. Gini impurity measures the probability of misclassifying a randomly chosen element if it were randomly labeled according to the distribution of labels in the subset. A Gini impurity of 0 means perfect purity -- all elements in the node belong to the same class. A Gini impurity of 0.5 (for binary classification) means maximum impurity -- a 50/50 split.

Think of it this way: the tree asks, "Which question, applied to this group of customers, creates two subgroups that are each as pure as possible?" A perfect split puts all churners in one group and all non-churners in the other. In practice, splits are imperfect, so the tree keeps splitting until it reaches a stopping criterion (maximum depth, minimum samples per leaf, or no further improvement).

The Interpretability Advantage

Decision trees produce rules that non-technical stakeholders can read and challenge:

IF days_since_last_purchase > 90
  AND purchase_count_12m < 3
  AND loyalty_tier IN ('Bronze', 'Silver')
THEN churn_probability = 0.78

A marketing manager can look at this rule and say, "That makes sense -- customers who haven't bought in three months, barely shop with us, and aren't in our premium loyalty tiers are exactly the ones we'd expect to leave." Or they might say, "Wait -- we just launched a new program for Bronze members. Can we check if that changes things?" Either way, the tree enables a conversation that a neural network cannot.

Business Insight. In regulated industries -- financial services, healthcare, insurance -- model interpretability is not optional. It is a compliance requirement. Decision trees and their derivatives (random forests with feature importance) are often preferred for regulatory reasons even when more complex models achieve marginally better performance. We will explore this tension between performance and interpretability in Chapter 26 (Fairness, Explainability, and Transparency).

The Overfitting Problem

Left unconstrained, a decision tree will keep splitting until every training example is perfectly classified. The result is a deeply branched, highly specific tree that memorizes the training data -- including its noise and idiosyncrasies -- rather than learning generalizable patterns.

Definition. Overfitting occurs when a model learns the training data too well, capturing noise and random fluctuations rather than true underlying patterns. An overfit model performs excellently on training data but poorly on new, unseen data. It has memorized rather than learned.

Signs of an overfit decision tree: - Training accuracy of 99 percent, test accuracy of 72 percent - Very deep tree with many leaves, some containing only a handful of examples - Brittle predictions: small changes in input data produce large changes in output

Remedies include: - Pruning: Limiting tree depth or requiring a minimum number of samples per leaf - Setting constraints: Maximum depth, minimum samples to split, maximum leaf nodes - Ensemble methods: Combining many trees, which brings us to random forests

7.5 Random Forests -- The Wisdom of Crowds

Random forests address the overfitting problem of individual decision trees through a simple, powerful idea: build many trees, each slightly different, and let them vote.

The Core Intuition

The principle is borrowed from the "wisdom of crowds" phenomenon. If you ask one person to estimate the number of jelly beans in a jar, the answer may be wildly off. If you ask five hundred people and average their responses, the average is typically much closer to the truth than any individual guess.

Random forests apply this principle to decision trees:

Create many training datasets by randomly sampling from the original training data with replacement (a technique called bootstrapping). Each sample is roughly the same size as the original but contains some duplicated rows and omits others.
Build a decision tree on each bootstrap sample. At each split, consider only a random subset of features (not all features). This ensures that the trees are different from each other -- they see different data and consider different features.
Aggregate predictions. For classification, each tree votes. The final prediction is the majority vote.

This combination of bootstrapping and random feature selection is called bagging (bootstrap aggregating). The randomness injects diversity, and the aggregation smooths out individual errors.

Definition. Ensemble learning is the strategy of combining multiple models to produce a prediction that is more robust and accurate than any single model. Random forests are the canonical example, combining hundreds or thousands of decision trees into a single ensemble.

Why Random Forests Work So Well

Reduced overfitting. Individual trees may overfit, but their errors tend to be different. Averaging many noisy-but-diverse predictions produces a smoother, more generalizable result.
Feature importance. Random forests naturally measure how much each feature contributes to prediction accuracy. This is invaluable for business insights: "Purchase recency is the single strongest predictor of churn" is actionable information.
Robustness. Random forests handle missing values, outliers, and mixed feature types (numerical and categorical) with minimal preprocessing. They are hard to break.
Few hyperparameters. The most important tuning parameters are the number of trees (more is generally better, with diminishing returns) and the maximum depth of each tree.

Key Hyperparameters

Parameter	What It Controls	Typical Values
`n_estimators`	Number of trees in the forest	100-500 (more trees = better, up to a point)
`max_depth`	Maximum depth of each tree	10-30, or None (unlimited)
`max_features`	Number of features considered at each split	sqrt(n_features) for classification
`min_samples_split`	Minimum samples required to split a node	2-20
`min_samples_leaf`	Minimum samples required in a leaf node	1-10

NK raises her hand. "Professor, you said random forests are 'hard to break.' Are there situations where they don't work well?"

"Excellent question," Professor Okonkwo replies. "Random forests can struggle with very high-dimensional, sparse data -- like text data with thousands of word features. They can also be computationally expensive on very large datasets. And while they tell you which features are important, they don't give you the direction of the effect as cleanly as logistic regression does. For that, you need partial dependence plots or SHAP values, which we'll cover in Chapter 26."

7.6 Gradient Boosting -- Sequential Correction

If random forests are "wisdom of crowds," gradient boosting is "learning from mistakes." While random forests build trees independently and average them, gradient boosted models build trees sequentially, with each new tree explicitly correcting the errors of the previous ones.

The Core Intuition

Imagine a student taking a practice exam. After grading, the student doesn't start over from scratch. Instead, they focus their studying on the questions they got wrong. The next practice attempt concentrates effort where performance was weakest. Over many rounds, the student improves by systematically addressing their specific gaps.

Gradient boosting works the same way:

Build a simple tree (often called a "stump" -- just one or two splits).
Calculate the residuals -- the errors between the model's predictions and the actual labels.
Build the next tree to predict these residuals (i.e., to correct the first tree's mistakes).
Add this correction tree to the ensemble.
Repeat: calculate new residuals, build a new tree, add it.
The final prediction is the sum of all trees' contributions.

Each tree is individually weak -- a "weak learner." But the sequential correction process produces a powerful combined model.

XGBoost and LightGBM

Two implementations of gradient boosting dominate practical machine learning:

XGBoost (eXtreme Gradient Boosting) was released in 2014 and quickly became the most popular machine learning algorithm for structured/tabular data. It won a remarkable string of Kaggle competitions and became the default choice for many applied ML teams. XGBoost includes built-in regularization (to prevent overfitting), efficient handling of missing values, and parallel processing for speed.

LightGBM (Light Gradient Boosting Machine), released by Microsoft in 2017, uses a technique called "gradient-based one-side sampling" and "exclusive feature bundling" to train significantly faster than XGBoost on large datasets, often with comparable or better accuracy. It has become the preferred choice for very large datasets.

Business Insight. In practice, XGBoost and LightGBM are the algorithms behind many of the "AI-powered" features you encounter as a consumer -- credit scoring at banks, fraud detection at payment processors, customer lifetime value predictions at subscription companies, and pricing optimization at airlines. When a company says "we use machine learning for X" and X involves structured tabular data, there is a high probability that gradient boosting is under the hood.

When to Use Gradient Boosting

Situation	Recommendation
Structured/tabular business data	Gradient boosting is often the top performer
Need maximum predictive accuracy	Gradient boosting typically beats random forests
Data is small to medium (<1M rows)	XGBoost is excellent
Data is large (>1M rows)	LightGBM is often faster
Interpretability is paramount	Consider random forests or logistic regression first
Rapid prototyping	Random forests are faster to tune

Key Hyperparameters

Parameter	What It Controls	Typical Values
`n_estimators`	Number of boosting rounds (trees)	100-1000
`learning_rate`	How much each tree's correction is weighted	0.01-0.3 (lower = slower but often better)
`max_depth`	Maximum depth of each tree	3-8 (shallower than random forests)
`subsample`	Fraction of training data used per tree	0.7-1.0
`colsample_bytree`	Fraction of features used per tree	0.5-1.0
`reg_alpha` / `reg_lambda`	L1/L2 regularization strength	0-10

"Here's what I want you to remember," Professor Okonkwo says. "For most classification problems with structured business data, you will try logistic regression as a baseline, random forest as a robust mid-range option, and XGBoost or LightGBM for maximum accuracy. In my consulting experience, the final model deployed in production is a gradient boosting model about sixty percent of the time. But you should always start simpler."

7.7 Feature Engineering for Classification

The algorithms we've covered are tools. Feature engineering is the craft that determines how well those tools work. In business applications, good feature engineering often contributes more to model performance than algorithm selection.

Encoding Categorical Variables

Machine learning algorithms operate on numbers. Categorical features -- like loyalty tier (Bronze, Silver, Gold, Platinum) or channel (Online, In-Store) -- must be converted to numerical representations.

One-Hot Encoding creates a binary column for each category:

import pandas as pd

# Before encoding
# loyalty_tier: ['Bronze', 'Silver', 'Gold', 'Platinum']

# After one-hot encoding
# loyalty_tier_Bronze:   [1, 0, 0, 0]
# loyalty_tier_Silver:   [0, 1, 0, 0]
# loyalty_tier_Gold:     [0, 0, 1, 0]
# loyalty_tier_Platinum: [0, 0, 0, 1]

df_encoded = pd.get_dummies(df, columns=['loyalty_tier'], drop_first=True)

Ordinal Encoding assigns integers that preserve an ordering:

tier_mapping = {'Bronze': 1, 'Silver': 2, 'Gold': 3, 'Platinum': 4}
df['loyalty_tier_encoded'] = df['loyalty_tier'].map(tier_mapping)

Use ordinal encoding when the categories have a natural order (loyalty tiers, education level). Use one-hot encoding when they do not (product category, geographic region).

Caution. Be careful with one-hot encoding when a feature has many categories (e.g., zip codes with hundreds of values). This creates hundreds of sparse columns that can slow training and cause overfitting. In such cases, consider target encoding (mapping each category to the average target value for that category) or grouping rare categories into an "Other" category.

Scaling Numerical Features

Some algorithms -- particularly logistic regression -- are sensitive to feature scales. If "annual income" ranges from 20,000 to 200,000 while "number of purchases" ranges from 1 to 50, the larger-scaled feature can dominate the model.

Standard Scaling transforms each feature to have mean 0 and standard deviation 1:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

Min-Max Scaling maps features to a 0-1 range:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_train)

Caution. Fit the scaler on the training data only, then apply it to both training and test data. If you fit on the full dataset (including test data), you introduce subtle data leakage.

Tree-based models (decision trees, random forests, XGBoost) are generally not affected by feature scaling because they make decisions based on threshold comparisons, not magnitude. But it is good practice to scale anyway -- it doesn't hurt and it makes switching between models easier.

Creating Domain-Informed Features

The most powerful features are often derived from domain knowledge, not raw data. For Athena's churn model, Ravi's team creates several engineered features:

# Recency-Frequency-Monetary (RFM) features
df['recency'] = (reference_date - df['last_purchase_date']).dt.days
df['frequency'] = df['purchase_count_12m']
df['monetary'] = df['avg_order_value'] * df['purchase_count_12m']

# Behavioral change features (trends matter more than snapshots)
df['purchase_trend'] = (
    df['purchases_last_3m'] / (df['purchases_prior_3m'] + 1)
)

# Engagement features
df['return_rate'] = df['items_returned'] / (df['items_purchased'] + 1)
df['online_share'] = df['online_purchases'] / (df['total_purchases'] + 1)

# Loyalty engagement
df['loyalty_points_used_ratio'] = (
    df['loyalty_points_redeemed'] / (df['loyalty_points_earned'] + 1)
)

Try It. Look at the features above. Can you think of additional features that might predict churn for a retailer? Consider: seasonal buying patterns, response to marketing emails, customer service interactions, product category diversity, price sensitivity (proportion of purchases made during sales). Feature engineering is where business knowledge meets data science.

Handling Class Imbalance

In many business classification problems, the classes are unequal. For Athena's data, roughly 18 percent of customers churn and 82 percent do not. This class imbalance can cause models to be biased toward the majority class -- predicting "no churn" for everyone achieves 82 percent accuracy but catches zero churners.

Strategies for handling imbalance:

1. Class Weights. Most scikit-learn classifiers accept a class_weight='balanced' parameter that automatically adjusts the learning algorithm to pay more attention to the minority class.

2. Oversampling the Minority Class (SMOTE). Synthetic Minority Over-sampling Technique creates synthetic examples of the minority class by interpolating between existing examples:

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

3. Undersampling the Majority Class. Remove examples from the majority class to balance the dataset. Simple but discards potentially useful data.

4. Threshold Adjustment. Rather than rebalancing the data, adjust the classification threshold. Instead of predicting "churn" at 0.5 probability, predict at 0.3 or 0.2. This is often the most practical approach for business applications.

Business Insight. The best strategy depends on your data and your business context. In Athena's case, Ravi's team uses a combination of class weights during training and threshold adjustment during deployment. SMOTE is powerful but can introduce artifacts -- synthetic examples that don't represent real customer behaviors. As Professor Okonkwo says: "Resample with caution. The synthetic data knows nothing about your business."

7.8 The ChurnClassifier -- Building Athena's First ML Model

Athena Update. Ravi Mehta's team has spent three weeks preparing data for Athena's churn prediction pilot. They've unified customer records across the POS system and e-commerce platform (a painful data engineering effort that Chapter 4 warned them about). They've defined churn as "no purchase in 180 days" after extensive debate with the merchandising, marketing, and finance teams. The executive team is watching. This is the first ML model Athena will build, and if it fails, it will set back the AI initiative by a year.

Now we build. The following ChurnClassifier class encapsulates the full classification workflow -- data preparation, model training, evaluation, and business interpretation. We build it step by step, explaining each section.

Setting Up

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, confusion_matrix,
    classification_report, roc_curve
)
import warnings
warnings.filterwarnings('ignore')

# For gradient boosting -- install with: pip install xgboost
try:
    from xgboost import XGBClassifier
    XGBOOST_AVAILABLE = True
except ImportError:
    XGBOOST_AVAILABLE = False
    print("XGBoost not installed. Install with: pip install xgboost")

Code Explanation. We import scikit-learn's core classification tools along with XGBoost. The try/except block handles the case where XGBoost isn't installed -- the classifier will still work with logistic regression and random forests. If you completed the Python environment setup in Chapter 3, you should have scikit-learn installed. Install XGBoost separately with pip install xgboost.

Generating Synthetic Athena Data

In production, Ravi's team works with real customer data. For this textbook, we generate realistic synthetic data that mirrors Athena's customer base:

def generate_athena_churn_data(n_customers=5000, random_state=42):
    """
    Generate synthetic customer data for Athena Retail Group's
    churn prediction model.

    Features mirror Athena's actual loyalty program data:
    - Behavioral features (purchase patterns)
    - Engagement features (channel usage, returns)
    - Demographic features (tenure, loyalty tier)
    """
    np.random.seed(random_state)

    # --- Customer tenure (months since first purchase) ---
    tenure = np.random.exponential(scale=24, size=n_customers).clip(1, 120)

    # --- Loyalty tier (influenced by tenure) ---
    tier_probs = np.where(
        tenure > 60, [0.1, 0.2, 0.3, 0.4],
        np.where(tenure > 24, [0.2, 0.3, 0.3, 0.2],
                 [0.5, 0.3, 0.15, 0.05])
    ).reshape(-1, 4)
    # Normalize probabilities
    tier_probs = tier_probs / tier_probs.sum(axis=1, keepdims=True)
    tiers = np.array([
        np.random.choice(
            ['Bronze', 'Silver', 'Gold', 'Platinum'], p=p
        ) for p in tier_probs
    ])

    # --- Purchase behavior ---
    # Base frequency influenced by tier
    tier_freq_base = {'Bronze': 3, 'Silver': 6, 'Gold': 10, 'Platinum': 15}
    base_freq = np.array([tier_freq_base[t] for t in tiers], dtype=float)
    purchase_count_12m = np.random.poisson(base_freq).clip(0, 50)

    # Recency (days since last purchase) -- lower tier = higher recency
    tier_recency_base = {'Bronze': 90, 'Silver': 45, 'Gold': 25, 'Platinum': 14}
    base_recency = np.array([tier_recency_base[t] for t in tiers], dtype=float)
    days_since_last_purchase = np.random.exponential(base_recency).clip(0, 365)

    # Average order value
    avg_order_value = np.random.lognormal(mean=3.8, sigma=0.5, size=n_customers)
    avg_order_value = avg_order_value.clip(15, 500)

    # Return rate (0 to 1)
    return_rate = np.random.beta(2, 10, size=n_customers)

    # Online purchase share (0 to 1)
    online_share = np.random.beta(3, 3, size=n_customers)

    # Category diversity (number of distinct categories, 1-12)
    category_diversity = np.random.poisson(
        lam=np.where(purchase_count_12m > 5, 5, 2)
    ).clip(1, 12)

    # Purchase trend (ratio of last 3 months to prior 3 months)
    purchase_trend = np.random.lognormal(mean=0, sigma=0.5, size=n_customers)
    purchase_trend = purchase_trend.clip(0.1, 5.0)

    # Email engagement rate
    email_engagement = np.random.beta(2, 5, size=n_customers)

    # --- Generate churn labels (influenced by features) ---
    churn_score = (
        0.3 * (days_since_last_purchase / 365)      # Higher recency = more churn
        - 0.25 * (purchase_count_12m / 50)           # More purchases = less churn
        - 0.15 * (avg_order_value / 500)             # Higher AOV = less churn
        + 0.15 * return_rate                          # Higher returns = more churn
        - 0.1 * (tenure / 120)                        # Longer tenure = less churn
        - 0.2 * purchase_trend.clip(0, 2) / 2        # Increasing trend = less churn
        - 0.1 * email_engagement                      # More engaged = less churn
        + np.random.normal(0, 0.15, size=n_customers) # Noise
    )

    # Convert to probability and sample
    churn_prob = 1 / (1 + np.exp(-5 * (churn_score - 0.05)))
    churned = (np.random.random(n_customers) < churn_prob).astype(int)

    # Build DataFrame
    df = pd.DataFrame({
        'customer_id': [f'ATH-{i:05d}' for i in range(n_customers)],
        'tenure_months': np.round(tenure, 1),
        'loyalty_tier': tiers,
        'purchase_count_12m': purchase_count_12m,
        'days_since_last_purchase': np.round(days_since_last_purchase, 0),
        'avg_order_value': np.round(avg_order_value, 2),
        'return_rate': np.round(return_rate, 3),
        'online_share': np.round(online_share, 3),
        'category_diversity': category_diversity,
        'purchase_trend': np.round(purchase_trend, 3),
        'email_engagement': np.round(email_engagement, 3),
        'churned': churned
    })

    return df

# Generate and inspect the data
df = generate_athena_churn_data(n_customers=5000)
print(f"Dataset shape: {df.shape}")
print(f"\nChurn distribution:")
print(df['churned'].value_counts(normalize=True).round(3))
print(f"\nFeature summary:")
print(df.describe().round(2))

Code Explanation. The synthetic data generator creates 5,000 customer records with realistic correlations. Churn is not random -- it is influenced by recency (the strongest predictor), purchase frequency, order value, return behavior, tenure, purchasing trends, and email engagement. The correlations are deliberately imperfect (note the noise term) to simulate real-world messiness. The churn rate is approximately 20 percent, consistent with Athena's observed rate.

The Complete ChurnClassifier Class

class ChurnClassifier:
    """
    End-to-end churn classification pipeline for Athena Retail Group.

    Trains and compares multiple classification models, evaluates
    performance using business-relevant metrics, and provides
    interpretable results for the retention team.

    Usage:
        classifier = ChurnClassifier(df)
        classifier.prepare_data()
        classifier.train_models()
        classifier.compare_models()
        classifier.analyze_feature_importance()
        classifier.optimize_threshold()
        classifier.generate_business_report()
    """

    def __init__(self, data, target_col='churned', id_col='customer_id',
                 test_size=0.2, random_state=42):
        """
        Initialize the ChurnClassifier.

        Parameters:
        -----------
        data : pd.DataFrame
            Customer data with features and churn labels.
        target_col : str
            Name of the target column (default: 'churned').
        id_col : str
            Name of the customer ID column (default: 'customer_id').
        test_size : float
            Fraction of data reserved for testing (default: 0.2).
        random_state : int
            Random seed for reproducibility (default: 42).
        """
        self.data = data.copy()
        self.target_col = target_col
        self.id_col = id_col
        self.test_size = test_size
        self.random_state = random_state
        self.models = {}
        self.results = {}
        self.best_model_name = None
        self.scaler = StandardScaler()
        self.feature_names = None

    def prepare_data(self):
        """
        Prepare features and split into train/test sets.

        Handles:
        - Dropping ID column
        - Encoding categorical variables
        - Scaling numerical features
        - Train/test splitting
        """
        df = self.data.copy()

        # Separate target
        y = df[self.target_col]
        X = df.drop(columns=[self.target_col, self.id_col])

        # Encode categorical features
        categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
        X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)

        self.feature_names = X.columns.tolist()

        # Train/test split
        self.X_train, self.X_test, self.y_train, self.y_test = (
            train_test_split(X, y, test_size=self.test_size,
                           random_state=self.random_state, stratify=y)
        )

        # Scale features (fit on training data only)
        self.X_train_scaled = pd.DataFrame(
            self.scaler.fit_transform(self.X_train),
            columns=self.feature_names,
            index=self.X_train.index
        )
        self.X_test_scaled = pd.DataFrame(
            self.scaler.transform(self.X_test),
            columns=self.feature_names,
            index=self.X_test.index
        )

        print("Data Preparation Complete")
        print(f"  Training samples: {len(self.X_train)}")
        print(f"  Test samples:     {len(self.X_test)}")
        print(f"  Features:         {len(self.feature_names)}")
        print(f"  Churn rate (train): {self.y_train.mean():.1%}")
        print(f"  Churn rate (test):  {self.y_test.mean():.1%}")

    def train_models(self):
        """
        Train multiple classification models and store results.

        Models trained:
        1. Logistic Regression (baseline)
        2. Random Forest
        3. XGBoost (if available)
        """
        print("\n" + "=" * 60)
        print("MODEL TRAINING")
        print("=" * 60)

        # --- Model 1: Logistic Regression ---
        print("\n[1/3] Training Logistic Regression...")
        lr = LogisticRegression(
            class_weight='balanced',
            max_iter=1000,
            random_state=self.random_state
        )
        lr.fit(self.X_train_scaled, self.y_train)
        self.models['Logistic Regression'] = lr
        self._evaluate_model('Logistic Regression', lr,
                            self.X_test_scaled)

        # --- Model 2: Random Forest ---
        print("[2/3] Training Random Forest...")
        rf = RandomForestClassifier(
            n_estimators=200,
            max_depth=15,
            min_samples_split=10,
            min_samples_leaf=5,
            class_weight='balanced',
            random_state=self.random_state,
            n_jobs=-1
        )
        rf.fit(self.X_train, self.y_train)  # Trees don't need scaling
        self.models['Random Forest'] = rf
        self._evaluate_model('Random Forest', rf, self.X_test)

        # --- Model 3: XGBoost ---
        if XGBOOST_AVAILABLE:
            print("[3/3] Training XGBoost...")
            # Calculate scale_pos_weight for class imbalance
            n_neg = (self.y_train == 0).sum()
            n_pos = (self.y_train == 1).sum()
            scale_weight = n_neg / n_pos

            xgb = XGBClassifier(
                n_estimators=200,
                max_depth=6,
                learning_rate=0.1,
                subsample=0.8,
                colsample_bytree=0.8,
                scale_pos_weight=scale_weight,
                random_state=self.random_state,
                eval_metric='logloss',
                use_label_encoder=False
            )
            xgb.fit(self.X_train, self.y_train)
            self.models['XGBoost'] = xgb
            self._evaluate_model('XGBoost', xgb, self.X_test)
        else:
            print("[3/3] Skipping XGBoost (not installed)")

        print("\nAll models trained successfully.")

    def _evaluate_model(self, name, model, X_test):
        """Evaluate a single model and store results."""
        y_pred = model.predict(X_test)
        y_prob = model.predict_proba(X_test)[:, 1]

        results = {
            'accuracy': accuracy_score(self.y_test, y_pred),
            'precision': precision_score(self.y_test, y_pred),
            'recall': recall_score(self.y_test, y_pred),
            'f1': f1_score(self.y_test, y_pred),
            'auc_roc': roc_auc_score(self.y_test, y_prob),
            'y_pred': y_pred,
            'y_prob': y_prob
        }

        self.results[name] = results
        print(f"  {name}: AUC={results['auc_roc']:.3f}, "
              f"F1={results['f1']:.3f}, "
              f"Recall={results['recall']:.3f}")

    def compare_models(self):
        """
        Display a side-by-side comparison of all trained models.
        """
        print("\n" + "=" * 60)
        print("MODEL COMPARISON")
        print("=" * 60)

        comparison = pd.DataFrame({
            name: {
                'Accuracy': f"{r['accuracy']:.3f}",
                'Precision': f"{r['precision']:.3f}",
                'Recall': f"{r['recall']:.3f}",
                'F1 Score': f"{r['f1']:.3f}",
                'AUC-ROC': f"{r['auc_roc']:.3f}"
            }
            for name, r in self.results.items()
        })

        print(comparison.to_string())

        # Identify best model by AUC-ROC
        self.best_model_name = max(
            self.results, key=lambda k: self.results[k]['auc_roc']
        )
        print(f"\nBest model by AUC-ROC: {self.best_model_name} "
              f"({self.results[self.best_model_name]['auc_roc']:.3f})")

        return comparison

    def analyze_feature_importance(self, top_n=10):
        """
        Analyze and display feature importance from the best
        tree-based model.

        Parameters:
        -----------
        top_n : int
            Number of top features to display (default: 10).
        """
        print("\n" + "=" * 60)
        print("FEATURE IMPORTANCE ANALYSIS")
        print("=" * 60)

        # Use Random Forest or XGBoost for feature importance
        if 'XGBoost' in self.models:
            model = self.models['XGBoost']
            model_name = 'XGBoost'
        elif 'Random Forest' in self.models:
            model = self.models['Random Forest']
            model_name = 'Random Forest'
        else:
            print("No tree-based model available for feature importance.")
            return None

        importances = model.feature_importances_
        feature_imp = pd.DataFrame({
            'Feature': self.feature_names,
            'Importance': importances
        }).sort_values('Importance', ascending=False)

        print(f"\nTop {top_n} Features ({model_name}):")
        print("-" * 45)
        for i, row in feature_imp.head(top_n).iterrows():
            bar = '#' * int(row['Importance'] * 100)
            print(f"  {row['Feature']:<30} {row['Importance']:.4f}  {bar}")

        # Also show logistic regression coefficients if available
        if 'Logistic Regression' in self.models:
            lr = self.models['Logistic Regression']
            coef_df = pd.DataFrame({
                'Feature': self.feature_names,
                'Coefficient': lr.coef_[0]
            }).sort_values('Coefficient', key=abs, ascending=False)

            print(f"\nLogistic Regression Coefficients (top {top_n}):")
            print("-" * 50)
            for i, row in coef_df.head(top_n).iterrows():
                direction = "+" if row['Coefficient'] > 0 else "-"
                print(f"  {direction} {row['Feature']:<30} "
                      f"{row['Coefficient']:+.4f}")

        self.feature_importance = feature_imp
        return feature_imp

    def optimize_threshold(self, cost_fp=20, cost_fn=340,
                          revenue_saved=500, save_rate=0.4):
        """
        Find the optimal classification threshold based on
        business economics.

        Parameters:
        -----------
        cost_fp : float
            Cost of a false positive (retention offer to non-churner).
        cost_fn : float
            Cost of a false negative (lost customer lifetime value).
        revenue_saved : float
            Annual revenue from a successfully retained customer.
        save_rate : float
            Probability that a targeted churner is retained.
        """
        print("\n" + "=" * 60)
        print("THRESHOLD OPTIMIZATION")
        print("=" * 60)

        best_model = self.models[self.best_model_name]
        if self.best_model_name == 'Logistic Regression':
            y_prob = best_model.predict_proba(self.X_test_scaled)[:, 1]
        else:
            y_prob = best_model.predict_proba(self.X_test)[:, 1]

        thresholds = np.arange(0.1, 0.9, 0.05)
        best_threshold = 0.5
        best_net_value = float('-inf')
        threshold_results = []

        for thresh in thresholds:
            y_pred_thresh = (y_prob >= thresh).astype(int)

            tn, fp, fn, tp = confusion_matrix(
                self.y_test, y_pred_thresh
            ).ravel()

            # Business value calculation
            value_tp = tp * save_rate * revenue_saved  # Revenue saved
            cost_intervention = (tp + fp) * cost_fp    # Intervention cost
            cost_missed = fn * cost_fn                  # Lost customers

            net_value = value_tp - cost_intervention - cost_missed

            precision = tp / (tp + fp) if (tp + fp) > 0 else 0
            recall = tp / (tp + fn) if (tp + fn) > 0 else 0

            threshold_results.append({
                'Threshold': round(thresh, 2),
                'Precision': round(precision, 3),
                'Recall': round(recall, 3),
                'TP': tp, 'FP': fp, 'FN': fn, 'TN': tn,
                'Net Value ($)': round(net_value, 0),
                'Customers Targeted': tp + fp
            })

            if net_value > best_net_value:
                best_net_value = net_value
                best_threshold = thresh

        results_df = pd.DataFrame(threshold_results)

        print(f"\nBusiness Parameters:")
        print(f"  Cost of false positive (wasted offer):  ${cost_fp}")
        print(f"  Cost of false negative (lost customer): ${cost_fn}")
        print(f"  Revenue per saved customer:             ${revenue_saved}")
        print(f"  Save rate for targeted customers:       {save_rate:.0%}")

        print(f"\nThreshold Analysis (showing every other threshold):")
        print(results_df.iloc[::2].to_string(index=False))

        print(f"\nOptimal threshold: {best_threshold:.2f}")
        print(f"Net business value at optimal threshold: "
              f"${best_net_value:,.0f}")

        self.optimal_threshold = best_threshold
        self.threshold_results = results_df

        return best_threshold, results_df

    def generate_business_report(self, threshold=None):
        """
        Generate a business-oriented summary of model results,
        translating metrics into actions and dollar values.

        Parameters:
        -----------
        threshold : float
            Classification threshold (default: uses optimized threshold).
        """
        if threshold is None:
            threshold = getattr(self, 'optimal_threshold', 0.5)

        print("\n" + "=" * 60)
        print("ATHENA RETAIL GROUP -- CHURN PREDICTION REPORT")
        print("=" * 60)

        best_model = self.models[self.best_model_name]
        if self.best_model_name == 'Logistic Regression':
            y_prob = best_model.predict_proba(self.X_test_scaled)[:, 1]
        else:
            y_prob = best_model.predict_proba(self.X_test)[:, 1]

        y_pred = (y_prob >= threshold).astype(int)

        tn, fp, fn, tp = confusion_matrix(self.y_test, y_pred).ravel()
        total = len(self.y_test)

        print(f"\nModel: {self.best_model_name}")
        print(f"Threshold: {threshold:.2f}")
        print(f"Test set size: {total} customers")

        print(f"\nConfusion Matrix:")
        print(f"                    Predicted: Stay    Predicted: Churn")
        print(f"  Actual: Stayed    {tn:>8,}            {fp:>8,}")
        print(f"  Actual: Churned   {fn:>8,}            {tp:>8,}")

        print(f"\nBusiness Translation:")
        print(f"  Churners correctly identified:  {tp:,} of "
              f"{tp + fn:,} ({tp/(tp+fn):.1%})")
        print(f"  Loyal customers incorrectly targeted: {fp:,} of "
              f"{tn + fp:,} ({fp/(tn+fp):.1%})")
        print(f"  Total customers to target: {tp + fp:,} "
              f"({(tp+fp)/total:.1%} of customer base)")

        # Risk segmentation
        print(f"\nRisk Segmentation:")
        high_risk = (y_prob >= 0.7).sum()
        medium_risk = ((y_prob >= 0.4) & (y_prob < 0.7)).sum()
        low_risk = ((y_prob >= 0.2) & (y_prob < 0.4)).sum()
        minimal_risk = (y_prob < 0.2).sum()

        print(f"  High risk (>=70%):    {high_risk:>6,} customers")
        print(f"  Medium risk (40-70%): {medium_risk:>6,} customers")
        print(f"  Low risk (20-40%):    {low_risk:>6,} customers")
        print(f"  Minimal risk (<20%):  {minimal_risk:>6,} customers")

        print(f"\nRecommended Actions:")
        print(f"  HIGH RISK:   Personal outreach + premium retention offer")
        print(f"  MEDIUM RISK: Targeted email campaign + loyalty bonus")
        print(f"  LOW RISK:    Automated engagement nudge")
        print(f"  MINIMAL:     Standard communication cadence")

    def predict_customer(self, customer_data, threshold=None):
        """
        Predict churn probability for a single customer and
        explain the key risk factors.

        Parameters:
        -----------
        customer_data : dict
            Dictionary of feature values for one customer.
        threshold : float
            Classification threshold.

        Returns:
        --------
        dict with prediction, probability, and risk factors.
        """
        if threshold is None:
            threshold = getattr(self, 'optimal_threshold', 0.5)

        # Create DataFrame from customer data
        cust_df = pd.DataFrame([customer_data])

        # Encode categorical features
        categorical_cols = cust_df.select_dtypes(
            include=['object']
        ).columns.tolist()
        cust_encoded = pd.get_dummies(
            cust_df, columns=categorical_cols, drop_first=True
        )

        # Align columns with training data
        for col in self.feature_names:
            if col not in cust_encoded.columns:
                cust_encoded[col] = 0
        cust_encoded = cust_encoded[self.feature_names]

        # Predict
        best_model = self.models[self.best_model_name]
        if self.best_model_name == 'Logistic Regression':
            cust_scaled = self.scaler.transform(cust_encoded)
            prob = best_model.predict_proba(cust_scaled)[0, 1]
        else:
            prob = best_model.predict_proba(cust_encoded)[0, 1]

        prediction = 'CHURN RISK' if prob >= threshold else 'LIKELY RETAIN'

        # Risk level
        if prob >= 0.7:
            risk_level = 'HIGH'
        elif prob >= 0.4:
            risk_level = 'MEDIUM'
        elif prob >= 0.2:
            risk_level = 'LOW'
        else:
            risk_level = 'MINIMAL'

        result = {
            'churn_probability': round(prob, 3),
            'prediction': prediction,
            'risk_level': risk_level,
            'threshold_used': threshold
        }

        print(f"\nCustomer Churn Assessment")
        print(f"-" * 40)
        print(f"  Churn probability: {prob:.1%}")
        print(f"  Risk level:        {risk_level}")
        print(f"  Prediction:        {prediction}")

        return result

Code Explanation. The ChurnClassifier class follows the workflow we outlined in Section 7.2. Key design decisions: - prepare_data() handles encoding and scaling, fitting the scaler on training data only (avoiding leakage). - train_models() trains three models using class_weight='balanced' (logistic regression and random forest) and scale_pos_weight (XGBoost) to handle class imbalance. - compare_models() presents results side by side and identifies the best model by AUC-ROC. - optimize_threshold() translates business economics into threshold selection -- this is where the model meets the business. - generate_business_report() produces output that a marketing manager can read and act on. - predict_customer() demonstrates single-customer prediction for operational use.

Running the Complete Pipeline

# Generate synthetic Athena customer data
df = generate_athena_churn_data(n_customers=5000)

# Initialize and run the classifier
classifier = ChurnClassifier(df)
classifier.prepare_data()
classifier.train_models()
classifier.compare_models()
classifier.analyze_feature_importance()
classifier.optimize_threshold(
    cost_fp=20,      # Cost of retention offer sent to non-churner
    cost_fn=340,     # Lifetime value lost when missing a churner
    revenue_saved=500,  # Annual revenue from a retained customer
    save_rate=0.4    # 40% of targeted churners are successfully retained
)
classifier.generate_business_report()

Example Output

When you run the pipeline, you will see output similar to:

Data Preparation Complete
  Training samples: 4000
  Test samples:     1000
  Features:         13
  Churn rate (train): 20.3%
  Churn rate (test):  19.8%

============================================================
MODEL TRAINING
============================================================

[1/3] Training Logistic Regression...
  Logistic Regression: AUC=0.817, F1=0.541, Recall=0.646
[2/3] Training Random Forest...
  Random Forest: AUC=0.838, F1=0.565, Recall=0.611
[3/3] Training XGBoost...
  XGBoost: AUC=0.851, F1=0.582, Recall=0.631

============================================================
MODEL COMPARISON
============================================================
              Logistic Regression  Random Forest  XGBoost
Accuracy                    0.791          0.812    0.821
Precision                   0.467          0.524    0.540
Recall                      0.646          0.611    0.631
F1 Score                    0.541          0.565    0.582
AUC-ROC                     0.817          0.838    0.851

Best model by AUC-ROC: XGBoost (0.851)

Athena Update. The numbers tell a story. XGBoost achieves the highest AUC (0.851), meaning it does the best job overall of separating churners from non-churners. But notice: even the best model's precision is only 0.54 -- meaning that of every 10 customers it flags as likely churners, only about 5 actually churn. Is that good enough? It depends on the business economics, which is exactly what threshold optimization answers.

Making a Single Customer Prediction

# Predict churn for a specific customer
sample_customer = {
    'tenure_months': 8.5,
    'loyalty_tier': 'Bronze',
    'purchase_count_12m': 2,
    'days_since_last_purchase': 95,
    'avg_order_value': 42.50,
    'return_rate': 0.15,
    'online_share': 0.8,
    'category_diversity': 2,
    'purchase_trend': 0.5,
    'email_engagement': 0.1
}

result = classifier.predict_customer(sample_customer)

Customer Churn Assessment
----------------------------------------
  Churn probability: 78.3%
  Risk level:        HIGH
  Prediction:        CHURN RISK

NK studies the output. "So this customer -- short tenure, only two purchases, hasn't bought in three months, doesn't engage with emails, and their purchase trend is declining. I don't need a model to tell me this person is at risk."

"Correct," Professor Okonkwo says. "The model confirms your intuition for obvious cases. Its real value is in the ambiguous cases -- the customer with twelve purchases who just had a bad experience, the Gold member whose purchase trend just started declining. Those are the customers the model catches before the human eye does. And at scale -- across one hundred thousand customers -- human intuition simply cannot process the patterns fast enough."

7.9 Interpreting Classification Results

Building a model is half the work. Interpreting its results -- and communicating those results to business stakeholders -- is the other half. This section covers the essential metrics for classification and, more importantly, what they mean for business decisions.

The Confusion Matrix

Every classification model's performance can be summarized in a 2x2 table:

                        Predicted: No Churn    Predicted: Churn
Actual: Did Not Churn       TN (True Neg)       FP (False Pos)
Actual: Churned             FN (False Neg)       TP (True Pos)

Cell	What Happened	Business Meaning (Churn Example)
True Positive (TP)	Model correctly predicted churn	We caught an at-risk customer and can intervene
True Negative (TN)	Model correctly predicted no churn	We correctly left a loyal customer alone
False Positive (FP)	Model predicted churn, customer stayed	We wasted a retention offer on a loyal customer
False Negative (FN)	Model predicted no churn, customer left	We missed an at-risk customer -- they left without intervention

The Key Metrics

Accuracy = (TP + TN) / (TP + TN + FP + FN)

The proportion of all predictions that were correct. Intuitive but misleading for imbalanced classes. A model that never predicts churn achieves 80 percent accuracy on a dataset where 20 percent of customers churn -- while catching zero churners.

NK's eyes widen. "Wait. Eighty percent accuracy by predicting nobody churns? That's... incredibly misleading."

"Welcome to the accuracy paradox," Professor Okonkwo says. "This is the single most common misunderstanding I see in boardroom presentations about AI. A VP shows a slide that says 'our model is 85 percent accurate' and the room applauds. But nobody asks what the model does with the 15 percent it gets wrong."

Caution. Never evaluate a classification model on accuracy alone, especially when classes are imbalanced. A model with 98 percent accuracy that never predicts the minority class is not a good model -- it is a coin that always lands on the same side.

Precision = TP / (TP + FP)

Of the customers we flagged as churners, what proportion actually churned? High precision means fewer wasted retention offers.

Recall (Sensitivity) = TP / (TP + FN)

Of the customers who actually churned, what proportion did we identify? High recall means fewer missed at-risk customers.

F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

The harmonic mean of precision and recall. Useful as a single-number summary, but the individual precision and recall values tell you more.

AUC-ROC (Area Under the Receiver Operating Characteristic Curve)

AUC-ROC measures how well the model separates the two classes across all possible thresholds. It ranges from 0.5 (random guessing) to 1.0 (perfect separation).

Definition. AUC-ROC is a threshold-independent metric that measures a classifier's ability to distinguish between classes. An AUC of 0.85 means that if you randomly pick one churner and one non-churner, the model will assign a higher churn probability to the actual churner 85 percent of the time.

AUC-ROC Range	Interpretation
0.90 - 1.00	Excellent
0.80 - 0.90	Good
0.70 - 0.80	Fair
0.60 - 0.70	Poor
0.50 - 0.60	No better than random

Athena's model achieves an AUC of approximately 0.83 to 0.85, which is solidly in the "good" range. Is that good enough? It depends entirely on the business application. We address this question next.

The Precision-Recall Tradeoff

As we discussed in Chapter 6, precision and recall are inversely related: increasing one typically decreases the other. The tradeoff is controlled by the classification threshold.

Lower Threshold (e.g., 0.2)	Higher Threshold (e.g., 0.7)
More customers flagged as at-risk	Fewer customers flagged
Higher recall (catch more churners)	Lower recall (miss more churners)
Lower precision (more false alarms)	Higher precision (fewer false alarms)
Cast a wider net	Cast a narrower net
Best when cost of missing a churner is high	Best when cost of intervention is high

7.10 From Model to Decision -- Translating Predictions into Actions

Athena Update. The model achieves an AUC of 0.83. Ravi presents the results to Athena's executive team. The CMO says, "Impressive. Ship it." The VP of Operations, Maya Gonzalez, crosses her arms: "Ship it where? What exactly do we do with a list of customers who might churn? We have three marketing specialists on the retention team. They can't personally call twenty thousand people."

Maya's pushback is the most important moment in Athena's ML journey so far. A model that produces predictions without a clear action plan is a model that will be ignored. Ravi learns a lesson that Professor Okonkwo has been teaching all semester: the model is not the product. The decision system is the product.

Designing the Intervention Strategy

Ravi's team works with the retention team to design a tiered intervention strategy based on the model's probability scores:

Risk Tier	Probability Range	Volume (est.)	Intervention	Cost per Customer
Critical	>= 0.80	~500/month	Personal phone call + premium offer (free shipping for 6 months + 20% next purchase)	$35
High	0.60 - 0.79	~1,200/month	Personalized email campaign (3-touch series) + loyalty bonus points	$15
Moderate	0.40 - 0.59	~2,000/month	Automated re-engagement email + product recommendations	$3
Low	0.20 - 0.39	~3,500/month	Automated "We miss you" notification (push/email)	$0.50
Minimal	< 0.20	~12,800/month	Standard communication cadence (no additional intervention)	$0

This tiered approach solves Maya's problem. The three marketing specialists focus exclusively on the Critical tier (~500 customers/month). The High tier gets a semi-automated campaign. The Moderate and Low tiers are fully automated. And the Minimal tier receives no additional treatment -- saving resources for where they matter.

The Business Case

Tom runs the numbers for a monthly cycle:

# Monthly business impact estimation

# Tier volumes and economics
tiers = {
    'Critical': {
        'volume': 500, 'cost_per': 35, 'actual_churn_rate': 0.75,
        'save_rate': 0.50, 'annual_value': 620
    },
    'High': {
        'volume': 1200, 'cost_per': 15, 'actual_churn_rate': 0.55,
        'save_rate': 0.40, 'annual_value': 480
    },
    'Moderate': {
        'volume': 2000, 'cost_per': 3, 'actual_churn_rate': 0.35,
        'save_rate': 0.25, 'annual_value': 380
    },
    'Low': {
        'volume': 3500, 'cost_per': 0.50, 'actual_churn_rate': 0.18,
        'save_rate': 0.10, 'annual_value': 340
    },
}

total_cost = 0
total_revenue_saved = 0

print(f"{'Tier':<12} {'Volume':>7} {'Cost':>10} {'Saved':>7} "
      f"{'Revenue Saved':>15}")
print("-" * 55)

for tier_name, t in tiers.items():
    cost = t['volume'] * t['cost_per']
    churners_in_tier = int(t['volume'] * t['actual_churn_rate'])
    saved = int(churners_in_tier * t['save_rate'])
    revenue = saved * t['annual_value']

    total_cost += cost
    total_revenue_saved += revenue

    print(f"{tier_name:<12} {t['volume']:>7,} ${cost:>9,.0f} "
          f"{saved:>7,} ${revenue:>14,.0f}")

print("-" * 55)
print(f"{'TOTAL':<12} {'':<7} ${total_cost:>9,.0f} "
      f"{'':>7} ${total_revenue_saved:>14,.0f}")
print(f"\nMonthly net value: ${total_revenue_saved - total_cost:,.0f}")
print(f"Annual net value:  ${(total_revenue_saved - total_cost) * 12:,.0f}")

Athena Update. When Ravi presents the tiered intervention plan alongside the model, the VP of Operations drops her objections. "This I can work with," Maya says. "We're not drowning the team in a firehose of names. We're giving them a prioritized list with specific actions." The model enters a three-month pilot. The executive team is cautiously optimistic. Athena's AI journey has its first real deliverable.

The Feedback Loop

Professor Okonkwo adds a final point: "The model's job doesn't end at prediction. Once the retention campaigns run, you collect new data -- which customers responded to the intervention, which ones churned despite it, which ones would have stayed anyway. That feedback becomes training data for the next version of the model. This is the monitoring and maintenance stage from Chapter 6. The model learns, the interventions improve, and the cycle continues."

She draws a circle on the whiteboard:

Predict --> Intervene --> Measure --> Retrain --> Predict

"This feedback loop is what separates a one-time analytics project from a production ML system. We'll cover it in depth in Chapter 12 when we discuss MLOps."

7.11 Algorithms at a Glance -- A Comparison Guide

For reference, here is a summary comparison of the four classification algorithms covered in this chapter:

Characteristic	Logistic Regression	Decision Tree	Random Forest	Gradient Boosting
Intuition	Draw a boundary line	Learn if-then rules	Vote among many trees	Correct mistakes sequentially
Interpretability	High (coefficients)	Very high (rules)	Medium (feature importance)	Medium (feature importance)
Accuracy on tabular data	Good baseline	Moderate (overfits easily)	Good to excellent	Excellent
Training speed	Very fast	Fast	Moderate	Moderate to slow
Handles non-linear patterns	No (without engineering)	Yes	Yes	Yes
Risk of overfitting	Low	High	Low	Moderate
Feature scaling required	Yes	No	No	No
When to use	Baseline, regulatory, interpretability	Quick exploration, rule extraction	Robust general-purpose	Maximum accuracy on structured data
Key hyperparameters	C (regularization)	max_depth, min_samples	n_estimators, max_depth	n_estimators, learning_rate, max_depth

Business Insight. Algorithm selection in business is rarely about raw accuracy. It is about the intersection of accuracy, interpretability, speed, and organizational trust. A model the business team understands and uses at 0.82 AUC creates more value than a model the business team ignores at 0.87 AUC. Ravi Mehta puts it this way: "The best model is the one that gets deployed."

7.12 Chapter Summary and the Road Ahead

This chapter covered the core concepts and tools of supervised classification -- the most widely applied category of machine learning in business. We moved from the economics of classification (why targeting the right customers is worth millions) through four algorithms (logistic regression, decision trees, random forests, gradient boosting), feature engineering techniques, a complete implementation (ChurnClassifier), evaluation metrics, and the critical step of connecting model outputs to business actions.

Several themes will recur throughout Part 2 and beyond:

Business framing precedes modeling. The value of a classification model is determined before the first line of code is written -- by the problem definition, the target variable design, and the action plan that follows prediction.
Metrics must be tied to economics. Accuracy, precision, recall, and AUC are useful summaries, but the question that matters is: "How much is this model worth in dollars?" Chapter 11 will extend this framework to a full model evaluation methodology.
Feature engineering is where domain expertise meets data science. The algorithms are commoditized. The competitive advantage lies in knowing which features to build and how to encode business knowledge into the model.
The model is not the product. Athena's churn model only created value when it was embedded in a tiered retention strategy that the operations team could execute. This lesson will be reinforced in Chapter 12 (MLOps) and Chapter 34 (Measuring AI ROI).
Interpretability is not optional. Understanding why a model makes its predictions is essential for stakeholder trust, regulatory compliance, and continuous improvement. We will explore interpretability tools in depth in Chapter 26 (Fairness, Explainability, and Transparency).

Caution. We have introduced classification metrics in this chapter, but we have only scratched the surface. Cross-validation, hyperparameter tuning, learning curves, calibration curves, and cost-sensitive evaluation are covered in Chapter 11 (Model Evaluation and Selection). Do not skip that chapter -- evaluation is where most real-world ML projects succeed or fail.

Tom closes his laptop. "I built the model in twenty minutes. Understanding what to do with it took the entire class."

"Now you're learning," Professor Okonkwo says.

In Chapter 8, we turn from classification (predicting categories) to regression (predicting continuous values). Athena's next challenge: forecasting demand across 12,000 SKUs. The DemandForecaster awaits.