In Chapter 11, you analyzed what Acme had sold. In Chapter 26, you built models to forecast what they would sell next. In Chapter 32, you used those forecasts to manage inventory. Every one of those analyses looked at historical data to understand...
In This Chapter
- From Describing the Past to Predicting the Future
- 34.1 Regression vs. Classification: Two Fundamental Questions
- 34.2 The scikit-learn Workflow
- 34.3 Linear Regression: Predicting Numbers
- 34.4 Logistic Regression: Predicting Probabilities
- 34.5 Understanding Classification Metrics
- 34.6 Decision Trees: Human-Readable Classification
- 34.7 Random Forests: Ensembles Win
- 34.8 Cross-Validation: Honest Evaluation
- 34.9 Feature Engineering: What Goes Into the Model
- 34.10 Practical Limits: When Not to Use ML
- 34.11 The Acme Churn Model: Putting It All Together
- Chapter Summary
- Key Terms
Chapter 34: Predictive Models — Regression and Classification
From Describing the Past to Predicting the Future
In Chapter 11, you analyzed what Acme had sold. In Chapter 26, you built models to forecast what they would sell next. In Chapter 32, you used those forecasts to manage inventory. Every one of those analyses looked at historical data to understand patterns.
This chapter is different. Here, you will learn to build models that make predictions about specific, individual outcomes: not "revenue will be about $2.4 million next quarter" but "this customer specifically has a 73% probability of canceling their account within the next 90 days."
That precision — prediction at the individual level — is the distinctive capability of machine learning. And with scikit-learn, the most widely used machine learning library in Python, it is accessible to any business analyst who has followed the path to this chapter.
By the time you finish, you will know:
- When to use regression versus classification
- The standard scikit-learn workflow that applies to every model you will ever build
- How to evaluate models honestly — including understanding what the metrics actually mean
- How to communicate results to non-technical stakeholders in ways that drive action
- How Priya built Acme's customer churn predictor, identified the top 20 at-risk accounts, and put them in front of the sales team before those customers left
34.1 Regression vs. Classification: Two Fundamental Questions
Every predictive model answers one of two fundamental types of questions:
Regression: "How much?" - How much revenue will this campaign generate? - What price should we set for this product? - How many units will we sell next quarter?
The output is a number on a continuous scale. There is no natural cutoff — $450,000 is not categorically different from $451,000.
Classification: "Which one?" - Will this customer cancel their subscription? (yes/no) - Will this loan default? (yes/no) - Which product category does this support ticket belong to? (category A/B/C/D)
The output is a category. A binary classification (yes/no) is the most common in business — the question almost always has the form "will this thing happen or not?"
The Borderline Cases
In practice, the line blurs. A logistic regression model outputs a probability (a number between 0 and 1) rather than a hard yes/no. Whether you call that regression or classification depends on how you use it:
- If you apply a threshold (probability > 0.5 = "will churn") and act on the binary outcome, it is classification
- If you use the probability directly to rank customers by risk and prioritize outreach, you are treating it more like a regression
This is intentional. In most business applications, the raw probability is more useful than the binary prediction — it lets you prioritize, set thresholds based on business costs, and communicate uncertainty.
34.2 The scikit-learn Workflow
scikit-learn enforces a consistent interface across hundreds of different models. Once you understand the pattern, every new model you learn follows the same steps.
Step 1: Prepare your data
├── Load and clean the raw data
├── Engineer features (create the columns the model needs)
└── Handle categorical variables and missing values
Step 2: Split into training and test sets
└── train_test_split(X, y, test_size=0.2)
Step 3: Build a pipeline
└── Pipeline([("scaler", StandardScaler()),
("model", YourModelHere())])
Step 4: Fit the model
└── pipeline.fit(X_train, y_train)
Step 5: Predict
├── pipeline.predict(X_test) — class labels
└── pipeline.predict_proba(X_test) — probabilities (classifiers only)
Step 6: Evaluate
├── For regression: R², MAE, RMSE
└── For classification: accuracy, precision, recall, F1, AUC
Step 7: Iterate and communicate
└── Tune, explain, and deploy
This workflow is the same whether you are using logistic regression, a decision tree, a random forest, or any other scikit-learn estimator. Learning it once means you can apply it everywhere.
The Importance of the Train/Test Split
The most important habit you need to develop is evaluating your model on data it has never seen during training. If you train and evaluate on the same data, you will always get inflated performance metrics — the model has "memorized" the answers rather than "learned" the pattern.
from sklearn.model_selection import train_test_split
# 80% for training, 20% for testing
# random_state ensures reproducibility
# stratify=y ensures the class ratio is preserved in both splits
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=y
)
print(f"Training set: {len(X_train):,} rows")
print(f"Test set: {len(X_test):,} rows")
print(f"Churn rate (train): {y_train.mean():.1%}")
print(f"Churn rate (test): {y_test.mean():.1%}")
The stratify=y argument is important for classification problems with imbalanced classes. Without it, your test set might happen to have very few examples of the minority class, making evaluation unreliable.
34.3 Linear Regression: Predicting Numbers
Linear regression is the oldest and most widely used predictive model in business analytics. Its core question: how does a measurable outcome change as the inputs change?
The Intuition
Imagine plotting marketing spend on the X axis and monthly revenue on the Y axis for the past 36 months. Each point is one month. You can see a rough upward pattern — more spending tends to correlate with more revenue — but with scatter around the trend.
Linear regression finds the straight line that best fits this cloud of points. Specifically, it finds the line that minimizes the sum of squared distances between each actual point and the line's prediction. That line becomes your model.
Once you have the line: - You can predict revenue for any level of marketing spend you are considering - You can quantify how much revenue increases per additional dollar of spend - You can add more features (seasonality, region, product mix) and understand each one's independent contribution
The scikit-learn Implementation
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
def build_revenue_regression_model(
features_df: pd.DataFrame,
target_series: pd.Series,
feature_names: list[str],
test_size: float = 0.20,
) -> dict:
"""
Build and evaluate a linear regression model for revenue prediction.
Args:
features_df: DataFrame of features (X).
target_series: Series of revenue values (y).
feature_names: List of feature column names (for coefficient display).
test_size: Fraction of data to hold out for evaluation.
Returns:
Dictionary with model, metrics, and coefficient interpretation.
"""
X_train, X_test, y_train, y_test = train_test_split(
features_df, target_series,
test_size=test_size,
random_state=42,
)
pipeline = Pipeline([
("scaler", StandardScaler()),
("model", LinearRegression()),
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = float(np.sqrt(mean_squared_error(y_test, y_pred)))
# Extract coefficients (in original scale, not scaled)
# The scaler transforms features, so we recover unscaled coefficients
model = pipeline.named_steps["model"]
scaler = pipeline.named_steps["scaler"]
unscaled_coefficients = model.coef_ / scaler.scale_
coef_df = pd.DataFrame({
"feature": feature_names,
"coefficient": unscaled_coefficients,
"abs_importance": np.abs(unscaled_coefficients),
}).sort_values("abs_importance", ascending=False)
return {
"pipeline": pipeline,
"r2": round(r2, 4),
"mae": round(mae, 2),
"rmse": round(rmse, 2),
"intercept": round(float(model.intercept_), 2),
"coefficients": coef_df,
"y_test": y_test,
"y_pred": y_pred,
}
Evaluating Regression: R², MAE, and RMSE in Plain English
R² (R-squared): Tells you what fraction of the variance in the target the model explains. An R² of 0.81 means the model accounts for 81% of why revenue varies from month to month. The remaining 19% is driven by factors not in the model.
R² has no universal threshold for "good" — it depends entirely on the domain. For forecasting revenue from a handful of business drivers, R² of 0.70-0.85 is typically considered strong. For predicting individual human behavior (will a specific customer churn?), R² of 0.20 might be quite useful. Compare your model to a naive baseline, not to an abstract standard.
MAE (Mean Absolute Error): The average magnitude of the model's errors, in the same units as the target. If MAE = $8,500, the model is off by about $8,500 on average. This is the most intuitive metric because it is directly interpretable.
RMSE (Root Mean Squared Error): Like MAE, but penalizes large errors more heavily. If your model is mostly close but occasionally very wrong, RMSE will be significantly larger than MAE. This makes RMSE a useful indicator of catastrophic failures.
def interpret_regression_results(results: dict) -> None:
"""Print a human-readable summary of regression model performance."""
print("=== REGRESSION MODEL RESULTS ===")
print(f"\nModel Performance:")
print(f" R² score: {results['r2']:.4f} (explains {results['r2']*100:.1f}% of variance)")
print(f" MAE: ${results['mae']:,.0f} (typical prediction error)")
print(f" RMSE: ${results['rmse']:,.0f} (penalizes large errors more)")
print(f"\nModel Equation (top 5 features):")
print(f" Revenue = {results['intercept']:,.0f}")
for _, row in results["coefficients"].head(5).iterrows():
direction = "+" if row["coefficient"] >= 0 else ""
print(f" {direction}{row['coefficient']:,.1f} × {row['feature']}")
Interpreting Coefficients
The coefficients are the most valuable output of linear regression for business purposes. Each coefficient answers: "Holding all other variables constant, how much does the outcome change for a one-unit increase in this feature?"
For example, in a model predicting monthly revenue:
# Hypothetical coefficient output:
# feature coefficient
# marketing_spend_thousands +4,250 (each $1K of spend = +$4,250 revenue)
# is_december +18,000 (December earns $18K more, all else equal)
# is_summer_month -3,200 (summer months earn $3.2K less)
# sales_headcount +6,800 (each additional salesperson = +$6.8K revenue)
These numbers tell a story. The marketing return ($4.25 per $1 of spend) is something the CMO will want to see. The seasonality effects are operationally useful for planning. This is why linear regression is valuable even when more complex models might predict slightly better — the coefficients are directly actionable.
34.4 Logistic Regression: Predicting Probabilities
Despite its name, logistic regression is a classification model. It predicts the probability that an observation belongs to the positive class — typically the event you care about, like churning, defaulting, or converting.
Why Not Just Use Linear Regression for Classification?
Linear regression can predict values outside the range [0, 1]. A model that predicts a 1.3 probability of churn or -0.2 probability of default is not interpretable as a probability.
Logistic regression solves this by applying the sigmoid (logistic) function to the linear combination of features. The sigmoid maps any real number to a value between 0 and 1, producing a valid probability. The result is a model that: - Always outputs a probability between 0 and 1 - Is monotonically related to the linear combination of features (more of a risky predictor = higher probability) - Can be converted to a binary prediction by applying a threshold
The Business Case for Logistic Regression as Your Starting Point
Logistic regression is often the right first choice for binary classification in business because:
Interpretability: Each coefficient has a clear interpretation. A positive coefficient means that feature increases the log-odds of the event; negative means it decreases them. In odds ratio terms: if payment_failures has coefficient 0.85, each additional payment failure multiplies the odds of churn by e^0.85 ≈ 2.3.
Calibration: Logistic regression probabilities are well-calibrated — a 70% prediction really does mean the model has seen about 70% of customers with that profile churn in the training data. Many more complex models produce poorly calibrated probabilities.
Speed: Trains in milliseconds on typical business datasets.
Regularization by default: scikit-learn's LogisticRegression applies L2 regularization by default (the C parameter), which reduces overfitting without any additional effort.
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
classification_report,
confusion_matrix,
roc_auc_score,
)
def build_logistic_regression_model(
X: pd.DataFrame,
y: pd.Series,
feature_names: list[str],
positive_class_label: str = "churned",
negative_class_label: str = "retained",
test_size: float = 0.20,
regularization_c: float = 1.0,
) -> dict:
"""
Build and evaluate a logistic regression classification model.
Args:
X: Feature DataFrame.
y: Binary target Series (1 = positive class, 0 = negative class).
feature_names: Column names, used for coefficient display.
positive_class_label: Name for the positive class (for reporting).
negative_class_label: Name for the negative class.
test_size: Fraction to hold out for evaluation.
regularization_c: Inverse regularization strength. Smaller = more
regularization. Default 1.0 (moderate regularization).
Returns:
Dictionary with model, metrics, predictions, and feature interpretation.
"""
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=test_size, random_state=42, stratify=y
)
pipeline = Pipeline([
("scaler", StandardScaler()),
("model", LogisticRegression(
C=regularization_c,
random_state=42,
max_iter=1000,
class_weight="balanced",
)),
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_prob = pipeline.predict_proba(X_test)[:, 1]
model = pipeline.named_steps["model"]
scaler = pipeline.named_steps["scaler"]
coef_df = pd.DataFrame({
"feature": feature_names,
"coefficient": model.coef_[0],
"odds_ratio": np.exp(model.coef_[0]),
}).sort_values("coefficient", key=abs, ascending=False)
return {
"pipeline": pipeline,
"auc": round(float(roc_auc_score(y_test, y_prob)), 4),
"classification_report": classification_report(
y_test, y_pred,
target_names=[negative_class_label, positive_class_label],
),
"confusion_matrix": confusion_matrix(y_test, y_pred),
"coefficients": coef_df,
"y_test": y_test,
"y_pred": y_pred,
"y_prob": y_prob,
}
34.5 Understanding Classification Metrics
Classification models have a richer set of evaluation metrics than regression models, because there are multiple ways to be right and wrong.
The Confusion Matrix
For a binary classifier, every prediction falls into one of four categories:
Predicted: RETAIN Predicted: CHURN
Actual: RETAIN TN FP
Actual: CHURN FN TP
- True Positives (TP): Correctly identified churners. These are the wins.
- True Negatives (TN): Correctly identified retained customers. No action needed.
- False Positives (FP): Customers predicted to churn who actually stayed. The cost: unnecessary retention outreach — expensive but manageable.
- False Negatives (FN): Churners the model missed. These customers leave without any intervention. Often the most costly error type.
Precision, Recall, and the Business Tradeoff
Precision = TP / (TP + FP) — Of all customers the model flagged as at-risk, what fraction actually churned?
A high-precision model does not waste your team's time on false alarms. But if precision is achieved by only flagging the most obvious cases, you miss many churners.
Recall = TP / (TP + FN) — Of all customers who actually churned, what fraction did the model catch?
A high-recall model catches most churners but may generate many false alarms.
The business tradeoff: if your retention team can handle 50 calls per week, a high-precision model that gives you 50 confident predictions is better than a high-recall model that flags 200 customers with many false positives. If missing a churner costs $10,000 in lost annual contract value but making a retention call costs $50, you want to maximize recall even at the cost of precision.
F1 Score = 2 × (Precision × Recall) / (Precision + Recall) — The harmonic mean of precision and recall. Useful as a single summary metric when both matter.
ROC AUC: The Overall Discrimination Metric
ROC AUC (Area Under the Receiver Operating Characteristic Curve) measures how well the model separates the two classes across all possible threshold settings. An AUC of 0.5 is no better than random; 1.0 is perfect; 0.8 is generally considered good for business applications.
AUC is particularly useful for comparing models because it is threshold-independent — it measures the model's fundamental discriminating ability, not its performance at any specific threshold.
def print_classification_results(
results: dict,
positive_class_label: str = "churned",
) -> None:
"""Print a business-readable classification model summary."""
print("=== CLASSIFICATION MODEL RESULTS ===")
print(f"\nROC AUC: {results['auc']:.4f}")
print(f" (A score of 1.0 = perfect; 0.5 = no better than guessing)")
cm = results["confusion_matrix"]
tn, fp, fn, tp = cm.ravel()
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
print(f"\nAt the default 0.5 threshold:")
print(f" Correctly identified {positive_class_label}s: {tp} ({recall:.1%} recall)")
print(f" Missed {positive_class_label}s: {fn}")
print(f" False alarms: {fp} ({precision:.1%} precision)")
print(f" Correctly flagged as retained: {tn}")
print(f"\nIn plain language:")
print(
f" The model catches {recall:.0%} of customers who will {positive_class_label}, "
f"and when it flags someone, it is right {precision:.0%} of the time."
)
print("\nFull Classification Report:")
print(results["classification_report"])
Choosing the Right Threshold
The default threshold of 0.5 is not always optimal. The right threshold depends on the relative costs of false positives and false negatives in your specific context.
import numpy as np
import pandas as pd
def evaluate_thresholds(
y_true: np.ndarray,
y_prob: np.ndarray,
false_negative_cost: float,
false_positive_cost: float,
thresholds: list[float] | None = None,
) -> pd.DataFrame:
"""
Evaluate classification performance at multiple thresholds.
Calculates the total business cost at each threshold based on the
specified costs of false negatives and false positives.
Args:
y_true: True binary labels.
y_prob: Predicted probabilities for the positive class.
false_negative_cost: Cost of missing one positive case
(e.g., $10,000 for a churned customer).
false_positive_cost: Cost of a false alarm
(e.g., $50 for an unnecessary retention call).
thresholds: List of thresholds to evaluate. Defaults to 0.1 to 0.9.
Returns:
DataFrame with metrics at each threshold, sorted by total cost.
"""
if thresholds is None:
thresholds = [round(t, 2) for t in np.arange(0.10, 0.91, 0.05)]
rows = []
for threshold in thresholds:
y_pred = (y_prob >= threshold).astype(int)
tn = int(((y_pred == 0) & (y_true == 0)).sum())
fp = int(((y_pred == 1) & (y_true == 0)).sum())
fn = int(((y_pred == 0) & (y_true == 1)).sum())
tp = int(((y_pred == 1) & (y_true == 1)).sum())
precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
total_cost = fn * false_negative_cost + fp * false_positive_cost
rows.append({
"threshold": threshold,
"true_positives": tp,
"false_positives": fp,
"false_negatives": fn,
"precision": round(precision, 3),
"recall": round(recall, 3),
"total_cost": round(total_cost, 0),
})
return pd.DataFrame(rows).sort_values("total_cost")
34.6 Decision Trees: Human-Readable Classification
Decision trees work by asking a series of yes/no questions about the data. They are often the first model business stakeholders can actually understand without explanation.
The Structure
A decision tree splits the training data recursively. At each node, the algorithm asks: "Which feature and threshold most cleanly separates the examples into different outcomes?" It chooses the split that maximizes the separation (using metrics like Gini impurity or information gain), then applies the same process to each resulting subset.
The result is a tree of if-then rules that can be printed and audited:
Is payment_failures_last_year > 1?
YES --> Is logins_last_30_days < 3?
YES --> PREDICT CHURN [probability: 0.87, n=134]
NO --> PREDICT RETAIN [probability: 0.28 churn, n=89]
NO --> Is account_age_days < 180?
YES --> Is support_contacts_last_90_days > 3?
YES --> PREDICT CHURN [probability: 0.61, n=44]
NO --> PREDICT RETAIN [probability: 0.09 churn, n=198]
NO --> PREDICT RETAIN [probability: 0.04 churn, n=412]
Every prediction is explainable: you can trace the path from root to leaf and describe in plain language why the model predicted what it did.
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.metrics import roc_auc_score, classification_report
def build_decision_tree(
X_train: pd.DataFrame,
X_test: pd.DataFrame,
y_train: pd.Series,
y_test: pd.Series,
feature_names: list[str],
max_depth: int = 4,
min_samples_leaf: int = 20,
) -> dict:
"""
Train and evaluate a decision tree classifier.
Args:
X_train, X_test: Feature DataFrames for training and evaluation.
y_train, y_test: Target Series for training and evaluation.
feature_names: Column names for the tree text representation.
max_depth: Maximum tree depth. Start with 4, increase only if
cross-validation confirms improvement.
min_samples_leaf: Minimum samples required in a leaf node.
Prevents the tree from fitting noise in small subsets.
Returns:
Dictionary with trained model, metrics, and text representation.
"""
tree = DecisionTreeClassifier(
max_depth=max_depth,
min_samples_leaf=min_samples_leaf,
class_weight="balanced",
random_state=42,
)
tree.fit(X_train, y_train)
y_pred = tree.predict(X_test)
y_prob = tree.predict_proba(X_test)[:, 1]
tree_text = export_text(tree, feature_names=feature_names)
return {
"model": tree,
"auc": round(float(roc_auc_score(y_test, y_prob)), 4),
"classification_report": classification_report(y_test, y_pred),
"tree_text": tree_text,
"y_pred": y_pred,
"y_prob": y_prob,
}
Overfitting: The Central Decision Tree Problem
A decision tree with no constraints grows until it has a unique leaf for every training example — perfect training accuracy, terrible test accuracy. This is the classic overfitting problem.
The practical controls:
- max_depth=4: Rarely does a deeper tree generalize better to unseen data
- min_samples_leaf=20: Each leaf must represent at least 20 training customers — prevents fitting noise
- class_weight="balanced": Adjusts for imbalanced classes automatically
Decision trees are excellent for explaining predictions but usually not the best choice for maximum predictive accuracy. That is where random forests come in.
34.7 Random Forests: Ensembles Win
A random forest trains hundreds of decision trees, each on a random sample of the data and a random subset of features, then aggregates their predictions. The wisdom-of-crowds effect produces a model that is substantially more accurate and stable than any single tree.
Why Random Forests Are Better Than Single Trees
A single decision tree is unstable: change a few training examples and the tree may look completely different. It also tends to overfit deeply.
Random forests address both problems: - Bootstrap sampling: Each tree is trained on a random sample (with replacement) of the training data, so different trees see different examples - Feature randomness: At each split, only a random subset of features is considered, forcing the trees to use different predictors - Aggregation: Errors in individual trees are uncorrelated and tend to cancel out when averaged
The result: better accuracy, lower variance, more stable feature importance estimates.
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import roc_auc_score, classification_report
import pandas as pd
import numpy as np
def build_random_forest_model(
X_train: pd.DataFrame,
X_test: pd.DataFrame,
y_train: pd.Series,
y_test: pd.Series,
feature_names: list[str],
n_estimators: int = 200,
max_depth: int = 6,
) -> dict:
"""
Train and evaluate a random forest classifier.
Args:
X_train, X_test: Feature DataFrames.
y_train, y_test: Target Series.
feature_names: Column names for feature importance display.
n_estimators: Number of trees. More trees = more stable, slower.
200 is a good default; rarely improves beyond 500.
max_depth: Maximum depth of each tree. 6 is a reasonable default.
Returns:
Dictionary with model, metrics, and feature importance.
"""
rf = RandomForestClassifier(
n_estimators=n_estimators,
max_depth=max_depth,
min_samples_leaf=10,
class_weight="balanced",
random_state=42,
n_jobs=-1, # use all available CPU cores
)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
y_prob = rf.predict_proba(X_test)[:, 1]
feature_importance = pd.DataFrame({
"feature": feature_names,
"importance": rf.feature_importances_,
}).sort_values("importance", ascending=False)
return {
"model": rf,
"auc": round(float(roc_auc_score(y_test, y_prob)), 4),
"classification_report": classification_report(y_test, y_pred),
"feature_importance": feature_importance,
"y_pred": y_pred,
"y_prob": y_prob,
}
Feature Importance: What the Forest Tells You About Your Business
Feature importance (the .feature_importances_ attribute) tells you which features the forest relied on most when making predictions. This is independently valuable business intelligence.
In a churn model for Acme, feature importance might reveal: - Payment failures are the single strongest predictor (2.3× more important than the next feature) - Login frequency trend matters more than absolute login count - NPS score has essentially no predictive power (despite being a widely tracked metric)
This kind of finding often surprises business stakeholders — and leads to productive conversations about which metrics actually predict outcomes.
Important caveat: Feature importance tells you what the model uses, not what causes the outcome. A feature that correlates with the true cause will show high importance even if it is not causal itself. Do not confuse predictive importance with causal importance.
34.8 Cross-Validation: Honest Evaluation
A single train/test split leaves your evaluation vulnerable to the luck of how the data was divided. Cross-validation gives you a more reliable estimate.
from sklearn.model_selection import cross_val_score, StratifiedKFold
def compare_models_with_cross_validation(
models: dict,
X: pd.DataFrame,
y: pd.Series,
n_splits: int = 5,
scoring: str = "roc_auc",
) -> pd.DataFrame:
"""
Compare multiple models using stratified k-fold cross-validation.
Args:
models: Dictionary mapping model names to fitted pipeline objects.
X: Full feature DataFrame (before any split).
y: Full target Series.
n_splits: Number of cross-validation folds.
scoring: scikit-learn scoring metric name.
Returns:
DataFrame with mean and standard deviation of CV scores per model.
"""
cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
results = []
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=cv, scoring=scoring)
results.append({
"model": name,
f"mean_{scoring}": round(scores.mean(), 4),
f"std_{scoring}": round(scores.std(), 4),
"n_folds": n_splits,
})
print(
f" {name:35s}: "
f"{scores.mean():.4f} "
f"(+/- {scores.std() * 2:.4f})"
)
return pd.DataFrame(results).sort_values(f"mean_{scoring}", ascending=False)
Five-fold cross-validation trains and evaluates the model five times on different 80%/20% splits of the data. The mean AUC across these five evaluations is a much more reliable estimate of the model's true performance than any single split.
34.9 Feature Engineering: What Goes Into the Model
The quality of your features matters more than the choice of algorithm. A mediocre algorithm with excellent features will usually outperform an excellent algorithm with mediocre features.
Business Variables That Predict Customer Behavior
For a B2B customer churn model, useful features typically fall into these categories:
Engagement signals: Login frequency, feature usage depth, time since last activity, declining usage trend
Financial signals: Payment history, invoice amount trends, contract renewal timing, outstanding balance
Support signals: Support ticket volume, response time satisfaction, unresolved issues
Relationship signals: Account age, contract tier, number of users, implementation health score
Comparison signals: Changes over time (this month vs. last month, this quarter vs. same quarter last year)
The comparison signals — rate-of-change features — are often the most predictive. A customer who logged in 10 times last month but only 3 times this month is more at risk than one who consistently logs in 3 times per month. The trend matters more than the level.
def engineer_churn_features(
customers_df: pd.DataFrame,
sales_df: pd.DataFrame,
analysis_date: pd.Timestamp | None = None,
) -> pd.DataFrame:
"""
Create churn prediction features from customer and transaction data.
Args:
customers_df: Customer master data from acme_customers.xlsx.
sales_df: Sales transaction history from acme_sales_2023.csv.
analysis_date: Reference date for feature calculation. Defaults
to the maximum sale date in the sales data.
Returns:
DataFrame with one row per customer and engineered features ready
for model training.
"""
if analysis_date is None:
analysis_date = pd.to_datetime(sales_df["sale_date"].max())
sales_df = sales_df.copy()
sales_df["sale_date"] = pd.to_datetime(sales_df["sale_date"])
# Recent purchase windows
cutoff_90d = analysis_date - pd.Timedelta(days=90)
cutoff_180d = analysis_date - pd.Timedelta(days=180)
cutoff_365d = analysis_date - pd.Timedelta(days=365)
recent_90d = sales_df[sales_df["sale_date"] >= cutoff_90d]
recent_180d = sales_df[sales_df["sale_date"] >= cutoff_180d]
prior_90d = sales_df[
(sales_df["sale_date"] >= cutoff_180d) &
(sales_df["sale_date"] < cutoff_90d)
]
recent_365d = sales_df[sales_df["sale_date"] >= cutoff_365d]
def revenue_by_customer(df: pd.DataFrame, col_name: str) -> pd.DataFrame:
return (
df.groupby("customer_id")["revenue"]
.sum()
.reset_index()
.rename(columns={"revenue": col_name})
)
revenue_90d = revenue_by_customer(recent_90d, "revenue_90d")
revenue_prior_90d = revenue_by_customer(prior_90d, "revenue_prior_90d")
revenue_365d = revenue_by_customer(recent_365d, "revenue_365d")
# Order count metrics
order_counts = (
recent_90d.groupby("customer_id")["order_id"]
.nunique()
.reset_index()
.rename(columns={"order_id": "orders_90d"})
)
# Last purchase date
last_purchase = (
sales_df.groupby("customer_id")["sale_date"]
.max()
.reset_index()
.rename(columns={"sale_date": "last_purchase_date"})
)
# Assemble features
features = customers_df.merge(revenue_90d, on="customer_id", how="left")
features = features.merge(revenue_prior_90d, on="customer_id", how="left")
features = features.merge(revenue_365d, on="customer_id", how="left")
features = features.merge(order_counts, on="customer_id", how="left")
features = features.merge(last_purchase, on="customer_id", how="left")
features["revenue_90d"] = features["revenue_90d"].fillna(0)
features["revenue_prior_90d"] = features["revenue_prior_90d"].fillna(0)
features["revenue_365d"] = features["revenue_365d"].fillna(0)
features["orders_90d"] = features["orders_90d"].fillna(0)
# Revenue trend (positive = growing, negative = declining)
features["revenue_trend_ratio"] = (
(features["revenue_90d"] - features["revenue_prior_90d"]) /
(features["revenue_prior_90d"] + 1)
)
# Days since last purchase
features["days_since_last_purchase"] = (
analysis_date - pd.to_datetime(features["last_purchase_date"])
).dt.days.fillna(999)
# Account age in days
if "first_order_date" in features.columns:
features["account_age_days"] = (
analysis_date - pd.to_datetime(features["first_order_date"])
).dt.days
return features
34.10 Practical Limits: When Not to Use ML
Machine learning is powerful but not universal. Before building a model, ask yourself these questions:
Do you have enough data? Logistic regression can work with a few hundred examples. A random forest benefits from thousands. Deep learning needs tens of thousands or more. If you have 80 customers and 12 of them churned, you do not have enough signal for a reliable ML model. Use simpler heuristics instead.
Is there a clear signal? If subject matter experts cannot identify any features that should predict the outcome, a model trained on those features will find noise rather than signal. The model's apparent accuracy on training data will not generalize.
Is the relationship stable over time? If the drivers of churn changed completely when you changed your product or pricing last year, last year's data will mislead the model. Models trained on historical data can become obsolete quickly in dynamic business environments.
Do you need to explain individual predictions? For regulated industries (credit, insurance, employment), you may be legally required to explain why a specific individual received a specific prediction. Some models (logistic regression, decision trees) support this; others (random forests, neural networks) do so only approximately.
Is the cost of errors understood and acceptable? If you cannot quantify the costs of false positives and false negatives, you cannot evaluate whether the model is actually useful in your business context.
34.11 The Acme Churn Model: Putting It All Together
This is where the chapter's techniques converge. Priya has one clear business question: which of Acme's customers are most at risk of reducing or eliminating their purchases in the next 90 days?
Sandra Chen has been watching three accounts with warning signs and wants a systematic approach. "I don't want to find out about at-risk accounts when they call to cancel," she told Priya. "I want to call them first."
The full implementation is in code/acme_churn_predictor.py. Here is the narrative.
The Data
Priya joins acme_customers.xlsx (customer master data: account age, tier, region, account manager) with acme_sales_2023.csv (transaction history: revenue, order count, recency).
She engineers features for each customer: - Revenue in the past 90 days vs. the prior 90 days (trend signal) - Orders in the past 90 days vs. annualized rate (decline signal) - Days since last purchase (recency signal) - Account age in days (tenure signal) - Total revenue in past 365 days (size/value signal) - Revenue tier (segment signal)
For the target variable, she defines "at-risk" as any customer whose revenue in the past 90 days was less than 50% of their average quarterly revenue in the prior 12 months. This produces a binary label for 127 customers.
The Model Selection
Priya runs all three classifiers through 5-fold cross-validation:
Model Comparison (5-fold stratified CV, ROC AUC):
Logistic Regression: 0.8214 (+/- 0.0312)
Decision Tree (depth=4): 0.7681 (+/- 0.0489)
Random Forest (200 trees): 0.8537 (+/- 0.0244)
The random forest edges out logistic regression by 0.032 AUC (about 3 percentage points) with lower variance across folds. Priya selects the random forest for production but keeps the logistic regression model as an interpretability reference.
The Top 20 At-Risk Accounts
Priya scores all 847 active customers through the trained random forest and ranks them by predicted churn probability. The top 20 are exported to a CSV and loaded into the CRM:
# Score all active customers
churn_probabilities = rf_pipeline.predict_proba(X_all_customers)[:, 1]
customers_scored = customers_df.copy()
customers_scored["churn_probability"] = churn_probabilities
top_20_at_risk = (
customers_scored
.sort_values("churn_probability", ascending=False)
.head(20)[
["customer_id", "company_name", "account_manager",
"revenue_365d", "revenue_trend_ratio",
"days_since_last_purchase", "churn_probability"]
]
)
Sandra Chen receives the list on a Monday morning. By Wednesday, her team has called 14 of the 20. Two of those customers mention a pricing concern that was never escalated through normal channels. One mentions a product availability issue that connects directly to the supply chain work from Chapter 32.
The model did not save those customers. The account managers' calls saved them. The model is what made the calls happen.
Chapter Summary
Predictive models answer specific, individual-level questions that descriptive analytics cannot.
Linear regression quantifies relationships and produces interpretable coefficients that translate directly into business insights: marketing ROI, seasonal effects, the revenue impact of adding a salesperson.
Logistic regression predicts probabilities for binary outcomes with well-calibrated, interpretable results. It is almost always the right starting point for classification problems.
Decision trees are interpretable non-linear classifiers that produce auditable if-then rules. Excellent for explanation; usually outperformed by ensembles for accuracy.
Random forests aggregate hundreds of trees for higher accuracy and stability. Feature importance is a valuable bonus output.
The scikit-learn workflow — split, pipeline, fit, predict, evaluate — is the same for all of them. Learning it once means you can apply it everywhere.
The metrics — R², MAE, RMSE, AUC, precision, recall, F1 — only matter when you translate them into business terms. What does this error rate mean for our customers? What does this recall rate mean for our sales team? That translation is your job.
Key Terms
| Term | Definition |
|---|---|
| Regression | Predicting a continuous numerical outcome |
| Classification | Predicting which category an observation belongs to |
| Train/Test Split | Dividing data into training (model fitting) and test (unbiased evaluation) sets |
| Cross-Validation | Evaluating model performance by training and testing on multiple data subsets |
| R² | Fraction of outcome variance explained by the model (0 = no better than baseline, 1 = perfect) |
| MAE | Mean Absolute Error — average magnitude of prediction errors |
| RMSE | Root Mean Squared Error — like MAE but penalizes large errors more |
| ROC AUC | Area under the ROC curve; measures classifier discrimination ability regardless of threshold |
| Precision | Of all positive predictions, what fraction were correct? |
| Recall | Of all actual positives, what fraction did the model find? |
| F1 Score | Harmonic mean of precision and recall |
| Confusion Matrix | 2x2 table of TP, FP, TN, FN for a binary classifier |
| Feature Importance | A model's assessment of how much each feature contributes to predictions |
| Overfitting | When a model learns the training data too specifically and fails to generalize |
| Regularization | Adding a penalty for model complexity to reduce overfitting |
| Pipeline | scikit-learn object chaining preprocessing and model steps |