Chapter 11: Model Evaluation and Selection

DataField.Dev

41 min read

> "If you can't translate your model's performance into a business impact statement, you haven't finished evaluating it."

Prerequisites

Chapters 7-10 (Supervised, unsupervised, recommenders)
Comfort interpreting metrics tables and benchmarks
Awareness of statistical significance and confidence intervals
Familiarity with experiment design

Learning Objectives

Select evaluation metrics that map to business outcomes, not just accuracy
Diagnose overfitting, underfitting, and data leakage in model reports
Apply cross-validation and holdout strategies appropriate to the use case
Compare candidate models using both technical and economic criteria
Defend a model selection decision to a skeptical executive audience

In This Chapter

The 92% Accuracy Trap
Why Evaluation Matters
The Confusion Matrix: A Business Accounting System
Precision and Recall: The Fundamental Tradeoff
F1 Score and F-Beta: Finding the Balance
ROC Curves and AUC: The Tradeoff Visualized
Precision-Recall Curves: When ROC Lies
Cost-Sensitive Evaluation: Putting Dollars on the Matrix
Regression Metrics: Measuring Continuous Predictions
Cross-Validation: Trusting Your Numbers
Hyperparameter Tuning: Finding the Best Settings
A/B Testing for Models: The Final Exam
The ModelEvaluator: A Unified Evaluation Framework
Model Selection: Beyond the Numbers
The Business Translation Test
Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 11: Model Evaluation and Selection

"If you can't translate your model's performance into a business impact statement, you haven't finished evaluating it."

— Professor Diane Okonkwo, MBA 7620: AI for Business Strategy

The 92% Accuracy Trap

Tom Kowalski is having a good day. He has spent the past two weeks building a customer churn prediction model for the Athena Retail Group case — applying the techniques from Chapter 7, engineering features from transaction histories and customer service logs, and tuning a gradient boosting classifier until the numbers look impressive. He stands at the front of the lecture hall, laptop connected to the projector, and delivers the headline with the quiet confidence of someone who knows he has done good work.

"Ninety-two percent accuracy."

He lets the number sit. A few classmates nod appreciatively. NK Adeyemi, three rows back, writes it down.

Professor Okonkwo does not nod. She stands to the side, arms folded, with an expression that Tom has learned to recognize over the past ten weeks — the expression that precedes a question designed to dismantle whatever you just said.

"That's a strong number, Tom. Let me ask you something. What's the base rate?"

Tom pauses. "The base rate?"

"What percentage of Athena's customers churn in any given quarter?"

Tom pulls up his data summary. "About five percent."

The room is quiet for a moment. Then NK's hand goes up slowly, as if she is working something out in real time.

"Wait," she says. "If only five percent of customers churn, then a model that just predicts 'no churn' for everyone — a model that does literally nothing — would be ninety-five percent accurate."

Tom stares at his screen. The color drains from his confidence.

"So your model," NK continues, with the particular satisfaction of someone who spent six weeks feeling inferior to Tom's technical skills, "is actually worse than guessing?"

"It's not that simple," Tom starts, but Professor Okonkwo is already at the whiteboard.

"Actually, NK has identified one of the most important lessons in applied machine learning. Accuracy, by itself, is almost never the right metric. It is seductive because it is simple — one number, easy to report, easy to compare. But in the real world, where class distributions are imbalanced and different types of errors carry different costs, accuracy can be actively misleading. It can make a terrible model look good and a good model look mediocre."

She writes on the whiteboard: ACCURACY IS A VANITY METRIC.

"Today," she says, "we fix that."

Why Evaluation Matters

Model evaluation is the discipline of answering a deceptively simple question: How good is this model? The deception lies in the word "good." Good for what? Good by whose standards? Good compared to what alternative?

In Chapter 6, we introduced the distinction between model metrics (the numbers your algorithm optimizes) and success metrics (the business outcomes you actually care about). In Chapter 7, we built a churn classifier and reported its accuracy. In Chapter 8, we built a demand forecaster and reported its R-squared. Those numbers were starting points, not destinations. This chapter is about the destination.

Model evaluation matters for three reasons that every business leader must internalize:

1. Different Errors Have Different Costs

When your spam filter incorrectly flags a legitimate email as spam (a false positive), you might miss a message from a colleague. Annoying, but manageable. When your spam filter lets a phishing email through to your CEO's inbox (a false negative), the company might lose millions in a wire fraud attack. Same model, same "accuracy" — radically different business consequences.

When a fraud detection system flags a legitimate transaction (false positive), the customer is inconvenienced and may call your support center. When the system fails to catch actual fraud (false negative), the bank absorbs the financial loss and the customer loses trust. A model with 99% accuracy might still be catastrophically expensive if its errors cluster in the wrong category.

Business Insight: The first question a business leader should ask about any model is not "How accurate is it?" but "What happens when it's wrong — and which kind of wrong is more expensive?" The answer to that question determines which evaluation metrics matter.

2. The Best Model Is Not Always the Best Business Decision

A deep neural network might achieve the highest predictive score on your test set, but if it takes three seconds to return a prediction (too slow for real-time bidding), requires a GPU cluster to run (too expensive for your budget), and produces outputs that no one on your operations team can explain to a customer ("Why was my loan denied?"), then it may not be the best business choice.

Model selection is a multi-dimensional optimization problem. Predictive performance is one dimension. Latency, cost, interpretability, fairness, maintainability, and organizational trust are others. The "best" model depends on the weights you assign to each dimension — and those weights are business decisions, not statistical ones.

3. Offline Metrics and Online Performance Diverge

Your model performed beautifully on your test set. Then you deployed it and it performed terribly. This is not an edge case — it is the norm. Test sets are static snapshots of the past. The real world is dynamic, adversarial, and full of distribution shifts. A model evaluated only on a test set is like a pilot evaluated only in a simulator. The simulator is necessary, but it is not sufficient.

This chapter covers both offline evaluation (the metrics you compute before deployment) and online evaluation (the A/B tests and monitoring you run after deployment). Both are essential. Neither alone is adequate.

The Confusion Matrix: A Business Accounting System

Every evaluation metric for classification models derives from a single, elegant structure: the confusion matrix. If you understand the confusion matrix — truly understand it, in business terms — every other metric in this chapter will feel like a natural consequence.

The Four Outcomes

When a binary classification model makes a prediction, exactly one of four things happens:

	Predicted Positive	Predicted Negative
Actually Positive	True Positive (TP)	False Negative (FN)
Actually Negative	False Positive (FP)	True Negative (TN)

Definition: A confusion matrix is a table that summarizes the performance of a classification model by comparing its predictions against the actual outcomes. For a binary classifier, it produces four counts: true positives (correctly identified positives), false positives (incorrectly flagged as positive), true negatives (correctly identified negatives), and false negatives (missed positives).

Let us make this concrete with three business scenarios that will recur throughout the chapter:

Scenario 1: Churn Prediction (Athena Retail Group)

True Positive: Model predicts customer will churn, and she does. Athena sends a retention offer. Good outcome — the offer costs $20 but saves a customer worth $500/year.
False Positive: Model predicts customer will churn, but she was never going to leave. Athena sends a $20 retention offer to someone who would have stayed anyway. Wasted money, but not catastrophic.
True Negative: Model predicts customer will stay, and she does. No action needed. The most common and least interesting outcome.
False Negative: Model predicts customer will stay, but she churns. Athena takes no action. The customer leaves. Cost: $500 in lost annual revenue.

Scenario 2: Fraud Detection (Financial Services)

True Positive: Model flags a fraudulent transaction. The bank blocks it. Loss prevented.
False Positive: Model flags a legitimate transaction. The customer's card is declined. Customer calls in frustration. Cost: customer service time ($15) plus potential customer dissatisfaction.
True Negative: Model approves a legitimate transaction. Normal operations.
False Negative: Model approves a fraudulent transaction. The bank absorbs the loss. Cost: the full transaction amount, plus investigation costs, plus potential regulatory penalties.

Scenario 3: Medical Screening

True Positive: Model identifies a disease. Patient receives treatment early.
False Positive: Model says a healthy patient has the disease. Patient undergoes unnecessary additional tests and experiences anxiety.
True Negative: Model correctly identifies a healthy patient. No unnecessary intervention.
False Negative: Model misses a disease. Patient goes untreated. Potentially fatal.

NK is staring at the whiteboard, where Professor Okonkwo has drawn all three scenarios.

"It's an accounting system," NK says. "Each cell in the matrix has a dollar value — or a human value. You're not just counting right and wrong. You're counting the cost of right and wrong."

Okonkwo smiles. "Now you can explain this to your CMO."

NK writes in her notebook: Confusion matrix = error accounting system. Each cell has a price tag.

Business Insight: The confusion matrix is not a statistical abstraction. It is a business accounting framework. When you assign a dollar value to each cell — the cost of a false positive, the value of a true positive, the loss from a false negative — you transform model evaluation from a technical exercise into a financial analysis. This is what separates data scientists who build models from business leaders who deploy them.

Visualizing the Confusion Matrix in Python

Let us build a confusion matrix from Tom's churn model, using the tools from Chapter 7:

import numpy as np
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Simulated predictions from Tom's churn model
# In practice, these come from model.predict(X_test)
y_true = np.array([0]*950 + [1]*50)  # 5% churn rate, 1000 customers
y_pred = np.array([0]*900 + [1]*50 + [0]*30 + [1]*20)  # Model predictions

cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)
print(f"\nTrue Negatives:  {cm[0,0]}")
print(f"False Positives: {cm[0,1]}")
print(f"False Negatives: {cm[1,0]}")
print(f"True Positives:  {cm[1,1]}")

# Visualize
fig, ax = plt.subplots(figsize=(8, 6))
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                               display_labels=["Stay", "Churn"])
disp.plot(ax=ax, cmap="Blues", values_format="d")
ax.set_title("Athena Churn Model — Confusion Matrix", fontsize=14)
plt.tight_layout()
plt.savefig("confusion_matrix.png", dpi=150)
plt.show()

Code Explanation: We create a confusion matrix from the model's predictions (y_pred) versus the actual outcomes (y_true). The ConfusionMatrixDisplay from scikit-learn produces a clean heatmap visualization. The key insight: of the 50 actual churners, the model caught 20 (true positives) and missed 30 (false negatives). Of the 950 non-churners, the model correctly identified 900 but incorrectly flagged 50 as churners (false positives).

Now let us add the dollar values:

# Business cost matrix for Athena churn model
cost_tp = -20      # Cost of retention offer (invested to save customer)
value_tp = 500     # Revenue saved from retained customer
cost_fp = -20      # Wasted retention offer
cost_fn = -500     # Lost customer revenue
cost_tn = 0        # No action, no cost

net_tp = value_tp + cost_tp  # $480 net value per true positive
net_fp = cost_fp              # -$20 per false positive
net_fn = cost_fn              # -$500 per false negative
net_tn = cost_tn              # $0 per true negative

# Calculate total business impact
total_impact = (cm[1,1] * net_tp +   # True positives
                cm[0,1] * net_fp +    # False positives
                cm[1,0] * net_fn +    # False negatives
                cm[0,0] * net_tn)     # True negatives

print(f"True Positives:  {cm[1,1]:>4} x ${net_tp:>6} = ${cm[1,1] * net_tp:>10,}")
print(f"False Positives: {cm[0,1]:>4} x ${net_fp:>6} = ${cm[0,1] * net_fp:>10,}")
print(f"False Negatives: {cm[1,0]:>4} x ${net_fn:>6} = ${cm[1,0] * net_fn:>10,}")
print(f"True Negatives:  {cm[0,0]:>4} x ${net_tn:>6} = ${cm[0,0] * net_tn:>10,}")
print(f"{'':>40}----------")
print(f"Total Business Impact:{total_impact:>19,}")

Code Explanation: By assigning dollar values to each cell of the confusion matrix, we transform the abstract counts into a concrete business impact statement. Each true positive represents a customer saved ($480 net after the retention offer). Each false negative represents a lost customer ($500). Each false positive is a wasted $20 offer. This is the "Business Translation Test" that Ravi will later formalize at Athena.

Precision and Recall: The Fundamental Tradeoff

Accuracy tells you the overall percentage of correct predictions. But as Tom's churn model demonstrated, overall correctness can mask critical failures. We need metrics that focus specifically on the model's behavior with respect to the positive class — the class we actually care about.

Precision: "When the Model Says Yes, How Often Is It Right?"

Precision answers the question: Of all the instances the model predicted as positive, what fraction were actually positive?

$$\text{Precision} = \frac{TP}{TP + FP}$$

In Tom's churn model: Precision = 20 / (20 + 50) = 0.286, or 28.6%.

This means that when the model predicts a customer will churn, it is right only 28.6% of the time. The other 71.4% of "churn" predictions are false alarms.

When precision matters most: When the cost of acting on a false positive is high. If your retention offer costs $500 per customer (a premium loyalty package, say, rather than a $20 coupon), you cannot afford to send it to seven non-churners for every three actual churners. Spam email filtering is another precision-critical application — marking a legitimate email as spam (false positive) is more disruptive than letting an occasional spam email through (false negative).

Recall: "Of All the Actual Positives, How Many Did the Model Catch?"

Recall (also called sensitivity or true positive rate) answers the question: Of all the instances that were actually positive, what fraction did the model correctly identify?

$$\text{Recall} = \frac{TP}{TP + FN}$$

In Tom's churn model: Recall = 20 / (20 + 30) = 0.40, or 40%.

The model catches only 40% of actual churners. Sixty percent of customers who are going to leave slip through undetected. For a business trying to proactively save customers, this is a serious problem.

When recall matters most: When the cost of missing a positive is high. Fraud detection demands high recall — every fraudulent transaction the model misses is money lost. Medical screening demands high recall — missing a cancer diagnosis can be fatal. In Athena's churn case, each missed churner represents $500 in lost revenue, making recall the more business-critical metric.

Definition: Precision measures the accuracy of positive predictions (how many predicted positives are correct). Recall measures the completeness of positive detection (how many actual positives are found). They represent fundamentally different priorities: precision protects you from acting on false alarms; recall protects you from missing real events.

The Tradeoff

Precision and recall are in tension. You can almost always increase one at the expense of the other by adjusting the model's classification threshold.

Consider a model that outputs a probability score between 0 and 1 for each customer. The default threshold is 0.5: predict "churn" if the score exceeds 0.5, "stay" otherwise. But you can move that threshold:

Lower the threshold (e.g., 0.3): More customers are flagged as potential churners. You catch more actual churners (recall goes up), but you also flag more non-churners (precision goes down).
Raise the threshold (e.g., 0.7): Fewer customers are flagged. The ones you do flag are more likely to be genuine churners (precision goes up), but you miss more actual churners (recall goes down).

from sklearn.metrics import precision_score, recall_score
import numpy as np

# Simulated probability scores from a churn model
np.random.seed(42)
n_customers = 1000
y_true = np.array([0]*950 + [1]*50)

# Higher scores for actual churners, with overlap
scores_negative = np.random.beta(2, 5, 950)   # Non-churners: skewed low
scores_positive = np.random.beta(5, 2, 50)     # Churners: skewed high
scores = np.concatenate([scores_negative, scores_positive])

# Evaluate at different thresholds
thresholds = [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]
print(f"{'Threshold':>10} {'Precision':>10} {'Recall':>10} {'Flagged':>10}")
print("-" * 45)

for t in thresholds:
    y_pred = (scores >= t).astype(int)
    p = precision_score(y_true, y_pred, zero_division=0)
    r = recall_score(y_true, y_pred, zero_division=0)
    flagged = y_pred.sum()
    print(f"{t:>10.1f} {p:>10.3f} {r:>10.3f} {flagged:>10}")

Code Explanation: By sweeping the classification threshold from 0.2 to 0.8, we see precision and recall move in opposite directions. At a low threshold (0.2), recall is high — we catch most churners — but precision is low because we also flag many non-churners. At a high threshold (0.8), precision is high — the few customers we flag are very likely to churn — but recall drops because we miss churners whose scores fall below 0.8. The "right" threshold depends on the business context.

Business Insight: Choosing between precision and recall is not a statistical question — it is a business strategy question. A company launching a new loyalty program with a limited budget needs high precision (target the right customers). A company with a high customer lifetime value and low retention costs needs high recall (miss no one who might leave). The model does not make this decision. The business leader does.

F1 Score and F-Beta: Finding the Balance

When you need a single number that balances precision and recall, the F1 score is the standard choice. It is the harmonic mean of precision and recall:

$$F_1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

Definition: The F1 score is the harmonic mean of precision and recall, ranging from 0 to 1. It penalizes extreme imbalance — a model with 100% precision but 0% recall (or vice versa) scores 0. Only a model with both high precision and high recall achieves a high F1.

Why the harmonic mean rather than the arithmetic mean? Because the harmonic mean punishes imbalance more severely. A model with 100% precision and 1% recall has an arithmetic mean of 50.5% — which sounds reasonable — but an F1 of 0.02 — which correctly signals that the model is useless.

For Tom's churn model: F1 = 2 x (0.286 x 0.40) / (0.286 + 0.40) = 0.334. Not good.

When F1 Isn't Enough: The F-Beta Score

The F1 score weights precision and recall equally. But in many business contexts, you care about one more than the other. The F-beta score generalizes F1 by introducing a parameter beta that controls the weighting:

$$F_\beta = (1 + \beta^2) \times \frac{\text{Precision} \times \text{Recall}}{(\beta^2 \times \text{Precision}) + \text{Recall}}$$

Beta = 1: F1 score. Equal weight.
Beta = 2: F2 score. Recall is weighted twice as heavily as precision. Use this when missing positives is expensive (fraud detection, medical screening, churn prevention with high customer lifetime values).
Beta = 0.5: F0.5 score. Precision is weighted twice as heavily as recall. Use this when false positives are expensive (spam filtering, content moderation, high-cost interventions).

from sklearn.metrics import fbeta_score

# Tom's churn model results
y_true_example = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
y_pred_example = [1, 1, 0, 0, 0, 1, 0, 0, 0, 0]

f1  = fbeta_score(y_true_example, y_pred_example, beta=1.0)
f2  = fbeta_score(y_true_example, y_pred_example, beta=2.0)
f05 = fbeta_score(y_true_example, y_pred_example, beta=0.5)

print(f"F1  (balanced):         {f1:.3f}")
print(f"F2  (recall-focused):   {f2:.3f}")
print(f"F0.5 (precision-focused): {f05:.3f}")

Business Insight: Choosing the right beta value is a business decision, not a technical one. Ask: "If I had to choose between catching one more true positive and avoiding one more false positive, which would I pick?" If you would always pick catching the true positive, use F2. If you would always pick avoiding the false positive, use F0.5. If you genuinely do not know, F1 is a reasonable default — but "I don't know" is a signal that you need to understand the cost structure better, not a signal that F1 is correct.

ROC Curves and AUC: The Tradeoff Visualized

The precision-recall tradeoff at different thresholds can be visualized as a curve. The most widely used such visualization is the Receiver Operating Characteristic (ROC) curve.

How the ROC Curve Works

The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at every possible classification threshold:

True Positive Rate (TPR): Same as recall. TP / (TP + FN). Of all actual positives, how many did we catch?
False Positive Rate (FPR): FP / (FP + TN). Of all actual negatives, how many did we incorrectly flag?

As you lower the threshold, both TPR and FPR increase (you catch more of everything). The ROC curve traces this relationship.

Definition: The ROC curve (Receiver Operating Characteristic curve) visualizes a classifier's performance across all possible thresholds by plotting true positive rate against false positive rate. The AUC (Area Under the Curve) summarizes the ROC curve as a single number between 0 and 1, where 1 represents perfect classification and 0.5 represents random guessing.

Interpreting AUC Values

AUC Range	Interpretation	Business Implication
0.90 - 1.00	Excellent	Strong signal; likely deployable after cost analysis
0.80 - 0.90	Good	Useful for most business applications
0.70 - 0.80	Fair	May be useful depending on cost structure
0.60 - 0.70	Poor	Marginal; typically not worth deploying
0.50 - 0.60	Fail	No better than random guessing

Caution

These ranges are rules of thumb, not commandments. An AUC of 0.75 in a domain where the best-known models achieve 0.76 is impressive. An AUC of 0.85 in a domain where 0.95 is achievable is underwhelming. Always compare your model's AUC against a relevant baseline, not against an arbitrary scale.

Plotting ROC Curves in Python

from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
import numpy as np

# Three competing models for Athena's churn prediction
np.random.seed(42)
n = 1000
y_true = np.array([0]*950 + [1]*50)

# Model A: Complex gradient boosting (high AUC)
scores_a_neg = np.random.beta(2, 8, 950)
scores_a_pos = np.random.beta(8, 2, 50)
scores_a = np.concatenate([scores_a_neg, scores_a_pos])

# Model B: Logistic regression (slightly lower AUC)
scores_b_neg = np.random.beta(2, 6, 950)
scores_b_pos = np.random.beta(6, 2, 50)
scores_b = np.concatenate([scores_b_neg, scores_b_pos])

# Model C: Simple decision tree (lower AUC)
scores_c_neg = np.random.beta(2, 4, 950)
scores_c_pos = np.random.beta(4, 2, 50)
scores_c = np.concatenate([scores_c_neg, scores_c_pos])

fig, ax = plt.subplots(figsize=(8, 8))

for scores, label, color in [
    (scores_a, "Model A: Gradient Boosting", "navy"),
    (scores_b, "Model B: Logistic Regression", "darkorange"),
    (scores_c, "Model C: Decision Tree", "forestgreen"),
]:
    fpr, tpr, _ = roc_curve(y_true, scores)
    roc_auc = auc(fpr, tpr)
    ax.plot(fpr, tpr, color=color, lw=2,
            label=f"{label} (AUC = {roc_auc:.3f})")

ax.plot([0, 1], [0, 1], "k--", lw=1, label="Random Guess (AUC = 0.500)")
ax.set_xlabel("False Positive Rate", fontsize=12)
ax.set_ylabel("True Positive Rate (Recall)", fontsize=12)
ax.set_title("ROC Curves — Athena Churn Models", fontsize=14)
ax.legend(loc="lower right", fontsize=11)
ax.set_xlim([0, 1])
ax.set_ylim([0, 1.02])
plt.tight_layout()
plt.savefig("roc_curves.png", dpi=150)
plt.show()

Code Explanation: We generate simulated probability scores for three models and plot their ROC curves. Model A (gradient boosting) hugs the upper-left corner most closely, achieving the highest AUC. Model B (logistic regression) is slightly lower but still strong. Model C (decision tree) is the weakest. The diagonal dashed line represents a model with no discriminative power — random guessing. Any useful model should curve above this line.

Comparing Models with ROC Curves

The ROC curve is powerful precisely because it separates model evaluation from threshold selection. Two models can be compared by their AUC without committing to a specific threshold. The model with the higher AUC has better discriminative power across all possible operating points.

However, AUC has an important limitation that Tom discovers during the Athena case analysis.

Precision-Recall Curves: When ROC Lies

"Professor," Tom says, scrolling through his results. "Model A has the highest AUC, but when I look at the precision at different recall levels, it's not as impressive. Model B actually has better precision at the recall levels we care about."

"Why might that be?" Okonkwo asks.

Tom works through it. "Because the classes are imbalanced. Only five percent of customers churn. The ROC curve uses the false positive rate, which is dominated by the large negative class. Even a lot of false positives barely move the FPR when there are 950 negatives."

"Exactly. When would you prefer a precision-recall curve?"

"Whenever the positive class is rare and the business cost concentrates on the positive class."

Caution

ROC curves can be misleadingly optimistic when classes are heavily imbalanced. A model can achieve a high AUC while having mediocre precision at the recall levels you actually need. In domains like fraud detection (0.1% fraud rate), disease screening (1% prevalence), or churn prediction (5% churn rate), precision-recall curves often provide a more honest picture of model performance.

The Precision-Recall Curve

The precision-recall (PR) curve plots precision against recall at every threshold. Unlike the ROC curve, where the baseline (random guess) is a diagonal line, the baseline for a PR curve is a horizontal line at the prevalence of the positive class (e.g., 0.05 for 5% churn).

from sklearn.metrics import precision_recall_curve, average_precision_score
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 8))

for scores, label, color in [
    (scores_a, "Model A: Gradient Boosting", "navy"),
    (scores_b, "Model B: Logistic Regression", "darkorange"),
    (scores_c, "Model C: Decision Tree", "forestgreen"),
]:
    precision_vals, recall_vals, _ = precision_recall_curve(y_true, scores)
    ap = average_precision_score(y_true, scores)
    ax.plot(recall_vals, precision_vals, color=color, lw=2,
            label=f"{label} (AP = {ap:.3f})")

# Baseline: prevalence of positive class
prevalence = y_true.mean()
ax.axhline(y=prevalence, color="gray", linestyle="--", lw=1,
           label=f"Random Baseline ({prevalence:.3f})")

ax.set_xlabel("Recall", fontsize=12)
ax.set_ylabel("Precision", fontsize=12)
ax.set_title("Precision-Recall Curves — Athena Churn Models", fontsize=14)
ax.legend(loc="upper right", fontsize=11)
ax.set_xlim([0, 1])
ax.set_ylim([0, 1.02])
plt.tight_layout()
plt.savefig("pr_curves.png", dpi=150)
plt.show()

Code Explanation: The precision-recall curve tells a different story from the ROC curve. Here, the gaps between models are often more visible, especially at the high-recall operating points that matter for churn prevention. Average Precision (AP) is the summary statistic for PR curves, analogous to AUC for ROC curves. For imbalanced datasets, AP is often a more informative single-number summary than AUC.

Business Insight: When presenting model performance to stakeholders, ask yourself: "What does my audience care about?" If they care about the model's ability to rank-order predictions (which customer is most likely to churn?), the ROC curve is appropriate. If they care about the model's actionable precision (of the customers we contact, how many actually needed intervention?), the PR curve is more relevant. Use the visualization that answers the business question.

Cost-Sensitive Evaluation: Putting Dollars on the Matrix

NK has been unusually quiet for the past twenty minutes — always a sign that she is building toward something. When she speaks, it is with the energy of someone who has just connected a concept to something she actually does for a living.

"Professor, I spent six years in marketing. I know how to calculate customer lifetime value, cost per acquisition, campaign ROI. This confusion matrix — it's just a campaign performance framework with different labels. I can build a business case out of this."

"Show us," Okonkwo says.

NK stands up and walks to the whiteboard. She draws a cost matrix:

	Predicted Churn	Predicted Stay
Actually Churns	Save customer: +$480 net \| Lose customer: -$500
Actually Stays	Wasted offer: -$20 \| No action: $0

"Every model gives us a confusion matrix. Every confusion matrix, multiplied by this cost matrix, gives us an expected profit. The model with the highest expected profit wins. Period."

Tom objects: "But what about AUC? What about—"

"AUC tells you which model discriminates best," NK interrupts. "Expected profit tells you which model makes the most money. I know which one my CMO wants to hear about."

Professor Okonkwo nods. "NK has just independently derived what the literature calls cost-sensitive evaluation. And she is right — for most business applications, the expected profit framework is the most actionable evaluation approach."

Expected Profit Calculation

The expected profit of a model at a given threshold is:

$$\text{Expected Profit} = (TP \times V_{TP}) + (FP \times V_{FP}) + (FN \times V_{FN}) + (TN \times V_{TN})$$

where $V_{TP}$, $V_{FP}$, $V_{FN}$, and $V_{TN}$ are the business values (positive or negative) associated with each outcome.

import numpy as np
from sklearn.metrics import confusion_matrix

def expected_profit(y_true, y_scores, threshold, cost_matrix):
    """
    Calculate expected profit at a given threshold.

    Parameters
    ----------
    y_true : array-like
        True labels (0 or 1).
    y_scores : array-like
        Predicted probability scores.
    threshold : float
        Classification threshold.
    cost_matrix : dict
        Keys: 'tp_value', 'fp_value', 'fn_value', 'tn_value'.

    Returns
    -------
    float
        Total expected profit.
    """
    y_pred = (np.array(y_scores) >= threshold).astype(int)
    cm = confusion_matrix(np.array(y_true), y_pred)
    tn, fp, fn, tp = cm.ravel()

    profit = (tp * cost_matrix['tp_value'] +
              fp * cost_matrix['fp_value'] +
              fn * cost_matrix['fn_value'] +
              tn * cost_matrix['tn_value'])
    return profit

# Athena cost matrix
athena_costs = {
    'tp_value': 480,    # Save a churner: $500 revenue - $20 offer
    'fp_value': -20,    # Wasted offer on non-churner
    'fn_value': -500,   # Lost customer
    'tn_value': 0       # No action, no cost
}

# Find optimal threshold for Model B (logistic regression)
thresholds = np.arange(0.05, 0.95, 0.01)
profits = [expected_profit(y_true, scores_b, t, athena_costs)
           for t in thresholds]

optimal_idx = np.argmax(profits)
optimal_threshold = thresholds[optimal_idx]
max_profit = profits[optimal_idx]

print(f"Optimal threshold: {optimal_threshold:.2f}")
print(f"Maximum expected profit: ${max_profit:,.0f}")

Code Explanation: The expected_profit function multiplies each cell of the confusion matrix by its business value. By sweeping across all possible thresholds, we find the threshold that maximizes total profit. This threshold is almost never 0.5 — it depends entirely on the cost structure. When false negatives are expensive (as in churn prevention), the optimal threshold is typically lower than 0.5, because the business would rather accept more false positives (wasted offers) than miss real churners.

The Profit Curve

Plotting expected profit against threshold produces a profit curve — one of the most underused and most powerful evaluation visualizations in applied ML.

import matplotlib.pyplot as plt
import numpy as np

fig, ax = plt.subplots(figsize=(10, 6))

for scores, label, color in [
    (scores_a, "Model A: Gradient Boosting", "navy"),
    (scores_b, "Model B: Logistic Regression", "darkorange"),
    (scores_c, "Model C: Decision Tree", "forestgreen"),
]:
    profits = [expected_profit(y_true, scores, t, athena_costs)
               for t in thresholds]
    ax.plot(thresholds, profits, color=color, lw=2, label=label)

    best_idx = np.argmax(profits)
    ax.scatter(thresholds[best_idx], profits[best_idx],
               color=color, s=100, zorder=5, edgecolors='black')

ax.set_xlabel("Classification Threshold", fontsize=12)
ax.set_ylabel("Expected Profit ($)", fontsize=12)
ax.set_title("Profit Curves — Which Model Makes the Most Money?", fontsize=14)
ax.legend(fontsize=11)
ax.axhline(y=0, color="gray", linestyle="--", lw=0.8)
plt.tight_layout()
plt.savefig("profit_curves.png", dpi=150)
plt.show()

Code Explanation: The profit curve reveals something that AUC alone cannot: Model B (logistic regression), despite having a lower AUC than Model A (gradient boosting), may deliver higher maximum expected profit at its optimal threshold. This happens because the cost structure penalizes false negatives more than false positives, and Model B may have a better precision-recall tradeoff in the relevant operating region. The star on each curve marks the optimal threshold for that model.

Athena Update: When Ravi Mehta sees the profit curve analysis, he makes it a requirement for all model evaluations at Athena. "I don't care about your AUC," he tells the data science team. "Show me the profit curve. Show me how much money each model makes at its best threshold. That's the number that goes to the executive team." This becomes a cornerstone of Athena's Model Evaluation Board process, which we will see formalized later in this chapter.

Regression Metrics: Measuring Continuous Predictions

Not all models produce binary predictions. The demand forecaster from Chapter 8 predicts a continuous number — units of inventory to stock, revenue for next quarter, customer spend for the next year. For regression models, we need a different set of evaluation metrics.

R-Squared (Coefficient of Determination)

R-squared measures the proportion of variance in the target variable that the model explains:

$$R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$$

An R-squared of 0.85 means the model explains 85% of the variance in the target. The remaining 15% is unexplained — due to noise, missing features, or model limitations.

Caution

R-squared can be misleading. It always increases (or stays the same) when you add more features, even irrelevant ones. Adjusted R-squared accounts for this by penalizing model complexity. Additionally, R-squared does not tell you whether the errors are large enough to matter for your business. A model might explain 95% of variance but still produce errors of plus or minus $50,000 on a revenue forecast — which could be either acceptable or catastrophic depending on your context.

Mean Absolute Error (MAE)

MAE is the average of the absolute differences between predictions and actual values:

$$\text{MAE} = \frac{1}{n}\sum|y_i - \hat{y}_i|$$

MAE is interpretable in the same units as the target variable. If you are predicting daily demand in units, an MAE of 12 means your model is off by about 12 units on average. Business leaders find MAE intuitive because it answers the question: "How far off is the model, on a typical day?"

Root Mean Squared Error (RMSE)

$$\text{RMSE} = \sqrt{\frac{1}{n}\sum(y_i - \hat{y}_i)^2}$$

RMSE is similar to MAE but penalizes large errors more heavily (because errors are squared before averaging). If occasional large errors are particularly costly — a demand forecast that is off by 100 units is more than ten times worse than being off by 10 — RMSE is a better metric than MAE.

Mean Absolute Percentage Error (MAPE)

$$\text{MAPE} = \frac{1}{n}\sum\left|\frac{y_i - \hat{y}_i}{y_i}\right| \times 100$$

MAPE expresses error as a percentage, making it easy to communicate across contexts. "Our forecast is off by 8% on average" is more intuitive to most executives than "Our MAE is 47 units."

Caution

MAPE is undefined when actual values are zero and becomes unstable when actual values are close to zero. For forecasting tasks where zero demand is common (slow-moving inventory items), use weighted MAPE (WMAPE) or symmetric MAPE (SMAPE) instead.

Choosing the Right Regression Metric

Metric	Best For	Limitation
R-squared	Overall model explanatory power	Doesn't reflect absolute error magnitude
Adjusted R-squared	Comparing models with different numbers of features	Same limitation as R-squared
MAE	When all errors are equally costly	Does not penalize outlier predictions
RMSE	When large errors are disproportionately costly	Harder to interpret than MAE
MAPE	Communicating to non-technical stakeholders	Breaks down near zero values

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Athena demand forecasting results (from Chapter 8)
np.random.seed(42)
y_actual = np.random.poisson(lam=100, size=365)  # Daily demand, ~100 units
y_predicted = y_actual + np.random.normal(0, 12, size=365)  # Model predictions

r2 = r2_score(y_actual, y_predicted)
mae = mean_absolute_error(y_actual, y_predicted)
rmse = np.sqrt(mean_squared_error(y_actual, y_predicted))
mape = np.mean(np.abs((y_actual - y_predicted) / y_actual)) * 100

print("Athena Demand Forecaster — Evaluation Metrics")
print("=" * 50)
print(f"R-squared:  {r2:.4f}  ({r2*100:.1f}% of variance explained)")
print(f"MAE:        {mae:.2f} units")
print(f"RMSE:       {rmse:.2f} units")
print(f"MAPE:       {mape:.2f}%")
print()
print("Business Translation:")
print(f"  On average, the forecast is off by ~{mae:.0f} units per day.")
print(f"  In percentage terms, ~{mape:.0f}% error on a typical day.")

Business Insight: The best regression metric is the one that most directly maps to your business cost function. If holding excess inventory costs the same per unit regardless of quantity (linear cost), use MAE. If stockouts cause cascading problems — lost sales, expedited shipping, damaged customer relationships — use RMSE to penalize large misses. If you need to compare forecast quality across product categories with very different sales volumes, use MAPE. The metric should reflect how the business experiences the error.

Cross-Validation: Trusting Your Numbers

All the metrics we have discussed so far assume you have a reliable estimate of how the model performs on data it has not seen. In Chapter 7, we used a simple train-test split: train the model on 80% of the data, evaluate on the remaining 20%. This approach has a fundamental weakness.

The Problem with a Single Split

A single train-test split is like evaluating a restaurant based on one visit. Maybe the chef was having a good night. Maybe the test set happened to contain the "easy" customers — the ones whose behavior is most predictable. A different random split might produce very different results.

Tom experienced this firsthand: "I ran the same model five times with different random seeds for the train-test split. I got accuracy ranging from 89% to 94%. Which one do I report?"

"All of them," Professor Okonkwo says. "Or rather, their average and their variation. That is what cross-validation gives you."

K-Fold Cross-Validation

K-fold cross-validation systematically addresses the instability of a single split:

Divide the data into K equal-sized folds (typically K = 5 or K = 10).
For each fold: use that fold as the test set, train on the remaining K-1 folds.
Compute the evaluation metric for each fold.
Report the mean and standard deviation across all K folds.

Definition: K-fold cross-validation is a technique for obtaining a more robust estimate of model performance by training and evaluating the model K times, each time using a different fold as the test set and the remaining folds for training. The resulting K performance scores are averaged to produce a single estimate, with their standard deviation indicating the estimate's reliability.

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

# Simulated Athena customer data
np.random.seed(42)
n = 2000
X = np.random.randn(n, 10)  # 10 features
y = (X[:, 0] + 0.5 * X[:, 1] - 0.3 * X[:, 2] +
     np.random.randn(n) * 0.5 > 0.5).astype(int)

# Build a pipeline (scaling + logistic regression)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(random_state=42))
])

# 5-fold cross-validation
cv_scores = cross_val_score(pipeline, X, y, cv=5, scoring='roc_auc')

print("5-Fold Cross-Validation Results (AUC)")
print("=" * 40)
for i, score in enumerate(cv_scores, 1):
    print(f"  Fold {i}: {score:.4f}")
print(f"\n  Mean AUC:  {cv_scores.mean():.4f}")
print(f"  Std Dev:   {cv_scores.std():.4f}")
print(f"  95% CI:    [{cv_scores.mean() - 1.96*cv_scores.std():.4f}, "
      f"{cv_scores.mean() + 1.96*cv_scores.std():.4f}]")

Code Explanation: The cross_val_score function automates K-fold cross-validation. It returns an array of K scores — one per fold. The mean gives our best estimate of model performance; the standard deviation tells us how stable that estimate is. A small standard deviation (e.g., 0.01-0.02) suggests the model performs consistently. A large standard deviation (e.g., 0.05+) suggests the results are sensitive to which data points end up in the test set — a warning sign.

Stratified K-Fold

When classes are imbalanced (as in churn prediction, where only 5% of customers churn), a random K-fold split might produce folds with zero churners — making the evaluation meaningless. Stratified K-fold preserves the class distribution in each fold.

from sklearn.model_selection import StratifiedKFold, cross_val_score

# Use stratified K-fold for imbalanced data
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores_stratified = cross_val_score(pipeline, X, y, cv=skf,
                                        scoring='f1')

print("Stratified 5-Fold CV (F1 Score)")
print(f"  Mean F1:   {cv_scores_stratified.mean():.4f}")
print(f"  Std Dev:   {cv_scores_stratified.std():.4f}")

Time Series Cross-Validation

For time-dependent data — demand forecasting, stock prices, quarterly revenue — standard K-fold cross-validation violates a critical assumption: it uses future data to predict the past.

If you are building a demand forecaster (Chapter 8), training on January-December data and testing on randomly selected days throughout the year means your model has "seen" data from after the test period. In production, you never have this luxury.

Time series cross-validation (also called expanding window or walk-forward validation) respects temporal ordering:

Train on months 1-3, test on month 4.
Train on months 1-4, test on month 5.
Train on months 1-5, test on month 6.
And so on.

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
print("Time Series Cross-Validation Splits:")
print("=" * 45)
for i, (train_idx, test_idx) in enumerate(tscv.split(X), 1):
    print(f"  Split {i}: Train [{train_idx[0]:>4}-{train_idx[-1]:>4}] "
          f"({len(train_idx):>4} samples) | "
          f"Test [{test_idx[0]:>4}-{test_idx[-1]:>4}] "
          f"({len(test_idx):>4} samples)")

Business Insight: If your model will make predictions about the future (and most business models do), always use time-series cross-validation. Standard K-fold will give you an overoptimistic estimate of performance because it allows information leakage from the future into the training set. The model will appear to work better in evaluation than it does in production — and the gap between offline metrics and online performance is already one of the biggest challenges in applied ML.

Hyperparameter Tuning: Finding the Best Settings

Every machine learning algorithm has hyperparameters — settings that are not learned from data but must be chosen by the practitioner. A random forest has the number of trees, the maximum depth of each tree, the minimum number of samples per leaf. A logistic regression has the regularization strength. A neural network has the learning rate, the number of layers, and the number of neurons per layer.

Definition: Hyperparameters are the configuration settings of a machine learning algorithm that are set before training begins, as opposed to parameters (like weights and coefficients) that are learned during training. Hyperparameter choices can dramatically affect model performance.

Grid Search: Exhaustive but Expensive

Grid search evaluates every combination of hyperparameter values you specify:

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import numpy as np

# Define the pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Define the parameter grid
param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [5, 10, 20, None],
    'classifier__min_samples_leaf': [1, 5, 10]
}

# Total combinations: 3 x 4 x 3 = 36
# With 5-fold CV: 36 x 5 = 180 model fits

grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,        # Use all CPU cores
    verbose=1
)

grid_search.fit(X, y)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best AUC:        {grid_search.best_score_:.4f}")

Code Explanation: Grid search tries every combination in the parameter grid (36 in this case) and evaluates each with 5-fold cross-validation, resulting in 180 model fits. The n_jobs=-1 parameter parallelizes the computation across all available CPU cores. Grid search is thorough but computationally expensive — the number of fits grows multiplicatively with each new hyperparameter.

Random Search: Smarter and Faster

Random search samples hyperparameter combinations randomly rather than exhaustively. Research by Bergstra and Bengio (2012) demonstrated that random search is often more efficient than grid search because it explores a wider range of each hyperparameter.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Define distributions instead of fixed values
param_distributions = {
    'classifier__n_estimators': randint(50, 300),
    'classifier__max_depth': randint(3, 30),
    'classifier__min_samples_leaf': randint(1, 20)
}

random_search = RandomizedSearchCV(
    pipeline,
    param_distributions,
    n_iter=50,       # Try 50 random combinations
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    random_state=42,
    verbose=1
)

random_search.fit(X, y)

print(f"Best parameters: {random_search.best_params_}")
print(f"Best AUC:        {random_search.best_score_:.4f}")

Code Explanation: Instead of a fixed grid, random search samples from probability distributions. randint(50, 300) means "pick a random integer between 50 and 300." With 50 iterations and 5-fold CV, we fit 250 models — compared to 180 for grid search — but we explore a much wider hyperparameter space and often find better combinations.

Bayesian Optimization: The Smart Search

Bayesian optimization goes further by using the results of previous evaluations to intelligently choose the next set of hyperparameters to try. Think of it as a guided search: instead of randomly sampling, it builds a probabilistic model of the relationship between hyperparameters and performance, and uses that model to focus on the most promising regions of the search space.

Libraries like optuna and hyperopt implement Bayesian optimization for hyperparameter tuning. A full implementation is beyond our scope here, but the intuition matters for business leaders: Bayesian optimization finds good hyperparameters in fewer iterations than grid or random search, which translates to lower compute costs and faster time-to-deployment.

Business Insight: Hyperparameter tuning is a real cost center. For complex models on large datasets, a grid search might take hours or days on cloud computing infrastructure. At Athena, Ravi estimates that the data science team spends roughly $2,000/month on hyperparameter tuning compute alone. Random search can typically cut that cost by 50-70% with comparable results. Bayesian optimization can cut it further. These are not academic distinctions — they directly affect the ROI of your ML operations. We will revisit compute economics in Chapter 12 (MLOps).

A/B Testing for Models: The Final Exam

Every metric we have discussed so far is an offline evaluation — computed on historical data before the model is deployed. Offline evaluation is necessary, but it is not sufficient. The only way to know whether a model creates business value is to test it in the real world, with real customers, making real decisions.

This is what A/B testing provides.

The Structure of a Model A/B Test

A model A/B test (sometimes called an online experiment or champion-challenger test) works as follows:

Control group (A): Continues to receive the current experience (the existing model, or no model at all).
Treatment group (B): Receives the new model's predictions.
Random assignment: Customers are randomly assigned to A or B to ensure the groups are comparable.
Measurement period: The test runs long enough to accumulate statistically significant results.
Decision: Based on the results, the new model is either deployed fully, iterated upon, or discarded.

Definition: An A/B test (also called a randomized controlled experiment) randomly assigns subjects to two or more groups to measure the causal effect of a treatment. In model evaluation, the "treatment" is the new model's predictions, and the outcome is a business metric (revenue, churn rate, conversion rate).

Key Design Decisions

What to measure (the primary metric): Choose the business metric that most directly reflects the model's intended impact. For Athena's churn model, the primary metric is the retention rate — the percentage of at-risk customers who are still active 90 days after receiving a retention intervention.

How long to run: The test must run long enough to detect a meaningful difference with statistical significance. Running too short risks a false negative (concluding the model does not work when it does). Running too long delays value capture and wastes resources.

Sample size: The required sample size depends on the baseline rate, the minimum detectable effect, and the desired statistical power. For rare events (like churn), you need larger samples.

Guardrail metrics: Metrics that must not deteriorate, even if the primary metric improves. For example, a churn model might improve retention but accidentally annoy loyal customers with unnecessary offers, reducing their satisfaction scores. Track customer satisfaction, revenue per customer, and support call volume alongside the primary metric.

import numpy as np
from scipy import stats

def ab_test_significance(control_conversions, control_total,
                          treatment_conversions, treatment_total,
                          confidence=0.95):
    """
    Two-proportion z-test for A/B test significance.

    Parameters
    ----------
    control_conversions : int
        Number of successes in control group.
    control_total : int
        Total observations in control group.
    treatment_conversions : int
        Number of successes in treatment group.
    treatment_total : int
        Total observations in treatment group.
    confidence : float
        Confidence level (default 0.95).

    Returns
    -------
    dict
        Test results including p-value and significance.
    """
    p_control = control_conversions / control_total
    p_treatment = treatment_conversions / treatment_total

    # Pooled proportion
    p_pool = (control_conversions + treatment_conversions) / \
             (control_total + treatment_total)

    # Standard error
    se = np.sqrt(p_pool * (1 - p_pool) *
                 (1/control_total + 1/treatment_total))

    # Z-statistic
    z = (p_treatment - p_control) / se

    # Two-tailed p-value
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))

    # Lift
    lift = (p_treatment - p_control) / p_control * 100

    alpha = 1 - confidence
    significant = p_value < alpha

    return {
        'control_rate': p_control,
        'treatment_rate': p_treatment,
        'lift': lift,
        'z_statistic': z,
        'p_value': p_value,
        'significant': significant,
        'confidence_level': confidence
    }

# Athena churn model A/B test results (hypothetical)
results = ab_test_significance(
    control_conversions=42,    # 42 of 500 at-risk customers retained (control)
    control_total=500,
    treatment_conversions=63,  # 63 of 500 at-risk customers retained (treatment)
    treatment_total=500,
    confidence=0.95
)

print("Athena Churn Model — A/B Test Results")
print("=" * 50)
print(f"Control retention rate:   {results['control_rate']:.1%}")
print(f"Treatment retention rate: {results['treatment_rate']:.1%}")
print(f"Lift:                     {results['lift']:+.1f}%")
print(f"p-value:                  {results['p_value']:.4f}")
print(f"Statistically significant at {results['confidence_level']:.0%}: "
      f"{'Yes' if results['significant'] else 'No'}")

Code Explanation: This function performs a two-proportion z-test — the standard statistical test for comparing conversion rates between two groups. The p-value indicates the probability of observing results this extreme if the model had no real effect. A p-value below 0.05 (at the 95% confidence level) suggests the improvement is real, not due to chance. The lift percentage quantifies the improvement: a 50% lift in retention rate, for example, means the model-driven intervention retains 50% more at-risk customers.

When to Deploy

NK asks the practical question: "So we have a statistically significant improvement. Does that mean we deploy?"

"Not necessarily," Okonkwo says. "Statistical significance tells you the effect is real. It does not tell you the effect is big enough to matter. A statistically significant 0.1% improvement in retention might not justify the cost of running the model in production. You also need to check your guardrail metrics — if retention went up but customer satisfaction went down, you may have won the battle and lost the war."

Business Insight: The decision to deploy a model requires three conditions: (1) the improvement is statistically significant (the effect is real), (2) the improvement is practically significant (the effect is large enough to justify the costs), and (3) guardrail metrics have not degraded (the model is not creating value in one area while destroying it in another). All three conditions must be met. This is the standard Athena applies to every model deployment, and it should be yours.

The ModelEvaluator: A Unified Evaluation Framework

Throughout this chapter, we have introduced numerous metrics, visualizations, and analytical frameworks. In practice, you need a systematic way to compute them all at once, translate them into business terms, and generate the kind of summary that an executive can read in five minutes.

That is what the ModelEvaluator class provides.

Athena Update: Ravi Mehta presents the Model Evaluation Board concept to Athena's executive team. "We have three churn models built by three different team members," he explains. "Model A has the highest AUC. Model C has the highest precision. Model B — a simpler logistic regression — has a lower AUC than Model A but is interpretable, fast, and the operations team can explain it to customers who ask why they received a retention offer." The board's decision framework: run all three through the ModelEvaluator with Athena's cost matrix, and let the expected profit analysis decide. Model B wins — not because it is the most sophisticated, but because it delivers the highest expected profit while meeting interpretability and latency requirements. Every model at Athena must now pass a "Business Translation Test" before deployment: can you express the model's value in a single sentence a non-technical executive would understand?

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import (
    confusion_matrix, classification_report,
    roc_curve, auc, precision_recall_curve,
    average_precision_score, f1_score, fbeta_score,
    precision_score, recall_score, accuracy_score
)


class ModelEvaluator:
    """
    Unified model evaluation framework that connects ML metrics
    to business outcomes.

    Parameters
    ----------
    model : sklearn estimator
        A trained classifier with a predict_proba method.
    X_test : array-like
        Test features.
    y_test : array-like
        True test labels (binary: 0 or 1).
    cost_matrix : dict, optional
        Business values for each outcome:
        {'tp_value': float, 'fp_value': float,
         'fn_value': float, 'tn_value': float}
    class_labels : list of str, optional
        Human-readable labels for the classes (e.g., ["Stay", "Churn"]).
    """

    def __init__(self, model, X_test, y_test, cost_matrix=None,
                 class_labels=None):
        self.model = model
        self.X_test = np.array(X_test)
        self.y_test = np.array(y_test)
        self.class_labels = class_labels or ["Negative", "Positive"]

        # Default cost matrix: symmetric costs
        self.cost_matrix = cost_matrix or {
            'tp_value': 1, 'fp_value': -1,
            'fn_value': -1, 'tn_value': 0
        }

        # Generate predictions and probabilities
        self.y_pred = model.predict(X_test)
        self.y_proba = model.predict_proba(X_test)[:, 1]

        # Compute confusion matrix
        self.cm = confusion_matrix(self.y_test, self.y_pred)

    def classification_summary(self):
        """Print a comprehensive classification report."""
        print("=" * 60)
        print("CLASSIFICATION SUMMARY")
        print("=" * 60)

        tn, fp, fn, tp = self.cm.ravel()

        print(f"\nConfusion Matrix:")
        print(f"  True Positives:  {tp:>6}")
        print(f"  False Positives: {fp:>6}")
        print(f"  False Negatives: {fn:>6}")
        print(f"  True Negatives:  {tn:>6}")

        print(f"\nKey Metrics:")
        print(f"  Accuracy:    {accuracy_score(self.y_test, self.y_pred):.4f}")
        print(f"  Precision:   {precision_score(self.y_test, self.y_pred):.4f}")
        print(f"  Recall:      {recall_score(self.y_test, self.y_pred):.4f}")
        print(f"  F1 Score:    {f1_score(self.y_test, self.y_pred):.4f}")
        print(f"  F2 Score:    "
              f"{fbeta_score(self.y_test, self.y_pred, beta=2):.4f}")

        # ROC AUC
        fpr, tpr, _ = roc_curve(self.y_test, self.y_proba)
        roc_auc = auc(fpr, tpr)
        print(f"  ROC AUC:     {roc_auc:.4f}")

        # Average Precision
        ap = average_precision_score(self.y_test, self.y_proba)
        print(f"  Avg Precision: {ap:.4f}")

        print(f"\nFull Classification Report:")
        print(classification_report(self.y_test, self.y_pred,
                                     target_names=self.class_labels))

    def plot_roc_curve(self, ax=None):
        """Plot the ROC curve with AUC."""
        if ax is None:
            fig, ax = plt.subplots(figsize=(8, 6))

        fpr, tpr, _ = roc_curve(self.y_test, self.y_proba)
        roc_auc = auc(fpr, tpr)

        ax.plot(fpr, tpr, color='navy', lw=2,
                label=f'Model (AUC = {roc_auc:.3f})')
        ax.plot([0, 1], [0, 1], 'k--', lw=1, label='Random Guess')
        ax.set_xlabel('False Positive Rate')
        ax.set_ylabel('True Positive Rate')
        ax.set_title('ROC Curve')
        ax.legend(loc='lower right')
        return ax

    def plot_precision_recall_curve(self, ax=None):
        """Plot the precision-recall curve with average precision."""
        if ax is None:
            fig, ax = plt.subplots(figsize=(8, 6))

        precision_vals, recall_vals, _ = precision_recall_curve(
            self.y_test, self.y_proba
        )
        ap = average_precision_score(self.y_test, self.y_proba)

        ax.plot(recall_vals, precision_vals, color='darkorange', lw=2,
                label=f'Model (AP = {ap:.3f})')
        prevalence = self.y_test.mean()
        ax.axhline(y=prevalence, color='gray', linestyle='--',
                   label=f'Baseline ({prevalence:.3f})')
        ax.set_xlabel('Recall')
        ax.set_ylabel('Precision')
        ax.set_title('Precision-Recall Curve')
        ax.legend(loc='upper right')
        return ax

    def plot_confusion_matrix(self, ax=None):
        """Plot a labeled confusion matrix heatmap."""
        if ax is None:
            fig, ax = plt.subplots(figsize=(7, 6))

        im = ax.imshow(self.cm, interpolation='nearest', cmap='Blues')
        ax.figure.colorbar(im, ax=ax)

        ax.set(xticks=[0, 1], yticks=[0, 1],
               xticklabels=self.class_labels,
               yticklabels=self.class_labels,
               ylabel='Actual', xlabel='Predicted',
               title='Confusion Matrix')

        # Annotate cells
        for i in range(2):
            for j in range(2):
                ax.text(j, i, format(self.cm[i, j], 'd'),
                        ha='center', va='center',
                        color='white' if self.cm[i, j] > self.cm.max()/2
                        else 'black', fontsize=16)
        return ax

    def compute_expected_profit(self, threshold=None):
        """
        Compute expected profit at a given threshold.

        If threshold is None, uses the default 0.5 threshold.
        """
        if threshold is not None:
            y_pred_t = (self.y_proba >= threshold).astype(int)
        else:
            y_pred_t = self.y_pred

        cm_t = confusion_matrix(self.y_test, y_pred_t)
        tn, fp, fn, tp = cm_t.ravel()

        profit = (tp * self.cost_matrix['tp_value'] +
                  fp * self.cost_matrix['fp_value'] +
                  fn * self.cost_matrix['fn_value'] +
                  tn * self.cost_matrix['tn_value'])
        return profit

    def find_optimal_threshold(self, n_thresholds=200):
        """
        Find the threshold that maximizes expected profit.

        Returns
        -------
        tuple
            (optimal_threshold, max_profit, all_thresholds, all_profits)
        """
        thresholds = np.linspace(0.01, 0.99, n_thresholds)
        profits = [self.compute_expected_profit(t) for t in thresholds]

        optimal_idx = np.argmax(profits)
        return (thresholds[optimal_idx], profits[optimal_idx],
                thresholds, profits)

    def plot_profit_curve(self, ax=None):
        """Plot expected profit across all thresholds."""
        if ax is None:
            fig, ax = plt.subplots(figsize=(10, 6))

        opt_threshold, max_profit, thresholds, profits = \
            self.find_optimal_threshold()

        ax.plot(thresholds, profits, color='navy', lw=2)
        ax.scatter(opt_threshold, max_profit, color='red', s=120,
                   zorder=5, label=f'Optimal: t={opt_threshold:.2f}, '
                   f'profit=${max_profit:,.0f}')
        ax.axhline(y=0, color='gray', linestyle='--', lw=0.8)
        ax.set_xlabel('Classification Threshold')
        ax.set_ylabel('Expected Profit ($)')
        ax.set_title('Profit Curve — Expected Profit by Threshold')
        ax.legend(fontsize=11)
        return ax

    def plot_dashboard(self):
        """Generate a 2x2 evaluation dashboard."""
        fig, axes = plt.subplots(2, 2, figsize=(14, 12))
        fig.suptitle('Model Evaluation Dashboard', fontsize=16, y=1.01)

        self.plot_confusion_matrix(ax=axes[0, 0])
        self.plot_roc_curve(ax=axes[0, 1])
        self.plot_precision_recall_curve(ax=axes[1, 0])
        self.plot_profit_curve(ax=axes[1, 1])

        plt.tight_layout()
        plt.savefig("model_dashboard.png", dpi=150, bbox_inches='tight')
        plt.show()
        return fig

    def executive_summary(self):
        """
        Generate a plain-English executive summary of model performance.
        """
        tn, fp, fn, tp = self.cm.ravel()
        total = tn + fp + fn + tp
        prevalence = (tp + fn) / total

        acc = accuracy_score(self.y_test, self.y_pred)
        prec = precision_score(self.y_test, self.y_pred)
        rec = recall_score(self.y_test, self.y_pred)
        f1 = f1_score(self.y_test, self.y_pred)

        fpr_vals, tpr_vals, _ = roc_curve(self.y_test, self.y_proba)
        roc_auc = auc(fpr_vals, tpr_vals)

        opt_threshold, max_profit, _, _ = self.find_optimal_threshold()
        default_profit = self.compute_expected_profit()

        print("=" * 60)
        print("EXECUTIVE SUMMARY — MODEL PERFORMANCE")
        print("=" * 60)
        print()
        print(f"Dataset: {total:,} observations "
              f"({tp+fn:,} positive, {tn+fp:,} negative)")
        print(f"Positive class prevalence: {prevalence:.1%}")
        print()
        print("HEADLINE METRICS")
        print(f"  ROC AUC: {roc_auc:.3f} — ", end="")
        if roc_auc >= 0.9:
            print("Excellent discriminative ability.")
        elif roc_auc >= 0.8:
            print("Good discriminative ability.")
        elif roc_auc >= 0.7:
            print("Fair discriminative ability.")
        else:
            print("Poor discriminative ability. "
                  "Consider additional features or alternative models.")

        print(f"  Precision: {prec:.1%} — Of {tp+fp:,} predicted positives, "
              f"{tp:,} were correct.")
        print(f"  Recall: {rec:.1%} — Of {tp+fn:,} actual positives, "
              f"the model identified {tp:,}.")
        print(f"  F1 Score: {f1:.3f}")
        print()
        print("BUSINESS IMPACT")
        print(f"  At default threshold (0.5):")
        print(f"    Expected profit: ${default_profit:,.0f}")
        print(f"  At optimal threshold ({opt_threshold:.2f}):")
        print(f"    Expected profit: ${max_profit:,.0f}")
        improvement = max_profit - default_profit
        if improvement > 0:
            print(f"    Improvement from threshold optimization: "
                  f"${improvement:,.0f}")
        print()
        print("KEY FINDINGS")
        if rec < 0.5:
            print(f"  WARNING: The model misses {fn:,} of {tp+fn:,} "
                  f"actual positives ({1-rec:.0%}). Consider lowering "
                  f"the threshold or improving recall.")
        if prec < 0.3:
            print(f"  WARNING: Only {prec:.0%} of positive predictions "
                  f"are correct. High false positive rate may erode "
                  f"trust in the system.")
        if roc_auc >= 0.8 and max_profit > 0:
            print(f"  The model demonstrates strong predictive power and "
                  f"positive expected ROI. Recommend proceeding to A/B test.")
        print()
        print("RECOMMENDATION")
        if max_profit > 0 and roc_auc >= 0.7:
            print(f"  Proceed to A/B test at threshold {opt_threshold:.2f}.")
            print(f"  Monitor: primary metric (retention rate), guardrail "
                  f"metrics (customer satisfaction, support volume).")
        elif max_profit > 0:
            print(f"  Model shows positive ROI but weak discrimination. "
                  f"Consider feature engineering or alternative algorithms "
                  f"before deployment.")
        else:
            print(f"  Model does not generate positive expected profit "
                  f"under the current cost structure. Do not deploy.")
        print("=" * 60)

Code Explanation: The ModelEvaluator class encapsulates every evaluation technique from this chapter into a single, reusable tool. It takes a trained model, test data, and a business cost matrix as inputs. Key methods: classification_summary() produces all standard metrics; plot_dashboard() generates the four-panel visualization; find_optimal_threshold() identifies the profit-maximizing threshold; and executive_summary() produces a plain-English report suitable for sharing with non-technical stakeholders. The executive_summary() method is the "Business Translation Test" in code form — it forces the evaluator to express model performance in business terms.

Using the ModelEvaluator

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

# Simulated Athena churn data
np.random.seed(42)
n = 5000
X = np.random.randn(n, 8)
y = (0.8*X[:,0] + 0.5*X[:,1] - 0.3*X[:,2] + 0.2*X[:,3] +
     np.random.randn(n)*1.5 > 2.0).astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train Model B: Logistic Regression
model_b = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(random_state=42))
])
model_b.fit(X_train, y_train)

# Athena cost matrix
athena_costs = {
    'tp_value': 480,
    'fp_value': -20,
    'fn_value': -500,
    'tn_value': 0
}

# Evaluate
evaluator = ModelEvaluator(
    model=model_b,
    X_test=X_test,
    y_test=y_test,
    cost_matrix=athena_costs,
    class_labels=["Stay", "Churn"]
)

evaluator.classification_summary()
evaluator.executive_summary()
evaluator.plot_dashboard()

Model Selection: Beyond the Numbers

We have spent most of this chapter on evaluation metrics — the quantitative tools for measuring model performance. But model selection — choosing which model to deploy — requires weighing factors that no metric captures on its own.

Ravi Mehta makes this point explicitly when he presents Athena's Model Evaluation Board process to the class.

Athena Update: "We had three models," Ravi tells the class during a guest lecture. "Model A — a gradient boosting ensemble with the highest AUC. Model B — a logistic regression with a slightly lower AUC but much faster inference time and coefficients that the operations team can inspect and explain. Model C — a model tuned for maximum precision that catches fewer churners but never sends a retention offer to the wrong customer." He pauses. "We chose Model B. Not because it was the best model. Because it was the best business decision."

The Model Selection Scorecard

Ravi walks the class through Athena's selection criteria:

Criterion	Model A (Gradient Boosting)	Model B (Logistic Regression)	Model C (Precision-Tuned)
AUC	0.91	0.87	0.84
Expected Profit	$12,400 \| $13,800	$8,200
Inference Latency	150ms	5ms	8ms
Interpretability	Low (black box)	High (coefficients)	Medium
Operations Trust	Low ("we can't explain it")	High ("we can see why")	Medium
Fairness Audit	Pending	Passed	Passed
Maintenance Cost	High (complex pipeline)	Low (simple model)	Medium
Monthly Compute	$800 \| $50	$65

"Model A has the best discrimination," Ravi explains. "But our operations team pushes back. When a customer calls and asks, 'Why did you send me this offer?' they need to explain. With logistic regression, they can say, 'Based on your purchase frequency dropping and your last return, our system flagged that you might not be satisfied.' With gradient boosting, they say, 'The algorithm said so.' That's not acceptable for our brand."

"Model C has the best precision — when it flags someone, it's almost always right. But it catches so few churners that the total expected profit is much lower. It's a great model for a different problem — one where the cost of a false positive is much higher."

"Model B wins because it maximizes expected profit, the operations team trusts it, the compute cost is 16 times lower than Model A, and it passed our fairness audit. That's the business case."

Tom, sitting in the front row, nods slowly. He had been the one advocating for Model A. "I was optimizing for the wrong thing," he admits. "I was trying to build the best model. I should have been trying to build the most valuable model."

Okonkwo smiles. "That distinction is the thesis of this entire course."

The Five Dimensions of Model Selection

When choosing between competing models, evaluate each along five dimensions:

1. Predictive Performance — AUC, F1, expected profit at optimal threshold. This is necessary but not sufficient. A model that does not predict well cannot create business value. But a model that predicts well while failing on the other dimensions can still destroy value.

2. Interpretability — Can the model's decisions be explained to stakeholders, customers, and regulators? Logistic regression coefficients are directly interpretable. Decision tree rules are inspectable. Deep neural network activations are opaque. When Chapter 25 introduces fairness metrics, interpretability becomes a regulatory requirement, not just a preference. As the EU AI Act (Chapter 28) takes effect, "the algorithm decided" will no longer be an acceptable explanation for high-risk decisions.

3. Latency and Scalability — How fast does the model produce predictions, and how does that speed change at scale? A model that takes 200ms per prediction is fine for batch processing (score all customers overnight) but too slow for real-time applications (approve a credit card transaction in under 100ms). At Athena's scale — 2 million customers, scored monthly — the difference between 5ms and 150ms inference time translates to 5 minutes versus 75 minutes of compute per scoring run.

4. Fairness and Compliance — Does the model treat different demographic groups equitably? Does it comply with relevant regulations? We will explore fairness metrics in depth in Chapter 25, but the evaluation starts here. A model that achieves high AUC overall but performs significantly worse for certain customer segments is a legal and ethical liability.

5. Organizational Fit — Does the team have the skills to maintain the model? Does the infrastructure support it? Will the business unit actually use the predictions? The most sophisticated model in the world creates zero value if the operations team does not trust it, the engineering team cannot deploy it, or the executive team does not understand it.

Business Insight: Model selection is an exercise in multi-criteria decision-making. Create a scorecard with weighted dimensions — and make sure the weights reflect your organization's strategic priorities. A startup optimizing for speed might weight predictive performance and latency heavily. A regulated financial institution might weight fairness and interpretability heavily. A resource-constrained team might weight maintenance cost and organizational fit heavily. There is no universal answer. There is only the answer that is right for your organization, at this moment, for this problem.

The Business Translation Test

Professor Okonkwo closes the lecture with a challenge that will become a recurring assignment for the rest of the course.

"Before any model leaves this classroom — before it goes into a presentation, a proposal, or a deployment plan — it must pass the Business Translation Test. You must be able to complete this sentence:"

She writes on the whiteboard:

"This model identifies [what] with [precision/recall at threshold], which enables us to [business action], resulting in an estimated [business impact], at a cost of [implementation cost], for a net ROI of [return]."

NK tries it: "This model identifies at-risk customers with 65% recall at a 0.3 threshold, which enables us to send targeted retention offers, resulting in an estimated $13,800 in saved annual revenue per 1,000 customers scored, at a cost of $600 per month in compute and operations, for a net ROI of 23:1 on a quarterly basis."

"Now I can take that to my CMO," she says.

Tom tries it: "This model identifies at-risk customers with 91% AUC and a gradient boosting architecture using 200 trees with max depth 10 and—"

"Stop," Okonkwo says. "That's a technical specification, not a business translation. Try again."

Tom pauses, then rewrites: "This model identifies at-risk customers with similar accuracy to the logistic regression but at thirty times the compute cost, without the interpretability that the operations team needs. It's the wrong model for this deployment."

Okonkwo nods. "That is also a valid business translation — even though the answer is 'don't deploy this one.' Knowing which model not to deploy is just as valuable as knowing which one to deploy. Maybe more so."

Chapter Summary

This chapter has taken you from the seductive simplicity of "92% accuracy" to the disciplined complexity of cost-sensitive, multi-dimensional model evaluation. Here is what we covered:

The confusion matrix is the foundation of all classification evaluation. Each cell represents a business outcome with an associated cost or value. Understanding the confusion matrix in business terms — not just statistical terms — is the first step toward translating model performance into business impact.

Precision and recall capture different dimensions of model quality. Precision measures the accuracy of positive predictions; recall measures the completeness of positive detection. They trade off against each other, and the right balance depends on the business cost of false positives versus false negatives.

The F1 score and F-beta provide single-number summaries that balance precision and recall. F-beta allows you to weight the balance toward precision (beta < 1) or recall (beta > 1) based on your business priorities.

ROC curves and AUC visualize the tradeoff between true positive rate and false positive rate across all thresholds. AUC provides a threshold-independent summary of discriminative power. But ROC curves can be misleading for imbalanced classes — use precision-recall curves when the positive class is rare.

Cost-sensitive evaluation assigns dollar values to each cell of the confusion matrix and computes expected profit. The profit curve shows expected profit at every threshold, revealing the threshold that maximizes business value. This threshold is almost never the default 0.5.

Regression metrics — R-squared, MAE, RMSE, MAPE — each capture different aspects of continuous prediction quality. Choose the metric that best reflects how the business experiences prediction errors.

Cross-validation provides robust performance estimates by training and evaluating the model multiple times on different data splits. Use stratified K-fold for imbalanced classes and time series CV for temporal data.

Hyperparameter tuning — grid search, random search, and Bayesian optimization — systematically optimizes model configuration. Random search is usually more efficient than grid search; Bayesian optimization is more efficient still.

A/B testing is the gold standard for evaluating models in the real world. Statistical significance confirms the effect is real; practical significance confirms it is large enough to matter; guardrail metrics confirm it does not cause collateral damage.

Model selection is a multi-dimensional decision. Predictive performance, interpretability, latency, fairness, and organizational fit all matter. The model with the highest AUC is not always the best business choice.

The Business Translation Test is the final checkpoint. If you cannot express your model's value in a single sentence that a non-technical executive would understand, you have not finished evaluating it.

In Chapter 12, we will take the model that survives this evaluation gauntlet and deploy it to production — where an entirely new set of challenges awaits. The gap between "the model works on my laptop" and "the model works in production" is the subject of MLOps, and it is wider than most data scientists expect.

"The goal of evaluation is not to produce a number. The goal of evaluation is to produce a decision."

— Professor Diane Okonkwo