Chapter 11 Key Takeaways: Model Evaluation and Selection

The Accuracy Trap

Accuracy is a vanity metric in imbalanced datasets. A model predicting the majority class for every observation achieves accuracy equal to the majority class proportion. Tom's 92% accuracy churn model was outperformed by a naive baseline that always predicted "no churn" (95% accuracy). Always compare model accuracy against the naive baseline before reporting it as a success.
Different errors have different costs, and evaluation metrics must reflect this asymmetry. A false negative in fraud detection (missed fraud) costs the bank the full transaction amount. A false positive (flagged legitimate transaction) costs a customer service call. A model evaluated only by accuracy treats these errors as equivalent — which they are not. The first question to ask about any model is: "What happens when it's wrong, and which kind of wrong is more expensive?"

Precision measures the reliability of positive predictions; recall measures the completeness of positive detection. Precision answers "When the model flags something, how often is it right?" Recall answers "Of all the things that should have been flagged, how many did the model catch?" They trade off against each other through the classification threshold, and the right balance is a business decision, not a statistical one.
The F-beta score allows you to weight precision and recall according to business priorities. F1 weights them equally. F2 emphasizes recall (use when missing positives is expensive: fraud, disease, churn with high customer lifetime value). F0.5 emphasizes precision (use when false positives are expensive: spam filtering, high-cost interventions). Choosing beta is a business decision that requires understanding the cost structure.
ROC curves and AUC measure discriminative ability across all thresholds, but can be misleading for imbalanced datasets. When the positive class is rare (fraud at 0.1%, churn at 5%), precision-recall curves and Average Precision (AP) provide a more honest picture of model performance. Use the visualization that answers the business question your stakeholders are actually asking.

Cost-sensitive evaluation transforms the confusion matrix into a profit-and-loss statement. By assigning dollar values to each cell of the confusion matrix — the value of catching a churner, the cost of a wasted offer, the loss from a missed churner — you convert model metrics into expected profit. The profit curve across all thresholds reveals the operating point that maximizes business value, and that threshold is almost never the default 0.5.
The optimal classification threshold depends on the cost structure, not the algorithm. When false negatives are much more expensive than false positives, the optimal threshold is below 0.5 (cast a wider net, accept more false alarms). When false positives are expensive, the optimal threshold is above 0.5 (be more selective). Threshold optimization is one of the highest-leverage activities in applied ML — it requires zero model retraining and can dramatically improve business outcomes.

Cross-validation provides reliable performance estimates; a single train-test split does not. K-fold cross-validation trains and evaluates the model K times on different data splits, producing a mean and standard deviation that indicate how robust the performance estimate is. Use stratified K-fold for imbalanced classes. Use time-series cross-validation for temporal data — standard K-fold allows future data to leak into the training set.
Hyperparameter tuning is a cost center that should be managed strategically. Grid search is exhaustive but expensive (combinations grow multiplicatively). Random search is typically more efficient — it explores a wider range of each hyperparameter and often finds comparably good results in fewer iterations. Bayesian optimization goes further by learning from previous evaluations. Match your tuning strategy to your compute budget.

A/B testing is the gold standard for validating models in the real world. Offline metrics tell you how the model performs on historical data. Online experiments tell you how it performs with real users making real decisions. Deployment requires three conditions: statistical significance (the effect is real), practical significance (the effect is large enough to matter), and guardrail metric stability (the model is not improving one metric while degrading others).

The best model is not always the most accurate model. Model selection is a multi-dimensional decision across predictive performance, interpretability, latency, fairness, cost, and organizational fit. Athena chose a logistic regression over a gradient boosting model — not because it was more accurate (it wasn't), but because it delivered higher expected profit, the operations team could explain it to customers, and it cost 16 times less to run.
Interpretability is a business requirement, not a technical luxury. When customers ask "Why did I receive this offer?" or regulators ask "Why was this applicant denied?", the team needs answers that go beyond "the algorithm decided." As the EU AI Act and other regulations take effect (Chapter 28), model interpretability will become a compliance requirement for high-risk applications. Evaluate interpretability alongside accuracy from the start.

If you cannot express your model's value in a single sentence a non-technical executive would understand, you have not finished evaluating it. The Business Translation Test — "This model identifies [what] with [metric], enabling [action], resulting in [impact], at [cost], for [ROI]" — forces the evaluator to connect technical performance to business value. Every model should pass this test before it enters a deployment conversation.
Knowing which model not to deploy is as valuable as knowing which model to deploy. Tom's realization that his high-accuracy, high-AUC gradient boosting model was "the wrong model for this deployment" was a more valuable business insight than any metric he computed. The goal of evaluation is not to produce a number. The goal is to produce a decision.