Key Takeaways: Chapter 16

DataField.Dev

Key Takeaways: Chapter 16

Model Evaluation Deep Dive

How you evaluate your model is more important than which model you choose. A mediocre model with honest evaluation delivers exactly what you expect in production. A brilliant model with broken evaluation will surprise you in the worst possible way. Every bad production model I have encountered got there because of evaluation failure, not model selection failure. Invest more time in evaluation than in hyperparameter tuning.
For classification, always use StratifiedKFold. When entities repeat across rows, use StratifiedGroupKFold. When data is time-ordered, use TimeSeriesSplit. Standard K-fold does not preserve class balance. Stratified K-fold does not prevent entity leakage. Neither prevents temporal leakage. Choosing the wrong cross-validation strategy inflates your performance estimates and produces models that underperform in production. Ask two questions before choosing: "Can the same entity appear in both train and test?" and "Does the order of observations matter?"
Data leakage is the most dangerous evaluation error because it is invisible. A leaked model looks perfect in development and collapses in production. The two most common forms are target leakage (a feature contains information about the outcome that would not be available at prediction time) and train-test contamination (preprocessing fitted on the full dataset including test data). Detect leakage by auditing feature importances (any feature above 0.25-0.30 warrants investigation) and by asking the temporal question for every feature: "Would I know this value at the time I need to make a prediction?"
Accuracy is almost never the right metric for imbalanced problems. A model that predicts the majority class for every example achieves accuracy equal to 1 minus the minority rate. For an 8% churn problem, that is 92% accuracy with zero predictive value. Use precision, recall, F1, AUC-PR, or a business-cost-weighted metric instead. If someone reports only accuracy on an imbalanced problem, the number is meaningless.
The precision-recall tradeoff must be resolved by the business problem, not by convention. The default threshold of 0.50 is optimal only when false positives and false negatives have equal cost, which almost never happens in practice. When retention offers cost $5 and saving a churner is worth $180, the optimal threshold can be as low as 0.05-0.10. When a missed hospital readmission costs $15,000 and an unnecessary follow-up costs $850, the optimal threshold is even lower. Calculate the break-even precision and set your threshold accordingly.
AUC-PR is more informative than AUC-ROC for imbalanced classification. AUC-ROC measures ranking quality across all thresholds and is dominated by the easy-to-classify majority class. It can look optimistically high even when the model performs poorly on the minority class you care about. AUC-PR focuses exclusively on the positive class, with a baseline equal to the positive rate. For churn, fraud, medical screening, and any imbalanced prediction task, AUC-PR tells you what you need to know.
Learning curves, validation curves, and calibration curves answer questions that raw metrics cannot. Learning curves tell you whether more data would help (validation score still climbing) or whether you need a better model (validation score plateaued). Validation curves show where a hyperparameter transitions from underfitting to overfitting. Calibration curves verify that predicted probabilities match observed frequencies, which matters whenever downstream decisions depend on the probability estimates, not just the ranking.
Use statistical tests to compare models, not just point estimates. A 0.005 AUC difference between two models is meaningless if the standard deviation across cross-validation folds is 0.015. Use paired t-tests on fold-by-fold scores to determine whether differences are statistically significant, and use Cohen's d to determine whether significant differences are practically meaningful. A statistically significant but negligible difference should not drive model selection --- choose the simpler, faster, or more maintainable model instead.
When offline metrics disagree with A/B test results, the A/B test is right. Offline evaluation is a proxy for real-world impact; A/B testing measures real-world impact directly. Disagreements arise from metric mismatch (optimizing AUC offline but caring about conversion in production), distribution shift (test set does not represent production traffic), calibration differences, and feedback loops. Always run an A/B test before full deployment. No amount of offline evaluation can replace a production experiment.
Translate ML metrics into business language before presenting results. "AUC-ROC of 0.71" means nothing to a VP of Product or a chief medical officer. "The model identifies 89% of future churners, enabling a retention program worth $292,000 annually" starts a useful conversation. "The model catches 21 out of 24 heart failure patients who will be readmitted" is a metric a clinician can act on. The evaluation is not complete until the results are expressed in terms the decision-maker understands.

If You Remember One Thing

The metric you choose determines the model you build, the threshold you deploy, and the outcomes you produce. Accuracy on an imbalanced problem hides failure. AUC-ROC on a minority-class prediction problem inflates success. A default threshold of 0.50 optimizes for a cost structure that does not match your business. The single most valuable skill in model evaluation is choosing the metric that measures what the business actually cares about, then optimizing for that metric end to end. Everything else --- cross-validation, leakage detection, statistical tests --- exists to ensure that the metric you report is an honest estimate of the metric you will observe in production.

These takeaways summarize Chapter 16: Model Evaluation Deep Dive. Return to the chapter for full context.