Key Takeaways: Evaluating Models
This is your reference card for Chapter 29. The core lesson: "how good is your model?" is never a yes/no question. It's always "good at what, for whom, and by which measure?"
Key Concepts
-
Accuracy is necessary but not sufficient. Accuracy tells you the percentage of correct predictions, but it's blind to how the model is wrong. With imbalanced classes, accuracy is dominated by the majority class and can be dangerously misleading — a model that predicts "no fraud" for every transaction is 99.8% accurate and 100% useless.
-
The confusion matrix shows you the full picture. Four outcomes: True Positives (caught!), False Positives (false alarm), True Negatives (correctly cleared), and False Negatives (missed!). Each has different real-world costs. You can't evaluate a model without knowing which types of errors it makes.
-
Precision and recall capture different kinds of "good." Precision = "when I flag something, am I right?" Recall = "of all the things I should have caught, how many did I find?" They trade off against each other, and the right balance depends on the cost of each type of error.
-
F1 balances precision and recall. The harmonic mean of precision and recall, F1 punishes extreme imbalance. Use it when both types of errors matter roughly equally. Don't use it when one type of error is far more costly than the other.
-
ROC curves and AUC compare models across all thresholds. The ROC curve shows the trade-off between true positive rate and false positive rate at every possible classification threshold. AUC (Area Under the Curve) collapses this into a single number representing overall ranking quality.
-
Cross-validation gives reliable performance estimates. A single train/test split is a single dice roll. K-fold cross-validation averages over k splits, reducing the impact of random variation. Always report the standard deviation alongside the mean.
-
The right metric is a business decision. Before evaluating any model, ask: "What is the cost of a false positive? What is the cost of a false negative?" The answer determines which metric to optimize.
The Confusion Matrix
Predicted Positive Predicted Negative
───────────────── ─────────────────
Actual Positive True Positive (TP) False Negative (FN)
"Correctly caught" "Missed it"
Actual Negative False Positive (FP) True Negative (TN)
"False alarm" "Correctly cleared"
Metrics Cheat Sheet
| Metric | Formula | What It Answers | Use When |
|---|---|---|---|
| Accuracy | (TP+TN) / Total | How often is the model correct overall? | Classes are balanced |
| Precision | TP / (TP+FP) | When the model says "yes," how often is it right? | False alarms are costly |
| Recall | TP / (TP+FN) | Of all actual positives, how many did we find? | Missed cases are costly |
| F1 Score | 2(P*R)/(P+R) | Balance between precision and recall | Both types of errors matter |
| AUC | Area under ROC curve | How well does the model rank positives above negatives? | Comparing models overall |
| MAE | mean(|actual-predicted|) | Average error magnitude (regression) | All errors equally costly |
| RMSE | sqrt(mean((actual-predicted)²)) | Average error with penalty for large errors | Big errors are costly |
| R² | 1 - SS_res/SS_tot | Fraction of variance explained | Summarizing regression fit |
When to Prioritize Which Metric
What's more costly?
Missing a true positive → Prioritize RECALL
(missed disease, missed fraud) "Catch as many as possible"
Raising a false alarm → Prioritize PRECISION
(blocking legit transaction, "Be sure before you flag"
sending important email to spam)
Both matter equally → Use F1 SCORE
"Balance precision and recall"
Comparing models at all → Use AUC
thresholds "Which model ranks better overall?"
Don't know the threshold yet → Use AUC
"Evaluate ranking, set threshold later"
Cross-Validation Quick Reference
from sklearn.model_selection import cross_val_score, StratifiedKFold
# Basic 5-fold CV
scores = cross_val_score(model, X, y, cv=5, scoring='f1')
print(f"F1: {scores.mean():.3f} +/- {scores.std():.3f}")
# Stratified K-Fold (preserves class distribution)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')
# Multiple metrics at once
from sklearn.model_selection import cross_validate
results = cross_validate(model, X, y, cv=skf,
scoring=['accuracy', 'f1', 'roc_auc'])
Scikit-learn Evaluation Quick Reference
from sklearn.metrics import (
accuracy_score,
precision_score, recall_score, f1_score,
classification_report,
confusion_matrix, ConfusionMatrixDisplay,
roc_curve, roc_auc_score,
mean_absolute_error, mean_squared_error, r2_score
)
# Classification
print(classification_report(y_test, y_pred))
# Confusion matrix (visual)
ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)
# ROC curve
y_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc = roc_auc_score(y_test, y_proba)
# Regression
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred) ** 0.5
r2 = r2_score(y_test, y_pred)
Common Mistakes to Avoid
| Mistake | Why It's Wrong | What to Do Instead |
|---|---|---|
| Evaluating only on training data | Shows how well the model memorizes, not how well it generalizes | Always use test set or cross-validation |
| Using accuracy with imbalanced classes | Accuracy is dominated by the majority class | Use F1, precision, recall, or AUC |
| Comparing models on a single split | Results depend on which samples are in which set | Use k-fold cross-validation |
| Ignoring the standard deviation of CV scores | "0.85 +/- 0.10" is very different from "0.85 +/- 0.01" | Always report both mean and std |
| Optimizing the wrong metric | Maximizing accuracy when recall matters leads to poor models | Choose the metric that reflects real-world costs |
| Peeking at the test set during development | Contaminates your final evaluation | Touch the test set exactly once, at the end |
What You Should Be Able to Do Now
- [ ] Explain why accuracy is misleading with imbalanced classes
- [ ] Construct and interpret a confusion matrix
- [ ] Compute precision, recall, and F1 and explain when each matters
- [ ] Generate a ROC curve and interpret AUC
- [ ] Implement k-fold cross-validation and report mean +/- std
- [ ] Compare multiple models using the right metric for the problem
- [ ] Apply regression metrics (MAE, RMSE, R²) and explain what each measures
- [ ] Choose the right evaluation metric based on the costs of different errors
- [ ] Identify common evaluation mistakes (leakage, peeking, wrong metric)
If you checked every box, you're ready for Chapter 30 — where you'll tie everything together into clean, reproducible scikit-learn pipelines. It's the capstone of Part V, and it will change how you build models forever.