Key Takeaways: Evaluating Models

Contributors to Introduction to Data Science

Key Takeaways: Evaluating Models

This is your reference card for Chapter 29. The core lesson: "how good is your model?" is never a yes/no question. It's always "good at what, for whom, and by which measure?"

Key Concepts

Accuracy is necessary but not sufficient. Accuracy tells you the percentage of correct predictions, but it's blind to how the model is wrong. With imbalanced classes, accuracy is dominated by the majority class and can be dangerously misleading — a model that predicts "no fraud" for every transaction is 99.8% accurate and 100% useless.
The confusion matrix shows you the full picture. Four outcomes: True Positives (caught!), False Positives (false alarm), True Negatives (correctly cleared), and False Negatives (missed!). Each has different real-world costs. You can't evaluate a model without knowing which types of errors it makes.
Precision and recall capture different kinds of "good." Precision = "when I flag something, am I right?" Recall = "of all the things I should have caught, how many did I find?" They trade off against each other, and the right balance depends on the cost of each type of error.
F1 balances precision and recall. The harmonic mean of precision and recall, F1 punishes extreme imbalance. Use it when both types of errors matter roughly equally. Don't use it when one type of error is far more costly than the other.
ROC curves and AUC compare models across all thresholds. The ROC curve shows the trade-off between true positive rate and false positive rate at every possible classification threshold. AUC (Area Under the Curve) collapses this into a single number representing overall ranking quality.
Cross-validation gives reliable performance estimates. A single train/test split is a single dice roll. K-fold cross-validation averages over k splits, reducing the impact of random variation. Always report the standard deviation alongside the mean.
The right metric is a business decision. Before evaluating any model, ask: "What is the cost of a false positive? What is the cost of a false negative?" The answer determines which metric to optimize.

The Confusion Matrix

                    Predicted Positive    Predicted Negative
                    ─────────────────    ─────────────────
Actual Positive     True Positive (TP)   False Negative (FN)
                    "Correctly caught"    "Missed it"

Actual Negative     False Positive (FP)  True Negative (TN)
                    "False alarm"         "Correctly cleared"

Metrics Cheat Sheet

Metric	Formula	What It Answers	Use When
Accuracy	(TP+TN) / Total	How often is the model correct overall?	Classes are balanced
Precision	TP / (TP+FP)	When the model says "yes," how often is it right?	False alarms are costly
Recall	TP / (TP+FN)	Of all actual positives, how many did we find?	Missed cases are costly
F1 Score	2(P*R)/(P+R)	Balance between precision and recall	Both types of errors matter
AUC	Area under ROC curve	How well does the model rank positives above negatives?	Comparing models overall
MAE	mean(\|actual-predicted\|)	Average error magnitude (regression)	All errors equally costly
RMSE	sqrt(mean((actual-predicted)²))	Average error with penalty for large errors	Big errors are costly
R²	1 - SS_res/SS_tot	Fraction of variance explained	Summarizing regression fit

When to Prioritize Which Metric

What's more costly?

Missing a true positive         →  Prioritize RECALL
(missed disease, missed fraud)     "Catch as many as possible"

Raising a false alarm           →  Prioritize PRECISION
(blocking legit transaction,       "Be sure before you flag"
 sending important email to spam)

Both matter equally             →  Use F1 SCORE
                                   "Balance precision and recall"

Comparing models at all         →  Use AUC
thresholds                         "Which model ranks better overall?"

Don't know the threshold yet    →  Use AUC
                                   "Evaluate ranking, set threshold later"

Cross-Validation Quick Reference

from sklearn.model_selection import cross_val_score, StratifiedKFold

# Basic 5-fold CV
scores = cross_val_score(model, X, y, cv=5, scoring='f1')
print(f"F1: {scores.mean():.3f} +/- {scores.std():.3f}")

# Stratified K-Fold (preserves class distribution)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')

# Multiple metrics at once
from sklearn.model_selection import cross_validate
results = cross_validate(model, X, y, cv=skf,
                          scoring=['accuracy', 'f1', 'roc_auc'])

Scikit-learn Evaluation Quick Reference

from sklearn.metrics import (
    accuracy_score,
    precision_score, recall_score, f1_score,
    classification_report,
    confusion_matrix, ConfusionMatrixDisplay,
    roc_curve, roc_auc_score,
    mean_absolute_error, mean_squared_error, r2_score
)

# Classification
print(classification_report(y_test, y_pred))

# Confusion matrix (visual)
ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)

# ROC curve
y_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc = roc_auc_score(y_test, y_proba)

# Regression
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred) ** 0.5
r2 = r2_score(y_test, y_pred)

Common Mistakes to Avoid

Mistake	Why It's Wrong	What to Do Instead
Evaluating only on training data	Shows how well the model memorizes, not how well it generalizes	Always use test set or cross-validation
Using accuracy with imbalanced classes	Accuracy is dominated by the majority class	Use F1, precision, recall, or AUC
Comparing models on a single split	Results depend on which samples are in which set	Use k-fold cross-validation
Ignoring the standard deviation of CV scores	"0.85 +/- 0.10" is very different from "0.85 +/- 0.01"	Always report both mean and std
Optimizing the wrong metric	Maximizing accuracy when recall matters leads to poor models	Choose the metric that reflects real-world costs
Peeking at the test set during development	Contaminates your final evaluation	Touch the test set exactly once, at the end

What You Should Be Able to Do Now

[ ] Explain why accuracy is misleading with imbalanced classes
[ ] Construct and interpret a confusion matrix
[ ] Compute precision, recall, and F1 and explain when each matters
[ ] Generate a ROC curve and interpret AUC
[ ] Implement k-fold cross-validation and report mean +/- std
[ ] Compare multiple models using the right metric for the problem
[ ] Apply regression metrics (MAE, RMSE, R²) and explain what each measures
[ ] Choose the right evaluation metric based on the costs of different errors
[ ] Identify common evaluation mistakes (leakage, peeking, wrong metric)

If you checked every box, you're ready for Chapter 30 — where you'll tie everything together into clean, reproducible scikit-learn pipelines. It's the capstone of Part V, and it will change how you build models forever.