Chapter 29 Exercises: Evaluating Models
How to use these exercises: Start with the conceptual questions to make sure you understand why different metrics exist. Then move to applied exercises where you compute and interpret these metrics in code. The real-world and synthesis sections push you to think about which metrics matter in specific contexts — the skill that separates beginners from practitioners.
Difficulty key: ⭐ Foundational | ⭐⭐ Intermediate | ⭐⭐⭐ Advanced | ⭐⭐⭐⭐ Extension
Part A: Conceptual Understanding ⭐
Exercise 29.1 — The accuracy trap
A model for predicting whether a customer will churn achieves 92% accuracy on a dataset where 92% of customers do NOT churn. Without running any code, answer:
- What is the simplest model that achieves 92% accuracy on this dataset?
- What is the recall of this simple model for the "churn" class?
- Why is this model useless despite its high accuracy?
Guidance
1. The simplest model is one that predicts "no churn" for every customer. It correctly classifies all 92% of non-churners. 2. Recall for the churn class = 0/churn_total = 0.0. It catches zero churners. 3. The whole point of the model is to identify customers who are about to churn so you can intervene. A model that never predicts churn identifies zero of these customers. It's 92% accurate and 100% useless for its intended purpose. This is why accuracy alone is misleading with imbalanced classes.Exercise 29.2 — Confusion matrix from scratch
A model makes the following predictions for 20 patients tested for a disease:
| Patient | Actual | Predicted |
|---|---|---|
| 1-5 | Positive | Positive |
| 6-8 | Positive | Negative |
| 9-10 | Negative | Positive |
| 11-20 | Negative | Negative |
Construct the confusion matrix. Then compute accuracy, precision, recall, and F1 score by hand.
Guidance
- TP = 5 (patients 1-5: actually positive, predicted positive) - FN = 3 (patients 6-8: actually positive, predicted negative) - FP = 2 (patients 9-10: actually negative, predicted positive) - TN = 10 (patients 11-20: actually negative, predicted negative) Confusion matrix: Predicted+ Predicted-
Actual+ 5 3
Actual- 2 10
- Accuracy = (5+10)/20 = 15/20 = 0.75
- Precision = 5/(5+2) = 5/7 = 0.714
- Recall = 5/(5+3) = 5/8 = 0.625
- F1 = 2 × (0.714 × 0.625)/(0.714 + 0.625) = 2 × 0.446/1.339 = 0.667
Exercise 29.3 — Precision vs. recall scenarios
For each of the following scenarios, state whether you would prioritize precision or recall, and explain why:
- A model that flags potentially defective airplane engine parts for manual inspection
- A model that recommends movies to Netflix users
- A model that identifies potential spam emails for the spam folder
- A model that screens blood donations for HIV
- A model that identifies candidates for a marketing promotion
Guidance
1. **Recall** — Missing a defective engine part (false negative) could cause a crash. Better to flag too many parts for inspection than to miss a defective one. 2. **Precision** — Recommending a bad movie isn't dangerous, but too many bad recommendations erode user trust. You want your recommendations to be mostly good (precision), even if you miss some good movies (lower recall). 3. **Precision** — Putting a legitimate email in spam (false positive) could mean missing an important message. Better to let some spam through than lose real emails. 4. **Recall** — Missing contaminated blood (false negative) could infect a patient. You must catch virtually every positive case, even if it means testing some false positives again. 5. **Precision** — Each promotion has a cost. Sending promotions to people who won't respond (false positives) wastes money. You want to target people who are genuinely likely to respond.Exercise 29.4 — F1 score intuition
Why does the F1 score use the harmonic mean of precision and recall instead of the arithmetic mean? Demonstrate with an example where precision = 1.0 and recall = 0.01.
Guidance
Arithmetic mean: (1.0 + 0.01)/2 = 0.505 — suggests a mediocre model. Harmonic mean (F1): 2 × (1.0 × 0.01)/(1.0 + 0.01) = 0.02/1.01 = 0.0198 — reveals the model is nearly useless. The harmonic mean penalizes extreme imbalances. A model with precision 1.0 and recall 0.01 is one that almost never makes a positive prediction — and when it does, it's right, but it misses 99% of actual positives. This model is terrible, and the harmonic mean correctly reflects that. The arithmetic mean would hide the problem.Exercise 29.5 — ROC curve reading
Describe what each of the following ROC curves would look like and what kind of model would produce each:
- A curve that goes straight up to (0, 1) and then across to (1, 1)
- A curve that follows the diagonal from (0, 0) to (1, 1)
- A curve that bows toward the upper-left corner
- A curve that bows toward the lower-right corner
Guidance
1. A **perfect classifier** — it achieves 100% true positive rate with 0% false positive rate. AUC = 1.0. This rarely happens in practice. 2. A **random classifier** — no better than flipping a coin. AUC = 0.5. The model provides no useful information. 3. A **good classifier** — it achieves high true positive rates with relatively low false positive rates. AUC > 0.5. The more it bows toward the upper-left, the better the model. 4. A **worse-than-random classifier** — AUC < 0.5. This model is systematically wrong. If you flip its predictions, it would be above-average. Something is likely wrong with the labels or the features.Exercise 29.6 — Cross-validation purpose
Explain in your own words why 5-fold cross-validation gives a more reliable estimate of model performance than a single 80/20 train/test split. What specific problem does it address?
Guidance
A single train/test split is sensitive to *which specific samples* happen to land in each set. You might get a lucky split (easy test samples) or an unlucky one (hard test samples). Cross-validation addresses this by training and testing on multiple different splits, then averaging the results. This averages out the "luck of the draw" effect. Additionally, every sample is used for testing exactly once, so you use all your data for evaluation without wasting any. The standard deviation across folds tells you how stable the model's performance is.Exercise 29.7 — Regression metrics comparison
A model predicts house prices (in thousands of dollars) with the following actual vs. predicted values:
| Actual | Predicted | Error |
|---|---|---|
| 200 | 210 | 10 |
| 300 | 290 | 10 |
| 150 | 180 | 30 |
| 500 | 490 | 10 |
| 250 | 240 | 10 |
Compute MAE and RMSE by hand. Why is RMSE larger than MAE? What does that tell you about the errors?
Guidance
MAE = (10 + 10 + 30 + 10 + 10)/5 = 70/5 = 14.0 MSE = (100 + 100 + 900 + 100 + 100)/5 = 1300/5 = 260.0 RMSE = sqrt(260) = 16.12 RMSE > MAE because RMSE squares the errors before averaging, which gives extra weight to the large error (30). The gap between RMSE and MAE tells you that errors are not uniform — there's at least one outlier prediction (the $150K house predicted as $180K). If all errors were equal (all 14.0), RMSE would equal MAE.Part B: Applied Exercises ⭐⭐
Exercise 29.8 — Confusion matrix in code
Load the Breast Cancer dataset from scikit-learn (from sklearn.datasets import load_breast_cancer). Train a logistic regression model with a 70/30 split. Generate and display the confusion matrix using ConfusionMatrixDisplay. How many false negatives are there? In medical terms, what does a false negative mean?
Guidance
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.3, random_state=42
)
model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)
A false negative in this context means a malignant tumor (cancer) was classified as benign. The patient would not receive treatment — a potentially fatal mistake.
Exercise 29.9 — Classification report interpretation
Using the same breast cancer model from Exercise 29.8, generate a classification_report. Write a one-paragraph interpretation for a non-technical audience (e.g., a hospital administrator) explaining: what the model does, how well it performs, and what its biggest weakness is.
Guidance
Focus on making the numbers meaningful. Instead of "precision is 0.97," say "when the model flags a tumor as malignant, it's correct about 97% of the time." Instead of "recall is 0.93," say "of all tumors that actually are malignant, the model catches 93% — but misses about 7%, which would need to be caught through other screening methods."Exercise 29.10 — Threshold tuning
Using the breast cancer logistic regression model, experiment with different classification thresholds. For thresholds of [0.3, 0.4, 0.5, 0.6, 0.7], compute the precision and recall for the malignant class. Plot precision and recall vs. threshold on the same chart. At which threshold would you deploy this model for medical screening? Why?
Guidance
import numpy as np
y_proba = model.predict_proba(X_test)[:, 1]
thresholds = [0.3, 0.4, 0.5, 0.6, 0.7]
for t in thresholds:
y_pred_t = (y_proba >= t).astype(int)
# Compute precision and recall for each threshold
For medical screening, you'd likely choose a lower threshold (e.g., 0.3) to maximize recall — catching as many cancers as possible, even at the cost of more false positives (which lead to additional testing, not missed diagnoses).
Exercise 29.11 — ROC curves for model comparison
Train three models on the breast cancer data (logistic regression, decision tree with max_depth=5, and random forest with 200 trees). Plot all three ROC curves on the same chart with their AUC values in the legend. Which model has the best AUC? Does the ranking match what you expected?
Guidance
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
# For each model: fit, get predict_proba, compute roc_curve and AUC
# Plot all three on the same axes
The random forest and logistic regression typically have similar AUC on this dataset. The decision tree often has slightly lower AUC because of its rigid axis-aligned splits.
Exercise 29.12 — Cross-validation comparison
Using the breast cancer dataset, compare all three models using 5-fold stratified cross-validation with four different scoring metrics: accuracy, precision, recall, and F1. Create a summary table showing mean and standard deviation for each model and metric. Which model would you choose, and does the choice depend on which metric you prioritize?
Guidance
from sklearn.model_selection import cross_val_score, StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for metric in ['accuracy', 'precision', 'recall', 'f1']:
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=skf, scoring=metric)
print(f"{name} - {metric}: {scores.mean():.3f} +/- {scores.std():.3f}")
Pay attention to cases where the model rankings change depending on the metric. This is common and illustrates why metric choice matters.
Exercise 29.13 — Learning curve diagnosis
Plot the learning curve for a decision tree with max_depth=15 on the breast cancer dataset. Is the model overfitting, underfitting, or well-fitted? Then plot the learning curve for a decision tree with max_depth=2. Compare the two curves and explain what each pattern tells you.
Guidance
from sklearn.model_selection import learning_curve
import numpy as np
train_sizes, train_scores, test_scores = learning_curve(
DecisionTreeClassifier(max_depth=15),
X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10)
)
The depth-15 tree should show overfitting (high train score, lower test score with a gap). The depth-2 tree should show underfitting (both scores relatively low and close together).
Exercise 29.14 — Regression evaluation
Load the California Housing dataset (from sklearn.datasets import fetch_california_housing). Train a random forest regressor. Compute MAE, RMSE, and R² on the test set. Write a sentence interpreting each metric for someone who wants to know "how good is this model at predicting house prices?"
Guidance
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
housing = fetch_california_housing()
# Split, train, predict, evaluate
Interpret: "The model's predictions are typically off by about $X (MAE). When it makes a big mistake, errors can be around $Y (RMSE). Overall, the model explains Z% of the variation in house prices (R²)."
Part C: Real-World Applications ⭐⭐⭐
Exercise 29.15 — Elena's vaccination model evaluation
Elena built a model to predict which communities have low vaccination coverage. Her classification report shows:
precision recall f1-score
Low coverage 0.72 0.85 0.78
High coverage 0.91 0.82 0.86
- What type of error is most dangerous for Elena's use case?
- Which class should she focus on improving?
- If she could improve either precision or recall for the "Low coverage" class, which should she prioritize and why?
Guidance
1. A false negative for "Low coverage" — missing a community that actually has low coverage. These communities won't receive outreach interventions. 2. The "Low coverage" class — that's where the recall (0.85) matters most. She's missing 15% of low-coverage communities. 3. She should prioritize recall for "Low coverage." Missing a low-coverage community (false negative) means that community doesn't receive the outreach it needs. A false positive (predicting low coverage when it's actually high) just means sending outreach to a community that doesn't need it as urgently — not ideal, but not harmful.Exercise 29.16 — Marcus's sales prediction
Marcus built a regression model predicting daily bakery sales. His model has: - MAE = $145 - RMSE = $312 - R² = 0.64
Interpret these three numbers in plain English for Marcus (who doesn't know statistics). Also: the big gap between MAE and RMSE suggests something about the model's errors. What is it, and what might Marcus want to investigate?
Guidance
"Marcus, on a typical day, the model's prediction is off by about $145 — so if it predicts $1,200 in sales, actual sales might be around $1,055 to $1,345. The model explains about 64% of the day-to-day variation in your sales, meaning there's real predictive power here, but about a third of the variation comes from things the model doesn't capture — maybe events, weather, or other factors we're not tracking." The big gap between MAE ($145) and RMSE ($312) means there are some days where the model is way off — probably special events, holidays, or unusual circumstances. Marcus should look at those outlier days and consider whether adding features (like `is_holiday` or `nearby_event`) might help.Exercise 29.17 — The metric mismatch
A data science team optimized their email spam filter for the highest F1 score and achieved F1 = 0.92. When they deployed it, users complained loudly. Investigating, they found the model had precision = 0.87 and recall = 0.97. Why were users unhappy despite the excellent F1 score? What metric should they have optimized instead?
Guidance
Users were unhappy because precision was 0.87 — meaning 13% of emails the model flagged as spam were actually legitimate. With hundreds of emails per day, users were missing important messages regularly. The high recall (97%) means almost all spam was caught, but users care more about not losing real email. They should have optimized for precision, accepting slightly more spam in the inbox in exchange for far fewer lost legitimate emails. This illustrates that F1, while balanced, may not reflect the asymmetric costs of different errors in a specific application.Exercise 29.18 — Jordan's grading fairness
Jordan built a model predicting whether a student gets an A, using features like study_hours, attendance, prior_gpa, and professor_id. The overall accuracy is 0.79 and F1 is 0.72. But when Jordan checks performance by demographic group, they find:
- Group A: Recall = 0.82
- Group B: Recall = 0.61
What does this disparity suggest? Is the model "fair"? What should Jordan investigate next?
Guidance
The disparity in recall means the model is significantly worse at identifying A students from Group B compared to Group A. Students in Group B who deserve an A are being missed 39% of the time versus 18% for Group A. This suggests the model may be biased — possibly because the training data contains existing biases (e.g., if Group B historically received lower grades despite similar performance). Jordan should investigate: (1) whether the training data itself contains the bias, (2) which features are driving the disparity, (3) whether the `professor_id` feature encodes implicit bias in grading. This is a fairness issue that overall metrics would completely hide — demonstrating why disaggregated evaluation by subgroup is essential.Exercise 29.19 — Threshold selection in practice
You've built a model that predicts whether a patient should be referred for additional diagnostic testing. Your hospital has limited testing capacity — they can handle about 200 referrals per month, but your model would flag 350 patients at the default threshold of 0.5. How would you use the ROC curve and threshold adjustment to solve this practical constraint? What trade-off are you making?
Guidance
Raise the classification threshold above 0.5 until the model flags approximately 200 patients. At threshold 0.5, the model flags 350; at some higher threshold (e.g., 0.65 or 0.70), it would flag closer to 200. You'd need to check the precision and recall at the new threshold. The trade-off: higher threshold means higher precision (more of the flagged patients truly need testing) but lower recall (you'll miss some patients who need testing but whose risk score fell just below the threshold). You're making a resource allocation decision: with limited capacity, you prioritize the highest-risk patients, accepting that some moderate-risk patients won't be flagged.Part D: Synthesis and Critical Thinking ⭐⭐⭐
Exercise 29.20 — Metric selection matrix
Create a table with the following domains as rows: medical diagnosis, spam filtering, credit card fraud, movie recommendations, self-driving car obstacle detection, job applicant screening. For each domain, specify: (a) which metric matters most, (b) whether false positives or false negatives are more costly, and (c) why.
Guidance
A strong answer recognizes that metric choice is domain-driven. Medical diagnosis and obstacle detection prioritize recall (missing something is dangerous). Spam filtering and movie recommendations prioritize precision (bad recommendations or lost emails are frustrating). Fraud detection and job screening are more nuanced — both types of errors have significant costs, and the balance depends on specific organizational priorities.Exercise 29.21 — Cross-validation edge cases
Explain what would go wrong in each of the following scenarios:
- Using regular k-fold (not stratified) cross-validation on a dataset with 2% positive class
- Using 50-fold cross-validation on a dataset with 100 samples
- Using 2-fold cross-validation when comparing two complex models
- Computing cross-validation scores and then using those same scores to report final model performance
Guidance
1. Some folds might contain zero or very few positive examples, making evaluation on those folds meaningless. Stratified k-fold ensures each fold has approximately the same positive rate. 2. Each fold would have only 2 test samples — far too few for reliable evaluation. The variance across folds would be enormous. Use 5-fold or 10-fold instead. 3. Two folds means each fold uses only 50% of data for training. For complex models that need lots of data, this may result in underfitting. Also, only two estimates to average is not very stable. 4. This is a form of data leakage. If you used cross-validation to select the best model (comparing architectures, tuning hyperparameters), those cross-validation scores are optimistically biased. You should report performance on a completely held-out test set that was never used during model selection.Exercise 29.22 — The full evaluation workflow
Write out, step by step, a complete evaluation workflow for a binary classification problem. Include: data splitting strategy, metric selection, model comparison, final reporting. For each step, explain why it's done that way.
Guidance
A strong answer includes: (1) Hold out a final test set (20-30%) that you never touch during development. (2) On the remaining data, use stratified k-fold cross-validation to compare models. (3) Choose the metric based on the problem context (costs of FP vs. FN). (4) Select the best model based on cross-validated scores. (5) Train the selected model on all training data. (6) Evaluate once on the held-out test set and report those numbers. (7) Report multiple metrics (accuracy, precision, recall, F1, AUC) and confusion matrix, not just a single number. Each step prevents a specific type of evaluation bias.Exercise 29.23 — Designing an evaluation plan
You're tasked with building a model that predicts which students are at risk of failing a course, so the university can offer tutoring. Design a complete evaluation plan: which metrics you'd use, how you'd validate the model, how you'd check for fairness across demographic groups, and how you'd communicate results to university administrators who don't know machine learning.
Guidance
Key elements: (1) Prioritize recall for the "at-risk" class — missing a student who needs help is worse than offering tutoring to someone who doesn't need it. (2) Use stratified 5-fold cross-validation. (3) Check recall separately by race, gender, income, and first-generation status. (4) For administrators, present the model as: "Of the students who would have failed, we identified X% in time for intervention" (recall in plain language). (5) Present the cost: "For every 10 students we flagged, Y needed help and Z didn't" (precision in plain language). (6) Be transparent about limitations and the need for ongoing monitoring.Part E: Extension Challenges ⭐⭐⭐⭐
Exercise 29.24 — Precision-recall curves
Research precision-recall (PR) curves as an alternative to ROC curves. Using the breast cancer dataset, plot the PR curve for a logistic regression and a random forest. When are PR curves more informative than ROC curves? (Hint: think about class imbalance.) Write a paragraph comparing the two approaches.
Guidance
from sklearn.metrics import precision_recall_curve, average_precision_score
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
ap = average_precision_score(y_test, y_proba)
PR curves are more informative than ROC curves when classes are highly imbalanced. ROC curves can look optimistic with imbalanced data because true negative rate is dominated by the large negative class. PR curves focus entirely on the positive class and are more sensitive to changes in the model's ability to find and correctly classify rare positives.
Exercise 29.25 — Custom scoring functions
Sometimes standard metrics don't capture the business cost of errors. Suppose each false negative costs $500 (missed at-risk student who fails) and each false positive costs $100 (unnecessary tutoring offered). Write a custom scoring function that computes the total cost of a model's predictions and use it with cross_val_score. Compare how this custom metric ranks your three models versus how F1 ranks them.
Guidance
from sklearn.metrics import make_scorer, confusion_matrix
def total_cost(y_true, y_pred):
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
cost = fn * 500 + fp * 100
return -cost # Negative because sklearn maximizes scores
cost_scorer = make_scorer(total_cost)
scores = cross_val_score(model, X, y, cv=5, scoring=cost_scorer)
The custom metric may rank models differently from F1 because it encodes the specific asymmetric costs of your problem. A model with lower F1 might have lower total cost if it catches more true positives (even at the expense of more false positives, which are cheap).