Chapter 11 Exercises: Model Evaluation and Selection


Section A: Recall and Comprehension

Exercise 11.1 Define the following terms in your own words, using no more than two sentences each: (a) confusion matrix, (b) precision, (c) recall, (d) F1 score, (e) AUC, (f) cross-validation.

Exercise 11.2 A model achieves 97% accuracy on a dataset where 97% of observations belong to the negative class. Explain why this accuracy figure is misleading, and identify at least two metrics that would provide a more honest assessment of the model's performance.

Exercise 11.3 Explain the difference between a model's parameters and its hyperparameters. Give two examples of each for a random forest classifier.

Exercise 11.4 Describe the three conditions that must be met before deploying a model based on A/B test results. Why is statistical significance alone insufficient?

Exercise 11.5 Explain why standard K-fold cross-validation is inappropriate for time series data. What alternative approach should be used, and why?

Exercise 11.6 List the five dimensions of model selection presented in this chapter. For each, write one sentence explaining why it matters beyond predictive performance.

Exercise 11.7 In the Athena churn model scenario, a false negative costs $500 (lost customer) and a false positive costs $20 (wasted retention offer). Without performing any calculations, predict whether the optimal classification threshold will be above or below 0.5. Explain your reasoning.


Section B: Metric Calculation

Exercise 11.8: Confusion Matrix Interpretation A credit card fraud detection model produces the following confusion matrix on a test set of 10,000 transactions:

Predicted Fraud Predicted Legitimate
Actually Fraud 45 5
Actually Legitimate 150 9,800

Calculate the following metrics: - (a) Accuracy - (b) Precision - (c) Recall - (d) F1 score - (e) False positive rate - (f) Specificity (true negative rate)

Exercise 11.9: Cost-Sensitive Analysis Using the confusion matrix from Exercise 11.8, calculate the expected profit given the following cost structure: - True Positive (caught fraud): Saves $2,000 average transaction - False Positive (flagged legitimate): Customer service call costs $15, plus $50 estimated goodwill loss - False Negative (missed fraud): Bank absorbs $2,000 loss, plus $500 investigation and remediation - True Negative: No cost

(a) Calculate the total expected profit. (b) If the model were replaced with a "flag everything" approach (predict fraud for all 10,000 transactions), what would the expected profit be? Compare. (c) If the model were replaced with a "flag nothing" approach (predict legitimate for all 10,000 transactions), what would the expected profit be? Compare.

Exercise 11.10: F-Beta Selection For each of the following business scenarios, recommend whether to use F0.5, F1, or F2, and explain your reasoning: - (a) An email spam filter for a corporate executive - (b) A cancer screening tool for initial patient triage - (c) A recommendation system suggesting products on an e-commerce site - (d) An automated loan approval system - (e) A system detecting defective products on a manufacturing line before shipment

Exercise 11.11: Threshold Optimization A churn model produces the following results at three different thresholds:

Threshold TP FP FN TN
0.3 85 200 15 700
0.5 60 80 40 820
0.7 30 20 70 880

Using the Athena cost matrix (TP value: $480, FP cost: -$20, FN cost: -$500, TN value: $0): - (a) Calculate the expected profit at each threshold. - (b) Which threshold maximizes profit? - (c) Calculate precision, recall, and F1 at each threshold. - (d) Does the threshold with the highest F1 also maximize profit? If not, explain why.

Exercise 11.12: Regression Metrics A demand forecasting model for a retail chain produces the following predictions for five stores:

Store Actual Demand Predicted Demand
A 1,000 1,050
B 200 280
C 500 520
D 50 120
E 800 760

Calculate: (a) MAE, (b) RMSE, (c) MAPE. (d) Which store has the highest absolute error? Which has the highest percentage error? (e) If the company penalizes under-forecasting more heavily than over-forecasting (stockouts are worse than excess inventory), which metric — MAE or RMSE — is more appropriate? What alternative approach might be even better?


Section C: Application

Exercise 11.13: ROC Curve Interpretation You are evaluating three models for predicting customer default on a personal loan: - Model X: AUC = 0.92, inference time = 500ms - Model Y: AUC = 0.88, inference time = 10ms, fully interpretable - Model Z: AUC = 0.94, inference time = 2,000ms, requires GPU

For each of the following deployment contexts, recommend a model and explain your reasoning: - (a) A mobile banking app that provides instant loan pre-approval as customers browse - (b) A batch scoring system that evaluates all existing borrowers monthly for portfolio risk monitoring - (c) A consumer lending application subject to the Equal Credit Opportunity Act, where rejected applicants have a legal right to know why they were denied

Exercise 11.14: Cross-Validation Design You are building a model to predict monthly sales for a subscription box company. The company has 3 years of monthly data (36 observations) and strong seasonal patterns (holiday spikes in November-December). - (a) Explain why a standard 5-fold cross-validation would produce misleading results. - (b) Design an appropriate cross-validation strategy. Specify the number of splits, the training window, and the test window. - (c) Should the validation strategy account for seasonality? If so, how?

Exercise 11.15: Model Evaluation Board You lead the data science team at a mid-size insurance company. Your team has developed three models for predicting which policyholders will file a claim in the next 12 months:

  • Model Alpha: XGBoost ensemble, AUC 0.82, inference time 80ms, no interpretability
  • Model Beta: Logistic regression with engineered features, AUC 0.78, inference time 3ms, fully interpretable, passed fairness audit
  • Model Gamma: Neural network, AUC 0.84, inference time 300ms, requires dedicated GPU infrastructure ($2,000/month)

The claims team will use the model to prioritize proactive outreach calls. Each call costs $25. Each prevented claim saves an average of $4,000. About 8% of policyholders file claims.

  • (a) Construct a cost matrix for this scenario.
  • (b) Which model would you recommend for deployment? Create a scorecard similar to Athena's and justify your choice.
  • (c) Draft the Business Translation Test sentence for your recommended model.
  • (d) What guardrail metrics would you monitor during the A/B test?

Exercise 11.16: Hyperparameter Tuning Strategy Your team has a budget of $500 for cloud compute to tune a gradient boosting classifier. The model has four hyperparameters, each with 5 reasonable values to try. The dataset has 100,000 rows, and a single model fit takes approximately 2 minutes on a cloud instance that costs $0.50/hour. - (a) How many total combinations does a full grid search require? - (b) With 5-fold cross-validation, how many model fits does grid search require? - (c) How long would the grid search take, and what would it cost? - (d) Propose an alternative tuning strategy that stays within budget while still exploring the hyperparameter space effectively. Estimate its cost and duration.


Section D: Analysis and Critical Thinking

Exercise 11.17: The Accuracy Paradox in Practice A hospital deploys a model to predict which emergency room patients will be readmitted within 30 days. The model achieves 92% accuracy. The readmission rate is 12%. - (a) Is 92% accuracy better than the naive baseline? Calculate the naive baseline accuracy. - (b) The hospital's goal is to intervene with high-risk patients before discharge (extra follow-up calls, earlier outpatient appointments). Which is more costly to the hospital: a false positive or a false negative? Explain. - (c) Given your answer to (b), which metric — precision or recall — should the hospital prioritize? - (d) The hospital administrator says, "92% accuracy sounds great — let's deploy." Write a one-paragraph response explaining why this conclusion may be premature and what additional analysis you would recommend.

Exercise 11.18: When AUC Misleads Two fraud detection models are evaluated on a dataset with 0.1% fraud rate (1 in 1,000 transactions): - Model P: AUC = 0.95, at its optimal threshold it achieves 80% recall and 10% precision - Model Q: AUC = 0.92, at its optimal threshold it achieves 60% recall and 40% precision

  • (a) At its optimal threshold, Model P flags many more legitimate transactions than Model Q. Explain why using the concept of base rate and false positive rate.
  • (b) A bank executive says, "Model P has a higher AUC, so it's the better model." Do you agree? Under what cost structure might Model Q be the better business choice?
  • (c) Would a precision-recall curve have been more informative than a ROC curve for comparing these models? Why?

Exercise 11.19: The Full Evaluation Pipeline You are the lead data scientist at a food delivery company. You have built a model to predict delivery time for each order (a regression problem). Write a complete evaluation plan that covers: - (a) Which regression metric(s) you would use as the primary evaluation metric, and why - (b) Your cross-validation strategy (consider temporal patterns in delivery times) - (c) How you would handle hyperparameter tuning given limited compute budget - (d) Your A/B test design: what is the control, what is the treatment, what is the primary metric, and what are the guardrail metrics - (e) The model selection criteria you would present to the product team if you had multiple candidate models - (f) The Business Translation Test sentence for your model

Exercise 11.20: Athena Model Evaluation Board Simulation Role-play exercise. Divide into three teams. Each team advocates for one of Athena's churn models (A, B, or C). Using the model selection scorecard from the chapter: - (a) Each team prepares a 3-minute pitch for their model, including the Business Translation Test sentence. - (b) Each team prepares a 2-minute critique of one competing model. - (c) The class votes on which model to deploy, then discusses whether the "winning" model was the right business choice for Athena's specific context.


Section E: Python Implementation

Exercise 11.21: Build Your Own Confusion Matrix Without using scikit-learn's confusion_matrix function, write a Python function my_confusion_matrix(y_true, y_pred) that: - (a) Takes two lists of binary labels as input - (b) Returns a 2x2 numpy array containing TN, FP, FN, TP - (c) Validates that both inputs have the same length - (d) Test your function against scikit-learn's implementation on a sample dataset

Exercise 11.22: Expected Profit Optimizer Extend the expected_profit function from the chapter to: - (a) Accept a multi-class cost matrix (not just binary) - (b) Plot the profit curve across all thresholds - (c) Annotate the plot with the optimal threshold, maximum profit, and the profit at the default threshold (0.5) - (d) Return a dictionary containing the optimal threshold, max profit, and the confusion matrix at the optimal threshold

Exercise 11.23: ModelEvaluator Extension Extend the ModelEvaluator class from the chapter with the following new method:

compare_models(self, other_evaluator) — Takes another ModelEvaluator instance and produces a side-by-side comparison report including: - All key metrics for both models - Overlaid ROC curves - Overlaid profit curves - A recommendation based on the cost matrix

Test your extension by training two different models (e.g., logistic regression and random forest) on the same dataset and comparing them.

Exercise 11.24: Automated Model Selection Write a Python function select_best_model(models, X_test, y_test, cost_matrix, weights) that: - (a) Evaluates each model using the ModelEvaluator - (b) Scores each model on multiple dimensions (AUC, expected profit, inference latency) - (c) Applies user-specified weights to compute a composite score - (d) Returns the recommended model with a justification string

Exercise 11.25: Regression Evaluator Adapt the ModelEvaluator class to handle regression models. Create a RegressionEvaluator class that: - (a) Computes R-squared, MAE, RMSE, and MAPE - (b) Plots actual vs. predicted scatter plot with a 45-degree reference line - (c) Plots residual distribution - (d) Generates a business-language executive summary that translates regression metrics into statements like "The forecast is off by an average of X units per day, which translates to approximately $Y in excess inventory costs."


Section F: Capstone Exercise

Exercise 11.26: End-to-End Model Evaluation Report Using any publicly available classification dataset (e.g., the UCI Adult Income dataset, the Kaggle Titanic dataset, or the UCI Credit Card Default dataset):

  1. Train at least three different classification models (e.g., logistic regression, random forest, gradient boosting).
  2. Use the ModelEvaluator class to evaluate each model.
  3. Design a realistic cost matrix for the business context of your chosen dataset.
  4. Produce a complete evaluation report that includes: - (a) The confusion matrix, ROC curve, PR curve, and profit curve for each model - (b) A cross-validation analysis with 5-fold stratified CV - (c) Hyperparameter tuning results (random search with at least 30 iterations) - (d) A model selection scorecard comparing all three models - (e) The optimal threshold and expected profit for the recommended model - (f) A Business Translation Test sentence - (g) An A/B test design for the recommended model, including sample size, duration, primary metric, and guardrail metrics

Present your findings as a 5-page executive briefing suitable for a non-technical audience.