Chapter 11 Quiz: Model Evaluation and Selection

DataField.Dev

Chapter 11 Quiz: Model Evaluation and Selection

Question 1. A binary classification model achieves 96% accuracy on a dataset where 96% of observations belong to the negative class. What can you conclude about this model?

(a) The model is excellent — 96% accuracy is very high.
(b) The model may be no better than always predicting the majority class.
(c) The model has high recall because it identifies most observations correctly.
(d) The model has high precision because 96% of its predictions are correct.

Question 2. In a confusion matrix for a binary classifier, a False Negative represents:

(a) A negative instance that the model correctly identified as negative.
(b) A positive instance that the model incorrectly classified as negative.
(c) A negative instance that the model incorrectly classified as positive.
(d) A positive instance that the model correctly identified as positive.

Question 3. A fraud detection model identifies 180 of 200 actual fraudulent transactions, but also flags 1,000 legitimate transactions as fraudulent. What is the model's recall?

(a) 180 / 1,180 = 15.3%
(b) 180 / 200 = 90.0%
(c) 180 / 1,000 = 18.0%
(d) 200 / 1,180 = 16.9%

Question 4. Using the same scenario as Question 3, what is the model's precision?

(a) 180 / 1,180 = 15.3%
(b) 180 / 200 = 90.0%
(c) 180 / 1,000 = 18.0%
(d) 1,000 / 1,180 = 84.7%

Question 5. You are building a medical screening tool to detect a rare but serious disease. Which metric should be prioritized?

(a) Precision, because false positives waste medical resources.
(b) Accuracy, because it provides the most comprehensive measure.
(c) Recall, because missing a true case of the disease (false negative) could be fatal.
(d) Specificity, because most patients do not have the disease.

Question 6. The F2 score weights recall more heavily than precision. In which scenario would F2 be more appropriate than F1?

(a) Email spam filtering, where marking a legitimate email as spam is very costly.
(b) Product recommendation, where irrelevant recommendations are mildly annoying.
(c) Airport security screening, where missing a genuine threat is far worse than a false alarm.
(d) Content moderation on social media, where over-censoring is equally as bad as under-censoring.

Question 7. What does an AUC (Area Under the ROC Curve) of 0.50 indicate?

(a) The model achieves perfect classification.
(b) The model performs no better than random guessing.
(c) The model correctly classifies exactly half of all instances.
(d) The model has equal precision and recall.

Question 8. When is a precision-recall curve more informative than a ROC curve?

(a) When the dataset has balanced classes.
(b) When the positive class is very rare (highly imbalanced dataset).
(c) When you are evaluating a regression model.
(d) When you need to compare more than two models simultaneously.

Question 9. In cost-sensitive evaluation, the "expected profit" at a given threshold depends on:

(a) The confusion matrix counts and the business value assigned to each outcome type.
(b) The AUC score and the base rate of the positive class.
(c) The F1 score and the total number of observations.
(d) The precision at that threshold and the prevalence of the positive class.

Question 10. A churn model has the following cost structure: saving a churner generates $480 net value, a wasted retention offer costs $20, and a missed churner costs $500 in lost revenue. The optimal classification threshold is most likely to be:

(a) Exactly 0.5, because that is the standard default.
(b) Above 0.5, because false positives should be minimized.
(c) Below 0.5, because missing churners (false negatives) is much more expensive than wasting retention offers (false positives).
(d) Impossible to determine without knowing the AUC.

Question 11. What is the primary advantage of K-fold cross-validation over a single train-test split?

(a) It trains the model on more data, resulting in a better final model.
(b) It provides a more robust and reliable estimate of model performance by reducing the variance of the evaluation.
(c) It eliminates the need for a test set entirely.
(d) It automatically performs hyperparameter tuning.

Question 12. For a time-series forecasting model predicting monthly sales, which cross-validation strategy is appropriate?

(a) Standard 5-fold cross-validation with random shuffling.
(b) Stratified K-fold cross-validation.
(c) Time-series cross-validation (expanding window / walk-forward validation).
(d) Leave-one-out cross-validation.

Question 13. A random search over hyperparameters with 50 iterations is often more efficient than a grid search with 50 combinations because:

(a) Random search uses fewer computational resources per iteration.
(b) Random search explores a wider range of each individual hyperparameter, making it more likely to find good values.
(c) Random search automatically selects the best hyperparameters without evaluation.
(d) Random search applies Bayesian optimization to guide the search.

Question 14. An A/B test comparing a new churn model against no model shows a statistically significant improvement in retention rate (p-value = 0.03). However, customer satisfaction scores in the treatment group dropped by 8%. What should you do?

(a) Deploy the model immediately — the retention improvement is statistically significant.
(b) Ignore the satisfaction drop — it is not the primary metric.
(c) Investigate the satisfaction decline as a guardrail metric violation before proceeding with deployment.
(d) Re-run the test with a larger sample size to confirm the satisfaction drop.

Question 15. Athena's Model Evaluation Board chose Model B (logistic regression) over Model A (gradient boosting) despite Model A having a higher AUC. Which of the following was NOT cited as a reason?

(a) Model B delivered higher expected profit when evaluated with the business cost matrix.
(b) Model B was interpretable, allowing the operations team to explain decisions to customers.
(c) Model B had lower compute costs ($50/month vs. $800/month).
(d) Model B had a higher F1 score than Model A.

Question 16. Which regression metric is most appropriate when occasional large forecast errors are disproportionately costly?

(a) R-squared, because it measures overall explanatory power.
(b) MAE, because it treats all errors equally.
(c) RMSE, because squaring the errors penalizes large deviations more heavily.
(d) MAPE, because it normalizes errors by actual values.

Question 17. MAPE (Mean Absolute Percentage Error) is problematic when:

(a) The dataset is very large.
(b) Actual values are close to zero or equal to zero.
(c) The model has high bias.
(d) The target variable is normally distributed.

Question 18. The "Business Translation Test" requires that before deployment, a model's value must be expressed as:

(a) A technical specification including the algorithm, hyperparameters, and feature set.
(b) A single sentence connecting what the model identifies, the business action it enables, and the estimated business impact.
(c) A confusion matrix and ROC curve suitable for a data science audience.
(d) A cost-benefit analysis with at least five years of projected savings.

Question 19. Professor Okonkwo states: "If you can't translate your model's performance into a business impact statement, you haven't finished evaluating it." This statement reflects which principle?

(a) Model metrics are more important than business metrics.
(b) Evaluation is only complete when technical performance is connected to business value.
(c) Business leaders should not be involved in model evaluation.
(d) The highest AUC always indicates the best model.

Question 20. In the model selection scorecard, "organizational fit" refers to:

(a) Whether the model's AUC exceeds a minimum threshold.
(b) Whether the team has the skills to maintain the model, the infrastructure supports it, and the business unit will actually use the predictions.
(c) Whether the model complies with regulatory requirements.
(d) Whether the model was developed using the organization's preferred programming language.

Question 21. Tom's initial churn model achieved 92% accuracy on a dataset where 95% of customers do not churn. The naive baseline (predicting "no churn" for everyone) achieves:

(a) 50% accuracy
(b) 88% accuracy
(c) 92% accuracy
(d) 95% accuracy

Question 22. Grid search with 4 hyperparameters, each with 5 values, and 5-fold cross-validation requires how many total model fits?

(a) 20
(b) 100
(c) 625
(d) 3,125

Question 23. The key difference between Bayesian optimization and random search for hyperparameter tuning is:

(a) Bayesian optimization uses the results of previous evaluations to intelligently choose the next hyperparameter combination, while random search samples independently.
(b) Bayesian optimization is always faster than random search.
(c) Random search can only evaluate integer hyperparameters, while Bayesian optimization handles continuous values.
(d) Bayesian optimization requires labeled data, while random search does not.

Question 24. Which of the following is an example of a guardrail metric in an A/B test for a churn prevention model?

(a) The churn rate in the treatment group.
(b) The retention rate improvement in the treatment group.
(c) Customer satisfaction scores, which must not decline even if retention improves.
(d) The model's AUC on the test set.

Question 25. A model with perfect precision but very low recall would:

(a) Miss many actual positive cases but rarely make false positive predictions.
(b) Catch all actual positive cases but generate many false alarms.
(c) Achieve the maximum possible F1 score.
(d) Have an AUC of 1.0.

Answer Key

(b) — The model may simply be predicting the majority class for all instances, which would also achieve 96% accuracy. Additional metrics (precision, recall, F1, AUC) are needed.
(b) — A false negative is a positive instance (e.g., an actual churner) that the model incorrectly classifies as negative (predicts "no churn").
(b) — Recall = TP / (TP + FN) = 180 / (180 + 20) = 180 / 200 = 90.0%.
(a) — Precision = TP / (TP + FP) = 180 / (180 + 1,000) = 180 / 1,180 = 15.3%.
(c) — In medical screening for serious diseases, missing a true case (false negative) can be life-threatening. High recall ensures most cases are detected, even at the cost of some false positives that can be resolved through follow-up testing.
(c) — Airport security screening prioritizes catching threats (recall) over minimizing false alarms (precision). Missing a genuine threat has catastrophic consequences, while false alarms cause inconvenience but not danger.
(b) — An AUC of 0.50 means the model's ROC curve follows the diagonal, indicating no discriminative ability beyond random chance.
(b) — When the positive class is rare, the false positive rate (used in ROC) can appear deceptively low because it is divided by a very large number of negatives. The PR curve uses precision, which is more sensitive to false positives in imbalanced settings.
(a) — Expected profit is calculated by multiplying each confusion matrix cell count by its corresponding business value and summing: (TP x V_TP) + (FP x V_FP) + (FN x V_FN) + (TN x V_TN).
(c) — When false negatives are far more expensive than false positives ($500 vs. $20), the optimal strategy is to lower the threshold to catch more positives, accepting additional false positives as a reasonable cost.
(b) — K-fold cross-validation uses all the data for both training and testing across K iterations, producing K performance estimates whose average and standard deviation give a more reliable assessment than any single split.
(c) — Time-series CV respects temporal ordering by always training on past data and testing on future data, preventing information leakage from the future into the training set.
(b) — Grid search evaluates a predetermined set of values for each hyperparameter. Random search samples from continuous distributions, exploring a wider range of each hyperparameter and being more likely to discover optimal values that fall between grid points.
(c) — Customer satisfaction is a guardrail metric. A statistically significant primary metric improvement does not justify deployment if a guardrail metric has degraded. The satisfaction drop should be investigated and addressed before deployment.
(d) — The chapter discusses AUC, expected profit, interpretability, compute costs, and operations team trust. F1 score comparison was not cited as a reason for choosing Model B.
(c) — RMSE squares errors before averaging, which amplifies the impact of large errors. This makes RMSE appropriate when large deviations are disproportionately costly (e.g., severe inventory shortages).
(b) — MAPE divides by the actual value. When actual values are zero, MAPE is undefined (division by zero). When actual values are near zero, MAPE produces extremely large percentage errors that distort the overall average.
(b) — The Business Translation Test requires a single sentence: "This model identifies [what] with [metric], which enables us to [action], resulting in [impact], at a cost of [cost], for a net ROI of [return]."
(b) — The statement emphasizes that technical evaluation (computing metrics) is a means to an end. The end is understanding business impact. Evaluation is incomplete until metrics are translated into business language.
(b) — Organizational fit encompasses the practical realities of deploying and maintaining a model: team skills, infrastructure support, and whether the predictions will actually be used by the intended business unit.
(d) — If 95% of customers do not churn, a model that always predicts "no churn" achieves 95% accuracy by correctly classifying all non-churners while missing all churners.
(d) — Total combinations = 5^4 = 625. With 5-fold CV: 625 x 5 = 3,125 model fits.
(a) — Bayesian optimization builds a surrogate model of the objective function and uses it to prioritize promising regions of the hyperparameter space. Random search treats each evaluation independently, without learning from previous results.
(c) — Guardrail metrics are secondary metrics that must not deteriorate during an experiment, even if the primary metric improves. Customer satisfaction, support call volume, and revenue per customer are common guardrail metrics.
(a) — Perfect precision means every positive prediction is correct (no false positives). Very low recall means the model misses most actual positives (many false negatives). The model is very conservative — it only predicts positive when it is highly confident.