Quiz: Chapter 16

Model Evaluation Deep Dive


Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.


Question 1 (Multiple Choice)

A churn prediction dataset has 50,000 subscriber-months from 8,000 unique subscribers. Each subscriber appears in 3-12 monthly rows. The correct cross-validation strategy is:

  • A) KFold(n_splits=5)
  • B) StratifiedKFold(n_splits=5)
  • C) StratifiedGroupKFold(n_splits=5) with subscriber_id as the group
  • D) LeaveOneOut()

Answer: C) StratifiedGroupKFold(n_splits=5) with subscriber_id as the group. Because each subscriber appears in multiple rows, standard K-fold or stratified K-fold could place different months of the same subscriber in both the training and test sets. The model would then be "predicting" churn for subscribers whose behavioral patterns it has already seen, inflating performance estimates. Group K-fold ensures all rows for a given subscriber stay in the same fold. Stratification preserves the class distribution across folds.


Question 2 (Multiple Choice)

A model achieves 96% accuracy on a dataset where the positive class is 3% of the data. What can you conclude?

  • A) The model is excellent and should be deployed
  • B) The model may simply be predicting the majority class for nearly every example
  • C) The model has perfect precision
  • D) Accuracy is always a valid metric regardless of class imbalance

Answer: B) The model may simply be predicting the majority class for nearly every example. A model that predicts "negative" for every sample would achieve 97% accuracy on this dataset. The 96% accuracy tells you almost nothing about the model's ability to identify the positive class. For this level of imbalance, metrics like AUC-PR, precision, recall, and F1 are far more informative.


Question 3 (Short Answer)

You train a churn model and a feature called last_payment_method_change_reason has the highest feature importance (0.42). The feature contains values like "downgrade," "cancellation_pending," "billing_issue," and "none." Should you be concerned? Explain.

Answer: Yes, this is almost certainly target leakage. The value "cancellation_pending" directly encodes that the subscriber is already in the process of churning, which means this information would not be available at the time you need to make a prediction (before the churn event). Similarly, "downgrade" may only be recorded after a churn-related action. Any feature with importance above 0.30 warrants investigation, and a feature whose values include states that only exist post-churn is a clear leak.


Question 4 (Multiple Choice)

Data leakage from fitting a StandardScaler on the entire dataset before splitting into train and test sets is called:

  • A) Target leakage
  • B) Train-test contamination
  • C) Feature leakage
  • D) Temporal leakage

Answer: B) Train-test contamination. When you fit the scaler on all data, the mean and standard deviation used for scaling include information from the test set. This means the training data has been influenced by test set statistics, contaminating the separation between train and test. The fix is to use a scikit-learn Pipeline that fits the scaler inside each cross-validation fold, ensuring the test fold's statistics never influence the training process.


Question 5 (Multiple Choice)

For a hospital readmission model where a missed readmission (false negative) can lead to patient harm, the most appropriate primary metric is:

  • A) Accuracy
  • B) Precision
  • C) Recall
  • D) Specificity

Answer: C) Recall. Recall measures the fraction of actual readmissions that the model catches. In a medical context where false negatives have severe consequences (patient harm, emergency room visits, potential death), maximizing recall ensures that as many at-risk patients as possible receive follow-up intervention. The cost of false positives (unnecessary follow-up calls) is much lower than the cost of false negatives.


Question 6 (Short Answer)

Explain the difference between AUC-ROC and AUC-PR. When is AUC-PR a better choice than AUC-ROC?

Answer: AUC-ROC measures the model's ability to rank positive above negative examples across all thresholds, using true positive rate vs. false positive rate. AUC-PR uses precision vs. recall. AUC-PR is better for imbalanced datasets because it focuses on the positive (minority) class. AUC-ROC can appear high even when the model performs poorly on the minority class, because the large number of true negatives inflates the true positive rate axis. AUC-PR exposes this weakness by ignoring true negatives entirely.


Question 7 (Multiple Choice)

A model predicts "80% probability of churn" for a group of 100 subscribers. Only 55 of them actually churn. This model is:

  • A) Well calibrated
  • B) Overconfident (predicted probability is too high)
  • C) Underconfident (predicted probability is too low)
  • D) Impossible to determine without more information

Answer: B) Overconfident (predicted probability is too high). The model assigned 80% probability but only 55% actually churned. The predicted probabilities are systematically higher than the observed frequencies. Post-hoc calibration (Platt scaling or isotonic regression) can correct this miscalibration by learning a mapping from raw predicted probabilities to calibrated ones.


Question 8 (Multiple Choice)

You compare two models using 10-fold cross-validation. Model A has mean AUC = 0.823 (std = 0.014) and Model B has mean AUC = 0.829 (std = 0.016). A paired t-test gives p = 0.34. The correct interpretation is:

  • A) Model B is significantly better than Model A
  • B) The difference is not statistically significant; you cannot conclude B is better
  • C) Model A is better because it has lower variance
  • D) The test is invalid because the AUC difference is small

Answer: B) The difference is not statistically significant; you cannot conclude B is better. A p-value of 0.34 means there is a 34% chance of observing this AUC difference (or larger) if the models were truly equal. This far exceeds the conventional alpha threshold of 0.05. The 0.006 AUC gap is within the noise range given the standard deviations. Model selection should be based on practical factors (speed, interpretability, maintenance) rather than this non-significant performance difference.


Question 9 (Short Answer)

A colleague trained a model on the full dataset, then used cross-validation to evaluate it. They report "5-fold CV AUC = 0.91." What is wrong with this approach?

Answer: If the colleague fit any preprocessing steps (scaling, feature selection, encoding, imputation) on the full dataset before running cross-validation, information from each fold's test set has leaked into the training process through those preprocessing steps. Even if they only called cross_val_score on an already-fitted pipeline, cross-validation must encompass the entire training process, not just the model fitting step. The correct approach is to put all preprocessing into a scikit-learn Pipeline and pass the unfitted pipeline to cross_val_score, so each fold independently fits preprocessing on its training portion.


Question 10 (Multiple Choice)

In a learning curve, the training score starts high and decreases as training set size increases, while the validation score starts low and increases. At the maximum training set size, the training score is 0.92 and the validation score is 0.78. This suggests:

  • A) The model is underfitting
  • B) The model is overfitting and more data may help
  • C) The model is perfectly fitted
  • D) The data has no learnable signal

Answer: B) The model is overfitting and more data may help. The 0.14 gap between training (0.92) and validation (0.78) AUC indicates overfitting --- the model memorizes training data better than it generalizes. The fact that the validation curve is still climbing at the maximum training size suggests that adding more data would continue to improve generalization. Alternatively, reducing model complexity (fewer trees, lower depth, more regularization) would narrow the gap.


Question 11 (Short Answer)

Explain why comparing two models based on a single train-test split is unreliable, even if the test set is large (e.g., 10,000 samples).

Answer: A single split produces one performance estimate that depends on which specific samples ended up in the test set. Different random splits will produce different estimates, and the variance of these estimates can be substantial even with 10,000 test samples. Without multiple splits, you cannot distinguish between genuine model differences and lucky/unlucky test set draws. Cross-validation provides multiple estimates, a mean that is more stable than any single split, and a standard deviation that quantifies the uncertainty.


Question 12 (Multiple Choice)

The Brier score measures:

  • A) The area under the ROC curve
  • B) The mean squared difference between predicted probabilities and actual binary outcomes
  • C) The harmonic mean of precision and recall
  • D) The correlation between predicted and actual values

Answer: B) The mean squared difference between predicted probabilities and actual binary outcomes. Brier score = (1/n) * sum((p_i - y_i)^2), where p_i is the predicted probability and y_i is the actual outcome (0 or 1). It ranges from 0 (perfect) to 1 (worst). Unlike AUC, the Brier score evaluates the quality of the probability estimates themselves, not just the ranking. A well-calibrated model with moderate discrimination can have a better Brier score than a poorly calibrated model with high AUC.


Question 13 (Multiple Choice)

A churn model produces the following at threshold 0.10: precision = 0.15, recall = 0.72. A retention offer costs $5 and saving a churner is worth $180. The expected value of sending a retention offer to every subscriber flagged at this threshold is:

  • A) Positive --- send the offers
  • B) Negative --- do not send the offers
  • C) Zero --- it does not matter
  • D) Cannot be determined without more information

Answer: A) Positive --- send the offers. At precision = 0.15, 15% of flagged subscribers are actual churners. The expected value per flagged subscriber = (0.15 * $180) - $5 = $27 - $5 = $22. Each retention offer has a positive expected value of $22, so the model creates value even at this low precision. The break-even precision is $5/$180 = 0.028 (2.8%). As long as precision exceeds 2.8%, the retention offers are profitable.


Question 14 (Short Answer)

Your model achieves AUC-ROC = 0.91 in offline evaluation but an A/B test shows it performs worse than the current model. List three possible reasons for this disconnect.

Answer: (1) The offline metric does not match the business objective --- AUC-ROC measures ranking quality, but the business may care about calibrated probabilities or decisions at a specific threshold. (2) Distribution shift between the offline test set and live production traffic due to seasonal changes, user behavior shifts, or demographic differences. (3) Feedback loops in production where the model's predictions change user behavior (e.g., showing offers based on predictions changes purchase timing), an effect that offline evaluation cannot capture.


Question 15 (Multiple Choice)

McNemar's test for comparing two classifiers uses:

  • A) The mean AUC scores from cross-validation
  • B) The number of samples where one model is correct and the other is wrong
  • C) The overall accuracy of each model
  • D) The feature importance rankings of each model

Answer: B) The number of samples where one model is correct and the other is wrong. McNemar's test builds a 2x2 contingency table: both correct, A correct / B wrong, A wrong / B correct, both wrong. It tests whether the off-diagonal counts (A right B wrong vs. A wrong B right) are significantly different. If model A gets 80 samples right that B misses, but B only gets 40 right that A misses, McNemar's test determines whether this asymmetry is statistically significant.


This quiz covers Chapter 16: Model Evaluation Deep Dive. Return to the chapter to review concepts.