Chapter 8: Quiz

Test your understanding of model evaluation, selection, and validation. Each question has one best answer unless stated otherwise.

Question 1

What is the primary purpose of the test set in a train/validation/test split?

A) To tune hyperparameters
B) To select the best model from several candidates
C) To provide an unbiased estimate of generalization performance
D) To provide additional training data when the model underfits

Answer

**C) To provide an unbiased estimate of generalization performance.** The test set is reserved for a single, final evaluation after all model selection and hyperparameter tuning decisions have been made. Using it for tuning (A) or model selection (B) would introduce optimistic bias into the performance estimate. It is never used for training (D).

Question 2

In 5-fold cross-validation, how much of the data is used for training in each fold?

A) 50%
B) 60%
C) 80%
D) 95%

Answer

**C) 80%.** In k-fold cross-validation, each fold uses $(k-1)/k$ of the data for training and $1/k$ for validation. With $k=5$: $4/5 = 80\%$ for training and $1/5 = 20\%$ for validation.

Question 3

A dataset contains 10,000 samples, of which 50 are positive (0.5%). A classifier predicts all samples as negative. What is its accuracy?

A) 0.5%
B) 50%
C) 95%
D) 99.5%

Answer

**D) 99.5%.** If all 10,000 samples are predicted as negative, the model correctly classifies all 9,950 negatives and misclassifies all 50 positives: accuracy = $9950 / 10000 = 99.5\%$. This demonstrates why accuracy is misleading for imbalanced datasets -- a trivial model achieves near-perfect accuracy.

Question 4

What is the recall of the all-negative classifier described in Question 3?

A) 0.0
B) 0.5
C) 0.995
D) 1.0

Answer

**A) 0.0.** Recall = TP / (TP + FN). The all-negative classifier has TP = 0 and FN = 50 (all positives are missed), so recall = 0/50 = 0.0. The model catches none of the positive cases.

Question 5

Which of the following best describes the F1 score?

A) The arithmetic mean of precision and recall
B) The geometric mean of precision and recall
C) The harmonic mean of precision and recall
D) The weighted average of precision, recall, and accuracy

Answer

**C) The harmonic mean of precision and recall.** $F_1 = 2 \cdot \frac{P \cdot R}{P + R}$. The harmonic mean is always less than or equal to the arithmetic mean, and it penalizes extreme imbalance between precision and recall more heavily.

Question 6

When is AUC-PR (Average Precision) preferred over AUC-ROC?

A) When classes are balanced
B) When classes are highly imbalanced
C) When the model outputs hard labels instead of probabilities
D) When comparing more than two models

Answer

**B) When classes are highly imbalanced.** AUC-ROC can be misleadingly optimistic for imbalanced datasets because the False Positive Rate denominator is dominated by the large negative class. AUC-PR focuses entirely on the positive class, making it more sensitive to performance changes that matter in imbalanced settings.

Question 7

A model has training MSE of 0.02 and validation MSE of 0.85. This is most likely a case of:

A) High bias (underfitting)
B) High variance (overfitting)
C) Good fit
D) Data leakage

Answer

**B) High variance (overfitting).** The large gap between training error (very low) and validation error (much higher) is the hallmark of overfitting. The model has memorized the training data but fails to generalize. High bias would show high error on both sets. A good fit would show similar, low errors on both sets.

Question 8

Which cross-validation strategy is appropriate for time series data?

A) Standard k-fold
B) Stratified k-fold
C) TimeSeriesSplit (forward-chaining)
D) Leave-one-out

Answer

**C) TimeSeriesSplit (forward-chaining).** Time series data has temporal ordering that must be respected. Standard k-fold and stratified k-fold would allow training on future data to predict the past (temporal leakage). TimeSeriesSplit always trains on past data and tests on future data, respecting chronological order.

Question 9

In McNemar's test, which cells of the contingency table are used?

A) All four cells
B) Only the diagonal cells (both correct, both wrong)
C) Only the off-diagonal cells (discordant pairs)
D) Only the cells where Model A is correct

Answer

**C) Only the off-diagonal cells (discordant pairs).** McNemar's test only uses the cases where the two models disagree: cases where A is right and B is wrong ($n_{01}$), and cases where A is wrong and B is right ($n_{10}$). Cases where both models agree provide no information about which model is better.

Question 10

What is the main advantage of random search over grid search for hyperparameter tuning?

A) Random search always finds the global optimum
B) Random search is more efficient because it explores more unique values per hyperparameter
C) Random search has lower variance in its results
D) Random search does not require specifying parameter ranges

Answer

**B) Random search is more efficient because it explores more unique values per hyperparameter.** As shown by Bergstra and Bengio (2012), not all hyperparameters are equally important. Grid search wastes many evaluations on unimportant dimensions because it evaluates the same values of important hyperparameters repeatedly. Random search samples each hyperparameter independently, exploring more distinct values.

Question 11

Which of the following is an example of data leakage?

A) Using 5-fold cross-validation instead of a single train/test split
B) Fitting a StandardScaler on the full dataset before splitting
C) Using a Pipeline that includes preprocessing and model training
D) Reporting cross-validation scores with standard deviations

Answer

**B) Fitting a StandardScaler on the full dataset before splitting.** When you fit the scaler on the full dataset, the test set statistics (mean, standard deviation) leak into the training process. The scaler should be fit only on the training set and then used to transform the validation and test sets. Using a Pipeline (C) actually prevents this type of leakage.

Question 12

What does an $R^2$ value of 0.75 mean?

A) The model is 75% accurate
B) The model explains 75% of the variance in the target
C) The model's predictions are within 75% of the true values
D) The model has a 75% correlation with the target

Answer

**B) The model explains 75% of the variance in the target.** $R^2 = 1 - SS_{res}/SS_{tot}$ measures the proportion of the total variance in the target variable that is explained by the model. An $R^2$ of 0.75 means 75% of the variance is explained, and 25% remains unexplained.

Question 13

In the bias-variance decomposition, what does the "irreducible error" represent?

A) Error due to model complexity
B) Error due to insufficient training data
C) Noise inherent in the data that no model can eliminate
D) Error from choosing the wrong model family

Answer

**C) Noise inherent in the data that no model can eliminate.** The irreducible error ($\sigma^2$) represents the inherent randomness in the relationship between features and target. Even a perfect model that knows the true data-generating function cannot predict the noise. Model complexity (A) relates to the variance term, and wrong model family (D) relates to the bias term.

Question 14

You have a dataset of 200 samples. Which evaluation strategy is most appropriate?

A) A single 80/20 train/test split
B) 10-fold cross-validation
C) A single 60/20/20 train/validation/test split
D) Hold out 50% for testing

Answer

**B) 10-fold cross-validation.** With only 200 samples, a single split would leave very few samples in each partition, leading to high variance in the performance estimate. Cross-validation uses all data for both training and evaluation, providing more robust estimates. 10-fold CV gives 180 training and 20 test samples per fold, a better balance than a single split.

Question 15

What is the purpose of nested cross-validation?

A) To speed up cross-validation by parallelizing folds
B) To handle imbalanced classes
C) To provide an unbiased estimate of performance when hyperparameters are tuned via CV
D) To combine predictions from multiple folds

Answer

**C) To provide an unbiased estimate of performance when hyperparameters are tuned via CV.** When you tune hyperparameters using cross-validation and then report the best CV score, that score is optimistically biased. Nested CV uses an outer loop for performance estimation and an inner loop for hyperparameter tuning, preventing the tuning process from biasing the performance estimate.

Question 16

MAE is preferred over RMSE when:

A) You want to penalize large errors more heavily
B) Outliers are present and should not dominate the metric
C) You need a metric in the original units of the target
D) The data follows a normal distribution

Answer

**B) Outliers are present and should not dominate the metric.** MAE uses absolute differences (linear penalty), while RMSE uses squared differences (quadratic penalty). A single outlier with a large error can dominate RMSE but has a proportional effect on MAE. Note that both MAE and RMSE (C) are in the original units -- this is a property they share, not one that distinguishes them.

Question 17

A model's AUC-ROC is 0.50. What does this indicate?

A) The model is perfectly calibrated
B) The model performs no better than random guessing
C) The model achieves 50% accuracy
D) Half of the predictions are correct

Answer

**B) The model performs no better than random guessing.** An AUC-ROC of 0.50 corresponds to the diagonal line on the ROC plot, which represents a classifier that assigns scores randomly without regard to the true labels. It means a randomly chosen positive example is equally likely to receive a higher or lower score than a randomly chosen negative example.

Question 18

Which of the following is NOT a remedy for high variance (overfitting)?

A) Adding more training data
B) Increasing model complexity
C) Adding regularization
D) Using ensemble methods like bagging

Answer

**B) Increasing model complexity.** Increasing model complexity would make overfitting worse, not better. All other options help reduce variance: more data provides more evidence (A), regularization constrains the model (C), and bagging (Chapter 7) averages multiple models to reduce variance (D).

Question 19

When using the Bonferroni correction for comparing 5 models pairwise at $\alpha = 0.05$, what is the corrected significance threshold?

A) 0.01
B) 0.005
C) 0.025
D) 0.0025

Answer

**B) 0.005.** With 5 models, there are $\binom{5}{2} = 10$ pairwise comparisons. The Bonferroni correction divides the significance level by the number of comparisons: $0.05 / 10 = 0.005$. This controls the family-wise error rate at 0.05.

Question 20

Bayesian optimization is most advantageous when:

A) The model trains in milliseconds
B) The search space has only 2 hyperparameters
C) Each model evaluation is computationally expensive
D) You need to compare many model families

Answer

**C) Each model evaluation is computationally expensive.** Bayesian optimization's main advantage is sample efficiency -- it uses a surrogate model to intelligently choose the next evaluation point. This matters most when each evaluation is costly (e.g., training a large neural network for hours). When evaluations are cheap (A), random search or grid search may be sufficient and simpler to implement.

Question 21

What is the probabilistic interpretation of AUC-ROC?

A) The probability that the model is correct
B) The probability that a random positive example is ranked higher than a random negative example
C) The probability of a true positive
D) The probability that precision exceeds recall

Answer

**B) The probability that a random positive example is ranked higher than a random negative example.** This is the Wilcoxon-Mann-Whitney interpretation of AUC. An AUC of 0.85 means that if you randomly pick one positive and one negative example, there is an 85% chance the model assigns a higher score to the positive example.

Question 22

Which metric averaging strategy for multi-class classification treats all classes equally regardless of their size?

A) Micro averaging
B) Macro averaging
C) Weighted averaging
D) Sample averaging

Answer

**B) Macro averaging.** Macro averaging computes the metric independently for each class and then takes the unweighted mean, giving equal importance to all classes regardless of their frequency. Micro averaging (A) aggregates counts and is dominated by large classes. Weighted averaging (C) explicitly weights by class frequency.

Question 23

A learning curve shows that both training and validation errors are high and converge to similar values as training set size increases. This suggests:

A) The model is overfitting
B) The model is underfitting (high bias)
C) The model needs more regularization
D) There is data leakage

Answer

**B) The model is underfitting (high bias).** When both errors are high and the gap between them is small, the model lacks the capacity to capture the underlying pattern. Adding more data will not help because the model is already limited by its simplicity. The remedy is to increase model complexity, add features, or reduce regularization.

Question 24

You standardize your features using the entire dataset (train + test), then split into train and test. Compared to correct preprocessing (fit on train only), your reported test accuracy will most likely be:

A) Lower (pessimistically biased)
B) The same
C) Higher (optimistically biased)
D) More variable but unbiased

Answer

**C) Higher (optimistically biased).** By fitting the scaler on the full dataset, test set statistics leak into the training process. The model indirectly gains information about the test distribution, leading to artificially inflated test performance. The effect is usually small for large datasets but can be significant for small ones.

Question 25

In the F-beta score with $\beta = 2$, which is weighted more heavily?

A) Precision is weighted 2x more than recall
B) Recall is weighted 2x more than precision
C) Precision is weighted 4x more than recall
D) Recall is weighted 4x more than precision

Answer

**B) Recall is weighted 2x more than precision.** In the F-beta score formula, $\beta$ controls the relative weight of recall vs. precision. When $\beta = 2$, recall is weighted $\beta = 2$ times as important as precision. This makes $F_2$ useful in applications where missing positives (false negatives) is more costly than false alarms (false positives), such as medical screening.