Chapter 8: Exercises

Section A: Conceptual Questions

Exercise 8.1: The Role of the Test Set

Explain why the test set should only be used once at the very end of model development. What happens to the reliability of your performance estimate if you use the test set repeatedly during hyperparameter tuning? Relate your answer to the concept of overfitting to the validation set discussed in Section 8.10.2.

Exercise 8.2: Cross-Validation Tradeoffs

A colleague argues that Leave-One-Out Cross-Validation (LOOCV) is always better than 5-fold cross-validation because each training set is larger. Critique this argument. Under what circumstances might 5-fold CV actually give a better estimate of generalization performance? Reference the bias-variance tradeoff of the estimator itself (not the model).

Exercise 8.3: Metric Selection for Imbalanced Data

You are building a fraud detection system where only 0.5% of transactions are fraudulent. The cost of missing a fraudulent transaction (false negative) is 100 times the cost of incorrectly flagging a legitimate transaction (false positive).

a) Explain why accuracy is a poor metric for this problem. b) Which metric or combination of metrics would you use? Justify your choice. c) Would you use AUC-ROC or AUC-PR? Explain the difference in this context. d) Calculate the appropriate $\beta$ for an F-beta score that reflects the 100:1 cost ratio.

Exercise 8.4: Bias-Variance Diagnosis

You train a decision tree on a dataset of 10,000 samples. The training accuracy is 99.8% and the validation accuracy is 72.3%.

a) Diagnose the problem using the bias-variance framework. b) List three specific actions you could take to address the problem. c) Which of the ensemble methods discussed in Chapter 7 would be most helpful here, and why?

Exercise 8.5: Time Series Splitting

Explain why standard k-fold cross-validation is invalid for time series data. Draw a diagram showing the first three folds of a time series cross-validation with 1000 data points and 5 splits. What is the minimum training set size in your scheme?

Exercise 8.6: Data Leakage

For each scenario, identify whether data leakage is present and explain why:

a) You standardize features (subtract mean, divide by standard deviation) using the entire dataset before splitting into train and test sets. b) You perform feature selection using mutual information on the training set only, then apply the selected features to the test set. c) You use tomorrow's stock price as a feature to predict today's stock movement direction. d) You use k-fold cross-validation inside a Pipeline that includes a StandardScaler.

Exercise 8.7: Statistical Testing

Models A and B are compared on a 1000-sample test set. Their predictions disagree on 100 samples: A is correct on 65 of those and B is correct on 35.

a) Set up McNemar's test for this comparison. b) Compute the chi-squared statistic. c) At $\alpha = 0.05$, is the difference statistically significant? d) What would you conclude if the discordant pairs were 55 vs. 45 instead?

Exercise 8.8: Nested Cross-Validation

Explain the difference between standard cross-validation and nested cross-validation. When is nested CV necessary? Draw a diagram showing the structure of nested CV with 5 outer folds and 3 inner folds. How many total model fits are performed if the inner loop evaluates 20 hyperparameter configurations?

Exercise 8.9: Precision-Recall Tradeoff

A medical diagnostic model outputs a probability between 0 and 1. At a threshold of 0.5, precision is 0.85 and recall is 0.60. At a threshold of 0.3, precision is 0.70 and recall is 0.90.

a) Explain the tradeoff intuitively. b) For cancer screening, which threshold would you recommend and why? c) Compute the F1 score at each threshold. Does F1 agree with your recommendation? d) What $\beta$ in the F-beta score would you need for F-beta to agree with your domain-driven recommendation?

Exercise 8.10: Calibration

A weather forecasting model predicts a 90% chance of rain on 100 different days. It actually rains on 60 of those days.

a) Is this model well-calibrated? Explain. b) What would a perfectly calibrated model predict? c) Name two methods to improve the calibration of a poorly calibrated model.

Section B: Mathematical Exercises

Exercise 8.11: Deriving F1 from Precision and Recall

Prove that $F_1 = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}$ by starting from the harmonic mean definition $F_1 = 2 \cdot \frac{P \cdot R}{P + R}$ and substituting the definitions of precision and recall.

Exercise 8.12: R-Squared Bounds

a) Prove that $R^2 \leq 1$ for any model. b) Show by example that $R^2$ can be negative on a test set. c) Prove that $R^2 = 0$ when the model always predicts $\bar{y}$ (the mean of the training targets).

Exercise 8.13: Bias-Variance Decomposition

For the squared loss, derive the bias-variance decomposition:

$$\mathbb{E}_{D}[(y - \hat{f}(x))^2] = \text{Bias}^2[\hat{f}(x)] + \text{Var}[\hat{f}(x)] + \sigma^2$$

where the expectation is over training sets $D$, $y = f(x) + \epsilon$ with $\mathbb{E}[\epsilon] = 0$ and $\text{Var}(\epsilon) = \sigma^2$.

Hint: Add and subtract $\mathbb{E}_D[\hat{f}(x)]$ inside the square.

Exercise 8.14: Cross-Validation Variance

Show that the variance of the k-fold cross-validation estimator is approximately:

$$\text{Var}(\hat{E}_{CV}) \approx \frac{1}{k}\sigma^2 + \frac{k-1}{k}\rho\sigma^2$$

where $\sigma^2$ is the variance of a single fold's estimate and $\rho$ is the correlation between estimates from different folds. Explain why this means LOOCV ($k=n$) can have high variance when $\rho$ is large.

Exercise 8.15: ROC Curve Properties

a) Prove that the ROC curve always passes through the points $(0, 0)$ and $(1, 1)$. b) Prove that a perfect classifier has AUC = 1. c) Prove that a random classifier (one that assigns scores independently of the true labels) has an expected AUC of 0.5. d) Can a classifier have AUC < 0.5? What does this mean practically?

Section C: Programming Exercises

Exercise 8.16: Implementing K-Fold from Scratch

Implement k-fold cross-validation from scratch without using scikit-learn's KFold or cross_val_score. Your implementation should:

a) Shuffle the data. b) Split into k folds. c) For each fold, train a model and compute the validation score. d) Return the mean and standard deviation of scores.

Test your implementation on the Iris dataset and compare with scikit-learn's result.

def kfold_from_scratch(
    X: np.ndarray,
    y: np.ndarray,
    model,
    k: int = 5,
    random_state: int = 42
) -> tuple[float, float]:
    """Implement k-fold cross-validation from scratch.

    Args:
        X: Feature matrix.
        y: Target vector.
        model: Scikit-learn estimator (must support fit/score).
        k: Number of folds.
        random_state: Random seed.

    Returns:
        Tuple of (mean_accuracy, std_accuracy).
    """
    # YOUR CODE HERE
    pass

Exercise 8.17: ROC Curve from Scratch

Implement ROC curve computation from scratch. Given true labels and predicted probabilities, compute the TPR and FPR at every unique threshold. Then compute the AUC using the trapezoidal rule.

def roc_curve_from_scratch(
    y_true: np.ndarray,
    y_scores: np.ndarray
) -> tuple[np.ndarray, np.ndarray, float]:
    """Compute ROC curve and AUC from scratch.

    Args:
        y_true: Ground truth binary labels (0 or 1).
        y_scores: Predicted scores or probabilities.

    Returns:
        Tuple of (fpr_array, tpr_array, auc_value).
    """
    # YOUR CODE HERE
    pass

Exercise 8.18: Learning Curve Visualization

Using the Breast Cancer Wisconsin dataset from scikit-learn:

a) Train a LogisticRegression model and generate a learning curve with training set sizes from 10% to 100%. b) Train a DecisionTreeClassifier (no depth limit) and generate its learning curve. c) Plot both learning curves on the same figure. d) Diagnose whether each model exhibits high bias, high variance, or a good fit.

Exercise 8.19: Metric Sensitivity Analysis

Create a synthetic binary classification dataset with 1000 samples and 10% positive class. Train a logistic regression model and:

a) Compute accuracy, precision, recall, F1, AUC-ROC, and AUC-PR. b) Vary the classification threshold from 0.1 to 0.9 in steps of 0.1. c) Plot how each metric changes with the threshold. d) Identify the threshold that maximizes F1. e) Compare the F1-optimal threshold with the default 0.5 threshold.

Exercise 8.20: Grid Search vs. Random Search

Using the Wine dataset from scikit-learn and a RandomForestClassifier:

a) Define a parameter grid with at least 4 hyperparameters. b) Run grid search and record the best score and the number of model fits. c) Run random search with the same total number of model fits. d) Compare the best scores. Repeat the experiment 10 times with different random seeds and plot the distribution of best scores for each method.

Exercise 8.21: Hyperparameter Sensitivity

Using any dataset and model of your choice:

a) Select the two most important hyperparameters. b) Create a 2D heatmap showing cross-validation performance as a function of both hyperparameters. c) Identify the region of the hyperparameter space that yields the best performance. d) Discuss whether the optimal region is robust (broad) or fragile (narrow).

Exercise 8.22: Implementing McNemar's Test

Implement McNemar's test from scratch and apply it to compare two classifiers:

a) Train a logistic regression and a random forest on the same dataset. b) Compute the contingency table of their predictions on the test set. c) Compute the chi-squared statistic and p-value. d) Determine whether the difference is statistically significant at $\alpha = 0.05$. e) Verify your result using scipy.stats.

Exercise 8.23: Stratified vs. Standard K-Fold

Create a highly imbalanced dataset (95% negative, 5% positive) with 200 samples:

a) Run standard 5-fold cross-validation and print the class distribution in each fold. b) Run stratified 5-fold cross-validation and print the class distribution in each fold. c) Compare the variance of the F1 scores across folds for each method. d) Explain why stratified folds produce more reliable estimates for imbalanced data.

Exercise 8.24: Time Series Cross-Validation Comparison

Generate a synthetic time series with a trend and seasonal component:

a) Evaluate a model using standard 5-fold CV and report the mean score. b) Evaluate the same model using TimeSeriesSplit with 5 splits and report the mean score. c) Which estimate is more reliable? Why? d) Introduce a "future leak" feature (the target shifted by -1) and show how standard CV fails to detect the leak while time series CV does.

Exercise 8.25: Complete Model Selection Pipeline

Build a complete model selection pipeline for the Digits dataset from scikit-learn:

a) Split data into train (70%), validation (15%), and test (15%). b) Evaluate at least three model families (e.g., logistic regression, SVM, random forest). c) For each model, tune hyperparameters using 5-fold CV on the training set. d) Select the best model based on validation performance. e) Report final performance on the test set with a classification report. f) Perform McNemar's test between the top two models. g) Generate a confusion matrix for the final model.

Section D: Applied Exercises

Exercise 8.26: Real-World Metric Selection

For each of the following applications, recommend the most appropriate primary evaluation metric and justify your choice:

a) Email spam detection for a corporate email system. b) Autonomous vehicle pedestrian detection. c) Movie recommendation system rating prediction. d) Credit scoring for loan approval. e) Medical image classification (tumor vs. no tumor).

Exercise 8.27: Evaluation Report

Download or generate a dataset of your choice. Train at least two models and write a 1-page evaluation report that includes:

a) Dataset description and class distribution. b) Evaluation methodology (splitting strategy, metrics chosen, and why). c) Results table with mean and standard deviation from cross-validation. d) Statistical comparison of models. e) Final test set results. f) Limitations and recommendations.

Exercise 8.28: Debugging a Suspicious Model

A colleague presents a model that achieves 99.7% accuracy on a medical diagnosis task. The dataset has 50% positive and 50% negative cases. They used the following code:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Scale all data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))  # 0.997

a) Identify all potential problems with this evaluation. b) Rewrite the code to fix the issues. c) Explain what impact each fix might have on the reported accuracy.

Exercise 8.29: Confidence Intervals via Bootstrapping

Implement a bootstrap procedure to compute 95% confidence intervals for the test accuracy of a model:

a) Train a model on the training set. b) Generate 1000 bootstrap samples from the test set. c) Compute accuracy on each bootstrap sample. d) Report the 2.5th and 97.5th percentiles as the confidence interval. e) Compare with the normal approximation CI: $\hat{p} \pm 1.96\sqrt{\hat{p}(1-\hat{p})/n}$.

Exercise 8.30: Multi-Metric Optimization

Sometimes you need to optimize multiple metrics simultaneously. Using the Breast Cancer dataset:

a) Compute precision and recall at 20 different thresholds. b) Plot the precision-recall curve. c) Find the threshold that maximizes F1. d) Find the threshold that achieves at least 95% recall with maximum precision. e) Discuss the tradeoff and which operating point you would choose for clinical deployment.

Section E: Challenge Exercises

Exercise 8.31: Implementing Bayesian Optimization

Implement a simplified version of Bayesian optimization for hyperparameter tuning:

a) Define an objective function that takes hyperparameters and returns a cross-validation score. b) Implement a Gaussian Process surrogate model (you may use sklearn.gaussian_process). c) Implement the Expected Improvement acquisition function. d) Run the optimization loop for 30 iterations. e) Compare the convergence speed with random search on the same budget.

Exercise 8.32: Double Descent

Reproduce the double descent phenomenon:

a) Generate a polynomial regression dataset with noise. b) Fit polynomial models of degree 1 through 50 (some will be overparameterized). c) Plot training error and test error as a function of model complexity. d) Identify the interpolation threshold and the double descent region. e) Discuss the implications for the classical bias-variance tradeoff.

Exercise 8.33: Fairness Evaluation

Using the Adult Census Income dataset:

a) Train a classifier to predict income (>50K vs. <=50K). b) Compute accuracy, precision, and recall separately for male and female subgroups. c) Compute demographic parity difference and equalized odds difference. d) Discuss whether the model is "fair" and what tradeoffs exist between fairness criteria.

Exercise 8.34: Cross-Validation Estimator Comparison

Empirically compare the bias and variance of different cross-validation strategies:

a) Generate a dataset with a known ground truth (e.g., from a known function with noise). b) Compute the true generalization error by evaluating on a large independent test set. c) Run 5-fold CV, 10-fold CV, and LOOCV 100 times each (with different random seeds for shuffling). d) For each strategy, compute the bias (mean estimate minus true error) and variance (standard deviation of estimates). e) Plot the results and discuss the bias-variance tradeoff of the CV estimator.

Exercise 8.35: Production Evaluation Simulation

Simulate an A/B test for model deployment:

a) Create a "production" data stream of 10,000 samples. b) Deploy Model A (logistic regression) and Model B (random forest). c) Randomly assign each sample to Model A or Model B. d) Compute the conversion rate (or accuracy) for each model. e) Perform a two-proportion z-test to determine if the difference is statistically significant. f) Calculate the minimum sample size needed to detect a 2% improvement with 80% power.