Exercises: Chapter 16

Model Evaluation Deep Dive


Exercise 1: Cross-Validation Strategies (Conceptual)

Consider a dataset of 50,000 medical records from 8,000 patients. Each patient has between 2 and 15 visits recorded. The target is whether the patient was readmitted within 30 days of discharge. The dataset is imbalanced: 12% readmission rate.

a) A colleague uses StratifiedKFold(n_splits=5) for cross-validation and reports an AUC of 0.84. Explain what is wrong with this approach and why the reported AUC is likely inflated.

b) What cross-validation strategy should be used instead? Implement it using scikit-learn and explain why it is correct.

c) Suppose the hospital records span 3 years and you want to deploy the model to predict readmissions next month. Should you use StratifiedGroupKFold or TimeSeriesSplit? Argue for your choice. Is there a way to combine both grouping and temporal ordering?

d) After switching to the correct cross-validation, the AUC drops from 0.84 to 0.76. Your colleague says "the new evaluation is worse, so the old approach was better." Write a 3-sentence response explaining why this reasoning is backwards.


Exercise 2: Detecting Data Leakage (Code + Detective Work)

Run the following code and find the leaky features.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score

np.random.seed(42)
n = 20000

data = pd.DataFrame({
    'age': np.random.randint(18, 80, n),
    'income': np.random.normal(55000, 20000, n).round(0),
    'credit_score': np.random.normal(680, 80, n).round(0),
    'num_accounts': np.random.poisson(3, n),
    'months_since_last_delinquency': np.random.exponential(24, n).round(0),
    'debt_to_income': np.random.beta(2, 5, n).round(3),
    'num_inquiries_6m': np.random.poisson(1.5, n),
    'employment_years': np.random.exponential(7, n).round(1),
})

# Generate default probability
default_prob = 1 / (1 + np.exp(-(
    -4.0
    + 0.02 * (data['debt_to_income'] > 0.4).astype(float)
    + 0.3 * (data['credit_score'] < 600).astype(float)
    + 0.2 * (data['num_inquiries_6m'] > 3).astype(float)
    + 0.15 * (data['employment_years'] < 1).astype(float)
)))
data['defaulted'] = np.random.binomial(1, default_prob)

# LEAKED FEATURES (find them)
data['collection_agency_contacted'] = np.where(
    data['defaulted'] == 1,
    np.random.binomial(1, 0.85, n),
    np.random.binomial(1, 0.01, n)
)
data['current_balance_status'] = np.where(
    data['defaulted'] == 1,
    np.random.choice(['delinquent', 'charged_off', 'in_collections'], n),
    np.random.choice(['current', 'current', 'current', 'late_30'], n)
)
data['current_balance_status'] = data['current_balance_status'].astype('category').cat.codes

X = data.drop('defaulted', axis=1)
y = data['defaulted']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

model = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=4, random_state=42
)
model.fit(X_train, y_train)
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
print(f"AUC: {auc:.4f}")

a) Run the code and examine feature importances. Which features are suspicious and why?

b) For each suspicious feature, explain the temporal logic: why would this information not be available at the time you need to predict default?

c) Remove the leaky features, retrain, and report the new AUC. What is the AUC drop?

d) Write a general-purpose function called leakage_detector that takes a trained model and a feature list, flags any feature with importance above a threshold (default 0.25), and prints a warning. Test it on the leaky and clean models.


Exercise 3: Precision-Recall Tradeoff for Churn (Code)

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import (
    precision_score, recall_score, f1_score,
    precision_recall_curve, average_precision_score
)

X, y = make_classification(
    n_samples=20000, n_features=20, n_informative=10,
    n_redundant=5, weights=[0.92, 0.08],
    flip_y=0.02, random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

model = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=4, random_state=42
)
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]

a) Calculate precision, recall, and F1 at thresholds: 0.02, 0.05, 0.08, 0.10, 0.15, 0.20, 0.30, 0.50. Create a table showing all three metrics at each threshold.

b) The business context: a retention offer costs $5 per subscriber, and saving a churner is worth $180. Calculate the expected profit per subscriber for each threshold in part (a). Which threshold maximizes expected profit?

c) Plot the precision-recall curve and mark the threshold that maximizes expected profit. Is this the same threshold that maximizes F1?

d) Your manager says "we should use a threshold of 0.5 because that is the standard." Using the table from part (a), calculate how much money is left on the table by using 0.5 instead of the profit-maximizing threshold. Express this as a percentage of the maximum profit.


Exercise 4: AUC-ROC vs. AUC-PR on Imbalanced Data (Code)

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, average_precision_score, make_scorer

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)

imbalance_ratios = [0.50, 0.30, 0.20, 0.10, 0.05, 0.02, 0.01]

a) For each imbalance ratio, generate a dataset with make_classification (n_samples=10000, n_informative=10) and evaluate the same Random Forest using both AUC-ROC and AUC-PR. Create a table showing both metrics for each ratio.

b) At which imbalance ratio does AUC-ROC start to become misleading (i.e., it looks good while AUC-PR shows poor minority-class identification)?

c) Explain in 3-4 sentences why AUC-ROC is robust to class imbalance (it changes little as imbalance increases) while AUC-PR is sensitive to it.

d) A colleague argues: "AUC-ROC is more reliable because it is more stable across different imbalance levels." Explain why stability is not the same as informativeness, and when AUC-PR's sensitivity to imbalance is a feature, not a bug.


Exercise 5: Learning Curves and Data Collection Decisions (Code)

Build learning curves for three models of different complexity on the same dataset.

import numpy as np
from sklearn.model_selection import learning_curve, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=15000, n_features=25, n_informative=15,
    n_redundant=5, flip_y=0.1, random_state=42
)

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(
        n_estimators=200, max_depth=10, random_state=42, n_jobs=-1
    ),
    'Gradient Boosting': GradientBoostingClassifier(
        n_estimators=200, learning_rate=0.1, max_depth=5, random_state=42
    ),
}

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
train_sizes = np.linspace(0.05, 1.0, 15)

a) Generate learning curves for all three models. For each, report: (1) training AUC at full data, (2) validation AUC at full data, (3) the gap between training and validation AUC.

b) Which model shows the most overfitting? Which shows the most underfitting? How can you tell from the learning curve?

c) For each model, determine whether collecting 50% more data (22,500 instead of 15,000 samples) would likely improve validation AUC. Explain your reasoning based on the shape of the learning curve.

d) A stakeholder asks: "Should we spend $50,000 to collect 10,000 more labeled samples?" Using the learning curves, write a 3-sentence recommendation for each model.


Exercise 6: Calibration Analysis (Code)

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.calibration import CalibratedClassifierCV, calibration_curve

X, y = make_classification(
    n_samples=20000, n_features=20, n_informative=12,
    n_redundant=4, flip_y=0.1, random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

a) Train Logistic Regression, Random Forest, Gradient Boosting, and SVM (with probability=True) on the training set. For each, compute the calibration curve using 10 bins and report the mean absolute calibration error (average absolute difference between predicted probability and observed frequency across bins).

b) Which model is best calibrated out of the box? Which is worst? Does this match the chapter's claims about model calibration?

c) Apply Platt scaling (sigmoid calibration) and isotonic calibration to the worst-calibrated model using CalibratedClassifierCV. Report the calibration error before and after each calibration method. Which works better?

d) In what business scenario would poor calibration matter even if AUC-ROC is high? Give a concrete example where a well-calibrated 0.78 AUC model is more useful than a poorly calibrated 0.82 AUC model.


Exercise 7: Statistical Model Comparison (Code)

Train five models and compare them using proper statistical tests.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (
    RandomForestClassifier, GradientBoostingClassifier,
    AdaBoostClassifier
)
from sklearn.svm import SVC
from scipy import stats

X, y = make_classification(
    n_samples=10000, n_features=20, n_informative=12,
    n_redundant=4, flip_y=0.08, random_state=42
)

skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

models = {
    'LR': LogisticRegression(max_iter=1000, random_state=42),
    'RF': RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1),
    'GB': GradientBoostingClassifier(
        n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
    ),
    'Ada': AdaBoostClassifier(n_estimators=200, random_state=42),
    'SVM': SVC(probability=True, random_state=42),
}

a) Run 10-fold stratified cross-validation for all five models. Report the mean AUC and standard deviation for each.

b) Perform all 10 pairwise paired t-tests and report the p-values in a matrix. How many comparisons are significant at alpha=0.05?

c) Apply the Bonferroni correction (divide alpha by the number of comparisons). How many comparisons remain significant at the corrected alpha? Why is Bonferroni important when making multiple comparisons?

d) Calculate Cohen's d for every significant pairwise comparison. Are any of the statistically significant differences practically negligible (|d| < 0.2)?

e) Based on all of the above, write a recommendation: which model or models would you advance to A/B testing, and why?


Exercise 8: The Full Evaluation Pipeline (Code)

Build a complete evaluation function that takes a model, data, and groups, and produces a comprehensive evaluation report.

def full_evaluation_report(model, X, y, groups=None, cv_splits=5,
                           scoring='roc_auc', random_state=42):
    """
    Produce a full evaluation report including:
    1. Cross-validation scores (with appropriate strategy)
    2. Learning curve summary
    3. Calibration analysis
    4. Feature importance (if available)
    5. Leakage warning flags

    Parameters
    ----------
    model : sklearn estimator
    X : features
    y : target
    groups : group labels (optional)
    cv_splits : number of CV folds
    scoring : metric name
    random_state : random seed

    Returns
    -------
    dict with all evaluation results
    """
    pass  # Your implementation here

a) Implement the function. It should automatically choose StratifiedGroupKFold if groups are provided, and StratifiedKFold otherwise.

b) The function should flag any feature with importance above 0.25 as a potential leak and print a warning.

c) The function should compute the calibration curve and report mean absolute calibration error.

d) Test the function on the StreamFlow churn dataset (or a simulated version) with and without leaky features. Verify that it correctly identifies the leak.

e) Add a method that runs a paired t-test against a baseline model (logistic regression with default parameters) and reports whether the model significantly outperforms the baseline.


Exercise 9: Time Series Cross-Validation Mistakes (Conceptual + Code)

A colleague is building a model to predict daily sales for a retail chain. The dataset has 3 years of daily data (1,095 rows) with features including day_of_week, month, promotions_running, temperature, and lag features (sales_yesterday, sales_last_week).

a) The colleague uses StratifiedKFold for cross-validation. List three specific problems with this approach for time series data.

b) The colleague fixes the issue by using TimeSeriesSplit(n_splits=5). They include sales_yesterday and sales_last_week as features. Is there still a leakage issue? If so, explain exactly how the leakage occurs and propose a fix.

c) Implement a custom TimeSeriesGroupSplit that respects both temporal ordering and store-level grouping (the chain has 50 stores). Each fold should train on all stores up to month M and test on all stores for months M+1 to M+3.

d) The model's cross-validation AUC is 0.87 on early folds and 0.72 on the last fold (most recent data). What does this pattern suggest? Is 0.87 or 0.72 a more realistic estimate of production performance?


Exercise 10: StreamFlow M6 --- Complete Evaluation Overhaul (Progressive Project)

This is the progressive project milestone for Chapter 16. Complete all tasks using your StreamFlow churn pipeline from earlier milestones.

a) Replace all prior evaluations with StratifiedGroupKFold using subscriber_id as the group. Re-report all model scores from M4 and M5. Create a comparison table showing old scores (standard CV) vs. new scores (group CV). Which models lost the most performance?

b) Build learning curves for your top 3 models. For each, answer: would more data help? Is the model underfitting or overfitting?

c) Find the planted leakage feature in your StreamFlow pipeline. Document your detective process: which clue led you to the leak? Remove it and report the AUC-PR change.

d) Calculate the optimal decision threshold based on the cost structure: retention offer = $5, saved churner value = $180. What threshold maximizes expected profit per subscriber?

e) Run paired t-tests comparing all your models. Which differences are statistically significant? Create a final recommendation table:

Model AUC-PR F1 Precision@optimal Recall@optimal vs. Baseline p-value Recommend?

f) Write a one-paragraph executive summary of your evaluation findings. The audience is a VP of Product who understands percentages but not machine learning.


These exercises support Chapter 16: Model Evaluation Deep Dive. Return to the chapter for reference.