Case Study 2: Comparing Three Models — Which Predicts Vaccination Best?

Contributors to Introduction to Data Science

Case Study 2: Comparing Three Models — Which Predicts Vaccination Best?

Tier 3 — Illustrative/Composite Example: This case study extends Elena's vaccination coverage analysis from Chapter 1. The dataset, models, and results are constructed for pedagogical purposes, drawing on publicly known patterns in global health data. All specific data points, model outputs, and findings are fictional. No specific countries, organizations, or datasets are represented.

The Setting

Elena has been building toward this moment for months. Across Chapters 26, 27, and 28, she built three different models to predict whether a country achieves "high" vaccination coverage (above the global median). Now her supervisor, Dr. Reyes, asks the obvious question:

"You keep showing me new models. Which one is actually the best?"

Elena has learned enough from Chapter 29 to know that "best" isn't a simple question. Best at what? Best for whom? Best by which metric? She decides to conduct a rigorous model comparison — the kind of analysis she'd want to see if she were reading someone else's work.

The Setup

Elena has data on 185 countries with five features: GDP per capita, health spending as a percentage of government budget, physicians per 1,000 people, adult literacy rate, and urban population percentage. The target variable is binary: high coverage (1) or low coverage (0). The split is roughly 50/50, so this is a relatively balanced classification problem.

Her three models: 1. Logistic Regression — the linear classifier from Chapter 27 2. Decision Tree — the interpretable model from Chapter 28 (max_depth=4) 3. Random Forest — the ensemble model from Chapter 28 (200 trees)

The Evaluation Plan

Elena's evaluation strategy reflects everything she learned in this chapter:

Hold out 30% as a final test set (never touched during model development)
Use 5-fold stratified cross-validation on the remaining 70% to compare models
Evaluate multiple metrics: accuracy, precision, recall, F1, and AUC
Generate confusion matrices to understand the types of errors each model makes
Plot ROC curves for visual comparison
Check consistency — do the cross-validation results match the final test set results?

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (classification_report, confusion_matrix,
                             roc_curve, roc_auc_score)
import matplotlib.pyplot as plt

# Prepare data
health = pd.read_csv('global_health_indicators.csv')
median_rate = health['vaccination_rate'].median()
health['high_coverage'] = (health['vaccination_rate'] >= median_rate).astype(int)

features = ['gdp_per_capita', 'health_spending_pct', 'physicians_per_1000',
            'literacy_rate', 'urban_population_pct']
X = health[features].dropna()
y = health.loc[X.index, 'high_coverage']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

Step 1: Cross-Validated Comparison

Elena uses pipelines for the logistic regression to handle scaling properly (avoiding the data leakage trap from Section 29.6):

models = {
    'Logistic Regression': Pipeline([
        ('scaler', StandardScaler()),
        ('model', LogisticRegression(max_iter=1000))
    ]),
    'Decision Tree': DecisionTreeClassifier(max_depth=4, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=200, random_state=42)
}

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
metrics = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']

results = []
for name, model in models.items():
    row = {'Model': name}
    for metric in metrics:
        scores = cross_val_score(model, X_train, y_train, cv=skf, scoring=metric)
        row[f'{metric}_mean'] = scores.mean()
        row[f'{metric}_std'] = scores.std()
    results.append(row)

results_df = pd.DataFrame(results)

The cross-validated results:

Model	Accuracy	Precision	Recall	F1	AUC
Logistic Regression	0.831 +/- 0.04	0.843 +/- 0.05	0.812 +/- 0.06	0.826 +/- 0.04	0.903 +/- 0.03
Decision Tree	0.808 +/- 0.05	0.815 +/- 0.06	0.793 +/- 0.07	0.802 +/- 0.06	0.827 +/- 0.05
Random Forest	0.862 +/- 0.03	0.871 +/- 0.04	0.848 +/- 0.05	0.858 +/- 0.03	0.932 +/- 0.02

Elena notes several things:

Random Forest wins on every metric. It has the highest accuracy, precision, recall, F1, and AUC.
Logistic Regression is a close second. It's surprisingly competitive, especially on AUC (0.903 vs. 0.932).
Decision Tree lags behind. The constraint of max_depth=4 limits its capacity, and the small dataset (185 countries) means there isn't enough data for the tree to learn complex patterns.
The standard deviations matter. The random forest has the smallest standard deviations, meaning its performance is most consistent across folds. The decision tree has the largest standard deviations — its performance varies more depending on which countries are in each fold.

Step 2: Final Test Set Evaluation

Elena trains each model on the full training set and evaluates on the held-out test set:

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"\n{'='*50}")
    print(f"{name}")
    print(f"{'='*50}")
    print(classification_report(y_test, y_pred,
                                target_names=['Low Coverage', 'High Coverage']))

The test set results align with the cross-validation results, confirming that Elena's evaluation is reliable and the models aren't just getting lucky on a particular split.

Step 3: Confusion Matrices

Elena generates confusion matrices to understand the specific errors:

LOGISTIC REGRESSION          DECISION TREE              RANDOM FOREST

        Pred:L  Pred:H              Pred:L  Pred:H           Pred:L  Pred:H
Act:L    24       4          Act:L    22       6       Act:L    25       3
Act:H     6      22          Act:H     5      23       Act:H     4      24

Key observation: All three models make roughly similar numbers of errors, but the types of errors differ:

Logistic Regression: 4 false positives (predicted high coverage when it was actually low) and 6 false negatives (predicted low coverage when it was actually high)
Decision Tree: 6 false positives and 5 false negatives — more false positives than the other models
Random Forest: 3 false positives and 4 false negatives — the fewest errors of both types

Step 4: ROC Curves

Elena plots all three ROC curves on the same chart:

plt.figure(figsize=(8, 6))
for name, model in models.items():
    model.fit(X_train, y_train)
    if hasattr(model, 'predict_proba'):
        y_proba = model.predict_proba(X_test)[:, 1]
    else:
        y_proba = model.decision_function(X_test)
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    auc = roc_auc_score(y_test, y_proba)
    plt.plot(fpr, tpr, linewidth=2, label=f'{name} (AUC={auc:.3f})')

plt.plot([0, 1], [0, 1], 'k--', alpha=0.5)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Model Comparison: ROC Curves')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.show()

The ROC curves visually confirm the ranking: the random forest's curve is closest to the upper-left corner, followed by logistic regression, then the decision tree.

The Recommendation

Elena writes up her findings for Dr. Reyes:

Model Recommendation: Vaccination Coverage Prediction

After comparing three models across five evaluation metrics using 5-fold cross-validation and a held-out test set, I recommend the Random Forest as the primary prediction model.

Why the Random Forest: - Highest accuracy (86.2%), F1 score (0.858), and AUC (0.932) - Most consistent performance across cross-validation folds (lowest standard deviation) - Fewest errors of both types (false positives and false negatives)

The trade-off: - The random forest is harder to explain than the decision tree. If stakeholders need to understand why a specific country is flagged as low-coverage, the decision tree's rules are clearer.

My suggestion: - Use the Random Forest for predictions — when we need to identify which countries need intervention - Use the Decision Tree for communication — when we need to explain the key risk factors to policymakers - The two models agree on the most important features (GDP per capita and physicians per 1,000), which gives us confidence in both

Important caveats: - With only 185 countries, the dataset is small. Performance estimates may change with additional data. - The model predicts coverage relative to the current median. As global coverage improves, the median shifts, and the model should be retrained. - Correlation is not causation: the model tells us that GDP and physician density predict coverage, not that increasing them would cause higher coverage.

What Elena Learned

1. Cross-validation revealed the truth. On a single train/test split, the logistic regression happened to beat the decision tree by 5%. On a different split, the decision tree won. Cross-validation showed that the random forest consistently outperforms both — a conclusion Elena couldn't have reached from a single split.

2. The "best" model depends on the metric, but not always. In this case, the random forest won on every metric. That's unusual. More commonly, different models win on different metrics, and you have to decide which metric matters most for your problem.

3. The confusion matrix told stories the summary metrics didn't. Seeing that the decision tree had more false positives than the other models suggested it was over-predicting high coverage — potentially because its rigid splits couldn't capture nuances that the other models could.

4. Presentation matters as much as analysis. Elena's recommendation wasn't "use the random forest because AUC = 0.932." It was "use the random forest for predictions and the decision tree for communication" — a practical recommendation that acknowledges the needs of different stakeholders.

5. Small datasets require extra caution. With only 185 countries, the cross-validation folds are small (about 26 countries per fold for the test portion). Individual fold scores can be noisy. Elena reported standard deviations alongside means to acknowledge this uncertainty.

Discussion Questions

Elena's dataset has roughly balanced classes (50/50 high vs. low coverage). How would the model comparison change if she defined "high coverage" as the top 20% of countries instead of above the median? Which metric would become most important?
The logistic regression's AUC (0.903) is only slightly lower than the random forest's (0.932). Given that logistic regression is simpler and more interpretable, could you argue that logistic regression is the "better" model? Under what circumstances?
Elena noted that the model should be retrained as global coverage changes. How often should she retrain? What signs would indicate that the model is no longer performing well?
If Elena wanted to use the model to prioritize countries for a limited intervention budget (say, the top 20 countries most at risk), would she care more about recall, precision, or the model's probability estimates? Why?
Elena used the same features for all three models. Is it possible that different models would benefit from different features? For example, might logistic regression prefer different features than a random forest? How would you test this?