Exercises: Chapter 17

Class Imbalance and Cost-Sensitive Learning


Exercise 1: Identifying and Quantifying Imbalance (Conceptual)

You are given three datasets. For each, calculate the imbalance ratio, classify the severity, and state whether accuracy is a meaningful metric.

a) An insurance claims dataset with 200,000 records, of which 3,400 are fraudulent claims. Calculate the imbalance ratio. If a model predicts "not fraud" for every record, what accuracy would it achieve? Argue why this accuracy is meaningless.

b) A customer support ticket dataset with 50,000 tickets, 12,000 of which were escalated to a supervisor. Is this dataset imbalanced? Would you expect standard models to struggle with this positive rate?

c) A cybersecurity dataset with 10 million network events, of which 847 are confirmed intrusion attempts. Calculate the imbalance ratio. A colleague says "we need at least 95% accuracy." Explain why this target is wrong and propose a better metric with a concrete target.


Exercise 2: SMOTE Mechanics (Code + Conceptual)

a) Implement a simplified version of SMOTE by hand for a 2D dataset. Given the following three minority-class points, generate 3 synthetic examples using k=2 nearest neighbors.

import numpy as np

# Three minority-class points
minority_points = np.array([
    [2.0, 3.0],
    [2.5, 3.8],
    [3.1, 3.2]
])

# Your task:
# 1. For each point, find its 2 nearest neighbors among the other minority points
# 2. Pick one neighbor at random
# 3. Generate a synthetic point on the line segment between the original and
#    the chosen neighbor (use a random lambda between 0 and 1)
# 4. Print the original point, the chosen neighbor, lambda, and the synthetic point

np.random.seed(42)
# Your code here

b) After generating the 3 synthetic points, explain in 2-3 sentences why these interpolated points are better than simply duplicating the original 3 points. Under what circumstances might SMOTE-generated points be harmful?

c) A colleague applies SMOTE to the entire training set before running 5-fold cross-validation. The cross-validated AUC-PR is 0.72. After you fix the pipeline to apply SMOTE inside each fold, the AUC-PR drops to 0.64. Explain why the first estimate was inflated.


Exercise 3: Cost-Sensitive Learning (Code)

A credit card company wants to detect fraudulent transactions. The dataset has 50,000 transactions with a 1.2% fraud rate. The costs are:

  • False negative (missed fraud): average loss of $2,800 per transaction
  • False positive (flagged legitimate transaction): customer friction cost estimated at $15 per transaction
  • True positive (caught fraud): saves $2,800 minus $50 investigation cost = net benefit of $2,750

a) Calculate the cost ratio (FN:FP). Calculate the break-even precision.

b) Implement three models: (1) default Gradient Boosting, (2) Gradient Boosting with sample_weight reflecting the cost ratio, and (3) default Gradient Boosting with threshold tuning.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import (
    precision_score, recall_score, f1_score,
    average_precision_score, precision_recall_curve
)

# Generate dataset with 1.2% fraud rate
X, y = make_classification(
    n_samples=50000, n_features=20, n_informative=10,
    n_redundant=4, weights=[0.988, 0.012],
    flip_y=0.005, random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Model 1: Default
# Model 2: Cost-weighted (use sample_weight in fit())
# Model 3: Default + threshold tuning (find profit-optimal threshold)
# Your code here

c) For each model, compute the total cost on the test set using the cost structure above. Which model has the lowest total cost? Is it the one with the highest F1?

d) A product manager asks: "Can we keep false positive rate below 2%?" Find the threshold that achieves this constraint while maximizing recall. What recall do you achieve? Compute the total cost under this constraint.


Exercise 4: Threshold Tuning Deep Dive (Code)

Using the StreamFlow churn scenario (8.2% positive rate, FN=$180, FP=$5):

a) Train a Gradient Boosting model and plot the full precision-recall curve. On the same plot, mark the following thresholds: 0.50 (default), F1-optimal, profit-optimal, and 80%-recall.

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import precision_recall_curve, average_precision_score

X, y = make_classification(
    n_samples=20000, n_features=20, n_informative=10,
    n_redundant=4, weights=[0.918, 0.082],
    flip_y=0.03, random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

gb = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
gb.fit(X_train, y_train)
proba = gb.predict_proba(X_test)[:, 1]

# Your code: plot PR curve and mark the four thresholds

b) Create a profit-vs-threshold plot. Sweep thresholds from 0.01 to 0.99 and plot the expected profit at each. Where is the peak? How flat or sharp is the profit curve around the optimum? What does the shape tell you about threshold sensitivity?

c) Suppose the VP of Marketing says the retention offer budget is capped at 2,000 offers per month. StreamFlow has 60,000 subscribers. What threshold achieves approximately 2,000 predicted positives on the test set (scaled to 60,000)? What recall and precision does this "budget-constrained" threshold produce?


Exercise 5: Resampling Strategy Comparison (Code)

Using the manufacturing failure dataset (0.4% failure rate, FN=$500K, FP=$5K):

a) Implement a full pipeline comparison with proper cross-validation:

from imblearn.over_sampling import SMOTE, RandomOverSampler, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
import numpy as np

X, y = make_classification(
    n_samples=100000, n_features=25, n_informative=8,
    n_redundant=5, weights=[0.996, 0.004],
    flip_y=0.01, random_state=42
)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
gb = GradientBoostingClassifier(
    n_estimators=300, learning_rate=0.05, max_depth=4, random_state=42
)

strategies = {
    'Default': None,
    'ROS': RandomOverSampler(random_state=42),
    'SMOTE': SMOTE(random_state=42),
    'ADASYN': ADASYN(random_state=42),
    'RUS': RandomUnderSampler(random_state=42),
}

# For each strategy, build an imblearn Pipeline and compute
# cross-validated AUC-PR. Print results as a comparison table.
# Your code here

b) Which resampling strategy produces the highest cross-validated AUC-PR? Is the difference statistically significant (paired t-test on fold scores)?

c) Now add threshold tuning: for the default (un-resampled) model, tune the threshold using the $500K/$5K cost structure. Compare the business cost of the threshold-tuned default model to the best resampling strategy at threshold 0.50. Which is cheaper?

d) Write a 3-4 sentence recommendation for the manufacturing team. Should they use SMOTE, class weights, threshold tuning, or a combination? Justify your recommendation.


Exercise 6: The Fairness Dimension (Conceptual + Code)

A hospital readmission model is trained on 10,000 patients. The overall readmission rate is 18%, but it varies by demographic group:

Group Patients Readmission Rate
Group A 6,000 15%
Group B 2,500 22%
Group C 1,500 28%

a) If you apply SMOTE to the full dataset, which group's readmission patterns will be most underrepresented in the synthetic examples? Explain why.

b) A single threshold of 0.20 achieves 75% recall overall. Compute the approximate recall for each group if the model's probability estimates are well-calibrated. (Hint: if the model outputs p=0.20 for a patient in Group A, what is the likelihood that patient was actually readmitted?)

c) Propose a strategy for achieving approximately equal recall across all three groups. Would you use group-specific thresholds, group-aware resampling, or a different approach? Defend your choice in 3-4 sentences.

d) A colleague argues that "we should just oversample Group C to match Group A's size, then the model will treat them equally." Explain why this reasoning is flawed. What would oversampling Group C actually do to the model?


Exercise 7: End-to-End Imbalance Pipeline (Integration)

Build a complete, production-ready pipeline for an imbalanced classification problem. Use the following scenario: a SaaS company with 100,000 subscriber records, 6.5% churn rate, FN=$250, FP=$8.

Your pipeline must include: 1. Train/validation/test split (60/20/20) 2. Feature scaling inside the pipeline 3. Two model variants: (a) default with threshold tuning, (b) SMOTE + class_weight with threshold tuning 4. Threshold optimization on the validation set using expected profit 5. Final evaluation on the test set at the optimized threshold 6. A summary table comparing both variants

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import (
    precision_score, recall_score, f1_score,
    average_precision_score
)
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

X, y = make_classification(
    n_samples=100000, n_features=25, n_informative=12,
    n_redundant=5, weights=[0.935, 0.065],
    flip_y=0.02, random_state=42
)

# Your code: build the full pipeline, tune thresholds, report results

Report: which variant produces higher expected profit on the test set? What is the optimal threshold for each? Does adding SMOTE improve ranking quality (AUC-PR)?


These exercises support Chapter 17: Class Imbalance and Cost-Sensitive Learning. Return to the chapter for full context.