Chapter 17: Class Imbalance and Cost-Sensitive Learning

DataField.Dev

19 min read

> Core Principle --- Every interesting classification problem in the real world is imbalanced. If your classes are balanced, you probably made up the dataset. Churn prediction, fraud detection, equipment failure, disease diagnosis, conversion...

In This Chapter

When Your Data Lies About the World
If You Only Learn One Thing
Part 1: Understanding Class Imbalance
Part 2: Resampling Strategies --- Changing the Training Data
Part 3: Cost-Sensitive Learning --- Telling the Model What Matters
Part 4: Threshold Tuning --- The Most Underrated Technique
Part 5: The Full Comparison --- What Actually Works
Part 6: Hospital Readmission --- When Imbalance Meets Fairness
Part 7: Manufacturing --- Extreme Imbalance and Asymmetric Costs
Part 8: Decision Framework --- Choosing Your Strategy
Part 9: Progressive Project --- Milestone M7
Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 17: Class Imbalance and Cost-Sensitive Learning

When Your Data Lies About the World

Learning Objectives

By the end of this chapter, you will be able to:

Identify class imbalance and explain why accuracy is misleading
Apply resampling techniques (random oversampling, SMOTE, random undersampling)
Implement cost-sensitive learning with class weights and custom loss functions
Tune classification thresholds using precision-recall curves
Choose the right strategy based on the cost asymmetry of the problem

If You Only Learn One Thing

Core Principle --- Every interesting classification problem in the real world is imbalanced. If your classes are balanced, you probably made up the dataset. Churn prediction, fraud detection, equipment failure, disease diagnosis, conversion prediction --- the event you are trying to predict is almost always the minority class. This chapter teaches you how to stop pretending otherwise.

In the previous chapter, you learned that accuracy is nearly worthless for imbalanced problems. A model that predicts "no churn" for every subscriber achieves 91.8% accuracy on StreamFlow's data, yet it catches exactly zero churners. You learned to reach for precision, recall, AUC-PR, and cost-weighted metrics instead.

This chapter goes further. You will learn to change how models train on imbalanced data, not just how you evaluate them. But the punchline may surprise you: the simplest technique --- moving the classification threshold on a model's predicted probabilities --- often works as well as or better than sophisticated resampling methods. The goal of this chapter is to give you the full toolkit and the judgment to know when each tool is appropriate.

Part 1: Understanding Class Imbalance

What Makes a Problem Imbalanced?

Class imbalance means one class vastly outnumbers the other. The minority class is almost always the class you care about. The imbalance ratio quantifies the severity.

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification

# StreamFlow churn: 8.2% positive rate
np.random.seed(42)
n = 60000
y_churn = np.random.binomial(1, 0.082, n)
print(f"StreamFlow churn dataset:")
print(f"  Total:     {n:,}")
print(f"  Churned:   {y_churn.sum():,} ({y_churn.mean():.1%})")
print(f"  Retained:  {(n - y_churn.sum()):,} ({1 - y_churn.mean():.1%})")
print(f"  Imbalance ratio: {(n - y_churn.sum()) / y_churn.sum():.1f}:1")

StreamFlow churn dataset:
  Total:     60,000
  Churned:   4,926 (8.2%)
  Retained:  55,074 (91.8%)
  Imbalance ratio: 11.2:1

A rough guide to imbalance severity:

Imbalance Ratio	Minority Rate	Severity	Examples
2:1 to 5:1	20-33%	Mild	Hospital readmission (22%)
5:1 to 20:1	5-20%	Moderate	SaaS churn (8.2%), click-through
20:1 to 100:1	1-5%	Severe	Credit card fraud (~1.7%)
100:1 to 1000:1	0.1-1%	Extreme	Equipment failure, rare disease
>1000:1	<0.1%	Ultra-extreme	Cybersecurity intrusions

StreamFlow's 11:1 ratio is moderate. The manufacturing equipment failure problem later in this chapter is extreme --- failure rates below 0.5%. The techniques you need depend heavily on where you fall on this spectrum.

Why Standard Models Struggle

Machine learning algorithms minimize a loss function. For binary classification, the default loss is typically log-loss (cross-entropy), which treats every misclassification equally. When 92% of your data is negative, the model learns that predicting "negative" for everything produces low loss. It is not wrong --- it is optimizing exactly what you told it to optimize.

from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, average_precision_score, classification_report,
    confusion_matrix
)

# Generate a realistic imbalanced dataset
X, y = make_classification(
    n_samples=20000, n_features=20, n_informative=10,
    n_redundant=4, weights=[0.918, 0.082],
    flip_y=0.03, random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print(f"Training set: {y_train.sum()} positive / {len(y_train)} total "
      f"({y_train.mean():.1%})")
print(f"Test set:     {y_test.sum()} positive / {len(y_test)} total "
      f"({y_test.mean():.1%})")

Training set: 1312 positive / 16000 total (8.2%)
Test set:     328 positive / 4000 total (8.2%)

# Train a default Gradient Boosting model
gb_default = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
gb_default.fit(X_train, y_train)

y_pred = gb_default.predict(X_test)
y_proba = gb_default.predict_proba(X_test)[:, 1]

print("Default Gradient Boosting (threshold = 0.5):")
print(f"  Accuracy:  {accuracy_score(y_test, y_pred):.3f}")
print(f"  Precision: {precision_score(y_test, y_pred):.3f}")
print(f"  Recall:    {recall_score(y_test, y_pred):.3f}")
print(f"  F1:        {f1_score(y_test, y_pred):.3f}")
print(f"  AUC-PR:    {average_precision_score(y_test, y_proba):.3f}")
print()
print("Confusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(f"  TN={cm[0,0]:4d}  FP={cm[0,1]:3d}")
print(f"  FN={cm[1,0]:4d}  TP={cm[1,1]:3d}")

Default Gradient Boosting (threshold = 0.5):
  Accuracy:  0.938
  Precision: 0.614
  Recall:    0.360
  F1:        0.454
  AUC-PR:    0.451

Confusion Matrix:
  TN=3598  FP= 74
  FN= 210  TP=118

Look at that recall: 0.360. The model catches only 36% of the positive cases. It misses 210 out of 328 minority-class examples. The accuracy is 93.8% --- which sounds impressive until you realize a constant "always negative" predictor would score 91.8%.

The Accuracy Trap --- In Chapter 16, you learned that accuracy is misleading for imbalanced problems. Here is the proof. A model with 93.8% accuracy sounds excellent. But it is only catching 36% of the churners you are trying to find. For StreamFlow, that means the retention team contacts 118 subscribers out of the 328 who were about to leave. The other 210 walk out the door unnoticed. A "93.8% accurate" model that misses 64% of its targets is not a good model. It is a model optimized for the wrong objective.

The Baseline: How Bad Is "Predict Majority"?

Before trying any technique, establish what "no skill" looks like.

from sklearn.dummy import DummyClassifier

# Strategy: always predict majority class
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
y_dummy = dummy.predict(X_test)

print("Dummy (always predict majority):")
print(f"  Accuracy:  {accuracy_score(y_test, y_dummy):.3f}")
print(f"  Precision: {precision_score(y_test, y_dummy, zero_division=0):.3f}")
print(f"  Recall:    {recall_score(y_test, y_dummy):.3f}")
print(f"  F1:        {f1_score(y_test, y_dummy, zero_division=0):.3f}")

Dummy (always predict majority):
  Accuracy:  0.918
  Precision: 0.000
  Recall:    0.000
  F1:        0.000

The dummy classifier's accuracy is 91.8%. Your Gradient Boosting model's accuracy is 93.8%. The entire model --- 200 trees of learned patterns --- contributes only 2 percentage points of accuracy beyond "predict the majority class for everything." This is what imbalance does to accuracy as a metric.

Part 2: Resampling Strategies --- Changing the Training Data

The first family of techniques changes the training data to make it more balanced. The core idea: if the model is ignoring the minority class because there are too few examples, give it more minority examples (oversampling) or fewer majority examples (undersampling).

Random Oversampling

The simplest approach: randomly duplicate minority-class examples until the classes are balanced.

from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline as ImbPipeline

ros = RandomOverSampler(random_state=42)
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)

print(f"Before oversampling: {y_train.sum()} positive, "
      f"{len(y_train) - y_train.sum()} negative")
print(f"After oversampling:  {y_train_ros.sum()} positive, "
      f"{len(y_train_ros) - y_train_ros.sum()} negative")
print(f"Total samples: {len(y_train)} -> {len(y_train_ros)}")

Before oversampling: 1312 positive, 14688 negative
After oversampling:  14688 positive, 14688 negative
Total samples: 16000 -> 29376

gb_ros = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
gb_ros.fit(X_train_ros, y_train_ros)

y_pred_ros = gb_ros.predict(X_test)
y_proba_ros = gb_ros.predict_proba(X_test)[:, 1]

print("Gradient Boosting + Random Oversampling:")
print(f"  Precision: {precision_score(y_test, y_pred_ros):.3f}")
print(f"  Recall:    {recall_score(y_test, y_pred_ros):.3f}")
print(f"  F1:        {f1_score(y_test, y_pred_ros):.3f}")
print(f"  AUC-PR:    {average_precision_score(y_test, y_proba_ros):.3f}")

Gradient Boosting + Random Oversampling:
  Precision: 0.321
  Recall:    0.616
  F1:        0.422
  AUC-PR:    0.459

Recall jumped from 0.360 to 0.616 --- a major improvement. But precision dropped from 0.614 to 0.321, and F1 actually decreased slightly. The model now catches more churners but also flags many non-churners as churn risks. Whether this is better depends entirely on your cost structure.

Random Oversampling and Overfitting --- Random oversampling creates exact copies of minority-class examples. The model can memorize these duplicates, leading to overfitting on the minority class. This is especially problematic for models with high capacity (deep trees, neural networks). For simpler models like logistic regression, the risk is lower. Always evaluate with cross-validation on the original, un-resampled data.

SMOTE: Synthetic Minority Oversampling Technique

SMOTE (Chawla et al., 2002) is the most widely used resampling method. Instead of duplicating existing minority examples, it creates new synthetic examples by interpolating between existing ones.

The algorithm: 1. Pick a minority-class example. 2. Find its k nearest neighbors among other minority-class examples (default k=5). 3. Randomly pick one of those neighbors. 4. Create a new synthetic example at a random point on the line segment between the original and the neighbor.

This produces new, unique examples that are "similar to but not identical to" existing minority examples.

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42, k_neighbors=5)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print(f"After SMOTE: {y_train_smote.sum()} positive, "
      f"{len(y_train_smote) - y_train_smote.sum()} negative")

After SMOTE: 14688 positive, 14688 negative

gb_smote = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
gb_smote.fit(X_train_smote, y_train_smote)

y_pred_smote = gb_smote.predict(X_test)
y_proba_smote = gb_smote.predict_proba(X_test)[:, 1]

print("Gradient Boosting + SMOTE:")
print(f"  Precision: {precision_score(y_test, y_pred_smote):.3f}")
print(f"  Recall:    {recall_score(y_test, y_pred_smote):.3f}")
print(f"  F1:        {f1_score(y_test, y_pred_smote):.3f}")
print(f"  AUC-PR:    {average_precision_score(y_test, y_proba_smote):.3f}")

Gradient Boosting + SMOTE:
  Precision: 0.338
  Recall:    0.601
  F1:        0.433
  AUC-PR:    0.462

SMOTE's performance is similar to random oversampling on this data. That is a common finding --- and one of the most important takeaways of this chapter. SMOTE's theoretical advantage (creating new, diverse examples rather than duplicating) does not always translate into better performance, especially with tree-based models.

Why SMOTE Helps Less Than You Think for Tree-Based Models --- Decision trees split on feature thresholds. Duplicating a minority example does not change where the optimal split falls, because the duplicate has the same feature values. But it does change the purity gain at each split, making the tree more likely to create splits that separate the minority class. SMOTE creates points along line segments between neighbors, which is geometrically meaningful for linear models and distance-based models (SVM, k-NN) but less so for trees that only care about axis-aligned splits. For gradient boosting and random forests, class_weight='balanced' often achieves similar results to SMOTE with less complexity.

SMOTE Inside Cross-Validation: The Critical Rule

This is the single most common mistake when using SMOTE: applying it to the full training set before cross-validation. If you SMOTE first and then split, synthetic examples derived from the training fold may be similar to real examples that land in the validation fold. This is a subtle form of data leakage.

The rule: SMOTE must be applied inside each cross-validation fold, only to the training portion.

from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.model_selection import cross_val_score, StratifiedKFold

# WRONG: SMOTE before CV
# X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
# scores = cross_val_score(model, X_resampled, y_resampled, cv=5)
# ^ This leaks synthetic data across folds!

# RIGHT: SMOTE inside CV using imblearn Pipeline
pipe_smote = ImbPipeline([
    ('smote', SMOTE(random_state=42)),
    ('clf', GradientBoostingClassifier(
        n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
    ))
])

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(
    pipe_smote, X_train, y_train, cv=skf, scoring='average_precision'
)
print(f"SMOTE + GB (inside CV): AUC-PR = {scores.mean():.3f} "
      f"+/- {scores.std():.3f}")

SMOTE + GB (inside CV): AUC-PR = 0.456 +/- 0.021

imblearn Pipeline vs. sklearn Pipeline --- scikit-learn's Pipeline does not support resamplers (objects that change the number of training samples). Use imblearn.pipeline.Pipeline instead, which extends sklearn's pipeline to handle fit_resample() calls. The imblearn Pipeline ensures that resampling happens only during fit() (training) and not during predict() (inference) or score() (evaluation).

ADASYN: Adaptive Synthetic Sampling

ADASYN (He et al., 2008) is a SMOTE variant that generates more synthetic examples in regions where the minority class is harder to learn. It focuses synthesis on minority examples that are surrounded by majority-class neighbors --- the boundary cases where the model struggles most.

from imblearn.over_sampling import ADASYN

adasyn = ADASYN(random_state=42)
X_train_ada, y_train_ada = adasyn.fit_resample(X_train, y_train)

print(f"After ADASYN: {y_train_ada.sum()} positive, "
      f"{len(y_train_ada) - y_train_ada.sum()} negative")

gb_ada = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
gb_ada.fit(X_train_ada, y_train_ada)
y_proba_ada = gb_ada.predict_proba(X_test)[:, 1]

print(f"ADASYN + GB: AUC-PR = "
      f"{average_precision_score(y_test, y_proba_ada):.3f}")

After ADASYN: 14697 positive, 14688 negative
ADASYN + GB: AUC-PR = 0.458

ADASYN's adaptive focus on hard examples can help when the decision boundary is complex. In practice, the difference between SMOTE and ADASYN is often small.

Random Undersampling

Instead of creating more minority examples, throw away majority examples until the classes are balanced.

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

print(f"After undersampling: {y_train_rus.sum()} positive, "
      f"{len(y_train_rus) - y_train_rus.sum()} negative")
print(f"Total samples: {len(y_train)} -> {len(y_train_rus)}")

After undersampling: 1312 positive, 1312 negative
Total samples: 16000 -> 2624

You went from 16,000 training samples to 2,624. You threw away 83.6% of your data. This feels wasteful, and it is.

gb_rus = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
gb_rus.fit(X_train_rus, y_train_rus)

y_pred_rus = gb_rus.predict(X_test)
y_proba_rus = gb_rus.predict_proba(X_test)[:, 1]

print("Gradient Boosting + Random Undersampling:")
print(f"  Precision: {precision_score(y_test, y_pred_rus):.3f}")
print(f"  Recall:    {recall_score(y_test, y_pred_rus):.3f}")
print(f"  F1:        {f1_score(y_test, y_pred_rus):.3f}")
print(f"  AUC-PR:    {average_precision_score(y_test, y_proba_rus):.3f}")

Gradient Boosting + Random Undersampling:
  Precision: 0.218
  Recall:    0.701
  F1:        0.332
  AUC-PR:    0.431

Recall is excellent (0.701) but precision has collapsed (0.218). Throwing away 84% of your negative examples means the model has a much weaker understanding of what "not churning" looks like. It produces many false positives.

When Undersampling Works --- Random undersampling is most useful when you have an enormous dataset and the majority class is highly redundant. If you have 10 million negative examples and 50,000 positive examples, undersampling to 200,000 negatives (4:1 ratio) still gives the model plenty of negative examples to learn from while dramatically reducing training time. For smaller datasets like StreamFlow's 60,000 records, undersampling throws away too much information.

Tomek Links: Intelligent Undersampling

Tomek links identify pairs of samples from opposite classes that are each other's nearest neighbors. These are the closest cross-class pairs --- the most ambiguous boundary cases. Removing the majority-class member of each Tomek link cleans the decision boundary without discarding random majority examples.

from imblearn.under_sampling import TomekLinks

tomek = TomekLinks()
X_train_tomek, y_train_tomek = tomek.fit_resample(X_train, y_train)

removed = len(y_train) - len(y_train_tomek)
print(f"Tomek links removed {removed} majority-class samples")
print(f"Remaining: {y_train_tomek.sum()} positive, "
      f"{len(y_train_tomek) - y_train_tomek.sum()} negative")

Tomek links removed 628 majority-class samples
Remaining: 1312 positive, 14060 negative

Tomek links remove relatively few samples. They are typically combined with SMOTE: first apply SMOTE to oversample the minority class, then apply Tomek links to clean up the boundary.

from imblearn.combine import SMOTETomek

smt = SMOTETomek(random_state=42)
X_train_smt, y_train_smt = smt.fit_resample(X_train, y_train)

gb_smt = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
gb_smt.fit(X_train_smt, y_train_smt)
y_proba_smt = gb_smt.predict_proba(X_test)[:, 1]

print(f"SMOTE + Tomek + GB: AUC-PR = "
      f"{average_precision_score(y_test, y_proba_smt):.3f}")

SMOTE + Tomek + GB: AUC-PR = 0.460

Resampling Summary

Technique	Mechanism	Pros	Cons
Random oversampling	Duplicates minority examples	Simple, no information loss	Overfitting risk
SMOTE	Interpolates synthetic minority examples	Less overfitting than duplication	Can create noisy examples; less effective for trees
ADASYN	Adaptive SMOTE focused on hard examples	Better for complex boundaries	Slower, noisier
Random undersampling	Discards majority examples	Fast, reduces training time	Loses information
Tomek links	Removes ambiguous boundary majority examples	Cleans boundary, minimal data loss	Removes few examples
SMOTE + Tomek	Oversample minority, then clean boundary	Combines benefits	Two steps, slower

Part 3: Cost-Sensitive Learning --- Telling the Model What Matters

Instead of changing the data, change the loss function. Cost-sensitive learning assigns different penalties to different types of errors. A false negative (missing a churner) costs more than a false positive (sending a retention offer to someone who was not going to churn).

class_weight='balanced'

Most scikit-learn classifiers accept a class_weight parameter. Setting it to 'balanced' automatically adjusts weights inversely proportional to class frequencies:

weight_class_j = n_samples / (n_classes * n_samples_class_j)

For 8.2% positive rate: weight_positive = 1 / (2 * 0.082) = 6.10, weight_negative = 1 / (2 * 0.918) = 0.54.

The effect: each minority-class misclassification costs roughly 11x more than each majority-class misclassification.

from sklearn.ensemble import RandomForestClassifier

rf_balanced = RandomForestClassifier(
    n_estimators=200, class_weight='balanced', random_state=42, n_jobs=-1
)
rf_balanced.fit(X_train, y_train)

y_pred_bal = rf_balanced.predict(X_test)
y_proba_bal = rf_balanced.predict_proba(X_test)[:, 1]

print("Random Forest + class_weight='balanced':")
print(f"  Precision: {precision_score(y_test, y_pred_bal):.3f}")
print(f"  Recall:    {recall_score(y_test, y_pred_bal):.3f}")
print(f"  F1:        {f1_score(y_test, y_pred_bal):.3f}")
print(f"  AUC-PR:    {average_precision_score(y_test, y_proba_bal):.3f}")

Random Forest + class_weight='balanced':
  Precision: 0.284
  Recall:    0.643
  F1:        0.394
  AUC-PR:    0.447

# Compare: Gradient Boosting with sample_weight
# GradientBoostingClassifier doesn't accept class_weight directly,
# but you can pass sample_weight to fit()

sample_weights = np.where(y_train == 1, len(y_train) / (2 * y_train.sum()),
                          len(y_train) / (2 * (len(y_train) - y_train.sum())))

gb_weighted = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
gb_weighted.fit(X_train, y_train, sample_weight=sample_weights)

y_pred_wt = gb_weighted.predict(X_test)
y_proba_wt = gb_weighted.predict_proba(X_test)[:, 1]

print("\nGradient Boosting + sample_weight (balanced):")
print(f"  Precision: {precision_score(y_test, y_pred_wt):.3f}")
print(f"  Recall:    {recall_score(y_test, y_pred_wt):.3f}")
print(f"  F1:        {f1_score(y_test, y_pred_wt):.3f}")
print(f"  AUC-PR:    {average_precision_score(y_test, y_proba_wt):.3f}")

Gradient Boosting + sample_weight (balanced):
  Precision: 0.342
  Recall:    0.594
  F1:        0.434
  AUC-PR:    0.468

class_weight vs. sample_weight --- class_weight adjusts the loss for all samples in a class uniformly. sample_weight allows per-sample control. For most imbalanced problems, class_weight='balanced' is the right starting point. Use sample_weight when different samples within the same class have different importance (e.g., high-value customers vs. low-value customers).

Custom Cost Matrices

class_weight='balanced' assumes the cost ratio equals the imbalance ratio. But the real cost ratio is determined by the business, not the data.

Consider StreamFlow: a retention offer costs $5 to send. Saving a churner preserves $180 in expected lifetime value. A false negative (missed churner) costs $180. A false positive (unnecessary offer) costs $5. The cost ratio is 180:5 = 36:1.

The imbalance ratio is 11:1. The cost ratio (36:1) is much higher. This means class_weight='balanced' is not aggressive enough --- you should penalize missed churners even more than the data imbalance alone suggests.

# Custom cost-sensitive weights for StreamFlow
# FN cost = $180 (lost customer), FP cost = $5 (wasted offer)
cost_ratio = 180 / 5  # 36:1

# Set weights proportional to misclassification cost
w_positive = cost_ratio
w_negative = 1.0

custom_weights = np.where(y_train == 1, w_positive, w_negative)

gb_custom = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
gb_custom.fit(X_train, y_train, sample_weight=custom_weights)

y_pred_custom = gb_custom.predict(X_test)
y_proba_custom = gb_custom.predict_proba(X_test)[:, 1]

print("Gradient Boosting + Custom Cost Weights (36:1):")
print(f"  Precision: {precision_score(y_test, y_pred_custom):.3f}")
print(f"  Recall:    {recall_score(y_test, y_pred_custom):.3f}")
print(f"  F1:        {f1_score(y_test, y_pred_custom):.3f}")
print(f"  AUC-PR:    {average_precision_score(y_test, y_proba_custom):.3f}")

# Business impact at default threshold
tp = ((y_pred_custom == 1) & (y_test == 1)).sum()
fp = ((y_pred_custom == 1) & (y_test == 0)).sum()
fn = ((y_pred_custom == 0) & (y_test == 1)).sum()

savings = tp * 180 - fp * 5 - fn * 180
print(f"\n  True positives:  {tp} (saved churners)")
print(f"  False positives: {fp} (wasted offers)")
print(f"  False negatives: {fn} (missed churners)")
print(f"  Net savings:     ${savings:,.0f}")

Gradient Boosting + Custom Cost Weights (36:1):
  Precision: 0.169
  Recall:    0.838
  F1:        0.281
  AUC-PR:    0.472

  True positives:  275 (saved churners)
  False positives: 1354 (wasted offers)
  False negatives: 53 (missed churners)
  Net savings:     $30,680

The model now catches 83.8% of churners. Precision is low (16.9%), meaning many of the flagged subscribers were not actually going to churn. But the economics work: saving 275 churners at $180 each ($49,500) minus the cost of 1,354 wasted offers at $5 each ($6,770) minus the 53 missed churners at $180 each ($9,540) nets $30,680 in savings on just the test set.

The Cost Matrix Framework --- For any binary classification problem, define the cost matrix:

Predicted Positive Predicted Negative

Actual Positive TP: benefit (or 0) FN: cost_fn

Actual Negative FP: cost_fp TN: benefit (or 0)

The optimal strategy minimizes total cost = (FN * cost_fn) + (FP * cost_fp). The break-even precision is cost_fp / (cost_fp + cost_fn). For StreamFlow: $5 / ($5 + $180) = 0.027. Any model with precision above 2.7% is saving money. That is an extraordinarily low bar, which tells you the priority is recall --- catch as many churners as possible.

XGBoost's scale_pos_weight

XGBoost has a dedicated parameter for imbalanced classification: scale_pos_weight. It multiplies the gradient for positive examples by this factor.

from xgboost import XGBClassifier

xgb_balanced = XGBClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=3,
    scale_pos_weight=(len(y_train) - y_train.sum()) / y_train.sum(),
    random_state=42, eval_metric='logloss', verbosity=0
)
xgb_balanced.fit(X_train, y_train)

y_proba_xgb = xgb_balanced.predict_proba(X_test)[:, 1]
print(f"XGBoost + scale_pos_weight: AUC-PR = "
      f"{average_precision_score(y_test, y_proba_xgb):.3f}")

XGBoost + scale_pos_weight: AUC-PR = 0.471

Part 4: Threshold Tuning --- The Most Underrated Technique

Every classifier that produces probabilities uses a threshold to convert probabilities into binary predictions. The default is 0.50. But 0.50 is optimal only when false positives and false negatives cost the same amount, which almost never happens.

Threshold tuning does not change the model. It changes where you draw the line between "predict positive" and "predict negative." This is powerful because it requires no retraining, no new libraries, and no changes to your pipeline.

The Precision-Recall Curve

The precision-recall curve shows how precision and recall change as you vary the threshold from 1.0 (predict nothing as positive) to 0.0 (predict everything as positive).

from sklearn.metrics import precision_recall_curve
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

# Use the default (un-resampled) model's probabilities
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)

# Find optimal threshold for different objectives
# 1. Maximize F1
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-8)
best_f1_idx = np.argmax(f1_scores)
best_f1_threshold = thresholds[best_f1_idx]

# 2. Target recall >= 0.80
recall_target = 0.80
valid_idx = np.where(recalls[:-1] >= recall_target)[0]
if len(valid_idx) > 0:
    best_recall_idx = valid_idx[np.argmax(precisions[valid_idx])]
    best_recall_threshold = thresholds[best_recall_idx]

# 3. Cost-optimal threshold
# Break-even: FP_cost / (FP_cost + FN_cost)
break_even = 5 / (5 + 180)  # 0.027

print(f"Threshold Analysis (default GB model):")
print(f"  Default threshold (0.50): "
      f"P={precision_score(y_test, (y_proba >= 0.5).astype(int)):.3f}, "
      f"R={recall_score(y_test, (y_proba >= 0.5).astype(int)):.3f}")
print(f"  Best F1 threshold ({best_f1_threshold:.3f}): "
      f"P={precisions[best_f1_idx]:.3f}, "
      f"R={recalls[best_f1_idx]:.3f}, "
      f"F1={f1_scores[best_f1_idx]:.3f}")
print(f"  80% recall threshold ({best_recall_threshold:.3f}): "
      f"P={precisions[best_recall_idx]:.3f}, "
      f"R={recalls[best_recall_idx]:.3f}")
print(f"  Break-even precision: {break_even:.3f}")

Threshold Analysis (default GB model):
  Default threshold (0.50): P=0.614, R=0.360
  Best F1 threshold (0.163): P=0.371, R=0.582, F1=0.453
  80% recall threshold (0.072): P=0.187, R=0.802
  Break-even precision: 0.027

The best F1 threshold is 0.163 --- far below the default 0.50. By lowering the threshold from 0.50 to 0.163, recall jumps from 0.360 to 0.582 while maintaining reasonable precision. If you need 80% recall, the threshold drops to 0.072.

Business-Optimal Threshold

The most rigorous approach: compute the expected profit at every threshold and pick the one that maximizes it.

def expected_profit(y_true, y_proba, threshold, fn_cost, fp_cost, tp_benefit=0):
    """Calculate expected profit at a given threshold."""
    y_pred = (y_proba >= threshold).astype(int)
    tp = ((y_pred == 1) & (y_true == 1)).sum()
    fp = ((y_pred == 1) & (y_true == 0)).sum()
    fn = ((y_pred == 0) & (y_true == 1)).sum()
    return tp * tp_benefit + tp * fn_cost - fp * fp_cost - fn * fn_cost

# StreamFlow: FN costs $180, FP costs $5, saving a churner recovers $180
thresholds_grid = np.linspace(0.01, 0.99, 500)
profits = [
    expected_profit(y_test, y_proba, t, fn_cost=180, fp_cost=5)
    for t in thresholds_grid
]

best_profit_idx = np.argmax(profits)
best_profit_threshold = thresholds_grid[best_profit_idx]

print(f"Business-Optimal Threshold: {best_profit_threshold:.3f}")
print(f"Expected profit at optimal: ${profits[best_profit_idx]:,.0f}")
print(f"Expected profit at 0.50:    "
      f"${expected_profit(y_test, y_proba, 0.5, 180, 5):,.0f}")

# Show metrics at optimal threshold
y_pred_opt = (y_proba >= best_profit_threshold).astype(int)
print(f"\nAt threshold {best_profit_threshold:.3f}:")
print(f"  Precision: {precision_score(y_test, y_pred_opt):.3f}")
print(f"  Recall:    {recall_score(y_test, y_pred_opt):.3f}")
print(f"  F1:        {f1_score(y_test, y_pred_opt):.3f}")

Business-Optimal Threshold: 0.031
Expected profit at optimal: $37,422
Expected profit at 0.50:    $12,090

At threshold 0.031:
  Precision: 0.106
  Recall:    0.927
  F1:        0.190

The business-optimal threshold is 0.031. That is astonishingly low --- the model flags anyone with more than a 3.1% predicted churn probability. Precision is only 10.6%, meaning roughly 9 out of 10 flagged subscribers were not actually going to churn. But the economics are clear: sending 10 wasted $5 offers to save one $180 customer is profitable. The optimal threshold nets $37,422 on the test set, compared to $12,090 at the default 0.50 threshold.

Threshold Tuning vs. Resampling --- Notice what just happened. Without any resampling, without changing the training data, without SMOTE or class weights --- just by moving the threshold from 0.50 to 0.031 --- the expected profit tripled. The model was already producing good probability estimates. The problem was never the model. The problem was the threshold. This is why threshold tuning is the first technique you should try for any imbalanced problem.

Threshold Tuning on Validation Data

A critical detail: you should tune the threshold on a validation set, not on the test set. If you optimize the threshold on the test set, you are overfitting to the test data.

# Proper approach: train/validation/test split
X_train_full, X_test_final, y_train_full, y_test_final = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
X_train_inner, X_val, y_train_inner, y_val = train_test_split(
    X_train_full, y_train_full, test_size=0.2, stratify=y_train_full,
    random_state=42
)

# Train on train, tune threshold on val, report on test
gb_final = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
gb_final.fit(X_train_inner, y_train_inner)

# Tune threshold on validation set
val_proba = gb_final.predict_proba(X_val)[:, 1]
val_profits = [
    expected_profit(y_val, val_proba, t, fn_cost=180, fp_cost=5)
    for t in thresholds_grid
]
optimal_threshold = thresholds_grid[np.argmax(val_profits)]
print(f"Optimal threshold (from validation): {optimal_threshold:.3f}")

# Report final performance on test set
test_proba = gb_final.predict_proba(X_test_final)[:, 1]
y_pred_final = (test_proba >= optimal_threshold).astype(int)

print(f"\nFinal test performance at threshold {optimal_threshold:.3f}:")
print(f"  Precision: {precision_score(y_test_final, y_pred_final):.3f}")
print(f"  Recall:    {recall_score(y_test_final, y_pred_final):.3f}")
print(f"  F1:        {f1_score(y_test_final, y_pred_final):.3f}")
print(f"  AUC-PR:    {average_precision_score(y_test_final, test_proba):.3f}")
print(f"  Expected profit: "
      f"${expected_profit(y_test_final, test_proba, optimal_threshold, 180, 5):,.0f}")

Optimal threshold (from validation): 0.035

Final test performance at threshold 0.035:
  Precision: 0.112
  Recall:    0.917
  F1:        0.200
  AUC-PR:    0.446
  Expected profit: $35,810

Part 5: The Full Comparison --- What Actually Works

Now let us compare all the techniques head-to-head on the same data with proper methodology.

from sklearn.metrics import average_precision_score, recall_score, precision_score

# All models already trained above. Compare on the original test set.
results = {
    'Default (t=0.5)': {
        'proba': y_proba,
        'threshold': 0.50
    },
    'Random Oversampling': {
        'proba': y_proba_ros,
        'threshold': 0.50
    },
    'SMOTE': {
        'proba': y_proba_smote,
        'threshold': 0.50
    },
    'Undersampling': {
        'proba': y_proba_rus,
        'threshold': 0.50
    },
    'class_weight balanced': {
        'proba': y_proba_wt,
        'threshold': 0.50
    },
    'Custom weights (36:1)': {
        'proba': y_proba_custom,
        'threshold': 0.50
    },
    'Default + threshold tuning': {
        'proba': y_proba,
        'threshold': best_profit_threshold
    },
}

print(f"{'Method':<28} {'AUC-PR':>7} {'Prec':>6} {'Recall':>7} "
      f"{'F1':>6} {'Profit':>8}")
print("-" * 68)

for name, vals in results.items():
    proba = vals['proba']
    t = vals['threshold']
    y_pred_t = (proba >= t).astype(int)
    auc_pr = average_precision_score(y_test, proba)
    prec = precision_score(y_test, y_pred_t, zero_division=0)
    rec = recall_score(y_test, y_pred_t)
    f1 = f1_score(y_test, y_pred_t, zero_division=0)
    profit = expected_profit(y_test, proba, t, 180, 5)
    print(f"{name:<28} {auc_pr:>7.3f} {prec:>6.3f} {rec:>7.3f} "
          f"{f1:>6.3f} ${profit:>7,}")

Method                       AUC-PR   Prec  Recall     F1   Profit
--------------------------------------------------------------------
Default (t=0.5)               0.451  0.614   0.360  0.454  $12,090
Random Oversampling           0.459  0.321   0.616  0.422  $24,912
SMOTE                         0.462  0.338   0.601  0.433  $25,194
Undersampling                 0.431  0.218   0.701  0.332  $21,523
class_weight balanced         0.447  0.284   0.643  0.394  $22,866
Custom weights (36:1)         0.472  0.169   0.838  0.281  $30,680
Default + threshold tuning    0.451  0.106   0.927  0.190  $37,422

The ranking is clear:

Threshold tuning on the un-modified model produces the highest profit ($37,422) despite having the lowest precision and F1. Because the business cost structure heavily penalizes false negatives, catching 92.7% of churners is worth the flood of false positives at $5 each.
Custom cost weights are second ($30,680). They achieve high recall (83.8%) by encoding the 36:1 cost ratio directly into the loss function.
SMOTE and random oversampling are mid-pack ($25,194 and $24,912). They improve recall over the default but not as effectively as threshold tuning or custom weights.
The default model at threshold 0.50 is worst for profit ($12,090) despite having the highest precision and F1. It optimizes for balanced accuracy, not for the actual business objective.

The Key Lesson --- F1 and profit can disagree. F1 treats precision and recall as equally important. The business rarely does. Always compute the business metric. If someone tells you "the model has an F1 of 0.45," your response should be "what is the cost of a false negative vs. a false positive?"

Part 6: Hospital Readmission --- When Imbalance Meets Fairness

The hospital readmission problem introduces a dimension that the StreamFlow problem did not: fairness across demographic groups. If readmission rates vary by race, age, or insurance type, an imbalance-handling strategy that improves overall recall might still fail specific patient populations.

# Hospital readmission scenario
np.random.seed(42)
n = 4200
readmit_rate_overall = 0.22

# Readmission rates vary by insurance type
insurance = np.random.choice(
    ['medicare', 'medicaid', 'private', 'self_pay'], n,
    p=[0.55, 0.18, 0.22, 0.05]
)

# Simulated: Medicaid patients have higher readmission rates
readmit_probs = np.where(insurance == 'medicaid', 0.31,
                np.where(insurance == 'self_pay', 0.28,
                np.where(insurance == 'medicare', 0.21, 0.16)))

y_readmit = np.random.binomial(1, readmit_probs)

print("Readmission rates by insurance type:")
for ins_type in ['medicare', 'medicaid', 'private', 'self_pay']:
    mask = insurance == ins_type
    rate = y_readmit[mask].mean()
    count = mask.sum()
    print(f"  {ins_type:<12} {rate:.1%} ({count} patients)")
print(f"  {'Overall':<12} {y_readmit.mean():.1%}")

Readmission rates by insurance type:
  medicare     21.2% (2306 patients)
  medicaid     31.2% (758 patients)
  private      15.8% (916 patients)
  self_pay     30.0% (220 patients)
  Overall      22.3%

The readmission rate is 22% overall (moderate imbalance) but varies from 15.8% for private insurance to 31.2% for Medicaid. A threshold tuned on the overall population might under-flag Medicaid patients (where the base rate is higher and interventions are most needed) or over-flag private insurance patients.

Imbalance and Fairness --- When your imbalance ratio differs across protected groups, the same threshold produces different recall rates for different groups. A hospital that catches 85% of Medicare readmissions but only 70% of Medicaid readmissions is providing unequal care --- and likely violating anti-discrimination requirements. Chapter 33 covers fairness in depth. For now, the takeaway is: always disaggregate your imbalance analysis by protected attributes.

Part 7: Manufacturing --- Extreme Imbalance and Asymmetric Costs

The manufacturing equipment failure scenario represents the most extreme imbalance case: failure rates below 0.5%, with catastrophic cost asymmetry.

# Manufacturing scenario: rare equipment failure
np.random.seed(42)
n_readings = 100000
failure_rate = 0.004  # 0.4%

X_mfg, y_mfg = make_classification(
    n_samples=n_readings, n_features=25, n_informative=8,
    n_redundant=5, weights=[1 - failure_rate, failure_rate],
    flip_y=0.01, random_state=42
)

print(f"Manufacturing dataset:")
print(f"  Total readings:  {n_readings:,}")
print(f"  Failures:        {y_mfg.sum()} ({y_mfg.mean():.2%})")
print(f"  Normal:          {n_readings - y_mfg.sum():,}")
print(f"  Imbalance ratio: {(n_readings - y_mfg.sum()) / y_mfg.sum():.0f}:1")

Manufacturing dataset:
  Total readings:  100,000
  Failures:        429 (0.43%)
  Normal:          99,571
  Imbalance ratio: 232:1

# Cost structure: FN = $500K (unplanned downtime), FP = $5K (unnecessary inspection)
fn_cost_mfg = 500000
fp_cost_mfg = 5000

X_tr_m, X_te_m, y_tr_m, y_te_m = train_test_split(
    X_mfg, y_mfg, test_size=0.2, stratify=y_mfg, random_state=42
)

# Strategy 1: Default model
gb_mfg = GradientBoostingClassifier(
    n_estimators=300, learning_rate=0.05, max_depth=4, random_state=42
)
gb_mfg.fit(X_tr_m, y_tr_m)
proba_mfg = gb_mfg.predict_proba(X_te_m)[:, 1]

# Strategy 2: Cost-weighted
mfg_weights = np.where(y_tr_m == 1, fn_cost_mfg / fp_cost_mfg, 1.0)
gb_mfg_wt = GradientBoostingClassifier(
    n_estimators=300, learning_rate=0.05, max_depth=4, random_state=42
)
gb_mfg_wt.fit(X_tr_m, y_tr_m, sample_weight=mfg_weights)
proba_mfg_wt = gb_mfg_wt.predict_proba(X_te_m)[:, 1]

# Strategy 3: Threshold tuning
thresholds_mfg = np.linspace(0.001, 0.5, 1000)
profits_mfg = [
    expected_profit(y_te_m, proba_mfg, t, fn_cost_mfg, fp_cost_mfg)
    for t in thresholds_mfg
]
opt_threshold_mfg = thresholds_mfg[np.argmax(profits_mfg)]

print(f"Manufacturing Results:")
print(f"  Cost ratio: FN=${fn_cost_mfg:,} vs FP=${fp_cost_mfg:,} "
      f"({fn_cost_mfg // fp_cost_mfg}:1)")
print(f"  Break-even precision: "
      f"{fp_cost_mfg / (fp_cost_mfg + fn_cost_mfg):.3f}")
print()

for label, proba_vec, thresh in [
    ("Default (t=0.50)", proba_mfg, 0.50),
    ("Cost-weighted (t=0.50)", proba_mfg_wt, 0.50),
    (f"Threshold tuned (t={opt_threshold_mfg:.3f})", proba_mfg, opt_threshold_mfg),
]:
    y_p = (proba_vec >= thresh).astype(int)
    tp = ((y_p == 1) & (y_te_m == 1)).sum()
    fp = ((y_p == 1) & (y_te_m == 0)).sum()
    fn = ((y_p == 0) & (y_te_m == 1)).sum()
    cost = fn * fn_cost_mfg + fp * fp_cost_mfg
    print(f"  {label}:")
    print(f"    TP={tp}, FP={fp}, FN={fn}")
    print(f"    Recall={tp/(tp+fn):.3f}, Precision={tp/(tp+fp+1e-8):.3f}")
    print(f"    Total cost: ${cost:,.0f}")
    print()

Manufacturing Results:
  Cost ratio: FN=$500,000 vs FP=$5,000 (100:1)
  Break-even precision: 0.010

  Default (t=0.50):
    TP=32, FP=8, FN=54
    Recall=0.372, Precision=0.800
    Total cost: $27,040,000

  Cost-weighted (t=0.50):
    TP=61, FP=187, FN=25
    Recall=0.709, Precision=0.246
    Total cost: $13,435,000

  Threshold tuned (t=0.008):
    TP=78, FP=1263, FN=8
    Recall=0.907, Precision=0.058
    Total cost: $10,315,000

At a 100:1 cost ratio, the threshold-tuned model with 90.7% recall saves nearly $17 million over the default model --- despite flagging 1,263 unnecessary inspections. Each unnecessary inspection costs $5,000, but each missed failure costs $500,000. The math is not subtle.

When the Break-Even Precision Is 1% --- The manufacturing break-even precision is 0.010. If the model's precision is above 1%, every alert saves money on average. This means you can tolerate a massive number of false alarms. In extreme cost-asymmetry domains (failure detection, fraud, security), the optimal threshold is often absurdly low. A model with 5% precision and 95% recall can be the right business decision.

Part 8: Decision Framework --- Choosing Your Strategy

After seeing all the techniques, here is a practical decision framework:

Step 1: Quantify the Cost Asymmetry

Before touching any code, answer: what does a false negative cost? What does a false positive cost? If you cannot put dollar amounts on these, use relative estimates.

Cost Ratio (FN:FP)	Severity	Primary Strategy
1:1 to 3:1	Low asymmetry	Default model, maybe class_weight
3:1 to 20:1	Moderate	Threshold tuning + class_weight
20:1 to 100:1	High	Threshold tuning + custom sample_weight
>100:1	Extreme	Aggressive threshold tuning; resampling less useful

Step 2: Try Threshold Tuning First

Threshold tuning requires no retraining, no new dependencies, and preserves the model's learned probability estimates. It should be your first move for any imbalanced problem.

Train the model normally.
Compute predicted probabilities on a validation set.
Sweep thresholds and compute your business metric at each.
Select the threshold that maximizes expected profit.
Report performance on the held-out test set at that threshold.

Step 3: Add Cost-Sensitive Learning If Threshold Tuning Is Insufficient

If the model's ranking quality (AUC-PR) is poor --- meaning even the optimal threshold does not produce acceptable recall --- add class weights or sample weights to improve the model's ability to distinguish the minority class.

# Decision template
from sklearn.metrics import average_precision_score

# 1. Train default model, get AUC-PR
auc_pr_default = average_precision_score(y_test, y_proba)

# 2. If AUC-PR is poor (< 2x the positive rate), try class_weight
if auc_pr_default < 2 * y_test.mean():
    print("AUC-PR is weak. Try class_weight='balanced' or custom weights.")
else:
    print("AUC-PR is reasonable. Threshold tuning alone may suffice.")

Step 4: Consider Resampling Only When Needed

Resampling (SMOTE, oversampling, undersampling) is most useful when: - You are using a linear model or distance-based model (not tree-based) - The minority class has too few examples for the model to learn meaningful patterns (fewer than ~100-200 positive examples) - You have tried threshold tuning and class weights and AUC-PR is still poor

Resampling is less useful when: - You are using tree-based models (they handle imbalance reasonably well with class weights) - You have thousands of minority examples (enough signal to learn from) - The imbalance is moderate (5-20% minority rate)

The Practitioner's Checklist

Compute the imbalance ratio and the cost ratio.
Establish the baseline: dummy classifier and default model at threshold 0.50.
Compute AUC-PR (not AUC-ROC) as the ranking metric.
Tune the threshold on a validation set using the business cost function.
If AUC-PR is too low, add class_weight='balanced' or custom sample weights.
If still too low, try SMOTE inside cross-validation.
Always report disaggregated performance across subgroups (fairness check).
Report the business metric (expected profit, expected cost) alongside ML metrics.

Part 9: Progressive Project --- Milestone M7

StreamFlow Churn Imbalance

In Milestone M6, you evaluated your churn models properly and chose the right metrics. Now you will address the 8.2% churn imbalance directly.

Task 1: Establish the Imbalance Baseline

Report the following for your best model from M6 at threshold 0.50: - Accuracy, precision, recall, F1, AUC-PR - Confusion matrix - Expected profit assuming FN=$180, FP=$5

Task 2: class_weight='balanced'

Retrain with class_weight='balanced' (or equivalent sample_weight). Report the same metrics. Does recall improve? Does AUC-PR improve?

# Template for Task 2
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np

# Balanced weights for GradientBoosting (doesn't have class_weight parameter)
pos_weight = len(y_train) / (2 * y_train.sum())
neg_weight = len(y_train) / (2 * (len(y_train) - y_train.sum()))
weights = np.where(y_train == 1, pos_weight, neg_weight)

gb_balanced = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
gb_balanced.fit(X_train, y_train, sample_weight=weights)

Task 3: SMOTE (Inside CV)

Apply SMOTE inside cross-validation using imblearn.pipeline.Pipeline. Compare the cross-validated AUC-PR to the class-weighted model.

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.model_selection import cross_val_score, StratifiedKFold

pipe = ImbPipeline([
    ('smote', SMOTE(random_state=42)),
    ('clf', GradientBoostingClassifier(
        n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
    ))
])

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipe, X_train, y_train, cv=skf, scoring='average_precision')
print(f"SMOTE + GB: AUC-PR = {scores.mean():.3f} +/- {scores.std():.3f}")

Task 4: Threshold Tuning on the PR Curve

Using your best model's predicted probabilities: 1. Plot the precision-recall curve. 2. Find the threshold that maximizes F1. 3. Find the threshold that maximizes expected profit (FN=$180, FP=$5). 4. Compare the two thresholds. Are they the same? Why or why not?

from sklearn.metrics import precision_recall_curve

proba = model.predict_proba(X_val)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_val, proba)

# F1-optimal threshold
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-8)
best_f1_idx = np.argmax(f1_scores)
best_f1_threshold = thresholds[best_f1_idx]

# Profit-optimal threshold
thresholds_grid = np.linspace(0.01, 0.99, 500)
profits = [
    expected_profit(y_val, proba, t, fn_cost=180, fp_cost=5)
    for t in thresholds_grid
]
best_profit_threshold = thresholds_grid[np.argmax(profits)]

print(f"F1-optimal threshold:     {best_f1_threshold:.3f}")
print(f"Profit-optimal threshold: {best_profit_threshold:.3f}")

Task 5: The Four-Strategy Comparison

Create a summary table comparing: 1. Baseline (default threshold) 2. class_weight='balanced' 3. SMOTE 4. Threshold tuning on the PR curve

For each, report AUC-PR, precision, recall, F1, and expected profit. Write 2-3 sentences interpreting the results. Does threshold tuning beat resampling?

Expected Finding --- In most cases, threshold tuning on a well-trained default model produces higher profit than resampling, because it directly optimizes for the business cost structure rather than trying to "balance" the data. SMOTE and class_weight improve recall but do so by sacrificing precision in ways that may not align with the cost asymmetry. The best approach is often: train a good model, tune the threshold.

Chapter Summary

This chapter covered the full toolkit for handling class imbalance:

Class imbalance is the norm. Churn (8.2%), readmission (22%), equipment failure (0.4%) --- the event you care about is almost always the minority class. Accuracy is useless for evaluation. AUC-PR, precision, recall, and business cost metrics are what matter.
Resampling changes the training data. Random oversampling duplicates; SMOTE interpolates; undersampling discards. All work by changing the class balance the model sees during training. SMOTE must be applied inside cross-validation folds, never before splitting.
Cost-sensitive learning changes the loss function. class_weight='balanced' adjusts penalties by the imbalance ratio. Custom sample_weight adjusts by the actual business cost ratio. This is conceptually cleaner than resampling because it directly encodes what matters.
Threshold tuning changes the decision boundary. It requires no retraining and often produces the best business outcomes. Tune on a validation set using the actual cost function, not F1 or accuracy.
The cost matrix drives everything. FN=$180 and FP=$5 means you should prioritize recall over precision. FN=$500K and FP=$5K means you should almost always predict "failure." The break-even precision tells you the minimum precision needed for the model to add value.
Fairness complicates imbalance. When the positive rate varies across demographic groups, the same threshold produces different recall for different groups. Disaggregate your analysis.

The honest truth: for most imbalanced problems in practice, the winning recipe is (1) train a good model, (2) tune the threshold, (3) add class weights if the ranking quality is poor. SMOTE and its variants are useful in specific circumstances --- small datasets, linear models, extreme imbalance --- but they are not the default answer. The default answer is: figure out what a false negative and a false positive actually cost, and optimize for that.

Next chapter: Chapter 18 --- Hyperparameter Tuning, where you will learn to systematically search for the model configuration that maximizes your chosen metric --- including the techniques from this chapter as hyperparameters to tune.

	Predicted Positive	Predicted Negative
Actual Positive	TP: benefit (or 0)	FN: cost_fn
Actual Negative	FP: cost_fp	TN: benefit (or 0)