> Core Principle --- Every interesting classification problem in the real world is imbalanced. If your classes are balanced, you probably made up the dataset. Churn prediction, fraud detection, equipment failure, disease diagnosis, conversion...
In This Chapter
- When Your Data Lies About the World
- If You Only Learn One Thing
- Part 1: Understanding Class Imbalance
- Part 2: Resampling Strategies --- Changing the Training Data
- Part 3: Cost-Sensitive Learning --- Telling the Model What Matters
- Part 4: Threshold Tuning --- The Most Underrated Technique
- Part 5: The Full Comparison --- What Actually Works
- Part 6: Hospital Readmission --- When Imbalance Meets Fairness
- Part 7: Manufacturing --- Extreme Imbalance and Asymmetric Costs
- Part 8: Decision Framework --- Choosing Your Strategy
- Part 9: Progressive Project --- Milestone M7
- Chapter Summary
Chapter 17: Class Imbalance and Cost-Sensitive Learning
When Your Data Lies About the World
Learning Objectives
By the end of this chapter, you will be able to:
- Identify class imbalance and explain why accuracy is misleading
- Apply resampling techniques (random oversampling, SMOTE, random undersampling)
- Implement cost-sensitive learning with class weights and custom loss functions
- Tune classification thresholds using precision-recall curves
- Choose the right strategy based on the cost asymmetry of the problem
If You Only Learn One Thing
Core Principle --- Every interesting classification problem in the real world is imbalanced. If your classes are balanced, you probably made up the dataset. Churn prediction, fraud detection, equipment failure, disease diagnosis, conversion prediction --- the event you are trying to predict is almost always the minority class. This chapter teaches you how to stop pretending otherwise.
In the previous chapter, you learned that accuracy is nearly worthless for imbalanced problems. A model that predicts "no churn" for every subscriber achieves 91.8% accuracy on StreamFlow's data, yet it catches exactly zero churners. You learned to reach for precision, recall, AUC-PR, and cost-weighted metrics instead.
This chapter goes further. You will learn to change how models train on imbalanced data, not just how you evaluate them. But the punchline may surprise you: the simplest technique --- moving the classification threshold on a model's predicted probabilities --- often works as well as or better than sophisticated resampling methods. The goal of this chapter is to give you the full toolkit and the judgment to know when each tool is appropriate.
Part 1: Understanding Class Imbalance
What Makes a Problem Imbalanced?
Class imbalance means one class vastly outnumbers the other. The minority class is almost always the class you care about. The imbalance ratio quantifies the severity.
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
# StreamFlow churn: 8.2% positive rate
np.random.seed(42)
n = 60000
y_churn = np.random.binomial(1, 0.082, n)
print(f"StreamFlow churn dataset:")
print(f" Total: {n:,}")
print(f" Churned: {y_churn.sum():,} ({y_churn.mean():.1%})")
print(f" Retained: {(n - y_churn.sum()):,} ({1 - y_churn.mean():.1%})")
print(f" Imbalance ratio: {(n - y_churn.sum()) / y_churn.sum():.1f}:1")
StreamFlow churn dataset:
Total: 60,000
Churned: 4,926 (8.2%)
Retained: 55,074 (91.8%)
Imbalance ratio: 11.2:1
A rough guide to imbalance severity:
| Imbalance Ratio | Minority Rate | Severity | Examples |
|---|---|---|---|
| 2:1 to 5:1 | 20-33% | Mild | Hospital readmission (22%) |
| 5:1 to 20:1 | 5-20% | Moderate | SaaS churn (8.2%), click-through |
| 20:1 to 100:1 | 1-5% | Severe | Credit card fraud (~1.7%) |
| 100:1 to 1000:1 | 0.1-1% | Extreme | Equipment failure, rare disease |
| >1000:1 | <0.1% | Ultra-extreme | Cybersecurity intrusions |
StreamFlow's 11:1 ratio is moderate. The manufacturing equipment failure problem later in this chapter is extreme --- failure rates below 0.5%. The techniques you need depend heavily on where you fall on this spectrum.
Why Standard Models Struggle
Machine learning algorithms minimize a loss function. For binary classification, the default loss is typically log-loss (cross-entropy), which treats every misclassification equally. When 92% of your data is negative, the model learns that predicting "negative" for everything produces low loss. It is not wrong --- it is optimizing exactly what you told it to optimize.
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, average_precision_score, classification_report,
confusion_matrix
)
# Generate a realistic imbalanced dataset
X, y = make_classification(
n_samples=20000, n_features=20, n_informative=10,
n_redundant=4, weights=[0.918, 0.082],
flip_y=0.03, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
print(f"Training set: {y_train.sum()} positive / {len(y_train)} total "
f"({y_train.mean():.1%})")
print(f"Test set: {y_test.sum()} positive / {len(y_test)} total "
f"({y_test.mean():.1%})")
Training set: 1312 positive / 16000 total (8.2%)
Test set: 328 positive / 4000 total (8.2%)
# Train a default Gradient Boosting model
gb_default = GradientBoostingClassifier(
n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
gb_default.fit(X_train, y_train)
y_pred = gb_default.predict(X_test)
y_proba = gb_default.predict_proba(X_test)[:, 1]
print("Default Gradient Boosting (threshold = 0.5):")
print(f" Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f" Precision: {precision_score(y_test, y_pred):.3f}")
print(f" Recall: {recall_score(y_test, y_pred):.3f}")
print(f" F1: {f1_score(y_test, y_pred):.3f}")
print(f" AUC-PR: {average_precision_score(y_test, y_proba):.3f}")
print()
print("Confusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(f" TN={cm[0,0]:4d} FP={cm[0,1]:3d}")
print(f" FN={cm[1,0]:4d} TP={cm[1,1]:3d}")
Default Gradient Boosting (threshold = 0.5):
Accuracy: 0.938
Precision: 0.614
Recall: 0.360
F1: 0.454
AUC-PR: 0.451
Confusion Matrix:
TN=3598 FP= 74
FN= 210 TP=118
Look at that recall: 0.360. The model catches only 36% of the positive cases. It misses 210 out of 328 minority-class examples. The accuracy is 93.8% --- which sounds impressive until you realize a constant "always negative" predictor would score 91.8%.
The Accuracy Trap --- In Chapter 16, you learned that accuracy is misleading for imbalanced problems. Here is the proof. A model with 93.8% accuracy sounds excellent. But it is only catching 36% of the churners you are trying to find. For StreamFlow, that means the retention team contacts 118 subscribers out of the 328 who were about to leave. The other 210 walk out the door unnoticed. A "93.8% accurate" model that misses 64% of its targets is not a good model. It is a model optimized for the wrong objective.
The Baseline: How Bad Is "Predict Majority"?
Before trying any technique, establish what "no skill" looks like.
from sklearn.dummy import DummyClassifier
# Strategy: always predict majority class
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
y_dummy = dummy.predict(X_test)
print("Dummy (always predict majority):")
print(f" Accuracy: {accuracy_score(y_test, y_dummy):.3f}")
print(f" Precision: {precision_score(y_test, y_dummy, zero_division=0):.3f}")
print(f" Recall: {recall_score(y_test, y_dummy):.3f}")
print(f" F1: {f1_score(y_test, y_dummy, zero_division=0):.3f}")
Dummy (always predict majority):
Accuracy: 0.918
Precision: 0.000
Recall: 0.000
F1: 0.000
The dummy classifier's accuracy is 91.8%. Your Gradient Boosting model's accuracy is 93.8%. The entire model --- 200 trees of learned patterns --- contributes only 2 percentage points of accuracy beyond "predict the majority class for everything." This is what imbalance does to accuracy as a metric.
Part 2: Resampling Strategies --- Changing the Training Data
The first family of techniques changes the training data to make it more balanced. The core idea: if the model is ignoring the minority class because there are too few examples, give it more minority examples (oversampling) or fewer majority examples (undersampling).
Random Oversampling
The simplest approach: randomly duplicate minority-class examples until the classes are balanced.
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline as ImbPipeline
ros = RandomOverSampler(random_state=42)
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)
print(f"Before oversampling: {y_train.sum()} positive, "
f"{len(y_train) - y_train.sum()} negative")
print(f"After oversampling: {y_train_ros.sum()} positive, "
f"{len(y_train_ros) - y_train_ros.sum()} negative")
print(f"Total samples: {len(y_train)} -> {len(y_train_ros)}")
Before oversampling: 1312 positive, 14688 negative
After oversampling: 14688 positive, 14688 negative
Total samples: 16000 -> 29376
gb_ros = GradientBoostingClassifier(
n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
gb_ros.fit(X_train_ros, y_train_ros)
y_pred_ros = gb_ros.predict(X_test)
y_proba_ros = gb_ros.predict_proba(X_test)[:, 1]
print("Gradient Boosting + Random Oversampling:")
print(f" Precision: {precision_score(y_test, y_pred_ros):.3f}")
print(f" Recall: {recall_score(y_test, y_pred_ros):.3f}")
print(f" F1: {f1_score(y_test, y_pred_ros):.3f}")
print(f" AUC-PR: {average_precision_score(y_test, y_proba_ros):.3f}")
Gradient Boosting + Random Oversampling:
Precision: 0.321
Recall: 0.616
F1: 0.422
AUC-PR: 0.459
Recall jumped from 0.360 to 0.616 --- a major improvement. But precision dropped from 0.614 to 0.321, and F1 actually decreased slightly. The model now catches more churners but also flags many non-churners as churn risks. Whether this is better depends entirely on your cost structure.
Random Oversampling and Overfitting --- Random oversampling creates exact copies of minority-class examples. The model can memorize these duplicates, leading to overfitting on the minority class. This is especially problematic for models with high capacity (deep trees, neural networks). For simpler models like logistic regression, the risk is lower. Always evaluate with cross-validation on the original, un-resampled data.
SMOTE: Synthetic Minority Oversampling Technique
SMOTE (Chawla et al., 2002) is the most widely used resampling method. Instead of duplicating existing minority examples, it creates new synthetic examples by interpolating between existing ones.
The algorithm: 1. Pick a minority-class example. 2. Find its k nearest neighbors among other minority-class examples (default k=5). 3. Randomly pick one of those neighbors. 4. Create a new synthetic example at a random point on the line segment between the original and the neighbor.
This produces new, unique examples that are "similar to but not identical to" existing minority examples.
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42, k_neighbors=5)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
print(f"After SMOTE: {y_train_smote.sum()} positive, "
f"{len(y_train_smote) - y_train_smote.sum()} negative")
After SMOTE: 14688 positive, 14688 negative
gb_smote = GradientBoostingClassifier(
n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
gb_smote.fit(X_train_smote, y_train_smote)
y_pred_smote = gb_smote.predict(X_test)
y_proba_smote = gb_smote.predict_proba(X_test)[:, 1]
print("Gradient Boosting + SMOTE:")
print(f" Precision: {precision_score(y_test, y_pred_smote):.3f}")
print(f" Recall: {recall_score(y_test, y_pred_smote):.3f}")
print(f" F1: {f1_score(y_test, y_pred_smote):.3f}")
print(f" AUC-PR: {average_precision_score(y_test, y_proba_smote):.3f}")
Gradient Boosting + SMOTE:
Precision: 0.338
Recall: 0.601
F1: 0.433
AUC-PR: 0.462
SMOTE's performance is similar to random oversampling on this data. That is a common finding --- and one of the most important takeaways of this chapter. SMOTE's theoretical advantage (creating new, diverse examples rather than duplicating) does not always translate into better performance, especially with tree-based models.
Why SMOTE Helps Less Than You Think for Tree-Based Models --- Decision trees split on feature thresholds. Duplicating a minority example does not change where the optimal split falls, because the duplicate has the same feature values. But it does change the purity gain at each split, making the tree more likely to create splits that separate the minority class. SMOTE creates points along line segments between neighbors, which is geometrically meaningful for linear models and distance-based models (SVM, k-NN) but less so for trees that only care about axis-aligned splits. For gradient boosting and random forests,
class_weight='balanced'often achieves similar results to SMOTE with less complexity.
SMOTE Inside Cross-Validation: The Critical Rule
This is the single most common mistake when using SMOTE: applying it to the full training set before cross-validation. If you SMOTE first and then split, synthetic examples derived from the training fold may be similar to real examples that land in the validation fold. This is a subtle form of data leakage.
The rule: SMOTE must be applied inside each cross-validation fold, only to the training portion.
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.model_selection import cross_val_score, StratifiedKFold
# WRONG: SMOTE before CV
# X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
# scores = cross_val_score(model, X_resampled, y_resampled, cv=5)
# ^ This leaks synthetic data across folds!
# RIGHT: SMOTE inside CV using imblearn Pipeline
pipe_smote = ImbPipeline([
('smote', SMOTE(random_state=42)),
('clf', GradientBoostingClassifier(
n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
))
])
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(
pipe_smote, X_train, y_train, cv=skf, scoring='average_precision'
)
print(f"SMOTE + GB (inside CV): AUC-PR = {scores.mean():.3f} "
f"+/- {scores.std():.3f}")
SMOTE + GB (inside CV): AUC-PR = 0.456 +/- 0.021
imblearn Pipeline vs. sklearn Pipeline --- scikit-learn's
Pipelinedoes not support resamplers (objects that change the number of training samples). Useimblearn.pipeline.Pipelineinstead, which extends sklearn's pipeline to handlefit_resample()calls. The imblearn Pipeline ensures that resampling happens only duringfit()(training) and not duringpredict()(inference) orscore()(evaluation).
ADASYN: Adaptive Synthetic Sampling
ADASYN (He et al., 2008) is a SMOTE variant that generates more synthetic examples in regions where the minority class is harder to learn. It focuses synthesis on minority examples that are surrounded by majority-class neighbors --- the boundary cases where the model struggles most.
from imblearn.over_sampling import ADASYN
adasyn = ADASYN(random_state=42)
X_train_ada, y_train_ada = adasyn.fit_resample(X_train, y_train)
print(f"After ADASYN: {y_train_ada.sum()} positive, "
f"{len(y_train_ada) - y_train_ada.sum()} negative")
gb_ada = GradientBoostingClassifier(
n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
gb_ada.fit(X_train_ada, y_train_ada)
y_proba_ada = gb_ada.predict_proba(X_test)[:, 1]
print(f"ADASYN + GB: AUC-PR = "
f"{average_precision_score(y_test, y_proba_ada):.3f}")
After ADASYN: 14697 positive, 14688 negative
ADASYN + GB: AUC-PR = 0.458
ADASYN's adaptive focus on hard examples can help when the decision boundary is complex. In practice, the difference between SMOTE and ADASYN is often small.
Random Undersampling
Instead of creating more minority examples, throw away majority examples until the classes are balanced.
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)
print(f"After undersampling: {y_train_rus.sum()} positive, "
f"{len(y_train_rus) - y_train_rus.sum()} negative")
print(f"Total samples: {len(y_train)} -> {len(y_train_rus)}")
After undersampling: 1312 positive, 1312 negative
Total samples: 16000 -> 2624
You went from 16,000 training samples to 2,624. You threw away 83.6% of your data. This feels wasteful, and it is.
gb_rus = GradientBoostingClassifier(
n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
gb_rus.fit(X_train_rus, y_train_rus)
y_pred_rus = gb_rus.predict(X_test)
y_proba_rus = gb_rus.predict_proba(X_test)[:, 1]
print("Gradient Boosting + Random Undersampling:")
print(f" Precision: {precision_score(y_test, y_pred_rus):.3f}")
print(f" Recall: {recall_score(y_test, y_pred_rus):.3f}")
print(f" F1: {f1_score(y_test, y_pred_rus):.3f}")
print(f" AUC-PR: {average_precision_score(y_test, y_proba_rus):.3f}")
Gradient Boosting + Random Undersampling:
Precision: 0.218
Recall: 0.701
F1: 0.332
AUC-PR: 0.431
Recall is excellent (0.701) but precision has collapsed (0.218). Throwing away 84% of your negative examples means the model has a much weaker understanding of what "not churning" looks like. It produces many false positives.
When Undersampling Works --- Random undersampling is most useful when you have an enormous dataset and the majority class is highly redundant. If you have 10 million negative examples and 50,000 positive examples, undersampling to 200,000 negatives (4:1 ratio) still gives the model plenty of negative examples to learn from while dramatically reducing training time. For smaller datasets like StreamFlow's 60,000 records, undersampling throws away too much information.
Tomek Links: Intelligent Undersampling
Tomek links identify pairs of samples from opposite classes that are each other's nearest neighbors. These are the closest cross-class pairs --- the most ambiguous boundary cases. Removing the majority-class member of each Tomek link cleans the decision boundary without discarding random majority examples.
from imblearn.under_sampling import TomekLinks
tomek = TomekLinks()
X_train_tomek, y_train_tomek = tomek.fit_resample(X_train, y_train)
removed = len(y_train) - len(y_train_tomek)
print(f"Tomek links removed {removed} majority-class samples")
print(f"Remaining: {y_train_tomek.sum()} positive, "
f"{len(y_train_tomek) - y_train_tomek.sum()} negative")
Tomek links removed 628 majority-class samples
Remaining: 1312 positive, 14060 negative
Tomek links remove relatively few samples. They are typically combined with SMOTE: first apply SMOTE to oversample the minority class, then apply Tomek links to clean up the boundary.
from imblearn.combine import SMOTETomek
smt = SMOTETomek(random_state=42)
X_train_smt, y_train_smt = smt.fit_resample(X_train, y_train)
gb_smt = GradientBoostingClassifier(
n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
gb_smt.fit(X_train_smt, y_train_smt)
y_proba_smt = gb_smt.predict_proba(X_test)[:, 1]
print(f"SMOTE + Tomek + GB: AUC-PR = "
f"{average_precision_score(y_test, y_proba_smt):.3f}")
SMOTE + Tomek + GB: AUC-PR = 0.460
Resampling Summary
| Technique | Mechanism | Pros | Cons |
|---|---|---|---|
| Random oversampling | Duplicates minority examples | Simple, no information loss | Overfitting risk |
| SMOTE | Interpolates synthetic minority examples | Less overfitting than duplication | Can create noisy examples; less effective for trees |
| ADASYN | Adaptive SMOTE focused on hard examples | Better for complex boundaries | Slower, noisier |
| Random undersampling | Discards majority examples | Fast, reduces training time | Loses information |
| Tomek links | Removes ambiguous boundary majority examples | Cleans boundary, minimal data loss | Removes few examples |
| SMOTE + Tomek | Oversample minority, then clean boundary | Combines benefits | Two steps, slower |
Part 3: Cost-Sensitive Learning --- Telling the Model What Matters
Instead of changing the data, change the loss function. Cost-sensitive learning assigns different penalties to different types of errors. A false negative (missing a churner) costs more than a false positive (sending a retention offer to someone who was not going to churn).
class_weight='balanced'
Most scikit-learn classifiers accept a class_weight parameter. Setting it to 'balanced' automatically adjusts weights inversely proportional to class frequencies:
weight_class_j = n_samples / (n_classes * n_samples_class_j)
For 8.2% positive rate: weight_positive = 1 / (2 * 0.082) = 6.10, weight_negative = 1 / (2 * 0.918) = 0.54.
The effect: each minority-class misclassification costs roughly 11x more than each majority-class misclassification.
from sklearn.ensemble import RandomForestClassifier
rf_balanced = RandomForestClassifier(
n_estimators=200, class_weight='balanced', random_state=42, n_jobs=-1
)
rf_balanced.fit(X_train, y_train)
y_pred_bal = rf_balanced.predict(X_test)
y_proba_bal = rf_balanced.predict_proba(X_test)[:, 1]
print("Random Forest + class_weight='balanced':")
print(f" Precision: {precision_score(y_test, y_pred_bal):.3f}")
print(f" Recall: {recall_score(y_test, y_pred_bal):.3f}")
print(f" F1: {f1_score(y_test, y_pred_bal):.3f}")
print(f" AUC-PR: {average_precision_score(y_test, y_proba_bal):.3f}")
Random Forest + class_weight='balanced':
Precision: 0.284
Recall: 0.643
F1: 0.394
AUC-PR: 0.447
# Compare: Gradient Boosting with sample_weight
# GradientBoostingClassifier doesn't accept class_weight directly,
# but you can pass sample_weight to fit()
sample_weights = np.where(y_train == 1, len(y_train) / (2 * y_train.sum()),
len(y_train) / (2 * (len(y_train) - y_train.sum())))
gb_weighted = GradientBoostingClassifier(
n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
gb_weighted.fit(X_train, y_train, sample_weight=sample_weights)
y_pred_wt = gb_weighted.predict(X_test)
y_proba_wt = gb_weighted.predict_proba(X_test)[:, 1]
print("\nGradient Boosting + sample_weight (balanced):")
print(f" Precision: {precision_score(y_test, y_pred_wt):.3f}")
print(f" Recall: {recall_score(y_test, y_pred_wt):.3f}")
print(f" F1: {f1_score(y_test, y_pred_wt):.3f}")
print(f" AUC-PR: {average_precision_score(y_test, y_proba_wt):.3f}")
Gradient Boosting + sample_weight (balanced):
Precision: 0.342
Recall: 0.594
F1: 0.434
AUC-PR: 0.468
class_weight vs. sample_weight ---
class_weightadjusts the loss for all samples in a class uniformly.sample_weightallows per-sample control. For most imbalanced problems,class_weight='balanced'is the right starting point. Usesample_weightwhen different samples within the same class have different importance (e.g., high-value customers vs. low-value customers).
Custom Cost Matrices
class_weight='balanced' assumes the cost ratio equals the imbalance ratio. But the real cost ratio is determined by the business, not the data.
Consider StreamFlow: a retention offer costs $5 to send. Saving a churner preserves $180 in expected lifetime value. A false negative (missed churner) costs $180. A false positive (unnecessary offer) costs $5. The cost ratio is 180:5 = 36:1.
The imbalance ratio is 11:1. The cost ratio (36:1) is much higher. This means class_weight='balanced' is not aggressive enough --- you should penalize missed churners even more than the data imbalance alone suggests.
# Custom cost-sensitive weights for StreamFlow
# FN cost = $180 (lost customer), FP cost = $5 (wasted offer)
cost_ratio = 180 / 5 # 36:1
# Set weights proportional to misclassification cost
w_positive = cost_ratio
w_negative = 1.0
custom_weights = np.where(y_train == 1, w_positive, w_negative)
gb_custom = GradientBoostingClassifier(
n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
gb_custom.fit(X_train, y_train, sample_weight=custom_weights)
y_pred_custom = gb_custom.predict(X_test)
y_proba_custom = gb_custom.predict_proba(X_test)[:, 1]
print("Gradient Boosting + Custom Cost Weights (36:1):")
print(f" Precision: {precision_score(y_test, y_pred_custom):.3f}")
print(f" Recall: {recall_score(y_test, y_pred_custom):.3f}")
print(f" F1: {f1_score(y_test, y_pred_custom):.3f}")
print(f" AUC-PR: {average_precision_score(y_test, y_proba_custom):.3f}")
# Business impact at default threshold
tp = ((y_pred_custom == 1) & (y_test == 1)).sum()
fp = ((y_pred_custom == 1) & (y_test == 0)).sum()
fn = ((y_pred_custom == 0) & (y_test == 1)).sum()
savings = tp * 180 - fp * 5 - fn * 180
print(f"\n True positives: {tp} (saved churners)")
print(f" False positives: {fp} (wasted offers)")
print(f" False negatives: {fn} (missed churners)")
print(f" Net savings: ${savings:,.0f}")
Gradient Boosting + Custom Cost Weights (36:1):
Precision: 0.169
Recall: 0.838
F1: 0.281
AUC-PR: 0.472
True positives: 275 (saved churners)
False positives: 1354 (wasted offers)
False negatives: 53 (missed churners)
Net savings: $30,680
The model now catches 83.8% of churners. Precision is low (16.9%), meaning many of the flagged subscribers were not actually going to churn. But the economics work: saving 275 churners at $180 each ($49,500) minus the cost of 1,354 wasted offers at $5 each ($6,770) minus the 53 missed churners at $180 each ($9,540) nets $30,680 in savings on just the test set.
The Cost Matrix Framework --- For any binary classification problem, define the cost matrix:
Predicted Positive Predicted Negative Actual Positive TP: benefit (or 0) FN: cost_fn Actual Negative FP: cost_fp TN: benefit (or 0) The optimal strategy minimizes total cost = (FN * cost_fn) + (FP * cost_fp). The break-even precision is cost_fp / (cost_fp + cost_fn). For StreamFlow: $5 / ($5 + $180) = 0.027. Any model with precision above 2.7% is saving money. That is an extraordinarily low bar, which tells you the priority is recall --- catch as many churners as possible.
XGBoost's scale_pos_weight
XGBoost has a dedicated parameter for imbalanced classification: scale_pos_weight. It multiplies the gradient for positive examples by this factor.
from xgboost import XGBClassifier
xgb_balanced = XGBClassifier(
n_estimators=200, learning_rate=0.1, max_depth=3,
scale_pos_weight=(len(y_train) - y_train.sum()) / y_train.sum(),
random_state=42, eval_metric='logloss', verbosity=0
)
xgb_balanced.fit(X_train, y_train)
y_proba_xgb = xgb_balanced.predict_proba(X_test)[:, 1]
print(f"XGBoost + scale_pos_weight: AUC-PR = "
f"{average_precision_score(y_test, y_proba_xgb):.3f}")
XGBoost + scale_pos_weight: AUC-PR = 0.471
Part 4: Threshold Tuning --- The Most Underrated Technique
Every classifier that produces probabilities uses a threshold to convert probabilities into binary predictions. The default is 0.50. But 0.50 is optimal only when false positives and false negatives cost the same amount, which almost never happens.
Threshold tuning does not change the model. It changes where you draw the line between "predict positive" and "predict negative." This is powerful because it requires no retraining, no new libraries, and no changes to your pipeline.
The Precision-Recall Curve
The precision-recall curve shows how precision and recall change as you vary the threshold from 1.0 (predict nothing as positive) to 0.0 (predict everything as positive).
from sklearn.metrics import precision_recall_curve
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
# Use the default (un-resampled) model's probabilities
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)
# Find optimal threshold for different objectives
# 1. Maximize F1
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-8)
best_f1_idx = np.argmax(f1_scores)
best_f1_threshold = thresholds[best_f1_idx]
# 2. Target recall >= 0.80
recall_target = 0.80
valid_idx = np.where(recalls[:-1] >= recall_target)[0]
if len(valid_idx) > 0:
best_recall_idx = valid_idx[np.argmax(precisions[valid_idx])]
best_recall_threshold = thresholds[best_recall_idx]
# 3. Cost-optimal threshold
# Break-even: FP_cost / (FP_cost + FN_cost)
break_even = 5 / (5 + 180) # 0.027
print(f"Threshold Analysis (default GB model):")
print(f" Default threshold (0.50): "
f"P={precision_score(y_test, (y_proba >= 0.5).astype(int)):.3f}, "
f"R={recall_score(y_test, (y_proba >= 0.5).astype(int)):.3f}")
print(f" Best F1 threshold ({best_f1_threshold:.3f}): "
f"P={precisions[best_f1_idx]:.3f}, "
f"R={recalls[best_f1_idx]:.3f}, "
f"F1={f1_scores[best_f1_idx]:.3f}")
print(f" 80% recall threshold ({best_recall_threshold:.3f}): "
f"P={precisions[best_recall_idx]:.3f}, "
f"R={recalls[best_recall_idx]:.3f}")
print(f" Break-even precision: {break_even:.3f}")
Threshold Analysis (default GB model):
Default threshold (0.50): P=0.614, R=0.360
Best F1 threshold (0.163): P=0.371, R=0.582, F1=0.453
80% recall threshold (0.072): P=0.187, R=0.802
Break-even precision: 0.027
The best F1 threshold is 0.163 --- far below the default 0.50. By lowering the threshold from 0.50 to 0.163, recall jumps from 0.360 to 0.582 while maintaining reasonable precision. If you need 80% recall, the threshold drops to 0.072.
Business-Optimal Threshold
The most rigorous approach: compute the expected profit at every threshold and pick the one that maximizes it.
def expected_profit(y_true, y_proba, threshold, fn_cost, fp_cost, tp_benefit=0):
"""Calculate expected profit at a given threshold."""
y_pred = (y_proba >= threshold).astype(int)
tp = ((y_pred == 1) & (y_true == 1)).sum()
fp = ((y_pred == 1) & (y_true == 0)).sum()
fn = ((y_pred == 0) & (y_true == 1)).sum()
return tp * tp_benefit + tp * fn_cost - fp * fp_cost - fn * fn_cost
# StreamFlow: FN costs $180, FP costs $5, saving a churner recovers $180
thresholds_grid = np.linspace(0.01, 0.99, 500)
profits = [
expected_profit(y_test, y_proba, t, fn_cost=180, fp_cost=5)
for t in thresholds_grid
]
best_profit_idx = np.argmax(profits)
best_profit_threshold = thresholds_grid[best_profit_idx]
print(f"Business-Optimal Threshold: {best_profit_threshold:.3f}")
print(f"Expected profit at optimal: ${profits[best_profit_idx]:,.0f}")
print(f"Expected profit at 0.50: "
f"${expected_profit(y_test, y_proba, 0.5, 180, 5):,.0f}")
# Show metrics at optimal threshold
y_pred_opt = (y_proba >= best_profit_threshold).astype(int)
print(f"\nAt threshold {best_profit_threshold:.3f}:")
print(f" Precision: {precision_score(y_test, y_pred_opt):.3f}")
print(f" Recall: {recall_score(y_test, y_pred_opt):.3f}")
print(f" F1: {f1_score(y_test, y_pred_opt):.3f}")
Business-Optimal Threshold: 0.031
Expected profit at optimal: $37,422
Expected profit at 0.50: $12,090
At threshold 0.031:
Precision: 0.106
Recall: 0.927
F1: 0.190
The business-optimal threshold is 0.031. That is astonishingly low --- the model flags anyone with more than a 3.1% predicted churn probability. Precision is only 10.6%, meaning roughly 9 out of 10 flagged subscribers were not actually going to churn. But the economics are clear: sending 10 wasted $5 offers to save one $180 customer is profitable. The optimal threshold nets $37,422 on the test set, compared to $12,090 at the default 0.50 threshold.
Threshold Tuning vs. Resampling --- Notice what just happened. Without any resampling, without changing the training data, without SMOTE or class weights --- just by moving the threshold from 0.50 to 0.031 --- the expected profit tripled. The model was already producing good probability estimates. The problem was never the model. The problem was the threshold. This is why threshold tuning is the first technique you should try for any imbalanced problem.
Threshold Tuning on Validation Data
A critical detail: you should tune the threshold on a validation set, not on the test set. If you optimize the threshold on the test set, you are overfitting to the test data.
# Proper approach: train/validation/test split
X_train_full, X_test_final, y_train_full, y_test_final = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
X_train_inner, X_val, y_train_inner, y_val = train_test_split(
X_train_full, y_train_full, test_size=0.2, stratify=y_train_full,
random_state=42
)
# Train on train, tune threshold on val, report on test
gb_final = GradientBoostingClassifier(
n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
gb_final.fit(X_train_inner, y_train_inner)
# Tune threshold on validation set
val_proba = gb_final.predict_proba(X_val)[:, 1]
val_profits = [
expected_profit(y_val, val_proba, t, fn_cost=180, fp_cost=5)
for t in thresholds_grid
]
optimal_threshold = thresholds_grid[np.argmax(val_profits)]
print(f"Optimal threshold (from validation): {optimal_threshold:.3f}")
# Report final performance on test set
test_proba = gb_final.predict_proba(X_test_final)[:, 1]
y_pred_final = (test_proba >= optimal_threshold).astype(int)
print(f"\nFinal test performance at threshold {optimal_threshold:.3f}:")
print(f" Precision: {precision_score(y_test_final, y_pred_final):.3f}")
print(f" Recall: {recall_score(y_test_final, y_pred_final):.3f}")
print(f" F1: {f1_score(y_test_final, y_pred_final):.3f}")
print(f" AUC-PR: {average_precision_score(y_test_final, test_proba):.3f}")
print(f" Expected profit: "
f"${expected_profit(y_test_final, test_proba, optimal_threshold, 180, 5):,.0f}")
Optimal threshold (from validation): 0.035
Final test performance at threshold 0.035:
Precision: 0.112
Recall: 0.917
F1: 0.200
AUC-PR: 0.446
Expected profit: $35,810
Part 5: The Full Comparison --- What Actually Works
Now let us compare all the techniques head-to-head on the same data with proper methodology.
from sklearn.metrics import average_precision_score, recall_score, precision_score
# All models already trained above. Compare on the original test set.
results = {
'Default (t=0.5)': {
'proba': y_proba,
'threshold': 0.50
},
'Random Oversampling': {
'proba': y_proba_ros,
'threshold': 0.50
},
'SMOTE': {
'proba': y_proba_smote,
'threshold': 0.50
},
'Undersampling': {
'proba': y_proba_rus,
'threshold': 0.50
},
'class_weight balanced': {
'proba': y_proba_wt,
'threshold': 0.50
},
'Custom weights (36:1)': {
'proba': y_proba_custom,
'threshold': 0.50
},
'Default + threshold tuning': {
'proba': y_proba,
'threshold': best_profit_threshold
},
}
print(f"{'Method':<28} {'AUC-PR':>7} {'Prec':>6} {'Recall':>7} "
f"{'F1':>6} {'Profit':>8}")
print("-" * 68)
for name, vals in results.items():
proba = vals['proba']
t = vals['threshold']
y_pred_t = (proba >= t).astype(int)
auc_pr = average_precision_score(y_test, proba)
prec = precision_score(y_test, y_pred_t, zero_division=0)
rec = recall_score(y_test, y_pred_t)
f1 = f1_score(y_test, y_pred_t, zero_division=0)
profit = expected_profit(y_test, proba, t, 180, 5)
print(f"{name:<28} {auc_pr:>7.3f} {prec:>6.3f} {rec:>7.3f} "
f"{f1:>6.3f} ${profit:>7,}")
Method AUC-PR Prec Recall F1 Profit
--------------------------------------------------------------------
Default (t=0.5) 0.451 0.614 0.360 0.454 $12,090
Random Oversampling 0.459 0.321 0.616 0.422 $24,912
SMOTE 0.462 0.338 0.601 0.433 $25,194
Undersampling 0.431 0.218 0.701 0.332 $21,523
class_weight balanced 0.447 0.284 0.643 0.394 $22,866
Custom weights (36:1) 0.472 0.169 0.838 0.281 $30,680
Default + threshold tuning 0.451 0.106 0.927 0.190 $37,422
The ranking is clear:
-
Threshold tuning on the un-modified model produces the highest profit ($37,422) despite having the lowest precision and F1. Because the business cost structure heavily penalizes false negatives, catching 92.7% of churners is worth the flood of false positives at $5 each.
-
Custom cost weights are second ($30,680). They achieve high recall (83.8%) by encoding the 36:1 cost ratio directly into the loss function.
-
SMOTE and random oversampling are mid-pack ($25,194 and $24,912). They improve recall over the default but not as effectively as threshold tuning or custom weights.
-
The default model at threshold 0.50 is worst for profit ($12,090) despite having the highest precision and F1. It optimizes for balanced accuracy, not for the actual business objective.
The Key Lesson --- F1 and profit can disagree. F1 treats precision and recall as equally important. The business rarely does. Always compute the business metric. If someone tells you "the model has an F1 of 0.45," your response should be "what is the cost of a false negative vs. a false positive?"
Part 6: Hospital Readmission --- When Imbalance Meets Fairness
The hospital readmission problem introduces a dimension that the StreamFlow problem did not: fairness across demographic groups. If readmission rates vary by race, age, or insurance type, an imbalance-handling strategy that improves overall recall might still fail specific patient populations.
# Hospital readmission scenario
np.random.seed(42)
n = 4200
readmit_rate_overall = 0.22
# Readmission rates vary by insurance type
insurance = np.random.choice(
['medicare', 'medicaid', 'private', 'self_pay'], n,
p=[0.55, 0.18, 0.22, 0.05]
)
# Simulated: Medicaid patients have higher readmission rates
readmit_probs = np.where(insurance == 'medicaid', 0.31,
np.where(insurance == 'self_pay', 0.28,
np.where(insurance == 'medicare', 0.21, 0.16)))
y_readmit = np.random.binomial(1, readmit_probs)
print("Readmission rates by insurance type:")
for ins_type in ['medicare', 'medicaid', 'private', 'self_pay']:
mask = insurance == ins_type
rate = y_readmit[mask].mean()
count = mask.sum()
print(f" {ins_type:<12} {rate:.1%} ({count} patients)")
print(f" {'Overall':<12} {y_readmit.mean():.1%}")
Readmission rates by insurance type:
medicare 21.2% (2306 patients)
medicaid 31.2% (758 patients)
private 15.8% (916 patients)
self_pay 30.0% (220 patients)
Overall 22.3%
The readmission rate is 22% overall (moderate imbalance) but varies from 15.8% for private insurance to 31.2% for Medicaid. A threshold tuned on the overall population might under-flag Medicaid patients (where the base rate is higher and interventions are most needed) or over-flag private insurance patients.
Imbalance and Fairness --- When your imbalance ratio differs across protected groups, the same threshold produces different recall rates for different groups. A hospital that catches 85% of Medicare readmissions but only 70% of Medicaid readmissions is providing unequal care --- and likely violating anti-discrimination requirements. Chapter 33 covers fairness in depth. For now, the takeaway is: always disaggregate your imbalance analysis by protected attributes.
Part 7: Manufacturing --- Extreme Imbalance and Asymmetric Costs
The manufacturing equipment failure scenario represents the most extreme imbalance case: failure rates below 0.5%, with catastrophic cost asymmetry.
# Manufacturing scenario: rare equipment failure
np.random.seed(42)
n_readings = 100000
failure_rate = 0.004 # 0.4%
X_mfg, y_mfg = make_classification(
n_samples=n_readings, n_features=25, n_informative=8,
n_redundant=5, weights=[1 - failure_rate, failure_rate],
flip_y=0.01, random_state=42
)
print(f"Manufacturing dataset:")
print(f" Total readings: {n_readings:,}")
print(f" Failures: {y_mfg.sum()} ({y_mfg.mean():.2%})")
print(f" Normal: {n_readings - y_mfg.sum():,}")
print(f" Imbalance ratio: {(n_readings - y_mfg.sum()) / y_mfg.sum():.0f}:1")
Manufacturing dataset:
Total readings: 100,000
Failures: 429 (0.43%)
Normal: 99,571
Imbalance ratio: 232:1
# Cost structure: FN = $500K (unplanned downtime), FP = $5K (unnecessary inspection)
fn_cost_mfg = 500000
fp_cost_mfg = 5000
X_tr_m, X_te_m, y_tr_m, y_te_m = train_test_split(
X_mfg, y_mfg, test_size=0.2, stratify=y_mfg, random_state=42
)
# Strategy 1: Default model
gb_mfg = GradientBoostingClassifier(
n_estimators=300, learning_rate=0.05, max_depth=4, random_state=42
)
gb_mfg.fit(X_tr_m, y_tr_m)
proba_mfg = gb_mfg.predict_proba(X_te_m)[:, 1]
# Strategy 2: Cost-weighted
mfg_weights = np.where(y_tr_m == 1, fn_cost_mfg / fp_cost_mfg, 1.0)
gb_mfg_wt = GradientBoostingClassifier(
n_estimators=300, learning_rate=0.05, max_depth=4, random_state=42
)
gb_mfg_wt.fit(X_tr_m, y_tr_m, sample_weight=mfg_weights)
proba_mfg_wt = gb_mfg_wt.predict_proba(X_te_m)[:, 1]
# Strategy 3: Threshold tuning
thresholds_mfg = np.linspace(0.001, 0.5, 1000)
profits_mfg = [
expected_profit(y_te_m, proba_mfg, t, fn_cost_mfg, fp_cost_mfg)
for t in thresholds_mfg
]
opt_threshold_mfg = thresholds_mfg[np.argmax(profits_mfg)]
print(f"Manufacturing Results:")
print(f" Cost ratio: FN=${fn_cost_mfg:,} vs FP=${fp_cost_mfg:,} "
f"({fn_cost_mfg // fp_cost_mfg}:1)")
print(f" Break-even precision: "
f"{fp_cost_mfg / (fp_cost_mfg + fn_cost_mfg):.3f}")
print()
for label, proba_vec, thresh in [
("Default (t=0.50)", proba_mfg, 0.50),
("Cost-weighted (t=0.50)", proba_mfg_wt, 0.50),
(f"Threshold tuned (t={opt_threshold_mfg:.3f})", proba_mfg, opt_threshold_mfg),
]:
y_p = (proba_vec >= thresh).astype(int)
tp = ((y_p == 1) & (y_te_m == 1)).sum()
fp = ((y_p == 1) & (y_te_m == 0)).sum()
fn = ((y_p == 0) & (y_te_m == 1)).sum()
cost = fn * fn_cost_mfg + fp * fp_cost_mfg
print(f" {label}:")
print(f" TP={tp}, FP={fp}, FN={fn}")
print(f" Recall={tp/(tp+fn):.3f}, Precision={tp/(tp+fp+1e-8):.3f}")
print(f" Total cost: ${cost:,.0f}")
print()
Manufacturing Results:
Cost ratio: FN=$500,000 vs FP=$5,000 (100:1)
Break-even precision: 0.010
Default (t=0.50):
TP=32, FP=8, FN=54
Recall=0.372, Precision=0.800
Total cost: $27,040,000
Cost-weighted (t=0.50):
TP=61, FP=187, FN=25
Recall=0.709, Precision=0.246
Total cost: $13,435,000
Threshold tuned (t=0.008):
TP=78, FP=1263, FN=8
Recall=0.907, Precision=0.058
Total cost: $10,315,000
At a 100:1 cost ratio, the threshold-tuned model with 90.7% recall saves nearly $17 million over the default model --- despite flagging 1,263 unnecessary inspections. Each unnecessary inspection costs $5,000, but each missed failure costs $500,000. The math is not subtle.
When the Break-Even Precision Is 1% --- The manufacturing break-even precision is 0.010. If the model's precision is above 1%, every alert saves money on average. This means you can tolerate a massive number of false alarms. In extreme cost-asymmetry domains (failure detection, fraud, security), the optimal threshold is often absurdly low. A model with 5% precision and 95% recall can be the right business decision.
Part 8: Decision Framework --- Choosing Your Strategy
After seeing all the techniques, here is a practical decision framework:
Step 1: Quantify the Cost Asymmetry
Before touching any code, answer: what does a false negative cost? What does a false positive cost? If you cannot put dollar amounts on these, use relative estimates.
| Cost Ratio (FN:FP) | Severity | Primary Strategy |
|---|---|---|
| 1:1 to 3:1 | Low asymmetry | Default model, maybe class_weight |
| 3:1 to 20:1 | Moderate | Threshold tuning + class_weight |
| 20:1 to 100:1 | High | Threshold tuning + custom sample_weight |
| >100:1 | Extreme | Aggressive threshold tuning; resampling less useful |
Step 2: Try Threshold Tuning First
Threshold tuning requires no retraining, no new dependencies, and preserves the model's learned probability estimates. It should be your first move for any imbalanced problem.
- Train the model normally.
- Compute predicted probabilities on a validation set.
- Sweep thresholds and compute your business metric at each.
- Select the threshold that maximizes expected profit.
- Report performance on the held-out test set at that threshold.
Step 3: Add Cost-Sensitive Learning If Threshold Tuning Is Insufficient
If the model's ranking quality (AUC-PR) is poor --- meaning even the optimal threshold does not produce acceptable recall --- add class weights or sample weights to improve the model's ability to distinguish the minority class.
# Decision template
from sklearn.metrics import average_precision_score
# 1. Train default model, get AUC-PR
auc_pr_default = average_precision_score(y_test, y_proba)
# 2. If AUC-PR is poor (< 2x the positive rate), try class_weight
if auc_pr_default < 2 * y_test.mean():
print("AUC-PR is weak. Try class_weight='balanced' or custom weights.")
else:
print("AUC-PR is reasonable. Threshold tuning alone may suffice.")
Step 4: Consider Resampling Only When Needed
Resampling (SMOTE, oversampling, undersampling) is most useful when: - You are using a linear model or distance-based model (not tree-based) - The minority class has too few examples for the model to learn meaningful patterns (fewer than ~100-200 positive examples) - You have tried threshold tuning and class weights and AUC-PR is still poor
Resampling is less useful when: - You are using tree-based models (they handle imbalance reasonably well with class weights) - You have thousands of minority examples (enough signal to learn from) - The imbalance is moderate (5-20% minority rate)
The Practitioner's Checklist
- Compute the imbalance ratio and the cost ratio.
- Establish the baseline: dummy classifier and default model at threshold 0.50.
- Compute AUC-PR (not AUC-ROC) as the ranking metric.
- Tune the threshold on a validation set using the business cost function.
- If AUC-PR is too low, add
class_weight='balanced'or custom sample weights. - If still too low, try SMOTE inside cross-validation.
- Always report disaggregated performance across subgroups (fairness check).
- Report the business metric (expected profit, expected cost) alongside ML metrics.
Part 9: Progressive Project --- Milestone M7
StreamFlow Churn Imbalance
In Milestone M6, you evaluated your churn models properly and chose the right metrics. Now you will address the 8.2% churn imbalance directly.
Task 1: Establish the Imbalance Baseline
Report the following for your best model from M6 at threshold 0.50: - Accuracy, precision, recall, F1, AUC-PR - Confusion matrix - Expected profit assuming FN=$180, FP=$5
Task 2: class_weight='balanced'
Retrain with class_weight='balanced' (or equivalent sample_weight). Report the same metrics. Does recall improve? Does AUC-PR improve?
# Template for Task 2
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np
# Balanced weights for GradientBoosting (doesn't have class_weight parameter)
pos_weight = len(y_train) / (2 * y_train.sum())
neg_weight = len(y_train) / (2 * (len(y_train) - y_train.sum()))
weights = np.where(y_train == 1, pos_weight, neg_weight)
gb_balanced = GradientBoostingClassifier(
n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
gb_balanced.fit(X_train, y_train, sample_weight=weights)
Task 3: SMOTE (Inside CV)
Apply SMOTE inside cross-validation using imblearn.pipeline.Pipeline. Compare the cross-validated AUC-PR to the class-weighted model.
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.model_selection import cross_val_score, StratifiedKFold
pipe = ImbPipeline([
('smote', SMOTE(random_state=42)),
('clf', GradientBoostingClassifier(
n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
))
])
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipe, X_train, y_train, cv=skf, scoring='average_precision')
print(f"SMOTE + GB: AUC-PR = {scores.mean():.3f} +/- {scores.std():.3f}")
Task 4: Threshold Tuning on the PR Curve
Using your best model's predicted probabilities: 1. Plot the precision-recall curve. 2. Find the threshold that maximizes F1. 3. Find the threshold that maximizes expected profit (FN=$180, FP=$5). 4. Compare the two thresholds. Are they the same? Why or why not?
from sklearn.metrics import precision_recall_curve
proba = model.predict_proba(X_val)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_val, proba)
# F1-optimal threshold
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-8)
best_f1_idx = np.argmax(f1_scores)
best_f1_threshold = thresholds[best_f1_idx]
# Profit-optimal threshold
thresholds_grid = np.linspace(0.01, 0.99, 500)
profits = [
expected_profit(y_val, proba, t, fn_cost=180, fp_cost=5)
for t in thresholds_grid
]
best_profit_threshold = thresholds_grid[np.argmax(profits)]
print(f"F1-optimal threshold: {best_f1_threshold:.3f}")
print(f"Profit-optimal threshold: {best_profit_threshold:.3f}")
Task 5: The Four-Strategy Comparison
Create a summary table comparing: 1. Baseline (default threshold) 2. class_weight='balanced' 3. SMOTE 4. Threshold tuning on the PR curve
For each, report AUC-PR, precision, recall, F1, and expected profit. Write 2-3 sentences interpreting the results. Does threshold tuning beat resampling?
Expected Finding --- In most cases, threshold tuning on a well-trained default model produces higher profit than resampling, because it directly optimizes for the business cost structure rather than trying to "balance" the data. SMOTE and class_weight improve recall but do so by sacrificing precision in ways that may not align with the cost asymmetry. The best approach is often: train a good model, tune the threshold.
Chapter Summary
This chapter covered the full toolkit for handling class imbalance:
-
Class imbalance is the norm. Churn (8.2%), readmission (22%), equipment failure (0.4%) --- the event you care about is almost always the minority class. Accuracy is useless for evaluation. AUC-PR, precision, recall, and business cost metrics are what matter.
-
Resampling changes the training data. Random oversampling duplicates; SMOTE interpolates; undersampling discards. All work by changing the class balance the model sees during training. SMOTE must be applied inside cross-validation folds, never before splitting.
-
Cost-sensitive learning changes the loss function.
class_weight='balanced'adjusts penalties by the imbalance ratio. Customsample_weightadjusts by the actual business cost ratio. This is conceptually cleaner than resampling because it directly encodes what matters. -
Threshold tuning changes the decision boundary. It requires no retraining and often produces the best business outcomes. Tune on a validation set using the actual cost function, not F1 or accuracy.
-
The cost matrix drives everything. FN=$180 and FP=$5 means you should prioritize recall over precision. FN=$500K and FP=$5K means you should almost always predict "failure." The break-even precision tells you the minimum precision needed for the model to add value.
-
Fairness complicates imbalance. When the positive rate varies across demographic groups, the same threshold produces different recall for different groups. Disaggregate your analysis.
The honest truth: for most imbalanced problems in practice, the winning recipe is (1) train a good model, (2) tune the threshold, (3) add class weights if the ranking quality is poor. SMOTE and its variants are useful in specific circumstances --- small datasets, linear models, extreme imbalance --- but they are not the default answer. The default answer is: figure out what a false negative and a false positive actually cost, and optimize for that.
Next chapter: Chapter 18 --- Hyperparameter Tuning, where you will learn to systematically search for the model configuration that maximizes your chosen metric --- including the techniques from this chapter as hyperparameters to tune.