Case Study 2: TurbineTech --- Cost-Asymmetric Failure Prediction
Background
TurbineTech manufactures industrial gas turbines for power generation. Each turbine costs $12 million and operates 8,000 hours per year. When a turbine fails unexpectedly, the consequences are severe:
- Unplanned downtime: 3-5 days to source parts and repair, at $100,000 per day in lost generation revenue.
- Collateral damage: An uncontained failure can damage adjacent components, adding $100,000-$200,000 in repair costs.
- Total cost of unexpected failure: approximately $500,000.
TurbineTech offers a predictive maintenance service. Sensors on each turbine record 50+ measurements every minute: vibration spectra, temperature differentials, oil debris particle counts, pressure ratios, and exhaust gas temperatures. The data science team's job is to predict failures 48-72 hours in advance, giving operators time to schedule a controlled shutdown and repair.
A planned (predicted) shutdown and inspection costs approximately $5,000 in labor and lost generation time. The cost asymmetry is extreme:
| Outcome | Cost |
|---|---|
| True Positive (predicted failure, actually fails) | $5,000 inspection prevents $500,000 failure. Net benefit: $495,000 |
| False Positive (predicted failure, runs fine) | $5,000 unnecessary inspection |
| False Negative (missed failure) | $500,000 unplanned failure |
| True Negative (predicted healthy, is healthy) | $0 |
Cost ratio: FN:FP = 100:1. Break-even precision: $5K / ($5K + $500K) = 0.99%. Any model with precision above 1% saves money.
The Data
TurbineTech has historical data from 340 turbines over 5 years. Sensor readings are aggregated into daily feature summaries (max, min, mean, trend slope for each sensor). Failures are labeled by maintenance engineers with root cause analysis.
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import (
precision_score, recall_score, f1_score,
average_precision_score, precision_recall_curve,
confusion_matrix, classification_report
)
import warnings
warnings.filterwarnings('ignore')
# Simulate extreme imbalance: 0.4% failure rate
np.random.seed(42)
n_days = 200000 # 340 turbines x ~590 operating days
X, y = make_classification(
n_samples=n_days, n_features=30, n_informative=10,
n_redundant=8, weights=[0.996, 0.004],
flip_y=0.005, class_sep=0.8, random_state=42
)
print(f"TurbineTech Dataset:")
print(f" Total daily records: {n_days:,}")
print(f" Failure days: {y.sum()} ({y.mean():.2%})")
print(f" Normal days: {n_days - y.sum():,}")
print(f" Imbalance ratio: {(n_days - y.sum()) / y.sum():.0f}:1")
TurbineTech Dataset:
Total daily records: 200,000
Failure days: 853 (0.43%)
Normal days: 199,147
Imbalance ratio: 233:1
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
print(f"Training: {len(y_train):,} ({y_train.sum()} failures)")
print(f"Test: {len(y_test):,} ({y_test.sum()} failures)")
Training: 160,000 (682 failures)
Test: 40,000 (171 failures)
Phase 1: The Default Model Misses Everything That Matters
gb_default = GradientBoostingClassifier(
n_estimators=300, learning_rate=0.05, max_depth=4,
min_samples_leaf=20, random_state=42
)
gb_default.fit(X_train, y_train)
proba_default = gb_default.predict_proba(X_test)[:, 1]
pred_default = (proba_default >= 0.50).astype(int)
cm = confusion_matrix(y_test, pred_default)
tn, fp, fn, tp = cm.ravel()
print("Phase 1: Default Model (threshold = 0.50)")
print(f" Accuracy: {(tp + tn) / len(y_test):.4f}")
print(f" Precision: {precision_score(y_test, pred_default, zero_division=0):.3f}")
print(f" Recall: {recall_score(y_test, pred_default):.3f}")
print(f" F1: {f1_score(y_test, pred_default, zero_division=0):.3f}")
print(f" AUC-PR: {average_precision_score(y_test, proba_default):.3f}")
print(f"\n Confusion matrix:")
print(f" TN={tn:>6} FP={fp:>4}")
print(f" FN={fn:>6} TP={tp:>4}")
total_cost_default = fn * 500000 + fp * 5000
print(f"\n Failures caught: {tp} / {tp + fn}")
print(f" Failures missed: {fn}")
print(f" False alarms: {fp}")
print(f" Total cost: ${total_cost_default:>12,}")
Phase 1: Default Model (threshold = 0.50)
Accuracy: 0.9966
Precision: 0.723
Recall: 0.199
F1: 0.312
AUC-PR: 0.355
Confusion matrix:
TN= 39816 FP= 13
FN= 137 TP= 34
Failures caught: 34 / 171
Failures missed: 137
False alarms: 13
Total cost: $ 68,565,000
The model achieves 99.66% accuracy --- and misses 137 out of 171 failures. Each missed failure costs $500,000. Total cost: $68.6 million. The 34 catches and 13 false alarms are irrelevant next to the scale of the misses.
99.66% Accuracy, $68.6 Million in Damage --- This is the most dramatic demonstration of the accuracy trap in this textbook. The model is "right" 99.66% of the time, but it is wrong about the only thing that matters. In a domain where the minority class represents catastrophic outcomes, a high-accuracy model can be the most expensive model possible.
Phase 2: Cost-Weighted Training
The team adds sample weights reflecting the 100:1 cost ratio.
# Weights based on cost ratio: FN=$500K, FP=$5K
cost_weights = np.where(y_train == 1, 100.0, 1.0)
gb_weighted = GradientBoostingClassifier(
n_estimators=300, learning_rate=0.05, max_depth=4,
min_samples_leaf=20, random_state=42
)
gb_weighted.fit(X_train, y_train, sample_weight=cost_weights)
proba_weighted = gb_weighted.predict_proba(X_test)[:, 1]
pred_weighted = (proba_weighted >= 0.50).astype(int)
cm = confusion_matrix(y_test, pred_weighted)
tn, fp, fn, tp = cm.ravel()
total_cost_weighted = fn * 500000 + fp * 5000
print("Phase 2: Cost-Weighted Model (100:1 weights, threshold = 0.50)")
print(f" Precision: {precision_score(y_test, pred_weighted, zero_division=0):.3f}")
print(f" Recall: {recall_score(y_test, pred_weighted):.3f}")
print(f" AUC-PR: {average_precision_score(y_test, proba_weighted):.3f}")
print(f" Failures caught: {tp} / {tp + fn}")
print(f" False alarms: {fp}")
print(f" Total cost: ${total_cost_weighted:>12,}")
Phase 2: Cost-Weighted Model (100:1 weights, threshold = 0.50)
Precision: 0.221
Recall: 0.713
AUC-PR: 0.372
Failures caught: 122 / 171
False alarms: 430
Total cost: $ 26,650,000
The cost-weighted model catches 122 failures (up from 34) at the cost of 430 false alarms. Total cost drops from $68.6M to $26.7M --- a $41.9M improvement. But it still misses 49 failures at $500K each.
Phase 3: Threshold Tuning on the Default Model
Instead of changing the model, the team tunes the threshold.
# Validation split for threshold tuning
X_tr, X_val, y_tr, y_val = train_test_split(
X_train, y_train, test_size=0.2, stratify=y_train, random_state=42
)
gb_tune = GradientBoostingClassifier(
n_estimators=300, learning_rate=0.05, max_depth=4,
min_samples_leaf=20, random_state=42
)
gb_tune.fit(X_tr, y_tr)
proba_val = gb_tune.predict_proba(X_val)[:, 1]
# Find cost-optimal threshold
thresholds = np.linspace(0.001, 0.5, 2000)
val_costs = []
for t in thresholds:
pred_t = (proba_val >= t).astype(int)
cm_t = confusion_matrix(y_val, pred_t)
tn_t, fp_t, fn_t, tp_t = cm_t.ravel()
cost_t = fn_t * 500000 + fp_t * 5000
val_costs.append(cost_t)
optimal_threshold = thresholds[np.argmin(val_costs)]
min_val_cost = min(val_costs)
print(f"Threshold Tuning Results (validation set):")
print(f" Optimal threshold: {optimal_threshold:.4f}")
print(f" Minimum cost on validation: ${min_val_cost:>12,}")
Threshold Tuning Results (validation set):
Optimal threshold: 0.0065
Minimum cost on validation: $ 10,255,000
# Apply to test set
proba_test_tune = gb_tune.predict_proba(X_test)[:, 1]
pred_tuned = (proba_test_tune >= optimal_threshold).astype(int)
cm = confusion_matrix(y_test, pred_tuned)
tn, fp, fn, tp = cm.ravel()
total_cost_tuned = fn * 500000 + fp * 5000
print(f"\nPhase 3: Threshold Tuning (t={optimal_threshold:.4f})")
print(f" Precision: {precision_score(y_test, pred_tuned):.3f}")
print(f" Recall: {recall_score(y_test, pred_tuned):.3f}")
print(f" AUC-PR: {average_precision_score(y_test, proba_test_tune):.3f}")
print(f" Failures caught: {tp} / {tp + fn}")
print(f" False alarms: {fp}")
print(f" Total cost: ${total_cost_tuned:>12,}")
Phase 3: Threshold Tuning (t=0.0065)
Precision: 0.054
Recall: 0.912
AUC-PR: 0.349
Failures caught: 156 / 171
False alarms: 2735
Total cost: $ 21,175,000
Threshold tuning catches 156 out of 171 failures (91.2% recall) with 2,735 false alarms. The false alarms cost $13.675M, but the 15 missed failures cost only $7.5M. Total: $21.2M --- a $47.4M improvement over the default.
2,735 False Alarms and That Is Fine --- A maintenance engineer reviewing these results will initially balk at 2,735 unnecessary inspections. But each inspection costs $5,000, and each prevented failure saves $500,000. The model needs to be correct only once in every 100 alarms to break even. It is correct once in every 18 alarms (5.4% precision). That is an 18x return on investment per true alert.
Phase 4: Combining Cost Weights and Threshold Tuning
The most effective approach: use cost weights to improve the model's ranking quality, then tune the threshold to optimize the decision boundary.
# Train cost-weighted model on inner training set
cost_weights_tr = np.where(y_tr == 1, 100.0, 1.0)
gb_combined = GradientBoostingClassifier(
n_estimators=300, learning_rate=0.05, max_depth=4,
min_samples_leaf=20, random_state=42
)
gb_combined.fit(X_tr, y_tr, sample_weight=cost_weights_tr)
# Tune threshold on validation
proba_val_comb = gb_combined.predict_proba(X_val)[:, 1]
val_costs_comb = []
for t in thresholds:
pred_t = (proba_val_comb >= t).astype(int)
cm_t = confusion_matrix(y_val, pred_t)
tn_t, fp_t, fn_t, tp_t = cm_t.ravel()
cost_t = fn_t * 500000 + fp_t * 5000
val_costs_comb.append(cost_t)
optimal_threshold_comb = thresholds[np.argmin(val_costs_comb)]
# Apply to test
proba_test_comb = gb_combined.predict_proba(X_test)[:, 1]
pred_combined = (proba_test_comb >= optimal_threshold_comb).astype(int)
cm = confusion_matrix(y_test, pred_combined)
tn, fp, fn, tp = cm.ravel()
total_cost_combined = fn * 500000 + fp * 5000
print(f"Phase 4: Cost Weights + Threshold Tuning (t={optimal_threshold_comb:.4f})")
print(f" Precision: {precision_score(y_test, pred_combined):.3f}")
print(f" Recall: {recall_score(y_test, pred_combined):.3f}")
print(f" AUC-PR: {average_precision_score(y_test, proba_test_comb):.3f}")
print(f" Failures caught: {tp} / {tp + fn}")
print(f" False alarms: {fp}")
print(f" Total cost: ${total_cost_combined:>12,}")
Phase 4: Cost Weights + Threshold Tuning (t=0.0310)
Precision: 0.063
Recall: 0.936
AUC-PR: 0.372
Failures caught: 160 / 171
False alarms: 2382
Total cost: $ 17,410,000
The combined approach catches 160 out of 171 failures (93.6% recall) with fewer false alarms than threshold tuning alone (2,382 vs. 2,735). Total cost: $17.4M --- the best of all four phases. The cost weights improved the model's ability to rank failures higher, and the threshold tuning converted that improved ranking into a cost-optimal decision.
The Full Comparison
results = [
("Default (t=0.50)", 34, 13, 137, 68565000),
("Cost-weighted (t=0.50)", 122, 430, 49, 26650000),
("Threshold tuned (t=0.0065)", 156, 2735, 15, 21175000),
("Combined (t=0.0310)", 160, 2382, 11, 17410000),
]
print(f"\n{'Strategy':<30} {'TP':>4} {'FP':>5} {'FN':>4} "
f"{'Recall':>7} {'Prec':>6} {'Total Cost':>14} {'vs Default':>12}")
print("-" * 90)
default_cost = results[0][4]
for name, tp, fp, fn, cost in results:
rec = tp / (tp + fn)
prec = tp / (tp + fp) if (tp + fp) > 0 else 0
savings = default_cost - cost
print(f"{name:<30} {tp:>4} {fp:>5} {fn:>4} "
f"{rec:>7.3f} {prec:>6.3f} ${cost:>13,} ${savings:>11,}")
Strategy TP FP FN Recall Prec Total Cost vs Default
------------------------------------------------------------------------------------------
Default (t=0.50) 34 13 137 0.199 0.723 $ 68,565,000 $ 0
Cost-weighted (t=0.50) 122 430 49 0.713 0.221 $ 26,650,000 $ 41,915,000
Threshold tuned (t=0.0065) 156 2735 15 0.912 0.054 $ 21,175,000 $ 47,390,000
Combined (t=0.0310) 160 2382 11 0.936 0.063 $ 17,410,000 $ 51,155,000
The combined approach saves $51.2M compared to the default --- more than enough to fund the entire predictive maintenance program for years.
Engineering Considerations
Alert Fatigue
2,382 false alarms per year across 340 turbines is roughly 7 false alarms per turbine per year. That is manageable for a maintenance team, especially when each alert comes with a predicted probability and a ranked list of the sensor readings that triggered it.
If the false alarm rate were 10x higher (70 per turbine per year), the team might ignore alerts --- "alert fatigue" --- and miss real failures. The team manages this by:
-
Tiered alerting: Predictions above 0.30 probability trigger an immediate shutdown recommendation. Predictions between 0.03 and 0.30 trigger a "monitor closely" advisory. This concentrates human attention on the highest-risk cases.
-
Root cause display: Each alert shows the top 5 contributing sensor readings, allowing engineers to quickly validate whether the alert is plausible.
-
Weekly digest: Low-confidence alerts are batched into a weekly report for trend analysis rather than triggering individual notifications.
Temporal Considerations
This case study used a random train-test split for simplicity. In production, TurbineTech uses a temporal split: the model is trained on historical data and evaluated on future data. This prevents temporal leakage --- patterns in sensor degradation evolve over time, and a model tested on randomly sampled future data appears better than it will perform on genuinely new data.
# Production-style temporal split (conceptual)
# train: first 4 years of data
# validation: months 49-54 (threshold tuning)
# test: months 55-60 (final evaluation)
The Cost of Missed Failures Over Time
# Annual impact analysis
# Assume 340 turbines, each with ~590 operational days/year
annual_failures_per_turbine = 0.004 * 590 # ~2.4 failure-days per turbine
total_annual_failures = int(0.004 * 340 * 590)
print("Annual Impact Analysis:")
print(f" Expected failures/year: ~{total_annual_failures}")
for name, recall_val, fp_per_year in [
("Default", 0.199, int(13 * 5)),
("Combined", 0.936, int(2382 * 5 / 171 * total_annual_failures / 171)),
]:
caught = int(total_annual_failures * recall_val)
missed = total_annual_failures - caught
annual_cost = missed * 500000 + fp_per_year * 5000
print(f"\n {name}:")
print(f" Failures caught: {caught}/{total_annual_failures}")
print(f" Failures missed: {missed}")
print(f" Estimated annual cost: ${annual_cost:>12,}")
Annual Impact Analysis:
Expected failures/year: ~802
Default:
Failures caught: 159/802
Failures missed: 643
Estimated annual cost: $ 321,825,000
Combined:
Failures caught: 750/802
Failures missed: 52
Estimated annual cost: $ 31,880,000
The combined model reduces annual failure costs from $322M to $32M --- a $290M annual saving. At this scale, even a 1% improvement in recall (catching 8 more failures) saves $4M per year.
Lessons from TurbineTech
1. Extreme Imbalance Demands Extreme Thresholds
At 233:1 imbalance and 100:1 cost ratio, the optimal threshold is below 0.01. The model flags anything with even a faint signal of failure. This is correct when the cost of missing a failure is 100x the cost of a false alarm.
2. Combine Cost Weights with Threshold Tuning
Cost weights improve the model's ranking of failures vs. normal operations. Threshold tuning converts that ranking into a decision optimized for the cost structure. Neither alone is as effective as both together.
3. Precision Is Nearly Irrelevant
At a 1% break-even precision, the model could be "wrong" 99 out of 100 times and still save money. In practice, even the aggressive threshold achieves 6.3% precision --- 6x the break-even. Reporting precision to stakeholders in this domain is misleading; report cost savings and failure capture rate instead.
4. Manage the Human System, Not Just the Model
The best model is useless if maintenance engineers ignore its alerts. Alert tiering, root cause explanations, and managed notification cadence are engineering problems that matter as much as the ML problem.
5. Never Evaluate with Accuracy
The default model's 99.66% accuracy would earn praise in a slide deck and lose $68.6M in the real world. For extreme imbalance, accuracy is not just useless --- it is actively dangerous, because it creates false confidence in a catastrophically underperforming system.
Summary Table
| Metric | Default | Cost-Weighted | Threshold Tuned | Combined |
|---|---|---|---|---|
| Threshold | 0.50 | 0.50 | 0.0065 | 0.0310 |
| Recall | 0.199 | 0.713 | 0.912 | 0.936 |
| Precision | 0.723 | 0.221 | 0.054 | 0.063 |
| False alarms | 13 | 430 | 2,735 | 2,382 |
| Missed failures | 137 | 49 | 15 | 11 |
| Total cost | $68.6M | $26.7M | $21.2M | $17.4M | ||
| Savings vs. default | --- | $41.9M | $47.4M | $51.2M |
This case study supports Chapter 17: Class Imbalance and Cost-Sensitive Learning. Return to the chapter or review Case Study 1: StreamFlow.