Case Study 2: TurbineTech --- Cost-Asymmetric Failure Prediction

DataField.Dev

Cost-Asymmetric Failure Prediction" type: case-study chapter: 17 part: 3

Case Study 2: TurbineTech --- Cost-Asymmetric Failure Prediction

Background

TurbineTech manufactures industrial gas turbines for power generation. Each turbine costs $12 million and operates 8,000 hours per year. When a turbine fails unexpectedly, the consequences are severe:

Unplanned downtime: 3-5 days to source parts and repair, at $100,000 per day in lost generation revenue.
Collateral damage: An uncontained failure can damage adjacent components, adding $100,000-$200,000 in repair costs.
Total cost of unexpected failure: approximately $500,000.

TurbineTech offers a predictive maintenance service. Sensors on each turbine record 50+ measurements every minute: vibration spectra, temperature differentials, oil debris particle counts, pressure ratios, and exhaust gas temperatures. The data science team's job is to predict failures 48-72 hours in advance, giving operators time to schedule a controlled shutdown and repair.

A planned (predicted) shutdown and inspection costs approximately $5,000 in labor and lost generation time. The cost asymmetry is extreme:

Outcome	Cost
True Positive (predicted failure, actually fails)	$5,000 inspection prevents $500,000 failure. Net benefit: $495,000
False Positive (predicted failure, runs fine)	$5,000 unnecessary inspection
False Negative (missed failure)	$500,000 unplanned failure
True Negative (predicted healthy, is healthy)	$0

Cost ratio: FN:FP = 100:1. Break-even precision: $5K / ($5K + $500K) = 0.99%. Any model with precision above 1% saves money.

The Data

TurbineTech has historical data from 340 turbines over 5 years. Sensor readings are aggregated into daily feature summaries (max, min, mean, trend slope for each sensor). Failures are labeled by maintenance engineers with root cause analysis.

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import (
    precision_score, recall_score, f1_score,
    average_precision_score, precision_recall_curve,
    confusion_matrix, classification_report
)
import warnings
warnings.filterwarnings('ignore')

# Simulate extreme imbalance: 0.4% failure rate
np.random.seed(42)
n_days = 200000  # 340 turbines x ~590 operating days

X, y = make_classification(
    n_samples=n_days, n_features=30, n_informative=10,
    n_redundant=8, weights=[0.996, 0.004],
    flip_y=0.005, class_sep=0.8, random_state=42
)

print(f"TurbineTech Dataset:")
print(f"  Total daily records:  {n_days:,}")
print(f"  Failure days:         {y.sum()} ({y.mean():.2%})")
print(f"  Normal days:          {n_days - y.sum():,}")
print(f"  Imbalance ratio:      {(n_days - y.sum()) / y.sum():.0f}:1")

TurbineTech Dataset:
  Total daily records:  200,000
  Failure days:         853 (0.43%)
  Normal days:          199,147
  Imbalance ratio:      233:1

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print(f"Training:  {len(y_train):,} ({y_train.sum()} failures)")
print(f"Test:      {len(y_test):,} ({y_test.sum()} failures)")

Training:  160,000 (682 failures)
Test:      40,000 (171 failures)

Phase 1: The Default Model Misses Everything That Matters

gb_default = GradientBoostingClassifier(
    n_estimators=300, learning_rate=0.05, max_depth=4,
    min_samples_leaf=20, random_state=42
)
gb_default.fit(X_train, y_train)

proba_default = gb_default.predict_proba(X_test)[:, 1]
pred_default = (proba_default >= 0.50).astype(int)

cm = confusion_matrix(y_test, pred_default)
tn, fp, fn, tp = cm.ravel()

print("Phase 1: Default Model (threshold = 0.50)")
print(f"  Accuracy:  {(tp + tn) / len(y_test):.4f}")
print(f"  Precision: {precision_score(y_test, pred_default, zero_division=0):.3f}")
print(f"  Recall:    {recall_score(y_test, pred_default):.3f}")
print(f"  F1:        {f1_score(y_test, pred_default, zero_division=0):.3f}")
print(f"  AUC-PR:    {average_precision_score(y_test, proba_default):.3f}")
print(f"\n  Confusion matrix:")
print(f"    TN={tn:>6}  FP={fp:>4}")
print(f"    FN={fn:>6}  TP={tp:>4}")

total_cost_default = fn * 500000 + fp * 5000
print(f"\n  Failures caught:    {tp} / {tp + fn}")
print(f"  Failures missed:    {fn}")
print(f"  False alarms:       {fp}")
print(f"  Total cost:         ${total_cost_default:>12,}")

Phase 1: Default Model (threshold = 0.50)
  Accuracy:  0.9966
  Precision: 0.723
  Recall:    0.199
  F1:        0.312
  AUC-PR:    0.355

  Confusion matrix:
    TN= 39816  FP=  13
    FN=   137  TP=  34

  Failures caught:    34 / 171
  Failures missed:    137
  False alarms:       13
  Total cost:         $  68,565,000

The model achieves 99.66% accuracy --- and misses 137 out of 171 failures. Each missed failure costs $500,000. Total cost: $68.6 million. The 34 catches and 13 false alarms are irrelevant next to the scale of the misses.

99.66% Accuracy, $68.6 Million in Damage --- This is the most dramatic demonstration of the accuracy trap in this textbook. The model is "right" 99.66% of the time, but it is wrong about the only thing that matters. In a domain where the minority class represents catastrophic outcomes, a high-accuracy model can be the most expensive model possible.

Phase 2: Cost-Weighted Training

The team adds sample weights reflecting the 100:1 cost ratio.

# Weights based on cost ratio: FN=$500K, FP=$5K
cost_weights = np.where(y_train == 1, 100.0, 1.0)

gb_weighted = GradientBoostingClassifier(
    n_estimators=300, learning_rate=0.05, max_depth=4,
    min_samples_leaf=20, random_state=42
)
gb_weighted.fit(X_train, y_train, sample_weight=cost_weights)

proba_weighted = gb_weighted.predict_proba(X_test)[:, 1]
pred_weighted = (proba_weighted >= 0.50).astype(int)

cm = confusion_matrix(y_test, pred_weighted)
tn, fp, fn, tp = cm.ravel()

total_cost_weighted = fn * 500000 + fp * 5000

print("Phase 2: Cost-Weighted Model (100:1 weights, threshold = 0.50)")
print(f"  Precision: {precision_score(y_test, pred_weighted, zero_division=0):.3f}")
print(f"  Recall:    {recall_score(y_test, pred_weighted):.3f}")
print(f"  AUC-PR:    {average_precision_score(y_test, proba_weighted):.3f}")
print(f"  Failures caught: {tp} / {tp + fn}")
print(f"  False alarms:    {fp}")
print(f"  Total cost:      ${total_cost_weighted:>12,}")

Phase 2: Cost-Weighted Model (100:1 weights, threshold = 0.50)
  Precision: 0.221
  Recall:    0.713
  AUC-PR:    0.372
  Failures caught: 122 / 171
  False alarms:    430
  Total cost:      $  26,650,000

The cost-weighted model catches 122 failures (up from 34) at the cost of 430 false alarms. Total cost drops from $68.6M to $26.7M --- a $41.9M improvement. But it still misses 49 failures at $500K each.

Phase 3: Threshold Tuning on the Default Model

Instead of changing the model, the team tunes the threshold.

# Validation split for threshold tuning
X_tr, X_val, y_tr, y_val = train_test_split(
    X_train, y_train, test_size=0.2, stratify=y_train, random_state=42
)

gb_tune = GradientBoostingClassifier(
    n_estimators=300, learning_rate=0.05, max_depth=4,
    min_samples_leaf=20, random_state=42
)
gb_tune.fit(X_tr, y_tr)

proba_val = gb_tune.predict_proba(X_val)[:, 1]

# Find cost-optimal threshold
thresholds = np.linspace(0.001, 0.5, 2000)
val_costs = []
for t in thresholds:
    pred_t = (proba_val >= t).astype(int)
    cm_t = confusion_matrix(y_val, pred_t)
    tn_t, fp_t, fn_t, tp_t = cm_t.ravel()
    cost_t = fn_t * 500000 + fp_t * 5000
    val_costs.append(cost_t)

optimal_threshold = thresholds[np.argmin(val_costs)]
min_val_cost = min(val_costs)

print(f"Threshold Tuning Results (validation set):")
print(f"  Optimal threshold: {optimal_threshold:.4f}")
print(f"  Minimum cost on validation: ${min_val_cost:>12,}")

Threshold Tuning Results (validation set):
  Optimal threshold: 0.0065
  Minimum cost on validation: $  10,255,000

# Apply to test set
proba_test_tune = gb_tune.predict_proba(X_test)[:, 1]
pred_tuned = (proba_test_tune >= optimal_threshold).astype(int)

cm = confusion_matrix(y_test, pred_tuned)
tn, fp, fn, tp = cm.ravel()
total_cost_tuned = fn * 500000 + fp * 5000

print(f"\nPhase 3: Threshold Tuning (t={optimal_threshold:.4f})")
print(f"  Precision: {precision_score(y_test, pred_tuned):.3f}")
print(f"  Recall:    {recall_score(y_test, pred_tuned):.3f}")
print(f"  AUC-PR:    {average_precision_score(y_test, proba_test_tune):.3f}")
print(f"  Failures caught: {tp} / {tp + fn}")
print(f"  False alarms:    {fp}")
print(f"  Total cost:      ${total_cost_tuned:>12,}")

Phase 3: Threshold Tuning (t=0.0065)
  Precision: 0.054
  Recall:    0.912
  AUC-PR:    0.349
  Failures caught: 156 / 171
  False alarms:    2735
  Total cost:      $  21,175,000

Threshold tuning catches 156 out of 171 failures (91.2% recall) with 2,735 false alarms. The false alarms cost $13.675M, but the 15 missed failures cost only $7.5M. Total: $21.2M --- a $47.4M improvement over the default.

2,735 False Alarms and That Is Fine --- A maintenance engineer reviewing these results will initially balk at 2,735 unnecessary inspections. But each inspection costs $5,000, and each prevented failure saves $500,000. The model needs to be correct only once in every 100 alarms to break even. It is correct once in every 18 alarms (5.4% precision). That is an 18x return on investment per true alert.

Phase 4: Combining Cost Weights and Threshold Tuning

The most effective approach: use cost weights to improve the model's ranking quality, then tune the threshold to optimize the decision boundary.

# Train cost-weighted model on inner training set
cost_weights_tr = np.where(y_tr == 1, 100.0, 1.0)
gb_combined = GradientBoostingClassifier(
    n_estimators=300, learning_rate=0.05, max_depth=4,
    min_samples_leaf=20, random_state=42
)
gb_combined.fit(X_tr, y_tr, sample_weight=cost_weights_tr)

# Tune threshold on validation
proba_val_comb = gb_combined.predict_proba(X_val)[:, 1]
val_costs_comb = []
for t in thresholds:
    pred_t = (proba_val_comb >= t).astype(int)
    cm_t = confusion_matrix(y_val, pred_t)
    tn_t, fp_t, fn_t, tp_t = cm_t.ravel()
    cost_t = fn_t * 500000 + fp_t * 5000
    val_costs_comb.append(cost_t)

optimal_threshold_comb = thresholds[np.argmin(val_costs_comb)]

# Apply to test
proba_test_comb = gb_combined.predict_proba(X_test)[:, 1]
pred_combined = (proba_test_comb >= optimal_threshold_comb).astype(int)

cm = confusion_matrix(y_test, pred_combined)
tn, fp, fn, tp = cm.ravel()
total_cost_combined = fn * 500000 + fp * 5000

print(f"Phase 4: Cost Weights + Threshold Tuning (t={optimal_threshold_comb:.4f})")
print(f"  Precision: {precision_score(y_test, pred_combined):.3f}")
print(f"  Recall:    {recall_score(y_test, pred_combined):.3f}")
print(f"  AUC-PR:    {average_precision_score(y_test, proba_test_comb):.3f}")
print(f"  Failures caught: {tp} / {tp + fn}")
print(f"  False alarms:    {fp}")
print(f"  Total cost:      ${total_cost_combined:>12,}")

Phase 4: Cost Weights + Threshold Tuning (t=0.0310)
  Precision: 0.063
  Recall:    0.936
  AUC-PR:    0.372
  Failures caught: 160 / 171
  False alarms:    2382
  Total cost:      $  17,410,000

The combined approach catches 160 out of 171 failures (93.6% recall) with fewer false alarms than threshold tuning alone (2,382 vs. 2,735). Total cost: $17.4M --- the best of all four phases. The cost weights improved the model's ability to rank failures higher, and the threshold tuning converted that improved ranking into a cost-optimal decision.

The Full Comparison

results = [
    ("Default (t=0.50)", 34, 13, 137, 68565000),
    ("Cost-weighted (t=0.50)", 122, 430, 49, 26650000),
    ("Threshold tuned (t=0.0065)", 156, 2735, 15, 21175000),
    ("Combined (t=0.0310)", 160, 2382, 11, 17410000),
]

print(f"\n{'Strategy':<30} {'TP':>4} {'FP':>5} {'FN':>4} "
      f"{'Recall':>7} {'Prec':>6} {'Total Cost':>14} {'vs Default':>12}")
print("-" * 90)

default_cost = results[0][4]
for name, tp, fp, fn, cost in results:
    rec = tp / (tp + fn)
    prec = tp / (tp + fp) if (tp + fp) > 0 else 0
    savings = default_cost - cost
    print(f"{name:<30} {tp:>4} {fp:>5} {fn:>4} "
          f"{rec:>7.3f} {prec:>6.3f} ${cost:>13,} ${savings:>11,}")

Strategy                         TP    FP   FN  Recall   Prec     Total Cost    vs Default
------------------------------------------------------------------------------------------
Default (t=0.50)                 34    13  137   0.199  0.723 $   68,565,000 $           0
Cost-weighted (t=0.50)          122   430   49   0.713  0.221 $   26,650,000 $  41,915,000
Threshold tuned (t=0.0065)      156  2735   15   0.912  0.054 $   21,175,000 $  47,390,000
Combined (t=0.0310)             160  2382   11   0.936  0.063 $   17,410,000 $  51,155,000

The combined approach saves $51.2M compared to the default --- more than enough to fund the entire predictive maintenance program for years.

Engineering Considerations

Alert Fatigue

2,382 false alarms per year across 340 turbines is roughly 7 false alarms per turbine per year. That is manageable for a maintenance team, especially when each alert comes with a predicted probability and a ranked list of the sensor readings that triggered it.

If the false alarm rate were 10x higher (70 per turbine per year), the team might ignore alerts --- "alert fatigue" --- and miss real failures. The team manages this by:

Tiered alerting: Predictions above 0.30 probability trigger an immediate shutdown recommendation. Predictions between 0.03 and 0.30 trigger a "monitor closely" advisory. This concentrates human attention on the highest-risk cases.
Root cause display: Each alert shows the top 5 contributing sensor readings, allowing engineers to quickly validate whether the alert is plausible.
Weekly digest: Low-confidence alerts are batched into a weekly report for trend analysis rather than triggering individual notifications.

Temporal Considerations

This case study used a random train-test split for simplicity. In production, TurbineTech uses a temporal split: the model is trained on historical data and evaluated on future data. This prevents temporal leakage --- patterns in sensor degradation evolve over time, and a model tested on randomly sampled future data appears better than it will perform on genuinely new data.

# Production-style temporal split (conceptual)
# train: first 4 years of data
# validation: months 49-54 (threshold tuning)
# test: months 55-60 (final evaluation)

The Cost of Missed Failures Over Time

# Annual impact analysis
# Assume 340 turbines, each with ~590 operational days/year
annual_failures_per_turbine = 0.004 * 590  # ~2.4 failure-days per turbine
total_annual_failures = int(0.004 * 340 * 590)

print("Annual Impact Analysis:")
print(f"  Expected failures/year: ~{total_annual_failures}")

for name, recall_val, fp_per_year in [
    ("Default", 0.199, int(13 * 5)),
    ("Combined", 0.936, int(2382 * 5 / 171 * total_annual_failures / 171)),
]:
    caught = int(total_annual_failures * recall_val)
    missed = total_annual_failures - caught
    annual_cost = missed * 500000 + fp_per_year * 5000
    print(f"\n  {name}:")
    print(f"    Failures caught: {caught}/{total_annual_failures}")
    print(f"    Failures missed: {missed}")
    print(f"    Estimated annual cost: ${annual_cost:>12,}")

Annual Impact Analysis:
  Expected failures/year: ~802

  Default:
    Failures caught: 159/802
    Failures missed: 643
    Estimated annual cost: $  321,825,000

  Combined:
    Failures caught: 750/802
    Failures missed: 52
    Estimated annual cost: $   31,880,000

The combined model reduces annual failure costs from $322M to $32M --- a $290M annual saving. At this scale, even a 1% improvement in recall (catching 8 more failures) saves $4M per year.

Lessons from TurbineTech

1. Extreme Imbalance Demands Extreme Thresholds

At 233:1 imbalance and 100:1 cost ratio, the optimal threshold is below 0.01. The model flags anything with even a faint signal of failure. This is correct when the cost of missing a failure is 100x the cost of a false alarm.

2. Combine Cost Weights with Threshold Tuning

Cost weights improve the model's ranking of failures vs. normal operations. Threshold tuning converts that ranking into a decision optimized for the cost structure. Neither alone is as effective as both together.

3. Precision Is Nearly Irrelevant

At a 1% break-even precision, the model could be "wrong" 99 out of 100 times and still save money. In practice, even the aggressive threshold achieves 6.3% precision --- 6x the break-even. Reporting precision to stakeholders in this domain is misleading; report cost savings and failure capture rate instead.

4. Manage the Human System, Not Just the Model

The best model is useless if maintenance engineers ignore its alerts. Alert tiering, root cause explanations, and managed notification cadence are engineering problems that matter as much as the ML problem.

5. Never Evaluate with Accuracy

The default model's 99.66% accuracy would earn praise in a slide deck and lose $68.6M in the real world. For extreme imbalance, accuracy is not just useless --- it is actively dangerous, because it creates false confidence in a catastrophically underperforming system.

Summary Table

Metric	Default	Cost-Weighted	Threshold Tuned	Combined
Threshold	0.50	0.50	0.0065	0.0310
Recall	0.199	0.713	0.912	0.936
Precision	0.723	0.221	0.054	0.063
False alarms	13	430	2,735	2,382
Missed failures	137	49	15	11
Total cost	$68.6M \| $26.7M	$21.2M \| $17.4M
Savings vs. default	---	$41.9M \| $47.4M	$51.2M

This case study supports Chapter 17: Class Imbalance and Cost-Sensitive Learning. Return to the chapter or review Case Study 1: StreamFlow.