Case Study 2: TurbineTech Predictive Maintenance --- End-to-End
Background
TurbineTech operates 240 wind turbines across three sites in the American Midwest. Each turbine has 14 sensors monitoring vibration, temperature, rotor speed, oil pressure, generator output, and ambient conditions. Sensor readings are collected every 10 minutes. A single turbine failure costs an average of $185,000 in unplanned downtime, emergency crew deployment, and replacement parts. Planned maintenance during scheduled downtime costs $12,000.
The head of maintenance, Carlos Gutierrez, wants a predictive maintenance system: a model that predicts which turbines are likely to fail within the next 14 days, so his team can schedule preventive repairs during planned downtime windows (every turbine has a 48-hour maintenance window every 6 weeks).
This case study walks through the complete system --- from business question to deployed model with monitoring --- highlighting how each capstone component differs from the StreamFlow churn system.
Component 1: Business Question
StreamFlow: "Which subscribers will churn in 60 days?"
TurbineTech: "Which turbines will fail in 14 days?"
The structure is the same. The constraints are different:
| Dimension | StreamFlow | TurbineTech |
|---|---|---|
| Prediction target | Binary (churn/retain) | Binary (failure/no failure) |
| Prediction window | 60 days | 14 days |
| Prediction consumer | Customer success team | Maintenance scheduling system |
| Cost of false negative | ~$180 (lost subscriber) | ~$185,000 (unplanned downtime) | |
| Cost of false positive | ~$12 (unnecessary outreach) | ~$12,000 (unnecessary maintenance) | |
| Class balance | 12% positive (churn) | 2% positive (failure) |
| Decision cadence | Weekly (batch list) | Daily (maintenance schedule) |
The cost asymmetry is the defining characteristic. A missed failure is 15 times more expensive than a false alarm. This drives every downstream decision: threshold, evaluation metric, monitoring priority.
Component 2: Data Extraction
StreamFlow extracted tabular subscriber data from a SQL database. TurbineTech extracts time-series sensor data from a time-series database (InfluxDB, TimescaleDB, or similar).
import numpy as np
import pandas as pd
np.random.seed(42)
# Simulated sensor data for one turbine over 90 days
# In production, this query hits the time-series database.
n_readings = 90 * 24 * 6 # 10-minute intervals for 90 days
timestamps = pd.date_range("2025-01-01", periods=n_readings, freq="10min")
sensor_data = pd.DataFrame({
"timestamp": timestamps,
"turbine_id": "T-042",
"vibration_mm_s": np.random.normal(2.8, 0.4, n_readings),
"bearing_temp_c": np.random.normal(65, 5, n_readings),
"rotor_speed_rpm": np.random.normal(1800, 100, n_readings),
"oil_pressure_bar": np.random.normal(4.2, 0.3, n_readings),
"generator_output_kw": np.random.normal(1200, 150, n_readings),
"ambient_temp_c": np.random.normal(5, 10, n_readings), # winter in the Midwest
"wind_speed_m_s": np.abs(np.random.normal(8, 3, n_readings)),
"humidity_pct": np.random.uniform(30, 90, n_readings),
})
# Simulate a degradation pattern: vibration increases in the 30 days before failure
failure_date = pd.Timestamp("2025-03-15")
days_to_failure = (failure_date - sensor_data["timestamp"]).dt.total_seconds() / 86400
degradation_mask = (days_to_failure > 0) & (days_to_failure < 30)
degradation_signal = np.where(
degradation_mask,
0.05 * (30 - days_to_failure[degradation_mask].values), # linear ramp
0,
)
sensor_data.loc[degradation_mask, "vibration_mm_s"] += degradation_signal
print(f"Sensor readings: {len(sensor_data):,}")
print(f"Date range: {sensor_data['timestamp'].min()} to {sensor_data['timestamp'].max()}")
print(f"Columns: {list(sensor_data.columns)}")
The Key Difference: Time-Series to Tabular
The model cannot consume raw 10-minute sensor readings. It needs tabular features that summarize the recent history of each turbine. This is the critical feature engineering step that has no equivalent in the StreamFlow system.
Component 3: Feature Engineering
StreamFlow's features were scalar: sessions_last_30d, monthly_charges. TurbineTech's features must capture temporal patterns: trends, variability, change points.
def engineer_turbine_features(
sensor_df: pd.DataFrame,
reference_date: pd.Timestamp,
windows: list = [1, 3, 7, 14], # days
) -> dict:
"""
Convert time-series sensor data into tabular features for a single turbine.
Three feature families:
1. Rolling statistics (mean, std, min, max) over multiple windows
2. Trend features (slope of linear fit over the window)
3. Change-point features (ratio of recent window to historical baseline)
"""
features = {"turbine_id": sensor_df["turbine_id"].iloc[0]}
sensor_cols = [
"vibration_mm_s", "bearing_temp_c", "rotor_speed_rpm",
"oil_pressure_bar", "generator_output_kw",
]
# Historical baseline: 60-90 days ago
baseline_mask = (
(reference_date - sensor_df["timestamp"]).dt.total_seconds() / 86400
).between(60, 90)
baseline = sensor_df[baseline_mask]
for window in windows:
window_mask = (
(reference_date - sensor_df["timestamp"]).dt.total_seconds() / 86400
).between(0, window)
window_data = sensor_df[window_mask]
for col in sensor_cols:
prefix = f"{col}_{window}d"
if len(window_data) > 0:
features[f"{prefix}_mean"] = window_data[col].mean()
features[f"{prefix}_std"] = window_data[col].std()
features[f"{prefix}_min"] = window_data[col].min()
features[f"{prefix}_max"] = window_data[col].max()
# Trend: slope of linear regression over the window
if len(window_data) > 10:
x = np.arange(len(window_data))
slope = np.polyfit(x, window_data[col].values, 1)[0]
features[f"{prefix}_slope"] = slope
else:
features[f"{prefix}_slope"] = 0.0
# Change point: ratio of window mean to baseline mean
if len(baseline) > 0 and baseline[col].mean() != 0:
features[f"{prefix}_vs_baseline"] = (
window_data[col].mean() / baseline[col].mean()
)
else:
features[f"{prefix}_vs_baseline"] = 1.0
else:
for suffix in ["mean", "std", "min", "max", "slope", "vs_baseline"]:
features[f"{prefix}_{suffix}"] = np.nan
# Ambient features (not sensor health, but operating conditions)
recent = sensor_df[
((reference_date - sensor_df["timestamp"]).dt.total_seconds() / 86400).between(0, 7)
]
if len(recent) > 0:
features["ambient_temp_7d_mean"] = recent["ambient_temp_c"].mean()
features["wind_speed_7d_mean"] = recent["wind_speed_m_s"].mean()
features["humidity_7d_mean"] = recent["humidity_pct"].mean()
return features
# Engineer features for T-042 as of March 1
reference = pd.Timestamp("2025-03-01")
features = engineer_turbine_features(sensor_data, reference)
print(f"Engineered {len(features)} features for turbine {features['turbine_id']}")
print(f"Sample features:")
for key in list(features.keys())[:10]:
print(f" {key}: {features[key]:.4f}" if isinstance(features[key], float) else f" {key}: {features[key]}")
Domain Knowledge --- The three feature families (rolling statistics, trends, change points) come from conversations with the maintenance team. Carlos Gutierrez told the data science team: "When a bearing is going bad, vibration does not jump overnight. It creeps up over two to three weeks, and the standard deviation increases because the vibration becomes irregular." That single piece of domain knowledge motivated the trend and variability features. Without it, the data scientist would have used only rolling means and missed the most predictive signal.
Component 4: Model Training
The extreme class imbalance (2% failure rate) and extreme cost asymmetry ($185K vs. $12K) dominate the training strategy.
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
roc_auc_score, precision_recall_curve, average_precision_score,
)
np.random.seed(42)
# Simulated training data: 240 turbines x 365 daily snapshots = 87,600 rows
# Failure rate: 2%
n_samples = 10000
n_features = 120 # 5 sensors x 4 windows x 6 stats = 120, plus 3 ambient
X_train_sim = np.random.randn(n_samples, n_features)
y_train_sim = np.random.binomial(1, 0.02, n_samples)
# Inject signal: increase vibration features for positive class
vibration_cols = list(range(0, 24)) # first 24 features are vibration-related
X_train_sim[y_train_sim == 1, :24] += np.random.normal(1.5, 0.5, (y_train_sim.sum(), 24))
X_tr, X_val, y_tr, y_val = train_test_split(
X_train_sim, y_train_sim, test_size=0.2, random_state=42, stratify=y_train_sim
)
# scale_pos_weight handles class imbalance
# The ratio should approximate negative/positive count
scale_pos_weight = (y_tr == 0).sum() / max((y_tr == 1).sum(), 1)
model = lgb.LGBMClassifier(
objective="binary",
metric="average_precision",
learning_rate=0.03,
num_leaves=31,
max_depth=6,
min_child_samples=20,
subsample=0.8,
colsample_bytree=0.6,
reg_alpha=0.1,
reg_lambda=1.0,
n_estimators=3000,
scale_pos_weight=scale_pos_weight,
random_state=42,
verbose=-1,
)
model.fit(
X_tr, y_tr,
eval_set=[(X_val, y_val)],
callbacks=[
lgb.early_stopping(stopping_rounds=100),
lgb.log_evaluation(period=500),
],
)
y_val_proba = model.predict_proba(X_val)[:, 1]
print(f"Validation AUC: {roc_auc_score(y_val, y_val_proba):.4f}")
print(f"Average Precision: {average_precision_score(y_val, y_val_proba):.4f}")
print(f"Positive class in validation: {y_val.sum()} / {len(y_val)} ({y_val.mean():.1%})")
Threshold Selection: Cost-Driven
# Cost-driven threshold optimization
def compute_cost(y_true, y_proba, threshold, fn_cost=185000, fp_cost=12000):
"""Total cost at a given threshold."""
y_pred = (y_proba >= threshold).astype(int)
fn = ((y_true == 1) & (y_pred == 0)).sum()
fp = ((y_true == 0) & (y_pred == 1)).sum()
return fn * fn_cost + fp * fp_cost
thresholds = np.arange(0.01, 0.50, 0.01)
costs = [compute_cost(y_val, y_val_proba, t) for t in thresholds]
optimal_threshold = thresholds[np.argmin(costs)]
print(f"Optimal threshold (cost-minimizing): {optimal_threshold:.2f}")
print(f"Total cost at optimal threshold: ${min(costs):,.0f}")
print(f"Total cost at threshold 0.50: ${compute_cost(y_val, y_val_proba, 0.50):,.0f}")
# Break-even precision
# At break-even, the cost of intervening on a false positive equals the
# expected cost of missing a true failure:
# break_even_precision = fp_cost / (fn_cost_saved + fp_cost)
# where fn_cost_saved = fn_cost - planned_maintenance_cost
fn_cost_saved = 185000 - 12000 # savings from catching a failure early
break_even_precision = 12000 / (fn_cost_saved + 12000)
print(f"Break-even precision: {break_even_precision:.4f}")
print("The model is profitable as long as at least "
f"{break_even_precision:.1%} of flagged turbines actually need maintenance.")
The optimal threshold is much lower than 0.50 because false negatives are 15 times more expensive than false positives. This is the mirror image of the StreamFlow system, where the cost asymmetry was smaller (FN cost 6x FP cost).
Component 5: Interpretation
The maintenance team needs to know why the model flagged a turbine, and the explanation must map to physical intuition.
# Feature importance (simulated top features)
important_features = [
("vibration_mm_s_3d_slope", 0.23, "Vibration trend over 3 days"),
("vibration_mm_s_7d_std", 0.18, "Vibration variability over 7 days"),
("vibration_mm_s_14d_vs_baseline", 0.15, "Vibration vs. 60-90 day baseline"),
("bearing_temp_c_3d_slope", 0.11, "Bearing temperature trend"),
("oil_pressure_bar_7d_min", 0.08, "Minimum oil pressure over 7 days"),
("rotor_speed_rpm_7d_std", 0.06, "Rotor speed variability"),
("generator_output_kw_14d_slope", 0.05, "Power output trend"),
]
print("Top features by SHAP importance:")
print(f"{'Feature':<40} {'Importance':>10} {'Physical Meaning'}")
print("-" * 85)
for feat, imp, meaning in important_features:
print(f"{feat:<40} {imp:>10.2f} {meaning}")
Per-Turbine Explanation
def explain_turbine_prediction(turbine_id, features_dict, top_k=3):
"""
Generate a maintenance-team-friendly explanation.
Instead of: "vibration_mm_s_3d_slope has SHAP value 0.82"
Say: "Vibration is trending upward over the last 3 days (+0.4 mm/s per day)"
"""
explanations = {
"vibration_mm_s_3d_slope": (
f"Vibration is trending {'upward' if features_dict.get('vibration_mm_s_3d_slope', 0) > 0 else 'stable'} "
f"over the last 3 days"
),
"vibration_mm_s_7d_std": (
f"Vibration variability is {'elevated' if features_dict.get('vibration_mm_s_7d_std', 0) > 0.5 else 'normal'} "
f"(std = {features_dict.get('vibration_mm_s_7d_std', 0):.2f} mm/s)"
),
"bearing_temp_c_3d_slope": (
f"Bearing temperature is {'rising' if features_dict.get('bearing_temp_c_3d_slope', 0) > 0 else 'stable'}"
),
"oil_pressure_bar_7d_min": (
f"Oil pressure dropped to {features_dict.get('oil_pressure_bar_7d_min', 0):.1f} bar "
f"in the last 7 days"
),
}
print(f"\nTurbine {turbine_id} --- FLAGGED for maintenance")
print(f" Predicted failure probability: [computed from model]")
print(f" Top reasons:")
for i, (feat, _, _) in enumerate(important_features[:top_k], 1):
explanation = explanations.get(feat, f"{feat} is outside normal range")
print(f" {i}. {explanation}")
print(f" Recommendation: Schedule inspection in next maintenance window.")
# Example
explain_turbine_prediction("T-042", features)
Practitioner Note --- Carlos's team does not care about SHAP values. They care about "is the vibration getting worse, and should I send a crew?" The data scientist's job is to translate SHAP into physical language. This translation requires domain knowledge --- you need to know that rising vibration with increasing variability is a classic bearing degradation signature, while a sudden step change in oil pressure suggests a seal failure. The model provides the probability. The explanation provides the action.
Component 6: Deployment and Monitoring
The deployment architecture differs from StreamFlow in one critical way: the maintenance scheduling system is the consumer, not a human team.
graph LR
subgraph "Data Ingestion"
SENSORS[240 Turbines<br/>14 Sensors Each] -->|10-min readings| TSDB[(TimescaleDB)]
end
subgraph "Daily Batch Pipeline"
TSDB --> FE[Feature Engineering<br/>Time-series to tabular]
FE --> MODEL[LightGBM<br/>Failure Prediction]
MODEL --> SCORES[(Score Table)]
end
subgraph "Scheduling System"
SCORES --> SCHED[Maintenance Scheduler]
SCHED --> CALENDAR[Maintenance Calendar]
SCHED --> ALERTS[Crew Notifications]
end
subgraph "Monitoring"
SCORES --> MON[Drift + Performance Monitor]
CALENDAR --> OUTCOMES[Actual Outcomes<br/>14-day label delay]
OUTCOMES --> MON
end
Monitoring Differences
| Dimension | StreamFlow | TurbineTech |
|---|---|---|
| Label delay | 60 days | 14 days (shorter prediction window) |
| Drift concern | User behavior shifts | Sensor calibration drift, seasonal temperature |
| Performance metric | AUC, precision, recall | Recall at fixed precision, cost-weighted F-score |
| Retraining trigger | PSI > 0.25 | Performance drop OR seasonal recalibration |
| Business metric | Revenue retained | Unplanned downtime avoided |
The shorter label delay (14 days vs. 60 days) is a significant advantage. TurbineTech can detect performance degradation much faster than StreamFlow, which means the monitoring loop is tighter and the risk of silent model decay is lower.
The Full Retrospective
What Worked
-
Domain-driven feature engineering. The three feature families (rolling statistics, trends, change points) came from conversations with the maintenance team. The most predictive feature ---
vibration_mm_s_3d_slope--- was suggested by Carlos, not discovered by automated feature search. -
Cost-driven threshold optimization. Using the default threshold of 0.50 would have missed 60% of failures. The cost-optimized threshold catches 92% of failures at the expense of more false alarms --- a tradeoff the maintenance team explicitly accepted.
-
Physical-language explanations. The maintenance team adopted the system because the explanations make physical sense. "Vibration trending upward" is something a technician can verify with a handheld sensor during an inspection.
What Didn't Work
-
Seasonal effects were not captured in the initial training data. The model was trained on 12 months of data, but the first winter in production revealed that ambient temperature affects bearing viscosity, which affects vibration baselines. The model generated excessive false alarms in January because it interpreted normal cold-weather vibration increases as degradation signals.
-
Sensor calibration drift was invisible to the model. Two turbines had sensors that drifted out of calibration over 6 months, producing readings that looked like degradation but were measurement artifacts. The model correctly predicted "failure" based on the readings, but the turbines were fine. The maintenance team lost trust in those two turbines' predictions until the sensors were recalibrated.
-
The 14-day prediction window is too long for some failure modes. Sudden failures (electrical faults, blade damage from storms) happen in hours, not weeks. The model is good at predicting gradual mechanical degradation and bad at predicting sudden failures. This limitation was not communicated clearly enough to the maintenance team in the initial rollout.
What I Would Do Differently
-
Include ambient temperature as an interaction feature. Instead of a standalone
ambient_temp_7d_mean, create interaction features:vibration_residual_after_temp_correction. This would reduce false alarms in cold weather by accounting for the known relationship between temperature and vibration baselines. -
Add a sensor health monitor upstream of the model. Before the model runs, check whether each sensor's readings are within calibration bounds. Flag sensors with suspicious drift patterns. Do not feed uncalibrated sensor data to the model.
-
Build two models. One for gradual degradation (the current model, 14-day window). One for rapid anomaly detection (a different model architecture, possibly isolation forest, with a 24-hour window). Present both to the maintenance team as complementary tools, not competing models.
-
Run a controlled experiment. During the first three months, randomly assign half the flagged turbines to receive preventive maintenance and half to be monitored without intervention. This would produce a causal estimate of the model's value, not a correlational one.
Questions for Discussion
-
The break-even precision for TurbineTech is approximately 6.5%. This means the model is profitable even if only 1 in 15 flagged turbines actually needs maintenance. Is this too low? At what point does the maintenance team lose trust in the model, regardless of the cost math?
-
The case study describes two failure modes: gradual degradation and sudden faults. Should these be modeled separately, or should a single model handle both? What are the tradeoffs?
-
Sensor calibration drift is a form of data drift that is specific to physical systems. How would you detect it? Propose a monitoring approach that distinguishes real degradation from sensor drift.
-
Compare the stakeholder communication challenges of StreamFlow (VP of Customer Success, non-technical) and TurbineTech (head of maintenance, domain expert but non-ML). How does the explanation strategy need to differ?
This case study is part of Chapter 35: Capstone --- End-to-End ML System. Return to the chapter for the full architecture.