Case Study 2: TurbineTech Predictive Maintenance --- End-to-End


Background

TurbineTech operates 240 wind turbines across three sites in the American Midwest. Each turbine has 14 sensors monitoring vibration, temperature, rotor speed, oil pressure, generator output, and ambient conditions. Sensor readings are collected every 10 minutes. A single turbine failure costs an average of $185,000 in unplanned downtime, emergency crew deployment, and replacement parts. Planned maintenance during scheduled downtime costs $12,000.

The head of maintenance, Carlos Gutierrez, wants a predictive maintenance system: a model that predicts which turbines are likely to fail within the next 14 days, so his team can schedule preventive repairs during planned downtime windows (every turbine has a 48-hour maintenance window every 6 weeks).

This case study walks through the complete system --- from business question to deployed model with monitoring --- highlighting how each capstone component differs from the StreamFlow churn system.


Component 1: Business Question

StreamFlow:  "Which subscribers will churn in 60 days?"
TurbineTech: "Which turbines will fail in 14 days?"

The structure is the same. The constraints are different:

Dimension StreamFlow TurbineTech
Prediction target Binary (churn/retain) Binary (failure/no failure)
Prediction window 60 days 14 days
Prediction consumer Customer success team Maintenance scheduling system
Cost of false negative ~$180 (lost subscriber) | ~$185,000 (unplanned downtime)
Cost of false positive ~$12 (unnecessary outreach) | ~$12,000 (unnecessary maintenance)
Class balance 12% positive (churn) 2% positive (failure)
Decision cadence Weekly (batch list) Daily (maintenance schedule)

The cost asymmetry is the defining characteristic. A missed failure is 15 times more expensive than a false alarm. This drives every downstream decision: threshold, evaluation metric, monitoring priority.


Component 2: Data Extraction

StreamFlow extracted tabular subscriber data from a SQL database. TurbineTech extracts time-series sensor data from a time-series database (InfluxDB, TimescaleDB, or similar).

import numpy as np
import pandas as pd

np.random.seed(42)

# Simulated sensor data for one turbine over 90 days
# In production, this query hits the time-series database.
n_readings = 90 * 24 * 6  # 10-minute intervals for 90 days

timestamps = pd.date_range("2025-01-01", periods=n_readings, freq="10min")

sensor_data = pd.DataFrame({
    "timestamp": timestamps,
    "turbine_id": "T-042",
    "vibration_mm_s": np.random.normal(2.8, 0.4, n_readings),
    "bearing_temp_c": np.random.normal(65, 5, n_readings),
    "rotor_speed_rpm": np.random.normal(1800, 100, n_readings),
    "oil_pressure_bar": np.random.normal(4.2, 0.3, n_readings),
    "generator_output_kw": np.random.normal(1200, 150, n_readings),
    "ambient_temp_c": np.random.normal(5, 10, n_readings),  # winter in the Midwest
    "wind_speed_m_s": np.abs(np.random.normal(8, 3, n_readings)),
    "humidity_pct": np.random.uniform(30, 90, n_readings),
})

# Simulate a degradation pattern: vibration increases in the 30 days before failure
failure_date = pd.Timestamp("2025-03-15")
days_to_failure = (failure_date - sensor_data["timestamp"]).dt.total_seconds() / 86400
degradation_mask = (days_to_failure > 0) & (days_to_failure < 30)
degradation_signal = np.where(
    degradation_mask,
    0.05 * (30 - days_to_failure[degradation_mask].values),  # linear ramp
    0,
)
sensor_data.loc[degradation_mask, "vibration_mm_s"] += degradation_signal

print(f"Sensor readings: {len(sensor_data):,}")
print(f"Date range: {sensor_data['timestamp'].min()} to {sensor_data['timestamp'].max()}")
print(f"Columns: {list(sensor_data.columns)}")

The Key Difference: Time-Series to Tabular

The model cannot consume raw 10-minute sensor readings. It needs tabular features that summarize the recent history of each turbine. This is the critical feature engineering step that has no equivalent in the StreamFlow system.


Component 3: Feature Engineering

StreamFlow's features were scalar: sessions_last_30d, monthly_charges. TurbineTech's features must capture temporal patterns: trends, variability, change points.

def engineer_turbine_features(
    sensor_df: pd.DataFrame,
    reference_date: pd.Timestamp,
    windows: list = [1, 3, 7, 14],  # days
) -> dict:
    """
    Convert time-series sensor data into tabular features for a single turbine.

    Three feature families:
    1. Rolling statistics (mean, std, min, max) over multiple windows
    2. Trend features (slope of linear fit over the window)
    3. Change-point features (ratio of recent window to historical baseline)
    """
    features = {"turbine_id": sensor_df["turbine_id"].iloc[0]}
    sensor_cols = [
        "vibration_mm_s", "bearing_temp_c", "rotor_speed_rpm",
        "oil_pressure_bar", "generator_output_kw",
    ]

    # Historical baseline: 60-90 days ago
    baseline_mask = (
        (reference_date - sensor_df["timestamp"]).dt.total_seconds() / 86400
    ).between(60, 90)
    baseline = sensor_df[baseline_mask]

    for window in windows:
        window_mask = (
            (reference_date - sensor_df["timestamp"]).dt.total_seconds() / 86400
        ).between(0, window)
        window_data = sensor_df[window_mask]

        for col in sensor_cols:
            prefix = f"{col}_{window}d"
            if len(window_data) > 0:
                features[f"{prefix}_mean"] = window_data[col].mean()
                features[f"{prefix}_std"] = window_data[col].std()
                features[f"{prefix}_min"] = window_data[col].min()
                features[f"{prefix}_max"] = window_data[col].max()

                # Trend: slope of linear regression over the window
                if len(window_data) > 10:
                    x = np.arange(len(window_data))
                    slope = np.polyfit(x, window_data[col].values, 1)[0]
                    features[f"{prefix}_slope"] = slope
                else:
                    features[f"{prefix}_slope"] = 0.0

                # Change point: ratio of window mean to baseline mean
                if len(baseline) > 0 and baseline[col].mean() != 0:
                    features[f"{prefix}_vs_baseline"] = (
                        window_data[col].mean() / baseline[col].mean()
                    )
                else:
                    features[f"{prefix}_vs_baseline"] = 1.0
            else:
                for suffix in ["mean", "std", "min", "max", "slope", "vs_baseline"]:
                    features[f"{prefix}_{suffix}"] = np.nan

    # Ambient features (not sensor health, but operating conditions)
    recent = sensor_df[
        ((reference_date - sensor_df["timestamp"]).dt.total_seconds() / 86400).between(0, 7)
    ]
    if len(recent) > 0:
        features["ambient_temp_7d_mean"] = recent["ambient_temp_c"].mean()
        features["wind_speed_7d_mean"] = recent["wind_speed_m_s"].mean()
        features["humidity_7d_mean"] = recent["humidity_pct"].mean()

    return features

# Engineer features for T-042 as of March 1
reference = pd.Timestamp("2025-03-01")
features = engineer_turbine_features(sensor_data, reference)
print(f"Engineered {len(features)} features for turbine {features['turbine_id']}")
print(f"Sample features:")
for key in list(features.keys())[:10]:
    print(f"  {key}: {features[key]:.4f}" if isinstance(features[key], float) else f"  {key}: {features[key]}")

Domain Knowledge --- The three feature families (rolling statistics, trends, change points) come from conversations with the maintenance team. Carlos Gutierrez told the data science team: "When a bearing is going bad, vibration does not jump overnight. It creeps up over two to three weeks, and the standard deviation increases because the vibration becomes irregular." That single piece of domain knowledge motivated the trend and variability features. Without it, the data scientist would have used only rolling means and missed the most predictive signal.


Component 4: Model Training

The extreme class imbalance (2% failure rate) and extreme cost asymmetry ($185K vs. $12K) dominate the training strategy.

import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    roc_auc_score, precision_recall_curve, average_precision_score,
)

np.random.seed(42)

# Simulated training data: 240 turbines x 365 daily snapshots = 87,600 rows
# Failure rate: 2%
n_samples = 10000
n_features = 120  # 5 sensors x 4 windows x 6 stats = 120, plus 3 ambient

X_train_sim = np.random.randn(n_samples, n_features)
y_train_sim = np.random.binomial(1, 0.02, n_samples)

# Inject signal: increase vibration features for positive class
vibration_cols = list(range(0, 24))  # first 24 features are vibration-related
X_train_sim[y_train_sim == 1, :24] += np.random.normal(1.5, 0.5, (y_train_sim.sum(), 24))

X_tr, X_val, y_tr, y_val = train_test_split(
    X_train_sim, y_train_sim, test_size=0.2, random_state=42, stratify=y_train_sim
)

# scale_pos_weight handles class imbalance
# The ratio should approximate negative/positive count
scale_pos_weight = (y_tr == 0).sum() / max((y_tr == 1).sum(), 1)

model = lgb.LGBMClassifier(
    objective="binary",
    metric="average_precision",
    learning_rate=0.03,
    num_leaves=31,
    max_depth=6,
    min_child_samples=20,
    subsample=0.8,
    colsample_bytree=0.6,
    reg_alpha=0.1,
    reg_lambda=1.0,
    n_estimators=3000,
    scale_pos_weight=scale_pos_weight,
    random_state=42,
    verbose=-1,
)

model.fit(
    X_tr, y_tr,
    eval_set=[(X_val, y_val)],
    callbacks=[
        lgb.early_stopping(stopping_rounds=100),
        lgb.log_evaluation(period=500),
    ],
)

y_val_proba = model.predict_proba(X_val)[:, 1]

print(f"Validation AUC: {roc_auc_score(y_val, y_val_proba):.4f}")
print(f"Average Precision: {average_precision_score(y_val, y_val_proba):.4f}")
print(f"Positive class in validation: {y_val.sum()} / {len(y_val)} ({y_val.mean():.1%})")

Threshold Selection: Cost-Driven

# Cost-driven threshold optimization
def compute_cost(y_true, y_proba, threshold, fn_cost=185000, fp_cost=12000):
    """Total cost at a given threshold."""
    y_pred = (y_proba >= threshold).astype(int)
    fn = ((y_true == 1) & (y_pred == 0)).sum()
    fp = ((y_true == 0) & (y_pred == 1)).sum()
    return fn * fn_cost + fp * fp_cost

thresholds = np.arange(0.01, 0.50, 0.01)
costs = [compute_cost(y_val, y_val_proba, t) for t in thresholds]
optimal_threshold = thresholds[np.argmin(costs)]

print(f"Optimal threshold (cost-minimizing): {optimal_threshold:.2f}")
print(f"Total cost at optimal threshold: ${min(costs):,.0f}")
print(f"Total cost at threshold 0.50:    ${compute_cost(y_val, y_val_proba, 0.50):,.0f}")

# Break-even precision
# At break-even, the cost of intervening on a false positive equals the
# expected cost of missing a true failure:
# break_even_precision = fp_cost / (fn_cost_saved + fp_cost)
# where fn_cost_saved = fn_cost - planned_maintenance_cost
fn_cost_saved = 185000 - 12000  # savings from catching a failure early
break_even_precision = 12000 / (fn_cost_saved + 12000)
print(f"Break-even precision: {break_even_precision:.4f}")
print("The model is profitable as long as at least "
      f"{break_even_precision:.1%} of flagged turbines actually need maintenance.")

The optimal threshold is much lower than 0.50 because false negatives are 15 times more expensive than false positives. This is the mirror image of the StreamFlow system, where the cost asymmetry was smaller (FN cost 6x FP cost).


Component 5: Interpretation

The maintenance team needs to know why the model flagged a turbine, and the explanation must map to physical intuition.

# Feature importance (simulated top features)
important_features = [
    ("vibration_mm_s_3d_slope", 0.23, "Vibration trend over 3 days"),
    ("vibration_mm_s_7d_std", 0.18, "Vibration variability over 7 days"),
    ("vibration_mm_s_14d_vs_baseline", 0.15, "Vibration vs. 60-90 day baseline"),
    ("bearing_temp_c_3d_slope", 0.11, "Bearing temperature trend"),
    ("oil_pressure_bar_7d_min", 0.08, "Minimum oil pressure over 7 days"),
    ("rotor_speed_rpm_7d_std", 0.06, "Rotor speed variability"),
    ("generator_output_kw_14d_slope", 0.05, "Power output trend"),
]

print("Top features by SHAP importance:")
print(f"{'Feature':<40} {'Importance':>10}  {'Physical Meaning'}")
print("-" * 85)
for feat, imp, meaning in important_features:
    print(f"{feat:<40} {imp:>10.2f}  {meaning}")

Per-Turbine Explanation

def explain_turbine_prediction(turbine_id, features_dict, top_k=3):
    """
    Generate a maintenance-team-friendly explanation.

    Instead of: "vibration_mm_s_3d_slope has SHAP value 0.82"
    Say:        "Vibration is trending upward over the last 3 days (+0.4 mm/s per day)"
    """
    explanations = {
        "vibration_mm_s_3d_slope": (
            f"Vibration is trending {'upward' if features_dict.get('vibration_mm_s_3d_slope', 0) > 0 else 'stable'} "
            f"over the last 3 days"
        ),
        "vibration_mm_s_7d_std": (
            f"Vibration variability is {'elevated' if features_dict.get('vibration_mm_s_7d_std', 0) > 0.5 else 'normal'} "
            f"(std = {features_dict.get('vibration_mm_s_7d_std', 0):.2f} mm/s)"
        ),
        "bearing_temp_c_3d_slope": (
            f"Bearing temperature is {'rising' if features_dict.get('bearing_temp_c_3d_slope', 0) > 0 else 'stable'}"
        ),
        "oil_pressure_bar_7d_min": (
            f"Oil pressure dropped to {features_dict.get('oil_pressure_bar_7d_min', 0):.1f} bar "
            f"in the last 7 days"
        ),
    }

    print(f"\nTurbine {turbine_id} --- FLAGGED for maintenance")
    print(f"  Predicted failure probability: [computed from model]")
    print(f"  Top reasons:")
    for i, (feat, _, _) in enumerate(important_features[:top_k], 1):
        explanation = explanations.get(feat, f"{feat} is outside normal range")
        print(f"    {i}. {explanation}")
    print(f"  Recommendation: Schedule inspection in next maintenance window.")

# Example
explain_turbine_prediction("T-042", features)

Practitioner Note --- Carlos's team does not care about SHAP values. They care about "is the vibration getting worse, and should I send a crew?" The data scientist's job is to translate SHAP into physical language. This translation requires domain knowledge --- you need to know that rising vibration with increasing variability is a classic bearing degradation signature, while a sudden step change in oil pressure suggests a seal failure. The model provides the probability. The explanation provides the action.


Component 6: Deployment and Monitoring

The deployment architecture differs from StreamFlow in one critical way: the maintenance scheduling system is the consumer, not a human team.

graph LR
    subgraph "Data Ingestion"
        SENSORS[240 Turbines<br/>14 Sensors Each] -->|10-min readings| TSDB[(TimescaleDB)]
    end

    subgraph "Daily Batch Pipeline"
        TSDB --> FE[Feature Engineering<br/>Time-series to tabular]
        FE --> MODEL[LightGBM<br/>Failure Prediction]
        MODEL --> SCORES[(Score Table)]
    end

    subgraph "Scheduling System"
        SCORES --> SCHED[Maintenance Scheduler]
        SCHED --> CALENDAR[Maintenance Calendar]
        SCHED --> ALERTS[Crew Notifications]
    end

    subgraph "Monitoring"
        SCORES --> MON[Drift + Performance Monitor]
        CALENDAR --> OUTCOMES[Actual Outcomes<br/>14-day label delay]
        OUTCOMES --> MON
    end

Monitoring Differences

Dimension StreamFlow TurbineTech
Label delay 60 days 14 days (shorter prediction window)
Drift concern User behavior shifts Sensor calibration drift, seasonal temperature
Performance metric AUC, precision, recall Recall at fixed precision, cost-weighted F-score
Retraining trigger PSI > 0.25 Performance drop OR seasonal recalibration
Business metric Revenue retained Unplanned downtime avoided

The shorter label delay (14 days vs. 60 days) is a significant advantage. TurbineTech can detect performance degradation much faster than StreamFlow, which means the monitoring loop is tighter and the risk of silent model decay is lower.


The Full Retrospective

What Worked

  1. Domain-driven feature engineering. The three feature families (rolling statistics, trends, change points) came from conversations with the maintenance team. The most predictive feature --- vibration_mm_s_3d_slope --- was suggested by Carlos, not discovered by automated feature search.

  2. Cost-driven threshold optimization. Using the default threshold of 0.50 would have missed 60% of failures. The cost-optimized threshold catches 92% of failures at the expense of more false alarms --- a tradeoff the maintenance team explicitly accepted.

  3. Physical-language explanations. The maintenance team adopted the system because the explanations make physical sense. "Vibration trending upward" is something a technician can verify with a handheld sensor during an inspection.

What Didn't Work

  1. Seasonal effects were not captured in the initial training data. The model was trained on 12 months of data, but the first winter in production revealed that ambient temperature affects bearing viscosity, which affects vibration baselines. The model generated excessive false alarms in January because it interpreted normal cold-weather vibration increases as degradation signals.

  2. Sensor calibration drift was invisible to the model. Two turbines had sensors that drifted out of calibration over 6 months, producing readings that looked like degradation but were measurement artifacts. The model correctly predicted "failure" based on the readings, but the turbines were fine. The maintenance team lost trust in those two turbines' predictions until the sensors were recalibrated.

  3. The 14-day prediction window is too long for some failure modes. Sudden failures (electrical faults, blade damage from storms) happen in hours, not weeks. The model is good at predicting gradual mechanical degradation and bad at predicting sudden failures. This limitation was not communicated clearly enough to the maintenance team in the initial rollout.

What I Would Do Differently

  1. Include ambient temperature as an interaction feature. Instead of a standalone ambient_temp_7d_mean, create interaction features: vibration_residual_after_temp_correction. This would reduce false alarms in cold weather by accounting for the known relationship between temperature and vibration baselines.

  2. Add a sensor health monitor upstream of the model. Before the model runs, check whether each sensor's readings are within calibration bounds. Flag sensors with suspicious drift patterns. Do not feed uncalibrated sensor data to the model.

  3. Build two models. One for gradual degradation (the current model, 14-day window). One for rapid anomaly detection (a different model architecture, possibly isolation forest, with a 24-hour window). Present both to the maintenance team as complementary tools, not competing models.

  4. Run a controlled experiment. During the first three months, randomly assign half the flagged turbines to receive preventive maintenance and half to be monitored without intervention. This would produce a causal estimate of the model's value, not a correlational one.


Questions for Discussion

  1. The break-even precision for TurbineTech is approximately 6.5%. This means the model is profitable even if only 1 in 15 flagged turbines actually needs maintenance. Is this too low? At what point does the maintenance team lose trust in the model, regardless of the cost math?

  2. The case study describes two failure modes: gradual degradation and sudden faults. Should these be modeled separately, or should a single model handle both? What are the tradeoffs?

  3. Sensor calibration drift is a form of data drift that is specific to physical systems. How would you detect it? Propose a monitoring approach that distinguishes real degradation from sensor drift.

  4. Compare the stakeholder communication challenges of StreamFlow (VP of Customer Success, non-technical) and TurbineTech (head of maintenance, domain expert but non-ML). How does the explanation strategy need to differ?


This case study is part of Chapter 35: Capstone --- End-to-End ML System. Return to the chapter for the full architecture.