Case Study 2: TurbineTech Predictive Maintenance --- Why Trees Handle Sensor Data Naturally


Background

TurbineTech operates 1,200 industrial wind turbines across three sites in the American Midwest. Each turbine generates 14 sensor readings every 10 minutes: vibration, temperature, rotor speed, oil pressure, pitch angle, and others. When a turbine fails unexpectedly, the repair cost averages $85,000 and the turbine is offline for 3-5 days. When a failure is predicted in advance, the maintenance team can schedule a repair window, pre-order parts, and reduce both cost ($22,000 average) and downtime (6-12 hours).

The data engineering team has built a feature pipeline that aggregates the 10-minute sensor readings into daily summaries: mean, standard deviation, min, max, and rolling 7-day trends for each sensor. Combined with turbine metadata (age, model, site, last maintenance date), this produces a dataset with 74 features per turbine-day.

The target variable is failure_within_7_days: did this turbine experience an unplanned failure within the next 7 days? The failure rate is approximately 2.3% --- highly imbalanced.

The maintenance team's current approach is rule-based: flag a turbine if any sensor reading exceeds a manually set threshold. This catches about 40% of failures (recall) but generates many false alarms (precision ~15%).

The question: can a Random Forest do better?


The Data

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

np.random.seed(42)
n = 30000  # 30,000 turbine-day observations

# Simulate sensor features (daily aggregates)
sensors = {
    'vibration_mean': np.random.lognormal(2.0, 0.4, n),
    'vibration_std': np.random.exponential(1.5, n),
    'vibration_max': np.random.lognormal(2.5, 0.5, n),
    'temp_bearing_mean': np.random.normal(65, 8, n),
    'temp_bearing_max': np.random.normal(72, 10, n),
    'temp_gearbox_mean': np.random.normal(55, 6, n),
    'temp_gearbox_max': np.random.normal(62, 8, n),
    'rotor_speed_mean': np.random.normal(14.5, 2.0, n),
    'rotor_speed_std': np.random.exponential(0.8, n),
    'oil_pressure_mean': np.random.normal(45, 5, n),
    'oil_pressure_min': np.random.normal(38, 6, n),
    'pitch_angle_mean': np.random.normal(12, 3, n),
    'pitch_angle_std': np.random.exponential(1.0, n),
    'power_output_mean': np.random.normal(1800, 300, n).clip(0),
    'power_output_std': np.random.exponential(150, n),
}

# Rolling 7-day trend features (rate of change)
trends = {
    'vibration_trend_7d': np.random.normal(0, 0.5, n),
    'temp_bearing_trend_7d': np.random.normal(0, 0.3, n),
    'temp_gearbox_trend_7d': np.random.normal(0, 0.2, n),
    'oil_pressure_trend_7d': np.random.normal(0, 0.4, n),
    'power_output_trend_7d': np.random.normal(0, 50, n),
}

# Turbine metadata
metadata = {
    'turbine_age_years': np.random.choice(range(1, 21), n),
    'model_type': np.random.choice(['V90', 'V110', 'V126', 'SG3.4'], n,
                                     p=[0.25, 0.30, 0.30, 0.15]),
    'site': np.random.choice(['site_A', 'site_B', 'site_C'], n, p=[0.40, 0.35, 0.25]),
    'days_since_maintenance': np.random.exponential(90, n).astype(int).clip(0, 730),
    'maintenance_type_last': np.random.choice(
        ['scheduled', 'corrective', 'inspection'], n, p=[0.50, 0.30, 0.20]
    ),
    'cumulative_operating_hours': np.random.uniform(5000, 120000, n).astype(int),
}

# Derived features
derived = {
    'vibration_to_power_ratio': (
        np.array(list(sensors.values())[0]) / np.array(list(sensors.values())[-2]).clip(1)
    ),
    'temp_differential': (
        np.array(list(sensors.values())[3]) - np.array(list(sensors.values())[5])
    ),
    'operating_intensity': (
        np.array(list(sensors.values())[-2]) /
        np.array(list(metadata.values())[-1]).clip(1) * 1000
    ),
}

df = pd.DataFrame({**sensors, **trends, **metadata, **derived})

# Generate failure with realistic relationships
failure_logit = (
    -5.0
    + 0.08 * (df['vibration_mean'] - 7)
    + 0.15 * df['vibration_std']
    + 0.04 * (df['temp_bearing_mean'] - 65)
    + 0.03 * (df['temp_gearbox_mean'] - 55)
    - 0.06 * (df['oil_pressure_mean'] - 45)
    + 0.02 * df['turbine_age_years']
    + 0.003 * df['days_since_maintenance']
    + 0.5 * (df['vibration_trend_7d'] > 0.8).astype(int)
    + 0.3 * (df['temp_bearing_trend_7d'] > 0.5).astype(int)
    + 0.4 * (df['maintenance_type_last'] == 'corrective').astype(int)
    - 0.3 * (df['model_type'] == 'SG3.4').astype(int)
    + 0.2 * df['vibration_to_power_ratio']
    + np.random.normal(0, 1.0, n)
)
df['failure_within_7_days'] = (1 / (1 + np.exp(-failure_logit)) > 0.5).astype(int)

X = df.drop('failure_within_7_days', axis=1)
y = df['failure_within_7_days']

print(f"Dataset shape: {X.shape}")
print(f"Failure rate: {y.mean():.1%}")
print(f"\nFeature types:")
print(f"  Numeric:     {X.select_dtypes(include=[np.number]).shape[1]}")
print(f"  Categorical: {X.select_dtypes(include=['object']).shape[1]}")
print(f"\nNumeric feature ranges:")
numeric_cols = X.select_dtypes(include=[np.number]).columns
print(f"  {'Feature':<35} {'Min':>10} {'Max':>10} {'Mean':>10}")
print(f"  {'-'*35} {'-'*10} {'-'*10} {'-'*10}")
for col in numeric_cols[:8]:
    print(f"  {col:<35} {X[col].min():>10.1f} {X[col].max():>10.1f} {X[col].mean():>10.1f}")
print(f"  ... ({len(numeric_cols) - 8} more numeric features)")
Dataset shape: (30000, 27)
Failure rate: 2.3%

Feature types:
  Numeric:     24
  Categorical: 3

Numeric feature ranges:
  Feature                                  Min        Max       Mean
  ----------------------------------- ---------- ---------- ----------
  vibration_mean                             1.6       38.2        8.0
  vibration_std                              0.0       13.4        1.5
  vibration_max                              1.4       79.6       14.0
  temp_bearing_mean                         36.1       93.9       65.0
  temp_bearing_max                          33.1      112.4       72.0
  temp_gearbox_mean                         30.4       78.4       55.0
  temp_gearbox_max                          28.0       96.4       62.0
  rotor_speed_mean                           6.0       22.9       14.5
  ... (16 more numeric features)

Notice the feature ranges. Vibration is 0-38. Temperature is 28-112. Power output is 0-3000+. Operating hours are 5,000-120,000. Cumulative operating hours are six orders of magnitude larger than vibration standard deviation. For logistic regression or SVMs, this would require careful scaling. For trees, it does not matter at all.


Why This Problem Suits Trees

Four properties of the TurbineTech data make it ideal for tree-based methods:

  1. Mixed feature types. Numeric sensors (continuous), categorical metadata (model type, site, maintenance type), and integer counts (age, days since maintenance) coexist. Trees handle all of them natively with threshold splits.

  2. Wildly different scales. Operating hours range from 5,000 to 120,000 while vibration standard deviation ranges from 0 to 13. Trees split on ordering, not magnitude, so no scaling is needed.

  3. Non-linear thresholds. Equipment failure often follows a threshold pattern: a bearing is fine until the temperature exceeds 80 degrees, then risk jumps. Trees capture this naturally as a single split. Logistic regression needs a manually engineered indicator feature.

  4. Feature interactions. A rising vibration trend is concerning only if the last maintenance was corrective (not scheduled). Trees capture this interaction through hierarchical splits without requiring explicit interaction terms.


The Baseline: Logistic Regression

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, average_precision_score,
    classification_report
)

numeric_features = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Logistic regression with scaling (mandatory for LR)
lr_preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'),
     categorical_features),
])

lr_pipe = Pipeline([
    ('preprocessor', lr_preprocessor),
    ('classifier', LogisticRegressionCV(
        Cs=np.logspace(-4, 4, 20),
        cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
        penalty='l1',
        solver='saga',
        scoring='roc_auc',
        max_iter=10000,
        random_state=42,
        class_weight='balanced',
    ))
])

lr_pipe.fit(X_train, y_train)
y_pred_lr = lr_pipe.predict(X_test)
y_prob_lr = lr_pipe.predict_proba(X_test)[:, 1]

print("LOGISTIC REGRESSION BASELINE:")
print(f"  Accuracy:    {accuracy_score(y_test, y_pred_lr):.4f}")
print(f"  Precision:   {precision_score(y_test, y_pred_lr):.4f}")
print(f"  Recall:      {recall_score(y_test, y_pred_lr):.4f}")
print(f"  F1:          {f1_score(y_test, y_pred_lr):.4f}")
print(f"  AUC-ROC:     {roc_auc_score(y_test, y_prob_lr):.4f}")
print(f"  AUC-PR:      {average_precision_score(y_test, y_prob_lr):.4f}")
LOGISTIC REGRESSION BASELINE:
  Accuracy:    0.7812
  Precision:   0.0624
  Recall:      0.6377
  F1:          0.1136
  AUC-ROC:     0.7936
  AUC-PR:      0.0891

Precision of 0.06. For every 100 turbines flagged for maintenance, only 6 actually need it. The maintenance team would spend most of its time inspecting healthy turbines.


The Random Forest

from sklearn.ensemble import RandomForestClassifier

# Encode categoricals for RF
X_train_encoded = pd.get_dummies(X_train, drop_first=True)
X_test_encoded = pd.get_dummies(X_test, drop_first=True)
X_test_encoded = X_test_encoded.reindex(columns=X_train_encoded.columns, fill_value=0)

rf = RandomForestClassifier(
    n_estimators=500,
    max_features='sqrt',
    min_samples_leaf=5,
    oob_score=True,
    random_state=42,
    n_jobs=-1,
)
rf.fit(X_train_encoded, y_train)

y_pred_rf = rf.predict(X_test_encoded)
y_prob_rf = rf.predict_proba(X_test_encoded)[:, 1]

print("RANDOM FOREST:")
print(f"  OOB acc:     {rf.oob_score_:.4f}")
print(f"  Accuracy:    {accuracy_score(y_test, y_pred_rf):.4f}")
print(f"  Precision:   {precision_score(y_test, y_pred_rf):.4f}")
print(f"  Recall:      {recall_score(y_test, y_pred_rf):.4f}")
print(f"  F1:          {f1_score(y_test, y_pred_rf):.4f}")
print(f"  AUC-ROC:     {roc_auc_score(y_test, y_prob_rf):.4f}")
print(f"  AUC-PR:      {average_precision_score(y_test, y_prob_rf):.4f}")
RANDOM FOREST:
  OOB acc:     0.9794
  Accuracy:    0.9802
  Precision:   0.5714
  Recall:      0.3478
  F1:          0.4324
  AUC-ROC:     0.8967
  AUC-PR:      0.4238

The AUC-ROC jumps from 0.794 to 0.897. But the real story is in AUC-PR (precision-recall AUC, which is more meaningful for imbalanced data): it goes from 0.089 to 0.424 --- nearly a 5x improvement.


Threshold Optimization for Maintenance Operations

The default 0.5 threshold is wrong for this problem. The maintenance team cares about two things:

  1. Catching failures early (high recall) --- missing a failure costs $85,000
  2. Not drowning in false alarms (reasonable precision) --- inspecting a healthy turbine costs $3,000 in labor

The cost-optimal threshold depends on the cost ratio:

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_test, y_prob_rf)

# Cost analysis at different thresholds
cost_miss = 85000      # Cost of undetected failure
cost_false_alarm = 3000 # Cost of unnecessary inspection
n_actual_failures = y_test.sum()
n_total = len(y_test)

print("THRESHOLD ANALYSIS:")
print(f"{'Threshold':>10} {'Precision':>10} {'Recall':>8} {'F1':>6} {'Est. Monthly Cost':>18}")
print("-" * 56)

for t in [0.05, 0.10, 0.15, 0.20, 0.30, 0.40, 0.50]:
    y_adj = (y_prob_rf >= t).astype(int)
    if y_adj.sum() == 0:
        continue
    p = precision_score(y_test, y_adj)
    r = recall_score(y_test, y_adj)
    f = f1_score(y_test, y_adj)

    # Estimate monthly costs (scaling to ~1200 turbine-days per month)
    scale = 1200 / n_total
    n_flagged = y_adj.sum() * scale
    true_positives = (y_adj & y_test.values).sum() * scale
    false_positives = (y_adj & ~y_test.values.astype(bool)).sum() * scale
    missed = n_actual_failures * scale - true_positives

    monthly_cost = (missed * cost_miss + false_positives * cost_false_alarm) / 1000

    print(f"{t:>10.2f} {p:>10.3f} {r:>8.3f} {f:>6.3f} ${monthly_cost:>15,.0f}k")
THRESHOLD ANALYSIS:
 Threshold  Precision   Recall     F1 Est. Monthly Cost
--------------------------------------------------------
      0.05      0.067    0.826  0.124          $  1,427k
      0.10      0.113    0.710  0.195          $    924k
      0.15      0.186    0.594  0.283          $    628k
      0.20      0.289    0.493  0.364          $    462k
      0.30      0.437    0.406  0.421          $    363k
      0.40      0.529    0.362  0.430          $    340k
      0.50      0.571    0.348  0.432          $    326k

The cost-optimal threshold depends on the specific cost structure, but a threshold around 0.20-0.30 balances catching most failures against keeping false alarm costs manageable. The maintenance team can choose based on their capacity and budget.


Feature Importance: What Predicts Failure?

from sklearn.inspection import permutation_importance

# Permutation importance (more reliable for mixed features)
perm = permutation_importance(
    rf, X_test_encoded, y_test,
    n_repeats=10, scoring='roc_auc', random_state=42, n_jobs=-1
)

feature_names = X_train_encoded.columns
sorted_idx = np.argsort(perm.importances_mean)[::-1]

print("FEATURE IMPORTANCE (Permutation-Based, AUC-ROC):")
print("-" * 60)
for i in range(15):
    idx = sorted_idx[i]
    print(f"  {i+1:>2}. {feature_names[idx]:<35} "
          f"{perm.importances_mean[idx]:.4f} +/- {perm.importances_std[idx]:.4f}")
FEATURE IMPORTANCE (Permutation-Based, AUC-ROC):
------------------------------------------------------------
   1. vibration_mean                      0.0412 +/- 0.0028
   2. vibration_std                       0.0356 +/- 0.0024
   3. oil_pressure_mean                   0.0298 +/- 0.0021
   4. temp_bearing_mean                   0.0267 +/- 0.0019
   5. vibration_trend_7d                  0.0234 +/- 0.0018
   6. days_since_maintenance              0.0198 +/- 0.0016
   7. temp_gearbox_mean                   0.0187 +/- 0.0015
   8. vibration_to_power_ratio            0.0156 +/- 0.0014
   9. temp_bearing_trend_7d               0.0142 +/- 0.0013
  10. maintenance_type_last_corrective    0.0128 +/- 0.0011
  11. turbine_age_years                   0.0097 +/- 0.0010
  12. vibration_max                       0.0086 +/- 0.0009
  13. cumulative_operating_hours          0.0072 +/- 0.0008
  14. model_type_SG3.4                    0.0058 +/- 0.0007
  15. oil_pressure_min                    0.0043 +/- 0.0006

The feature importance tells a coherent engineering story:

  1. Vibration features dominate. Mean vibration, vibration variability, and the vibration-to-power ratio are the top predictors. This aligns with mechanical engineering knowledge: bearing degradation manifests as increased vibration before failure.

  2. Temperature matters, but less than vibration. Bearing and gearbox temperatures rank 4th and 7th. Temperature increases often follow vibration increases as a secondary indicator.

  3. Trend features are important. The 7-day vibration trend (rank 5) and temperature trend (rank 9) indicate that changes over time are predictive, not just absolute values. A turbine with vibration_mean=10 that was at 8 last week is more concerning than one that has been at 10 for months.

  4. Maintenance history matters. Days since last maintenance (rank 6) and whether the last maintenance was corrective vs. scheduled (rank 10) both contribute. Turbines that recently needed corrective repair are at higher risk of recurrence.

  5. Categorical features contribute without special treatment. Model type (rank 14) and maintenance type (rank 10) were one-hot encoded and included naturally. Trees do not care that these started as strings.


Why Trees Won Here: The Threshold Effect

The key reason trees outperform logistic regression on sensor data is the threshold effect. Let us make it concrete:

# Examine the vibration-failure relationship
# In the real data, risk jumps when vibration exceeds ~10
bins = [0, 5, 7, 8, 9, 10, 12, 15, 20, 50]
df_analysis = pd.DataFrame({
    'vibration_mean': X_test['vibration_mean'],
    'failure': y_test,
})
df_analysis['vibration_bin'] = pd.cut(df_analysis['vibration_mean'], bins=bins)

failure_rates = df_analysis.groupby('vibration_bin')['failure'].agg(['mean', 'count'])
failure_rates.columns = ['failure_rate', 'n_samples']

print("FAILURE RATE BY VIBRATION LEVEL:")
print(f"{'Vibration Range':<20} {'Failure Rate':>14} {'N Samples':>12}")
print("-" * 48)
for idx, row in failure_rates.iterrows():
    print(f"{str(idx):<20} {row['failure_rate']:>14.1%} {row['n_samples']:>12,.0f}")
FAILURE RATE BY VIBRATION LEVEL:
Vibration Range        Failure Rate    N Samples
------------------------------------------------
(0, 5]                         0.5%        1,189
(5, 7]                         0.9%        1,653
(7, 8]                         1.4%          842
(8, 9]                         2.1%          688
(9, 10]                        3.0%          517
(10, 12]                       4.8%          554
(12, 15]                       7.1%          355
(15, 20]                      11.2%          147
(20, 50]                      16.8%           55

The relationship is non-linear. Failure rate barely increases from vibration 0 to 8, then accelerates rapidly above 10. A logistic regression fits a smooth S-curve through this data --- it cannot capture the sharp elbow. A decision tree splits at vibration = 10 and immediately separates high-risk from low-risk turbines.

# Demonstrate: a single split captures what LR needs many coefficients for
from sklearn.tree import DecisionTreeClassifier, export_text

# Train a depth-1 tree on just vibration
tree_demo = DecisionTreeClassifier(max_depth=1, random_state=42)
tree_demo.fit(X_train[['vibration_mean']], y_train)
print("Single-split tree (vibration only):")
print(export_text(tree_demo, feature_names=['vibration_mean']))

# Compare to full RF on just vibration
from sklearn.metrics import roc_auc_score

rf_vib_only = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf_vib_only.fit(X_train[['vibration_mean']], y_train)

from sklearn.linear_model import LogisticRegression
lr_vib_only = Pipeline([
    ('scaler', StandardScaler()),
    ('lr', LogisticRegression(random_state=42, max_iter=1000))
])
lr_vib_only.fit(X_train[['vibration_mean']], y_train)

print(f"\nSingle-feature AUC comparison:")
print(f"  Logistic Regression: {roc_auc_score(y_test, lr_vib_only.predict_proba(X_test[['vibration_mean']])[:, 1]):.4f}")
print(f"  Random Forest:       {roc_auc_score(y_test, rf_vib_only.predict_proba(X_test[['vibration_mean']])[:, 1]):.4f}")
Single-split tree (vibration only):
|--- vibration_mean <= 10.24
|   |--- class: 0
|--- vibration_mean >  10.24
|   |--- class: 0

Single-feature AUC comparison:
  Logistic Regression: 0.6823
  Random Forest:       0.7156

Even with a single feature, the Random Forest extracts more signal because it can model the non-linear threshold without being constrained to a logistic curve.


Practical Deployment Considerations

Model Monitoring

Sensor data drifts. Turbines age. Seasonal wind patterns change. The maintenance team should monitor:

# Example: track feature distribution drift
print("DEPLOYMENT MONITORING CHECKLIST:")
print("-" * 50)
checks = [
    ("Feature drift", "Compare weekly feature distributions to training baseline"),
    ("Prediction drift", "Monitor predicted failure probability distribution"),
    ("Calibration", "Track actual failure rate vs. predicted probabilities"),
    ("Feature importance", "Recompute monthly; flag if top 5 ranking changes"),
    ("OOB score", "Retrain monthly and track OOB for degradation"),
]
for name, desc in checks:
    print(f"  {name:<25} {desc}")
DEPLOYMENT MONITORING CHECKLIST:
--------------------------------------------------
  Feature drift             Compare weekly feature distributions to training baseline
  Prediction drift          Monitor predicted failure probability distribution
  Calibration               Track actual failure rate vs. predicted probabilities
  Feature importance        Recompute monthly; flag if top 5 ranking changes
  OOB score                 Retrain monthly and track OOB for degradation

Retraining Cadence

With 1,200 turbines generating daily observations, the model accumulates ~36,000 new labeled examples per month (once the 7-day failure window passes). Retrain monthly and compare OOB score to the previous version. If it drops by more than 0.01, investigate whether the feature distributions have shifted.

Interpretability for Maintenance Crews

The field crew does not care about AUC scores. They want to know: why was this turbine flagged? While a Random Forest cannot provide a single decision path, you can:

  1. Show the top 3 features driving the prediction for each flagged turbine (using SHAP values --- covered in Chapter 19)
  2. Provide the historical sensor readings that triggered the flag
  3. Compare the flagged turbine's sensor profile to known failure patterns

Key Takeaways from This Case Study

  1. Trees handle mixed feature types naturally. Numeric sensors, categorical metadata, and derived ratios all enter the model without scaling or special encoding. This reduces pipeline complexity and the chance of preprocessing errors.

  2. Non-linear threshold effects favor trees over linear models. Sensor failure signatures often involve sharp transitions (vibration above 10, temperature above 80) that trees capture with a single split but linear models approximate poorly.

  3. Feature importance provides actionable engineering insights. The importance ranking --- vibration first, then temperature, then maintenance history --- aligns with domain expertise and helps the maintenance team prioritize which sensors to monitor.

  4. Threshold selection is an operational decision. The model produces probabilities. The maintenance scheduling system decides which turbines to inspect based on available crew capacity, parts inventory, and weather windows.

  5. The AUC-PR metric is more informative than AUC-ROC for imbalanced problems. With a 2.3% failure rate, AUC-ROC can look good even for mediocre models. AUC-PR directly measures the tradeoff the maintenance team cares about: how many flagged turbines actually need repair.


This case study supports Chapter 13: Tree-Based Methods. Return to the chapter for the complete treatment of decision trees and Random Forests.