Case Study 2: TurbineTech Predictive Maintenance --- Why Trees Handle Sensor Data Naturally
Background
TurbineTech operates 1,200 industrial wind turbines across three sites in the American Midwest. Each turbine generates 14 sensor readings every 10 minutes: vibration, temperature, rotor speed, oil pressure, pitch angle, and others. When a turbine fails unexpectedly, the repair cost averages $85,000 and the turbine is offline for 3-5 days. When a failure is predicted in advance, the maintenance team can schedule a repair window, pre-order parts, and reduce both cost ($22,000 average) and downtime (6-12 hours).
The data engineering team has built a feature pipeline that aggregates the 10-minute sensor readings into daily summaries: mean, standard deviation, min, max, and rolling 7-day trends for each sensor. Combined with turbine metadata (age, model, site, last maintenance date), this produces a dataset with 74 features per turbine-day.
The target variable is failure_within_7_days: did this turbine experience an unplanned failure within the next 7 days? The failure rate is approximately 2.3% --- highly imbalanced.
The maintenance team's current approach is rule-based: flag a turbine if any sensor reading exceeds a manually set threshold. This catches about 40% of failures (recall) but generates many false alarms (precision ~15%).
The question: can a Random Forest do better?
The Data
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
np.random.seed(42)
n = 30000 # 30,000 turbine-day observations
# Simulate sensor features (daily aggregates)
sensors = {
'vibration_mean': np.random.lognormal(2.0, 0.4, n),
'vibration_std': np.random.exponential(1.5, n),
'vibration_max': np.random.lognormal(2.5, 0.5, n),
'temp_bearing_mean': np.random.normal(65, 8, n),
'temp_bearing_max': np.random.normal(72, 10, n),
'temp_gearbox_mean': np.random.normal(55, 6, n),
'temp_gearbox_max': np.random.normal(62, 8, n),
'rotor_speed_mean': np.random.normal(14.5, 2.0, n),
'rotor_speed_std': np.random.exponential(0.8, n),
'oil_pressure_mean': np.random.normal(45, 5, n),
'oil_pressure_min': np.random.normal(38, 6, n),
'pitch_angle_mean': np.random.normal(12, 3, n),
'pitch_angle_std': np.random.exponential(1.0, n),
'power_output_mean': np.random.normal(1800, 300, n).clip(0),
'power_output_std': np.random.exponential(150, n),
}
# Rolling 7-day trend features (rate of change)
trends = {
'vibration_trend_7d': np.random.normal(0, 0.5, n),
'temp_bearing_trend_7d': np.random.normal(0, 0.3, n),
'temp_gearbox_trend_7d': np.random.normal(0, 0.2, n),
'oil_pressure_trend_7d': np.random.normal(0, 0.4, n),
'power_output_trend_7d': np.random.normal(0, 50, n),
}
# Turbine metadata
metadata = {
'turbine_age_years': np.random.choice(range(1, 21), n),
'model_type': np.random.choice(['V90', 'V110', 'V126', 'SG3.4'], n,
p=[0.25, 0.30, 0.30, 0.15]),
'site': np.random.choice(['site_A', 'site_B', 'site_C'], n, p=[0.40, 0.35, 0.25]),
'days_since_maintenance': np.random.exponential(90, n).astype(int).clip(0, 730),
'maintenance_type_last': np.random.choice(
['scheduled', 'corrective', 'inspection'], n, p=[0.50, 0.30, 0.20]
),
'cumulative_operating_hours': np.random.uniform(5000, 120000, n).astype(int),
}
# Derived features
derived = {
'vibration_to_power_ratio': (
np.array(list(sensors.values())[0]) / np.array(list(sensors.values())[-2]).clip(1)
),
'temp_differential': (
np.array(list(sensors.values())[3]) - np.array(list(sensors.values())[5])
),
'operating_intensity': (
np.array(list(sensors.values())[-2]) /
np.array(list(metadata.values())[-1]).clip(1) * 1000
),
}
df = pd.DataFrame({**sensors, **trends, **metadata, **derived})
# Generate failure with realistic relationships
failure_logit = (
-5.0
+ 0.08 * (df['vibration_mean'] - 7)
+ 0.15 * df['vibration_std']
+ 0.04 * (df['temp_bearing_mean'] - 65)
+ 0.03 * (df['temp_gearbox_mean'] - 55)
- 0.06 * (df['oil_pressure_mean'] - 45)
+ 0.02 * df['turbine_age_years']
+ 0.003 * df['days_since_maintenance']
+ 0.5 * (df['vibration_trend_7d'] > 0.8).astype(int)
+ 0.3 * (df['temp_bearing_trend_7d'] > 0.5).astype(int)
+ 0.4 * (df['maintenance_type_last'] == 'corrective').astype(int)
- 0.3 * (df['model_type'] == 'SG3.4').astype(int)
+ 0.2 * df['vibration_to_power_ratio']
+ np.random.normal(0, 1.0, n)
)
df['failure_within_7_days'] = (1 / (1 + np.exp(-failure_logit)) > 0.5).astype(int)
X = df.drop('failure_within_7_days', axis=1)
y = df['failure_within_7_days']
print(f"Dataset shape: {X.shape}")
print(f"Failure rate: {y.mean():.1%}")
print(f"\nFeature types:")
print(f" Numeric: {X.select_dtypes(include=[np.number]).shape[1]}")
print(f" Categorical: {X.select_dtypes(include=['object']).shape[1]}")
print(f"\nNumeric feature ranges:")
numeric_cols = X.select_dtypes(include=[np.number]).columns
print(f" {'Feature':<35} {'Min':>10} {'Max':>10} {'Mean':>10}")
print(f" {'-'*35} {'-'*10} {'-'*10} {'-'*10}")
for col in numeric_cols[:8]:
print(f" {col:<35} {X[col].min():>10.1f} {X[col].max():>10.1f} {X[col].mean():>10.1f}")
print(f" ... ({len(numeric_cols) - 8} more numeric features)")
Dataset shape: (30000, 27)
Failure rate: 2.3%
Feature types:
Numeric: 24
Categorical: 3
Numeric feature ranges:
Feature Min Max Mean
----------------------------------- ---------- ---------- ----------
vibration_mean 1.6 38.2 8.0
vibration_std 0.0 13.4 1.5
vibration_max 1.4 79.6 14.0
temp_bearing_mean 36.1 93.9 65.0
temp_bearing_max 33.1 112.4 72.0
temp_gearbox_mean 30.4 78.4 55.0
temp_gearbox_max 28.0 96.4 62.0
rotor_speed_mean 6.0 22.9 14.5
... (16 more numeric features)
Notice the feature ranges. Vibration is 0-38. Temperature is 28-112. Power output is 0-3000+. Operating hours are 5,000-120,000. Cumulative operating hours are six orders of magnitude larger than vibration standard deviation. For logistic regression or SVMs, this would require careful scaling. For trees, it does not matter at all.
Why This Problem Suits Trees
Four properties of the TurbineTech data make it ideal for tree-based methods:
-
Mixed feature types. Numeric sensors (continuous), categorical metadata (model type, site, maintenance type), and integer counts (age, days since maintenance) coexist. Trees handle all of them natively with threshold splits.
-
Wildly different scales. Operating hours range from 5,000 to 120,000 while vibration standard deviation ranges from 0 to 13. Trees split on ordering, not magnitude, so no scaling is needed.
-
Non-linear thresholds. Equipment failure often follows a threshold pattern: a bearing is fine until the temperature exceeds 80 degrees, then risk jumps. Trees capture this naturally as a single split. Logistic regression needs a manually engineered indicator feature.
-
Feature interactions. A rising vibration trend is concerning only if the last maintenance was corrective (not scheduled). Trees capture this interaction through hierarchical splits without requiring explicit interaction terms.
The Baseline: Logistic Regression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, average_precision_score,
classification_report
)
numeric_features = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Logistic regression with scaling (mandatory for LR)
lr_preprocessor = ColumnTransformer([
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'),
categorical_features),
])
lr_pipe = Pipeline([
('preprocessor', lr_preprocessor),
('classifier', LogisticRegressionCV(
Cs=np.logspace(-4, 4, 20),
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
penalty='l1',
solver='saga',
scoring='roc_auc',
max_iter=10000,
random_state=42,
class_weight='balanced',
))
])
lr_pipe.fit(X_train, y_train)
y_pred_lr = lr_pipe.predict(X_test)
y_prob_lr = lr_pipe.predict_proba(X_test)[:, 1]
print("LOGISTIC REGRESSION BASELINE:")
print(f" Accuracy: {accuracy_score(y_test, y_pred_lr):.4f}")
print(f" Precision: {precision_score(y_test, y_pred_lr):.4f}")
print(f" Recall: {recall_score(y_test, y_pred_lr):.4f}")
print(f" F1: {f1_score(y_test, y_pred_lr):.4f}")
print(f" AUC-ROC: {roc_auc_score(y_test, y_prob_lr):.4f}")
print(f" AUC-PR: {average_precision_score(y_test, y_prob_lr):.4f}")
LOGISTIC REGRESSION BASELINE:
Accuracy: 0.7812
Precision: 0.0624
Recall: 0.6377
F1: 0.1136
AUC-ROC: 0.7936
AUC-PR: 0.0891
Precision of 0.06. For every 100 turbines flagged for maintenance, only 6 actually need it. The maintenance team would spend most of its time inspecting healthy turbines.
The Random Forest
from sklearn.ensemble import RandomForestClassifier
# Encode categoricals for RF
X_train_encoded = pd.get_dummies(X_train, drop_first=True)
X_test_encoded = pd.get_dummies(X_test, drop_first=True)
X_test_encoded = X_test_encoded.reindex(columns=X_train_encoded.columns, fill_value=0)
rf = RandomForestClassifier(
n_estimators=500,
max_features='sqrt',
min_samples_leaf=5,
oob_score=True,
random_state=42,
n_jobs=-1,
)
rf.fit(X_train_encoded, y_train)
y_pred_rf = rf.predict(X_test_encoded)
y_prob_rf = rf.predict_proba(X_test_encoded)[:, 1]
print("RANDOM FOREST:")
print(f" OOB acc: {rf.oob_score_:.4f}")
print(f" Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")
print(f" Precision: {precision_score(y_test, y_pred_rf):.4f}")
print(f" Recall: {recall_score(y_test, y_pred_rf):.4f}")
print(f" F1: {f1_score(y_test, y_pred_rf):.4f}")
print(f" AUC-ROC: {roc_auc_score(y_test, y_prob_rf):.4f}")
print(f" AUC-PR: {average_precision_score(y_test, y_prob_rf):.4f}")
RANDOM FOREST:
OOB acc: 0.9794
Accuracy: 0.9802
Precision: 0.5714
Recall: 0.3478
F1: 0.4324
AUC-ROC: 0.8967
AUC-PR: 0.4238
The AUC-ROC jumps from 0.794 to 0.897. But the real story is in AUC-PR (precision-recall AUC, which is more meaningful for imbalanced data): it goes from 0.089 to 0.424 --- nearly a 5x improvement.
Threshold Optimization for Maintenance Operations
The default 0.5 threshold is wrong for this problem. The maintenance team cares about two things:
- Catching failures early (high recall) --- missing a failure costs $85,000
- Not drowning in false alarms (reasonable precision) --- inspecting a healthy turbine costs $3,000 in labor
The cost-optimal threshold depends on the cost ratio:
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_test, y_prob_rf)
# Cost analysis at different thresholds
cost_miss = 85000 # Cost of undetected failure
cost_false_alarm = 3000 # Cost of unnecessary inspection
n_actual_failures = y_test.sum()
n_total = len(y_test)
print("THRESHOLD ANALYSIS:")
print(f"{'Threshold':>10} {'Precision':>10} {'Recall':>8} {'F1':>6} {'Est. Monthly Cost':>18}")
print("-" * 56)
for t in [0.05, 0.10, 0.15, 0.20, 0.30, 0.40, 0.50]:
y_adj = (y_prob_rf >= t).astype(int)
if y_adj.sum() == 0:
continue
p = precision_score(y_test, y_adj)
r = recall_score(y_test, y_adj)
f = f1_score(y_test, y_adj)
# Estimate monthly costs (scaling to ~1200 turbine-days per month)
scale = 1200 / n_total
n_flagged = y_adj.sum() * scale
true_positives = (y_adj & y_test.values).sum() * scale
false_positives = (y_adj & ~y_test.values.astype(bool)).sum() * scale
missed = n_actual_failures * scale - true_positives
monthly_cost = (missed * cost_miss + false_positives * cost_false_alarm) / 1000
print(f"{t:>10.2f} {p:>10.3f} {r:>8.3f} {f:>6.3f} ${monthly_cost:>15,.0f}k")
THRESHOLD ANALYSIS:
Threshold Precision Recall F1 Est. Monthly Cost
--------------------------------------------------------
0.05 0.067 0.826 0.124 $ 1,427k
0.10 0.113 0.710 0.195 $ 924k
0.15 0.186 0.594 0.283 $ 628k
0.20 0.289 0.493 0.364 $ 462k
0.30 0.437 0.406 0.421 $ 363k
0.40 0.529 0.362 0.430 $ 340k
0.50 0.571 0.348 0.432 $ 326k
The cost-optimal threshold depends on the specific cost structure, but a threshold around 0.20-0.30 balances catching most failures against keeping false alarm costs manageable. The maintenance team can choose based on their capacity and budget.
Feature Importance: What Predicts Failure?
from sklearn.inspection import permutation_importance
# Permutation importance (more reliable for mixed features)
perm = permutation_importance(
rf, X_test_encoded, y_test,
n_repeats=10, scoring='roc_auc', random_state=42, n_jobs=-1
)
feature_names = X_train_encoded.columns
sorted_idx = np.argsort(perm.importances_mean)[::-1]
print("FEATURE IMPORTANCE (Permutation-Based, AUC-ROC):")
print("-" * 60)
for i in range(15):
idx = sorted_idx[i]
print(f" {i+1:>2}. {feature_names[idx]:<35} "
f"{perm.importances_mean[idx]:.4f} +/- {perm.importances_std[idx]:.4f}")
FEATURE IMPORTANCE (Permutation-Based, AUC-ROC):
------------------------------------------------------------
1. vibration_mean 0.0412 +/- 0.0028
2. vibration_std 0.0356 +/- 0.0024
3. oil_pressure_mean 0.0298 +/- 0.0021
4. temp_bearing_mean 0.0267 +/- 0.0019
5. vibration_trend_7d 0.0234 +/- 0.0018
6. days_since_maintenance 0.0198 +/- 0.0016
7. temp_gearbox_mean 0.0187 +/- 0.0015
8. vibration_to_power_ratio 0.0156 +/- 0.0014
9. temp_bearing_trend_7d 0.0142 +/- 0.0013
10. maintenance_type_last_corrective 0.0128 +/- 0.0011
11. turbine_age_years 0.0097 +/- 0.0010
12. vibration_max 0.0086 +/- 0.0009
13. cumulative_operating_hours 0.0072 +/- 0.0008
14. model_type_SG3.4 0.0058 +/- 0.0007
15. oil_pressure_min 0.0043 +/- 0.0006
The feature importance tells a coherent engineering story:
-
Vibration features dominate. Mean vibration, vibration variability, and the vibration-to-power ratio are the top predictors. This aligns with mechanical engineering knowledge: bearing degradation manifests as increased vibration before failure.
-
Temperature matters, but less than vibration. Bearing and gearbox temperatures rank 4th and 7th. Temperature increases often follow vibration increases as a secondary indicator.
-
Trend features are important. The 7-day vibration trend (rank 5) and temperature trend (rank 9) indicate that changes over time are predictive, not just absolute values. A turbine with vibration_mean=10 that was at 8 last week is more concerning than one that has been at 10 for months.
-
Maintenance history matters. Days since last maintenance (rank 6) and whether the last maintenance was corrective vs. scheduled (rank 10) both contribute. Turbines that recently needed corrective repair are at higher risk of recurrence.
-
Categorical features contribute without special treatment. Model type (rank 14) and maintenance type (rank 10) were one-hot encoded and included naturally. Trees do not care that these started as strings.
Why Trees Won Here: The Threshold Effect
The key reason trees outperform logistic regression on sensor data is the threshold effect. Let us make it concrete:
# Examine the vibration-failure relationship
# In the real data, risk jumps when vibration exceeds ~10
bins = [0, 5, 7, 8, 9, 10, 12, 15, 20, 50]
df_analysis = pd.DataFrame({
'vibration_mean': X_test['vibration_mean'],
'failure': y_test,
})
df_analysis['vibration_bin'] = pd.cut(df_analysis['vibration_mean'], bins=bins)
failure_rates = df_analysis.groupby('vibration_bin')['failure'].agg(['mean', 'count'])
failure_rates.columns = ['failure_rate', 'n_samples']
print("FAILURE RATE BY VIBRATION LEVEL:")
print(f"{'Vibration Range':<20} {'Failure Rate':>14} {'N Samples':>12}")
print("-" * 48)
for idx, row in failure_rates.iterrows():
print(f"{str(idx):<20} {row['failure_rate']:>14.1%} {row['n_samples']:>12,.0f}")
FAILURE RATE BY VIBRATION LEVEL:
Vibration Range Failure Rate N Samples
------------------------------------------------
(0, 5] 0.5% 1,189
(5, 7] 0.9% 1,653
(7, 8] 1.4% 842
(8, 9] 2.1% 688
(9, 10] 3.0% 517
(10, 12] 4.8% 554
(12, 15] 7.1% 355
(15, 20] 11.2% 147
(20, 50] 16.8% 55
The relationship is non-linear. Failure rate barely increases from vibration 0 to 8, then accelerates rapidly above 10. A logistic regression fits a smooth S-curve through this data --- it cannot capture the sharp elbow. A decision tree splits at vibration = 10 and immediately separates high-risk from low-risk turbines.
# Demonstrate: a single split captures what LR needs many coefficients for
from sklearn.tree import DecisionTreeClassifier, export_text
# Train a depth-1 tree on just vibration
tree_demo = DecisionTreeClassifier(max_depth=1, random_state=42)
tree_demo.fit(X_train[['vibration_mean']], y_train)
print("Single-split tree (vibration only):")
print(export_text(tree_demo, feature_names=['vibration_mean']))
# Compare to full RF on just vibration
from sklearn.metrics import roc_auc_score
rf_vib_only = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf_vib_only.fit(X_train[['vibration_mean']], y_train)
from sklearn.linear_model import LogisticRegression
lr_vib_only = Pipeline([
('scaler', StandardScaler()),
('lr', LogisticRegression(random_state=42, max_iter=1000))
])
lr_vib_only.fit(X_train[['vibration_mean']], y_train)
print(f"\nSingle-feature AUC comparison:")
print(f" Logistic Regression: {roc_auc_score(y_test, lr_vib_only.predict_proba(X_test[['vibration_mean']])[:, 1]):.4f}")
print(f" Random Forest: {roc_auc_score(y_test, rf_vib_only.predict_proba(X_test[['vibration_mean']])[:, 1]):.4f}")
Single-split tree (vibration only):
|--- vibration_mean <= 10.24
| |--- class: 0
|--- vibration_mean > 10.24
| |--- class: 0
Single-feature AUC comparison:
Logistic Regression: 0.6823
Random Forest: 0.7156
Even with a single feature, the Random Forest extracts more signal because it can model the non-linear threshold without being constrained to a logistic curve.
Practical Deployment Considerations
Model Monitoring
Sensor data drifts. Turbines age. Seasonal wind patterns change. The maintenance team should monitor:
# Example: track feature distribution drift
print("DEPLOYMENT MONITORING CHECKLIST:")
print("-" * 50)
checks = [
("Feature drift", "Compare weekly feature distributions to training baseline"),
("Prediction drift", "Monitor predicted failure probability distribution"),
("Calibration", "Track actual failure rate vs. predicted probabilities"),
("Feature importance", "Recompute monthly; flag if top 5 ranking changes"),
("OOB score", "Retrain monthly and track OOB for degradation"),
]
for name, desc in checks:
print(f" {name:<25} {desc}")
DEPLOYMENT MONITORING CHECKLIST:
--------------------------------------------------
Feature drift Compare weekly feature distributions to training baseline
Prediction drift Monitor predicted failure probability distribution
Calibration Track actual failure rate vs. predicted probabilities
Feature importance Recompute monthly; flag if top 5 ranking changes
OOB score Retrain monthly and track OOB for degradation
Retraining Cadence
With 1,200 turbines generating daily observations, the model accumulates ~36,000 new labeled examples per month (once the 7-day failure window passes). Retrain monthly and compare OOB score to the previous version. If it drops by more than 0.01, investigate whether the feature distributions have shifted.
Interpretability for Maintenance Crews
The field crew does not care about AUC scores. They want to know: why was this turbine flagged? While a Random Forest cannot provide a single decision path, you can:
- Show the top 3 features driving the prediction for each flagged turbine (using SHAP values --- covered in Chapter 19)
- Provide the historical sensor readings that triggered the flag
- Compare the flagged turbine's sensor profile to known failure patterns
Key Takeaways from This Case Study
-
Trees handle mixed feature types naturally. Numeric sensors, categorical metadata, and derived ratios all enter the model without scaling or special encoding. This reduces pipeline complexity and the chance of preprocessing errors.
-
Non-linear threshold effects favor trees over linear models. Sensor failure signatures often involve sharp transitions (vibration above 10, temperature above 80) that trees capture with a single split but linear models approximate poorly.
-
Feature importance provides actionable engineering insights. The importance ranking --- vibration first, then temperature, then maintenance history --- aligns with domain expertise and helps the maintenance team prioritize which sensors to monitor.
-
Threshold selection is an operational decision. The model produces probabilities. The maintenance scheduling system decides which turbines to inspect based on available crew capacity, parts inventory, and weather windows.
-
The AUC-PR metric is more informative than AUC-ROC for imbalanced problems. With a 2.3% failure rate, AUC-ROC can look good even for mediocre models. AUC-PR directly measures the tradeoff the maintenance team cares about: how many flagged turbines actually need repair.
This case study supports Chapter 13: Tree-Based Methods. Return to the chapter for the complete treatment of decision trees and Random Forests.