Case Study 1: TurbineTech Sensor Anomaly Detection

DataField.Dev

Case Study 1: TurbineTech Sensor Anomaly Detection

Predicting Bearing Failure from Vibration Sensor Data

Background

TurbineTech operates 1,200 wind turbines across three regions. Each turbine has 847 sensors monitoring vibration, temperature, pressure, rotational speed, acoustic emissions, and electrical characteristics. A bearing failure costs $180,000 in emergency repair and lost generation --- plus the risk of cascading damage to the gearbox, which raises the cost to $600,000.

The maintenance team currently relies on scheduled inspections every 90 days. Between inspections, they are blind. In the past 18 months, 23 unplanned bearing failures occurred. Forensic analysis showed that 19 of those 23 failures exhibited detectable vibration anomalies 5-14 days before failure. The maintenance team wants a system that flags those anomalies automatically.

The challenge: they have 18 months of sensor data (approximately 47 million readings across the fleet) and only 23 confirmed failure events. The labeled data is insufficient for supervised learning. This is an anomaly detection problem.

The Data

For this case study, we work with a single turbine's bearing sensor array: 8 features sampled every 10 minutes.

import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_score, recall_score, roc_auc_score
from sklearn.metrics import precision_recall_curve, average_precision_score
import matplotlib.pyplot as plt

np.random.seed(42)

# Simulate 90 days of 10-minute readings (12,960 readings)
n_normal = 12800
n_degrading = 130   # gradual degradation over ~22 hours before failure
n_failure = 30       # acute failure signature

# Normal operation: correlated sensor readings with diurnal variation
hours = np.tile(np.arange(0, 24, 10/60), n_normal // 144 + 1)[:n_normal]
diurnal_temp = 3 * np.sin(2 * np.pi * hours / 24)  # temperature varies with time of day

normal = pd.DataFrame({
    'vibration_rms_mm_s': np.random.normal(4.2, 0.8, n_normal) + 0.1 * diurnal_temp,
    'vibration_peak_mm_s': np.random.normal(12.5, 2.5, n_normal) + 0.3 * diurnal_temp,
    'temperature_bearing_c': np.random.normal(62, 5, n_normal) + diurnal_temp,
    'temperature_ambient_c': np.random.normal(18, 8, n_normal) + diurnal_temp * 2,
    'pressure_hydraulic_bar': np.random.normal(32.0, 1.5, n_normal),
    'rotation_rpm': np.random.normal(1800, 40, n_normal),
    'acoustic_emission_db': np.random.normal(72, 3, n_normal),
    'current_draw_amp': np.random.normal(15.2, 1.1, n_normal),
})

# Gradual degradation: vibration and temperature slowly increase
# Simulates bearing wear developing over ~22 hours
degradation_progress = np.linspace(0, 1, n_degrading)
degrading = pd.DataFrame({
    'vibration_rms_mm_s': 4.2 + degradation_progress * 4.5 + np.random.normal(0, 0.5, n_degrading),
    'vibration_peak_mm_s': 12.5 + degradation_progress * 12.0 + np.random.normal(0, 1.5, n_degrading),
    'temperature_bearing_c': 62 + degradation_progress * 18 + np.random.normal(0, 2, n_degrading),
    'temperature_ambient_c': np.random.normal(18, 8, n_degrading),
    'pressure_hydraulic_bar': 32.0 - degradation_progress * 3.0 + np.random.normal(0, 0.8, n_degrading),
    'rotation_rpm': 1800 - degradation_progress * 120 + np.random.normal(0, 20, n_degrading),
    'acoustic_emission_db': 72 + degradation_progress * 15 + np.random.normal(0, 2, n_degrading),
    'current_draw_amp': 15.2 + degradation_progress * 4.0 + np.random.normal(0, 0.6, n_degrading),
})

# Acute failure: extreme readings
failure = pd.DataFrame({
    'vibration_rms_mm_s': np.random.normal(11.0, 2.0, n_failure),
    'vibration_peak_mm_s': np.random.normal(35.0, 5.0, n_failure),
    'temperature_bearing_c': np.random.normal(95, 8, n_failure),
    'temperature_ambient_c': np.random.normal(18, 8, n_failure),
    'pressure_hydraulic_bar': np.random.normal(27, 3, n_failure),
    'rotation_rpm': np.random.normal(1650, 80, n_failure),
    'acoustic_emission_db': np.random.normal(92, 5, n_failure),
    'current_draw_amp': np.random.normal(21, 2.5, n_failure),
})

data = pd.concat([normal, degrading, failure], ignore_index=True)
labels = np.array(
    [0] * n_normal +
    [1] * n_degrading +    # degradation is anomalous
    [1] * n_failure        # failure is anomalous
)

print(f"Dataset: {len(data)} readings")
print(f"Normal:    {n_normal} ({n_normal/len(data)*100:.1f}%)")
print(f"Degrading: {n_degrading} ({n_degrading/len(data)*100:.1f}%)")
print(f"Failure:   {n_failure} ({n_failure/len(data)*100:.1f}%)")
print(f"Total anomalous: {labels.sum()} ({labels.mean()*100:.2f}%)")

Step 1: Statistical Baseline

Start simple. Compute per-feature z-scores and flag readings where any feature exceeds 3 standard deviations.

from scipy import stats

z_scores = np.abs(stats.zscore(data))
z_flags_per_feature = z_scores > 3

# Flag if ANY feature exceeds threshold
z_any_flag = z_flags_per_feature.any(axis=1)

print("Z-score baseline (threshold = 3):")
print(f"  Readings flagged:      {z_any_flag.sum()}")
print(f"  True anomalies caught: {(z_any_flag & (labels == 1)).sum()} / {labels.sum()}")
print(f"  False positives:       {(z_any_flag & (labels == 0)).sum()}")
print(f"  Precision: {precision_score(labels, z_any_flag):.3f}")
print(f"  Recall:    {recall_score(labels, z_any_flag):.3f}")

# Which features trigger the most flags?
print("\nFlags per feature:")
for i, col in enumerate(data.columns):
    n_flags = z_flags_per_feature[:, i].sum()
    true_flags = (z_flags_per_feature[:, i] & (labels == 1)).sum()
    print(f"  {col:<30s}: {n_flags:>4d} flags ({true_flags} true anomalies)")

The z-score baseline catches the acute failure readings (where sensor values are extreme) but misses the early degradation phase (where values are elevated but not yet extreme). This is the critical gap: by the time z-scores flag a problem, the bearing may be hours from failure. We need methods that detect subtle multivariate shifts earlier.

Step 2: Isolation Forest

scaler = StandardScaler()
X_scaled = scaler.fit_transform(data)

# We do not know the exact anomaly rate, so start with a reasonable estimate
iso = IsolationForest(
    n_estimators=200,
    max_samples=256,
    contamination=0.02,
    random_state=42,
)
iso_preds = iso.fit_predict(X_scaled)
iso_anomalies = iso_preds == -1
iso_scores = -iso.decision_function(X_scaled)

print("Isolation Forest (contamination=0.02):")
print(f"  Readings flagged:      {iso_anomalies.sum()}")
print(f"  True anomalies caught: {(iso_anomalies & (labels == 1)).sum()} / {labels.sum()}")
print(f"  False positives:       {(iso_anomalies & (labels == 0)).sum()}")
print(f"  Precision: {precision_score(labels, iso_anomalies):.3f}")
print(f"  Recall:    {recall_score(labels, iso_anomalies):.3f}")
print(f"  AUC-ROC:   {roc_auc_score(labels, iso_scores):.3f}")
print(f"  Avg Precision: {average_precision_score(labels, iso_scores):.3f}")

Can Isolation Forest Detect Early Degradation?

The critical question is whether the model flags the degradation phase --- the period where intervention can prevent catastrophic failure.

# Examine anomaly scores across the three phases
normal_scores = iso_scores[:n_normal]
degrading_scores = iso_scores[n_normal:n_normal + n_degrading]
failure_scores = iso_scores[n_normal + n_degrading:]

print("Anomaly score distribution by phase:")
print(f"  Normal:    mean={normal_scores.mean():.3f}, "
      f"std={normal_scores.std():.3f}, "
      f"max={normal_scores.max():.3f}")
print(f"  Degrading: mean={degrading_scores.mean():.3f}, "
      f"std={degrading_scores.std():.3f}, "
      f"max={degrading_scores.max():.3f}")
print(f"  Failure:   mean={failure_scores.mean():.3f}, "
      f"std={failure_scores.std():.3f}, "
      f"max={failure_scores.max():.3f}")

# When in the degradation sequence does the score exceed the threshold?
iso_threshold = np.percentile(iso_scores, 98)  # 2% contamination threshold
first_detection_idx = None
for i, score in enumerate(degrading_scores):
    if score > iso_threshold:
        first_detection_idx = i
        break

if first_detection_idx is not None:
    readings_before_failure = n_degrading - first_detection_idx + n_failure
    hours_warning = readings_before_failure * 10 / 60
    print(f"\nFirst degradation detection: reading {first_detection_idx} of {n_degrading}")
    print(f"Warning time before failure: {hours_warning:.1f} hours")
    print(f"Degradation progress at detection: {first_detection_idx/n_degrading*100:.0f}%")
else:
    print("\nDegradation not detected before failure phase")

Step 3: Autoencoder Approach

For TurbineTech, we have extensive historical data from normal operation. This makes a semi-supervised autoencoder a natural fit: train on normal data, detect deviations.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

# Train on normal data only
X_normal_scaled = scaler.fit_transform(normal)
X_all_scaled = scaler.transform(data)

input_dim = X_normal_scaled.shape[1]

class SensorAutoencoder(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 16),
            nn.ReLU(),
            nn.Linear(16, 8),
            nn.ReLU(),
            nn.Linear(8, 3),
            nn.ReLU(),
        )
        self.decoder = nn.Sequential(
            nn.Linear(3, 8),
            nn.ReLU(),
            nn.Linear(8, 16),
            nn.ReLU(),
            nn.Linear(16, input_dim),
        )

    def forward(self, x):
        return self.decoder(self.encoder(x))

torch.manual_seed(42)
model = SensorAutoencoder(input_dim)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

X_tensor = torch.FloatTensor(X_normal_scaled)
dataset = TensorDataset(X_tensor, X_tensor)
loader = DataLoader(dataset, batch_size=256, shuffle=True)

# Training
model.train()
for epoch in range(80):
    total_loss = 0
    for batch_x, batch_target in loader:
        optimizer.zero_grad()
        output = model(batch_x)
        loss = criterion(output, batch_target)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    if (epoch + 1) % 20 == 0:
        print(f"Epoch {epoch+1}/80, Loss: {total_loss/len(loader):.6f}")

# Compute reconstruction error
model.eval()
with torch.no_grad():
    X_all_tensor = torch.FloatTensor(X_all_scaled)
    reconstructed = model(X_all_tensor)
    recon_error = torch.mean((X_all_tensor - reconstructed) ** 2, dim=1).numpy()

# Threshold from normal training data
with torch.no_grad():
    train_reconstructed = model(X_tensor)
    train_error = torch.mean((X_tensor - train_reconstructed) ** 2, dim=1).numpy()

ae_threshold = np.percentile(train_error, 97)
ae_anomalies = recon_error > ae_threshold

print("\nAutoencoder (semi-supervised):")
print(f"  Readings flagged:      {ae_anomalies.sum()}")
print(f"  True anomalies caught: {(ae_anomalies & (labels == 1)).sum()} / {labels.sum()}")
print(f"  Precision: {precision_score(labels, ae_anomalies):.3f}")
print(f"  Recall:    {recall_score(labels, ae_anomalies):.3f}")
print(f"  AUC-ROC:   {roc_auc_score(labels, recon_error):.3f}")

Feature-Level Reconstruction Error

One advantage of autoencoders: you can decompose the anomaly score into per-feature contributions.

# Per-feature reconstruction error for anomalous readings
with torch.no_grad():
    per_feature_error = (X_all_tensor - reconstructed) ** 2

# Average per-feature error for each phase
feature_names = data.columns.tolist()
normal_fe = per_feature_error[:n_normal].mean(dim=0).numpy()
degrading_fe = per_feature_error[n_normal:n_normal + n_degrading].mean(dim=0).numpy()
failure_fe = per_feature_error[n_normal + n_degrading:].mean(dim=0).numpy()

print("Per-feature reconstruction error (mean):")
print(f"{'Feature':<30s} {'Normal':>8s} {'Degrading':>10s} {'Failure':>8s} {'Ratio':>8s}")
print("-" * 68)
for i, name in enumerate(feature_names):
    ratio = failure_fe[i] / max(normal_fe[i], 1e-6)
    print(f"{name:<30s} {normal_fe[i]:>8.4f} {degrading_fe[i]:>10.4f} "
          f"{failure_fe[i]:>8.4f} {ratio:>7.1f}x")

This tells the maintenance team which sensors are driving the anomaly --- critical for diagnosis. "Bearing B7 flagged: vibration_rms 4.2x normal reconstruction error, acoustic_emission 3.8x normal" is actionable. "Bearing B7 flagged: anomaly score 0.87" is not.

Step 4: Method Comparison

# Compare all methods head-to-head
results = pd.DataFrame({
    'Method': ['Z-score (any feature)', 'Isolation Forest', 'Autoencoder'],
    'AUC-ROC': [
        roc_auc_score(labels, z_scores.max(axis=1)),
        roc_auc_score(labels, iso_scores),
        roc_auc_score(labels, recon_error),
    ],
    'Avg Precision': [
        average_precision_score(labels, z_scores.max(axis=1)),
        average_precision_score(labels, iso_scores),
        average_precision_score(labels, recon_error),
    ],
})

print(results.to_string(index=False))

Degradation Timeline Analysis

The most important evaluation for TurbineTech is not aggregate AUC --- it is how early the model detects degradation.

# Track anomaly scores through the degradation sequence
fig, axes = plt.subplots(2, 1, figsize=(14, 8), sharex=True)

readings = np.arange(n_degrading + n_failure)
hours_elapsed = readings * 10 / 60

# Isolation Forest scores
combined_scores_iso = np.concatenate([degrading_scores, failure_scores])
axes[0].plot(hours_elapsed, combined_scores_iso, 'b-', alpha=0.7, label='IF score')
axes[0].axhline(y=iso_threshold, color='r', linestyle='--', label=f'Threshold (p98)')
axes[0].axvline(x=n_degrading * 10 / 60, color='gray', linestyle=':', label='Failure onset')
axes[0].set_ylabel('Isolation Forest Score')
axes[0].set_title('Anomaly Scores During Bearing Degradation')
axes[0].legend()

# Autoencoder reconstruction error
combined_scores_ae = recon_error[n_normal:]
axes[1].plot(hours_elapsed, combined_scores_ae, 'g-', alpha=0.7, label='AE recon error')
axes[1].axhline(y=ae_threshold, color='r', linestyle='--', label=f'Threshold (p97)')
axes[1].axvline(x=n_degrading * 10 / 60, color='gray', linestyle=':', label='Failure onset')
axes[1].set_ylabel('Reconstruction Error')
axes[1].set_xlabel('Hours Before End of Recording')
axes[1].legend()

plt.tight_layout()
plt.savefig('degradation_timeline.png', dpi=150, bbox_inches='tight')
plt.show()

Step 5: Production Alert System Design

The final deliverable is not a model --- it is an alert system that the maintenance team can act on.

def generate_turbine_alert(reading, model, scaler, autoencoder, feature_names,
                           iso_threshold, ae_threshold):
    """
    Generate an actionable alert for a single turbine reading.

    Returns None if the reading is normal, or an alert dict if anomalous.
    """
    reading_scaled = scaler.transform(reading.values.reshape(1, -1))

    # Isolation Forest score
    iso_score = -model.decision_function(reading_scaled)[0]

    # Autoencoder reconstruction error
    with torch.no_grad():
        tensor = torch.FloatTensor(reading_scaled)
        reconstructed = autoencoder(tensor)
        recon_err = torch.mean((tensor - reconstructed) ** 2, dim=1).item()
        per_feature_err = ((tensor - reconstructed) ** 2).squeeze().numpy()

    # Dual-threshold: alert if either method flags
    is_anomalous = (iso_score > iso_threshold) or (recon_err > ae_threshold)

    if not is_anomalous:
        return None

    # Identify top contributing features
    feature_contributions = pd.Series(per_feature_err, index=feature_names)
    top_features = feature_contributions.nlargest(3)

    # Determine severity
    if iso_score > iso_threshold * 1.5 and recon_err > ae_threshold * 2:
        severity = 'CRITICAL'
    elif iso_score > iso_threshold * 1.2 or recon_err > ae_threshold * 1.5:
        severity = 'WARNING'
    else:
        severity = 'WATCH'

    return {
        'severity': severity,
        'iso_score': round(iso_score, 4),
        'recon_error': round(recon_err, 4),
        'top_features': {
            name: round(float(reading[name]), 2)
            for name in top_features.index
        },
        'top_feature_errors': {
            name: round(err, 4)
            for name, err in top_features.items()
        },
    }

# Test on a degrading reading (75% through degradation)
test_reading = degrading.iloc[int(n_degrading * 0.75)]
alert = generate_turbine_alert(
    test_reading, iso, scaler, model,
    feature_names, iso_threshold, ae_threshold
)

if alert:
    print(f"ALERT [{alert['severity']}]")
    print(f"  Isolation Forest score: {alert['iso_score']}")
    print(f"  Reconstruction error:   {alert['recon_error']}")
    print(f"  Top contributing sensors:")
    for sensor, value in alert['top_features'].items():
        error = alert['top_feature_errors'][sensor]
        print(f"    {sensor}: {value} (error: {error})")

Results and Recommendations

What Worked

Isolation Forest detected degradation before failure. With a 2% contamination threshold, the model flagged anomalous readings during the degradation phase --- providing hours of warning before acute failure.
The autoencoder provided feature-level explanations. Per-feature reconstruction error told the maintenance team which sensors were driving the anomaly, enabling targeted diagnosis.
The dual-method ensemble (Isolation Forest + autoencoder) reduced false positives. Requiring agreement between methods improved precision without sacrificing recall on true degradation events.

What Requires Caution

The threshold determines the tradeoff. A lower threshold catches degradation earlier but generates more false alarms. TurbineTech must decide: is a 10-hour warning with 5 false alarms per turbine per month acceptable? Or do they prefer a 4-hour warning with 1 false alarm per month?
Seasonal and operational variation matters. Wind speed, ambient temperature, and load conditions affect sensor readings. A reading that is anomalous in summer may be normal in winter. The model should be trained on data spanning all operating conditions, or separate models should be built for different regimes.
Concept drift is real. Bearings age. The "normal" baseline for a 2-year-old turbine is different from a new one. The model should be retrained periodically or designed to adapt to gradual baseline shifts.

Deployment Recommendation

Deploy the dual-method system with three severity levels (WATCH, WARNING, CRITICAL). Route CRITICAL alerts to the on-call maintenance team immediately. Aggregate WARNING alerts into a daily digest for scheduled inspection planning. Track WATCH-level readings over time to identify gradual trends. Retrain models quarterly using the most recent 6 months of data.

This case study uses the TurbineTech manufacturing scenario introduced in Chapter 22. Return to the chapter for algorithm details.