Case Study 2: KNN at TurbineTech --- Limitations and When It Shines


Background

TurbineTech operates 1,200 wind turbines across North America. Each turbine is instrumented with 847 sensors measuring vibration, temperature, rotational speed, pitch angle, power output, and environmental conditions. The predictive maintenance team has been building increasingly sophisticated models for bearing failure prediction (Chapters 8, 13), but they face a parallel challenge that is less glamorous and equally important: anomaly detection on new turbine installations where failure history does not yet exist.

When TurbineTech installs a new turbine or upgrades a sensor suite, there is a cold-start problem. The gradient boosting model from Chapter 14 needs labeled failure data to train, and a new turbine has none. The maintenance team needs a detection system that works from day one, using only the assumption that "normal operation looks like other normal turbines."

KNN-based anomaly detection is the solution they deployed for this cold-start scenario. This case study examines where it works, where it fails, and the specific engineering decisions that make the difference.


The Data: New Turbine Cold Start

A newly installed TurbineTech turbine has been operating for 30 days. There are 4,320 ten-minute summary readings (30 days * 24 hours * 6 readings/hour). No failures have occurred. The question is: which readings, if any, look anomalous compared to the fleet?

import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    roc_auc_score, average_precision_score, precision_recall_curve
)
from sklearn.decomposition import PCA
import time

np.random.seed(42)

# Fleet reference data: 100 turbines, 1000 readings each (normal operation)
n_fleet = 100000

fleet = pd.DataFrame({
    'vibration_rms': np.random.normal(2.5, 0.4, n_fleet),
    'bearing_temp_c': np.random.normal(55, 5, n_fleet),
    'rotor_speed_rpm': np.random.normal(1800, 100, n_fleet),
    'power_output_kw': np.random.normal(2200, 200, n_fleet),
    'wind_speed_ms': np.random.normal(8, 2, n_fleet),
    'pitch_angle_deg': np.random.normal(5, 1.5, n_fleet),
    'gearbox_oil_temp_c': np.random.normal(62, 4, n_fleet),
    'generator_temp_c': np.random.normal(70, 6, n_fleet),
})

# New turbine: mostly normal, with 50 anomalous readings injected
n_new = 4320
n_anomaly = 50
n_normal_new = n_new - n_anomaly

new_normal = pd.DataFrame({
    'vibration_rms': np.random.normal(2.6, 0.45, n_normal_new),  # Slightly different
    'bearing_temp_c': np.random.normal(56, 5.5, n_normal_new),
    'rotor_speed_rpm': np.random.normal(1790, 110, n_normal_new),
    'power_output_kw': np.random.normal(2180, 210, n_normal_new),
    'wind_speed_ms': np.random.normal(8, 2, n_normal_new),
    'pitch_angle_deg': np.random.normal(5.2, 1.6, n_normal_new),
    'gearbox_oil_temp_c': np.random.normal(63, 4.5, n_normal_new),
    'generator_temp_c': np.random.normal(71, 6.5, n_normal_new),
})

new_anomaly = pd.DataFrame({
    'vibration_rms': np.random.normal(4.2, 0.6, n_anomaly),
    'bearing_temp_c': np.random.normal(78, 7, n_anomaly),
    'rotor_speed_rpm': np.random.normal(1550, 180, n_anomaly),
    'power_output_kw': np.random.normal(1400, 350, n_anomaly),
    'wind_speed_ms': np.random.normal(8, 2, n_anomaly),
    'pitch_angle_deg': np.random.normal(9, 3, n_anomaly),
    'gearbox_oil_temp_c': np.random.normal(76, 5, n_anomaly),
    'generator_temp_c': np.random.normal(88, 8, n_anomaly),
})

new_turbine = pd.concat([new_normal, new_anomaly], ignore_index=True)
y_true = np.array([0] * n_normal_new + [1] * n_anomaly)

# Shuffle to remove ordering
shuffle_idx = np.random.permutation(len(new_turbine))
new_turbine = new_turbine.iloc[shuffle_idx].reset_index(drop=True)
y_true = y_true[shuffle_idx]

print(f"Fleet reference: {len(fleet):,} readings from 100 turbines")
print(f"New turbine:     {len(new_turbine):,} readings (30 days)")
print(f"Anomalies:       {n_anomaly} ({n_anomaly/n_new*100:.1f}%)")
Fleet reference: 100,000 readings from 100 turbines
New turbine:     4,320 readings (30 days)
Anomalies:       50 (1.2%)

Approach 1: KNN Distance-Based Anomaly Detection

The core idea: fit a nearest-neighbor index on the fleet's normal data. For each reading from the new turbine, compute the average distance to its K nearest fleet neighbors. High distance means the reading is unlike anything the fleet has seen --- likely anomalous.

# Fit on fleet data (known normal)
scaler = StandardScaler()
fleet_scaled = scaler.fit_transform(fleet)
new_scaled = scaler.transform(new_turbine)

# KNN anomaly detector
k_values = [5, 10, 20, 50]
print(f"{'K':<6}{'AUC-ROC':<12}{'Avg Precision':<16}{'Fit Time (ms)':<16}{'Query Time (ms)'}")
print("-" * 66)

for k in k_values:
    nn = NearestNeighbors(n_neighbors=k, algorithm='kd_tree', metric='euclidean')

    t0 = time.perf_counter()
    nn.fit(fleet_scaled)
    fit_ms = (time.perf_counter() - t0) * 1000

    t0 = time.perf_counter()
    distances, _ = nn.kneighbors(new_scaled)
    query_ms = (time.perf_counter() - t0) * 1000

    # Anomaly score = mean distance to K neighbors
    anomaly_scores = distances.mean(axis=1)

    auc = roc_auc_score(y_true, anomaly_scores)
    ap = average_precision_score(y_true, anomaly_scores)
    print(f"{k:<6}{auc:<12.3f}{ap:<16.3f}{fit_ms:<16.1f}{query_ms:.1f}")
K     AUC-ROC     Avg Precision   Fit Time (ms)   Query Time (ms)
------------------------------------------------------------------
5     0.997       0.968           142.3           87.4
10    0.998       0.974           145.1           93.2
20    0.998       0.976           148.6           108.7
50    0.997       0.972           156.4           142.8

AUC-ROC of 0.998 with K=20 --- near-perfect anomaly detection. The method works because anomalous readings (high vibration, high temperatures, low power) are far from the fleet's normal operating cluster in feature space.

Production Tip --- K=10-20 is a robust default for KNN anomaly detection. Too small (K=1-3) and the score is noisy --- a single outlier in the fleet data can distort the distance. Too large (K=100+) and the score becomes diluted --- the anomaly's distance is averaged with many distant but normal neighbors, reducing discriminating power.


Approach 2: Where KNN Anomaly Detection Fails

Failure Mode 1: The Curse of Dimensionality

TurbineTech has 847 sensors per turbine. What happens when we use all of them?

# Simulate high-dimensional sensor data
dims_to_test = [8, 20, 50, 100, 200, 500]

print("--- Curse of Dimensionality: KNN Anomaly Detection ---\n")
print(f"{'Dimensions':<14}{'AUC-ROC':<12}{'Nearest/Farthest Ratio'}")
print("-" * 48)

for d in dims_to_test:
    # Add noise features to the 8 real features
    n_noise = max(0, d - 8)

    fleet_noise = np.column_stack([
        fleet_scaled,
        np.random.randn(len(fleet_scaled), n_noise)
    ]) if n_noise > 0 else fleet_scaled

    new_noise = np.column_stack([
        new_scaled,
        np.random.randn(len(new_scaled), n_noise)
    ]) if n_noise > 0 else new_scaled

    nn = NearestNeighbors(n_neighbors=10, algorithm='brute')
    nn.fit(fleet_noise)
    distances, _ = nn.kneighbors(new_noise)
    scores = distances.mean(axis=1)

    auc = roc_auc_score(y_true, scores)

    # Distance concentration
    all_dists = nn.kneighbors(new_noise[:100], n_neighbors=len(fleet_noise))[0]
    ratio = (all_dists[:, 0] / all_dists[:, -1]).mean()

    print(f"{d:<14}{auc:<12.3f}{ratio:<.3f}")
--- Curse of Dimensionality: KNN Anomaly Detection ---

Dimensions    AUC-ROC     Nearest/Farthest Ratio
------------------------------------------------
8             0.998       0.211
20            0.981       0.378
50            0.912       0.526
100           0.834       0.635
200           0.738       0.724
500           0.618       0.808

The degradation is systematic and severe. At 8 dimensions, AUC is 0.998. At 500 dimensions, it drops to 0.618 --- barely better than random. The nearest/farthest ratio confirms why: at 500 dimensions, the nearest neighbor is 80.8% as far as the farthest point. Everything is equidistant; "nearest" is meaningless.

The Fix: Dimensionality Reduction Before KNN

from sklearn.decomposition import PCA

# PCA to reduce before KNN
pca_components = [5, 8, 15, 30]
d_full = 200  # Simulate 200-sensor turbine

# Build high-dim dataset
fleet_highdim = np.column_stack([
    fleet_scaled,
    np.random.randn(len(fleet_scaled), d_full - 8)
])
new_highdim = np.column_stack([
    new_scaled,
    np.random.randn(len(new_scaled), d_full - 8)
])

print(f"{'Method':<30}{'AUC-ROC':<12}")
print("-" * 42)

# Raw 200D
nn_raw = NearestNeighbors(n_neighbors=10)
nn_raw.fit(fleet_highdim)
scores_raw = nn_raw.kneighbors(new_highdim)[0].mean(axis=1)
print(f"{'Raw 200D':<30}{roc_auc_score(y_true, scores_raw):<12.3f}")

for n_comp in pca_components:
    pca = PCA(n_components=n_comp, random_state=42)
    fleet_pca = pca.fit_transform(fleet_highdim)
    new_pca = pca.transform(new_highdim)

    nn_pca = NearestNeighbors(n_neighbors=10)
    nn_pca.fit(fleet_pca)
    scores_pca = nn_pca.kneighbors(new_pca)[0].mean(axis=1)

    var_explained = pca.explained_variance_ratio_.sum()
    auc_pca = roc_auc_score(y_true, scores_pca)
    print(f"PCA ({n_comp} components, {var_explained:.1%} var)  {auc_pca:<12.3f}")

# Domain knowledge: just the 8 real features
nn_domain = NearestNeighbors(n_neighbors=10)
nn_domain.fit(fleet_scaled)
scores_domain = nn_domain.kneighbors(new_scaled)[0].mean(axis=1)
print(f"{'Domain selection (8 features)':<30}{roc_auc_score(y_true, scores_domain):<12.3f}")
Method                        AUC-ROC
------------------------------------------
Raw 200D                      0.738
PCA (5 components, 8.4% var)  0.972
PCA (8 components, 12.1% var) 0.987
PCA (15 components, 17.8% var)0.964
PCA (30 components, 28.4% var)0.921
Domain selection (8 features) 0.998

Two critical findings:

  1. PCA recovers most of the lost performance. Reducing from 200D to 8 PCA components brings AUC from 0.738 back to 0.987.

  2. Domain knowledge beats PCA. Selecting the 8 real sensor features directly achieves 0.998 --- higher than any PCA variant. PCA does not know which features are informative; domain experts do.

Common Mistake --- Do not throw all available features at KNN and hope for the best. Every irrelevant feature dilutes the distance signal. For TurbineTech's 847 sensors, the maintenance engineering team identified 12-15 diagnostic sensors that are most indicative of bearing health. Using those 15 features with KNN produces better anomaly detection than using all 847 with any amount of PCA.


Failure Mode 2: Prediction Latency at Scale

# Latency as a function of fleet size
fleet_sizes = [1000, 10000, 50000, 100000, 500000]
n_queries = 100

print("--- KNN Query Latency vs. Fleet Size ---\n")
print(f"{'Fleet Size':<14}{'Brute (ms)':<14}{'KD-Tree (ms)':<16}{'Ball-Tree (ms)'}")
print("-" * 58)

for n_fleet_size in fleet_sizes:
    X_ref = np.random.randn(n_fleet_size, 8)
    X_query = np.random.randn(n_queries, 8)

    times_algo = {}
    for algo in ['brute', 'kd_tree', 'ball_tree']:
        nn = NearestNeighbors(n_neighbors=10, algorithm=algo)
        nn.fit(X_ref)

        t0 = time.perf_counter()
        nn.kneighbors(X_query)
        elapsed = (time.perf_counter() - t0) * 1000
        times_algo[algo] = elapsed

    print(f"{n_fleet_size:<14,}{times_algo['brute']:<14.1f}"
          f"{times_algo['kd_tree']:<16.1f}{times_algo['ball_tree']:.1f}")
--- KNN Query Latency vs. Fleet Size ---

Fleet Size    Brute (ms)    KD-Tree (ms)    Ball-Tree (ms)
----------------------------------------------------------
1,000         0.4           0.3             0.3
10,000        1.8           0.5             0.6
50,000        8.4           1.1             1.3
100,000       16.8          1.8             2.1
500,000       84.2          4.2             4.8

At 500,000 fleet readings with brute force, querying 100 new readings takes 84ms. That is manageable for batch processing (TurbineTech processes readings every 10 minutes), but would be a bottleneck for real-time applications. KD-trees provide a 20x speedup at this scale.

Production Tip --- For TurbineTech's batch anomaly detection (4,320 readings/day per turbine, processed every 10 minutes), scikit-learn's KD-tree implementation is fast enough. If the fleet grows to 10,000+ turbines with real-time detection requirements, switch to FAISS (Facebook's approximate nearest neighbor library), which can search billions of vectors in milliseconds using GPU acceleration.


Failure Mode 3: Concept Drift

Normal operating conditions shift with seasons, component aging, and firmware updates. A turbine's "normal" vibration in summer (higher temperatures, different air density) differs from winter. KNN detects these shifts as anomalies --- false positives.

# Simulate seasonal drift
np.random.seed(42)

# "Summer" fleet data (reference, collected in July)
fleet_summer = pd.DataFrame({
    'vibration_rms': np.random.normal(2.5, 0.4, 10000),
    'bearing_temp_c': np.random.normal(58, 5, 10000),       # Warmer in summer
    'rotor_speed_rpm': np.random.normal(1800, 100, 10000),
    'power_output_kw': np.random.normal(2100, 200, 10000),  # Lower output (less wind)
})

# "Winter" new turbine data (collected in January)
new_winter_normal = pd.DataFrame({
    'vibration_rms': np.random.normal(2.5, 0.4, 1000),
    'bearing_temp_c': np.random.normal(48, 5, 1000),        # Colder in winter
    'rotor_speed_rpm': np.random.normal(1800, 100, 1000),
    'power_output_kw': np.random.normal(2400, 200, 1000),   # Higher output (more wind)
})

# Actual anomalies in winter
new_winter_anomaly = pd.DataFrame({
    'vibration_rms': np.random.normal(4.5, 0.6, 30),
    'bearing_temp_c': np.random.normal(75, 8, 30),
    'rotor_speed_rpm': np.random.normal(1500, 200, 30),
    'power_output_kw': np.random.normal(1200, 400, 30),
})

new_winter = pd.concat([new_winter_normal, new_winter_anomaly], ignore_index=True)
y_winter = np.array([0] * 1000 + [1] * 30)

# KNN on summer reference data
scaler_season = StandardScaler()
fleet_summer_scaled = scaler_season.fit_transform(fleet_summer)
new_winter_scaled = scaler_season.transform(new_winter)

nn_season = NearestNeighbors(n_neighbors=10)
nn_season.fit(fleet_summer_scaled)
distances_season, _ = nn_season.kneighbors(new_winter_scaled)
scores_season = distances_season.mean(axis=1)

# Evaluate
auc_season = roc_auc_score(y_winter, scores_season)

# Check false positive rate at a threshold that catches 90% of anomalies
threshold_90 = np.percentile(scores_season[y_winter == 1], 10)
fp_rate = (scores_season[y_winter == 0] > threshold_90).mean()

print("--- Seasonal Drift Impact on KNN Anomaly Detection ---")
print(f"AUC-ROC:                    {auc_season:.3f}")
print(f"False positive rate at 90% recall: {fp_rate:.3f} ({fp_rate*100:.1f}%)")
print(f"\nNormal winter readings scored as anomalous:")
print(f"  Mean score (normal winter): {scores_season[y_winter == 0].mean():.3f}")
print(f"  Mean score (actual anomaly):{scores_season[y_winter == 1].mean():.3f}")
print(f"  Mean score (normal summer): "
      f"{nn_season.kneighbors(fleet_summer_scaled[:1000])[0].mean():.3f}")
--- Seasonal Drift Impact on KNN Anomaly Detection ---
AUC-ROC:                    0.982
False positive rate at 90% recall: 0.184 (18.4%)

Normal winter readings scored as anomalous:
  Mean score (normal winter): 1.847
  Mean score (actual anomaly):4.326
  Mean score (normal summer): 0.624

The AUC is still high (0.982) because actual anomalies have much higher scores than normal winter readings. But the false positive rate at 90% recall is 18.4% --- unacceptable for a production system where every false alarm triggers an expensive inspection.

The fix is straightforward: update the reference data to include seasonal variation. Use fleet data from the same season, or include a full year of fleet data.

# Fix: Include all-season fleet data
fleet_all_seasons = pd.DataFrame({
    'vibration_rms': np.random.normal(2.5, 0.4, 40000),
    'bearing_temp_c': np.concatenate([
        np.random.normal(58, 5, 10000),   # Summer
        np.random.normal(48, 5, 10000),   # Winter
        np.random.normal(53, 5, 10000),   # Spring
        np.random.normal(53, 5, 10000),   # Fall
    ]),
    'rotor_speed_rpm': np.random.normal(1800, 100, 40000),
    'power_output_kw': np.concatenate([
        np.random.normal(2100, 200, 10000),
        np.random.normal(2400, 200, 10000),
        np.random.normal(2250, 200, 10000),
        np.random.normal(2250, 200, 10000),
    ]),
})

scaler_all = StandardScaler()
fleet_all_scaled = scaler_all.fit_transform(fleet_all_seasons)
new_winter_scaled_all = scaler_all.transform(new_winter)

nn_all = NearestNeighbors(n_neighbors=10)
nn_all.fit(fleet_all_scaled)
scores_all = nn_all.kneighbors(new_winter_scaled_all)[0].mean(axis=1)

auc_all = roc_auc_score(y_winter, scores_all)
threshold_90_all = np.percentile(scores_all[y_winter == 1], 10)
fp_rate_all = (scores_all[y_winter == 0] > threshold_90_all).mean()

print("--- All-Season Reference Data ---")
print(f"AUC-ROC:                    {auc_all:.3f}")
print(f"False positive rate at 90% recall: {fp_rate_all:.3f} ({fp_rate_all*100:.1f}%)")
--- All-Season Reference Data ---
AUC-ROC:                    0.996
False positive rate at 90% recall: 0.031 (3.1%)

False positive rate drops from 18.4% to 3.1% by including seasonal variation in the reference data. The reference data is the model in KNN --- curate it carefully.


When KNN Shines at TurbineTech

Despite the limitations, KNN anomaly detection solves a specific, valuable problem at TurbineTech:

  1. Cold start. New turbines have no failure history. KNN needs only examples of normal operation from other turbines.

  2. No labeling required. The fleet reference data is implicitly labeled "normal" by the absence of failure events. No human labeling effort.

  3. Interpretability. When the KNN detector flags a reading, the maintenance team can inspect the nearest normal neighbors: "This reading's vibration is 4.3 RMS, but the nearest normal readings average 2.5 RMS. The bearing temperature is 78C vs. a fleet average of 55C." The explanation is concrete and actionable.

  4. Simplicity. The entire system --- scaler, nearest-neighbor index, threshold --- fits in a few hundred lines of code with no custom training loop.

# The complete KNN anomaly detection system for one turbine
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler

class TurbineAnomalyDetector:
    """KNN-based anomaly detection for TurbineTech cold-start turbines."""

    def __init__(self, k=10, threshold_percentile=99.0):
        self.k = k
        self.threshold_percentile = threshold_percentile
        self.scaler = StandardScaler()
        self.nn = NearestNeighbors(n_neighbors=k, algorithm='kd_tree')
        self.threshold = None
        self.feature_names = None

    def fit(self, fleet_data):
        """Fit on fleet reference data (all normal operation)."""
        self.feature_names = fleet_data.columns.tolist()
        fleet_scaled = self.scaler.fit_transform(fleet_data)
        self.nn.fit(fleet_scaled)

        # Set threshold as high percentile of fleet self-distances
        fleet_distances = self.nn.kneighbors(fleet_scaled)[0].mean(axis=1)
        self.threshold = np.percentile(
            fleet_distances, self.threshold_percentile
        )
        return self

    def score(self, readings):
        """Compute anomaly scores for new readings."""
        scaled = self.scaler.transform(readings)
        distances, indices = self.nn.kneighbors(scaled)
        return distances.mean(axis=1)

    def predict(self, readings):
        """Flag anomalous readings (1 = anomaly, 0 = normal)."""
        scores = self.score(readings)
        return (scores > self.threshold).astype(int)

    def explain(self, reading, fleet_data):
        """Explain why a reading was flagged."""
        scaled = self.scaler.transform(reading.values.reshape(1, -1))
        distances, indices = self.nn.kneighbors(scaled)
        neighbors = fleet_data.iloc[indices[0]]

        explanation = {}
        for feat in self.feature_names:
            reading_val = reading[feat]
            neighbor_mean = neighbors[feat].mean()
            neighbor_std = neighbors[feat].std()
            z_score = (reading_val - neighbor_mean) / (neighbor_std + 1e-8)
            explanation[feat] = {
                'reading': reading_val,
                'neighbor_mean': neighbor_mean,
                'z_score': z_score,
            }
        return explanation

# Usage
detector = TurbineAnomalyDetector(k=10, threshold_percentile=99.0)
detector.fit(fleet)

scores = detector.score(new_turbine)
predictions = detector.predict(new_turbine)

n_flagged = predictions.sum()
print(f"Readings flagged: {n_flagged}/{len(new_turbine)} "
      f"({n_flagged/len(new_turbine)*100:.1f}%)")
print(f"True anomalies:   {n_anomaly}/{len(new_turbine)} "
      f"({n_anomaly/len(new_turbine)*100:.1f}%)")

# Explain one flagged reading
flagged_indices = np.where(predictions == 1)[0]
if len(flagged_indices) > 0:
    sample_idx = flagged_indices[0]
    explanation = detector.explain(
        new_turbine.iloc[sample_idx], fleet
    )
    print(f"\n--- Explanation for Reading #{sample_idx} ---")
    for feat, info in explanation.items():
        flag = " ***" if abs(info['z_score']) > 2 else ""
        print(f"  {feat:<22} reading={info['reading']:>8.1f}  "
              f"fleet_mean={info['neighbor_mean']:>8.1f}  "
              f"z={info['z_score']:>+6.1f}{flag}")
Readings flagged: 53/4320 (1.2%)
True anomalies:   50/4320 (1.2%)

--- Explanation for Reading #127 ---
  vibration_rms          reading=     4.8  fleet_mean=     2.5  z=  +5.7 ***
  bearing_temp_c         reading=    82.3  fleet_mean=    55.1  z=  +5.4 ***
  rotor_speed_rpm        reading=  1423.0  fleet_mean=  1798.4  z=  -3.8 ***
  power_output_kw        reading=  1156.0  fleet_mean=  2195.2  z=  -5.2 ***
  wind_speed_ms          reading=     8.1  fleet_mean=     8.0  z=  +0.0
  pitch_angle_deg        reading=    11.2  fleet_mean=     5.0  z=  +4.1 ***
  gearbox_oil_temp_c     reading=    79.4  fleet_mean=    62.0  z=  +4.3 ***
  generator_temp_c       reading=    93.7  fleet_mean=    70.1  z=  +3.9 ***

The explanation is immediately useful to a turbine technician: vibration is 5.7 standard deviations above the fleet mean, bearing temperature is 5.4 SDs above normal, and power output is 5.2 SDs below normal. Wind speed is normal --- so this is not a wind-driven anomaly, it is a turbine health anomaly.


Discussion Questions

  1. KNN as a transitional model. TurbineTech uses KNN for cold-start anomaly detection, then transitions to a supervised gradient boosting model once enough failure data accumulates. At what point should the transition happen? What data volume is needed to make supervised learning worthwhile?

  2. The reference data IS the model. In KNN anomaly detection, the quality of the reference fleet data determines everything. What curation steps would you take to ensure the reference data represents genuinely normal operation? How do you handle the fact that some "normal" readings in the fleet data might actually be undetected pre-failure conditions?

  3. Dimensionality reduction vs. feature selection. PCA recovered performance from 0.738 to 0.987 on the high-dimensional dataset. Domain knowledge (selecting 8 features) achieved 0.998. In practice, when should you use PCA and when should you use domain-driven feature selection? What are the risks of each approach?

  4. False positive cost. A false positive at TurbineTech triggers an inspection costing $5,000-$15,000. A missed anomaly (false negative) can lead to a $500,000 bearing failure. How should the detection threshold be set given this cost asymmetry? Is 3.1% false positive rate acceptable?

  5. KNN vs. Isolation Forest. Isolation Forest (Chapter 22) is another unsupervised anomaly detection method that does not require distance calculations and is immune to the curse of dimensionality. Under what circumstances would you prefer KNN over Isolation Forest for TurbineTech's use case?


Case Study 2 for Chapter 15: Naive Bayes and Nearest Neighbors. Return to the chapter for full context.