Chapter 22: Anomaly Detection

DataField.Dev

20 min read

Anomaly detection is one of the most practically valuable and least well-taught tasks in machine learning. Every organization needs it: fraudulent transactions, failing equipment, compromised accounts, network intrusions, data quality errors. Yet...

In This Chapter

Isolation Forests, Autoencoders, and Finding Needles in Haystacks
The Hardest Part Is Not the Algorithm
The Three Paradigms
Part 1: Statistical Baselines
Part 2: Isolation Forest
Part 3: Local Outlier Factor (LOF)
Part 4: One-Class SVM
Part 5: Autoencoders for Anomaly Detection
Part 6: Evaluating Anomaly Detection
Part 7: The Threshold Selection Problem
Part 8: Putting It Together --- The Anomaly Detection Workflow
StreamFlow Application: Detecting Unusual Usage Patterns
Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 22: Anomaly Detection

Isolation Forests, Autoencoders, and Finding Needles in Haystacks

Learning Objectives

By the end of this chapter, you will be able to:

Distinguish between supervised, semi-supervised, and unsupervised anomaly detection
Implement Isolation Forest and understand why anomalies are "easy to isolate"
Build a simple autoencoder for reconstruction-based anomaly detection
Apply statistical methods (z-score, IQR, Mahalanobis distance) as baselines
Evaluate anomaly detection without (or with few) labeled anomalies

The Hardest Part Is Not the Algorithm

Anomaly detection is one of the most practically valuable and least well-taught tasks in machine learning. Every organization needs it: fraudulent transactions, failing equipment, compromised accounts, network intrusions, data quality errors. Yet most ML courses spend 95% of their time on supervised classification and treat anomaly detection as a footnote.

The reason it is under-taught is that it is genuinely harder than supervised learning --- not algorithmically, but conceptually. In supervised classification, the problem is well-defined: you have labeled examples of both classes, you optimize a loss function, and you evaluate with metrics like AUC and F1. In anomaly detection, the problem itself is ambiguous.

Consider the threshold selection problem. You run an Isolation Forest on your data and it assigns every observation an anomaly score between 0 and 1. Score 0.95 is probably anomalous. Score 0.05 is probably normal. But what about 0.62? Or 0.51? Where do you draw the line?

There is no mathematically correct answer. The threshold depends on the cost of false positives versus false negatives, which depends on the business context. A manufacturing defect that causes a turbine fire has a different cost structure than a SaaS account that shows unusual but legitimate usage. The algorithm gives you a score. The business decides the threshold. And that means anomaly detection is as much a decision-making problem as it is a modeling problem.

This chapter teaches you both: the algorithms that produce anomaly scores and the frameworks for turning scores into decisions.

The Three Paradigms

Anomaly detection methods fall into three paradigms, and the distinction matters because it determines what data you need, what assumptions you make, and how you evaluate.

Supervised Anomaly Detection

You have labeled examples of both normal and anomalous observations. This is just binary classification with an extreme class imbalance. You already have the tools: logistic regression, gradient boosting, SMOTE, class weights (Chapters 11-17). If you have enough labeled anomalies (hundreds or more), supervised methods will almost always outperform unsupervised ones.

The problem: labeled anomalies are rare and expensive to obtain. In manufacturing, you might have millions of normal sensor readings and 47 confirmed bearing failures. In fraud detection, you have millions of legitimate transactions and a few thousand flagged cases. And the labeled anomalies may not represent all types of anomalies --- a fraud detection model trained on known fraud patterns will miss novel fraud schemes.

Semi-Supervised Anomaly Detection

You have a large dataset of verified normal observations and no labeled anomalies. You build a model of "what normal looks like" and flag anything that deviates. This is sometimes called novelty detection. One-class SVM and autoencoders trained only on normal data fall into this category.

The assumption: your training data is clean. If anomalies contaminate your "normal" training set, the model learns that anomalies are normal. This assumption is surprisingly hard to satisfy in practice --- most real datasets contain unlabeled anomalies.

Unsupervised Anomaly Detection

You have unlabeled data that contains both normal and anomalous observations, and you do not know which is which. You assume that anomalies are rare and structurally different from the majority. The algorithm identifies observations that do not fit the dominant pattern.

The assumption: anomalies are a small minority. If 30% of your data is anomalous, unsupervised methods will not distinguish anomalies from a second cluster of normal behavior. The standard assumption is that anomalies constitute less than 5% of the data, and many methods work best when it is less than 1%.

Paradigm	Training Data	Assumption	Best When
Supervised	Labeled normal + anomalous	Labels are accurate and representative	Hundreds+ labeled anomalies available
Semi-supervised	Verified normal only	Training data is truly clean	Clean normal data can be obtained
Unsupervised	Unlabeled (mixed)	Anomalies are rare (<5%)	No labels available

Practical Note --- Most real-world anomaly detection starts unsupervised (because labels do not exist yet), produces initial detections, sends them to a human for review, and gradually accumulates enough labels to move to supervised. The lifecycle is: unsupervised bootstrapping, then semi-supervised refinement, then supervised where possible. You need all three paradigms, not just one.

Part 1: Statistical Baselines

Before reaching for Isolation Forest or autoencoders, start with statistical methods. They are simple, interpretable, fast, and often surprisingly effective. They also give you a baseline to beat.

Z-Score Method

The z-score measures how many standard deviations a value is from the mean. For a feature $x$ with mean $\mu$ and standard deviation $\sigma$:

$$z = \frac{x - \mu}{\sigma}$$

Flag any observation where $|z| > 3$ (more than 3 standard deviations from the mean). This works for univariate anomaly detection on roughly Gaussian features.

import numpy as np
import pandas as pd
from scipy import stats

np.random.seed(42)
n = 10000

# Simulate vibration sensor data from TurbineTech
# Normal operation: mean 4.2 mm/s RMS, std 0.8
# Anomalous bearings: elevated vibration
vibration = np.concatenate([
    np.random.normal(4.2, 0.8, n - 50),   # normal
    np.random.normal(8.5, 1.2, 50)         # bearing degradation
])
np.random.shuffle(vibration)

z_scores = np.abs(stats.zscore(vibration))
z_anomalies = z_scores > 3

print(f"Z-score anomalies detected: {z_anomalies.sum()}")
print(f"Mean vibration (flagged):    {vibration[z_anomalies].mean():.2f} mm/s")
print(f"Mean vibration (normal):     {vibration[~z_anomalies].mean():.2f} mm/s")

The z-score method has two critical limitations. First, it assumes a roughly Gaussian distribution. For skewed features (common in sensor data, transaction amounts, session durations), the z-score will flag observations in the long tail that are unusual but not anomalous. Second, it is univariate --- it checks each feature independently and cannot detect anomalies that only appear as unusual combinations of features.

IQR Method

The interquartile range (IQR) method is more robust to non-Gaussian distributions. Compute Q1 (25th percentile), Q3 (75th percentile), and IQR = Q3 - Q1. Flag any observation below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR. The 1.5 multiplier is Tukey's classic choice; use 3.0 for a more conservative threshold.

def iqr_anomalies(series, multiplier=1.5):
    """Flag anomalies using the IQR method."""
    q1 = series.quantile(0.25)
    q3 = series.quantile(0.75)
    iqr = q3 - q1
    lower = q1 - multiplier * iqr
    upper = q3 + multiplier * iqr
    return (series < lower) | (series > upper)

vibration_series = pd.Series(vibration)
iqr_flags = iqr_anomalies(vibration_series, multiplier=1.5)
print(f"IQR anomalies (1.5x): {iqr_flags.sum()}")

iqr_flags_3 = iqr_anomalies(vibration_series, multiplier=3.0)
print(f"IQR anomalies (3.0x): {iqr_flags_3.sum()}")

The IQR method handles skewed distributions better because percentiles are robust to outliers. But it is still univariate.

Mahalanobis Distance

The Mahalanobis distance is the multivariate generalization of the z-score. Instead of measuring distance from the mean in each feature independently, it measures distance from the centroid of the data, accounting for correlations between features.

For an observation $\mathbf{x}$, the mean vector $\boldsymbol{\mu}$, and the covariance matrix $\Sigma$:

$$D_M = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu})}$$

The key insight: Mahalanobis distance accounts for correlation. An observation that is 2 standard deviations high on Feature A and 2 standard deviations high on Feature B may be perfectly normal if A and B are positively correlated --- but deeply anomalous if they are negatively correlated. The z-score misses this; Mahalanobis distance catches it.

from scipy.spatial.distance import mahalanobis

np.random.seed(42)
n = 10000

# Simulate multivariate sensor data: temperature + vibration + pressure
# Normal: correlated (hot bearings vibrate more)
normal_mean = [85.0, 4.2, 32.0]
normal_cov = [
    [25.0, 4.0, -2.0],   # temperature
    [4.0,  0.64, -0.3],   # vibration
    [-2.0, -0.3, 4.0],    # pressure
]
normal_data = np.random.multivariate_normal(normal_mean, normal_cov, n - 50)

# Anomalous: unusual combination (high vibration but low temperature)
anomaly_data = np.column_stack([
    np.random.normal(78, 3, 50),   # low temperature (unusual given high vibration)
    np.random.normal(8.0, 0.8, 50), # high vibration
    np.random.normal(28, 2, 50),   # low pressure
])

sensor_data = np.vstack([normal_data, anomaly_data])
labels = np.array([0] * (n - 50) + [1] * 50)

# Compute Mahalanobis distance
mu = sensor_data.mean(axis=0)
cov = np.cov(sensor_data, rowvar=False)
cov_inv = np.linalg.inv(cov)

maha_distances = np.array([
    mahalanobis(obs, mu, cov_inv) for obs in sensor_data
])

# Flag observations with high Mahalanobis distance
threshold = np.percentile(maha_distances, 99)
maha_anomalies = maha_distances > threshold

print(f"Mahalanobis anomalies: {maha_anomalies.sum()}")
print(f"True anomalies caught: {(maha_anomalies & (labels == 1)).sum()} / {labels.sum()}")

Practical Note --- Mahalanobis distance assumes the normal data follows a multivariate Gaussian distribution. If the normal data has multiple clusters or non-linear structure, the covariance matrix does not describe the shape of the data well, and the distance becomes unreliable. Use the robust covariance estimator (sklearn.covariance.EllipticEnvelope) to reduce the influence of outliers on the covariance estimate.

When Statistical Baselines Are Enough

Statistical methods are sufficient when:

The anomalies are extreme in one or two features (univariate methods work)
The data is roughly Gaussian or can be transformed to be (log, Box-Cox)
You need interpretability ("flagged because vibration was 4.7 standard deviations above normal")
You need a quick first pass before investing in more complex models

They are insufficient when:

Anomalies are subtle combinations of features that look normal individually
The data has complex, non-linear structure
The feature space is high-dimensional (covariance estimation degrades)
The normal data has multiple modes or non-Gaussian structure

Part 2: Isolation Forest

Isolation Forest is the single most important algorithm in this chapter. It is fast, effective, requires minimal tuning, works well in high dimensions, and is built into scikit-learn. If you learn one anomaly detection method, learn this one.

The Intuition: Anomalies Are Easy to Isolate

Most anomaly detection methods model "what normal looks like" and then flag anything that deviates. Isolation Forest flips this: instead of profiling normal observations, it directly measures how easy each observation is to isolate.

The key insight is geometric. Normal observations are clustered together in dense regions. To isolate a normal observation --- to separate it from all other observations --- you need many random splits. Anomalies, by contrast, live in sparse regions far from the bulk of the data. A single random split can separate an anomaly from everything else.

Here is the procedure:

Select a random feature.
Select a random split value between the feature's minimum and maximum in the current subset.
Split the data.
Repeat until every observation is isolated (alone in its partition) or the tree reaches maximum depth.

An observation that gets isolated in 3 splits is more anomalous than one that takes 15 splits. The anomaly score is based on the average path length across many random trees.

Why It Works: Path Length and Anomaly Scores

In an Isolation Tree, the path length for an observation is the number of splits needed to isolate it. Build many trees (an Isolation Forest) and average the path lengths. Anomalies have short average path lengths; normal observations have long ones.

The anomaly score $s(x, n)$ for an observation $x$ in a dataset of size $n$ is:

$$s(x, n) = 2^{-\frac{E[h(x)]}{c(n)}}$$

where $E[h(x)]$ is the average path length for observation $x$ across all trees, and $c(n)$ is a normalization constant (the average path length of an unsuccessful search in a binary search tree of $n$ items).

$s \approx 1$: anomaly (very short path length)
$s \approx 0.5$: normal (average path length)
$s \approx 0$: extremely normal (very long path length)

Implementation

import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_score, recall_score, f1_score
import matplotlib.pyplot as plt

np.random.seed(42)
n = 10000

# TurbineTech sensor data: 6 features from a single turbine bearing
# Normal operation: correlated sensor readings
normal = pd.DataFrame({
    'vibration_rms': np.random.normal(4.2, 0.8, n - 80),
    'temperature_c': np.random.normal(85, 5, n - 80),
    'pressure_bar': np.random.normal(32, 2, n - 80),
    'rotation_rpm': np.random.normal(3600, 50, n - 80),
    'acoustic_db': np.random.normal(72, 3, n - 80),
    'current_amp': np.random.normal(15.2, 1.1, n - 80),
})

# Anomalous: early-stage bearing degradation
# Vibration increases, temperature rises, acoustic emission changes
anomalous = pd.DataFrame({
    'vibration_rms': np.random.normal(7.8, 1.5, 80),
    'temperature_c': np.random.normal(98, 6, 80),
    'pressure_bar': np.random.normal(29, 3, 80),
    'rotation_rpm': np.random.normal(3580, 80, 80),
    'acoustic_db': np.random.normal(81, 4, 80),
    'current_amp': np.random.normal(17.8, 1.5, 80),
})

data = pd.concat([normal, anomalous], ignore_index=True)
true_labels = np.array([0] * (n - 80) + [1] * 80)  # 0=normal, 1=anomaly

# Fit Isolation Forest
iso_forest = IsolationForest(
    n_estimators=200,
    max_samples='auto',      # subsample size (default: min(256, n))
    contamination=0.01,       # expected fraction of anomalies
    random_state=42,
)
predictions = iso_forest.fit_predict(data)

# IsolationForest uses -1 for anomalies, 1 for normal
pred_anomalies = predictions == -1

print(f"Anomalies detected: {pred_anomalies.sum()}")
print(f"True anomalies caught: {(pred_anomalies & (true_labels == 1)).sum()} / {true_labels.sum()}")
print(f"False positives: {(pred_anomalies & (true_labels == 0)).sum()}")
print(f"Precision: {precision_score(true_labels, pred_anomalies):.3f}")
print(f"Recall:    {recall_score(true_labels, pred_anomalies):.3f}")

# Get anomaly scores (lower = more anomalous in sklearn's convention)
scores = iso_forest.decision_function(data)
# Negate so higher = more anomalous (more intuitive)
anomaly_scores = -scores

print(f"\nScore distribution:")
print(f"  Normal mean:  {anomaly_scores[true_labels == 0].mean():.3f}")
print(f"  Anomaly mean: {anomaly_scores[true_labels == 1].mean():.3f}")

The Contamination Parameter

The contamination parameter is the most important hyperparameter in Isolation Forest, and it is frequently misunderstood. It does not control how the model is trained --- the trees are built the same way regardless of contamination. It controls the decision threshold: what fraction of observations the model labels as anomalies.

If you set contamination=0.01, the model marks the top 1% most anomalous observations as anomalies. If you set it to 0.05, it marks the top 5%. The underlying anomaly scores are identical.

# Same model, different contamination thresholds
for cont in [0.005, 0.01, 0.02, 0.05]:
    iso = IsolationForest(
        n_estimators=200, contamination=cont, random_state=42
    )
    preds = iso.fit_predict(data)
    n_flagged = (preds == -1).sum()
    true_caught = ((preds == -1) & (true_labels == 1)).sum()
    print(f"contamination={cont:.3f}: flagged={n_flagged:4d}, "
          f"true anomalies caught={true_caught}/{true_labels.sum()}")

Practical Note --- If you know the approximate anomaly rate, set contamination accordingly. If you do not (the common case), set contamination='auto' and work with the raw anomaly scores from decision_function or score_samples. Apply your own threshold based on business logic, or use the scores to rank observations and review the top N.

Tuning Isolation Forest

Parameter	Default	Guidance
`n_estimators`	100	200 is a safe choice. Diminishing returns beyond 300.
`max_samples`	256	Small subsamples increase diversity. 256 works well up to ~100K rows. Increase for very large datasets.
`contamination`	'auto'	Set to expected anomaly rate if known. Use 'auto' and threshold manually otherwise.
`max_features`	1.0	Reduce to 0.5-0.8 for high-dimensional data (adds feature subsampling like Random Forest).
`bootstrap`	False	Set True for larger subsamples.

Visualizing Isolation Forest Decisions

# Create a 2D projection to visualize the anomaly boundary
from sklearn.decomposition import PCA

pca = PCA(n_components=2, random_state=42)
data_2d = pca.fit_transform(StandardScaler().fit_transform(data))

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: colored by true labels
scatter1 = axes[0].scatter(
    data_2d[:, 0], data_2d[:, 1],
    c=true_labels, cmap='coolwarm', alpha=0.3, s=5
)
axes[0].set_title('True Labels (0=Normal, 1=Anomaly)')
axes[0].set_xlabel('PC1')
axes[0].set_ylabel('PC2')

# Right: colored by anomaly score
scatter2 = axes[1].scatter(
    data_2d[:, 0], data_2d[:, 1],
    c=anomaly_scores, cmap='YlOrRd', alpha=0.3, s=5
)
axes[1].set_title('Isolation Forest Anomaly Scores')
axes[1].set_xlabel('PC1')
axes[1].set_ylabel('PC2')
plt.colorbar(scatter2, ax=axes[1], label='Anomaly Score')

plt.tight_layout()
plt.savefig('isolation_forest_visualization.png', dpi=150, bbox_inches='tight')
plt.show()

Part 3: Local Outlier Factor (LOF)

Local Outlier Factor takes a different approach from Isolation Forest. Instead of measuring how easy an observation is to isolate, LOF measures how dense an observation's neighborhood is compared to the neighborhoods of its neighbors. An observation in a sparse region surrounded by dense regions is a local outlier.

The Intuition

Imagine a city. Downtown is dense --- everyone has many neighbors nearby. The suburbs are sparse --- people are spread out. LOF does not flag all suburban residents as outliers because they are consistently sparse. It flags the lone house in the middle of a field next to a dense neighborhood, because that house's local density is much lower than its neighbors' densities.

This local perspective is powerful for data with clusters of different densities. Isolation Forest treats density globally --- a point in a sparse but legitimate cluster might get a high anomaly score. LOF compares each point to its neighbors, so a sparse-but-consistent cluster is left alone.

Implementation

from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor(
    n_neighbors=20,
    contamination=0.01,
    novelty=False,   # True for semi-supervised (predict on new data)
)
lof_predictions = lof.fit_predict(data)
lof_anomalies = lof_predictions == -1
lof_scores = -lof.negative_outlier_factor_  # higher = more anomalous

print(f"LOF anomalies detected: {lof_anomalies.sum()}")
print(f"True anomalies caught:  {(lof_anomalies & (true_labels == 1)).sum()} / {true_labels.sum()}")
print(f"Precision: {precision_score(true_labels, lof_anomalies):.3f}")
print(f"Recall:    {recall_score(true_labels, lof_anomalies):.3f}")

Practical Note --- LOF with novelty=False cannot predict on new data. It only labels the training data. For novelty detection (semi-supervised), set novelty=True, fit on clean training data, and then call predict on new observations. This is a different mode of operation and assumes your training set is anomaly-free.

Isolation Forest vs. LOF

	Isolation Forest	LOF
Approach	How easy to isolate (random splits)	How dense vs. neighbors
Speed	Fast (O(n * t * log(subsample)))	Slower (O(n^2) for neighbor search)
Scalability	Handles 1M+ rows easily	Struggles above 100K rows
Density variation	Global --- misses multi-density data	Local --- handles multi-density well
New data	Can predict on new data	Only with `novelty=True`
Default choice	Yes, for most problems	When density varies across clusters

Part 4: One-Class SVM

One-class SVM is the classic semi-supervised anomaly detection method. You train it on normal data only, and it learns a boundary that encloses the normal observations. Anything outside the boundary is flagged as anomalous.

How It Works

One-class SVM maps the data into a high-dimensional feature space (using a kernel, typically RBF) and finds the hyperplane that separates the data from the origin with maximum margin. The fraction of training points that fall outside the boundary is controlled by the nu parameter, which is analogous to contamination.

from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler

# Semi-supervised: train on normal data only
scaler = StandardScaler()
X_train_normal = scaler.fit_transform(normal)

oc_svm = OneClassSVM(
    kernel='rbf',
    gamma='scale',
    nu=0.01,  # upper bound on fraction of outliers in training data
)
oc_svm.fit(X_train_normal)

# Predict on all data (normal + anomalous)
X_all = scaler.transform(data)
svm_predictions = oc_svm.predict(X_all)
svm_anomalies = svm_predictions == -1
svm_scores = -oc_svm.decision_function(X_all)

print(f"One-Class SVM anomalies: {svm_anomalies.sum()}")
print(f"True anomalies caught:   {(svm_anomalies & (true_labels == 1)).sum()} / {true_labels.sum()}")
print(f"Precision: {precision_score(true_labels, svm_anomalies):.3f}")
print(f"Recall:    {recall_score(true_labels, svm_anomalies):.3f}")

Practical Note --- One-class SVM is slow. Training is O(n^2) to O(n^3) depending on the kernel, making it impractical for datasets larger than ~50,000 observations. It also requires careful feature scaling. Use it when you have clean normal training data and a moderate-sized dataset. For large datasets, prefer Isolation Forest.

Part 5: Autoencoders for Anomaly Detection

Autoencoders take a fundamentally different approach. Instead of geometry (Isolation Forest), density (LOF), or boundary (One-Class SVM), autoencoders learn to reconstruct their input. The idea: if you train an autoencoder on normal data, it learns to reconstruct normal patterns well. When you feed it an anomaly, the reconstruction is poor because the anomaly does not match the patterns the autoencoder learned. The reconstruction error is the anomaly score.

Architecture

An autoencoder is a neural network with two parts:

Encoder: Compresses the input from $d$ dimensions to a smaller bottleneck of $k$ dimensions (the latent representation).
Decoder: Reconstructs the input from the $k$-dimensional bottleneck back to $d$ dimensions.

The network is trained to minimize the difference between input and output (reconstruction error). The bottleneck forces the network to learn a compressed representation of the dominant patterns in the data --- which, if the training data is mostly normal, means it learns to represent normal patterns.

Building an Autoencoder with PyTorch

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

# Prepare data
scaler = StandardScaler()
X_normal_scaled = scaler.fit_transform(normal)  # train on normal only
X_all_scaled = scaler.transform(data)            # evaluate on all data

input_dim = X_normal_scaled.shape[1]  # 6 features
encoding_dim = 3                       # bottleneck size

class Autoencoder(nn.Module):
    def __init__(self, input_dim, encoding_dim):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 16),
            nn.ReLU(),
            nn.Linear(16, encoding_dim),
            nn.ReLU(),
        )
        self.decoder = nn.Sequential(
            nn.Linear(encoding_dim, 16),
            nn.ReLU(),
            nn.Linear(16, input_dim),
        )

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

# Training
torch.manual_seed(42)
model = Autoencoder(input_dim, encoding_dim)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

X_tensor = torch.FloatTensor(X_normal_scaled)
dataset = TensorDataset(X_tensor, X_tensor)  # input = target
loader = DataLoader(dataset, batch_size=256, shuffle=True)

model.train()
for epoch in range(50):
    total_loss = 0
    for batch_x, batch_target in loader:
        optimizer.zero_grad()
        output = model(batch_x)
        loss = criterion(output, batch_target)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1}/50, Loss: {total_loss/len(loader):.6f}")

# Compute reconstruction error on all data
model.eval()
with torch.no_grad():
    X_all_tensor = torch.FloatTensor(X_all_scaled)
    reconstructed = model(X_all_tensor)
    reconstruction_error = torch.mean((X_all_tensor - reconstructed) ** 2, dim=1).numpy()

# Threshold: 95th percentile of reconstruction error on normal training data
with torch.no_grad():
    train_reconstructed = model(X_tensor)
    train_error = torch.mean((X_tensor - train_reconstructed) ** 2, dim=1).numpy()

threshold = np.percentile(train_error, 95)
ae_anomalies = reconstruction_error > threshold

print(f"\nAutoencoder results:")
print(f"Anomalies detected: {ae_anomalies.sum()}")
print(f"True anomalies caught: {(ae_anomalies & (true_labels == 1)).sum()} / {true_labels.sum()}")
print(f"Precision: {precision_score(true_labels, ae_anomalies):.3f}")
print(f"Recall:    {recall_score(true_labels, ae_anomalies):.3f}")

Why Reconstruction Error Works

The autoencoder's bottleneck forces it to learn a compressed representation. With only 3 latent dimensions for 6 input features, the network cannot memorize the data --- it must learn the dominant patterns. If the training data is mostly normal, the dominant patterns are "normal sensor readings." When a degraded bearing produces unusual readings, the autoencoder tries to reconstruct those readings using its learned "normal" patterns and fails. The reconstruction error spikes.

This is conceptually similar to PCA reconstruction error (Chapter 21). In fact, a linear autoencoder with no activation functions is mathematically equivalent to PCA. The advantage of neural autoencoders is that they can capture non-linear patterns.

Autoencoder Design Decisions

Decision	Guidance
Bottleneck size	Start with input_dim / 2, reduce if overfitting. Too large = memorizes anomalies. Too small = high error on everything.
Depth	1-2 hidden layers per side is usually sufficient for tabular data. Deeper for images/sequences.
Activation	ReLU for hidden layers. Linear (no activation) for the output layer if inputs are standardized.
Training data	Semi-supervised (normal only) is preferred. If unavailable, train on all data and assume anomalies are rare enough that the model learns normal patterns.
Threshold	95th or 99th percentile of training reconstruction error. Or use a validation set with labeled anomalies to tune.
Loss function	MSE for continuous features. Binary cross-entropy if inputs are in [0, 1].

Practical Note --- For tabular data with fewer than ~50 features, Isolation Forest often matches or outperforms autoencoders with far less effort. Autoencoders shine on high-dimensional data (hundreds of sensors, image data, sequence data) where the relationships between features are complex and non-linear. Do not use an autoencoder just because it sounds more sophisticated.

Part 6: Evaluating Anomaly Detection

Evaluation is where anomaly detection gets genuinely difficult. In supervised learning, you have labels and can compute precision, recall, F1, and AUC. In anomaly detection, you often have no labels --- or a small, biased set of labels.

When You Have Labels

If you have labeled anomalies (even a small test set), use them. But the extreme class imbalance (99%+ normal) means some standard metrics are misleading.

from sklearn.metrics import (
    precision_recall_curve, average_precision_score,
    roc_auc_score, confusion_matrix
)

# Compare methods using anomaly scores
methods = {
    'Isolation Forest': anomaly_scores,
    'LOF': lof_scores,
    'One-Class SVM': svm_scores,
    'Autoencoder': reconstruction_error,
}

print("Method Comparison (with labels):")
print(f"{'Method':<20s} {'AUC-ROC':>8s} {'AP':>8s}")
print("-" * 38)
for name, scores_arr in methods.items():
    auc = roc_auc_score(true_labels, scores_arr)
    ap = average_precision_score(true_labels, scores_arr)
    print(f"{name:<20s} {auc:>8.3f} {ap:>8.3f}")

Metrics that work for anomaly detection:

Metric	What It Measures	When to Use
AUC-ROC	Ranking quality across all thresholds	General comparison between methods
Average Precision (AP)	Area under precision-recall curve	Better than AUC-ROC for extreme imbalance
Precision@k	Precision among top k scored observations	When you review a fixed number of alerts per day
Recall@k	Fraction of true anomalies in the top k	When you must catch a minimum fraction of anomalies

Metrics that mislead:

Accuracy: 99.2% accuracy sounds great, but a model that predicts "normal" for everything achieves 99.2% accuracy when 0.8% of observations are anomalous.
F1 at a single threshold: F1 depends on the threshold, which is a business decision. Report F1 at the threshold you actually use, not the threshold that maximizes F1.

Precision at k

In practice, your anomaly detection system produces a ranked list, and a human reviews the top k items. The business question is: "If we review 50 alerts per day, how many of them are real?"

def precision_at_k(true_labels, scores, k):
    """Compute precision among the top-k highest-scored observations."""
    top_k_idx = np.argsort(scores)[-k:]
    return true_labels[top_k_idx].mean()

for k in [20, 50, 100, 200]:
    p_at_k = precision_at_k(true_labels, anomaly_scores, k)
    print(f"Precision@{k:>3d}: {p_at_k:.3f} "
          f"(true anomalies in top {k}: {int(p_at_k * k)})")

When You Have No Labels

This is the harder and more common case. Without labels, you cannot compute precision, recall, or AUC. You can:

Inspect the top anomalies manually. Sort by anomaly score, examine the top 50-100, and assess whether they look genuinely anomalous based on domain knowledge. This is subjective but necessary.
Check consistency across methods. If Isolation Forest, LOF, and the autoencoder all flag the same observations, those are probably genuine anomalies. Disagreements point to borderline cases or method-specific artifacts.
Validate with domain proxies. In manufacturing, check whether flagged observations correspond to known maintenance events. In SaaS, check whether flagged accounts subsequently churned. These are not labels, but they provide indirect validation.
Stability analysis. Run the algorithm with different random seeds, subsample sizes, or hyperparameters. Observations that are consistently flagged across settings are robust anomalies. Observations that appear and disappear are borderline.

# Consistency check: which observations are flagged by all methods?
all_flags = np.column_stack([
    pred_anomalies,       # Isolation Forest
    lof_anomalies,        # LOF
    svm_anomalies,        # One-Class SVM
    ae_anomalies,         # Autoencoder
])
consensus = all_flags.sum(axis=1)

print(f"Flagged by 4/4 methods: {(consensus == 4).sum()}")
print(f"Flagged by 3/4 methods: {(consensus >= 3).sum()}")
print(f"Flagged by 2/4 methods: {(consensus >= 2).sum()}")
print(f"Flagged by 1/4 methods: {(consensus >= 1).sum()}")

# Among consensus anomalies, what fraction are true anomalies?
if (consensus == 4).sum() > 0:
    consensus_precision = true_labels[consensus == 4].mean()
    print(f"\nPrecision of 4/4 consensus: {consensus_precision:.3f}")

Part 7: The Threshold Selection Problem

We have deferred the hardest question long enough. You have anomaly scores. You need a threshold. How do you choose it?

Option 1: Contamination-Based

If you know (or estimate) the anomaly rate, set the threshold to flag that percentage. If you believe 1% of turbine readings indicate degradation, flag the top 1%.

This is simple but circular --- you are assuming the answer to the question you are trying to answer.

Option 2: Statistical

Fit a distribution to the anomaly scores and flag observations in the extreme tail. For example, if anomaly scores are roughly Gaussian, flag anything beyond the 99th percentile.

# Fit normal distribution to Isolation Forest scores
from scipy.stats import norm

mu_score, std_score = norm.fit(anomaly_scores)
threshold_99 = norm.ppf(0.99, mu_score, std_score)

print(f"Score distribution: mean={mu_score:.3f}, std={std_score:.3f}")
print(f"99th percentile threshold: {threshold_99:.3f}")
print(f"Anomalies above threshold: {(anomaly_scores > threshold_99).sum()}")

Option 3: Business-Driven

Define the threshold by operational capacity. "Our team can investigate 20 alerts per day. Flag the top 20." This is the most honest approach because it acknowledges that the threshold is a resource allocation decision, not a statistical one.

Option 4: Labeled Validation Set

If you have even a small set of labeled anomalies (50-100), use them to tune the threshold. Plot precision and recall as a function of the threshold and choose the operating point that matches your cost structure.

precisions, recalls, thresholds = precision_recall_curve(
    true_labels, anomaly_scores
)

# Find threshold where precision >= 0.8
for i, (p, r) in enumerate(zip(precisions, recalls)):
    if p >= 0.8 and r > 0:
        print(f"Threshold for precision >= 0.8: {thresholds[i]:.3f}")
        print(f"  Precision: {p:.3f}, Recall: {r:.3f}")
        break

Practical Note --- In production systems, the threshold is not fixed. It evolves as you accumulate feedback. Start with a loose threshold (more alerts, lower precision), have humans review the alerts, use the labels to tighten the threshold, and iterate. The first deployment is a bootstrapping phase, not a finished product.

Part 8: Putting It Together --- The Anomaly Detection Workflow

Here is the workflow for a real anomaly detection project:

Step 1: Understand the Domain

Before writing any code, answer:

What constitutes an anomaly in this domain?
What is the cost of missing an anomaly (false negative)?
What is the cost of a false alarm (false positive)?
How many alerts can the team review per day?
Do you have any labeled anomalies, even a handful?

Step 2: Start with Statistical Baselines

Compute z-scores and IQR flags on individual features. This gives you a lower bound on performance and often catches the most extreme anomalies. If the baselines solve 80% of the problem, you may not need anything fancier.

Step 3: Apply Isolation Forest

Isolation Forest is the default starting point for unsupervised anomaly detection. It requires minimal tuning, scales well, and handles high-dimensional data.

Step 4: Consider Alternatives Based on Data Structure

Multi-density clusters: Add LOF.
Clean normal training data available: Try One-Class SVM or an autoencoder trained on normal data only.
High-dimensional / complex patterns: Try an autoencoder.
Enough labeled anomalies: Switch to supervised classification (gradient boosting).

Step 5: Evaluate and Select Threshold

Use labeled data if available (AP, AUC-ROC, precision@k). If not, use consensus across methods, manual inspection, and domain proxies. Set the threshold based on operational capacity.

Step 6: Deploy and Monitor

Deploy the model, send alerts to reviewers, collect feedback labels, and use them to improve the model. Monitor for concept drift --- the definition of "normal" changes over time (a turbine ages, a SaaS product adds features, user behavior shifts).

StreamFlow Application: Detecting Unusual Usage Patterns

The StreamFlow SaaS churn dataset from previous chapters has a natural anomaly detection application: identifying accounts with unusual usage patterns that may indicate early churn, account compromise, or data quality issues.

import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
n = 50000

streamflow = pd.DataFrame({
    'monthly_hours_watched': np.random.exponential(18, n).round(1),
    'sessions_last_30d': np.random.poisson(14, n),
    'avg_session_minutes': np.random.exponential(28, n).round(1),
    'unique_titles_watched': np.random.poisson(8, n),
    'content_completion_rate': np.random.beta(3, 2, n).round(3),
    'binge_sessions_30d': np.random.poisson(2, n),
    'weekend_ratio': np.random.beta(2.5, 3, n).round(3),
    'peak_hour_ratio': np.random.beta(3, 2, n).round(3),
    'hours_change_pct': np.random.normal(0, 30, n).round(1),
    'sessions_change_pct': np.random.normal(0, 25, n).round(1),
    'months_active': np.random.randint(1, 60, n),
    'plan_price': np.random.choice(
        [9.99, 14.99, 19.99, 24.99], n, p=[0.35, 0.35, 0.20, 0.10]
    ),
    'devices_used': np.random.randint(1, 6, n),
    'profiles_active': np.random.randint(1, 5, n),
    'payment_failures_6m': np.random.poisson(0.3, n),
    'support_tickets_90d': np.random.poisson(0.8, n),
    'days_since_last_session': np.random.exponential(5, n).round(0).clip(0, 60),
    'recommendation_click_rate': np.random.beta(2, 8, n).round(3),
    'search_frequency_30d': np.random.poisson(6, n),
    'download_count_30d': np.random.poisson(3, n),
    'share_count_30d': np.random.poisson(1, n),
    'rating_count_30d': np.random.poisson(2, n),
    'free_trial_convert': np.random.binomial(1, 0.65, n),
    'referral_source': np.random.choice(
        [0, 1, 2, 3], n, p=[0.50, 0.25, 0.15, 0.10]
    ),
})

# Generate churn labels
churn_logit = (
    -3.0
    + 0.08 * streamflow['days_since_last_session']
    - 0.02 * streamflow['monthly_hours_watched']
    - 0.04 * streamflow['sessions_last_30d']
    + 0.15 * streamflow['payment_failures_6m']
    + 0.10 * streamflow['support_tickets_90d']
    - 0.01 * streamflow['months_active']
)
churn_prob = 1 / (1 + np.exp(-churn_logit))
streamflow['churned'] = np.random.binomial(1, churn_prob)

# Anomaly detection on usage features (excluding churn label)
usage_features = [
    'monthly_hours_watched', 'sessions_last_30d', 'avg_session_minutes',
    'unique_titles_watched', 'content_completion_rate', 'binge_sessions_30d',
    'hours_change_pct', 'sessions_change_pct', 'days_since_last_session',
    'devices_used', 'payment_failures_6m', 'support_tickets_90d',
]

scaler = StandardScaler()
X_usage = scaler.fit_transform(streamflow[usage_features])

iso = IsolationForest(n_estimators=200, contamination=0.02, random_state=42)
streamflow['anomaly_flag'] = iso.fit_predict(X_usage) == -1
streamflow['anomaly_score'] = -iso.decision_function(X_usage)

# How do anomalous accounts differ?
print("Anomalous vs. Normal accounts:")
comparison = streamflow.groupby('anomaly_flag')[
    usage_features + ['churned']
].mean().T
comparison.columns = ['Normal', 'Anomalous']
comparison['Ratio'] = comparison['Anomalous'] / comparison['Normal']
print(comparison.round(2))

# Churn rate among anomalous accounts
churn_anomalous = streamflow.loc[streamflow['anomaly_flag'], 'churned'].mean()
churn_normal = streamflow.loc[~streamflow['anomaly_flag'], 'churned'].mean()
print(f"\nChurn rate (anomalous): {churn_anomalous:.3f}")
print(f"Churn rate (normal):    {churn_normal:.3f}")
print(f"Lift: {churn_anomalous / churn_normal:.1f}x")

This is the bridge between anomaly detection and the churn prediction work from earlier chapters. Anomaly scores can serve as features in a supervised churn model, or anomaly flags can trigger proactive retention outreach.

Chapter Summary

Anomaly detection is not a single algorithm --- it is a workflow that spans statistical baselines, tree-based isolation, density estimation, and reconstruction-based methods. The algorithm that assigns the score is the easy part. The hard part is choosing the threshold, evaluating without labels, and integrating the system into a human review process.

The methods you now have:

Method	Type	Strengths	Weaknesses
Z-score / IQR	Statistical baseline	Simple, interpretable, fast	Univariate, assumes distribution
Mahalanobis	Statistical multivariate	Catches correlation-based anomalies	Assumes Gaussian, fragile in high-d
Isolation Forest	Unsupervised	Fast, scalable, minimal tuning	Global density (misses local patterns)
LOF	Unsupervised	Local density comparison	Slow, does not scale well
One-Class SVM	Semi-supervised	Flexible boundary	Slow, needs clean training data
Autoencoder	Semi-supervised	Captures non-linear patterns	More complex, needs more data

Start with the simplest method that could work. Graduate to more complex methods when the simpler ones demonstrably fail. And always remember: the threshold is a business decision, not a statistical one.

Next chapter: Chapter 23 --- Association Rules, where we move from finding unusual observations to finding unusual co-occurrences.