Exercises: Chapter 22
Anomaly Detection: Isolation Forests, Autoencoders, and Finding Needles in Haystacks
Exercise 1: Statistical Baselines (Conceptual + Code)
a) Explain the difference between z-score and IQR methods for anomaly detection. Under what data distribution does the z-score flag too many legitimate observations as anomalous? Give a concrete example using a feature from the TurbineTech dataset.
b) A sensor produces readings that follow a log-normal distribution (common for vibration data). The raw z-score method flags 12% of readings as anomalous, far more than the true rate of ~0.5%. Explain why this happens and propose a fix. Implement your fix below:
import numpy as np
from scipy import stats
np.random.seed(42)
# Log-normal vibration data with a few true anomalies
normal_readings = np.random.lognormal(mean=1.5, sigma=0.4, size=9950)
anomalous_readings = np.random.lognormal(mean=2.8, sigma=0.3, size=50)
readings = np.concatenate([normal_readings, anomalous_readings])
labels = np.array([0] * 9950 + [1] * 50)
# Raw z-score
z_raw = np.abs(stats.zscore(readings))
raw_flagged = (z_raw > 3).sum()
print(f"Raw z-score flags: {raw_flagged} ({raw_flagged / len(readings) * 100:.1f}%)")
# Your fix here: transform and re-apply
# Your code here
c) Compute the Mahalanobis distance for the following 3-feature dataset. Manually verify for one observation that the Mahalanobis distance is larger than the Euclidean distance from the mean when the features are correlated.
import numpy as np
from scipy.spatial.distance import mahalanobis
np.random.seed(42)
# Correlated features
mean = [10, 20, 30]
cov = [[4, 3, 1], [3, 9, 2], [1, 2, 4]]
data = np.random.multivariate_normal(mean, cov, 500)
# Add one anomaly: unusual COMBINATION (each feature looks normal individually)
anomaly = np.array([[12, 15, 35]]) # high on feature 1, low on feature 2
# Your code: compute Mahalanobis distance for the anomaly
# Compare to Euclidean distance from the mean
# Explain why the Mahalanobis distance is larger
d) Under what conditions does the Mahalanobis distance fail as an anomaly detection method? Name two failure modes and describe a dataset characteristic that triggers each.
Exercise 2: Isolation Forest Deep Dive (Code)
a) Generate a 2D dataset with two clusters (one dense, one sparse) and 20 point anomalies scattered in empty space. Fit an Isolation Forest and visualize the decision boundary using a contour plot of anomaly scores.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
np.random.seed(42)
# Dense cluster
cluster1 = np.random.normal(loc=[0, 0], scale=[0.5, 0.5], size=(500, 2))
# Sparse cluster
cluster2 = np.random.normal(loc=[6, 6], scale=[1.5, 1.5], size=(200, 2))
# Anomalies
anomalies = np.random.uniform(low=-4, high=12, size=(20, 2))
data = np.vstack([cluster1, cluster2, anomalies])
labels = np.array([0]*500 + [0]*200 + [1]*20)
# Fit Isolation Forest
# Your code here
# Create contour plot of anomaly scores across the 2D space
# Your code here: create meshgrid, predict scores, plot contours
b) Using the same dataset, vary n_estimators from 10 to 500 (in steps: 10, 50, 100, 200, 500). For each, compute the AUC-ROC. At what point do additional trees stop improving performance? Plot the learning curve.
c) Vary max_samples from 32 to 512 (in steps: 32, 64, 128, 256, 512) with n_estimators=200. How does subsample size affect AUC-ROC? Which setting provides the best tradeoff between performance and training time? Measure wall-clock time using time.time().
d) Explain in your own words why the contamination parameter does not affect the model's anomaly scores --- only the binary prediction threshold. Write code that demonstrates this by showing the score_samples output is identical for two models with different contamination values.
# Your code here: fit two IsolationForest models with different contamination
# Show that score_samples() produces identical values
Exercise 3: Isolation Forest vs. LOF on Multi-Density Data (Code + Analysis)
This exercise demonstrates the local vs. global density distinction.
a) Generate a dataset with three clusters of different densities:
import numpy as np
np.random.seed(42)
# Tight cluster
c1 = np.random.normal(loc=[0, 0], scale=[0.3, 0.3], size=(1000, 2))
# Medium cluster
c2 = np.random.normal(loc=[5, 5], scale=[1.0, 1.0], size=(500, 2))
# Sparse cluster
c3 = np.random.normal(loc=[-5, 5], scale=[2.0, 2.0], size=(300, 2))
# True anomalies: points between clusters
anomalies = np.array([
[2.5, 2.5], [3.0, 1.0], [-2.0, 3.0], [-1.0, 7.0],
[7.0, 2.0], [-6.0, 0.0], [0.0, 8.0], [4.0, -2.0],
])
data = np.vstack([c1, c2, c3, anomalies])
labels = np.array([0]*1000 + [0]*500 + [0]*300 + [1]*8)
b) Fit both Isolation Forest and LOF with contamination=0.01. Compare their predictions: which observations does each method flag? Create a side-by-side scatter plot showing the flagged observations.
c) Does Isolation Forest incorrectly flag points from the sparse cluster (c3) as anomalies? Does LOF? Explain the difference in terms of global vs. local density.
d) Propose a strategy for combining Isolation Forest and LOF predictions when you suspect multi-density data. Implement an ensemble scoring approach and evaluate it against the individual methods.
Exercise 4: Autoencoder Architecture Exploration (Code)
a) Build an autoencoder for the TurbineTech 6-feature sensor dataset from the chapter. Train three variants:
- Shallow: Input(6) -> 4 -> 6 (one hidden layer per side)
- Medium: Input(6) -> 16 -> 3 -> 16 -> 6 (from the chapter)
- Deep: Input(6) -> 32 -> 16 -> 3 -> 16 -> 32 -> 6
Train each for 50 epochs on normal data only. Compare reconstruction error distributions on normal vs. anomalous test data. Which architecture best separates the two distributions?
import torch
import torch.nn as nn
# Define three autoencoder architectures
# Your code here
# Train each on normal data
# Compute reconstruction error on all data
# Plot reconstruction error histograms for normal vs. anomalous, for each architecture
b) Vary the bottleneck size from 1 to 6 (with the Medium architecture structure). Plot the average reconstruction error on normal data and on anomalous data as a function of bottleneck size. What is the optimal bottleneck size? What happens when bottleneck = input_dim?
c) Train an autoencoder on the full dataset (normal + anomalous mixed together) instead of normal data only. Compare its anomaly detection performance to the semi-supervised version (trained on normal only). How much does contamination of the training data degrade performance? At what contamination rate does the mixed-training autoencoder become useless?
d) Implement early stopping based on validation reconstruction error. Split the normal training data 80/20. Stop training when validation loss has not improved for 10 epochs. Compare the final model to one trained for a fixed 100 epochs. Which generalizes better to the anomalous test data?
Exercise 5: Evaluation Without Labels (Analysis + Code)
a) You deploy an Isolation Forest for TurbineTech bearing anomaly detection. After one month, you have anomaly scores for 100,000 sensor readings but no labels. Describe three concrete strategies for evaluating whether the model is working. For each strategy, specify what data you would need and what a positive signal looks like.
b) Implement the consensus approach. Run Isolation Forest (3 random seeds), LOF, and an autoencoder on the TurbineTech data. For each observation, count how many methods flag it. Create a table showing: for each consensus level (1/5, 2/5, 3/5, 4/5, 5/5), how many observations are flagged, and what fraction are true anomalies.
# Your code here: run 5 anomaly detection models
# Count consensus levels
# Report precision at each level
c) Implement a stability analysis. Run Isolation Forest 20 times with different random seeds (random_state=0 through random_state=19). For each observation, compute the fraction of runs that flag it. Plot the distribution of this "flag frequency." Are there observations that are always flagged (20/20)? Always missed (0/20)? What does the bimodal vs. uniform shape of this distribution tell you about the model's confidence?
d) A colleague suggests evaluating anomaly detection by training a supervised model on the anomaly flags (treat flags as labels) and measuring its cross-validated accuracy. Explain why this is circular reasoning and does not validate the anomaly detector.
Exercise 6: Threshold Selection (Analysis + Code)
a) Using the TurbineTech dataset with known labels, create a precision-recall curve for the Isolation Forest anomaly scores. Mark three operating points on the curve:
- Conservative: Precision >= 0.90 (few false alarms, some misses)
- Balanced: F1-maximizing threshold
- Aggressive: Recall >= 0.95 (catch almost everything, more false alarms)
For each operating point, report the threshold, precision, recall, F1, and the number of daily alerts (assuming 10,000 readings per day).
from sklearn.metrics import precision_recall_curve, f1_score
# Your code here
b) The TurbineTech operations team can investigate 15 alerts per day. Using the precision@k framework, determine: if you send the top 15 anomaly scores each day, what fraction are true bearing issues? What fraction of true bearing issues do you catch?
c) Implement a dynamic threshold that adapts over time. Simulate 30 days of data (10,000 readings per day, 0.8% anomaly rate). For the first 5 days, use the 99th percentile as the threshold. Starting on day 6, use the labeled feedback from reviewed alerts to adjust the threshold using precision-recall analysis. Track how precision and recall change over the 30 days.
d) A manufacturing plant has two types of anomalies: bearing degradation (slow, develops over days) and sudden catastrophic failure (immediate, rare). The cost of missing a catastrophic failure is 100x the cost of missing a gradual degradation. Design a dual-threshold system: a low threshold for gradual issues (reviewed weekly) and a high threshold for critical issues (reviewed immediately). Implement this using the anomaly scores.
Exercise 7: End-to-End Anomaly Detection Pipeline (Project)
Build a complete anomaly detection pipeline for the StreamFlow SaaS churn dataset:
a) Data preparation: Select relevant usage features. Handle any preprocessing (scaling, encoding). Justify your feature selection.
b) Baseline: Compute z-score and IQR anomalies per feature. How many accounts are flagged by at least one feature?
c) Model: Fit an Isolation Forest. Tune contamination using the known churn labels as a proxy (anomalous usage should correlate with churn, but they are not the same thing).
d) Evaluation: Use churn as a proxy label. Compute AUC-ROC and precision@50 (the top 50 anomalous accounts). Do anomalous accounts churn at a higher rate?
e) Profiling: For the top 100 anomalous accounts, compute the mean of each usage feature and compare to the population mean. What characterizes anomalous StreamFlow accounts?
f) Actionable output: Design a daily alert email for the Customer Success team. What information should each alert contain? Write the code that generates a summary DataFrame with account ID, anomaly score, anomaly rank, top 3 contributing features, and recommended action.
# Your complete pipeline here
# Expected output: a DataFrame of the top 50 accounts with anomaly details
Exercise 8: Feature Contribution for Anomaly Explanations (Code)
One limitation of Isolation Forest is that it provides an anomaly score but does not explain which features made the observation anomalous. This exercise addresses that.
a) Implement a simple feature contribution method: for each flagged anomaly, compute the z-score of each feature (relative to the normal population). The features with the highest absolute z-scores are the most "surprising" features for that observation.
def explain_anomaly(observation, normal_data, feature_names):
"""Return a DataFrame of feature contributions for a single anomaly."""
# Your code here: compute z-scores relative to normal_data
# Return sorted by absolute z-score
pass
# Test on the top 5 Isolation Forest anomalies
b) Implement a permutation-based feature importance for Isolation Forest: for each feature, permute its values and measure the change in anomaly score. Features where permutation reduces the anomaly score are important for the detection.
c) Compare your explanation method from (a) with the permutation method from (b) on the top 10 anomalies. Do they agree on the most important features? Discuss cases where they disagree and why.
d) Why is explainability particularly important for anomaly detection compared to supervised classification? Give two reasons specific to the anomaly detection workflow (hint: think about the human review step and the threshold selection step).
Solutions to selected exercises are available in the Appendix.