Quiz: Chapter 22

Anomaly Detection: Isolation Forests, Autoencoders, and Finding Needles in Haystacks


Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.


Question 1 (Multiple Choice)

What is the key insight behind Isolation Forest?

  • A) Anomalies are harder to classify than normal observations
  • B) Anomalies are easier to isolate because they lie in sparse regions
  • C) Anomalies have higher reconstruction error than normal observations
  • D) Anomalies have lower local density than their neighbors

Answer: B) Anomalies are easier to isolate because they lie in sparse regions. Isolation Forest builds random trees by selecting random features and random split values. Normal observations are clustered in dense regions and require many splits to isolate. Anomalies are in sparse regions and can be separated with fewer splits. The anomaly score is based on the average path length --- shorter paths indicate anomalies. Option C describes autoencoders, and option D describes Local Outlier Factor.


Question 2 (Multiple Choice)

In Isolation Forest, what does the contamination parameter control?

  • A) The fraction of data used to build each tree
  • B) The number of trees in the forest
  • C) The decision threshold that converts anomaly scores into binary predictions
  • D) The depth limit of each isolation tree

Answer: C) The decision threshold that converts anomaly scores into binary predictions. The contamination parameter tells the model what fraction of observations to label as anomalies. It does not change how the trees are built or the anomaly scores themselves --- only the threshold applied to those scores. Two models with different contamination values produce identical score_samples() output but different predict() output. Option A describes max_samples, option B describes n_estimators, and option D describes max_depth.


Question 3 (Multiple Choice)

Which anomaly detection paradigm requires a dataset of verified normal observations with no anomalies?

  • A) Supervised anomaly detection
  • B) Semi-supervised anomaly detection (novelty detection)
  • C) Unsupervised anomaly detection
  • D) All of the above

Answer: B) Semi-supervised anomaly detection (novelty detection). In semi-supervised anomaly detection, you train a model on verified clean data to learn what "normal" looks like, then flag anything that deviates as novel or anomalous. One-class SVM and autoencoders trained on normal data only are examples. Supervised detection requires both normal and anomalous labels. Unsupervised detection works on unlabeled data that contains both normal and anomalous observations.


Question 4 (Short Answer)

Explain why accuracy is a misleading metric for anomaly detection evaluation. Give a concrete numeric example.

Answer: Accuracy is misleading because of the extreme class imbalance inherent in anomaly detection. If 0.8% of observations are truly anomalous, a model that predicts "normal" for every observation achieves 99.2% accuracy while catching zero anomalies. In a dataset of 10,000 observations with 80 true anomalies, the trivial "always normal" classifier has accuracy 9,920 / 10,000 = 99.2%. This is useless for the actual task. Average precision (area under the precision-recall curve) and precision@k are more informative because they focus on the model's performance in the anomalous tail.


Question 5 (Multiple Choice)

How does an autoencoder detect anomalies?

  • A) It learns to classify observations as normal or anomalous
  • B) It isolates observations using random splits in feature space
  • C) It learns to reconstruct normal patterns, and anomalies have high reconstruction error
  • D) It measures the distance from each observation to the nearest cluster center

Answer: C) It learns to reconstruct normal patterns, and anomalies have high reconstruction error. An autoencoder compresses input through a bottleneck and reconstructs it. When trained on normal data, it learns to reconstruct normal patterns well. Anomalies do not match the learned patterns, so the reconstruction is poor and the error is high. The reconstruction error serves as the anomaly score.


Question 6 (Multiple Choice)

What advantage does Mahalanobis distance have over computing z-scores on individual features?

  • A) It is faster to compute
  • B) It accounts for correlations between features
  • C) It does not require the data to be approximately Gaussian
  • D) It can handle categorical features

Answer: B) It accounts for correlations between features. The z-score checks each feature independently. An observation that is +2 standard deviations on Feature A and +2 on Feature B might be normal if A and B are positively correlated, but deeply anomalous if they are negatively correlated. Mahalanobis distance incorporates the covariance matrix, detecting anomalies that are unusual combinations of features even when each feature is individually within normal range. Mahalanobis distance is slower than z-scores (A), still assumes approximately Gaussian data (C), and does not handle categorical features (D).


Question 7 (Short Answer)

Describe the threshold selection problem in anomaly detection. Why is there no single "correct" threshold?

Answer: The threshold selection problem asks: given a continuous anomaly score, how anomalous must an observation be to trigger an alert? There is no mathematically correct answer because the threshold depends on the business context --- specifically the relative costs of false positives (unnecessary investigations) and false negatives (missed anomalies). A manufacturing plant where a missed bearing failure causes a fire needs a low threshold (catch everything, tolerate false alarms). A SaaS company triaging usage anomalies for customer success outreach can afford a high threshold (only flag the most obvious cases). The threshold is a resource allocation decision driven by operational capacity and cost asymmetry, not a statistical property of the model.


Question 8 (Multiple Choice)

What is the primary advantage of Local Outlier Factor (LOF) over Isolation Forest?

  • A) LOF is faster and more scalable
  • B) LOF handles data with clusters of varying density
  • C) LOF can predict on new, unseen data
  • D) LOF requires no hyperparameter tuning

Answer: B) LOF handles data with clusters of varying density. LOF compares each observation's local density to the density of its neighbors. A point in a sparse-but-consistent cluster is not flagged, because its neighbors are equally sparse. Isolation Forest uses a global density perspective and may incorrectly flag points in sparse but legitimate clusters. LOF is slower than Isolation Forest (A), cannot predict on new data by default without novelty=True (C), and requires tuning n_neighbors (D).


Question 9 (Multiple Choice)

You train an autoencoder on a dataset that is 95% normal and 5% anomalous (unlabeled). Compared to training on 100% normal data, what is the most likely effect?

  • A) The autoencoder will perform identically because 5% contamination is negligible
  • B) The autoencoder may learn to partially reconstruct anomalous patterns, reducing its ability to detect them
  • C) The autoencoder will overfit to the anomalies and flag all normal observations
  • D) The autoencoder will fail to converge during training

Answer: B) The autoencoder may learn to partially reconstruct anomalous patterns, reducing its ability to detect them. If anomalies are present in the training data, the autoencoder's loss function includes reconstruction of those anomalies. With 5% contamination, the model will not fully learn to reconstruct anomalies (they are too rare), but it will partially accommodate them, producing lower reconstruction errors for anomalies than a clean-trained model would. This narrows the gap between normal and anomalous reconstruction errors, degrading detection performance. The effect is proportional to the contamination rate.


Question 10 (Multiple Choice)

Which evaluation metric is most appropriate when your operations team can review exactly 50 alerts per day?

  • A) AUC-ROC
  • B) F1 score
  • C) Precision at k (where k = 50)
  • D) Recall at the default threshold

Answer: C) Precision at k (where k = 50). Precision@k directly answers the operational question: "Of the 50 alerts we review each day, how many are genuine anomalies?" AUC-ROC evaluates ranking quality across all thresholds but does not tell you what happens at k=50 specifically. F1 depends on an arbitrary threshold. Recall at a default threshold does not account for the fixed review budget. Precision@k aligns the metric with the operational constraint.


Question 11 (Short Answer)

A colleague proposes using Isolation Forest anomaly flags as features in a supervised churn prediction model. Is this a good idea? What are the benefits and risks?

Answer: This can be useful but requires care. The benefit is that the anomaly score captures multivariate unusualness that individual features may not --- a customer whose combination of features is unusual may be at risk even if no single feature is extreme. The risk is information leakage if the anomaly model was fit on the same data used to train and evaluate the churn model. The Isolation Forest should be fit on training data only and then applied to produce scores for both training and test sets. The anomaly score should be treated as a derived feature, not a label. A further risk is that the anomaly score may be redundant with existing features, adding noise without improving the churn model.


Question 12 (Multiple Choice)

In the anomaly detection lifecycle, what is the typical progression of approaches?

  • A) Supervised, then semi-supervised, then unsupervised
  • B) Unsupervised bootstrapping, then semi-supervised refinement, then supervised where possible
  • C) Semi-supervised, then unsupervised, then supervised
  • D) The approach is chosen once and does not change

Answer: B) Unsupervised bootstrapping, then semi-supervised refinement, then supervised where possible. Most real-world anomaly detection starts with no labels. Unsupervised methods (Isolation Forest) produce initial detections, which are reviewed by humans, generating labels. As clean normal data accumulates, semi-supervised methods (autoencoders, One-Class SVM) can be trained. Eventually, enough labeled anomalies accumulate to support supervised classification (gradient boosting with class imbalance handling). The approach evolves as data and labels accumulate.


Question 13 (Short Answer)

Explain why the bottleneck size of an autoencoder matters for anomaly detection. What happens if the bottleneck is too large? Too small?

Answer: The bottleneck forces the autoencoder to learn a compressed representation of the dominant patterns. If the bottleneck is too large (close to or equal to the input dimension), the network can learn a near-identity mapping and reconstruct everything well, including anomalies. The reconstruction error gap between normal and anomalous observations shrinks, and detection fails. If the bottleneck is too small, the network cannot represent even normal patterns adequately, producing high reconstruction error on everything. The optimal bottleneck is large enough to represent normal variation but small enough to force lossy compression that fails on anomalous patterns.


Question 14 (Multiple Choice)

A TurbineTech engineer says: "Our Isolation Forest flags 2% of readings as anomalous, but we know the true bearing failure rate is 0.1%. The model is broken." What is wrong with this reasoning?

  • A) The Isolation Forest is indeed broken and should be retrained
  • B) The model's contamination parameter is set too high, but the underlying scores may be correct
  • C) Anomalies include more than just bearing failures --- they include any unusual readings
  • D) Both B and C

Answer: D) Both B and C. First, the contamination parameter may be set to 0.02 when it should be closer to 0.001, which only affects the threshold, not the model itself. Adjusting contamination or manually thresholding the scores could resolve this. Second, the model detects all types of unusual readings, not just bearing failures. Sensor drift, calibration errors, unusual operating conditions, and other non-failure anomalies are also flagged. A 2% anomaly rate is plausible even with a 0.1% failure rate because many anomalies are not failures. The engineer needs to distinguish between "unusual" and "failure" --- anomaly detection finds the former, and domain knowledge filters for the latter.


Question 15 (Short Answer)

You are deploying an anomaly detection model for StreamFlow SaaS churn prevention. After three months, the model's precision@50 drops from 0.72 to 0.41, even though no code has changed. What are two likely causes, and how would you diagnose each?

Answer: The two most likely causes are concept drift and data drift. Concept drift occurs when the relationship between features and "anomalousness" changes --- for example, StreamFlow launches a new feature that changes normal usage patterns, making previously unusual behavior common. Diagnose by comparing the feature distributions of flagged accounts over time. Data drift occurs when the input feature distributions shift --- for example, a marketing campaign brings in a new user segment with different usage patterns. Diagnose by monitoring input feature statistics (mean, variance, quantiles) over time and comparing to the training period. Both require retraining the model on recent data to restore performance.


This quiz covers Chapter 22: Anomaly Detection. Review the chapter and key takeaways for concepts you found difficult.