Key Takeaways
Chapter 25: Machine Learning in Fraud Detection
1. Fraud detection is a classification problem with distinctive challenges. The combination of extreme class imbalance (0.1–1% fraud rate), temporal dynamics (patterns change constantly), adversarial adaptation (fraudsters study and defeat the model), label noise (not all fraud is reported), and millisecond latency requirements makes fraud detection one of the hardest machine learning applications in financial services. Standard accuracy metrics are misleading — precision and recall are the operative measures.
2. Feature engineering is more important than model selection. Behavioral baseline features — amount z-score against account history, velocity over rolling windows, new country flag, new device flag — consistently outperform sophisticated models with basic features. The feature engineering process requires deep collaboration between data scientists and fraud operations experts.
Core Machine Learning Techniques Summary
| Technique | Type | Best For | Limitation |
|---|---|---|---|
| Logistic Regression | Supervised | Baseline; interpretable | Cannot capture non-linear fraud patterns |
| Gradient Boosted Trees (XGBoost, LightGBM) | Supervised | Tabular fraud data; production workhorse | Limited out-of-box interpretability |
| Neural Networks / RNNs | Supervised | Sequential pattern detection | Higher data requirements; slower to interpret |
| Isolation Forest | Unsupervised | Anomaly / novel fraud detection | High false positive rate on its own |
| Autoencoder | Unsupervised | Complex behavioral anomalies | Requires careful architecture tuning |
3. Supervised learning catches known fraud; unsupervised learning catches novel patterns. Supervised models learn from labeled historical data — they will not recognize attack vectors that have never appeared in training data. Unsupervised anomaly detection learns what "normal" looks like and flags deviations — providing a first-line defense against new fraud types. Production systems combine both.
4. The feedback loop must be protected. Model performance degrades when investigation teams produce inaccurate labels — particularly when "customer confirmed legitimate" dispositions are accepted without verification. Fraudsters using social engineering (calling the bank posing as the customer) can corrupt the training data if verification is weak. Label quality is as important as model quality.
5. Sample selection bias is the hidden danger. When the model doesn't flag a transaction, that transaction typically doesn't get reviewed — and therefore doesn't get labeled. If the model systematically misses a fraud type, the training data will underrepresent that type, and the next model will miss it too. Explore-exploit sampling (reviewing a random sample of low-scored transactions) and customer-reported dispute labels help mitigate this.
6. Threshold calibration is a business decision, not a technical one. The decision threshold determines the precision-recall tradeoff. The right threshold depends on: the cost of fraud losses, the acceptable rate of customer false-positive friction, the investigation team's capacity, and the value of the customer segment. A premium card program accepts lower false positive rates than a prepaid program.
7. Real-time architecture requires a feature store. Card payment fraud decisions require sub-200ms latency. Behavioral features cannot be computed from scratch on each transaction. A low-latency feature store (Redis-based) maintains pre-computed behavioral profiles, updated asynchronously as transactions complete. The slight lag in feature store updates is an acceptable tradeoff for the latency requirement.
Class Imbalance Handling Approaches
| Approach | Mechanism | When to Use |
|---|---|---|
| Class weights | Penalize fraud misclassification more heavily during training | Default starting point; simple and effective |
| SMOTE (oversampling) | Generate synthetic fraud examples in feature space | When class imbalance is extreme (< 0.1%) |
| Undersampling | Reduce legitimate training examples | When training data is very large and time is a constraint |
| Threshold adjustment | Set decision threshold below 0.5 | When model is calibrated but operating point needs adjustment |
8. Explainability in fraud detection has an adversarial constraint. SHAP values should be used internally for analyst review of alerts — helping analysts assess false positives and confirm true positives. Detailed feature-level explanations should not be exposed externally (to fraudsters who could use them to defeat the model). Customer-facing explanations should use general human-readable language without revealing model structure.
9. Continuous monitoring and retraining is essential. Fraud patterns evolve faster than almost any other financial risk. PSI (Population Stability Index) monitors whether the scored population has shifted from the training population. Model performance metrics (precision, recall, F1) should be tracked monthly against labeled dispositions. Models should be retrained at least annually; quarterly or more frequently is better practice.
10. Regulatory governance applies to fraud detection models. GDPR Recital 47 supports fraud prevention as a legitimate interest for processing personal data. GDPR Article 22 requires a human review process for challenged automated decisions. The FCA's Consumer Duty requires monitoring of false positive harm. SR 11-7 (US) requires model documentation, independent validation, and ongoing monitoring for all models, including fraud models.
Performance Metrics Reference
| Metric | Formula | What It Measures |
|---|---|---|
| Precision | TP / (TP + FP) | Of flagged transactions, what % are actually fraud? |
| Recall | TP / (TP + FN) | Of all fraud, what % was caught? |
| F1 Score | 2 × (P × R) / (P + R) | Harmonic mean of precision and recall |
| False Positive Rate | FP / (FP + TN) | Rate of legitimate transactions incorrectly blocked |
| AUC-ROC | Area under ROC curve | Model's ability to discriminate fraud vs. legitimate |
| PSI | Σ (actual% − expected%) × ln(actual/expected) | Population shift from training to current |
PSI interpretation: < 0.1 = stable; 0.1–0.25 = minor shift (monitor); > 0.25 = significant shift (retrain required)