Key Takeaways

DataField.Dev

Key Takeaways

Chapter 25: Machine Learning in Fraud Detection

1. Fraud detection is a classification problem with distinctive challenges. The combination of extreme class imbalance (0.1–1% fraud rate), temporal dynamics (patterns change constantly), adversarial adaptation (fraudsters study and defeat the model), label noise (not all fraud is reported), and millisecond latency requirements makes fraud detection one of the hardest machine learning applications in financial services. Standard accuracy metrics are misleading — precision and recall are the operative measures.

2. Feature engineering is more important than model selection. Behavioral baseline features — amount z-score against account history, velocity over rolling windows, new country flag, new device flag — consistently outperform sophisticated models with basic features. The feature engineering process requires deep collaboration between data scientists and fraud operations experts.

Core Machine Learning Techniques Summary

Technique	Type	Best For	Limitation
Logistic Regression	Supervised	Baseline; interpretable	Cannot capture non-linear fraud patterns
Gradient Boosted Trees (XGBoost, LightGBM)	Supervised	Tabular fraud data; production workhorse	Limited out-of-box interpretability
Neural Networks / RNNs	Supervised	Sequential pattern detection	Higher data requirements; slower to interpret
Isolation Forest	Unsupervised	Anomaly / novel fraud detection	High false positive rate on its own
Autoencoder	Unsupervised	Complex behavioral anomalies	Requires careful architecture tuning

3. Supervised learning catches known fraud; unsupervised learning catches novel patterns. Supervised models learn from labeled historical data — they will not recognize attack vectors that have never appeared in training data. Unsupervised anomaly detection learns what "normal" looks like and flags deviations — providing a first-line defense against new fraud types. Production systems combine both.

4. The feedback loop must be protected. Model performance degrades when investigation teams produce inaccurate labels — particularly when "customer confirmed legitimate" dispositions are accepted without verification. Fraudsters using social engineering (calling the bank posing as the customer) can corrupt the training data if verification is weak. Label quality is as important as model quality.

5. Sample selection bias is the hidden danger. When the model doesn't flag a transaction, that transaction typically doesn't get reviewed — and therefore doesn't get labeled. If the model systematically misses a fraud type, the training data will underrepresent that type, and the next model will miss it too. Explore-exploit sampling (reviewing a random sample of low-scored transactions) and customer-reported dispute labels help mitigate this.

6. Threshold calibration is a business decision, not a technical one. The decision threshold determines the precision-recall tradeoff. The right threshold depends on: the cost of fraud losses, the acceptable rate of customer false-positive friction, the investigation team's capacity, and the value of the customer segment. A premium card program accepts lower false positive rates than a prepaid program.

7. Real-time architecture requires a feature store. Card payment fraud decisions require sub-200ms latency. Behavioral features cannot be computed from scratch on each transaction. A low-latency feature store (Redis-based) maintains pre-computed behavioral profiles, updated asynchronously as transactions complete. The slight lag in feature store updates is an acceptable tradeoff for the latency requirement.

Class Imbalance Handling Approaches

Approach	Mechanism	When to Use
Class weights	Penalize fraud misclassification more heavily during training	Default starting point; simple and effective
SMOTE (oversampling)	Generate synthetic fraud examples in feature space	When class imbalance is extreme (< 0.1%)
Undersampling	Reduce legitimate training examples	When training data is very large and time is a constraint
Threshold adjustment	Set decision threshold below 0.5	When model is calibrated but operating point needs adjustment

8. Explainability in fraud detection has an adversarial constraint. SHAP values should be used internally for analyst review of alerts — helping analysts assess false positives and confirm true positives. Detailed feature-level explanations should not be exposed externally (to fraudsters who could use them to defeat the model). Customer-facing explanations should use general human-readable language without revealing model structure.

9. Continuous monitoring and retraining is essential. Fraud patterns evolve faster than almost any other financial risk. PSI (Population Stability Index) monitors whether the scored population has shifted from the training population. Model performance metrics (precision, recall, F1) should be tracked monthly against labeled dispositions. Models should be retrained at least annually; quarterly or more frequently is better practice.

10. Regulatory governance applies to fraud detection models. GDPR Recital 47 supports fraud prevention as a legitimate interest for processing personal data. GDPR Article 22 requires a human review process for challenged automated decisions. The FCA's Consumer Duty requires monitoring of false positive harm. SR 11-7 (US) requires model documentation, independent validation, and ongoing monitoring for all models, including fraud models.

Performance Metrics Reference

Metric	Formula	What It Measures
Precision	TP / (TP + FP)	Of flagged transactions, what % are actually fraud?
Recall	TP / (TP + FN)	Of all fraud, what % was caught?
F1 Score	2 × (P × R) / (P + R)	Harmonic mean of precision and recall
False Positive Rate	FP / (FP + TN)	Rate of legitimate transactions incorrectly blocked
AUC-ROC	Area under ROC curve	Model's ability to discriminate fraud vs. legitimate
PSI	Σ (actual% − expected%) × ln(actual/expected)	Population shift from training to current

PSI interpretation: < 0.1 = stable; 0.1–0.25 = minor shift (monitor); > 0.25 = significant shift (retrain required)