Chapter 25 Exercises
Machine Learning in Fraud Detection
Exercise 25.1: Precision-Recall Tradeoff Analysis
Difficulty: Introductory
A fraud detection model operating at a payment processor produces the following results at three different thresholds over 100,000 transactions (actual fraud rate: 0.8%, i.e., 800 fraudulent transactions):
| Threshold | True Positives | False Positives | False Negatives | True Negatives |
|---|---|---|---|---|
| 0.20 | 720 | 4,200 | 80 | 95,000 |
| 0.35 | 640 | 1,800 | 160 | 97,400 |
| 0.55 | 480 | 380 | 320 | 98,820 |
a) For each threshold, calculate: Precision, Recall, F1 Score, and False Positive Rate.
b) Plot (or describe in words) the precision-recall curve implied by these three operating points. Which threshold achieves the best F1 score?
c) The fraud operations team can review 2,500 alerts per day (during normal business hours). At which threshold(s) is the alert volume manageable? What happens to recall at the highest precision threshold?
d) The payment processor's merchants bear 60% of fraud losses above a £500 threshold (fraud amounts are uniformly distributed between £50 and £2,000). The processor bears the remaining 40%. Average fraud amount: £380. Using the True Positive and False Negative counts above: - How much total fraud value is caught vs. missed at each threshold? - How much does the processor bear vs. merchants bear at each threshold? - If the processor is choosing a threshold purely for its own financial interest, which would it choose? If choosing for merchant interests, which?
e) An analyst suggests setting a "two-tier" threshold: transactions above 0.55 are automatically blocked; transactions between 0.35 and 0.55 are queued for manual review; transactions below 0.35 are automatically approved. Evaluate this approach. What are the operational requirements for the middle tier?
Exercise 25.2: Feature Engineering Design
Difficulty: Intermediate
You are designing the feature engineering layer for a new credit card fraud detection system for a UK challenger bank. The bank issues contactless, chip-and-PIN, and virtual cards. Customers use cards primarily for: grocery shopping (frequent, small amounts), restaurant dining (infrequent, medium amounts), online retail (variable), and occasional travel.
a) Design 10 behavioral features that would help distinguish fraudulent from legitimate transactions. For each feature, specify: - Feature name and description - How it is computed (what historical data is required) - Why it is expected to be predictive of fraud - Any limitations or edge cases where the feature may generate false signals
b) Three of your 10 features require customer-level behavioral profiles maintained across all transactions. Describe the data architecture requirements: what must be stored, how often updated, and what the latency requirements are for a card payment authorization system targeting <150ms total processing time.
c) A customer moves to a new country for work. Their behavioral profile is based on UK transactions. For the first 3 months, many of their legitimate transactions will trigger "new country" and "unusual merchant" features. How should the feature engineering system handle this known limitation? What information sources could accelerate profile updating?
d) Fraud attacks often involve purchasing gift cards or crypto at high velocity — a pattern called "cash-out fraud." Design 2–3 additional features specifically targeting this attack type. Consider MCC codes (merchant category codes) for gift card retailers (5945, 5999) and cryptocurrency exchanges.
Exercise 25.3: Feedback Loop Audit
Difficulty: Intermediate
You are conducting a model risk audit of a fraud detection system at a regional bank. The fraud investigation team closes approximately 1,200 alerts per month. You are examining the label quality in the training data.
a) Design an audit methodology to assess label quality in the bank's training dataset. Specify: the sample you would examine, the data sources you would cross-reference, the criteria for flagging a label as potentially incorrect, and what a "materially corrupted" training dataset would look like in quantitative terms.
b) The bank's investigators resolve approximately 35% of alerts as "Customer Confirmed Legitimate" (CCL) based on inbound customer calls. The bank uses knowledge-based authentication: the customer provides date of birth, postcode, and the last 4 digits of their card. Evaluate the fraud risk of this authentication method. What percentage of CCL dispositions might you expect to be fraudulently confirmed?
c) You find that 8.5% of CCL-dispositioned transactions in the past 18 months subsequently generated chargebacks. In a dataset of 12,000 training samples (1.2% actual fraud rate), approximately how many fraud labels are incorrect? What is the estimated impact on model recall if those examples are mislabeled as legitimate in training data?
d) Draft a recommendation memo (500 words) to the bank's Head of Fraud Risk, describing: the label quality problem you identified, its root cause, its impact on model performance, and your recommended remediation (investigation process changes, label correction, retraining schedule).
Coding Exercise 25.4: Extend the Fraud Detection System
Difficulty: Coding — Intermediate
Using the FraudDetectionSystem framework from the chapter:
1. Add an AnomalyDetector class that implements a simplified isolation forest concept:
class AnomalyDetector:
"""
Simplified anomaly detection: flags transactions where multiple
features simultaneously deviate significantly from account history.
Simulates isolation forest behavior.
"""
def __init__(self, anomaly_threshold: float = 0.70):
self.anomaly_threshold = anomaly_threshold
def compute_anomaly_score(self, txn: Transaction) -> float:
"""
Compute anomaly score (0-1). Higher = more anomalous.
Production: uses shap.TreeExplainer on fitted IsolationForest.
"""
# Your implementation: score based on number and magnitude
# of behavioral deviations
...
def is_anomalous(self, txn: Transaction) -> bool:
return self.compute_anomaly_score(txn) >= self.anomaly_threshold
The anomaly score should increase when multiple of these conditions are true simultaneously:
- amount_zscore > 2.5 (very unusual amount)
- velocity_1h > 5 (high recent velocity)
- new_country is True
- new_device is True
- unusual_hour is True
2. Add an EnsembleScorer class that combines the GBT score and the anomaly score:
class EnsembleScorer:
def __init__(self, gbt_scorer: GBTFraudScorer,
anomaly_detector: AnomalyDetector,
gbt_weight: float = 0.70,
anomaly_weight: float = 0.30):
...
def ensemble_score(self, txn: Transaction,
gbt_score: float,
anomaly_score: float) -> float:
"""Weighted combination of GBT score and anomaly score."""
...
3. Add a ThresholdOptimizer class with a method find_optimal_threshold(transactions, target_recall=0.80) that finds the threshold value that achieves at least target_recall while maximizing precision, given a list of labeled transactions.
4. Write unit tests (using unittest or pytest) for:
- A transaction that should score high anomaly (high amount + new country + velocity > 5)
- A transaction that should score low anomaly (normal amount, known country, low velocity)
- The ensemble scorer correctly weights GBT and anomaly scores
- The threshold optimizer returns a threshold where recall ≥ 0.80
Exercise 25.5: Regulatory Compliance Assessment
Difficulty: Applied
A UK payment institution has deployed a machine learning fraud detection system with the following characteristics: - GBT model retrained annually; last retrained 14 months ago - PSI (Population Stability Index): 0.19 (computed monthly; stable over last 4 months) - Fraud rate in scored population: increased 0.15% to 0.22% over 3 months - Current model recall: 78% (down from 84% at deployment) - Investigation process: alerts reviewed by 4 analysts; average case closure time: 2.8 days - SHAP values computed for all alerts above threshold 0.65; not computed for lower alerts - Model inventory: the fraud model is registered; no other ML models are formally registered - GDPR records: the firm maintains behavioral profiles on all cardholders; retention: 18 months
a) Against each of the following regulatory/governance requirements, assess whether the firm is compliant, partially compliant, or non-compliant. Provide justification:
| Requirement | Status | Justification |
|---|---|---|
| Model monitoring frequency (SR 11-7 equivalent) | ||
| PSI-triggered retraining policy | ||
| SHAP explainability for all alerts | ||
| Model inventory completeness | ||
| GDPR data retention (proportionality) | ||
| Consumer Duty (customer outcome monitoring) |
b) The firm's data scientist argues that PSI 0.19 is "within the acceptable range" (< 0.25) and no retraining is required. However, recall has dropped from 84% to 78% in 14 months. Construct the argument that the model should be retrained despite PSI being below the 0.25 threshold.
c) Draft a model governance remediation plan with a 90-day timeline identifying the 5 most urgent actions from your assessment.
d) The firm is subject to a routine FCA review. Draft a 1-page "Model Risk Summary" that the compliance team could use to brief the FCA on the firm's fraud detection model governance framework, including the current performance status and remediation actions underway.