19 min read

The fraud analyst's world changed twice in a generation. The first change was the shift from paper to digital: when payments moved online, fraud moved with them, and the volume of transactions requiring assessment grew beyond any human team's...

Chapter 25: Machine Learning in Fraud Detection


The fraud analyst's world changed twice in a generation. The first change was the shift from paper to digital: when payments moved online, fraud moved with them, and the volume of transactions requiring assessment grew beyond any human team's capacity to handle individually. Rules-based systems filled the gap — if the transaction matches certain conditions (unusual amount, unusual location, unusual time), flag it. This worked until fraudsters discovered the rules.

The second change was machine learning. Not the displacement of rules entirely, but the arrival of systems that could learn from patterns in millions of historical transactions, identify subtle correlations that no analyst would have spotted, and adapt as fraud patterns evolved — without waiting for a human analyst to update a ruleset. The results were striking: false positive rates that had plagued rules-based systems dropped; fraud detection rates climbed; and the fraud losses that had been growing with e-commerce volumes began to be contained.

This chapter examines how machine learning fraud detection works — the techniques, the architecture, the data, and the governance. It also examines what distinguishes machine learning fraud detection from its simpler predecessors, and what it still cannot do. Because machine learning is powerful, and also limited, and understanding the limits is as important as understanding the capabilities.


The Fraud Detection Challenge

Fraud detection is, at its core, a classification problem: given a transaction (or account opening request, or loan application, or insurance claim), is it fraudulent or legitimate? But the problem has several features that make it harder than generic classification.

Class imbalance. In card payment fraud, fraudulent transactions typically constitute 0.1% to 1% of all transactions. A classifier that labels everything as legitimate will be 99% accurate — and will catch no fraud. Standard machine learning metrics like accuracy are misleading in this context; what matters is precision (of the transactions flagged, how many are actually fraudulent?) and recall (of all fraudulent transactions, how many did we catch?). The tradeoff between these two is the central engineering challenge of fraud detection.

Temporal dynamics. Fraud patterns evolve faster than almost any other phenomenon in finance. A new attack vector (a vulnerability in a payment processor, a new social engineering scheme, a compromised batch of card numbers) can be exploited and abandoned within days. A model trained on last year's fraud may not recognize this year's attacks. This is why fraud detection is not a one-time model deployment but an ongoing system that must be continuously monitored and retrained.

Adversarial adaptation. Unlike credit risk or operational risk, fraud involves an intelligent adversary who is actively trying to defeat the detection system. When a new detection pattern is deployed and starts catching a certain fraud type, fraudsters notice — their success rate drops — and they adapt. This adversarial dynamic is unique to fraud and creates a continuous arms race that no model can win permanently.

Label noise. In most machine learning applications, training labels are reliable: a transaction either was or was not fraudulent. In fraud detection, this is harder than it appears. Some fraud is never reported by the customer. Some legitimate transactions are reported as fraud (customers disputing charges they actually authorized). The training data contains noise, and the model learns from that noise.

Latency requirements. Card payment fraud decisions must be made in milliseconds — before the merchant receives the authorization response. Other fraud types (application fraud, account takeover) have slightly more time but still require near-real-time decisions. This constrains which models are practical: a model that requires 10 seconds of inference time is not deployable for card payment authorization.


Feature Engineering: The Foundation of Fraud Detection

Before any model is trained, the most important work in fraud detection happens: feature engineering. Features are the inputs that the model uses to make its prediction. The raw transaction record — amount, merchant, location, timestamp — contains useful information, but the most powerful fraud signals come from derived features that capture behavioral context.

Transaction-level features. The raw transaction attributes: amount, merchant category code (MCC), currency, payment channel (contactless, chip-and-PIN, online, MOTO). By themselves these are limited signals. A £5,000 transaction is not inherently suspicious — it depends on the account holder's usual behavior.

Velocity features. Computed over rolling time windows: how many transactions in the last 1 hour? 24 hours? 7 days? How many distinct merchants in the last hour? How many distinct ATMs in the last 24 hours? Velocity features are among the most powerful fraud signals — fraud attacks often involve rapid sequential transactions before the account holder notices.

Behavioral baseline features. How does this transaction compare to the account holder's historical behavior? Feature: Z-score of transaction amount relative to 90-day mean. Feature: Is this merchant category new for this account? Feature: Is this transaction occurring at an unusual time (night vs. day based on account history)? Feature: Is this the first transaction in this country? These features require maintaining a statistical profile of each account — computationally intensive at scale, but highly effective.

Device and network features. For card-not-present (online) transactions: device fingerprint (is this a previously seen device?), IP address reputation, geolocation match to stated address, browser fingerprint. These signals are particularly important for e-commerce fraud where the physical card is not present.

Graph features. Fraud often involves networks: the same account used by multiple compromised cards; the same IP address appearing across multiple fraudulent accounts; merchant terminals associated with multiple fraud reports. Graph analytics can surface these connections — features like "number of fraud reports involving the same IP address in the last 30 days."

Aggregation across the portfolio. What is happening across all accounts at the same moment? A sudden spike in declined transactions at a specific merchant may indicate a compromised POS terminal. These portfolio-level features require real-time aggregation across the entire card base.

Feature engineering is where domain knowledge matters most. A data scientist without deep knowledge of payment fraud may build technically correct features that miss the behavioral patterns that fraud analysts have learned from years of investigation. The best fraud detection systems combine machine learning expertise with fraud expertise — typically through close collaboration between data science teams and fraud operations teams.


Supervised Learning: Classifying Known Fraud Patterns

When labeled training data is available — historical transactions tagged as fraudulent or legitimate — supervised learning applies standard classification algorithms to learn the relationship between features and fraud labels.

Logistic Regression. The simplest classifier: a linear combination of features passed through a sigmoid function to produce a probability. Logistic regression is interpretable (each feature has a coefficient), fast to train and score, and a useful baseline. For fraud detection, it is often insufficient because fraud-legitimate boundaries are rarely linear. But its interpretability makes it useful for regulatory explanations and for identifying feature importance.

Gradient Boosted Trees (GBT). The workhorse of tabular data fraud detection. Algorithms including XGBoost, LightGBM, and CatBoost build an ensemble of decision trees sequentially, each correcting the errors of the previous. GBT models typically outperform other approaches on tabular fraud data, are relatively fast to score, and are robust to missing features and outliers. The main limitation: limited interpretability (why did this transaction score 0.87?). SHAP values (Chapter 26) are typically used to explain GBT predictions.

Neural Networks / Deep Learning. Multi-layer perceptrons and their variants can capture non-linear relationships that GBTs miss. For certain fraud types — particularly those involving sequential pattern recognition, like transaction sequence models — recurrent neural networks (RNNs) and transformer architectures can outperform GBTs. The tradeoff: higher computational cost, more training data required, harder to interpret.

Handling class imbalance. With 0.5% fraud rate, standard training would produce a model biased toward predicting legitimate. Standard approaches: - Oversampling (SMOTE — Synthetic Minority Over-sampling Technique): generate synthetic fraud examples in feature space to balance the training set - Undersampling: reduce the number of legitimate examples (loses information but improves balance) - Class weights: train with higher misclassification cost for fraud examples than legitimate examples — the simplest approach and often sufficient - Threshold adjustment: the model outputs a probability; the decision threshold (above which a transaction is flagged) is set below 0.5 to catch more fraud at the cost of more false positives


Unsupervised Learning: Detecting Novel Fraud Patterns

Supervised learning requires labeled data — it learns to recognize fraud patterns that occurred historically. The problem: genuinely novel fraud attacks have no historical label. The first time a new attack vector is used, no model trained on past data will recognize it.

Unsupervised anomaly detection addresses this by learning what "normal" looks like and flagging deviations — without requiring fraud labels.

Isolation Forest. Randomly partitions the feature space; anomalous points (those that are unusual) require fewer splits to isolate than normal points. Produces an anomaly score for each transaction. Fast, scales well, effective for tabular data. The limitation: anomalies are not necessarily fraud — they include legitimate unusual transactions (a customer making their first large purchase abroad).

Autoencoder anomaly detection. A neural network trained to compress and reconstruct transactions. The model learns to reconstruct normal transactions accurately. When presented with a fraudulent transaction (unusual pattern not in training data), reconstruction error is high — flagging it as anomalous. Particularly useful for high-cardinality categorical features (merchant codes, geographic codes) that benefit from the autoencoder's embedding representation.

One-Class SVM. Learns a boundary around the normal region of the feature space; points outside the boundary are anomalous. Less commonly used in production fraud detection due to scaling challenges.

In practice, fraud detection systems combine supervised and unsupervised approaches: the supervised model handles known fraud patterns effectively; the unsupervised system catches behavioral anomalies that may indicate novel attacks. Triggered anomalies feed back to the fraud investigation team, who label them — generating training data for the next supervised model iteration.


The Feedback Loop: From Detection to Label to Model

A critical architectural feature of fraud detection systems is the feedback loop: model predictions generate cases for investigation; investigations produce labels; labels improve the model.

The feedback loop contains a dangerous trap: sample selection bias. If the model scores a transaction 0.05 (low fraud probability) and it is not reviewed, there is no label — it remains in the unlabeled majority. If 50% of genuinely fraudulent transactions score below the review threshold, the training data will systematically underrepresent that fraud pattern. The model will never learn what it cannot see — and what it cannot see grows with its own blind spots.

Mitigations include: - Explore-exploit sampling: periodically sending a random sample of low-scored transactions to review (accepting some efficiency loss in exchange for unbiased sampling) - Customer-reported fraud labels: integrating dispute reports as ground truth for transactions that were not reviewed proactively - Champion-challenger testing: running a new model alongside the existing model, routing a portion of traffic to the challenger to compare performance


Python Implementation: Fraud Detection Pipeline

The following implementation demonstrates a complete fraud detection system: feature engineering, model training and evaluation, scoring, and alert generation.

from __future__ import annotations
import numpy as np
import pandas as pd
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from enum import Enum
from typing import Optional
import json
import hashlib
import random


class FraudStatus(Enum):
    LEGITIMATE = "Legitimate"
    FRAUD = "Fraud"
    SUSPECTED = "Suspected — Under Review"
    UNKNOWN = "Unknown"


class AlertDisposition(Enum):
    OPEN = "Open"
    TRUE_POSITIVE = "True Positive — Confirmed Fraud"
    FALSE_POSITIVE = "False Positive — Legitimate"
    INCONCLUSIVE = "Inconclusive"


@dataclass
class Transaction:
    """A card payment transaction with raw and derived features."""
    transaction_id: str
    account_id: str
    timestamp: datetime
    amount: float
    currency: str
    merchant_id: str
    merchant_category_code: str
    country: str
    channel: str  # "chip", "contactless", "online", "atm"

    # Derived features (populated by FeatureEngineer)
    amount_zscore: float = 0.0          # Amount vs. account 90-day mean
    velocity_1h: int = 0                # Transactions in last 1 hour
    velocity_24h: int = 0               # Transactions in last 24 hours
    distinct_merchants_24h: int = 0     # Distinct merchants last 24 hours
    new_merchant: bool = False          # First time at this merchant
    new_country: bool = False           # First transaction in this country
    unusual_hour: bool = False          # Outside account's normal hours
    new_device: bool = False            # For online: new device fingerprint

    # Labels and scores
    fraud_score: float = 0.0           # Model output probability (0–1)
    true_label: FraudStatus = FraudStatus.UNKNOWN

    def to_feature_vector(self) -> np.ndarray:
        """Convert to numeric feature vector for model input."""
        return np.array([
            self.amount,
            self.amount_zscore,
            float(self.velocity_1h),
            float(self.velocity_24h),
            float(self.distinct_merchants_24h),
            float(self.new_merchant),
            float(self.new_country),
            float(self.unusual_hour),
            float(self.new_device),
            # Channel encoding
            float(self.channel == "online"),
            float(self.channel == "atm"),
        ])


@dataclass
class AccountProfile:
    """Behavioral baseline for an account — updated with each transaction."""
    account_id: str
    mean_amount: float = 100.0
    std_amount: float = 50.0
    usual_hours: set[int] = field(default_factory=lambda: set(range(7, 22)))
    known_countries: set[str] = field(default_factory=lambda: {"GBR"})
    known_merchants: set[str] = field(default_factory=set)
    transaction_history: list[datetime] = field(default_factory=list)

    def amount_zscore(self, amount: float) -> float:
        if self.std_amount == 0:
            return 0.0
        return (amount - self.mean_amount) / self.std_amount

    def velocity_in_window(self, timestamp: datetime, hours: int) -> int:
        cutoff = timestamp - timedelta(hours=hours)
        return sum(1 for t in self.transaction_history if t >= cutoff)

    def distinct_merchants_in_window(self, timestamp: datetime, hours: int,
                                      recent_merchants: list[str]) -> int:
        """Count distinct merchant IDs in last N hours."""
        cutoff = timestamp - timedelta(hours=hours)
        return len(set(recent_merchants))

    def is_unusual_hour(self, timestamp: datetime) -> bool:
        return timestamp.hour not in self.usual_hours

    def update(self, transaction: Transaction) -> None:
        """Update profile after a legitimate transaction."""
        # Running mean/std update (Welford's online algorithm)
        n = len(self.transaction_history) + 1
        delta = transaction.amount - self.mean_amount
        self.mean_amount += delta / n
        self.known_countries.add(transaction.country)
        self.known_merchants.add(transaction.merchant_id)
        self.transaction_history.append(transaction.timestamp)
        # Keep only last 90 days
        cutoff = transaction.timestamp - timedelta(days=90)
        self.transaction_history = [t for t in self.transaction_history if t >= cutoff]


class FeatureEngineer:
    """
    Computes behavioral context features for each transaction.
    In production: uses streaming feature store (Redis, Feast) for real-time lookups.
    """

    def __init__(self):
        self._profiles: dict[str, AccountProfile] = {}
        self._recent_merchants: dict[str, list[tuple[datetime, str]]] = {}

    def get_or_create_profile(self, account_id: str) -> AccountProfile:
        if account_id not in self._profiles:
            self._profiles[account_id] = AccountProfile(account_id=account_id)
        return self._profiles[account_id]

    def engineer_features(self, txn: Transaction) -> Transaction:
        """Populate derived features for a transaction."""
        profile = self.get_or_create_profile(txn.account_id)

        txn.amount_zscore = profile.amount_zscore(txn.amount)
        txn.velocity_1h = profile.velocity_in_window(txn.timestamp, hours=1)
        txn.velocity_24h = profile.velocity_in_window(txn.timestamp, hours=24)
        txn.new_merchant = txn.merchant_id not in profile.known_merchants
        txn.new_country = txn.country not in profile.known_countries
        txn.unusual_hour = profile.is_unusual_hour(txn.timestamp)

        # Count distinct merchants in last 24 hours
        acct_merchants = self._recent_merchants.get(txn.account_id, [])
        cutoff = txn.timestamp - timedelta(hours=24)
        recent = [m for ts, m in acct_merchants if ts >= cutoff]
        txn.distinct_merchants_24h = len(set(recent))

        # Update merchant tracker
        acct_merchants.append((txn.timestamp, txn.merchant_id))
        acct_merchants = [(ts, m) for ts, m in acct_merchants if ts >= cutoff]
        self._recent_merchants[txn.account_id] = acct_merchants

        return txn


class GBTFraudScorer:
    """
    Gradient Boosted Tree fraud scorer.
    In production: uses XGBoost / LightGBM model loaded from registry.
    This implementation uses a simplified rule-based approximation for demonstration.
    """

    # Weights simulate a GBT model's feature importance
    FEATURE_WEIGHTS = {
        "amount": 0.05,
        "amount_zscore": 0.25,
        "velocity_1h": 0.20,
        "velocity_24h": 0.10,
        "distinct_merchants_24h": 0.08,
        "new_merchant": 0.07,
        "new_country": 0.12,
        "unusual_hour": 0.05,
        "new_device": 0.08,
    }

    def __init__(self, threshold: float = 0.30):
        self.threshold = threshold
        self._scores_log: list[dict] = []

    def score(self, txn: Transaction) -> float:
        """
        Compute fraud score (0–1 probability).
        Production: model.predict_proba(feature_vector)[1]
        """
        score = 0.0

        # Amount z-score contribution
        zscore_contrib = min(abs(txn.amount_zscore) / 5.0, 1.0) * self.FEATURE_WEIGHTS["amount_zscore"]
        score += zscore_contrib

        # Velocity contributions
        v1h_contrib = min(txn.velocity_1h / 10.0, 1.0) * self.FEATURE_WEIGHTS["velocity_1h"]
        v24h_contrib = min(txn.velocity_24h / 20.0, 1.0) * self.FEATURE_WEIGHTS["velocity_24h"]
        score += v1h_contrib + v24h_contrib

        # Binary feature contributions
        if txn.new_country:
            score += self.FEATURE_WEIGHTS["new_country"]
        if txn.new_merchant:
            score += self.FEATURE_WEIGHTS["new_merchant"]
        if txn.unusual_hour:
            score += self.FEATURE_WEIGHTS["unusual_hour"]
        if txn.new_device and txn.channel == "online":
            score += self.FEATURE_WEIGHTS["new_device"]

        # Distinct merchants
        merch_contrib = min(txn.distinct_merchants_24h / 8.0, 1.0) * self.FEATURE_WEIGHTS["distinct_merchants_24h"]
        score += merch_contrib

        # Clamp to [0, 1]
        score = max(0.0, min(1.0, score))

        self._scores_log.append({
            "transaction_id": txn.transaction_id,
            "account_id": txn.account_id,
            "score": score,
            "threshold": self.threshold,
            "flagged": score >= self.threshold,
            "timestamp": txn.timestamp.isoformat(),
        })

        return score

    def batch_score(self, transactions: list[Transaction]) -> list[Transaction]:
        for txn in transactions:
            txn.fraud_score = self.score(txn)
        return transactions

    def performance_metrics(self, scored_transactions: list[Transaction]) -> dict:
        """
        Calculate model performance metrics for labeled transactions.
        Requires true_label to be set on each transaction.
        """
        labeled = [t for t in scored_transactions if t.true_label != FraudStatus.UNKNOWN]
        if not labeled:
            return {"error": "No labeled transactions available"}

        y_true = [1 if t.true_label == FraudStatus.FRAUD else 0 for t in labeled]
        y_pred = [1 if t.fraud_score >= self.threshold else 0 for t in labeled]
        y_score = [t.fraud_score for t in labeled]

        tp = sum(1 for yt, yp in zip(y_true, y_pred) if yt == 1 and yp == 1)
        fp = sum(1 for yt, yp in zip(y_true, y_pred) if yt == 0 and yp == 1)
        fn = sum(1 for yt, yp in zip(y_true, y_pred) if yt == 1 and yp == 0)
        tn = sum(1 for yt, yp in zip(y_true, y_pred) if yt == 0 and yp == 0)

        precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0

        return {
            "labeled_transactions": len(labeled),
            "actual_fraud_count": sum(y_true),
            "flagged_count": sum(y_pred),
            "true_positives": tp,
            "false_positives": fp,
            "false_negatives": fn,
            "true_negatives": tn,
            "precision": round(precision, 4),
            "recall": round(recall, 4),
            "f1_score": round(f1, 4),
            "false_positive_rate": round(fp / (fp + tn) if (fp + tn) > 0 else 0.0, 4),
        }


@dataclass
class FraudAlert:
    """A fraud alert generated when a transaction exceeds the score threshold."""
    alert_id: str
    transaction: Transaction
    score: float
    alert_timestamp: datetime
    disposition: AlertDisposition = AlertDisposition.OPEN
    analyst_notes: str = ""
    resolved_at: Optional[datetime] = None

    @classmethod
    def create(cls, transaction: Transaction) -> "FraudAlert":
        return cls(
            alert_id=f"ALERT-{transaction.transaction_id[:8]}",
            transaction=transaction,
            score=transaction.fraud_score,
            alert_timestamp=datetime.utcnow(),
        )

    def resolve(self, disposition: AlertDisposition, notes: str = "") -> None:
        self.disposition = disposition
        self.analyst_notes = notes
        self.resolved_at = datetime.utcnow()


class FraudDetectionSystem:
    """End-to-end fraud detection system."""

    def __init__(self, threshold: float = 0.30):
        self.feature_engineer = FeatureEngineer()
        self.scorer = GBTFraudScorer(threshold=threshold)
        self.alerts: list[FraudAlert] = []
        self.threshold = threshold

    def process_transaction(self, txn: Transaction) -> tuple[Transaction, Optional[FraudAlert]]:
        """Score a transaction and generate alert if above threshold."""
        txn = self.feature_engineer.engineer_features(txn)
        txn.fraud_score = self.scorer.score(txn)

        alert = None
        if txn.fraud_score >= self.threshold:
            alert = FraudAlert.create(txn)
            self.alerts.append(alert)

        return txn, alert

    def process_batch(self, transactions: list[Transaction]) -> dict:
        """Process a batch of transactions."""
        alerts_generated = []
        for txn in transactions:
            _, alert = self.process_transaction(txn)
            if alert:
                alerts_generated.append(alert)

        return {
            "total_processed": len(transactions),
            "alerts_generated": len(alerts_generated),
            "alert_rate": round(len(alerts_generated) / len(transactions) * 100, 2),
            "alert_ids": [a.alert_id for a in alerts_generated],
        }

    def alert_queue(self) -> list[dict]:
        """Return open alerts sorted by score (highest first)."""
        open_alerts = [a for a in self.alerts if a.disposition == AlertDisposition.OPEN]
        open_alerts.sort(key=lambda a: a.score, reverse=True)
        return [
            {
                "alert_id": a.alert_id,
                "account_id": a.transaction.account_id,
                "amount": a.transaction.amount,
                "score": round(a.score, 4),
                "country": a.transaction.country,
                "new_country": a.transaction.new_country,
                "velocity_1h": a.transaction.velocity_1h,
                "timestamp": a.transaction.timestamp.isoformat(),
            }
            for a in open_alerts
        ]

    def performance_report(self) -> dict:
        all_txns = [a.transaction for a in self.alerts]
        metrics = self.scorer.performance_metrics(all_txns)
        open_count = sum(1 for a in self.alerts if a.disposition == AlertDisposition.OPEN)
        confirmed_fraud = sum(1 for a in self.alerts if a.disposition == AlertDisposition.TRUE_POSITIVE)
        false_positives = sum(1 for a in self.alerts if a.disposition == AlertDisposition.FALSE_POSITIVE)

        return {
            "total_alerts": len(self.alerts),
            "open_alerts": open_count,
            "confirmed_fraud": confirmed_fraud,
            "false_positives": false_positives,
            "model_metrics": metrics,
        }


def _make_transaction(account_id: str, amount: float, country: str = "GBR",
                      channel: str = "chip", hours_offset: int = 0,
                      is_fraud: bool = False) -> Transaction:
    """Helper: build a sample transaction."""
    txn = Transaction(
        transaction_id=hashlib.md5(
            f"{account_id}{amount}{hours_offset}{random.random()}".encode()
        ).hexdigest()[:12],
        account_id=account_id,
        timestamp=datetime(2024, 3, 15, 14, 0) - timedelta(hours=hours_offset),
        amount=amount,
        currency="GBP",
        merchant_id=f"MERCH-{random.randint(1000, 9999)}",
        merchant_category_code="5411",
        country=country,
        channel=channel,
    )
    txn.true_label = FraudStatus.FRAUD if is_fraud else FraudStatus.LEGITIMATE
    return txn


if __name__ == "__main__":
    random.seed(42)

    system = FraudDetectionSystem(threshold=0.30)

    # Simulate account ACC-001: typical customer
    # Build profile with 20 historical transactions
    historical = [
        _make_transaction("ACC-001", amount=random.uniform(20, 200),
                         hours_offset=i*4)
        for i in range(20, 0, -1)
    ]
    for txn in historical:
        txn = system.feature_engineer.engineer_features(txn)
        system.feature_engineer.get_or_create_profile("ACC-001").update(txn)

    # Now process current transactions
    transactions = [
        # Normal transactions
        _make_transaction("ACC-001", amount=45.00, hours_offset=0),
        _make_transaction("ACC-001", amount=120.00, hours_offset=0),
        # Suspicious: new country, high amount
        _make_transaction("ACC-001", amount=850.00, country="ROU",
                         hours_offset=0, is_fraud=True),
        # High velocity burst (3 in 1 hour)
        _make_transaction("ACC-001", amount=99.99, hours_offset=0, is_fraud=True),
        _make_transaction("ACC-001", amount=89.99, hours_offset=0, is_fraud=True),
        # New account
        _make_transaction("ACC-002", amount=25.00),
        _make_transaction("ACC-002", amount=3500.00, country="NGA",
                         channel="online", is_fraud=True),
    ]

    result = system.process_batch(transactions)
    print("=== BATCH PROCESSING RESULT ===")
    print(json.dumps(result, indent=2))

    print("\n=== OPEN ALERT QUEUE ===")
    for alert in system.alert_queue():
        print(f"  {alert['alert_id']}: Score {alert['score']:.3f} | "
              f"£{alert['amount']:.2f} | {alert['country']} | "
              f"Velocity 1h: {alert['velocity_1h']} | "
              f"New country: {alert['new_country']}")

    # Simulate analyst dispositions
    for alert in system.alerts:
        if alert.transaction.true_label == FraudStatus.FRAUD:
            alert.resolve(AlertDisposition.TRUE_POSITIVE, "Confirmed fraud — card blocked")
        else:
            alert.resolve(AlertDisposition.FALSE_POSITIVE, "Customer confirmed transaction")

    print("\n=== PERFORMANCE REPORT ===")
    report = system.performance_report()
    for k, v in report.items():
        if k != "model_metrics":
            print(f"  {k}: {v}")
    metrics = report.get("model_metrics", {})
    print(f"  Precision: {metrics.get('precision', 'N/A')}")
    print(f"  Recall: {metrics.get('recall', 'N/A')}")
    print(f"  F1 Score: {metrics.get('f1_score', 'N/A')}")

Real-Time Architecture: Scoring at Millisecond Latency

For card payment fraud, the scoring decision must complete before the authorization response — typically within 200 milliseconds end-to-end (network to network). This constrains the architecture significantly.

The production architecture for real-time fraud scoring typically follows a streaming pattern. Payment events arrive at a message broker (Apache Kafka is the standard). A feature computation service reads from Kafka, fetches pre-computed behavioral features from a low-latency feature store (Redis is typical — sub-millisecond reads), and writes enriched events to the scoring service. The scoring service runs the model (a loaded XGBoost or LightGBM model, typically in-process rather than via API call) and writes the score to a results store within 20–50 milliseconds. The decision service reads the score and applies business rules (block, review, approve) before returning the authorization response.

The feature store is architecturally critical. Behavioral features — velocity, profile baselines, recent merchant history — cannot be computed from scratch on each transaction without exceeding latency requirements. They must be pre-computed and maintained in the feature store, updated asynchronously as transactions complete. This creates an eventual-consistency challenge: the feature store reflects state as of the last update, which may be milliseconds or seconds behind real-time. For most fraud detection use cases this lag is acceptable; for very high-velocity fraud attacks it can matter.

Model deployment uses a model registry (MLflow, AWS SageMaker) to version models, track training runs and performance metrics, and manage the promotion from development to staging to production. A champion-challenger framework typically runs the new model on a portion of traffic before full deployment, with automatic rollback if performance degrades.


Explainability in Fraud Detection: A Distinctive Challenge

Chapter 26 covers model explainability broadly. Fraud detection presents a distinctive challenge that deserves attention here: the adversarial constraint on explanation.

When a fraudster is caught — their transaction is declined — they do not generally receive an explanation. But when a legitimate customer is declined (a false positive), they often call their bank. And the bank's fraud analyst must explain the decline in terms the customer can understand. "Our model gave your transaction a high fraud score" is not sufficient — the customer wants to know why, and the analyst needs to know whether the alert was correct.

SHAP values (Chapter 26) are the standard tool: for each alert, compute the feature contributions. A SHAP waterfall chart might show that for a declined transaction, the primary contributors were: new country (+0.18 to score), high amount z-score (+0.14), and three transactions in the last hour (+0.12). The analyst can assess this: the customer is indeed traveling, the amount is consistent with their usual range once travel is accounted for, and the transactions were all at the same restaurant across two hours — not a burst of small transactions. False positive; unblock.

The adversarial constraint: if fraudsters can access the model's explanation for their declined transactions, they can use that information to defeat the model. The explanation tells them which features triggered the alert — and they can adjust their behavior accordingly. This is why fraud explanation outputs are typically shown only to analysts, not to the customer or to any external API. The customer receives a generic decline reason; the analyst sees the full SHAP-based explanation.

This tension between explainability requirements (GDPR Article 22, ECOA Regulation B) and adversarial robustness is real and unresolved. The practical industry approach: provide explanations in human terms (not model-readable feature names), focus on actionable steps the customer can take, and never expose model architecture or feature importance rankings externally.


Regulatory Dimensions of Fraud Detection

Fraud detection systems sit at the intersection of several regulatory frameworks.

Data protection. The behavioral profiles maintained for fraud detection contain personal data. Under GDPR, this data must be: collected lawfully (usually legitimate interests — fraud prevention is a recognised legitimate interest); limited to what is necessary; accurate; retained for no longer than necessary; and secured appropriately. GDPR Recital 47 specifically names fraud prevention as a legitimate interest. The Data Protection Act 2018 in the UK contains similar provisions.

Automated decision-making. GDPR Article 22 gives individuals the right not to be subject to solely automated decisions with significant effects. A card transaction decline is arguably such a decision. The practical exemption: Article 22(2)(b) permits automated decisions "necessary for entering into, or performance of, a contract" — which applies to payment authorization. But firms must have a process for human review upon request.

ECOA and adverse action (US). The Equal Credit Opportunity Act requires creditors to provide specific reasons for adverse action. For fraud-based declines, the interaction with ECOA is complex: declining a payment due to suspected fraud is not a credit decision and typically does not trigger ECOA adverse action requirements. But denying a credit application because of fraud indicators might — and the reasons must be explainable.

Consumer Duty (UK). The FCA's Consumer Duty (effective 2023) requires firms to deliver good outcomes for retail customers. A fraud detection system with a high false positive rate — blocking many legitimate transactions — may cause customer harm. Firms are expected to monitor false positive rates and optimize systems for customer outcomes, not just fraud prevention.


The Feedback Mechanism: From Investigation to Retraining

A fraud model is not a completed product — it is a living system. Model performance degrades as fraud patterns evolve. The time between model training cutoff and model deployment, plus the time the model remains in production, creates an observation gap: events after the training cutoff are invisible to the model.

The industry standard is a model review cycle: monthly monitoring of precision, recall, and PSI (Population Stability Index — measuring whether the scoring population has shifted from the training population); quarterly evaluation against holdout data; annual retraining for major models, more frequently when monitoring alerts to degradation.

The most sophisticated fraud detection organizations run near-continuous retraining: model performance is monitored daily; when performance drops below a threshold or PSI exceeds 0.1, an automated retraining pipeline runs, the challenger model is validated against the champion on holdout data, and if the challenger wins, it replaces the champion automatically. This requires substantial MLOps infrastructure (model registries, automated pipelines, monitoring dashboards) but achieves the fastest possible adaptation to evolving fraud patterns.


The Practitioner's Reality: What Works and What Doesn't

Fraud detection teams at major banks have learned lessons through years of production experience that do not always appear in academic literature.

Feature engineering beats model selection. In a competition between a sophisticated model with basic features and a simple model with excellent features, the simple model usually wins. The behavioral baseline features — amount z-score, velocity, new country — consistently outperform novel architectural choices applied to raw features.

Threshold calibration is as important as model training. The decision threshold determines the precision-recall tradeoff. A threshold of 0.5 will generate very different outcomes from 0.2 or 0.7. The right threshold is a business decision: how much fraud loss is acceptable? How much false positive friction is acceptable? This tradeoff is different for a premium card (low false positive tolerance) than a prepaid card (higher false positive tolerance because fraud losses are more impactful relative to card value).

Ensemble approaches outperform single models. Most production fraud systems combine a supervised GBT model (for known patterns), an unsupervised anomaly detector (for novel patterns), and rule-based filters (for specific known attack signatures). The rule-based layer is particularly important for blocking known bad actors — a compromised card number identified by the card network can be blocked instantly by a rule without waiting for the model.

Investigation quality is as important as model quality. A model that identifies fraud with 90% recall is useless if the investigation team cannot process alerts faster than they accumulate. Alert quality (precision) and alert volume must be calibrated to the investigation team's capacity. A fraud alert queue with 10,000 open items is not a detection system — it is a liability.


Priya's Engagement: Assessing a Challenger Bank's Fraud System

When Priya Nair was asked to review Verdant Bank's fraud detection capabilities, she expected a discussion about machine learning models. What she found was more fundamental.

Verdant had deployed a card fraud detection system from a vendor eighteen months earlier. The vendor's model was a gradient-boosted ensemble, commercially competitive, regularly retrained. On paper, the system was sound. But Verdant's fraud loss rate had been trending upward for six months while the industry average was flat.

Priya's diagnosis: the feedback loop was broken. Verdant's investigation team was clearing alerts by classifying suspicious transactions as "customer confirmed legitimate" based on customer service calls — without verifying that the customer actually made the call, not a social engineering impersonator. This meant genuinely fraudulent transactions were being labeled as legitimate in the training data. The next retraining incorporated these false labels. The model learned to not flag the patterns that the fraudsters were using — because those patterns were labeled as legitimate.

The fix was not a better model. It was a better investigation process. Verdant implemented call-back verification for all "customer confirmed" dispositions before they were added to the training label set. Fraud labels required either chargeback confirmation or a structured verification process. The next model retrain, with clean labels, recovered the performance gap.

Priya's observation for the project debrief: "The model is only as good as the feedback it receives. Garbage in, garbage out. For fraud detection systems, the garbage usually enters through the investigation process, not the data pipeline. That's where you start."


Conclusion: The Arms Race Continues

Machine learning fraud detection represents a genuine advance over rules-based systems. The evidence is in the data: across the industry, the adoption of ML fraud detection has correlated with meaningful reductions in fraud loss rates even as transaction volumes grew dramatically. The models find patterns that no human analyst would identify; they score millions of transactions in milliseconds; they adapt to new patterns through continuous retraining.

But the arms race continues. Fraudsters study what triggers detections and adjust. Social engineering attacks bypass technical controls by manipulating customers rather than exploiting system vulnerabilities. First-party fraud (where customers themselves commit fraud) is difficult to detect with models trained on third-party fraud patterns. Synthetic identity fraud creates personas that look legitimate until they default. Each new fraud vector requires a new detection response.

The compliance professional's role in this system is governance: ensuring that models are documented, validated, monitored, and fairly designed; that the investigation process generates clean training labels; that customer outcomes are monitored and false positive harm is taken seriously; and that the system as a whole meets the regulatory expectations for automated decision-making. The data scientists build the model. The compliance function ensures the model serves the customer as well as the firm.


Chapter 26 extends these themes to explainable AI and model governance more broadly — examining the regulatory frameworks, technical tools, and organizational practices that constitute responsible use of machine learning in regulated financial services.