Case Study: Auditing a Fraud Detector — When 99.9% Accurate Means Mostly Wrong

DataField.Dev

Case Study: Auditing a Fraud Detector — When 99.9% Accurate Means Mostly Wrong

"It is the mark of an instructed mind to rest satisfied with the degree of precision which the nature of the subject admits, and not to seek exactness where only an approximation of the truth is possible." — commonly attributed to Aristotle

Executive Summary

A payments company has shipped a machine-learning fraud detector. The model card proudly reports "99.9% accuracy." The on-call risk team is drowning: nearly every transaction the model flags turns out, on review, to be a legitimate purchase by an annoyed customer. Leadership wants to know whether the model is broken. Your job is not to retrain anything — it is to audit the numbers with conditional probability and Bayes' theorem, and explain, in language an executive can act on, why a genuinely accurate detector can still be wrong about the overwhelming majority of its alerts.

This is the base-rate fallacy from §21.4, dressed in a production setting. The arithmetic is the same as the rare-disease test; the labels change from "disease/positive" to "fraud/flag," and the stakes change from a worried patient to a blocked card and a churned customer. You will compute the detector's true precision (the probability a flag is real), watch a single false-positive rate destroy it, and derive the threshold change that would actually fix the problem.

Skills applied

Translating a vague "accuracy" claim into the precise conditionals it conflates ($P(\text{flag} \mid \text{fraud})$ vs. $P(\text{fraud} \mid \text{flag})$).
Applying Bayes' theorem with the law of total probability to compute a posterior from a base rate and two likelihoods.
Confirming the Bayes result with a natural-frequencies count over a concrete population.
Quantifying how the answer moves as the base rate and the false-positive rate change.
Turning the analysis into an engineering recommendation (target false-positive rate / threshold).

Background

The scenario

The payments company, NorthPay (a hypothetical example), processes about 1,000,000 transactions per day. Genuine fraud is rare: from labeled chargeback history, roughly 0.1% of transactions are fraudulent. The model team reports the following from their held-out test set:

Sensitivity (recall, true-positive rate): $P(\text{flag} \mid \text{fraud}) = 0.95$. The model catches 95% of fraud.
Specificity: $P(\text{no flag} \mid \text{legit}) = 0.999$, so the false-positive rate is $P(\text{flag} \mid \text{legit}) = 0.001$.
Reported "accuracy": the fraction of all transactions classified correctly, which the model card computed as $0.999$.

The risk team's complaint is empirical: of the transactions the model flags each day, the vast majority are confirmed legitimate after a manual review or a customer phone call. Leadership's instinct — "99.9% accurate, so 99.9% of flags should be fraud" — is exactly the instinct §21.4 warns about.

💡 Intuition: "Accuracy" mixes two very different error types into one headline number, and when one class is rare, accuracy is dominated by the easy majority. A model that flags nothing would itself be 99.9% accurate here — it would be right on every one of the 99.9% legitimate transactions. Accuracy alone cannot distinguish that useless model from a good one. We need conditional probabilities that keep the two error types separate.

Why this matters

Every rare-event detector in production lives on this knife edge: spam filters, intrusion-detection systems, medical screening, content-moderation classifiers, and fraud models all face the same arithmetic. The cost of a false positive (a blocked legitimate card, a quarantined real email, a falsely accused user) is paid in customer trust, and the base-rate fallacy systematically hides how often that cost is incurred. An engineer who can run this audit on a napkin is worth a great deal to any team shipping a classifier.

Phase 1: Name the conditionals the "accuracy" claim conflates

Define the events precisely. Let

$F$ = "the transaction is fraudulent," with base rate $P(F) = 0.001$ (so $P(\overline{F}) = 0.999$),
$G$ = "the model flags the transaction."

The two numbers the model card gives are likelihoods — probabilities of the evidence given the truth: $$P(G \mid F) = 0.95 \quad (\text{sensitivity}), \qquad P(G \mid \overline{F}) = 0.001 \quad (\text{false-positive rate}).$$ The number leadership wants is the reverse conditional — the precision, or the posterior probability that a flagged transaction is truly fraud: $$P(F \mid G) = \ ?$$ These are not the same number, and confusing them is precisely the base-rate fallacy (§21.4). The whole audit is the computation of $P(F \mid G)$ from the three quantities we have.

⚠️ Common Pitfall: The model card's "accuracy" is neither of the conditionals that matter for the risk team. Accuracy is $P(\text{model correct}) = P(G \mid F)P(F) + P(\overline{G} \mid \overline{F})P(\overline{F})$, a blend dominated by the $0.999$ legitimate majority. You can have excellent accuracy and terrible precision simultaneously. Always ask which conditional a metric reports.

Phase 2: Compute the evidence with the law of total probability

We need $P(G)$, the unconditional probability that the model flags a transaction. A transaction is flagged either because it is fraud-and-caught or legit-and-falsely-flagged; those two routes are disjoint and cover every flag, so the law of total probability (§21.3) gives $$P(G) = P(G \mid F)\,P(F) + P(G \mid \overline{F})\,P(\overline{F}).$$ Substituting the numbers: $$P(G) = (0.95)(0.001) + (0.001)(0.999).$$ Compute each term by hand: $$(0.95)(0.001) = 0.00095, \qquad (0.001)(0.999) = 0.000999,$$ $$P(G) = 0.00095 + 0.000999 = 0.001949.$$ So about $0.1949\%$ of all transactions get flagged — on a million transactions, roughly 1,949 alerts a day. Hold onto that the term $0.00095$ (the true-positive mass) is smaller than $0.000999$ (the false-positive mass); the false positives already outweigh the true positives before we even divide.

Phase 3: Apply Bayes' theorem for the precision

Now the posterior. Bayes' theorem (§21.3): $$P(F \mid G) = \frac{P(G \mid F)\,P(F)}{P(G)} = \frac{0.00095}{0.001949} \approx 0.4874.$$ The precision is about 49%. Fewer than half of all flags are real fraud — more than half are legitimate customers getting blocked. Leadership's "99.9%" is off by a factor of roughly two; the model is not broken, but the headline metric was the wrong one.

Here is the computation in code, reusing the chapter's Toolkit functions exactly:

from dmtoolkit.probability import bayes_update

base_rate    = 0.001    # P(fraud)
sensitivity  = 0.95     # P(flag | fraud)
false_pos    = 0.001    # P(flag | legit)

precision = bayes_update(base_rate, sensitivity, false_pos)
print(round(precision, 4))
# Expected output:
# 0.4874

We hand-trace bayes_update: it first calls total_probability(0.001, 0.95, 0.001) $= 0.95(0.001) + 0.001(0.999) = 0.00095 + 0.000999 = 0.001949$, then `bayes(0.001, 0.95, 0.001949)` $= 0.95 \times 0.001 / 0.001949 = 0.00095 / 0.001949 \approx 0.4874$ — matching the hand calculation.

Phase 4: Confirm with natural frequencies

The formula is correct, but executives trust counts more than they trust a posterior. Re-derive the same 49% by imagining one full day of 1,000,000 transactions (§21.4's intuition pump):

$0.1\%$ are fraud: that is $1{,}000$ fraudulent transactions. The model catches $95\%$: $950$ true positives.
$999{,}000$ are legitimate. The model falsely flags $0.1\%$ of them: $999$ false positives.
Total alerts: $950 + 999 = 1{,}949$ (matching the $P(G)$ count from Phase 2). Of those, only $950$ are real fraud.

So the precision is $$P(F \mid G) = \frac{950}{1949} \approx 0.4874,$$ the identical 49%. The story is now visceral: the risk team works 1,949 alerts a day, and 999 of them — more than half — are real customers whose cards were wrongly blocked.

def fraud_frequencies(total, base_rate, sensitivity, false_pos):
    """Return (true_positives, false_positives, precision) over a population."""
    fraud = total * base_rate
    legit = total - fraud
    tp = fraud * sensitivity
    fp = legit * false_pos
    return tp, fp, tp / (tp + fp)

tp, fp, prec = fraud_frequencies(1_000_000, 0.001, 0.95, 0.001)
print(int(tp), int(fp), round(prec, 4))
# Expected output:
# 950 999 0.4874

The counts $950$ and $999$ are computed directly: $1{,}000{,}000 \times 0.001 \times 0.95 = 950$ and $1{,}000{,}000 \times 0.999 \times 0.001 = 999$, and $950/(950+999) = 950/1949 \approx 0.4874$.

🔗 Connection: This is precisely the precision in the precision/recall tradeoff you may have met in machine learning — and it is a Bayes computation wearing different notation (§21.4). Recall is the sensitivity $P(G \mid F)$ we were given; precision is the posterior $P(F \mid G)$ we just computed. The base rate is what ties them together, and it is exactly what the "accuracy" headline threw away.

Phase 5: A second opinion — cascading detectors

Before recommending a threshold change, the risk team floats a cheaper-sounding idea: keep the model as is, but send every flagged transaction to a second, independent review model, and only act when both models flag it. Does stacking detectors fix the precision problem? This is a Bayes question, and the key move is the one from §21.4's repeated-test discussion: the posterior after the first model becomes the prior for the second.

After the first model flags a transaction, our belief that it is fraud is no longer the base rate $0.001$ — it is the posterior we just computed, $P(F \mid G_1) \approx 0.4874$. Feed that in as the new prior and apply Bayes again with the second model's likelihoods (assume, for the sake of the estimate, the same sensitivity $0.95$ and false-positive rate $0.001$, and that the second model's errors are independent of the first's given the true label): $$P(F \mid G_1, G_2) = \frac{P(G_2 \mid F)\,P(F \mid G_1)}{P(G_2 \mid F)\,P(F \mid G_1) + P(G_2 \mid \overline{F})\,P(\overline{F} \mid G_1)}.$$ Plug in $P(F \mid G_1) = 0.4874$ (so $P(\overline{F} \mid G_1) = 0.5126$): $$P(F \mid G_1, G_2) = \frac{(0.95)(0.4874)}{(0.95)(0.4874) + (0.001)(0.5126)} = \frac{0.463030}{0.463030 + 0.000513} = \frac{0.463030}{0.463543} \approx 0.9989.$$

first  = bayes_update(0.001, 0.95, 0.001)   # after model 1 flags: 0.4874
second = bayes_update(first, 0.95, 0.001)    # model 2 also flags, prior = first
print(round(first, 4), round(second, 4))
# Expected output:
# 0.4874 0.9989

Two independent flags lift precision from 49% to 99.9% — a transaction flagged by both models is almost certainly fraud. This is why layered detectors work, and it is nothing more than Bayes applied twice. The catch is hidden in the assumption: the two models must err independently given the truth. In practice, two models trained on the same features tend to make correlated mistakes (they get fooled by the same unusual-but-legitimate transactions), which violates the independence we imported and inflates the real two-flag precision above what it should be — exactly the "watch where independence was imported" lesson of §21.2.

🚪 Threshold Concept: the posterior is the next prior. Belief updating composes. Every new piece of evidence is absorbed by treating your current posterior as the prior for the next Bayes step. This single principle — chain the updates — is the entire engine behind sequential testing, online learning, Kalman filters, and the multi-stage detection pipelines that real fraud and spam systems use. Once you see updating as composable, "how do I combine many noisy signals?" always has the same answer: multiply the likelihood ratios, or equivalently, fold them in one at a time.

⚠️ Common Pitfall: Cascading only helps if each stage's evidence is genuinely new. Running the same model twice on the same transaction tells you nothing the first run didn't — the second "observation" is fully determined by the first, so $P(G_2 \mid F, G_1) = 1$ and the update does nothing. Independence of the evidence (given the label) is what makes a second flag informative; without it, you are double-counting one signal, which is also precisely the error the naive in naive Bayes risks (§21.5).

Phase 6: What would actually fix it?

The audit's payoff is a recommendation. The precision is crushed by the false-positive mass $P(G \mid \overline{F})\,P(\overline{F})$, and we cannot change the base rate $P(\overline{F}) = 0.999$ — the world has the fraud rate it has. The only lever is the false-positive rate. Suppose the team retunes the decision threshold to cut the false-positive rate tenfold, to $0.0001$, while keeping sensitivity at $0.95$. Recompute the precision:

print(round(bayes_update(0.001, 0.95, 0.0001), 4))   # 10x lower false-positive rate
print(round(bayes_update(0.001, 0.95, 0.00001), 4))  # 100x lower
# Expected output:
# 0.905
# 0.9896

Hand-trace the first: $P(G) = 0.95(0.001) + 0.0001(0.999) = 0.00095 + 0.0000999 = 0.0010499$, so $P(F \mid G) = 0.00095/0.0010499 \approx 0.905$. Cutting the false-positive rate by $10\times$ lifts precision from 49% to about 90%. Sensitivity never changed — the fix lives entirely in the false-positive rate, exactly as the base-rate analysis predicts: a detector for a rare event must drive its false-positive rate far below the event's base rate to be useful.

💡 Intuition: The governing comparison is between the two masses $P(G \mid F)P(F)$ (true positives) and $P(G \mid \overline{F})P(\overline{F})$ (false positives). Because $P(\overline{F})$ is huge, even a tiny per-item false-positive rate produces a large false-positive mass. Precision crosses 50% only once the true-positive mass overtakes the false-positive mass — which, with sensitivity near 1, happens roughly when the false-positive rate drops below the base rate. Here base rate $= 0.001$, and precision crossed 50% as the false-positive rate fell through $\approx 0.001$; pushing it to $0.0001$ buys comfortable headroom.

A classifier doesn't have one false-positive rate — it has a whole curve of operating points as you move the decision threshold, trading sensitivity against the false-positive rate. Tabulating a few candidate operating points (all at base rate $0.001$) turns the audit into a menu leadership can choose from:

Operating point	Sensitivity $P(G\mid F)$	False-pos rate $P(G\mid\overline{F})$	Precision $P(F\mid G)$	Daily alerts (of 1M)
Current (aggressive)	0.95	0.001	≈ 0.49	≈ 1,949
Balanced	0.90	0.0001	≈ 0.90	≈ 1,000
Conservative	0.80	0.00001	≈ 0.99	≈ 810

Reading down the precision column shows the real engineering choice: each tightening of the threshold sacrifices a little recall (some fraud slips through) to win a large jump in precision (far fewer blocked customers). Which row is "best" is not a math question — it depends on the relative cost of a missed fraud versus an angry customer, which is the cost-weighted decision in the Your-Turn extensions. The math's job is to hand leadership the honest precision for each option; the base-rate analysis is what makes those numbers trustworthy.

Phase 7: The same mistake in a courtroom — the prosecutor's fallacy

The audit you just performed is structurally identical to a notorious legal error, and seeing the parallel cements why the distinction $P(A \mid B) \ne P(B \mid A)$ is not academic (§21.4). Imagine a fraud case that reaches a dispute: an analyst testifies, "This transaction pattern matches known fraud; only 0.1% of legitimate transactions look like this." A naive listener hears, "So there's a 99.9% chance this was fraud." That inference is the prosecutor's fallacy — and it is exactly leadership's original mistake, now with a person's reputation attached.

Line up the two conditionals explicitly:

What the analyst measured: $P(\text{pattern} \mid \text{legitimate}) = 0.001$ — the false-positive rate, a likelihood.
What the listener heard: $P(\text{legitimate} \mid \text{pattern}) = 0.001$, i.e. $P(\text{fraud} \mid \text{pattern}) = 0.999$ — a posterior.

These are reverse conditionals, and Bayes shows they differ by the base rate. With our numbers ($P(\text{fraud}) = 0.001$, sensitivity $\approx 1$, false-positive rate $0.001$), the actual posterior is the precision we already computed: $P(\text{fraud} \mid \text{pattern}) \approx 0.49$, not $0.999$. The "only 0.1% of legitimate transactions look like this" statistic is real and unimpeachable; the inference drawn from it is off by a factor that depends entirely on how rare fraud is in the relevant pool.

🔗 Connection: This is the same arithmetic that has produced real wrongful convictions from DNA and rare-disease evidence (§21.4). Swap "transaction pattern" for "DNA match" and "fraud" for "guilt": "the chance of this DNA matching at random is 1 in a million" ($P(\text{match} \mid \text{innocent})$) is routinely, disastrously, reported as "the chance the defendant is innocent is 1 in a million" ($P(\text{innocent} \mid \text{match})$). The gap between them is the base rate of guilt in the suspect pool. In a payments context the cost is a frozen account and a lost customer; in a courtroom it is a person's freedom. Same theorem, very different stakes.

The remedy in both settings is the same one this whole case study has applied: refuse to read a likelihood as a posterior, and always supply the base rate before stating how much a piece of evidence actually means.

Discussion Questions

The model card reported a single "accuracy" of 0.999. Write out the full expression for accuracy in terms of the two likelihoods and the base rate, and show numerically that a model flagging nothing would also score 0.999. What does this reveal about accuracy as a metric for rare events?
Phase 6 changed only the false-positive rate. Suppose instead the team could only raise sensitivity (from 0.95 toward 1.0) without touching the false-positive rate. Argue from the Bayes formula why this barely helps precision, and quantify the ceiling precision as sensitivity $\to 1$.
Phase 5 cascaded two models assuming independent errors given the label. Suppose instead the two models' false positives are perfectly correlated (they get fooled by exactly the same legitimate transactions). Argue why, in that extreme, the second flag adds no information, and explain what this means for teams that build a "second opinion" model on the same features as the first.
Phase 7 reframed the audit as the prosecutor's fallacy. Now turn it around: a customer whose card was wrongly blocked complains that the bank "accused" them of fraud. Using the precision number, explain to that customer — in plain language, no jargon — what a flag actually does and does not imply about them, and why a flag is a request for a closer look, not a verdict.

Your Turn: Extensions

Option A (sweep the base rate). Write code that computes precision as the fraud base rate sweeps over $0.0001, 0.001, 0.01, 0.05$ with sensitivity 0.95 and false-positive rate 0.001. At which base rate does precision first exceed 90%? Explain why detectors are easier to make precise on higher-base-rate populations (e.g., transactions already pre-filtered as "high risk").
Option B (the cost-weighted threshold). Suppose a false positive costs \$5 (a support call and an annoyed customer) and a missed fraud costs \$200 (the chargeback). Write an expression for the expected cost per transaction as a function of the threshold's $(\text{sensitivity}, \text{false-positive rate})$ operating point, and discuss how Bayes' posterior feeds a decision rule rather than just a number.
Option C (two-stage cascade, fully). Implement the cascade from Discussion Question 3 as a function cascade_precision(base, sens, fpr, stages) that applies the update stages times, feeding each posterior in as the next prior. Tabulate precision for 1, 2, and 3 stages and comment on diminishing returns.

Key Takeaways

"Accuracy" hides the base-rate problem. For a rare event, accuracy is dominated by the easy majority and can be near-perfect for a useless model. Always demand the conditional that matters — here, precision $P(F \mid G)$.
Precision is a Bayes posterior. $P(\text{fraud} \mid \text{flag}) = \frac{P(\text{flag} \mid \text{fraud})P(\text{fraud})}{P(\text{flag})}$, with the denominator built by the law of total probability. The natural-frequencies count gives the identical answer and is the version to put in front of leadership.
False positives from a huge majority swamp true positives from a rare class. A 0.1% false-positive rate on 999,000 legitimate transactions produced more false alarms than there were real fraud cases.
The fix is the false-positive rate, not sensitivity. To make a rare-event detector precise, drive its false-positive rate below the base rate. Bayes tells you exactly how much headroom each tenfold reduction buys.