Case Study 1: The 99% Accurate Model That Was Useless — Fraud Detection
Tier 3 — Illustrative/Composite Example: The financial institution, data, and model results in this case study are entirely fictional. However, the patterns described — the accuracy paradox, class imbalance challenges, and the critical importance of metric selection in fraud detection — reflect widely documented experiences in the financial industry. All names, figures, and scenarios are invented for pedagogical purposes.
The Setting
River Valley Financial is a mid-sized bank processing about 8 million credit card transactions per month. Their existing fraud detection system is rule-based: it flags transactions above $5,000 from new merchants, transactions from foreign countries without travel notification, and a handful of other hard-coded patterns. It catches about 40% of fraud and generates a lot of false alarms.
The bank's new analytics manager, Kenji Tanaka, is tasked with building a machine-learning-based fraud detection system. The expectation from leadership is clear: "Catch more fraud, generate fewer false alarms, and give us a number we can put in the quarterly report."
Kenji assembles three months of transaction data: 24 million transactions, of which about 4,800 are confirmed fraud. That's a fraud rate of 0.02% — one fraudulent transaction for every 5,000 legitimate ones.
The Question
Kenji's primary question is: Can we build a classifier that identifies fraudulent transactions in real time, so we can block or flag them before they clear?
Secondary questions: - What transaction features are most predictive of fraud? - How do we balance catching fraud vs. blocking legitimate transactions? - What metrics should we use to evaluate the model, given that fraud is extremely rare?
The First Model: A Cautionary Tale
Kenji trains a logistic regression model with features including transaction amount, merchant category, time of day, distance from the cardholder's home ZIP code, and the number of transactions in the past hour.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
model_v1 = LogisticRegression(max_iter=1000)
model_v1.fit(X_train, y_train)
y_pred = model_v1.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
Accuracy: 0.9998
99.98% accuracy. Kenji almost sends the email to leadership. Then he checks the confusion matrix.
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))
[[5999997 3]
[ 1195 5]]
The model predicted "not fraud" for almost everything. It correctly classified 5,999,997 legitimate transactions and caught exactly 5 of the 1,200 fraudulent ones. Five. Out of twelve hundred.
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, target_names=['Legit', 'Fraud']))
precision recall f1-score support
Legit 1.00 1.00 1.00 6000000
Fraud 0.63 0.00 0.01 1200
accuracy 1.00 6001200
Recall for fraud: 0.004 (0.4%). The model catches fewer than 1 in 200 fraudulent transactions. The 99.98% accuracy is a mirage — the model has essentially learned to say "not fraud" for everything, and because fraud is so rare, that strategy is almost always correct.
Kenji does NOT send the email.
The Problem: Extreme Class Imbalance
This is the accuracy paradox at its most extreme. When one class represents 99.98% of the data, a model can achieve near-perfect accuracy by ignoring the rare class entirely. The model's loss function (which optimizes for overall accuracy) treats every misclassification equally — so missing 1,200 fraud cases costs the same as getting 1,200 legitimate cases wrong. From the model's perspective, why bother learning the complex patterns of fraud when you can get 99.98% right by doing nothing?
Kenji realizes he needs to attack this problem from three angles: the data, the model, and the metric.
The Solution: A Multi-Pronged Approach
Change 1: Address the Imbalance
Kenji uses class_weight='balanced' to tell the model that misclassifying a fraud case is far more costly than misclassifying a legitimate one:
model_v2 = LogisticRegression(max_iter=1000, class_weight='balanced')
model_v2.fit(X_train, y_train)
Change 2: Use the Right Metric
Kenji switches from accuracy to the metrics that matter:
from sklearn.metrics import classification_report, roc_auc_score
y_pred_v2 = model_v2.predict(X_test)
print(classification_report(y_test, y_pred_v2, target_names=['Legit', 'Fraud']))
y_proba = model_v2.predict_proba(X_test)[:, 1]
print(f"AUC: {roc_auc_score(y_test, y_proba):.4f}")
precision recall f1-score support
Legit 1.00 0.97 0.99 6000000
Fraud 0.01 0.82 0.01 1200
accuracy 0.97 6001200
AUC: 0.9512
Now the model catches 82% of fraud (recall jumped from 0.004 to 0.82). But precision for fraud is 0.01 — meaning for every legitimate fraud it catches, it falsely flags about 99 legitimate transactions. That's a lot of angry phone calls from customers whose cards were blocked.
Change 3: Bring in a Stronger Model
Kenji trains a random forest with balanced weights:
from sklearn.ensemble import RandomForestClassifier
model_v3 = RandomForestClassifier(
n_estimators=300, max_depth=12,
class_weight='balanced', random_state=42, n_jobs=-1
)
model_v3.fit(X_train, y_train)
precision recall f1-score support
Legit 1.00 0.99 1.00 6000000
Fraud 0.04 0.78 0.07 1200
AUC: 0.9687
Better: precision for fraud improved from 0.01 to 0.04 (still low, but significantly fewer false alarms), while recall is 0.78. The model catches 78% of fraud, and for every 25 flagged transactions, about 1 is actually fraudulent.
Change 4: Tune the Threshold
Kenji uses the ROC curve to find the optimal threshold for the bank's specific cost structure. Blocking a legitimate transaction costs about $15 in customer service time and goodwill. A successful fraud averages $850 in losses. So the cost ratio is roughly 1:57 — each missed fraud costs 57 times more than a false alarm.
from sklearn.metrics import roc_curve
import numpy as np
y_proba = model_v3.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
# Find the threshold that maximizes: benefit of catching fraud - cost of false alarms
costs = []
for t in thresholds:
y_pred_t = (y_proba >= t).astype(int)
tp = ((y_pred_t == 1) & (y_test == 1)).sum()
fp = ((y_pred_t == 1) & (y_test == 0)).sum()
fn = ((y_pred_t == 0) & (y_test == 1)).sum()
net_benefit = tp * 850 - fp * 15 - fn * 850
costs.append({'threshold': t, 'net_benefit': net_benefit})
best = max(costs, key=lambda x: x['net_benefit'])
print(f"Optimal threshold: {best['threshold']:.4f}")
At the optimized threshold, the model achieves a different precision-recall balance that minimizes total cost to the bank.
The Results
Kenji presents three models to leadership, but instead of accuracy, he shows a table with the metrics that matter:
| Model | Fraud Recall | Fraud Precision | Legit Blocked (FP) | Est. Monthly Cost |
|---|---|---|---|---|
| Existing rules | 40% | 5% | ~45,000 | $675K blocked + $510K missed fraud |
| ML (optimized) | 78% | 4% | ~23,400 | $351K blocked + $187K missed fraud |
The ML model catches nearly twice as much fraud as the rule-based system AND generates fewer false alarms. The estimated monthly cost savings: over $600,000.
But Kenji is careful in his presentation. He doesn't mention accuracy. Not once.
Lessons Learned
1. Accuracy is meaningless with extreme class imbalance. 99.98% accuracy sounds impressive until you realize it means "predict the majority class for everything." In fraud detection, spam filtering, disease screening, and many other real-world problems, accuracy is the wrong metric.
2. The right metric depends on the cost structure. For fraud detection, the costs of false negatives (missed fraud = $850) and false positives (blocked legitimate transaction = $15) are asymmetric. The optimal model balances these specific costs, not some abstract notion of "accuracy."
3. Class imbalance requires explicit handling. Without class_weight='balanced' or similar techniques, the model's optimizer has no reason to learn about the rare class. You must tell the algorithm that rare-class mistakes are more expensive.
4. Threshold tuning is a business decision, not just a technical one. The default threshold of 0.5 is almost never optimal for imbalanced problems. The right threshold depends on the specific costs of false positives and false negatives — and those costs are determined by the business, not the data scientist.
5. Small improvements in the right metric can have enormous business value. Going from 40% to 78% fraud recall doesn't sound dramatic compared to "99.98% accuracy." But it translates to hundreds of thousands of dollars in savings per month. Kenji's model wasn't evaluated on accuracy — it was evaluated on money saved.
Discussion Questions
-
Why did Kenji choose NOT to mention accuracy in his presentation to leadership? What would have happened if he had led with "99.98% accuracy"?
-
The optimized model still misses 22% of fraud. For a bank processing 8 million transactions per month, how many fraud cases is that? What are the real-world consequences?
-
Fraud patterns evolve — criminals adapt to detection systems. How might the model's performance change over time, and what should Kenji do about it?
-
The model's precision for fraud is 0.04 — meaning 96% of flagged transactions are legitimate. How should the bank handle flagged transactions? Block them? Send a text to the cardholder? Review them manually?
-
Could optimizing purely for recall (catching the most fraud possible) ever be the wrong strategy? What are the downsides of very high recall with very low precision in this context?