Case Study 1: The 99% Accurate Model That Was Useless — Fraud Detection

Contributors to Introduction to Data Science

Case Study 1: The 99% Accurate Model That Was Useless — Fraud Detection

Tier 3 — Illustrative/Composite Example: The financial institution, data, and model results in this case study are entirely fictional. However, the patterns described — the accuracy paradox, class imbalance challenges, and the critical importance of metric selection in fraud detection — reflect widely documented experiences in the financial industry. All names, figures, and scenarios are invented for pedagogical purposes.

The Setting

River Valley Financial is a mid-sized bank processing about 8 million credit card transactions per month. Their existing fraud detection system is rule-based: it flags transactions above $5,000 from new merchants, transactions from foreign countries without travel notification, and a handful of other hard-coded patterns. It catches about 40% of fraud and generates a lot of false alarms.

The bank's new analytics manager, Kenji Tanaka, is tasked with building a machine-learning-based fraud detection system. The expectation from leadership is clear: "Catch more fraud, generate fewer false alarms, and give us a number we can put in the quarterly report."

Kenji assembles three months of transaction data: 24 million transactions, of which about 4,800 are confirmed fraud. That's a fraud rate of 0.02% — one fraudulent transaction for every 5,000 legitimate ones.

The Question

Kenji's primary question is: Can we build a classifier that identifies fraudulent transactions in real time, so we can block or flag them before they clear?

Secondary questions: - What transaction features are most predictive of fraud? - How do we balance catching fraud vs. blocking legitimate transactions? - What metrics should we use to evaluate the model, given that fraud is extremely rare?

The First Model: A Cautionary Tale

Kenji trains a logistic regression model with features including transaction amount, merchant category, time of day, distance from the cardholder's home ZIP code, and the number of transactions in the past hour.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

model_v1 = LogisticRegression(max_iter=1000)
model_v1.fit(X_train, y_train)
y_pred = model_v1.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")

Accuracy: 0.9998

99.98% accuracy. Kenji almost sends the email to leadership. Then he checks the confusion matrix.

from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, y_pred))

[[5999997       3]
 [   1195       5]]

The model predicted "not fraud" for almost everything. It correctly classified 5,999,997 legitimate transactions and caught exactly 5 of the 1,200 fraudulent ones. Five. Out of twelve hundred.

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred, target_names=['Legit', 'Fraud']))

              precision    recall  f1-score   support

       Legit       1.00      1.00      1.00   6000000
       Fraud       0.63      0.00      0.01      1200

    accuracy                           1.00   6001200

Recall for fraud: 0.004 (0.4%). The model catches fewer than 1 in 200 fraudulent transactions. The 99.98% accuracy is a mirage — the model has essentially learned to say "not fraud" for everything, and because fraud is so rare, that strategy is almost always correct.

Kenji does NOT send the email.

The Problem: Extreme Class Imbalance

This is the accuracy paradox at its most extreme. When one class represents 99.98% of the data, a model can achieve near-perfect accuracy by ignoring the rare class entirely. The model's loss function (which optimizes for overall accuracy) treats every misclassification equally — so missing 1,200 fraud cases costs the same as getting 1,200 legitimate cases wrong. From the model's perspective, why bother learning the complex patterns of fraud when you can get 99.98% right by doing nothing?

Kenji realizes he needs to attack this problem from three angles: the data, the model, and the metric.

The Solution: A Multi-Pronged Approach

Change 1: Address the Imbalance

Kenji uses class_weight='balanced' to tell the model that misclassifying a fraud case is far more costly than misclassifying a legitimate one:

model_v2 = LogisticRegression(max_iter=1000, class_weight='balanced')
model_v2.fit(X_train, y_train)

Change 2: Use the Right Metric

Kenji switches from accuracy to the metrics that matter:

from sklearn.metrics import classification_report, roc_auc_score

y_pred_v2 = model_v2.predict(X_test)
print(classification_report(y_test, y_pred_v2, target_names=['Legit', 'Fraud']))

y_proba = model_v2.predict_proba(X_test)[:, 1]
print(f"AUC: {roc_auc_score(y_test, y_proba):.4f}")

              precision    recall  f1-score   support

       Legit       1.00      0.97      0.99   6000000
       Fraud       0.01      0.82      0.01      1200

    accuracy                           0.97   6001200

AUC: 0.9512

Now the model catches 82% of fraud (recall jumped from 0.004 to 0.82). But precision for fraud is 0.01 — meaning for every legitimate fraud it catches, it falsely flags about 99 legitimate transactions. That's a lot of angry phone calls from customers whose cards were blocked.

Change 3: Bring in a Stronger Model

Kenji trains a random forest with balanced weights:

from sklearn.ensemble import RandomForestClassifier

model_v3 = RandomForestClassifier(
    n_estimators=300, max_depth=12,
    class_weight='balanced', random_state=42, n_jobs=-1
)
model_v3.fit(X_train, y_train)

              precision    recall  f1-score   support

       Legit       1.00      0.99      1.00   6000000
       Fraud       0.04      0.78      0.07      1200

AUC: 0.9687

Better: precision for fraud improved from 0.01 to 0.04 (still low, but significantly fewer false alarms), while recall is 0.78. The model catches 78% of fraud, and for every 25 flagged transactions, about 1 is actually fraudulent.

Change 4: Tune the Threshold

Kenji uses the ROC curve to find the optimal threshold for the bank's specific cost structure. Blocking a legitimate transaction costs about $15 in customer service time and goodwill. A successful fraud averages $850 in losses. So the cost ratio is roughly 1:57 — each missed fraud costs 57 times more than a false alarm.

from sklearn.metrics import roc_curve
import numpy as np

y_proba = model_v3.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_proba)

# Find the threshold that maximizes: benefit of catching fraud - cost of false alarms
costs = []
for t in thresholds:
    y_pred_t = (y_proba >= t).astype(int)
    tp = ((y_pred_t == 1) & (y_test == 1)).sum()
    fp = ((y_pred_t == 1) & (y_test == 0)).sum()
    fn = ((y_pred_t == 0) & (y_test == 1)).sum()
    net_benefit = tp * 850 - fp * 15 - fn * 850
    costs.append({'threshold': t, 'net_benefit': net_benefit})

best = max(costs, key=lambda x: x['net_benefit'])
print(f"Optimal threshold: {best['threshold']:.4f}")

At the optimized threshold, the model achieves a different precision-recall balance that minimizes total cost to the bank.

The Results

Kenji presents three models to leadership, but instead of accuracy, he shows a table with the metrics that matter:

Model	Fraud Recall	Fraud Precision	Legit Blocked (FP)	Est. Monthly Cost
Existing rules	40%	5%	~45,000	$675K blocked + $510K missed fraud
ML (optimized)	78%	4%	~23,400	$351K blocked + $187K missed fraud

The ML model catches nearly twice as much fraud as the rule-based system AND generates fewer false alarms. The estimated monthly cost savings: over $600,000.

But Kenji is careful in his presentation. He doesn't mention accuracy. Not once.

Lessons Learned

1. Accuracy is meaningless with extreme class imbalance. 99.98% accuracy sounds impressive until you realize it means "predict the majority class for everything." In fraud detection, spam filtering, disease screening, and many other real-world problems, accuracy is the wrong metric.

2. The right metric depends on the cost structure. For fraud detection, the costs of false negatives (missed fraud = $850) and false positives (blocked legitimate transaction = $15) are asymmetric. The optimal model balances these specific costs, not some abstract notion of "accuracy."

3. Class imbalance requires explicit handling. Without class_weight='balanced' or similar techniques, the model's optimizer has no reason to learn about the rare class. You must tell the algorithm that rare-class mistakes are more expensive.

4. Threshold tuning is a business decision, not just a technical one. The default threshold of 0.5 is almost never optimal for imbalanced problems. The right threshold depends on the specific costs of false positives and false negatives — and those costs are determined by the business, not the data scientist.

5. Small improvements in the right metric can have enormous business value. Going from 40% to 78% fraud recall doesn't sound dramatic compared to "99.98% accuracy." But it translates to hundreds of thousands of dollars in savings per month. Kenji's model wasn't evaluated on accuracy — it was evaluated on money saved.

Discussion Questions

Why did Kenji choose NOT to mention accuracy in his presentation to leadership? What would have happened if he had led with "99.98% accuracy"?
The optimized model still misses 22% of fraud. For a bank processing 8 million transactions per month, how many fraud cases is that? What are the real-world consequences?
Fraud patterns evolve — criminals adapt to detection systems. How might the model's performance change over time, and what should Kenji do about it?
The model's precision for fraud is 0.04 — meaning 96% of flagged transactions are legitimate. How should the bank handle flagged transactions? Block them? Send a text to the cardholder? Review them manually?
Could optimizing purely for recall (catching the most fraud possible) ever be the wrong strategy? What are the downsides of very high recall with very low precision in this context?