Case Study 4.1: Maya's Technology Assessment — From Marketing Claims to Technical Reality
The Situation
Organization: Verdant Bank (fictional) Context: Maya Osei is evaluating three ML-based AML vendors as part of the transaction monitoring overhaul Challenge: Translating vendor claims into assessments she can defend to the FCA Maya's background: Law degree, former FCA supervisor — not a data scientist
The Challenge
By mid-2021, Maya had closed the KYC backlog and stabilized the transaction monitoring process with a scenario refresh. But the underlying platform — with its rules-based architecture and 82% false positive rate — remained a long-term problem. She had begun a formal evaluation of ML-based alternatives.
Three vendors had made it to the shortlist, each presenting compelling statistics:
| Vendor | Claimed FP Reduction | Technology | Claim Basis |
|---|---|---|---|
| AlertIQ | 60% reduction | Gradient boosting + rules | Tested on "500K transactions across 12 banks" |
| ClearPath AI | 45% reduction | Deep learning (LSTM) | Tested on their "proprietary benchmark dataset" |
| RiskSense | 55% reduction | Ensemble (RF + rules) | Tested on "similar UK challenger bank" (unnamed) |
Maya's problem: she did not know enough about machine learning to evaluate these claims, and she did not trust herself to be adequately skeptical of what she did not understand.
Getting to Substance
Maya did three things before requesting a demo.
First, she called two former FCA colleagues who had supervised AML technology reviews. Their shared advice: "Ask them to run on your data. Not their benchmark dataset. Your actual historical transactions."
Second, she attended a half-day training session on ML fundamentals run by a data science consulting firm. It was not enough to make her an expert. It was enough to give her a vocabulary and a list of questions.
Third, she hired a data science contractor — a PhD in statistics who had previously worked in banking — for four weeks specifically to support the vendor evaluation. She was not buying four weeks of data science; she was buying the ability to ask the right questions and evaluate the answers.
The Technical Evaluation
Maya's team sent each vendor the same dataset: 90 days of historical transaction data from Verdant Bank, with 847 transactions labeled as true positives (cases where SARs had ultimately been filed, confirmed through the consent regime with FinCEN for the UK equivalent). The vendors were asked to: 1. Run their model on the historical data 2. Report performance metrics against the labeled dataset 3. Document their feature engineering approach 4. Explain (in plain English) why the model flagged the three specific transactions Maya selected as test cases
The results were illuminating:
AlertIQ: False positive rate on Verdant's data: 74% (vs. claimed 82% remaining after 60% reduction from an assumed 97% baseline). Better than claimed, partly because their model was trained on more similar data. Their explanation of the three test cases was clear and specific. Recommended.
ClearPath AI: False positive rate on Verdant's data: 89% — significantly worse than claimed. Their benchmark dataset, it emerged, had been drawn from large US retail banks with very different transaction profiles than a UK challenger bank. The deep learning model had not generalized. Not recommended.
RiskSense: False positive rate on Verdant's data: 81% — marginally worse than current. But their explanation of the test cases was the most detailed and the most convincing. The contractor's assessment: the model was technically sound, the underperformance on Verdant's data was likely due to a calibration issue that could be corrected during implementation. Conditionally recommended pending calibration discussion.
What Maya Learned
The experience taught Maya three things she has carried through her career:
Lesson 1: Test on your data. No benchmark dataset is your data. "AI-powered" claims are only meaningful if tested against the specific transaction population, customer base, and jurisdiction you operate in. A model trained on US retail bank data will not generalize to a UK challenger bank.
Lesson 2: Explainability is a regulatory requirement, not a nice-to-have. When she asked ClearPath AI to explain why the model had flagged one of the test transactions, they gave her a SHAP waterfall chart that required a PhD to interpret. AlertIQ gave her: "This transaction was flagged primarily because the destination country has a high risk score, the amount was in the range associated with structuring in your customer segment, and this customer's transaction velocity in the 48 hours prior was in the 98th percentile for similar-profile customers." That was something she could present to the FCA.
Lesson 3: You don't need to be a data scientist to govern ML systems. Maya was never going to retrain a gradient boosting model. But she could evaluate whether a vendor's claims were methodologically sound, whether the model's outputs were explainable, and whether the governance requirements of SR 11-7 (which the FCA had largely adopted in its approach) were being met. That was enough.
The Outcome
Maya selected AlertIQ, with a contractual requirement for quarterly model performance reviews, a documented escalation process if false positive rates increased beyond specified thresholds, and a transition plan if Verdant needed to switch vendors.
Implementation took seven months (longer than the vendor's quoted four months — the core banking integration was more complex than expected). Post-calibration false positive rate: 71%. The team's alert review productivity improved by approximately 35%.
The FCA's next supervisory contact noted the improvement and asked for documentation of the model governance approach. Maya was able to provide a clear, defensible model governance document. No findings.
Discussion Questions
1. Maya hired a data science contractor specifically for the vendor evaluation. Under what circumstances is this a cost-effective approach vs. building in-house data science capability vs. relying entirely on vendor representations?
2. ClearPath AI's deep learning model underperformed on Verdant's data because it had been trained on different types of banks. What does this reveal about the limitations of vendor-provided performance benchmarks?
3. Maya required AlertIQ to explain individual transaction flags in plain language as a condition of the contract. What are the technical constraints on this requirement? (Hint: think about the difference between simple and complex models.)
4. The FCA asked for documentation of the model governance approach. What would you expect that documentation to include? Reference the AI readiness framework from Section 4.7.
5. If Verdant's false positive rate increases from 71% back to 82% six months after implementation, what are the possible causes, and what is the appropriate response?