Case Study 2: Meridian Financial — Regulated Model Deployment for Credit Scoring
Context
Meridian Financial, the mid-size consumer lending institution introduced in Chapter 24 (Case Study 2) and Chapter 28 (Case Study 2), operates a production credit scoring model that determines lending decisions for 2.4 million applicants annually. The model — an XGBoost gradient-boosted tree ensemble with 200 features — scores applicants on a 0-1 probability of default scale. Applications scoring below 0.12 are auto-approved; applications scoring above 0.35 are auto-declined; applications between 0.12 and 0.35 are routed to human underwriters.
The model operates under the Federal Reserve's SR 11-7 guidance on model risk management and the OCC's companion document (OCC 2011-12). These regulations require independent validation, auditable documentation, and formal change management for every model deployed in a credit decision capacity.
The data science team has been asked to design a deployment pipeline that satisfies three constraints simultaneously: (1) keep the model fresh enough to reflect current economic conditions, (2) comply with all regulatory requirements for model change management, and (3) minimize the risk of deploying a model that produces unfair or inaccurate credit decisions.
The Regulatory Challenge
Traditional ML deployment optimizes for speed: get the new model to production as fast as safely possible. Regulated deployment adds constraints that slow the pipeline but reduce a different kind of risk — regulatory risk:
| Constraint | StreamRec Approach | Meridian Requirement |
|---|---|---|
| Who approves deployment? | Automated (validation gate) | MRM team + model risk committee |
| Documentation | Automated metrics logging | 47-page model change document |
| Validation independence | Same team trains and validates | Independent MRM team validates |
| Rollback authority | Automated on metric violation | Automated for technical issues; MRM approval for model issues |
| Audit trail retention | 90 days | 7 years |
| Deployment timeline | 12 days | 17-28 days |
The Deployment Pipeline
Tier Classification
Meridian classifies model changes into three tiers, each with a different pipeline:
Tier 1 — Material Change. New model architecture, new feature sources, new target variable definition, or model replacement. Requires full MRM review (5-10 business days), model risk committee approval, and updated model documentation. Example: replacing XGBoost with a neural network, or adding a new credit bureau data source.
Tier 2 — Non-Material Change. Retraining on fresh data with the same architecture, features, and hyperparameters. Requires abbreviated MRM review (2-3 business days) and MRM analyst approval. Example: quarterly retraining on the most recent 3 years of application data.
Tier 3 — Recalibration. Score recalibration without retraining (e.g., adjusting the intercept to account for population drift). Requires automated validation and MRM notification (no approval gate). Example: monthly recalibration of the score-to-probability mapping.
Tier 2 Pipeline (Quarterly Retraining)
The quarterly retraining pipeline is the most common deployment scenario. The timeline:
| Day | Activity | Owner |
|---|---|---|
| 1 | Dagster pipeline: extract 3 years of data, validate (47 expectations), compute features | Automated |
| 1-2 | Train XGBoost on validated data, evaluate on holdout | Automated |
| 2 | Validation gate: behavioral tests (18 tests), champion-challenger comparison, fairness checks | Automated |
| 2 | Register model in MLflow, generate model change document | Automated |
| 3-5 | MRM analyst reviews validation results, model change document, fairness analysis | MRM team |
| 5 | MRM analyst approves or requests changes | MRM team |
| 5-12 | Shadow mode: model receives production traffic in parallel (7 days) | Automated |
| 12 | Shadow evaluation report generated and sent to MRM analyst | Automated |
| 13 | MRM analyst reviews shadow results, approves canary | MRM team |
| 13-16 | Canary deployment: 5% of scoring traffic routed to new model (3 days) | Automated |
| 16 | Canary evaluation: statistical comparison, fairness metrics by demographic group | Automated |
| 16-17 | Business owner reviews canary results, approves full rollout | Business |
| 17-19 | Progressive rollout: 10% → 25% → 50% → 100% (1 day per stage) | Automated |
| 19 | New model promoted to champion; previous model enters warm standby | Automated |
Total timeline: 17-19 business days for a routine quarterly retraining.
Regulatory-Specific Deployment Stages
Three deployment stages are unique to the regulated pipeline:
1. Automated Model Change Document. When the training pipeline completes and the model passes the validation gate, the system automatically generates a 47-page model change document:
| Section | Contents | Source |
|---|---|---|
| Executive summary | Model name, change type, key metrics | Template + metrics |
| Data description | Training data period, row count, feature count, data quality report | Lineage + GE results |
| Methodology | Architecture, hyperparameters, training procedure | Lineage + config |
| Performance | Holdout metrics, champion comparison, segment analysis | Validation gate |
| Stability | PSI analysis, score distribution comparison | Monitoring |
| Fair lending | Disparate impact analysis by race, gender, age | Fairness module |
| Limitations | Known weaknesses, out-of-scope populations | Template + behavioral tests |
| Approval | Signature blocks for MRM analyst, MRM director, business owner | Empty (filled by reviewers) |
Automating the document does not eliminate the human review — it eliminates the 8-12 hours of analyst time previously spent compiling the document manually, allowing the MRM analyst to focus on reviewing the results rather than assembling them.
2. Fairness Validation at Every Stage. Credit scoring models must comply with the Equal Credit Opportunity Act (ECOA) and fair lending regulations. Meridian's pipeline checks fairness at three stages:
- Offline validation (Day 2): Disparate impact ratio for auto-approve and auto-decline decisions across ECOA-protected classes (race, gender, marital status, age, national origin). Threshold: DIRatio > 0.80 for all groups (the four-fifths rule).
- Shadow mode (Day 12): Score distribution comparison across demographic groups between champion and challenger. Threshold: no group's median score should shift by more than 0.02 (on the 0-1 scale).
- Canary (Day 16): Observed approval rate by demographic group in the canary population vs. the champion population. Threshold: no group's approval rate should change by more than 1 percentage point.
If any fairness check fails, the pipeline pauses and the MRM team investigates. A fairness failure does not automatically trigger rollback — it triggers a review to determine whether the change is legitimate (e.g., a new feature that improves accuracy uniformly) or problematic (e.g., a proxy variable that introduces disparate impact).
3. Regulatory Kill Switch. In addition to the standard metric-based rollback triggers, Meridian's pipeline includes a regulatory kill switch — a feature flag that any MRM analyst can activate to immediately disable the model and route all scoring to the previous approved model. The kill switch is tested monthly and has been activated twice in 2 years:
- Once during a regulatory examination, when the examiner requested a model be temporarily disabled while they reviewed its documentation (precautionary, no issue found).
- Once when a data vendor announced a retroactive correction to credit bureau data that had been used in training. The kill switch disabled the model within 30 seconds; the previous model served traffic for 6 days while the data science team retrained on corrected data.
Retraining Triggers in a Regulated Environment
Meridian's retraining triggers are configured more conservatively than StreamRec's:
| Trigger | StreamRec | Meridian |
|---|---|---|
| Scheduled | Weekly | Quarterly |
| Drift PSI threshold | 0.25 | 0.20 (more sensitive) |
| Performance floor | NDCG@20 < 0.15 | AUC < 0.78 |
| Performance decline | >10% relative | >5% relative (more sensitive) |
| Drift-triggered pipeline | Automated training, human approval before canary | Automated training, MRM review before shadow |
| Performance-triggered pipeline | Automated training, human approval | Automated training, MRM + business review |
The quarterly retraining schedule reflects the credit scoring domain: default outcomes take 12-24 months to materialize, so the model's training data must include a maturation window. More frequent retraining would include immature labels (applicants whose default status is not yet known), introducing label noise.
Drift-triggered retraining requires MRM review even before shadow mode (not just before canary) because the drift may indicate a change in the applicant population that affects the model's regulatory compliance — for example, a marketing campaign that attracts a new demographic, changing the population composition in ways that affect fair lending analysis.
Results
Deployment Metrics
| Metric | Before Pipeline | After Pipeline |
|---|---|---|
| Quarterly deployment time | 35-45 business days | 17-19 business days |
| MRM analyst hours per deployment | 40 hours (document + review) | 12 hours (review only) |
| Deployment failures | 1 per year (insufficient testing) | 0 in 2 years |
| Regulatory findings related to deployment | 2 findings in 3 years | 0 findings in 2 years |
| Kill switch activation time | 15 minutes (manual) | 30 seconds (feature flag) |
| Rollback time | 2 hours (manual) | 4 seconds (warm standby) |
| Model freshness (days since training) | 90-135 days | 90-109 days |
Regulatory Examination Outcomes
The automated documentation and lineage system has transformed regulatory examinations. In the most recent examination, the MRM examiner requested documentation for 4 models. The previous examination required 3 weeks of preparation (compiling documents, gathering evidence, preparing walkthroughs). The current examination required 2 days: the team generated the model change documents from the lineage system, pulled the validation gate results from MLflow, and showed the shadow and canary evaluation reports from the deployment pipeline. The examiner noted the automated fairness validation as a "best practice" — the first time Meridian had received positive examiner feedback on model risk management.
The Recalibration Exception
Six months into the new pipeline, the data science team proposed monthly Tier 3 recalibrations — adjusting the score-to-probability mapping without retraining the model. The score distribution had drifted (PSI = 0.18 on the score distribution, below the retrain threshold but above the monitoring alert threshold), causing the auto-approve threshold (0.12) to approve fewer applicants than intended.
The MRM team approved monthly recalibration as a Tier 3 change with the following pipeline: compute recalibration parameters (Platt scaling on the most recent 90 days of data with known outcomes) → validate that the recalibrated score distribution matches the expected approval rate → deploy directly to production (no shadow or canary, because the underlying model does not change). The recalibration pipeline runs in 4 hours and requires no human approval, only MRM notification.
This monthly recalibration — a capability that did not exist before the automated pipeline — reduced score distribution drift between quarterly retraining cycles from PSI 0.15-0.22 to PSI 0.03-0.08, maintaining scoring consistency for applicants and examiners.
Key Insight
The fundamental insight from Meridian's experience is that regulatory requirements and deployment automation are not in tension — they are complementary. The automated pipeline produces better regulatory documentation (complete, consistent, and timestamped) faster (17 days vs. 45 days) than the manual process, while also reducing deployment risk (zero failures vs. one per year). The regulatory constraints shaped the pipeline design (longer shadow periods, human approval gates, fairness checks at every stage), and the pipeline automation made those constraints manageable rather than burdensome.
The MRM team, initially skeptical of automated deployment ("how can you deploy a model without a human reviewing every step?"), became the pipeline's strongest advocates after the first regulatory examination — because the pipeline produced the evidence that examiners required, automatically, every time.