Case Study 2: Meridian Financial — Regulated Model Deployment for Credit Scoring

Context

Meridian Financial, the mid-size consumer lending institution introduced in Chapter 24 (Case Study 2) and Chapter 28 (Case Study 2), operates a production credit scoring model that determines lending decisions for 2.4 million applicants annually. The model — an XGBoost gradient-boosted tree ensemble with 200 features — scores applicants on a 0-1 probability of default scale. Applications scoring below 0.12 are auto-approved; applications scoring above 0.35 are auto-declined; applications between 0.12 and 0.35 are routed to human underwriters.

The model operates under the Federal Reserve's SR 11-7 guidance on model risk management and the OCC's companion document (OCC 2011-12). These regulations require independent validation, auditable documentation, and formal change management for every model deployed in a credit decision capacity.

The data science team has been asked to design a deployment pipeline that satisfies three constraints simultaneously: (1) keep the model fresh enough to reflect current economic conditions, (2) comply with all regulatory requirements for model change management, and (3) minimize the risk of deploying a model that produces unfair or inaccurate credit decisions.

The Regulatory Challenge

Traditional ML deployment optimizes for speed: get the new model to production as fast as safely possible. Regulated deployment adds constraints that slow the pipeline but reduce a different kind of risk — regulatory risk:

Constraint StreamRec Approach Meridian Requirement
Who approves deployment? Automated (validation gate) MRM team + model risk committee
Documentation Automated metrics logging 47-page model change document
Validation independence Same team trains and validates Independent MRM team validates
Rollback authority Automated on metric violation Automated for technical issues; MRM approval for model issues
Audit trail retention 90 days 7 years
Deployment timeline 12 days 17-28 days

The Deployment Pipeline

Tier Classification

Meridian classifies model changes into three tiers, each with a different pipeline:

Tier 1 — Material Change. New model architecture, new feature sources, new target variable definition, or model replacement. Requires full MRM review (5-10 business days), model risk committee approval, and updated model documentation. Example: replacing XGBoost with a neural network, or adding a new credit bureau data source.

Tier 2 — Non-Material Change. Retraining on fresh data with the same architecture, features, and hyperparameters. Requires abbreviated MRM review (2-3 business days) and MRM analyst approval. Example: quarterly retraining on the most recent 3 years of application data.

Tier 3 — Recalibration. Score recalibration without retraining (e.g., adjusting the intercept to account for population drift). Requires automated validation and MRM notification (no approval gate). Example: monthly recalibration of the score-to-probability mapping.

Tier 2 Pipeline (Quarterly Retraining)

The quarterly retraining pipeline is the most common deployment scenario. The timeline:

Day Activity Owner
1 Dagster pipeline: extract 3 years of data, validate (47 expectations), compute features Automated
1-2 Train XGBoost on validated data, evaluate on holdout Automated
2 Validation gate: behavioral tests (18 tests), champion-challenger comparison, fairness checks Automated
2 Register model in MLflow, generate model change document Automated
3-5 MRM analyst reviews validation results, model change document, fairness analysis MRM team
5 MRM analyst approves or requests changes MRM team
5-12 Shadow mode: model receives production traffic in parallel (7 days) Automated
12 Shadow evaluation report generated and sent to MRM analyst Automated
13 MRM analyst reviews shadow results, approves canary MRM team
13-16 Canary deployment: 5% of scoring traffic routed to new model (3 days) Automated
16 Canary evaluation: statistical comparison, fairness metrics by demographic group Automated
16-17 Business owner reviews canary results, approves full rollout Business
17-19 Progressive rollout: 10% → 25% → 50% → 100% (1 day per stage) Automated
19 New model promoted to champion; previous model enters warm standby Automated

Total timeline: 17-19 business days for a routine quarterly retraining.

Regulatory-Specific Deployment Stages

Three deployment stages are unique to the regulated pipeline:

1. Automated Model Change Document. When the training pipeline completes and the model passes the validation gate, the system automatically generates a 47-page model change document:

Section Contents Source
Executive summary Model name, change type, key metrics Template + metrics
Data description Training data period, row count, feature count, data quality report Lineage + GE results
Methodology Architecture, hyperparameters, training procedure Lineage + config
Performance Holdout metrics, champion comparison, segment analysis Validation gate
Stability PSI analysis, score distribution comparison Monitoring
Fair lending Disparate impact analysis by race, gender, age Fairness module
Limitations Known weaknesses, out-of-scope populations Template + behavioral tests
Approval Signature blocks for MRM analyst, MRM director, business owner Empty (filled by reviewers)

Automating the document does not eliminate the human review — it eliminates the 8-12 hours of analyst time previously spent compiling the document manually, allowing the MRM analyst to focus on reviewing the results rather than assembling them.

2. Fairness Validation at Every Stage. Credit scoring models must comply with the Equal Credit Opportunity Act (ECOA) and fair lending regulations. Meridian's pipeline checks fairness at three stages:

  • Offline validation (Day 2): Disparate impact ratio for auto-approve and auto-decline decisions across ECOA-protected classes (race, gender, marital status, age, national origin). Threshold: DIRatio > 0.80 for all groups (the four-fifths rule).
  • Shadow mode (Day 12): Score distribution comparison across demographic groups between champion and challenger. Threshold: no group's median score should shift by more than 0.02 (on the 0-1 scale).
  • Canary (Day 16): Observed approval rate by demographic group in the canary population vs. the champion population. Threshold: no group's approval rate should change by more than 1 percentage point.

If any fairness check fails, the pipeline pauses and the MRM team investigates. A fairness failure does not automatically trigger rollback — it triggers a review to determine whether the change is legitimate (e.g., a new feature that improves accuracy uniformly) or problematic (e.g., a proxy variable that introduces disparate impact).

3. Regulatory Kill Switch. In addition to the standard metric-based rollback triggers, Meridian's pipeline includes a regulatory kill switch — a feature flag that any MRM analyst can activate to immediately disable the model and route all scoring to the previous approved model. The kill switch is tested monthly and has been activated twice in 2 years:

  • Once during a regulatory examination, when the examiner requested a model be temporarily disabled while they reviewed its documentation (precautionary, no issue found).
  • Once when a data vendor announced a retroactive correction to credit bureau data that had been used in training. The kill switch disabled the model within 30 seconds; the previous model served traffic for 6 days while the data science team retrained on corrected data.

Retraining Triggers in a Regulated Environment

Meridian's retraining triggers are configured more conservatively than StreamRec's:

Trigger StreamRec Meridian
Scheduled Weekly Quarterly
Drift PSI threshold 0.25 0.20 (more sensitive)
Performance floor NDCG@20 < 0.15 AUC < 0.78
Performance decline >10% relative >5% relative (more sensitive)
Drift-triggered pipeline Automated training, human approval before canary Automated training, MRM review before shadow
Performance-triggered pipeline Automated training, human approval Automated training, MRM + business review

The quarterly retraining schedule reflects the credit scoring domain: default outcomes take 12-24 months to materialize, so the model's training data must include a maturation window. More frequent retraining would include immature labels (applicants whose default status is not yet known), introducing label noise.

Drift-triggered retraining requires MRM review even before shadow mode (not just before canary) because the drift may indicate a change in the applicant population that affects the model's regulatory compliance — for example, a marketing campaign that attracts a new demographic, changing the population composition in ways that affect fair lending analysis.

Results

Deployment Metrics

Metric Before Pipeline After Pipeline
Quarterly deployment time 35-45 business days 17-19 business days
MRM analyst hours per deployment 40 hours (document + review) 12 hours (review only)
Deployment failures 1 per year (insufficient testing) 0 in 2 years
Regulatory findings related to deployment 2 findings in 3 years 0 findings in 2 years
Kill switch activation time 15 minutes (manual) 30 seconds (feature flag)
Rollback time 2 hours (manual) 4 seconds (warm standby)
Model freshness (days since training) 90-135 days 90-109 days

Regulatory Examination Outcomes

The automated documentation and lineage system has transformed regulatory examinations. In the most recent examination, the MRM examiner requested documentation for 4 models. The previous examination required 3 weeks of preparation (compiling documents, gathering evidence, preparing walkthroughs). The current examination required 2 days: the team generated the model change documents from the lineage system, pulled the validation gate results from MLflow, and showed the shadow and canary evaluation reports from the deployment pipeline. The examiner noted the automated fairness validation as a "best practice" — the first time Meridian had received positive examiner feedback on model risk management.

The Recalibration Exception

Six months into the new pipeline, the data science team proposed monthly Tier 3 recalibrations — adjusting the score-to-probability mapping without retraining the model. The score distribution had drifted (PSI = 0.18 on the score distribution, below the retrain threshold but above the monitoring alert threshold), causing the auto-approve threshold (0.12) to approve fewer applicants than intended.

The MRM team approved monthly recalibration as a Tier 3 change with the following pipeline: compute recalibration parameters (Platt scaling on the most recent 90 days of data with known outcomes) → validate that the recalibrated score distribution matches the expected approval rate → deploy directly to production (no shadow or canary, because the underlying model does not change). The recalibration pipeline runs in 4 hours and requires no human approval, only MRM notification.

This monthly recalibration — a capability that did not exist before the automated pipeline — reduced score distribution drift between quarterly retraining cycles from PSI 0.15-0.22 to PSI 0.03-0.08, maintaining scoring consistency for applicants and examiners.

Key Insight

The fundamental insight from Meridian's experience is that regulatory requirements and deployment automation are not in tension — they are complementary. The automated pipeline produces better regulatory documentation (complete, consistent, and timestamped) faster (17 days vs. 45 days) than the manual process, while also reducing deployment risk (zero failures vs. one per year). The regulatory constraints shaped the pipeline design (longer shadow periods, human approval gates, fairness checks at every stage), and the pipeline automation made those constraints manageable rather than burdensome.

The MRM team, initially skeptical of automated deployment ("how can you deploy a model without a human reviewing every step?"), became the pipeline's strongest advocates after the first regulatory examination — because the pipeline produced the evidence that examiners required, automatically, every time.