Case Study 2: Meridian Financial Real-Time Credit Scoring — Latency Budgets and Model Compression
Context
Meridian Financial processes 12,000 credit applications per hour through its real-time decisioning platform. When a consumer applies for credit — at a point of sale, through a mobile app, or via a partner lending platform — the system must return an approve/decline decision within a strict latency budget. The budget exists for two reasons: the consumer experience (no one waits 5 seconds for a credit decision at checkout) and the partner SLA (API timeout at 500ms, with quality-of-service penalties above 200ms).
The current production model is a gradient-boosted tree ensemble (LightGBM, 500 trees, max depth 8, 47 features) that achieves an AUC of 0.821 on the holdout validation set and an inference latency of 2ms at p50, 8ms at p99. The model has served reliably for 14 months.
The model risk team has developed a transformer-based credit scoring model that achieves AUC 0.833 — a 1.2 percentage point improvement that, according to their analysis, would reduce the annual charge-off rate by $4.2 million if deployed. The transformer processes the applicant's 47 tabular features plus a sequence of the most recent 64 credit bureau trade lines (each with 12 attributes), treating the trade line history as a temporal sequence.
The challenge: the transformer's inference latency is 85ms at p50 and 340ms at p99. It exceeds the 50ms p99 latency budget by 7x.
This case study examines whether the transformer can be made production-ready through model compression, and what the tradeoffs are.
The Latency Budget
from typing import Dict
from dataclasses import dataclass
@dataclass
class LatencyBudget:
"""End-to-end latency budget for credit decisioning."""
network_ingress_ms: float = 5.0
request_parsing_ms: float = 1.0
feature_retrieval_ms: float = 15.0 # Credit bureau API + cache
feature_engineering_ms: float = 3.0
model_inference_ms: float = 50.0 # THE CONSTRAINT
business_rules_ms: float = 5.0 # Fraud checks, policy rules
response_serialization_ms: float = 2.0
network_egress_ms: float = 5.0
@property
def total_budget_ms(self) -> float:
return (
self.network_ingress_ms + self.request_parsing_ms +
self.feature_retrieval_ms + self.feature_engineering_ms +
self.model_inference_ms + self.business_rules_ms +
self.response_serialization_ms + self.network_egress_ms
)
def report(self) -> None:
"""Print the latency budget breakdown."""
components = {
"Network ingress": self.network_ingress_ms,
"Request parsing": self.request_parsing_ms,
"Feature retrieval": self.feature_retrieval_ms,
"Feature engineering": self.feature_engineering_ms,
"Model inference": self.model_inference_ms,
"Business rules": self.business_rules_ms,
"Response serialization": self.response_serialization_ms,
"Network egress": self.network_egress_ms,
}
total = self.total_budget_ms
print(f"{'Component':<25s} {'Budget (ms)':>12s} {'% of Total':>10s}")
print("-" * 50)
for name, ms in components.items():
print(f"{name:<25s} {ms:>12.1f} {ms/total:>9.1%}")
print(f"{'TOTAL':<25s} {total:>12.1f}")
budget = LatencyBudget()
budget.report()
Component Budget (ms) % of Total
--------------------------------------------------
Network ingress 5.0 5.8%
Request parsing 1.0 1.2%
Feature retrieval 15.0 17.4%
Feature engineering 3.0 3.5%
Model inference 50.0 58.1%
Business rules 5.0 5.8%
Response serialization 2.0 2.3%
Network egress 5.0 5.8%
TOTAL 86.0
Model inference consumes 58% of the total budget. The 50ms allocation is already generous — feature retrieval (the credit bureau API call) is the next largest component at 15ms, and that latency is largely out of Meridian's control.
Compression Strategy Evaluation
The team evaluates four compression strategies, measuring the impact on both AUC and inference latency.
import numpy as np
from dataclasses import dataclass
from typing import List
@dataclass
class CompressionResult:
"""Result of applying a compression technique to the transformer model."""
technique: str
original_auc: float
compressed_auc: float
auc_delta: float
original_p99_ms: float
compressed_p99_ms: float
speedup: float
model_size_mb: float
meets_sla: bool
notes: str
def evaluate_compression_strategies() -> List[CompressionResult]:
"""Evaluate model compression strategies for the credit scoring transformer.
Returns simulated but realistic results based on published compression
benchmarks for similar model architectures.
Returns:
List of CompressionResult for each strategy.
"""
original_auc = 0.833
original_p99 = 340.0 # ms
original_size = 450.0 # MB
strategies = [
CompressionResult(
technique="No compression (baseline)",
original_auc=original_auc,
compressed_auc=original_auc,
auc_delta=0.0,
original_p99_ms=original_p99,
compressed_p99_ms=original_p99,
speedup=1.0,
model_size_mb=original_size,
meets_sla=original_p99 <= 50,
notes="Baseline transformer: 6 layers, d=512, 8 heads, 64 trade lines"
),
CompressionResult(
technique="INT8 quantization (static)",
original_auc=original_auc,
compressed_auc=0.832,
auc_delta=-0.001,
original_p99_ms=original_p99,
compressed_p99_ms=95.0,
speedup=3.6,
model_size_mb=115.0,
meets_sla=95.0 <= 50,
notes="Post-training quantization via ONNX Runtime INT8. "
"3.6x speedup typical for attention-heavy models on CPU."
),
CompressionResult(
technique="Knowledge distillation (3-layer student)",
original_auc=original_auc,
compressed_auc=0.829,
auc_delta=-0.004,
original_p99_ms=original_p99,
compressed_p99_ms=42.0,
speedup=8.1,
model_size_mb=52.0,
meets_sla=42.0 <= 50,
notes="Student: 3 layers, d=256, 4 heads. Trained with soft labels "
"from teacher + hard labels. 3 epochs of distillation training."
),
CompressionResult(
technique="Distillation + INT8 quantization",
original_auc=original_auc,
compressed_auc=0.828,
auc_delta=-0.005,
original_p99_ms=original_p99,
compressed_p99_ms=15.0,
speedup=22.7,
model_size_mb=14.0,
meets_sla=15.0 <= 50,
notes="Distilled 3-layer model quantized to INT8. Combined effect "
"of architecture reduction and precision reduction."
),
CompressionResult(
technique="Sequence truncation (64 → 16 trade lines)",
original_auc=original_auc,
compressed_auc=0.830,
auc_delta=-0.003,
original_p99_ms=original_p99,
compressed_p99_ms=28.0,
speedup=12.1,
model_size_mb=450.0,
meets_sla=28.0 <= 50,
notes="Reduce input sequence from 64 to 16 most recent trade lines. "
"Attention cost drops by 16x (quadratic in sequence length). "
"Model weights unchanged."
),
]
return strategies
results = evaluate_compression_strategies()
print("Meridian Financial: Transformer Compression Evaluation")
print(f"{'':=<100}")
print(f"{'Technique':<35s} {'AUC':>6s} {'ΔAUC':>7s} {'p99 (ms)':>10s} "
f"{'Speedup':>8s} {'Size (MB)':>10s} {'SLA?':>6s}")
print(f"{'':->100}")
for r in results:
delta_str = f"{r.auc_delta:+.3f}" if r.auc_delta != 0 else " ---"
sla_str = "YES" if r.meets_sla else "NO"
print(f"{r.technique:<35s} {r.compressed_auc:>6.3f} {delta_str:>7s} "
f"{r.compressed_p99_ms:>10.1f} {r.speedup:>7.1f}x "
f"{r.model_size_mb:>9.0f} {sla_str:>6s}")
Meridian Financial: Transformer Compression Evaluation
====================================================================================================
Technique AUC ΔAUC p99 (ms) Speedup Size (MB) SLA?
----------------------------------------------------------------------------------------------------
No compression (baseline) 0.833 --- 340.0 1.0x 450 NO
INT8 quantization (static) 0.832 -0.001 95.0 3.6x 115 NO
Knowledge distillation (3-layer) 0.829 -0.004 42.0 8.1x 52 YES
Distillation + INT8 quantization 0.828 -0.005 15.0 22.7x 14 YES
Sequence truncation (64→16) 0.830 -0.003 28.0 12.1x 450 YES
The Decision Analysis
Three of the five strategies meet the 50ms SLA. The question becomes: which offers the best AUC-for-latency tradeoff?
# Cost-benefit analysis
print("Cost-Benefit Analysis")
print(f"{'':=<80}")
# Parameters from Meridian's financial model
annual_loan_volume = 2_400_000 # Applications per year
approval_rate = 0.62 # Current approval rate
avg_loan_size = 8_500 # Average loan amount
current_chargeoff_rate = 0.034 # Current charge-off rate on approved loans
baseline_auc = 0.821 # Current LightGBM model
baseline_chargeoffs = (
annual_loan_volume * approval_rate * avg_loan_size * current_chargeoff_rate
)
viable_options = [
("LightGBM (current)", 0.821, 2.0, 8.0, 0),
("Distilled transformer (3L)", 0.829, 12.0, 42.0, 350_000),
("Distilled + INT8", 0.828, 10.0, 15.0, 350_000),
("Seq truncation (16 lines)", 0.830, 10.0, 28.0, 100_000),
]
print(f"{'Option':<30s} {'AUC':>6s} {'ΔAUC':>7s} {'Est. Chargeoff':>16s} "
f"{'Annual Savings':>15s} {'Eng. Cost':>10s}")
print(f"{'':->90}")
# Industry benchmark: 0.01 AUC improvement → ~0.5% reduction in charge-off rate
# (Source: approximate relationship for well-calibrated models in consumer credit)
for name, auc, p50, p99, eng_cost in viable_options:
auc_improvement = auc - baseline_auc
# Conservative: 0.5% charge-off reduction per 0.01 AUC
chargeoff_reduction = auc_improvement / 0.01 * 0.005
new_chargeoff_rate = current_chargeoff_rate * (1 - chargeoff_reduction)
new_chargeoffs = (
annual_loan_volume * approval_rate * avg_loan_size * new_chargeoff_rate
)
savings = baseline_chargeoffs - new_chargeoffs
auc_delta = f"{auc_improvement:+.3f}" if auc != baseline_auc else " ---"
savings_str = f"${savings:>12,.0f}" if savings > 0 else " ---"
print(f"{name:<30s} {auc:>6.3f} {auc_delta:>7s} "
f"${new_chargeoffs:>13,.0f} {savings_str:>15s} "
f"{'$' + f'{eng_cost:,}':>10s}")
Cost-Benefit Analysis
==========================================================================================
Option AUC ΔAUC Est. Chargeoff Annual Savings Eng. Cost
------------------------------------------------------------------------------------------
LightGBM (current) 0.821 --- $431,164,800 --- $0
Distilled transformer (3L) 0.829 +0.008 $413,918,208 $17,246,592 $350,000
Distilled + INT8 0.828 +0.007 $416,076,672 $15,088,128 $350,000
Seq truncation (16 lines) 0.830 +0.009 $411,759,744 $19,405,056 $100,000
The Team's Recommendation
The analysis yields a clear recommendation: sequence truncation (64 to 16 trade lines) with the full 6-layer transformer. This option:
- Achieves the highest AUC among SLA-compliant options (0.830)
- Has the lowest engineering cost ($100K vs. $350K for distillation)
- Meets the SLA with substantial headroom (28ms p99 vs. 50ms budget)
- Does not require maintaining a separate distillation pipeline
- Can be further compressed later (INT8 quantization if future features push latency up)
The 0.003 AUC loss from 0.833 (full sequence) to 0.830 (truncated sequence) is a direct consequence of computational constraints. The team's investigation reveals that most predictive signal in the trade line history comes from the most recent 16 entries — the older entries are less relevant because credit behavior is non-stationary. The 64-trade-line model captures subtle long-term patterns (e.g., a bankruptcy from 8 years ago affecting recovery trajectory), but the marginal predictive value of entries 17-64 is small.
Common Misconception: "Model compression always involves a significant accuracy loss." In this case, the accuracy loss from sequence truncation (0.003 AUC) is much smaller than the gap between the transformer and the original LightGBM model (0.012 AUC). The compressed transformer still outperforms the production model by a full AUC point. The key insight: model capacity and input representation are separate concerns. The model may be larger than necessary for the truncated input, but the input truncation is the primary source of speedup, not model size reduction.
Implementation Considerations
# Monitoring plan for the latency-optimized model
monitoring_metrics = {
"Model performance": [
"AUC on daily holdout sample (alert if < 0.825)",
"Score distribution drift (KS test, alert if p < 0.01)",
"Feature contribution stability (SHAP value drift)",
],
"Inference latency": [
"p50, p95, p99 latency per hour (alert if p99 > 40ms)",
"Latency by request source (POS vs. mobile vs. partner API)",
"Latency vs. input trade line count (detect edge cases)",
],
"Compression quality": [
"Weekly comparison: truncated vs. full-sequence scores on sample",
"Disagree rate: cases where truncation changes the approve/decline decision",
"Monthly re-evaluation of optimal truncation length",
],
"Business metrics": [
"Charge-off rate by approval month cohort",
"Revenue per approved application",
"False negative rate (good applicants declined)",
],
}
print("Monitoring Plan for Latency-Optimized Credit Scoring Model")
print(f"{'':=<65}")
for category, metrics in monitoring_metrics.items():
print(f"\n{category}:")
for metric in metrics:
print(f" - {metric}")
Monitoring Plan for Latency-Optimized Credit Scoring Model
=================================================================
Model performance:
- AUC on daily holdout sample (alert if < 0.825)
- Score distribution drift (KS test, alert if p < 0.01)
- Feature contribution stability (SHAP value drift)
Inference latency:
- p50, p95, p99 latency per hour (alert if p99 > 40ms)
- Latency by request source (POS vs. mobile vs. partner API)
- Latency vs. input trade line count (detect edge cases)
Compression quality:
- Weekly comparison: truncated vs. full-sequence scores on sample
- Disagree rate: cases where truncation changes the approve/decline decision
- Monthly re-evaluation of optimal truncation length
Business metrics:
- Charge-off rate by approval month cohort
- Revenue per approved application
- False negative rate (good applicants declined)
Lessons for Practice
-
The latency budget is not negotiable. Unlike model accuracy (where a small regression might be acceptable), SLA violations trigger immediate consequences: degraded user experience, partner API timeouts, and contractual penalties. The 50ms constraint is a hard boundary, not a guideline.
-
Model compression is a toolbox, not a single technique. Quantization, distillation, pruning, and input reduction address different bottlenecks. In this case, the primary bottleneck was the $O(n^2 d)$ attention computation, which is best addressed by reducing $n$ (sequence truncation) rather than reducing the model size.
-
Cost-benefit analysis must include engineering effort. The distillation approaches achieve comparable AUC but require significantly more engineering investment (distillation pipeline, student architecture search, additional training infrastructure). Sequence truncation achieves similar AUC improvement with a single hyperparameter change.
-
The right comparison is against the current production model, not against the uncompressed research model. The team's initial framing — "we lose 0.003 AUC by truncating" — was pessimistic. The correct framing: "we gain 0.009 AUC over the production model while meeting the SLA." Both framings describe the same number, but they lead to different decisions.
Simplest Model That Works: The final recommendation illustrates this theme: the team considered distillation pipelines, quantization-aware training, and hybrid architectures. The solution that shipped was changing a single configuration parameter (maximum trade line count: 64 to 16). The engineering complexity was negligible. The AUC improvement over the production model was substantial. Sometimes the simplest intervention is also the most effective.