Case Study 2: Meridian Financial Real-Time Credit Scoring — Latency Budgets and Model Compression

Context

Meridian Financial processes 12,000 credit applications per hour through its real-time decisioning platform. When a consumer applies for credit — at a point of sale, through a mobile app, or via a partner lending platform — the system must return an approve/decline decision within a strict latency budget. The budget exists for two reasons: the consumer experience (no one waits 5 seconds for a credit decision at checkout) and the partner SLA (API timeout at 500ms, with quality-of-service penalties above 200ms).

The current production model is a gradient-boosted tree ensemble (LightGBM, 500 trees, max depth 8, 47 features) that achieves an AUC of 0.821 on the holdout validation set and an inference latency of 2ms at p50, 8ms at p99. The model has served reliably for 14 months.

The model risk team has developed a transformer-based credit scoring model that achieves AUC 0.833 — a 1.2 percentage point improvement that, according to their analysis, would reduce the annual charge-off rate by $4.2 million if deployed. The transformer processes the applicant's 47 tabular features plus a sequence of the most recent 64 credit bureau trade lines (each with 12 attributes), treating the trade line history as a temporal sequence.

The challenge: the transformer's inference latency is 85ms at p50 and 340ms at p99. It exceeds the 50ms p99 latency budget by 7x.

This case study examines whether the transformer can be made production-ready through model compression, and what the tradeoffs are.

The Latency Budget

from typing import Dict
from dataclasses import dataclass

@dataclass
class LatencyBudget:
    """End-to-end latency budget for credit decisioning."""
    network_ingress_ms: float = 5.0
    request_parsing_ms: float = 1.0
    feature_retrieval_ms: float = 15.0  # Credit bureau API + cache
    feature_engineering_ms: float = 3.0
    model_inference_ms: float = 50.0    # THE CONSTRAINT
    business_rules_ms: float = 5.0      # Fraud checks, policy rules
    response_serialization_ms: float = 2.0
    network_egress_ms: float = 5.0

    @property
    def total_budget_ms(self) -> float:
        return (
            self.network_ingress_ms + self.request_parsing_ms +
            self.feature_retrieval_ms + self.feature_engineering_ms +
            self.model_inference_ms + self.business_rules_ms +
            self.response_serialization_ms + self.network_egress_ms
        )

    def report(self) -> None:
        """Print the latency budget breakdown."""
        components = {
            "Network ingress": self.network_ingress_ms,
            "Request parsing": self.request_parsing_ms,
            "Feature retrieval": self.feature_retrieval_ms,
            "Feature engineering": self.feature_engineering_ms,
            "Model inference": self.model_inference_ms,
            "Business rules": self.business_rules_ms,
            "Response serialization": self.response_serialization_ms,
            "Network egress": self.network_egress_ms,
        }
        total = self.total_budget_ms
        print(f"{'Component':<25s} {'Budget (ms)':>12s} {'% of Total':>10s}")
        print("-" * 50)
        for name, ms in components.items():
            print(f"{name:<25s} {ms:>12.1f} {ms/total:>9.1%}")
        print(f"{'TOTAL':<25s} {total:>12.1f}")

budget = LatencyBudget()
budget.report()
Component                  Budget (ms)  % of Total
--------------------------------------------------
Network ingress                    5.0       5.8%
Request parsing                    1.0       1.2%
Feature retrieval                 15.0      17.4%
Feature engineering                3.0       3.5%
Model inference                   50.0      58.1%
Business rules                     5.0       5.8%
Response serialization             2.0       2.3%
Network egress                     5.0       5.8%
TOTAL                             86.0

Model inference consumes 58% of the total budget. The 50ms allocation is already generous — feature retrieval (the credit bureau API call) is the next largest component at 15ms, and that latency is largely out of Meridian's control.

Compression Strategy Evaluation

The team evaluates four compression strategies, measuring the impact on both AUC and inference latency.

import numpy as np
from dataclasses import dataclass
from typing import List

@dataclass
class CompressionResult:
    """Result of applying a compression technique to the transformer model."""
    technique: str
    original_auc: float
    compressed_auc: float
    auc_delta: float
    original_p99_ms: float
    compressed_p99_ms: float
    speedup: float
    model_size_mb: float
    meets_sla: bool
    notes: str

def evaluate_compression_strategies() -> List[CompressionResult]:
    """Evaluate model compression strategies for the credit scoring transformer.

    Returns simulated but realistic results based on published compression
    benchmarks for similar model architectures.

    Returns:
        List of CompressionResult for each strategy.
    """
    original_auc = 0.833
    original_p99 = 340.0  # ms
    original_size = 450.0  # MB

    strategies = [
        CompressionResult(
            technique="No compression (baseline)",
            original_auc=original_auc,
            compressed_auc=original_auc,
            auc_delta=0.0,
            original_p99_ms=original_p99,
            compressed_p99_ms=original_p99,
            speedup=1.0,
            model_size_mb=original_size,
            meets_sla=original_p99 <= 50,
            notes="Baseline transformer: 6 layers, d=512, 8 heads, 64 trade lines"
        ),
        CompressionResult(
            technique="INT8 quantization (static)",
            original_auc=original_auc,
            compressed_auc=0.832,
            auc_delta=-0.001,
            original_p99_ms=original_p99,
            compressed_p99_ms=95.0,
            speedup=3.6,
            model_size_mb=115.0,
            meets_sla=95.0 <= 50,
            notes="Post-training quantization via ONNX Runtime INT8. "
                  "3.6x speedup typical for attention-heavy models on CPU."
        ),
        CompressionResult(
            technique="Knowledge distillation (3-layer student)",
            original_auc=original_auc,
            compressed_auc=0.829,
            auc_delta=-0.004,
            original_p99_ms=original_p99,
            compressed_p99_ms=42.0,
            speedup=8.1,
            model_size_mb=52.0,
            meets_sla=42.0 <= 50,
            notes="Student: 3 layers, d=256, 4 heads. Trained with soft labels "
                  "from teacher + hard labels. 3 epochs of distillation training."
        ),
        CompressionResult(
            technique="Distillation + INT8 quantization",
            original_auc=original_auc,
            compressed_auc=0.828,
            auc_delta=-0.005,
            original_p99_ms=original_p99,
            compressed_p99_ms=15.0,
            speedup=22.7,
            model_size_mb=14.0,
            meets_sla=15.0 <= 50,
            notes="Distilled 3-layer model quantized to INT8. Combined effect "
                  "of architecture reduction and precision reduction."
        ),
        CompressionResult(
            technique="Sequence truncation (64 → 16 trade lines)",
            original_auc=original_auc,
            compressed_auc=0.830,
            auc_delta=-0.003,
            original_p99_ms=original_p99,
            compressed_p99_ms=28.0,
            speedup=12.1,
            model_size_mb=450.0,
            meets_sla=28.0 <= 50,
            notes="Reduce input sequence from 64 to 16 most recent trade lines. "
                  "Attention cost drops by 16x (quadratic in sequence length). "
                  "Model weights unchanged."
        ),
    ]
    return strategies


results = evaluate_compression_strategies()

print("Meridian Financial: Transformer Compression Evaluation")
print(f"{'':=<100}")
print(f"{'Technique':<35s} {'AUC':>6s} {'ΔAUC':>7s} {'p99 (ms)':>10s} "
      f"{'Speedup':>8s} {'Size (MB)':>10s} {'SLA?':>6s}")
print(f"{'':->100}")

for r in results:
    delta_str = f"{r.auc_delta:+.3f}" if r.auc_delta != 0 else "  ---"
    sla_str = "YES" if r.meets_sla else "NO"
    print(f"{r.technique:<35s} {r.compressed_auc:>6.3f} {delta_str:>7s} "
          f"{r.compressed_p99_ms:>10.1f} {r.speedup:>7.1f}x "
          f"{r.model_size_mb:>9.0f} {sla_str:>6s}")
Meridian Financial: Transformer Compression Evaluation
====================================================================================================
Technique                           AUC    ΔAUC   p99 (ms)  Speedup  Size (MB)   SLA?
----------------------------------------------------------------------------------------------------
No compression (baseline)         0.833     ---      340.0     1.0x       450    NO
INT8 quantization (static)        0.832  -0.001       95.0     3.6x       115    NO
Knowledge distillation (3-layer)  0.829  -0.004       42.0     8.1x        52   YES
Distillation + INT8 quantization  0.828  -0.005       15.0    22.7x        14   YES
Sequence truncation (64→16)       0.830  -0.003       28.0    12.1x       450   YES

The Decision Analysis

Three of the five strategies meet the 50ms SLA. The question becomes: which offers the best AUC-for-latency tradeoff?

# Cost-benefit analysis
print("Cost-Benefit Analysis")
print(f"{'':=<80}")

# Parameters from Meridian's financial model
annual_loan_volume = 2_400_000  # Applications per year
approval_rate = 0.62  # Current approval rate
avg_loan_size = 8_500  # Average loan amount
current_chargeoff_rate = 0.034  # Current charge-off rate on approved loans

baseline_auc = 0.821  # Current LightGBM model
baseline_chargeoffs = (
    annual_loan_volume * approval_rate * avg_loan_size * current_chargeoff_rate
)

viable_options = [
    ("LightGBM (current)",           0.821,  2.0,   8.0, 0),
    ("Distilled transformer (3L)",   0.829, 12.0,  42.0, 350_000),
    ("Distilled + INT8",             0.828, 10.0,  15.0, 350_000),
    ("Seq truncation (16 lines)",    0.830, 10.0,  28.0, 100_000),
]

print(f"{'Option':<30s} {'AUC':>6s} {'ΔAUC':>7s} {'Est. Chargeoff':>16s} "
      f"{'Annual Savings':>15s} {'Eng. Cost':>10s}")
print(f"{'':->90}")

# Industry benchmark: 0.01 AUC improvement → ~0.5% reduction in charge-off rate
# (Source: approximate relationship for well-calibrated models in consumer credit)
for name, auc, p50, p99, eng_cost in viable_options:
    auc_improvement = auc - baseline_auc
    # Conservative: 0.5% charge-off reduction per 0.01 AUC
    chargeoff_reduction = auc_improvement / 0.01 * 0.005
    new_chargeoff_rate = current_chargeoff_rate * (1 - chargeoff_reduction)
    new_chargeoffs = (
        annual_loan_volume * approval_rate * avg_loan_size * new_chargeoff_rate
    )
    savings = baseline_chargeoffs - new_chargeoffs

    auc_delta = f"{auc_improvement:+.3f}" if auc != baseline_auc else "  ---"
    savings_str = f"${savings:>12,.0f}" if savings > 0 else "       ---"

    print(f"{name:<30s} {auc:>6.3f} {auc_delta:>7s} "
          f"${new_chargeoffs:>13,.0f} {savings_str:>15s} "
          f"{'$' + f'{eng_cost:,}':>10s}")
Cost-Benefit Analysis
==========================================================================================
Option                          AUC    ΔAUC  Est. Chargeoff  Annual Savings  Eng. Cost
------------------------------------------------------------------------------------------
LightGBM (current)            0.821     ---   $431,164,800            ---         $0
Distilled transformer (3L)    0.829  +0.008   $413,918,208   $17,246,592   $350,000
Distilled + INT8              0.828  +0.007   $416,076,672   $15,088,128   $350,000
Seq truncation (16 lines)     0.830  +0.009   $411,759,744   $19,405,056   $100,000

The Team's Recommendation

The analysis yields a clear recommendation: sequence truncation (64 to 16 trade lines) with the full 6-layer transformer. This option:

  • Achieves the highest AUC among SLA-compliant options (0.830)
  • Has the lowest engineering cost ($100K vs. $350K for distillation)
  • Meets the SLA with substantial headroom (28ms p99 vs. 50ms budget)
  • Does not require maintaining a separate distillation pipeline
  • Can be further compressed later (INT8 quantization if future features push latency up)

The 0.003 AUC loss from 0.833 (full sequence) to 0.830 (truncated sequence) is a direct consequence of computational constraints. The team's investigation reveals that most predictive signal in the trade line history comes from the most recent 16 entries — the older entries are less relevant because credit behavior is non-stationary. The 64-trade-line model captures subtle long-term patterns (e.g., a bankruptcy from 8 years ago affecting recovery trajectory), but the marginal predictive value of entries 17-64 is small.

Common Misconception: "Model compression always involves a significant accuracy loss." In this case, the accuracy loss from sequence truncation (0.003 AUC) is much smaller than the gap between the transformer and the original LightGBM model (0.012 AUC). The compressed transformer still outperforms the production model by a full AUC point. The key insight: model capacity and input representation are separate concerns. The model may be larger than necessary for the truncated input, but the input truncation is the primary source of speedup, not model size reduction.

Implementation Considerations

# Monitoring plan for the latency-optimized model
monitoring_metrics = {
    "Model performance": [
        "AUC on daily holdout sample (alert if < 0.825)",
        "Score distribution drift (KS test, alert if p < 0.01)",
        "Feature contribution stability (SHAP value drift)",
    ],
    "Inference latency": [
        "p50, p95, p99 latency per hour (alert if p99 > 40ms)",
        "Latency by request source (POS vs. mobile vs. partner API)",
        "Latency vs. input trade line count (detect edge cases)",
    ],
    "Compression quality": [
        "Weekly comparison: truncated vs. full-sequence scores on sample",
        "Disagree rate: cases where truncation changes the approve/decline decision",
        "Monthly re-evaluation of optimal truncation length",
    ],
    "Business metrics": [
        "Charge-off rate by approval month cohort",
        "Revenue per approved application",
        "False negative rate (good applicants declined)",
    ],
}

print("Monitoring Plan for Latency-Optimized Credit Scoring Model")
print(f"{'':=<65}")
for category, metrics in monitoring_metrics.items():
    print(f"\n{category}:")
    for metric in metrics:
        print(f"  - {metric}")
Monitoring Plan for Latency-Optimized Credit Scoring Model
=================================================================

Model performance:
  - AUC on daily holdout sample (alert if < 0.825)
  - Score distribution drift (KS test, alert if p < 0.01)
  - Feature contribution stability (SHAP value drift)

Inference latency:
  - p50, p95, p99 latency per hour (alert if p99 > 40ms)
  - Latency by request source (POS vs. mobile vs. partner API)
  - Latency vs. input trade line count (detect edge cases)

Compression quality:
  - Weekly comparison: truncated vs. full-sequence scores on sample
  - Disagree rate: cases where truncation changes the approve/decline decision
  - Monthly re-evaluation of optimal truncation length

Business metrics:
  - Charge-off rate by approval month cohort
  - Revenue per approved application
  - False negative rate (good applicants declined)

Lessons for Practice

  1. The latency budget is not negotiable. Unlike model accuracy (where a small regression might be acceptable), SLA violations trigger immediate consequences: degraded user experience, partner API timeouts, and contractual penalties. The 50ms constraint is a hard boundary, not a guideline.

  2. Model compression is a toolbox, not a single technique. Quantization, distillation, pruning, and input reduction address different bottlenecks. In this case, the primary bottleneck was the $O(n^2 d)$ attention computation, which is best addressed by reducing $n$ (sequence truncation) rather than reducing the model size.

  3. Cost-benefit analysis must include engineering effort. The distillation approaches achieve comparable AUC but require significantly more engineering investment (distillation pipeline, student architecture search, additional training infrastructure). Sequence truncation achieves similar AUC improvement with a single hyperparameter change.

  4. The right comparison is against the current production model, not against the uncompressed research model. The team's initial framing — "we lose 0.003 AUC by truncating" — was pessimistic. The correct framing: "we gain 0.009 AUC over the production model while meeting the SLA." Both framings describe the same number, but they lead to different decisions.

Simplest Model That Works: The final recommendation illustrates this theme: the team considered distillation pipelines, quantization-aware training, and hybrid architectures. The solution that shipped was changing a single configuration parameter (maximum trade line count: 64 to 16). The engineering complexity was negligible. The AUC improvement over the production model was substantial. Sometimes the simplest intervention is also the most effective.