Case Study 2: Pinnacle Health's Monthly Claims Processing Pipeline

DataField.Dev

Case Study 2: Pinnacle Health's Monthly Claims Processing Pipeline

"Fifty million claims. Thirty-one days. One shot at getting the numbers right."

— Diane Okoye, Systems Architect, Pinnacle Health Insurance

The Context

Pinnacle Health Insurance processes 50 million medical claims per month across commercial, Medicare Advantage, and Medicaid managed care lines of business. Unlike CNB's nightly batch window — which runs every day and must complete by morning — Pinnacle's biggest batch challenge is the monthly claims cycle: a multi-day batch processing event that runs during the last three days of each month and produces the financial, actuarial, and regulatory outputs that drive the entire business.

The monthly cycle isn't just a bigger version of a nightly batch. It's a different beast entirely — one where the batch window spans 72 hours, the dependencies cross business domains, regulatory deadlines are measured in calendar days (not hours), and a failure on Day 1 can cascade into a missed CMS (Centers for Medicare & Medicaid Services) filing deadline that triggers federal audit action.

Diane Okoye has architected Pinnacle's batch processing for nine years. She inherited a monthly cycle that was designed for 20 million claims and has nursed it through to 50 million without a complete re-architecture — yet. This case study examines how she manages the monthly cycle, the crisis that forced a partial re-architecture in 2025, and the lessons that apply to any large-scale batch processing environment.

The Monthly Cycle Architecture

The Three-Day Window

Pinnacle's monthly cycle runs on the 28th, 29th, and 30th of each month (adjusted for short months and weekends):

Day 1 (28th): Claim Finalization and Adjudication
  Window: 8:00 PM – 6:00 AM (10 hours)
  Jobs: 312
  Critical path: 8.2 hours

Day 2 (29th): Financial Processing and Provider Payments
  Window: 8:00 PM – 6:00 AM (10 hours)
  Jobs: 247
  Critical path: 7.5 hours

Day 3 (30th): Regulatory Reporting and Reconciliation
  Window: 8:00 PM – 6:00 AM (10 hours)
  Jobs: 198
  Critical path: 6.8 hours

Each day's batch has its own DAG, but Day 2 depends on Day 1's successful completion, and Day 3 depends on Day 2. The three-day window is itself a DAG with three macro-nodes chained serially.

Day 1: Claim Finalization

The Day 1 batch is the most complex. It processes all claims in "pending" status, applies final adjudication rules, reprices services, and produces the adjudicated claims master file that drives everything else.

Day 1 Critical Path:

CLAIM-EXTRACT (extract 50M claims from DB2) ────── 95 min
    │
CLAIM-SORT (sort by provider/service date) ─────── 40 min
    │
ADJUD-PHASE1 (primary adjudication rules) ──────── 110 min ★
    │
ADJUD-PHASE2 (secondary rules/overrides) ───────── 65 min
    │
REPRICE-01 (apply fee schedule repricing) ──────── 85 min ★
    │
MERGE-ADJ (merge adjudication results) ─────────── 25 min
    │
CLAIM-POST (post to claims master DB2) ─────────── 70 min ★
    │
CLAIM-VERFY (verification/reconciliation) ──────── 15 min
    │
                                        Total: 505 min (8.4 hours)

★ marks the three jobs with the highest volume elasticity — the ones that grow fastest as claims volume increases.

Parallel Streams on Day 1

Not everything is serial. Day 1 has five parallel streams:

Stream A: Commercial claims (22M claims)
Stream B: Medicare Advantage (18M claims)
Stream C: Medicaid managed care (10M claims)
Stream D: Dental/Vision (auxiliary, 3M claims)
Stream E: Pharmacy (separate adjudication engine, 8M claims)

Streams A, B, and C share the same adjudication engine but process different claim populations. They can run in parallel if — and this is a critical constraint — they don't update the same provider records simultaneously.

Parallel execution plan:

8:00 PM ─ Stream A (Commercial): Extract → Sort → Adjud → Reprice
         Stream D (Dental):      Extract → Sort → Adjud → Reprice
         Stream E (Pharmacy):    Extract → Adjud (different engine)

9:30 PM ─ Stream B (Medicare):   Extract → Sort → Adjud → Reprice
          (delayed to avoid provider table contention with Stream A)

10:15 PM─ Stream C (Medicaid):   Extract → Sort → Adjud → Reprice
          (delayed further — Medicaid repricing uses same fee tables)

1:00 AM ─ MERGE-ALL: Combine all stream outputs
          POST: Update claims master
          VERIFY: Reconcile counts and amounts

The critical path is Stream A (the largest at 22M claims), followed by the merge and post steps that require all streams to complete.

Day 2: Financial Processing

Day 2 takes the adjudicated claims from Day 1 and produces:

Provider payment files (EFT and check)
Member liability statements
Accounts payable journal entries
Reserve adjustments
Reinsurance calculations

The critical path runs through the provider payment stream:

PAY-EXTRACT (claims by provider) ──────── 45 min
    │
PAY-CALC (calculate payment amounts) ──── 75 min
    │
PAY-DEDUCT (apply withholdings) ────────── 35 min
    │
PAY-EFT (generate EFT files) ─────────── 30 min
    │
PAY-REMIT (generate remittance advice) ── 55 min
    │
PAY-POST (post to AP ledger) ─────────── 40 min
    │
PAY-RECON (reconcile payments) ────────── 20 min
    │
RESERVE (calculate reserves) ──────────── 65 min
    │
REINS (reinsurance calculations) ──────── 40 min
    │
FIN-CLOSE (financial period close) ────── 35 min
    │
                                Total: 440 min (7.3 hours)

Day 3: Regulatory Reporting

Day 3 produces federally mandated reports:

CMS-835 (Medicare remittance) ───────── 40 min
CMS-837 (claim submission to CMS) ──── 55 min
MLR-CALC (Medical Loss Ratio) ──────── 85 min
STATE-RPT (state regulatory reports) ── 45 min
AUDIT-EXT (audit trail extraction) ──── 60 min
RECON-FINAL (full cycle reconciliation)─ 35 min
ARCHIVE (cycle archive to tape/cloud) ── 90 min

Many Day 3 jobs can run in parallel — the CMS files, state reports, and audit extracts are independent. The critical path runs through the MLR calculation (which needs all financial data) and the final reconciliation.

The Crisis: February 2025

What Happened

Pinnacle acquired a regional health plan in January 2025, adding 8 million members and approximately 12 million claims per month. The integration plan called for a 6-month migration, but regulatory pressure from the state insurance commissioner accelerated the timeline — all claims had to be processed on Pinnacle's platform by the February month-end cycle.

February 28th. The monthly cycle began at 8:00 PM.

Day 1 (February 28th):

Claim extract ran normally — 62 million claims instead of the expected 50 million (the acquisition's 12M claims were already loaded into the pending queue). The extract took 118 minutes instead of 95.

ADJUD-PHASE1, the primary adjudication engine, started processing. The acquired plan's claims used different coding conventions (ICD-10 mapping variations, non-standard place-of-service codes), triggering the "manual review" exception path at 3x the normal rate. The program's exception handling wrote each exception to a DB2 audit table — and the high exception rate caused DB2 lock escalation on the audit table, which blocked other adjudication streams.

By 1:30 AM, Stream A (Commercial) was at 65% complete instead of the expected 90%. Streams B and C hadn't started because the DB2 lock escalation was blocking the shared provider tables.

Diane got the call at 1:45 AM.

The Decision Tree

Diane's assessment at 1:45 AM:

Day 1 status:
  Stream A: 65% complete, running 90 min behind
  Stream B: Queued (DB2 contention)
  Stream C: Queued (DB2 contention)
  Stream D: Complete (dental claims unaffected)
  Stream E: 80% complete (pharmacy uses different engine)

Projected Day 1 completion: 9:30 AM (3.5 hours late)
Day 2 cannot start until Day 1 completes.
Day 3 must complete by 11:59 PM March 2nd (CMS filing deadline).

Time budget:
  Available: 76 hours (Feb 28 8PM to Mar 2 midnight)
  Day 1 projected: 13.5 hours (3.5 late)
  Day 2 expected: 7.5 hours
  Day 3 expected: 6.8 hours
  Inter-day buffers: 2 hours (two transitions)
  Total projected: 29.8 hours
  Remaining: 46.2 hours

The math said they'd finish with 46 hours of margin. But Diane didn't trust the math — the Day 1 projection assumed the DB2 contention would be resolved, and it hadn't been yet.

The Recovery

Diane's team executed a three-part recovery:

Part 1 (Immediate): Fix the DB2 lock escalation.

Ahmad Rashidi (compliance) confirmed that the audit table writes weren't needed in real-time — they could be deferred to a post-processing step. Lisa Tran's DB2 equivalent at Pinnacle added a secondary audit table for the batch insert load, bypassing the contended table.

-- Redirect audit writes to staging table (no contention)
INSERT INTO CLAIM_AUDIT_STAGING
  (CLAIM_ID, EXCEPTION_CODE, EXCEPTION_DESC, AUDIT_TS)
VALUES
  (:WS-CLAIM-ID, :WS-EXCEPT-CD, :WS-EXCEPT-DESC, CURRENT TIMESTAMP);

-- After batch completes, merge staging to main audit table
INSERT INTO CLAIM_AUDIT_MASTER
  SELECT * FROM CLAIM_AUDIT_STAGING
  WHERE AUDIT_TS >= :WS-CYCLE-START;

This required an emergency code change — a 4-line COBOL modification to redirect the audit INSERT to the staging table. The change was tested on a parallel LPAR at 2:15 AM, promoted to production at 2:30 AM, and the adjudication jobs were restarted from their last checkpoint.

Part 2 (Parallel recovery): Start Streams B and C immediately.

With the audit table contention resolved, Streams B and C could start while Stream A continued. Stream A had already processed 65% of its claims — the remaining 35% would take approximately 40 minutes. Streams B and C would take their normal 90 minutes each (running in parallel = 90 minutes elapsed).

Part 3 (Schedule compression): Overlap Day 1 tail with Day 2 start.

Normally, Day 2 waits for ALL of Day 1 to complete, including the merge and post steps. But Diane realized that the provider payment calculation (Day 2's biggest job) only needed the adjudicated claims — it didn't need the merge to be complete. She could start Day 2's payment extraction as soon as each stream's adjudication finished, rather than waiting for the final merge.

Normal flow:
  Stream A adjud → Stream B adjud → Stream C adjud → MERGE → Day 2 start

Compressed flow:
  Stream A adjud ──→ Day 2 payment calc (Stream A claims) ──→
  Stream B adjud ──→ Day 2 payment calc (Stream B claims) ──→ MERGE payments
  Stream C adjud ──→ Day 2 payment calc (Stream C claims) ──→

  Savings: ~2 hours (overlap of Day 1 merge with Day 2 processing)

The Outcome

Actual timing:
  Day 1 completion:  8:45 AM Mar 1 (2.75 hours late)
  Day 2 completion:  11:00 PM Mar 1 (started at 2:00 PM using overlap)
  Day 3 completion:  7:30 AM Mar 2 (well within deadline)

  CMS filing: 9:15 AM Mar 2 (deadline: 11:59 PM Mar 2)
  Margin: 14 hours 45 minutes

Claims processed: 62 million (all adjudicated, posted, and reported)
Errors: 0 (after the audit table fix)

The Post-Crisis Re-architecture

Diane didn't wait for the next crisis. She launched a batch architecture review the following week.

Finding 1: The Adjudication Engine Wasn't Designed for Exceptions

The adjudication COBOL program was written when exception rates were 2–3%. At 9% (the acquired plan's rate), the exception handling path — especially the synchronous DB2 audit writes — became a bottleneck. The fix:

Asynchronous audit logging: Exceptions are written to a sequential file during adjudication. A separate post-processing job loads them to DB2 in bulk. This eliminated all DB2 contention from the adjudication path.
Exception rate monitoring: A threshold check after every 100,000 claims. If the exception rate exceeds 5%, the job writes a warning message and (optionally) activates a "high-exception" processing mode that batches audit writes more aggressively.

       2500-CHECK-EXCEPTION-RATE.
           DIVIDE WS-EXCEPTION-COUNT BY WS-RECORD-COUNT
               GIVING WS-EXCEPTION-RATE ROUNDED
           IF WS-EXCEPTION-RATE > 0.05
               DISPLAY 'WARNING: Exception rate '
                       WS-EXCEPTION-RATE
                       ' exceeds 5% threshold at record '
                       WS-RECORD-COUNT
               SET WS-HIGH-EXCEPTION-MODE TO TRUE
           END-IF.

Finding 2: The Three-Day Model Was Fragile

A single bad night on Day 1 cascaded through the entire cycle. Diane re-architected to allow partial overlap between days:

Old model (serial):
  Day 1 ────────────→ Day 2 ────────────→ Day 3

New model (pipelined):
  Day 1 Stream A ──→ Day 2 payment A ──→ Day 3 CMS A
  Day 1 Stream B ──→ Day 2 payment B ──→ Day 3 CMS B
  Day 1 Stream C ──→ Day 2 payment C ──→ Day 3 CMS C
                     ↓                   ↓
                  Day 2 merge ──→ Day 3 final recon

Each stream flows through all three days independently, converging only for cross-stream operations (GL posting, final reconciliation, MLR calculation). This reduced the end-to-end critical path from 22.5 hours (serial) to 15.2 hours (pipelined).

Finding 3: Acquisition Integration Needs Batch Impact Assessment

The 12 million new claims were loaded into the production pending queue with no batch capacity analysis. The coding convention differences weren't flagged because nobody ran the new claims through a test adjudication cycle before month-end.

Diane implemented a mandatory "batch integration test" for any data migration that adds more than 5% to existing volume:

Extract a representative sample (10%) of new data
Run it through the batch pipeline on a test LPAR
Measure throughput, exception rates, and DB2 resource consumption
Project elapsed time impact on the production critical path
If projected impact exceeds 10% of margin, require architecture review before go-live

Finding 4: Checkpoint Granularity Was Too Coarse

The adjudication program took checkpoints every 500,000 claims. When the job was restarted at 2:30 AM, it had to re-process 340,000 claims that had already been adjudicated since the last checkpoint. That's 25 minutes of wasted re-processing.

New checkpoint frequency: every 50,000 claims. The checkpoint overhead increased by approximately 2 minutes per run (additional I/O and DB2 commits), but the maximum re-processing on restart dropped from 500,000 claims to 50,000 — saving up to 45 minutes in recovery scenarios.

Cost-benefit analysis:
  Checkpoint overhead increase: +2 min per run (every month-end)
  Recovery time savings: up to 45 min (when failures occur)
  Break-even: if failures occur more than once every 22 months
  Actual failure frequency: ~3 times per year
  ROI: clearly positive

Regulatory Dimensions

Ahmad Rashidi flagged a critical compliance consideration during the post-crisis review: the CMS filing deadline isn't just a business target — it's a contractual obligation under Pinnacle's Medicare Advantage contract. Missing it triggers:

Corrective Action Plan (CAP) requirement from CMS
Financial penalties up to $25,000 per day of late filing
Audit risk — late filers get flagged for enhanced compliance review
Enrollment sanctions — repeat offenders can be barred from enrolling new members

The February crisis finished with 14 hours of margin. But if the DB2 fix had taken longer, or if the overlap technique hadn't worked, the margin would have been thin enough to trigger Ahmad's escalation protocol — which includes pre-positioning a regulatory notification letter and engaging outside counsel.

"The batch window isn't just an IT problem," Ahmad told the post-crisis review meeting. "It's a regulatory compliance obligation. When the window breaks, my phone rings next."

Key Metrics: Before and After Re-architecture

Metric                      Before          After           Change
──────────────────────────────────────────────────────────────────────
Monthly cycle elapsed        22.5 hours     15.2 hours      -32%
Day 1 critical path         8.4 hours       6.1 hours      -27%
Day 2 critical path         7.3 hours       5.4 hours      -26%
Day 3 critical path         6.8 hours       5.8 hours      -15%
Max claims capacity         55M             80M             +45%
Recovery time (worst case)  4.2 hours       1.8 hours      -57%
Checkpoint frequency        500K claims     50K claims      10x
Pipeline overlap            None            Full            N/A
Batch integration tests     None            Mandatory       N/A

Discussion Questions

Diane's emergency code change at 2:30 AM bypassed normal change management. Under what circumstances is this acceptable? What controls should exist for emergency batch changes?
The pipeline model (streams flowing independently through all three days) introduces complexity — more jobs, more dependencies, more potential failure points. How would you manage this increased complexity?
Ahmad Rashidi's compliance concerns add a dimension that pure technical analysis misses. How should regulatory deadlines be represented in the batch DAG? Should they be time dependencies, or something else?
Pinnacle's adjudication engine couldn't handle a 3x increase in exception rates. What design principles would make batch programs more resilient to unexpected data characteristics?
The "batch integration test" process adds time to data migration projects. Business stakeholders may push back. How would you justify the additional lead time?
Compare CNB's crisis (Case Study 1) with Pinnacle's. Both involved volume growth that broke the batch window. What are the common patterns? What's different about a monthly cycle versus a nightly cycle?
If Pinnacle's claims volume reaches 100M per month (double current), is the re-architected pipeline sufficient? What would need to change?