Case Study 2: Pinnacle Health's Claims Pipeline Checkpoint Coordination
Background
Pinnacle Health processes 2.8 million medical claims daily through a batch pipeline that runs every night between 10:00 PM and 4:00 AM. The pipeline feeds into the adjudication system, which determines payment amounts and generates remittance advice for healthcare providers. If the pipeline doesn't complete by 4:00 AM, the adjudication system cannot run, and provider payments are delayed by 24 hours.
Diane Torres is the senior systems architect who designed the original pipeline in 2011. Ahmad Patel joined the team in 2018 as the batch operations lead. Together, they faced a problem that tested the limits of checkpoint/restart design: coordinating checkpoints across a six-step pipeline where each step depended on the outputs of previous steps, and three different resource types (DB2, VSAM, sequential) were interleaved across steps.
The Pipeline
The Pinnacle claims pipeline consists of six steps in a single JCL job (PNCLMPRC):
| Step | Program | Input | Output | Duration |
|---|---|---|---|---|
| STEP010 | PNCEXTRT | DB2: CLAIMS_STAGING | SEQ: CLAIMS.EXTRACT | 25 min |
| STEP020 | PNCVALID | SEQ: CLAIMS.EXTRACT, VSAM: PROVIDER.MASTER | SEQ: VALID.CLAIMS, SEQ: REJECT.CLAIMS | 40 min |
| STEP030 | PNCENRCH | SEQ: VALID.CLAIMS, DB2: MEMBER_BENEFITS, VSAM: PROCEDURE.CODES | SEQ: ENRICHED.CLAIMS | 55 min |
| STEP040 | PNCPRICE | SEQ: ENRICHED.CLAIMS, DB2: FEE_SCHEDULE, VSAM: CONTRACT.TERMS | DB2: PRICED_CLAIMS, SEQ: PRICING.AUDIT | 70 min |
| STEP050 | PNCADJ | DB2: PRICED_CLAIMS, VSAM: ACCUMULATORS.MASTER | DB2: ADJUDICATED_CLAIMS, VSAM: ACCUMULATORS.MASTER (update) | 90 min |
| STEP060 | PNCREMT | DB2: ADJUDICATED_CLAIMS | SEQ: REMITTANCE.FILE, Report (SYSOUT) | 20 min |
Total pipeline time on a clean run: approximately 5 hours. The 6-hour batch window provided about 1 hour of slack.
The Problem
For the first seven years, the pipeline had no checkpoint/restart logic in any step. When a step failed, the standard procedure was:
- Fix the problem (disk space, DB2 deadlock, data error, etc.)
- Delete all output datasets from the failed step and all subsequent steps
- Restart the job from the failed step
This worked adequately when failures were rare. But as claim volume grew from 1.8 million to 2.8 million daily, the pipeline's runtime stretched from 3.5 hours to 5 hours, and failures became more consequential.
The breaking point came on a Tuesday night in October 2023. STEP040 (PNCPRICE, the pricing step) failed at the 45-minute mark due to a DB2 deadlock with an ad-hoc query that a business analyst had running against the FEE_SCHEDULE table. The deadlock caused SQLCODE -911 and the step abended.
Ahmad assessed the situation:
- STEP010–STEP030 had completed successfully (2 hours elapsed).
- STEP040 had been running for 45 minutes when it failed.
- Remaining time in the batch window: 3 hours 15 minutes.
- STEP040 requires 70 minutes on a clean run.
- STEP050 requires 90 minutes.
- STEP060 requires 20 minutes.
- Total remaining: 70 + 90 + 20 = 180 minutes = 3 hours.
The math was tight. Ahmad restarted from STEP040. The restart required deleting the partial PRICED_CLAIMS data (DB2 had rolled it back) and the partial PRICING.AUDIT sequential file (which had to be manually deleted and reallocated). STEP040 reran from record 1 — all 2.6 million valid claims. It finished in 72 minutes. STEP050 and STEP060 ran normally. The pipeline completed at 3:58 AM — two minutes before the deadline.
"Two minutes," Ahmad told Diane the next morning. "We made it by two minutes. If STEP040 had run three minutes slower — different night, heavier I/O load, anything — we'd have missed the window and delayed 14,000 provider payments."
Diane's response: "We're adding checkpoint/restart. All six steps. And we're going to coordinate them properly."
The Design Challenge
Adding checkpoint/restart to individual steps was straightforward — Diane had done it many times. The challenge was coordination across steps. The pipeline had several properties that made coordination complex:
Property 1: Cascading dependencies. Each step consumed the output of the previous step. If STEP040 restarted and regenerated its sequential output, STEP050 and STEP060 had to reprocess from scratch even though they hadn't started yet. But what if STEP050 had started (in a future parallel-step design) — how would it know that its input had changed?
Property 2: Mixed resource types. STEP050 (PNCADJ) both read from DB2 and updated a VSAM file (ACCUMULATORS.MASTER). The DB2 updates would be rolled back on failure, but the VSAM updates would not. The VSAM file contained year-to-date accumulators for each member's benefits — deductibles met, out-of-pocket maximums, visit counts. These numbers had to be exactly right. An incorrect accumulator could cause a $50,000 surgery to be adjudicated as "deductible not met" or, worse, as "fully covered" when the member had exhausted their benefits.
Property 3: The idempotency problem. Several steps were not naturally idempotent. STEP040 inserted rows into PRICED_CLAIMS. If the step was restarted and reprocessed records that had already been priced, it would create duplicate rows. STEP050 added amounts to VSAM accumulators. If a claim was processed twice, the accumulator would be incremented twice, giving an incorrect year-to-date total.
Property 4: Cross-step state. STEP060 (PNCREMT) generated a remittance file with header records containing the total count and total dollar amount from STEP050. If STEP050 was restarted and the counts changed (e.g., because of timing differences in which records were committed before the failure), the remittance header would be incorrect.
Diane's Design
Diane spent two weeks designing the checkpoint/restart coordination. She established four principles:
Principle 1: Each Step Owns Its Restart Logic
Each program manages its own restart table row. The RESTART_CONTROL table uses the composite key (PROGRAM_NAME, JOB_NAME, STEP_NAME). No step reads another step's restart state.
Principle 2: The Pipeline Controller
Diane introduced a new artifact: the pipeline control table.
CREATE TABLE PIPELINE_CONTROL (
PIPELINE_NAME CHAR(8) NOT NULL,
JOB_NAME CHAR(8) NOT NULL,
RUN_DATE DATE NOT NULL,
STEP_NAME CHAR(8) NOT NULL,
STEP_SEQUENCE SMALLINT NOT NULL,
STEP_STATUS CHAR(1) NOT NULL,
START_TS TIMESTAMP,
END_TS TIMESTAMP,
RECORDS_IN INTEGER NOT NULL WITH DEFAULT 0,
RECORDS_OUT INTEGER NOT NULL WITH DEFAULT 0,
RESTART_COUNT SMALLINT NOT NULL WITH DEFAULT 0,
PRIMARY KEY (PIPELINE_NAME, JOB_NAME, RUN_DATE, STEP_NAME)
);
Every step, at startup, registers itself in the pipeline control table with STEP_STATUS = 'R' (running). At completion, it sets STEP_STATUS = 'C' (complete). On restart, it increments RESTART_COUNT.
Before a step begins, it checks the pipeline control table to verify that all predecessor steps have STEP_STATUS = 'C'. If a predecessor has been restarted (RESTART_COUNT > what was recorded when the current step started), the current step knows its input may have changed and must reinitialize.
Principle 3: VSAM Accumulator Protection
For STEP050's VSAM accumulator problem, Diane designed a before-image log:
CREATE TABLE ACCUM_BEFORE_IMAGE (
PROGRAM_NAME CHAR(8) NOT NULL,
JOB_NAME CHAR(8) NOT NULL,
RUN_DATE DATE NOT NULL,
MEMBER_ID CHAR(12) NOT NULL,
FIELD_NAME CHAR(20) NOT NULL,
BEFORE_VALUE DECIMAL(15,2) NOT NULL,
AFTER_VALUE DECIMAL(15,2) NOT NULL,
CHECKPOINT_NUM INTEGER NOT NULL,
PRIMARY KEY (PROGRAM_NAME, JOB_NAME, RUN_DATE,
MEMBER_ID, FIELD_NAME, CHECKPOINT_NUM)
);
Every time STEP050 updated a VSAM accumulator field, it also inserted a before-image row into this DB2 table. The insert was part of the same DB2 unit of recovery as the other DB2 updates.
On restart:
- DB2 automatically rolled back all PRICED_CLAIMS and ADJUDICATED_CLAIMS changes since the last checkpoint.
- The ACCUM_BEFORE_IMAGE table was also rolled back to the last checkpoint (because it was committed with the same COMMIT).
- The remaining ACCUM_BEFORE_IMAGE rows — from the last checkpoint back to the run start — contained the before-images of all VSAM updates that had been committed.
- A restart cleanup routine read these before-images and reversed the VSAM accumulator updates, restoring the accumulators to their checkpoint-consistent state.
2300-RESTORE-VSAM-ACCUMULATORS.
EXEC SQL
DECLARE CSR-BEFORE-IMG CURSOR FOR
SELECT MEMBER_ID, FIELD_NAME, BEFORE_VALUE
FROM ACCUM_BEFORE_IMAGE
WHERE PROGRAM_NAME = :WS-PGM-NAME
AND JOB_NAME = :WS-JOB-NAME
AND RUN_DATE = :WS-RUN-DATE
AND CHECKPOINT_NUM > :WS-LAST-CHKPT-NUM
ORDER BY CHECKPOINT_NUM DESC,
MEMBER_ID, FIELD_NAME
END-EXEC
EXEC SQL OPEN CSR-BEFORE-IMG END-EXEC
PERFORM UNTIL SQLCODE = +100
EXEC SQL
FETCH CSR-BEFORE-IMG
INTO :WS-MEMBER-ID,
:WS-FIELD-NAME,
:WS-BEFORE-VALUE
END-EXEC
IF SQLCODE = 0
PERFORM 2310-REVERSE-VSAM-UPDATE
END-IF
END-PERFORM
EXEC SQL CLOSE CSR-BEFORE-IMG END-EXEC
DISPLAY 'VSAM ACCUMULATORS RESTORED TO '
'CHECKPOINT ' WS-LAST-CHKPT-NUM
.
This approach was more complex than the "delete orphans" strategy used for simple VSAM output files, because the accumulators could not simply be deleted — they had to be restored to their pre-run values. The before-image log provided the information needed to do this precisely.
Principle 4: Sequential File Regeneration with Verification
For sequential output files (CLAIMS.EXTRACT, VALID.CLAIMS, ENRICHED.CLAIMS, PRICING.AUDIT, REMITTANCE.FILE), Diane used the regeneration strategy with an added verification step:
- On restart, the sequential output is deleted and the step rewrites it from committed data.
- After the step completes (whether on first run or after restart), a verification step compares the record count and hash total of the sequential output against the values stored in the pipeline control table.
- If the counts don't match, the step fails with a clear diagnostic.
This caught a subtle bug during testing: the PNCENRCH program, on restart, was not correctly skipping past enrichment records that had already been written. The verification step detected that the output had 2,587,342 records when the pipeline control table recorded 2,584,919 from the previous step's output. The 2,423-record discrepancy led to the discovery of a boundary condition in the restart key handling for enrichment records that spanned multiple procedure codes.
The Commit Frequency Decision
Each step required a different commit frequency based on its processing characteristics:
| Step | Program | Commit Freq | Rationale |
|---|---|---|---|
| STEP010 | PNCEXTRT | 10,000 | Read-only from DB2, no lock concerns |
| STEP020 | PNCVALID | 5,000 | Read-only from VSAM, checkpoint for restart positioning |
| STEP030 | PNCENRCH | 5,000 | Multiple DB2 reads per record, moderate complexity |
| STEP040 | PNCPRICE | 3,000 | DB2 inserts (PRICED_CLAIMS), concurrent read potential from ad-hoc queries |
| STEP050 | PNCADJ | 2,000 | DB2 updates + VSAM updates, before-image logging, highest complexity |
| STEP060 | PNCREMT | 10,000 | Read-only from DB2, sequential output only |
STEP050's lower commit frequency (2,000) reflected its higher per-record complexity: each claim required reading PRICED_CLAIMS, calculating deductibles and copays against VSAM accumulators, updating the accumulators, inserting ADJUDICATED_CLAIMS, and writing a before-image log record. The 2,000-record commit interval balanced lock duration (accumulators were also accessed by the real-time claims portal during daytime hours, though not during the batch window) against restart complexity (more checkpoints meant less VSAM restoration work on restart).
STEP040 used 3,000 because the original failure was caused by a deadlock against FEE_SCHEDULE. Shorter commit intervals reduced the window for deadlocks with any concurrent access to the pricing tables.
Implementation and Testing
Ahmad led the testing effort. He created a test harness that automated the restart validation:
Test Dataset: A scaled-down version of production data — 280,000 claims (10% of production volume) — loaded into a test DB2 subsystem and VSAM files.
Baseline Run: The pipeline ran cleanly end-to-end. Ahmad captured: - Record counts at every step boundary - Hash totals (sum of claim amounts) at every step boundary - Final ADJUDICATED_CLAIMS table contents - Final VSAM accumulator values for all test members - Final REMITTANCE.FILE contents
Failure Injection: Ahmad tested failure at multiple points in each step:
| Test | Failure Point | Recovery Time | Data Verification |
|---|---|---|---|
| T01 | STEP010, record 50,000 | 2 min | Pass |
| T02 | STEP010, record 250,000 | 1 min | Pass |
| T03 | STEP020, record 100,000 | 4 min | Pass |
| T04 | STEP020, record 250,000 | 3 min | Pass |
| T05 | STEP030, record 75,000 | 5 min | Pass |
| T06 | STEP030, record 200,000 | 3 min | Pass |
| T07 | STEP040, record 50,000 | 6 min | Pass |
| T08 | STEP040, record 200,000 | 4 min | Pass |
| T09 | STEP050, record 30,000 | 8 min | Pass — VSAM accumulators verified |
| T10 | STEP050, record 150,000 | 5 min | Pass — VSAM accumulators verified |
| T11 | STEP050, record 250,000 | 3 min | FAIL — accumulator mismatch |
| T12 | STEP060, record 100,000 | 1 min | Pass |
Test T11 failure analysis: STEP050 failed at record 250,000 (near the end of processing). On restart, the before-image restoration routine processed 248,000 before-image records to restore VSAM accumulators. During restoration, one member had 47 accumulator updates (the member had 47 claims in the test dataset). The restoration routine processed them in reverse checkpoint order (highest checkpoint number first), but within the same checkpoint, the order was by MEMBER_ID and FIELD_NAME. For this specific member, two updates to the same accumulator field occurred within the same checkpoint interval, and the restoration reversed them in the wrong order, leaving the accumulator off by $127.43.
Fix: Ahmad modified the before-image restoration to process records in strict reverse order of insert sequence (using an additional SEQUENCE_NUM column added to the ACCUM_BEFORE_IMAGE table), rather than relying on CHECKPOINT_NUM + MEMBER_ID + FIELD_NAME ordering.
The fix was verified by re-running T11 and two additional edge case tests (multiple updates to the same accumulator within the same checkpoint). All passed.
Production Results
The redesigned pipeline went into production on November 15, 2023. Results through March 2026:
| Metric | Before Redesign | After Redesign |
|---|---|---|
| Pipeline failures requiring restart | 11 per year | 12 per year (slightly higher claim volume) |
| Average recovery time | 47 minutes | 7 minutes |
| Maximum recovery time | 2h 15m (STEP050 failure) | 19 minutes |
| SLA misses due to pipeline failures | 3 (in 29 months) | 0 (in 29 months) |
| Provider payment delays | 3 incidents | 0 incidents |
| Batch window utilization (avg) | 83% | 85% (slight increase due to checkpoint overhead) |
The 2% increase in average batch window utilization was due to checkpoint overhead — the additional COMMIT operations, restart table updates, and before-image logging. Diane considered this an acceptable tradeoff for the elimination of SLA misses.
The most dramatic test of the new system came on February 8, 2025, when a DB2 buffer pool shortage caused STEP050 to abend at the 62-minute mark (out of its 90-minute typical runtime). Before the redesign, this would have required rerunning STEP050 from scratch — 90 minutes — plus the 20-minute STEP060, totaling 110 minutes. With checkpoint/restart, STEP050 recovered in 11 minutes, and the pipeline completed with 48 minutes of batch window remaining.
Key Technical Insights
Insight 1: VSAM Before-Image Logging Is Expensive but Necessary
The before-image log added approximately 8% overhead to STEP050's runtime — an additional INSERT per VSAM update, plus the restoration processing on restart. Diane considered alternatives:
- Snapshot the VSAM file before the run: Requires copying a 4 GB VSAM file nightly. Time: ~15 minutes. Simpler but wastes storage and I/O bandwidth.
- Use CICS recoverable files: Pinnacle's batch system doesn't run under CICS. Would require infrastructure changes.
- Accept the risk of incorrect accumulators on restart: Unacceptable. An incorrect accumulator could cause a claim to adjudicate incorrectly, potentially resulting in a member being billed $50,000 for a procedure that should be covered.
The before-image approach was the only option that provided record-level accuracy without infrastructure changes.
Insight 2: Pipeline-Level Coordination Adds Overhead but Prevents Silent Data Corruption
The pipeline control table and cross-step verification added complexity. A simpler design — each step independently checkpointing with no awareness of other steps — would have worked most of the time. But Diane identified a scenario where it would fail silently:
If STEP040 was restarted and processed claims in a slightly different order (due to DB2 access path differences under different system load), the PRICED_CLAIMS table could have the same records but in a different physical order. STEP050, reading the same table, would produce the same adjudication results — except for edge cases involving claims that interacted (e.g., a member with two claims where the second claim's copay depends on whether the first claim applied to the deductible). Without the pipeline control table's verification, this inconsistency would go undetected.
Insight 3: Commit Frequency Must Account for the Most Complex Step
Diane initially set STEP050's commit frequency to 5,000, matching the other steps. Testing revealed that the VSAM before-image processing at 5,000 records generated so many before-image rows that restoration took 15 minutes on failure near the end of the step. Reducing to 2,000 cut restoration time to under 6 minutes, at the cost of 1,500 additional COMMITs (5,000 was 560 commits; 2,000 was 1,400 commits). The additional commit overhead was 2 minutes — a worthwhile tradeoff for 9 minutes of faster recovery.
Discussion Questions
-
Diane chose a before-image log in DB2 to protect VSAM accumulators. An alternative is to store VSAM before-images in a separate VSAM file. Compare these approaches in terms of atomicity guarantees, performance, complexity, and storage requirements.
-
Ahmad's test T11 revealed a bug in the before-image restoration ordering. This bug would only manifest when a member had multiple claims processed within the same commit interval — a relatively rare occurrence. How would you design a test dataset to maximize the likelihood of finding such edge cases? What characteristics should the test data have?
-
The pipeline control table adds cross-step awareness to what would otherwise be independent step-level checkpoint/restart. Is this additional complexity justified? Under what circumstances would you omit it and rely purely on step-level independence?
-
STEP040's commit frequency was reduced to 3,000 specifically because the original failure was a deadlock. Is it appropriate to set commit frequency based on past failure modes? What other factors should influence the decision?
-
Diane's design accepts an 8% runtime overhead for VSAM before-image logging. At what overhead percentage would you consider the approach too expensive? What alternative strategies would you explore at that threshold?
-
The pipeline processes 2.8 million claims in a 6-hour window. Claim volume is growing at approximately 12% per year. At what volume does the current checkpoint/restart design become insufficient? What architectural changes would you recommend at that point?