Case Study 2: Pinnacle Health's Claims Pipeline Checkpoint Coordination

Background

Pinnacle Health processes 2.8 million medical claims daily through a batch pipeline that runs every night between 10:00 PM and 4:00 AM. The pipeline feeds into the adjudication system, which determines payment amounts and generates remittance advice for healthcare providers. If the pipeline doesn't complete by 4:00 AM, the adjudication system cannot run, and provider payments are delayed by 24 hours.

Diane Torres is the senior systems architect who designed the original pipeline in 2011. Ahmad Patel joined the team in 2018 as the batch operations lead. Together, they faced a problem that tested the limits of checkpoint/restart design: coordinating checkpoints across a six-step pipeline where each step depended on the outputs of previous steps, and three different resource types (DB2, VSAM, sequential) were interleaved across steps.

The Pipeline

The Pinnacle claims pipeline consists of six steps in a single JCL job (PNCLMPRC):

Step Program Input Output Duration
STEP010 PNCEXTRT DB2: CLAIMS_STAGING SEQ: CLAIMS.EXTRACT 25 min
STEP020 PNCVALID SEQ: CLAIMS.EXTRACT, VSAM: PROVIDER.MASTER SEQ: VALID.CLAIMS, SEQ: REJECT.CLAIMS 40 min
STEP030 PNCENRCH SEQ: VALID.CLAIMS, DB2: MEMBER_BENEFITS, VSAM: PROCEDURE.CODES SEQ: ENRICHED.CLAIMS 55 min
STEP040 PNCPRICE SEQ: ENRICHED.CLAIMS, DB2: FEE_SCHEDULE, VSAM: CONTRACT.TERMS DB2: PRICED_CLAIMS, SEQ: PRICING.AUDIT 70 min
STEP050 PNCADJ DB2: PRICED_CLAIMS, VSAM: ACCUMULATORS.MASTER DB2: ADJUDICATED_CLAIMS, VSAM: ACCUMULATORS.MASTER (update) 90 min
STEP060 PNCREMT DB2: ADJUDICATED_CLAIMS SEQ: REMITTANCE.FILE, Report (SYSOUT) 20 min

Total pipeline time on a clean run: approximately 5 hours. The 6-hour batch window provided about 1 hour of slack.

The Problem

For the first seven years, the pipeline had no checkpoint/restart logic in any step. When a step failed, the standard procedure was:

  1. Fix the problem (disk space, DB2 deadlock, data error, etc.)
  2. Delete all output datasets from the failed step and all subsequent steps
  3. Restart the job from the failed step

This worked adequately when failures were rare. But as claim volume grew from 1.8 million to 2.8 million daily, the pipeline's runtime stretched from 3.5 hours to 5 hours, and failures became more consequential.

The breaking point came on a Tuesday night in October 2023. STEP040 (PNCPRICE, the pricing step) failed at the 45-minute mark due to a DB2 deadlock with an ad-hoc query that a business analyst had running against the FEE_SCHEDULE table. The deadlock caused SQLCODE -911 and the step abended.

Ahmad assessed the situation:

  • STEP010–STEP030 had completed successfully (2 hours elapsed).
  • STEP040 had been running for 45 minutes when it failed.
  • Remaining time in the batch window: 3 hours 15 minutes.
  • STEP040 requires 70 minutes on a clean run.
  • STEP050 requires 90 minutes.
  • STEP060 requires 20 minutes.
  • Total remaining: 70 + 90 + 20 = 180 minutes = 3 hours.

The math was tight. Ahmad restarted from STEP040. The restart required deleting the partial PRICED_CLAIMS data (DB2 had rolled it back) and the partial PRICING.AUDIT sequential file (which had to be manually deleted and reallocated). STEP040 reran from record 1 — all 2.6 million valid claims. It finished in 72 minutes. STEP050 and STEP060 ran normally. The pipeline completed at 3:58 AM — two minutes before the deadline.

"Two minutes," Ahmad told Diane the next morning. "We made it by two minutes. If STEP040 had run three minutes slower — different night, heavier I/O load, anything — we'd have missed the window and delayed 14,000 provider payments."

Diane's response: "We're adding checkpoint/restart. All six steps. And we're going to coordinate them properly."

The Design Challenge

Adding checkpoint/restart to individual steps was straightforward — Diane had done it many times. The challenge was coordination across steps. The pipeline had several properties that made coordination complex:

Property 1: Cascading dependencies. Each step consumed the output of the previous step. If STEP040 restarted and regenerated its sequential output, STEP050 and STEP060 had to reprocess from scratch even though they hadn't started yet. But what if STEP050 had started (in a future parallel-step design) — how would it know that its input had changed?

Property 2: Mixed resource types. STEP050 (PNCADJ) both read from DB2 and updated a VSAM file (ACCUMULATORS.MASTER). The DB2 updates would be rolled back on failure, but the VSAM updates would not. The VSAM file contained year-to-date accumulators for each member's benefits — deductibles met, out-of-pocket maximums, visit counts. These numbers had to be exactly right. An incorrect accumulator could cause a $50,000 surgery to be adjudicated as "deductible not met" or, worse, as "fully covered" when the member had exhausted their benefits.

Property 3: The idempotency problem. Several steps were not naturally idempotent. STEP040 inserted rows into PRICED_CLAIMS. If the step was restarted and reprocessed records that had already been priced, it would create duplicate rows. STEP050 added amounts to VSAM accumulators. If a claim was processed twice, the accumulator would be incremented twice, giving an incorrect year-to-date total.

Property 4: Cross-step state. STEP060 (PNCREMT) generated a remittance file with header records containing the total count and total dollar amount from STEP050. If STEP050 was restarted and the counts changed (e.g., because of timing differences in which records were committed before the failure), the remittance header would be incorrect.

Diane's Design

Diane spent two weeks designing the checkpoint/restart coordination. She established four principles:

Principle 1: Each Step Owns Its Restart Logic

Each program manages its own restart table row. The RESTART_CONTROL table uses the composite key (PROGRAM_NAME, JOB_NAME, STEP_NAME). No step reads another step's restart state.

Principle 2: The Pipeline Controller

Diane introduced a new artifact: the pipeline control table.

CREATE TABLE PIPELINE_CONTROL (
    PIPELINE_NAME   CHAR(8)       NOT NULL,
    JOB_NAME        CHAR(8)       NOT NULL,
    RUN_DATE        DATE          NOT NULL,
    STEP_NAME       CHAR(8)       NOT NULL,
    STEP_SEQUENCE   SMALLINT      NOT NULL,
    STEP_STATUS     CHAR(1)       NOT NULL,
    START_TS        TIMESTAMP,
    END_TS          TIMESTAMP,
    RECORDS_IN      INTEGER       NOT NULL WITH DEFAULT 0,
    RECORDS_OUT     INTEGER       NOT NULL WITH DEFAULT 0,
    RESTART_COUNT   SMALLINT      NOT NULL WITH DEFAULT 0,
    PRIMARY KEY (PIPELINE_NAME, JOB_NAME, RUN_DATE, STEP_NAME)
);

Every step, at startup, registers itself in the pipeline control table with STEP_STATUS = 'R' (running). At completion, it sets STEP_STATUS = 'C' (complete). On restart, it increments RESTART_COUNT.

Before a step begins, it checks the pipeline control table to verify that all predecessor steps have STEP_STATUS = 'C'. If a predecessor has been restarted (RESTART_COUNT > what was recorded when the current step started), the current step knows its input may have changed and must reinitialize.

Principle 3: VSAM Accumulator Protection

For STEP050's VSAM accumulator problem, Diane designed a before-image log:

CREATE TABLE ACCUM_BEFORE_IMAGE (
    PROGRAM_NAME    CHAR(8)       NOT NULL,
    JOB_NAME        CHAR(8)       NOT NULL,
    RUN_DATE        DATE          NOT NULL,
    MEMBER_ID       CHAR(12)      NOT NULL,
    FIELD_NAME      CHAR(20)      NOT NULL,
    BEFORE_VALUE    DECIMAL(15,2) NOT NULL,
    AFTER_VALUE     DECIMAL(15,2) NOT NULL,
    CHECKPOINT_NUM  INTEGER       NOT NULL,
    PRIMARY KEY (PROGRAM_NAME, JOB_NAME, RUN_DATE,
                 MEMBER_ID, FIELD_NAME, CHECKPOINT_NUM)
);

Every time STEP050 updated a VSAM accumulator field, it also inserted a before-image row into this DB2 table. The insert was part of the same DB2 unit of recovery as the other DB2 updates.

On restart:

  1. DB2 automatically rolled back all PRICED_CLAIMS and ADJUDICATED_CLAIMS changes since the last checkpoint.
  2. The ACCUM_BEFORE_IMAGE table was also rolled back to the last checkpoint (because it was committed with the same COMMIT).
  3. The remaining ACCUM_BEFORE_IMAGE rows — from the last checkpoint back to the run start — contained the before-images of all VSAM updates that had been committed.
  4. A restart cleanup routine read these before-images and reversed the VSAM accumulator updates, restoring the accumulators to their checkpoint-consistent state.
       2300-RESTORE-VSAM-ACCUMULATORS.
           EXEC SQL
             DECLARE CSR-BEFORE-IMG CURSOR FOR
             SELECT MEMBER_ID, FIELD_NAME, BEFORE_VALUE
             FROM   ACCUM_BEFORE_IMAGE
             WHERE  PROGRAM_NAME = :WS-PGM-NAME
               AND  JOB_NAME     = :WS-JOB-NAME
               AND  RUN_DATE     = :WS-RUN-DATE
               AND  CHECKPOINT_NUM > :WS-LAST-CHKPT-NUM
             ORDER BY CHECKPOINT_NUM DESC,
                      MEMBER_ID, FIELD_NAME
           END-EXEC

           EXEC SQL OPEN CSR-BEFORE-IMG END-EXEC

           PERFORM UNTIL SQLCODE = +100
             EXEC SQL
               FETCH CSR-BEFORE-IMG
               INTO :WS-MEMBER-ID,
                    :WS-FIELD-NAME,
                    :WS-BEFORE-VALUE
             END-EXEC
             IF SQLCODE = 0
               PERFORM 2310-REVERSE-VSAM-UPDATE
             END-IF
           END-PERFORM

           EXEC SQL CLOSE CSR-BEFORE-IMG END-EXEC

           DISPLAY 'VSAM ACCUMULATORS RESTORED TO '
                   'CHECKPOINT ' WS-LAST-CHKPT-NUM
           .

This approach was more complex than the "delete orphans" strategy used for simple VSAM output files, because the accumulators could not simply be deleted — they had to be restored to their pre-run values. The before-image log provided the information needed to do this precisely.

Principle 4: Sequential File Regeneration with Verification

For sequential output files (CLAIMS.EXTRACT, VALID.CLAIMS, ENRICHED.CLAIMS, PRICING.AUDIT, REMITTANCE.FILE), Diane used the regeneration strategy with an added verification step:

  • On restart, the sequential output is deleted and the step rewrites it from committed data.
  • After the step completes (whether on first run or after restart), a verification step compares the record count and hash total of the sequential output against the values stored in the pipeline control table.
  • If the counts don't match, the step fails with a clear diagnostic.

This caught a subtle bug during testing: the PNCENRCH program, on restart, was not correctly skipping past enrichment records that had already been written. The verification step detected that the output had 2,587,342 records when the pipeline control table recorded 2,584,919 from the previous step's output. The 2,423-record discrepancy led to the discovery of a boundary condition in the restart key handling for enrichment records that spanned multiple procedure codes.

The Commit Frequency Decision

Each step required a different commit frequency based on its processing characteristics:

Step Program Commit Freq Rationale
STEP010 PNCEXTRT 10,000 Read-only from DB2, no lock concerns
STEP020 PNCVALID 5,000 Read-only from VSAM, checkpoint for restart positioning
STEP030 PNCENRCH 5,000 Multiple DB2 reads per record, moderate complexity
STEP040 PNCPRICE 3,000 DB2 inserts (PRICED_CLAIMS), concurrent read potential from ad-hoc queries
STEP050 PNCADJ 2,000 DB2 updates + VSAM updates, before-image logging, highest complexity
STEP060 PNCREMT 10,000 Read-only from DB2, sequential output only

STEP050's lower commit frequency (2,000) reflected its higher per-record complexity: each claim required reading PRICED_CLAIMS, calculating deductibles and copays against VSAM accumulators, updating the accumulators, inserting ADJUDICATED_CLAIMS, and writing a before-image log record. The 2,000-record commit interval balanced lock duration (accumulators were also accessed by the real-time claims portal during daytime hours, though not during the batch window) against restart complexity (more checkpoints meant less VSAM restoration work on restart).

STEP040 used 3,000 because the original failure was caused by a deadlock against FEE_SCHEDULE. Shorter commit intervals reduced the window for deadlocks with any concurrent access to the pricing tables.

Implementation and Testing

Ahmad led the testing effort. He created a test harness that automated the restart validation:

Test Dataset: A scaled-down version of production data — 280,000 claims (10% of production volume) — loaded into a test DB2 subsystem and VSAM files.

Baseline Run: The pipeline ran cleanly end-to-end. Ahmad captured: - Record counts at every step boundary - Hash totals (sum of claim amounts) at every step boundary - Final ADJUDICATED_CLAIMS table contents - Final VSAM accumulator values for all test members - Final REMITTANCE.FILE contents

Failure Injection: Ahmad tested failure at multiple points in each step:

Test Failure Point Recovery Time Data Verification
T01 STEP010, record 50,000 2 min Pass
T02 STEP010, record 250,000 1 min Pass
T03 STEP020, record 100,000 4 min Pass
T04 STEP020, record 250,000 3 min Pass
T05 STEP030, record 75,000 5 min Pass
T06 STEP030, record 200,000 3 min Pass
T07 STEP040, record 50,000 6 min Pass
T08 STEP040, record 200,000 4 min Pass
T09 STEP050, record 30,000 8 min Pass — VSAM accumulators verified
T10 STEP050, record 150,000 5 min Pass — VSAM accumulators verified
T11 STEP050, record 250,000 3 min FAIL — accumulator mismatch
T12 STEP060, record 100,000 1 min Pass

Test T11 failure analysis: STEP050 failed at record 250,000 (near the end of processing). On restart, the before-image restoration routine processed 248,000 before-image records to restore VSAM accumulators. During restoration, one member had 47 accumulator updates (the member had 47 claims in the test dataset). The restoration routine processed them in reverse checkpoint order (highest checkpoint number first), but within the same checkpoint, the order was by MEMBER_ID and FIELD_NAME. For this specific member, two updates to the same accumulator field occurred within the same checkpoint interval, and the restoration reversed them in the wrong order, leaving the accumulator off by $127.43.

Fix: Ahmad modified the before-image restoration to process records in strict reverse order of insert sequence (using an additional SEQUENCE_NUM column added to the ACCUM_BEFORE_IMAGE table), rather than relying on CHECKPOINT_NUM + MEMBER_ID + FIELD_NAME ordering.

The fix was verified by re-running T11 and two additional edge case tests (multiple updates to the same accumulator within the same checkpoint). All passed.

Production Results

The redesigned pipeline went into production on November 15, 2023. Results through March 2026:

Metric Before Redesign After Redesign
Pipeline failures requiring restart 11 per year 12 per year (slightly higher claim volume)
Average recovery time 47 minutes 7 minutes
Maximum recovery time 2h 15m (STEP050 failure) 19 minutes
SLA misses due to pipeline failures 3 (in 29 months) 0 (in 29 months)
Provider payment delays 3 incidents 0 incidents
Batch window utilization (avg) 83% 85% (slight increase due to checkpoint overhead)

The 2% increase in average batch window utilization was due to checkpoint overhead — the additional COMMIT operations, restart table updates, and before-image logging. Diane considered this an acceptable tradeoff for the elimination of SLA misses.

The most dramatic test of the new system came on February 8, 2025, when a DB2 buffer pool shortage caused STEP050 to abend at the 62-minute mark (out of its 90-minute typical runtime). Before the redesign, this would have required rerunning STEP050 from scratch — 90 minutes — plus the 20-minute STEP060, totaling 110 minutes. With checkpoint/restart, STEP050 recovered in 11 minutes, and the pipeline completed with 48 minutes of batch window remaining.

Key Technical Insights

Insight 1: VSAM Before-Image Logging Is Expensive but Necessary

The before-image log added approximately 8% overhead to STEP050's runtime — an additional INSERT per VSAM update, plus the restoration processing on restart. Diane considered alternatives:

  • Snapshot the VSAM file before the run: Requires copying a 4 GB VSAM file nightly. Time: ~15 minutes. Simpler but wastes storage and I/O bandwidth.
  • Use CICS recoverable files: Pinnacle's batch system doesn't run under CICS. Would require infrastructure changes.
  • Accept the risk of incorrect accumulators on restart: Unacceptable. An incorrect accumulator could cause a claim to adjudicate incorrectly, potentially resulting in a member being billed $50,000 for a procedure that should be covered.

The before-image approach was the only option that provided record-level accuracy without infrastructure changes.

Insight 2: Pipeline-Level Coordination Adds Overhead but Prevents Silent Data Corruption

The pipeline control table and cross-step verification added complexity. A simpler design — each step independently checkpointing with no awareness of other steps — would have worked most of the time. But Diane identified a scenario where it would fail silently:

If STEP040 was restarted and processed claims in a slightly different order (due to DB2 access path differences under different system load), the PRICED_CLAIMS table could have the same records but in a different physical order. STEP050, reading the same table, would produce the same adjudication results — except for edge cases involving claims that interacted (e.g., a member with two claims where the second claim's copay depends on whether the first claim applied to the deductible). Without the pipeline control table's verification, this inconsistency would go undetected.

Insight 3: Commit Frequency Must Account for the Most Complex Step

Diane initially set STEP050's commit frequency to 5,000, matching the other steps. Testing revealed that the VSAM before-image processing at 5,000 records generated so many before-image rows that restoration took 15 minutes on failure near the end of the step. Reducing to 2,000 cut restoration time to under 6 minutes, at the cost of 1,500 additional COMMITs (5,000 was 560 commits; 2,000 was 1,400 commits). The additional commit overhead was 2 minutes — a worthwhile tradeoff for 9 minutes of faster recovery.

Discussion Questions

  1. Diane chose a before-image log in DB2 to protect VSAM accumulators. An alternative is to store VSAM before-images in a separate VSAM file. Compare these approaches in terms of atomicity guarantees, performance, complexity, and storage requirements.

  2. Ahmad's test T11 revealed a bug in the before-image restoration ordering. This bug would only manifest when a member had multiple claims processed within the same commit interval — a relatively rare occurrence. How would you design a test dataset to maximize the likelihood of finding such edge cases? What characteristics should the test data have?

  3. The pipeline control table adds cross-step awareness to what would otherwise be independent step-level checkpoint/restart. Is this additional complexity justified? Under what circumstances would you omit it and rely purely on step-level independence?

  4. STEP040's commit frequency was reduced to 3,000 specifically because the original failure was a deadlock. Is it appropriate to set commit frequency based on past failure modes? What other factors should influence the decision?

  5. Diane's design accepts an 8% runtime overhead for VSAM before-image logging. At what overhead percentage would you consider the approach too expensive? What alternative strategies would you explore at that threshold?

  6. The pipeline processes 2.8 million claims in a 6-hour window. Claim volume is growing at approximately 12% per year. At what volume does the current checkpoint/restart design become insufficient? What architectural changes would you recommend at that point?