Case Study 2: Pinnacle Health's Automated Claims Pipeline Recovery
Background
Pinnacle Health Systems processes 340,000 medical claims per night through a batch pipeline that feeds adjudication, payment, and regulatory reporting systems. The pipeline has a hard deadline: adjudicated claims must be available to the provider portal by 07:00 Eastern, and payment files must be transmitted to the clearinghouse by 08:00. Missing the 08:00 deadline triggers a regulatory reporting requirement and exposes Pinnacle to per-claim penalties under their provider contracts.
Diane Chen, Pinnacle's batch operations manager, and Ahmad Patel, the senior systems programmer, had spent three years building what they considered a robust batch environment. Job scheduling was tight, checkpoints were frequent, and the overnight team was experienced. But in September, they experienced what Ahmad calls "the cascade" — a single failure that propagated through the entire pipeline and nearly caused a regulatory breach.
The Cascade
At 01:22am, the claims intake job (CLMINTKE) abended with an E37 — primary allocation failure for the sorted work file. The volume designated for temporary datasets was fragmented, and DFSMS couldn't find a contiguous extent large enough for the primary allocation.
The overnight operator restarted the job. It failed again — same E37, same volume. The operator increased the SPACE parameter and restarted. It failed a third time because the volume simply didn't have enough free space.
By now it was 02:15am. The operator called Ahmad. Ahmad identified the root cause: a reorganization job that ran earlier in the evening had created large temporary datasets on the same volume and hadn't cleaned them up (the cleanup step had been bypassed due to a JCL error introduced in a change three days earlier). Ahmad manually deleted the orphaned datasets, and CLMINTKE ran successfully starting at 02:35am.
But the cascade had begun. CLMINTKE's 73-minute delay pushed its completion to 03:48am (it normally completes at 02:35). This delayed CLMVALID (claims validation), which normally starts at 02:40 and runs for 50 minutes. CLMVALID didn't start until 03:52. Then CLMADJUD (adjudication) — the longest job in the pipeline at 120 minutes — didn't start until 04:45, which meant it wouldn't complete until 06:45 at best.
The payment file generation (CLMPAYMT) and provider portal refresh (CLMPORTAL) follow adjudication. With a combined 65 minutes of processing, the earliest possible completion was 07:50 — well past the 07:00 portal deadline and dangerously close to the 08:00 payment deadline.
At 03:30am, Ahmad made the call: invoke the emergency parallel processing procedure. Pinnacle's claims pipeline was designed with a fallback mode that splits the claims file into four segments and processes them concurrently across four initiators. The fallback takes more system resources but reduces adjudication time from 120 minutes to approximately 45 minutes.
The parallel processing completed at 05:32am. Payment files transmitted at 06:47am. Portal refreshed at 06:58am — two minutes before the deadline.
Total incident duration: 5 hours 36 minutes. Direct cost: $12,000 in overtime and emergency processing charges. Indirect cost: Ahmad and Diane spent the next two days on the post-incident review, the regulatory near-miss report, and the remediation plan.
The Root Cause Analysis
Diane led the post-incident review with a rigorous Five Whys analysis:
- Why did CLMINTKE fail? E37 — insufficient contiguous space on the designated volume.
- Why was there insufficient space? Orphaned temporary datasets from the earlier reorganization job consumed the available space.
- Why were the temporary datasets orphaned? The cleanup step in the reorganization job was bypassed due to a JCL error.
- Why was there a JCL error? A change made three days earlier to add a new parameter inadvertently altered the COND parameter on the cleanup step, causing it to be skipped when the preceding step returned RC=4 (which it always does — RC=4 is the normal completion code for that step).
- Why wasn't the JCL error caught? The change was tested in the development environment where the preceding step returns RC=0 (different data), so the conditional execution path that skips the cleanup was never exercised. There was no pre-production validation of JCL conditional logic.
Six layers of defense had gaps:
| Layer | Gap Identified | Remediation |
|---|---|---|
| Change management | JCL change not tested with production-equivalent data | Require production-equivalent test data for JCL changes |
| Code review | COND parameter change not flagged in review | Add JCL conditional logic to code review checklist |
| Monitoring | No monitoring of temporary dataset cleanup | Add space monitoring for temp volumes |
| Pre-flight | No space availability check before CLMINTKE | Implement pre-flight space validation |
| Recovery | No automated restart for space abends | Implement auto-restart with space reclamation |
| Fallback | Emergency parallel processing was manual | Automate the fallback decision and invocation |
The Redesign
Ahmad and Diane spent the next eight weeks implementing a comprehensive automated recovery framework for the claims pipeline. The design had four pillars.
Pillar 1: Predictive Space Management
Ahmad wrote a COBOL program (SPACEPRD) that runs every 30 minutes during the batch window and performs three functions:
-
Volume scanning: Checks space utilization on all batch-critical volumes. If any volume exceeds 85%, it triggers SMS space reclamation and sends a Tier 2 alert.
-
Orphan detection: Scans for temporary datasets older than 4 hours that should have been deleted by their creating job. If found, generates a Tier 2 alert with the owning job name and creation time.
-
Predictive calculation: Based on the remaining batch workload (from the scheduler) and historical space utilization patterns (from SMF type 42 data), estimates whether any volume will reach critical capacity before the batch window ends. If the prediction exceeds 90%, triggers a Tier 1 alert.
IDENTIFICATION DIVISION.
PROGRAM-ID. SPACEPRD.
*================================================================*
* PREDICTIVE SPACE MANAGEMENT *
* Monitors batch-critical volumes and predicts space exhaustion. *
*================================================================*
DATA DIVISION.
WORKING-STORAGE SECTION.
01 WS-VOLUME-TABLE.
05 WS-VOL-COUNT PIC S9(4) COMP VALUE 0.
05 WS-VOL-ENTRY OCCURS 50 TIMES
INDEXED BY VOL-IDX.
10 WS-VOL-SER PIC X(6).
10 WS-VOL-TOTAL PIC S9(9) COMP.
10 WS-VOL-FREE PIC S9(9) COMP.
10 WS-VOL-PCT-USED PIC S9(3)V99 COMP.
10 WS-VOL-TREND PIC S9(3)V99 COMP.
10 WS-VOL-EST-PEAK PIC S9(3)V99 COMP.
10 WS-VOL-STATUS PIC X(1).
88 VOL-GREEN VALUE 'G'.
88 VOL-YELLOW VALUE 'Y'.
88 VOL-RED VALUE 'R'.
01 WS-REMAINING-WORKLOAD.
05 WS-REM-JOB-COUNT PIC S9(4) COMP.
05 WS-REM-EST-SPACE PIC S9(9) COMP.
05 WS-REM-EST-TEMP PIC S9(9) COMP.
01 WS-ALERT-INFO.
05 WS-ALERT-TIER PIC 9.
05 WS-ALERT-MSG PIC X(200).
05 WS-ALERT-COUNT PIC S9(4) COMP VALUE 0.
PROCEDURE DIVISION.
0000-MAIN.
PERFORM 1000-SCAN-VOLUMES
PERFORM 2000-CHECK-ORPHANS
PERFORM 3000-PREDICT-USAGE
PERFORM 4000-GENERATE-ALERTS
STOP RUN.
1000-SCAN-VOLUMES.
PERFORM VARYING VOL-IDX FROM 1 BY 1
UNTIL VOL-IDX > WS-VOL-COUNT
* Query LSPACE for each volume
* Calculate utilization percentage
COMPUTE WS-VOL-PCT-USED(VOL-IDX) =
(WS-VOL-TOTAL(VOL-IDX) -
WS-VOL-FREE(VOL-IDX)) * 100
/ WS-VOL-TOTAL(VOL-IDX)
EVALUATE TRUE
WHEN WS-VOL-PCT-USED(VOL-IDX) > 90
SET VOL-RED(VOL-IDX) TO TRUE
WHEN WS-VOL-PCT-USED(VOL-IDX) > 85
SET VOL-YELLOW(VOL-IDX) TO TRUE
WHEN OTHER
SET VOL-GREEN(VOL-IDX) TO TRUE
END-EVALUATE
END-PERFORM.
2000-CHECK-ORPHANS.
* Scan catalog for temp datasets on batch volumes
* Compare creation time against 4-hour threshold
* Flag any orphaned datasets for alert and potential
* automated cleanup
CONTINUE.
3000-PREDICT-USAGE.
* Load remaining workload from scheduler
* Estimate space requirements from historical data
* Project peak utilization for each volume
PERFORM VARYING VOL-IDX FROM 1 BY 1
UNTIL VOL-IDX > WS-VOL-COUNT
COMPUTE WS-VOL-EST-PEAK(VOL-IDX) =
WS-VOL-PCT-USED(VOL-IDX) +
(WS-REM-EST-TEMP * 100 /
WS-VOL-TOTAL(VOL-IDX))
IF WS-VOL-EST-PEAK(VOL-IDX) > 90
SET VOL-RED(VOL-IDX) TO TRUE
END-IF
END-PERFORM.
4000-GENERATE-ALERTS.
PERFORM VARYING VOL-IDX FROM 1 BY 1
UNTIL VOL-IDX > WS-VOL-COUNT
IF VOL-RED(VOL-IDX)
MOVE 1 TO WS-ALERT-TIER
STRING 'SPACEPRD TIER1: Volume '
WS-VOL-SER(VOL-IDX)
' at ' WS-VOL-PCT-USED(VOL-IDX)
'% (predicted peak '
WS-VOL-EST-PEAK(VOL-IDX) '%)'
DELIMITED BY SIZE
INTO WS-ALERT-MSG
PERFORM 9000-SEND-ALERT
ELSE IF VOL-YELLOW(VOL-IDX)
MOVE 2 TO WS-ALERT-TIER
STRING 'SPACEPRD TIER2: Volume '
WS-VOL-SER(VOL-IDX)
' at ' WS-VOL-PCT-USED(VOL-IDX)
'% - approaching critical'
DELIMITED BY SIZE
INTO WS-ALERT-MSG
PERFORM 9000-SEND-ALERT
END-IF
END-PERFORM.
9000-SEND-ALERT.
ADD 1 TO WS-ALERT-COUNT
* Write alert to notification queue
* Implementation depends on notification infrastructure
DISPLAY WS-ALERT-MSG.
Pillar 2: Automated Recovery with Intelligent Escalation
Ahmad designed a three-level automated recovery framework:
Level 1 — Simple Restart (Automatic, No Human Involvement): - Eligible abend codes: E37, B37, S878, U0100-U0149 - Maximum attempts: 2 - Delay: 3 minutes (to allow SMS space reclamation or storage pressure to decrease) - If successful: log the event, send Tier 3 informational alert, continue batch stream - If unsuccessful after 2 attempts: escalate to Level 2
Level 2 — Intelligent Recovery (Automatic, With Notification): - Triggered when Level 1 fails or for certain abend codes (D37, S0C7 with known data patterns) - Actions depend on the specific job and failure: - For space abends: attempt to reallocate on an alternate volume, compress target volume, or reroute to overflow storage group - For data exceptions in CLMINTKE/CLMVALID: invoke the data cleansing utility to quarantine bad records, then restart with clean data (bad records routed to exception queue) - For performance degradation: invoke parallel processing mode (see Pillar 3) - Send Tier 2 alert with automated action taken - If successful: continue batch stream, flag exception records for morning review - If unsuccessful: escalate to Level 3
Level 3 — Human Intervention (Manual, With Full Context): - Triggered when Level 2 fails or for unrecoverable errors (S0C4, S806, S913) - The alert includes: complete job history, all automated actions attempted, current batch window status, critical path impact assessment, and the relevant playbook link - The goal is to give the human responder maximum context so they can make an informed decision quickly
Pillar 3: Adaptive Parallel Processing
The emergency parallel processing that Ahmad invoked manually during "the cascade" was formalized and automated. The claims pipeline now includes a decision point after each major job:
AFTER EACH CRITICAL JOB:
Calculate remaining_time = SLA_deadline - current_time
Calculate required_time = sum(baseline of remaining jobs)
Calculate buffer = remaining_time - required_time
IF buffer < 30 minutes THEN
INVOKE parallel processing mode
SEND Tier 2 alert: "Pipeline accelerated due to
SLA risk. Buffer = {buffer} min"
ELSE IF buffer < 60 minutes THEN
SEND Tier 3 alert: "Pipeline buffer below 60 min.
Monitoring closely."
ENDIF
Parallel processing mode splits the claims file at a natural boundary (provider group code) and runs four adjudication instances concurrently. The four output streams are merged before payment file generation. The split/merge logic was already in production for disaster recovery; Ahmad simply exposed it as a configurable runtime option that the automation framework can invoke.
The critical insight: the decision to invoke parallel processing is based on projected SLA compliance, not on any single job's performance. A job might be running 40% over baseline, but if the overall pipeline still has adequate buffer, there's no need to incur the overhead and complexity of parallel processing. Conversely, even if every job runs at baseline, an accumulation of small delays (each individually within threshold) might erode the buffer to the point where acceleration is prudent.
Pillar 4: Continuous Validation
Every stage of the claims pipeline now includes validation checks that verify data integrity before and after processing:
Pre-Processing Validation (before each major job): - Input file record count within tolerance band - Control totals match trailer record - Sample validation of key fields (provider ID format, claim amount range, date validity) - Output space availability confirmed - All required reference files present and current
Post-Processing Validation (after each major job): - Output record count within expected range of input - Output control totals reconcile with input (claims in = claims adjudicated + claims rejected + claims pended) - No orphaned records (every input record accounted for in output) - Output file characteristics match expected format
Cross-Job Validation (at pipeline milestones): - Running totals from CLMINTKE through CLMPAYMT must reconcile - Claim count at each stage must be traceable to the original intake count - Financial totals must balance: payments + adjustments + denials = total claims received
If any validation fails, the pipeline enters a controlled hold state. The current job completes (no mid-stream abort), the output is preserved for analysis, and the downstream jobs are held. The validation failure alert includes the specific discrepancy and the relevant reconciliation data.
Results After One Year
| Metric | Before Redesign | After Redesign | Improvement |
|---|---|---|---|
| Pipeline failures/month | 6.2 | 1.8 | -71% |
| Failures requiring human intervention | 6.2 | 0.6 | -90% |
| Average MTTR (all incidents) | 73 min | 11 min | -85% |
| SLA violations (08:00 payment deadline) | 2.1/quarter | 0.25/quarter | -88% |
| Regulatory near-misses | 3/year | 0/year | -100% |
| Data integrity exceptions | 12/month | 14/month | +17%* |
| Ahmad's 3am phone calls | 4.5/month | 0.3/month | -93% |
*The increase in data integrity exceptions is a feature, not a bug. The continuous validation catches data quality issues that previously went undetected and propagated to downstream systems. The exceptions are now quarantined and resolved during business hours instead of causing processing failures at 3am.
The Ongoing Challenge
Diane identifies two remaining challenges:
1. Complexity Management. The automated recovery framework is itself a complex system. It has its own failure modes. During a system maintenance window, the SPACEPRD monitoring program abended because the catalog it was scanning was being reorganized. The monitoring system needed its own monitoring — a recursive problem that Diane calls "the watchman problem." Ahmad's solution: a separate, minimal heartbeat process that verifies the monitoring infrastructure is running. If the heartbeat fails, a console message alerts the operator directly, bypassing the automated notification system entirely.
2. Institutional Inertia. The claims processing application team has grown accustomed to the automated recovery catching their errors. In the first six months after deployment, the number of JCL errors and data quality issues actually increased slightly, because the application team knew the automation would handle them. Diane addressed this by publishing monthly "exception reports" that attributed each automated recovery to its root cause and the responsible team. "The automation is a safety net, not a license to be sloppy," she told the application team leads. The exception rates returned to baseline within two quarters.
Discussion Questions
-
The cascade originated from a JCL change three days before the incident. Why wasn't the problem detected sooner? Design a monitoring rule that would have caught the orphaned temporary datasets within hours of their creation.
-
Ahmad's parallel processing fallback reduces adjudication time from 120 minutes to 45 minutes but consumes four times the initiator resources. In a shared LPAR environment, what impact might this have on other workloads? How would you mitigate that impact?
-
Diane's "watchman problem" — monitoring the monitoring system — is a fundamental challenge in complex systems. What are the limits of recursive monitoring? At what point do you accept that the lowest layer must be simple enough to be reliable without monitoring?
-
The data integrity exceptions increased by 17% after implementing continuous validation. This means 17% more data quality issues were being detected that previously went unnoticed. What does this imply about the claims data quality in the period before validation was implemented? How would you assess the downstream impact of those previously-undetected issues?
-
Compare Rob's approach at CNB (Section 27.1 and Case Study 1) with Ahmad's approach at Pinnacle Health. Both shops improved dramatically, but they emphasized different aspects. What environmental factors (industry, regulatory pressure, team size, technical skill) might explain the different emphases?