Case Study 2: Pinnacle Health's Batch/Online Lock Contention

DataField.Dev

Case Study 2: Pinnacle Health's Batch/Online Lock Contention

Background

Pinnacle Health Systems processes approximately 50 million medical claims per year across a DB2 for z/OS environment running on a z15. The system operates 24/7 with no batch window — claims submission is continuous (hospitals, pharmacies, and physicians submit claims around the clock), and batch processing for adjudication, payment calculation, and regulatory reporting runs throughout the day.

The claims processing team: - Diane Kowalski, Claims Systems Architect, 18 years at Pinnacle - Ahmad Rashidi, Senior DBA, 11 years at Pinnacle

The core tables: - CLAIMS — 800 million rows, partitioned by submission month (24 partitions, 2 years of history) - MEMBER — 45 million rows, partitioned by state code (50 partitions) - PROVIDER — 2 million rows, not partitioned - PAYMENT — 600 million rows, partitioned by payment date (24 partitions)

The Problem

Pinnacle's online claims submission system had been experiencing intermittent performance degradation for months. The symptoms:

Claims submission response times averaged 340ms normally but spiked to 2-5 seconds three to four times daily
Each spike lasted 15-25 minutes
Spikes correlated roughly with batch job schedules but not exactly
No deadlocks — only timeouts (SQLCODE -913)
Timeouts affected approximately 0.5% of claims during spike periods

Diane had been tracking the pattern for three months. The spikes were not causing SLA violations yet, but the trend was worsening as claim volume grew. She estimated they had six months before the spikes would start causing contractual penalties with hospital partners.

Ahmad's monitoring data showed:

NORMAL PERIOD (no batch running):
  Lock suspensions/min:        120
  Lock timeouts/min:             0
  Lock escalations:              0
  Avg online response:        340ms

SPIKE PERIOD (batch + online):
  Lock suspensions/min:      8,400
  Lock timeouts/min:            45
  Lock escalations:           2-4
  Avg online response:      2,100ms

Investigation

Phase 1: Identifying the Batch Programs

Ahmad correlated the spike times with the batch schedule:

Batch Job	Schedule	Table	Operation	Duration
CLMADJ01	06:00, 12:00, 18:00, 00:00	CLAIMS	Update (adjudication)	45-60 min
CLMPAY01	08:00, 14:00, 20:00, 02:00	CLAIMS, PAYMENT	Read Claims, Insert Payment	30-40 min
CLMRPT01	07:00 daily	CLAIMS, MEMBER, PROVIDER	Read-only reporting	90 min

The spikes aligned with CLMADJ01 runs. This batch program was the adjudication engine — it read unadjudicated claims and applied pricing rules, updating each claim's status and payment amount.

Phase 2: Analyzing CLMADJ01's Locking Behavior

Diane pulled the CLMADJ01 source code. Key characteristics:

       EXEC SQL
           DECLARE CSR-CLAIMS CURSOR FOR
           SELECT CLAIM_ID, MEMBER_ID, PROVIDER_ID,
                  CLAIM_AMT, CLAIM_STATUS, SUBMIT_DATE
           FROM CLAIMS
           WHERE CLAIM_STATUS = 'SUBMITTED'
             AND SUBMIT_DATE >= :WS-CUTOFF-DATE
           ORDER BY CLAIM_ID
           FOR UPDATE OF CLAIM_STATUS, ADJ_AMT, ADJ_DATE
       END-EXEC

Properties of CLMADJ01: - Cursor declared FOR UPDATE OF (correct — prevents conversion deadlocks) - Processes claims in CLAIM_ID order (correct — consistent ordering) - Commit frequency: every 10,000 claims (problem) - No WITH HOLD on cursor (problem — cursor closes on COMMIT, requires expensive repositioning) - Bound with ISOLATION(RS) (problem — overkill for this access pattern)

Phase 3: Quantifying the Lock Impact

Ahmad ran the numbers:

CLAIMS tablespace: LOCKSIZE ROW, LOCKMAX 8,000
CLMADJ01 processes ~120,000 claims per run
Commit frequency: 10,000 claims
With RS isolation and FOR UPDATE: acquires U lock on each row fetched, converts to X on update
Between commits: holds up to 10,000 X locks (updated rows) + U lock on current fetch position

Lock escalation was occurring. 10,000 X locks exceeded LOCKMAX (8,000). Every commit interval triggered escalation, converting 8,000+ row locks to a partition-level X lock.

During escalation, the partition held by CLMADJ01 was exclusively locked. Online claims submissions to that partition waited. After RESOURCE TIMEOUT (30 seconds), they timed out with -913.

But there was a subtlety. The escalation didn't lock the entire tablespace — only the partition being processed (Pinnacle used partition-level locking). Claims submitted to other months' partitions were unaffected. The spike occurred because roughly 60% of claims were for the current month, and CLMADJ01 was processing the current month's partition.

Phase 4: The RS Isolation Surprise

Diane dug deeper into why RS was specified. She found a code comment from 2019:

      *---------------------------------------------------------
      * USE RS ISOLATION TO PREVENT PHANTOM READS
      * DURING ADJUDICATION. WE NEED A STABLE SET OF
      * CLAIMS TO PROCESS.
      *---------------------------------------------------------

The original developer was worried about new claims being inserted while adjudication was running. Under CS, a new claim matching the WHERE clause could theoretically be inserted and then missed by the cursor (if the insert happened on a page the cursor had already passed).

Diane's analysis: this concern was valid in theory but irrelevant in practice. CLMADJ01 ran four times daily. Any claim missed in one run would be picked up in the next run, four hours later. The adjudication SLA was 24 hours. Using RS to prevent a 4-hour delay was causing timeouts that violated the 5-second submission SLA.

The Redesign

Diane and Ahmad designed a comprehensive fix with five changes:

Change 1: Reduce Commit Frequency

From 10,000 to 500 rows. This kept the maximum lock count well below LOCKMAX (8,000).

      * BEFORE
       01  WS-COMMIT-FREQ    PIC S9(8) COMP VALUE 10000.

      * AFTER
       01  WS-COMMIT-FREQ    PIC S9(8) COMP VALUE 500.

Impact analysis: COMMIT overhead increased. Diane benchmarked: elapsed time went from 48 minutes to 51 minutes per run. Acceptable.

Change 2: WITH HOLD Cursor

      * BEFORE
       EXEC SQL
           DECLARE CSR-CLAIMS CURSOR FOR
           ...
       END-EXEC

      * AFTER
       EXEC SQL
           DECLARE CSR-CLAIMS CURSOR WITH HOLD FOR
           ...
       END-EXEC

Without WITH HOLD, every COMMIT closed the cursor. CLMADJ01 had been reopening the cursor and repositioning using:

       EXEC SQL
           OPEN CSR-CLAIMS
       END-EXEC
       PERFORM UNTIL WS-CLAIM-ID >= WS-LAST-PROCESSED
           EXEC SQL FETCH CSR-CLAIMS INTO ... END-EXEC
       END-PERFORM

This repositioning loop re-read (and re-locked) all previously processed claims. Under RS, those re-read locks were held until the next COMMIT. This meant CLMADJ01 was accumulating locks on both the current batch AND previously processed claims during repositioning.

With WITH HOLD, the cursor stays open across COMMIT. Position is retained. No repositioning needed. No phantom locks from re-reads.

Change 3: Change Isolation to CS

      * BEFORE: BIND PLAN(CLMADJ01) ISOLATION(RS)
      * AFTER:  BIND PLAN(CLMADJ01) ISOLATION(CS)

Diane's justification: the FOR UPDATE OF clause already acquires U locks on fetched rows, preventing other programs from modifying them before CLMADJ01's update. CS is sufficient because: - The cursor is ordered by CLAIM_ID, so it moves forward only - Missed claims (due to concurrent inserts) are caught in the next run - The U/X lock on the current row prevents concurrent modification

Change 4: Partition-Aware Processing

Instead of processing all unadjudicated claims across all partitions in a single cursor, Diane redesigned CLMADJ01 to process one partition at a time:

       PERFORM VARYING WS-PARTITION FROM 1 BY 1
           UNTIL WS-PARTITION > 24
           PERFORM PROCESS-PARTITION
       END-PERFORM

       PROCESS-PARTITION.
           EXEC SQL
               OPEN CSR-CLAIMS
           END-EXEC
           PERFORM PROCESS-CLAIMS-IN-PARTITION
           EXEC SQL
               CLOSE CSR-CLAIMS
           END-EXEC
           EXEC SQL COMMIT END-EXEC.

The cursor was modified to include a partition key predicate:

       EXEC SQL
           DECLARE CSR-CLAIMS CURSOR WITH HOLD FOR
           SELECT CLAIM_ID, MEMBER_ID, PROVIDER_ID,
                  CLAIM_AMT, CLAIM_STATUS, SUBMIT_DATE
           FROM CLAIMS
           WHERE CLAIM_STATUS = 'SUBMITTED'
             AND SUBMIT_MONTH = :WS-PARTITION-MONTH
           ORDER BY CLAIM_ID
           FOR UPDATE OF CLAIM_STATUS, ADJ_AMT, ADJ_DATE
       END-EXEC

Critical scheduling decision: The current month's partition (the one with 60% of online submissions) is processed during the lowest-activity window (02:00 run). Previous months' partitions — which have minimal online activity — are processed during daytime runs.

Change 5: Currently Committed for Read-Only Online Queries

Pinnacle's online system included read-only claim status queries. These were bound with CS and would wait on CLMADJ01's X locks. Ahmad enabled currently committed semantics for the claim inquiry plan:

BIND PLAN(CLMINQ01) ISOLATION(CS) CONCURRENTACCESSRESOLUTION(USECURRENTLYCOMMITTED)

This allowed claim status queries to return the last committed value even when CLMADJ01 was mid-update on a row.

Implementation and Results

The changes were implemented over two weeks, deployed during a maintenance window, and monitored for 30 days.

Before vs. After Metrics

Metric	Before	After	Change
Lock suspensions/min (during batch)	8,400	310	-96%
Lock timeouts/min (during batch)	45	0.1	-99.8%
Lock escalations per day	8-16	0	-100%
Online response during batch	2,100ms avg	380ms avg	-82%
Online response normal	340ms	330ms	-3%
Batch elapsed time	48 min	51 min	+6%
Claims adjudicated per run	~120,000	~120,000	No change

Unexpected Benefits

Batch restartability improved. With COMMIT every 500 rows, the maximum work lost on an abend dropped from 10,000 claims to 500 claims. Restart time decreased from ~15 minutes (repositioning past 10,000 claims) to near-instant (WITH HOLD cursor, no repositioning needed, COMMIT point is at most 500 claims back).
IRLM storage decreased. Peak IRLM lock structure utilization dropped from 72% to 31%. This provided headroom for future volume growth.
Reporting batch (CLMRPT01) also improved. CLMRPT01 was a read-only batch that ran with RS isolation. Since CLMADJ01 was no longer escalating and holding partition-level X locks, CLMRPT01's lock suspensions dropped to near zero. Ahmad subsequently changed CLMRPT01 to UR isolation (the report could tolerate dirty reads), further reducing lock interactions.

Lessons Learned

Lesson 1: LOCKSIZE Changes Require Holistic Analysis

Pinnacle's CLAIMS tablespace was originally LOCKSIZE PAGE. At some point, it was changed to LOCKSIZE ROW to improve concurrency for online transactions. Like CNB's incident, the change improved online concurrency for individual row access but dramatically increased the batch lock count, triggering escalation.

Diane's rule: "Any LOCKSIZE change requires analysis of every program that accesses the table, not just the programs you're trying to help."

Lesson 2: RS Isolation Is Rarely Needed for OLTP

The original developer's concern about phantom reads was understandable but misapplied. RS adds significant lock overhead. In the vast majority of OLTP and batch processing, CS provides adequate consistency. RS should be reserved for: - Regulatory reporting where consistency is legally mandated - Financial reconciliation where totals must balance exactly - Any process where re-reading a row and getting a different result would cause a program logic error

Lesson 3: Commit Frequency Is Not a Tuning Parameter — It's a Design Decision

Commit frequency should be determined during design, not adjusted in production as a performance fix. The commit frequency must account for: - LOCKMAX (never exceed it) - Recovery time (how much work can you afford to lose?) - Lock duration (how long will online transactions wait?) - Checkpoint overhead (each COMMIT has a cost)

Diane's guideline for Pinnacle: batch programs must commit at least every 500 rows when running concurrently with online. Programs that run in an exclusive window may use higher frequencies.

Lesson 4: WITH HOLD Is Not Optional for Concurrent Batch

Any batch program that commits periodically and runs concurrently with online MUST use WITH HOLD cursors. The alternative — reopening and repositioning the cursor — introduces unnecessary I/O, re-acquires locks on already-processed data, and increases elapsed time.

Lesson 5: Currently Committed Is a Free Lunch (Almost)

Enabling currently committed for read-only transactions eliminated an entire category of lock contention (readers waiting on writers) with no observable downside. The only cost is occasional log reads to retrieve the committed version of an in-flight row. For Pinnacle's workload, this cost was unmeasurable.

Epilogue: The Volume Growth Challenge

Eighteen months after the redesign, Pinnacle's claims volume grew 40% due to a major hospital network acquisition. Ahmad ran the lock analysis again:

CLMADJ01 now processed ~170,000 claims per run
With COMMIT every 500 rows, maximum lock count remained 500 — no risk of escalation
Batch elapsed time grew to 72 minutes per run (from 51 minutes) — proportional to volume
Online response during batch: 395ms — essentially unchanged

The locking strategy scaled linearly with volume. Diane's design was volume-independent: the commit frequency controlled the maximum lock count regardless of total claims processed.

Ahmad's observation: "The best locking strategies are the ones that don't change when the data grows. If your lock behavior depends on volume, you have a time bomb."

Discussion Questions

CLMADJ01 was changed from committing every 10,000 rows to every 500 rows. If you were Diane, how would you choose between 200, 500, 1,000, or 2,000? What factors influence this decision?
The partition-aware processing change moved current-month adjudication to the 02:00 run. What happens if claim volume grows to the point where the 02:00 run can't finish the current month's partition before the 06:00 run starts? How would you redesign?
Ahmad enabled currently committed for read-only claim inquiries. Should it also be enabled for the claims submission transaction (which reads existing claims to check for duplicates before inserting a new one)? What are the risks?
Diane changed CLMADJ01 from RS to CS isolation. Under what specific scenario could this cause a problem? How likely is this scenario, and what would the impact be?
The case study mentions that CLMRPT01 (the reporting batch) was subsequently changed to UR isolation. What types of reports would make UR inappropriate, even for batch reporting?
Pinnacle's CLAIMS table has 800 million rows across 24 partitions. Estimate the IRLM storage savings from reducing CLMADJ01's peak lock count from 10,000+ (with escalation) to 500 row locks.
Design a locking strategy for a new Pinnacle requirement: a real-time claims dashboard that shows claim counts by status, updated every 30 seconds. The dashboard query scans the current month's partition. How do you prevent this query from interfering with adjudication batch processing?