Case Study 2: Pinnacle Health's Claims Processing Performance Crisis

DataField.Dev

Case Study 2: Pinnacle Health's Claims Processing Performance Crisis

The Crisis

At 6:47 AM on a Tuesday in March, Ahmad Rashid's phone buzzes with a priority-one alert: the nightly claims adjudication batch has not completed. It started at 22:00, as usual. It normally finishes by 03:30. It is still running.

By 7:15 AM, Ahmad is at his desk reviewing the situation. The claims adjudication cycle is Pinnacle Health's most critical batch process. It processes every claim received in the prior 24 hours — verifying eligibility, applying fee schedules, checking medical necessity rules, computing member liability, and posting adjudication results. Downstream systems depend on it: provider payments cannot be calculated, member portals cannot show claim status, and regulatory reports cannot be generated until adjudication completes.

Today, there are 847,000 claims in the queue. The system processed 312,000 before apparent stalling.

Initial Diagnosis

Ahmad follows the methodology. First, the precise problem statement:

"Job PHCLM010 (claims adjudication) has been running for 8 hours 47 minutes as of 06:47. Normal elapsed time for comparable volume is 4 hours 30 minutes. The job appears to be making very slow progress — 312,000 claims processed in 8+ hours versus the normal rate of approximately 190,000 claims per hour."

He pulls the partial accounting data (in-flight thread statistics from OMEGAMON):

PROGRAM: PHCLM010   PLAN: PHCLMPLN   STATUS: ACTIVE
RUNNING SINCE: 22:00:03
CURRENT TIME:  07:15:22   (ELAPSED SO FAR: 33,319 SEC / 9.25 HOURS)

IN-DB2 TIME:     31,847 SEC (95.6%)
CLASS 2 CPU:      2,412 SEC ( 7.2%)
CLASS 3 WAITS:
  SYNC I/O:       1,847 SEC ( 5.5%)
  LOCK WAIT:        127 SEC ( 0.4%)
  LOG WRITE:         84 SEC ( 0.3%)
  OTHER:            312 SEC ( 0.9%)
  (UNACCOUNTED):  27,065 SEC (81.2%)

SQL ACTIVITY:
  SELECT:     142,847,000
  INSERT:      14,312,000
  UPDATE:      28,624,000
  FETCH:      285,694,000
  COMMITS:          3,120
  GETPAGE:  4,847,000,000

BUFFER POOL:
  BP0 HIT: 91.2%
  BP1 HIT: 99.4%
  BP2 HIT: 74.8%

Ahmad immediately sees something unusual. Lock wait is only 127 seconds — that is not the problem. CPU is 2,412 seconds over 9 hours — about 7% of elapsed. The "unaccounted" time of 27,065 seconds is enormous. And the GETPAGE count is staggering — nearly 5 billion.

The Unaccounted Time Mystery

"Unaccounted" time in DB2 accounting usually means the time is being spent outside DB2's tracked categories. The most common causes:

Application processing time between SQL calls — but the program is 95.6% in DB2, so this is only 4.4%
Synchronous I/O not being properly categorized — possible with buffer pool contention
Page latch contention — waiting for access to in-memory buffer pages (not lock contention — latch contention)
Prefetch suspension — DB2 suspending prefetch because the buffer pool is full

Ahmad looks at BP2's hit ratio: 74.8%. That is terrible. And the question is: what is in BP2?

He queries the DB2 catalog:

SELECT DBNAME, TSNAME, BPOOL, NACTIVE, NPAGES
FROM   SYSIBM.SYSTABLESPACE
WHERE  BPOOL = 'BP2'
ORDER BY NPAGES DESC;

DBNAME      TSNAME          BPOOL   NACTIVE   NPAGES
--------    -----------     -----   -------   ------
PHCLMDB     TSCLM_DETAIL    BP2       847      4.2M
PHCLMDB     TSCLM_PROC      BP2       312      1.8M
PHCLMDB     TSCLM_HISTORY   BP2       124      12.4M

BP2 is assigned 5,000 pages (about 20 MB for a 4K page size). It is serving three tablespaces with a combined 18.4 million pages. The working set vastly exceeds the buffer pool. DB2 is constantly stealing pages and re-reading them from disk — and the steal/re-read cycle is what accounts for the "unaccounted" time. It appears as neither traditional sync I/O wait nor lock wait because the latch waits and internal buffer management overhead fall into an accounting gap.

But Why Now?

Ahmad checks: BP2 has been 5,000 pages for years. Why is this suddenly a problem?

He pulls GETPAGE trends for the CLAIMS_DETAIL table:

Month        Rows Added     Total Rows     GETPAGEs (PHCLM010)
---------    ----------     ----------     -------------------
October       612,000       34.2M          1.1B
November      647,000       34.8M          1.2B
December      723,000       35.5M          1.3B
January       781,000       36.3M          1.5B
February      812,000       37.1M          1.7B
March (MTD)   847,000       38.0M          4.8B (still running)

The GETPAGE count has been growing, but March shows a massive jump. 847,000 claims is only 4% more than February's 812,000 — but GETPAGEs nearly tripled.

Ahmad checks EXPLAIN for the main claims processing query:

QUERYNO PLANNO METHOD TNAME          ACCESSTYPE MATCHCOLS ACCESSNAME        INDEXONLY PREFETCH
------- ------ ------ -------------- ---------- --------- ---------------   -------- --------
     8      1      0  CLAIMS         I                  2  XCLM_STATUS_DT    N        S
     8      2      1  CLAIMS_DETAIL  I                  1  XCLMD_CLM_ID      N        S
     8      3      1  CLAIMS_PROC    I                  2  XCLMP_CLM_PROC    N        L

Then he checks the same EXPLAIN from two months ago:

QUERYNO PLANNO METHOD TNAME          ACCESSTYPE MATCHCOLS ACCESSNAME        INDEXONLY PREFETCH
------- ------ ------ -------------- ---------- --------- ---------------   -------- --------
     8      1      0  CLAIMS         I                  2  XCLM_STATUS_DT    N        S
     8      2      1  CLAIMS_DETAIL  I                  2  XCLMD_CLM_DATE    N        S
     8      3      1  CLAIMS_PROC    I                  2  XCLMP_CLM_PROC    N        L

The access path for CLAIMS_DETAIL changed. Two months ago, it used XCLMD_CLM_DATE with MATCHCOLS=2. Now it uses XCLMD_CLM_ID with MATCHCOLS=1.

Ahmad checks: when was the last REBIND of PHCLMPLN?

BIND DATE: 2025-02-15
LAST RUNSTATS ON CLAIMS_DETAIL: 2025-02-14

The REBIND happened one day after RUNSTATS. He checks the RUNSTATS timing: it ran at 14:00 on February 14. At that time, CLAIMS_DETAIL had 37.1 million rows. But the February RUNSTATS captured a different distribution than the January statistics that the optimizer had been using.

Ahmad digs deeper. He looks at the column distribution for CLAIM_ID in CLAIMS_DETAIL:

SELECT COLVALUE, FREQUENCY
FROM   SYSIBM.SYSCOLDIST
WHERE  TBNAME = 'CLAIMS_DETAIL'
  AND  NAME = 'CLAIM_ID'
ORDER BY FREQUENCY DESC
FETCH FIRST 10 ROWS ONLY;

The result shows that CLAIM_ID has uniform distribution — each claim ID maps to 3-8 detail rows. The MATCHCOLS=1 on CLAIM_ID means DB2 is doing a nested loop for each claim, reading 3-8 detail rows per probe. That is fine for individual lookups.

But the old access path used XCLMD_CLM_DATE with MATCHCOLS=2 — it was doing a range scan on CLAIMS_DETAIL by date range, reading all detail rows for the batch date in one sequential sweep. Much more efficient for batch processing that processes an entire day's claims.

The optimizer switched because the new statistics made the per-claim nested loop look cheaper (3-8 rows per probe). What the optimizer did not account for: when you process 847,000 claims, that is 847,000 probes into the index, each reading a different part of the tablespace. The random I/O pattern is catastrophic for a buffer pool that can only hold 5,000 pages.

The Root Cause Chain

Ahmad documents the root cause chain:

Proximate cause: Buffer pool BP2 exhaustion causing massive page steal/re-read cycles
Contributing cause: Access path change from sequential date-range scan to per-claim random probe
Trigger: REBIND on 2025-02-15 picked up new statistics from 2025-02-14 RUNSTATS
Underlying cause: BP2 sizing (5,000 pages) was always marginal — it worked only because the sequential access pattern had good prefetch behavior
Root cause: No EXPLAIN comparison was performed after the REBIND, and no performance regression test was run

The Fix

Ahmad implements fixes in priority order:

Immediate Fix (Tuesday Morning)

Cancel the running job. Rebind the package with an explicit access path hint to restore the XCLMD_CLM_DATE index:

-- In the COBOL program, modify the cursor declaration:
EXEC SQL
    DECLARE CSR_CLAIM_DETAIL CURSOR FOR
    SELECT CD.DETAIL_ID, CD.PROC_CODE, CD.DIAG_CODE,
           CD.SERVICE_DATE, CD.BILLED_AMT
    FROM   CLAIMS_DETAIL CD
    WHERE  CD.CLAIM_DATE BETWEEN :WS-BATCH-START AND :WS-BATCH-END
      AND  CD.CLAIM_STATUS = :WS-PROC-STATUS
    ORDER BY CD.CLAIM_DATE, CD.CLAIM_ID
END-EXEC

The ORDER BY hints the optimizer toward the date-based index. Ahmad also adds an OPTIMIZE FOR 100000 ROWS clause to signal that this is a batch cursor:

    OPTIMIZE FOR 100000 ROWS

He rebinds, runs EXPLAIN, confirms MATCHCOLS=2 on XCLMD_CLM_DATE, and restarts the job. The remaining 535,000 claims process in 2 hours 48 minutes.

Short-Term Fix (Within One Week)

Increase BP2 from 5,000 pages to 50,000 pages. Even with the sequential access path restored, BP2 is undersized for the growing data volumes. Ahmad requests the additional 180 MB of memory from the z/OS systems team. It is approved the same day.

He also adds CLAIMS_DETAIL, CLAIMS_PROC, and CLAIMS_HISTORY to the RUNSTATS schedule with FREQVAL and distribution statistics. The default RUNSTATS was capturing only basic cardinality — not enough for the optimizer to make good decisions about batch vs. single-row access paths.

Medium-Term Fix (Within One Month)

Implement REBIND governance:

Pre-REBIND EXPLAIN comparison — mandatory for all packages. Diane Chen writes a COBOL comparison program (similar to Pinnacle's existing report programs) that reads PLAN_TABLE rows for the current and proposed package and produces a difference report.
Performance regression test — mandatory for all packages that show access path changes. Run the affected program in the test environment with production-volume data and compare accounting metrics.
REBIND approval workflow — no REBIND moves to production without sign-off from both the DBA (Ahmad) and the application lead (Diane).

Long-Term Fix (Within One Quarter)

Partition CLAIMS_DETAIL by month. The table is approaching 40 million rows and growing at 800K+ per month. Partitioning provides:

Partition pruning — batch processing only scans the relevant monthly partition
Partition-level RUNSTATS — statistics are more accurate per partition
Partition-level REORG — can reorganize current month without affecting history
Partition-level archival — drop old partitions when retention period expires

The Aftermath

Diane Chen conducts a review meeting on Thursday. She opens with the timeline:

22:00 Tuesday:    PHCLM010 starts (normal)
03:30 Wednesday:  Expected completion time — still running
06:47 Wednesday:  P1 alert generated
07:15 Wednesday:  Ahmad begins diagnosis
07:45 Wednesday:  Root cause identified (30 minutes)
08:00 Wednesday:  Fix applied (rebind with hint)
08:15 Wednesday:  Job restarted
11:03 Wednesday:  Job completes
11:15 Wednesday:  Downstream processing begins (8 hours late)
19:00 Wednesday:  All downstream processing complete

Impact: Provider payment file delayed 8 hours. 147 providers did not receive expected payment notification by their contractual deadline. Member portal showed stale claim status for 847,000 claims until 19:00. Three regulatory reports filed 4 hours late (within the grace period, but documented).

Diane's after-action summary:

"The technical root cause was an access path change that went undetected because we have no REBIND governance process. The access path change exposed a buffer pool sizing weakness that had been latent for years. The fix was straightforward once diagnosed — Ahmad identified it in 30 minutes. The prevention is also straightforward: compare EXPLAIN before and after every REBIND, test with production volumes, and size buffer pools for the worst-case access pattern, not the best-case."

Ahmad adds: "I also want to flag that we have been treating RUNSTATS and REBIND as routine maintenance activities. They are not routine. Every RUNSTATS can change what the optimizer sees. Every REBIND can change the access path. These are change events and should be treated with the same rigor as code changes."

Lisa agrees. The REBIND governance process is now a mandatory standard at Pinnacle Health.

Quantitative Summary

Metric	Normal	During Crisis	After Fix
Elapsed Time	4h 30m	9h 15m+ (cancelled)	4h 12m
CPU Time	2,400 sec	2,412 sec	2,380 sec
GETPAGE	1.7B	4.8B (partial)	1.6B
BP2 Hit Ratio	91%	74.8%	97.2% (after resize)
MATCHCOLS (Q8, CLAIMS_DETAIL)	2	1	2
Sync I/O	1,200 sec	1,847 sec (partial)	890 sec
Downstream Delay	0	8 hours	0

Note that CPU time is nearly identical across all three scenarios. This was never a CPU problem. It was an I/O access pattern problem caused by an access path change, amplified by an undersized buffer pool.

Discussion Questions

Ahmad identified the root cause in 30 minutes. What specific skills and preparation made this possible? How long would it take without the systematic methodology?
The buffer pool sizing of 5,000 pages had been "fine" for years. What changed, and how would you build a monitoring system that detects when a previously-adequate resource is becoming insufficient?
The optimizer's decision to switch from date-range scan to per-claim probe was technically correct — the per-claim probe is cheaper for single-claim lookups. How do you communicate to the optimizer that this SQL is batch processing, not single-row lookup?
Diane's after-action review treated this as a process failure, not a technical failure. Do you agree? What is the relationship between technical competence and process discipline in performance management?
Ahmad said RUNSTATS and REBIND should be treated as "change events." What are the implications of this statement for a shop that runs RUNSTATS weekly and REBIND monthly? How would you balance the overhead of governance with the frequency of these operations?
The long-term fix (partitioning) was recommended even though the short-term fixes resolved the immediate problem. Under what circumstances would you prioritize the long-term architectural fix over the quick tactical fixes?