Case Study 2: The API That Brought Down Batch

Background

Pacific Health Partners had successfully modernized their claims system through Phases 1-4, closely following the approach described in this chapter. The crown jewel of the modernization was a new CICS web service that gave partner hospitals real-time claim status inquiries. The API launched on a Monday and was immediately popular — within a week, partners were making 15,000 API calls per day.

On Friday of the launch week, the nightly batch processing run exceeded the batch window for the first time in the system's 14-year history. Jobs that normally completed by 4 AM were still running at 7 AM when the online CICS region was scheduled to start.

The operations team faced a difficult choice: cancel the remaining batch jobs (leaving the day's processing incomplete) or delay the online system startup (preventing customer service representatives from working). They chose to delay online startup, which caused a cascade of complaints from the call center.

The Investigation

The batch jobs that overran were the claim adjudication and payment processing steps — the same programs that the API also accessed through DB2.

The DBA, analyzing DB2 performance metrics, found the smoking gun: lock contention. The batch programs updated claim records using DB2 UPDATE statements with COMMIT every 500 records. The API read claim records using DB2 SELECT statements. Under DB2's default isolation level (Cursor Stability), the SELECT statements were waiting for UPDATE locks to be released.

Normally, this was not a problem because the API did not exist during batch processing hours. But partner hospitals on the West Coast were making API calls at 10 PM Pacific time — which was 1 AM Eastern, right in the middle of the batch window.

The DB2 statistics told the story:

Metric Normal Night Problem Night
Batch elapsed time 3.5 hours 7.2 hours
DB2 lock waits 142 48,720
Average lock wait time 8 ms 340 ms
API response time N/A 2.3 seconds (target: 200ms)
API timeout errors N/A 1,247

The batch programs were waiting for API-held locks. The API programs were waiting for batch-held locks. Neither could make progress efficiently.

The Solution

The team implemented three changes:

1. Uncommitted Read for the API. The claim status API only reads data — it never updates. Changing the DB2 isolation level to UR (Uncommitted Read) for the API program eliminated lock contention entirely. The API would occasionally see in-flight data (a claim mid-update), but for a status inquiry, this was acceptable.

SELECT CLAIM_STATUS, PAID_AMT
INTO :WS-STATUS, :WS-PAID
FROM MEDCLAIM.CLAIM
WHERE CLAIM_ID = :WS-CLAIM-ID
WITH UR

2. Reduced COMMIT frequency in batch. The batch programs committed every 500 records, which meant locks were held across large groups of records. Reducing the COMMIT interval to every 50 records released locks more frequently, reducing contention for the API.

3. Scheduled batch awareness. The API gateway was configured to return cached results during the batch window (10 PM - 4 AM Eastern) instead of querying DB2 directly. The cache was populated by a pre-batch extract and was at most a few hours stale — acceptable for status inquiries.

Results

After implementing all three changes, the batch window returned to its normal 3.5 hours, and API response times dropped to under 100ms even during batch processing.

Discussion Questions

  1. Why does Uncommitted Read (UR) solve the lock contention problem? What are the risks of using UR?
  2. Could this problem have been anticipated during the API design phase? What design review question would have caught it?
  3. The caching solution introduces "stale data" during the batch window. How would you communicate this to API consumers? Under what circumstances would stale data be unacceptable?
  4. How would you monitor for lock contention proactively, so that the next time batch and online processing conflict, you detect it before the batch window is exceeded?
  5. What does this case study teach about the relationship between batch and online processing in a modernized system?