Case Study 1: GlobalBank Quick Lookup Table Performance Optimization

Background

GlobalBank's nightly batch cycle processes 2.3 million transactions against the ACCT-MASTER VSAM KSDS file. The batch window — the time between the close of business and the morning's online systems coming up — is 6 hours. By late 2023, the transaction processing job (TXN-PROC) consumed 3 hours and 47 minutes of that window, leaving uncomfortably little margin for the downstream report generation, reconciliation, and backup jobs.

Maria Chen analyzed the I/O profile and discovered that 60% of transactions (approximately 1.38 million) targeted only 50,000 accounts — the "hot accounts." These were corporate accounts with multiple daily transactions, high-activity checking accounts, and accounts receiving automated payments. Every lookup against the KSDS required 3-4 I/O operations for index traversal plus data read.

"We're doing roughly 5 million I/Os just for the hot account lookups," Maria reported to the architecture review board. "If we can get those down to 1 I/O each, we save 4 million I/Os and cut the job by at least 20 minutes."

Design Phase

Option Analysis

Option Estimated Improvement Complexity Risk
Add VSAM LSR buffering 10-15% (index caching) Low Low
Increase BUFND/BUFNI 5-10% Low Low
RRDS quick lookup cache 25-30% for hot accounts Medium Medium
In-memory table (OCCURS) 40%+ High High (memory)

Priya Kapoor recommended the RRDS approach: "LSR buffering helps but won't solve the fundamental problem of 3-4 I/Os per lookup. An in-memory table for 50,000 records would consume 17 MB of working storage — too much for our region size. The RRDS gives us single-I/O access without the memory pressure."

Hash Function Selection

Derek Washington was assigned to prototype hash functions. He tested three approaches against the actual hot-account key distribution:

Hash Function Collisions (out of 50,000) Max Chain Avg Probes
Division by 64,997 (prime) 8,247 7 1.17
Folding (2-digit groups) 12,891 12 1.26
Mid-square 9,103 9 1.19

Division by prime produced the fewest collisions and shortest chains. "The account numbers have patterns — branches assign them in blocks," Derek explained. "The prime divisor breaks up those patterns better than folding or mid-square."

File Sizing

With 50,000 records and a target load factor of 77%:

Slots needed = 50,000 / 0.77 = 64,935
Nearest prime ≥ 64,935 = 64,997

Record size: 100 bytes (account number, name, balance, status, last-access date). Total file size: 64,997 * 100 = ~6.5 MB — trivial for DASD allocation.

Implementation

The implementation followed three stages:

Stage 1: Hot Account Identification

A new batch step ran before TXN-PROC, analyzing the previous 30 days of transaction history to identify the 50,000 most-accessed accounts:

//HOTIDENT EXEC PGM=HOTACCTS
//TXN-HIST DD DSN=GLOBANK.TXN.HIST.VSAM,DISP=SHR
//HOTLIST  DD DSN=GLOBANK.HOT.ACCT.LIST,DISP=(,CATLG,...),
//            SPACE=(CYL,(1,1)),DCB=(RECFM=FB,LRECL=55)

Stage 2: RRDS Load

The QUICK-LOAD program (presented in Section 13.6) loaded the hot accounts into the RRDS using the division/remainder hash with linear probing.

Stage 3: TXN-PROC Modification

The existing TXN-PROC program was modified to check the RRDS first:

For each transaction:
  1. Compute hash of account number
  2. Probe RRDS (1-2 I/Os typical)
  3. If found → use cached data
  4. If not found → fall back to KSDS (3-4 I/Os)
  5. Process transaction

Testing Results

Pilot Run (10% of transactions)

Metric Before (KSDS only) After (RRDS + KSDS)
I/Os for hot accounts 838,000 247,000
I/Os for cold accounts 368,000 368,000
Total I/Os 1,206,000 615,000
Elapsed time 23 min 17 min

Full Production Run

Metric Before After Improvement
Hot account I/Os 4,830,000 1,590,000 -67%
Cold account I/Os 3,680,000 3,680,000 No change
Total I/Os 8,510,000 5,270,000 -38%
TXN-PROC elapsed 3h 47min 3h 25min -22 min
QUICK-LOAD elapsed N/A 3 min New step
Net time savings 19 min

Cache Statistics

QUICK LOOKUP LOAD COMPLETE
Records loaded:  0050000
Collisions:      0008247
Overflows:       0000000
Load factor:     0000076%

TRANSACTION PROCESSING COMPLETE
Cache hits:      1,241,837
Cache misses:    138,163 (hot accts not found on first probe)
KSDS fallbacks:  920,000 (cold accounts)
Cache hit ratio: 57.4% (of all transactions)

Production Issues

Issue 1: Stale Cache Data

During the first week of production, the team discovered that accounts whose balances changed during batch processing had stale data in the RRDS cache. The TXN-PROC program read the cached balance, processed a transaction, then needed to write the updated balance back to the KSDS master — but the cache was now out of sync.

Fix: After each KSDS REWRITE (balance update), also REWRITE the RRDS cache record with the new balance. This added minimal overhead (one extra I/O per update) but kept the cache consistent throughout the batch run.

Issue 2: Hash Function Drift

After six months, the hot account profile shifted. New corporate accounts had different number patterns, causing more collisions:

Month Collisions Avg Probes
Month 1 8,247 1.17
Month 3 9,412 1.19
Month 6 14,823 1.31

Fix: The QUICK-LOAD program was enhanced to log collision statistics. A monitoring job alerts operations when average probes exceed 1.5, triggering a review of the hash function parameters.

Lessons Learned

  1. Profile before optimizing: The 60/15 ratio (60% of transactions hitting 15% of accounts) made the cache viable. Without that skewed distribution, the approach would not have been worthwhile.

  2. Cache coherence matters: Even in batch processing, cache consistency must be managed. The stale data issue could have caused balance discrepancies.

  3. Monitor hash performance over time: Account number patterns change as new accounts are opened. The hash function's effectiveness can degrade without anyone noticing.

  4. Keep the fallback path: The KSDS lookup must always work, even if the cache fails completely. The cache is an optimization, not a replacement.

Discussion Questions

  1. What would happen if the hot account profile changed dramatically — say, from 50,000 accounts to 200,000? How would you adjust the design?

  2. Instead of rebuilding the cache nightly, could the cache be maintained continuously (updated as accounts change)? What are the trade-offs?

  3. The team chose linear probing for collision handling. Would quadratic probing or double hashing have been significantly better for this use case? Why or why not?

  4. At what point should GlobalBank consider replacing the RRDS cache with an in-memory approach (larger COBOL tables or a data-grid technology like IBM zOSConnect)?