Case Study 1: GlobalBank Quick Lookup Table Performance Optimization
Background
GlobalBank's nightly batch cycle processes 2.3 million transactions against the ACCT-MASTER VSAM KSDS file. The batch window — the time between the close of business and the morning's online systems coming up — is 6 hours. By late 2023, the transaction processing job (TXN-PROC) consumed 3 hours and 47 minutes of that window, leaving uncomfortably little margin for the downstream report generation, reconciliation, and backup jobs.
Maria Chen analyzed the I/O profile and discovered that 60% of transactions (approximately 1.38 million) targeted only 50,000 accounts — the "hot accounts." These were corporate accounts with multiple daily transactions, high-activity checking accounts, and accounts receiving automated payments. Every lookup against the KSDS required 3-4 I/O operations for index traversal plus data read.
"We're doing roughly 5 million I/Os just for the hot account lookups," Maria reported to the architecture review board. "If we can get those down to 1 I/O each, we save 4 million I/Os and cut the job by at least 20 minutes."
Design Phase
Option Analysis
| Option | Estimated Improvement | Complexity | Risk |
|---|---|---|---|
| Add VSAM LSR buffering | 10-15% (index caching) | Low | Low |
| Increase BUFND/BUFNI | 5-10% | Low | Low |
| RRDS quick lookup cache | 25-30% for hot accounts | Medium | Medium |
| In-memory table (OCCURS) | 40%+ | High | High (memory) |
Priya Kapoor recommended the RRDS approach: "LSR buffering helps but won't solve the fundamental problem of 3-4 I/Os per lookup. An in-memory table for 50,000 records would consume 17 MB of working storage — too much for our region size. The RRDS gives us single-I/O access without the memory pressure."
Hash Function Selection
Derek Washington was assigned to prototype hash functions. He tested three approaches against the actual hot-account key distribution:
| Hash Function | Collisions (out of 50,000) | Max Chain | Avg Probes |
|---|---|---|---|
| Division by 64,997 (prime) | 8,247 | 7 | 1.17 |
| Folding (2-digit groups) | 12,891 | 12 | 1.26 |
| Mid-square | 9,103 | 9 | 1.19 |
Division by prime produced the fewest collisions and shortest chains. "The account numbers have patterns — branches assign them in blocks," Derek explained. "The prime divisor breaks up those patterns better than folding or mid-square."
File Sizing
With 50,000 records and a target load factor of 77%:
Slots needed = 50,000 / 0.77 = 64,935
Nearest prime ≥ 64,935 = 64,997
Record size: 100 bytes (account number, name, balance, status, last-access date). Total file size: 64,997 * 100 = ~6.5 MB — trivial for DASD allocation.
Implementation
The implementation followed three stages:
Stage 1: Hot Account Identification
A new batch step ran before TXN-PROC, analyzing the previous 30 days of transaction history to identify the 50,000 most-accessed accounts:
//HOTIDENT EXEC PGM=HOTACCTS
//TXN-HIST DD DSN=GLOBANK.TXN.HIST.VSAM,DISP=SHR
//HOTLIST DD DSN=GLOBANK.HOT.ACCT.LIST,DISP=(,CATLG,...),
// SPACE=(CYL,(1,1)),DCB=(RECFM=FB,LRECL=55)
Stage 2: RRDS Load
The QUICK-LOAD program (presented in Section 13.6) loaded the hot accounts into the RRDS using the division/remainder hash with linear probing.
Stage 3: TXN-PROC Modification
The existing TXN-PROC program was modified to check the RRDS first:
For each transaction:
1. Compute hash of account number
2. Probe RRDS (1-2 I/Os typical)
3. If found → use cached data
4. If not found → fall back to KSDS (3-4 I/Os)
5. Process transaction
Testing Results
Pilot Run (10% of transactions)
| Metric | Before (KSDS only) | After (RRDS + KSDS) |
|---|---|---|
| I/Os for hot accounts | 838,000 | 247,000 |
| I/Os for cold accounts | 368,000 | 368,000 |
| Total I/Os | 1,206,000 | 615,000 |
| Elapsed time | 23 min | 17 min |
Full Production Run
| Metric | Before | After | Improvement |
|---|---|---|---|
| Hot account I/Os | 4,830,000 | 1,590,000 | -67% |
| Cold account I/Os | 3,680,000 | 3,680,000 | No change |
| Total I/Os | 8,510,000 | 5,270,000 | -38% |
| TXN-PROC elapsed | 3h 47min | 3h 25min | -22 min |
| QUICK-LOAD elapsed | N/A | 3 min | New step |
| Net time savings | — | — | 19 min |
Cache Statistics
QUICK LOOKUP LOAD COMPLETE
Records loaded: 0050000
Collisions: 0008247
Overflows: 0000000
Load factor: 0000076%
TRANSACTION PROCESSING COMPLETE
Cache hits: 1,241,837
Cache misses: 138,163 (hot accts not found on first probe)
KSDS fallbacks: 920,000 (cold accounts)
Cache hit ratio: 57.4% (of all transactions)
Production Issues
Issue 1: Stale Cache Data
During the first week of production, the team discovered that accounts whose balances changed during batch processing had stale data in the RRDS cache. The TXN-PROC program read the cached balance, processed a transaction, then needed to write the updated balance back to the KSDS master — but the cache was now out of sync.
Fix: After each KSDS REWRITE (balance update), also REWRITE the RRDS cache record with the new balance. This added minimal overhead (one extra I/O per update) but kept the cache consistent throughout the batch run.
Issue 2: Hash Function Drift
After six months, the hot account profile shifted. New corporate accounts had different number patterns, causing more collisions:
| Month | Collisions | Avg Probes |
|---|---|---|
| Month 1 | 8,247 | 1.17 |
| Month 3 | 9,412 | 1.19 |
| Month 6 | 14,823 | 1.31 |
Fix: The QUICK-LOAD program was enhanced to log collision statistics. A monitoring job alerts operations when average probes exceed 1.5, triggering a review of the hash function parameters.
Lessons Learned
-
Profile before optimizing: The 60/15 ratio (60% of transactions hitting 15% of accounts) made the cache viable. Without that skewed distribution, the approach would not have been worthwhile.
-
Cache coherence matters: Even in batch processing, cache consistency must be managed. The stale data issue could have caused balance discrepancies.
-
Monitor hash performance over time: Account number patterns change as new accounts are opened. The hash function's effectiveness can degrade without anyone noticing.
-
Keep the fallback path: The KSDS lookup must always work, even if the cache fails completely. The cache is an optimization, not a replacement.
Discussion Questions
-
What would happen if the hot account profile changed dramatically — say, from 50,000 accounts to 200,000? How would you adjust the design?
-
Instead of rebuilding the cache nightly, could the cache be maintained continuously (updated as accounts change)? What are the trade-offs?
-
The team chose linear probing for collision handling. Would quadratic probing or double hashing have been significantly better for this use case? Why or why not?
-
At what point should GlobalBank consider replacing the RRDS cache with an in-memory approach (larger COBOL tables or a data-grid technology like IBM zOSConnect)?