Case Study 2: Hash-Based Account Lookup
Background
Pacific Federal Credit Union serves 450,000 members with accounts ranging from basic savings to complex commercial lending facilities. The credit union's core banking system processes approximately 800,000 transactions daily during the overnight batch cycle. Each transaction references an account number, and the posting program must look up the account's master record to validate the transaction and apply the update.
The existing account lookup uses a VSAM KSDS (Key-Sequenced Data Set) with the 10-digit account number as the primary key. While VSAM KSDS provides excellent performance for most workloads, the batch posting program's access pattern creates a specific problem: the transactions arrive in chronological order (not sorted by account), causing the VSAM index to thrash between widely separated areas of the file. During peak batch windows, the index buffer pool overflows and VSAM must perform physical disk I/O for nearly every random read. The batch window is tightening, and the DBA team has asked for an alternative access method.
The solution is a hash-based account lookup using a COBOL relative file. Instead of maintaining an index structure, the program computes the record's slot number directly from the account number using a hashing function. This eliminates the index entirely -- every lookup requires exactly one disk I/O (or two, in the case of a collision). The relative file acts as a hash table stored on disk.
Hash Table Design
The Hashing Concept
A hash function converts a key value (the account number) into a slot number within a fixed range. The ideal hash function distributes keys uniformly across all available slots, minimizing collisions (cases where two different keys map to the same slot).
For the credit union's 450,000 accounts, the team allocates a relative file with 600,000 slots -- approximately 1.33 times the number of accounts. This load factor of 0.75 (450,000 / 600,000) provides a good balance between space efficiency and collision avoidance. A lower load factor reduces collisions but wastes disk space; a higher load factor saves space but increases collision chains.
The Hash Function
The hash function uses a division-remainder method, which is simple and effective for numeric keys:
WORKING-STORAGE SECTION.
01 WS-HASH-WORK.
05 WS-HASH-INPUT PIC 9(10).
05 WS-HASH-PRIME PIC 9(6) VALUE 599993.
05 WS-HASH-RESULT PIC 9(6).
05 WS-TABLE-SIZE PIC 9(6) VALUE 600000.
01 WS-REL-KEY PIC 9(6).
5000-COMPUTE-HASH.
* Division-remainder hashing with a prime divisor
* The prime number reduces clustering for sequential keys
DIVIDE WS-HASH-INPUT BY WS-HASH-PRIME
GIVING WS-HASH-WORK
REMAINDER WS-HASH-RESULT
* Ensure slot is between 1 and TABLE-SIZE
* (relative files start at slot 1, not slot 0)
ADD 1 TO WS-HASH-RESULT
MOVE WS-HASH-RESULT TO WS-REL-KEY
.
The prime number 599,993 (the largest prime less than 600,000) is used as the divisor because prime divisors produce better distribution than composite numbers, especially when account numbers contain patterns (such as sequential assignment or branch-prefix schemes).
Collision Resolution: Linear Probing
When two account numbers hash to the same slot, a collision occurs. The program resolves collisions using linear probing: it examines consecutive slots starting from the hash position until it finds either the target account or an empty slot.
01 WS-COLLISION-FIELDS.
05 WS-PROBE-COUNT PIC 9(4).
05 WS-MAX-PROBES PIC 9(4) VALUE 100.
05 WS-ORIGINAL-HASH PIC 9(6).
05 WS-PROBE-SLOT PIC 9(6).
05 WS-PROBE-RESULT PIC X(1).
88 PROBE-FOUND VALUE "F".
88 PROBE-EMPTY VALUE "E".
88 PROBE-COLLISION VALUE "C".
88 PROBE-EXHAUSTED VALUE "X".
File Definitions
The Hash Table File (Relative Organization)
ENVIRONMENT DIVISION.
INPUT-OUTPUT SECTION.
FILE-CONTROL.
SELECT HASH-TABLE-FILE
ASSIGN TO HASHTBL
ORGANIZATION IS RELATIVE
ACCESS MODE IS RANDOM
RELATIVE KEY IS WS-REL-KEY
FILE STATUS IS WS-HT-STATUS.
DATA DIVISION.
FILE SECTION.
FD HASH-TABLE-FILE
RECORD CONTAINS 150 CHARACTERS.
01 HASH-TABLE-RECORD.
05 HTR-SLOT-STATUS PIC X(1).
88 HTR-OCCUPIED VALUE "O".
88 HTR-EMPTY VALUE "E".
88 HTR-DELETED VALUE "D".
05 HTR-ACCOUNT-NUMBER PIC X(10).
05 HTR-MEMBER-NAME PIC X(30).
05 HTR-ACCOUNT-TYPE PIC X(2).
88 HTR-SAVINGS VALUE "SV".
88 HTR-CHECKING VALUE "CK".
88 HTR-MONEY-MARKET VALUE "MM".
88 HTR-CERTIFICATE VALUE "CD".
88 HTR-LOAN VALUE "LN".
88 HTR-CREDIT-LINE VALUE "CL".
05 HTR-BRANCH-CODE PIC X(4).
05 HTR-CURRENT-BALANCE PIC S9(11)V99.
05 HTR-AVAILABLE-BALANCE PIC S9(11)V99.
05 HTR-LAST-TXN-DATE PIC 9(8).
05 HTR-OPEN-DATE PIC 9(8).
05 HTR-STATUS-CODE PIC X(1).
88 HTR-ACTIVE VALUE "A".
88 HTR-FROZEN VALUE "F".
88 HTR-CLOSED VALUE "C".
88 HTR-DORMANT VALUE "D".
05 HTR-FILLER PIC X(48).
The HTR-SLOT-STATUS field is critical for the hash table algorithm:
- "O" (Occupied): The slot contains a valid account record.
- "E" (Empty): The slot has never been used. During a lookup probe, an empty slot means the target account does not exist in the table.
- "D" (Deleted): The slot previously held a record that has been logically deleted. During a lookup, the probe must continue past deleted slots (they might have been part of a collision chain). During an insert, deleted slots can be reused.
The Source Account Master File
SELECT ACCOUNT-MASTER-FILE
ASSIGN TO ACCTMAST
ORGANIZATION IS INDEXED
ACCESS MODE IS SEQUENTIAL
RECORD KEY IS AMR-ACCOUNT-NUMBER
FILE STATUS IS WS-AM-STATUS.
FD ACCOUNT-MASTER-FILE
RECORD CONTAINS 150 CHARACTERS.
01 ACCOUNT-MASTER-RECORD.
05 AMR-ACCOUNT-NUMBER PIC X(10).
05 AMR-MEMBER-NAME PIC X(30).
05 AMR-ACCOUNT-TYPE PIC X(2).
05 AMR-BRANCH-CODE PIC X(4).
05 AMR-CURRENT-BALANCE PIC S9(11)V99.
05 AMR-AVAILABLE-BALANCE PIC S9(11)V99.
05 AMR-LAST-TXN-DATE PIC 9(8).
05 AMR-OPEN-DATE PIC 9(8).
05 AMR-STATUS-CODE PIC X(1).
05 AMR-FILLER PIC X(48).
Building the Hash Table
The hash table is built from the account master file during a nightly initialization step. This program reads every account record, computes its hash, and inserts it into the relative file:
IDENTIFICATION DIVISION.
PROGRAM-ID. HTBUILD.
PROCEDURE DIVISION.
0000-MAIN.
PERFORM 1000-INITIALIZE
PERFORM 2000-PROCESS-ACCOUNTS
UNTIL WS-AM-EOF
PERFORM 3000-REPORT-STATISTICS
PERFORM 9000-TERMINATE
STOP RUN
.
1000-INITIALIZE.
OPEN INPUT ACCOUNT-MASTER-FILE
OPEN OUTPUT HASH-TABLE-FILE
OPEN OUTPUT BUILD-REPORT-FILE
* Pre-allocate all slots as empty
INITIALIZE HASH-TABLE-RECORD
SET HTR-EMPTY TO TRUE
PERFORM VARYING WS-REL-KEY
FROM 1 BY 1
UNTIL WS-REL-KEY > WS-TABLE-SIZE
WRITE HASH-TABLE-RECORD
INVALID KEY
DISPLAY "INIT ERROR AT SLOT " WS-REL-KEY
STOP RUN
END-WRITE
END-PERFORM
MOVE 0 TO WS-RECORDS-LOADED
MOVE 0 TO WS-TOTAL-COLLISIONS
MOVE 0 TO WS-MAX-PROBE-DEPTH
MOVE 0 TO WS-INSERT-FAILURES
PERFORM 1100-READ-ACCOUNT-MASTER
.
2000-PROCESS-ACCOUNTS.
PERFORM 5000-COMPUTE-HASH
PERFORM 5100-INSERT-WITH-PROBING
ADD 1 TO WS-RECORDS-LOADED
PERFORM 1100-READ-ACCOUNT-MASTER
.
The Insert Logic with Linear Probing
5100-INSERT-WITH-PROBING.
MOVE WS-REL-KEY TO WS-ORIGINAL-HASH
MOVE 0 TO WS-PROBE-COUNT
PERFORM 5110-TRY-INSERT
UNTIL PROBE-FOUND
OR PROBE-EMPTY
OR PROBE-EXHAUSTED
IF PROBE-EXHAUSTED
ADD 1 TO WS-INSERT-FAILURES
DISPLAY "HASH TABLE FULL: CANNOT INSERT "
AMR-ACCOUNT-NUMBER
END-IF
.
5110-TRY-INSERT.
READ HASH-TABLE-FILE INTO HASH-TABLE-RECORD
INVALID KEY
SET PROBE-EMPTY TO TRUE
PERFORM 5120-WRITE-RECORD
END-READ
IF WS-HT-STATUS = "00"
EVALUATE TRUE
WHEN HTR-EMPTY
WHEN HTR-DELETED
* Slot is available -- insert here
SET PROBE-EMPTY TO TRUE
PERFORM 5120-WRITE-RECORD
WHEN HTR-OCCUPIED
* Slot is taken -- check if same account
IF HTR-ACCOUNT-NUMBER =
AMR-ACCOUNT-NUMBER
SET PROBE-FOUND TO TRUE
DISPLAY "DUPLICATE ACCOUNT: "
AMR-ACCOUNT-NUMBER
ELSE
* Collision -- try next slot
ADD 1 TO WS-PROBE-COUNT
ADD 1 TO WS-TOTAL-COLLISIONS
IF WS-PROBE-COUNT > WS-MAX-PROBES
SET PROBE-EXHAUSTED TO TRUE
ELSE
ADD 1 TO WS-REL-KEY
IF WS-REL-KEY > WS-TABLE-SIZE
MOVE 1 TO WS-REL-KEY
END-IF
END-IF
END-IF
END-EVALUATE
END-IF
.
5120-WRITE-RECORD.
* Transfer account data into hash table record
SET HTR-OCCUPIED TO TRUE
MOVE AMR-ACCOUNT-NUMBER TO HTR-ACCOUNT-NUMBER
MOVE AMR-MEMBER-NAME TO HTR-MEMBER-NAME
MOVE AMR-ACCOUNT-TYPE TO HTR-ACCOUNT-TYPE
MOVE AMR-BRANCH-CODE TO HTR-BRANCH-CODE
MOVE AMR-CURRENT-BALANCE TO HTR-CURRENT-BALANCE
MOVE AMR-AVAILABLE-BALANCE
TO HTR-AVAILABLE-BALANCE
MOVE AMR-LAST-TXN-DATE TO HTR-LAST-TXN-DATE
MOVE AMR-OPEN-DATE TO HTR-OPEN-DATE
MOVE AMR-STATUS-CODE TO HTR-STATUS-CODE
REWRITE HASH-TABLE-RECORD
INVALID KEY
DISPLAY "REWRITE ERROR AT SLOT "
WS-REL-KEY
END-REWRITE
IF WS-PROBE-COUNT > WS-MAX-PROBE-DEPTH
MOVE WS-PROBE-COUNT TO WS-MAX-PROBE-DEPTH
END-IF
.
Looking Up Accounts During Transaction Processing
The batch posting program uses the hash table for fast account lookups:
6000-LOOKUP-ACCOUNT.
MOVE TXN-ACCOUNT-NUMBER TO WS-HASH-INPUT
PERFORM 5000-COMPUTE-HASH
MOVE WS-REL-KEY TO WS-ORIGINAL-HASH
MOVE 0 TO WS-PROBE-COUNT
SET PROBE-COLLISION TO TRUE
PERFORM 6100-PROBE-FOR-ACCOUNT
UNTIL PROBE-FOUND
OR PROBE-EMPTY
OR PROBE-EXHAUSTED
IF PROBE-FOUND
MOVE HTR-CURRENT-BALANCE
TO WS-ACCOUNT-BALANCE
MOVE HTR-AVAILABLE-BALANCE
TO WS-AVAILABLE-BALANCE
MOVE HTR-STATUS-CODE TO WS-ACCOUNT-STATUS
SET WS-LOOKUP-OK TO TRUE
ELSE
SET WS-LOOKUP-NOT-FOUND TO TRUE
ADD 1 TO WS-LOOKUP-MISS-COUNT
END-IF
.
6100-PROBE-FOR-ACCOUNT.
READ HASH-TABLE-FILE INTO HASH-TABLE-RECORD
INVALID KEY
SET PROBE-EMPTY TO TRUE
END-READ
IF WS-HT-STATUS = "00"
EVALUATE TRUE
WHEN HTR-EMPTY
* Empty slot reached -- account not found
SET PROBE-EMPTY TO TRUE
WHEN HTR-DELETED
* Deleted slot -- must continue probing
ADD 1 TO WS-PROBE-COUNT
IF WS-PROBE-COUNT > WS-MAX-PROBES
SET PROBE-EXHAUSTED TO TRUE
ELSE
ADD 1 TO WS-REL-KEY
IF WS-REL-KEY > WS-TABLE-SIZE
MOVE 1 TO WS-REL-KEY
END-IF
END-IF
WHEN HTR-OCCUPIED
IF HTR-ACCOUNT-NUMBER =
TXN-ACCOUNT-NUMBER
SET PROBE-FOUND TO TRUE
ELSE
ADD 1 TO WS-PROBE-COUNT
IF WS-PROBE-COUNT > WS-MAX-PROBES
SET PROBE-EXHAUSTED TO TRUE
ELSE
ADD 1 TO WS-REL-KEY
IF WS-REL-KEY > WS-TABLE-SIZE
MOVE 1 TO WS-REL-KEY
END-IF
END-IF
END-IF
END-EVALUATE
END-IF
.
Updating Balances After Posting
After a transaction is validated and applied, the program writes the updated balance back to the hash table:
7000-UPDATE-ACCOUNT-BALANCE.
* WS-REL-KEY still points to the record from lookup
MOVE WS-NEW-BALANCE TO HTR-CURRENT-BALANCE
MOVE WS-NEW-AVAILABLE TO HTR-AVAILABLE-BALANCE
MOVE WS-CURRENT-DATE TO HTR-LAST-TXN-DATE
REWRITE HASH-TABLE-RECORD
INVALID KEY
PERFORM 9300-REWRITE-ERROR
END-REWRITE
.
JCL for the Hash Table Build and Posting Jobs
//HTBUILD JOB (ACCT),'HASH TABLE BUILD',
// CLASS=A,MSGCLASS=X,NOTIFY=&SYSUID
//*
//* STEP 1: BUILD HASH TABLE FROM ACCOUNT MASTER
//*
//BUILD EXEC PGM=HTBUILD
//STEPLIB DD DSN=PROD.LOADLIB,DISP=SHR
//ACCTMAST DD DSN=CREDIT.ACCOUNT.MASTER,DISP=SHR
//HASHTBL DD DSN=CREDIT.HASH.TABLE,DISP=OLD
//BUILDRPT DD SYSOUT=*
//SYSOUT DD SYSOUT=*
//SYSUDUMP DD SYSOUT=*
//*
//* STEP 2: POST TRANSACTIONS USING HASH LOOKUPS
//*
//POST EXEC PGM=HTPOST,COND=(0,NE)
//STEPLIB DD DSN=PROD.LOADLIB,DISP=SHR
//HASHTBL DD DSN=CREDIT.HASH.TABLE,DISP=OLD
//TXNFILE DD DSN=CREDIT.DAILY.TRANSACTIONS(0),DISP=SHR
//POSTRPT DD SYSOUT=*
//REJECTS DD DSN=CREDIT.POST.REJECTS(+1),
// DISP=(NEW,CATLG,DELETE),
// SPACE=(CYL,(5,2)),
// DCB=(RECFM=FB,LRECL=200,BLKSIZE=0)
//SYSOUT DD SYSOUT=*
//SYSUDUMP DD SYSOUT=*
Build Statistics Report
After building the hash table, the program produces a report that helps the DBA team monitor the table's health:
3000-REPORT-STATISTICS.
DISPLAY "========================================"
DISPLAY " HASH TABLE BUILD STATISTICS"
DISPLAY "========================================"
DISPLAY " TABLE SIZE: " WS-TABLE-SIZE
DISPLAY " RECORDS LOADED: " WS-RECORDS-LOADED
COMPUTE WS-LOAD-FACTOR ROUNDED =
WS-RECORDS-LOADED / WS-TABLE-SIZE * 100
DISPLAY " LOAD FACTOR: " WS-LOAD-FACTOR "%"
DISPLAY " TOTAL COLLISIONS: " WS-TOTAL-COLLISIONS
COMPUTE WS-AVG-PROBES ROUNDED =
WS-TOTAL-COLLISIONS / WS-RECORDS-LOADED
DISPLAY " AVG PROBES/INSERT: " WS-AVG-PROBES
DISPLAY " MAX PROBE DEPTH: " WS-MAX-PROBE-DEPTH
DISPLAY " INSERT FAILURES: " WS-INSERT-FAILURES
DISPLAY "========================================"
.
A healthy hash table will show an average probe depth below 2.0 and a maximum probe depth below 15. If the statistics reveal clustering problems, the DBA team can adjust the table size or switch to a different hash function.
Performance Comparison
The team measured lookup performance over a batch run of 800,000 transactions:
| Metric | VSAM KSDS (Indexed) | VSAM RRDS (Hash) |
|---|---|---|
| Avg. I/Os per lookup | 3.7 | 1.2 |
| Max I/Os per lookup | 6 | 4 |
| Batch elapsed time | 47 min | 19 min |
| CPU time | 12 min | 11 min |
| Buffer pool hit ratio | 68% | N/A (no index) |
The hash-based approach reduced elapsed time by 60 percent. The CPU time was similar because the hash computation is trivial compared to disk I/O. The primary savings came from eliminating index I/O operations.
Limitations and Trade-Offs
What Hash Tables Cannot Do
-
Sequential access by key: A hash table scatters accounts randomly across the file. You cannot read accounts in account-number order without sorting the output. For reports that need sorted output, the VSAM KSDS remains necessary.
-
Range queries: Finding all accounts between 5000000000 and 5000001000 requires probing every possible account number. An indexed file handles this with a START and sequential read.
-
Growth: When the file reaches 80 percent capacity, collision chains lengthen rapidly. The table must be rebuilt with a larger size, which requires an outage.
-
Deletion complexity: The three-state slot status (Occupied, Empty, Deleted) is necessary because simply emptying a slot could break collision chains for other records that probed past that slot during insertion.
When to Choose Hash-Based Relative Files
Hash-based relative files are most appropriate when: - The workload is predominantly random read/write by a single key - Sequential access by key order is not required during the batch window - The number of records is predictable and does not grow rapidly - Minimizing I/O response time is critical
Discussion Questions
-
Why is a prime number used as the hash divisor rather than the table size itself? What distribution problem would arise if the divisor were 600,000 (a highly composite number)?
-
The three-state slot status (Occupied, Empty, Deleted) introduces complexity. Explain why simply marking a deleted slot as Empty would corrupt the hash table. Provide a concrete example with two account numbers that hash to the same slot.
-
At what load factor would you recommend rebuilding the hash table with more slots? What observable metrics would trigger this decision?
-
The current hash function uses simple division-remainder. Research and describe one alternative hash function (such as multiplicative hashing or folding) and explain whether it would be better suited to 10-digit account numbers.
-
How could the design be extended to support a secondary lookup key (such as member Social Security Number) without duplicating the account data? Consider what additional files or structures would be needed.