Case Study 1: CNB's Batch Window Crisis and the Great Optimization
Background
City National Bank's nightly batch window had been comfortable for a decade. From 10:00 PM to 6:00 AM, eight hours of uncontested processing time handled interest calculations, statement generation, regulatory reporting, and account maintenance. The systems team, led by VP Kwame Osei, rarely thought about it.
Then the bank acquired Lakeshore Federal Credit Union, adding 4.2 million accounts to CNB's existing 12 million. Simultaneously, the business demanded that online banking be available by 5:00 AM instead of 6:00 AM for the growing West Coast customer base. The batch window shrank from eight hours to seven — while the workload grew by 35%.
The first night after the Lakeshore migration, the batch didn't finish until 5:47 AM. The second night, 6:12 AM. By Thursday, Kwame called an emergency meeting.
The Diagnosis
Lisa Hwang, lead DB2 programmer, spent the weekend analyzing the batch stream. She identified three critical bottlenecks:
Bottleneck 1: The Interest Calculation Program (INTCALC)
INTCALC processed all savings and checking accounts for daily interest accrual. Post-merger, it processed 16.2 million accounts. The program had been written in 2008 and never substantially modified.
Lisa found the first problem in the cursor declaration:
EXEC SQL
DECLARE INT-CURSOR CURSOR FOR
SELECT ACCOUNT_NUM, BALANCE, INT_RATE,
LAST_CALC_DATE, ACCOUNT_TYPE
FROM ACCOUNTS
WHERE ACCOUNT_TYPE IN ('SAV', 'CHK', 'MMA', 'CD')
AND BALANCE > 0
ORDER BY ACCOUNT_NUM
END-EXEC
No WITH HOLD. No commit interval. The program opened the cursor, processed all 16.2 million qualifying rows, and committed once at the end. It held a table lock (escalated from row locks around row 2,000) for the entire 3.5-hour run, blocking the statement generation program from starting. It consumed 1.2 GB of active log space, forcing two archive switches that stalled the entire DB2 subsystem for 90 seconds each.
Bottleneck 2: The Statement Generation Program (STMTGEN)
STMTGEN used a single "universal cursor" for all account types:
EXEC SQL
DECLARE STMT-CURSOR CURSOR FOR
SELECT A.ACCOUNT_NUM, A.BALANCE,
C.CUSTOMER_NAME, C.ADDRESS_LINE1,
C.ADDRESS_LINE2, C.CITY, C.STATE, C.ZIP
FROM ACCOUNTS A
JOIN CUSTOMERS C ON A.CUSTOMER_ID = C.CUSTOMER_ID
WHERE (:WS-ACCT-TYPE = 'ALL'
OR A.ACCOUNT_TYPE = :WS-ACCT-TYPE)
AND (:WS-REGION = 'ALL'
OR C.REGION = :WS-REGION)
ORDER BY C.CUSTOMER_NAME
END-EXEC
The optimizer, unable to predict which predicates would be active, chose a tablespace scan on ACCOUNTS joined with a nested loop into CUSTOMERS. For 16.2 million accounts, this meant 16.2 million random I/Os into the CUSTOMERS table — a four-hour operation.
Bottleneck 3: The Regulatory Extract Program (REGEXT)
REGEXT ran a dynamic SQL query built from a configuration table. The SQL was constructed by concatenating column values directly into the query string — no parameter markers. While this wasn't causing a performance problem, Lisa flagged it to Kwame as a security risk that would fail their next audit.
* From the 2008-vintage REGEXT program
MOVE SPACES TO WS-DYN-SQL
STRING 'SELECT * FROM TRANSACTIONS WHERE '
WS-CONFIG-WHERE-CLAUSE
' ORDER BY ' WS-CONFIG-ORDER-CLAUSE
DELIMITED SIZE INTO WS-DYN-SQL
The WS-CONFIG-WHERE-CLAUSE was loaded from a CONFIG_RULES table that was maintained by business analysts through a simple ISPF interface. No input validation, no audit trail.
The Solution
Lisa assembled a team of three (herself, Rob Chen, and a junior programmer) for a two-week intensive rewrite. Kwame cleared their schedules of all other work.
Fix 1: INTCALC Redesign
Lisa redesigned INTCALC around the commit-checkpoint pattern with partition-aware processing:
Step 1: The ACCOUNTS table was already partitioned into 8 ranges by account number. Lisa created 8 JCL job steps, each processing one partition, running in parallel:
//INTCALC1 EXEC PGM=INTCALC,PARM='PART=1,LO=00000001,HI=19999999'
//INTCALC2 EXEC PGM=INTCALC,PARM='PART=2,LO=20000000,HI=39999999'
...
//INTCALC8 EXEC PGM=INTCALC,PARM='PART=8,LO=E0000000,HI=ZZZZZZZZ'
Step 2: Each instance used a WITH HOLD cursor with a 2,000-row commit interval:
EXEC SQL
DECLARE INT-CURSOR CURSOR WITH HOLD FOR
SELECT ACCOUNT_NUM, BALANCE, INT_RATE,
LAST_CALC_DATE, ACCOUNT_TYPE
FROM ACCOUNTS
WHERE ACCOUNT_NUM >= :WS-PART-LOW
AND ACCOUNT_NUM < :WS-PART-HIGH
AND ACCOUNT_TYPE IN ('SAV', 'CHK', 'MMA', 'CD')
AND BALANCE > 0
AND ACCOUNT_NUM > :WS-RESTART-KEY
ORDER BY ACCOUNT_NUM
FOR FETCH ONLY
END-EXEC
Lisa chose FOR FETCH ONLY with a searched UPDATE (by primary key) rather than FOR UPDATE, because 100% of rows were updated and the read-only cursor allowed list prefetch:
3000-CALC-AND-UPDATE.
COMPUTE WS-ACCRUED = WS-BALANCE *
(WS-INT-RATE / 36500) * WS-DAYS-SINCE-LAST
EXEC SQL
UPDATE ACCOUNTS
SET BALANCE = BALANCE + :WS-ACCRUED,
LAST_CALC_DATE = CURRENT DATE,
LAST_CALC_AMT = :WS-ACCRUED
WHERE ACCOUNT_NUM = :WS-CURRENT-ACCT
END-EXEC.
Step 3: Commit-checkpoint with restart capability using the BATCH_RESTART table:
5000-COMMIT-CHECKPOINT.
EXEC SQL
MERGE INTO BATCH_RESTART R
USING (VALUES ('INTCALC', :WS-PARTITION-NUM,
CURRENT DATE))
AS S(PROG, PART, RUNDT)
ON R.PROGRAM_ID = S.PROG
AND R.PARTITION_NUM = S.PART
AND R.RUN_DATE = S.RUNDT
WHEN MATCHED THEN
UPDATE SET LAST_KEY = :WS-CURRENT-ACCT,
ROWS_COMMITTED = R.ROWS_COMMITTED
+ :WS-COMMIT-CTR,
COMMIT_TS = CURRENT TIMESTAMP
WHEN NOT MATCHED THEN
INSERT VALUES
(S.PROG, S.PART, S.RUNDT, :WS-CURRENT-ACCT,
:WS-COMMIT-CTR, CURRENT TIMESTAMP)
END-EXEC
EXEC SQL COMMIT END-EXEC
ADD WS-COMMIT-CTR TO WS-TOTAL-COMMITTED
MOVE 0 TO WS-COMMIT-CTR.
Results: INTCALC dropped from 3.5 hours (serial, no commits) to 28 minutes (8-way parallel, 2,000-row commits). Lock escalation was eliminated. Log consumption per commit was under 5 MB. Any partition could be restarted independently.
Fix 2: STMTGEN Redesign
Rob Chen replaced the universal cursor with a cursor pool. Since STMTGEN was always called with a specific account type and region (the "ALL" case was actually never used in production — Rob checked three years of job logs), he created targeted cursors:
* Cursor for specific account type + specific region
EXEC SQL
DECLARE STMT-C1 CURSOR WITH HOLD FOR
SELECT A.ACCOUNT_NUM, A.BALANCE,
C.CUSTOMER_NAME, C.ADDRESS_LINE1,
C.ADDRESS_LINE2, C.CITY, C.STATE, C.ZIP
FROM ACCOUNTS A
JOIN CUSTOMERS C ON A.CUSTOMER_ID = C.CUSTOMER_ID
WHERE A.ACCOUNT_TYPE = :WS-ACCT-TYPE
AND C.REGION = :WS-REGION
ORDER BY A.ACCOUNT_NUM
FOR FETCH ONLY
END-EXEC
He also added a composite index on ACCOUNTS (ACCOUNT_TYPE, CUSTOMER_ID) that gave the optimizer a matching index scan instead of a tablespace scan.
Results: STMTGEN dropped from 4 hours to 45 minutes. The new index added 200 MB of DASD but saved 3 hours of elapsed time — a trade-off Kwame approved in 10 seconds.
Fix 3: REGEXT Security Remediation
Lisa rewrote the dynamic SQL construction to use parameter markers for all value-based predicates and whitelisted column and table names:
2000-BUILD-DYNAMIC-SQL.
* Column whitelist validation
PERFORM VARYING WS-IDX FROM 1 BY 1
UNTIL WS-IDX > WS-CONFIG-COLUMN-COUNT
EVALUATE WS-CONFIG-COLUMN(WS-IDX)
WHEN 'TRANSACTION_ID'
WHEN 'ACCOUNT_NUM'
WHEN 'TRANS_DATE'
WHEN 'AMOUNT'
WHEN 'TRANS_TYPE'
WHEN 'BRANCH_CODE'
CONTINUE
WHEN OTHER
DISPLAY 'INVALID COLUMN: '
WS-CONFIG-COLUMN(WS-IDX)
MOVE 'Y' TO WS-ERROR-FLAG
END-EVALUATE
END-PERFORM
* Build with parameter markers for values
STRING 'SELECT ' WS-VALIDATED-COLUMNS
' FROM TRANSACTIONS'
' WHERE TRANS_DATE BETWEEN ? AND ?'
DELIMITED SIZE INTO WS-DYN-SQL
END-STRING
* Add validated optional predicates with markers
IF WS-HAS-AMOUNT-FILTER
STRING WS-DYN-SQL DELIMITED SPACES
' AND AMOUNT > ?'
DELIMITED SIZE INTO WS-DYN-SQL
END-STRING
END-IF.
She also implemented the full audit trail — every dynamic SQL execution was logged to the SQL_AUDIT_LOG table with the SQL text, parameters, executing user, and timestamp.
Outcome
After two weeks of development and one week of testing:
| Metric | Before | After | Improvement |
|---|---|---|---|
| INTCALC elapsed | 3h 30m | 28m | 87.5% |
| STMTGEN elapsed | 4h 00m | 45m | 81.3% |
| Total batch window | 7h 45m | 4h 10m | 46.2% |
| Lock escalations/night | 12-15 | 0 | 100% |
| Log archive switches | 2-3 | 0 | 100% |
| Dynamic SQL injection vectors | 3 programs | 0 | 100% |
The batch window now finished by 2:15 AM, leaving nearly three hours of margin before the 5:00 AM online availability deadline. Kwame described the project as "the best two weeks of engineering this team has done in five years."
Lessons Learned
-
Serial processing is the enemy of the batch window. Partition-aware parallelism delivered the single largest improvement.
-
The commit-checkpoint pattern is not optional at scale. The no-commit INTCALC program had been a ticking time bomb for years — it just hadn't exploded because the volume hadn't crossed the threshold.
-
Universal cursors are performance time bombs. They work fine in development with 10,000 rows. They work fine in production with 5 million rows (if slowly). They fail catastrophically at 16 million rows.
-
Security debt compounds. The dynamic SQL concatenation in REGEXT had been there since 2008. Fixing it took two days. Fixing it after a breach would have taken months — plus the regulatory consequences.
-
Check your assumptions. Rob's discovery that the "ALL" case was never actually used in production saved weeks of development time. Don't optimize for code paths that don't execute.
Discussion Questions
-
Lisa chose a commit interval of 2,000 rows for INTCALC. What factors would cause you to increase or decrease this interval? At what interval would lock escalation become a risk on CNB's system?
-
Rob found that the "ALL" case in STMTGEN was never used. What's the risk of removing it? How would you verify this more rigorously than checking job logs?
-
The REGEXT security fix used a column whitelist. What happens when the business needs to add a new column to the report? How would you design the whitelist to be maintainable without code changes?
-
If INTCALC partition 5 abends at 2:00 AM while partitions 1-4 have completed and 6-8 are still running, what happens? How does the restart logic handle this partial failure?
-
The new composite index on ACCOUNTS (ACCOUNT_TYPE, CUSTOMER_ID) improved STMTGEN dramatically. What is the maintenance cost of this index for INSERT and UPDATE operations on the ACCOUNTS table? When does that cost matter?