Case Study 1: CNB's Migration from Cursor Loops to Multi-Row FETCH
Background
City National Bank processes approximately 500 million transactions daily across its core banking platform. The nightly batch cycle — which posts transactions, calculates running balances, updates account summaries, generates regulatory reports, and produces customer statements — must complete within a 4-hour batch window between 11:00 PM and 3:00 AM EST. By Q2 2025, the batch cycle was consistently running 3 hours and 47 minutes, leaving only 13 minutes of slack. Volume was growing at 8% annually. Without intervention, the batch window would be breached within six months.
Kwame Asante, Senior Systems Programmer, was tasked with leading the optimization effort. His team: Lisa Chen (DB2 DBA, 18 years experience) and Rob Martinez (COBOL application developer, 12 years experience).
The Problem in Detail
Kwame's first step was instrumentation. He added DB2 accounting trace data collection to the six largest batch jobs, which together consumed 82% of the batch window. The results were damning:
| Job | Rows Processed | SQL Calls | Thread Switch Time | SQL Execution Time | COBOL Logic Time |
|---|---|---|---|---|---|
| TXNPOST | 500M | 1.5B | 62 min | 89 min | 17 min |
| BALCALC | 500M | 1.0B | 42 min | 71 min | 24 min |
| ACCTSUM | 12M | 36M | 1.5 min | 8 min | 3 min |
| REGFEED | 50M | 150M | 6.3 min | 22 min | 9 min |
| CUSTSTMT | 8M | 32M | 1.3 min | 11 min | 14 min |
| AUDITARC | 200M | 400M | 16.7 min | 34 min | 5 min |
The numbers told a clear story: across the six jobs, thread-switching overhead consumed 130 minutes — more than an entire hour of the batch window spent doing nothing but crossing the address space boundary.
"I looked at those numbers and realized we'd been optimizing the wrong thing for years," Kwame said. "We'd spend weeks tuning SQL access paths to save 3% on execution time. Meanwhile, a third of our batch window was pure overhead from the single-row programming model."
The Migration Plan
Lisa designed a three-phase migration plan:
Phase 1: Multi-Row FETCH (4 weeks)
Convert all six jobs from single-row FETCH to multi-row FETCH with a rowset size of 500. This was the lowest-risk change — the COBOL business logic didn't change at all. Only the data retrieval mechanism changed.
Phase 2: Multi-Row INSERT (3 weeks)
Convert the three jobs that produce output records (TXNPOST, AUDITARC, REGFEED) to use multi-row INSERT with the accumulate-then-flush pattern.
Phase 3: MERGE and OLAP (6 weeks)
Replace the SELECT/INSERT/UPDATE patterns in ACCTSUM with MERGE, and replace the cursor-based running balance logic in BALCALC with OLAP functions.
Phase 1: The Multi-Row FETCH Conversion
Rob started with TXNPOST, the largest and most critical job. The original program had a straightforward structure:
OPEN cursor.
LOOP: FETCH one row.
Validate transaction.
Determine posting action.
UPDATE transaction status.
INSERT audit record.
END-LOOP.
CLOSE cursor.
The multi-row conversion required:
1. WORKING-STORAGE changes: Rob declared host variable arrays with OCCURS 500 for all FETCH columns, plus matching indicator arrays. Total additional WORKING-STORAGE: 94 KB (500 rows * 187 bytes per row + indicator overhead).
2. Cursor declaration: Added WITH ROWSET POSITIONING to the existing cursor declaration. The SELECT statement itself didn't change.
3. FETCH logic: Replaced the single-row FETCH with FETCH NEXT ROWSET FROM... FOR :WS-ROWSET-SIZE ROWS. Added SQLERRD(3) checking to determine actual row count.
4. Processing loop: Added an inner loop (PERFORM VARYING WS-IDX FROM 1 BY 1 UNTIL WS-IDX > WS-ROWS-FETCHED) to process each row in the rowset. The business logic in PROCESS-TRANSACTION was unchanged — it now referenced WS-TXN-ACCT(WS-IDX) instead of WS-TXN-ACCT.
5. End-of-data handling: The outer loop now checked for SQLCODE 100 after processing all rows in the final rowset, using SQLERRD(3) to process only valid rows.
The First Bug
During unit testing, Rob noticed that the last batch of transactions in every test run was being processed twice. The problem: when the final FETCH returned SQLCODE 100 with SQLERRD(3) = 283, the outer loop detected the SQLCODE 100 and exited — but then the "final processing" paragraph at the end of the program re-processed the last rowset because WS-ROWS-FETCHED still contained 283.
The fix was to set WS-ROWS-FETCHED to 0 after processing each rowset. Simple, but it wouldn't have been caught without thorough testing with known test data and reconciliation of output counts.
The Second Bug
Integration testing revealed a more subtle problem. TXNPOST processes transactions in ACCT_NUM order and maintains a control break to detect account changes. With single-row FETCH, the control break occurred between two FETCHes. With multi-row FETCH, the control break could occur within a rowset — at any position.
Rob's original control-break logic assumed that WS-PREV-ACCT was set from the most recent FETCH. With multi-row FETCH, WS-PREV-ACCT needed to be compared against each row in the array individually. He restructured the inner loop:
PERFORM VARYING WS-IDX FROM 1 BY 1
UNTIL WS-IDX > WS-ROWS-FETCHED
IF WS-TXN-ACCT(WS-IDX) NOT =
WS-PREV-ACCT
IF WS-PREV-ACCT NOT = SPACES
PERFORM ACCOUNT-BREAK-PROCESSING
END-IF
MOVE WS-TXN-ACCT(WS-IDX)
TO WS-PREV-ACCT
END-IF
PERFORM PROCESS-TRANSACTION
END-PERFORM.
The Third Bug
Production parallel testing (running the new program alongside the old one and comparing results) revealed that nullable columns were causing mismatches. The original single-row program used indicator variables for three nullable columns. Rob had declared indicator arrays, but in the inner processing loop, he was checking WS-IND-MEMO instead of WS-IND-MEMO(WS-IDX). Rows where MEMO was NULL in the first rowset position were treated as NULL for all 500 rows; rows where MEMO was not NULL in position 1 masked NULLs in later positions.
"That one scared me," Rob admitted. "It would have corrupted audit records for any transaction where the memo field was NULL if it happened to fall after a non-NULL row in the same rowset. We caught it in parallel testing because the audit record counts didn't match."
Phase 1 Results
After four weeks of development, testing, and parallel production validation, Phase 1 went live:
| Job | Before (Thread Switch) | After (Thread Switch) | Savings |
|---|---|---|---|
| TXNPOST | 62 min | 0.12 min | 61.88 min |
| BALCALC | 42 min | 0.08 min | 41.92 min |
| ACCTSUM | 1.5 min | 0.003 min | 1.50 min |
| REGFEED | 6.3 min | 0.013 min | 6.29 min |
| CUSTSTMT | 1.3 min | 0.003 min | 1.30 min |
| AUDITARC | 16.7 min | 0.033 min | 16.67 min |
| Total | 130 min | 0.25 min | 129.75 min |
Thread-switching overhead dropped from 130 minutes to 15 seconds. The total batch window dropped from 3:47 to 1:37.
Phase 2: Multi-Row INSERT
With the FETCH side optimized, Rob converted the INSERT operations in TXNPOST, AUDITARC, and REGFEED to multi-row INSERT with the accumulate-then-flush pattern.
The key design decisions:
- TXNPOST (posting transactions to ledger): ATOMIC mode. A partial post is worse than no post — financial integrity requires all-or-nothing within each batch.
- AUDITARC (audit archival): NOT ATOMIC. Losing a few audit records is acceptable; stopping the archive job is not.
- REGFEED (regulatory feed): ATOMIC mode. Regulatory submissions must be complete.
Rob implemented a reusable COPY member (MRINSERT.cpy) that encapsulated the accumulate-then-flush pattern, parameterized by table name, column list, and rowset size. This allowed all three jobs to share the same insert logic with different configurations.
Phase 2 Results
| Job | Before (INSERT overhead) | After | Savings |
|---|---|---|---|
| TXNPOST | 34 min | 0.07 min | 33.93 min |
| AUDITARC | 18 min | 0.04 min | 17.96 min |
| REGFEED | 4 min | 0.01 min | 3.99 min |
| Total | 56 min | 0.12 min | 55.88 min |
Cumulative batch window: 1:37 → 0:41.
Phase 3: MERGE and OLAP
Lisa took the lead on Phase 3, which required SQL redesign rather than mechanical conversion.
ACCTSUM: The account summary job used the classic SELECT-then-INSERT-or-UPDATE pattern. Lisa replaced it with a MERGE that used a global temporary table as the source, loaded via multi-row INSERT. The MERGE reduced the SQL complexity from 47 lines of COBOL SQL logic to 22 lines, eliminated the concurrency retry logic (15 lines of SQLCODE -803 handling), and cut the job's elapsed time from 12.5 minutes to 4.2 minutes.
BALCALC: This was the most ambitious change. The running balance calculation used a cursor loop that maintained running totals across 12 million accounts. Lisa replaced the entire calculation with a single SQL statement using SUM(TXN_AMT) OVER (PARTITION BY ACCT_NUM ORDER BY TXN_DATE, TXN_SEQ ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW).
The challenge was TEMP space. With 500 million transactions partitioned by account, the OLAP sort required 180 GB of TEMP space. Lisa worked with the storage team to allocate dedicated TEMP tablespaces for the BALCALC job, backed by high-speed flash storage.
Phase 3 Results
| Job | Before | After | Savings |
|---|---|---|---|
| ACCTSUM | 12.5 min | 4.2 min | 8.3 min |
| BALCALC | 137 min (was 95 after Phase 1) | 38 min | 57 min |
Final Results
| Metric | Before Migration | After Migration | Improvement |
|---|---|---|---|
| Total batch elapsed | 3:47 | 0:34 | 85% reduction |
| Thread switches | ~3.1 billion | ~6.2 million | 99.8% reduction |
| Batch window slack | 13 minutes | 3 hours 26 minutes | 15x improvement |
| Annual growth capacity | 6 months | 8+ years | At 8% annual volume growth |
Lessons Learned
1. Measure before you optimize. Without the DB2 accounting trace data showing that thread-switching overhead consumed 130 minutes, Kwame's team would have continued tuning SQL access paths for marginal gains.
2. Multi-row FETCH is a mechanical conversion — but test the edge cases. The rowset boundary bugs (stale data, control breaks, indicator arrays) were all caught in testing. But they would all have caused production data corruption if missed.
3. The accumulate-then-flush pattern needs a reusable implementation. Rob's COPY member prevented the same bugs from being independently introduced in three different programs.
4. OLAP functions change your TEMP sizing requirements. The 180 GB TEMP requirement for BALCALC was not anticipated. Capacity planning for TEMP must be part of any OLAP migration.
5. Parallel production testing is non-negotiable. Running old and new programs simultaneously against production data, then reconciling every output record, caught bugs that unit testing and integration testing missed. The indicator array bug (Bug #3) would not have been found any other way.
Discussion Questions
-
Rob's indicator array bug went undetected through unit and integration testing. What testing approach would have caught it earlier?
-
The BALCALC job's OLAP conversion required 180 GB of TEMP space. What alternatives could Lisa have considered if that storage was not available?
-
CNB chose ATOMIC mode for TXNPOST inserts and NOT ATOMIC for AUDITARC. Under what circumstances might you reverse these choices?
-
The three-phase migration took 13 weeks. If you had to compress it to 6 weeks, which phases would you combine and what additional risks would you accept?
-
Kwame's team measured thread-switching overhead at 0.03 ms per switch. How would you validate this measurement on your own system? What factors could make it significantly higher or lower?