Case Study 1: CNB's Migration from Cursor Loops to Multi-Row FETCH

Background

City National Bank processes approximately 500 million transactions daily across its core banking platform. The nightly batch cycle — which posts transactions, calculates running balances, updates account summaries, generates regulatory reports, and produces customer statements — must complete within a 4-hour batch window between 11:00 PM and 3:00 AM EST. By Q2 2025, the batch cycle was consistently running 3 hours and 47 minutes, leaving only 13 minutes of slack. Volume was growing at 8% annually. Without intervention, the batch window would be breached within six months.

Kwame Asante, Senior Systems Programmer, was tasked with leading the optimization effort. His team: Lisa Chen (DB2 DBA, 18 years experience) and Rob Martinez (COBOL application developer, 12 years experience).

The Problem in Detail

Kwame's first step was instrumentation. He added DB2 accounting trace data collection to the six largest batch jobs, which together consumed 82% of the batch window. The results were damning:

Job Rows Processed SQL Calls Thread Switch Time SQL Execution Time COBOL Logic Time
TXNPOST 500M 1.5B 62 min 89 min 17 min
BALCALC 500M 1.0B 42 min 71 min 24 min
ACCTSUM 12M 36M 1.5 min 8 min 3 min
REGFEED 50M 150M 6.3 min 22 min 9 min
CUSTSTMT 8M 32M 1.3 min 11 min 14 min
AUDITARC 200M 400M 16.7 min 34 min 5 min

The numbers told a clear story: across the six jobs, thread-switching overhead consumed 130 minutes — more than an entire hour of the batch window spent doing nothing but crossing the address space boundary.

"I looked at those numbers and realized we'd been optimizing the wrong thing for years," Kwame said. "We'd spend weeks tuning SQL access paths to save 3% on execution time. Meanwhile, a third of our batch window was pure overhead from the single-row programming model."

The Migration Plan

Lisa designed a three-phase migration plan:

Phase 1: Multi-Row FETCH (4 weeks)

Convert all six jobs from single-row FETCH to multi-row FETCH with a rowset size of 500. This was the lowest-risk change — the COBOL business logic didn't change at all. Only the data retrieval mechanism changed.

Phase 2: Multi-Row INSERT (3 weeks)

Convert the three jobs that produce output records (TXNPOST, AUDITARC, REGFEED) to use multi-row INSERT with the accumulate-then-flush pattern.

Phase 3: MERGE and OLAP (6 weeks)

Replace the SELECT/INSERT/UPDATE patterns in ACCTSUM with MERGE, and replace the cursor-based running balance logic in BALCALC with OLAP functions.

Phase 1: The Multi-Row FETCH Conversion

Rob started with TXNPOST, the largest and most critical job. The original program had a straightforward structure:

OPEN cursor.
LOOP: FETCH one row.
      Validate transaction.
      Determine posting action.
      UPDATE transaction status.
      INSERT audit record.
END-LOOP.
CLOSE cursor.

The multi-row conversion required:

1. WORKING-STORAGE changes: Rob declared host variable arrays with OCCURS 500 for all FETCH columns, plus matching indicator arrays. Total additional WORKING-STORAGE: 94 KB (500 rows * 187 bytes per row + indicator overhead).

2. Cursor declaration: Added WITH ROWSET POSITIONING to the existing cursor declaration. The SELECT statement itself didn't change.

3. FETCH logic: Replaced the single-row FETCH with FETCH NEXT ROWSET FROM... FOR :WS-ROWSET-SIZE ROWS. Added SQLERRD(3) checking to determine actual row count.

4. Processing loop: Added an inner loop (PERFORM VARYING WS-IDX FROM 1 BY 1 UNTIL WS-IDX > WS-ROWS-FETCHED) to process each row in the rowset. The business logic in PROCESS-TRANSACTION was unchanged — it now referenced WS-TXN-ACCT(WS-IDX) instead of WS-TXN-ACCT.

5. End-of-data handling: The outer loop now checked for SQLCODE 100 after processing all rows in the final rowset, using SQLERRD(3) to process only valid rows.

The First Bug

During unit testing, Rob noticed that the last batch of transactions in every test run was being processed twice. The problem: when the final FETCH returned SQLCODE 100 with SQLERRD(3) = 283, the outer loop detected the SQLCODE 100 and exited — but then the "final processing" paragraph at the end of the program re-processed the last rowset because WS-ROWS-FETCHED still contained 283.

The fix was to set WS-ROWS-FETCHED to 0 after processing each rowset. Simple, but it wouldn't have been caught without thorough testing with known test data and reconciliation of output counts.

The Second Bug

Integration testing revealed a more subtle problem. TXNPOST processes transactions in ACCT_NUM order and maintains a control break to detect account changes. With single-row FETCH, the control break occurred between two FETCHes. With multi-row FETCH, the control break could occur within a rowset — at any position.

Rob's original control-break logic assumed that WS-PREV-ACCT was set from the most recent FETCH. With multi-row FETCH, WS-PREV-ACCT needed to be compared against each row in the array individually. He restructured the inner loop:

           PERFORM VARYING WS-IDX FROM 1 BY 1
               UNTIL WS-IDX > WS-ROWS-FETCHED

               IF WS-TXN-ACCT(WS-IDX) NOT =
                   WS-PREV-ACCT
                   IF WS-PREV-ACCT NOT = SPACES
                       PERFORM ACCOUNT-BREAK-PROCESSING
                   END-IF
                   MOVE WS-TXN-ACCT(WS-IDX)
                       TO WS-PREV-ACCT
               END-IF

               PERFORM PROCESS-TRANSACTION

           END-PERFORM.

The Third Bug

Production parallel testing (running the new program alongside the old one and comparing results) revealed that nullable columns were causing mismatches. The original single-row program used indicator variables for three nullable columns. Rob had declared indicator arrays, but in the inner processing loop, he was checking WS-IND-MEMO instead of WS-IND-MEMO(WS-IDX). Rows where MEMO was NULL in the first rowset position were treated as NULL for all 500 rows; rows where MEMO was not NULL in position 1 masked NULLs in later positions.

"That one scared me," Rob admitted. "It would have corrupted audit records for any transaction where the memo field was NULL if it happened to fall after a non-NULL row in the same rowset. We caught it in parallel testing because the audit record counts didn't match."

Phase 1 Results

After four weeks of development, testing, and parallel production validation, Phase 1 went live:

Job Before (Thread Switch) After (Thread Switch) Savings
TXNPOST 62 min 0.12 min 61.88 min
BALCALC 42 min 0.08 min 41.92 min
ACCTSUM 1.5 min 0.003 min 1.50 min
REGFEED 6.3 min 0.013 min 6.29 min
CUSTSTMT 1.3 min 0.003 min 1.30 min
AUDITARC 16.7 min 0.033 min 16.67 min
Total 130 min 0.25 min 129.75 min

Thread-switching overhead dropped from 130 minutes to 15 seconds. The total batch window dropped from 3:47 to 1:37.

Phase 2: Multi-Row INSERT

With the FETCH side optimized, Rob converted the INSERT operations in TXNPOST, AUDITARC, and REGFEED to multi-row INSERT with the accumulate-then-flush pattern.

The key design decisions:

  • TXNPOST (posting transactions to ledger): ATOMIC mode. A partial post is worse than no post — financial integrity requires all-or-nothing within each batch.
  • AUDITARC (audit archival): NOT ATOMIC. Losing a few audit records is acceptable; stopping the archive job is not.
  • REGFEED (regulatory feed): ATOMIC mode. Regulatory submissions must be complete.

Rob implemented a reusable COPY member (MRINSERT.cpy) that encapsulated the accumulate-then-flush pattern, parameterized by table name, column list, and rowset size. This allowed all three jobs to share the same insert logic with different configurations.

Phase 2 Results

Job Before (INSERT overhead) After Savings
TXNPOST 34 min 0.07 min 33.93 min
AUDITARC 18 min 0.04 min 17.96 min
REGFEED 4 min 0.01 min 3.99 min
Total 56 min 0.12 min 55.88 min

Cumulative batch window: 1:37 → 0:41.

Phase 3: MERGE and OLAP

Lisa took the lead on Phase 3, which required SQL redesign rather than mechanical conversion.

ACCTSUM: The account summary job used the classic SELECT-then-INSERT-or-UPDATE pattern. Lisa replaced it with a MERGE that used a global temporary table as the source, loaded via multi-row INSERT. The MERGE reduced the SQL complexity from 47 lines of COBOL SQL logic to 22 lines, eliminated the concurrency retry logic (15 lines of SQLCODE -803 handling), and cut the job's elapsed time from 12.5 minutes to 4.2 minutes.

BALCALC: This was the most ambitious change. The running balance calculation used a cursor loop that maintained running totals across 12 million accounts. Lisa replaced the entire calculation with a single SQL statement using SUM(TXN_AMT) OVER (PARTITION BY ACCT_NUM ORDER BY TXN_DATE, TXN_SEQ ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW).

The challenge was TEMP space. With 500 million transactions partitioned by account, the OLAP sort required 180 GB of TEMP space. Lisa worked with the storage team to allocate dedicated TEMP tablespaces for the BALCALC job, backed by high-speed flash storage.

Phase 3 Results

Job Before After Savings
ACCTSUM 12.5 min 4.2 min 8.3 min
BALCALC 137 min (was 95 after Phase 1) 38 min 57 min

Final Results

Metric Before Migration After Migration Improvement
Total batch elapsed 3:47 0:34 85% reduction
Thread switches ~3.1 billion ~6.2 million 99.8% reduction
Batch window slack 13 minutes 3 hours 26 minutes 15x improvement
Annual growth capacity 6 months 8+ years At 8% annual volume growth

Lessons Learned

1. Measure before you optimize. Without the DB2 accounting trace data showing that thread-switching overhead consumed 130 minutes, Kwame's team would have continued tuning SQL access paths for marginal gains.

2. Multi-row FETCH is a mechanical conversion — but test the edge cases. The rowset boundary bugs (stale data, control breaks, indicator arrays) were all caught in testing. But they would all have caused production data corruption if missed.

3. The accumulate-then-flush pattern needs a reusable implementation. Rob's COPY member prevented the same bugs from being independently introduced in three different programs.

4. OLAP functions change your TEMP sizing requirements. The 180 GB TEMP requirement for BALCALC was not anticipated. Capacity planning for TEMP must be part of any OLAP migration.

5. Parallel production testing is non-negotiable. Running old and new programs simultaneously against production data, then reconciling every output record, caught bugs that unit testing and integration testing missed. The indicator array bug (Bug #3) would not have been found any other way.

Discussion Questions

  1. Rob's indicator array bug went undetected through unit and integration testing. What testing approach would have caught it earlier?

  2. The BALCALC job's OLAP conversion required 180 GB of TEMP space. What alternatives could Lisa have considered if that storage was not available?

  3. CNB chose ATOMIC mode for TXNPOST inserts and NOT ATOMIC for AUDITARC. Under what circumstances might you reverse these choices?

  4. The three-phase migration took 13 weeks. If you had to compress it to 6 weeks, which phases would you combine and what additional risks would you accept?

  5. Kwame's team measured thread-switching overhead at 0.03 ms per switch. How would you validate this measurement on your own system? What factors could make it significantly higher or lower?