Case Study 1: CNB's Batch Window Crisis and the Great Optimization

DataField.Dev

Case Study 1: CNB's Batch Window Crisis and the Great Optimization

Background

Continental National Bank's nightly batch window had been comfortable for a decade. From 10:00 PM to 6:00 AM, eight hours of uncontested processing time handled interest calculations, statement generation, regulatory reporting, and account maintenance. The systems team, led by VP Kwame Mensah, rarely thought about it.

Then the bank acquired Lakeshore Federal Credit Union, adding 4.2 million accounts to CNB's existing 12 million. Simultaneously, the business demanded that online banking be available by 5:00 AM instead of 6:00 AM for the growing West Coast customer base. The batch window shrank from eight hours to seven — while the workload grew by 35%.

The first night after the Lakeshore migration, the batch didn't finish until 5:47 AM. The second night, 6:12 AM. By Thursday, Kwame called an emergency meeting.

The Diagnosis

Lisa Hwang, lead DB2 programmer, spent the weekend analyzing the batch stream. She identified three critical bottlenecks:

Bottleneck 1: The Interest Calculation Program (INTCALC)

INTCALC processed all savings and checking accounts for daily interest accrual. Post-merger, it processed 16.2 million accounts. The program had been written in 2008 and never substantially modified.

Lisa found the first problem in the cursor declaration:

           EXEC SQL
               DECLARE INT-CURSOR CURSOR FOR
               SELECT ACCOUNT_NUM, BALANCE, INT_RATE,
                      LAST_CALC_DATE, ACCOUNT_TYPE
               FROM   ACCOUNTS
               WHERE  ACCOUNT_TYPE IN ('SAV', 'CHK', 'MMA', 'CD')
               AND    BALANCE > 0
               ORDER BY ACCOUNT_NUM
           END-EXEC

No WITH HOLD. No commit interval. The program opened the cursor, processed all 16.2 million qualifying rows, and committed once at the end. It held a table lock (escalated from row locks around row 2,000) for the entire 3.5-hour run, blocking the statement generation program from starting. It consumed 1.2 GB of active log space, forcing two archive switches that stalled the entire DB2 subsystem for 90 seconds each.

Bottleneck 2: The Statement Generation Program (STMTGEN)

STMTGEN used a single "universal cursor" for all account types:

           EXEC SQL
               DECLARE STMT-CURSOR CURSOR FOR
               SELECT A.ACCOUNT_NUM, A.BALANCE,
                      C.CUSTOMER_NAME, C.ADDRESS_LINE1,
                      C.ADDRESS_LINE2, C.CITY, C.STATE, C.ZIP
               FROM   ACCOUNTS A
               JOIN   CUSTOMERS C ON A.CUSTOMER_ID = C.CUSTOMER_ID
               WHERE  (:WS-ACCT-TYPE = 'ALL'
                        OR A.ACCOUNT_TYPE = :WS-ACCT-TYPE)
               AND    (:WS-REGION = 'ALL'
                        OR C.REGION = :WS-REGION)
               ORDER BY C.CUSTOMER_NAME
           END-EXEC

The optimizer, unable to predict which predicates would be active, chose a tablespace scan on ACCOUNTS joined with a nested loop into CUSTOMERS. For 16.2 million accounts, this meant 16.2 million random I/Os into the CUSTOMERS table — a four-hour operation.

Bottleneck 3: The Regulatory Extract Program (REGEXT)

REGEXT ran a dynamic SQL query built from a configuration table. The SQL was constructed by concatenating column values directly into the query string — no parameter markers. While this wasn't causing a performance problem, Lisa flagged it to Kwame as a security risk that would fail their next audit.

      * From the 2008-vintage REGEXT program
           MOVE SPACES TO WS-DYN-SQL
           STRING 'SELECT * FROM TRANSACTIONS WHERE '
                  WS-CONFIG-WHERE-CLAUSE
                  ' ORDER BY ' WS-CONFIG-ORDER-CLAUSE
                  DELIMITED SIZE INTO WS-DYN-SQL

The WS-CONFIG-WHERE-CLAUSE was loaded from a CONFIG_RULES table that was maintained by business analysts through a simple ISPF interface. No input validation, no audit trail.

The Solution

Lisa assembled a team of three (herself, Rob Chen, and a junior programmer) for a two-week intensive rewrite. Kwame cleared their schedules of all other work.

Fix 1: INTCALC Redesign

Lisa redesigned INTCALC around the commit-checkpoint pattern with partition-aware processing:

Step 1: The ACCOUNTS table was already partitioned into 8 ranges by account number. Lisa created 8 JCL job steps, each processing one partition, running in parallel:

//INTCALC1 EXEC PGM=INTCALC,PARM='PART=1,LO=00000001,HI=19999999'
//INTCALC2 EXEC PGM=INTCALC,PARM='PART=2,LO=20000000,HI=39999999'
...
//INTCALC8 EXEC PGM=INTCALC,PARM='PART=8,LO=E0000000,HI=ZZZZZZZZ'

Step 2: Each instance used a WITH HOLD cursor with a 2,000-row commit interval:

           EXEC SQL
               DECLARE INT-CURSOR CURSOR WITH HOLD FOR
               SELECT ACCOUNT_NUM, BALANCE, INT_RATE,
                      LAST_CALC_DATE, ACCOUNT_TYPE
               FROM   ACCOUNTS
               WHERE  ACCOUNT_NUM >= :WS-PART-LOW
               AND    ACCOUNT_NUM <  :WS-PART-HIGH
               AND    ACCOUNT_TYPE IN ('SAV', 'CHK', 'MMA', 'CD')
               AND    BALANCE > 0
               AND    ACCOUNT_NUM > :WS-RESTART-KEY
               ORDER BY ACCOUNT_NUM
               FOR FETCH ONLY
           END-EXEC

Lisa chose FOR FETCH ONLY with a searched UPDATE (by primary key) rather than FOR UPDATE, because 100% of rows were updated and the read-only cursor allowed list prefetch:

       3000-CALC-AND-UPDATE.
           COMPUTE WS-ACCRUED = WS-BALANCE *
               (WS-INT-RATE / 36500) * WS-DAYS-SINCE-LAST

           EXEC SQL
               UPDATE ACCOUNTS
               SET    BALANCE        = BALANCE + :WS-ACCRUED,
                      LAST_CALC_DATE = CURRENT DATE,
                      LAST_CALC_AMT  = :WS-ACCRUED
               WHERE  ACCOUNT_NUM = :WS-CURRENT-ACCT
           END-EXEC.

Step 3: Commit-checkpoint with restart capability using the BATCH_RESTART table:

       5000-COMMIT-CHECKPOINT.
           EXEC SQL
               MERGE INTO BATCH_RESTART R
               USING (VALUES ('INTCALC', :WS-PARTITION-NUM,
                              CURRENT DATE))
                   AS S(PROG, PART, RUNDT)
               ON R.PROGRAM_ID    = S.PROG
               AND R.PARTITION_NUM = S.PART
               AND R.RUN_DATE     = S.RUNDT
               WHEN MATCHED THEN
                   UPDATE SET LAST_KEY       = :WS-CURRENT-ACCT,
                              ROWS_COMMITTED = R.ROWS_COMMITTED
                                             + :WS-COMMIT-CTR,
                              COMMIT_TS      = CURRENT TIMESTAMP
               WHEN NOT MATCHED THEN
                   INSERT VALUES
                   (S.PROG, S.PART, S.RUNDT, :WS-CURRENT-ACCT,
                    :WS-COMMIT-CTR, CURRENT TIMESTAMP)
           END-EXEC

           EXEC SQL COMMIT END-EXEC
           ADD WS-COMMIT-CTR TO WS-TOTAL-COMMITTED
           MOVE 0 TO WS-COMMIT-CTR.

Results: INTCALC dropped from 3.5 hours (serial, no commits) to 28 minutes (8-way parallel, 2,000-row commits). Lock escalation was eliminated. Log consumption per commit was under 5 MB. Any partition could be restarted independently.

Fix 2: STMTGEN Redesign

Rob Chen replaced the universal cursor with a cursor pool. Since STMTGEN was always called with a specific account type and region (the "ALL" case was actually never used in production — Rob checked three years of job logs), he created targeted cursors:

      * Cursor for specific account type + specific region
           EXEC SQL
               DECLARE STMT-C1 CURSOR WITH HOLD FOR
               SELECT A.ACCOUNT_NUM, A.BALANCE,
                      C.CUSTOMER_NAME, C.ADDRESS_LINE1,
                      C.ADDRESS_LINE2, C.CITY, C.STATE, C.ZIP
               FROM   ACCOUNTS A
               JOIN   CUSTOMERS C ON A.CUSTOMER_ID = C.CUSTOMER_ID
               WHERE  A.ACCOUNT_TYPE = :WS-ACCT-TYPE
               AND    C.REGION       = :WS-REGION
               ORDER BY A.ACCOUNT_NUM
               FOR FETCH ONLY
           END-EXEC

He also added a composite index on ACCOUNTS (ACCOUNT_TYPE, CUSTOMER_ID) that gave the optimizer a matching index scan instead of a tablespace scan.

Results: STMTGEN dropped from 4 hours to 45 minutes. The new index added 200 MB of DASD but saved 3 hours of elapsed time — a trade-off Kwame approved in 10 seconds.

Fix 3: REGEXT Security Remediation

Lisa rewrote the dynamic SQL construction to use parameter markers for all value-based predicates and whitelisted column and table names:

       2000-BUILD-DYNAMIC-SQL.
      * Column whitelist validation
           PERFORM VARYING WS-IDX FROM 1 BY 1
               UNTIL WS-IDX > WS-CONFIG-COLUMN-COUNT
               EVALUATE WS-CONFIG-COLUMN(WS-IDX)
                   WHEN 'TRANSACTION_ID'
                   WHEN 'ACCOUNT_NUM'
                   WHEN 'TRANS_DATE'
                   WHEN 'AMOUNT'
                   WHEN 'TRANS_TYPE'
                   WHEN 'BRANCH_CODE'
                       CONTINUE
                   WHEN OTHER
                       DISPLAY 'INVALID COLUMN: '
                               WS-CONFIG-COLUMN(WS-IDX)
                       MOVE 'Y' TO WS-ERROR-FLAG
               END-EVALUATE
           END-PERFORM

      * Build with parameter markers for values
           STRING 'SELECT ' WS-VALIDATED-COLUMNS
                  ' FROM TRANSACTIONS'
                  ' WHERE TRANS_DATE BETWEEN ? AND ?'
                  DELIMITED SIZE INTO WS-DYN-SQL
           END-STRING

      * Add validated optional predicates with markers
           IF WS-HAS-AMOUNT-FILTER
               STRING WS-DYN-SQL DELIMITED SPACES
                      ' AND AMOUNT > ?'
                      DELIMITED SIZE INTO WS-DYN-SQL
               END-STRING
           END-IF.

She also implemented the full audit trail — every dynamic SQL execution was logged to the SQL_AUDIT_LOG table with the SQL text, parameters, executing user, and timestamp.

Outcome

After two weeks of development and one week of testing:

Metric	Before	After	Improvement
INTCALC elapsed	3h 30m	28m	87.5%
STMTGEN elapsed	4h 00m	45m	81.3%
Total batch window	7h 45m	4h 10m	46.2%
Lock escalations/night	12-15	0	100%
Log archive switches	2-3	0	100%
Dynamic SQL injection vectors	3 programs	0	100%

The batch window now finished by 2:15 AM, leaving nearly three hours of margin before the 5:00 AM online availability deadline. Kwame described the project as "the best two weeks of engineering this team has done in five years."

Lessons Learned

Serial processing is the enemy of the batch window. Partition-aware parallelism delivered the single largest improvement.
The commit-checkpoint pattern is not optional at scale. The no-commit INTCALC program had been a ticking time bomb for years — it just hadn't exploded because the volume hadn't crossed the threshold.
Universal cursors are performance time bombs. They work fine in development with 10,000 rows. They work fine in production with 5 million rows (if slowly). They fail catastrophically at 16 million rows.
Security debt compounds. The dynamic SQL concatenation in REGEXT had been there since 2008. Fixing it took two days. Fixing it after a breach would have taken months — plus the regulatory consequences.
Check your assumptions. Rob's discovery that the "ALL" case was never actually used in production saved weeks of development time. Don't optimize for code paths that don't execute.

Discussion Questions

Lisa chose a commit interval of 2,000 rows for INTCALC. What factors would cause you to increase or decrease this interval? At what interval would lock escalation become a risk on CNB's system?
Rob found that the "ALL" case in STMTGEN was never used. What's the risk of removing it? How would you verify this more rigorously than checking job logs?
The REGEXT security fix used a column whitelist. What happens when the business needs to add a new column to the report? How would you design the whitelist to be maintainable without code changes?
If INTCALC partition 5 abends at 2:00 AM while partitions 1-4 have completed and 6-8 are still running, what happens? How does the restart logic handle this partial failure?
The new composite index on ACCOUNTS (ACCOUNT_TYPE, CUSTOMER_ID) improved STMTGEN dramatically. What is the maintenance cost of this index for INSERT and UPDATE operations on the ACCOUNTS table? When does that cost matter?