Case Study 1: CNB's Batch Window Crisis and Re-engineering

DataField.Dev

Case Study 1: CNB's Batch Window Crisis and Re-engineering

"We had 17 years of knowledge in that batch window. Turns out, about 4 years of it was still relevant."

— Rob Calloway, Batch Operations Manager, Continental National Bank

The Setup

Continental National Bank's end-of-day batch window had been running reliably for years. The window opened at 11:00 PM when CICS regions quiesced and closed at 6:00 AM when online services resumed. Within that 7-hour window, 847 batch jobs processed the day's transactions, calculated balances, generated regulatory files, produced statements, and prepared the bank for the next business day.

Rob Calloway managed it. Kwame Mensah architected it. Lisa Tran kept DB2 healthy under it. Together, they owned the 847-job dependency graph that was the circulatory system of a $45 billion bank.

The batch window finished most nights between 4:15 and 4:45 AM. Over 90 minutes of margin. Comfortable. Reliable. Nobody worried about it.

Then came Q4.

The Trigger

In September, CNB launched a co-branded mobile banking partnership with a national fintech platform. The partnership was a business success — 1.2 million new mobile accounts in the first quarter. Transaction volumes climbed from a daily average of 385 million to over 500 million by mid-November.

A 30% increase in transaction volume.

Nobody told batch operations. Nobody re-analyzed the critical path. Nobody ran capacity projections. The business celebrated new customer acquisition numbers while the batch window silently consumed its margin.

The Escalation

Week 1 (early November): Batch completion time drifted from 4:30 AM to 4:50 AM. Rob noticed but attributed it to seasonal pre-holiday volume. "Normal Q4 stuff," he noted in his weekly ops report.

Week 2: Completion time hit 5:15 AM. Rob flagged it to Kwame: "We're trending later. Might want to take a look." Kwame added it to his backlog.

Week 3 (Tuesday): Batch completed at 5:52 AM. Eight minutes of margin. Rob sent an urgent email: "We will blow the window this week if nothing changes."

Week 3 (Thursday, 5:47 AM): Rob's phone rang. The automated monitoring system had already paged the on-call team, but Rob always picked up when the batch was in trouble. The message was the one he'd dreaded: "CICS startup blocked. Batch jobs CNBEOD-GL03 and CNBEOD-STMT02 still executing. Projected completion: 6:23 AM."

Online banking would be 23 minutes late.

Rob called the CIO at 5:51 AM. "Online won't come up at 6. We're looking at 6:25 at best. Do you want us to bring CICS up without GL posting complete?" The CIO's answer: "How wrong will the balances be?" Rob's answer: "I don't know." The CIO's decision: "Wait for batch to finish."

At 6:22 AM, CICS regions started accepting transactions. Twenty-two minutes late. 340,000 early-morning mobile banking users got "system unavailable" messages. The app store reviews started pouring in by 7:00 AM. The board's risk committee was notified by 8:00 AM.

The Analysis

Kwame Mensah cleared his calendar for the next two weeks. The first task was understanding why the window broke — not at the individual job level, but at the architectural level.

Step 1: Measure Everything

Kwame pulled 90 days of SMF Type 30 records (job-level accounting) and SMF Type 14/15 records (dataset activity) for every batch job. He built a complete timing profile:

Job Timing Analysis (90-day average vs. November average):

                          90-Day Avg    Nov Avg    Change
──────────────────────────────────────────────────────────
Critical path elapsed     310 min       385 min    +24%
Total CPU consumed        142 min       198 min    +39%
Total I/O wait time       89 min        124 min    +39%
Total DB2 wait time       203 min       289 min    +42%
Jobs on critical path     14            14         --
Non-critical jobs         833           833        --

The critical path grew by 24%, but individual component growth was 39–42%. Why the discrepancy? Because not all critical-path jobs were equally affected. Some were volume-dependent (growing with transactions), while others were fixed-overhead (running the same duration regardless of volume).

Step 2: Map the Dependency Graph

Kwame exported the TWS dependency definitions and built the complete DAG. 847 nodes, 2,341 edges. He then cross-referenced with actual SMF data to find implicit dependencies.

Findings:

Discovery	Count	Impact
Unnecessary explicit dependencies	127	Jobs serialized with no data dependency
Duplicate/transitive dependencies	43	B→A and B→C→A (A→C already implied)
Phantom dependencies (decommissioned jobs)	8	Edges pointing to non-existent jobs (scheduler silently ignored)
Implicit dataset contention	31	Jobs with no scheduler dependency but DISP=OLD on shared datasets
Implicit DB2 contention	17	Jobs with no scheduler dependency but lock conflicts on same tables

The 127 unnecessary dependencies were the most impactful. Many dated back years — jobs that were once related but had since been refactored. Nobody removed the dependency because nobody questioned why it was there.

Step 3: Identify Critical Path Bottlenecks

The 14 critical-path jobs, in dependency order:

Job ID          Description                Duration(Nov)  Volume-Elastic?
────────────────────────────────────────────────────────────────────────
CNBEOD-EXT01    Transaction extract          28 min       Yes (0.95)
CNBEOD-SORT01   Transaction sort             18 min       Yes (0.90)
CNBEOD-VAL01    Transaction validation       46 min       Yes (0.92)
CNBEOD-FRDSC    Fraud detection scan         32 min       Yes (0.88)
CNBEOD-MRGV     Merge validated txns         12 min       Yes (0.85)
CNBEOD-POST1    Account posting (retail)     52 min       Yes (0.94)
CNBEOD-BALS     Balance calculation          44 min       Yes (0.91)
CNBEOD-INTCL    Interest accrual             38 min       Partial (0.70)
CNBEOD-GL01     GL journal entries           24 min       Yes (0.88)
CNBEOD-GL02     GL posting to DB2            18 min       Yes (0.85)
CNBEOD-GL03     GL reconciliation            12 min       No (fixed)
CNBEOD-REG01    Regulatory extract           15 min       No (fixed)
CNBEOD-STMT01   Statement generation (pt 1)  28 min       Yes (0.92)
CNBEOD-STMT02   Statement generation (pt 2)  18 min       Yes (0.90)
────────────────────────────────────────────────────────────────────────
Total critical path: 385 minutes (November average)

The three biggest contributors to critical-path growth: 1. CNBEOD-POST1 (Account posting): 52 minutes, up from 38 — heavily volume-elastic, DB2-bound 2. CNBEOD-VAL01 (Validation): 46 minutes, up from 34 — CPU and DB2 bound 3. CNBEOD-BALS (Balance calc): 44 minutes, up from 33 — DB2-bound

These three jobs accounted for 47 minutes of the 75-minute critical path growth.

The Solution

Kwame presented a three-phase batch window re-engineering plan.

Phase 1: Dependency Cleanup (Week 1–2)

Action: Remove 127 unnecessary dependencies after validating each with SMF data and application analysis.

Validation process: For each dependency, Kwame checked: 1. Does the predecessor produce output that the successor consumes? 2. Do they share any datasets with conflicting DISP? 3. Do they update the same DB2 tables? 4. Is there any business logic reason for ordering?

If all four answers were "no," the dependency was removed — but not before running a 3-day parallel test where the dependency was kept in the scheduler but both jobs were eligible to run simultaneously on a test LPAR.

Result: The critical path restructured. Jobs that were previously serialized behind unnecessary predecessors could now start earlier. Several parallel streams emerged that had been artificially sequential.

Critical path reduction: 47 minutes (from 385 to 338 minutes).

Phase 2: Job Splitting (Week 3–4)

Action: Split the three biggest critical-path jobs into parallel key-range jobs.

CNBEOD-POST1 (Account Posting): Split into 4 parallel jobs by account number range. Each processes approximately 125,000 accounts. A merge step reconciles cross-boundary transactions.

Before: CNBEOD-POST1 → 52 minutes (serial)

After:
  CNBEOD-POST1A (accounts 000-249) → 14 min ─┐
  CNBEOD-POST1B (accounts 250-499) → 14 min ─┤
  CNBEOD-POST1C (accounts 500-749) → 13 min ─├→ CNBEOD-POSTM (merge) → 4 min
  CNBEOD-POST1D (accounts 750-999) → 13 min ─┘

  Elapsed: 14 + 4 = 18 minutes
  Savings: 34 minutes

CNBEOD-VAL01 (Validation): Split into 3 parallel jobs by transaction type (retail, commercial, card). Each type has independent validation rules.

Before: CNBEOD-VAL01 → 46 minutes (serial)

After:
  CNBEOD-VALRT (retail txns)     → 22 min ─┐
  CNBEOD-VALCM (commercial txns) → 16 min ─┼→ CNBEOD-VALMG (merge) → 3 min
  CNBEOD-VALCD (card txns)       → 18 min ─┘

  Elapsed: 22 + 3 = 25 minutes
  Savings: 21 minutes

CNBEOD-BALS (Balance Calculation): Split into 2 parallel jobs (checking/savings vs. commercial/CD).

Before: CNBEOD-BALS → 44 minutes (serial)

After:
  CNBEOD-BALS1 (checking/savings) → 26 min ─┐
  CNBEOD-BALS2 (commercial/CD)    → 20 min ─┼→ CNBEOD-BALSM (merge) → 3 min

  Elapsed: 26 + 3 = 29 minutes
  Savings: 15 minutes

COBOL changes required: Each program needed to accept key-range or type-range parameters via the JCL PARM field. The merge steps were new programs — simple sequential file merge logic with running total reconciliation.

Critical path reduction: 62 minutes (from 338 to 276 minutes, but not all savings are purely additive due to path restructuring — actual measured reduction was 62 minutes).

Phase 3: I/O Optimization (Week 5–6)

Action: Optimize I/O for all critical-path jobs.

Lisa Tran led this phase, focusing on:

BUFNO increases: Changed from default (5) to 30 for all sequential datasets on critical-path jobs. Required JCL changes only.
BLKSIZE optimization: Reformatted 12 datasets from BLKSIZE=8000 to BLKSIZE=27998 (optimal half-track blocking). Required dataset recreation and reload.
DB2 buffer pool tuning: Increased VPSIZE for the batch buffer pools from 50,000 to 150,000 pages. Required DB2 parameter change and restart (done during a weekend maintenance window).
Dataset placement: Moved the 8 most-accessed batch datasets to dedicated volumes with no other allocation during the batch window. Eliminated I/O contention from non-batch work.

Critical path reduction: 23 minutes (from 276 to 253 minutes).

The Results

Metric                    Before    After     Improvement
───────────────────────────────────────────────────────────
Critical path (Nov avg)   385 min   253 min   132 min (34%)
Window margin             -10 min   122 min   +132 min
Projected safe duration   Broken    14 months at current growth
Jobs on critical path     14        11 (restructured)
Parallel streams          3         7
Project duration          --        6 weeks
COBOL programs changed    --        6 (3 split + 3 merge)
JCL members changed       --        47
Scheduler definitions     --        189 modified
Cost                      --        ~$180K (staff time)

The batch window that was broken now had 122 minutes of margin — more than it ever had, even before the mobile banking partnership.

Lessons Learned

1. Monitor trends, not thresholds. The window was drifting for weeks before it broke. A simple trend chart of nightly completion times would have flagged the issue a month earlier.

2. Dependencies are the first place to look. 127 unnecessary dependencies — over 5% of the total graph — had accumulated over years. Nobody questioned them because the window had margin. Implement a quarterly dependency review.

3. Volume planning requires communication. The business team knew about the mobile banking launch months in advance. Nobody communicated the volume impact to batch operations. Kwame now requires a batch impact assessment for any business initiative expected to change transaction volume by more than 10%.

4. Job splitting is powerful but requires application discipline. The COBOL programs needed to accept range parameters cleanly. This only worked because the original programs were well-structured with clear PERFORM sections. Had they been monolithic procedural code, the split would have required months of refactoring.

5. The batch window is infrastructure. It needs the same architectural attention as the CICS regions, the DB2 subsystems, and the network. CNB now treats the batch window as a first-class architectural component with quarterly reviews, capacity projections, and formal change management.

Discussion Questions

Kwame's dependency validation process checked four criteria before removing a dependency. Can you think of additional criteria that should be checked?
The job-splitting approach assumed relatively even data distribution across key ranges. What would happen if 60% of accounts fell in one range? How would you handle it?
Rob Calloway's recovery decision on the night the window broke was to wait for batch to finish. Under what circumstances would bringing CICS up before batch completes be the right call?
CNB's re-engineering project took 6 weeks and changed only 6 COBOL programs. If the COBOL programs had been poorly structured (no clear section boundaries, heavy use of GO TO, no PARM handling), how would the project timeline and approach differ?
The project cost $180K in staff time and recovered 132 minutes of margin. How would you calculate the ROI of this investment? What's the cost of a blown batch window?
After re-engineering, CNB has 7 parallel streams instead of 3. Does increased parallelism introduce any new risks? What monitoring changes are needed?