> "The batch window is a scheduling problem, not a performance problem."
In This Chapter
- Job Scheduling, Critical Path Analysis, and the Math of Getting It All Done by 6am
- 23.1 The 6am Deadline — Why Batch Window Engineering Is Architecture
- 23.2 The Batch Window as a Graph — Jobs, Dependencies, and Critical Path
- 23.3 The Math — Throughput Calculations, I/O Analysis, and Capacity
- 23.4 Job Scheduling — TWS, CA-7, Control-M, and the Art of Dependencies
- 23.5 Parallel Streams — Running Jobs Simultaneously Without Stepping on Each Other
- 23.6 Window Compression — When You Need to Finish Faster
- 23.7 When the Window Breaks — Batch Failure Analysis and Recovery
- Production Considerations
- Project Checkpoint — HA Banking System End-of-Day Batch Window
- Summary
- Spaced Review
Chapter 23: Batch Window Engineering
Job Scheduling, Critical Path Analysis, and the Math of Getting It All Done by 6am
"The batch window is a scheduling problem, not a performance problem."
23.1 The 6am Deadline — Why Batch Window Engineering Is Architecture
Rob Calloway has been running batch operations at Continental National Bank for seventeen years. He's seen the batch window from every angle — the nights it finished at 4:15am with room to spare, and the nights he was on the phone at 5:47am with the CIO explaining why online wouldn't be up by 6:00. He'll tell you the same thing every time: "People think batch is about making programs run fast. It's not. It's about making the whole thing finish on time."
That distinction — individual job performance versus end-to-end window completion — is the threshold concept for this entire chapter. And it's the concept that separates batch operators from batch architects.
The Batch Window Defined
The batch window is the period between the close of online processing and the required resumption of online services. At CNB, that window is:
- Online close: 11:00 PM Eastern (CICS regions quiesced by 11:15 PM)
- Online open: 6:00 AM Eastern (CICS regions must be accepting transactions)
- Available window: 6 hours 45 minutes (405 minutes)
- Required buffer: 30 minutes (for CICS startup, cache warming, verification)
- Effective window: 375 minutes
That's 375 minutes to process 500 million transactions worth of end-of-day activity. Every night. Including the night after Black Friday. Including the night the quarterly interest calculation runs. Including the night DB2 decided to reorganize the customer master table.
🔄 ANCHOR — CNB's Q4 Crisis: In Q4 2024, CNB's transaction volume grew 30% due to a new mobile banking partnership. Nobody changed the batch jobs. Nobody re-analyzed the critical path. The first sign of trouble was a Tuesday night when the window finished at 5:52 AM — eight minutes of margin. By Thursday, it blew the window entirely. Rob Calloway got the 5:47 AM call. "Online can't come up. Batch is still running." Those are the seven words no batch operations manager ever wants to hear.
The root cause wasn't any single slow job. Every individual job was performing within its historical norms. The problem was that the aggregate throughput of the serial dependency chain had exceeded the window capacity. The critical path — the longest chain of jobs that must execute sequentially — had grown from 310 minutes to 420 minutes. Individual job optimization couldn't fix it. The architecture had to change.
Why This Is Architecture, Not Operations
Batch window engineering sits at the intersection of:
- Data architecture — What data flows where, and in what order?
- Systems architecture — How many initiators, LPARs, DB2 subsystems?
- Application architecture — How are programs structured for restartability?
- Capacity architecture — What throughput does the hardware support?
- Organizational architecture — Who owns which jobs, and who can change them?
💡 KEY INSIGHT: A batch window that works today but has no margin is a batch window that's already broken — you just don't know it yet. Volume grows. New requirements appear. Regulatory jobs get added. If you're using 95% of your window today, you'll blow it within two quarters.
The Batch Window Across the Industry
CNB's window is typical for a Tier-1 bank. But batch windows vary dramatically across industries and shop sizes:
Organization Type Typical Window Jobs Critical Path
──────────────────────────────────────────────────────────────────
Large bank (Tier 1) 6-8 hours 500+ 4-6 hours
Mid-size bank 8-10 hours 200-400 3-5 hours
Insurance company 8-12 hours 300-600 4-8 hours
Federal agency 10-14 hours 100-300 3-6 hours
Retail chain 6-8 hours 150-300 3-5 hours
Federal Benefits Administration, where Sandra Chen is modernizing a 40-year codebase, has a 12-hour window — generous by banking standards. But their critical path is still 8 hours because the legacy code was never parallelized. Marcus Whitfield, the retiring SME, remembers when the window was 18 hours and nobody worried about it. "We had a mainframe to ourselves back then," he says. "No online to fight with."
SecureFirst Retail Bank, where Yuki Nakamura runs DevOps, faces the opposite problem: their mobile-first strategy means online processing runs nearly 24/7. The batch window has been compressed to 4 hours — and they're moving toward continuous batch processing that eliminates the window concept entirely. That's the future for many shops, but it requires a fundamentally different architecture that most COBOL applications weren't designed for.
🚪 GATEWAY CONCEPT: This chapter is the entry point for Part V (Batch Architecture at Scale). Every subsequent chapter in this part — individual batch program design, parallel processing, and disaster recovery — builds on the DAG model and critical path concepts introduced here. If you don't internalize the idea that the batch window is a graph problem, the rest of Part V will feel like disconnected optimization tips rather than a coherent architectural framework.
The rest of this chapter teaches you to think about batch windows the way Rob Calloway learned to think about them after that Q4 crisis: as an engineering discipline with mathematical foundations, not as a hope-and-pray operational exercise.
23.2 The Batch Window as a Graph — Jobs, Dependencies, and Critical Path
Modeling Jobs as a DAG
Every batch window can be modeled as a directed acyclic graph (DAG). Each node is a job. Each directed edge represents a dependency — "this job must complete before that job can start."
Consider a simplified version of CNB's end-of-day processing:
Job ID Description Duration Predecessors
─────────────────────────────────────────────────────────────────────────
EOD-001 Transaction extract 25 min (none)
EOD-002 ATM settlement extract 15 min (none)
EOD-003 Wire transfer reconciliation 20 min (none)
EOD-004 Transaction validation 35 min EOD-001
EOD-005 ATM posting 20 min EOD-002
EOD-006 Wire posting 30 min EOD-003
EOD-007 Combined posting 45 min EOD-004, EOD-005, EOD-006
EOD-008 Balance calculation 40 min EOD-007
EOD-009 Interest accrual 50 min EOD-008
EOD-010 GL posting 30 min EOD-008
EOD-011 Regulatory extract 25 min EOD-009, EOD-010
EOD-012 Statement generation 60 min EOD-009
EOD-013 End-of-day report 15 min EOD-011, EOD-012
This DAG has three independent entry points (EOD-001, EOD-002, EOD-003) that can run in parallel, a convergence point (EOD-007), and multiple paths to the terminal node (EOD-013).
Critical Path Analysis
The critical path is the longest path through the DAG measured by total elapsed time. It determines the minimum possible batch window duration — you cannot finish faster than the critical path, no matter how many other jobs you parallelize.
Let's trace every path from start to finish:
Path A: EOD-001 → EOD-004 → EOD-007 → EOD-008 → EOD-009 → EOD-012 → EOD-013
25 + 35 + 45 + 40 + 50 + 60 + 15 = 270 minutes
Path B: EOD-001 → EOD-004 → EOD-007 → EOD-008 → EOD-009 → EOD-011 → EOD-013
25 + 35 + 45 + 40 + 50 + 25 + 15 = 235 minutes
Path C: EOD-001 → EOD-004 → EOD-007 → EOD-008 → EOD-010 → EOD-011 → EOD-013
25 + 35 + 45 + 40 + 30 + 25 + 15 = 215 minutes
Path D: EOD-002 → EOD-005 → EOD-007 → EOD-008 → EOD-009 → EOD-012 → EOD-013
15 + 20 + 45 + 40 + 50 + 60 + 15 = 245 minutes
Path E: EOD-003 → EOD-006 → EOD-007 → EOD-008 → EOD-009 → EOD-012 → EOD-013
20 + 30 + 45 + 40 + 50 + 60 + 15 = 260 minutes
The critical path is Path A at 270 minutes. That's 4 hours and 30 minutes — within the 375-minute effective window, but with only 105 minutes of margin.
🔍 ANALYSIS — What the Critical Path Tells You: - Optimizing EOD-002 (ATM settlement) does nothing for the window. It's not on the critical path. - Optimizing EOD-012 (statement generation, 60 minutes) would reduce the critical path by however many minutes you save — it is on the critical path. - Adding a new 20-minute job after EOD-009 but before EOD-012 would increase the critical path to 290 minutes. - The path through EOD-010/EOD-011 has 55 minutes of slack (270 - 215 = 55). You could delay EOD-010 by up to 55 minutes without affecting the window.
Slack and Float
Every job not on the critical path has slack (also called float) — the amount of time its start can be delayed without affecting the overall window completion.
Job Earliest Latest Slack On Critical
Start Start (min) Path?
───────────────────────────────────────────────────
EOD-001 0:00 0:00 0 YES
EOD-002 0:00 0:10 10 no
EOD-003 0:00 0:05 5 no
EOD-004 0:25 0:25 0 YES
EOD-005 0:15 0:40 25 no
EOD-006 0:20 0:30 10 no
EOD-007 1:00 1:00 0 YES
EOD-008 1:45 1:45 0 YES
EOD-009 2:25 2:25 0 YES
EOD-010 2:25 3:20 55 no
EOD-011 3:15 4:00 45 no
EOD-012 3:15 3:15 0 YES
EOD-013 4:15 4:15 0 YES
⚠️ WARNING — Slack is Fragile: That 10 minutes of slack on EOD-002 assumes everything runs at normal duration. If EOD-005 runs 15 minutes longer than expected (a common occurrence during high-volume periods), EOD-002's path suddenly has -5 minutes of slack — meaning it's now the new critical path. Monitor slack trends, not just the current critical path.
Hidden Dependencies
The DAG you draw on paper isn't always the DAG that exists in reality. Hidden dependencies include:
Dataset contention: Two jobs that have no logical dependency may both need exclusive access to the same dataset. If JOBABC allocates DISP=OLD on CUST.MASTER and JOBXYZ also needs DISP=OLD on the same dataset, they serialize — even though neither is a predecessor of the other.
DB2 lock conflicts: Two batch DB2 programs updating the same tablespace will experience lock contention even without formal job dependencies. One will wait. The effective throughput drops.
Initiator starvation: If you have 15 jobs eligible to run but only 8 batch initiators in the right class, 7 jobs queue. This creates implicit serialization.
Tape drive allocation: Yes, in 2026, some shops still have tape. Two jobs needing 4 tape drives each on a system with 6 drives will serialize.
GDG contention: If JOBA creates a new generation of a GDG and JOBB reads the current generation, there's an implicit ordering — but the scheduler may not know about it unless you tell it.
🧩 PATTERN — Dependency Discovery: To find hidden dependencies, don't just read the scheduler. Run a week of SMF data (type 30 records) and correlate job start/end times with dataset allocation records (type 14/15). Jobs that never overlap aren't necessarily independent — they may be implicitly serialized by resource contention. This is where operations knowledge meets architecture knowledge, and it's why Rob Calloway's seventeen years of experience matter.
The Real-World DAG: Complexity at Scale
CNB's simplified example has 13 jobs. The real batch window has 847 jobs and 2,341 dependency edges. At that scale, manual path analysis is impossible. You need tools.
Scheduler-native analysis: TWS/OPC provides a critical path analysis feature (the "Plan" view) that calculates the longest path through the current plan. CA-7 has similar reporting through SASSHIS7. Control-M's Planning domain provides visual DAG rendering with critical path highlighting.
Custom analysis: Many shops extract scheduler dependency data to a flat file and process it with custom programs. The algorithm for finding the critical path in a DAG is a topological sort followed by a forward pass (calculating earliest start/finish for each node) and a backward pass (calculating latest start/finish). Jobs where earliest finish equals latest finish are on the critical path.
* SIMPLIFIED CRITICAL PATH FORWARD PASS
* (Pseudocode — real implementation needs graph data structure)
PERFORM VARYING WS-NODE-IDX FROM 1 BY 1
UNTIL WS-NODE-IDX > WS-TOTAL-JOBS
MOVE 0 TO WS-EARLIEST-START(WS-NODE-IDX)
PERFORM VARYING WS-PRED-IDX FROM 1 BY 1
UNTIL WS-PRED-IDX > WS-PRED-COUNT(WS-NODE-IDX)
COMPUTE WS-PRED-FINISH =
WS-EARLIEST-START(WS-PRED-NODE(WS-PRED-IDX))
+ WS-DURATION(WS-PRED-NODE(WS-PRED-IDX))
IF WS-PRED-FINISH > WS-EARLIEST-START(WS-NODE-IDX)
MOVE WS-PRED-FINISH
TO WS-EARLIEST-START(WS-NODE-IDX)
END-IF
END-PERFORM
END-PERFORM.
⚠️ WARNING — DAG Integrity: If your dependency graph has a cycle, it's not a DAG and no valid schedule exists. Schedulers reject cycles at definition time, but you can create logical cycles through conditional dependencies or cross-system references that the scheduler doesn't detect. Always validate DAG integrity after making dependency changes.
23.3 The Math — Throughput Calculations, I/O Analysis, and Capacity
Why the Math Matters
Most batch operations teams run on instinct. "EOD-007 usually takes about 45 minutes." "If we add a job here, it'll probably push us 10 minutes later." These estimates are often wrong — not because the people are bad at their jobs, but because human intuition about cumulative effects in a dependency graph is unreliable.
The math in this section gives you something better than intuition: predictive models that tell you, with reasonable accuracy, how long a job will take at a given volume, how much the critical path will grow with volume increases, and when the window will break. Rob Calloway started tracking these numbers after the Q4 crisis and now publishes a monthly "Batch Window Health Report" to the architecture team. It contains two numbers: current critical path duration and projected months to exhaustion. Those two numbers drive more architecture decisions than any other metric in his organization.
Records Per Second — The Fundamental Unit
Every batch job's elapsed time is determined by how fast it processes records. The fundamental equation:
Elapsed Time = Total Records / Processing Rate (records/second)
But "processing rate" isn't a single number. It's the result of an interaction between CPU processing, I/O operations, and DB2 access:
Time per record = CPU time + I/O wait time + DB2 wait time + other wait time
Processing rate = 1 / Time per record
For a typical COBOL batch program reading a sequential file and updating DB2:
Component Time per record Percentage
──────────────────────────────────────────────────────
CPU (COBOL logic) 0.015 ms 3%
Sequential read I/O 0.050 ms 10%
DB2 SQL execution 0.350 ms 70%
DB2 lock/latch wait 0.060 ms 12%
Other (catalog, ENQ) 0.025 ms 5%
──────────────────────────────────────────────────────
Total 0.500 ms 100%
Processing rate 2,000 records/sec
With 10 million records to process:
Elapsed time = 10,000,000 / 2,000 = 5,000 seconds = 83.3 minutes
💡 KEY INSIGHT: In this example, 70% of elapsed time is DB2 SQL execution. Optimizing the COBOL logic (3% of time) would save approximately 2.5 minutes on an 83-minute job. Re-indexing the DB2 table to cut SQL time by 30% would save 17.5 minutes. Know where the time goes before you optimize.
I/O Throughput Analysis
For sequential file processing (QSAM/BSAM), I/O throughput depends on:
Block size × Blocks per track × Tracks per seek = Data per I/O operation
A well-tuned sequential read on modern DASD:
Configuration:
BLKSIZE = 27,998 (optimal for 3390 half-track)
BUFNO = 30 (30 I/O buffers for read-ahead)
Channel speed: FICON 16 Gbps
Cache hit ratio: 95% (sequential detect activated)
Throughput:
Cached reads: ~200 MB/sec
Non-cached reads: ~40 MB/sec
Effective (95% cache): ~192 MB/sec
Record size: 500 bytes
Records per block: 55
Blocks per second: 384,000 (cached)
Records per second: 21,120,000 (I/O only, no processing)
The I/O subsystem can deliver over 21 million records per second for sequential reads. Your COBOL program processes 2,000 records per second. The bottleneck is never sequential I/O for a well-tuned dataset — it's processing time.
⚠️ WARNING — Random I/O Is Different: The numbers above are for sequential access with caching. Random I/O (VSAM KSDS random reads, DB2 index lookups) drops to 5,000–50,000 I/O operations per second depending on cache hit ratio. Random I/O can be the bottleneck, especially for DB2 batch programs doing singleton SELECTs with index access.
CPU vs. I/O Bound Analysis
Classify every critical-path job:
CPU Bound (CPU time > 60% of elapsed):
- Complex calculations (interest accrual, actuarial)
- Data transformation with heavy COMPUTE
- Sorting (internal SORT, not DFSORT)
- Compression/decompression
Fix: zIIP offload (for DB2/XML), faster processor, algorithm optimization
I/O Bound (I/O wait > 60% of elapsed):
- Large sequential file scans
- Random VSAM access
- Tape processing
- Cross-system dataset access
Fix: Better block sizes, more buffers, parallel I/O, data placement
DB2 Bound (DB2 wait > 60% of elapsed):
- Heavy SQL batch processing
- Lock contention with other batch jobs
- Tablespace scans instead of index access
- Commit frequency too low (lock escalation)
Fix: SQL tuning, index optimization, commit frequency, parallel DB2 threads
Worked Example — CNB Transaction Validation (EOD-004)
EOD-004: Transaction Validation
Input: 12.5M transactions (Q4 volume, up from 9.6M in Q3)
Processing: Validate each transaction against business rules,
check fraud flags, verify account status via DB2
Measured rates (from SMF Type 30):
CPU time per invocation: 18.2 minutes
Elapsed time per invocation: 35.0 minutes
CPU/Elapsed ratio: 0.52 (mixed CPU/DB2 bound)
DB2 accounting (IFCID 3):
SQL calls: 37.5M (3 per transaction)
Class 2 elapsed: 14.8 minutes
Class 2 CPU: 4.1 minutes
SQL DB2 wait: 10.7 minutes
Throughput:
12,500,000 records / (35 × 60 seconds) = 5,952 records/sec
Q4 projection at 30% growth:
16,250,000 records / 5,952 rps = 2,730 seconds = 45.5 minutes
Impact: EOD-004 grows from 35 to 45.5 minutes.
Critical path impact: +10.5 minutes
New critical path: 280.5 minutes (was 270)
Remaining margin: 94.5 minutes (was 105)
🔍 ANALYSIS: The 30% volume growth costs 10.5 minutes on the critical path. That's manageable for this one job. But when every job on the critical path grows by a similar proportion, the cumulative effect is what broke CNB's window. Seven critical-path jobs each growing 10–15 minutes added up to 85 minutes of growth — and the window only had 105 minutes of margin.
Capacity Planning Formula
For any batch window, the capacity equation is:
Window Capacity = Available Time - Critical Path Length - Buffer
If Window Capacity < 0, the window is broken.
If Window Capacity < Growth Margin, the window will break soon.
Growth Margin = (Monthly Volume Growth Rate × Months to Next Review)
× Critical Path Sensitivity Factor
Critical Path Sensitivity Factor =
Sum of (job_duration × volume_elasticity) for all critical path jobs
÷ Sum of (job_duration) for all critical path jobs
Volume elasticity measures how much a job's duration changes per unit of volume growth. A purely sequential file processor has elasticity of 1.0 (linear). A job with significant fixed overhead (JCL setup, sort initialization, DB2 thread allocation) has elasticity less than 1.0.
For CNB's batch window at Q4:
Available Time: 375 minutes
Critical Path Length: 270 minutes
Buffer: 30 minutes (Rob's minimum)
Window Capacity: 75 minutes
Monthly Volume Growth Rate: 2.5%
Months to Next Review: 6
Avg Volume Elasticity: 0.85
Critical Path Duration: 270 minutes
Growth Margin Needed: 2.5% × 6 × 0.85 × 270 = 34.4 minutes
Verdict: 75 > 34.4 → Safe for 6 months (pre-Q4 projection)
After the Q4 spike:
Critical Path Length: 420 minutes (actual, after 30% growth)
Window Capacity: 375 - 420 - 30 = -75 minutes
Verdict: BROKEN
23.4 Job Scheduling — TWS, CA-7, Control-M, and the Art of Dependencies
The Big Three Schedulers
Every mainframe shop runs one of three enterprise job schedulers. The concepts are identical; the syntax differs.
IBM Tivoli Workload Scheduler (TWS/OPC):
//********************************************
//* TWS APPLICATION DEFINITION — EOD-004 *
//********************************************
ADID(CNBEOD004)
OWNER(BATCHOPS)
PRIORITY(5)
WSNAME(CNBSYSA)
RUN DAILY
CALENDAR(CNB-BUSINESS-DAYS)
DEADLINE(0430)
PREDECESSOR(CNBEOD001)
TYPE(SUCCESSOR)
CONDITION(RC <= 4)
RESOURCE(DB2BATCH)
QUANTITY(1)
RESOURCE(BATCHINIT-A)
QUANTITY(1)
CA-7 (Broadcom):
CA-7 JOB DEFINITION
JOB: CNBEOD04
SYSTEM: SYSA
JCLID: CNBEOD04
REQUIREMENT:
JOB CNBEOD01 - COND CODE LE 4
RESOURCE:
RES DB2BATCH QTY 1
SCHEDULE:
SCHID 001
SCAL CNB-BUS-DAYS
LEADTM 0015
DEADTM 0430
BMC Control-M:
{
"CNBEOD004": {
"Type": "Job:zOS",
"Application": "CNB-EOD",
"SubApplication": "VALIDATION",
"RunAs": "BATCHOPS",
"When": {
"RuleBasedCalendar": {
"Calendar": "CNB-BUSINESS-DAYS"
}
},
"InCondition": [
{"Name": "CNBEOD001-ENDED-OK", "Date": "ODAT"}
],
"Resource": {
"DB2BATCH": {"Quantity": 1}
}
}
}
Dependency Types
Regardless of scheduler, dependencies come in several flavors:
Job-to-Job (hard dependency): Job B cannot start until Job A completes successfully. This is the most common and the most significant for critical path analysis.
EOD-007 depends on EOD-004, EOD-005, EOD-006
// EOD-007 will not start until ALL THREE predecessors complete with RC ≤ 4
Conditional dependency: Job B runs only if Job A ends with a specific condition code.
// If EOD-004 ends RC=0, run EOD-004A (normal path)
// If EOD-004 ends RC=4, run EOD-004B (warning path — some records rejected)
// If EOD-004 ends RC>4, trigger alert, do NOT run EOD-007
Time dependency: Job starts at a specific time regardless of predecessor completion.
// STMT-GEN must not start before 02:00 AM (tape library staffing)
// REGULATORY must complete by 04:30 AM (federal filing deadline)
⚠️ WARNING — Time Dependencies Are Critical Path Killers: If you have a time-based dependency that says "don't start before 02:00" and the job's predecessors finish at 01:15, you've just added 45 minutes of dead time to the critical path. Review every time dependency quarterly. Many exist because of constraints that no longer apply.
Resource dependency: Job waits until a shared resource is available.
// Only 4 batch DB2 threads allowed simultaneously
// Only 2 jobs can run in INITCLASS-H at once
// Only 1 job can hold CUST-MASTER dataset at a time
Cross-system dependency: Job on SYSA waits for a job on SYSB to complete.
// SYSB-EOD-EXTRACT must complete before SYSA-EOD-LOAD can start
// Requires XCF signaling or scheduler cross-system communication
The Dependency Explosion Problem
CNB's batch window has 847 jobs. The dependency graph has 2,341 edges. Nobody fully understands it.
🔄 ANCHOR — Dependency Archaeology: When Rob Calloway's team analyzed the dependency graph after the Q4 crisis, they found:
- 127 unnecessary dependencies — jobs that were predecessors only because "they were always run in that order" with no actual data dependency
- 43 duplicate dependencies — Job C depended on Job A both directly and through Job B (if B already depends on A, C doesn't need to depend on A)
- 8 phantom dependencies — references to jobs that had been decommissioned years ago but whose dependency entries remained in the scheduler
- 3 circular dependency risks — not actual cycles (the scheduler would reject those) but near-cycles that made the graph nearly impossible to modify
Removing the 127 unnecessary dependencies shortened the critical path by 47 minutes — without changing a single COBOL program.
💡 KEY INSIGHT: Dependency cleanup is the single highest-ROI batch window optimization. It costs nothing, risks little (if you analyze carefully), and can recover tens of minutes from the critical path. Before you tune a single program, clean your dependency graph.
Operational Calendar Management
Schedulers don't just manage dependencies — they manage time. Every job has a calendar that determines when it runs:
Calendar Types:
BUSINESS-DAYS: Monday-Friday, excluding bank holidays
MONTH-END: Last business day of each month
QUARTER-END: Last business day of March, June, Sept, Dec
YEAR-END: December 31 (or last business day)
DAILY: Every day including weekends
CUSTOM: Application-specific (e.g., "third Wednesday")
Calendar interactions create batch window variation. On a normal Tuesday, CNB runs 847 jobs. On a month-end Tuesday, it runs 1,023 jobs — the extra 176 are month-end-only jobs (account reconciliation, management reporting, regulatory filings). On a quarter-end that falls on month-end, it's 1,187 jobs. On December 31st, it can exceed 1,400.
🔄 ANCHOR — The Month-End/Quarter-End Problem: Rob Calloway's critical path analysis must account for these calendar variations. The critical path on a normal night is 253 minutes. On month-end, additional jobs insert into the dependency chain, extending the critical path to approximately 310 minutes. On quarter-end, it reaches 345 minutes. On year-end, 380 minutes — within 5 minutes of the effective window.
This is why Rob runs a "rehearsal" batch two weeks before every year-end: he simulates the year-end job stream on a test LPAR to verify the timing. If the rehearsal exceeds 350 minutes, he activates the pre-planned compression strategies (additional job splits, temporary dependency bypasses for non-critical reports, deferred archival).
Scheduler Resource Management
Modern schedulers manage resources as countable tokens:
Resource Definition:
RESOURCE(DB2-BATCH-THREADS) QUANTITY(8)
RESOURCE(BATCH-INITIATORS-A) QUANTITY(12)
RESOURCE(TAPE-DRIVES) QUANTITY(6)
RESOURCE(CUST-MASTER-EXCL) QUANTITY(1)
Job Requirements:
EOD-004: DB2-BATCH-THREADS(2), BATCH-INITIATORS-A(1)
EOD-007: DB2-BATCH-THREADS(3), BATCH-INITIATORS-A(1)
EOD-009: DB2-BATCH-THREADS(2), BATCH-INITIATORS-A(1), CUST-MASTER-EXCL(1)
When EOD-004 and EOD-007 need to run simultaneously, they require 5 DB2 batch threads total. If only 4 are available, one waits. This implicit serialization doesn't appear in the dependency graph but affects actual elapsed time.
🧩 PATTERN — Resource Modeling: Add resource constraints to your DAG model. For each time slot, calculate total resource demand. Where demand exceeds supply, jobs queue — and queueing time adds to elapsed time even though it's not processing time. The best batch architects model resource contention as variable edge weights in their DAG.
23.5 Parallel Streams — Running Jobs Simultaneously Without Stepping on Each Other
The Parallelization Imperative
If your batch window critical path is 300 minutes and you need it to be 200 minutes, you have exactly two options: make the critical-path jobs run faster (optimization), or rearrange the graph so fewer jobs are on the critical path (parallelization). Optimization has limits — you can't make a program faster than its I/O or DB2 constraints allow. Parallelization is theoretically limited only by the true data dependencies between jobs. In practice, the limit is resource availability (initiators, DB2 threads, I/O bandwidth) and the contention effects that arise when multiple jobs compete for shared resources.
The rest of this section focuses on finding and exploiting parallelism — the most powerful lever you have for batch window compression.
Identifying Parallelizable Work
Two jobs can run in parallel if and only if:
- No data dependency: Neither job's output is the other's input
- No dataset contention: They don't both need exclusive (DISP=OLD) access to the same dataset
- No DB2 lock conflict: They don't update overlapping rows in the same table
- Sufficient resources: Enough initiators, DB2 threads, and I/O bandwidth for both
CNB's batch window has three natural parallel streams:
Stream 1 (Retail Banking):
Customer transactions → validation → posting → balance calc
Stream 2 (Commercial Banking):
Wire transfers → reconciliation → posting → GL update
Stream 3 (Card Processing):
ATM/debit transactions → settlement → posting → interchange calc
These streams share no data until the convergence point (combined GL posting). They can run fully in parallel, and the batch window's critical path is the longest stream — not the sum of all three.
Serial execution: Stream 1 (120 min) + Stream 2 (95 min) + Stream 3 (80 min) = 295 min
Parallel execution: max(120, 95, 80) = 120 min
Savings: 175 minutes (59% reduction)
Dataset Contention Resolution
The most common barrier to parallelization is dataset contention. Solutions:
1. Convert DISP=OLD to DISP=SHR where possible:
//* BEFORE — serializes against any other user of CUST.MASTER
//CUSTMAST DD DSN=CNB.PROD.CUST.MASTER,DISP=OLD
//*
//* AFTER — allows concurrent read access
//CUSTMAST DD DSN=CNB.PROD.CUST.MASTER,DISP=SHR
Only works if the job is reading, not writing. Many old JCL specifies DISP=OLD out of habit when DISP=SHR would suffice.
2. Use GDG generations to decouple readers from writers:
//* Writer job creates new generation
//TRANOUT DD DSN=CNB.PROD.TRANS.DAILY(+1),
// DISP=(NEW,CATLG,DELETE),
// SPACE=(CYL,(500,100)),
// DCB=(RECFM=FB,LRECL=500,BLKSIZE=27998)
//*
//* Reader job reads current generation (written by previous run)
//TRANIN DD DSN=CNB.PROD.TRANS.DAILY(0),DISP=SHR
⚠️ WARNING — GDG Catalog Serialization: Even with GDGs, the ICF catalog serializes during OPEN for GDG base updates. If 10 jobs all reference the same GDG base simultaneously, catalog contention can cause seconds to minutes of delay. For high-concurrency GDGs, consider spreading jobs' start times by 15–30 seconds.
3. Split files by key range:
//* Instead of one job processing all customers:
//* Job A processes customers 000000000–249999999
//* Job B processes customers 250000000–499999999
//* Job C processes customers 500000000–749999999
//* Job D processes customers 750000000–999999999
//CUSTMAST DD DSN=CNB.PROD.CUST.MASTER,DISP=SHR
//SYSIN DD *
RANGE-START=000000000
RANGE-END=249999999
/*
This requires application-level changes to support key-range processing — but it's the single most powerful parallelization technique for CPU-bound batch jobs.
DB2 Concurrency in Batch
DB2 batch programs running simultaneously face lock contention. The severity depends on:
Lock Level Contention Risk Throughput Impact
─────────────────────────────────────────────────────
Row-level Low Minimal (unless hot rows)
Page-level Medium 10-30% degradation
Table-level High Full serialization
Tablespace-level Critical Full serialization
🔄 ANCHOR — CNB's DB2 Batch Strategy: After the Q4 crisis, Lisa Tran (DBA) implemented three DB2 changes that recovered 35 minutes from the critical path:
-
Changed LOCKSIZE from PAGE to ROW on the TRANSACTION table — allowed three posting jobs to run concurrently instead of serially. Saved 40 minutes of elapsed time at the cost of 15% more CPU (row-level lock management overhead).
-
Increased COMMIT frequency from every 10,000 records to every 1,000 records — reduced lock hold time, eliminated lock escalation events that had been causing tablespace-level locks. Slight CPU increase but dramatic reduction in lock-wait time.
-
Implemented ISOLATION(UR) (uncommitted read) for read-only batch reporting jobs — these no longer took any locks at all, eliminating all contention with update jobs. Only safe because reports don't require transactional consistency — they run after all updates complete.
EXEC SQL
SELECT ACCT_BALANCE
INTO :WS-BALANCE
FROM CUSTOMER_ACCOUNTS
WHERE ACCT_NUMBER = :WS-ACCT-NUM
WITH UR
END-EXEC.
The Initiator Class Strategy
z/OS batch initiators are grouped into classes. Each initiator processes jobs from one or more classes. The class assignment determines which initiators can run which jobs:
Initiator Configuration at CNB:
Init 1-8: Class A,B (general batch — any standard job)
Init 9-12: Class B,H (high-priority batch + long-running)
Init 13-14: Class H (long-running jobs only)
Init 15-16: Class S (STC/special — DB2 utilities, sorts)
Job Class Assignment:
Short batch jobs (< 15 min): Class A
Standard batch jobs (15-60 min): Class B
Long-running batch (> 60 min): Class H
DB2 utilities and sorts: Class S
This class structure prevents long-running jobs from consuming all initiators and starving short jobs. If the statement generation job (60 minutes) runs in Class H, it uses one of initiators 9-14 and leaves initiators 1-8 free for the shorter validation and posting jobs.
💡 KEY INSIGHT: Initiator class assignment is a resource allocation decision that affects the critical path. If you put a critical-path job in a class with only 2 initiators, and both initiators are occupied when the job becomes eligible, it queues. Rob Calloway reviews initiator utilization monthly and adjusts class definitions quarterly. The goal: zero queue time for critical-path jobs.
Parallel Utility Execution
DFSORT, IDCAMS REPRO, and DB2 utilities (REORG, RUNSTATS, COPY) consume significant batch window time. Parallelizing utilities:
//* Run RUNSTATS on 4 tablespaces simultaneously
//* Each in its own job step with COND=(4,LT)
//*
//* Job DBUTIL1: RUNSTATS on CUSTOMER tablespace
//* Job DBUTIL2: RUNSTATS on TRANSACTION tablespace (parallel)
//* Job DBUTIL3: RUNSTATS on ACCOUNT tablespace (parallel)
//* Job DBUTIL4: RUNSTATS on GL_ENTRY tablespace (parallel)
//*
//* Convergence job DBUTIL9 depends on all four
DB2 REORG is particularly important — it's often the longest-running utility in the batch window, and it takes an exclusive lock on the tablespace. Schedule it on the least-critical tablespaces during the batch window and save the critical tablespaces for weekend maintenance windows.
Lisa Tran manages CNB's DB2 utility schedule with a simple rule: "If it locks a tablespace that a critical-path job touches, it runs on Saturday night. Period." This means some tablespaces go a full week between REORGs during heavy periods — not ideal for performance, but far better than extending the critical path by 30 minutes for a REORG that could have waited two days.
Cross-LPAR Batch Distribution
In a Parallel Sysplex with DB2 data sharing, batch work can be distributed across multiple LPARs. At CNB, SYSA runs the primary batch stream while SYSB handles utility processing and non-critical reporting:
SYSA (Primary Batch):
- All critical-path jobs (EOD-001 through EOD-013)
- All DB2 update jobs
- Checkpoint files on SYSA local DASD
SYSB (Auxiliary Batch):
- DB2 RUNSTATS and COPY utilities
- Management reports (read-only DB2 access)
- Archive processing (tape operations)
- Regulatory file transmission (FTP/Connect:Direct)
Cross-system dependencies:
- SYSB-RUNSTATS depends on SYSA-EOD-007 (posting complete before stats)
- SYSB-REPORTS depends on SYSA-EOD-008 (balances final before reporting)
- SYSA-EOD-013 depends on SYSB-TRANSMIT (regulatory files sent before close)
This distribution offloads approximately 15% of the batch window's total work from SYSA, freeing CPU and I/O resources for critical-path jobs. The cross-system dependencies are managed through TWS's inter-system communication using XCF (Cross-System Coupling Facility) signaling.
23.6 Window Compression — When You Need to Finish Faster
When the math says the window won't fit, you have six strategies — listed in order of increasing cost and risk:
Strategy 1: Eliminate Unnecessary Dependencies (Cost: Low, Risk: Low)
Already discussed in Section 23.4. This is always the first thing to try.
Before cleanup: Critical path = 420 minutes
After cleanup: Critical path = 373 minutes
Savings: 47 minutes
Cost: 40 hours of analysis
Risk: Low (no code changes)
Strategy 2: Split Large Serial Jobs into Parallel Jobs (Cost: Medium, Risk: Medium)
If a single job takes 60 minutes and processes 12 million records sequentially, split it into 4 parallel jobs processing 3 million records each:
Before:
EOD-009 (Interest Accrual): 50 minutes, serial, 10M accounts
After:
EOD-009A (Interest Accrual, range 0-2.5M): 13 min ──┐
EOD-009B (Interest Accrual, range 2.5M-5M): 13 min ──┤
EOD-009C (Interest Accrual, range 5M-7.5M): 13 min ──├── EOD-009Z (Merge): 3 min
EOD-009D (Interest Accrual, range 7.5M-10M): 12 min ──┘
Elapsed: 13 + 3 = 16 minutes (vs. 50 minutes)
Savings: 34 minutes on critical path
This requires application changes — the COBOL program must accept key-range parameters and the merge step must handle any cross-boundary reconciliation.
* ACCEPT KEY RANGE FROM PARM
IDENTIFICATION DIVISION.
PROGRAM-ID. INTACRL.
DATA DIVISION.
WORKING-STORAGE SECTION.
01 WS-PARM-DATA.
05 WS-RANGE-START PIC 9(10).
05 WS-RANGE-END PIC 9(10).
01 WS-ACCT-KEY PIC 9(10).
LINKAGE SECTION.
01 LS-PARM.
05 LS-PARM-LEN PIC S9(4) COMP.
05 LS-PARM-DATA PIC X(20).
PROCEDURE DIVISION USING LS-PARM.
0000-MAIN.
UNSTRING LS-PARM-DATA DELIMITED BY ','
INTO WS-RANGE-START WS-RANGE-END
END-UNSTRING
EXEC SQL
DECLARE ACCT-CURSOR CURSOR FOR
SELECT ACCT_NUMBER, ACCT_BALANCE,
INTEREST_RATE, LAST_CALC_DATE
FROM CUSTOMER_ACCOUNTS
WHERE ACCT_NUMBER >= :WS-RANGE-START
AND ACCT_NUMBER <= :WS-RANGE-END
ORDER BY ACCT_NUMBER
FOR UPDATE OF ACCT_BALANCE
END-EXEC
PERFORM 1000-PROCESS-RANGE
STOP RUN.
//EOD009A EXEC PGM=INTACRL,PARM='0000000000,0002500000'
//EOD009B EXEC PGM=INTACRL,PARM='0002500001,0005000000'
//EOD009C EXEC PGM=INTACRL,PARM='0005000001,0007500000'
//EOD009D EXEC PGM=INTACRL,PARM='0007500001,0010000000'
Strategy 3: Optimize I/O Configuration (Cost: Low-Medium, Risk: Low)
Optimization Typical Savings Effort
─────────────────────────────────────────────────────────
Increase BUFNO (5→30) 5-15% JCL change
Optimize BLKSIZE (half-track) 5-20% Reformat dataset
Enable sequential detect 10-25% Storage admin
Use HyperPAV alias volumes 15-30% Storage config
Spread datasets across CUs 10-40% Storage placement
For the biggest critical-path jobs, every percentage matters:
EOD-007 (Combined Posting): 45 minutes
Current: BUFNO=5, BLKSIZE=8000
Optimized: BUFNO=30, BLKSIZE=27998
New elapsed: 38 minutes
Savings: 7 minutes on critical path
Strategy 4: Use zIIP Offload for DB2-Heavy Jobs (Cost: Medium, Risk: Low)
DB2 SQL processing is eligible for zIIP (System z Integrated Information Processor) offload. zIIP cycles don't consume general-purpose CPU capacity and are priced differently.
More relevant for batch window engineering: zIIP offload can effectively increase the CPU capacity available for DB2 batch processing without competing with other batch jobs for general-purpose CPU.
Before zIIP offload:
EOD-004 CPU time: 18.2 min (all GP)
EOD-004 elapsed: 35.0 min
GP CPU available: 8 processors shared across all batch
After zIIP configuration:
EOD-004 GP CPU: 12.1 min
EOD-004 zIIP CPU: 7.8 min (DB2 SQL offloaded)
EOD-004 elapsed: 31.5 min (3.5 min savings from reduced GP contention)
Strategy 5: Extend the Batch Window (Cost: High, Risk: High)
Sometimes the answer is: negotiate a later online start time or an earlier online close.
Current: 11:00 PM – 6:00 AM = 7 hours
Proposed: 10:00 PM – 6:30 AM = 8.5 hours
Impact:
- 1 hour earlier close: affects West Coast online users
- 30 min later open: affects early East Coast mobile banking
- Business impact assessment required
- Often requires C-level approval
⚠️ WARNING — The Window Extension Trap: Extending the window is a one-time fix that doesn't address the underlying growth problem. If your volume is growing 2.5% per month, a 90-minute extension buys you about 18 months. And now you've given up that margin permanently. It's the batch equivalent of treating a fever with ice instead of antibiotics.
Strategy 6: Re-architect the Batch Processing Model (Cost: Very High, Risk: High)
When strategies 1–5 aren't enough, it's time to rethink what "batch" means:
Near-real-time processing: Move some batch work to CICS or IMS online processing. Instead of accumulating transactions and processing them in batch, process each transaction as it arrives. At SecureFirst, Carlos Vega has moved the fraud detection scan (previously a 30-minute batch job) into a CICS transaction that evaluates each transaction at the point of entry. The batch window no longer needs to include fraud scanning at all — and the bank gets real-time fraud detection as a business benefit.
Continuous batch: Run batch jobs throughout the day, not just in a defined window. This requires careful design to avoid conflicts with online transactions. The key challenge is data consistency: if a balance calculation runs at 2:00 PM while transactions are still posting, the balance is a moving target. Solutions include snapshot isolation (DB2's CURRENTLY COMMITTED) and designated "batch partitions" that are locked from online access during processing.
Parallel sysplex batch: Distribute batch work across multiple LPARs in a Parallel Sysplex. Each LPAR processes a portion of the work, converging at the end. DB2 data sharing makes this feasible — both LPARs can access the same data simultaneously with cross-system lock management. However, inter-system lock negotiation adds latency, and coupling facility contention can negate some of the parallelism benefit.
Hybrid batch/online: The most common modernization pattern. Move time-sensitive work (fraud detection, real-time balance updates) to online processing. Keep complex, data-intensive work (interest accrual, GL posting, regulatory reporting) in batch. This reduces the batch window without requiring a complete application re-architecture.
🔍 ANALYSIS — When to Re-architect vs. When to Optimize: Re-architecture is warranted when: (a) the critical path exceeds 80% of the window even after optimization, (b) volume growth rate exceeds 5% monthly, (c) the business requires 24/7 online availability that eliminates the traditional batch window, or (d) the regulatory environment demands real-time processing (e.g., instant payment schemes). If none of these conditions apply, optimization strategies 1–5 are almost always more cost-effective.
🔄 ANCHOR — CNB's Re-architecture: After the Q4 crisis, Kwame Mensah (architect) led a batch modernization that combined strategies 1, 2, and 3:
- Dependency cleanup: -47 minutes
- Job splitting (3 critical-path jobs): -62 minutes
- I/O optimization (all critical-path jobs): -23 minutes
- Total savings: 132 minutes
- New critical path: 288 minutes (down from 420)
- New margin: 57 minutes
- Projected safe window: 12+ months at current growth rates
The project took 6 weeks of analysis and 4 weeks of implementation. No COBOL business logic changed. It was purely a batch architecture project.
23.7 When the Window Breaks — Batch Failure Analysis and Recovery
Failure Modes
Batch jobs fail. The question isn't whether, but how you recover. Common failure modes:
Failure Type Frequency Severity Recovery Complexity
─────────────────────────────────────────────────────────────────
JCL error Weekly Low Fix and resubmit
ABEND S0C7 (data) Weekly Medium Fix data, restart
ABEND S0C4 (storage) Monthly Medium Fix program, restart
DB2 -904 (unavail) Monthly High Wait/restart
DB2 deadlock Weekly Low-Med Auto-retry
Dataset not found Monthly Medium Correct catalog/JCL
Space abend (B37) Monthly Medium Allocate more space
Tape mount timeout Weekly Low Operator intervention
System abend (S*22) Rare Critical IPL may be needed
CICS didn't close Quarterly Critical Manual intervention
Rob Calloway's Incident Playbook
🔄 ANCHOR — The CNB Batch Recovery Framework:
Rob Calloway's team operates on a tiered response model:
Tier 1 — Automatic Recovery (no human intervention):
//* AUTOMATIC RETRY FOR DB2 DEADLOCK (-911)
//STEP01 EXEC PGM=IKJEFT01,
// PARM='DSNTEP2',
// COND=(4,LT)
//SYSTSIN DD *
RUN PROGRAM(VALTRXN) PLAN(CNBPLAN1) -
PARMS('RETRY=3,COMMIT=1000')
/*
The COBOL program itself handles deadlock retry:
88 DB2-DEADLOCK VALUE -911.
88 DB2-TIMEOUT VALUE -913.
2000-PROCESS-RECORD.
MOVE 0 TO WS-RETRY-COUNT
PERFORM 2100-ATTEMPT-UPDATE
UNTIL WS-UPDATE-DONE = 'Y'
OR WS-RETRY-COUNT > 3
IF WS-RETRY-COUNT > 3
PERFORM 9000-WRITE-ERROR-RECORD
END-IF.
2100-ATTEMPT-UPDATE.
EXEC SQL
UPDATE TRANSACTION_MASTER
SET STATUS = :WS-NEW-STATUS,
PROC_DATE = CURRENT DATE
WHERE TXN_ID = :WS-TXN-ID
END-EXEC
EVALUATE TRUE
WHEN DB2-DEADLOCK
WHEN DB2-TIMEOUT
ADD 1 TO WS-RETRY-COUNT
EXEC SQL ROLLBACK END-EXEC
CALL 'CEESUSP' USING WS-WAIT-2-SEC
WHEN SQLCODE = 0
SET WS-UPDATE-DONE TO TRUE
ADD 1 TO WS-COMMIT-COUNTER
IF WS-COMMIT-COUNTER >= 1000
EXEC SQL COMMIT END-EXEC
MOVE 0 TO WS-COMMIT-COUNTER
END-IF
WHEN OTHER
PERFORM 9100-SQL-ERROR-HANDLER
END-EVALUATE.
Tier 2 — Operator Recovery (restart from checkpoint):
When a job fails and can't auto-recover, the goal is to restart from the last checkpoint — not from the beginning.
* CHECKPOINT/RESTART LOGIC
01 WS-CHECKPOINT-DATA.
05 WS-CHKPT-RECORD-COUNT PIC 9(10).
05 WS-CHKPT-LAST-KEY PIC X(20).
05 WS-CHKPT-ACCUMULATORS.
10 WS-CHKPT-TOTAL-AMT PIC S9(15)V99 COMP-3.
10 WS-CHKPT-ERROR-CT PIC 9(7).
05 WS-CHKPT-TIMESTAMP PIC X(26).
2000-TAKE-CHECKPOINT.
MOVE WS-RECORD-COUNT TO WS-CHKPT-RECORD-COUNT
MOVE WS-CURRENT-KEY TO WS-CHKPT-LAST-KEY
MOVE WS-TOTAL-AMT TO WS-CHKPT-TOTAL-AMT
MOVE WS-ERROR-COUNT TO WS-CHKPT-ERROR-CT
MOVE FUNCTION CURRENT-DATE TO WS-CHKPT-TIMESTAMP
EXEC SQL COMMIT END-EXEC
WRITE CHECKPOINT-RECORD FROM WS-CHECKPOINT-DATA
DISPLAY 'CHECKPOINT: RECORDS=' WS-CHKPT-RECORD-COUNT
' KEY=' WS-CHKPT-LAST-KEY
' TIME=' WS-CHKPT-TIMESTAMP
MOVE 0 TO WS-COMMIT-COUNTER.
0100-CHECK-RESTART.
OPEN INPUT CHECKPOINT-FILE
READ CHECKPOINT-FILE INTO WS-CHECKPOINT-DATA
AT END
SET WS-FRESH-START TO TRUE
NOT AT END
SET WS-RESTART TO TRUE
MOVE WS-CHKPT-RECORD-COUNT TO WS-RECORD-COUNT
MOVE WS-CHKPT-LAST-KEY TO WS-RESTART-KEY
MOVE WS-CHKPT-TOTAL-AMT TO WS-TOTAL-AMT
MOVE WS-CHKPT-ERROR-CT TO WS-ERROR-COUNT
DISPLAY 'RESTART FROM KEY=' WS-RESTART-KEY
' RECORDS=' WS-RECORD-COUNT
END-READ
CLOSE CHECKPOINT-FILE.
Tier 3 — Architect Recovery (critical path rerouting):
When a critical-path job fails and restart will take too long to meet the window:
- Assess remaining work: How many records left? What's the projected finish time?
- Split and parallel: Can the remaining work be split across multiple parallel jobs?
- Defer non-critical: Can downstream jobs that aren't legally required be deferred to a supplemental batch run?
- Partial online: Can CICS come up for a subset of functions while batch completes for others?
Incident: EOD-009 (Interest Accrual) failed at record 4.2M of 10M
after 21 minutes. S0C7 on corrupted account record.
Time remaining in window: 180 minutes
Time to restart from scratch: 50 minutes
Time to complete from checkpoint: 29 minutes (5.8M records)
Decision tree:
If remaining_time > (restart_time + downstream_path_time):
RESTART FROM CHECKPOINT
Remaining path: 29 + 30 + 60 + 15 = 134 minutes
Margin: 180 - 134 = 46 minutes → SAFE, restart from checkpoint
If remaining_time < restart_time:
SPLIT remaining work into parallel streams
If no recovery possible within window:
OPEN CICS FOR NON-INTEREST FUNCTIONS
RUN SUPPLEMENTAL INTEREST BATCH AT MIDDAY
💡 KEY INSIGHT: The recovery decision is always a math problem. Calculate the remaining critical path time for each recovery option, compare to the remaining window, and choose the option with the most margin. Rob Calloway keeps a laminated card in the operations center with the decision tree and the current critical path timing for the top 10 failure scenarios.
The Recovery Hierarchy
Level Recovery Action Time Cost Risk
─────────────────────────────────────────────────────────
1 Auto-retry (deadlock/timeout) Seconds None
2 Restart from checkpoint Minutes Low
3 Restart from beginning Tens of min Medium
4 Fix and resubmit Variable Medium
5 Split remaining work Minutes Medium-High
6 Bypass job, manual correction Minutes High
7 Defer to supplemental batch 0 min High (regulatory)
8 Extend window (delay online) N/A Very High
Every level up the hierarchy increases business risk. Level 1–3 are operational decisions. Level 4–6 require application knowledge. Level 7–8 require management approval.
Designing for Restartability from Day One
The difference between a batch program that recovers gracefully and one that requires a full rerun is checkpoint design. Every critical-path COBOL batch program must implement these four elements:
1. Checkpoint records that capture complete processing state:
The checkpoint must include not just the current position in the input file, but all accumulators, counters, flags, and state variables needed to resume processing as if the interruption never happened. Missing a single accumulator means the final totals will be wrong after restart.
2. Idempotent processing logic:
If a record is processed twice (because the checkpoint was taken before the commit), the result must be the same as processing it once. For database updates, this typically means using "upsert" logic — UPDATE if the record exists, INSERT if it doesn't. For file output, it means repositioning the output file to the checkpoint position and overwriting.
3. Commit synchronization:
The DB2 commit and the checkpoint write must be synchronized. If you commit to DB2 but crash before writing the checkpoint, the restart will re-process records that have already been committed — producing duplicate updates unless your processing is idempotent. The safest pattern: take the checkpoint first (to a sequential file), then commit DB2.
2000-TAKE-SYNCHRONIZED-CHECKPOINT.
* Write checkpoint BEFORE DB2 commit
* If crash after checkpoint but before commit,
* restart will re-process — but uncommitted DB2
* changes will be rolled back, so no duplicates.
WRITE CHECKPOINT-RECORD FROM WS-CHECKPOINT-DATA
EXEC SQL COMMIT END-EXEC
MOVE 0 TO WS-COMMIT-COUNTER.
4. Restart detection:
The program must detect whether it's a fresh start or a restart. A simple approach: check for the existence of a non-empty checkpoint file. If present, read the checkpoint and resume; if absent or empty, start from the beginning.
🔴 CRITICAL: At CNB, every COBOL batch program submitted for production must pass a "restart test" — the program is deliberately killed at 50% completion and then restarted. If the final outputs don't match a clean run, the program is rejected. This testing requirement was instituted after a 2019 incident where a restart produced $14.3 million in duplicate interest credits that weren't detected until the next month's reconciliation.
Batch Monitoring — Knowing You're in Trouble Before It's Too Late
Don't wait until 5:47 AM to discover the window is breaking. Implement milestone monitoring:
Milestone Expected Time Alert Threshold
─────────────────────────────────────────────────────
Extracts complete 11:45 PM +15 min (12:00 AM)
Validation done 12:55 AM +20 min (01:15 AM)
Posting complete 01:45 AM +20 min (02:05 AM)
Balance calc done 02:25 AM +15 min (02:40 AM)
Interest done 03:15 AM +20 min (03:35 AM)
GL posting done 03:45 AM +15 min (04:00 AM)
Statements done 04:15 AM +20 min (04:35 AM)
Window complete 04:30 AM +15 min (04:45 AM)
//* MILESTONE NOTIFICATION STEP
//MILSTN EXEC PGM=CNBNOTFY,
// PARM='MILESTONE=POSTING-COMPLETE'
//SYSOUT DD SYSOUT=*
//*
//* CNBNOTFY checks current time against expected time
//* If late, sends alert to operations page group
//* Writes record to milestone tracking dataset
⚠️ WARNING — Trend Monitoring: A single night finishing 5 minutes late isn't a crisis. Three consecutive nights each 2 minutes later than the last is a trend that, if unaddressed, will blow the window within weeks. Monitor batch window trends weekly, not just nightly alerts.
The Human Factor: Knowledge Transfer and the Batch Window
🔄 ANCHOR — The Marcus Whitfield Problem: At Federal Benefits Administration, Marcus Whitfield is retiring. He's the only person who fully understands the 600-job monthly cycle that processes 40 million benefit payments. The dependency graph exists in TWS, but the logic behind the dependencies — why job BENPAY-047 must run before BENPAY-052, even though there's no obvious data dependency — lives entirely in Marcus's head.
Sandra Chen's modernization effort includes a batch window documentation initiative. For every dependency in the graph, she requires a documented justification in one of four categories:
- Data dependency: "BENPAY-047 writes the ELIGIBLE-BENEFICIARY file that BENPAY-052 reads."
- Resource dependency: "Both jobs need exclusive access to the BENEFITS-MASTER VSAM cluster."
- Temporal dependency: "BENPAY-052 must not start before 02:00 AM due to downstream system availability."
- Unknown/historical: "Dependency exists but justification cannot be determined."
Category 4 currently covers 23% of all dependencies. Sandra's goal is to reduce that to zero before Marcus retires — because every unknown dependency is either a necessary constraint that will cause a production failure if removed, or an unnecessary constraint that's artificially extending the critical path. There's no way to know which without investigation, and Marcus is the only person who can investigate.
This is the knowledge retirement problem applied to batch architecture. And it's happening at shops across the industry. If your batch window depends on knowledge that lives in one person's head, you have a single point of failure that no amount of redundant hardware can address.
✅ BEST PRACTICE: Every dependency in the scheduler should have a comment explaining why it exists. When a new dependency is added, the change request must include the justification category. When an employee who owns batch knowledge announces retirement, a batch dependency audit should be initiated immediately — not during their last two weeks.
Production Considerations
Regulatory and Compliance Constraints
Some batch jobs have legal deadlines that don't care about your technical problems:
Requirement Deadline Penalty
─────────────────────────────────────────────────────────────────
ACH origination file to Fed 06:00 AM ET Regulatory action
Wire transfer confirmations 07:00 AM ET Customer/regulatory
OCC Call Report (quarterly) Midnight filing Regulatory fine
BSA/AML daily scan 09:00 AM ET Criminal liability
FDIC assessment data Quarterly Regulatory action
🔴 CRITICAL: The ACH file must be transmitted to the Federal Reserve by 06:00 AM. If your batch window runs late and the ACH file isn't generated, millions of dollars in payroll direct deposits don't arrive in customer accounts. The reputational and regulatory consequences are severe. This is why Rob Calloway's minimum buffer isn't negotiable.
Seasonal Volume Planning
Period Volume Change Planning Action
─────────────────────────────────────────────────────
Month-end +15-20% Pre-split GL jobs
Quarter-end +25-35% Full parallel plan
Year-end +50-80% Rehearsal runs, extra LPARs
Tax season +40% Add processing capacity
Black Friday week +100-200% Special batch schedule
Regulatory filing +varies Dedicated job streams
Plan for the worst case, not the average case. The batch window that works on a normal Tuesday in March will fail on December 31st if you haven't planned for year-end volume.
🔄 ANCHOR — Pinnacle Health's Seasonal Challenge: Diane Okoye at Pinnacle Health Insurance faces a different seasonal pattern. January is their peak — open enrollment processing adds 40% to claims volume, and new-year deductible resets trigger a wave of "accumulator zeroing" jobs that don't run any other month. Diane builds her capacity model around January volume, not annual average. If the window can survive January, it can survive anything — with the possible exception of a mid-year acquisition that adds millions of claims overnight, which is exactly what happened in the case study for this chapter.
Ahmad Rashidi (Pinnacle's compliance architect) adds another dimension: regulatory filing deadlines shift with the calendar. The CMS EDGE Server submission is due by the 15th of each month, but when the 15th falls on a weekend, the effective deadline moves to Friday the 13th — and the batch window on Thursday the 12th must produce the files. These calendar edge cases catch operations teams off guard because they only occur a few times a year, and the jobs involved may not have been tested since the last occurrence.
Change Management for the Batch Window
Every change to the batch window — new jobs, modified dependencies, changed calendars, updated resources — should go through formal change management. At CNB, the batch window change process requires:
- Impact analysis: What is the effect on the critical path? Does the change add, remove, or modify any critical-path job?
- Resource assessment: Does the new/changed job require additional initiators, DB2 threads, or dataset access?
- Recovery review: Is the new/changed job restartable? Has the recovery procedure been documented and tested?
- Calendar review: On which days does this change affect the window? Does it create a new worst-case scenario?
- Approval: Changes that affect the critical path require architecture team approval. Changes that don't affect the critical path require operations team approval.
Rob Calloway estimates that 30% of batch window incidents are caused by changes that weren't properly impact-assessed. A new job added without checking resource contention. A dependency removed because "it seemed unnecessary." A calendar change that created an unexpected collision on month-end. Change management isn't bureaucracy — it's the batch window's immune system.
Documentation Requirements
Every batch window should have:
- DAG diagram — updated monthly, showing all jobs and dependencies
- Critical path documentation — which jobs are on it, what the expected times are
- Recovery runbook — for the top 20 failure scenarios, step-by-step recovery procedures
- Capacity model — current utilization, growth rate, projected window exhaustion date
- Change log — every dependency change, every new job added, every job removed
🧩 PATTERN — The Batch Window Dashboard: CNB maintains a real-time dashboard that shows: - Current batch progress (jobs completed/remaining) - Critical path status (on time / minutes ahead / minutes behind) - Resource utilization (DB2 threads, initiators, I/O bandwidth) - Milestone tracking (expected vs. actual completion times) - Projected window completion time (updated every 5 minutes)
Rob Calloway checks it exactly once before bed (at midnight) and trusts the alerting system for everything else. If the dashboard shows green at midnight, the window will be fine. If it shows yellow, he sets his alarm for 3:00 AM. If it shows red at midnight, he's not going to bed.
Project Checkpoint — HA Banking System End-of-Day Batch Window
🔧 Progressive Project: HA Banking Transaction Processing System
Apply the batch window engineering concepts from this chapter to design the end-of-day processing for the HA banking system you've been building throughout this book.
Your Design Task
Design a complete end-of-day batch window for the HA banking system with the following characteristics:
Volume: 50 million transactions per day across 5 million active accounts.
Available window: 11:00 PM – 6:00 AM (7 hours, 420 minutes effective after 30-minute buffer).
Required processing: 1. Transaction extraction from CICS journal 2. Transaction validation and enrichment 3. Account posting (debit/credit application) 4. Balance recalculation and interest accrual 5. Fraud detection daily scan 6. General ledger posting 7. Regulatory reporting (AML/BSA daily file) 8. Statement generation (for accounts with cycle date = today) 9. ACH origination file generation 10. End-of-day reconciliation
Deliverables:
- DAG diagram (text-based): Show all jobs with dependencies
- Critical path analysis: Identify the critical path and calculate total elapsed time
- Parallel stream design: Identify which jobs can run concurrently
- Throughput calculations: For the three longest jobs, estimate elapsed time based on record counts and processing rates
- Recovery strategy: For the two most critical failure points, document the recovery procedure
- Capacity projection: At 3% monthly volume growth, when will this design exhaust the window?
See code/project-checkpoint.md for the full project specification and worked guidance.
Summary
The batch window is a scheduling problem. That one sentence, truly understood, changes how you approach every aspect of batch processing architecture.
Individual job performance matters — but only for jobs on the critical path. The most impactful optimization is often dependency cleanup, which changes no code at all. The math of throughput and elapsed time gives you predictive power: you can calculate when the window will break before it actually breaks.
Job schedulers are the control plane of batch processing. They manage dependencies, allocate resources, and (when properly configured) route around failures. Understanding your scheduler's capabilities — resource management, conditional execution, cross-system dependencies — is essential for batch architecture.
Parallelization is the primary mechanism for window compression. Identifying independent work streams, resolving dataset and DB2 contention, and splitting large serial jobs into parallel components can reduce the critical path by 50% or more.
Recovery is architecture. Programs must be designed for restartability from the first line of code, not bolted on after a production failure. Checkpoint/restart logic, commit frequency, and idempotent processing aren't optional features — they're requirements for production batch systems.
And the 6 AM deadline doesn't negotiate. Whatever math you do, whatever architecture you design, the answer to "when does online come up?" must always be "on time."
Spaced Review
From Chapter 1 — z/OS Lifecycle
Connection: Chapter 1 introduced the z/OS job lifecycle — JCL submission, initiator allocation, step execution, and completion. That lifecycle is the fundamental unit of the batch window. Every node in your DAG is one execution of that lifecycle. The scheduler manages thousands of these lifecycles in the correct order.
Review Question: How does the z/OS initiator class system relate to the resource constraints discussed in this chapter's DAG model?
From Chapter 4 — Dataset Management
Connection: Chapter 4 covered dataset allocation, GDG management, and catalog operations. In batch window engineering, dataset contention is one of the primary barriers to parallelization, and GDGs are the primary mechanism for decoupling sequential producers from consumers.
Review Question: Why does GDG catalog serialization matter more during the batch window than during online processing? How would you mitigate it?
From Chapter 5 — Workload Manager
Connection: Chapter 5 discussed WLM service classes and how z/OS allocates resources. Batch jobs run in WLM service classes that determine their CPU dispatching priority and I/O priority. A critical-path batch job running in a low-priority service class will be preempted by online work that's still draining — understanding WLM is essential for ensuring batch jobs get the resources the throughput math assumes.
Review Question: If Rob Calloway's critical-path jobs are running in the default batch service class but CICS is still draining online transactions, what WLM change might help the batch window?
Next Chapter: Chapter 24 dives into the individual job level — how to design COBOL batch programs that process millions of records efficiently with proper restart/recovery, commit strategies, and error handling. Where this chapter gave you the forest, Chapter 24 gives you the trees.
Related Reading
Explore this topic in other books
Advanced COBOL Checkpoint and Restart Learning COBOL Batch Processing Intermediate COBOL Batch Processing Patterns