Chapter 23: Batch Window Engineering

DataField.Dev

34 min read

> "The batch window is a scheduling problem, not a performance problem."

In This Chapter

Job Scheduling, Critical Path Analysis, and the Math of Getting It All Done by 6am
23.1 The 6am Deadline — Why Batch Window Engineering Is Architecture
23.2 The Batch Window as a Graph — Jobs, Dependencies, and Critical Path
23.3 The Math — Throughput Calculations, I/O Analysis, and Capacity
23.4 Job Scheduling — TWS, CA-7, Control-M, and the Art of Dependencies
23.5 Parallel Streams — Running Jobs Simultaneously Without Stepping on Each Other
23.6 Window Compression — When You Need to Finish Faster
23.7 When the Window Breaks — Batch Failure Analysis and Recovery
Production Considerations
Project Checkpoint — HA Banking System End-of-Day Batch Window
Summary
Spaced Review

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 23: Batch Window Engineering

Job Scheduling, Critical Path Analysis, and the Math of Getting It All Done by 6am

"The batch window is a scheduling problem, not a performance problem."

23.1 The 6am Deadline — Why Batch Window Engineering Is Architecture

Rob Calloway has been running batch operations at Continental National Bank for seventeen years. He's seen the batch window from every angle — the nights it finished at 4:15am with room to spare, and the nights he was on the phone at 5:47am with the CIO explaining why online wouldn't be up by 6:00. He'll tell you the same thing every time: "People think batch is about making programs run fast. It's not. It's about making the whole thing finish on time."

That distinction — individual job performance versus end-to-end window completion — is the threshold concept for this entire chapter. And it's the concept that separates batch operators from batch architects.

The Batch Window Defined

The batch window is the period between the close of online processing and the required resumption of online services. At CNB, that window is:

Online close: 11:00 PM Eastern (CICS regions quiesced by 11:15 PM)
Online open: 6:00 AM Eastern (CICS regions must be accepting transactions)
Available window: 6 hours 45 minutes (405 minutes)
Required buffer: 30 minutes (for CICS startup, cache warming, verification)
Effective window: 375 minutes

That's 375 minutes to process 500 million transactions worth of end-of-day activity. Every night. Including the night after Black Friday. Including the night the quarterly interest calculation runs. Including the night DB2 decided to reorganize the customer master table.

🔄 ANCHOR — CNB's Q4 Crisis: In Q4 2024, CNB's transaction volume grew 30% due to a new mobile banking partnership. Nobody changed the batch jobs. Nobody re-analyzed the critical path. The first sign of trouble was a Tuesday night when the window finished at 5:52 AM — eight minutes of margin. By Thursday, it blew the window entirely. Rob Calloway got the 5:47 AM call. "Online can't come up. Batch is still running." Those are the seven words no batch operations manager ever wants to hear.

The root cause wasn't any single slow job. Every individual job was performing within its historical norms. The problem was that the aggregate throughput of the serial dependency chain had exceeded the window capacity. The critical path — the longest chain of jobs that must execute sequentially — had grown from 310 minutes to 420 minutes. Individual job optimization couldn't fix it. The architecture had to change.

Why This Is Architecture, Not Operations

Batch window engineering sits at the intersection of:

Data architecture — What data flows where, and in what order?
Systems architecture — How many initiators, LPARs, DB2 subsystems?
Application architecture — How are programs structured for restartability?
Capacity architecture — What throughput does the hardware support?
Organizational architecture — Who owns which jobs, and who can change them?

💡 KEY INSIGHT: A batch window that works today but has no margin is a batch window that's already broken — you just don't know it yet. Volume grows. New requirements appear. Regulatory jobs get added. If you're using 95% of your window today, you'll blow it within two quarters.

The Batch Window Across the Industry

CNB's window is typical for a Tier-1 bank. But batch windows vary dramatically across industries and shop sizes:

Organization Type           Typical Window    Jobs    Critical Path
──────────────────────────────────────────────────────────────────
Large bank (Tier 1)         6-8 hours         500+    4-6 hours
Mid-size bank               8-10 hours        200-400 3-5 hours
Insurance company           8-12 hours        300-600 4-8 hours
Federal agency              10-14 hours       100-300 3-6 hours
Retail chain                6-8 hours         150-300 3-5 hours

Federal Benefits Administration, where Sandra Chen is modernizing a 40-year codebase, has a 12-hour window — generous by banking standards. But their critical path is still 8 hours because the legacy code was never parallelized. Marcus Whitfield, the retiring SME, remembers when the window was 18 hours and nobody worried about it. "We had a mainframe to ourselves back then," he says. "No online to fight with."

SecureFirst Retail Bank, where Yuki Nakamura runs DevOps, faces the opposite problem: their mobile-first strategy means online processing runs nearly 24/7. The batch window has been compressed to 4 hours — and they're moving toward continuous batch processing that eliminates the window concept entirely. That's the future for many shops, but it requires a fundamentally different architecture that most COBOL applications weren't designed for.

🚪 GATEWAY CONCEPT: This chapter is the entry point for Part V (Batch Architecture at Scale). Every subsequent chapter in this part — individual batch program design, parallel processing, and disaster recovery — builds on the DAG model and critical path concepts introduced here. If you don't internalize the idea that the batch window is a graph problem, the rest of Part V will feel like disconnected optimization tips rather than a coherent architectural framework.

The rest of this chapter teaches you to think about batch windows the way Rob Calloway learned to think about them after that Q4 crisis: as an engineering discipline with mathematical foundations, not as a hope-and-pray operational exercise.

23.2 The Batch Window as a Graph — Jobs, Dependencies, and Critical Path

Modeling Jobs as a DAG

Every batch window can be modeled as a directed acyclic graph (DAG). Each node is a job. Each directed edge represents a dependency — "this job must complete before that job can start."

Consider a simplified version of CNB's end-of-day processing:

Job ID        Description                      Duration   Predecessors
─────────────────────────────────────────────────────────────────────────
EOD-001       Transaction extract               25 min     (none)
EOD-002       ATM settlement extract            15 min     (none)
EOD-003       Wire transfer reconciliation      20 min     (none)
EOD-004       Transaction validation            35 min     EOD-001
EOD-005       ATM posting                       20 min     EOD-002
EOD-006       Wire posting                      30 min     EOD-003
EOD-007       Combined posting                  45 min     EOD-004, EOD-005, EOD-006
EOD-008       Balance calculation               40 min     EOD-007
EOD-009       Interest accrual                  50 min     EOD-008
EOD-010       GL posting                        30 min     EOD-008
EOD-011       Regulatory extract                25 min     EOD-009, EOD-010
EOD-012       Statement generation              60 min     EOD-009
EOD-013       End-of-day report                 15 min     EOD-011, EOD-012

This DAG has three independent entry points (EOD-001, EOD-002, EOD-003) that can run in parallel, a convergence point (EOD-007), and multiple paths to the terminal node (EOD-013).

Critical Path Analysis

The critical path is the longest path through the DAG measured by total elapsed time. It determines the minimum possible batch window duration — you cannot finish faster than the critical path, no matter how many other jobs you parallelize.

Let's trace every path from start to finish:

Path A: EOD-001 → EOD-004 → EOD-007 → EOD-008 → EOD-009 → EOD-012 → EOD-013
        25 + 35 + 45 + 40 + 50 + 60 + 15 = 270 minutes

Path B: EOD-001 → EOD-004 → EOD-007 → EOD-008 → EOD-009 → EOD-011 → EOD-013
        25 + 35 + 45 + 40 + 50 + 25 + 15 = 235 minutes

Path C: EOD-001 → EOD-004 → EOD-007 → EOD-008 → EOD-010 → EOD-011 → EOD-013
        25 + 35 + 45 + 40 + 30 + 25 + 15 = 215 minutes

Path D: EOD-002 → EOD-005 → EOD-007 → EOD-008 → EOD-009 → EOD-012 → EOD-013
        15 + 20 + 45 + 40 + 50 + 60 + 15 = 245 minutes

Path E: EOD-003 → EOD-006 → EOD-007 → EOD-008 → EOD-009 → EOD-012 → EOD-013
        20 + 30 + 45 + 40 + 50 + 60 + 15 = 260 minutes

The critical path is Path A at 270 minutes. That's 4 hours and 30 minutes — within the 375-minute effective window, but with only 105 minutes of margin.

🔍 ANALYSIS — What the Critical Path Tells You: - Optimizing EOD-002 (ATM settlement) does nothing for the window. It's not on the critical path. - Optimizing EOD-012 (statement generation, 60 minutes) would reduce the critical path by however many minutes you save — it is on the critical path. - Adding a new 20-minute job after EOD-009 but before EOD-012 would increase the critical path to 290 minutes. - The path through EOD-010/EOD-011 has 55 minutes of slack (270 - 215 = 55). You could delay EOD-010 by up to 55 minutes without affecting the window.

Slack and Float

Every job not on the critical path has slack (also called float) — the amount of time its start can be delayed without affecting the overall window completion.

Job        Earliest   Latest    Slack    On Critical
           Start      Start     (min)    Path?
───────────────────────────────────────────────────
EOD-001    0:00       0:00      0        YES
EOD-002    0:00       0:10      10       no
EOD-003    0:00       0:05      5        no
EOD-004    0:25       0:25      0        YES
EOD-005    0:15       0:40      25       no
EOD-006    0:20       0:30      10       no
EOD-007    1:00       1:00      0        YES
EOD-008    1:45       1:45      0        YES
EOD-009    2:25       2:25      0        YES
EOD-010    2:25       3:20      55       no
EOD-011    3:15       4:00      45       no
EOD-012    3:15       3:15      0        YES
EOD-013    4:15       4:15      0        YES

⚠️ WARNING — Slack is Fragile: That 10 minutes of slack on EOD-002 assumes everything runs at normal duration. If EOD-005 runs 15 minutes longer than expected (a common occurrence during high-volume periods), EOD-002's path suddenly has -5 minutes of slack — meaning it's now the new critical path. Monitor slack trends, not just the current critical path.

Hidden Dependencies

The DAG you draw on paper isn't always the DAG that exists in reality. Hidden dependencies include:

Dataset contention: Two jobs that have no logical dependency may both need exclusive access to the same dataset. If JOBABC allocates DISP=OLD on CUST.MASTER and JOBXYZ also needs DISP=OLD on the same dataset, they serialize — even though neither is a predecessor of the other.

DB2 lock conflicts: Two batch DB2 programs updating the same tablespace will experience lock contention even without formal job dependencies. One will wait. The effective throughput drops.

Initiator starvation: If you have 15 jobs eligible to run but only 8 batch initiators in the right class, 7 jobs queue. This creates implicit serialization.

Tape drive allocation: Yes, in 2026, some shops still have tape. Two jobs needing 4 tape drives each on a system with 6 drives will serialize.

GDG contention: If JOBA creates a new generation of a GDG and JOBB reads the current generation, there's an implicit ordering — but the scheduler may not know about it unless you tell it.

🧩 PATTERN — Dependency Discovery: To find hidden dependencies, don't just read the scheduler. Run a week of SMF data (type 30 records) and correlate job start/end times with dataset allocation records (type 14/15). Jobs that never overlap aren't necessarily independent — they may be implicitly serialized by resource contention. This is where operations knowledge meets architecture knowledge, and it's why Rob Calloway's seventeen years of experience matter.

The Real-World DAG: Complexity at Scale

CNB's simplified example has 13 jobs. The real batch window has 847 jobs and 2,341 dependency edges. At that scale, manual path analysis is impossible. You need tools.

Scheduler-native analysis: TWS/OPC provides a critical path analysis feature (the "Plan" view) that calculates the longest path through the current plan. CA-7 has similar reporting through SASSHIS7. Control-M's Planning domain provides visual DAG rendering with critical path highlighting.

Custom analysis: Many shops extract scheduler dependency data to a flat file and process it with custom programs. The algorithm for finding the critical path in a DAG is a topological sort followed by a forward pass (calculating earliest start/finish for each node) and a backward pass (calculating latest start/finish). Jobs where earliest finish equals latest finish are on the critical path.

      * SIMPLIFIED CRITICAL PATH FORWARD PASS
      * (Pseudocode — real implementation needs graph data structure)
       PERFORM VARYING WS-NODE-IDX FROM 1 BY 1
           UNTIL WS-NODE-IDX > WS-TOTAL-JOBS
           MOVE 0 TO WS-EARLIEST-START(WS-NODE-IDX)
           PERFORM VARYING WS-PRED-IDX FROM 1 BY 1
               UNTIL WS-PRED-IDX > WS-PRED-COUNT(WS-NODE-IDX)
               COMPUTE WS-PRED-FINISH =
                   WS-EARLIEST-START(WS-PRED-NODE(WS-PRED-IDX))
                   + WS-DURATION(WS-PRED-NODE(WS-PRED-IDX))
               IF WS-PRED-FINISH > WS-EARLIEST-START(WS-NODE-IDX)
                   MOVE WS-PRED-FINISH
                       TO WS-EARLIEST-START(WS-NODE-IDX)
               END-IF
           END-PERFORM
       END-PERFORM.

⚠️ WARNING — DAG Integrity: If your dependency graph has a cycle, it's not a DAG and no valid schedule exists. Schedulers reject cycles at definition time, but you can create logical cycles through conditional dependencies or cross-system references that the scheduler doesn't detect. Always validate DAG integrity after making dependency changes.

23.3 The Math — Throughput Calculations, I/O Analysis, and Capacity

Why the Math Matters

Most batch operations teams run on instinct. "EOD-007 usually takes about 45 minutes." "If we add a job here, it'll probably push us 10 minutes later." These estimates are often wrong — not because the people are bad at their jobs, but because human intuition about cumulative effects in a dependency graph is unreliable.

The math in this section gives you something better than intuition: predictive models that tell you, with reasonable accuracy, how long a job will take at a given volume, how much the critical path will grow with volume increases, and when the window will break. Rob Calloway started tracking these numbers after the Q4 crisis and now publishes a monthly "Batch Window Health Report" to the architecture team. It contains two numbers: current critical path duration and projected months to exhaustion. Those two numbers drive more architecture decisions than any other metric in his organization.

Records Per Second — The Fundamental Unit

Every batch job's elapsed time is determined by how fast it processes records. The fundamental equation:

Elapsed Time = Total Records / Processing Rate (records/second)

But "processing rate" isn't a single number. It's the result of an interaction between CPU processing, I/O operations, and DB2 access:

Time per record = CPU time + I/O wait time + DB2 wait time + other wait time
Processing rate = 1 / Time per record

For a typical COBOL batch program reading a sequential file and updating DB2:

Component               Time per record    Percentage
──────────────────────────────────────────────────────
CPU (COBOL logic)       0.015 ms           3%
Sequential read I/O     0.050 ms           10%
DB2 SQL execution       0.350 ms           70%
DB2 lock/latch wait     0.060 ms           12%
Other (catalog, ENQ)    0.025 ms           5%
──────────────────────────────────────────────────────
Total                   0.500 ms           100%
Processing rate         2,000 records/sec

With 10 million records to process:

Elapsed time = 10,000,000 / 2,000 = 5,000 seconds = 83.3 minutes

💡 KEY INSIGHT: In this example, 70% of elapsed time is DB2 SQL execution. Optimizing the COBOL logic (3% of time) would save approximately 2.5 minutes on an 83-minute job. Re-indexing the DB2 table to cut SQL time by 30% would save 17.5 minutes. Know where the time goes before you optimize.

I/O Throughput Analysis

For sequential file processing (QSAM/BSAM), I/O throughput depends on:

Block size × Blocks per track × Tracks per seek = Data per I/O operation

A well-tuned sequential read on modern DASD:

Configuration:
  BLKSIZE = 27,998 (optimal for 3390 half-track)
  BUFNO = 30 (30 I/O buffers for read-ahead)
  Channel speed: FICON 16 Gbps
  Cache hit ratio: 95% (sequential detect activated)

Throughput:
  Cached reads: ~200 MB/sec
  Non-cached reads: ~40 MB/sec
  Effective (95% cache): ~192 MB/sec

  Record size: 500 bytes
  Records per block: 55
  Blocks per second: 384,000 (cached)
  Records per second: 21,120,000 (I/O only, no processing)

The I/O subsystem can deliver over 21 million records per second for sequential reads. Your COBOL program processes 2,000 records per second. The bottleneck is never sequential I/O for a well-tuned dataset — it's processing time.

⚠️ WARNING — Random I/O Is Different: The numbers above are for sequential access with caching. Random I/O (VSAM KSDS random reads, DB2 index lookups) drops to 5,000–50,000 I/O operations per second depending on cache hit ratio. Random I/O can be the bottleneck, especially for DB2 batch programs doing singleton SELECTs with index access.

CPU vs. I/O Bound Analysis

Classify every critical-path job:

CPU Bound (CPU time > 60% of elapsed):
  - Complex calculations (interest accrual, actuarial)
  - Data transformation with heavy COMPUTE
  - Sorting (internal SORT, not DFSORT)
  - Compression/decompression

  Fix: zIIP offload (for DB2/XML), faster processor, algorithm optimization

I/O Bound (I/O wait > 60% of elapsed):
  - Large sequential file scans
  - Random VSAM access
  - Tape processing
  - Cross-system dataset access

  Fix: Better block sizes, more buffers, parallel I/O, data placement

DB2 Bound (DB2 wait > 60% of elapsed):
  - Heavy SQL batch processing
  - Lock contention with other batch jobs
  - Tablespace scans instead of index access
  - Commit frequency too low (lock escalation)

  Fix: SQL tuning, index optimization, commit frequency, parallel DB2 threads

Worked Example — CNB Transaction Validation (EOD-004)

EOD-004: Transaction Validation
  Input: 12.5M transactions (Q4 volume, up from 9.6M in Q3)
  Processing: Validate each transaction against business rules,
              check fraud flags, verify account status via DB2

  Measured rates (from SMF Type 30):
    CPU time per invocation:    18.2 minutes
    Elapsed time per invocation: 35.0 minutes
    CPU/Elapsed ratio:           0.52 (mixed CPU/DB2 bound)

  DB2 accounting (IFCID 3):
    SQL calls:                   37.5M (3 per transaction)
    Class 2 elapsed:             14.8 minutes
    Class 2 CPU:                 4.1 minutes
    SQL DB2 wait:                10.7 minutes

  Throughput:
    12,500,000 records / (35 × 60 seconds) = 5,952 records/sec

  Q4 projection at 30% growth:
    16,250,000 records / 5,952 rps = 2,730 seconds = 45.5 minutes

  Impact: EOD-004 grows from 35 to 45.5 minutes.
  Critical path impact: +10.5 minutes
  New critical path: 280.5 minutes (was 270)
  Remaining margin: 94.5 minutes (was 105)

🔍 ANALYSIS: The 30% volume growth costs 10.5 minutes on the critical path. That's manageable for this one job. But when every job on the critical path grows by a similar proportion, the cumulative effect is what broke CNB's window. Seven critical-path jobs each growing 10–15 minutes added up to 85 minutes of growth — and the window only had 105 minutes of margin.

Capacity Planning Formula

For any batch window, the capacity equation is:

Window Capacity = Available Time - Critical Path Length - Buffer

If Window Capacity < 0, the window is broken.
If Window Capacity < Growth Margin, the window will break soon.

Growth Margin = (Monthly Volume Growth Rate × Months to Next Review)
                × Critical Path Sensitivity Factor

Critical Path Sensitivity Factor =
    Sum of (job_duration × volume_elasticity) for all critical path jobs
    ÷ Sum of (job_duration) for all critical path jobs

Volume elasticity measures how much a job's duration changes per unit of volume growth. A purely sequential file processor has elasticity of 1.0 (linear). A job with significant fixed overhead (JCL setup, sort initialization, DB2 thread allocation) has elasticity less than 1.0.

For CNB's batch window at Q4:

Available Time:              375 minutes
Critical Path Length:        270 minutes
Buffer:                      30 minutes (Rob's minimum)
Window Capacity:             75 minutes

Monthly Volume Growth Rate:  2.5%
Months to Next Review:       6
Avg Volume Elasticity:       0.85
Critical Path Duration:      270 minutes

Growth Margin Needed:        2.5% × 6 × 0.85 × 270 = 34.4 minutes

Verdict: 75 > 34.4 → Safe for 6 months (pre-Q4 projection)

After the Q4 spike:

Critical Path Length:        420 minutes (actual, after 30% growth)
Window Capacity:             375 - 420 - 30 = -75 minutes
Verdict: BROKEN

23.4 Job Scheduling — TWS, CA-7, Control-M, and the Art of Dependencies

The Big Three Schedulers

Every mainframe shop runs one of three enterprise job schedulers. The concepts are identical; the syntax differs.

IBM Tivoli Workload Scheduler (TWS/OPC):

//********************************************
//* TWS APPLICATION DEFINITION — EOD-004     *
//********************************************
ADID(CNBEOD004)
  OWNER(BATCHOPS)
  PRIORITY(5)
  WSNAME(CNBSYSA)
  RUN DAILY
  CALENDAR(CNB-BUSINESS-DAYS)
  DEADLINE(0430)
  PREDECESSOR(CNBEOD001)
    TYPE(SUCCESSOR)
    CONDITION(RC <= 4)
  RESOURCE(DB2BATCH)
    QUANTITY(1)
  RESOURCE(BATCHINIT-A)
    QUANTITY(1)

CA-7 (Broadcom):

CA-7  JOB DEFINITION
  JOB: CNBEOD04
  SYSTEM: SYSA
  JCLID: CNBEOD04
  REQUIREMENT:
    JOB CNBEOD01 - COND CODE LE 4
  RESOURCE:
    RES DB2BATCH QTY 1
  SCHEDULE:
    SCHID 001
    SCAL CNB-BUS-DAYS
    LEADTM 0015
    DEADTM 0430

BMC Control-M:

{
  "CNBEOD004": {
    "Type": "Job:zOS",
    "Application": "CNB-EOD",
    "SubApplication": "VALIDATION",
    "RunAs": "BATCHOPS",
    "When": {
      "RuleBasedCalendar": {
        "Calendar": "CNB-BUSINESS-DAYS"
      }
    },
    "InCondition": [
      {"Name": "CNBEOD001-ENDED-OK", "Date": "ODAT"}
    ],
    "Resource": {
      "DB2BATCH": {"Quantity": 1}
    }
  }
}

Dependency Types

Regardless of scheduler, dependencies come in several flavors:

Job-to-Job (hard dependency): Job B cannot start until Job A completes successfully. This is the most common and the most significant for critical path analysis.

EOD-007 depends on EOD-004, EOD-005, EOD-006
// EOD-007 will not start until ALL THREE predecessors complete with RC ≤ 4

Conditional dependency: Job B runs only if Job A ends with a specific condition code.

// If EOD-004 ends RC=0, run EOD-004A (normal path)
// If EOD-004 ends RC=4, run EOD-004B (warning path — some records rejected)
// If EOD-004 ends RC>4, trigger alert, do NOT run EOD-007

Time dependency: Job starts at a specific time regardless of predecessor completion.

// STMT-GEN must not start before 02:00 AM (tape library staffing)
// REGULATORY must complete by 04:30 AM (federal filing deadline)

⚠️ WARNING — Time Dependencies Are Critical Path Killers: If you have a time-based dependency that says "don't start before 02:00" and the job's predecessors finish at 01:15, you've just added 45 minutes of dead time to the critical path. Review every time dependency quarterly. Many exist because of constraints that no longer apply.

Resource dependency: Job waits until a shared resource is available.

// Only 4 batch DB2 threads allowed simultaneously
// Only 2 jobs can run in INITCLASS-H at once
// Only 1 job can hold CUST-MASTER dataset at a time

Cross-system dependency: Job on SYSA waits for a job on SYSB to complete.

// SYSB-EOD-EXTRACT must complete before SYSA-EOD-LOAD can start
// Requires XCF signaling or scheduler cross-system communication

The Dependency Explosion Problem

CNB's batch window has 847 jobs. The dependency graph has 2,341 edges. Nobody fully understands it.

🔄 ANCHOR — Dependency Archaeology: When Rob Calloway's team analyzed the dependency graph after the Q4 crisis, they found:

127 unnecessary dependencies — jobs that were predecessors only because "they were always run in that order" with no actual data dependency
43 duplicate dependencies — Job C depended on Job A both directly and through Job B (if B already depends on A, C doesn't need to depend on A)
8 phantom dependencies — references to jobs that had been decommissioned years ago but whose dependency entries remained in the scheduler
3 circular dependency risks — not actual cycles (the scheduler would reject those) but near-cycles that made the graph nearly impossible to modify

Removing the 127 unnecessary dependencies shortened the critical path by 47 minutes — without changing a single COBOL program.

💡 KEY INSIGHT: Dependency cleanup is the single highest-ROI batch window optimization. It costs nothing, risks little (if you analyze carefully), and can recover tens of minutes from the critical path. Before you tune a single program, clean your dependency graph.

Operational Calendar Management

Schedulers don't just manage dependencies — they manage time. Every job has a calendar that determines when it runs:

Calendar Types:
  BUSINESS-DAYS:      Monday-Friday, excluding bank holidays
  MONTH-END:          Last business day of each month
  QUARTER-END:        Last business day of March, June, Sept, Dec
  YEAR-END:           December 31 (or last business day)
  DAILY:              Every day including weekends
  CUSTOM:             Application-specific (e.g., "third Wednesday")

Calendar interactions create batch window variation. On a normal Tuesday, CNB runs 847 jobs. On a month-end Tuesday, it runs 1,023 jobs — the extra 176 are month-end-only jobs (account reconciliation, management reporting, regulatory filings). On a quarter-end that falls on month-end, it's 1,187 jobs. On December 31st, it can exceed 1,400.

🔄 ANCHOR — The Month-End/Quarter-End Problem: Rob Calloway's critical path analysis must account for these calendar variations. The critical path on a normal night is 253 minutes. On month-end, additional jobs insert into the dependency chain, extending the critical path to approximately 310 minutes. On quarter-end, it reaches 345 minutes. On year-end, 380 minutes — within 5 minutes of the effective window.

This is why Rob runs a "rehearsal" batch two weeks before every year-end: he simulates the year-end job stream on a test LPAR to verify the timing. If the rehearsal exceeds 350 minutes, he activates the pre-planned compression strategies (additional job splits, temporary dependency bypasses for non-critical reports, deferred archival).

Scheduler Resource Management

Modern schedulers manage resources as countable tokens:

Resource Definition:
  RESOURCE(DB2-BATCH-THREADS)  QUANTITY(8)
  RESOURCE(BATCH-INITIATORS-A) QUANTITY(12)
  RESOURCE(TAPE-DRIVES)        QUANTITY(6)
  RESOURCE(CUST-MASTER-EXCL)   QUANTITY(1)

Job Requirements:
  EOD-004: DB2-BATCH-THREADS(2), BATCH-INITIATORS-A(1)
  EOD-007: DB2-BATCH-THREADS(3), BATCH-INITIATORS-A(1)
  EOD-009: DB2-BATCH-THREADS(2), BATCH-INITIATORS-A(1), CUST-MASTER-EXCL(1)

When EOD-004 and EOD-007 need to run simultaneously, they require 5 DB2 batch threads total. If only 4 are available, one waits. This implicit serialization doesn't appear in the dependency graph but affects actual elapsed time.

🧩 PATTERN — Resource Modeling: Add resource constraints to your DAG model. For each time slot, calculate total resource demand. Where demand exceeds supply, jobs queue — and queueing time adds to elapsed time even though it's not processing time. The best batch architects model resource contention as variable edge weights in their DAG.

23.5 Parallel Streams — Running Jobs Simultaneously Without Stepping on Each Other

The Parallelization Imperative

If your batch window critical path is 300 minutes and you need it to be 200 minutes, you have exactly two options: make the critical-path jobs run faster (optimization), or rearrange the graph so fewer jobs are on the critical path (parallelization). Optimization has limits — you can't make a program faster than its I/O or DB2 constraints allow. Parallelization is theoretically limited only by the true data dependencies between jobs. In practice, the limit is resource availability (initiators, DB2 threads, I/O bandwidth) and the contention effects that arise when multiple jobs compete for shared resources.

The rest of this section focuses on finding and exploiting parallelism — the most powerful lever you have for batch window compression.

Identifying Parallelizable Work

Two jobs can run in parallel if and only if:

No data dependency: Neither job's output is the other's input
No dataset contention: They don't both need exclusive (DISP=OLD) access to the same dataset
No DB2 lock conflict: They don't update overlapping rows in the same table
Sufficient resources: Enough initiators, DB2 threads, and I/O bandwidth for both

CNB's batch window has three natural parallel streams:

Stream 1 (Retail Banking):
  Customer transactions → validation → posting → balance calc

Stream 2 (Commercial Banking):
  Wire transfers → reconciliation → posting → GL update

Stream 3 (Card Processing):
  ATM/debit transactions → settlement → posting → interchange calc

These streams share no data until the convergence point (combined GL posting). They can run fully in parallel, and the batch window's critical path is the longest stream — not the sum of all three.

Serial execution:    Stream 1 (120 min) + Stream 2 (95 min) + Stream 3 (80 min) = 295 min
Parallel execution:  max(120, 95, 80) = 120 min
Savings:             175 minutes (59% reduction)

Dataset Contention Resolution

The most common barrier to parallelization is dataset contention. Solutions:

1. Convert DISP=OLD to DISP=SHR where possible:

//* BEFORE — serializes against any other user of CUST.MASTER
//CUSTMAST DD DSN=CNB.PROD.CUST.MASTER,DISP=OLD
//*
//* AFTER — allows concurrent read access
//CUSTMAST DD DSN=CNB.PROD.CUST.MASTER,DISP=SHR

Only works if the job is reading, not writing. Many old JCL specifies DISP=OLD out of habit when DISP=SHR would suffice.

2. Use GDG generations to decouple readers from writers:

//* Writer job creates new generation
//TRANOUT  DD DSN=CNB.PROD.TRANS.DAILY(+1),
//            DISP=(NEW,CATLG,DELETE),
//            SPACE=(CYL,(500,100)),
//            DCB=(RECFM=FB,LRECL=500,BLKSIZE=27998)
//*
//* Reader job reads current generation (written by previous run)
//TRANIN   DD DSN=CNB.PROD.TRANS.DAILY(0),DISP=SHR

⚠️ WARNING — GDG Catalog Serialization: Even with GDGs, the ICF catalog serializes during OPEN for GDG base updates. If 10 jobs all reference the same GDG base simultaneously, catalog contention can cause seconds to minutes of delay. For high-concurrency GDGs, consider spreading jobs' start times by 15–30 seconds.

3. Split files by key range:

//* Instead of one job processing all customers:
//* Job A processes customers 000000000–249999999
//* Job B processes customers 250000000–499999999
//* Job C processes customers 500000000–749999999
//* Job D processes customers 750000000–999999999
//CUSTMAST DD DSN=CNB.PROD.CUST.MASTER,DISP=SHR
//SYSIN    DD *
  RANGE-START=000000000
  RANGE-END=249999999
/*

This requires application-level changes to support key-range processing — but it's the single most powerful parallelization technique for CPU-bound batch jobs.

DB2 Concurrency in Batch

DB2 batch programs running simultaneously face lock contention. The severity depends on:

Lock Level        Contention Risk    Throughput Impact
─────────────────────────────────────────────────────
Row-level          Low               Minimal (unless hot rows)
Page-level         Medium            10-30% degradation
Table-level        High              Full serialization
Tablespace-level   Critical          Full serialization

🔄 ANCHOR — CNB's DB2 Batch Strategy: After the Q4 crisis, Lisa Tran (DBA) implemented three DB2 changes that recovered 35 minutes from the critical path:

Changed LOCKSIZE from PAGE to ROW on the TRANSACTION table — allowed three posting jobs to run concurrently instead of serially. Saved 40 minutes of elapsed time at the cost of 15% more CPU (row-level lock management overhead).
Increased COMMIT frequency from every 10,000 records to every 1,000 records — reduced lock hold time, eliminated lock escalation events that had been causing tablespace-level locks. Slight CPU increase but dramatic reduction in lock-wait time.
Implemented ISOLATION(UR) (uncommitted read) for read-only batch reporting jobs — these no longer took any locks at all, eliminating all contention with update jobs. Only safe because reports don't require transactional consistency — they run after all updates complete.

       EXEC SQL
           SELECT ACCT_BALANCE
           INTO :WS-BALANCE
           FROM CUSTOMER_ACCOUNTS
           WHERE ACCT_NUMBER = :WS-ACCT-NUM
           WITH UR
       END-EXEC.

The Initiator Class Strategy

z/OS batch initiators are grouped into classes. Each initiator processes jobs from one or more classes. The class assignment determines which initiators can run which jobs:

Initiator Configuration at CNB:
  Init 1-8:   Class A,B   (general batch — any standard job)
  Init 9-12:  Class B,H   (high-priority batch + long-running)
  Init 13-14: Class H     (long-running jobs only)
  Init 15-16: Class S     (STC/special — DB2 utilities, sorts)

Job Class Assignment:
  Short batch jobs (< 15 min):      Class A
  Standard batch jobs (15-60 min):  Class B
  Long-running batch (> 60 min):    Class H
  DB2 utilities and sorts:          Class S

This class structure prevents long-running jobs from consuming all initiators and starving short jobs. If the statement generation job (60 minutes) runs in Class H, it uses one of initiators 9-14 and leaves initiators 1-8 free for the shorter validation and posting jobs.

💡 KEY INSIGHT: Initiator class assignment is a resource allocation decision that affects the critical path. If you put a critical-path job in a class with only 2 initiators, and both initiators are occupied when the job becomes eligible, it queues. Rob Calloway reviews initiator utilization monthly and adjusts class definitions quarterly. The goal: zero queue time for critical-path jobs.

Parallel Utility Execution

DFSORT, IDCAMS REPRO, and DB2 utilities (REORG, RUNSTATS, COPY) consume significant batch window time. Parallelizing utilities:

//* Run RUNSTATS on 4 tablespaces simultaneously
//* Each in its own job step with COND=(4,LT)
//*
//* Job DBUTIL1: RUNSTATS on CUSTOMER tablespace
//* Job DBUTIL2: RUNSTATS on TRANSACTION tablespace (parallel)
//* Job DBUTIL3: RUNSTATS on ACCOUNT tablespace (parallel)
//* Job DBUTIL4: RUNSTATS on GL_ENTRY tablespace (parallel)
//*
//* Convergence job DBUTIL9 depends on all four

DB2 REORG is particularly important — it's often the longest-running utility in the batch window, and it takes an exclusive lock on the tablespace. Schedule it on the least-critical tablespaces during the batch window and save the critical tablespaces for weekend maintenance windows.

Lisa Tran manages CNB's DB2 utility schedule with a simple rule: "If it locks a tablespace that a critical-path job touches, it runs on Saturday night. Period." This means some tablespaces go a full week between REORGs during heavy periods — not ideal for performance, but far better than extending the critical path by 30 minutes for a REORG that could have waited two days.

Cross-LPAR Batch Distribution

In a Parallel Sysplex with DB2 data sharing, batch work can be distributed across multiple LPARs. At CNB, SYSA runs the primary batch stream while SYSB handles utility processing and non-critical reporting:

SYSA (Primary Batch):
  - All critical-path jobs (EOD-001 through EOD-013)
  - All DB2 update jobs
  - Checkpoint files on SYSA local DASD

SYSB (Auxiliary Batch):
  - DB2 RUNSTATS and COPY utilities
  - Management reports (read-only DB2 access)
  - Archive processing (tape operations)
  - Regulatory file transmission (FTP/Connect:Direct)

Cross-system dependencies:
  - SYSB-RUNSTATS depends on SYSA-EOD-007 (posting complete before stats)
  - SYSB-REPORTS depends on SYSA-EOD-008 (balances final before reporting)
  - SYSA-EOD-013 depends on SYSB-TRANSMIT (regulatory files sent before close)

This distribution offloads approximately 15% of the batch window's total work from SYSA, freeing CPU and I/O resources for critical-path jobs. The cross-system dependencies are managed through TWS's inter-system communication using XCF (Cross-System Coupling Facility) signaling.

23.6 Window Compression — When You Need to Finish Faster

When the math says the window won't fit, you have six strategies — listed in order of increasing cost and risk:

Strategy 1: Eliminate Unnecessary Dependencies (Cost: Low, Risk: Low)

Already discussed in Section 23.4. This is always the first thing to try.

Before cleanup:  Critical path = 420 minutes
After cleanup:   Critical path = 373 minutes
Savings:         47 minutes
Cost:            40 hours of analysis
Risk:            Low (no code changes)

Strategy 2: Split Large Serial Jobs into Parallel Jobs (Cost: Medium, Risk: Medium)

If a single job takes 60 minutes and processes 12 million records sequentially, split it into 4 parallel jobs processing 3 million records each:

Before:
  EOD-009 (Interest Accrual): 50 minutes, serial, 10M accounts

After:
  EOD-009A (Interest Accrual, range 0-2.5M):   13 min ──┐
  EOD-009B (Interest Accrual, range 2.5M-5M):  13 min ──┤
  EOD-009C (Interest Accrual, range 5M-7.5M):  13 min ──├── EOD-009Z (Merge): 3 min
  EOD-009D (Interest Accrual, range 7.5M-10M): 12 min ──┘

  Elapsed: 13 + 3 = 16 minutes (vs. 50 minutes)
  Savings: 34 minutes on critical path

This requires application changes — the COBOL program must accept key-range parameters and the merge step must handle any cross-boundary reconciliation.

      * ACCEPT KEY RANGE FROM PARM
       IDENTIFICATION DIVISION.
       PROGRAM-ID. INTACRL.

       DATA DIVISION.
       WORKING-STORAGE SECTION.
       01  WS-PARM-DATA.
           05  WS-RANGE-START      PIC 9(10).
           05  WS-RANGE-END        PIC 9(10).
       01  WS-ACCT-KEY             PIC 9(10).

       LINKAGE SECTION.
       01  LS-PARM.
           05  LS-PARM-LEN        PIC S9(4) COMP.
           05  LS-PARM-DATA       PIC X(20).

       PROCEDURE DIVISION USING LS-PARM.
       0000-MAIN.
           UNSTRING LS-PARM-DATA DELIMITED BY ','
               INTO WS-RANGE-START WS-RANGE-END
           END-UNSTRING

           EXEC SQL
               DECLARE ACCT-CURSOR CURSOR FOR
               SELECT ACCT_NUMBER, ACCT_BALANCE,
                      INTEREST_RATE, LAST_CALC_DATE
               FROM CUSTOMER_ACCOUNTS
               WHERE ACCT_NUMBER >= :WS-RANGE-START
                 AND ACCT_NUMBER <= :WS-RANGE-END
               ORDER BY ACCT_NUMBER
               FOR UPDATE OF ACCT_BALANCE
           END-EXEC

           PERFORM 1000-PROCESS-RANGE
           STOP RUN.

//EOD009A  EXEC PGM=INTACRL,PARM='0000000000,0002500000'
//EOD009B  EXEC PGM=INTACRL,PARM='0002500001,0005000000'
//EOD009C  EXEC PGM=INTACRL,PARM='0005000001,0007500000'
//EOD009D  EXEC PGM=INTACRL,PARM='0007500001,0010000000'

Strategy 3: Optimize I/O Configuration (Cost: Low-Medium, Risk: Low)

Optimization                    Typical Savings    Effort
─────────────────────────────────────────────────────────
Increase BUFNO (5→30)          5-15%              JCL change
Optimize BLKSIZE (half-track)  5-20%              Reformat dataset
Enable sequential detect       10-25%             Storage admin
Use HyperPAV alias volumes     15-30%             Storage config
Spread datasets across CUs     10-40%             Storage placement

For the biggest critical-path jobs, every percentage matters:

EOD-007 (Combined Posting): 45 minutes
  Current: BUFNO=5, BLKSIZE=8000
  Optimized: BUFNO=30, BLKSIZE=27998
  New elapsed: 38 minutes
  Savings: 7 minutes on critical path

Strategy 4: Use zIIP Offload for DB2-Heavy Jobs (Cost: Medium, Risk: Low)

DB2 SQL processing is eligible for zIIP (System z Integrated Information Processor) offload. zIIP cycles don't consume general-purpose CPU capacity and are priced differently.

More relevant for batch window engineering: zIIP offload can effectively increase the CPU capacity available for DB2 batch processing without competing with other batch jobs for general-purpose CPU.

Before zIIP offload:
  EOD-004 CPU time: 18.2 min (all GP)
  EOD-004 elapsed:  35.0 min
  GP CPU available:  8 processors shared across all batch

After zIIP configuration:
  EOD-004 GP CPU:   12.1 min
  EOD-004 zIIP CPU:  7.8 min (DB2 SQL offloaded)
  EOD-004 elapsed:  31.5 min (3.5 min savings from reduced GP contention)

Strategy 5: Extend the Batch Window (Cost: High, Risk: High)

Sometimes the answer is: negotiate a later online start time or an earlier online close.

Current:  11:00 PM – 6:00 AM = 7 hours
Proposed: 10:00 PM – 6:30 AM = 8.5 hours

Impact:
  - 1 hour earlier close: affects West Coast online users
  - 30 min later open: affects early East Coast mobile banking
  - Business impact assessment required
  - Often requires C-level approval

⚠️ WARNING — The Window Extension Trap: Extending the window is a one-time fix that doesn't address the underlying growth problem. If your volume is growing 2.5% per month, a 90-minute extension buys you about 18 months. And now you've given up that margin permanently. It's the batch equivalent of treating a fever with ice instead of antibiotics.

Strategy 6: Re-architect the Batch Processing Model (Cost: Very High, Risk: High)

When strategies 1–5 aren't enough, it's time to rethink what "batch" means:

Near-real-time processing: Move some batch work to CICS or IMS online processing. Instead of accumulating transactions and processing them in batch, process each transaction as it arrives. At SecureFirst, Carlos Vega has moved the fraud detection scan (previously a 30-minute batch job) into a CICS transaction that evaluates each transaction at the point of entry. The batch window no longer needs to include fraud scanning at all — and the bank gets real-time fraud detection as a business benefit.

Continuous batch: Run batch jobs throughout the day, not just in a defined window. This requires careful design to avoid conflicts with online transactions. The key challenge is data consistency: if a balance calculation runs at 2:00 PM while transactions are still posting, the balance is a moving target. Solutions include snapshot isolation (DB2's CURRENTLY COMMITTED) and designated "batch partitions" that are locked from online access during processing.

Parallel sysplex batch: Distribute batch work across multiple LPARs in a Parallel Sysplex. Each LPAR processes a portion of the work, converging at the end. DB2 data sharing makes this feasible — both LPARs can access the same data simultaneously with cross-system lock management. However, inter-system lock negotiation adds latency, and coupling facility contention can negate some of the parallelism benefit.

Hybrid batch/online: The most common modernization pattern. Move time-sensitive work (fraud detection, real-time balance updates) to online processing. Keep complex, data-intensive work (interest accrual, GL posting, regulatory reporting) in batch. This reduces the batch window without requiring a complete application re-architecture.

🔍 ANALYSIS — When to Re-architect vs. When to Optimize: Re-architecture is warranted when: (a) the critical path exceeds 80% of the window even after optimization, (b) volume growth rate exceeds 5% monthly, (c) the business requires 24/7 online availability that eliminates the traditional batch window, or (d) the regulatory environment demands real-time processing (e.g., instant payment schemes). If none of these conditions apply, optimization strategies 1–5 are almost always more cost-effective.

🔄 ANCHOR — CNB's Re-architecture: After the Q4 crisis, Kwame Mensah (architect) led a batch modernization that combined strategies 1, 2, and 3:

Dependency cleanup: -47 minutes
Job splitting (3 critical-path jobs): -62 minutes
I/O optimization (all critical-path jobs): -23 minutes
Total savings: 132 minutes
New critical path: 288 minutes (down from 420)
New margin: 57 minutes
Projected safe window: 12+ months at current growth rates

The project took 6 weeks of analysis and 4 weeks of implementation. No COBOL business logic changed. It was purely a batch architecture project.

23.7 When the Window Breaks — Batch Failure Analysis and Recovery

Failure Modes

Batch jobs fail. The question isn't whether, but how you recover. Common failure modes:

Failure Type          Frequency   Severity    Recovery Complexity
─────────────────────────────────────────────────────────────────
JCL error             Weekly      Low         Fix and resubmit
ABEND S0C7 (data)    Weekly      Medium      Fix data, restart
ABEND S0C4 (storage) Monthly     Medium      Fix program, restart
DB2 -904 (unavail)   Monthly     High        Wait/restart
DB2 deadlock          Weekly      Low-Med     Auto-retry
Dataset not found     Monthly     Medium      Correct catalog/JCL
Space abend (B37)    Monthly     Medium      Allocate more space
Tape mount timeout   Weekly      Low         Operator intervention
System abend (S*22)  Rare        Critical    IPL may be needed
CICS didn't close    Quarterly   Critical    Manual intervention

Rob Calloway's Incident Playbook

🔄 ANCHOR — The CNB Batch Recovery Framework:

Rob Calloway's team operates on a tiered response model:

Tier 1 — Automatic Recovery (no human intervention):

//* AUTOMATIC RETRY FOR DB2 DEADLOCK (-911)
//STEP01   EXEC PGM=IKJEFT01,
//         PARM='DSNTEP2',
//         COND=(4,LT)
//SYSTSIN  DD *
 RUN PROGRAM(VALTRXN) PLAN(CNBPLAN1) -
     PARMS('RETRY=3,COMMIT=1000')
/*

The COBOL program itself handles deadlock retry:

       88  DB2-DEADLOCK           VALUE -911.
       88  DB2-TIMEOUT            VALUE -913.

       2000-PROCESS-RECORD.
           MOVE 0 TO WS-RETRY-COUNT
           PERFORM 2100-ATTEMPT-UPDATE
               UNTIL WS-UPDATE-DONE = 'Y'
                  OR WS-RETRY-COUNT > 3
           IF WS-RETRY-COUNT > 3
               PERFORM 9000-WRITE-ERROR-RECORD
           END-IF.

       2100-ATTEMPT-UPDATE.
           EXEC SQL
               UPDATE TRANSACTION_MASTER
               SET STATUS = :WS-NEW-STATUS,
                   PROC_DATE = CURRENT DATE
               WHERE TXN_ID = :WS-TXN-ID
           END-EXEC
           EVALUATE TRUE
               WHEN DB2-DEADLOCK
               WHEN DB2-TIMEOUT
                   ADD 1 TO WS-RETRY-COUNT
                   EXEC SQL ROLLBACK END-EXEC
                   CALL 'CEESUSP' USING WS-WAIT-2-SEC
               WHEN SQLCODE = 0
                   SET WS-UPDATE-DONE TO TRUE
                   ADD 1 TO WS-COMMIT-COUNTER
                   IF WS-COMMIT-COUNTER >= 1000
                       EXEC SQL COMMIT END-EXEC
                       MOVE 0 TO WS-COMMIT-COUNTER
                   END-IF
               WHEN OTHER
                   PERFORM 9100-SQL-ERROR-HANDLER
           END-EVALUATE.

Tier 2 — Operator Recovery (restart from checkpoint):

When a job fails and can't auto-recover, the goal is to restart from the last checkpoint — not from the beginning.

      * CHECKPOINT/RESTART LOGIC
       01  WS-CHECKPOINT-DATA.
           05  WS-CHKPT-RECORD-COUNT  PIC 9(10).
           05  WS-CHKPT-LAST-KEY      PIC X(20).
           05  WS-CHKPT-ACCUMULATORS.
               10  WS-CHKPT-TOTAL-AMT PIC S9(15)V99 COMP-3.
               10  WS-CHKPT-ERROR-CT  PIC 9(7).
           05  WS-CHKPT-TIMESTAMP     PIC X(26).

       2000-TAKE-CHECKPOINT.
           MOVE WS-RECORD-COUNT TO WS-CHKPT-RECORD-COUNT
           MOVE WS-CURRENT-KEY  TO WS-CHKPT-LAST-KEY
           MOVE WS-TOTAL-AMT    TO WS-CHKPT-TOTAL-AMT
           MOVE WS-ERROR-COUNT  TO WS-CHKPT-ERROR-CT
           MOVE FUNCTION CURRENT-DATE TO WS-CHKPT-TIMESTAMP

           EXEC SQL COMMIT END-EXEC

           WRITE CHECKPOINT-RECORD FROM WS-CHECKPOINT-DATA

           DISPLAY 'CHECKPOINT: RECORDS=' WS-CHKPT-RECORD-COUNT
                   ' KEY=' WS-CHKPT-LAST-KEY
                   ' TIME=' WS-CHKPT-TIMESTAMP

           MOVE 0 TO WS-COMMIT-COUNTER.

       0100-CHECK-RESTART.
           OPEN INPUT CHECKPOINT-FILE
           READ CHECKPOINT-FILE INTO WS-CHECKPOINT-DATA
               AT END
                   SET WS-FRESH-START TO TRUE
               NOT AT END
                   SET WS-RESTART TO TRUE
                   MOVE WS-CHKPT-RECORD-COUNT TO WS-RECORD-COUNT
                   MOVE WS-CHKPT-LAST-KEY TO WS-RESTART-KEY
                   MOVE WS-CHKPT-TOTAL-AMT TO WS-TOTAL-AMT
                   MOVE WS-CHKPT-ERROR-CT TO WS-ERROR-COUNT
                   DISPLAY 'RESTART FROM KEY=' WS-RESTART-KEY
                           ' RECORDS=' WS-RECORD-COUNT
           END-READ
           CLOSE CHECKPOINT-FILE.

Tier 3 — Architect Recovery (critical path rerouting):

When a critical-path job fails and restart will take too long to meet the window:

Assess remaining work: How many records left? What's the projected finish time?
Split and parallel: Can the remaining work be split across multiple parallel jobs?
Defer non-critical: Can downstream jobs that aren't legally required be deferred to a supplemental batch run?
Partial online: Can CICS come up for a subset of functions while batch completes for others?

Incident: EOD-009 (Interest Accrual) failed at record 4.2M of 10M
          after 21 minutes. S0C7 on corrupted account record.

Time remaining in window: 180 minutes
Time to restart from scratch: 50 minutes
Time to complete from checkpoint: 29 minutes (5.8M records)

Decision tree:
  If remaining_time > (restart_time + downstream_path_time):
     RESTART FROM CHECKPOINT
     Remaining path: 29 + 30 + 60 + 15 = 134 minutes
     Margin: 180 - 134 = 46 minutes → SAFE, restart from checkpoint

  If remaining_time < restart_time:
     SPLIT remaining work into parallel streams

  If no recovery possible within window:
     OPEN CICS FOR NON-INTEREST FUNCTIONS
     RUN SUPPLEMENTAL INTEREST BATCH AT MIDDAY

💡 KEY INSIGHT: The recovery decision is always a math problem. Calculate the remaining critical path time for each recovery option, compare to the remaining window, and choose the option with the most margin. Rob Calloway keeps a laminated card in the operations center with the decision tree and the current critical path timing for the top 10 failure scenarios.

The Recovery Hierarchy

Level   Recovery Action              Time Cost    Risk
─────────────────────────────────────────────────────────
  1     Auto-retry (deadlock/timeout)   Seconds    None
  2     Restart from checkpoint         Minutes    Low
  3     Restart from beginning          Tens of min Medium
  4     Fix and resubmit               Variable    Medium
  5     Split remaining work            Minutes    Medium-High
  6     Bypass job, manual correction   Minutes    High
  7     Defer to supplemental batch     0 min      High (regulatory)
  8     Extend window (delay online)    N/A        Very High

Every level up the hierarchy increases business risk. Level 1–3 are operational decisions. Level 4–6 require application knowledge. Level 7–8 require management approval.

Designing for Restartability from Day One

The difference between a batch program that recovers gracefully and one that requires a full rerun is checkpoint design. Every critical-path COBOL batch program must implement these four elements:

1. Checkpoint records that capture complete processing state:

The checkpoint must include not just the current position in the input file, but all accumulators, counters, flags, and state variables needed to resume processing as if the interruption never happened. Missing a single accumulator means the final totals will be wrong after restart.

2. Idempotent processing logic:

If a record is processed twice (because the checkpoint was taken before the commit), the result must be the same as processing it once. For database updates, this typically means using "upsert" logic — UPDATE if the record exists, INSERT if it doesn't. For file output, it means repositioning the output file to the checkpoint position and overwriting.

3. Commit synchronization:

The DB2 commit and the checkpoint write must be synchronized. If you commit to DB2 but crash before writing the checkpoint, the restart will re-process records that have already been committed — producing duplicate updates unless your processing is idempotent. The safest pattern: take the checkpoint first (to a sequential file), then commit DB2.

       2000-TAKE-SYNCHRONIZED-CHECKPOINT.
      *    Write checkpoint BEFORE DB2 commit
      *    If crash after checkpoint but before commit,
      *    restart will re-process — but uncommitted DB2
      *    changes will be rolled back, so no duplicates.
           WRITE CHECKPOINT-RECORD FROM WS-CHECKPOINT-DATA
           EXEC SQL COMMIT END-EXEC
           MOVE 0 TO WS-COMMIT-COUNTER.

4. Restart detection:

The program must detect whether it's a fresh start or a restart. A simple approach: check for the existence of a non-empty checkpoint file. If present, read the checkpoint and resume; if absent or empty, start from the beginning.

🔴 CRITICAL: At CNB, every COBOL batch program submitted for production must pass a "restart test" — the program is deliberately killed at 50% completion and then restarted. If the final outputs don't match a clean run, the program is rejected. This testing requirement was instituted after a 2019 incident where a restart produced $14.3 million in duplicate interest credits that weren't detected until the next month's reconciliation.

Batch Monitoring — Knowing You're in Trouble Before It's Too Late

Don't wait until 5:47 AM to discover the window is breaking. Implement milestone monitoring:

Milestone          Expected Time    Alert Threshold
─────────────────────────────────────────────────────
Extracts complete  11:45 PM         +15 min (12:00 AM)
Validation done    12:55 AM         +20 min (01:15 AM)
Posting complete   01:45 AM         +20 min (02:05 AM)
Balance calc done  02:25 AM         +15 min (02:40 AM)
Interest done      03:15 AM         +20 min (03:35 AM)
GL posting done    03:45 AM         +15 min (04:00 AM)
Statements done    04:15 AM         +20 min (04:35 AM)
Window complete    04:30 AM         +15 min (04:45 AM)

//* MILESTONE NOTIFICATION STEP
//MILSTN   EXEC PGM=CNBNOTFY,
//         PARM='MILESTONE=POSTING-COMPLETE'
//SYSOUT   DD SYSOUT=*
//*
//* CNBNOTFY checks current time against expected time
//* If late, sends alert to operations page group
//* Writes record to milestone tracking dataset

⚠️ WARNING — Trend Monitoring: A single night finishing 5 minutes late isn't a crisis. Three consecutive nights each 2 minutes later than the last is a trend that, if unaddressed, will blow the window within weeks. Monitor batch window trends weekly, not just nightly alerts.

The Human Factor: Knowledge Transfer and the Batch Window

🔄 ANCHOR — The Marcus Whitfield Problem: At Federal Benefits Administration, Marcus Whitfield is retiring. He's the only person who fully understands the 600-job monthly cycle that processes 40 million benefit payments. The dependency graph exists in TWS, but the logic behind the dependencies — why job BENPAY-047 must run before BENPAY-052, even though there's no obvious data dependency — lives entirely in Marcus's head.

Sandra Chen's modernization effort includes a batch window documentation initiative. For every dependency in the graph, she requires a documented justification in one of four categories:

Data dependency: "BENPAY-047 writes the ELIGIBLE-BENEFICIARY file that BENPAY-052 reads."
Resource dependency: "Both jobs need exclusive access to the BENEFITS-MASTER VSAM cluster."
Temporal dependency: "BENPAY-052 must not start before 02:00 AM due to downstream system availability."
Unknown/historical: "Dependency exists but justification cannot be determined."

Category 4 currently covers 23% of all dependencies. Sandra's goal is to reduce that to zero before Marcus retires — because every unknown dependency is either a necessary constraint that will cause a production failure if removed, or an unnecessary constraint that's artificially extending the critical path. There's no way to know which without investigation, and Marcus is the only person who can investigate.

This is the knowledge retirement problem applied to batch architecture. And it's happening at shops across the industry. If your batch window depends on knowledge that lives in one person's head, you have a single point of failure that no amount of redundant hardware can address.

✅ BEST PRACTICE: Every dependency in the scheduler should have a comment explaining why it exists. When a new dependency is added, the change request must include the justification category. When an employee who owns batch knowledge announces retirement, a batch dependency audit should be initiated immediately — not during their last two weeks.

Production Considerations

Regulatory and Compliance Constraints

Some batch jobs have legal deadlines that don't care about your technical problems:

Requirement                          Deadline           Penalty
─────────────────────────────────────────────────────────────────
ACH origination file to Fed          06:00 AM ET       Regulatory action
Wire transfer confirmations          07:00 AM ET       Customer/regulatory
OCC Call Report (quarterly)          Midnight filing    Regulatory fine
BSA/AML daily scan                   09:00 AM ET       Criminal liability
FDIC assessment data                 Quarterly          Regulatory action

🔴 CRITICAL: The ACH file must be transmitted to the Federal Reserve by 06:00 AM. If your batch window runs late and the ACH file isn't generated, millions of dollars in payroll direct deposits don't arrive in customer accounts. The reputational and regulatory consequences are severe. This is why Rob Calloway's minimum buffer isn't negotiable.

Seasonal Volume Planning

Period              Volume Change    Planning Action
─────────────────────────────────────────────────────
Month-end           +15-20%         Pre-split GL jobs
Quarter-end         +25-35%         Full parallel plan
Year-end            +50-80%         Rehearsal runs, extra LPARs
Tax season          +40%            Add processing capacity
Black Friday week   +100-200%       Special batch schedule
Regulatory filing   +varies         Dedicated job streams

Plan for the worst case, not the average case. The batch window that works on a normal Tuesday in March will fail on December 31st if you haven't planned for year-end volume.

🔄 ANCHOR — Pinnacle Health's Seasonal Challenge: Diane Okoye at Pinnacle Health Insurance faces a different seasonal pattern. January is their peak — open enrollment processing adds 40% to claims volume, and new-year deductible resets trigger a wave of "accumulator zeroing" jobs that don't run any other month. Diane builds her capacity model around January volume, not annual average. If the window can survive January, it can survive anything — with the possible exception of a mid-year acquisition that adds millions of claims overnight, which is exactly what happened in the case study for this chapter.

Ahmad Rashidi (Pinnacle's compliance architect) adds another dimension: regulatory filing deadlines shift with the calendar. The CMS EDGE Server submission is due by the 15th of each month, but when the 15th falls on a weekend, the effective deadline moves to Friday the 13th — and the batch window on Thursday the 12th must produce the files. These calendar edge cases catch operations teams off guard because they only occur a few times a year, and the jobs involved may not have been tested since the last occurrence.

Change Management for the Batch Window

Every change to the batch window — new jobs, modified dependencies, changed calendars, updated resources — should go through formal change management. At CNB, the batch window change process requires:

Impact analysis: What is the effect on the critical path? Does the change add, remove, or modify any critical-path job?
Resource assessment: Does the new/changed job require additional initiators, DB2 threads, or dataset access?
Recovery review: Is the new/changed job restartable? Has the recovery procedure been documented and tested?
Calendar review: On which days does this change affect the window? Does it create a new worst-case scenario?
Approval: Changes that affect the critical path require architecture team approval. Changes that don't affect the critical path require operations team approval.

Rob Calloway estimates that 30% of batch window incidents are caused by changes that weren't properly impact-assessed. A new job added without checking resource contention. A dependency removed because "it seemed unnecessary." A calendar change that created an unexpected collision on month-end. Change management isn't bureaucracy — it's the batch window's immune system.

Documentation Requirements

Every batch window should have:

DAG diagram — updated monthly, showing all jobs and dependencies
Critical path documentation — which jobs are on it, what the expected times are
Recovery runbook — for the top 20 failure scenarios, step-by-step recovery procedures
Capacity model — current utilization, growth rate, projected window exhaustion date
Change log — every dependency change, every new job added, every job removed

🧩 PATTERN — The Batch Window Dashboard: CNB maintains a real-time dashboard that shows: - Current batch progress (jobs completed/remaining) - Critical path status (on time / minutes ahead / minutes behind) - Resource utilization (DB2 threads, initiators, I/O bandwidth) - Milestone tracking (expected vs. actual completion times) - Projected window completion time (updated every 5 minutes)

Rob Calloway checks it exactly once before bed (at midnight) and trusts the alerting system for everything else. If the dashboard shows green at midnight, the window will be fine. If it shows yellow, he sets his alarm for 3:00 AM. If it shows red at midnight, he's not going to bed.

Project Checkpoint — HA Banking System End-of-Day Batch Window

🔧 Progressive Project: HA Banking Transaction Processing System

Apply the batch window engineering concepts from this chapter to design the end-of-day processing for the HA banking system you've been building throughout this book.

Your Design Task

Design a complete end-of-day batch window for the HA banking system with the following characteristics:

Volume: 50 million transactions per day across 5 million active accounts.

Available window: 11:00 PM – 6:00 AM (7 hours, 420 minutes effective after 30-minute buffer).

Required processing: 1. Transaction extraction from CICS journal 2. Transaction validation and enrichment 3. Account posting (debit/credit application) 4. Balance recalculation and interest accrual 5. Fraud detection daily scan 6. General ledger posting 7. Regulatory reporting (AML/BSA daily file) 8. Statement generation (for accounts with cycle date = today) 9. ACH origination file generation 10. End-of-day reconciliation

Deliverables:

DAG diagram (text-based): Show all jobs with dependencies
Critical path analysis: Identify the critical path and calculate total elapsed time
Parallel stream design: Identify which jobs can run concurrently
Throughput calculations: For the three longest jobs, estimate elapsed time based on record counts and processing rates
Recovery strategy: For the two most critical failure points, document the recovery procedure
Capacity projection: At 3% monthly volume growth, when will this design exhaust the window?

See code/project-checkpoint.md for the full project specification and worked guidance.

Summary

The batch window is a scheduling problem. That one sentence, truly understood, changes how you approach every aspect of batch processing architecture.

Individual job performance matters — but only for jobs on the critical path. The most impactful optimization is often dependency cleanup, which changes no code at all. The math of throughput and elapsed time gives you predictive power: you can calculate when the window will break before it actually breaks.

Job schedulers are the control plane of batch processing. They manage dependencies, allocate resources, and (when properly configured) route around failures. Understanding your scheduler's capabilities — resource management, conditional execution, cross-system dependencies — is essential for batch architecture.

Parallelization is the primary mechanism for window compression. Identifying independent work streams, resolving dataset and DB2 contention, and splitting large serial jobs into parallel components can reduce the critical path by 50% or more.

Recovery is architecture. Programs must be designed for restartability from the first line of code, not bolted on after a production failure. Checkpoint/restart logic, commit frequency, and idempotent processing aren't optional features — they're requirements for production batch systems.

And the 6 AM deadline doesn't negotiate. Whatever math you do, whatever architecture you design, the answer to "when does online come up?" must always be "on time."

Spaced Review

From Chapter 1 — z/OS Lifecycle

Connection: Chapter 1 introduced the z/OS job lifecycle — JCL submission, initiator allocation, step execution, and completion. That lifecycle is the fundamental unit of the batch window. Every node in your DAG is one execution of that lifecycle. The scheduler manages thousands of these lifecycles in the correct order.

Review Question: How does the z/OS initiator class system relate to the resource constraints discussed in this chapter's DAG model?

From Chapter 4 — Dataset Management

Connection: Chapter 4 covered dataset allocation, GDG management, and catalog operations. In batch window engineering, dataset contention is one of the primary barriers to parallelization, and GDGs are the primary mechanism for decoupling sequential producers from consumers.

Review Question: Why does GDG catalog serialization matter more during the batch window than during online processing? How would you mitigate it?

From Chapter 5 — Workload Manager

Connection: Chapter 5 discussed WLM service classes and how z/OS allocates resources. Batch jobs run in WLM service classes that determine their CPU dispatching priority and I/O priority. A critical-path batch job running in a low-priority service class will be preempted by online work that's still draining — understanding WLM is essential for ensuring batch jobs get the resources the throughput math assumes.

Review Question: If Rob Calloway's critical-path jobs are running in the default batch service class but CICS is still draining online transactions, what WLM change might help the batch window?

Next Chapter: Chapter 24 dives into the individual job level — how to design COBOL batch programs that process millions of records efficiently with proper restart/recovery, commit strategies, and error handling. Where this chapter gave you the forest, Chapter 24 gives you the trees.

In This Chapter

Chapter 23: Batch Window Engineering

Job Scheduling, Critical Path Analysis, and the Math of Getting It All Done by 6am

23.1 The 6am Deadline — Why Batch Window Engineering Is Architecture

The Batch Window Defined

Why This Is Architecture, Not Operations

The Batch Window Across the Industry

23.2 The Batch Window as a Graph — Jobs, Dependencies, and Critical Path

Modeling Jobs as a DAG

Critical Path Analysis

Slack and Float

Hidden Dependencies

The Real-World DAG: Complexity at Scale

23.3 The Math — Throughput Calculations, I/O Analysis, and Capacity

Why the Math Matters

Records Per Second — The Fundamental Unit

I/O Throughput Analysis

CPU vs. I/O Bound Analysis

Worked Example — CNB Transaction Validation (EOD-004)

Capacity Planning Formula

23.4 Job Scheduling — TWS, CA-7, Control-M, and the Art of Dependencies

The Big Three Schedulers

Dependency Types

The Dependency Explosion Problem

Operational Calendar Management

Scheduler Resource Management

23.5 Parallel Streams — Running Jobs Simultaneously Without Stepping on Each Other

The Parallelization Imperative

Identifying Parallelizable Work

Dataset Contention Resolution

DB2 Concurrency in Batch

The Initiator Class Strategy

Parallel Utility Execution

Cross-LPAR Batch Distribution

23.6 Window Compression — When You Need to Finish Faster

Strategy 1: Eliminate Unnecessary Dependencies (Cost: Low, Risk: Low)

Strategy 2: Split Large Serial Jobs into Parallel Jobs (Cost: Medium, Risk: Medium)

Strategy 3: Optimize I/O Configuration (Cost: Low-Medium, Risk: Low)

Strategy 4: Use zIIP Offload for DB2-Heavy Jobs (Cost: Medium, Risk: Low)

Strategy 5: Extend the Batch Window (Cost: High, Risk: High)

Strategy 6: Re-architect the Batch Processing Model (Cost: Very High, Risk: High)

23.7 When the Window Breaks — Batch Failure Analysis and Recovery

Failure Modes

Rob Calloway's Incident Playbook

The Recovery Hierarchy

Designing for Restartability from Day One

Batch Monitoring — Knowing You're in Trouble Before It's Too Late

The Human Factor: Knowledge Transfer and the Batch Window

Production Considerations

Regulatory and Compliance Constraints

Seasonal Volume Planning

Change Management for the Batch Window

Documentation Requirements

Project Checkpoint — HA Banking System End-of-Day Batch Window

Your Design Task

Summary

Spaced Review

From Chapter 1 — z/OS Lifecycle

From Chapter 4 — Dataset Management

From Chapter 5 — Workload Manager

Related Reading