Chapter 23: Key Takeaways

DataField.Dev

Chapter 23: Key Takeaways

Batch Window Engineering: Job Scheduling, Critical Path Analysis, and the Math of Getting It All Done by 6am

Threshold Concept

The batch window is a scheduling problem, not a performance problem. The constraint is the critical path through the dependency graph — the longest chain of jobs that must execute sequentially. Individual job optimization only matters for jobs on that critical path. Understanding this distinction is what separates batch operators from batch architects.

Core Takeaways

Model the batch window as a directed acyclic graph (DAG). Every job is a node. Every dependency is a directed edge. The critical path — the longest path through the DAG — determines the minimum possible batch window duration. You cannot finish faster than the critical path, regardless of how many other jobs you optimize or parallelize.
Slack is your early warning system. Jobs not on the critical path have slack — the amount of time their start can be delayed without affecting the window. When slack on non-critical paths drops below a few minutes, those paths are one bad night away from becoming the new critical path. Monitor slack trends, not just critical path completion.
Hidden dependencies are everywhere. The scheduler's dependency graph isn't the complete picture. Dataset contention (DISP=OLD conflicts), DB2 lock conflicts, initiator starvation, GDG catalog serialization, and tape drive allocation all create implicit serialization that doesn't appear in the DAG. Use SMF data to discover hidden dependencies.
Know where the time goes before you optimize. Every batch job's elapsed time is composed of CPU time, I/O wait, DB2 wait, and other overhead. A job that's 70% DB2-bound won't benefit from COBOL logic optimization. Measure the components before choosing a strategy.
Dependency cleanup is the highest-ROI optimization. Removing unnecessary dependencies — jobs that are serialized by habit rather than by data dependency — costs nothing, risks little, and often recovers tens of minutes from the critical path. Always clean the graph before tuning programs.
Job splitting is the most powerful parallelization technique. Splitting a single large serial job into multiple parallel jobs by key range or data type can reduce critical-path contribution by 50–75%. It requires application changes (parameter handling, merge logic) but the ROI is enormous.
The throughput math is predictive. If you know records per second, volume growth rate, and volume elasticity, you can calculate when the batch window will break — weeks or months before it happens. Capacity planning formulas turn batch window management from reactive to proactive.
Recovery is architecture, not improvisation. Checkpoint/restart logic must be designed into programs from the start. Commit frequency must balance throughput against lock contention. Recovery decision trees should be documented and rehearsed before the 5:47 AM phone call.
The batch window is infrastructure. It deserves the same architectural attention as CICS regions, DB2 subsystems, and network configuration. Quarterly reviews, capacity projections, formal change management for dependencies, and trend monitoring are requirements — not luxuries.
The 6 AM deadline doesn't negotiate. Regulatory filings, ACH transmissions, and online service availability have hard deadlines. Buffer time isn't padding — it's insurance against the certainty that things will go wrong. Rob Calloway's 30-minute minimum buffer is the result of seventeen years of learning the hard way.

Formulas to Remember

Critical path duration: Sum of elapsed times for all jobs on the longest path through the DAG.

Processing rate: 1 / (CPU time + I/O time + DB2 time + other time per record)

Elapsed time estimate: Total records / Processing rate (records/second)

Window capacity: Available time - Critical path length - Buffer

Growth margin: Monthly growth rate x Months to review x Avg volume elasticity x Critical path duration

Window safe if: Window capacity > Growth margin

Red Flags

Batch window margin below 60 minutes on a normal night
Critical path growing by more than 5 minutes month-over-month
More than 10% of scheduler dependencies undocumented or unexplained
No checkpoint/restart logic in critical-path jobs
Recovery procedures that haven't been tested in the past 6 months
No capacity projection for the next 12 months
Business initiatives launching without batch impact assessment
Time-based dependencies that haven't been reviewed in over a year

Production Checklist

[ ] DAG documented and current (updated within 30 days)
[ ] Critical path identified with expected elapsed times
[ ] Slack calculated for all non-critical-path jobs
[ ] Hidden dependencies identified via SMF analysis
[ ] Throughput math calculated for all critical-path jobs
[ ] Capacity projection current (covers next 12 months minimum)
[ ] Recovery runbook for top 20 failure scenarios
[ ] Checkpoint/restart tested for all critical-path programs
[ ] Milestone monitoring in place with alert thresholds
[ ] Trend analysis reviewed weekly
[ ] Quarterly dependency review scheduled
[ ] Batch impact assessment process documented and enforced