36 min read

> "The difference between a shop that runs and a shop that runs well is what happens between midnight and 6am when nobody's watching." — Rob Fielding, CNB Batch Operations Lead

Chapter 27: Batch Monitoring, Alerting, and Incident Response

SMF Records, Automation, and the On-Call Playbook

"The difference between a shop that runs and a shop that runs well is what happens between midnight and 6am when nobody's watching." — Rob Fielding, CNB Batch Operations Lead


27.1 When Rob's Phone Rings at 3am

Rob Fielding has been running batch operations at City National Bank for nineteen years. He can tell you the approximate elapsed time of every critical job in the nightly cycle without looking it up. He knows which tapes are fragile, which VSAM clusters run hot, and which application teams submit last-minute changes on Thursday afternoons that break the Friday cycle. Rob is, by any measure, a walking encyclopedia of CNB's batch environment.

None of that helped him at 3:17am on a Tuesday in March.

His phone lit up with a text from the overnight operator: "ACCTPOST abended S0C7 step 4. Restarted twice. Same thing." Rob pulled up his laptop, VPN'd into the operations console, and saw the damage. ACCTPOST — the core account posting job — had failed three times. The automated restart logic had dutifully restarted it each time, and each time it had abended with a data exception in the same step. Downstream, fourteen dependent jobs were in a held state. The general ledger extract was stalled. The ATM authorization file hadn't been refreshed. And the batch window was now ninety minutes behind schedule.

Rob spent twenty minutes diagnosing. The S0C7 was caused by a packed decimal field containing hex zeros where it expected a valid amount — a data corruption issue in the input file from the new online system deployed the previous evening. The fix was straightforward: run a data cleansing utility to repair the corrupted records, then restart from the checkpoint. Total resolution time: forty-three minutes from first alert to successful completion.

But here's what kept Rob awake for the rest of the night. The corrupted data had been sitting in the input dataset since 11:47pm — more than three hours before the abend. The online system had written bad records starting at 9:23pm, and the batch extract job that pulled them into the staging file had completed at 11:47pm without error because it doesn't validate data content, it just copies. If Rob had been monitoring dataset characteristics — record counts, control totals, basic field validation — he could have caught the problem before ACCTPOST ever ran.

This chapter is about making sure Rob's phone doesn't ring at 3am. And when it does — because it will — making sure the response is fast, systematic, and driven by data rather than heroics.

The Cost of Reactive Operations

Every mainframe shop has its Rob. The person whose institutional knowledge keeps the lights on. This is simultaneously the shop's greatest asset and its greatest vulnerability. When Rob takes vacation, the overnight team holds its breath. When Rob retires — and he will — the shop loses decades of undocumented knowledge about failure modes, recovery procedures, and the subtle interdependencies that no run book captures.

The solution is not to replace Rob's expertise with automation. The solution is to capture that expertise in monitoring rules, alerting thresholds, playbooks, and automated recovery procedures — and then let Rob focus on the problems that actually require a human brain.

Consider the economics. CNB's batch window processes approximately 2.3 million transactions nightly. A one-hour delay in batch completion delays the morning's online availability, which CNB estimates costs $47,000 per hour in lost transaction revenue and customer impact. Rob's forty-three-minute incident cost roughly $33,000 — but the real cost was the three hours of undetected bad data flowing through the system, which required an additional four hours of reconciliation work the next day.

At Pinnacle Health, the stakes are different but equally high. Diane Chen's claims processing batch runs against regulatory deadlines. A failed batch cycle doesn't just cost money — it triggers compliance reporting requirements and can result in penalties. Ahmad Patel, Pinnacle's systems programmer, learned this the hard way when a storage allocation failure in the claims adjudication job went undetected for two hours because the monitoring was watching for abends, not for jobs running abnormally long.

Federal Benefits Administration faces yet another dimension: Sandra Williams's batch processing handles benefit payments for 4.2 million recipients. A failed payment cycle affects real people — veterans, retirees, disability recipients who depend on those payments arriving on time. Marcus Rivera, her technical lead, designed their monitoring with a single principle: "Every anomaly is guilty until proven innocent."


27.2 SMF Records for Batch Intelligence

System Management Facilities — SMF — is the mainframe's built-in telemetry system. Every z/OS installation produces SMF records continuously. They are, without exaggeration, the richest source of operational intelligence available on any computing platform. Most shops use perhaps ten percent of what SMF provides.

For batch monitoring, three SMF record types are essential, two are highly valuable, and several more are useful in specific contexts.

27.2.1 SMF Type 30: The Job Accounting Record

The type 30 record is the workhorse of batch monitoring. z/OS writes type 30 subtypes at key points in a job's lifecycle:

  • Subtype 1: Job start (written when the job initiator selects the job)
  • Subtype 2: Interval record (written at WLM-controlled intervals during execution)
  • Subtype 3: Step termination (written when each job step completes)
  • Subtype 4: Job termination (written when the job ends)
  • Subtype 5: Job purge (written when JES purges the job from the system)

Each subtype contains different sections, but the step termination (subtype 3) and job termination (subtype 4) records are the most valuable for batch monitoring. Here's what they give you:

Identification Section (all subtypes): - Job name, job number, JES job ID - Step name and program name - System ID (critical in sysplex environments) - Job class and service class - Start time and date, end time and date - Completion code (condition code, abend code, or system abend)

Processor Accounting Section: - CPU time (TCB and SRB, broken down by step) - zIIP and zAAP eligible time (and time actually executed on specialty engines) - I/O counts by device type

Storage Section: - Region size requested and region size used - High-water mark for virtual storage below and above the line - Number of getmain/freemain requests

Performance Section: - Elapsed time - Wait time (I/O wait, ENQ wait, page wait) - Number of EXCPs (I/O operations) by DD name - DASD I/O counts and connect time

Let's look at how Rob uses type 30 data. Every morning, a COBOL program reads the previous night's SMF type 30 records and produces a batch performance report. The program compares each job's elapsed time, CPU time, and EXCP counts against a baseline stored in a VSAM KSDS:

       IDENTIFICATION DIVISION.
       PROGRAM-ID. BCHPERF.
      *================================================================*
      * BATCH PERFORMANCE ANALYZER                                      *
      * Reads SMF type 30 subtype 4 records and compares against        *
      * baseline performance metrics stored in VSAM KSDS.               *
      *================================================================*
       ENVIRONMENT DIVISION.
       INPUT-OUTPUT SECTION.
       FILE-CONTROL.
           SELECT SMF-INPUT   ASSIGN TO SMFDATA
                  FILE STATUS IS WS-SMF-STATUS.
           SELECT BASELINE-FILE ASSIGN TO BSELINE
                  ORGANIZATION IS INDEXED
                  ACCESS MODE IS RANDOM
                  RECORD KEY IS BL-JOB-KEY
                  FILE STATUS IS WS-BL-STATUS.
           SELECT ALERT-OUTPUT ASSIGN TO ALERTS
                  FILE STATUS IS WS-ALT-STATUS.

       DATA DIVISION.
       FILE SECTION.
       FD  SMF-INPUT
           RECORDING MODE IS V
           RECORD CONTAINS 18 TO 32760 CHARACTERS.
       01  SMF-RECORD.
           05  SMF-LEN         PIC S9(4) COMP.
           05  SMF-SEG         PIC S9(4) COMP.
           05  SMF-FLAG        PIC X(1).
           05  SMF-RTY         PIC X(1).
              88 SMF-TYPE-30   VALUE X'1E'.
           05  SMF-TME         PIC S9(8) COMP.
           05  SMF-DTE         PIC S9(8) COMP REDEFINES SMF-TME.
           05  SMF-SID         PIC X(4).
           05  SMF-DATA        PIC X(32740).

       FD  BASELINE-FILE.
       01  BASELINE-REC.
           05  BL-JOB-KEY.
               10  BL-JOB-NAME    PIC X(8).
               10  BL-STEP-NAME   PIC X(8).
           05  BL-AVG-ELAPSED  PIC S9(8) COMP.
           05  BL-STD-ELAPSED  PIC S9(8) COMP.
           05  BL-AVG-CPU      PIC S9(8) COMP.
           05  BL-AVG-EXCP     PIC S9(8) COMP.
           05  BL-LAST-UPDATE  PIC X(8).
           05  BL-SAMPLE-COUNT PIC S9(4) COMP.

       FD  ALERT-OUTPUT.
       01  ALERT-RECORD        PIC X(200).

       WORKING-STORAGE SECTION.
       01  WS-SMF-STATUS       PIC X(2).
       01  WS-BL-STATUS        PIC X(2).
       01  WS-ALT-STATUS       PIC X(2).
       01  WS-EOF-FLAG         PIC X VALUE 'N'.
          88  END-OF-FILE      VALUE 'Y'.
       01  WS-ALERT-LINE       PIC X(200).
       01  WS-ELAPSED-DIFF     PIC S9(8) COMP.
       01  WS-THRESHOLD-PCT    PIC S9(3) COMP VALUE 25.
       01  WS-ALERT-COUNT      PIC S9(5) COMP VALUE 0.
       01  WS-JOBS-ANALYZED    PIC S9(7) COMP VALUE 0.

       01  WS-SMF30-FIELDS.
           05  WS-JOB-NAME     PIC X(8).
           05  WS-STEP-NAME    PIC X(8).
           05  WS-ELAPSED      PIC S9(8) COMP.
           05  WS-CPU-TIME     PIC S9(8) COMP.
           05  WS-EXCP-COUNT   PIC S9(8) COMP.
           05  WS-COMP-CODE    PIC S9(4) COMP.
           05  WS-ABEND-CODE   PIC S9(4) COMP.

       PROCEDURE DIVISION.
       0000-MAIN.
           PERFORM 1000-INITIALIZE
           PERFORM 2000-PROCESS-SMF
              UNTIL END-OF-FILE
           PERFORM 9000-FINALIZE
           STOP RUN.

       1000-INITIALIZE.
           OPEN INPUT  SMF-INPUT
           OPEN I-O    BASELINE-FILE
           OPEN OUTPUT ALERT-OUTPUT
           STRING 'BCHPERF: Batch Performance Analysis started '
                  FUNCTION CURRENT-DATE(1:8)
                  DELIMITED BY SIZE
                  INTO WS-ALERT-LINE
           WRITE ALERT-RECORD FROM WS-ALERT-LINE.

       2000-PROCESS-SMF.
           READ SMF-INPUT
              AT END SET END-OF-FILE TO TRUE
              NOT AT END
                 IF SMF-TYPE-30
                    PERFORM 3000-PARSE-TYPE30
                    IF WS-COMP-CODE >= 0
                       PERFORM 4000-CHECK-BASELINE
                    END-IF
                 END-IF
           END-READ.

       3000-PARSE-TYPE30.
      *    Extract relevant fields from SMF type 30 record.
      *    Actual parsing requires mapping the self-defining
      *    section (offsets/lengths) in the record header.
      *    Simplified here for clarity.
           MOVE SMF-DATA(1:8)   TO WS-JOB-NAME
           MOVE SMF-DATA(9:8)   TO WS-STEP-NAME
           MOVE SMF-DATA(17:4)  TO WS-ELAPSED
           MOVE SMF-DATA(21:4)  TO WS-CPU-TIME
           MOVE SMF-DATA(25:4)  TO WS-EXCP-COUNT
           MOVE SMF-DATA(29:2)  TO WS-COMP-CODE
           MOVE SMF-DATA(31:2)  TO WS-ABEND-CODE
           ADD 1 TO WS-JOBS-ANALYZED.

       4000-CHECK-BASELINE.
           MOVE WS-JOB-NAME  TO BL-JOB-NAME
           MOVE WS-STEP-NAME TO BL-STEP-NAME
           READ BASELINE-FILE
              INVALID KEY
                 PERFORM 4100-NEW-JOB-ALERT
              NOT INVALID KEY
                 PERFORM 4200-COMPARE-METRICS
           END-READ.

       4100-NEW-JOB-ALERT.
           STRING 'INFO: New job detected - '
                  WS-JOB-NAME '/' WS-STEP-NAME
                  ' Elapsed=' WS-ELAPSED
                  DELIMITED BY SIZE
                  INTO WS-ALERT-LINE
           WRITE ALERT-RECORD FROM WS-ALERT-LINE.

       4200-COMPARE-METRICS.
           COMPUTE WS-ELAPSED-DIFF =
              ((WS-ELAPSED - BL-AVG-ELAPSED) * 100)
              / BL-AVG-ELAPSED
           IF WS-ELAPSED-DIFF > WS-THRESHOLD-PCT
              PERFORM 5000-GENERATE-ALERT
           END-IF
      *    Update rolling average
           COMPUTE BL-AVG-ELAPSED =
              ((BL-AVG-ELAPSED * BL-SAMPLE-COUNT)
               + WS-ELAPSED)
              / (BL-SAMPLE-COUNT + 1)
           ADD 1 TO BL-SAMPLE-COUNT
           MOVE FUNCTION CURRENT-DATE(1:8)
              TO BL-LAST-UPDATE
           REWRITE BASELINE-REC.

       5000-GENERATE-ALERT.
           ADD 1 TO WS-ALERT-COUNT
           STRING 'ALERT: ' WS-JOB-NAME '/' WS-STEP-NAME
                  ' Elapsed=' WS-ELAPSED
                  ' Baseline=' BL-AVG-ELAPSED
                  ' Deviation=' WS-ELAPSED-DIFF '%'
                  DELIMITED BY SIZE
                  INTO WS-ALERT-LINE
           WRITE ALERT-RECORD FROM WS-ALERT-LINE.

       9000-FINALIZE.
           STRING 'BCHPERF: Analysis complete. Jobs='
                  WS-JOBS-ANALYZED
                  ' Alerts=' WS-ALERT-COUNT
                  DELIMITED BY SIZE
                  INTO WS-ALERT-LINE
           WRITE ALERT-RECORD FROM WS-ALERT-LINE
           CLOSE SMF-INPUT BASELINE-FILE ALERT-OUTPUT.

Note the structure: the program reads raw SMF data, compares against stored baselines, generates alerts for deviations, and updates the rolling averages. In production, Rob's version is considerably more complex — it handles all type 30 subtypes, tracks inter-step timings, and feeds a DB2 table that drives a real-time dashboard. But the principle is identical: compare observed behavior against expected behavior and flag deviations.

27.2.2 SMF Type 14 and Type 15: Dataset Activity

Type 14 records are written when a non-VSAM dataset is closed (input datasets). Type 15 records are written when a non-VSAM dataset is closed after output (output datasets). Together, they tell you everything about dataset I/O during batch execution.

Key fields in type 14/15 records:

  • Dataset name (44 bytes, the full DSN)
  • Volume serial (6 bytes, the DASD volume)
  • Device type (UCBTYP, identifies the device class)
  • EXCP count (number of I/O operations performed)
  • Block count (number of blocks read or written)
  • Creation date and time
  • LRECL, BLKSIZE, RECFM (dataset characteristics)
  • Job name and step name (what job accessed the dataset)

Why do these matter for monitoring? Consider this scenario at SecureFirst Insurance. Yuki Tanaka noticed that their policy renewal batch job had been gradually slowing over six months. Carlos Mendez, her systems programmer, analyzed SMF type 14/15 records and discovered that the primary input dataset had grown from 2.3 million records to 4.1 million records, but the BLKSIZE hadn't been adjusted. The job was performing 78% more EXCPs than necessary because the block size was optimized for the original dataset size. A simple reblock reduced elapsed time by thirty-one percent.

Type 14/15 records also reveal dataset contention patterns. If two jobs in different batch streams are both writing to the same dataset, the ENQ serialization will cause one to wait. SMF records expose this through the timing gaps between I/O operations and the ENQ wait times recorded in the associated type 30 records.

27.2.3 SMF Type 42: Storage Management

Type 42 records track DFSMS storage management activity — dataset allocation, migration, recall, backup, and space management events. For batch monitoring, the critical subtypes are:

  • Subtype 1: Dataset open/close
  • Subtype 5: Space management events (B37, D37, E37 abend conditions)
  • Subtype 6: Dataset allocation/unallocation

When a batch job fails with a space abend (B37 for end-of-volume, D37 for end-of-device, E37 for primary allocation failure), type 42 records provide the forensic detail: how much space was requested, how much was available, what volumes were tried, and why allocation failed. This is invaluable for post-incident analysis and for building predictive alerts that fire before space problems cause abends.

27.2.4 Other Relevant SMF Types

Record Type Description Batch Monitoring Use
Type 4 Step termination Basic step-level accounting (subset of type 30)
Type 5 Job termination Basic job-level accounting (subset of type 30)
Type 6 External writer output Spool activity monitoring
Type 26 JES2 job purge Job lifecycle completion tracking
Type 42 Storage management Space monitoring and prediction
Type 64 VSAM component open VSAM dataset access patterns
Type 60-69 VSAM statistics VSAM CI/CA split monitoring, buffer hit ratios
Type 80 RACF events Security violation detection in batch
Type 82 RACF audit Privileged access monitoring

Rob's rule of thumb: "If you're not collecting types 30, 14/15, and 42, you're flying blind. If you're collecting those three and actually analyzing them, you're ahead of ninety percent of shops."

27.2.5 SMF Data Management

SMF data is written to SMF datasets (SYS1.MANx) and must be dumped regularly to prevent data loss. The SMF dump process itself is a critical batch function. At CNB, Rob runs the SMF dump every four hours during the batch window and every hour during the online day. The dumped data feeds into a historical database — typically DB2 — where it's available for trend analysis.

A common pitfall: shops that collect SMF data but never analyze it. At Federal Benefits, Sandra Williams's team was writing SMF records to tape and shipping them to an offsite vault for "compliance purposes." Nobody had looked at the data in three years. When Marcus Rivera implemented their batch monitoring system, he discovered that the historical data revealed a pattern of gradual performance degradation that had been invisible to the operations team — a classic boiling-frog problem.

The COBOL programs that process SMF data must handle the self-defining record format. Every SMF record contains a header with offsets and lengths that describe where each data section begins within the record. This design allows IBM to add new sections to existing record types without breaking programs that read older formats. Your SMF processing code must use these offsets rather than hard-coded positions:

      *    SMF Type 30 Self-Defining Section
       01  SMF30-HEADER.
           05  SMF30-LEN       PIC S9(4) COMP.
           05  SMF30-SEG       PIC S9(4) COMP.
           05  SMF30-FLG       PIC X(1).
           05  SMF30-RTY       PIC X(1).
           05  SMF30-TME       PIC S9(8) COMP.
           05  SMF30-DTE       PIC X(4).
           05  SMF30-SID       PIC X(4).
           05  SMF30-SSI       PIC X(4).
           05  SMF30-STY       PIC S9(4) COMP.
      *    Self-defining section - triplets of offset/length/count
           05  SMF30-IDS-OFS   PIC S9(8) COMP.
           05  SMF30-IDS-LEN   PIC S9(4) COMP.
           05  SMF30-IDS-NUM   PIC S9(4) COMP.
           05  SMF30-PAS-OFS   PIC S9(8) COMP.
           05  SMF30-PAS-LEN   PIC S9(4) COMP.
           05  SMF30-PAS-NUM   PIC S9(4) COMP.
      *    ... additional triplets for each section

Each triplet — offset, length, count — tells you where a section starts within the record, how long each instance is, and how many instances exist. This is the mainframe's version of a self-describing data format, predating JSON and XML by decades.


27.3 Batch Monitoring Architecture

Effective batch monitoring operates on four levels, each with different latency requirements and different audiences:

Level 1: Real-Time Console Monitoring (Seconds)

The z/OS operator console displays WTO (Write To Operator) messages as they occur. Every job start, step completion, abend, and significant system event generates console messages. The challenge is volume: a busy batch system can generate thousands of messages per hour. Without filtering and automation, console monitoring is like drinking from a fire hose.

Automation products — CA-OPS/MVS (now Broadcom), IBM System Automation for z/OS (SA z/OS), BMC CONTROL-M — sit between the console message stream and the operators. They apply rules to filter, suppress, highlight, and act on messages. A typical automation rule at CNB:

)MSG ACCTPOST
)MSGID IEF450I
  IF MSGTEXT CONTAINS 'ABEND'
    THEN DO
      INFORM ONCALL BATCH-CRITICAL
      HOLD JOBSTREAM GL-EXTRACT
      LOG INCIDENT SEV=2
    ENDDO
)ENDMSG

This rule says: when any message containing "ABEND" is issued for job ACCTPOST, notify the on-call team, hold the downstream GL extract job stream, and log a severity-2 incident. The rule fires within seconds of the abend, long before any human would notice.

Level 2: Job-Level Monitoring (Minutes)

At this level, you're tracking individual jobs against expected behavior:

  • Did the job start on time? If the scheduler submitted it at 01:00 and it hasn't started by 01:15, something is wrong — perhaps it's waiting for a predecessor, or it's sitting in the input queue behind higher-priority work.
  • Is the job running within expected duration? If ACCTPOST normally takes 45 minutes and it's been running for 90, that's an alert.
  • Did the job complete successfully? Not just "did it abend" but "did it return an acceptable condition code."
  • Did the job produce expected output? Record counts, control totals, file sizes.

Most shops implement job-level monitoring through their scheduler. CONTROL-M, CA-7 (now Broadcom), Tivoli Workload Scheduler (now IBM Workload Automation) — all provide job-level monitoring out of the box. The key is configuring it properly, which requires knowing what "normal" looks like for each job. This is where SMF baselines become essential.

Rob maintains a spreadsheet — yes, a spreadsheet — of critical job thresholds. He updates it quarterly based on SMF trend data. Lisa Nakamura, CNB's COBOL development lead, has been trying to get him to migrate it to a DB2 table for years. Rob's response: "It works. Don't fix what works." (Lisa notes that this is precisely the attitude that creates single points of failure, which is precisely the problem this chapter addresses.)

Level 3: Batch Stream Monitoring (Batch Window)

Individual jobs are trees. The batch stream is the forest. Stream-level monitoring answers higher-order questions:

  • Is the overall batch cycle on schedule? If the cycle normally reaches the GL extract by 03:30 and it's 04:00 with the GL extract not started, the entire downstream chain is at risk.
  • Are SLA milestones being met? CNB has four SLA milestones in the nightly cycle: account posting complete by 03:00, GL extract complete by 04:30, ATM file refresh by 05:00, and online availability by 06:00.
  • What is the critical path? In a complex batch network with hundreds of jobs and multiple dependency chains, the critical path — the longest chain of dependent jobs — determines the earliest possible completion time. Monitoring the critical path tells you whether you'll make your SLA even if every remaining job runs at baseline.

Stream monitoring typically requires a combination of scheduler data (which jobs have run, which are waiting) and SMF data (how long completed jobs took versus baseline). At Pinnacle Health, Ahmad Patel built a critical-path calculator that runs every fifteen minutes during the batch window:

       IDENTIFICATION DIVISION.
       PROGRAM-ID. CRITPATH.
      *================================================================*
      * CRITICAL PATH CALCULATOR                                        *
      * Reads scheduler dependency data and SMF actuals to calculate    *
      * estimated batch window completion time.                         *
      *================================================================*

       DATA DIVISION.
       WORKING-STORAGE SECTION.
       01  WS-JOB-TABLE.
           05  WS-MAX-JOBS     PIC S9(4) COMP VALUE 500.
           05  WS-JOB-COUNT    PIC S9(4) COMP VALUE 0.
           05  WS-JOB-ENTRY OCCURS 500 TIMES
                              INDEXED BY JOB-IDX.
               10  WS-JOB-ID      PIC X(8).
               10  WS-JOB-STATUS  PIC X(1).
                   88  JOB-COMPLETE   VALUE 'C'.
                   88  JOB-RUNNING    VALUE 'R'.
                   88  JOB-WAITING    VALUE 'W'.
                   88  JOB-HELD       VALUE 'H'.
               10  WS-ACTUAL-START PIC S9(8) COMP.
               10  WS-ACTUAL-END   PIC S9(8) COMP.
               10  WS-ACTUAL-ELAPSED PIC S9(8) COMP.
               10  WS-BASELINE-ELAPSED PIC S9(8) COMP.
               10  WS-EST-REMAINING PIC S9(8) COMP.
               10  WS-PRED-COUNT   PIC S9(2) COMP.
               10  WS-PRED-LIST    PIC X(8)
                                   OCCURS 20 TIMES.
               10  WS-CRIT-PATH-FLAG PIC X VALUE 'N'.
                   88  ON-CRITICAL-PATH VALUE 'Y'.

       01  WS-BATCH-WINDOW.
           05  WS-WINDOW-START PIC S9(8) COMP.
           05  WS-WINDOW-END   PIC S9(8) COMP.
           05  WS-CURRENT-TIME PIC S9(8) COMP.
           05  WS-EST-COMPLETION PIC S9(8) COMP.
           05  WS-SLA-TARGET   PIC S9(8) COMP.
           05  WS-SLA-STATUS   PIC X(8).
              88  SLA-GREEN    VALUE 'GREEN'.
              88  SLA-YELLOW   VALUE 'YELLOW'.
              88  SLA-RED      VALUE 'RED'.

       PROCEDURE DIVISION.
      *    Main logic:
      *    1. Load job dependency network from scheduler
      *    2. Load actuals from SMF/scheduler status
      *    3. For waiting/running jobs, estimate remaining
      *       time using baseline
      *    4. Walk dependency graph to find critical path
      *    5. Sum critical path to get estimated completion
      *    6. Compare against SLA target
           PERFORM 1000-LOAD-SCHEDULE
           PERFORM 2000-LOAD-ACTUALS
           PERFORM 3000-ESTIMATE-REMAINING
           PERFORM 4000-FIND-CRITICAL-PATH
           PERFORM 5000-CALCULATE-COMPLETION
           PERFORM 6000-EVALUATE-SLA
           STOP RUN.

The critical-path calculation is a classic directed acyclic graph (DAG) traversal — the same algorithm used in project management (PERT/CPM). For each job that hasn't completed, you estimate its remaining elapsed time (baseline minus actual elapsed if running, full baseline if waiting) and sum along the longest dependency chain. The result tells you the earliest possible completion time assuming no further failures.

Level 4: Historical Trend Analysis (Days/Weeks/Months)

This is where SMF data stored in DB2 pays dividends. Trend analysis reveals:

  • Gradual performance degradation that's invisible on a night-to-night basis but adds up over months. A job that took 30 minutes a year ago and takes 38 minutes today has degraded by 27% — enough to shift the critical path.
  • Cyclical patterns tied to business cycles. End-of-month processing, quarterly closes, and annual cycles all produce predictable load spikes. If you know they're coming, you can plan for them.
  • Correlation between changes and performance. When a new application release is deployed and batch times increase 15% the next night, that's not coincidence. SMF data proves the correlation.
  • Capacity planning. If transaction volumes are growing 12% annually and batch elapsed times are growing proportionally, you can predict when the batch window will exceed its SLA — and justify the hardware upgrade, software optimization, or architecture change needed to prevent it.

At Federal Benefits, Marcus Rivera built a COBOL reporting suite that produces weekly trend reports from the SMF/DB2 historical database. The reports are automatically distributed to application teams, capacity planners, and management. "The report doesn't lie," Marcus says. "When an application team tells me their batch job hasn't changed, I show them the SMF data. It tells a different story."


27.4 Alerting Strategy: Thresholds, Escalation, and Noise Control

The single most common failure in batch monitoring is not lack of alerts — it's too many alerts. Alert fatigue is real, measurable, and dangerous. When the on-call person receives 200 alerts per night and 195 of them are informational or false positives, the five real problems get lost in the noise.

27.4.1 Threshold Design

Every alert needs a threshold. The threshold must be specific, measurable, and meaningful. Here's Rob's framework for CNB:

Tier 1 — Critical (Immediate Response Required): - Any abend in a Tier-1 job (defined list of ~40 critical jobs) - Batch window SLA milestone missed or projected to miss within 30 minutes - Dataset space abend (B37/D37/E37) in any batch job - Security violation (RACF denial) in batch execution - JES spool utilization exceeds 85% - DASD volume utilization exceeds 95% on a batch-critical volume

Tier 2 — Warning (Response Required Within 30 Minutes): - Tier-1 job elapsed time exceeds baseline by 50% - Any job elapsed time exceeds baseline by 100% - Tier-2 job abend (non-critical but important jobs) - Job waiting for predecessor beyond expected time - Tape mount pending for more than 10 minutes - GDG base approaching maximum generations

Tier 3 — Informational (Review Next Business Day): - Any job elapsed time exceeds baseline by 25% - Condition code non-zero but within acceptable range - Dataset created larger than expected - New job detected (not in baseline database) - Successful automated restart

Tier 4 — Diagnostic (Logged Only): - All job start/stop events - All step completion codes - Resource utilization readings - Scheduler dependency resolution events

The tier structure serves two purposes: it prioritizes human attention, and it determines notification methods. Tier 1 triggers a phone call (or pager, because yes, some shops still use pagers — they work when cell towers don't). Tier 2 sends a text message. Tier 3 goes to an email queue. Tier 4 goes to the log.

27.4.2 Dynamic Thresholds

Static thresholds break on end-of-month nights. If ACCTPOST normally takes 45 minutes but takes 90 minutes on the last business day of the month (because there are twice as many transactions), a static threshold of 50% over baseline will fire every month-end. This is predictable, expected, and not a problem — but it generates a Tier 2 alert that the on-call person must acknowledge and dismiss.

The solution is dynamic thresholds — baselines that account for known patterns:

       01  WS-THRESHOLD-CALC.
           05  WS-BASE-ELAPSED PIC S9(8) COMP.
           05  WS-DAY-OF-WEEK  PIC 9 VALUE 0.
           05  WS-DAY-OF-MONTH PIC 99 VALUE 0.
           05  WS-MONTH-END-FLAG PIC X VALUE 'N'.
              88  IS-MONTH-END VALUE 'Y'.
           05  WS-QUARTER-END-FLAG PIC X VALUE 'N'.
              88  IS-QUARTER-END VALUE 'Y'.
           05  WS-ADJUSTED-BASELINE PIC S9(8) COMP.
           05  WS-ADJUSTMENT-FACTOR PIC S9V99 COMP VALUE 1.00.

       5100-CALCULATE-DYNAMIC-THRESHOLD.
      *    Start with normal baseline
           MOVE WS-BASE-ELAPSED TO WS-ADJUSTED-BASELINE
           MOVE 1.00 TO WS-ADJUSTMENT-FACTOR
      *    Apply day-of-week factor (Mondays tend to be heavier)
           IF WS-DAY-OF-WEEK = 2
              MULTIPLY 1.15 BY WS-ADJUSTMENT-FACTOR
           END-IF
      *    Apply month-end factor
           IF IS-MONTH-END
              MULTIPLY 1.80 BY WS-ADJUSTMENT-FACTOR
           END-IF
      *    Apply quarter-end factor (cumulative with month-end)
           IF IS-QUARTER-END
              MULTIPLY 1.30 BY WS-ADJUSTMENT-FACTOR
           END-IF
      *    Calculate adjusted baseline
           COMPUTE WS-ADJUSTED-BASELINE =
              WS-BASE-ELAPSED * WS-ADJUSTMENT-FACTOR.

At Pinnacle Health, Diane Chen took this further. Her team built a machine-learning model (running on a Linux partition, feeding results back to the mainframe via MQ) that predicts expected batch durations based on fifteen input variables including transaction volume, day of week, month-end flag, recent application changes, and concurrent workload. The model produces per-job expected durations that are far more accurate than static baselines. "We went from sixty false alerts per week to three," Diane reports. "And we caught two real problems that the old static thresholds would have missed because they happened on light-volume nights when the thresholds were too generous."

27.4.3 Escalation Procedures

Every alert tier needs an escalation path:

Time Since Alert Tier 1 Action Tier 2 Action
0 minutes Phone call to on-call primary Text to on-call primary
15 minutes Phone call to on-call secondary Text to on-call secondary
30 minutes Phone call to batch operations manager Email to batch operations manager
60 minutes Conference bridge activated Phone call to batch operations manager
90 minutes VP Technology notified Conference bridge if SLA at risk

The escalation path must be automated. If the on-call person doesn't acknowledge a Tier 1 alert within 15 minutes, the system automatically escalates. No human decision required. No "I'm sure they saw it" assumptions.

Rob has a rule at CNB: "If you get paged and can't respond within 10 minutes, you call the secondary immediately. Don't try to be a hero on a bad cell connection from your cousin's cabin."

27.4.4 Notification Channels

Modern mainframe monitoring integrates with enterprise notification systems:

  • Phone/pager for Tier 1 (most reliable, hardest to ignore)
  • SMS/text for Tier 2 (fast, unobtrusive)
  • Email for Tier 3 (reviewable, searchable, attachable)
  • Slack/Teams webhook for real-time team awareness (increasingly common)
  • ServiceNow/Remedy for incident tracking and escalation management
  • PagerDuty/OpsGenie for on-call rotation management

The mainframe side typically uses a REXX exec or automation product rule to send notifications. See the REXX example in the code directory (example-02-alert-rexx.rexx) for a complete notification exec that sends alerts via multiple channels.


27.5 On-Call Playbooks: When the Alert Fires

A playbook — sometimes called a runbook — is a documented procedure for responding to a specific type of incident. The goal is to enable a competent operator who may not be an expert on the specific job to diagnose and resolve the problem without calling Rob at 3am.

27.5.1 Playbook Structure

Every playbook entry should follow this format:

  1. Trigger: What alert or condition activates this playbook entry?
  2. Impact Assessment: What is affected? How urgent is this?
  3. Diagnostic Steps: What should the responder check first?
  4. Resolution Options: What are the possible fixes, in order of likelihood?
  5. Escalation Criteria: When should the responder stop trying and escalate?
  6. Post-Resolution: What verification and documentation is required?

27.5.2 Common Batch Failure Playbooks

Playbook: S0C7 (Data Exception)

Trigger: Job abends with system completion code 0C7.

Impact: The job and all downstream dependents are stalled. Assess criticality based on job tier.

Diagnostic Steps: 1. Check the job log for the failing step and program name. 2. Look at the PSW (Program Status Word) in the dump — the instruction address tells you where in the program the error occurred. 3. Check the input data. S0C7 is almost always a data problem, not a program problem. Common causes: non-numeric data in a numeric field, uninitialized working storage, corrupted VSAM record. 4. Compare input file record counts and characteristics against the previous successful run. Use IDCAMS LISTCAT for VSAM, IEHLIST for sequential. 5. Check whether the input-producing job ran successfully and whether it was a new version.

Resolution Options: 1. If data corruption is isolated to identifiable records: run the data cleansing utility to repair or exclude bad records, then restart from the last checkpoint. 2. If the input file is entirely corrupt: rerun the input-producing job, then restart. 3. If the problem is in the program (rare for established jobs): contact the application team, provide the dump, and discuss emergency code fix versus data workaround.

Escalation: If not resolved within 30 minutes, or if the cause is unclear, escalate to the application on-call team.

Post-Resolution: Verify downstream jobs complete successfully. Check control totals. Document the root cause and resolution in the incident log.

Playbook: B37/D37/E37 (Space Abend)

Trigger: Job abends with system completion code x37.

Impact: Output dataset could not be extended. Job and downstream dependents are stalled.

Diagnostic Steps: 1. Identify the DD name and dataset from the job log (message IEC030I or IEC032I). 2. Check the space allocation in the JCL. Is primary too small? Is secondary zero? Are secondary extents exhausted (16 per volume)? 3. Check the volume. Is it full? Use STORAGE GROUP DISPLAY or equivalent to check available space. 4. Check whether this is a GDG that has accumulated more generations than expected. 5. For B37: the dataset exists but can't extend on the current volume and can't allocate on another. For D37: no secondary space was specified. For E37: primary allocation failed (not enough contiguous space).

Resolution Options: 1. For B37/D37 on a non-critical volume: compress or delete unnecessary datasets on the volume, then restart the job. The restart will attempt allocation again. 2. For any x37: increase the SPACE parameter in the JCL (temporary override via operator command or JCL change) and restart. 3. For E37: the volume is likely fragmented. Allocate the dataset on a different volume with a VOL=SER override, or run DEFRAG on the target volume if time permits. 4. For SMS-managed datasets: check the storage group. The ACS routines may be directing allocation to a full storage group. Contact the storage administrator.

Escalation: If the problem is storage-group-wide (no space available across multiple volumes), escalate to the storage administrator immediately.

Playbook: S806/S806-04 (Load Module Not Found)

Trigger: Job abends with system completion code 806.

Impact: The program cannot be loaded. Almost always caused by a deployment problem.

Diagnostic Steps: 1. Identify the program name from the job log. 2. Check the STEPLIB/JOBLIB concatenation. Is the correct load library included? 3. Verify the load module exists in the expected library: LISTDS 'library.name' MEMBERS. 4. Check whether a deployment was performed recently. Was the load library updated? 5. Check the link-edit (bind) output for the program. Was the module link-edited successfully?

Resolution Options: 1. If the load library is missing from the STEPLIB: add it (temporary JCL override) and restart. 2. If the module doesn't exist: contact the deployment team. This is a deployment failure. 3. If the module was accidentally deleted: restore from the previous backup (if available) and restart.

Escalation: Always escalate to the application team. An S806 in a production job is a deployment failure that needs root cause analysis.

Playbook: Performance Degradation (No Abend, But Running Long)

Trigger: Job elapsed time exceeds dynamic threshold. No abend, but significantly slower than expected.

Diagnostic Steps: 1. Check system-wide resource utilization. Is the LPAR under CPU pressure? Are channels saturated? This is a system problem, not an application problem. 2. Check for ENQ contention. Is the job waiting for a dataset held by another job? DISPLAY ENQ command shows current enqueues. 3. Check the specific step that's slow. Compare EXCP counts against baseline. If EXCPs are normal but elapsed time is high, the job is waiting for something. If EXCPs are high, the job is processing more data than expected. 4. Check input volume. Did the online day produce significantly more transactions than normal? 5. Check for VSAM CI/CA splits if VSAM datasets are involved. Excessive splits dramatically increase I/O.

Resolution Options: 1. If system-wide: reduce concurrent batch workload (hold non-critical jobs), request WLM adjustment for the critical job's service class, or work with capacity team. 2. If ENQ contention: identify the holder and determine which job should be prioritized. 3. If data volume: no quick fix. Let it run, adjust downstream schedule expectations, and plan for permanent accommodation. 4. If VSAM splits: schedule a REPRO/RELOAD or reorganization as soon as the job completes.

Escalation: If the batch window SLA is at risk and no quick resolution is available, escalate for management decision (e.g., extend the batch window, delay online availability, invoke disaster recovery procedures).

27.5.3 Decision Trees

For the overnight operations team, a decision tree is more useful than pages of text. Here's the top-level decision tree for any batch alert:

BATCH ALERT RECEIVED
    |
    +-- Is it an ABEND?
    |     |
    |     +-- YES --> What completion code?
    |     |     |
    |     |     +-- S0C7 --> Data exception playbook
    |     |     +-- S0C4 --> Protection exception playbook
    |     |     +-- x37  --> Space abend playbook
    |     |     +-- S806 --> Module not found playbook
    |     |     +-- S222 --> Job cancelled (operator or time limit)
    |     |     +-- S322 --> CPU time limit exceeded
    |     |     +-- S722 --> SYSOUT limit exceeded
    |     |     +-- S913 --> RACF authorization failure
    |     |     +-- Uxxxx --> User abend (application-specific)
    |     |     +-- Other --> Check system codes manual, escalate
    |     |
    |     +-- NO --> Performance issue
    |           |
    |           +-- Running long? --> Performance playbook
    |           +-- Not started? --> Check predecessors, initiators
    |           +-- Output wrong? --> Check input data, application
    |
    +-- Is it TIER 1?
          |
          +-- YES --> Respond immediately, begin diagnostics
          +-- NO  --> Acknowledge, begin diagnostics per SLA

Rob laminated the decision tree and hung it next to every console in the operations center. "When you're half-awake at 3am, you don't want to search a wiki. You want a laminated card that tells you what to do next."


27.6 Self-Healing Batch: Automated Recovery

The ultimate goal is a batch environment that resolves common problems without human intervention. This doesn't mean eliminating human oversight — it means reserving human attention for problems that genuinely require human judgment.

27.6.1 Automated Restart

Most batch abends fall into a small number of categories, and many of them can be resolved by simple retry. A transient I/O error, a momentary resource shortage, a timing-dependent ENQ conflict — these problems often resolve themselves on the second attempt.

The scheduler (CONTROL-M, CA-7, TWS) provides built-in restart capabilities. The key configuration parameters:

  • Maximum restart attempts (typically 2-3; anything more and you're just hammering a persistent problem)
  • Restart delay (wait 2-5 minutes between attempts to let transient conditions clear)
  • Restart point (beginning of job, beginning of failing step, or from checkpoint — Chapter 24 covered checkpoint/restart in detail)
  • Restart conditions (which abend codes are eligible for automatic restart)

At CNB, Rob configured automatic restart for specific, well-understood abend codes:

JOB: ACCTPOST
  AUTO-RESTART: YES
  MAX-RESTARTS: 2
  RESTART-DELAY: 300 SECONDS
  RESTART-FROM: CHECKPOINT
  RESTART-CODES:
    S0C7: NO    (data problem - won't help)
    SE37: YES   (space - SMS may resolve on retry)
    S878: YES   (virtual storage - may clear)
    S0C4: NO    (protection exception - code bug)
    S806: NO    (module not found - deployment issue)
    S913: NO    (security - won't help)
    U0100-U0199: YES  (application retry-eligible)

The critical design decision is which abend codes are eligible for automatic restart. The rule: retry only if there's a reasonable probability the problem will not recur on the next attempt. An S0C7 (data exception) will fail again because the bad data is still there. An SE37 (space) might succeed if SMS has had time to free space or if allocation goes to a different volume.

Application teams can design their programs to support automated recovery by using user abend codes intentionally. At CNB, the convention is:

  • U0100-U0199: Transient errors eligible for automatic retry
  • U0200-U0299: Data quality errors requiring investigation
  • U0300-U0399: Environmental errors (file not found, etc.)
  • U0900-U0999: Severe errors requiring immediate human attention

This convention lets the automation infrastructure make intelligent restart decisions based on the application's own assessment of the failure.

27.6.2 Conditional Execution and Error Routing

Beyond simple restart, self-healing batch uses conditional logic to route around failures:

//STEP010  EXEC PGM=MAINPROC
//STEP020  EXEC PGM=NORMAL,COND=(0,NE,STEP010)
//STEP025  EXEC PGM=ERRPROC,COND=(0,EQ,STEP010)
//STEP030  EXEC PGM=CLEANUP
//*
//* IF STEP010 fails, skip STEP020 (normal path),
//* execute STEP025 (error path), then continue to STEP030.

Modern JCL uses IF/THEN/ELSE for clearer logic:

//STEP010  EXEC PGM=MAINPROC
//*
// IF (STEP010.RC = 0) THEN
//STEP020  EXEC PGM=NORMAL
// ELSE
//STEP025  EXEC PGM=ERRPROC
// ENDIF
//*
//STEP030  EXEC PGM=CLEANUP

At Pinnacle Health, Ahmad Patel built an elaborate error-routing framework for the claims processing batch. Every major processing step has a parallel error-handling step that:

  1. Captures the error details (return code, abend code, step name, time)
  2. Writes the error to an incident dataset
  3. Executes fallback logic if available (e.g., processing claims through an alternative path)
  4. Sends a notification with context
  5. Sets a return code that tells the scheduler whether to continue or hold
       IDENTIFICATION DIVISION.
       PROGRAM-ID. ERRHNDLR.
      *================================================================*
      * BATCH ERROR HANDLER                                             *
      * Generic error processing routine invoked by JCL conditional     *
      * execution when a predecessor step fails.                        *
      *================================================================*

       DATA DIVISION.
       WORKING-STORAGE SECTION.
       01  WS-ERROR-INFO.
           05  WS-FAILING-STEP PIC X(8).
           05  WS-FAILING-PGM  PIC X(8).
           05  WS-ABEND-CODE   PIC X(4).
           05  WS-RETURN-CODE  PIC S9(4) COMP.
           05  WS-ERROR-TIME   PIC X(20).
           05  WS-JOB-NAME     PIC X(8).

       01  WS-RECOVERY-ACTION  PIC X(1).
          88  RECOVER-RETRY    VALUE 'R'.
          88  RECOVER-BYPASS   VALUE 'B'.
          88  RECOVER-HALT     VALUE 'H'.

       01  WS-INCIDENT-RECORD.
           05  IR-TIMESTAMP    PIC X(26).
           05  IR-JOB-NAME     PIC X(8).
           05  IR-STEP-NAME    PIC X(8).
           05  IR-PGM-NAME     PIC X(8).
           05  IR-ERROR-TYPE   PIC X(4).
           05  IR-ACTION-TAKEN PIC X(20).
           05  IR-RESOLUTION   PIC X(80).

       PROCEDURE DIVISION.
       0000-MAIN.
           PERFORM 1000-GATHER-ERROR-INFO
           PERFORM 2000-DETERMINE-ACTION
           PERFORM 3000-LOG-INCIDENT
           PERFORM 4000-EXECUTE-RECOVERY
           PERFORM 5000-SET-RETURN-CODE
           STOP RUN.

       1000-GATHER-ERROR-INFO.
           ACCEPT WS-JOB-NAME FROM JCL
           MOVE FUNCTION CURRENT-DATE TO WS-ERROR-TIME
           ACCEPT WS-FAILING-STEP FROM ENVIRONMENT
           ACCEPT WS-ABEND-CODE FROM ENVIRONMENT.

       2000-DETERMINE-ACTION.
      *    Decision logic based on error type and job context
           EVALUATE TRUE
              WHEN WS-ABEND-CODE = 'E37 '
              WHEN WS-ABEND-CODE = 'B37 '
                 SET RECOVER-RETRY TO TRUE
              WHEN WS-ABEND-CODE = '0C7 '
              WHEN WS-ABEND-CODE = '0C4 '
                 SET RECOVER-HALT TO TRUE
              WHEN WS-RETURN-CODE <= 8
                 SET RECOVER-BYPASS TO TRUE
              WHEN OTHER
                 SET RECOVER-HALT TO TRUE
           END-EVALUATE.

       3000-LOG-INCIDENT.
           MOVE FUNCTION CURRENT-DATE TO IR-TIMESTAMP
           MOVE WS-JOB-NAME     TO IR-JOB-NAME
           MOVE WS-FAILING-STEP TO IR-STEP-NAME
           MOVE WS-FAILING-PGM  TO IR-PGM-NAME
           MOVE WS-ABEND-CODE   TO IR-ERROR-TYPE
           EVALUATE TRUE
              WHEN RECOVER-RETRY
                 MOVE 'AUTOMATIC RETRY'  TO IR-ACTION-TAKEN
              WHEN RECOVER-BYPASS
                 MOVE 'BYPASS AND CONT'  TO IR-ACTION-TAKEN
              WHEN RECOVER-HALT
                 MOVE 'HALT FOR REVIEW'  TO IR-ACTION-TAKEN
           END-EVALUATE
           WRITE IR-INCIDENT-RECORD FROM WS-INCIDENT-RECORD.

       4000-EXECUTE-RECOVERY.
           EVALUATE TRUE
              WHEN RECOVER-RETRY
                 PERFORM 4100-TRIGGER-RESTART
              WHEN RECOVER-BYPASS
                 PERFORM 4200-EXECUTE-BYPASS
              WHEN RECOVER-HALT
                 PERFORM 4300-HALT-STREAM
           END-EVALUATE.

       4100-TRIGGER-RESTART.
      *    Signal the scheduler to restart the failed step
      *    Implementation depends on scheduler product
           DISPLAY 'ERRHNDLR: Requesting restart of '
                    WS-FAILING-STEP
           MOVE 4 TO RETURN-CODE.

       4200-EXECUTE-BYPASS.
      *    Continue processing without the failed step
      *    May invoke alternative processing logic
           DISPLAY 'ERRHNDLR: Bypassing ' WS-FAILING-STEP
                   ', continuing batch stream'
           MOVE 0 TO RETURN-CODE.

       4300-HALT-STREAM.
      *    Stop the batch stream for human intervention
           DISPLAY 'ERRHNDLR: HALTING batch stream. '
                   'Human intervention required for '
                   WS-FAILING-STEP
           MOVE 16 TO RETURN-CODE.

       5000-SET-RETURN-CODE.
           MOVE RETURN-CODE TO RETURN-CODE.

27.6.3 Predictive Prevention

The most sophisticated self-healing doesn't wait for failures — it prevents them. This requires continuous monitoring during the batch window with rules that detect pre-failure conditions:

  • Space trending: If a dataset is extending toward its maximum extents, compress or allocate additional space before the B37 occurs.
  • Elapsed time trending: If a job is running at 120% of expected pace at the 50% mark, it will likely exceed the threshold. Alert early.
  • Input validation: Validate input data characteristics (record count, control total, field sampling) before the processing job starts, not after it abends.
  • Resource availability: Before submitting a job that needs 500 cylinders of temp space, verify that 500 cylinders are available.

At Federal Benefits, Marcus Rivera implemented what he calls "pre-flight checks" — a COBOL program that runs before each critical batch step and validates all prerequisites:

       IDENTIFICATION DIVISION.
       PROGRAM-ID. PREFLIGHT.
      *================================================================*
      * BATCH PRE-FLIGHT CHECK                                          *
      * Validates prerequisites before critical batch step execution.    *
      * Returns RC=0 if all checks pass, RC=8 if warnings,             *
      * RC=16 if critical failure (do not proceed).                     *
      *================================================================*

       DATA DIVISION.
       WORKING-STORAGE SECTION.
       01  WS-CHECK-RESULTS.
           05  WS-OVERALL-RC   PIC S9(4) COMP VALUE 0.
           05  WS-SPACE-OK     PIC X VALUE 'N'.
           05  WS-INPUT-OK     PIC X VALUE 'N'.
           05  WS-PREREQ-OK    PIC X VALUE 'N'.
           05  WS-RESOURCE-OK  PIC X VALUE 'N'.
           05  WS-CHECK-COUNT  PIC S9(4) COMP VALUE 0.
           05  WS-FAIL-COUNT   PIC S9(4) COMP VALUE 0.
           05  WS-WARN-COUNT   PIC S9(4) COMP VALUE 0.

       01  WS-INPUT-VALIDATION.
           05  WS-EXPECTED-RECS  PIC S9(9) COMP.
           05  WS-ACTUAL-RECS    PIC S9(9) COMP.
           05  WS-REC-TOLERANCE  PIC S9(3)V99 COMP VALUE 10.00.
           05  WS-REC-DEVIATION  PIC S9(3)V99 COMP.

       01  WS-SPACE-CHECK.
           05  WS-NEEDED-CYLS   PIC S9(7) COMP.
           05  WS-AVAILABLE-CYLS PIC S9(7) COMP.
           05  WS-SPACE-MARGIN  PIC S9(3)V99 COMP.

       PROCEDURE DIVISION.
       0000-MAIN.
           DISPLAY 'PREFLIGHT: Starting pre-flight checks'
           PERFORM 1000-CHECK-INPUTS
           PERFORM 2000-CHECK-SPACE
           PERFORM 3000-CHECK-PREREQUISITES
           PERFORM 4000-CHECK-RESOURCES
           PERFORM 9000-REPORT-AND-EXIT
           STOP RUN.

       1000-CHECK-INPUTS.
           ADD 1 TO WS-CHECK-COUNT
      *    Read the control file with expected record counts
      *    Count records in actual input file
      *    Compare with tolerance
           IF WS-ACTUAL-RECS = 0
              DISPLAY 'PREFLIGHT FAIL: Input file is empty'
              ADD 1 TO WS-FAIL-COUNT
              MOVE 16 TO WS-OVERALL-RC
           ELSE
              COMPUTE WS-REC-DEVIATION =
                 (ABS(WS-ACTUAL-RECS - WS-EXPECTED-RECS)
                  * 100) / WS-EXPECTED-RECS
              IF WS-REC-DEVIATION > WS-REC-TOLERANCE
                 DISPLAY 'PREFLIGHT WARN: Record count '
                         'deviation ' WS-REC-DEVIATION '%'
                 ADD 1 TO WS-WARN-COUNT
                 IF WS-OVERALL-RC < 8
                    MOVE 8 TO WS-OVERALL-RC
                 END-IF
              ELSE
                 MOVE 'Y' TO WS-INPUT-OK
              END-IF
           END-IF.

       2000-CHECK-SPACE.
           ADD 1 TO WS-CHECK-COUNT
      *    Calculate expected output size based on input size
      *    Check available space on target volumes
           COMPUTE WS-SPACE-MARGIN =
              (WS-AVAILABLE-CYLS - WS-NEEDED-CYLS) * 100
              / WS-NEEDED-CYLS
           IF WS-SPACE-MARGIN < 0
              DISPLAY 'PREFLIGHT FAIL: Insufficient space. '
                      'Need ' WS-NEEDED-CYLS
                      ' Available ' WS-AVAILABLE-CYLS
              ADD 1 TO WS-FAIL-COUNT
              MOVE 16 TO WS-OVERALL-RC
           ELSE IF WS-SPACE-MARGIN < 20
              DISPLAY 'PREFLIGHT WARN: Space margin only '
                      WS-SPACE-MARGIN '%'
              ADD 1 TO WS-WARN-COUNT
              IF WS-OVERALL-RC < 8
                 MOVE 8 TO WS-OVERALL-RC
              END-IF
           ELSE
              MOVE 'Y' TO WS-SPACE-OK
           END-IF.

       3000-CHECK-PREREQUISITES.
           ADD 1 TO WS-CHECK-COUNT
      *    Verify predecessor jobs completed successfully
      *    Check control dataset for predecessor status flags
           MOVE 'Y' TO WS-PREREQ-OK.

       4000-CHECK-RESOURCES.
           ADD 1 TO WS-CHECK-COUNT
      *    Check DB2 tablespace status
      *    Check CICS region availability (for jobs that interact)
      *    Verify network connectivity for distributed components
           MOVE 'Y' TO WS-RESOURCE-OK.

       9000-REPORT-AND-EXIT.
           DISPLAY 'PREFLIGHT: Checks=' WS-CHECK-COUNT
                   ' Failures=' WS-FAIL-COUNT
                   ' Warnings=' WS-WARN-COUNT
           DISPLAY 'PREFLIGHT: Input=' WS-INPUT-OK
                   ' Space=' WS-SPACE-OK
                   ' Prereq=' WS-PREREQ-OK
                   ' Resources=' WS-RESOURCE-OK
           DISPLAY 'PREFLIGHT: Return code=' WS-OVERALL-RC
           MOVE WS-OVERALL-RC TO RETURN-CODE.

The pre-flight check runs as a job step before the main processing. If it returns RC=16, the JCL conditional execution skips the main step and routes to the error handler. If it returns RC=8 (warnings only), the main step proceeds but the alerts are already in the queue for review. If RC=0, all clear.

"We catch about three problems a week before they become abends," Marcus reports. "That's three incidents that never happen, three 3am phone calls that never occur, three recovery procedures that never need to execute."


27.7 Post-Incident Review: Learning from Failure

Every significant batch incident should trigger a post-incident review (PIR), sometimes called a postmortem. The purpose is not to assign blame — it's to identify systemic improvements that prevent recurrence.

27.7.1 The PIR Framework

Rob uses a structured framework for every Tier 1 and Tier 2 incident at CNB:

Timeline Reconstruction: - When did the problem actually begin? (Not when it was detected — when did the underlying condition first manifest?) - When was it detected? - When was diagnosis complete? - When was it resolved? - When were all downstream effects cleared?

The gap between "began" and "detected" is the detection latency. The gap between "detected" and "resolved" is the resolution time. Together they constitute the incident duration. The goal is to reduce both.

Root Cause Analysis: - What was the immediate cause? (The corrupt data in the input file.) - What was the contributing cause? (The input-producing job doesn't validate data.) - What was the systemic cause? (There is no pre-flight validation for data quality.)

Five Whys Exercise: 1. Why did ACCTPOST abend? — Bad data in the packed decimal field. 2. Why was there bad data? — The extract job copied corrupt records from the online system. 3. Why did the online system write corrupt records? — A deployment bug in the evening release. 4. Why wasn't the deployment bug caught? — The deployment test didn't include the specific transaction type that triggers the code path. 5. Why didn't monitoring catch the bad data before ACCTPOST? — There is no data validation step between extract and processing.

Each "why" reveals a layer of defense that either didn't exist or didn't function. The improvement plan addresses each layer:

Layer Current State Improvement
Deployment testing Incomplete test coverage Add regression test for all transaction types
Data validation None between extract and processing Add pre-flight validation step
Monitoring Detects abends, not data anomalies Add record count and control total monitoring
Recovery Manual diagnosis, manual restart Add automated restart from checkpoint for S0C7 with data cleanse
Knowledge Rob's expertise, not documented Create playbook entry for ACCTPOST S0C7

27.7.2 Mean Time to Recovery (MTTR)

MTTR is the single most important metric for batch incident management. It measures the elapsed time from incident detection to complete resolution (including downstream recovery). The formula is straightforward:

MTTR = Sum of all incident durations / Number of incidents

But the components matter more than the aggregate:

  • Mean Time to Detect (MTTD): How long between the problem occurring and someone (or something) noticing?
  • Mean Time to Diagnose (MTTDx): How long between detection and identifying the root cause?
  • Mean Time to Repair (MTTRp): How long between diagnosis and implementing the fix?
  • Mean Time to Verify (MTTV): How long to confirm the fix worked and downstream effects are cleared?

At CNB, Rob tracks these components separately for every Tier 1 incident. His data over the past year shows:

Component Average Before Monitoring Upgrade Average After
MTTD 37 minutes 4 minutes
MTTDx 28 minutes 12 minutes
MTTRp 19 minutes 8 minutes (automated: 2 min)
MTTV 14 minutes 9 minutes
Total MTTR 98 minutes 33 minutes

The biggest improvement was in detection time — from 37 minutes to 4 minutes. That's the value of proper monitoring and alerting. The second biggest was in repair time — from 19 minutes to 8 minutes on average, with automated recoveries averaging 2 minutes. That's the value of playbooks and self-healing automation.

27.7.3 Knowledge Capture

Every PIR should produce or update at least one artifact:

  • A playbook entry for the specific failure mode
  • A monitoring rule to detect the condition earlier
  • An automation rule to recover automatically if possible
  • A baseline update if the incident revealed that baselines were incorrect

Sandra Williams at Federal Benefits requires that every PIR produce a "future self memo" — a one-page document written as if you're explaining the problem and solution to the person who will encounter it next. "Assume the reader is competent but has never seen this specific job before," Sandra tells her team. "Write what you wish someone had written for you before you spent ninety minutes figuring it out."

These memos are indexed by job name, abend code, and symptom keywords, and stored in a searchable repository. When a new incident occurs, the first step in the diagnostic procedure is to search the repository for similar past incidents. More than half the time, the answer is already there.

27.7.4 Blameless Culture

A critical note on organizational dynamics: post-incident reviews only work in environments where the goal is improvement, not punishment. If the developer who introduced the deployment bug knows they'll be blamed in the PIR, they'll resist participation, withhold information, and avoid honest analysis.

Rob's PIR rule at CNB: "We don't ask who caused the problem. We ask what allowed the problem to happen and how we prevent it from happening again. If a single human error can take down the batch cycle, the problem is the system, not the human."

This is not soft management philosophy. It's engineering pragmatism. A system that depends on humans not making mistakes is a fragile system. A system that detects, contains, and recovers from human errors is a resilient system. Monitoring, alerting, playbooks, and automation are the mechanisms that create that resilience.


27.8 Progressive Project: HA Banking System Monitoring Framework

Time to apply everything from this chapter to the High Availability Banking Transaction Processing System you've been building throughout Part 5.

27.8.1 Monitoring Design

Your HA banking system needs monitoring at all four levels:

Level 1 — Console Automation Rules: - Capture all abend messages for batch jobs in the HABK* job name prefix - Automatically restart eligible jobs (per the restart table) - Hold downstream dependents when a critical job fails - Alert the on-call team for Tier 1 events

Level 2 — Job-Level Monitoring: For each of the seven batch streams in the HA system (account posting, GL extract, statement generation, regulatory reporting, interest calculation, fee processing, and reconciliation), define: - Expected start time - Expected elapsed time (with dynamic thresholds for month-end) - Maximum acceptable condition code - Expected output record counts (within tolerance bands) - Checkpoint frequency for restart capability

Level 3 — Stream-Level Monitoring: - Define four SLA milestones for the batch window - Implement critical path calculation across all seven streams - Create a stream-level dashboard showing real-time progress

Level 4 — Historical Trending: - Design the DB2 tables to store nightly SMF summaries - Define the weekly trend reports - Establish the quarterly baseline review process

27.8.2 Alerting Configuration

Design the complete alerting configuration:

  1. Classify every job in the HA system as Tier 1, 2, or 3
  2. Define static and dynamic thresholds for each tier
  3. Design the escalation path (who gets called, in what order, after how long)
  4. Define the notification channels (considering that the HA system runs in a sysplex with jobs on multiple LPARs)

27.8.3 Playbook

Create playbook entries for the five most likely failure scenarios in the HA banking system:

  1. Account posting abend (S0C7, data quality in transaction input)
  2. GL extract space abend (B37, growing transaction volumes)
  3. Statement generation performance degradation (running long due to customer growth)
  4. Regulatory report missed deadline (compliance impact)
  5. Reconciliation out-of-balance (data integrity issue)

For each scenario, document: trigger, impact, diagnostic steps, resolution options, escalation criteria, and post-resolution verification.

27.8.4 Self-Healing Design

Implement automated recovery for the HA banking system:

  1. Design the pre-flight validation program for the account posting stream
  2. Define the restart eligibility table for all jobs
  3. Create the error-routing JCL for the critical path jobs
  4. Design the fallback processing path for statement generation (what happens if the primary path fails — can statements be generated through an alternative method?)

27.8.5 Post-Incident Framework

Design the PIR process for the HA banking system:

  1. Incident severity classification criteria
  2. PIR template specific to banking batch operations
  3. MTTR tracking metrics and targets
  4. Knowledge base structure for incident history

The project checkpoint in the code directory provides the detailed specification for deliverables.


Spaced Review Integration

From Chapter 5 (Workload Management): WLM service classes determine how batch jobs are dispatched and prioritized. Your monitoring must be aware of WLM classifications because a job that's "running slow" may actually be performing normally for its assigned service class — it's simply getting fewer resources than a higher-priority service class. When investigating performance issues, always check the WLM service class and whether the job is meeting its velocity goal. If it is, the "problem" is a WLM policy issue, not an application issue.

From Chapter 23 (Batch Window Architecture): The batch window defines the SLA milestones that your monitoring tracks. But the batch window isn't static — it shifts with business growth, seasonal patterns, and application changes. Your monitoring must adapt. If you defined SLA milestones in Chapter 23 and the batch window has grown 15% in six months, your milestones need adjustment. Historical trend data from SMF tells you when to adjust.

From Chapter 24 (Checkpoint/Restart): Automated restart is only useful if the job supports checkpoint/restart. A job that must restart from the beginning of a two-hour step is far more expensive to restart than one that can resume from a checkpoint ten minutes before the failure. Your restart eligibility table should account for checkpoint capability — jobs without checkpoints should have a higher escalation priority because automated restart is less effective.


Chapter Summary

Batch monitoring is not a technology problem — it's a discipline. The tools exist: SMF provides the telemetry, automation products provide the action framework, schedulers provide the job management, and COBOL provides the analytical programs that turn raw data into operational intelligence. What separates shops that run smoothly from shops that lurch from crisis to crisis is whether they've invested the time to configure those tools properly, train their staff to use them, and build the feedback loop that turns every incident into a systemic improvement.

Rob's phone still rings occasionally at 3am. But now when it rings, the monitoring system has already diagnosed the problem, the playbook is already open, and frequently the automated recovery has already resolved the issue and the call is just a notification that it happened. Rob's MTTR is down 66%. His blood pressure is down too, though he won't admit that's related.

The next time your batch window completes successfully and nobody notices, remember: that's what good monitoring looks like. The goal isn't to be a hero. The goal is to make heroics unnecessary.


Next chapter: Chapter 28 explores batch scheduling optimization — advanced dependency management, resource leveling, and dynamic scheduling that adapts to actual conditions rather than static timetables.