Chapter 31: Operational Automation: REXX, JCL Procedures, Automation Products, and Self-Healing Batch Streams

31 min read

> "Every manual step is a future outage waiting for a Friday night." — Kwame Asante, CNB Infrastructure Director

In This Chapter

31.1 The Case for Automation
31.2 REXX for z/OS
31.3 JCL Procedures (PROCs)
31.4 Automation Products
31.5 Self-Healing Batch Streams
31.6 Automating Operational Procedures
31.7 Automation Governance
31.8 Progressive Project: HA Banking System — Operational Automation Framework
Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 31: Operational Automation: REXX, JCL Procedures, Automation Products, and Self-Healing Batch Streams

"Every manual step is a future outage waiting for a Friday night." — Kwame Asante, CNB Infrastructure Director

31.1 The Case for Automation

Let me tell you about a Monday morning at CNB in 2019. Kwame Asante walked into the data center at 6:15 AM to find three operators hunched over consoles, manually restarting a batch stream that had failed at 2:47 AM. The nightly general ledger close had aborted because a GDG base had filled to its limit. An operator attempted a manual restart, fat-fingered a dataset name, and cascaded two downstream jobs into abend S0C7s. By the time the dust settled, the branch network opened 47 minutes late. The root cause wasn't the GDG limit — that was a known condition with a known fix. The root cause was that a human being, fatigued at 3 AM, had to remember a twelve-step recovery procedure they'd last executed nine months ago.

That was the last straw. Kwame pulled Lisa Cheng and Rob Mueller into a conference room and said, "We're automating everything that doesn't require judgment. If a runbook step says 'do X,' then a machine does X. If a runbook step says 'decide whether to do X or Y,' then a machine does X or Y based on criteria we define in advance. Humans handle exceptions — that's it."

Three years later, CNB's mean time to recovery for batch failures dropped from 43 minutes to 6 minutes. Operator interventions per night shift dropped from 22 to 3. The branch network hasn't opened late once.

This chapter teaches you how to build that kind of operational automation on z/OS. Not the toy examples you see in training manuals — the real thing: REXX scripts that do actual work, JCL procedures that enforce standards, automation products that respond to events in real time, and self-healing batch streams that fix themselves before anyone's pager goes off.

Why Automation Matters More Than Ever

The mainframe workforce is shrinking. The average z/OS systems programmer in North America is 57 years old. When Sandra Kowalski at Federal Benefits lost two senior operators to retirement in the same quarter, she didn't have the budget to replace them — and even if she had, the candidate pool was thin. Automation wasn't optional; it was survival.

But workforce pressure is only one driver. The others hit harder:

Human error dominance. IBM's own studies consistently show that 60–70% of unplanned outages on z/OS are caused by human error. Not hardware failures. Not software bugs. People making mistakes under pressure. Every manual step you eliminate is a failure mode you eliminate.

MTTR reduction. Mean time to recovery is the metric that matters most for availability. Automated recovery doesn't need to wake up, log in, read a runbook, interpret symptoms, or second-guess itself. It executes a predefined response in seconds. At Pinnacle Health, Ahmad Rashid measured MTTR for automated versus manual recovery of their HL7 batch interface: automated averaged 38 seconds, manual averaged 22 minutes. That's a 35x improvement.

Operational cost. SecureFirst's Yuki Tanaka calculated that their z/OS operations team spent 1,400 hours per year on tasks that could be fully automated — dataset housekeeping, GDG management, job restarts, report distribution, catalog cleanup. At loaded labor cost, that was $280,000 annually. The automation project paid for itself in seven months.

Compliance and auditability. When a machine executes a procedure, it logs every step identically every time. When a human executes a procedure, the logs reflect whatever the human remembered to document. Federal Benefits' auditors specifically asked Sandra, "Can you prove that your recovery procedures were followed as documented?" Automated execution with comprehensive logging was the only credible answer.

Consistency. Rob Mueller puts it bluntly: "I don't care how good your operators are. At 3 AM on the fourth consecutive night of an implementation weekend, they're going to make mistakes. Not because they're bad at their jobs — because they're human."

The Automation Spectrum

Not everything should be automated, and not everything should be automated to the same degree. Think of automation as a spectrum:

Level	Description	Example
0 — Manual	Human reads runbook, executes steps	Operator restarts failed job after reading procedure
1 — Scripted	Human triggers script that executes steps	Operator runs REXX exec to restart with correct parameters
2 — Triggered	System detects condition, notifies human, suggests action	Automation product detects abend, pages operator with recommended restart command
3 — Automated with approval	System detects condition, prepares action, waits for human approval	Automation product detects abend, prepares restart, operator approves with single keystroke
4 — Fully automated	System detects condition, executes action, logs result	Automation product detects abend, restarts job, verifies success, closes incident
5 — Self-healing	System detects condition, diagnoses root cause, remediates, prevents recurrence	Automation product detects GDG full, extends base, restarts job, adjusts capacity plan

Most shops operate between levels 1 and 3. The goal of this chapter is to get you to level 4, with elements of level 5 for well-understood failure modes.

Spaced Review: Connections to Earlier Chapters

Before we dive in, let's reconnect with three chapters that directly feed into automation:

Chapter 5 — WLM (Workload Manager). WLM governs how z/OS allocates resources. Your automation must be WLM-aware. If you write a REXX script that spawns batch work, that work runs in whatever service class WLM assigns. If your automation product issues commands that affect system resources, WLM will respond. At CNB, Lisa Cheng learned this the hard way when an automation rule that cancelled and restarted stuck jobs caused WLM to reclassify the restarted work into a lower service class, making it run even slower. The fix was to include a WLM CLASSIFY override in the restart procedure.

Chapter 23 — Batch Scheduling. Automation operates within the scheduling framework. Your self-healing batch streams are defined in the scheduler (CA-7, TWS, Control-M). Your automation products interact with the scheduler via APIs. Your REXX scripts may query or update the scheduler. If you haven't internalized the scheduling concepts from Chapter 23 — predecessor/successor dependencies, conditional execution, restart/checkpoint — go back and review. Everything in Section 31.5 assumes you have that foundation.

Chapter 27 — Monitoring. Monitoring provides the events that trigger automation. SMF records, console messages, WTO/WTOR messages, RMF data — these are the inputs to your automation rules. The monitoring infrastructure from Chapter 27 is the nervous system; the automation products in this chapter are the muscles that respond.

31.2 REXX for z/OS

REXX — Restructured Extended Executor — is the Swiss Army knife of z/OS automation. It's been on every z/OS system since MVS/ESA. It runs in TSO/ISPF, in batch, as a NetView CLIST replacement, inside automation products, and as a general-purpose scripting language. If you can only learn one scripting language for z/OS automation, REXX is the one.

I'm not going to teach you REXX syntax from scratch — you should already know the basics from Chapter 1. What I'm going to show you is how to write REXX that does real operational work.

TSO/ISPF REXX for Automation

Most z/OS REXX runs under TSO, either interactively through ISPF or in batch via IKJEFT01. TSO REXX has access to TSO commands, ISPF services, and host command environments that make it extremely powerful for automation.

The OUTTRAP function. This is the single most important REXX function for automation. OUTTRAP captures the output of TSO commands into stem variables, letting you parse and act on results programmatically.

/* REXX - Check dataset allocation status */
x = OUTTRAP('line.')
"LISTDS 'CNB.PROD.GLDATA' STATUS"
x = OUTTRAP('OFF')

do i = 1 to line.0
  if POS('--IN-USE', line.i) > 0 then do
    say 'WARNING: GL dataset is currently allocated'
    say 'Allocated to:' WORD(line.i, 1)
    call notify_operations 'GL dataset in use - hold downstream jobs'
  end
end

LISTDSI for dataset information. LISTDSI retrieves comprehensive dataset attributes without the overhead of catalog access:

/* REXX - Check dataset space utilization */
x = LISTDSI("'CNB.PROD.TRANLOG'" "DIRECTORY")

if SYSREASON = 0 then do
  used_pct = (SYSUSEDPAGES / SYSPRIMARY) * 100
  if used_pct > 85 then do
    say 'ALERT: TRANLOG at' FORMAT(used_pct,5,1)'% capacity'
    call extend_dataset 'CNB.PROD.TRANLOG'
  end
end
else do
  say 'LISTDSI failed - reason code:' SYSREASON
  say 'SYSMSGLVL1:' SYSMSGLVL1
  say 'SYSMSGLVL2:' SYSMSGLVL2
end

ISPF services in REXX. When running under ISPF, you have access to ISPF services for table management, file tailoring, and panel display. This is how you build sophisticated automation tools with user interfaces:

/* REXX - Generate operations report using ISPF file tailoring */
ADDRESS ISPEXEC

"TBCREATE JOBTBL KEYS(JOBNAME) NAMES(STATUS RETCODE ELAPSED STEPNAME)"

/* Query scheduler for today's job status */
call get_job_status

"FTOPEN TEMP"
"FTINCL RPTHEAD"     /* Report header skeleton */
"TBTOP JOBTBL"

do forever
  "TBSKIP JOBTBL"
  if rc > 0 then leave
  "FTINCL RPTLINE"   /* Report line skeleton */
end

"FTCLOSE"
"BROWSE DATAID("dataid")"
"TBCLOSE JOBTBL"

Batch REXX

Running REXX in batch is essential for scheduled automation. You use the TSO terminal monitor program IKJEFT01 (or IKJEFT1B for no TSO prompting):

//AUTOEXEC EXEC PGM=IKJEFT01
//SYSTSPRT DD SYSOUT=*
//SYSTSIN  DD *
  %CNBAUTO1
/*
//SYSPROC  DD DISP=SHR,DSN=CNB.PROD.REXX.EXEC

The % prefix tells TSO to search SYSPROC/SYSEXEC for the REXX exec. For production automation, always use IKJEFT1B to prevent TSO from issuing READY prompts that hang batch jobs.

REXX with DB2

REXX can execute SQL through the DB2 DSNREXX interface. This is extraordinarily powerful for automated reporting and data-driven automation:

/* REXX - Check for stale batch cycles in DB2 control table */
ADDRESS TSO
"SUBCOM DSNREXX"
if rc then do
  s_rc = RXSUBCOM('ADD','DSNREXX','DSNREXX')
end

ADDRESS DSNREXX
"CONNECT DB2P"

SQLSTMT = "SELECT CYCLE_NAME, LAST_RUN_TS, STATUS" ,
          "FROM CNB.BATCH_CONTROL" ,
          "WHERE STATUS = 'ACTIVE'" ,
          "AND LAST_RUN_TS < CURRENT TIMESTAMP - 2 HOURS"

ADDRESS DSNREXX "EXECSQL PREPARE S1 FROM :SQLSTMT"
ADDRESS DSNREXX "EXECSQL DECLARE C1 CURSOR FOR S1"
ADDRESS DSNREXX "EXECSQL OPEN C1"

do forever
  ADDRESS DSNREXX "EXECSQL FETCH C1 INTO :CNAME, :LASTRUN, :STAT"
  if SQLCODE \= 0 then leave
  say 'STALE CYCLE:' CNAME 'Last run:' LASTRUN
  call escalate_stale_cycle CNAME
end

ADDRESS DSNREXX "EXECSQL CLOSE C1"
ADDRESS DSNREXX "DISCONNECT"

Common REXX Automation Patterns

After twenty-five years, certain patterns appear in every shop. Here are the ones you'll use most:

Pattern 1: The Monitor Loop. A REXX exec that runs continuously (or periodically via scheduler), checks conditions, and takes action:

/* REXX - Monitor spool utilization */
signal on halt
interval = 300   /* Check every 5 minutes */

do forever
  x = OUTTRAP('spool.')
  "STATUS SPOOL"
  x = OUTTRAP('OFF')

  parse var spool.1 . 'PERCENT:' pct '%' .
  if pct > 80 then do
    call purge_aged_output 7    /* Purge output older than 7 days */
    if pct > 90 then
      call escalate 'SPOOL CRITICAL' pct'%'
  end

  call SysSleep interval
end

halt:
  say 'Monitor terminated by operator request'
  exit 0

Pattern 2: The Wrapper Script. A REXX exec that wraps a complex operational procedure, enforcing sequence, validation, and logging:

/* REXX - Month-end close wrapper */
parse arg cycle_date .

if cycle_date = '' then do
  say 'ERROR: Cycle date required (YYYYMMDD)'
  exit 8
end

if \DATATYPE(cycle_date, 'W') | LENGTH(cycle_date) \= 8 then do
  say 'ERROR: Invalid date format'
  exit 8
end

call log 'Month-end close initiated for' cycle_date

/* Pre-flight checks */
call check_prerequisites cycle_date
call verify_input_datasets cycle_date
call confirm_schedule_holds cycle_date

/* Execute close sequence */
call trigger_gl_extract cycle_date
call wait_for_completion 'CNBGLE*' 3600
call trigger_reconciliation cycle_date
call wait_for_completion 'CNBREC*' 1800
call generate_reports cycle_date

call log 'Month-end close completed RC=0'
exit 0

Pattern 3: The Dataset Manager. REXX for automated dataset lifecycle management:

/* REXX - GDG cleanup and management */
parse arg gdg_base max_age .

x = OUTTRAP('cat.')
"LISTCAT ENT('"gdg_base"') GDG ALL"
x = OUTTRAP('OFF')

today = DATE('B')   /* Base date (days since 1/1/0001) */
deleted = 0

do i = 1 to cat.0
  if POS('NONVSAM', cat.i) > 0 then do
    parse var cat.i . '---' dsname
    dsname = STRIP(dsname)
    x = LISTDSI("'"dsname"'")
    if SYSREASON = 0 then do
      /* Parse creation date and calculate age */
      create_date = SYSCREATE
      parse var create_date yyyy'/'ddd
      age = today - DATE('B', yyyy||RIGHT(ddd,3,'0'), 'J')
      if age > max_age then do
        "DELETE '"dsname"' PURGE"
        if rc = 0 then do
          deleted = deleted + 1
          call log 'Deleted' dsname '(age:' age 'days)'
        end
      end
    end
  end
end

say 'GDG cleanup complete:' deleted 'generations deleted'

REXX Best Practices for Production

These aren't suggestions — they're requirements for production REXX:

Always use SIGNAL ON ERROR and SIGNAL ON HALT. Untrapped errors in production REXX are unacceptable.
Log everything. Every action, every decision, every return code. Write to a log dataset, not just SYSTSPRT.
Validate all inputs. Never assume arguments are correct. Validate data types, ranges, existence.
Use meaningful return codes. 0 = success, 4 = warning, 8 = error, 12 = severe, 16 = catastrophic. Be consistent across all your automation execs.
Avoid hardcoded values. Use a configuration dataset or ISPF table for environment-specific values (dataset names, thresholds, contact information).
Test with TRACE. REXX's TRACE instruction is your best debugging tool. Use TRACE R for results, TRACE I for intermediates.
Keep execs under 1000 lines. If it's longer, refactor into called subroutines stored as separate execs.

31.3 JCL Procedures (PROCs)

If REXX is the Swiss Army knife, JCL procedures are the assembly line. PROCs enforce standardization — every job that does the same kind of work uses the same procedure, with variations controlled through symbolic parameters. This is how you prevent the "snowflake job" problem where every programmer writes their own JCL and you end up with 200 different ways to run a COBOL-DB2 program.

Anatomy of a Production PROC

Here's a production-grade PROC from CNB for executing COBOL-DB2 batch programs:

//CNBDB2BT PROC PROG=,
//            PLAN=,
//            DBSYS=DB2P,
//            REGION=0M,
//            DSNLOAD='DSN.V13R1.SDSNLOAD',
//            RUNLIB='CNB.PROD.LOADLIB',
//            PARMDSN=,
//            PARMMEM=,
//            UTPRINT='SYSOUT=*',
//            COND='(4,LT)'
//*------------------------------------------------------------*
//* CNB STANDARD DB2 BATCH EXECUTION PROC V3.2                 *
//* LAST MODIFIED: 2024-11-15 BY L.CHENG                       *
//* CHANGE#: CHG0045872                                        *
//*------------------------------------------------------------*
//JOBLIB   DD DISP=SHR,DSN=&RUNLIB
//         DD DISP=SHR,DSN=&DSNLOAD
//*
//STEP01   EXEC PGM=IKJEFT01,
//            REGION=&REGION,
//            COND=&COND
//SYSTSPRT DD &UTPRINT
//SYSPRINT DD &UTPRINT
//SYSUDUMP DD SYSOUT=*
//SYSTSIN  DD *
  DSN SYSTEM(&DBSYS)
  RUN PROGRAM(&PROG) PLAN(&PLAN) -
      LIB('&RUNLIB')
  END
/*
//SYSOUT   DD SYSOUT=*
//PARMFILE DD DISP=SHR,DSN=&PARMDSN(&PARMMEM)

Key design decisions in this PROC:

All environment-specific values are symbolic parameters. The DB2 subsystem (DBSYS), load library (RUNLIB), and DSNLOAD library are parameterized. One PROC serves development, test, QA, and production.
Sensible defaults. DBSYS=DB2P and REGION=0M are production defaults. Override them for lower environments.
Change tracking in comments. Every PROC modification references a change ticket.
COND parameter is symbolic. Callers can override the condition code threshold.

Symbolic Parameters: The Art of Parameterization

The difference between a good PROC and a great one is in the parameterization. Parameterize too little and the PROC is inflexible. Parameterize too much and it becomes incomprehensible. Here's the rule of thumb:

Parameterize anything that varies by environment, by execution context, or by caller preference. Don't parameterize internal logic, step sequencing, or program names that are intrinsic to the procedure's purpose.

Symbolic parameter conventions at CNB:

Convention	Meaning	Example
No default	Required — caller must supply	`PROG=`
Default provided	Optional — override when needed	`DBSYS=DB2P`
Single-quoted default	String literal default	`DSNLOAD='DSN.V13R1.SDSNLOAD'`
Keyword style	Self-documenting	`UTPRINT='SYSOUT=*'`

Nested PROCs

z/OS supports PROCs calling other PROCs, up to 15 levels deep. In practice, you should never go beyond 3 levels. CNB uses a two-level pattern:

Level 1: Execution PROCs. Like CNBDB2BT above. These know how to run a specific type of program.
Level 2: Application PROCs. These call execution PROCs with application-specific parameters:

//CNBGLEXT PROC ENV=PROD,
//            CYCLE=,
//            RUNDATE=&LYYMMDD
//*------------------------------------------------------------*
//* CNB GENERAL LEDGER EXTRACT - APPLICATION PROC              *
//*------------------------------------------------------------*
//EXTRACT  EXEC CNBDB2BT,
//            PROG=CNBGL100,
//            PLAN=CNBGLPLN,
//            DBSYS=DB2&ENV.SYS,
//            PARMDSN='CNB.&ENV..PARMLIB',
//            PARMMEM=GL100P01
//GLEXTRACT DD DISP=(NEW,CATLG,DELETE),
//            DSN=CNB.&ENV..GLEXT.D&RUNDATE(+1),
//            SPACE=(CYL,(500,100),RLSE),
//            DCB=(RECFM=FB,LRECL=400,BLKSIZE=0)
//GLREJECT  DD DISP=(NEW,CATLG,DELETE),
//            DSN=CNB.&ENV..GLREJ.D&RUNDATE,
//            SPACE=(CYL,(10,5),RLSE),
//            DCB=(RECFM=FB,LRECL=400,BLKSIZE=0)

This pattern gives you reuse at both levels. Twenty different applications can call CNBDB2BT. The GL extract PROC can be called from multiple batch streams.

PROC Libraries and Management

PROCs live in procedure libraries referenced by the JCLLIB statement or the system PROCLIB concatenation. CNB's PROC library strategy:

SYS1.PROCLIB          — IBM-supplied and system PROCs
CNB.SYSTEM.PROCLIB    — Site-wide system PROCs
CNB.PROD.PROCLIB      — Production application PROCs
CNB.TEST.PROCLIB      — Test environment PROCs
CNB.DEV.PROCLIB       — Development PROCs

Jobs reference the appropriate library:

//CNBGL01  JOB (ACCT),'GL EXTRACT',CLASS=A,MSGCLASS=X
//         JCLLIB ORDER=(CNB.PROD.PROCLIB,
//                       CNB.SYSTEM.PROCLIB)

PROC Standardization Rules

Lisa Cheng enforces these rules at CNB. They've prevented more problems than any other single practice:

Every production job uses a PROC. No inline JCL for production batch. Period. If you need to run something once, write a PROC anyway.
PROCs are version-controlled. Every PROC is in the source management system with change history.
Naming conventions are mandatory. CNB prefix, three-character application code, two-character function code. CNBGLEXT = CNB, General Ledger, Extract.
Documentation headers are mandatory. Purpose, parameters, change history, dependencies.
Override DD statements only. Callers can add DD statements and override existing ones but cannot change EXEC statements.
Test before promote. PROCs go through the same promotion path as application code: DEV -> TEST -> QA -> PROD.

31.4 Automation Products

REXX and JCL PROCs are the building blocks. Automation products are the orchestration layer — they monitor the system, detect events, execute responses, and manage complex workflows that span multiple systems and time horizons.

IBM System Automation for z/OS (SA z/OS)

SA z/OS is IBM's flagship automation product. It manages the lifecycle of z/OS subsystems (CICS, DB2, IMS, MQ), automates operator commands, and provides policy-based automation.

Core concepts:

Automation Policy. Defines what SA z/OS monitors and how it responds. Policies are built in the Customization Dialog (a set of ISPF panels) and stored in the Automation Policy Database (APD).
Automation Operators. Virtual operators (Extended MCS consoles) that SA z/OS uses to issue commands. Each automation operator has a defined scope and authority.
Application Groups. Logical groupings of related subsystems with defined start/stop sequences and dependencies.

Example: Automating CICS region management. At CNB, SA z/OS manages 47 CICS regions. The automation policy defines:

Application: CICSP01 (Production AOR #1)
  Start command:    S CICSP01
  Stop command:     F CICSP01,CEMT P SHUT
  Monitor message:  DFHSI1517 (CICS ready)
  Health check:     CEMT I TASK every 60 seconds
  Restart policy:   Restart up to 3 times within 60 minutes
  Escalation:       After 3 restarts, alert Level 2 support
  Dependencies:     Requires DB2P active, MQP1 active
  Move group:       Can relocate to LPAR2 if LPAR1 fails

This policy means that if CICSP01 abends at 2 AM, SA z/OS will: 1. Detect the failure via the absence of the CICS address space 2. Verify that dependencies (DB2, MQ) are still active 3. Issue the start command 4. Wait for the DFHSI1517 ready message 5. If it doesn't restart successfully, try again (up to 3 times) 6. If all restarts fail, page the on-call support team with full diagnostic information

No human involvement for the first three restart attempts. No human error. No delay.

CA OPS/MVS (Broadcom)

OPS/MVS takes a different approach — rule-based automation driven by an event engine. It monitors system messages, SMF records, and other events, then executes rules written in REXX-like syntax (OPS/REXX).

The OPS/MVS rule structure:

)MSG HASP373
)PROC
  /* Triggered when a job ends - HASP373 message */
  if MSG.TEXT contains 'CNBGL' then do
    jobname = WORD(MSG.TEXT, 1)
    retcode = get_return_code(jobname)
    if retcode > 4 then do
      call escalate jobname retcode
      call hold_successors jobname
    end
    else do
      call release_successors jobname
    end
  end
)END

OPS/MVS event types:

Event Type	Trigger	Use Case
MSG	Console message	Job failures, system alerts
CMD	Operator command	Command interception/validation
TOD	Time of day	Scheduled automation tasks
SMF	SMF record written	Performance threshold breaches
EOJ	End of job	Post-job processing, notifications
SEC	Security violation	RACF event response
OMG	OMEGAMON alert	Performance-driven automation

Why CNB chose OPS/MVS. Kwame's team evaluated both SA z/OS and OPS/MVS. They chose OPS/MVS for day-to-day operational automation because of its rule-based flexibility: "SA z/OS is excellent for subsystem lifecycle management, and we use it for that. But for the hundred little operational automations — the 'when this happens, do that' rules — OPS/MVS is faster to implement and easier to maintain. Our operators can read and understand OPS/MVS rules. SA z/OS policies require a specialist."

Many large shops use both. SA z/OS for subsystem management, OPS/MVS for event-driven operational automation.

Comparing Automation Product Capabilities

The choice between automation products is not binary, and understanding the trade-offs helps architects make informed decisions. Here is a practical comparison based on CNB's evaluation and Federal Benefits' experience:

Capability	SA z/OS	OPS/MVS	NetView
Subsystem lifecycle	Excellent — purpose-built	Good — via rules	Fair — via CLISTs
Event-driven rules	Limited	Excellent — core strength	Good — automation table
Cross-LPAR coordination	Good — via XCF	Good — via shared variables	Excellent — native
Rule development speed	Slow — policy DB	Fast — REXX-like syntax	Medium — CLIST/REXX
Learning curve	Steep — Customization Dialog	Moderate — familiar syntax	Steep — network heritage
Debugging/testing	Difficult — policy simulation	Good — TRACE, test mode	Moderate — session manager
Vendor lock-in	High — IBM proprietary	High — Broadcom proprietary	High — IBM proprietary
Typical cost	Included with z/OS (base)	Separate license	Separate license

Sandra Chen at Federal Benefits ran a six-month proof-of-concept with all three products before standardizing. Her conclusion: "SA z/OS is non-negotiable for subsystem management — trying to replicate its CICS/DB2/MQ lifecycle management in any other tool is a waste of time. For everything else, we chose OPS/MVS because our operators could write and understand the rules. NetView is powerful, but its heritage as a network management tool shows — the z/OS automation feels bolted on."

One critical factor not captured in feature matrices is community knowledge. OPS/MVS has a larger installed base for z/OS operational automation, which means more online forums, more sample rules, and more consultants with hands-on experience. When a junior operator needs to write a new rule at 11 PM, the availability of examples and documentation matters more than any feature comparison.

Tivoli NetView

NetView is IBM's network and system management platform. While it's primarily known for network automation, its automation table and command lists are widely used for z/OS operational automation.

NetView automation table entries:

IF MSGID = 'IEF404I' &         /* Job failed message */
   TEXT = . 'CNBGL' . &        /* GL batch job */
   JOBNAME ¬= 'CNBGLMNT' THEN /* Not maintenance job */
  EXEC(CMD('CNBAUTO GLRECOV' JOBNAME) ROUTE(ONE AUTO1))

NetView's strength is cross-system automation. It can automate actions across multiple LPARs, sysplex members, and even distributed systems through its GenericAlert and command forwarding capabilities. At Federal Benefits, Sandra's team uses NetView to coordinate automation across their three-LPAR sysplex — when a batch stream fails on one LPAR, NetView can redirect successor jobs to another LPAR.

Automated Actions: Design Principles

Regardless of which product you use, automated actions should follow these principles:

1. Idempotency. Running the same automation action twice should produce the same result as running it once. If your restart automation starts a job, and the job is already running, the automation should detect that and skip the start — not submit a duplicate.

2. Bounded scope. Every automated action should have a clearly defined scope. An automation rule that restarts failed jobs should specify exactly which jobs, under what conditions, with what limits. "Restart any failed job" is not an automation rule — it's a disaster waiting to happen.

3. Escalation paths. Every automated action must have an escalation path for when automation fails. If the automated restart doesn't work after N attempts, what happens? If the answer is "nothing," you've built a system that can fail silently.

4. Audit trail. Every automated action must be logged with timestamp, triggering event, action taken, and result. This isn't just for compliance — it's for debugging. When something goes wrong with automation, the first question is always "what did the automation do, and why?"

5. Kill switch. Every automation rule must be individually disableable without affecting other rules. When an automation rule misbehaves, you need to shut it down immediately without bringing down all automation.

31.5 Self-Healing Batch Streams

Self-healing batch is the holy grail of mainframe operational automation. A self-healing batch stream detects its own failures, diagnoses the cause, applies the appropriate fix, and continues processing — all without human intervention.

This isn't science fiction. CNB has been running self-healing batch for their general ledger processing since 2021. Here's how it works.

The Self-Healing Architecture

A self-healing batch stream has four components:

1. Pre-flight checks. Before a batch job executes, a pre-flight step validates that all prerequisites are met:

//*------------------------------------------------------------*
//* PRE-FLIGHT VALIDATION                                      *
//*------------------------------------------------------------*
//PREFLT   EXEC PGM=IKJEFT1B
//SYSTSPRT DD SYSOUT=*
//SYSTSIN  DD *
  %CNBPREFLT CNBGL100 PROD 20250115
/*

The pre-flight REXX exec checks: - All input datasets exist and are available (not in use by another job) - DB2 subsystem is active and accepting connections - Sufficient DASD space for output datasets - Predecessor jobs completed successfully (query scheduler) - Control table entries are in the expected state - System resources (CPU, memory, spool) are within acceptable ranges

If any check fails, the pre-flight step sets a return code that skips the main processing and triggers the appropriate remediation.

2. Conditional routing. Modern JCL and scheduling tools support conditional execution that routes processing based on conditions:

//PREFLT   EXEC CNBPRFLT,ENV=PROD,CYCLE=20250115
//*
//* Normal path - pre-flight passed
//EXTRACT  EXEC CNBGLEXT,ENV=PROD,CYCLE=20250115,
//            COND=(0,NE,PREFLT)
//*
//* Recovery path - space issue detected (RC=8 from pre-flight)
//SPACFIX  EXEC CNBSPCFX,ENV=PROD,
//            COND=((8,NE,PREFLT))
//* Retry after space fix
//EXTR2    EXEC CNBGLEXT,ENV=PROD,CYCLE=20250115,
//            COND=((0,NE,SPACFIX))
//*
//* Escalation path - unfixable issue (RC=16 from pre-flight)
//ESCALAT  EXEC CNBESCAL,ENV=PROD,CYCLE=20250115,
//            COND=((16,NE,PREFLT))

In practice, sophisticated conditional routing is usually handled by the scheduler rather than JCL COND parameters, because schedulers provide richer condition logic. In TWS (Tivoli Workload Scheduler):

IF CNBGL100.PREFLT RC = 0 THEN
  RELEASE CNBGL100.EXTRACT
ELSE IF CNBGL100.PREFLT RC = 8 THEN
  RELEASE CNBGL100.SPACFIX
  AFTER CNBGL100.SPACFIX
    RELEASE CNBGL100.EXTRACT
ELSE
  RELEASE CNBGL100.ESCALATE
  HOLD CNBGL100.EXTRACT

3. Automated restart with diagnosis. When a batch job abends, the automation product captures the abend code, consults a knowledge base of known failure modes, and applies the appropriate recovery:

/* REXX - Batch failure recovery engine */
/* Called by OPS/MVS when batch job abends */
parse arg jobname abend_code step_name

/* Look up known recovery actions */
call load_recovery_table
recovery_action = find_recovery(jobname, abend_code, step_name)

select
  when recovery_action = 'RESTART_STEP' then do
    call log jobname 'Restarting from step' step_name
    call restart_job jobname step_name
  end
  when recovery_action = 'EXTEND_SPACE' then do
    call log jobname 'Extending output dataset space'
    call extend_output_datasets jobname step_name
    call restart_job jobname step_name
  end
  when recovery_action = 'WAIT_RESOURCE' then do
    call log jobname 'Resource contention - waiting 5 minutes'
    call schedule_delayed_restart jobname step_name 300
  end
  when recovery_action = 'REROUTE' then do
    call log jobname 'Rerouting to alternate system'
    call reroute_to_alternate jobname
  end
  otherwise do
    call log jobname 'Unknown failure - escalating to operations'
    call escalate jobname abend_code step_name
  end
end

The recovery table is the brain of self-healing. It maps abend codes and contexts to recovery actions:

Abend Code	Context	Recovery Action	Max Retries
S0C7	Any	ESCALATE (data error, needs human)	0
S0C4	Any	ESCALATE (program error, needs human)	0
S806	Any	CHECK_LOADLIB, then ESCALATE	1
SB37	Output dataset	EXTEND_SPACE, then RESTART_STEP	2
SD37	Output dataset	EXTEND_SPACE, then RESTART_STEP	2
SE37	Output dataset	EXTEND_SPACE, then RESTART_STEP	2
S822	Any	WAIT_RESOURCE (region size), then ESCALATE	1
U0100	CNB GL programs	RESTART_STEP (transient DB2 timeout)	3
U0200	CNB GL programs	CHECK_DB2, then RESTART_STEP	2
JCL ERROR	Dataset not found	CHECK_PREDECESSORS	1

4. Post-recovery validation. After automated recovery, a validation step confirms that the recovery was successful and the output is correct:

/* REXX - Post-recovery validation */
parse arg jobname cycle_date

/* Check job completed successfully */
job_rc = get_job_retcode(jobname)
if job_rc > 4 then do
  call log 'Recovery FAILED for' jobname '- RC='job_rc
  call escalate jobname 'RECOVERY_FAILED' job_rc
  exit 12
end

/* Validate output */
call validate_output_datasets jobname cycle_date
call validate_record_counts jobname cycle_date
call validate_control_totals jobname cycle_date

call log 'Recovery SUCCESSFUL for' jobname
call update_recovery_metrics jobname 'SUCCESS'
call release_successors jobname
exit 0

Real-World Self-Healing at CNB

Here's the actual sequence that executes at CNB when their GL extract job (CNBGL100) fails with an SB37 (out of space) abend at 2:47 AM:

T+0 seconds: Job CNBGL100 abends with SB37 in step EXTRACT.
T+2 seconds: OPS/MVS MSG rule triggers on the $HASP373 end-of-job message. Rule invokes the recovery engine REXX exec.
T+3 seconds: Recovery engine looks up SB37 for CNBGL100.EXTRACT in the recovery table. Action: EXTEND_SPACE, then RESTART_STEP.
T+5 seconds: EXTEND_SPACE routine identifies the full dataset (CNB.PROD.GLEXT.G0047V00), uses IDCAMS ALTER to add 500 cylinders of secondary space, and verifies the extension.
T+8 seconds: Recovery engine issues restart command to the scheduler for CNBGL100 from step EXTRACT.
T+12 seconds: Scheduler initiates restart. Job begins re-executing from the EXTRACT step with the extended dataset.
T+15 minutes: Job CNBGL100 completes successfully with RC=0.
T+15 minutes, 3 seconds: Post-recovery validation runs. Record counts match control totals. Output dataset is complete.
T+15 minutes, 5 seconds: Successor jobs are released. Recovery is logged. Incident ticket is auto-generated with full details for morning review.

Total human involvement: zero. The on-call operator's pager never went off. The morning shift reviews the incident log and sees a clean automated recovery. Compare this with the 2019 incident that opened the chapter — 3+ hours of manual recovery with cascading errors.

Graduated Self-Healing: The Maturity Model

Not every shop can implement full self-healing overnight. CNB's automation maturity evolved over three years through four levels, and understanding this progression is useful for shops beginning their automation journey.

Level 1 — Detection and alerting (2020). Automation detects failures and pages the on-call operator with diagnostic information. No automated recovery, but the operator gets precise context instead of a raw console message. Implementation: OPS/MVS MSG rules that parse abend codes and format alert messages with job name, step name, abend code, and the first 10 lines of SYSOUT. Time to implement: 2 weeks. Benefit: reduced mean-time-to-diagnose from 15 minutes to 3 minutes.

Level 2 — Automated recovery for known failures (2021). Automation recovers from a defined set of failure modes (space abends, transient timeouts, GDG limit exceeded). Failures not in the recovery table escalate to a human. Implementation: recovery table with 15 abend code/context mappings, REXX recovery engine, post-recovery validation. Time to implement: 6 weeks. Benefit: 60% of overnight failures recovered automatically.

Level 3 — Predictive prevention (2022). Automation detects conditions that will cause failures before they happen and takes preventive action. Pre-flight checks run before every batch job. Trend analysis identifies DASD volumes approaching capacity, GDG bases approaching their limit, and DB2 tablespaces approaching REORG thresholds. Implementation: pre-flight REXX framework, daily trend analysis REXX scripts, predictive alerting rules. Time to implement: 3 months. Benefit: preventable failures dropped 80%.

Level 4 — Closed-loop learning (2023–present). Automation tracks every failure, every recovery action, and every outcome. Monthly analysis identifies new failure patterns that should be added to the recovery table and existing patterns whose recovery actions need tuning. Rob Mueller reviews the automation metrics dashboard every Monday morning: recovery success rate (target: > 95%), mean-time-to-recover (target: < 60 seconds for known failures), false positive rate (target: < 5%), and new-pattern rate (target: declining quarter over quarter).

/* REXX - Monthly automation metrics extraction */
parse arg report_month  /* YYYYMM */

/* Extract recovery log for the month */
call extract_recovery_log report_month

/* Calculate key metrics */
total_events = count_events(report_month)
auto_recovered = count_auto_recovered(report_month)
escalated = count_escalated(report_month)
false_positives = count_false_positives(report_month)

success_rate = (auto_recovered / total_events) * 100
say 'Recovery success rate:' FORMAT(success_rate,,1)'%'
say 'Events:' total_events 'Auto:' auto_recovered,
    'Escalated:' escalated 'False positive:' false_positives

/* Identify new failure patterns not in recovery table */
call identify_new_patterns report_month
/* Output: list of abend code/context combinations seen */
/*         more than twice that have no recovery mapping */

This maturity model matters because shops that attempt Level 3 or Level 4 without solid Level 1 and Level 2 foundations build fragile automation. Get detection right first. Then get recovery right. Then — and only then — add prediction and learning.

Limits of Self-Healing

Self-healing is not a silver bullet. It works for known failure modes with known remediation. It does not work for:

Data corruption. If a program produces wrong results without abending, no automation will catch it (unless you have comprehensive output validation, which most shops don't for every job).
Novel failures. If the failure mode isn't in the recovery table, the system escalates to a human. This is correct behavior — attempting to recover from an unknown failure is worse than doing nothing.
Cascading failures. If the root cause is a systemic issue (DASD subsystem failure, network outage, DB2 down), automated recovery of individual jobs will just produce more failures. The automation must detect cascading patterns and escalate rather than attempting individual recoveries. CNB's recovery engine tracks failure rates per minute; if more than 5 jobs fail within a 10-minute window, it stops individual recovery and escalates to a system-level alert.
Security incidents. Automated recovery must not override security controls. If a job fails due to a RACF authorization error, the correct response is always escalation, never automated bypass.

31.6 Automating Operational Procedures

Beyond batch recovery, there are dozens of operational procedures that every z/OS shop performs regularly and that should be automated. Here are the most common, with implementation approaches.

Dataset Cleanup and Lifecycle Management

Every z/OS shop generates datasets that need periodic cleanup — temporary datasets from failed jobs, aged GDG generations, obsolete test datasets, orphaned VSAM clusters. Manual cleanup is tedious, error-prone, and never comprehensive.

/* REXX - Comprehensive dataset cleanup */
parse arg hlq max_age_days exclude_list

call log 'Dataset cleanup started for' hlq

x = OUTTRAP('cat.')
"LISTCAT ENT('"hlq"') NONVSAM ALL"
x = OUTTRAP('OFF')

cleaned = 0
skipped = 0
errors = 0

do i = 1 to cat.0
  if POS('NONVSAM', cat.i) = 0 then iterate
  parse var cat.i . '---' dsname
  dsname = STRIP(dsname)

  /* Check exclusion list */
  if is_excluded(dsname, exclude_list) then do
    skipped = skipped + 1
    iterate
  end

  /* Check if dataset is in use */
  x = LISTDSI("'"dsname"'" "NORECALL")
  if SYSREASON = 9 then do  /* Dataset migrated */
    /* Don't recall just to check age - skip migrated datasets */
    skipped = skipped + 1
    iterate
  end

  if SYSREASON \= 0 then do
    call log 'Cannot access' dsname '- reason:' SYSREASON
    errors = errors + 1
    iterate
  end

  /* Check age */
  age = calculate_age(SYSCREATE)
  if age > max_age_days then do
    "DELETE '"dsname"' PURGE"
    if rc = 0 then do
      cleaned = cleaned + 1
      call log 'Deleted:' dsname '(age:' age 'days)'
    end
    else do
      call log 'DELETE failed for' dsname 'RC='rc
      errors = errors + 1
    end
  end
end

call log 'Cleanup complete: deleted='cleaned 'skipped='skipped,
         'errors='errors

At Federal Benefits, Sandra's team runs automated dataset cleanup nightly. The automation cleaned up 14,000 orphaned datasets in its first run — 2.3 TB of DASD that nobody knew was being wasted.

Catalog Maintenance

ICF catalog health is critical for z/OS operations. Catalog problems cause job failures, and catalog recovery is painful. Automated catalog maintenance prevents problems:

/* REXX - Catalog health check */
catalogs = 'CATALOG.PROD.UCAT01 CATALOG.PROD.UCAT02',
           'CATALOG.PROD.UCAT03'

do i = 1 to WORDS(catalogs)
  catname = WORD(catalogs, i)

  /* Check catalog space */
  x = OUTTRAP('idcams.')
  "IDCAMS LISTCAT CAT('"catname"') ALL"
  x = OUTTRAP('OFF')

  /* Parse HURBA and HARBA for space check */
  call check_catalog_space catname

  /* Run DIAGNOSE for structural integrity */
  call run_catalog_diagnose catname

  /* Check for orphaned entries */
  call check_orphaned_entries catname
end

Performance Reporting

Automated performance reporting transforms RMF and SMF data into actionable reports without human intervention. At Pinnacle Health, Ahmad's team automated their daily performance report:

/* REXX - Automated performance report generation */
/* Runs daily at 06:00 via scheduler */

report_date = DATE('S')  /* YYYYMMDD */

/* Extract SMF data for prior day */
call extract_smf_data report_date

/* Generate CPU utilization summary */
call generate_cpu_report report_date

/* Generate DASD I/O statistics */
call generate_dasd_report report_date

/* Generate batch elapsed time trends */
call generate_batch_report report_date

/* Generate exception report (SLA breaches) */
exceptions = generate_exception_report(report_date)

/* Assemble and distribute */
call assemble_report report_date
call distribute_report report_date

/* If exceptions found, escalate */
if exceptions > 0 then
  call escalate_performance exceptions report_date

SecureFirst's Automated Security Audit

Yuki Tanaka built an automated nightly security audit that checks for common security configuration drift:

/* REXX - Security configuration audit */
findings = 0

/* Check for datasets with universal access > NONE */
call audit_uacc 'SYS1.**'
call audit_uacc 'CNB.PROD.**'

/* Check for users with excessive privileges */
call audit_special_users
call audit_operations_users

/* Check for password policy compliance */
call audit_password_settings

/* Check for APF libraries on the approved list */
call audit_apf_list

/* Check for PPT entries */
call audit_ppt

/* Check for started task assignments */
call audit_started_tasks

/* Generate findings report */
if findings > 0 then do
  call generate_audit_report
  call notify_security_team findings
end

This catches configuration drift within 24 hours instead of waiting for the next quarterly audit.

31.7 Automation Governance

Automation without governance is a loaded gun. Every automation rule you create is a piece of code that executes with system-level authority in response to system events. Bad automation doesn't just fail — it actively damages your environment, often faster than a human could.

The Runaway Problem

Rob Mueller tells the story of the "Disk Eater" incident at a previous employer. Someone wrote an automation rule that detected when a dataset ran out of space (SB37), deleted the dataset, and restarted the job. The logic was: "If the dataset is full, the data must be bad, so delete it and start fresh." The rule worked perfectly for the specific job it was written for — a temporary work file.

Then another job failed with SB37 on a production master file. The automation deleted the production master file and restarted the job, which abended because its input no longer existed. That abend triggered another automation rule that tried to recover by re-running the predecessor job, which also failed because its output dataset (the one that was just deleted) was supposed to exist.

The cascading automation ran for seven minutes before someone noticed. In those seven minutes, it deleted four production datasets and submitted twenty-three job restarts. Recovery took sixteen hours.

The root cause was a scoping failure. The automation rule applied to all jobs, not just the job it was designed for. This is why governance matters.

Automation Testing Requirements

At CNB, every automation rule goes through this testing process:

1. Unit test in isolation. The rule is tested in a sandboxed environment with simulated events. It must produce the expected action for the triggering event and NO action for non-triggering events.

2. Negative testing. The rule is tested with events that are similar but not identical to the trigger condition. It must NOT fire for similar-but-different events.

3. Stress testing. The rule is tested with rapid-fire triggering events. It must handle the volume without queueing up duplicate actions or consuming excessive resources.

4. Integration test. The rule is activated in a test environment (not production) alongside other active rules. It must not conflict with or interfere with existing rules.

5. Staged rollout. The rule is activated in production in "monitor only" mode — it detects events and logs what it would do, but doesn't execute. After a minimum of one week of clean monitoring, it's promoted to active execution.

6. Post-activation review. After one month of active execution, the rule's activity log is reviewed. False positives, unexpected triggers, and near-misses are analyzed. The rule is tuned or deactivated based on findings.

Change Management for Automation

Automation rules are code. They follow the same change management process as application code:

Change request with business justification
Peer review of the rule logic
Test evidence from the testing process above
Approval from the automation team lead and the affected application team
Implementation window (no automation changes during month-end, quarter-end, or year-end processing)
Backout plan (how to deactivate the rule if it misbehaves)

Runaway Prevention

Every automation product should be configured with safety mechanisms:

Rate limiting. No automation rule should execute more than N times per hour. If it does, something is wrong — either the rule is misfiring or there's a systemic issue that the rule can't fix. CNB's standard is: any rule that fires more than 10 times in 30 minutes is automatically suspended and the automation team is paged.

Mutual exclusion. Automation rules that could conflict must be mutually exclusive. If Rule A restarts job X and Rule B cancels job X, they cannot both be active simultaneously.

Authority limits. Automation rules should run with the minimum authority necessary. A rule that restarts batch jobs doesn't need the authority to cancel system address spaces. OPS/MVS and SA z/OS both support authority profiles for automation.

Circuit breakers. If the automation system itself is malfunctioning, there must be a way to shut it all down instantly. CNB has a "AUTOMATION OFF" command that suspends all automation rules in under five seconds. It's tested quarterly.

Audit logging. Every automation action is logged to a protected dataset that the automation system itself cannot modify. This ensures that even if automation goes rogue, there's an untampered record of what happened.

Documentation Standards

Every automation rule must have documentation that answers:

What does it do? Plain English description of the trigger condition and action.
Why does it exist? Business justification and the incident or requirement that created it.
What is its scope? Exactly which systems, jobs, or resources does it affect?
What are its limits? Maximum execution rate, retry counts, timeout values.
What is its escalation path? What happens when the automation can't resolve the issue?
Who owns it? Contact information for the team responsible.
When was it last reviewed? All automation rules are reviewed annually at minimum.

31.8 Progressive Project: HA Banking System — Operational Automation Framework

Your HA Banking Transaction Processing System needs operational automation. In this checkpoint, you'll design and implement the automation framework.

Requirements

The HA banking system from previous chapters processes 2.4 million transactions daily across two LPARs. Your automation must handle:

Batch stream self-healing. The nightly settlement batch (12 jobs, 45-minute window) must recover from common failures without human intervention.
REXX automation scripts. Dataset management, job monitoring, and reporting for the banking environment.
JCL PROCs. Standardized execution procedures for all batch programs.
Automation product rules. Event-driven automation for the top 10 operational scenarios.
Governance framework. Testing, change management, and runaway prevention for all automation.

Deliverables

Deliverable 1: Pre-flight check REXX exec. Write a REXX exec (HABPREFLT) that validates prerequisites before the nightly settlement batch begins. It must check: DB2 subsystem active, input datasets available, sufficient DASD space, predecessor jobs completed, control table in correct state.

Deliverable 2: JCL PROC for COBOL-DB2 execution. Write a production-grade PROC (HABDB2BT) following the standards from Section 31.3. Include symbolic parameters for environment, DB2 subsystem, load library, and plan name.

Deliverable 3: Recovery table. Define a recovery table (in a format of your choice — CSV, DB2 table DDL, or REXX stem variables) mapping abend codes to recovery actions for the settlement batch jobs.

Deliverable 4: Self-healing batch stream design. Document the end-to-end flow for the nightly settlement batch with pre-flight checks, conditional routing, automated recovery, and post-recovery validation. Include a diagram (text-based is fine).

Deliverable 5: Automation governance document. Define the testing, change management, and monitoring standards for your automation framework.

Success Criteria

Pre-flight REXX handles at least 8 distinct checks with appropriate return codes
PROC supports at minimum 10 symbolic parameters with sensible defaults
Recovery table covers at least 12 abend code/context combinations
Self-healing design includes escalation for unknown failures
Governance document addresses all items from Section 31.7

Chapter Summary

Operational automation on z/OS isn't about replacing operators — it's about letting operators focus on problems that require judgment instead of wasting their expertise on repetitive procedures. REXX gives you the scripting power to automate individual tasks. JCL PROCs enforce standardization so that automation has a consistent interface. Automation products provide the event-driven orchestration layer that ties everything together. And self-healing batch streams combine all three to create systems that recover from known failures without human intervention.

The key insight — the one that separates shops that "have automation" from shops where automation actually works — is governance. Testing, change management, runaway prevention, and documentation are not overhead. They're what prevents your automation from becoming the biggest threat to your availability. Every automation rule is a piece of code that executes with system-level authority. Treat it with the same rigor you'd apply to any production code.

Kwame's directive from the opening of this chapter — "Humans handle exceptions" — is the right target. But getting there takes discipline, not just tools. Build your automation incrementally. Start with the highest-impact, lowest-risk automations. Prove governance works on simple rules before tackling complex self-healing scenarios. And never, ever deploy automation without a kill switch.

What's Next

Chapter 32 takes automation to the next level with disaster recovery and business continuity — how to keep the entire system running when hardware, software, or an entire data center fails. The automation patterns you learned here are the foundation for DR automation, where the stakes are even higher and the margin for error is even thinner.

Spaced Review Answers

From Chapter 5 (WLM): WLM interacts with automation in two critical ways. First, automated job restarts may cause WLM to reclassify work into different service classes based on the new submission context — always verify that restarted work retains its intended WLM classification. Second, WLM health monitoring data (via RMF) can serve as input to automation rules: if a service class is missing its velocity goal, automation can take corrective action (e.g., cancelling low-priority work, adjusting dispatching priorities via WLM policy switches).

From Chapter 23 (Batch Scheduling): Self-healing batch automation must integrate with the scheduler, not bypass it. Automated restarts should be issued through the scheduler's API (not raw JES commands) to maintain dependency tracking, resource serialization, and audit trails. The scheduler's conditional logic (IF/THEN/ELSE in TWS, condition codes in CA-7) is the primary mechanism for implementing conditional routing in self-healing batch streams.

From Chapter 27 (Monitoring): Monitoring provides the events that trigger automation. SMF records (type 30 for job accounting, type 70-79 for RMF data, type 80 for RACF events) are the raw data. Console messages (WTO/WTOR) are the real-time triggers. The monitoring infrastructure must be highly available — if monitoring fails, automation is blind. This is why CNB runs redundant monitoring: OPS/MVS message rules as primary, NetView automation table as secondary.