Key Takeaways — Chapter 31: Operational Automation

Key Takeaways — Chapter 31: Operational Automation

Core Principles

Every manual step is a future failure. 60–70% of z/OS unplanned outages are caused by human error. Automating repetitive procedures eliminates the most common source of production incidents. The goal is not to replace operators but to let them focus on problems that require judgment.
Standardize before you automate. JCL standardization (PROCs) is a prerequisite for effective automation. Without standard procedures, every automation rule is job-specific and unmaintainable. With standard PROCs, automation rules can be generic and reusable.
REXX is the z/OS automation workhorse. OUTTRAP for capturing command output, LISTDSI for dataset interrogation, DSNREXX for DB2 access, and ISPF services for UI integration — these capabilities make REXX the primary scripting language for z/OS operational automation.
Automation products are the orchestration layer. SA z/OS manages subsystem lifecycles. OPS/MVS provides event-driven operational automation. NetView handles cross-system coordination. Most large shops use a combination.
Self-healing batch is achievable for known failure modes. Pre-flight validation, conditional routing, automated diagnosis and recovery, and post-recovery validation can reduce MTTR from 43 minutes to 6 minutes for well-understood failures.
Governance is not overhead — it is the foundation. Testing, change management, runaway prevention, and documentation prevent automation from becoming the biggest threat to availability. Every automation rule executes with system-level authority and must be treated with the same rigor as production code.

Technical Essentials

REXX for z/OS

Use IKJEFT1B (not IKJEFT01) for batch REXX to prevent TSO READY prompts
OUTTRAP captures TSO command output into stem variables for programmatic parsing
LISTDSI retrieves dataset attributes without catalog overhead
DSNREXX provides SQL access from REXX for data-driven automation
Always use SIGNAL ON ERROR and SIGNAL ON HALT in production REXX
Keep execs under 1,000 lines; refactor larger execs into called subroutines
Use consistent return codes: 0=success, 4=warning, 8=error, 12=severe, 16=catastrophic

JCL Procedures

Parameterize environment-specific values; don't parameterize internal logic
Required parameters have no default (PROG=); optional parameters have defaults (DBSYS=DB2P)
Maximum 15 levels of PROC nesting; practical limit is 2–3 levels
Every production job uses a PROC — no inline JCL
PROCs are version-controlled and follow the same promotion path as application code
Naming conventions, documentation headers, and change tracking are mandatory

Automation Products

SA z/OS: Policy-based subsystem lifecycle management (start/stop sequences, health checks, restart policies, move groups)
OPS/MVS: Rule-based event engine (MSG, CMD, TOD, SMF, EOJ, SEC event types with OPS/REXX actions)
NetView: Cross-system automation via automation tables and command forwarding
All automated actions must be idempotent, bounded in scope, auditable, and individually disableable

Self-Healing Batch

Pre-flight checks validate all prerequisites before execution (datasets, DB2, space, predecessors, control tables)
Recovery tables map abend codes and contexts to specific recovery actions with retry limits
Post-recovery validation confirms that recovery produced correct output
Cascading failure detection (e.g., >5 failures in 10 minutes) triggers system-level escalation instead of individual recovery
S0C7 and S0C4 always escalate — data and program errors cannot be automatically recovered
Security-related failures (RACF authorization) always escalate — never bypass security controls automatically

Governance Requirements

Testing: Unit test, negative test, stress test, integration test, monitor-only deployment (1 week minimum), post-activation review (1 month)
Change management: Change request, peer review, test evidence, approval, implementation window, backout plan
Rate limiting: Suspend any rule firing more than the defined threshold (CNB standard: 10 times in 30 minutes)
Mutual exclusion: Conflicting rules cannot be simultaneously active
Authority limits: Automation runs with minimum necessary authority
Circuit breaker: Global kill switch to suspend all automation instantly, tested regularly
Audit logging: All actions logged to a protected, tamper-resistant dataset
Documentation: Trigger, action, scope, limits, escalation path, owner, last review date

The Automation Spectrum

Level	Name	Description
0	Manual	Human reads runbook, executes steps
1	Scripted	Human triggers script that executes steps
2	Triggered	System detects condition, notifies human with suggestion
3	Automated with approval	System prepares action, waits for human approval
4	Fully automated	System detects, acts, logs — no human involved
5	Self-healing	System detects, diagnoses root cause, remediates, prevents recurrence

Target: Level 4 for known operational procedures, Level 5 for well-understood failure modes.

Common Pitfalls

Automating without standardizing. If your JCL isn't standardized, your automation will be a collection of one-off scripts as fragile as the manual process it replaces.
Scoping too broadly. The "Disk Eater" pattern: a rule designed for one job applied to all jobs. Always define explicit scope.
No escalation path. Automation that can fail silently is worse than no automation — it gives false confidence.
No kill switch. If you can't shut down all automation in seconds, you can't recover from automation-caused incidents.
Treating automation as a project. Automation is a continuous program. New applications, system changes, and staff turnover require ongoing maintenance.
Ignoring WLM interactions. Automated restarts can cause WLM reclassification. Always verify service class assignment after automated recovery.