Key Takeaways — Chapter 31: Operational Automation
Core Principles
-
Every manual step is a future failure. 60–70% of z/OS unplanned outages are caused by human error. Automating repetitive procedures eliminates the most common source of production incidents. The goal is not to replace operators but to let them focus on problems that require judgment.
-
Standardize before you automate. JCL standardization (PROCs) is a prerequisite for effective automation. Without standard procedures, every automation rule is job-specific and unmaintainable. With standard PROCs, automation rules can be generic and reusable.
-
REXX is the z/OS automation workhorse. OUTTRAP for capturing command output, LISTDSI for dataset interrogation, DSNREXX for DB2 access, and ISPF services for UI integration — these capabilities make REXX the primary scripting language for z/OS operational automation.
-
Automation products are the orchestration layer. SA z/OS manages subsystem lifecycles. OPS/MVS provides event-driven operational automation. NetView handles cross-system coordination. Most large shops use a combination.
-
Self-healing batch is achievable for known failure modes. Pre-flight validation, conditional routing, automated diagnosis and recovery, and post-recovery validation can reduce MTTR from 43 minutes to 6 minutes for well-understood failures.
-
Governance is not overhead — it is the foundation. Testing, change management, runaway prevention, and documentation prevent automation from becoming the biggest threat to availability. Every automation rule executes with system-level authority and must be treated with the same rigor as production code.
Technical Essentials
REXX for z/OS
- Use IKJEFT1B (not IKJEFT01) for batch REXX to prevent TSO READY prompts
- OUTTRAP captures TSO command output into stem variables for programmatic parsing
- LISTDSI retrieves dataset attributes without catalog overhead
- DSNREXX provides SQL access from REXX for data-driven automation
- Always use SIGNAL ON ERROR and SIGNAL ON HALT in production REXX
- Keep execs under 1,000 lines; refactor larger execs into called subroutines
- Use consistent return codes: 0=success, 4=warning, 8=error, 12=severe, 16=catastrophic
JCL Procedures
- Parameterize environment-specific values; don't parameterize internal logic
- Required parameters have no default (
PROG=); optional parameters have defaults (DBSYS=DB2P) - Maximum 15 levels of PROC nesting; practical limit is 2–3 levels
- Every production job uses a PROC — no inline JCL
- PROCs are version-controlled and follow the same promotion path as application code
- Naming conventions, documentation headers, and change tracking are mandatory
Automation Products
- SA z/OS: Policy-based subsystem lifecycle management (start/stop sequences, health checks, restart policies, move groups)
- OPS/MVS: Rule-based event engine (MSG, CMD, TOD, SMF, EOJ, SEC event types with OPS/REXX actions)
- NetView: Cross-system automation via automation tables and command forwarding
- All automated actions must be idempotent, bounded in scope, auditable, and individually disableable
Self-Healing Batch
- Pre-flight checks validate all prerequisites before execution (datasets, DB2, space, predecessors, control tables)
- Recovery tables map abend codes and contexts to specific recovery actions with retry limits
- Post-recovery validation confirms that recovery produced correct output
- Cascading failure detection (e.g., >5 failures in 10 minutes) triggers system-level escalation instead of individual recovery
- S0C7 and S0C4 always escalate — data and program errors cannot be automatically recovered
- Security-related failures (RACF authorization) always escalate — never bypass security controls automatically
Governance Requirements
- Testing: Unit test, negative test, stress test, integration test, monitor-only deployment (1 week minimum), post-activation review (1 month)
- Change management: Change request, peer review, test evidence, approval, implementation window, backout plan
- Rate limiting: Suspend any rule firing more than the defined threshold (CNB standard: 10 times in 30 minutes)
- Mutual exclusion: Conflicting rules cannot be simultaneously active
- Authority limits: Automation runs with minimum necessary authority
- Circuit breaker: Global kill switch to suspend all automation instantly, tested regularly
- Audit logging: All actions logged to a protected, tamper-resistant dataset
- Documentation: Trigger, action, scope, limits, escalation path, owner, last review date
The Automation Spectrum
| Level | Name | Description |
|---|---|---|
| 0 | Manual | Human reads runbook, executes steps |
| 1 | Scripted | Human triggers script that executes steps |
| 2 | Triggered | System detects condition, notifies human with suggestion |
| 3 | Automated with approval | System prepares action, waits for human approval |
| 4 | Fully automated | System detects, acts, logs — no human involved |
| 5 | Self-healing | System detects, diagnoses root cause, remediates, prevents recurrence |
Target: Level 4 for known operational procedures, Level 5 for well-understood failure modes.
Common Pitfalls
- Automating without standardizing. If your JCL isn't standardized, your automation will be a collection of one-off scripts as fragile as the manual process it replaces.
- Scoping too broadly. The "Disk Eater" pattern: a rule designed for one job applied to all jobs. Always define explicit scope.
- No escalation path. Automation that can fail silently is worse than no automation — it gives false confidence.
- No kill switch. If you can't shut down all automation in seconds, you can't recover from automation-caused incidents.
- Treating automation as a project. Automation is a continuous program. New applications, system changes, and staff turnover require ongoing maintenance.
- Ignoring WLM interactions. Automated restarts can cause WLM reclassification. Always verify service class assignment after automated recovery.