Exercises — Chapter 31: Operational Automation

DataField.Dev

Exercises — Chapter 31: Operational Automation

Section 31.2: REXX for z/OS

Exercise 31.1 — OUTTRAP Basics

Write a REXX exec that uses OUTTRAP to capture the output of the LISTDS command for a given dataset name (passed as an argument). Parse the output to determine whether the dataset is a PDS, PDSE, or sequential file, and display the result. Include error handling for datasets that don't exist.

Exercise 31.2 — LISTDSI Space Monitor

Write a REXX exec that accepts a high-level qualifier as input and uses LISTDSI to check space utilization for all datasets under that HLQ. The exec should produce a report listing any dataset above 80% utilization, sorted by utilization percentage (highest first). Include the dataset name, allocated space, used space, and percentage used.

Exercise 31.3 — DB2 Health Check REXX

Write a REXX exec that connects to DB2 via DSNREXX and queries the SYSIBM.SYSTABLESPACE catalog table to find tablespaces in COPY-pending or CHECK-pending status. For each tablespace found, the exec should log the database name, tablespace name, and pending status. If no pending statuses are found, the exec should report "All tablespaces healthy."

Exercise 31.4 — REXX Log Rotation

Write a REXX exec that implements log file rotation. Given a base dataset name pattern (e.g., CNB.PROD.AUTOLOG), the exec should: 1. Rename the current log dataset by appending a date suffix 2. Allocate a new empty log dataset with the original name 3. Delete log datasets older than 30 days 4. Handle all error conditions with appropriate return codes

Exercise 31.5 — REXX Configuration Reader

Write a REXX exec that reads a configuration dataset containing key-value pairs (one per record, format KEY=VALUE). The exec should parse all configuration entries into REXX stem variables, validate that all required keys are present (from a predefined list), and report any missing or invalid entries. Demonstrate how this configuration reader would be called by other automation execs.

Exercise 31.6 — Batch REXX Wrapper

Write the complete JCL and REXX exec for a batch automation job that: 1. Reads a list of dataset names from an input file 2. Checks each dataset's last reference date using LISTDSI 3. Produces a report of datasets not referenced in the last 90 days 4. Optionally migrates unreferenced datasets (controlled by a parameter) Include both the REXX exec and the JCL to execute it via IKJEFT1B.

Section 31.3: JCL Procedures

Exercise 31.7 — Basic PROC Design

Design a JCL PROC for executing a standard COBOL batch program (non-DB2). Include symbolic parameters for: program name, load library, sort work space, input dataset, output dataset, report class, condition code threshold, and region size. Provide sensible defaults for all optional parameters and require the program name.

Exercise 31.8 — Nested PROC Architecture

Design a two-level PROC architecture for Pinnacle Health's batch environment: - Level 1: A base execution PROC (PINHEXEC) that handles COBOL program execution with DB2 - Level 2: Three application PROCs that call PINHEXEC — one for claims processing, one for eligibility verification, and one for HL7 interface processing

Show how symbolic parameters flow from the application PROCs through to the base execution PROC. Document which parameters are set at each level and which are passed through.

Exercise 31.9 — PROC Migration Strategy

You're standardizing JCL at Federal Benefits. Currently, 340 production jobs use inline JCL (no PROCs). Outline a migration strategy to convert these to PROC-based execution. Address: prioritization criteria, testing approach, rollback plan, and how to handle jobs with unique JCL requirements that don't fit standard PROCs. Estimate the effort and timeline.

Exercise 31.10 — PROC Versioning

Design a PROC versioning scheme that allows: - Multiple versions of a PROC to exist simultaneously - Jobs to specify which version they want (or default to the current version) - Easy rollback to a previous version - Audit trail of which version each job used

Show the naming convention, JCLLIB configuration, and migration procedure.

Exercise 31.11 — PROC Validation Tool

Write a REXX exec that validates a JCL PROC for compliance with organizational standards. The exec should check: - Documentation header is present and complete - All symbolic parameters have descriptions - Naming convention is followed - No hardcoded dataset names for environment-specific resources - COND parameters are consistent List the checks and show sample output for a non-compliant PROC.

Section 31.4: Automation Products

Exercise 31.12 — OPS/MVS Rule Design

Write an OPS/MVS MSG rule that monitors for IEC030I messages (dataset space abend — SB37/SD37/SE37) on any job with the prefix HAB (HA Banking). The rule should: 1. Parse the message to extract the dataset name and job name 2. Check if the dataset is eligible for automated extension (consult an eligibility table) 3. If eligible, extend the dataset by the amount specified in the eligibility table 4. Restart the job from the failing step 5. If not eligible, escalate to operations with full diagnostic information 6. Log all actions taken

Exercise 31.13 — SA z/OS Application Group

Design an SA z/OS application group definition for CNB's online banking environment, which includes: - Two CICS AOR regions (CICSP01, CICSP02) - One CICS TOR region (CICST01) - DB2 subsystem (DB2P) - MQ Queue Manager (MQP1) - VTAM (SNA access)

Define the start sequence, stop sequence, dependencies, health checks, and restart policies for each component. Show the dependency graph.

Exercise 31.14 — NetView Cross-System Automation

Federal Benefits runs a three-LPAR sysplex. Design a NetView automation table entry and associated command list (CLIST or REXX) that detects when the batch scheduler on LPAR1 becomes unresponsive and automatically: 1. Confirms the scheduler is truly down (not just slow) 2. Attempts restart on LPAR1 3. If restart fails, redirects pending batch work to LPAR2 4. Notifies operations of the situation and actions taken

Exercise 31.15 — Automation Product Comparison

You've been asked to recommend an automation strategy for a mid-size z/OS shop (one LPAR, 500 batch jobs/day, 12 CICS regions, DB2 and MQ). They currently have no automation products. Compare three approaches: 1. IBM SA z/OS only 2. Broadcom OPS/MVS only 3. SA z/OS for subsystem management + OPS/MVS for operational automation

For each approach, evaluate: initial cost, implementation effort, ongoing maintenance, flexibility, and coverage. Provide a recommendation with justification.

Section 31.5: Self-Healing Batch

Exercise 31.16 — Pre-Flight Check Design

Design a comprehensive pre-flight check for Pinnacle Health's nightly claims processing batch. The batch processes health insurance claims from 47 provider groups. Pre-flight must validate: - DB2 availability and response time - Input file arrival from all 47 provider groups (with tolerance for missing files) - Output dataset space availability - CICS regions are quiesced for batch window - Previous cycle completed successfully - Archive datasets from prior run are available for restart comparison

For each check, specify the pass/fail criteria and the return code. Design the logic for partial failures (e.g., 45 of 47 provider files arrived — proceed or wait?).

Exercise 31.17 — Recovery Table Construction

Build a recovery table for SecureFirst's transaction processing batch (15 jobs). For each of the following failure scenarios, specify the automated recovery action, maximum retry count, conditions for escalation, and any special handling: 1. SB37 on transaction log dataset 2. DB2 -911 (timeout) during high-volume period 3. MQ connection failure (U0300 abend) 4. Sort work space insufficient (SB37 on SORTWKnn) 5. Input file missing (JCL error) 6. S0C7 in validation step 7. DB2 -904 (resource unavailable) 8. VSAM RLS contention (IEC161I) 9. Tape mount timeout 10. Region size exceeded (S878) 11. WTO buffer shortage 12. Job cancelled by operator (S222)

Exercise 31.18 — Cascading Failure Detection

Design an algorithm (in pseudocode or REXX) that detects cascading failures in a batch stream. The algorithm should: - Track failure rates across all monitored jobs - Distinguish between independent failures and cascading failures - Identify the likely root cause of a cascade - Switch from individual recovery to system-level escalation when a cascade is detected Define the thresholds and criteria you'd use. Explain how you'd tune these thresholds over time.

Exercise 31.19 — Self-Healing Metrics Dashboard

Design a report (layout and content) that shows the effectiveness of self-healing automation over time. Include metrics for: - Automated recovery success rate (by failure type) - Mean time to automated recovery vs. mean time to manual recovery - False positive rate (automation attempted recovery but shouldn't have) - Escalation rate (failures that required human intervention) - Cost savings (estimated labor hours saved)

Show a sample report with realistic data for a 30-day period.

Section 31.6: Automating Operational Procedures

Exercise 31.20 — Dataset Lifecycle Policy

Design a comprehensive dataset lifecycle automation policy for CNB. For each dataset category below, specify the retention period, cleanup action, notification requirements, and any exceptions: - Temporary work datasets (HLQ.TEMP.**) - GDG bases and generations - Test datasets in the development environment - SMF dump datasets - Spool output datasets - DB2 image copies - Archive datasets

Implement the policy as a REXX exec that reads the policy from a configuration dataset and executes the appropriate cleanup.

Exercise 31.21 — Automated Capacity Reporting

Write a REXX exec that generates a weekly capacity report including: - DASD utilization by storage group (total, used, available) - Top 20 space consumers (dataset name, allocated, used) - GDG bases approaching their limit (within 10 generations) - Catalog utilization (entries, space used) - Trend analysis (comparison with previous 4 weeks)

The report should be written to a dataset and optionally emailed via SMTP.

Exercise 31.22 — Automated Compliance Check

Design an automated compliance checking framework for Federal Benefits that runs nightly and checks: - RACF dataset protection rules match the documented standard - APF-authorized libraries match the approved list - System parameters (IEASYSxx, IEALNKxx) match the documented baseline - Started task assignments are correct - Program Properties Table entries are correct

For each check, the framework should compare current state against a baseline, report deviations, and classify severity (critical, major, minor).

Section 31.7: Automation Governance

Exercise 31.23 — Automation Change Request

Write a complete automation change request document for the following scenario: You want to add a new OPS/MVS rule that automatically extends VSAM datasets that receive an IEC070I (out of extents) message. Include: - Business justification - Rule logic (pseudocode) - Scope definition (which datasets, which systems) - Risk assessment - Test plan - Rollback plan - Approval requirements

Exercise 31.24 — Runaway Scenario Analysis

For each of the following automation runaway scenarios, describe: what went wrong, what governance control would have prevented it, and how to implement that control: 1. A restart rule that keeps restarting a job that immediately abends, creating an infinite restart loop 2. A cleanup rule that deletes datasets matching a pattern, but the pattern is too broad and catches production datasets 3. A notification rule that sends pages for every WTO message during a system restart, flooding the pager with 3,000 messages in 5 minutes 4. A recovery rule that works correctly on LPAR1 but causes problems when it fires on LPAR2 (different configuration) 5. Two automation rules that conflict — one holds a job, the other releases it, creating a hold/release cycle

Exercise 31.25 — Automation Maturity Assessment

Create an automation maturity assessment checklist that a z/OS shop can use to evaluate their current automation practices. Cover: - Automation coverage (what percentage of operational tasks are automated) - Governance maturity (testing, change management, documentation) - Self-healing capability (recovery automation, pre-flight checks) - Monitoring integration (how well automation is connected to monitoring) - Continuous improvement (metrics, review cadence, optimization)

For each area, define levels 1 through 5 (ad hoc to optimized) with specific criteria. Provide guidance on how to move from one level to the next.

Challenge Exercises

Exercise 31.26 — Full Automation Design

Design the complete operational automation framework for the HA Banking Transaction Processing System. Include: 1. Complete REXX exec library (list all execs with purpose and pseudocode) 2. Complete PROC library (list all PROCs with parameters) 3. Complete OPS/MVS rule set (list all rules with trigger/action) 4. SA z/OS application group definitions 5. Self-healing batch design for the nightly settlement stream 6. Governance framework documentation 7. Metrics and reporting design

This is a comprehensive design exercise — produce documentation, not necessarily working code for every component.

Exercise 31.27 — Automation Disaster Recovery

Your automation infrastructure itself needs to be highly available. Design a plan for: - What happens if OPS/MVS fails on the primary LPAR? - What happens if the automation REXX exec library becomes unavailable? - What happens if the recovery table (DB2) is inaccessible? - How do you failover automation to a secondary system? - How do you test automation failover?

Exercise 31.28 — Legacy Automation Migration

You've inherited a z/OS environment with 15 years of accumulated automation: - 340 REXX execs in various libraries (some duplicated, some obsolete) - 87 OPS/MVS rules (some conflicting, some never fire) - 12 NetView CLISTs (poorly documented) - No governance framework - No documentation beyond inline comments

Design a project plan to rationalize this automation estate. Address: inventory and assessment, prioritization, cleanup, governance implementation, documentation, testing, and ongoing management. Estimate the effort and propose a phased timeline.