Quiz — Chapter 31: Operational Automation
Question 1
What is the primary benefit of operational automation on z/OS, according to IBM's studies on unplanned outages?
A) Reducing hardware costs B) Eliminating 60–70% of unplanned outages caused by human error C) Improving batch throughput by 50% D) Replacing the need for systems programmers
Answer: B Explanation: IBM's studies consistently show that 60–70% of unplanned outages on z/OS are caused by human error. Every manual step eliminated is a failure mode eliminated. Automation doesn't replace skilled people — it prevents fatigue-induced mistakes in repetitive procedures.
Question 2
Which REXX function is described as "the single most important REXX function for automation" because it captures TSO command output into stem variables?
A) LISTDSI B) EXECIO C) OUTTRAP D) SYSVAR
Answer: C Explanation: OUTTRAP captures the output of TSO commands into REXX stem variables, enabling programmatic parsing and action on command results. This is the foundation of most TSO/REXX automation — you issue a command, capture its output, and make decisions based on what it tells you.
Question 3
When running REXX in batch for production automation, which TSO terminal monitor program should you use to prevent TSO READY prompts from hanging the batch job?
A) IKJEFT01 B) IKJEFT1B C) IRXJCL D) IKJEFT1A
Answer: B Explanation: IKJEFT1B is the "no prompt" version of the TSO terminal monitor program. IKJEFT01 can issue READY prompts that will hang a batch job waiting for input. For production automation where no human is present to respond, IKJEFT1B is required.
Question 4
In a JCL PROC, what does a symbolic parameter defined with no default value (e.g., PROG=) indicate?
A) The parameter is optional and defaults to blanks B) The parameter is required — the caller must supply a value C) The parameter will use the system default D) The parameter is deprecated and should not be used
Answer: B
Explanation: A symbolic parameter with no default (e.g., PROG=) is required. If the calling JCL doesn't supply a value via the EXEC statement's override, the JCL will fail with a substitution error. This is a deliberate design choice to force callers to provide critical values like program names.
Question 5
What is the maximum nesting depth for JCL PROCs on z/OS?
A) 3 levels B) 5 levels C) 10 levels D) 15 levels
Answer: D Explanation: z/OS supports up to 15 levels of nested PROCs. However, in practice, nesting should rarely exceed 2–3 levels. Deeper nesting makes debugging difficult and symbolic parameter resolution increasingly complex.
Question 6
In the automation level spectrum described in the chapter, what distinguishes Level 4 (Fully Automated) from Level 5 (Self-Healing)?
A) Level 4 requires human approval; Level 5 does not B) Level 4 executes a predefined response; Level 5 also diagnoses root cause and prevents recurrence C) Level 4 only handles batch; Level 5 handles online systems too D) Level 4 uses REXX; Level 5 uses automation products
Answer: B Explanation: Level 4 automation detects a condition and executes a predefined action (e.g., detect abend, restart job). Level 5 goes further — it diagnoses the root cause, remediates the issue, and takes action to prevent recurrence (e.g., detect GDG full, extend the base, restart the job, and adjust the capacity plan).
Question 7
Which automation product uses a rule-based event engine with OPS/REXX syntax for event-driven operational automation?
A) IBM System Automation (SA z/OS) B) CA OPS/MVS (Broadcom) C) Tivoli NetView D) IBM Health Checker
Answer: B Explanation: OPS/MVS uses a rule-based event engine. Rules are triggered by events (console messages, SMF records, time-of-day, end-of-job) and execute OPS/REXX code. SA z/OS uses a policy-based approach, and NetView uses automation table entries.
Question 8
What is the first design principle for automated actions listed in the chapter?
A) Audit trail B) Bounded scope C) Idempotency D) Kill switch
Answer: C Explanation: Idempotency — running the same automation action twice should produce the same result as running it once. If your restart automation starts a job and the job is already running, the automation should detect that and skip, not submit a duplicate. This prevents cascading problems from automation re-execution.
Question 9
In CNB's self-healing batch architecture, what is the purpose of the "pre-flight check" component?
A) To back up all input datasets before processing B) To validate that all prerequisites are met before a batch job executes C) To notify operators that a batch stream is about to start D) To generate a forecast of expected runtime
Answer: B Explanation: Pre-flight checks validate prerequisites before execution: input datasets exist and are available, DB2 is active, sufficient DASD space exists, predecessor jobs completed, control tables are correct, and system resources are adequate. Failing pre-flight skips main processing and triggers remediation instead of letting the job fail mid-execution.
Question 10
In the real-world self-healing sequence at CNB, how long did the automated recovery of an SB37 abend take from initial failure to successor job release?
A) 38 seconds B) Approximately 15 minutes C) 43 minutes D) 3 hours
Answer: B Explanation: The sequence took approximately 15 minutes from abend detection (T+0) to successor job release (T+15 min 5 sec). The automation responded in seconds (detection at T+2, space extension at T+5, restart at T+8), but the job itself needed about 15 minutes to re-execute. Compare this with the 3+ hour manual recovery described at the chapter opening.
Question 11
Which of the following failure types should NOT be handled by self-healing automation, according to the chapter?
A) SB37 (out of space) B) DB2 timeout (transient) C) Data corruption that produces wrong results without abending D) Sort work space insufficient
Answer: C Explanation: Self-healing works for known failure modes with known remediation. Data corruption that doesn't cause an abend is invisible to automation — the job completes "successfully" with wrong data. This is why comprehensive output validation (record counts, control totals) is critical, though most shops don't have it for every job.
Question 12
What is CNB's threshold for detecting cascading failures and switching from individual recovery to system-level escalation?
A) 2 jobs fail within 5 minutes B) 3 jobs fail within 15 minutes C) 5 jobs fail within a 10-minute window D) 10 jobs fail within 30 minutes
Answer: C Explanation: CNB's recovery engine tracks failure rates per minute. If more than 5 jobs fail within a 10-minute window, individual recovery stops and the system escalates to a system-level alert. This prevents automation from individually "recovering" jobs when the real problem is systemic (e.g., DASD failure, DB2 down).
Question 13
In the recovery table, what is the correct automated response for an S0C7 abend?
A) RESTART_STEP B) EXTEND_SPACE C) WAIT_RESOURCE D) ESCALATE (data error, needs human)
Answer: D Explanation: S0C7 is a data exception — typically a numeric field containing non-numeric data. This is a data or program logic error that requires human investigation. Automated restart would just reproduce the same abend. The correct response is always escalation. Similarly, S0C4 (protection exception) always escalates.
Question 14
What event type in OPS/MVS would you use to trigger automation based on performance threshold breaches recorded in SMF?
A) MSG B) EOJ C) SMF D) TOD
Answer: C Explanation: OPS/MVS SMF rules trigger when specific SMF record types are written. Since performance data (CPU utilization, I/O rates, response times) is recorded in SMF records by RMF, SMF event rules can detect performance threshold breaches in real time and trigger corrective automation.
Question 15
Which governance control is described as the most critical safety mechanism — the ability to instantly disable all automation?
A) Rate limiting B) Mutual exclusion C) Circuit breaker (kill switch) D) Authority limits
Answer: C Explanation: The circuit breaker (kill switch) is the ultimate safety mechanism. CNB's "AUTOMATION OFF" command suspends all automation rules in under five seconds. It's tested quarterly. When automation itself is misbehaving, you need a way to shut everything down immediately without surgical precision.
Question 16
According to the REXX best practices, what is the maximum recommended length for a single REXX exec?
A) 200 lines B) 500 lines C) 1000 lines D) 2000 lines
Answer: C Explanation: Keep REXX execs under 1000 lines. If an exec is longer, refactor it into called subroutines stored as separate execs. Long monolithic REXX execs are difficult to test, debug, maintain, and reuse.
Question 17
What is the staged rollout process for a new automation rule at CNB?
A) Write, test in production, activate B) Write, unit test, deploy to production in active mode C) Write, unit test, negative test, stress test, integration test, deploy in monitor-only mode for one week minimum, then activate D) Write, peer review, deploy to production
Answer: C Explanation: CNB's process requires six stages: unit test in isolation, negative testing (similar-but-different events), stress testing (rapid-fire events), integration testing (alongside other rules), staged rollout in monitor-only mode for at minimum one week, and post-activation review after one month. This rigorous process prevents the kind of runaway scenarios described in Section 31.7.
Question 18
In the "Disk Eater" incident described in Section 31.7, what was the root cause of the automation disaster?
A) The REXX exec had a bug B) The automation rule's scope was too broad — it applied to all jobs instead of the specific job it was designed for C) The automation product malfunctioned D) The operator disabled the kill switch
Answer: B Explanation: The rule was designed for a specific job's temporary work file but was scoped to apply to all jobs. When a different job's production master file hit SB37, the rule deleted it. This is a scoping failure — the most common cause of automation disasters. Bounded scope is a fundamental design principle for automated actions.
Question 19
What is CNB's standard for rate limiting automation rules?
A) No rule should execute more than 3 times per hour B) No rule should execute more than 5 times per 15 minutes C) Any rule that fires more than 10 times in 30 minutes is automatically suspended D) No rule should execute more than 100 times per day
Answer: C Explanation: CNB's rate limiting standard: any automation rule that fires more than 10 times in 30 minutes is automatically suspended and the automation team is paged. Excessive firing indicates either a misfiring rule or a systemic issue that the rule can't fix. Either way, human review is needed.
Question 20
Why does the chapter state that automated restarts should be issued through the scheduler's API rather than raw JES commands?
A) JES commands are slower B) To maintain dependency tracking, resource serialization, and audit trails C) JES commands require operator authority D) The scheduler API is easier to use from REXX
Answer: B Explanation: Restarting through the scheduler API maintains the integrity of the scheduling environment — predecessor/successor dependencies stay intact, resource serialization (preventing conflicting jobs from running simultaneously) is preserved, and the scheduler's audit trail captures the restart. Raw JES commands bypass all of this, potentially causing downstream jobs to run before their prerequisites are complete.