Chapter 27 Quiz
Question 1
Which SMF record type provides the most comprehensive job-level performance data for batch monitoring, including CPU time, elapsed time, and I/O counts broken down by step?
A) SMF Type 4 (Step Termination) B) SMF Type 14 (Input Dataset Activity) C) SMF Type 30 (Common Address Space Work) D) SMF Type 42 (Storage Management)
Answer: C SMF Type 30 is the comprehensive job accounting record. While Type 4 provides step termination data, Type 30 subtypes 3 (step termination) and 4 (job termination) contain far more detail, including processor accounting, storage usage, and detailed I/O statistics organized by the self-defining section format.
Question 2
In an SMF Type 30 record, the self-defining section contains "triplets." What does each triplet consist of?
A) Record type, record length, record count B) Offset, length, count C) Job name, step name, program name D) Start time, end time, elapsed time
Answer: B Each triplet in the self-defining section contains an offset (where the data section starts within the record), a length (how long each instance of the section is), and a count (how many instances exist). This design allows IBM to add new sections without breaking existing programs.
Question 3
SMF Type 14 records are written when:
A) A VSAM dataset is opened for input B) A non-VSAM dataset is closed after input C) Any dataset is allocated by JCL D) A dataset is migrated by HSM
Answer: B SMF Type 14 records are written when a non-VSAM dataset is closed after being used for input. Type 15 records are the equivalent for output datasets. These records provide dataset-level I/O statistics including EXCP counts and block counts.
Question 4
Which of the following is the PRIMARY purpose of dynamic thresholds in batch alerting?
A) To reduce the computational cost of threshold checking B) To eliminate the need for human review of alerts C) To account for predictable variations like month-end processing D) To automatically adjust thresholds after each incident
Answer: C Dynamic thresholds adjust baselines for known patterns — day of week, month-end, quarter-end, seasonal variations — so that expected fluctuations don't trigger false alerts. A job that runs twice as long on month-end is behaving normally for that day type, and the threshold should reflect that.
Question 5
Rob's Tier 1 alert classification at CNB includes all of the following EXCEPT:
A) Any abend in a Tier-1 critical job B) Dataset space abend (B37/D37/E37) in any batch job C) Any job elapsed time exceeding baseline by 25% D) JES spool utilization exceeding 85%
Answer: C A 25% baseline deviation is classified as Tier 3 (Informational — review next business day). Tier 1 is reserved for conditions requiring immediate response: critical job abends, space abends, SLA milestones at risk, security violations, and critical resource thresholds.
Question 6
What is the correct order of MTTR components from detection to full resolution?
A) MTTD → MTTRp → MTTDx → MTTV B) MTTDx → MTTD → MTTRp → MTTV C) MTTD → MTTDx → MTTRp → MTTV D) MTTRp → MTTD → MTTDx → MTTV
Answer: C The sequence is: Mean Time to Detect (MTTD) — how long until the problem is noticed; Mean Time to Diagnose (MTTDx) — how long to identify the root cause; Mean Time to Repair (MTTRp) — how long to implement the fix; Mean Time to Verify (MTTV) — how long to confirm the fix worked and downstream effects are cleared.
Question 7
In a self-healing batch environment, which abend code is LEAST appropriate for automated restart?
A) SE37 (End of extent — space) B) S878 (Virtual storage shortage) C) S0C4 (Protection exception) D) U0150 (Application-defined transient error)
Answer: C An S0C4 (protection exception) indicates a program bug — the code attempted to access a memory address it shouldn't have. Restarting will produce the same result because the code hasn't changed. SE37 might succeed if SMS frees space, S878 might succeed if storage pressure decreases, and U0150 is explicitly defined by the application as retry-eligible.
Question 8
What is the primary benefit of "pre-flight checks" before critical batch steps?
A) They reduce CPU consumption of the main processing step B) They detect problems before they cause abends, preventing incidents entirely C) They eliminate the need for checkpoint/restart logic D) They replace the need for SMF monitoring
Answer: B Pre-flight checks validate prerequisites — input data quality, space availability, predecessor completion — before the main processing step executes. By catching problems proactively, they prevent abends that would otherwise require reactive incident response, reducing MTTR to zero for those specific failure modes.
Question 9
In the post-incident review process, what is the purpose of the "Five Whys" exercise?
A) To identify the five team members responsible for the incident B) To peel back layers of causation from immediate symptom to systemic root cause C) To rate the incident severity on a five-point scale D) To identify five corrective actions for the action item list
Answer: B The Five Whys exercise asks "why" repeatedly (typically five times, though the actual number varies) to move from the immediate, surface-level cause to the deeper systemic causes. Each answer reveals a layer of defense that failed or didn't exist, guiding the improvement plan to address root causes rather than symptoms.
Question 10
A batch job normally takes 45 minutes. On month-end nights, it takes 90 minutes. The shop uses a static threshold of 50% over baseline. What problem does this create?
A) The job will never trigger an alert because 50% is too high B) The alert will fire every month-end night, creating a predictable false positive C) The threshold will prevent automated restart from functioning D) The job will be automatically cancelled when it exceeds the threshold
Answer: B With a static baseline of 45 minutes and a 50% threshold, the alert fires at 67.5 minutes. On month-end nights when the job legitimately takes 90 minutes, the alert fires every time — a predictable, expected alert that wastes the on-call person's attention. Dynamic thresholds solve this by using a month-end baseline of 90 minutes, so the alert would only fire if the month-end run exceeded 135 minutes.
Question 11
Which batch monitoring level is responsible for tracking SLA milestones and critical path progress?
A) Level 1 — Real-Time Console Monitoring B) Level 2 — Job-Level Monitoring C) Level 3 — Batch Stream Monitoring D) Level 4 — Historical Trend Analysis
Answer: C Level 3 (Batch Stream Monitoring) operates at the forest level rather than the tree level, answering questions about whether the overall batch cycle is on schedule, whether SLA milestones are being met, and what the critical path shows about estimated completion time.
Question 12
In the playbook for an S806 (Load Module Not Found) abend, the recommended action is:
A) Restart the job automatically with a delay, as the module will eventually appear B) Escalate to the application team, as this is a deployment failure requiring root cause analysis C) Increase the region size and restart, as the module couldn't be loaded due to storage D) Check the RACF profiles, as the job likely lacks read authority to the load library
Answer: B An S806 means the program could not be found in the STEPLIB/JOBLIB concatenation or LINKLST. This is almost always a deployment problem — the module wasn't link-edited, was deployed to the wrong library, or was accidentally deleted. The application or deployment team must be involved to ensure the correct module is in place.
Question 13
What does "alert fatigue" refer to, and why is it dangerous?
A) Physical exhaustion of the on-call person from too many hours on shift B) The tendency to ignore or dismiss alerts when too many are non-actionable, causing real problems to be missed C) The gradual degradation of alerting system performance under high volume D) The burnout experienced by automation systems that process too many rules
Answer: B Alert fatigue occurs when operations staff receive so many alerts — particularly false positives and informational messages — that they begin to dismiss alerts without proper investigation. This is dangerous because genuine critical alerts get lost in the noise, leading to delayed response or missed incidents.
Question 14
In the critical path calculation for a batch dependency network, if the critical path takes 4 hours and a non-critical-path job abends, what is the impact on the batch window?
A) The batch window is guaranteed to be extended B) No impact, as long as the non-critical job is resolved before it becomes a dependency for a critical-path job C) The batch window is shortened because one less job needs to run D) The entire batch window must be restarted from the beginning
Answer: B A non-critical-path job has "slack" — time available beyond its required duration without affecting the overall completion. If the job is resolved within its available slack, there is no impact on the batch window. However, if the delay extends beyond the slack or if the job is a predecessor to a critical-path job, it may shift the critical path and extend the batch window.
Question 15
CNB uses the convention U0100-U0199 for application user abend codes. What is the significance of this range?
A) These codes are reserved by IBM for system use B) These codes indicate transient errors eligible for automatic retry C) These codes signal that the job should be immediately escalated to management D) These codes mean the job completed successfully despite an error
Answer: B At CNB, application teams use the U0100-U0199 range specifically to indicate transient errors that are appropriate for automated restart. This convention allows the automation infrastructure to make intelligent restart decisions based on the application's own assessment of the failure — the application knows which of its errors are retry-eligible.
Question 16
What is the PRIMARY difference between a "playbook" and a "decision tree" for incident response?
A) Playbooks are for management; decision trees are for operators B) Playbooks provide detailed procedures for specific scenarios; decision trees provide quick classification and routing to the appropriate procedure C) Playbooks are automated; decision trees are manual D) Playbooks cover only abend codes; decision trees cover all alert types
Answer: B Decision trees are quick-reference tools for classifying an incident and routing the responder to the correct detailed procedure. They answer "what kind of problem is this?" Playbooks provide the complete, step-by-step procedure for handling that specific type of problem once it's identified. The decision tree is the index; the playbook is the content.
Question 17
In Marcus Rivera's pre-flight check at Federal Benefits, what return code signals "warnings detected but proceed with main processing"?
A) RC=0 B) RC=4 C) RC=8 D) RC=16
Answer: C The pre-flight check returns RC=0 for all clear, RC=8 for warnings (proceed but alerts are generated for review), and RC=16 for critical failures (do not proceed, route to error handler). The JCL conditional execution uses these return codes to determine whether to execute the main processing step.
Question 18
Which of the following represents the correct approach to SMF data management for batch monitoring?
A) Collect only SMF types 30 and 14/15; all other types are unnecessary overhead B) Write SMF directly to DB2 tables for real-time analysis C) Dump SMF data regularly (e.g., every 4 hours during batch window), store in a historical database, and analyze with dedicated programs D) Disable SMF collection during the batch window to reduce overhead, then enable it during the online day
Answer: C SMF data must be dumped regularly from the active SYS1.MANx datasets to prevent data loss (the datasets wrap). The dumped data is loaded into a historical database (typically DB2) where it's available for both real-time analysis during the current batch window and long-term trend analysis. Disabling SMF during batch would eliminate the very data you need for monitoring.
Question 19
Rob's PIR rule at CNB is: "We don't ask who caused the problem. We ask what allowed the problem to happen." This reflects which principle?
A) Plausible deniability — protect individual team members from accountability B) Blameless culture — focus on systemic improvement rather than individual punishment C) Separation of duties — operators shouldn't know who wrote the code D) Minimal disclosure — limit incident information to those directly involved
Answer: B A blameless post-incident culture focuses on identifying and fixing systemic weaknesses rather than punishing individuals. This approach produces better outcomes because team members participate honestly, share complete information, and focus on prevention rather than self-protection. It recognizes that if a single human error can take down the batch cycle, the system is fragile — the fix is in the system, not the human.
Question 20
The chapter states that Rob's overall MTTR improved from 98 minutes to 33 minutes. Which single improvement contributed the LARGEST reduction?
A) Faster diagnosis through better playbooks (MTTDx: 28→12 min) B) Faster detection through proper monitoring (MTTD: 37→4 min) C) Faster repair through automation (MTTRp: 19→8 min) D) Faster verification through automated checks (MTTV: 14→9 min)
Answer: B Detection time improved by 33 minutes (from 37 to 4), the single largest reduction of any component. This underscores the fundamental value of proactive monitoring — you cannot fix a problem you haven't detected, and every minute of undetected failure is a minute of wasted time, downstream impact, and accumulated damage.