Chapter 27 Exercises
Section 27.2 — SMF Records for Batch Intelligence
Exercise 27.1: SMF Record Type Identification
For each of the following monitoring requirements, identify the primary SMF record type(s) you would use and specify which fields within those records are relevant:
a) Determine which job consumed the most CPU time during last night's batch window. b) Identify all datasets written to by the ACCTPOST job. c) Detect whether a VSAM KSDS experienced CI splits during batch processing. d) Determine the elapsed time of each step in a multi-step job. e) Identify which volume a space abend (B37) occurred on and how much space was requested.
Exercise 27.2: SMF Self-Defining Section Parsing
Write a COBOL paragraph that correctly parses the self-defining section of an SMF type 30 record. Your code should: - Read the triplets (offset, length, count) from the header - Use those triplets to locate the identification section, the processor accounting section, and the I/O activity section - Handle the case where a section has zero entries (count = 0) - Store the extracted data in working storage structures
Do not hard-code offsets — use the self-defining section exclusively.
Exercise 27.3: SMF Data Volume Calculation
CNB processes approximately 2,500 batch jobs per night. Each job averages 4 steps. SMF type 30 subtype 3 records average 1,200 bytes each. Type 30 subtype 4 records average 800 bytes. Type 14/15 records average 400 bytes, and each step generates an average of 6 dataset accesses.
a) Calculate the approximate SMF data volume generated during a single batch window (type 30 and type 14/15 only). b) If SMF data is retained for 90 days in a DB2 historical database, estimate the total storage required. c) If the SMF dump runs every 4 hours and must complete within 10 minutes, what throughput rate (MB/second) is required?
Exercise 27.4: Baseline Construction
You are given 30 days of SMF type 30 data for the GLEXTRACT job. The elapsed times (in seconds) for those 30 nights are:
2710, 2680, 2745, 2690, 2720, 5430, 2700, 2715, 2695, 2730,
2705, 2740, 2680, 2710, 5510, 2725, 2690, 2700, 2715, 2720,
2710, 5480, 2695, 2700, 2730, 2740, 2690, 2705, 5520, 2710
a) Calculate the mean and standard deviation of the full dataset. b) The outlier values (~5400-5500 seconds) occur every 7 days. What business pattern likely explains this? c) Calculate separate baselines for "normal" nights and "weekly processing" nights. d) Design a dynamic threshold that uses the appropriate baseline depending on the day of week. What threshold percentage would you use for each, and why?
Exercise 27.5: SMF Analysis Program Enhancement
Extend the BCHPERF program from Section 27.2 to: - Track EXCP counts per DD name (not just total EXCPs) - Detect when a specific DD's EXCP count increases by more than 50% over baseline - Generate a separate alert for I/O anomalies that includes the DD name, actual EXCP count, and baseline EXCP count - Update the baseline with a weighted rolling average (giving 20% weight to the new observation and 80% to the existing baseline)
Write the additional COBOL paragraphs and any necessary working storage modifications.
Section 27.3 — Batch Monitoring Architecture
Exercise 27.6: Monitoring Level Classification
Classify each of the following monitoring activities by level (1=Real-Time Console, 2=Job-Level, 3=Stream-Level, 4=Historical Trend):
a) Detecting that the GL extract batch stream is 40 minutes behind the SLA milestone. b) Catching an S0C7 abend in the ACCTPOST job. c) Noticing that average ACCTPOST elapsed time has increased 18% over the past quarter. d) Alerting that GLEXTRACT hasn't started even though its predecessor completed 20 minutes ago. e) Identifying that end-of-quarter processing consistently pushes the batch window to 95% of capacity. f) Suppressing informational messages from the scheduler to reduce console clutter. g) Tracking that the statement generation job produces approximately 12% more output each year.
Exercise 27.7: Critical Path Calculation
Given the following batch job dependency network (with estimated durations in minutes):
Job A (15 min) --> Job C (45 min) --> Job F (30 min) --> Job H (20 min)
Job A (15 min) --> Job D (60 min) --> Job G (25 min) --> Job H (20 min)
Job B (10 min) --> Job D (60 min)
Job B (10 min) --> Job E (35 min) --> Job G (25 min)
a) Draw the complete dependency graph. b) Identify the critical path and calculate the minimum batch window duration. c) If Job D has been running for 30 minutes (half its expected duration), what is the estimated remaining time to completion of Job H? d) If Job C abends and requires a 20-minute fix plus restart, does this change the critical path? Show your work. e) Which job, if optimized, would provide the greatest reduction in overall batch window duration? Why?
Exercise 27.8: Monitoring Dashboard Design
Design a batch monitoring dashboard for the HA Banking System. For each panel, specify: - What data source feeds it (SMF type, scheduler API, system command, etc.) - What refresh interval is appropriate - What color coding or visual indicators should be used - What drill-down capability should be available
Your dashboard should include at minimum: a) A batch window progress bar showing current position vs. SLA milestones b) A job status summary (running, complete, waiting, held, failed counts) c) A critical path view showing the current longest dependency chain d) A resource utilization panel (CPU, I/O, storage) e) An alert feed showing recent Tier 1 and Tier 2 alerts
Exercise 27.9: Automation Rule Writing
Write automation product rules (using pseudo-code similar to the CA-OPS/MVS syntax shown in Section 27.3) for the following scenarios:
a) When the STMTGEN (statement generation) job has been running for more than 120 minutes, send a Tier 2 alert to the on-call team and update the batch status dashboard. b) When any job in the GL- job stream abends, hold all remaining jobs in the GL- stream and send a Tier 1 alert. c) When JES spool utilization exceeds 80%, identify the top three spool consumers and send an informational alert. d) When the batch window reaches the 04:00 milestone and the ACCTPOST job has not completed, escalate to batch operations manager.
Exercise 27.10: Sysplex Monitoring Considerations
The HA Banking System runs in a two-LPAR sysplex (SYS1 and SYS2). Batch jobs may execute on either LPAR depending on workload balancing.
a) How does sysplex execution affect SMF data collection? Where are the SMF records written? b) How would you ensure that a job-level alert fires even if the monitoring program is running on a different LPAR than the job? c) Design a cross-LPAR batch stream monitoring approach that provides a unified view regardless of which LPAR each job executes on. d) What additional failure scenarios exist in a sysplex batch environment that don't exist on a single LPAR? How would you monitor for them?
Section 27.4 — Alerting Strategy
Exercise 27.11: Alert Tier Classification
Classify each of the following events as Tier 1 (Critical), Tier 2 (Warning), Tier 3 (Informational), or Tier 4 (Diagnostic). Justify each classification:
a) The ACCTPOST job (Tier 1 critical job) has been running 75% longer than its dynamic baseline. b) A new job named ADHOCFIX appeared in the batch window that isn't in the baseline database. c) The regulatory reporting job completed with RC=4 (acceptable but not clean). d) DASD volume BATCH3 is at 91% capacity. e) The ATM file refresh job abended S0C4. f) The reconciliation job completed successfully but processed 0 records. g) A non-critical utility job abended with SE37. h) The batch window critical path estimate shows completion 5 minutes before the SLA deadline.
Exercise 27.12: Dynamic Threshold Algorithm
Design a COBOL algorithm for dynamic thresholds that accounts for: - Day of week (Monday heavy, Sunday light) - Month-end processing (last 3 business days) - Quarter-end processing (cumulative with month-end) - Year-end processing (cumulative with quarter-end) - Holiday schedules (day after holiday may be heavier) - Known application change dates (new releases may change performance profile)
Provide the working storage structures, the threshold calculation paragraph, and the logic for determining which adjustment factors apply. Explain how you would initially calibrate the adjustment factors and how you would maintain them over time.
Exercise 27.13: Alert Fatigue Analysis
A shop receives the following alerts during a typical batch window:
| Tier | Count | Actionable | False Positive | Already Known |
|---|---|---|---|---|
| 1 | 3 | 2 | 1 | 0 |
| 2 | 18 | 5 | 8 | 5 |
| 3 | 47 | 12 | 20 | 15 |
| 4 | 230 | N/A | N/A | N/A |
a) Calculate the overall signal-to-noise ratio for actionable alerts (Tier 1 + Tier 2). b) What is the false positive rate for Tier 1 and Tier 2 separately? c) What specific steps would you take to reduce false positives in Tier 2? d) The "Already Known" category represents alerts for conditions that are expected and documented but not suppressed. How would you address this? e) If the on-call person spends an average of 5 minutes per Tier 1 alert, 3 minutes per Tier 2 alert, and 1 minute per Tier 3 alert to acknowledge and triage, how much time per night is spent on alert management? What percentage of that time is productive (addressing genuinely actionable items)?
Exercise 27.14: Escalation Path Design
Design a complete escalation path for the HA Banking System. Include:
a) On-call rotation structure (who is primary, secondary, tertiary; rotation schedule) b) Escalation timing for each tier c) Management notification criteria (when does the VP of Technology need to know?) d) Cross-team escalation (when batch operations needs to involve application teams, database teams, or storage teams) e) Customer communication triggers (at what point does the incident warrant notifying internal or external customers of a potential delay?)
Exercise 27.15: Notification Integration
Write a design specification for integrating mainframe batch alerts with the following enterprise systems: - PagerDuty for on-call management - ServiceNow for incident tracking - Slack for team awareness - Email for management reporting
For each integration, specify: a) The communication protocol (API, email, SNMP, etc.) b) The data payload (what information is sent) c) The triggering criteria (which alert tiers use this channel) d) Error handling (what happens if the notification channel is unavailable)
Section 27.5 — On-Call Playbooks
Exercise 27.16: Playbook Creation
Write complete playbook entries (following the structure from Section 27.5.2) for the following scenarios:
a) S322 (CPU Time Limit Exceeded): A batch job has consumed its TIME parameter limit. Consider: when is this a legitimate time limit that needs increasing, versus a program loop? b) S913 (RACF Authorization Failure): A batch job attempted to access a resource it doesn't have authority for. Consider: deployment issues, changed security profiles, and how to resolve without granting excessive authority. c) SDSF shows job "stuck" — running for hours with no I/O activity: The job is consuming no CPU, performing no I/O, but hasn't ended. Consider: ENQ waits, operator reply waits, and inter-system communication hangs.
Exercise 27.17: Decision Tree Extension
Extend the top-level decision tree from Section 27.5.3 to include: - A branch for "job not started" conditions (waiting in input queue, held by scheduler, predecessor incomplete) - A branch for "job completed but output incorrect" (wrong record count, missing output file, control total mismatch) - A branch for "system-level issue affecting batch" (high paging, channel busy, LPAR performance degradation)
Draw the complete decision tree and identify the diagnostic commands used at each decision point.
Exercise 27.18: Scenario Walkthrough
Walk through the following incident scenario step by step. At each step, state what you would check, what you expect to find, and what your next action would be.
Scenario: It's 02:45am. You receive a Tier 2 alert: "STMTGEN running 85% over dynamic baseline." STMTGEN is the statement generation job, normally takes 90 minutes, currently at 166 minutes and still running. It's a month-end night. The dynamic baseline for month-end STMTGEN is 140 minutes.
Step through your diagnosis and response. Consider: Is this actually a problem? What information do you need? What are the possible causes? What actions are appropriate? When do you escalate?
Exercise 27.19: Playbook Maintenance Process
Design a process for keeping playbooks current. Address:
a) How often should playbooks be reviewed and by whom? b) What events should trigger an immediate playbook update? c) How do you handle playbook entries that reference jobs, programs, or datasets that no longer exist? d) How do you train new on-call staff on the playbook content? e) How do you measure whether the playbook is actually being used and whether it's effective?
Exercise 27.20: Cross-Training Exercise
You are designing a cross-training exercise for the HA Banking System operations team. Create a tabletop exercise scenario that:
a) Involves a cascading failure (one initial problem triggers multiple downstream issues) b) Requires the team to use playbook procedures c) Includes a decision point where automated recovery fails and human judgment is required d) Tests the escalation path e) Results in a post-incident review
Write the scenario script, the expected responses at each stage, and the evaluation criteria.
Section 27.6 — Self-Healing Batch
Exercise 27.21: Restart Eligibility Design
For the HA Banking System, create a restart eligibility table that includes: - Every job in the nightly cycle (at least 15 jobs) - For each job: whether automated restart is allowed, maximum restart attempts, eligible abend codes, restart method (from beginning, from step, from checkpoint), and restart delay - Justification for each decision (why is/isn't automatic restart allowed for each abend code)
Present the table and explain your design rationale.
Exercise 27.22: Pre-Flight Check Implementation
Write a complete COBOL pre-flight check program for the ACCTPOST job in the HA Banking System. Your program should validate:
a) Input file exists and is not empty b) Input record count is within 20% of the baseline for this day type c) Control total of transaction amounts matches the header record d) A random sample of 100 records passes basic field validation (account number is numeric, amount is valid packed decimal, transaction code is in the valid set) e) Sufficient DASD space is available for the output datasets f) The prerequisite EXTRACT job completed with RC=0
Provide complete COBOL code including file definitions, working storage, and all validation paragraphs.
Exercise 27.23: Error Routing JCL
Write the complete JCL for a three-step batch job with full error routing: - STEP010: Pre-flight validation (PREFLIGHT program) - STEP020: Main processing (MAINPROC program) — only if STEP010 RC=0 - STEP030: Post-processing verification (POSTVER program) — only if STEP020 RC<=4 - ERRSTEP: Error handler (ERRHNDLR program) — if any step fails - CLEANUP: Always runs regardless of prior step results
Use IF/THEN/ELSE constructs. Include appropriate COND parameters, RESTART parameters, and DD statements. Document every JCL parameter with comments.
Exercise 27.24: Self-Healing Architecture Design
Design a complete self-healing architecture for the HA Banking System's interest calculation batch stream. The stream consists of: - INTEXTRT: Extract account balances (10 min baseline) - INTCALC: Calculate interest amounts (45 min baseline) - INTPOST: Post interest to accounts (20 min baseline) - INTAUDIT: Generate audit trail (5 min baseline) - INTRECON: Reconcile totals (10 min baseline)
For each job, specify: a) Pre-flight checks b) Automated restart configuration c) Error routing (what happens if this job fails) d) Impact on downstream jobs e) Manual intervention criteria
Design the overall stream so that a failure in any single job results in a known, controlled state that can be recovered from without data loss or double-posting.
Exercise 27.25: Predictive Monitoring Rules
Write five predictive monitoring rules for the HA Banking System that detect problems before they cause failures. For each rule: a) What condition is being monitored? b) What data source provides the information? c) What threshold triggers the rule? d) What automated action is taken? e) What is the expected reduction in incidents?
Example: "If DASD volume BATCH1 exceeds 90% utilization at any point during the batch window, trigger an SMS space reclamation, alert the storage team, and log the event. Expected to prevent 2-3 E37 abends per month."
Section 27.7 — Post-Incident Review
Exercise 27.26: Five Whys Analysis
Perform a Five Whys analysis for the following incident:
Incident: The regulatory reporting job missed its 06:00 deadline by 45 minutes because it was waiting for the reconciliation job, which was waiting for the GL extract, which was restarted twice after B37 abends on volume GLVOL1.
Trace through all five whys, identify the systemic root cause, and propose at minimum three corrective actions at different layers of defense.
Exercise 27.27: MTTR Calculation and Improvement
Given the following incident data for the HA Banking System over a 30-day period:
| Incident | Detection (min) | Diagnosis (min) | Repair (min) | Verify (min) |
|---|---|---|---|---|
| ACCTPOST S0C7 | 2 | 25 | 15 | 10 |
| GLEXTRACT B37 | 5 | 8 | 12 | 8 |
| STMTGEN slow | 15 | 30 | 0 (resolved itself) | 5 |
| INTCALC S0C4 | 3 | 45 | 30 | 15 |
| RECON mismatch | 20 | 40 | 25 | 20 |
| FEECALC S322 | 2 | 5 | 3 | 5 |
| REGIRPT timeout | 8 | 15 | 20 | 10 |
| ACCTPOST S0C7 | 2 | 10 | 15 | 10 |
a) Calculate MTTR and its components (MTTD, MTTDx, MTTRp, MTTV). b) Which component contributes most to overall MTTR? c) Identify the three incidents that would benefit most from improved monitoring, playbooks, or automation. Explain what improvement you would make for each. d) If you could invest in only one improvement (monitoring, playbooks, or automation), which would give the greatest MTTR reduction? Justify with data. e) Set MTTR targets for 90 days from now. What specific actions would achieve those targets?
Exercise 27.28: PIR Template Design
Create a comprehensive Post-Incident Review template specific to the HA Banking System. Include: - Header information (incident ID, date, severity, systems affected) - Timeline section (detection through resolution with timestamps) - Impact assessment (financial, customer, regulatory, operational) - Root cause analysis section (Five Whys, contributing factors) - Action items section (corrective, preventive, with owners and dates) - Metrics section (MTTD, MTTDx, MTTRp, MTTV) - Lessons learned section - Approval section (who signs off on the PIR)
Exercise 27.29: Knowledge Base Design
Design the knowledge base structure for the HA Banking System incident history. Specify: a) The database schema (tables, columns, relationships) b) Search capabilities (what can users search by?) c) Integration with the on-call playbook (how does a new PIR update the playbook?) d) Reporting capabilities (what management reports can be generated?) e) Retention policy (how long is incident data kept, and in what form?)
Exercise 27.30: Comprehensive Monitoring Assessment
You have been asked to assess the overall batch monitoring maturity of the HA Banking System. Design an assessment framework with:
a) Five maturity levels (1=Reactive, 2=Managed, 3=Defined, 4=Measured, 5=Optimized) b) Assessment criteria for each level across six dimensions: SMF utilization, alerting, playbooks, automation, post-incident review, and knowledge management c) A scoring rubric d) Recommendations for advancing from each level to the next e) Key metrics that indicate maturity level
Apply your framework to assess a hypothetical shop that has: basic SMF collection but no analysis programs, static alert thresholds with significant false positive rate, informal playbooks maintained by one person, no automated restart, occasional post-incident reviews, and no centralized knowledge base. What maturity level is this shop, and what are the top three priorities for improvement?