Case Study 1 — CNB's Automation Journey: From Manual to Self-Healing
Background
City National Bank processes 2.4 million transactions daily across a two-LPAR z/OS sysplex. The nightly batch window — 11:00 PM to 5:30 AM — runs 847 jobs covering general ledger processing, transaction settlement, regulatory reporting, and customer statement generation. In early 2019, CNB's operations team consisted of 14 operators working three shifts, with the overnight shift staffed by three operators responsible for monitoring batch execution, responding to failures, and coordinating with on-call application teams.
Kwame Asante, Infrastructure Director, had been tracking operational metrics for two years and didn't like what he saw. The numbers told a clear story:
| Metric | 2018 Value |
|---|---|
| Average batch failures per night | 4.7 |
| Average manual interventions per night shift | 22 |
| Mean time to recovery (manual) | 43 minutes |
| Operator errors during recovery | 1.3 per week |
| Late batch completions (after 5:30 AM) | 8 per month |
| Branch network late opens | 3 per quarter |
| Hours spent on manual housekeeping per week | 47 |
The Monday morning incident in January 2019 — three operators spending hours recovering a cascading failure that started with a GDG limit — was the catalyst. But the data had been screaming for attention for months.
Phase 1: Assessment and Quick Wins (Q1 2019)
Kwame assembled Lisa Cheng (Lead Systems Programmer) and Rob Mueller (Senior Operator/Automation Specialist) into what they called the "Automation Tiger Team." Their first task was a comprehensive assessment.
The Runbook Audit
Rob catalogued every operational procedure the overnight shift performed. He found:
- 73 documented runbook procedures for batch recovery
- 41 undocumented tribal knowledge procedures (things operators "just knew" from experience)
- 28 housekeeping procedures (dataset cleanup, spool management, catalog maintenance)
- 15 notification procedures (who to call, when, for what)
Of the 73 documented procedures, Rob estimated that 58 were deterministic — "If X happens, do Y." No judgment required. These were immediate candidates for automation.
Quick Wins: REXX-Based Automation
Lisa wrote the first wave of automation in three weeks — nine REXX execs targeting the highest-frequency manual tasks:
1. CNBGDGMGR — GDG management exec. Monitored GDG bases nightly, extended any base within 5 generations of its limit, reported extensions to operations. This directly addressed the root cause of the January incident.
2. CNBSPOOLMGR — Spool management exec. Purged spool output older than 7 days (14 days for financial reports), alerted when spool utilization exceeded 75%.
3. CNBDSKRPT — DASD space report. Generated a daily report of DASD utilization by storage group, flagging volumes above 85%.
4. CNBJOBMON — Job monitoring exec. Ran every 5 minutes during the batch window, checking for jobs that had been executing longer than their expected maximum elapsed time.
5. CNBPREFLT — First-generation pre-flight check. Validated input dataset availability before the GL batch stream began.
Results after three months:
| Metric | 2018 Value | Post-Phase 1 |
|---|---|---|
| Manual interventions per night shift | 22 | 14 |
| Hours on manual housekeeping per week | 47 | 31 |
| Late batch completions per month | 8 | 5 |
Not transformative, but enough to justify continued investment.
Phase 2: JCL Standardization (Q2–Q3 2019)
Lisa tackled the JCL problem next. CNB had 847 production jobs, and every one had unique inline JCL. Some had been cloned and modified for fifteen years. There were 23 different ways to execute a COBOL-DB2 program across the batch environment.
The PROC Library
Lisa designed four base execution PROCs:
| PROC Name | Purpose | Key Parameters |
|---|---|---|
| CNBBATCH | Standard COBOL batch execution | PROG, RUNLIB, REGION, COND |
| CNBDB2BT | COBOL-DB2 batch execution | PROG, PLAN, DBSYS, RUNLIB |
| CNBSRTBT | COBOL batch with sort step | PROG, RUNLIB, SORTWK, SRTPARM |
| CNBUTLBT | Utility execution (IDCAMS, IEBGENER, etc.) | UTIL, PARM, SYSIN |
She then created 34 application-level PROCs that called these base PROCs with application-specific parameters and DD statements.
The Migration
Converting 847 jobs wasn't a weekend project. Lisa's team migrated in waves:
- Wave 1 (8 weeks): 127 GL and settlement jobs — the highest-impact batch stream
- Wave 2 (6 weeks): 203 regulatory reporting jobs
- Wave 3 (10 weeks): 312 customer statement and correspondence jobs
- Wave 4 (8 weeks): 205 remaining jobs (maintenance, ad hoc, low frequency)
Each wave followed the same process: convert JCL to PROC-based, test in parallel (run both old and new, compare results), certify, and cut over. They found 14 bugs in the original JCL during conversion — errors that had been silently producing wrong results for months or years.
Impact on Automation
JCL standardization was a prerequisite for advanced automation. With standardized PROCs:
- Automation rules could be written generically ("any job using CNBDB2BT") rather than per-job
- Recovery procedures were consistent — restarting a CNBDB2BT job followed the same steps regardless of which application it was
- New jobs automatically inherited automation coverage by using standard PROCs
Phase 3: OPS/MVS Deployment (Q4 2019 – Q1 2020)
With standardized JCL and proven REXX automation in place, Kwame approved the purchase of Broadcom OPS/MVS. The decision to use OPS/MVS rather than SA z/OS for operational automation was deliberate — Kwame wanted automation that his operators could understand and maintain, not just his systems programmers.
Rule Development
Rob Mueller led the OPS/MVS rule development. He started with the 58 deterministic runbook procedures and converted them to OPS/MVS rules over four months. His approach:
Month 1: The Big Five. The five most frequent failure scenarios:
- SB37/SD37/SE37 (space abends) — automated dataset extension and restart
- DB2 timeout (U0100) — automated retry with backoff
- Input dataset not available — automated wait-and-retry with scheduler hold
- Long-running job detection — automated alerting with diagnostics
- CICS transaction dump threshold — automated dump cleanup and alerting
Month 2: Batch Stream Recovery. 15 rules covering automated restart for the GL and settlement batch streams, including conditional restart based on abend code analysis.
Month 3: Housekeeping Automation. 20 rules converting the REXX-based housekeeping (GDG management, spool cleanup, DASD monitoring) into event-driven OPS/MVS rules with richer trigger conditions.
Month 4: Notification and Escalation. 18 rules standardizing how and when operations was notified, replacing ad hoc pager calls with structured escalation based on severity and time of day.
The Governance Framework
After the "quick wins" phase, Lisa insisted on a governance framework before OPS/MVS rules went into production. The framework required:
- Every rule documented with trigger, action, scope, and owner
- Peer review by at least one person who didn't write the rule
- Testing in the QA LPAR with simulated events
- One week of monitor-only mode in production before activation
- Monthly review of rule activity logs
Rob initially resisted — "We're adding bureaucracy to something that should be fast." Three months later, after a misscoped rule held 40 jobs for 90 minutes during a batch window, he became the governance framework's strongest advocate.
Results After OPS/MVS Deployment
| Metric | 2018 Value | Post-Phase 1 | Post-Phase 3 |
|---|---|---|---|
| Manual interventions per night | 22 | 14 | 7 |
| Mean time to recovery | 43 min | 38 min | 12 min |
| Operator errors during recovery | 1.3/week | 1.1/week | 0.2/week |
| Late batch completions per month | 8 | 5 | 2 |
| Branch network late opens | 3/quarter | 2/quarter | 0 |
Phase 4: Self-Healing Batch (2021)
Phase 4 was the big leap. Kwame wanted the GL settlement batch stream — CNB's most critical nightly process — to be self-healing. Not just automated restart, but end-to-end self-management: pre-flight validation, conditional routing, automated diagnosis and recovery, post-recovery validation, and intelligent escalation.
Architecture Design
Lisa designed the self-healing architecture with four layers:
Layer 1: Pre-flight Validation. The CNBPREFLT exec (evolved from the Phase 1 version) now performed 14 distinct checks before the GL batch stream launched:
- DB2P subsystem active and accepting connections
- DB2P buffer pool hit ratios above threshold (>95%)
- All 12 input datasets available and not in use
- DASD space available for all output datasets (calculated from average sizes + 20% buffer)
- GDG bases have headroom (>5 generations from limit)
- Predecessor batch streams completed successfully
- Control table BATCH_CONTROL in correct state for new cycle
- CICS regions quiesced for batch window
- MQ queue manager active, queues not backed up
- Spool utilization below 70%
- No active system maintenance (checked against maintenance calendar)
- Previous cycle's archive datasets available for restart comparison
- WLM service class for batch is active with correct goals
- Tape drives available (for archive step)
Each check returned a specific return code. The pre-flight exec aggregated results and set an overall return code: - RC=0: All checks passed, proceed - RC=4: Non-critical warnings (e.g., spool at 68%), proceed with monitoring - RC=8: Remediable issue (e.g., space shortage), trigger auto-fix then retry - RC=12: Serious issue requiring human review (e.g., DB2 down), hold batch and escalate - RC=16: Critical issue (e.g., predecessor failed), abort cycle and escalate immediately
Layer 2: Recovery Engine. A REXX exec (CNBRECOV) that served as the central recovery brain. It maintained a recovery table in DB2 with 47 entries mapping abend codes and job contexts to recovery actions. When OPS/MVS detected a batch failure, it called CNBRECOV with the job name, abend code, and step name.
Layer 3: Post-Recovery Validation. After every automated recovery, a validation exec (CNBVALID) verified that the recovered job produced correct output. It checked record counts against control totals, verified output dataset attributes, and compared key balancing figures against the input.
Layer 4: Escalation Intelligence. The escalation engine didn't just page someone — it provided context. When escalating, it included: the original failure, all automated recovery attempts and their results, current system state, suggested manual actions, and the relevant runbook section.
The First Live Recovery
The self-healing system's first real test came on March 15, 2021, at 1:23 AM. The GL transaction extract job (CNBGL100) abended with SB37 — the same failure type that caused the January 2019 incident.
This time: - T+0: OPS/MVS detected the HASP373 message - T+2 sec: Recovery engine identified SB37 on CNBGL100.EXTRACT, looked up the recovery action: EXTEND_SPACE + RESTART_STEP - T+4 sec: Space extension REXX exec added 500 cylinders to the output GDG - T+7 sec: Restart issued through TWS API for CNBGL100 from EXTRACT step - T+11 sec: Job restarted execution - T+14 min: Job completed RC=0 - T+14 min 3 sec: Post-recovery validation passed — record counts matched, balancing figures correct - T+14 min 5 sec: Successor jobs released, incident logged
Rob was the on-call operator that night. His pager never went off. He found the incident in the morning log and said, "That's the first time in my career I've been happy about not getting woken up."
Tuning and Edge Cases
The first year of self-healing operation revealed edge cases that the initial design didn't handle:
The slow DB2 problem. Jobs weren't abending but were running three times longer than normal due to DB2 lock contention. The self-healing system only triggered on abends. Fix: Added an elapsed-time monitor that detected jobs running beyond 150% of their average elapsed time and investigated the cause (DB2 lock waits, resource contention, etc.).
The partial success. A job completed with RC=4 but produced an output dataset with zero records. Technically not a failure, but functionally useless. Fix: Post-completion validation was added for critical jobs, checking output record counts even when the return code was acceptable.
The Friday night deployment. Application changes deployed on Friday nights occasionally introduced bugs that caused Monday morning batch failures. The self-healing system would restart the job, get the same failure, restart again, hit the retry limit, and escalate at 3 AM. Fix: After two identical failures, the system now checks whether application libraries changed in the last 48 hours. If so, it escalates immediately with "possible deployment issue" rather than exhausting retries.
Current State (2024)
Five years after Kwame's directive, CNB's operational automation metrics tell the story:
| Metric | 2018 (Pre) | 2024 (Current) | Improvement |
|---|---|---|---|
| Manual interventions per night | 22 | 3 | 86% reduction |
| Mean time to recovery | 43 min | 6 min | 86% reduction |
| Operator errors during recovery | 1.3/week | 0.05/week | 96% reduction |
| Late batch completions per month | 8 | 0.3 | 96% reduction |
| Branch network late opens per year | 12 | 0 | 100% elimination |
| Automated recovery success rate | 0% | 87% | — |
| Operator headcount (overnight) | 3 | 1 | 67% reduction |
The overnight shift went from three operators to one. The remaining operator isn't less busy — they handle the exceptions that automation escalates, monitor the automation itself, and work on automation improvement projects. Their role transformed from "person who executes procedures" to "person who designs and tunes the system that executes procedures."
Lessons Learned
Kwame, Lisa, and Rob distill their experience into five lessons:
1. Start with data, not tools. Before buying any automation product, instrument your operations. Know where the time goes. Know where the failures cluster. The data tells you where automation will have the most impact.
2. Standardize before automating. JCL standardization (Phase 2) was the least exciting work and the most important. Without standard PROCs, every automation rule would have been job-specific. With standard PROCs, automation rules could be generic.
3. Governance isn't optional — it's the foundation. Every automation disaster they narrowly avoided was caught by the governance framework. Testing, review, monitor-only deployment, and activity auditing are not overhead — they're the safety net.
4. Build for the failure you haven't seen yet. The recovery table handles known failures. The escalation path handles everything else. The most important automation design decision is what happens when automation doesn't know what to do — the answer must always be "escalate to a human," never "do nothing" and never "guess."
5. Automation is a program, not a project. It's never done. New applications bring new failure modes. System upgrades change message formats. Staff turnover means knowledge must be captured in automation, not heads. CNB allocates 20% of Rob's time permanently to automation maintenance and improvement.
Discussion Questions
-
CNB migrated 847 jobs to standardized PROCs over 32 weeks. What risks does this migration introduce, and how would you mitigate them?
-
The self-healing system's recovery table has 47 entries. How do you decide when a new failure mode warrants a recovery table entry versus being left to human escalation?
-
CNB reduced overnight operators from three to one. What are the risks of single-operator overnight coverage, even with comprehensive automation? What safeguards would you implement?
-
The "Friday night deployment" edge case required a custom detection rule. What other deployment-related failure patterns should self-healing automation account for?
-
Kwame's "humans handle exceptions" philosophy assumes that exceptions are rare. What happens to this model as the system grows more complex and the definition of "exception" narrows? Is there a practical limit to automation?