Case Study 1 — CNB's Automation Journey: From Manual to Self-Healing


Background

City National Bank processes 2.4 million transactions daily across a two-LPAR z/OS sysplex. The nightly batch window — 11:00 PM to 5:30 AM — runs 847 jobs covering general ledger processing, transaction settlement, regulatory reporting, and customer statement generation. In early 2019, CNB's operations team consisted of 14 operators working three shifts, with the overnight shift staffed by three operators responsible for monitoring batch execution, responding to failures, and coordinating with on-call application teams.

Kwame Asante, Infrastructure Director, had been tracking operational metrics for two years and didn't like what he saw. The numbers told a clear story:

Metric 2018 Value
Average batch failures per night 4.7
Average manual interventions per night shift 22
Mean time to recovery (manual) 43 minutes
Operator errors during recovery 1.3 per week
Late batch completions (after 5:30 AM) 8 per month
Branch network late opens 3 per quarter
Hours spent on manual housekeeping per week 47

The Monday morning incident in January 2019 — three operators spending hours recovering a cascading failure that started with a GDG limit — was the catalyst. But the data had been screaming for attention for months.


Phase 1: Assessment and Quick Wins (Q1 2019)

Kwame assembled Lisa Cheng (Lead Systems Programmer) and Rob Mueller (Senior Operator/Automation Specialist) into what they called the "Automation Tiger Team." Their first task was a comprehensive assessment.

The Runbook Audit

Rob catalogued every operational procedure the overnight shift performed. He found:

  • 73 documented runbook procedures for batch recovery
  • 41 undocumented tribal knowledge procedures (things operators "just knew" from experience)
  • 28 housekeeping procedures (dataset cleanup, spool management, catalog maintenance)
  • 15 notification procedures (who to call, when, for what)

Of the 73 documented procedures, Rob estimated that 58 were deterministic — "If X happens, do Y." No judgment required. These were immediate candidates for automation.

Quick Wins: REXX-Based Automation

Lisa wrote the first wave of automation in three weeks — nine REXX execs targeting the highest-frequency manual tasks:

1. CNBGDGMGR — GDG management exec. Monitored GDG bases nightly, extended any base within 5 generations of its limit, reported extensions to operations. This directly addressed the root cause of the January incident.

2. CNBSPOOLMGR — Spool management exec. Purged spool output older than 7 days (14 days for financial reports), alerted when spool utilization exceeded 75%.

3. CNBDSKRPT — DASD space report. Generated a daily report of DASD utilization by storage group, flagging volumes above 85%.

4. CNBJOBMON — Job monitoring exec. Ran every 5 minutes during the batch window, checking for jobs that had been executing longer than their expected maximum elapsed time.

5. CNBPREFLT — First-generation pre-flight check. Validated input dataset availability before the GL batch stream began.

Results after three months:

Metric 2018 Value Post-Phase 1
Manual interventions per night shift 22 14
Hours on manual housekeeping per week 47 31
Late batch completions per month 8 5

Not transformative, but enough to justify continued investment.


Phase 2: JCL Standardization (Q2–Q3 2019)

Lisa tackled the JCL problem next. CNB had 847 production jobs, and every one had unique inline JCL. Some had been cloned and modified for fifteen years. There were 23 different ways to execute a COBOL-DB2 program across the batch environment.

The PROC Library

Lisa designed four base execution PROCs:

PROC Name Purpose Key Parameters
CNBBATCH Standard COBOL batch execution PROG, RUNLIB, REGION, COND
CNBDB2BT COBOL-DB2 batch execution PROG, PLAN, DBSYS, RUNLIB
CNBSRTBT COBOL batch with sort step PROG, RUNLIB, SORTWK, SRTPARM
CNBUTLBT Utility execution (IDCAMS, IEBGENER, etc.) UTIL, PARM, SYSIN

She then created 34 application-level PROCs that called these base PROCs with application-specific parameters and DD statements.

The Migration

Converting 847 jobs wasn't a weekend project. Lisa's team migrated in waves:

  • Wave 1 (8 weeks): 127 GL and settlement jobs — the highest-impact batch stream
  • Wave 2 (6 weeks): 203 regulatory reporting jobs
  • Wave 3 (10 weeks): 312 customer statement and correspondence jobs
  • Wave 4 (8 weeks): 205 remaining jobs (maintenance, ad hoc, low frequency)

Each wave followed the same process: convert JCL to PROC-based, test in parallel (run both old and new, compare results), certify, and cut over. They found 14 bugs in the original JCL during conversion — errors that had been silently producing wrong results for months or years.

Impact on Automation

JCL standardization was a prerequisite for advanced automation. With standardized PROCs:

  • Automation rules could be written generically ("any job using CNBDB2BT") rather than per-job
  • Recovery procedures were consistent — restarting a CNBDB2BT job followed the same steps regardless of which application it was
  • New jobs automatically inherited automation coverage by using standard PROCs

Phase 3: OPS/MVS Deployment (Q4 2019 – Q1 2020)

With standardized JCL and proven REXX automation in place, Kwame approved the purchase of Broadcom OPS/MVS. The decision to use OPS/MVS rather than SA z/OS for operational automation was deliberate — Kwame wanted automation that his operators could understand and maintain, not just his systems programmers.

Rule Development

Rob Mueller led the OPS/MVS rule development. He started with the 58 deterministic runbook procedures and converted them to OPS/MVS rules over four months. His approach:

Month 1: The Big Five. The five most frequent failure scenarios:

  1. SB37/SD37/SE37 (space abends) — automated dataset extension and restart
  2. DB2 timeout (U0100) — automated retry with backoff
  3. Input dataset not available — automated wait-and-retry with scheduler hold
  4. Long-running job detection — automated alerting with diagnostics
  5. CICS transaction dump threshold — automated dump cleanup and alerting

Month 2: Batch Stream Recovery. 15 rules covering automated restart for the GL and settlement batch streams, including conditional restart based on abend code analysis.

Month 3: Housekeeping Automation. 20 rules converting the REXX-based housekeeping (GDG management, spool cleanup, DASD monitoring) into event-driven OPS/MVS rules with richer trigger conditions.

Month 4: Notification and Escalation. 18 rules standardizing how and when operations was notified, replacing ad hoc pager calls with structured escalation based on severity and time of day.

The Governance Framework

After the "quick wins" phase, Lisa insisted on a governance framework before OPS/MVS rules went into production. The framework required:

  • Every rule documented with trigger, action, scope, and owner
  • Peer review by at least one person who didn't write the rule
  • Testing in the QA LPAR with simulated events
  • One week of monitor-only mode in production before activation
  • Monthly review of rule activity logs

Rob initially resisted — "We're adding bureaucracy to something that should be fast." Three months later, after a misscoped rule held 40 jobs for 90 minutes during a batch window, he became the governance framework's strongest advocate.

Results After OPS/MVS Deployment

Metric 2018 Value Post-Phase 1 Post-Phase 3
Manual interventions per night 22 14 7
Mean time to recovery 43 min 38 min 12 min
Operator errors during recovery 1.3/week 1.1/week 0.2/week
Late batch completions per month 8 5 2
Branch network late opens 3/quarter 2/quarter 0

Phase 4: Self-Healing Batch (2021)

Phase 4 was the big leap. Kwame wanted the GL settlement batch stream — CNB's most critical nightly process — to be self-healing. Not just automated restart, but end-to-end self-management: pre-flight validation, conditional routing, automated diagnosis and recovery, post-recovery validation, and intelligent escalation.

Architecture Design

Lisa designed the self-healing architecture with four layers:

Layer 1: Pre-flight Validation. The CNBPREFLT exec (evolved from the Phase 1 version) now performed 14 distinct checks before the GL batch stream launched:

  1. DB2P subsystem active and accepting connections
  2. DB2P buffer pool hit ratios above threshold (>95%)
  3. All 12 input datasets available and not in use
  4. DASD space available for all output datasets (calculated from average sizes + 20% buffer)
  5. GDG bases have headroom (>5 generations from limit)
  6. Predecessor batch streams completed successfully
  7. Control table BATCH_CONTROL in correct state for new cycle
  8. CICS regions quiesced for batch window
  9. MQ queue manager active, queues not backed up
  10. Spool utilization below 70%
  11. No active system maintenance (checked against maintenance calendar)
  12. Previous cycle's archive datasets available for restart comparison
  13. WLM service class for batch is active with correct goals
  14. Tape drives available (for archive step)

Each check returned a specific return code. The pre-flight exec aggregated results and set an overall return code: - RC=0: All checks passed, proceed - RC=4: Non-critical warnings (e.g., spool at 68%), proceed with monitoring - RC=8: Remediable issue (e.g., space shortage), trigger auto-fix then retry - RC=12: Serious issue requiring human review (e.g., DB2 down), hold batch and escalate - RC=16: Critical issue (e.g., predecessor failed), abort cycle and escalate immediately

Layer 2: Recovery Engine. A REXX exec (CNBRECOV) that served as the central recovery brain. It maintained a recovery table in DB2 with 47 entries mapping abend codes and job contexts to recovery actions. When OPS/MVS detected a batch failure, it called CNBRECOV with the job name, abend code, and step name.

Layer 3: Post-Recovery Validation. After every automated recovery, a validation exec (CNBVALID) verified that the recovered job produced correct output. It checked record counts against control totals, verified output dataset attributes, and compared key balancing figures against the input.

Layer 4: Escalation Intelligence. The escalation engine didn't just page someone — it provided context. When escalating, it included: the original failure, all automated recovery attempts and their results, current system state, suggested manual actions, and the relevant runbook section.

The First Live Recovery

The self-healing system's first real test came on March 15, 2021, at 1:23 AM. The GL transaction extract job (CNBGL100) abended with SB37 — the same failure type that caused the January 2019 incident.

This time: - T+0: OPS/MVS detected the HASP373 message - T+2 sec: Recovery engine identified SB37 on CNBGL100.EXTRACT, looked up the recovery action: EXTEND_SPACE + RESTART_STEP - T+4 sec: Space extension REXX exec added 500 cylinders to the output GDG - T+7 sec: Restart issued through TWS API for CNBGL100 from EXTRACT step - T+11 sec: Job restarted execution - T+14 min: Job completed RC=0 - T+14 min 3 sec: Post-recovery validation passed — record counts matched, balancing figures correct - T+14 min 5 sec: Successor jobs released, incident logged

Rob was the on-call operator that night. His pager never went off. He found the incident in the morning log and said, "That's the first time in my career I've been happy about not getting woken up."

Tuning and Edge Cases

The first year of self-healing operation revealed edge cases that the initial design didn't handle:

The slow DB2 problem. Jobs weren't abending but were running three times longer than normal due to DB2 lock contention. The self-healing system only triggered on abends. Fix: Added an elapsed-time monitor that detected jobs running beyond 150% of their average elapsed time and investigated the cause (DB2 lock waits, resource contention, etc.).

The partial success. A job completed with RC=4 but produced an output dataset with zero records. Technically not a failure, but functionally useless. Fix: Post-completion validation was added for critical jobs, checking output record counts even when the return code was acceptable.

The Friday night deployment. Application changes deployed on Friday nights occasionally introduced bugs that caused Monday morning batch failures. The self-healing system would restart the job, get the same failure, restart again, hit the retry limit, and escalate at 3 AM. Fix: After two identical failures, the system now checks whether application libraries changed in the last 48 hours. If so, it escalates immediately with "possible deployment issue" rather than exhausting retries.


Current State (2024)

Five years after Kwame's directive, CNB's operational automation metrics tell the story:

Metric 2018 (Pre) 2024 (Current) Improvement
Manual interventions per night 22 3 86% reduction
Mean time to recovery 43 min 6 min 86% reduction
Operator errors during recovery 1.3/week 0.05/week 96% reduction
Late batch completions per month 8 0.3 96% reduction
Branch network late opens per year 12 0 100% elimination
Automated recovery success rate 0% 87%
Operator headcount (overnight) 3 1 67% reduction

The overnight shift went from three operators to one. The remaining operator isn't less busy — they handle the exceptions that automation escalates, monitor the automation itself, and work on automation improvement projects. Their role transformed from "person who executes procedures" to "person who designs and tunes the system that executes procedures."

Lessons Learned

Kwame, Lisa, and Rob distill their experience into five lessons:

1. Start with data, not tools. Before buying any automation product, instrument your operations. Know where the time goes. Know where the failures cluster. The data tells you where automation will have the most impact.

2. Standardize before automating. JCL standardization (Phase 2) was the least exciting work and the most important. Without standard PROCs, every automation rule would have been job-specific. With standard PROCs, automation rules could be generic.

3. Governance isn't optional — it's the foundation. Every automation disaster they narrowly avoided was caught by the governance framework. Testing, review, monitor-only deployment, and activity auditing are not overhead — they're the safety net.

4. Build for the failure you haven't seen yet. The recovery table handles known failures. The escalation path handles everything else. The most important automation design decision is what happens when automation doesn't know what to do — the answer must always be "escalate to a human," never "do nothing" and never "guess."

5. Automation is a program, not a project. It's never done. New applications bring new failure modes. System upgrades change message formats. Staff turnover means knowledge must be captured in automation, not heads. CNB allocates 20% of Rob's time permanently to automation maintenance and improvement.


Discussion Questions

  1. CNB migrated 847 jobs to standardized PROCs over 32 weeks. What risks does this migration introduce, and how would you mitigate them?

  2. The self-healing system's recovery table has 47 entries. How do you decide when a new failure mode warrants a recovery table entry versus being left to human escalation?

  3. CNB reduced overnight operators from three to one. What are the risks of single-operator overnight coverage, even with comprehensive automation? What safeguards would you implement?

  4. The "Friday night deployment" edge case required a custom detection rule. What other deployment-related failure patterns should self-healing automation account for?

  5. Kwame's "humans handle exceptions" philosophy assumes that exceptions are rare. What happens to this model as the system grows more complex and the definition of "exception" narrows? Is there a practical limit to automation?