Case Study 1: CNB's Batch Operations Center and Rob's Playbook
Background
City National Bank's batch operations center occupies a windowless room on the third floor of their downtown data center. Two operator consoles face a wall of monitors. A laminated decision tree hangs between them. A whiteboard on the far wall tracks the nightly batch cycle with magnetic markers — green for complete, yellow for running, red for failed, white for waiting. Rob Fielding has been moving those markers for nineteen years.
But the batch environment Rob manages today bears little resemblance to the one he inherited. When Rob took over as batch operations lead, CNB's nightly monitoring consisted of an overnight operator watching the console and calling Rob when something looked wrong. "Something looked wrong" was entirely at the operator's discretion. There were no documented thresholds, no playbooks, and no automation. Rob's cell phone was the monitoring system.
After the ACCTPOST incident described at the opening of this chapter — and the $33,000 in direct costs plus four hours of reconciliation work — Rob, Lisa Nakamura, and Kwame Asante undertook a complete redesign of CNB's batch monitoring infrastructure.
The Assessment
Rob started by cataloging every batch incident from the previous twelve months. He pulled the data from three sources: the operator shift logs (handwritten, inconsistent, sometimes illegible), the scheduler's job history (which jobs failed and when), and his own text message history (which told him what he was called about at 3am).
The results were sobering:
| Category | Incidents/Year | Avg Resolution (min) | Total Downtime (hrs) |
|---|---|---|---|
| Data exception (S0C7) | 47 | 42 | 33 |
| Space abend (x37) | 31 | 28 | 14 |
| Module not found (S806) | 12 | 55 | 11 |
| Performance degradation | 23 | 65 | 25 |
| Security failure (S913) | 8 | 35 | 5 |
| Application abend (Uxxxx) | 38 | 48 | 30 |
| Other system abend | 15 | 52 | 13 |
| Total | 174 | 46 (avg) | 131 |
One hundred and seventy-four incidents per year — roughly one every two nights. Average resolution time of 46 minutes. Total batch downtime of 131 hours. Rob had been treating each incident as an independent event. The data told a different story: the same failure modes were recurring, and many were preventable.
Phase 1: SMF Infrastructure
The first investment was in SMF. CNB had been collecting types 30, 14/15, and 42, but dumping them to tape and shipping them offsite. Nobody analyzed them. Rob worked with Kwame (the systems programmer) to build the analysis pipeline:
-
SMF Dump: Every 2 hours during the batch window (midnight to 6am), an automated job dumps SMF datasets to a staging area on DASD.
-
SMF Load: A COBOL program (SMFLOAD) reads the dumped SMF records, parses the self-defining sections, and loads relevant fields into three DB2 tables: -
BATCH_JOB_PERF— one row per job step per night (from type 30 subtypes 3 and 4) -BATCH_DS_ACTIVITY— one row per dataset access (from types 14 and 15) -BATCH_SPACE_EVENTS— one row per space management event (from type 42) -
Baseline Build: A weekly COBOL batch job (BLDBSELN) calculates rolling baselines for every job step: average elapsed time, standard deviation, average CPU, average EXCPs, with separate baselines for day-of-week and month-end.
-
Analysis: The BCHPERF program (shown in the chapter) runs at the end of each batch window, comparing the night's actuals against baselines and generating the performance report.
The SMF infrastructure took Kwame three weeks to build and Rob two weeks to calibrate the baselines. The immediate payoff: they could see trends that had been invisible. The GLEXTRACT job had been degrading 2% per month for eight months — a 16% cumulative slowdown that was slowly consuming batch window slack. Nobody had noticed because the change was gradual and the job hadn't abended.
Phase 2: Alerting Architecture
With data flowing, Rob designed the alerting tiers. He started by classifying every batch job:
Tier 1 Critical (37 jobs): - Account posting chain: ACCTEXTRT, ACCTVALID, ACCTPOST, ACCTAUDIT - GL chain: GLEXTRACT, GLTRANSF, GLBALANCE, GLREPORT - ATM/card chain: ATMEXTRT, ATMREFRESH, CARDBATCH - Regulatory chain: REGDAILY, REGCOMPLY - Statement chain: STMTGEN, STMTPRINT, STMTARCH - All checkpoint/restart-critical jobs
Tier 2 Important (58 jobs): - Loan processing: LOANPOST, LOANINTCL, LOANAGING - Customer analytics: CUSTPROF, CUSTSEGM - Internal reporting: MGMTRPT, RISKREPT - Utility jobs: VSAMREORG, DBBACKUP, ARCHMGMT
Tier 3 Routine (89 jobs): - Data extracts for downstream systems - Housekeeping and cleanup jobs - Non-time-critical reporting - Development and test support jobs
For each tier, Rob defined static thresholds and then worked with Lisa to implement dynamic adjustments. The dynamic threshold program considers:
- Day of week: Monday baselines are 12% higher than midweek (weekend transaction accumulation)
- Month-end: Last three business days use separate baselines (typically 60-100% higher for financial posting jobs)
- Quarter-end: Cumulative 30% increase on top of month-end
- Year-end: Cumulative 50% on top of quarter-end
- Post-holiday: The day after a bank holiday uses Monday-equivalent baselines
Rob's threshold percentages:
| Tier | Alert Level | Threshold Over Dynamic Baseline |
|---|---|---|
| 1 | Tier 1 Alert | 30% over baseline |
| 1 | Tier 2 Alert | 50% over baseline |
| 2 | Tier 1 Alert | 50% over baseline |
| 2 | Tier 2 Alert | 75% over baseline |
| 3 | Tier 3 Alert | 100% over baseline |
The notification routing:
- Tier 1 Alert: Phone call to on-call primary (Rob or his backup, Frank). If no acknowledgment in 10 minutes, phone call to secondary (Lisa or Kwame). If no acknowledgment in 20 minutes, phone call to IT Director.
- Tier 2 Alert: Text message to on-call primary. If no acknowledgment in 20 minutes, phone call to on-call primary. If no acknowledgment in 30 minutes, escalation to IT Director.
- Tier 3 Alert: Email to batch operations queue. Review next business day.
The notification system itself is a REXX exec running under CA-OPS/MVS that formats alert details and sends them through the enterprise notification platform (which handles phone, text, and email delivery). Rob insisted on phone calls for Tier 1: "Text messages get lost. You can sleep through a text. You can't sleep through a phone that won't stop ringing."
Phase 3: The Playbook
Rob spent two months documenting his knowledge. Not the easy stuff — the hard stuff. The tribal knowledge that existed only in his head. He sat with Lisa and dictated procedures while she typed:
- "When ACCTPOST abends S0C7, the first thing you check is the control total on the extract file. If it doesn't match the online system's total, the problem is in the extract, not the posting."
- "When GLBALANCE shows an out-of-balance condition, don't restart it. Check whether GLTRANSF processed the same number of records that GLEXTRACT produced. If there's a discrepancy, the problem is in GLTRANSF."
- "When STMTGEN runs long, check the customer count. If it's normal, check for VSAM CI splits on the account master. If the CI split count is over 500, you need a reorg, but you can't reorg during the batch window — flag it for Saturday maintenance."
The playbook grew to 47 entries covering specific failure scenarios. Each entry followed the standard format: trigger, impact, diagnostics, resolution, escalation, post-resolution. Rob organized them by job name and by abend code, with cross-references.
Lisa converted the playbook to a searchable wiki — accessible from the operator consoles and from Rob's laptop over VPN. When an alert fires, the wiki link for the matching playbook entry is included in the notification message. The overnight operator can click the link and have the diagnostic procedure on screen within seconds.
Phase 4: Automation and Self-Healing
With monitoring and playbooks in place, Rob and Kwame implemented automated recovery for the most common, well-understood failure modes:
Automated Restart Rules: - Space abends (x37): Automatic restart with 5-minute delay, maximum 2 attempts. The delay gives SMS time to attempt space reclamation. - Application transient errors (U0100-U0199): Automatic restart from checkpoint, 3-minute delay, maximum 2 attempts. - Virtual storage shortage (S878): Automatic restart with 10-minute delay, maximum 1 attempt. If it fails again, escalate — this may be a system-wide problem.
Pre-Flight Validation: Rob implemented pre-flight checks for the six most failure-prone jobs: - ACCTPOST: Validate input record count, control total, and field sampling - GLEXTRACT: Verify DB2 tablespace availability and DASD space - STMTGEN: Check customer count and VSAM cluster health - REGDAILY: Validate all input feeds are present and non-empty - ATMREFRESH: Verify authorization file format and size - INTCALC: Confirm interest rate table is current and complete
Conditional Error Routing: For the critical account posting chain, Rob implemented fallback paths: - If ACCTVALID detects more than 0.1% error records, route the errors to an exception file and continue processing clean records. Alert the data quality team about the exceptions. - If ACCTPOST fails and automated restart fails, route to ACCTDEFR (deferred posting) which queues the transactions for the next cycle and generates a regulatory notification. - If ACCTAUDIT detects a control total mismatch, hold the GL extract chain and alert immediately — this is a potential data integrity issue.
Results
After six months of operation with the new monitoring infrastructure:
| Metric | Before | After | Change |
|---|---|---|---|
| Incidents per month | 14.5 | 6.2 | -57% |
| Incidents requiring human intervention | 14.5 | 3.1 | -79% |
| Average MTTR (all incidents) | 46 min | 18 min | -61% |
| Average MTTR (human-resolved) | 46 min | 28 min | -39% |
| Average MTTR (auto-resolved) | N/A | 4 min | N/A |
| Batch window SLA violations | 4/month | 0.5/month | -88% |
| 3am phone calls to Rob | 8/month | 1.2/month | -85% |
The numbers tell the story, but the qualitative changes matter too. The overnight operators are more confident — they have procedures to follow instead of relying on guesswork. Lisa's development team gets better data about production issues, which helps them write more resilient code. And Rob took a two-week vacation for the first time in seven years. The batch window ran fine without him.
Lessons Learned
-
Data before opinions. Rob's gut said ACCTPOST was the most failure-prone job. The SMF data showed it was actually GLEXTRACT. Measure first.
-
Start with detection, not automation. It's tempting to jump straight to automated recovery. But if you automate recovery for a problem you don't fully understand, you'll automate the wrong thing. First understand the failure modes through monitoring, then automate.
-
The playbook is never done. Every PIR produces a playbook update. Six months in, ten of the original 47 entries had been revised, eight new entries had been added, and three had been retired (the jobs were decommissioned).
-
Dynamic thresholds are essential. Static thresholds produced 23 false alerts per week. Dynamic thresholds reduced that to 2. The on-call team trusts alerts now because almost every alert represents a real issue.
-
Culture change is harder than technology. The overnight operators initially resisted the playbook — they felt it was "dumbing down" their job. Rob reframed it: "The playbook handles the routine stuff so you can focus on the interesting problems." After two months, the operators were contributing playbook updates based on their own experience.
Discussion Questions
-
Rob's playbook started as tribal knowledge in one person's head. What risks does this represent, and how does the playbook address them? Are there residual risks that the playbook doesn't address?
-
The automated restart for space abends includes a 5-minute delay. Why? What could go wrong if the restart were immediate? What could go wrong if the delay were longer?
-
CNB's error routing for ACCTPOST includes a deferred posting fallback. What are the business implications of deferred posting? Under what circumstances might deferred posting be worse than a delayed batch window?
-
The results show that human-resolved incidents still take 28 minutes on average. What additional improvements could reduce this further? Is there a practical lower bound on human response time?
-
Rob says "the playbook is never done." Design a process for keeping the playbook current that doesn't depend on Rob's personal attention. Who should own updates? What triggers a review? How do you ensure entries stay accurate as systems change?