Key Takeaways — Chapter 27

DataField.Dev

Key Takeaways — Chapter 27

1. SMF Records Are Your Batch Telemetry System

SMF type 30 (job accounting), type 14/15 (dataset activity), and type 42 (storage management) provide comprehensive, granular data about every aspect of batch execution. Most shops collect this data but don't analyze it. The shops that analyze it have a decisive advantage in detecting problems early, understanding trends, and making data-driven capacity decisions. Learn the self-defining section format — it's the key to writing SMF analysis programs that survive IBM record format changes.

2. Four Levels of Monitoring, Four Different Purposes

Real-time console monitoring (seconds) catches immediate failures. Job-level monitoring (minutes) tracks individual job health. Stream-level monitoring (batch window) ensures SLA milestones are met and the critical path is on track. Historical trend analysis (days/weeks/months) reveals gradual degradation, cyclical patterns, and capacity trends. A mature shop operates at all four levels simultaneously.

3. Alert Fatigue Kills

Too many alerts is worse than too few. If the on-call person receives 200 alerts per night and 195 are noise, the five real problems get buried. Design alerts in tiers (Critical, Warning, Informational, Diagnostic) with clear thresholds for each. Use dynamic thresholds that account for day-of-week, month-end, and seasonal patterns. Measure your signal-to-noise ratio and treat false positives as defects to be eliminated.

4. Dynamic Thresholds Prevent False Positives

A job that legitimately runs twice as long on month-end nights will trigger a static threshold alert every single month. Dynamic thresholds adjust baselines for known patterns — day of week, month-end, quarter-end, year-end — so that expected variations don't generate alerts. The payoff is dramatic: shops that switch from static to dynamic thresholds typically reduce false alerts by 80% or more.

5. Playbooks Capture and Distribute Expertise

Every shop has a Rob — the person whose institutional knowledge keeps the lights on. Playbooks transfer that knowledge from one person's head into a documented, searchable, repeatable procedure that any competent operator can follow. Each playbook entry covers: trigger, impact, diagnostics, resolution options, escalation criteria, and post-resolution verification. The playbook is never finished — every incident should update it.

6. Decision Trees for Quick Classification

When the phone rings at 3am, you need fast routing to the right procedure. Decision trees classify the incident type (abend code, performance issue, missing output) and point the responder to the appropriate playbook entry. Keep decision trees simple, visual, and accessible — laminated, on the wall, on the console, linked in alert notifications.

7. Self-Healing Batch Reserves Human Attention for Human Problems

Automated restart for well-understood, transient failures (space abends, temporary resource shortages, application-defined retry-eligible errors) eliminates the most common 3am phone calls. But automation must be selective: only restart when there's a reasonable probability the problem won't recur. Data exceptions (S0C7) and program bugs (S0C4) will fail again — don't waste time restarting them automatically.

8. Pre-Flight Checks Prevent Incidents

The most effective recovery is prevention. Pre-flight validation programs that run before critical batch steps can detect space shortages, data quality problems, missing prerequisites, and resource unavailability before the main processing job encounters them. A problem detected by a pre-flight check is a problem that never becomes an incident.

9. MTTR Has Four Components — Improve Each One

Mean Time to Recovery breaks down into detection (MTTD), diagnosis (MTTDx), repair (MTTRp), and verification (MTTV). The biggest gains typically come from reducing detection time through proper monitoring — you can't fix a problem you haven't detected. The second biggest comes from reducing diagnosis time through playbooks. The third comes from reducing repair time through automation. Measure all four components separately to focus improvement efforts.

10. Post-Incident Reviews Build Systemic Resilience

Every significant incident should produce a Five Whys analysis, corrective actions at multiple defense layers, and updates to monitoring rules, playbooks, and automation. The review must be blameless — focused on what allowed the problem to happen, not who caused it. The goal is a system where individual human errors cannot cause batch failures, because the system detects, contains, and recovers from them.

11. The Goal Is Making Heroics Unnecessary

Batch monitoring success looks like nothing happening. The best night is the night when nobody notices the batch ran, because everything completed on schedule without intervention. The monitoring, alerting, playbooks, and automation are the infrastructure that makes that silence possible. Rob's phone doesn't ring at 3am — not because problems don't occur, but because the systems handle them before they become incidents.