Chapter 20 Key Takeaways
Core Principles
-
Proactive monitoring detects problems before users notice them. Reactive monitoring means you learn about problems from angry phone calls. Proactive monitoring means you learn about problems from dashboards and alerts, often minutes or hours before any user is affected.
-
You cannot detect anomalies without baselines. Collect metrics during known-good operations for at least two weeks, segmented by time period (business hours, batch window, month-end). Without baselines, every alert threshold is a guess.
-
Monitor five pillars: Availability, Performance, Capacity, Concurrency, and Integrity. A comprehensive monitoring strategy addresses all five. Neglecting any pillar leaves a blind spot that will eventually cause an outage.
z/OS Monitoring
-
DISPLAY commands are your real-time eyes.
-DIS DATABASEfor object status,-DIS THREADfor connection activity,-DIS BUFFERPOOLfor performance,-DIS LOGfor log health, and-DIS UTILfor utility progress. Memorize these commands. -
Statistics and accounting traces should always be running. They have low overhead and provide essential data for trend analysis and per-thread performance diagnostics. Never turn them off in production.
-
SMF 101 records are the foundation of z/OS performance management. Each record captures a complete thread lifecycle with elapsed time, CPU, I/O, lock waits, and SQL counts. Build historical trend tables from SMF data.
-
Performance traces are powerful but dangerous. They generate enormous data volumes and impose measurable overhead. Always filter by plan, auth ID, or connection type. Always set a time limit. Never leave them running indefinitely.
-
OMEGAMON or equivalent third-party tools are essential for enterprise z/OS environments. Manual DISPLAY commands cannot provide 24x7 coverage, threshold alerting, or historical trending at scale.
LUW Monitoring
-
db2pd is safe to run during any crisis. It reads directly from shared memory without acquiring latches, imposing virtually zero overhead. Use it as your first diagnostic tool in any performance investigation.
-
MON_GET table functions are the modern standard for LUW monitoring. They provide SQL-queryable monitoring data that is easy to filter, aggregate, and store. Prefer them over GET SNAPSHOT for all new monitoring development.
-
Event monitors capture transient events that snapshots miss. Deadlocks, slow queries, and lock timeouts are instantaneous events. If you are not running event monitors, you will miss them entirely.
-
db2diag.log is the first place to look when something goes wrong. Filter it with the
db2diagtool by time, severity, database name, and message ID. Set DIAGLEVEL to 3 in production — never to 0.
Key Metrics
-
Buffer pool hit ratio is the single most important performance metric. A drop from 99% to 92% represents an eightfold increase in physical I/O. Monitor it continuously with tight thresholds (OLTP: warn at 97%, critical at 93%).
-
Log utilization requires proactive management. Running out of active log space stops all update activity. On z/OS, alert when 4 of 6 active logs are full. On LUW, alert above 50% utilization.
-
Lock wait time reveals concurrency problems. Average lock waits above 100ms for OLTP transactions indicate contention that needs investigation. Deadlocks should always be captured and analyzed.
-
Sort overflows above 5% indicate critical sort heap issues. Each sort that spills to disk may be orders of magnitude slower than an in-memory sort. Increase sort heap or add indexes to provide ordering.
-
Package cache hit ratio below 60% usually means literal SQL. Applications using string concatenation instead of parameter markers create unique SQL statements that cannot be reused, wasting CPU on repeated optimization.
Dashboard and Alerting
-
Organize dashboards in three tiers: glance, performance, and detail. Tier 1 (red/yellow/green) refreshes every 15-30 seconds. Tier 2 (numeric with trends) refreshes every 30-60 seconds. Tier 3 (detail panels) refreshes on demand.
-
Alert hysteresis prevents alert storms. Require 3-5 consecutive readings above threshold before alerting, and suppress re-alerts for 15-30 minutes. Include rate-of-change detection for fast-moving metrics.
-
The daily health check is non-negotiable. Run it every morning. Review it before business hours begin. Gradual degradation and slow-building problems are visible in daily data long before they trigger real-time alerts.
Meridian Bank Specifics
-
Separate buffer pools protect workload isolation. Data, index, LOB, and temp buffer pools should be separate. If possible, separate online and reporting workloads into different pools.
-
Workload management controls resource consumption. Define workloads by application name or auth ID. Set thresholds for maximum rows read, concurrent activities, and execution time. Prevent a single query from degrading the entire system.
-
Monitor the monitors. Verify that collection scripts, event monitors, and background db2pd processes are actually running. A monitoring system that has silently stopped is worse than no monitoring at all — it provides false confidence.