Case Study 1: Production Monitoring Setup for a 24x7 Banking System

DataField.Dev

Case Study 1: Production Monitoring Setup for a 24x7 Banking System

Background

Meridian National Bank has grown from a regional bank with a single DB2 LUW instance to a multi-platform operation running DB2 for z/OS (core banking, batch processing) and DB2 LUW (online banking, mobile APIs, data warehouse). The IT department has historically operated in reactive mode — problems were discovered when customers called the help desk or when batch jobs failed with error messages.

After a particularly painful incident in which a transaction log filled during peak hours, causing a 47-minute outage that made local news, the CIO mandated a complete overhaul of the database monitoring strategy. Maria Chen, the senior DBA, was tasked with designing and implementing a monitoring infrastructure that would prevent such incidents from recurring.

The Challenge

Maria faced several constraints:

Budget: The monitoring solution had to be cost-effective. A full OMEGAMON deployment was approved for z/OS, but the LUW environment had to rely primarily on built-in tools.
Staffing: The DBA team consisted of Maria (senior, 15 years experience), two mid-level DBAs, and one junior DBA. 24x7 human monitoring was not feasible.
Platforms: Monitoring had to cover DB2 12 for z/OS (two data sharing members) and DB2 11.5 LUW (three database servers: production, reporting, and development).
Integration: Alerts had to feed into the existing ServiceNow ticketing system and PagerDuty on-call rotation.
Timeline: The CIO wanted the monitoring infrastructure operational within 60 days.

Maria's Design

Phase 1: Foundation (Days 1-15)

Maria began by establishing baselines. She enabled the statistics and accounting traces on z/OS (they had been running intermittently) and set them to run continuously:

-START TRACE(STAT) CLASS(1,3,4,5,6) DEST(SMF)
-START TRACE(ACCTG) CLASS(1,2,3) DEST(SMF)

On LUW, she enabled all default monitor switches and created a baseline collection script that ran every 15 minutes via cron:

INSERT INTO MERIDIAN_OPS.BASELINE_METRICS
SELECT CURRENT TIMESTAMP,
       BP_NAME,
       POOL_DATA_L_READS, POOL_DATA_P_READS,
       POOL_INDEX_L_READS, POOL_INDEX_P_READS,
       POOL_ASYNC_DATA_READS, POOL_ASYNC_INDEX_READS
FROM TABLE(MON_GET_BUFFERPOOL('', -2)) AS BP;

She collected two weeks of data, segmented it by time period (business hours, evening, batch window, weekend), and calculated the mean and 95th percentile for each metric in each period.

Phase 2: z/OS Monitoring (Days 10-30)

The OMEGAMON for DB2 installation was handled by the systems programming team. Maria configured it with these priorities:

Critical alerts (PagerDuty — immediate page): - Active logs: 5 of 6 full - Any object in RECP or GRECP status - Data sharing coupling facility structure full > 90% - Buffer pool hit ratio < 90% for any pool during online hours

Warning alerts (ServiceNow ticket — next business day): - Active logs: 4 of 6 full - Any object in COPY or CHKP status for more than 4 hours - Buffer pool hit ratio < 97% during online hours - Thread elapsed time > 60 seconds for online plans - Lock timeout rate > 10 per hour - Any deadlock occurrence

Informational (daily report): - SMF 101 accounting summaries by plan - Buffer pool statistics trends - Utility execution summary - Thread volume by hour

Maria also configured OMEGAMON's near-term history to retain 4 hours of 1-minute interval data, enabling the team to investigate problems that occurred within the last few hours without needing to process SMF data.

Phase 3: LUW Monitoring (Days 15-45)

Without OMEGAMON on LUW, Maria built a monitoring framework using native tools:

Always-on event monitors:

-- Locking events (deadlocks, timeouts, escalations)
CREATE EVENT MONITOR OPS_LOCK_MON FOR LOCKING
    WRITE TO TABLE AUTOSTART;

-- Activities exceeding 5 seconds
CREATE THRESHOLD OPS_SLOW_QUERY
    FOR DATABASE ACTIVITIES ENFORCEMENT DATABASE
    WHEN ACTIVITYTOTALTIME > 5
    COLLECT ACTIVITY DATA WITH DETAILS AND VALUES
    WITHOUT STOP;

CREATE EVENT MONITOR OPS_ACT_MON FOR ACTIVITIES
    WRITE TO TABLE AUTOSTART;

Scheduled collection (cron-based):

Every 5 minutes, a shell script ran a comprehensive MON_GET collection query and inserted results into monitoring tables. Every 15 minutes, a second script analyzed the collected data against baseline thresholds and generated alerts via a REST API call to PagerDuty (critical) or ServiceNow (warning).

The alerting logic used hysteresis: three consecutive readings above the critical threshold triggered a page; five consecutive readings above the warning threshold created a ticket. Recovery notifications were sent when three consecutive readings returned below the warning threshold.

Continuous db2pd collection:

# Runs as a background process, rotated daily
db2pd -db MERIDIAN -bufferpools -locks -transactions -repeat 60 \
    -file /db2data/monitoring/db2pd_$(date +%Y%m%d).out

Phase 4: Dashboard and Reporting (Days 30-60)

Maria built a simple dashboard using Grafana connected to the monitoring tables. The dashboard displayed:

Top row: Green/yellow/red indicators for each database (z/OS member 1, z/OS member 2, LUW production, LUW reporting)
Second row: Real-time buffer pool hit ratios, log utilization, active connections
Third row: 4-hour trend charts for transaction volume, response time, lock waits
Bottom: Expandable panels for top SQL, lock details, and table space utilization

The dashboard was displayed on a large monitor in the DBA team area and was accessible remotely via VPN.

Phase 5: Daily Health Check (Day 45)

Maria implemented the daily health check script described in Section 20.12 of this chapter. The script ran at 5:45 AM and emailed a formatted report to the DBA team. The on-call DBA was required to review the report and acknowledge it by 7:00 AM. If no acknowledgment was received, PagerDuty escalated to the backup DBA.

Results

Within the first month of full operation, the monitoring system proved its value:

Log space alert: At 1:23 AM on a Tuesday, the z/OS monitoring detected that a batch job had started an uncommitted unit of work that was preventing log offloading. Active logs were 4 of 6 full (warning threshold). The on-call DBA received the alert, identified the batch job, coordinated with the operations team to commit the transaction, and resolved the issue before active logs reached 5 of 6. No user impact.
Buffer pool degradation: The LUW dashboard showed a gradual decline in the MERIDIAN_BP_DATA hit ratio from 98.2% to 96.8% over five days. Investigation revealed that a new report query was performing full table scans on a large transaction table. The DBA team worked with the development team to add an appropriate index. The hit ratio recovered to 98.5%.
Connection leak: The daily health check revealed that active connections on the LUW production server had increased from an average of 45 to 78 over two weeks, with no corresponding increase in transaction volume. Investigation found a connection leak in a recently deployed microservice. The development team deployed a fix before the connection limit was reached.
Deadlock pattern: The locking event monitor captured a recurring deadlock between the funds transfer and statement generation applications. Analysis of the deadlock graph showed that the applications were accessing the ACCOUNTS and TRANSACTIONS tables in opposite order. The development team modified the statement generation application to access tables in the same order as funds transfer, eliminating the deadlock.

Lessons Learned

Maria documented several lessons from the implementation:

Start with baselines: Without two weeks of baseline data, every threshold would have been a guess. The investment in baseline collection paid for itself immediately.
Hysteresis is essential: During the first week, before hysteresis was implemented, the team received 47 alerts for transient buffer pool fluctuations. After implementing the "3 consecutive intervals" rule, false alerts dropped to near zero.
The daily health check is non-negotiable: Two of the four incidents above were caught by the daily health check, not by real-time alerts. The gradual connection leak and buffer pool degradation were slow enough that no real-time threshold was breached, but the trend was visible in daily data.
Automate everything that can be automated: Manual DISPLAY commands are useful for investigation but useless for detection. If a human has to remember to check something, it will eventually be missed.
Monitor the monitors: Maria added a heartbeat check that verified the monitoring scripts themselves were running. During a server reboot, this caught a failure to restart the db2pd background process.

Discussion Questions

Maria's LUW monitoring relies on cron-scheduled scripts rather than a commercial product. What risks does this approach carry, and how would you mitigate them?
The monitoring system generates approximately 500 MB of data per day across all platforms. Design a retention policy that balances storage cost against analytical value.
If Meridian Bank acquired another bank with Oracle databases, how would you extend the monitoring framework to include heterogeneous database monitoring?
Maria's team of four DBAs covers 24x7 operations. Is this sustainable? What staffing model would you recommend?