Case Study 1: Utility Scheduling Strategy for a 24x7 System

DataField.Dev

Case Study 1: Utility Scheduling Strategy for a 24x7 System

Background

Meridian National Bank operates its core banking platform on DB2 for z/OS. The system processes an average of 50,000 transactions per day across 1.2 million customer accounts. The digital banking channel (ATM, mobile, and web) operates 24 hours a day, 7 days a week. Even the teller and branch systems run extended hours — 6 AM to 10 PM Monday through Saturday, with Sunday as the only reduced-activity day.

When Sandra Reeves joined Meridian Bank as the lead DB2 DBA, she inherited a maintenance strategy that could be charitably described as "ad hoc." REORG jobs ran when someone remembered to submit them. RUNSTATS was executed after major batch loads but not on a regular schedule. Full image copies were taken weekly on Sunday mornings, with no incremental copies in between. The last test recovery had been performed eight months earlier and had failed — a fact documented in a ticket that was promptly forgotten.

Sandra's mandate was clear: design and implement a utility scheduling strategy that maintains peak performance and guaranteed recoverability without any unplanned application downtime.

The Assessment

Sandra began with a comprehensive assessment of the current state. She ran REORG REPORTONLY on every production tablespace and queried the SYSIBM.SYSCOPY catalog to determine backup recency.

The findings were alarming:

Fragmentation: The TRANSACTIONS tablespace (TRANS_TS) showed 47% disorganization. The clustering index on TRANS_DATE was only 38% clustered — meaning nearly two-thirds of rows were physically stored out of date order. Sequential range scans on date ranges were performing 2.8 times more I/O than necessary.

Stale statistics: The ACCOUNTS tablespace had not had RUNSTATS in 45 days. During that period, a major acquisition had added 180,000 new accounts. The catalog still showed 1,020,000 rows; the actual count was 1,200,000. Several access paths were suboptimal because the optimizer was working with outdated cardinality estimates.

Backup gaps: The weekly full image copy on Sunday provided a maximum recovery window, but with no incremental copies during the week, a failure on Saturday would require replaying six days of archive logs. For TRANS_TS, that meant approximately 180 GB of log data and an estimated recovery time of 14 hours — far exceeding the bank's 4-hour RTO.

Archive log risk: Archive logs were retained for 14 days, which seemed adequate. However, the BRANCHES tablespace (a low-priority reference table) had not been backed up in 21 days. If its VSAM dataset failed, it would be unrecoverable.

The Design

Sandra designed a three-tier maintenance strategy:

Tier 1: Daily Maintenance (10:30 PM - 12:00 AM)

The daily window targets statistics freshness and backup recency without requiring any application downtime.

RUNSTATS runs on the five highest-churn tablespaces: TRANS_TS, DAILYBAL_TS, SESSION_TS, AUDIT_TS, and PENDING_TS. Sandra chose to collect KEYCARD and FREQVAL COUNT 20 MOST for these tables, capturing the distribution information that the optimizer needs for the bank's most critical queries. She considered using inline RUNSTATS with a daily REORG but determined that daily REORG was unnecessary — fragmentation accumulates over days, not hours, for Meridian's workload profile.

Incremental image copies run on all critical tablespaces after RUNSTATS completes. Using SHRLEVEL REFERENCE, these copies are consistent and require only a momentary drain of write activity (typically under 2 seconds). The incremental copies are written to DASD for fast recovery access.

This tier addresses the backup gap directly. With daily incrementals, the maximum log replay during recovery is reduced from 6 days to approximately 20 hours of log data — roughly 30 GB for TRANS_TS, bringing estimated recovery time down to 3.5 hours.

Tier 2: Weekly Maintenance (Sunday 1:00 AM - 4:30 AM)

Sunday morning is the lowest-activity period. Sandra scheduled the heavy maintenance here.

REORG with SHRLEVEL CHANGE runs on all OLTP tablespaces. The SHRLEVEL CHANGE approach allows the digital banking channel to continue operating throughout the REORG. Sandra measured the drain phase (the brief period at the end of REORG when write access is suspended) at consistently under 8 seconds for TRANS_TS — well within the application's timeout thresholds.

She configured inline statistics (STATISTICS TABLE(ALL) INDEX(ALL) KEYCARD FREQVAL COUNT 20 HISTOGRAM NUMQUANTILES 100) within the REORG, combining two steps into one and saving approximately 45 minutes of elapsed time.

Full image copies follow immediately after REORG, written to both DASD (COPY1) and tape (COPY2). The DASD copy provides fast local recovery; the tape copy goes offsite for disaster recovery.

Using LISTDEF and TEMPLATE, Sandra consolidated the entire weekly maintenance into a single JCL job with dynamic dataset naming. This eliminated the 14 separate JCL members that the previous team had maintained (poorly) and ensured consistent treatment of every tablespace.

Tier 3: Monthly Maintenance (First Sunday, 12:00 AM - 5:00 AM)

Monthly maintenance extends weekly maintenance with additional activities:

Extended REORG scope includes reference tablespaces (BRANCHES, PRODUCTS, FEE_SCHEDULES) that do not change frequently enough to warrant weekly REORG but do accumulate minor fragmentation over time.

Column group statistics are collected during the monthly window. Sandra identified seven column combinations across four tables where correlated predicates were causing optimizer misestimates. Monthly collection was sufficient because the correlation patterns (e.g., BRANCH_ID with ACCOUNT_TYPE) were structurally stable.

Test recovery of one critical tablespace is performed each month, rotating through ACCOUNTS_TS, TRANS_TS, CUSTOMER_TS, and LOAN_TS on a four-month cycle. The test recovery targets a separate test subsystem and verifies the entire backup chain — full copy, incremental copies, and archive log application. Sandra made this a non-negotiable item on the monthly calendar after discovering the previous team's failed test recovery.

Implementation Challenges

Sort work space. The initial REORG runs on TRANS_TS required more sort work space than Sandra had allocated. The DFSORT step abended with a space error on the first attempt. She increased the SORTWK allocations from 100 cylinders to 200 cylinders per dataset and added a fourth SORTWK DD statement for safety margin. She also implemented a monitoring check on SORTWK utilization to proactively resize before future abends.

Mapping table sizing. The REORG SHRLEVEL CHANGE mapping table for TRANS_TS initially was placed in a tablespace with insufficient space. During Sunday peak digital banking activity, the mapping table exceeded its allocation. Sandra resized the mapping tablespace to accommodate 150% of the maximum observed REORG duration multiplied by the peak transaction rate.

Job scheduling dependencies. Sandra used the bank's Tivoli Workload Scheduler (TWS) to chain the maintenance jobs. A critical dependency she enforced: the COPY job must not start until the REORG job completes successfully. If REORG fails, the COPY job still runs (to back up the current state) but a RUNSTATS job is inserted to ensure statistics are current even without reorganization.

Communication. Sandra established a maintenance notification process. The operations team receives a start/end notification for each maintenance window. If any utility fails, an alert is sent to the DBA on-call pager within 5 minutes. Weekly maintenance results are summarized in a Monday morning report distributed to the database services team.

Results

After three months of the new maintenance strategy:

TRANS_TS disorganization dropped from 47% to under 5% consistently, measured each Sunday before REORG and confirmed at under 5% after.
Clustering ratio on IX_TRANS_DATE improved from 38% to 97%, reducing I/O for date-range queries by 60%.
Estimated recovery time for TRANS_TS dropped from 14 hours to 3.5 hours (daily incrementals) and under 1 hour immediately after weekly full copy.
Three query performance complaints from the application team were resolved entirely by the regular RUNSTATS — the queries were fine, but the optimizer was choosing wrong paths due to stale statistics.
Monthly test recoveries succeeded every month, with documented recovery times for each critical tablespace. The documentation now serves as the basis for the bank's disaster recovery audit response.

Lessons Learned

Measure before you optimize. REORG REPORTONLY and catalog queries gave Sandra the data to prioritize maintenance tasks and justify the scheduling to management.
SHRLEVEL CHANGE makes maintenance practical for 24x7 systems. The brief drain phase is invisible to well-configured applications with appropriate timeout settings.
Inline statistics save significant time. Combining RUNSTATS within REORG eliminated a separate 45-minute step each week.
Test your recoveries. A backup you cannot restore from is not a backup. Monthly test recoveries are non-negotiable.
Automate and monitor. Manual maintenance is unreliable. Scheduler integration with automated alerting ensures consistency and rapid response to failures.

Discussion questions:

If Meridian Bank's transaction volume tripled, which element of Sandra's maintenance plan would need to change first? Why?
Sandra chose SHRLEVEL CHANGE for weekly REORG. Under what circumstances would SHRLEVEL REFERENCE be a better choice?
How would the maintenance plan change if the bank migrated from z/OS to DB2 LUW?