Case Study 1: Catalog-Driven Automation — Self-Managing DB2
Background
Lakewood Federal Credit Union (LFCU) is a mid-sized financial cooperative headquartered in Portland, Oregon, with $8.4 billion in assets, 620,000 members, and 42 branch locations across the Pacific Northwest. LFCU's core banking system runs on DB2 for z/OS 12, managing approximately 1,800 tables across four databases with a combined uncompressed data volume of 3.2 terabytes.
The DB2 administration team consists of three people:
- Sandra Voss, Senior DBA (22 years of DB2 experience)
- Marcus Chen, Junior DBA (18 months of DB2 experience)
- Priya Ramaswamy, DBA/Developer hybrid (4 years of experience, split between application development and database administration)
This three-person team is responsible for all DB2 administration: RUNSTATS, REORG, COPY, recovery, security management, capacity planning, and change management. They support a 24/7 banking operation with a four-hour nightly batch window.
The Problem
When Sandra joined LFCU in 2004, the environment had 400 tables. Over two decades, the schema has grown to 1,800 tables through organic application growth, regulatory requirements, and data warehouse integration. But the DBA team has not grown with it. The maintenance approach that worked for 400 tables — manually scheduled RUNSTATS, calendar-based REORG, spreadsheet-tracked statistics freshness — has become unsustainable.
The symptoms are severe:
| Issue | Frequency | Impact |
|---|---|---|
| Stale statistics causing poor access paths | 3-4 incidents per month | Batch jobs running 2-5x longer than expected |
| Missed REORG leading to tablespace fragmentation | Weekly discoveries | Gradual performance degradation, emergency REORG during business hours |
| Missed COPY backups | Monthly discoveries | Extended recovery time if failure occurs |
| Over-REORGing stable reference tables | Every weekend | Wasted batch window time, unnecessary I/O |
| Security privilege drift | Quarterly audit findings | Audit exceptions requiring remediation |
Sandra estimates that her team spends 60% of their time on reactive firefighting — fixing problems that could have been prevented with proactive monitoring. The remaining 40% is consumed by manual maintenance scheduling, leaving zero time for optimization, capacity planning, or process improvement.
"We are not managing DB2," Sandra tells her manager. "DB2 is managing us."
The Catalyst
The breaking point comes during the Q3 OCC (Office of the Comptroller of the Currency) examination. The examiners request evidence that RUNSTATS is executed on all critical tables within 7 days of significant data changes. Sandra cannot produce this evidence because the team has no systematic way to track statistics freshness against data volatility. The credit union receives a Matter Requiring Attention (MRA) — a formal regulatory finding that must be addressed within 90 days.
Sandra has 90 days to build a solution. She has no budget for third-party tools (the request was denied in the last budget cycle). She has three people, one of whom has been on the job for only 18 months.
She decides to build the solution from what she already has: the DB2 catalog and Real-Time Statistics.
The Solution: Project Lighthouse
Sandra names her initiative "Project Lighthouse" — a system that illuminates the state of the DB2 environment automatically, without human eyes watching every tablespace.
Phase 1: The Diagnostic Foundation (Weeks 1-3)
Sandra starts by building a comprehensive catalog query library. She assigns Marcus the task of writing and testing queries in four categories:
Category 1: Statistics Freshness
Marcus writes queries against SYSIBM.SYSTABLESPACESTATS that compare STATSINSERTS + STATSDELETES + STATSUPDATES (changes since last RUNSTATS) against TOTALROWS. Tables where the change ratio exceeds 10% are flagged as "statistics stale." Tables where it exceeds 25% are flagged as "statistics critical."
He joins these results with SYSIBM.SYSTABLES to pull the STATSTIME, allowing the system to also flag tables where no RUNSTATS has been run in 30 days regardless of change volume — catching the edge case of tables with very low but steady change rates.
Category 2: REORG Need
Marcus queries SYSIBM.SYSTABLESPACESTATS for the REORGINSERTS, REORGDELETES, and REORGUPDATES columns, comparing the total change volume against TOTALROWS. He adds a join to SYSIBM.SYSINDEXES to check CLUSTERRATIOF for clustering indexes. Tablespaces where either the change ratio exceeds 20% or the cluster ratio has dropped below 0.75 are flagged for REORG.
Category 3: COPY Currency
Marcus queries COPYLASTTIME and COPYCHANGES from SYSTABLESPACESTATS. Tablespaces where COPYCHANGES exceeds 50,000 pages or where COPYLASTTIME is older than 7 days are flagged for backup. The thresholds are tiered: critical tables (CUSTOMER, ACCOUNT, TRANSACTION) have tighter thresholds than reference tables.
Category 4: Security Baseline
Priya builds a set of privilege audit queries against SYSIBM.SYSTABAUTH. She creates a "privilege baseline" — a snapshot of all current grants — and writes comparison queries that detect deviations from the baseline. Any new GRANT or REVOKE that was not pre-approved through the change management process triggers an alert.
Phase 2: The Automation Engine (Weeks 4-6)
With the diagnostic queries tested and validated, Sandra builds the automation layer. She creates a stored procedure named LIGHTHOUSE_ENGINE that:
- Executes all diagnostic queries and inserts results into a tracking table (MERIDIAN_ADMIN.LIGHTHOUSE_FINDINGS)
- Compares current findings against previous runs to detect trends
- Generates utility control statements (RUNSTATS, REORG, COPY) for objects that exceed their thresholds
- Writes the generated utility statements to a staging table for review
- Sends a daily summary to the DBA team via email (using DB2's WLM-managed stored procedure for SMTP)
The critical design decision: Lighthouse does NOT execute utilities automatically. It generates and recommends them. Sandra insists on human review before execution — automation should assist decisions, not replace them.
Phase 3: Tuning and Validation (Weeks 7-9)
The team runs Lighthouse in parallel with their manual processes for three weeks. The results are illuminating:
- Lighthouse identifies 142 tables with stale statistics that the team's manual schedule had missed
- 38 tablespaces are flagged for REORG that had not been on the manual schedule
- 12 tablespaces are removed from the weekly REORG schedule because Lighthouse shows they have had zero changes since the last REORG — the team had been reorganizing them every Sunday for no reason
- 7 privilege grants are detected that were not in the change management system
The team adjusts thresholds based on the parallel run. Some tables (high-volume transaction tables) need tighter thresholds. Some tables (yearly regulatory snapshots) need looser thresholds. Sandra builds a configuration table (LIGHTHOUSE_CONFIG) that stores per-table threshold overrides.
Phase 4: Regulatory Evidence (Weeks 10-12)
For the OCC MRA remediation, Sandra builds a reporting layer on top of the Lighthouse findings table. The reports show:
- For every critical table, the date of the last RUNSTATS, the change volume since that RUNSTATS, and whether the RUNSTATS was triggered within the 7-day policy window
- A historical trend showing statistics freshness compliance over time
- Exception reports for any period where the 7-day window was exceeded, with root cause (e.g., "batch window overrun on 2025-11-14 delayed RUNSTATS by 2 days")
The OCC examiners review the Lighthouse system during the follow-up examination and close the MRA. One examiner comments that the level of automation and evidence exceeds what they see at institutions five times LFCU's size.
Results After 6 Months
| Metric | Before Lighthouse | After Lighthouse |
|---|---|---|
| Stale statistics incidents per month | 3-4 | 0 |
| Emergency REORG during business hours | 2-3 per month | 0 |
| Batch window utilization (useful work) | 85% (15% wasted on unnecessary REORG) | 97% |
| Time spent on reactive firefighting | 60% | 15% |
| Time available for optimization | 0% | 35% |
| Regulatory audit findings (DB2-related) | 1 MRA | 0 |
| DBA team stress level (self-reported) | "Drowning" | "Manageable" |
Lessons Learned
1. The catalog is a DBA power tool, not a reference manual. Sandra's team had been querying the catalog occasionally for ad hoc questions. Lighthouse transforms the catalog from a passive reference into an active monitoring system. The same information that was always available becomes transformative when queried systematically and automatically.
2. Real-Time Statistics change the game on z/OS. Without RTS, Lighthouse would require running RUNSTATS on every table just to determine whether RUNSTATS was needed — a circular dependency. RTS provides the change counters that make intelligent scheduling possible without that overhead.
3. Automation should augment, not replace, human judgment. Sandra's decision to have Lighthouse recommend utility jobs rather than execute them automatically is validated when Marcus catches a REORG recommendation that would have conflicted with a special batch process. Automation handles the analysis; humans handle the exceptions.
4. Regulatory compliance is a catalog problem. The OCC MRA was fundamentally a metadata management problem — can you demonstrate that you manage your statistics proactively? The catalog already contained the evidence; the credit union just needed a systematic way to extract and present it.
5. You do not need expensive tools. Every component of Lighthouse — the diagnostic queries, the stored procedure engine, the tracking tables, the configuration overrides — is built with standard DB2 SQL and catalog access. No third-party software was purchased. The entire project cost was approximately 500 person-hours of DBA time over 12 weeks.
Discussion Questions
-
If you were building a Lighthouse-equivalent system for DB2 for LUW, which SYSCAT views would replace the SYSIBM tables Sandra used? What would you use instead of Real-Time Statistics for change tracking?
-
Sandra chose not to have Lighthouse execute utilities automatically. Under what circumstances might automatic execution be appropriate? What safeguards would you put in place?
-
The privilege audit component of Lighthouse uses a baseline comparison approach. What are the limitations of this approach? How would you handle legitimate privilege changes that are approved through the change management system but would otherwise trigger alerts?
-
How would you extend Lighthouse to track not just current state but trends over time — for example, detecting tables with accelerating growth rates that might require partition management soon?
-
LFCU has three DBAs for 1,800 tables. What team size would justify the investment in a commercial catalog management tool (such as BMC MainView or CA Database Analyzer) instead of a custom solution?
Return to Chapter 21 | Continue to Case Study 2