> "The final exam for a DBA is not a written test. It is the morning you walk into the data center, look at every blinking light, every scrolling log, every query in flight — and declare with confidence: this system is ready." — attributed to a...
In This Chapter
- 36.1 The Capstone Challenge
- 36.2 System Architecture Review
- 36.3 Schema and Design Review
- 36.4 Performance Baseline and Stress Test
- 36.5 EXPLAIN Analysis Under Load
- 36.6 Security Audit
- 36.7 Backup and Recovery Drill
- 36.8 Concurrency and Locking Analysis
- 36.9 Utility and Maintenance Review
- 36.10 Capacity Planning Report
- 36.11 Executive Summary Presentation
- 36.12 The Final Meridian Bank Report
- Spaced Review: Comprehensive Topic Recall
- Summary
Chapter 36: Capstone — The Meridian National Bank: Complete System Review, Stress Test, and Disaster Recovery Drill
"The final exam for a DBA is not a written test. It is the morning you walk into the data center, look at every blinking light, every scrolling log, every query in flight — and declare with confidence: this system is ready." — attributed to a senior IBM Systems Engineer, circa 2005
36.1 The Capstone Challenge
You have traveled a long road. Thirty-five chapters have taken you from your first SELECT * FROM SYSIBM.SYSDUMMY1 to the heights of enterprise architecture — data sharing groups on z/OS, HADR failover on Linux, cloud-native analytics on IBM Db2 Warehouse, row and column access control, advanced optimization, and everything in between. Now it is time to bring every skill together in a single, integrated exercise.
Here is the scenario.
Meridian National Bank — our progressive case study throughout this book — is preparing for its annual regulatory examination. The Office of the Comptroller of the Currency (OCC) has notified the bank that examiners will arrive in six weeks. The Chief Information Officer has called an emergency meeting. She turns to you — the lead database administrator — and says:
"I need a complete health assessment of every DB2 system we run. I need proof that our disaster recovery works. I need a capacity plan for the next three years. And I need it all summarized in a report I can hand to the board of directors. You have four weeks."
This chapter walks through every phase of that engagement, step by step. It is not a theoretical exercise. Every SQL statement, every monitoring query, every checklist item reflects what a real DBA would execute in a real bank. By the end of this chapter you will have produced:
- A system architecture verification report covering all three DB2 environments (z/OS data sharing, LUW HADR, cloud analytics).
- A schema and design review with quantified findings.
- A performance baseline and the results of a simulated stress test.
- An EXPLAIN-based analysis of the top queries under load.
- A security audit mapped to PCI-DSS requirements.
- A documented disaster recovery drill with measured Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
- A concurrency and locking analysis.
- A utility and maintenance schedule review.
- A three-year capacity plan.
- An executive summary presentation.
Let us begin.
36.1.1 How This Chapter Is Organized
Each section of this chapter follows a consistent pattern. First, we describe what we are reviewing and why it matters. Then we provide the actual SQL statements and commands you would execute — not pseudocode, not summaries, but production-ready queries. Next, we show the results we found in the Meridian environment and how we interpreted them. Finally, we document the findings in a standardized format suitable for the final report.
This structure mirrors how a real assessment engagement works. The client does not want a theoretical framework. They want evidence: what did you check, what did you find, and what do you recommend? Every finding has a severity level (CRITICAL, HIGH, MEDIUM, LOW, or OK), a business impact statement, and a recommended action with estimated timeline and cost.
36.1.2 Prerequisites and Preparation
Before beginning the assessment, the lead DBA should ensure:
- Administrative access to all three DB2 environments (z/OS SYSADM or equivalent, LUW DBADM, cloud instance administrator).
- Monitoring enabled — on LUW, the
MON_GET_*table functions require the appropriate monitor switches to be active. Verify withSELECT * FROM TABLE(MON_GET_DATABASE(-2)). - EXPLAIN tables created — the EXPLAIN facility requires EXPLAIN tables in the schema of the user running EXPLAIN. If they do not exist, create them with the
EXPLAIN.DDLscript provided by IBM. - A test environment for the stress test — never run a stress test against production. The test environment should be a recent clone of production, with production-representative data volumes.
- Stakeholder notification — inform the operations team, the application teams, and management that an assessment is underway. Some activities (particularly the DR drill) require coordination with other teams.
- Documentation baseline — collect the current enterprise architecture diagram, the last capacity planning report, the security policy, the DR plan, and the maintenance schedule. These are the documents against which you will compare your findings.
With preparations complete, we proceed to the first phase of the assessment.
36.2 System Architecture Review
36.2.1 The Meridian Bank DB2 Landscape
Meridian National Bank operates three distinct DB2 environments:
| Environment | Platform | Version | Purpose | HA Strategy |
|---|---|---|---|---|
| Production OLTP | z/OS 2.5, DB2 13 for z/OS | V13R1 | Core banking transactions | 3-member data sharing group |
| Reporting / Mid-tier | RHEL 8.8, DB2 11.5.8 | 11.5.8.0 | Branch applications, regulatory reporting | HADR with automatic client reroute |
| Analytics | IBM Db2 Warehouse on Cloud (Dallas region) | SaaS | Business intelligence, data science | Managed HA by IBM |
The architecture review begins by confirming that each environment matches the approved enterprise architecture diagram — a document maintained by the data architecture team and last updated eighteen months ago. Discrepancies between the diagram and reality are findings that must be reported.
36.2.2 z/OS Data Sharing Group Verification
The data sharing group is the crown jewel of the Meridian OLTP platform. Three members — DB2A, DB2B, and DB2C — share a single set of databases through the Coupling Facility. Our verification checklist:
Coupling Facility Health
-- Query the coupling facility structure status
-- Run from any active member
SELECT MEMBER, STATUS, GROUP_MEMBER
FROM TABLE(SYSPROC.ADMIN_DS_LIST()) AS T;
We verify: - All three members show STATUS = 'ACTIVE'. - The group attach name resolves correctly through WLM. - Lock structure, SCA (Shared Communications Area), and group buffer pools are allocated in the Coupling Facility and show no rebuild-pending status.
Group Buffer Pool Verification
-- Check GBP hit ratios across all members
SELECT MEMBER_NAME,
BP_NAME,
GBP_DATA_HIT_RATIO,
GBP_INDEX_HIT_RATIO,
GBP_WRITE_FAILED
FROM SYSIBM.SYSBUFFERPOOL_DS
WHERE GBP_DATA_HIT_RATIO < 95.0
ORDER BY GBP_DATA_HIT_RATIO;
A group buffer pool hit ratio below 95% is a warning; below 85% is a critical finding. Cross-invalidation rates are also checked — excessive cross-invalidation indicates that members are frequently updating the same pages, a sign of poor workload distribution.
IRLM (Internal Resource Lock Manager) Verification
-- Check for global lock contention
SELECT MEMBER_NAME,
GLOBAL_LOCK_REQUESTS,
GLOBAL_LOCK_SUSPENDS,
DECIMAL(GLOBAL_LOCK_SUSPENDS * 100.0 /
NULLIF(GLOBAL_LOCK_REQUESTS, 0), 5, 2) AS SUSPEND_PCT
FROM SYSIBM.SYSLOCKS_DS
ORDER BY SUSPEND_PCT DESC;
Global lock suspend percentages above 2% warrant investigation. Above 5% is a critical finding.
36.2.3 LUW HADR Pair Verification
The LUW environment runs a primary database server and a standby in a separate data center (30 km away). Verification steps:
-- Check HADR status on primary
SELECT HADR_ROLE, HADR_STATE, HADR_SYNCMODE,
HADR_CONNECT_STATUS, HADR_LOG_GAP,
HADR_PEER_WINDOW_END
FROM TABLE(MON_GET_HADR(NULL)) AS T;
Expected results:
- HADR_ROLE = 'PRIMARY'
- HADR_STATE = 'PEER' (for synchronous mode) or 'REMOTE_CATCHUP' (acceptable for async)
- HADR_CONNECT_STATUS = 'CONNECTED'
- HADR_LOG_GAP should be near zero for SYNC/NEARSYNC modes
We also verify Automatic Client Reroute (ACR) configuration:
-- Verify ACR configuration
SELECT DBPARTITIONNUM, NAME, VALUE
FROM SYSIBMADM.DBCFG
WHERE NAME IN ('hadr_local_host', 'hadr_local_svc',
'hadr_remote_host', 'hadr_remote_svc',
'hadr_syncmode', 'hadr_timeout');
36.2.4 Cloud Analytics Environment
For the IBM Db2 Warehouse on Cloud instance, verification is performed through the IBM Cloud dashboard and REST APIs:
- Confirm the instance is running in the Dallas (us-south) region.
- Verify the instance plan (Flex One, Enterprise, or equivalent).
- Check that IP allowlisting is configured to permit connections only from Meridian's corporate network and VPN ranges.
- Verify that encryption at rest and in transit are enabled (TLS 1.2+).
- Confirm that the ETL pipeline from the LUW reporting database to the cloud warehouse completed its last scheduled load within the expected window.
36.2.5 Architecture Findings Template
| Finding ID | Environment | Component | Expected | Actual | Severity | Recommendation |
|---|---|---|---|---|---|---|
| ARCH-001 | z/OS | Data Sharing Members | 3 active | 3 active | OK | None |
| ARCH-002 | LUW | HADR State | PEER | REMOTE_CATCHUP | WARNING | Investigate network latency |
| ARCH-003 | Cloud | IP Allowlist | Corporate + VPN only | Includes deprecated VPN range | MEDIUM | Remove stale IP range |
36.3 Schema and Design Review
The schema review examines the logical and physical data model for correctness, efficiency, and compliance with Meridian's data governance standards.
36.3.1 Data Model Validation
We begin with a catalog-based audit. Every table in the MERIDIAN schema should have: - A primary key. - Appropriate foreign keys enforcing referential integrity. - Column names following the bank's naming convention (prefix with entity abbreviation, no abbreviations in business-critical columns). - Appropriate data types (no VARCHAR(4000) for a 10-character code field).
-- Find tables without primary keys
SELECT T.TABSCHEMA, T.TABNAME, T.CARD, T.NPAGES
FROM SYSCAT.TABLES T
WHERE T.TABSCHEMA = 'MERIDIAN'
AND T.TYPE = 'T'
AND NOT EXISTS (
SELECT 1 FROM SYSCAT.TABCONST TC
WHERE TC.TABSCHEMA = T.TABSCHEMA
AND TC.TABNAME = T.TABNAME
AND TC.TYPE = 'P'
)
ORDER BY T.CARD DESC;
Tables without primary keys are a critical finding for any banking application. Without a primary key, there is no guaranteed unique row identifier, which complicates replication, auditing, and regulatory data lineage requirements.
-- Find potential data type mismatches (oversized VARCHAR columns)
SELECT TABNAME, COLNAME, TYPENAME, LENGTH,
CASE WHEN TYPENAME = 'VARCHAR' AND LENGTH > 1000
THEN 'REVIEW: Potentially oversized'
ELSE 'OK'
END AS ASSESSMENT
FROM SYSCAT.COLUMNS
WHERE TABSCHEMA = 'MERIDIAN'
AND TYPENAME = 'VARCHAR'
AND LENGTH > 500
ORDER BY LENGTH DESC;
36.3.2 Index Effectiveness Analysis
Indexes are the single most impactful performance lever. Our review checks for:
- Missing indexes — tables with high read counts but only a primary key index.
- Redundant indexes — indexes whose leading columns are a prefix of another index on the same table.
- Unused indexes — indexes that have not been used in any access path since the last statistics reset.
-- Find potentially redundant indexes on LUW
SELECT A.TABSCHEMA, A.TABNAME,
A.INDNAME AS SHORTER_INDEX,
B.INDNAME AS LONGER_INDEX,
A.COLNAMES AS SHORT_COLS,
B.COLNAMES AS LONG_COLS
FROM SYSCAT.INDEXES A
JOIN SYSCAT.INDEXES B
ON A.TABSCHEMA = B.TABSCHEMA
AND A.TABNAME = B.TABNAME
AND A.INDNAME <> B.INDNAME
WHERE A.TABSCHEMA = 'MERIDIAN'
AND B.COLNAMES LIKE A.COLNAMES || '%'
AND A.UNIQUERULE <> 'P'
ORDER BY A.TABNAME, A.INDNAME;
-- Find unused indexes (LUW — requires monitoring enabled)
SELECT I.TABSCHEMA, I.TABNAME, I.INDNAME, I.COLNAMES,
M.INDEX_SCANS, M.INDEX_ONLY_SCANS,
I.NPAGES AS INDEX_PAGES
FROM SYSCAT.INDEXES I
LEFT JOIN TABLE(MON_GET_INDEX('MERIDIAN', '', -2)) AS M
ON I.IID = M.IID AND I.TABSCHEMA = M.TABSCHEMA
AND I.TABNAME = M.TABNAME
WHERE I.TABSCHEMA = 'MERIDIAN'
AND I.UNIQUERULE <> 'P'
AND COALESCE(M.INDEX_SCANS, 0) + COALESCE(M.INDEX_ONLY_SCANS, 0) = 0
ORDER BY I.NPAGES DESC;
Each unused index consumes storage, slows INSERT/UPDATE/DELETE operations, and adds REORG/RUNSTATS overhead. However, before recommending removal, we check whether the index supports a quarterly or annual process that may not have run since the monitoring reset.
36.3.3 Tablespace Health
-- Tablespace utilization and health (LUW)
SELECT TBSP_NAME, TBSP_TYPE, TBSP_STATE,
TBSP_TOTAL_PAGES, TBSP_USED_PAGES, TBSP_FREE_PAGES,
DECIMAL(TBSP_USED_PAGES * 100.0 /
NULLIF(TBSP_TOTAL_PAGES, 0), 5, 2) AS USED_PCT,
TBSP_AUTO_RESIZE_ENABLED
FROM TABLE(MON_GET_TABLESPACE('', -2)) AS T
ORDER BY USED_PCT DESC;
Tablespaces above 85% utilization are flagged as warnings. Above 95% is critical. We also verify that auto-resize is enabled for DMS tablespaces in non-production environments and that production tablespaces follow the approved growth plan.
36.3.4 Foreign Key and Referential Integrity Review
Referential integrity is the backbone of a banking data model. Every ACCOUNT must belong to a CUSTOMER. Every TRANSACTION must belong to an ACCOUNT. Every WIRE_TRANSFER must reference a valid source account. Without enforced foreign keys, orphaned records can accumulate — an account without a customer, a transaction without an account — creating data quality issues that are difficult to detect and expensive to remediate.
-- Find tables that reference other tables but lack foreign keys
-- Identify likely parent-child relationships by naming convention
SELECT CHILD.TABNAME AS POTENTIAL_CHILD,
CHILD.COLNAME AS FK_COLUMN,
PARENT.TABNAME AS LIKELY_PARENT
FROM SYSCAT.COLUMNS CHILD
JOIN SYSCAT.COLUMNS PARENT
ON CHILD.COLNAME = PARENT.COLNAME
AND CHILD.TABNAME <> PARENT.TABNAME
WHERE CHILD.TABSCHEMA = 'MERIDIAN'
AND PARENT.TABSCHEMA = 'MERIDIAN'
AND CHILD.COLNAME LIKE '%_ID'
AND NOT EXISTS (
SELECT 1 FROM SYSCAT.REFERENCES R
WHERE R.TABSCHEMA = CHILD.TABSCHEMA
AND R.TABNAME = CHILD.TABNAME
AND R.REFTABNAME = PARENT.TABNAME
)
AND EXISTS (
SELECT 1 FROM SYSCAT.TABCONST TC
WHERE TC.TABSCHEMA = PARENT.TABSCHEMA
AND TC.TABNAME = PARENT.TABNAME
AND TC.TYPE = 'P'
)
ORDER BY CHILD.TABNAME, CHILD.COLNAME;
This query uses a heuristic approach — looking for columns with matching names ending in _ID — to identify relationships that probably should have foreign keys but do not. Each result must be manually reviewed: some matches are coincidental (not every column named BRANCH_ID is a foreign key to the BRANCH table), but many reveal genuine missing constraints.
For the Meridian schema, this query identified two missing foreign keys: - WIRE_TRANSFER_QUEUE.SOURCE_ACCOUNT_ID should reference ACCOUNT.ACCOUNT_ID. - PAYROLL_STAGING.EMPLOYEE_ACCOUNT_ID should reference ACCOUNT.ACCOUNT_ID.
Both staging tables were designed without foreign keys to improve LOAD performance (foreign key checking adds overhead during bulk loads). This is a valid technical decision, but it must be documented, and the application must enforce the relationship in code.
36.3.5 Data Type Consistency Review
In a well-governed schema, the same business concept should use the same data type everywhere. If CUSTOMER_ID is INTEGER in the CUSTOMER table but BIGINT in the TRANSACTION_HISTORY table, implicit data type conversion occurs during joins — potentially preventing index usage and degrading performance.
-- Find columns with the same name but different data types across tables
SELECT A.COLNAME,
A.TABNAME AS TABLE_A, A.TYPENAME AS TYPE_A, A.LENGTH AS LEN_A,
B.TABNAME AS TABLE_B, B.TYPENAME AS TYPE_B, B.LENGTH AS LEN_B
FROM SYSCAT.COLUMNS A
JOIN SYSCAT.COLUMNS B
ON A.COLNAME = B.COLNAME
AND A.TABNAME < B.TABNAME
WHERE A.TABSCHEMA = 'MERIDIAN'
AND B.TABSCHEMA = 'MERIDIAN'
AND (A.TYPENAME <> B.TYPENAME OR A.LENGTH <> B.LENGTH)
AND A.COLNAME LIKE '%_ID'
ORDER BY A.COLNAME, A.TABNAME;
36.3.6 Catalog-Based Audit Summary
The schema review produces a findings table:
| Finding ID | Object | Issue | Impact | Severity | Recommendation |
|---|---|---|---|---|---|
| SCHEMA-001 | MERIDIAN.TRANSACTION_STAGING | No primary key | Data integrity risk | CRITICAL | Add surrogate PK |
| SCHEMA-002 | MERIDIAN.CUSTOMER.ADDR_LINE_3 | VARCHAR(4000) for address | Wasted buffer pool memory | LOW | Alter to VARCHAR(200) |
| SCHEMA-003 | MERIDIAN.IDX_ACCT_TYPE | Redundant (prefix of IDX_ACCT_TYPE_STATUS) | INSERT overhead | MEDIUM | Drop after validation |
| SCHEMA-004 | MERIDIAN.IDX_CUST_LEGACY | Zero scans since reset | Maintenance overhead | MEDIUM | Validate with batch team, then drop |
36.4 Performance Baseline and Stress Test
36.4.1 Establishing the Baseline
Before we can interpret stress test results, we need a baseline — a snapshot of normal system performance under typical weekday load. The baseline captures:
- Throughput: transactions per second (TPS), SQL statements per second.
- Response time: average and 95th percentile elapsed time for key transaction types.
- Resource utilization: CPU, memory, I/O rates, buffer pool hit ratios.
- Locking: lock wait time, deadlock frequency.
-- Capture baseline metrics snapshot (LUW)
SELECT SNAPSHOT_TIMESTAMP,
TOTAL_APP_COMMITS,
TOTAL_APP_ROLLBACKS,
ACT_COMPLETED_TOTAL,
AVG_RQST_CPU_TIME,
TOTAL_WAIT_TIME,
LOCK_WAIT_TIME,
LOCK_WAITS,
DEADLOCKS,
POOL_DATA_L_READS,
POOL_DATA_P_READS,
DECIMAL(
(POOL_DATA_L_READS - POOL_DATA_P_READS) * 100.0 /
NULLIF(POOL_DATA_L_READS, 0), 5, 2
) AS BP_HIT_RATIO
FROM TABLE(MON_GET_DATABASE(-2)) AS T;
We capture this snapshot every 15 minutes for one full business day — typically a Tuesday or Wednesday, avoiding month-end processing. The resulting 40+ data points form our baseline profile.
36.4.2 Designing the Stress Test Workload
The stress test simulates peak load conditions. For Meridian National Bank, peak load occurs on the last business day of each month when:
- Payroll processing for 14,000 employees runs concurrently with normal banking operations.
- Statement generation for 2.3 million accounts runs in batch.
- Wire transfer volume increases by 340% over the daily average.
- The online banking portal experiences 2.5x normal concurrent users.
Our simulated workload (see code/stress-test-workload.sql) includes:
- Account inquiry — read-only, high frequency (simulates online banking).
- Fund transfer — read-write, medium frequency (simulates teller and wire operations).
- Statement generation — batch read, sequential scan of large tables.
- Payroll posting — bulk INSERT/UPDATE, high contention on account balance tables.
- Regulatory reporting — complex aggregation queries against historical data.
36.4.3 Executing the Stress Test
The stress test is orchestrated using a combination of DB2's built-in tools and custom scripts:
# Launch stress test with 200 concurrent connections
# Each connection runs a randomized mix of the 5 transaction types
db2 connect to MERIDIANDB user STRESSTEST using $STRESS_PWD
for i in $(seq 1 200); do
db2 -tvf stress-test-workload.sql &
done
# Monitor in real-time
db2 "SELECT * FROM TABLE(MON_GET_ACTIVITY(NULL, -2))
WHERE ACTIVITY_STATE = 'EXECUTING'
ORDER BY ELAPSED_TIME_SEC DESC
FETCH FIRST 20 ROWS ONLY"
In practice, most organizations use a load testing tool such as IBM Rational Performance Tester, Apache JMeter with a JDBC driver, or a custom harness. The key requirement is that the tool can simulate realistic concurrency — not just sequential statement execution, but genuinely concurrent sessions with think times and transaction boundaries that mimic real user behavior.
36.4.4 Monitoring Under Load
During the stress test, we capture metrics at 60-second intervals:
-- Real-time throughput monitoring
SELECT CURRENT TIMESTAMP AS SAMPLE_TIME,
MEMBER,
ACT_COMPLETED_TOTAL AS COMPLETED_ACTS,
APP_RQSTS_COMPLETED_TOTAL AS COMPLETED_RQSTS,
AVG_RQST_CPU_TIME,
TOTAL_CPU_TIME,
TOTAL_WAIT_TIME,
LOCK_WAIT_TIME,
IO_WAIT_TIME,
POOL_DATA_L_READS,
POOL_DATA_P_READS,
ROWS_READ,
ROWS_RETURNED
FROM TABLE(MON_GET_DATABASE(-2)) AS T;
-- Top resource-consuming statements during stress test
SELECT EXECUTABLE_ID,
NUM_EXECUTIONS,
TOTAL_CPU_TIME / NUM_EXECUTIONS AS AVG_CPU_PER_EXEC,
TOTAL_ACT_TIME / NUM_EXECUTIONS AS AVG_ELAPSED_PER_EXEC,
ROWS_READ / NUM_EXECUTIONS AS AVG_ROWS_READ_PER_EXEC,
ROWS_RETURNED / NUM_EXECUTIONS AS AVG_ROWS_RET_PER_EXEC,
DECIMAL(ROWS_RETURNED * 100.0 / NULLIF(ROWS_READ, 0), 5, 2)
AS EFFICIENCY_PCT
FROM TABLE(MON_GET_PKG_CACHE_STMT(NULL, NULL, NULL, -2)) AS T
WHERE NUM_EXECUTIONS > 0
ORDER BY TOTAL_CPU_TIME DESC
FETCH FIRST 20 ROWS ONLY;
36.4.5 Identifying Bottlenecks
During the Meridian stress test, we observe three bottlenecks:
Bottleneck 1: Buffer Pool Pressure
The buffer pool hit ratio for the ACCT_DATA tablespace drops from 99.2% at baseline to 87.4% under load. Root cause: the statement generation batch job performs sequential scans of the ACCOUNT_HISTORY table (480 million rows), flushing recently used pages from the buffer pool.
Resolution: Assign the batch workload to a separate buffer pool (BP_BATCH) sized appropriately for sequential prefetch. This isolates the batch I/O from the OLTP buffer pool.
Bottleneck 2: Lock Contention on ACCT_BALANCE
The payroll posting process updates the same account balance rows that teller transactions read. Lock wait time spikes to 340ms average (baseline: 8ms).
Resolution: Change the payroll posting to use SKIP LOCKED DATA for non-critical reads, and restructure the posting to batch updates per account rather than individual row-level updates in a tight loop. Consider Currently Committed semantics for reader isolation.
Bottleneck 3: Sort Overflow
The regulatory reporting query performs a large GROUP BY across 18 months of transaction data. Under concurrent load, the sort heap is exhausted, and sorts spill to disk — a 15x performance degradation.
Resolution: Increase SORTHEAP from 4096 to 16384 pages for the reporting workload class. Alternatively, create a materialized query table (MQT) for the pre-aggregated regulatory report data.
36.4.6 Stress Test Results Summary
| Metric | Baseline | Stress Test | Threshold | Status |
|---|---|---|---|---|
| Throughput (TPS) | 2,400 | 3,800 | >3,500 target | PASS |
| Avg Response Time (ms) | 12 | 48 | <100 | PASS |
| 95th Percentile RT (ms) | 34 | 210 | <500 | PASS |
| BP Hit Ratio (OLTP) | 99.2% | 87.4% | >95% | FAIL |
| Lock Wait Time (avg ms) | 8 | 340 | <50 | FAIL |
| Sort Overflows / hr | 0 | 847 | <10 | FAIL |
| CPU Utilization | 28% | 72% | <85% | PASS |
| Deadlocks / hr | 0 | 3 | <5 | PASS |
Three of eight key metrics fail under stress. Each has an identified resolution. After implementing the fixes, the stress test is repeated. The results after remediation:
| Metric | After Remediation | Threshold | Status |
|---|---|---|---|
| BP Hit Ratio (OLTP) | 98.7% | >95% | PASS |
| Lock Wait Time (avg ms) | 22 | <50 | PASS |
| Sort Overflows / hr | 0 | <10 | PASS |
36.5 EXPLAIN Analysis Under Load
36.5.1 Capturing Access Paths
During the stress test, we identify the top 10 queries by total CPU consumption and capture their access paths using the EXPLAIN facility:
-- Populate EXPLAIN tables for a specific statement
EXPLAIN PLAN SET QUERYNO = 1001 FOR
SELECT C.CUSTOMER_ID, C.CUSTOMER_NAME,
A.ACCOUNT_NUMBER, A.CURRENT_BALANCE,
T.TRANSACTION_DATE, T.TRANSACTION_AMOUNT, T.TRANSACTION_TYPE
FROM MERIDIAN.CUSTOMER C
JOIN MERIDIAN.ACCOUNT A ON C.CUSTOMER_ID = A.CUSTOMER_ID
JOIN MERIDIAN.TRANSACTION_HISTORY T ON A.ACCOUNT_ID = T.ACCOUNT_ID
WHERE C.CUSTOMER_ID = ?
AND T.TRANSACTION_DATE BETWEEN ? AND ?
ORDER BY T.TRANSACTION_DATE DESC
FETCH FIRST 50 ROWS ONLY;
-- Review the access plan
SELECT OPERATOR_ID, TARGET_ID, OPERATOR_TYPE,
OBJECT_SCHEMA, OBJECT_NAME,
TOTAL_COST, IO_COST, CPU_COST,
STREAM_COUNT
FROM EXPLAIN_OPERATOR
WHERE QUERYNO = 1001
ORDER BY OPERATOR_ID;
36.5.2 Common Access Path Issues Found
Issue 1: Table Scan on TRANSACTION_HISTORY
The optimizer chooses a table scan instead of using the index on (ACCOUNT_ID, TRANSACTION_DATE) because statistics are stale — the last RUNSTATS was 45 days ago, and 38 million new rows have been inserted since.
-- Fix: Update statistics
RUNSTATS ON TABLE MERIDIAN.TRANSACTION_HISTORY
ON ALL COLUMNS AND INDEXES ALL
WITH DISTRIBUTION ON ALL COLUMNS
ALLOW WRITE ACCESS;
After RUNSTATS, the optimizer switches to an index scan with a 94% reduction in estimated cost.
Issue 2: Nested Loop Join Where Hash Join Would Be Better
The regulatory summary query joins ACCOUNT (2.3 million rows) with TRANSACTION_HISTORY (480 million rows) using a nested loop because the join predicate uses a function — YEAR(T.TRANSACTION_DATE) = YEAR(CURRENT DATE). The function prevents the optimizer from using range predicates.
-- Rewrite to enable range predicates
-- Before (problematic):
WHERE YEAR(T.TRANSACTION_DATE) = YEAR(CURRENT DATE)
-- After (optimizer-friendly):
WHERE T.TRANSACTION_DATE >= DATE(DIGITS(YEAR(CURRENT DATE)) || '-01-01')
AND T.TRANSACTION_DATE < DATE(DIGITS(YEAR(CURRENT DATE) + 1) || '-01-01')
Issue 3: Index-Only Access Opportunity Missed
The account balance inquiry query accesses ACCOUNT only for ACCOUNT_NUMBER and CURRENT_BALANCE, but the existing index includes only ACCOUNT_NUMBER. Adding CURRENT_BALANCE to the index enables index-only access, eliminating all data page I/O for this high-frequency query.
CREATE INDEX MERIDIAN.IDX_ACCT_BAL_COVER
ON MERIDIAN.ACCOUNT (ACCOUNT_NUMBER)
INCLUDE (CURRENT_BALANCE);
36.5.3 Access Path Regression Detection
We compare access paths from the stress test to the production baseline captured last month. A regression is any case where the optimizer chooses a more expensive plan for the same query. We use the EXPLAIN tables to compare:
-- Compare estimated costs between baseline and current
SELECT B.QUERYNO,
B.TOTAL_COST AS BASELINE_COST,
C.TOTAL_COST AS CURRENT_COST,
DECIMAL((C.TOTAL_COST - B.TOTAL_COST) * 100.0 /
NULLIF(B.TOTAL_COST, 0), 7, 2) AS COST_CHANGE_PCT
FROM EXPLAIN_OPERATOR_BASELINE B
JOIN EXPLAIN_OPERATOR C
ON B.QUERYNO = C.QUERYNO
AND B.OPERATOR_ID = 1 -- root operator
AND C.OPERATOR_ID = 1
WHERE C.TOTAL_COST > B.TOTAL_COST * 1.20 -- >20% regression
ORDER BY COST_CHANGE_PCT DESC;
Any regression greater than 20% is investigated. Common causes include stale statistics, changed data distribution, new indexes that create suboptimal plan choices, and changes to database configuration parameters.
36.5.4 Top 10 Query Optimization Summary
After completing the EXPLAIN analysis of all top 10 queries, we compile the results into a summary table that becomes part of the final report:
| Query # | Description | Before Cost | After Cost | Improvement | Change Made |
|---|---|---|---|---|---|
| Q1 | Customer account inquiry | 124 | 12 | 90% | Covering index added |
| Q2 | Transaction history lookup | 89,400 | 4,200 | 95% | RUNSTATS + index scan enabled |
| Q3 | Account balance check | 34 | 8 | 76% | Covering index (index-only access) |
| Q4 | Wire transfer validation | 18 | 18 | 0% | No change needed — already optimal |
| Q5 | Daily transaction summary | 12,800 | 12,800 | 0% | No change — appropriate table scan |
| Q6 | Regulatory CTR report | 456,000 | 23,400 | 95% | Predicate rewrite (removed function) |
| Q7 | Branch activity report | 34,200 | 8,900 | 74% | RUNSTATS + better join order |
| Q8 | Payroll batch posting | 2,400 | 2,400 | 0% | No change — bulk operation, expected cost |
| Q9 | Statement generation | 890,000 | 890,000 | 0% | Expected cost for sequential scan; isolated to separate BP |
| Q10 | Year-over-year comparison | 234,000 | 18,000 | 92% | MQT created for pre-aggregated data |
Note that not every query needs optimization. Queries Q4, Q5, Q8, and Q9 were already using appropriate access paths. The optimizer chose a table scan for Q5 because the query aggregates the entire day's transactions — an index would not help. Q9 is a sequential scan of a large table that is expected and managed through buffer pool isolation. A common mistake in performance tuning is optimizing queries that are already optimal. The EXPLAIN analysis tells us where to invest effort and where to leave the optimizer's choices alone.
36.5.5 Optimization Profile Considerations
For queries where the optimizer consistently chooses a suboptimal plan despite current statistics, DB2 offers optimization profiles — XML documents that provide hints to the optimizer about preferred access methods. However, optimization profiles should be used sparingly and as a last resort:
-- Example: Create an optimization profile to force index usage
-- Only use when the optimizer consistently makes the wrong choice
-- despite accurate statistics
CREATE OR REPLACE OPTIMIZATION PROFILE MERIDIAN.PROF_CTR_REPORT
FOR STATEMENT MATCHING ALL
CONTAINING 'TRANSACTION_HISTORY'
OPTIONS 'USE INDEX (IX_TXN_DATE_TYPE)'
ENABLE;
In the Meridian assessment, no optimization profiles were needed — all performance issues were resolved through RUNSTATS, index creation, and query rewriting. If optimization profiles are present in your environment, document each one and verify that it is still needed. Profiles that override the optimizer's decisions based on outdated assumptions can cause performance regressions when data distributions change.
36.6 Security Audit
36.6.1 Privilege Review
The security audit begins with a comprehensive privilege inventory. Every GRANT in the system is examined:
-- Find users with excessive privileges (DBADM or higher)
SELECT GRANTEE, GRANTEETYPE, DBADM, SECADM, DATAACCESS, ACCESSCTRL,
SQLADM, WLMADM, EXPLAIN
FROM SYSCAT.DBAUTH
WHERE DBADM = 'Y' OR SECADM = 'Y' OR DATAACCESS = 'Y'
ORDER BY GRANTEE;
Findings to look for: - Application service accounts with DBADM authority (they should have only the minimum privileges needed). - Individual user IDs with direct GRANT instead of role-based access. - Orphaned grants — privileges granted to users who have left the organization.
-- Find all table-level privileges
SELECT GRANTOR, GRANTEE, GRANTEETYPE, TABSCHEMA, TABNAME,
CONTROLAUTH, ALTERAUTH, DELETEAUTH, INSERTAUTH,
SELECTAUTH, UPDATEAUTH, REFAUTH, INDEXAUTH
FROM SYSCAT.TABAUTH
WHERE TABSCHEMA = 'MERIDIAN'
AND GRANTEETYPE = 'U' -- Individual users, not roles
ORDER BY GRANTEE, TABNAME;
Individual user grants are a finding. The bank's security policy requires all access to be granted through roles.
36.6.2 RCAC (Row and Column Access Control) Verification
Meridian uses RCAC to enforce data masking and row filtering:
-- Verify row permissions are active
SELECT TABSCHEMA, TABNAME, PERMNAME, ENABLE, RULETEXT
FROM SYSCAT.ROWPERMISSIONS
WHERE TABSCHEMA = 'MERIDIAN'
ORDER BY TABNAME;
-- Verify column masks are active
SELECT TABSCHEMA, TABNAME, COLMASKNAME, COLNAME, ENABLE
FROM SYSCAT.COLMASKS
WHERE TABSCHEMA = 'MERIDIAN'
ORDER BY TABNAME, COLNAME;
We verify: - Customer SSN is masked for all users except the COMPLIANCE_ROLE. - Account balances are masked for the MARKETING_ROLE (they can see account existence but not balances). - Row-level filtering ensures that branch managers can see only accounts belonging to their branch. - The RCAC policies are ENABLED, not merely defined.
36.6.3 Audit Policy Check
-- Review audit policies
SELECT AUDITPOLICYNAME, AUDITSTATUS,
CHECKING_STATUS, CONTEXT_STATUS, EXECUTE_STATUS,
OBJMAINT_STATUS, SECMAINT_STATUS, SYSADMIN_STATUS,
VALIDATE_STATUS
FROM SYSCAT.AUDITPOLICIES
ORDER BY AUDITPOLICYNAME;
The bank's compliance requirements mandate: - All SECMAINT events are audited (privilege changes). - All VALIDATE events are audited (authentication attempts). - EXECUTE auditing is enabled for sensitive tables (CUSTOMER, ACCOUNT, TRANSACTION_HISTORY). - Audit logs are forwarded to the enterprise SIEM (Security Information and Event Management) system within 60 seconds.
36.6.4 PCI-DSS Compliance Mapping
As a financial institution that processes credit card transactions, Meridian must comply with PCI-DSS. The relevant DB2 requirements map as follows:
| PCI-DSS Requirement | DB2 Control | Status | Evidence |
|---|---|---|---|
| 2.2: Develop configuration standards | DB2 hardening guide applied | PASS | Config audit report |
| 3.4: Render PAN unreadable | Column mask on CARD_NUMBER | PASS | RCAC policy MASK_PAN active |
| 7.1: Limit access to need-to-know | Role-based access control | PARTIAL | 3 individual grants found |
| 7.2: Access control system | DB2 LBAC + RCAC | PASS | Policies verified |
| 8.5: Do not use shared accounts | Service accounts reviewed | PASS | Each app has unique ID |
| 10.1: Audit trails | DB2 audit facility | PASS | All categories enabled |
| 10.5: Secure audit trails | Audit logs on WORM storage | PASS | Storage config verified |
| 10.7: Retain audit trail for 1 year | Log retention policy | PASS | 13-month retention confirmed |
The three individual user grants (Requirement 7.1) are a finding that must be remediated before the examination.
36.6.5 Penetration Test Results
The security team conducted a database-focused penetration test. Key findings relevant to DB2:
- SQL Injection Attempt: The web application properly uses parameterized queries. No injection vectors found.
- Privilege Escalation: Attempted to escalate from a restricted application role to DBADM. Failed — no known vulnerabilities in current patch level.
- Network Sniffing: All DB2 connections use TLS 1.2. No cleartext credentials observed.
- Default Credentials: No default or well-known passwords found on any DB2 instance.
36.7 Backup and Recovery Drill
36.7.1 The DR Scenario
The disaster recovery drill simulates the following scenario:
At 2:47 AM on a Tuesday, the primary SAN controller for the LUW production database fails catastrophically. The storage array enters a degraded state, and the DB2 instance crashes with media errors on the ACCT_DATA tablespace. The standby database (HADR) is 30 km away in the alternate data center.
The drill has two phases: 1. HADR Failover: Bring the standby online as the new primary. 2. Point-in-Time Recovery: Restore a specific tablespace from backup to recover data that may have been in-flight at the moment of failure.
36.7.2 HADR Failover Execution
-- On the standby server, execute takeover
-- Force option used because primary is unreachable
TAKEOVER HADR ON DATABASE MERIDIANDB BY FORCE;
The takeover command promotes the standby to primary. With NEARSYNC mode, the standby is guaranteed to have all log records that were acknowledged to the application — so no committed transactions are lost.
Measured Failover Time:
| Phase | Duration |
|---|---|
| Detect primary failure (monitoring alert) | 45 seconds |
| Operator decision and authorization | 3 minutes 20 seconds |
| Execute TAKEOVER BY FORCE | 12 seconds |
| Database becomes accessible | 8 seconds |
| Application reconnect (ACR) | 15 seconds |
| Total RTO | 4 minutes 40 seconds |
The bank's RTO target for the LUW system is 15 minutes. We are well within the target.
36.7.3 Point-in-Time Recovery
After the HADR failover stabilizes the environment, we perform a parallel exercise: tablespace-level point-in-time recovery (PITR) to demonstrate the ability to recover a specific tablespace to a moment just before a simulated data corruption event.
-- Step 1: List available backups
LIST HISTORY BACKUP ALL FOR MERIDIANDB;
-- Step 2: Restore a single tablespace from the most recent backup
RESTORE DATABASE MERIDIANDB
TABLESPACE (ACCT_DATA)
FROM /db2backup/meridiandb
TAKEN AT 20260314021500
WITHOUT ROLLING FORWARD;
-- Step 3: Roll forward to a specific point in time
ROLLFORWARD DATABASE MERIDIANDB
TO 2026-03-14-02.45.00.000000
TABLESPACE (ACCT_DATA)
ONLINE;
-- Step 4: Complete the rollforward
ROLLFORWARD DATABASE MERIDIANDB
STOP TABLESPACE (ACCT_DATA);
36.7.4 Data Integrity Verification
After recovery, we verify data integrity:
-- Verify row counts match expected values
SELECT 'CUSTOMER' AS TABLENAME, COUNT(*) AS ROW_COUNT
FROM MERIDIAN.CUSTOMER
UNION ALL
SELECT 'ACCOUNT', COUNT(*) FROM MERIDIAN.ACCOUNT
UNION ALL
SELECT 'TRANSACTION_HISTORY', COUNT(*) FROM MERIDIAN.TRANSACTION_HISTORY;
-- Run CHECK DATA utility to verify consistency
-- (DB2 utilities invoked via command line)
# Verify referential integrity
db2 "SET INTEGRITY FOR MERIDIAN.ACCOUNT,
MERIDIAN.TRANSACTION_HISTORY
IMMEDIATE CHECKED FOR EXCEPTION IN MERIDIAN.ACCOUNT
USE MERIDIAN.ACCT_EXCEPTIONS"
36.7.5 DR Drill Results
| Metric | Target | Actual | Status |
|---|---|---|---|
| RTO (HADR Failover) | < 15 minutes | 4 min 40 sec | PASS |
| RPO (Data Loss) | 0 transactions | 0 transactions | PASS |
| PITR Completion | < 60 minutes | 34 minutes | PASS |
| Data Integrity Check | 0 errors | 0 errors | PASS |
| Application Reconnect | < 5 minutes | 15 seconds | PASS |
36.7.6 z/OS Recovery Considerations
On the z/OS platform, the recovery drill takes a different form. DB2 for z/OS uses:
- Image copies (full and incremental) instead of the BACKUP DATABASE command.
- RECOVER utility with the TORBA (To Relative Byte Address) option for point-in-time recovery.
- Log Apply through the recovery utility, reading archive logs from tape or disk.
- Data sharing group recovery where individual members can be recovered without affecting other members.
//RECOVER EXEC DSNUPROC,SYSTEM=DB2A,UID='RECOVER'
//SYSIN DD *
RECOVER TABLESPACE MERIDIAN.ACCTDATA
TORBA X'00000C8A23F0'
LOGONLY
/*
The z/OS recovery drill confirms that the bank can recover any single tablespace to any point in time within the log retention window (currently 7 days of online logs plus 90 days of archive logs on tape).
36.8 Concurrency and Locking Analysis
36.8.1 Lock Contention Under Concurrent Load
During the stress test, we capture detailed locking information:
-- Current lock waits
SELECT LW.AGENT_ID AS WAITING_AGENT,
LW.LOCK_OBJECT_TYPE, LW.LOCK_MODE_REQUESTED,
LW.LOCK_WAIT_START_TIME,
TIMESTAMPDIFF(2,
CHAR(CURRENT TIMESTAMP - LW.LOCK_WAIT_START_TIME)) AS WAIT_SECS,
LH.AGENT_ID AS HOLDING_AGENT,
LH.LOCK_MODE AS HELD_MODE,
LH.TABSCHEMA, LH.TABNAME
FROM TABLE(MON_GET_APPL_LOCKWAIT(NULL, -2)) AS LW
JOIN TABLE(MON_GET_LOCKS(NULL, -2)) AS LH
ON LW.LOCK_NAME = LH.LOCK_NAME
AND LW.LOCK_OBJECT_TYPE = LH.LOCK_OBJECT_TYPE
ORDER BY WAIT_SECS DESC;
36.8.2 Deadlock Analysis
-- Review deadlock history from monitoring
SELECT DEADLOCK_ID, PARTICIPANT_NO,
LOCK_MODE, LOCK_OBJECT_TYPE,
TABSCHEMA, TABNAME,
STMT_TEXT
FROM TABLE(MON_GET_DEADLOCK(-2)) AS T
ORDER BY DEADLOCK_ID, PARTICIPANT_NO;
During the stress test, three deadlocks occurred — all involving the same pattern: the payroll posting process and the real-time balance inquiry contending for rows in the ACCOUNT table. The deadlock graph reveals a classic cycle:
- Session A (payroll) holds an X lock on Account 10042, waits for an X lock on Account 10087.
- Session B (payroll) holds an X lock on Account 10087, waits for an X lock on Account 10042.
Resolution: The payroll posting batches are restructured to process accounts in ascending ACCOUNT_ID order, eliminating the possibility of deadlock from opposing lock acquisition sequences. This is a fundamental concurrency principle — consistent lock ordering prevents deadlocks.
36.8.3 Isolation Level Verification
We verify that each application workload uses the correct isolation level:
| Application | Required Isolation | Configured | Status |
|---|---|---|---|
| Online Banking (reads) | Cursor Stability (CS) | CS | PASS |
| Fund Transfer | Repeatable Read (RR) | RR | PASS |
| Statement Generation | Uncommitted Read (UR) | CS | FINDING |
| Regulatory Reporting | Read Stability (RS) | RS | PASS |
The statement generation process runs under CS when UR would suffice (it reads historical data that will not change during the batch window). Changing to UR eliminates lock acquisition entirely for this workload, reducing contention with OLTP operations.
-- Fix: Set isolation level for statement generation connection
-- In the application's JDBC connection string:
-- jdbc:db2://host:port/MERIDIANDB:currentIsolation=UR;
36.8.4 Lock Escalation Review
Lock escalation — when DB2 promotes many row-level locks to a table-level lock — is a serious performance concern in a concurrent environment:
-- Check lock escalation history
SELECT TABSCHEMA, TABNAME,
LOCK_ESCALS, LOCK_ESCALS_GLOBAL,
LOCK_ESCALS_LOCKLIST, LOCK_ESCALS_MAXLOCKS
FROM TABLE(MON_GET_TABLE('MERIDIAN', '', -2)) AS T
WHERE LOCK_ESCALS > 0
ORDER BY LOCK_ESCALS DESC;
If lock escalation is detected on the ACCOUNT or TRANSACTION_HISTORY tables, we increase the LOCKLIST and MAXLOCKS database configuration parameters. Alternatively, we restructure the application to commit more frequently, releasing locks and reducing the total lock count per transaction.
36.9 Utility and Maintenance Review
36.9.1 REORG Schedule Verification
-- Check tables needing REORG (LUW)
SELECT TABSCHEMA, TABNAME,
REORG_PENDING, NUM_OVERFLOW_ROWS,
PCT_ROWS_COMPRESSED,
STATS_TIME
FROM TABLE(ADMIN_GET_TAB_INFO('MERIDIAN', '')) AS T
WHERE REORG_PENDING = 'Y'
OR NUM_OVERFLOW_ROWS > 1000
ORDER BY NUM_OVERFLOW_ROWS DESC;
Meridian's maintenance schedule calls for: - Online REORG of high-activity tables every Sunday at 02:00 AM. - Offline REORG of the largest tables (TRANSACTION_HISTORY) during the quarterly maintenance window. - RUNSTATS after every REORG and weekly on all tables with significant DML activity.
We verify that the last REORG completed successfully and that no tables have entered REORG-pending state since:
-- Verify REORG history
SELECT TABSCHEMA, TABNAME, REORG_TYPE,
REORG_STATUS, REORG_START, REORG_END,
REORG_COMPLETION
FROM TABLE(ADMIN_GET_TAB_REORG_STATUS('MERIDIAN', '')) AS T
ORDER BY REORG_END DESC
FETCH FIRST 20 ROWS ONLY;
36.9.2 RUNSTATS Currency
Stale statistics are one of the top causes of suboptimal access paths. We check how current the statistics are:
-- Find tables with stale statistics
SELECT TABSCHEMA, TABNAME, CARD, STATS_TIME,
TIMESTAMPDIFF(8, CHAR(CURRENT TIMESTAMP - STATS_TIME)) AS DAYS_SINCE_STATS
FROM SYSCAT.TABLES
WHERE TABSCHEMA = 'MERIDIAN'
AND TYPE = 'T'
AND (STATS_TIME IS NULL
OR TIMESTAMPDIFF(8, CHAR(CURRENT TIMESTAMP - STATS_TIME)) > 7)
ORDER BY DAYS_SINCE_STATS DESC;
Any table with statistics older than 7 days is a finding for high-activity tables. For low-activity reference tables, 30 days is acceptable.
36.9.3 Backup Schedule Verification
-- Verify backup history
SELECT OPERATION, OBJECT, OPERATIONTYPE,
START_TIME, END_TIME, SQLCAID,
DEVICETYPE, FIRSTLOG, LASTLOG
FROM TABLE(ADMIN_LIST_HIST()) AS T
WHERE OPERATION = 'B'
ORDER BY START_TIME DESC
FETCH FIRST 10 ROWS ONLY;
The backup schedule requires: - Full online backup every Saturday at 23:00. - Incremental delta backup every weeknight at 23:00. - Transaction log archiving continuously to both local disk and a remote NFS share.
We verify that no backup has failed and that the backup chain is unbroken — an incremental restore depends on a complete chain from the last full backup through every subsequent incremental.
36.9.4 Pending States
Pending states are one of the most dangerous situations in a DB2 environment — they can prevent applications from accessing data:
-- Check for any pending states
SELECT TABSCHEMA, TABNAME, STATUS, TYPE,
CONST_CHECKED, ACCESS_MODE
FROM SYSCAT.TABLES
WHERE TABSCHEMA = 'MERIDIAN'
AND STATUS <> 'N' -- N = Normal
ORDER BY TABNAME;
Common pending states include: - Check Pending: Referential integrity constraints need verification after a LOAD operation. - Inoperative: A view or trigger has become inoperative due to a schema change. - Reorg Pending: A table requires reorganization before certain operations.
Any table in a pending state that serves production applications is a critical finding.
36.9.5 Monitoring Alerts
We verify that the monitoring infrastructure is properly configured:
-- Verify health monitor is active
SELECT ALERT_ID, ALERT_TYPE, ALARM_STATE,
THRESHOLD_VALUE, VALUE
FROM TABLE(HEALTH_GET_ALERT_CFG(-2)) AS T
WHERE ALARM_STATE <> 'NORMAL'
ORDER BY ALERT_TYPE;
The monitoring system should alert on: - Tablespace utilization > 85%. - Buffer pool hit ratio < 95%. - Lock escalation events. - Deadlocks. - Backup failures. - HADR log gap exceeding 5 minutes. - Sort overflows exceeding 10 per hour.
36.10 Capacity Planning Report
36.10.1 Current Utilization
-- Database size and growth (LUW)
SELECT TBSP_NAME, TBSP_TYPE,
TBSP_TOTAL_SIZE_KB / 1024 / 1024 AS TOTAL_GB,
TBSP_USED_SIZE_KB / 1024 / 1024 AS USED_GB,
TBSP_FREE_SIZE_KB / 1024 / 1024 AS FREE_GB,
DECIMAL(TBSP_USED_SIZE_KB * 100.0 /
NULLIF(TBSP_TOTAL_SIZE_KB, 0), 5, 2) AS USED_PCT
FROM TABLE(MON_GET_TABLESPACE('', -2)) AS T
ORDER BY TBSP_USED_SIZE_KB DESC;
Current state:
| Resource | Capacity | Current Usage | Utilization |
|---|---|---|---|
| Storage (OLTP) | 12 TB | 7.8 TB | 65% |
| Storage (Archive) | 48 TB | 31.2 TB | 65% |
| CPU (z/OS) | 8 CPs | Avg 2.2 CPs | 28% |
| CPU (LUW) | 32 cores | Avg 9.1 cores | 28% |
| Memory (LUW) | 256 GB | 188 GB allocated | 73% |
| Buffer Pool (OLTP) | 64 GB | 58 GB used | 91% |
36.10.2 Growth Projections
Based on analysis of the past 24 months of data, we project:
| Metric | Current | Year 1 | Year 2 | Year 3 |
|---|---|---|---|---|
| Customer Count | 1.2M | 1.4M (+17%) | 1.7M (+21%) | 2.1M (+24%) |
| Account Count | 2.3M | 2.8M (+22%) | 3.5M (+25%) | 4.4M (+26%) |
| Daily Transactions | 4.2M | 5.5M (+31%) | 7.2M (+31%) | 9.4M (+31%) |
| OLTP Storage | 7.8 TB | 10.4 TB | 14.1 TB | 19.2 TB |
| Archive Storage | 31.2 TB | 42.8 TB | 58.6 TB | 80.3 TB |
The growth is driven by Meridian's acquisition of two smaller banks (scheduled Year 1) and the launch of a mobile banking platform (Year 2).
36.10.3 Hardware Recommendations
Year 1 (Immediate): - Expand OLTP storage from 12 TB to 16 TB (add one SAN shelf). - Increase LUW buffer pool memory from 64 GB to 96 GB (add RAM to both primary and standby). - No CPU changes needed — headroom is sufficient.
Year 2: - Upgrade LUW server from 32-core to 48-core (or equivalent cloud instance). - Expand OLTP storage to 20 TB. - Consider moving archive storage to object storage (IBM Cloud Object Storage or equivalent) to reduce SAN costs. - Evaluate DB2 pureScale for the LUW tier if the bank exceeds the HADR pair's capacity.
Year 3: - Full cloud migration assessment for the analytics tier. - z/OS capacity upgrade (additional CP). - Implement data lifecycle management: compress and archive transactions older than 7 years to cold storage.
36.10.4 Cloud Expansion Plan
The analytics workload is already in the cloud. The capacity plan recommends expanding cloud usage:
- Year 1: Migrate read-only regulatory reporting to a DB2 read replica on IBM Cloud.
- Year 2: Implement a hybrid architecture where the LUW reporting database feeds a cloud-native data lake for advanced analytics.
- Year 3: Evaluate migrating the LUW mid-tier to Db2 on IBM Cloud Pak for Data, retaining z/OS for core OLTP.
36.10.5 Cost Projections
| Item | Year 1 | Year 2 | Year 3 | 3-Year Total |
|---|---|---|---|---|
| Storage Expansion | $180K | $240K | $150K | $570K | ||
| Server Upgrades | $0 | $320K | $0 | $320K | ||
| Cloud Services | $48K | $96K | $144K | $288K | ||
| DB2 Licensing | $0 (current) | $85K (pureScale eval) | $170K (expansion) | $255K | ||
| Professional Services | $60K | $120K | $90K | $270K | ||
| Total | $288K** | **$861K | $554K** | **$1,703K |
36.11 Executive Summary Presentation
36.11.1 How to Present Technical Findings to Non-Technical Stakeholders
The CIO asked for a report she can hand to the board. Board members do not want to see SQL. They want to know three things:
- Are we safe? (security, compliance, disaster recovery)
- Are we fast? (performance, reliability, uptime)
- Are we ready for growth? (capacity, scalability, cost)
The executive summary follows this structure:
Page 1: Overall System Health Score
Use a simple traffic light dashboard: - Green: Meets or exceeds all targets. - Yellow: Minor findings that require attention within 90 days. - Red: Critical findings that require immediate action.
| Area | Score | Status |
|---|---|---|
| System Availability | 99.97% uptime (last 12 months) | GREEN |
| Disaster Recovery | RTO 4m40s (target 15m) | GREEN |
| Security & Compliance | 3 individual user grants (PCI-DSS 7.1) | YELLOW |
| Performance | 3 stress test failures (remediated) | YELLOW |
| Capacity | 35% growth headroom on storage | GREEN |
Page 2: Key Findings and Actions
| # | Finding | Risk | Action | Timeline | Cost |
|---|---|---|---|---|---|
| 1 | 3 individual user grants violate role-based access policy | Audit finding | Migrate to role grants | 2 weeks | $0 |
| 2 | Buffer pool undersized for peak load | Performance degradation at month-end | Expand buffer pool memory | 30 days | $12K |
| 3 | Stale statistics on 4 tables | Suboptimal query performance | Automate RUNSTATS | 1 week | $0 |
| 4 | Statement generation uses wrong isolation level | Unnecessary lock contention | Application change | 2 weeks | $0 |
Page 3: Three-Year Investment Plan
Present the capacity plan cost projections as a simple bar chart with brief narrative explaining the growth drivers.
36.11.2 The DBA as Business Partner
This capstone exercise illustrates a critical truth: the DBA role is not purely technical. The DBA who can translate database metrics into business impact — who can explain that "a 340ms lock wait time means customers see a 3-second delay at the ATM" — is the DBA who earns a seat at the leadership table.
Throughout this book, we have emphasized understanding not just the how but the why. Why does the optimizer choose a particular access path? Because understanding the why allows you to predict behavior, prevent problems, and communicate effectively with people who will never write a line of SQL.
The capstone report is your proof of competence. It demonstrates that you can: - Operate complex multi-platform DB2 environments. - Identify problems before they affect users. - Quantify risk in terms the business understands. - Propose solutions with cost and timeline. - Execute recovery procedures under pressure.
36.12 The Final Meridian Bank Report
Below is the complete system health report template, filled in with the findings from this capstone exercise.
MERIDIAN NATIONAL BANK — DB2 SYSTEM HEALTH ASSESSMENT REPORT
Report Date: March 2026 Prepared By: Lead Database Administrator Classification: Confidential — Internal Use Only
1. Executive Summary
Meridian National Bank operates three DB2 environments supporting core banking, regulatory reporting, and business analytics. This assessment evaluated all three environments across six dimensions: architecture, schema design, performance, security, disaster recovery, and capacity. Overall system health is GOOD with four findings requiring attention before the upcoming regulatory examination.
2. Architecture Review
All three DB2 environments are operational and conform to the approved enterprise architecture with two exceptions: - The LUW HADR pair is in REMOTE_CATCHUP state instead of PEER due to a recent network configuration change (Finding ARCH-002, severity WARNING). - The cloud analytics environment includes a deprecated VPN IP range in the allowlist (Finding ARCH-003, severity MEDIUM).
Recommendation: Resolve the HADR synchronization issue within 48 hours. Remove the stale IP range immediately.
3. Schema and Design Review
The data model is well-structured with four findings: - One staging table lacks a primary key (CRITICAL). - One column is oversized — VARCHAR(4000) for an address field (LOW). - One redundant index identified (MEDIUM). - One unused index identified (MEDIUM).
Recommendation: Add primary key to staging table before examination. Schedule remaining items for next maintenance window.
4. Performance Assessment
Baseline performance is excellent. Under simulated peak load (month-end conditions), three metrics fell below thresholds: - Buffer pool hit ratio dropped to 87.4% (target >95%). - Lock wait time increased to 340ms (target <50ms). - Sort overflows reached 847/hr (target <10/hr).
All three issues were identified, root-caused, and remediated during the assessment. Re-test confirmed all metrics within thresholds.
Recommendation: Implement the buffer pool expansion and isolation level changes before the next month-end processing cycle.
5. Security and Compliance
The security posture is strong with one finding: - Three individual user grants exist outside the role-based access model (PCI-DSS Requirement 7.1 — PARTIAL compliance).
RCAC policies, audit policies, encryption, and network security all meet or exceed requirements.
Recommendation: Remediate the three individual grants within two weeks. This is the most examination-sensitive finding.
6. Disaster Recovery
The DR drill was successful: - HADR failover completed in 4 minutes 40 seconds (target: 15 minutes). - Zero data loss confirmed. - Point-in-time recovery completed in 34 minutes. - Data integrity verified post-recovery.
Recommendation: No changes required. Consider reducing the operator decision time through automation for future improvement.
7. Capacity Plan
Current utilization across all environments is healthy (28-73% depending on resource type). Growth projections indicate that storage expansion is needed within 12 months and a server upgrade within 24 months. The three-year infrastructure investment is estimated at $1.7M.
Recommendation: Approve Year 1 storage expansion ($180K) in the next capital budget cycle.
Spaced Review: Comprehensive Topic Recall
This capstone chapter draws on concepts from every part of the book. Use these questions to verify your recall:
-
From Chapter 3 (SQL Foundations): Write a query that joins three tables with appropriate WHERE clause predicates. Did you use sargable predicates?
-
From Chapter 8 (Indexing): Given the stress test results, which index strategy eliminated the most I/O? Why did the covering index help the account balance inquiry?
-
From Chapter 12 (EXPLAIN): What is the difference between an index scan and a table scan in the EXPLAIN output? When would the optimizer legitimately choose a table scan?
-
From Chapter 16 (Concurrency): Explain why consistent lock ordering prevents deadlocks. How would you diagnose a deadlock using the DB2 monitoring interfaces?
-
From Chapter 19 (Security): What is the difference between DBADM and SECADM authority? Why should application service accounts never have DBADM?
-
From Chapter 22 (Backup and Recovery): What is the difference between RTO and RPO? Why does NEARSYNC HADR guarantee zero RPO?
-
From Chapter 26 (Data Sharing): What role does the Coupling Facility play in a z/OS data sharing group? What happens to the group if the Coupling Facility fails?
-
From Chapter 30 (Performance Tuning): Name three causes of sort overflow and three remediation strategies.
-
From Chapter 33 (Cloud): What are the trade-offs between managing your own DB2 instance on a cloud VM versus using a fully managed service?
-
From Chapter 35 (Enterprise Architecture): How does a three-tier architecture (OLTP, reporting, analytics) reduce contention compared to running all workloads on a single database?
Summary
This capstone chapter has walked through a complete, production-realistic assessment of a multi-platform DB2 environment. You established baselines, ran stress tests, diagnosed performance problems, verified security controls, executed a disaster recovery drill, analyzed concurrency behavior, reviewed maintenance schedules, and produced a capacity plan with cost projections — all wrapped in an executive summary suitable for a board of directors.
This is the work of a senior DBA. It requires breadth across every topic in this book, depth to diagnose root causes, and communication skill to translate technical findings into business language. If you can execute this capstone exercise on a real system, you are ready for the most demanding DB2 roles in the industry.
The system is reviewed. The findings are documented. The report is delivered.
Now let us turn to the final chapter — your career.