Case Study 1: CNB's Quarterly DB2 Performance Review
Background
It is the first Monday of April, and Kwame Asante is preparing for CNB's quarterly DB2 performance review. This is not a crisis meeting — it is a structured process that Kwame initiated three years ago after a series of preventable batch window overruns nearly cost the bank a regulatory filing deadline.
The review covers four areas: batch window health, online transaction SLA compliance, capacity trending, and access path stability. Kwame presents to Lisa Martinez (VP of Technology) and Rob Jeffries (lead COBOL developer), with the understanding that action items from the review are tracked to completion.
The Batch Window Analysis
Kwame begins with the batch window. CNB's critical batch cycle runs from 22:00 to 06:00 — eight hours for all nightly processing. The cycle includes general ledger posting, account interest calculation, regulatory reporting, and data warehouse loading.
He presents three months of data:
January February March
------- -------- -----
Cycle Start: 22:00 22:00 22:00
Cycle End (avg): 04:12 04:35 05:18
Cycle End (max): 04:47 05:22 05:51
Slack Time (avg): 108 min 85 min 42 min
Slack Time (min): 73 min 38 min 9 min
Longest Job:
Name: CNBGL410 CNBGL410 CNBGL410
Elapsed: 2h 14m 2h 38m 3h 47m
CPU: 18m 22s 19m 01s 19m 47s
Lock Wait: 1h 12m 1h 34m 2h 41m
GETPAGE: 247M 261M 274M
Lisa immediately spots the trend. "We are losing about 20 minutes of slack per month. At this rate we miss the window in June."
Kwame nods. "Correct. And the root cause is clear from the data. CNBGL410's CPU is flat — 18 to 19 minutes over the quarter. GETPAGEs grew about 5% per month, consistent with transaction volume growth. But lock wait went from 72 minutes to 161 minutes. That is the entire degradation."
Rob asks the obvious question: "What is it contending with?"
The Lock Contention Investigation
Kwame pulls up the lock wait detail from March's worst night:
CNBGL410 Lock Wait Breakdown (March 28):
Waiting on CNBGL420 (GL Summarization): 97 minutes
Waiting on CNBGL430 (GL Archival): 38 minutes
Waiting on CNBRPT10 (Reg Reporting): 26 minutes
Total Lock Wait: 161 minutes
The problem is job scheduling interaction. CNBGL420 was moved earlier in the cycle six months ago to accommodate a new regulatory report. It now overlaps with CNBGL410 for approximately 90 minutes, and both programs access GL_TRANS with UPDATE intent.
Kwame shows the accounting data side by side:
CNBGL410: CNBGL420:
SQL UPDATE: 8.4M SQL UPDATE: 2.1M
COMMITS: 847 COMMITS: 210
Rows/Commit: ~9,900 Rows/Commit: ~10,000
Avg Lock Hold: ~14 sec/commit Avg Lock Hold: ~18 sec/commit
"Both programs commit approximately every 10,000 rows," Kwame explains. "That means each commit holds page locks for 14-18 seconds. During that window, the other program waits. They are ping-ponging — CNBGL410 waits for CNBGL420, then CNBGL420 waits for CNBGL410, back and forth for the duration of their overlap."
The Fix
Kwame proposes three changes, prioritized by impact and effort:
Change 1 — Increase Commit Frequency (High Impact, Low Effort)
Reduce commit interval from 10,000 rows to 500 rows for both CNBGL410 and CNBGL420.
Expected result: Lock hold time drops from 14-18 seconds to under 1 second. Lock wait for both programs should drop to near zero even with overlap.
Cost: 37 seconds of additional commit overhead per program (Kwame calculated this from the log force time per commit).
Rob raises a concern: "Both programs have restart logic keyed to commit points. If we change from 10,000 to 500, the checkpoint table will have 20x more rows."
Kwame anticipated this: "We change the checkpoint logic to keep only the last committed key value, not a row per commit. One UPDATE instead of one INSERT per commit. I have the code change spec ready."
Change 2 — Reschedule CNBGL420 (Medium Impact, Low Effort)
Move CNBGL420 to start after CNBGL410 completes, eliminating the overlap entirely.
Risk: This extends the serial chain and reduces parallelism. If CNBGL410 runs long for any reason, CNBGL420 starts late, which cascades to downstream jobs.
Kwame recommends implementing Change 1 first and holding Change 2 in reserve. If commit frequency solves the contention, the programs can continue to overlap safely.
Change 3 — ISOLATION(UR) for CNBGL420 Read Phase (Low Impact, Medium Effort)
CNBGL420 has two phases: a read phase that scans GL_TRANS to build summaries, and a write phase that updates GL_SUMMARY. The read phase does not need transactional consistency — it reads committed data.
Changing the read-phase cursors to ISOLATION(UR) would eliminate lock contention during that phase entirely. However, it requires code review to ensure uncommitted reads do not introduce data integrity issues.
Lisa approves Change 1 for immediate implementation and Change 3 for investigation. Change 2 is documented as a fallback.
Online Transaction SLA Review
Kwame moves to online performance. CNB's SLA requires 95% of transactions to complete within 1 second.
Transaction Q1 Avg Q1 P95 Q1 P99 SLA Status
----------- ------ ------ ------ --- ------
CINQ (inquiry) 0.12s 0.28s 0.87s 1.0s OK
CXFR (transfer) 0.34s 0.72s 1.84s 1.0s OK (P95)
CPMT (payment) 0.41s 0.89s 2.12s 1.0s OK (P95)
CSTM (statement) 0.87s 2.14s 4.87s 3.0s OK
CACL (acct close) 1.23s 3.47s 8.92s 5.0s OK
All transactions meet P95 SLA. But Kwame flags two concerns:
Concern 1: CXFR P99 is 1.84 seconds. This means 1% of transfer transactions take nearly 2 seconds. "The P99 was 1.2 seconds last quarter," Kwame notes. "That is a 53% increase."
He drills into the data:
CXFR Response Time > 1.5 sec — Analysis:
Total occurrences: 847 (out of 412,000 executions)
Lock wait present: 612 (72.3%)
Avg lock wait: 1.12 sec
Lock holder: CNBGL410 (batch) in 589 cases
No lock wait: 235 (27.7%)
High GETPAGE (>5000): 198 (84.3% of non-lock cases)
Accounts with >10K txns: 187 (94.4% of high-GETPAGE cases)
Two distinct problems: 72% of slow transfers are caused by lock contention with the batch cycle (which will be addressed by Change 1 above), and 28% are caused by data skew — corporate accounts with many transactions require more GETPAGEs for the transfer validation query.
Concern 2: CSTM average is 0.87 seconds and trending upward. Statement generation queries are getting slower as transaction history grows. Kwame recommends partitioning the TRANSACTIONS table by date and implementing partition pruning in the statement generation SQL. This is a larger project — he estimates 3 months of development and testing.
Lisa asks: "When does CSTM hit the SLA wall at current growth rate?"
Kwame has the projection: "August, if growth continues at the current 8% per month compound rate. We need to start the partitioning project by May to have it deployed by July."
Access Path Stability Review
Kwame's third section compares EXPLAIN output from January REBIND to March REBIND for all 247 packages in the CNB production DB2 subsystem.
Access Path Changes Summary:
Total packages: 247
Packages with changes: 12 (4.9%)
Changes by type:
Index change: 5
Join order change: 3
Sort added: 2
Tablespace scan added: 1
MATCHCOLS reduced: 1
Classification:
Beneficial: 7
Neutral: 3
Degraded: 2
The two degraded access paths:
Package CNBTRN05, Query 34: Changed from index XTRN_ACCT_DATE (MATCHCOLS=3) to index XTRN_STATUS (MATCHCOLS=1). Root cause: RUNSTATS ran during a period when the STATUS column had unusually low cardinality due to a batch that temporarily set all rows to 'P' (pending). The optimizer saw 99.9% of rows with STATUS='P' and concluded the index was not selective. Fix: Re-run RUNSTATS after the batch completes, add to the RUNSTATS scheduling documentation.
Package CNBRPT08, Query 112: Added a sort for ORDER BY that was previously satisfied by index order. Root cause: A new index was created for a different program, and the optimizer now prefers that index (which does not provide ORDER BY ordering). Fix: Add an explicit OPTIMIZE FOR clause to the query that hints the optimizer toward the original index, or remove the unnecessary new index if it is not providing value elsewhere.
Rob asks about the one tablespace scan addition. Kwame explains: "Package CNBREF01, Query 7. The REFERENCE_CODES table grew from 200 rows to 200,000 rows when someone loaded the international codes. Previously the tablespace scan was fine — 200 rows, 3 pages. Now it is 200,000 rows, 4,800 pages. We need an index on CODE_TYPE, CODE_VALUE. I have the DDL ready."
Capacity Trending
The final section covers growth:
Tablespace Growth (Top 5 by % Growth):
Jan Size Mar Size Monthly Growth
GL_TRANS 47.2 GB 53.1 GB 4.1%
TRANSACTIONS 112.4 GB 128.7 GB 4.6%
AUDIT_LOG 23.8 GB 31.2 GB 10.3%
ACCOUNT_HISTORY 67.1 GB 71.4 GB 2.1%
REGULATORY_RPT 8.3 GB 12.1 GB 14.4%
REGULATORY_RPT growing at 14.4% monthly catches Lisa's attention. "That is the new regulatory data retention requirement," Kwame explains. "We are keeping 7 years instead of 3. I have a partitioning plan for this table — same approach as TRANSACTIONS, partition by quarter, archive oldest partitions to cheaper storage."
AUDIT_LOG at 10.3% is also flagged. This table has no archival strategy. Kwame recommends implementing a 90-day retention policy with monthly archival to sequential files.
Action Items
Lisa summarizes the action items:
| # | Action | Owner | Target Date | Priority |
|---|---|---|---|---|
| 1 | Implement 500-row commit frequency for CNBGL410/CNBGL420 | Rob | April 15 | Critical |
| 2 | Investigate ISOLATION(UR) for CNBGL420 read phase | Rob | April 30 | High |
| 3 | Re-run RUNSTATS for CNBTRN05 and verify access path | Kwame | April 10 | High |
| 4 | Create index on REFERENCE_CODES (CODE_TYPE, CODE_VALUE) | Kwame | April 10 | High |
| 5 | Begin TRANSACTIONS table partitioning project | Rob/Kwame | May 1 start | High |
| 6 | Implement AUDIT_LOG 90-day retention | Kwame | May 15 | Medium |
| 7 | Design REGULATORY_RPT partitioning | Kwame | May 30 | Medium |
| 8 | Investigate CXFR data skew (FREQVAL on ACCOUNT_TYPE) | Kwame | April 15 | High |
Lessons
The quarterly review process catches problems that daily monitoring misses. Daily monitoring is excellent for sudden changes — a job that runs 3x longer than yesterday gets flagged immediately. But gradual degradation — 20 minutes of slack lost per month, P99 creeping upward, tablespace growth compounding — only becomes visible when you look at trends over months.
Kwame's review takes approximately 8 hours to prepare and 2 hours to present. Lisa estimates it saves 200+ hours of emergency debugging per year. The math is not close.
Discussion Questions
-
Why did Kwame prioritize commit frequency change over job rescheduling? Under what circumstances would rescheduling be the better first move?
-
The CNBTRN05 access path change was caused by RUNSTATS running at the wrong time. How would you design a RUNSTATS scheduling policy that avoids this problem?
-
The quarterly review identified that CSTM transaction will hit its SLA wall in August. The partitioning project takes 3 months. If Lisa had not asked the right question, what would have happened?
-
How would you automate the access path stability comparison that Kwame performs manually? What would the automated alert criteria be?
-
Kwame identified 12 packages with access path changes out of 247. Is a 4.9% change rate per quarter concerning, acceptable, or expected? What would change your assessment?