Case Study 1: CNB's Quarterly DB2 Performance Review

Background

It is the first Monday of April, and Kwame Asante is preparing for CNB's quarterly DB2 performance review. This is not a crisis meeting — it is a structured process that Kwame initiated three years ago after a series of preventable batch window overruns nearly cost the bank a regulatory filing deadline.

The review covers four areas: batch window health, online transaction SLA compliance, capacity trending, and access path stability. Kwame presents to Lisa Martinez (VP of Technology) and Rob Jeffries (lead COBOL developer), with the understanding that action items from the review are tracked to completion.

The Batch Window Analysis

Kwame begins with the batch window. CNB's critical batch cycle runs from 22:00 to 06:00 — eight hours for all nightly processing. The cycle includes general ledger posting, account interest calculation, regulatory reporting, and data warehouse loading.

He presents three months of data:

                    January         February        March
                    -------         --------        -----
Cycle Start:        22:00           22:00           22:00
Cycle End (avg):    04:12           04:35           05:18
Cycle End (max):    04:47           05:22           05:51
Slack Time (avg):   108 min         85 min          42 min
Slack Time (min):    73 min         38 min           9 min

Longest Job:
  Name:             CNBGL410        CNBGL410        CNBGL410
  Elapsed:          2h 14m          2h 38m          3h 47m
  CPU:              18m 22s         19m 01s         19m 47s
  Lock Wait:        1h 12m          1h 34m          2h 41m
  GETPAGE:          247M            261M            274M

Lisa immediately spots the trend. "We are losing about 20 minutes of slack per month. At this rate we miss the window in June."

Kwame nods. "Correct. And the root cause is clear from the data. CNBGL410's CPU is flat — 18 to 19 minutes over the quarter. GETPAGEs grew about 5% per month, consistent with transaction volume growth. But lock wait went from 72 minutes to 161 minutes. That is the entire degradation."

Rob asks the obvious question: "What is it contending with?"

The Lock Contention Investigation

Kwame pulls up the lock wait detail from March's worst night:

CNBGL410 Lock Wait Breakdown (March 28):
  Waiting on CNBGL420 (GL Summarization):    97 minutes
  Waiting on CNBGL430 (GL Archival):          38 minutes
  Waiting on CNBRPT10 (Reg Reporting):        26 minutes
  Total Lock Wait:                           161 minutes

The problem is job scheduling interaction. CNBGL420 was moved earlier in the cycle six months ago to accommodate a new regulatory report. It now overlaps with CNBGL410 for approximately 90 minutes, and both programs access GL_TRANS with UPDATE intent.

Kwame shows the accounting data side by side:

CNBGL410:                           CNBGL420:
  SQL UPDATE: 8.4M                    SQL UPDATE: 2.1M
  COMMITS:    847                     COMMITS:    210
  Rows/Commit: ~9,900                 Rows/Commit: ~10,000
  Avg Lock Hold: ~14 sec/commit       Avg Lock Hold: ~18 sec/commit

"Both programs commit approximately every 10,000 rows," Kwame explains. "That means each commit holds page locks for 14-18 seconds. During that window, the other program waits. They are ping-ponging — CNBGL410 waits for CNBGL420, then CNBGL420 waits for CNBGL410, back and forth for the duration of their overlap."

The Fix

Kwame proposes three changes, prioritized by impact and effort:

Change 1 — Increase Commit Frequency (High Impact, Low Effort)

Reduce commit interval from 10,000 rows to 500 rows for both CNBGL410 and CNBGL420.

Expected result: Lock hold time drops from 14-18 seconds to under 1 second. Lock wait for both programs should drop to near zero even with overlap.

Cost: 37 seconds of additional commit overhead per program (Kwame calculated this from the log force time per commit).

Rob raises a concern: "Both programs have restart logic keyed to commit points. If we change from 10,000 to 500, the checkpoint table will have 20x more rows."

Kwame anticipated this: "We change the checkpoint logic to keep only the last committed key value, not a row per commit. One UPDATE instead of one INSERT per commit. I have the code change spec ready."

Change 2 — Reschedule CNBGL420 (Medium Impact, Low Effort)

Move CNBGL420 to start after CNBGL410 completes, eliminating the overlap entirely.

Risk: This extends the serial chain and reduces parallelism. If CNBGL410 runs long for any reason, CNBGL420 starts late, which cascades to downstream jobs.

Kwame recommends implementing Change 1 first and holding Change 2 in reserve. If commit frequency solves the contention, the programs can continue to overlap safely.

Change 3 — ISOLATION(UR) for CNBGL420 Read Phase (Low Impact, Medium Effort)

CNBGL420 has two phases: a read phase that scans GL_TRANS to build summaries, and a write phase that updates GL_SUMMARY. The read phase does not need transactional consistency — it reads committed data.

Changing the read-phase cursors to ISOLATION(UR) would eliminate lock contention during that phase entirely. However, it requires code review to ensure uncommitted reads do not introduce data integrity issues.

Lisa approves Change 1 for immediate implementation and Change 3 for investigation. Change 2 is documented as a fallback.

Online Transaction SLA Review

Kwame moves to online performance. CNB's SLA requires 95% of transactions to complete within 1 second.

Transaction     Q1 Avg    Q1 P95    Q1 P99    SLA     Status
-----------     ------    ------    ------    ---     ------
CINQ (inquiry)   0.12s     0.28s     0.87s    1.0s    OK
CXFR (transfer)  0.34s     0.72s     1.84s    1.0s    OK (P95)
CPMT (payment)   0.41s     0.89s     2.12s    1.0s    OK (P95)
CSTM (statement) 0.87s     2.14s     4.87s    3.0s    OK
CACL (acct close) 1.23s    3.47s     8.92s    5.0s    OK

All transactions meet P95 SLA. But Kwame flags two concerns:

Concern 1: CXFR P99 is 1.84 seconds. This means 1% of transfer transactions take nearly 2 seconds. "The P99 was 1.2 seconds last quarter," Kwame notes. "That is a 53% increase."

He drills into the data:

CXFR Response Time > 1.5 sec — Analysis:
  Total occurrences: 847 (out of 412,000 executions)

  Lock wait present:       612 (72.3%)
  Avg lock wait:           1.12 sec
  Lock holder:             CNBGL410 (batch) in 589 cases

  No lock wait:            235 (27.7%)
  High GETPAGE (>5000):    198 (84.3% of non-lock cases)
  Accounts with >10K txns: 187 (94.4% of high-GETPAGE cases)

Two distinct problems: 72% of slow transfers are caused by lock contention with the batch cycle (which will be addressed by Change 1 above), and 28% are caused by data skew — corporate accounts with many transactions require more GETPAGEs for the transfer validation query.

Concern 2: CSTM average is 0.87 seconds and trending upward. Statement generation queries are getting slower as transaction history grows. Kwame recommends partitioning the TRANSACTIONS table by date and implementing partition pruning in the statement generation SQL. This is a larger project — he estimates 3 months of development and testing.

Lisa asks: "When does CSTM hit the SLA wall at current growth rate?"

Kwame has the projection: "August, if growth continues at the current 8% per month compound rate. We need to start the partitioning project by May to have it deployed by July."

Access Path Stability Review

Kwame's third section compares EXPLAIN output from January REBIND to March REBIND for all 247 packages in the CNB production DB2 subsystem.

Access Path Changes Summary:
  Total packages:           247
  Packages with changes:     12 (4.9%)

  Changes by type:
    Index change:             5
    Join order change:        3
    Sort added:               2
    Tablespace scan added:    1
    MATCHCOLS reduced:        1

  Classification:
    Beneficial:               7
    Neutral:                  3
    Degraded:                 2

The two degraded access paths:

Package CNBTRN05, Query 34: Changed from index XTRN_ACCT_DATE (MATCHCOLS=3) to index XTRN_STATUS (MATCHCOLS=1). Root cause: RUNSTATS ran during a period when the STATUS column had unusually low cardinality due to a batch that temporarily set all rows to 'P' (pending). The optimizer saw 99.9% of rows with STATUS='P' and concluded the index was not selective. Fix: Re-run RUNSTATS after the batch completes, add to the RUNSTATS scheduling documentation.

Package CNBRPT08, Query 112: Added a sort for ORDER BY that was previously satisfied by index order. Root cause: A new index was created for a different program, and the optimizer now prefers that index (which does not provide ORDER BY ordering). Fix: Add an explicit OPTIMIZE FOR clause to the query that hints the optimizer toward the original index, or remove the unnecessary new index if it is not providing value elsewhere.

Rob asks about the one tablespace scan addition. Kwame explains: "Package CNBREF01, Query 7. The REFERENCE_CODES table grew from 200 rows to 200,000 rows when someone loaded the international codes. Previously the tablespace scan was fine — 200 rows, 3 pages. Now it is 200,000 rows, 4,800 pages. We need an index on CODE_TYPE, CODE_VALUE. I have the DDL ready."

The final section covers growth:

Tablespace Growth (Top 5 by % Growth):
                        Jan Size    Mar Size    Monthly Growth
GL_TRANS               47.2 GB      53.1 GB        4.1%
TRANSACTIONS          112.4 GB     128.7 GB        4.6%
AUDIT_LOG              23.8 GB      31.2 GB       10.3%
ACCOUNT_HISTORY        67.1 GB      71.4 GB        2.1%
REGULATORY_RPT          8.3 GB      12.1 GB       14.4%

REGULATORY_RPT growing at 14.4% monthly catches Lisa's attention. "That is the new regulatory data retention requirement," Kwame explains. "We are keeping 7 years instead of 3. I have a partitioning plan for this table — same approach as TRANSACTIONS, partition by quarter, archive oldest partitions to cheaper storage."

AUDIT_LOG at 10.3% is also flagged. This table has no archival strategy. Kwame recommends implementing a 90-day retention policy with monthly archival to sequential files.

Action Items

Lisa summarizes the action items:

# Action Owner Target Date Priority
1 Implement 500-row commit frequency for CNBGL410/CNBGL420 Rob April 15 Critical
2 Investigate ISOLATION(UR) for CNBGL420 read phase Rob April 30 High
3 Re-run RUNSTATS for CNBTRN05 and verify access path Kwame April 10 High
4 Create index on REFERENCE_CODES (CODE_TYPE, CODE_VALUE) Kwame April 10 High
5 Begin TRANSACTIONS table partitioning project Rob/Kwame May 1 start High
6 Implement AUDIT_LOG 90-day retention Kwame May 15 Medium
7 Design REGULATORY_RPT partitioning Kwame May 30 Medium
8 Investigate CXFR data skew (FREQVAL on ACCOUNT_TYPE) Kwame April 15 High

Lessons

The quarterly review process catches problems that daily monitoring misses. Daily monitoring is excellent for sudden changes — a job that runs 3x longer than yesterday gets flagged immediately. But gradual degradation — 20 minutes of slack lost per month, P99 creeping upward, tablespace growth compounding — only becomes visible when you look at trends over months.

Kwame's review takes approximately 8 hours to prepare and 2 hours to present. Lisa estimates it saves 200+ hours of emergency debugging per year. The math is not close.

Discussion Questions

  1. Why did Kwame prioritize commit frequency change over job rescheduling? Under what circumstances would rescheduling be the better first move?

  2. The CNBTRN05 access path change was caused by RUNSTATS running at the wrong time. How would you design a RUNSTATS scheduling policy that avoids this problem?

  3. The quarterly review identified that CSTM transaction will hit its SLA wall in August. The partitioning project takes 3 months. If Lisa had not asked the right question, what would have happened?

  4. How would you automate the access path stability comparison that Kwame performs manually? What would the automated alert criteria be?

  5. Kwame identified 12 packages with access path changes out of 247. Is a 4.9% change rate per quarter concerning, acceptable, or expected? What would change your assessment?