> "The worst performance tuning I ever saw was a developer who added an index because the query was slow. The query was slow because it was returning 4 million rows to COBOL for filtering. The index made the 4-million-row fetch faster. Nobody asked...
In This Chapter
Chapter 11: DB2 Performance Diagnosis: Reading EXPLAIN Output, Interpreting DSN_STATEMNT_TABLE, and Solving Real Problems
"The worst performance tuning I ever saw was a developer who added an index because the query was slow. The query was slow because it was returning 4 million rows to COBOL for filtering. The index made the 4-million-row fetch faster. Nobody asked why the application needed 4 million rows." — Kwame Asante, CNB Senior DBA
Let me be direct about something. Most DB2 performance problems in COBOL shops are not DB2 problems. They are application design problems that manifest in DB2. The database is where the pain shows up, but the disease lives in the COBOL code, the SQL design, or — most often — in assumptions somebody made in 1997 that stopped being true around 2008.
This chapter teaches you to diagnose, not guess. We are going to work through the actual artifacts that DB2 produces — EXPLAIN output, DSN_STATEMNT_TABLE rows, accounting trace records — and learn to read them the way a radiologist reads an X-ray. You will learn what is normal, what is abnormal, and what is "abnormal but we live with it because fixing it would cost more than the problem."
By the end of this chapter, you will be able to pick up a performance problem, work it systematically from symptom to root cause, and either fix it yourself or hand a DBA a diagnosis so precise that they buy you lunch.
Prerequisites: You need Chapter 6's coverage of the DB2 optimizer and access path selection — we build directly on that foundation. Chapter 8's locking material is essential for understanding contention-related performance problems. Chapter 9's utility coverage matters because RUNSTATS timing affects everything the optimizer does. Chapters 7, 9, and 10 provide useful background but are not strictly required.
Learning Objectives:
- Conduct systematic DB2 performance diagnosis using EXPLAIN output, DSN_STATEMNT_TABLE, and DB2 accounting traces
- Identify and resolve the five most common DB2 performance problems in COBOL applications
- Use DB2 performance monitoring tools (OMEGAMON, BMC) and native DB2 monitoring facilities
- Design a DB2 performance testing strategy for COBOL applications
- Establish DB2 performance baselines and monitoring for the HA banking system
11.1 The Performance Investigation Methodology
Here is what happens in most shops when someone says "DB2 is slow":
- A developer says "my batch job ran long last night."
- A DBA runs a few queries against the catalog.
- Someone adds an index.
- The job runs faster next time (or it does not, and they add another index).
- Six months later there are 47 indexes on the table and INSERT performance has cratered.
This is not diagnosis. This is superstition with a technical vocabulary.
The Systematic Approach
Real performance diagnosis follows a methodology. Here is the one we use, and it has not failed me in 25 years:
Step 1: Define the Problem Precisely
"The batch is slow" is not a problem statement. A problem statement looks like this:
- "Job CNBGL410 step SORT0030 elapsed time increased from 12 minutes to 3 hours 47 minutes between March 14 and March 15."
- "Transaction PINQ average response time is 2.3 seconds, SLA requires sub-second."
- "Program FBCL0220 is consuming 847 CPU seconds per execution, up from 190 CPU seconds last quarter."
Notice the pattern: what changed, when did it change, and what is the measurable delta. If you cannot state the problem with numbers, you are not ready to start diagnosing.
Step 2: Establish the Baseline
Before you can say something is broken, you need to know what "working" looks like. For every DB2-heavy COBOL program, you should have:
- Normal elapsed time range
- Normal CPU time range
- Normal GETPAGE count range
- Normal lock/unlock count
- Normal number of SQL calls by type (SELECT, INSERT, UPDATE, DELETE)
- Normal commit frequency
Where do these numbers come from? DB2 accounting traces (IFCID 3), which we will cover in Section 11.3. If you do not have baselines, your first task is to create them — not to tune.
Step 3: Isolate the Component
A COBOL batch job that runs long might be slow because of:
- DB2 access path changes (optimizer chose a different plan)
- Data volume growth (table went from 10M to 100M rows)
- Lock contention (another job is holding locks)
- CPU constraint (LPAR capping, WLM throttling)
- I/O constraint (channel busy, DASD contention)
- Application logic (someone changed the program and added a nested cursor loop)
- RUNSTATS staleness (statistics no longer reflect reality)
- Catalog changes (someone dropped and recreated an index)
You need to eliminate possibilities systematically. The accounting trace tells you where time was spent. EXPLAIN tells you how DB2 is accessing data. The application tells you what SQL is being executed and how often.
Step 4: Identify Root Cause
This is where the tools in this chapter come in. Once you know which component is slow, you use the appropriate diagnostic:
| Symptom | Primary Diagnostic |
|---|---|
| High CPU, low elapsed | EXPLAIN — check access paths |
| High elapsed, low CPU | Lock wait analysis, I/O analysis |
| High GETPAGE count | EXPLAIN — tablespace scan likely |
| High lock/unlock count | Commit frequency, lock avoidance |
| Sudden change, no code change | RUNSTATS, catalog changes, data growth |
| Gradual degradation | Data volume growth, index fragmentation |
Step 5: Implement and Verify
Fix the problem. Measure the result. Compare to baseline. If the numbers are back in range, you are done. If not, go back to Step 3.
📊 Spaced Review — Chapter 6: Recall the access path selection process. The optimizer chooses an access path based on statistics, predicates, and available indexes. Everything in this chapter assumes you understand that the optimizer is making cost-based decisions. When those decisions go wrong, EXPLAIN is how you find out.
When the Methodology Breaks Down
The methodology assumes you have data. In practice, there are situations where data is incomplete or missing:
No baseline exists. The program is new or nobody ever captured accounting data. In this case, your first run IS the baseline. Capture everything, document it, and use it going forward. Do not attempt to tune a program that has no baseline — you will not know if your changes helped or hurt.
Multiple variables changed simultaneously. Someone deployed new code, rebound the package, and ran RUNSTATS all in the same maintenance window. Now performance is different and you cannot isolate which change caused it. This is why change management matters: one change per window, with measurement between changes. At Federal Benefits, Sandra Kowalski enforces a "one variable" rule for performance-sensitive changes.
The problem is intermittent. Performance is fine most of the time but degrades unpredictably. Intermittent problems are the hardest to diagnose because you cannot reproduce them on demand. The solution is continuous monitoring — capture accounting data for every execution so that when the problem occurs, you have the data. At SecureFirst, Yuki Tanaka's monitoring captures every CICS transaction execution, not just the ones that exceed thresholds.
The problem is environmental. DB2 performance can degrade due to factors entirely outside the database: LPAR CPU capping by WLM, channel contention from another subsystem, storage pool exhaustion, or paging. When accounting data shows high "other wait" or "not in DB2" time, look outside DB2. The RMF (Resource Measurement Facility) reports are your friend here.
The Cardinal Rule
Never tune what you have not measured. I cannot emphasize this enough. I have watched senior developers spend three days rewriting SQL that was not the bottleneck. The bottleneck was a SORT step that was spilling to disk because the region size was too small. Three days wasted because nobody looked at the accounting data first.
At CNB, Lisa Martinez enforces what she calls the "measurement tax": before any performance change is approved, you must produce the before-metrics. No metrics, no change approval. It sounds bureaucratic until you realize it has prevented at least a dozen catastrophic "improvements" over the past five years.
Rob Jeffries adds a corollary: "The measurement tax also prevents us from taking credit for improvements we did not actually make. Last year a developer claimed a 40% improvement from rewriting a query. The accounting data showed the improvement was 3%. The other 37% came from a concurrent hardware upgrade that nobody mentioned."
11.2 EXPLAIN in Depth
You were introduced to EXPLAIN in Chapter 6. Now we go deep. EXPLAIN is not a single thing — it is a set of tables that DB2 populates when you ask it to show its work. The three tables we care about most are:
- PLAN_TABLE — the classic access path table, one row per query block per table access
- DSN_STATEMNT_TABLE — statement-level cost estimates (DB2 V8+), the single most useful performance diagnostic table
- DSN_FUNCTION_TABLE — function-level detail for queries with subqueries, UNIONs, and other multi-step operations
PLAN_TABLE Deep Dive
PLAN_TABLE has been around since DB2 V1. Every DB2 programmer has seen it. Most have not read it carefully enough.
Here are the columns that matter most for COBOL application performance:
QUERYNO - Identifies which SQL statement (matches DBRM)
QBLOCKNO - Query block number (1 for simple queries)
PLANNO - Step within the query block
METHOD - How tables are joined (0=first table, 1=NLJ, 2=merge, 3=sort, 4=hybrid)
CREATOR - Schema of the table
TNAME - Table name
TABNO - Table number in FROM clause
ACCESSTYPE - How the table is accessed (I=index, R=tablespace scan, etc.)
MATCHCOLS - Number of index columns matched by predicates
ACCESSCREATOR - Schema of the index used
ACCESSNAME - Index name used
INDEXONLY - Y if DB2 can satisfy query from index alone
SORTN_UNIQ - Sort needed to remove duplicates
SORTN_JOIN - Sort needed for join
SORTN_ORDERBY - Sort needed for ORDER BY
SORTN_GROUPBY - Sort needed for GROUP BY
SORTC_UNIQ - Composite sort for uniqueness
SORTC_JOIN - Composite sort for join
SORTC_ORDERBY - Composite sort for ORDER BY
SORTC_GROUPBY - Composite sort for GROUP BY
TSLOCKMODE - Lock mode (IS, IX, S, X, etc.)
PREFETCH - Prefetch type (S=sequential, L=list, D=dynamic)
COLUMN_FN_EVAL - Where column functions evaluated
MIXOPSEQ - Sequence for multiple index access
Reading PLAN_TABLE Like a Pro
Here is a real PLAN_TABLE output from CNB's general ledger posting program. I have simplified slightly for clarity:
QUERYNO PLANNO METHOD TNAME ACCESSTYPE MATCHCOLS ACCESSNAME INDEXONLY PREFETCH
------- ------ ------ -------------- ---------- --------- --------------- -------- --------
15 1 0 GL_TRANS I 3 XGLTRN_ACCT_DT N S
15 2 1 GL_ACCOUNTS I 1 XGLACT_PK Y
15 3 1 GL_PERIODS I 2 XGLPER_YR_PRD N
Let me walk you through this:
- QUERYNO 15 — This is the 15th SQL statement in the DBRM. You match this to your COBOL source to find the actual SQL.
- PLANNO 1, METHOD 0 — First table accessed. METHOD 0 always means "this is where we start." DB2 is accessing GL_TRANS using index XGLTRN_ACCT_DT with 3 matching columns. Sequential prefetch is enabled. This is the outer table of a nested loop join.
- PLANNO 2, METHOD 1 — Nested loop join (METHOD 1) to GL_ACCOUNTS using the primary key index with 1 matching column. INDEXONLY=Y means DB2 never touches the data pages — everything it needs is in the index. No prefetch needed because it is single-row access.
- PLANNO 3, METHOD 1 — Another nested loop join to GL_PERIODS using a 2-column match. This accesses data pages (INDEXONLY=N).
This is a well-optimized access path. Three-column match on the driving table, index-only access for the lookup table, nested loop joins throughout. Sequential prefetch on the first table means DB2 expects to read a range of rows.
Now here is what the same query looked like after someone ran RUNSTATS with the wrong options:
QUERYNO PLANNO METHOD TNAME ACCESSTYPE MATCHCOLS ACCESSNAME INDEXONLY PREFETCH
------- ------ ------ -------------- ---------- --------- --------------- -------- --------
15 1 0 GL_PERIODS R 0 N S
15 2 1 GL_TRANS I 1 XGLTRN_ACCT_DT N S
15 3 1 GL_ACCOUNTS I 1 XGLACT_PK Y
See the problem? GL_PERIODS is now the outer table with a tablespace scan (ACCESSTYPE=R, no index). The join order reversed. The MATCHCOLS on GL_TRANS dropped from 3 to 1. This query went from 0.3 seconds to 47 seconds. The fix was re-running RUNSTATS with correct options — the bad statistics made the optimizer think GL_PERIODS had 12 rows when it actually had 1.2 million.
⚠️ Critical Pattern: When access paths change suddenly and no code changed, check RUNSTATS. Always check RUNSTATS. In 25 years, stale or incorrect statistics have been the root cause of more sudden performance degradations than all other causes combined.
The ACCESSTYPE Values You Must Know
| ACCESSTYPE | Meaning | Performance Implication |
|---|---|---|
| I | Index access | Usually good — check MATCHCOLS |
| I1 | One-fetch index scan | Excellent — MIN/MAX optimization |
| R | Tablespace scan | Red flag on large tables, fine on small ones |
| M | Multiple index access | DB2 combining indexes — often means no single good index |
| MX | Multiple index with intersection | Similar to M, intersection narrows results |
| MU | Multiple index with union | Union of multiple index scans |
| N | Index scan (IN-list) | Check IN-list size |
MATCHCOLS: The Most Important Number
MATCHCOLS tells you how many leading columns of the index are being used as matching predicates. Higher is better. Here is why it matters:
- MATCHCOLS=0 with ACCESSTYPE=I means a non-matching index scan. DB2 is reading the entire index. This is often worse than a tablespace scan because the index pages are smaller and more numerous.
- MATCHCOLS=1 on a 4-column index means only the first column is filtering. DB2 reads all index entries for that one value, then scans through them.
- MATCHCOLS=4 on a 4-column index means a surgical strike — DB2 goes directly to the exact rows.
When Rob Jeffries at CNB reviews EXPLAIN output, the first thing he circles is every MATCHCOLS value. If it is lower than expected, he checks the predicate coding. Common culprits:
- Type mismatch — comparing CHAR to VARCHAR, or INTEGER to DECIMAL, forces DB2 to apply a function, which kills index matching
- Expression on the column side —
WHERE YEAR(TRANS_DATE) = 2025cannot match an index on TRANS_DATE - OR predicates —
WHERE ACCT_TYPE = 'C' OR ACCT_TYPE = 'S'may reduce MATCHCOLS; rewrite asWHERE ACCT_TYPE IN ('C', 'S') - Host variable type mismatch — the COBOL host variable declaration does not match the column type
DSN_STATEMNT_TABLE
This is the table most developers never look at, and it is the most valuable one. DSN_STATEMNT_TABLE contains the optimizer's cost estimates for the entire statement.
Key columns:
QUERYNO - Statement identifier
STMT_ID - Internal statement ID
GROUP_MEMBER - Data sharing member
PROCMS - Estimated processor time in milliseconds
PROCSU - Estimated processor time in service units
REASON - Why EXPLAIN was performed
STMT_TYPE - SELECT, INSERT, UPDATE, DELETE
COST_CATEGORY - A (confident) or B (uncertain)
The COST_CATEGORY column is gold. If COST_CATEGORY = 'B', DB2 is telling you it is not confident in its cost estimate. This usually means:
- Statistics are missing or stale
- Host variables prevent the optimizer from knowing selectivity
- Correlation between columns is not captured
When you see COST_CATEGORY='B', do not trust the access path. Run RUNSTATS with distribution statistics (KEYCARD, FREQVAL, and HISTOGRAM if your DB2 version supports it).
PROCMS and PROCSU give you the optimizer's estimate of how expensive the statement will be. These are estimates, not measurements — but they are useful for comparison. If PROCMS says 50 and your actual execution takes 50,000, something is very wrong with the statistics.
At Pinnacle Health, Ahmad Rashid built a monitoring query that compares estimated PROCMS to actual CPU from accounting traces:
SELECT S.QUERYNO,
S.PROCMS AS ESTIMATED_MS,
A.ACTUAL_CPU_MS,
CASE WHEN A.ACTUAL_CPU_MS > S.PROCMS * 10
THEN 'INVESTIGATE'
WHEN A.ACTUAL_CPU_MS > S.PROCMS * 3
THEN 'WATCH'
ELSE 'OK'
END AS STATUS
FROM DSN_STATEMNT_TABLE S
JOIN ACCOUNTING_SUMMARY A
ON S.QUERYNO = A.QUERYNO
WHERE S.COST_CATEGORY = 'B'
OR A.ACTUAL_CPU_MS > S.PROCMS * 10
ORDER BY A.ACTUAL_CPU_MS DESC;
This query has caught at least a dozen problems before they became incidents.
Interpreting EXPLAIN for Multi-Table Joins
COBOL programs frequently join three, four, or five tables in a single query — account to customer to address to phone, for example. The PLAN_TABLE output for these queries requires careful reading because the join order and join method interact.
Consider this four-table join from SecureFirst's fraud detection program:
QNO PLANNO METHOD TNAME ACCESSTYPE MATCHCOLS ACCESSNAME IXONLY PREFETCH
--- ------ ------ --------------- ---------- --------- --------------- ------ --------
5 1 0 TRANSACTIONS I 3 XTRN_ACCT_DT_TYP N S
5 2 1 ACCOUNTS I 1 XACT_PK Y
5 3 1 CUSTOMERS I 1 XCUST_PK Y
5 4 2 FRAUD_SCORES I 2 XFRAUD_TRN_DT N
Reading this: DB2 starts with TRANSACTIONS (METHOD=0), using a 3-column index match with sequential prefetch — this is the driving table. For each transaction row, it does a nested loop join (METHOD=1) to ACCOUNTS using the primary key (index-only, no data page access). Then another nested loop to CUSTOMERS, also index-only. Finally, a merge scan join (METHOD=2) to FRAUD_SCORES.
Why the merge scan for FRAUD_SCORES? Because the data from the TRANSACTIONS driving table and the FRAUD_SCORES data both need to be in the same order for an efficient merge. DB2 determined that sorting and merging was cheaper than 4.8 million random index probes. Yuki Tanaka verified this by running the query both ways — the merge scan was 40% faster than a forced nested loop join.
The key insight: merge scan joins and hybrid joins are not bad. They are bad only when they are unexpected. If you expect a nested loop and get a merge scan, investigate. But if the optimizer chose merge scan for good reasons — large intermediate result sets, matching sort orders — trust it (after verifying with actual measurements).
DSN_FUNCTION_TABLE
DSN_FUNCTION_TABLE provides detail on individual operations within a query — sorts, scans, joins. It is most useful for complex queries with subqueries, UNIONs, or views.
Key columns:
QUERYNO - Statement identifier
QBLOCKNO - Query block
PLANNO - Step number
TABLE_SCANNED - Table involved
COLUMN_FN_EVAL - Where column functions evaluated (R=during retrieval, S=during sort)
PAGE_RANGE - Estimated page range scanned
PREFETCH - Prefetch strategy
The PAGE_RANGE column deserves attention. It tells you how many pages DB2 expects to scan. If PAGE_RANGE for a query is 500,000 and the tablespace only has 600,000 pages, you effectively have a tablespace scan even though ACCESSTYPE says 'I'. This is what Ahmad Rashid at Pinnacle Health calls a "stealth scan" — the EXPLAIN output looks like index access, but the actual I/O pattern is indistinguishable from a tablespace scan. Always check PAGE_RANGE relative to the total tablespace size, not just the ACCESSTYPE.
Another important detail in DSN_FUNCTION_TABLE is COLUMN_FN_EVAL. When this shows 'S' (evaluated during sort), it means DB2 is computing column functions like SUM, COUNT, or AVG during a sort operation rather than during data retrieval. This can indicate that an otherwise unnecessary sort was added to accommodate aggregation. If your query does not logically require sorting, a COLUMN_FN_EVAL='S' value is worth investigating — it may indicate a missing index that could provide data in the required order.
📊 Spaced Review — Chapter 9: Remember from Chapter 9 that RUNSTATS timing matters. Running RUNSTATS immediately after a utility reorganization, before new data arrives, gives the optimizer a snapshot of a perfectly organized tablespace. Run it again after a representative data load to capture real-world distributions.
11.3 DB2 Accounting Traces
EXPLAIN tells you what DB2 plans to do. Accounting traces tell you what DB2 actually did. The difference between plan and reality is where performance problems live.
IFCID 3: The Thread Termination Record
IFCID 3 fires when a DB2 thread terminates (or at commit for long-running threads with accounting rollup). This is your primary performance diagnostic for COBOL programs. It contains:
Class 1 — Elapsed Time: - Total elapsed time for the thread - Time in DB2 (vs. time in the application) - Wait times broken down by category
Class 2 — CPU Time: - TCB CPU time in DB2 - SRB CPU time in DB2 - CPU time for SQL statement processing - CPU time for sort operations
Class 3 — Wait Time Detail: - I/O wait (synchronous reads, writes) - Lock/latch wait - Global lock wait (data sharing) - Drain wait - Claim wait - Page latch wait - Log write wait - Other waits
SQL Activity Counters: - Number of SELECT, INSERT, UPDATE, DELETE, OPEN, FETCH, CLOSE - Number of COMMIT and ROLLBACK - Number of PREPARE (for dynamic SQL) - Number of GETPAGE requests - Number of sequential prefetch requests and pages read - Number of synchronous I/O requests - Buffer pool hit ratio data
Reading Accounting Data
Here is a real (anonymized) accounting summary from CNB's nightly batch cycle:
PROGRAM: CNBGL410 PLAN: CNBGLPLN CONNTYPE: BATCH
START: 2025-03-14 22:15:03 END: 2025-03-15 01:47:22
ELAPSED: 12,739.4 SEC IN-DB2: 11,847.2 SEC (93.0%)
CLASS 1 ELAPSED: 12,739.4 SEC
CLASS 2 TCB CPU: 1,247.3 SEC
CLASS 2 SRB CPU: 189.4 SEC
TOTAL DB2 CPU: 1,436.7 SEC
CLASS 3 WAITS:
SYNC I/O WAIT: 847.2 SEC (8.9%)
LOCK WAIT: 7,312.8 SEC (61.7%)
LOG WRITE: 104.3 SEC (0.9%)
OTHER WAIT: 146.2 SEC (1.2%)
SQL ACTIVITY:
SELECT: 14,847,223
INSERT: 2,341,008
UPDATE: 8,472,119
DELETE: 47,882
COMMIT: 847
OPEN CURSOR: 14,847,223
FETCH: 29,694,446
CLOSE CURSOR: 14,847,223
GETPAGE: 247,881,334
BUFFER HIT RATIO: 98.7%
PREFETCH REQUESTS: 1,247,882
SYNC I/O: 3,247,119
Now let me teach you to read this like a practitioner.
First question: Where did the time go?
Total elapsed: 12,739 seconds (~3.5 hours). DB2 CPU: 1,437 seconds (11.3% of elapsed). Lock wait: 7,313 seconds (57.4% of elapsed). This program spent more than half its time waiting for locks. That is your primary target.
Second question: Is the SQL volume reasonable?
14.8 million SELECTs, 29.7 million FETCHes (2 fetches per cursor open — reasonable for single-row result sets plus the end-of-cursor fetch), 8.5 million UPDATEs, 2.3 million INSERTs. Only 847 COMMITs. That means the average commit interval is approximately 30,000 SQL statements. This is a problem. We will discuss commit frequency in Section 11.4.
Third question: Is the buffer pool working?
98.7% hit ratio looks good. But 247 million GETPAGEs with 1.3% misses means 3.2 million physical I/Os. The sync I/O count of 3.2 million confirms this. At 847 seconds of I/O wait, that is about 0.26 milliseconds per synchronous I/O — reasonable for cached DASD.
Fourth question: What is the ratio of SQL calls to useful work?
14.8 million SELECTs for a general ledger posting program that processes approximately 2.3 million transactions (based on INSERT count). That is 6.4 SELECTs per transaction. Is that reasonable? It depends on the program design — but it is worth checking whether some of those SELECTs are redundant lookups that could be cached in COBOL working storage.
Thread-Level Analysis
When you have multiple programs running concurrently, you need to correlate accounting data across threads. The classic scenario at CNB:
- CNBGL410 (GL posting) runs from 22:00 to 01:47 with 7,313 seconds lock wait
- CNBGL420 (GL summarization) runs from 22:30 to 02:15 with 5,847 seconds lock wait
- CNBGL430 (GL archival) runs from 23:00 to 03:10 with 9,222 seconds lock wait
All three programs access the same GL_TRANS table. All three are acquiring page locks or row locks and holding them until commit. Because commit frequency is low (every 30,000 statements), the lock duration is long, and all three programs are serializing on each other.
The fix at CNB was threefold: 1. Increased commit frequency to every 500 rows 2. Sequenced the jobs so GL posting completes before summarization starts 3. Added ISOLATION(UR) to the summarization program (which only reads committed data anyway)
Lock wait dropped from 7,313 seconds to 12 seconds. Elapsed time dropped from 3.5 hours to 38 minutes. No SQL was changed. No indexes were added.
💡 Practitioner Insight: When I teach new DBAs, I tell them: "Read the accounting trace like a doctor reading bloodwork. The numbers tell you what organ is sick. CPU = brain. Lock wait = circulatory. I/O = muscles. You do not order an MRI until the bloodwork tells you which organ to scan."
11.4 The Five Common DB2 Problems
In 25 years of mainframe performance work, I have seen hundreds of DB2 performance issues. They almost all fall into five categories. Master these five and you will handle 90% of what comes across your desk.
Problem 1: Tablespace Scans on Large Tables
Symptom: High GETPAGE count, high CPU, ACCESSTYPE='R' in EXPLAIN.
Root Causes: - No suitable index exists - Index exists but predicates do not match leading columns - Statistics are stale — optimizer thinks table is small - Stage 2 predicates (predicates DB2 cannot push to the index)
Diagnosis at Pinnacle Health:
Diane Chen's claims processing program was scanning the entire CLAIMS table (47 million rows) because the WHERE clause used:
SELECT CLAIM_ID, PATIENT_ID, PROC_CODE, CLAIM_AMT
FROM CLAIMS
WHERE SUBSTR(CLAIM_ID, 1, 3) = :WS-REGION-CODE
AND CLAIM_STATUS = 'P'
AND CLAIM_DATE BETWEEN :WS-START-DATE AND :WS-END-DATE
The SUBSTR function on CLAIM_ID killed index matching. The optimizer could not use the index on CLAIM_ID because SUBSTR is a function applied to the column.
Fix: Restructured the data model to add a REGION_CODE column (which should have existed from the beginning) and created an appropriate index:
CREATE INDEX XCLM_REGION_STATUS_DATE
ON CLAIMS (REGION_CODE, CLAIM_STATUS, CLAIM_DATE)
USING STOGROUP CLMSG01
BUFFERPOOL BP2;
Then rewrote the query:
SELECT CLAIM_ID, PATIENT_ID, PROC_CODE, CLAIM_AMT
FROM CLAIMS
WHERE REGION_CODE = :WS-REGION-CODE
AND CLAIM_STATUS = 'P'
AND CLAIM_DATE BETWEEN :WS-START-DATE AND :WS-END-DATE
Result: MATCHCOLS went from 0 to 3. Elapsed time dropped from 14 minutes to 0.8 seconds.
General Rules for Avoiding Tablespace Scans: 1. Never apply functions to indexed columns in WHERE clauses 2. Ensure predicates match the leading columns of available indexes 3. Keep statistics current — run RUNSTATS after significant data changes 4. For range predicates, the range column should be last in the index 5. Check for implicit type conversions (COBOL PIC 9 vs. DB2 INTEGER mismatches)
The Stage 1 vs. Stage 2 Predicate Distinction:
Not all predicates are created equal. DB2 classifies predicates into stages:
- Stage 1 (sargable) predicates can be evaluated by the Data Manager during index or data page scanning. These predicates participate in access path selection and can use indexes.
- Stage 2 predicates can only be evaluated by the Relational Data System after rows are retrieved. They never use indexes.
Common Stage 2 predicates in COBOL programs:
- WHERE COL1 <> :HV (not equal — cannot be index-matched)
- WHERE COL1 LIKE '%value' (leading wildcard)
- WHERE VALUE(COL1, 'X') = 'Y' (scalar function on column)
- WHERE COL1 || COL2 = :HV (concatenation)
- WHERE :HV BETWEEN COL1 AND COL2 (host variable between two columns)
The performance impact is dramatic. A Stage 1 predicate with an index might examine 100 rows. The equivalent Stage 2 predicate forces DB2 to read all rows and filter afterward — potentially millions. At CNB, Rob Jeffries maintains a "Stage 2 watchlist" of queries that use Stage 2 predicates on high-volume tables, reviewing them quarterly for optimization opportunities.
Problem 2: Lock Contention
Symptom: High elapsed time, low CPU, high lock wait in accounting data. IRLM reports lock timeout or deadlock.
Root Causes: - Infrequent COMMITs holding locks too long - Lock escalation from row/page to tablespace - Hot page — many threads updating the same page - Poor job scheduling — concurrent programs competing for same data - ISOLATION level too restrictive
Diagnosis:
Lock contention is the most common cause of batch window overruns. The accounting trace tells you immediately — if lock wait time is more than 10% of elapsed time, you have a contention problem.
At Federal Benefits, Sandra Kowalski discovered that the monthly eligibility recalculation job was deadlocking with the real-time eligibility inquiry transaction. The batch job was updating BENEFICIARY rows by SSN order. The online transaction was reading and occasionally updating the same rows in random order determined by which citizen walked into which field office.
Fix: Three changes: 1. Batch job commit frequency increased from every 10,000 rows to every 200 rows 2. Online transaction changed to ISOLATION(CS) WITH USE AND KEEP EXCLUSIVE LOCKS (the narrowest lock needed) 3. Batch job processing order changed to match the clustering index order, reducing page lock contention
Marcus Williams wrote the monitoring COBOL code that checks lock wait after each commit:
PERFORM UNTIL WS-EOF-FLAG = 'Y'
EXEC SQL
SELECT BENEFICIARY_ID, ELIG_STATUS, CALC_DATE
INTO :WS-BEN-ID, :WS-ELIG-STATUS, :WS-CALC-DATE
FROM BENEFICIARY
WHERE BEN_SSN = :WS-CURRENT-SSN
FOR UPDATE OF ELIG_STATUS, CALC_DATE
END-EXEC
* -- process eligibility recalculation --
PERFORM 2100-RECALCULATE-ELIGIBILITY
EXEC SQL
UPDATE BENEFICIARY
SET ELIG_STATUS = :WS-NEW-STATUS,
CALC_DATE = CURRENT DATE
WHERE CURRENT OF CSR-BENEFICIARY
END-EXEC
ADD 1 TO WS-ROW-COUNT
IF WS-ROW-COUNT >= 200
EXEC SQL COMMIT END-EXEC
MOVE 0 TO WS-ROW-COUNT
END-IF
PERFORM 2000-FETCH-NEXT-BENEFICIARY
END-PERFORM
📊 Spaced Review — Chapter 8: Recall the locking hierarchy from Chapter 8: row locks, page locks, table/tablespace locks. Lock escalation occurs when the number of page locks exceeds LOCKMAX for the tablespace. Set LOCKMAX thoughtfully — too low and you escalate too often (causing contention), too high and you consume IRLM storage.
Problem 3: Thread Reuse Issues
Symptom: First execution of a program is fast, subsequent executions are slow (or vice versa). Thread reuse statistics in accounting look abnormal.
Root Causes: - CICS thread reuse with residual cursor state - Plan invalidation between executions - Connection pool reuse with different SPECIAL REGISTERS - Package not rebound after catalog changes
Diagnosis at SecureFirst:
Yuki Tanaka found that SecureFirst's fraud detection transaction was showing bimodal response times — either 0.1 seconds or 4.2 seconds, nothing in between. The 4.2-second executions always had a PREPARE in the accounting trace. The 0.1-second executions did not.
Root cause: The transaction used dynamic SQL (it had to — the WHERE clause varied based on fraud pattern). The dynamic SQL cache was sized too small, and the prepared statement was being evicted between executions. Every eviction caused a full PREPARE on the next call.
Fix: Carlos Mendez increased the dynamic statement cache (EDM pool MAXKEEPD parameter) and added KEEPDYNAMIC(YES) to the package bind. The PREPARE only happens once, and subsequent executions reuse the cached statement.
BIND PACKAGE(SFCOLL) MEMBER(SFFRAUD1) -
KEEPDYNAMIC(YES) -
REOPT(ONCE) -
ISOLATION(CS) -
RELEASE(DEALLOCATE)
Note RELEASE(DEALLOCATE) — this tells DB2 to keep the package allocated across commits, which preserves the dynamic statement cache. Without it, every COMMIT invalidates the cache.
Thread Reuse in CICS — A Deeper Look:
CICS manages DB2 threads through a pool. When a CICS transaction requests a DB2 connection, CICS either gives it an existing idle thread or creates a new one. The thread creation involves:
- Allocating DB2 resources (internal control blocks, authorization checking)
- Loading the package into the EDM pool (if not already there)
- Preparing any dynamic SQL statements
- Establishing the working storage environment
Steps 1-3 can take 50-200 milliseconds on the first execution. On subsequent executions with a reused thread, steps 1-2 are skipped — the resources are already allocated. This is why the first execution is slower.
The CICS DB2 connection parameters that matter:
- THREADLIMIT — maximum number of active threads for a transaction type. Too low means thread wait; too high means DB2 resource consumption.
- THREADWAIT — whether CICS waits for a thread or abends when none is available. Set to YES for critical transactions, NO for low-priority ones.
- ACCOUNTREC — when accounting records are produced. TASK produces one per CICS task; TXID accumulates across tasks of the same type; UOW produces one per unit of work.
At CNB, Rob Jeffries configures ACCOUNTREC(TASK) for all production CICS regions. This gives per-transaction accounting data, which is essential for response time SLA monitoring. The overhead is negligible — one IFCID 3 record per transaction.
Problem 4: Commit Frequency
Symptom: Long lock duration, high lock wait, potential for log space issues, long ROLLBACK times if the program abends.
This is the single most common performance problem in COBOL batch applications. I have seen it at every shop I have ever worked in.
The Math:
Consider a batch program that processes 1 million rows and commits once at the end:
- Lock duration: the entire run time (could be hours)
- Log space consumed: all 1 million row changes in a single unit of recovery
- If the program abends at row 999,999: DB2 must roll back all 999,999 changes
- Every other program that needs any of those locked rows: waits
Now consider the same program committing every 500 rows:
- Lock duration: time to process 500 rows (seconds)
- Log space consumed: 500 rows per unit of recovery
- If the program abends: DB2 rolls back at most 499 changes
- Other programs wait at most a few seconds
The Right Commit Frequency:
There is no universal answer, but here are guidelines:
| Application Type | Commit Interval | Rationale |
|---|---|---|
| High-volume batch, shared tables | 200-500 rows | Minimize lock duration |
| High-volume batch, exclusive access | 5,000-10,000 rows | Minimize commit overhead |
| CICS/IMS transaction | Every logical unit of work | Architectural requirement |
| Utility-like programs | 1,000-5,000 rows | Balance throughput and recoverability |
The Commit Overhead Myth:
Developers avoid frequent commits because "commits are expensive." Let us quantify. A DB2 COMMIT involves:
- Force the log buffer to disk (if not already written)
- Release locks
- Signal waiting threads
The log force is the expensive part — approximately 1-3 milliseconds. So 2,000 commits add 2-6 seconds of overhead. Compare that to the minutes or hours of lock wait you eliminate.
At CNB, Kwame Asante calculated that switching from commit-every-50,000 to commit-every-500 added 37 seconds of commit overhead to a job that previously waited 2.4 hours for locks. The net improvement was 2 hours and 23 minutes.
Restartability:
Frequent commits also enable restart logic. If your program commits every 500 rows and records its position, it can restart from the last commit point instead of reprocessing everything from the beginning. This is essential for the HA banking system, which we will address in the project checkpoint.
Problem 5: Data Skew
Symptom: EXPLAIN shows a good access path, but performance is poor for specific input values. Some executions are fast, others are slow with the same program.
Root Causes: - Non-uniform data distribution — the optimizer uses average selectivity, but some values match millions of rows and others match ten - Correlated columns — the combination of two column values has different selectivity than either column alone - Temporal skew — recent data is "hot" and represents a disproportionate share of queries
Diagnosis at CNB:
The account inquiry transaction was fast for most accounts but took 12 seconds for corporate accounts. EXPLAIN showed index access with MATCHCOLS=2. The problem: corporate accounts had 500-50,000 transactions per month. Individual accounts had 5-50. The optimizer estimated 25 transactions per account (the average) and chose nested loop join, which was correct for individuals but catastrophic for corporations.
Fix: RUNSTATS with FREQVAL COUNT 50 on the ACCOUNT_TYPE column. This gave the optimizer distribution statistics for the most frequent values. DB2 then used a different access path for corporate accounts (merge scan join instead of nested loop join) when the host variable value fell into the high-frequency bucket.
Additional option: REOPT(ALWAYS) on the package bind, which tells DB2 to re-optimize the SQL at execution time using actual host variable values. This has a CPU cost for the PREPARE but ensures the access path matches the actual data.
BIND PACKAGE(CNBCOLL) MEMBER(CNBACIQ1) -
REOPT(ALWAYS) -
ISOLATION(CS)
Use REOPT(ALWAYS) sparingly — it adds PREPARE overhead to every execution. REOPT(ONCE) is often sufficient, as it optimizes once with actual values and caches the result.
A deeper example of data skew at Federal Benefits:
Sandra Kowalski found that the eligibility verification transaction had a 99th percentile response time of 8.4 seconds while the average was 0.2 seconds. The cause: military veteran beneficiaries. A veteran's eligibility record could have 200+ service entries spanning decades of active duty, reserve duty, and various benefit categories. The typical beneficiary had 3-5 entries.
The verification query joined BENEFICIARY to SERVICE_HISTORY using a nested loop join — perfect for 3-5 rows, catastrophic for 200+ rows. For veterans, DB2 was probing the SERVICE_HISTORY index 200+ times per query, each time doing a random I/O.
Marcus Williams solved this without changing the SQL or the access path. He added application-level caching in the COBOL program:
01 WS-CACHE-TABLE.
05 WS-CACHE-ENTRY OCCURS 100 TIMES.
10 WS-CACHE-BEN-ID PIC X(10).
10 WS-CACHE-ELIG-CODE PIC X(02).
10 WS-CACHE-EXP-DATE PIC X(10).
01 WS-CACHE-COUNT PIC S9(4) COMP VALUE 0.
For the high-volume veteran accounts that were identified through the accounting data, the program cached the eligibility result in working storage. Subsequent lookups for the same beneficiary hit the cache instead of DB2. This reduced the 99th percentile from 8.4 seconds to 1.1 seconds without any DB2 changes at all.
The lesson: sometimes the best DB2 performance fix is not a DB2 fix.
💡 Practitioner Insight: Data skew is the hardest of the five problems to diagnose because EXPLAIN looks fine. The access path is correct for the average case. You find this problem by correlating slow executions with specific input values. When someone says "it is usually fast but sometimes it is slow," think data skew first.
11.5 Performance Monitoring Tools
OMEGAMON for DB2
IBM OMEGAMON for DB2 (previously known as DB2 Performance Monitor, and before that as OMEGAMON XE for DB2) is the primary real-time monitoring tool in most mainframe shops. Here is what it gives you:
Real-Time Thread Monitoring: - Active thread display showing current SQL, elapsed time, CPU, lock waits - Thread detail drilldown showing SQL text, access path, buffer pool usage - Ability to cancel runaway threads
Exception Monitoring: - Configurable thresholds (e.g., "alert if any thread exceeds 300 seconds elapsed") - Lock contention alerts - Tablespace scan alerts for specified tables - Dynamic SQL cache overflow alerts
Historical Reporting: - Accounting trace data collection and reporting - Trend analysis (performance over time) - Workload comparison (this month vs. last month)
Near-Term History: - SQL activity for the last N minutes - Thread history for analysis after a problem resolves
At Pinnacle Health, Ahmad uses OMEGAMON's "top SQL" feature daily. It shows the top SQL statements by CPU, elapsed time, or GETPAGE count over a configurable interval. He reviews it every morning at 8:00 AM looking for new entries — any SQL statement that appears in the top 20 that was not there yesterday gets investigated.
BMC MainView for DB2
BMC's tool provides similar functionality with some different strengths:
- Stronger historical trending and capacity planning
- Better integration with BMC's broader suite (MainView for CICS, MainView for IMS)
- SQL workload analytics — grouping related SQL statements
- Automated recommendation engine (take these with a grain of salt — they are starting points, not prescriptions)
Native DB2 Monitoring Facilities
You do not always need a commercial tool. DB2 provides built-in monitoring:
DISPLAY THREAD:
-DIS THREAD(*) TYPE(ACTIVE) DETAIL
Shows currently active threads, their status, and what they are waiting for.
DISPLAY BUFFERPOOL:
-DIS BPOOL(BP0) DETAIL
Shows buffer pool hit ratio, steal count, prefetch activity. If the hit ratio drops below 95%, investigate which objects are consuming pool space.
DISPLAY DATABASE:
-DIS DB(CNBGLDB) SPACE(*) LIMIT(*)
Shows tablespace and indexspace status, including page counts and extent information. A tablespace that has grown to many extents may benefit from reorganization.
DB2 Statistics Trace (IFCID 1):
The statistics trace fires at regular intervals (default 30 minutes) and provides system-level metrics:
- Buffer pool activity (all pools)
- EDM pool usage
- Log activity
- Locking activity (global)
- SQL statement counts
- Sort activity
- RID pool usage
Use the statistics trace for capacity planning and system-level trending. Use the accounting trace (IFCID 3) for application-level diagnosis.
DB2 Performance Trace (IFCID 2+):
The performance trace provides detailed timing for individual SQL executions. It is expensive — typically 5-15% CPU overhead — and should only be activated for specific diagnosis, never left on permanently.
When to use it: - You need per-SQL-statement timing within a program - You need to identify exactly which FETCH in a cursor loop is slow - You need to correlate SQL execution with lock wait events
At Federal Benefits, Sandra activates the performance trace only during scheduled maintenance windows when she is investigating a specific problem. She limits it to specific plans and threads to minimize overhead:
-START TRACE(P) CLASS(1,2,3) PLAN(FBELPLN) DEST(GTF)
Building Your Own Monitoring Without Commercial Tools
Not every shop has OMEGAMON or BMC. Smaller shops — and we have all worked in them — often rely on native DB2 facilities and custom monitoring programs. Here is a practical approach:
Custom Accounting Data Collector:
Write a COBOL program (or REXX exec) that runs as a started task, reading the SMF records produced by the DB2 accounting trace. Parse the IFCID 3 records and insert summary data into a DB2 monitoring table:
CREATE TABLE HA_PERF_HISTORY (
CAPTURE_DATE DATE NOT NULL,
PROGRAM_NAME CHAR(8) NOT NULL,
PLAN_NAME CHAR(8) NOT NULL,
CONN_TYPE CHAR(8),
ELAPSED_SEC DECIMAL(15,2),
CPU_SEC DECIMAL(15,2),
LOCK_WAIT_SEC DECIMAL(15,2),
IO_WAIT_SEC DECIMAL(15,2),
GETPAGE_COUNT BIGINT,
SQL_SELECT BIGINT,
SQL_INSERT BIGINT,
SQL_UPDATE BIGINT,
SQL_DELETE BIGINT,
COMMIT_COUNT BIGINT,
BUFFER_HIT_PCT DECIMAL(5,2),
LOCK_ESCAL INTEGER,
DEADLOCK_COUNT INTEGER,
TIMEOUT_COUNT INTEGER,
PRIMARY KEY (CAPTURE_DATE, PROGRAM_NAME, PLAN_NAME)
);
With this table populated daily, you can write trending queries, generate exception reports, and detect gradual degradation — all without a commercial monitoring tool. At SecureFirst, Carlos Mendez built exactly this system using 400 lines of COBOL and a dozen SQL queries. It has been running in production for six years.
Automated Alert Generation:
A simple batch program that runs after each batch cycle can check for exceptions:
SELECT PROGRAM_NAME, ELAPSED_SEC,
(SELECT AVG(ELAPSED_SEC)
FROM HA_PERF_HISTORY H2
WHERE H2.PROGRAM_NAME = H1.PROGRAM_NAME
AND H2.CAPTURE_DATE BETWEEN
CURRENT DATE - 30 DAYS AND CURRENT DATE - 1 DAY
) AS AVG_30DAY,
ELAPSED_SEC /
(SELECT AVG(ELAPSED_SEC)
FROM HA_PERF_HISTORY H2
WHERE H2.PROGRAM_NAME = H1.PROGRAM_NAME
AND H2.CAPTURE_DATE BETWEEN
CURRENT DATE - 30 DAYS AND CURRENT DATE - 1 DAY
) AS RATIO_TO_AVG
FROM HA_PERF_HISTORY H1
WHERE H1.CAPTURE_DATE = CURRENT DATE
AND ELAPSED_SEC >
(SELECT AVG(ELAPSED_SEC) * 1.5
FROM HA_PERF_HISTORY H2
WHERE H2.PROGRAM_NAME = H1.PROGRAM_NAME
AND H2.CAPTURE_DATE BETWEEN
CURRENT DATE - 30 DAYS AND CURRENT DATE - 1 DAY
)
ORDER BY RATIO_TO_AVG DESC;
This query flags any program that ran 50% longer than its 30-day average. Simple, effective, and free.
The Monitoring Stack
A well-instrumented DB2 COBOL environment has layers:
- Always on: Accounting trace (IFCID 3), Statistics trace (IFCID 1), EXPLAIN tables populated for all packages
- Daily review: OMEGAMON/BMC top SQL report (or custom equivalent), buffer pool hit ratios, lock contention summary
- On-demand: Performance trace for specific problems, DISPLAY commands for real-time status
- Periodic: RUNSTATS scheduling review, EXPLAIN comparison (before/after REBIND), capacity trending
The commercial tools are better — they have real-time dashboards, drill-down capability, and automated alerting. But a custom solution built on DB2 accounting data and catalog queries is vastly better than no monitoring at all. If budget is a constraint, build the custom solution first and advocate for commercial tools with the data you collect.
11.6 Performance Testing Strategy
Performance testing for COBOL-DB2 applications is not optional. It is how you find problems before production finds them for you. Here is a testing strategy that works.
Test Environment Requirements
Your performance test environment must have:
-
Representative data volumes. Testing a query against 1,000 rows when production has 100 million is useless. You need at least 70% of production volume, ideally 100%. At CNB, the performance test DB2 subsystem has full production data volume (scrubbed of PII).
-
Representative data distribution. Uniform synthetic data hides skew problems. Copy production distributions or generate synthetic data that matches production histograms.
-
Current statistics. Run RUNSTATS on the test environment with the same options used in production.
-
Comparable hardware. You cannot extrapolate from a z14 test LPAR to a z16 production LPAR by applying a CPU factor. Too many variables change — cache sizes, I/O paths, memory. If you must test on different hardware, validate the extrapolation with at least three known workloads.
Test Types
Unit Performance Test (Developer)
For every SQL statement, the developer should:
- Run EXPLAIN and verify the access path is reasonable
- Check MATCHCOLS values — are they as expected?
- Check COST_CATEGORY — is the optimizer confident?
- Run the statement with representative bind values and measure actual elapsed/CPU
- For cursor loops, verify FETCH count matches expectations
This takes 15 minutes per SQL statement. It catches 60% of performance problems before they leave the developer's desk.
Integration Performance Test (Team)
Run the complete program (or job stream) with production-volume data:
- Capture accounting trace data
- Compare GETPAGE, CPU, elapsed to budgets established during design
- Verify commit frequency under load
- Run concurrent programs to test lock contention
- Test with peak-hour data volumes (not average)
Load Test (System)
Simulate production workload patterns:
- Run batch jobs in production sequence with production timing
- Simulate online transaction arrival rates
- Introduce the new program/change and measure impact on existing workload
- Test failure scenarios — what happens to performance when a thread abends and rolls back?
Performance Budgets
Every COBOL-DB2 program should have a performance budget established during design:
PROGRAM: CNBGL450 (GL Daily Close)
BUDGET:
Elapsed: < 45 minutes
CPU: < 180 seconds
GETPAGE: < 50,000,000
SQL Calls: < 10,000,000
Commits: every 500 rows
Lock Wait: < 2% of elapsed
Buffer Hit: > 97%
If the program exceeds any budget during testing, it goes back for optimization before production deployment. This is not negotiable at CNB — Lisa Martinez rejected a program that met all budgets except commit frequency (it was committing every 5,000 rows when the budget said 500).
Performance Regression Testing
When you change a COBOL-DB2 program, you need to verify that performance did not degrade. The minimum regression test:
- Run EXPLAIN on all SQL statements — compare to previous EXPLAIN output
- Run the program with the same input as the baseline run
- Compare accounting data: elapsed, CPU, GETPAGE, SQL counts
- Any metric that degrades by more than 10% requires investigation
- Any new tablespace scan requires investigation regardless of percentage
The Dangers of Testing with Insufficient Data
I want to be explicit about a failure mode I have seen repeatedly. A developer tests a query against a table with 10,000 rows. The query runs in 0.01 seconds. They deploy to production where the table has 100 million rows. The query runs in 47 minutes.
Why? Because with 10,000 rows, the entire table fits in the buffer pool. Every access is a memory access. With 100 million rows, the buffer pool holds maybe 2% of the table. Almost every access is a physical I/O. The difference is three to four orders of magnitude.
At Federal Benefits, Sandra mandates that performance test tables contain at least 80% of production volume. She learned this the hard way: a program that passed all tests with 5 million rows caused a 4-hour production outage when it encountered 75 million rows.
The corollary: you must also test with representative data distribution. If production has 80% of accounts in three states (which Federal Benefits does — California, Texas, and Florida), your test data must reflect that. Uniform synthetic data hides every skew problem that exists.
Performance Regression Testing
Automated comparison is essential. Manual comparison of EXPLAIN output across 200 SQL statements is error-prone. At Pinnacle Health, they wrote a COBOL program (yes, COBOL) that compares two sets of PLAN_TABLE rows and flags differences:
REPORT: EXPLAIN COMPARISON - PHCLM020
BEFORE: REBIND 2025-02-15 AFTER: REBIND 2025-03-14
QUERYNO COLUMN BEFORE AFTER STATUS
------- ---------- ----------- ----------- -------
47 ACCESSTYPE I R *** DEGRADED ***
47 MATCHCOLS 3 0 *** DEGRADED ***
47 PREFETCH S S OK
112 METHOD 1 2 CHANGED
112 SORTN_JOIN N Y *** ADDED SORT ***
188 ACCESSNAME XCLM_STATUS XCLM_DATE CHANGED
188 MATCHCOLS 2 1 *** DEGRADED ***
Three queries degraded. Query 47 went from index access to tablespace scan. Query 112 added a sort. Query 188 is using a different index with fewer matching columns. Each gets investigated before the REBIND proceeds to production.
11.7 Continuous Performance Management
Performance is not a project. It is a practice. Here is how mature shops manage DB2 performance continuously.
Daily Operations
Morning Review (15 minutes): 1. Check overnight batch completion times — any jobs miss their windows? 2. Review OMEGAMON/BMC top SQL — any new entries? 3. Check buffer pool hit ratios — any drop below threshold? 4. Review lock contention summary — any unexpected waits?
Response Protocol: - Batch job 10% over normal: note and watch - Batch job 25% over normal: investigate - Batch job 50% over normal: escalate - Online transaction SLA miss: investigate immediately - Deadlock: investigate immediately
Weekly Operations
RUNSTATS Review: - Verify RUNSTATS ran on schedule for all high-activity tablespaces - Check for tablespaces where row count changed more than 20% since last RUNSTATS - Verify distribution statistics are current for tables with known skew
Capacity Trending: - Review tablespace growth rates - Project when tablespaces will need additional extents or reorganization - Review buffer pool pressure — are pools sized appropriately?
Monthly Operations
REBIND Analysis: - Review which packages have not been rebound since the last RUNSTATS - Perform REBIND in test environment and compare EXPLAIN output - Schedule production REBINDs with rollback plans
Baseline Update: - Update performance baselines for programs that changed - Archive historical performance data - Produce month-over-month trend reports
Quarterly Operations
Access Path Review: - EXPLAIN all high-impact SQL statements - Compare to previous quarter's access paths - Investigate any changes — were they intentional?
Index Review: - Identify unused indexes (indexes that consume space and INSERT/UPDATE CPU but are never chosen by the optimizer) - Identify missing indexes (tablespace scans on large tables that could benefit from an index) - Review index key compression opportunities
At CNB, Kwame Asante conducts a quarterly "DB2 health check" that covers all of the above. He produces a 15-page report that goes to the CTO. The report consistently prevents small problems from becoming large ones. Lisa estimates it saves 200+ hours of emergency debugging per year.
The Role of REORG in Performance Management
A topic that bridges Chapter 9 (utilities) and performance management: tablespace and index reorganization. Over time, as rows are inserted, updated, and deleted, the physical organization of data degrades:
- Data pages develop free space gaps where deleted rows used to be
- Index leaf pages split as new key values are inserted between existing ones
- Clustering order deteriorates — new rows that should be physically near related rows end up on distant pages
The performance impact is measurable. A table with 90% clustering ratio (the percentage of rows in clustering key order) has approximately 10% more random I/O than a table with 100% clustering. For a batch program reading 10 million rows, that 10% translates to hundreds of thousands of additional physical I/Os.
At CNB, Kwame monitors clustering ratio through the catalog:
SELECT I.NAME, I.TBNAME, I.CLUSTERING,
CASE WHEN I.CLUSTERING < 80 THEN '*** REORG NEEDED ***'
WHEN I.CLUSTERING < 90 THEN '* REORG RECOMMENDED *'
ELSE 'OK'
END AS STATUS
FROM SYSIBM.SYSINDEXES I
WHERE I.CLUSTERING <> -1 -- Has a clustering value
AND I.CLUSTERING < 90
AND I.DBNAME = 'CNBGLDB'
ORDER BY I.CLUSTERING ASC;
The rule at CNB: any clustering index below 80% triggers a REORG request. Below 90% goes on the "next maintenance window" list. The REORG is followed by RUNSTATS, which is followed by REBIND with EXPLAIN comparison. This three-step sequence is non-negotiable — REORG without RUNSTATS leaves the optimizer working with stale statistics, and RUNSTATS without REBIND means the package does not benefit from updated statistics until the next REBIND.
Performance Governance
Establish and enforce performance standards:
- Design standards: Maximum acceptable CPU per transaction type, maximum GETPAGE per SQL statement, required commit frequency
- Code review standards: EXPLAIN must be attached to code review for any DB2 SQL change
- Deployment standards: Performance regression test must pass before production migration
- Monitoring standards: Accounting trace must be active. Period. No exceptions. The overhead (typically 1-3%) is the cheapest insurance you will ever buy.
The HA Banking System Performance Framework
For our progressive project, the HA Banking Transaction Processing System needs:
-
Per-transaction performance budgets: - Account inquiry: < 0.5 seconds elapsed, < 50ms CPU - Balance transfer: < 1.0 seconds elapsed, < 100ms CPU - Statement generation: < 30 minutes per 100,000 accounts - End-of-day settlement: < 2 hours elapsed
-
Monitoring integration: - OMEGAMON alerts for any transaction exceeding 3x budget - Accounting data captured and archived daily - Weekly trending reports automated
-
EXPLAIN baseline: - All SQL statements explained and baselined before deployment - Automated comparison on every REBIND - Quarterly full review
-
Restart/recovery performance: - Commit every 200 transactions in batch - Checkpoint/restart logic in all batch programs - Maximum rollback time: 30 seconds
We will implement this framework in the project checkpoint at the end of this chapter.
Chapter Summary
DB2 performance diagnosis is a skill, not a talent. It is learnable, repeatable, and systematic. The key principles:
-
Measure before you tune. Accounting traces tell you where time is spent. EXPLAIN tells you how DB2 accesses data. Start with these, always.
-
The five common problems — tablespace scans, lock contention, thread reuse, commit frequency, and data skew — account for 90% of DB2 performance issues in COBOL applications.
-
DSN_STATEMNT_TABLE is your best friend. The COST_CATEGORY column alone has caught more problems than any other single diagnostic.
-
Commit frequency is the single most impactful tuning lever for COBOL batch programs. Most programs commit too infrequently.
-
Performance is continuous. Daily monitoring, weekly reviews, monthly analysis, quarterly deep dives. This is not optional — it is the difference between a stable system and a system that fails in creative ways at 2 AM.
The tools and techniques in this chapter are not theoretical. They are what practitioners use every day at shops like CNB, Pinnacle Health, Federal Benefits, and SecureFirst. Learn them, practice them, and you will be the person everyone calls when "DB2 is slow."
A Final War Story
I will close with a story that ties together everything in this chapter. Five years ago, a large insurance company — not one of our anchor examples — called me in because their monthly billing cycle was exceeding the batch window. It had been degrading for 18 months. Multiple DBAs had attempted fixes: adding indexes (made INSERT slower), increasing buffer pools (marginal improvement), rescheduling jobs (moved the problem to a different window).
I spent the first day doing nothing but reading accounting data. Not fixing. Reading. Here is what I found:
The billing program was executing 47 million SQL statements per run. The FETCH count was 312 million. That is 6.6 fetches per cursor open on average, but the distribution was bimodal — most cursors fetched 1-2 rows, and a handful of cursors fetched 50,000-100,000 rows.
The high-fetch cursors were billing detail cursors for large group accounts. The program opened a cursor, fetched all detail rows for an account, performed calculations in COBOL, and closed the cursor. For group accounts with 100,000 members, this meant fetching 100,000 rows one at a time through a cursor loop.
The fix was not a DB2 fix. It was an application redesign. The calculations the COBOL program was performing row-by-row could be expressed in SQL. A single SELECT with SUM, COUNT, and GROUP BY replaced a 100,000-iteration cursor loop. The SQL executed in the data manager — close to the data, no row-by-row transfer to the application.
The result: SQL statement count dropped from 47 million to 2.3 million. Elapsed time dropped from 9 hours to 47 minutes. CPU dropped by 60%. Buffer pool hit ratio went from 92% to 99.4% because the program was no longer thrashing through 312 million data pages.
Nobody had looked at the SQL activity ratios in the accounting data. They had all jumped to EXPLAIN, which showed perfectly reasonable access paths. The access paths were fine. The problem was that the application was asking DB2 to do 47 million perfectly reasonable things when 2.3 million would suffice.
Measure first. Always measure first.
Next chapter, we move to DB2 application design patterns — how to structure your COBOL-DB2 programs to avoid the performance problems we diagnosed in this chapter. Prevention is always cheaper than cure.
Related Reading
Explore this topic in other books
IBM DB2 The Optimizer IBM DB2 SQL Tuning Advanced COBOL DB2 Optimizer