Chapter 22: Key Takeaways
The Big Picture
The DB2 optimizer is a cost-based query optimizer. It evaluates multiple candidate execution plans, estimates the cost of each using catalog statistics, and selects the plan with the lowest estimated cost. Understanding how it works is the single most valuable skill a DBA can develop.
Ten Things to Remember
1. Cost-Based, Not Rule-Based
DB2 has used cost-based optimization since 1983. There is no fixed rule like "always use an index." Every decision is based on estimated cost, which depends on statistics about the data.
2. Cost = I/O + CPU (Weighted)
The optimizer's cost metric blends I/O cost (sequential and random, weighted differently) and CPU cost. Random I/O is far more expensive than sequential I/O in the cost model — typically 10-50x more.
3. Filter Factors Are the Foundation
A filter factor is the fraction of rows matching a predicate (0 to 1). The default estimate for equality is 1/COLCARD. Accurate filter factors lead to accurate cardinality estimates, which lead to correct plan choices.
4. The Uniform Distribution Assumption Breaks Down
Without detailed statistics, the optimizer assumes values are evenly distributed. Real data is almost always skewed. Collect frequency values for skewed columns and histograms for range-queried columns.
5. Cardinality Errors Cascade
If the optimizer underestimates the output of step 1 by 5x, and step 2 inherits that error, the final estimate can be off by 50x or more. This cascading effect is the primary cause of catastrophically bad plans.
6. The Optimizer Optimizes Against a Model, Not Reality
This is the threshold concept. The optimizer reads catalog statistics at optimization time. It does not sample your data or count actual rows. When the model (statistics) diverges from reality (actual data), plans diverge from optimal.
7. Access Path Selection Is a Cost Comparison
Table scan vs. index scan, nested loop vs. hash join — every choice comes from comparing estimated costs. The optimizer picks the cheapest candidate. Understanding what makes one path cheaper helps you predict and influence optimizer behavior.
8. Stage 1 vs. Stage 2 Matters (z/OS)
On z/OS, predicates evaluated at Stage 1 (by the Data Manager) are far more efficient than those evaluated at Stage 2 (after row materialization). Avoid functions on columns in WHERE clauses: write WHERE TXN_DATE >= '2024-01-01' instead of WHERE YEAR(TXN_DATE) = 2024.
9. Stale Statistics Are the Number One Enemy
Run RUNSTATS after significant data changes, before REBIND, and on a regular schedule. Stale statistics are the most common cause of suboptimal access paths in production.
10. Access Paths Change for Identifiable Reasons
The top five: RUNSTATS, REBIND, DB2 version migration, data volume growth, and configuration changes. When a plan changes, investigate systematically.
Decision Framework: When to Act
| Observation | Action |
|---|---|
| Query suddenly slower | Check if RUNSTATS or REBIND occurred; compare EXPLAIN output with previous plan |
| Optimizer choosing table scan despite good index | Check CLUSTERRATIO, check if statistics are current, check filter factor calculation |
| Optimizer choosing wrong join method | Check cardinality estimates at each step; collect frequency values and column group statistics |
| Stage 2 predicates in EXPLAIN output (z/OS) | Rewrite predicates to avoid functions on columns |
| HAVING clause with unpredictable cardinality | Consider rewriting with LATERAL or CTE; collect detailed statistics |
Statistics Priority Checklist
Collect these in order of importance:
- Table cardinality and page count — via basic RUNSTATS (always collect this)
- COLCARD for all indexed columns — essential for filter factor calculation
- Frequency values for skewed columns — STATUS, TYPE, STATE, and similar categorical columns
- HIGH2KEY/LOW2KEY for range-queried columns — DATE, AMOUNT, BALANCE columns
- Histograms for heavily skewed numeric columns — financial amounts, balances
- Index statistics (NLEVELS, NLEAF, CLUSTERRATIO, FULLKEYCARD) — critical for index access cost
- Column group statistics for correlated columns — (STATE, CITY), (STATUS, ACCOUNT_TYPE)
Connecting to Other Chapters
| Chapter | Connection |
|---|---|
| Chapter 6 (Joins) | SQL join syntax is logical; the optimizer chooses the physical join method |
| Chapter 15 (Indexes) | Indexes are only useful if the optimizer knows about them and estimates they reduce cost |
| Chapter 17 (RUNSTATS) | RUNSTATS feeds the optimizer's statistical model — it is not busywork |
| Chapter 23 (EXPLAIN) | EXPLAIN makes the optimizer's invisible decisions visible — the diagnostic tool |
| Chapter 24 (SQL Tuning) | SQL structure affects which plans the optimizer can consider |