Chapter 22: Key Takeaways

DataField.Dev

Chapter 22: Key Takeaways

The Big Picture

The DB2 optimizer is a cost-based query optimizer. It evaluates multiple candidate execution plans, estimates the cost of each using catalog statistics, and selects the plan with the lowest estimated cost. Understanding how it works is the single most valuable skill a DBA can develop.

Ten Things to Remember

1. Cost-Based, Not Rule-Based

DB2 has used cost-based optimization since 1983. There is no fixed rule like "always use an index." Every decision is based on estimated cost, which depends on statistics about the data.

2. Cost = I/O + CPU (Weighted)

The optimizer's cost metric blends I/O cost (sequential and random, weighted differently) and CPU cost. Random I/O is far more expensive than sequential I/O in the cost model — typically 10-50x more.

3. Filter Factors Are the Foundation

A filter factor is the fraction of rows matching a predicate (0 to 1). The default estimate for equality is 1/COLCARD. Accurate filter factors lead to accurate cardinality estimates, which lead to correct plan choices.

4. The Uniform Distribution Assumption Breaks Down

Without detailed statistics, the optimizer assumes values are evenly distributed. Real data is almost always skewed. Collect frequency values for skewed columns and histograms for range-queried columns.

5. Cardinality Errors Cascade

If the optimizer underestimates the output of step 1 by 5x, and step 2 inherits that error, the final estimate can be off by 50x or more. This cascading effect is the primary cause of catastrophically bad plans.

6. The Optimizer Optimizes Against a Model, Not Reality

This is the threshold concept. The optimizer reads catalog statistics at optimization time. It does not sample your data or count actual rows. When the model (statistics) diverges from reality (actual data), plans diverge from optimal.

7. Access Path Selection Is a Cost Comparison

Table scan vs. index scan, nested loop vs. hash join — every choice comes from comparing estimated costs. The optimizer picks the cheapest candidate. Understanding what makes one path cheaper helps you predict and influence optimizer behavior.

8. Stage 1 vs. Stage 2 Matters (z/OS)

On z/OS, predicates evaluated at Stage 1 (by the Data Manager) are far more efficient than those evaluated at Stage 2 (after row materialization). Avoid functions on columns in WHERE clauses: write WHERE TXN_DATE >= '2024-01-01' instead of WHERE YEAR(TXN_DATE) = 2024.

9. Stale Statistics Are the Number One Enemy

Run RUNSTATS after significant data changes, before REBIND, and on a regular schedule. Stale statistics are the most common cause of suboptimal access paths in production.

10. Access Paths Change for Identifiable Reasons

The top five: RUNSTATS, REBIND, DB2 version migration, data volume growth, and configuration changes. When a plan changes, investigate systematically.

Decision Framework: When to Act

Observation	Action
Query suddenly slower	Check if RUNSTATS or REBIND occurred; compare EXPLAIN output with previous plan
Optimizer choosing table scan despite good index	Check CLUSTERRATIO, check if statistics are current, check filter factor calculation
Optimizer choosing wrong join method	Check cardinality estimates at each step; collect frequency values and column group statistics
Stage 2 predicates in EXPLAIN output (z/OS)	Rewrite predicates to avoid functions on columns
HAVING clause with unpredictable cardinality	Consider rewriting with LATERAL or CTE; collect detailed statistics

Statistics Priority Checklist

Collect these in order of importance:

Table cardinality and page count — via basic RUNSTATS (always collect this)
COLCARD for all indexed columns — essential for filter factor calculation
Frequency values for skewed columns — STATUS, TYPE, STATE, and similar categorical columns
HIGH2KEY/LOW2KEY for range-queried columns — DATE, AMOUNT, BALANCE columns
Histograms for heavily skewed numeric columns — financial amounts, balances
Index statistics (NLEVELS, NLEAF, CLUSTERRATIO, FULLKEYCARD) — critical for index access cost
Column group statistics for correlated columns — (STATE, CITY), (STATUS, ACCOUNT_TYPE)

Connecting to Other Chapters

Chapter	Connection
Chapter 6 (Joins)	SQL join syntax is logical; the optimizer chooses the physical join method
Chapter 15 (Indexes)	Indexes are only useful if the optimizer knows about them and estimates they reduce cost
Chapter 17 (RUNSTATS)	RUNSTATS feeds the optimizer's statistical model — it is not busywork
Chapter 23 (EXPLAIN)	EXPLAIN makes the optimizer's invisible decisions visible — the diagnostic tool
Chapter 24 (SQL Tuning)	SQL structure affects which plans the optimizer can consider