Chapter 22: Exercises — The DB2 Optimizer

DataField.Dev

Chapter 22: Exercises — The DB2 Optimizer

These exercises test your understanding of cost-based optimization, filter factor estimation, access path selection, and optimizer behavior. Work through them carefully — the ability to reason about optimizer decisions is one of the most valuable skills a DBA can possess.

Section A: Filter Factor Calculations (Exercises 1-10)

Exercise 1: Basic Equality Filter Factor

The CUSTOMER table has 1,000,000 rows. The STATE column has COLCARD = 50 (50 distinct states). No frequency values are collected.

a) What is the optimizer's estimated filter factor for WHERE STATE = 'TX'? b) How many rows does the optimizer estimate will qualify? c) If Texas actually contains 12% of the customers, what is the actual filter factor? d) By what factor does the optimizer's estimate differ from reality?

Exercise 2: Range Predicate Filter Factor

The TRANSACTION table has 80,000,000 rows. The AMOUNT column has: - LOW2KEY = 0.01 - HIGH2KEY = 50,000.00

a) Calculate the filter factor for WHERE AMOUNT > 10000. b) Calculate the filter factor for WHERE AMOUNT BETWEEN 500 AND 2000. c) Calculate the estimated number of rows for each predicate. d) If amounts follow a heavily right-skewed distribution (most transactions are small), would the optimizer overestimate or underestimate the number of rows for AMOUNT > 10000?

Exercise 3: Compound Predicate Filter Factors

The ACCOUNT table has 2,000,000 rows with: - ACCOUNT_TYPE COLCARD = 4 - STATUS COLCARD = 5 - BRANCH_ID COLCARD = 50

No frequency values or column group statistics are collected. Calculate the optimizer's estimated filter factor and row count for each query:

a) WHERE ACCOUNT_TYPE = 'SAVINGS' AND STATUS = 'ACTIVE' b) WHERE ACCOUNT_TYPE = 'CHECKING' OR STATUS = 'FROZEN' c) WHERE BRANCH_ID = 12 AND ACCOUNT_TYPE = 'CD' AND STATUS = 'ACTIVE' d) Explain why the estimate in part (c) might be severely inaccurate.

Exercise 4: Frequency Values Impact

The ACCOUNT table has 2,000,000 rows. The STATUS column has COLCARD = 5. After RUNSTATS with frequency values, the catalog contains:

COLVALUE	FREQUENCY
ACTIVE	1,900,000
CLOSED	60,000
DORMANT	30,000
FROZEN	8,000
PENDING	2,000

a) What filter factor does the optimizer use for STATUS = 'ACTIVE' with frequency values? b) What filter factor would it use without frequency values (uniform distribution)? c) What is the impact on the estimated row count for STATUS = 'FROZEN'? d) For which STATUS value does the uniform distribution assumption cause the largest absolute estimation error?

Exercise 5: Histogram Impact on Range Predicates

The BALANCE column in the ACCOUNT table has LOW2KEY = 100.00 and HIGH2KEY = 500,000.00 with 2,000,000 rows. Without histograms, the optimizer assumes uniform distribution.

A histogram shows the following distribution:

Bucket Range	Rows
100 - 5,000	800,000
5,000 - 20,000	600,000
20,000 - 50,000	350,000
50,000 - 100,000	150,000
100,000 - 500,000	100,000

a) Calculate the optimizer's estimate for BALANCE > 100000 without histograms. b) Calculate the estimate for BALANCE > 100000 using the histogram. c) What is the ratio of the two estimates? Why does this matter for access path selection?

Exercise 6: Multi-Column Correlation

The EMPLOYEE table has 5,000 rows. The BRANCH_ID column has COLCARD = 50 and the POSITION column has COLCARD = 8.

a) Under the independence assumption, what is the estimated filter factor for WHERE BRANCH_ID = 7 AND POSITION = 'TELLER'? b) If Branch 7 is a small branch with only 3 tellers and 12 employees total, what is the actual filter factor? c) How many rows does the optimizer estimate vs. reality? d) What type of statistics would correct this estimation error?

Exercise 7: NOT and Negation

Using the ACCOUNT table (2,000,000 rows, STATUS COLCARD = 5):

a) What is the optimizer's filter factor for WHERE STATUS <> 'FROZEN'? b) If frequency values show FROZEN has a frequency of 8,000, what is the filter factor with frequency values? c) Why might the optimizer's plan for STATUS <> 'FROZEN' differ from its plan for STATUS = 'FROZEN', even though both use the same column?

Exercise 8: IN-List Filter Factors

The ACCOUNT table has 2,000,000 rows. ACCOUNT_TYPE has COLCARD = 4 with no frequency values.

a) What is the filter factor for WHERE ACCOUNT_TYPE IN ('CHECKING', 'SAVINGS')? b) What is the estimated number of qualifying rows? c) How does the optimizer compute this (hint: think of IN as multiple ORs)?

Exercise 9: LIKE Predicate Filter Factors

The CUSTOMER table has 1,000,000 rows. The LAST_NAME column has COLCARD = 85,000.

a) What is the optimizer's estimated filter factor for WHERE LAST_NAME = 'SMITH'? b) What default filter factor does DB2 typically use for WHERE LAST_NAME LIKE 'SM%' when no detailed statistics are available? (Hint: DB2 uses a platform-specific default, often around 10% for leading-constant LIKE.) c) Why is the LIKE estimate much less precise than the equality estimate?

Exercise 10: Join Cardinality Estimation

CUSTOMER has 1,000,000 rows. ACCOUNT has 2,000,000 rows. ACCOUNT.CUSTOMER_ID has COLCARD = 850,000.

a) Estimate the cardinality of CUSTOMER INNER JOIN ACCOUNT ON CUSTOMER.CUSTOMER_ID = ACCOUNT.CUSTOMER_ID (no WHERE clause). b) If a WHERE clause on CUSTOMER reduces it to an estimated 10,000 rows, what is the estimated join cardinality? c) Why might the join cardinality estimate in (b) be wrong if many of those 10,000 customers have no accounts?

Section B: Access Path Prediction (Exercises 11-18)

Exercise 11: Table Scan vs. Index Scan

The TRANSACTION table has 80,000,000 rows stored in 3,200,000 pages. An index IX_TXN_DATE exists on TXN_DATE with NLEVELS = 4 and NLEAF = 250,000. The CLUSTERRATIO is 5 (very poorly clustered).

For the query SELECT * FROM TRANSACTION WHERE TXN_DATE = '2024-06-15', which returns approximately 44,000 rows:

a) Estimate the cost of a table scan (in page reads). b) Estimate the cost of an index scan followed by data page reads, assuming each qualifying row requires a random page read (poor clustering). c) Which access path would the optimizer likely choose? Why? d) How would the answer change if CLUSTERRATIO were 98?

Exercise 12: Index-Only Access

Given this query and index:

-- Index: IX_ACCT_STATUS_BAL on ACCOUNT(STATUS, BALANCE)
SELECT STATUS, COUNT(*), AVG(BALANCE)
FROM ACCOUNT
GROUP BY STATUS;

a) Can the optimizer use index-only access? Why or why not? b) What is the advantage of index-only access for this query? c) If you added the column ACCOUNT_TYPE to the SELECT list, could the optimizer still use index-only access? What change would be needed?

Exercise 13: Composite Index Matching

An index IX_TXN_ACCT_DATE_TYPE exists on TRANSACTION(ACCOUNT_ID, TXN_DATE, TXN_TYPE). Predict the number of matching columns (MATCHCOLS) for each query:

a) WHERE ACCOUNT_ID = 5001 b) WHERE ACCOUNT_ID = 5001 AND TXN_DATE = '2024-06-15' c) WHERE ACCOUNT_ID = 5001 AND TXN_TYPE = 'DEPOSIT' d) WHERE TXN_DATE = '2024-06-15' e) WHERE ACCOUNT_ID = 5001 AND TXN_DATE BETWEEN '2024-01-01' AND '2024-12-31' AND TXN_TYPE = 'DEPOSIT' f) Explain why the answer to (c) is different from (b), even though both have two predicates.

Exercise 14: Join Method Prediction

Predict the optimizer's likely join method for each scenario:

a) CUSTOMER (1,000,000 rows, filtered to 1 row by WHERE CUSTOMER_ID = 1001) joined to ACCOUNT (2,000,000 rows, index on CUSTOMER_ID). b) TRANSACTION (80,000,000 rows, filtered to 1,300,000 by date range) joined to ACCOUNT (2,000,000 rows, no useful index on ACCOUNT_ID for the join direction). c) BRANCH (50 rows) joined to ACCOUNT (2,000,000 rows, index on BRANCH_ID). d) Two large temporary result sets (500,000 rows each) joined on a column with no index.

Exercise 15: List Prefetch

The ACCOUNT table has 2,000,000 rows in 80,000 pages. The index IX_ACCT_BRANCH on BRANCH_ID has CLUSTERRATIO = 15 (poorly clustered relative to BRANCH_ID). The query is:

SELECT * FROM ACCOUNT WHERE BRANCH_ID = 7;

The filter factor yields an estimated 40,000 qualifying rows.

a) Why would a direct index scan (read index, then read each data page) be expensive? b) How does list prefetch improve the I/O pattern? c) Estimate the maximum number of distinct data pages that might need to be read with list prefetch (hint: consider the total pages and the fraction of rows). d) Under what circumstances would the optimizer choose a table scan instead of list prefetch?

Exercise 16: [z/OS] Stage 1 vs. Stage 2 Classification

Classify each predicate as Stage 1 or Stage 2 on z/OS, and rewrite Stage 2 predicates to be Stage 1 where possible:

a) WHERE BALANCE > 5000 b) WHERE YEAR(TXN_DATE) = 2024 c) WHERE UPPER(LAST_NAME) = 'SMITH' d) WHERE ACCOUNT_TYPE || '-' || STATUS = 'CHECKING-ACTIVE' e) WHERE AMOUNT * 1.1 > 1000 f) WHERE SUBSTR(PHONE, 1, 3) = '312' g) WHERE TXN_DATE >= CURRENT DATE - 30 DAYS h) WHERE COALESCE(MIDDLE_NAME, 'NONE') = 'NONE'

Exercise 17: Query Rewrite Recognition

Identify which query rewrite optimization would apply to each pair. State the rewrite type and explain the benefit:

a) Original: SELECT * FROM (SELECT * FROM ACCOUNT WHERE STATUS = 'ACTIVE') sub WHERE sub.BALANCE > 50000 Optimized: ?

b) Original: SELECT * FROM CUSTOMER WHERE CUSTOMER_ID IN (SELECT CUSTOMER_ID FROM ACCOUNT WHERE BALANCE > 100000) Optimized: ?

c) Original: Query against a view that is SELECT * FROM TRANSACTION WHERE TXN_TYPE = 'DEPOSIT', with an additional WHERE clause WHERE AMOUNT > 5000 Optimized: ?

Exercise 18: MQT Routing

An MQT exists:

CREATE TABLE DAILY_ACCOUNT_SUMMARY AS (
    SELECT ACCOUNT_ID, TXN_DATE,
           SUM(AMOUNT) AS DAILY_TOTAL,
           COUNT(*) AS DAILY_COUNT
    FROM TRANSACTION
    GROUP BY ACCOUNT_ID, TXN_DATE
) DATA INITIALLY DEFERRED REFRESH DEFERRED;

For each query, state whether the optimizer can route to the MQT:

a) SELECT ACCOUNT_ID, SUM(AMOUNT) FROM TRANSACTION WHERE TXN_DATE = '2024-06-15' GROUP BY ACCOUNT_ID b) SELECT ACCOUNT_ID, SUM(AMOUNT) FROM TRANSACTION WHERE TXN_DATE BETWEEN '2024-06-01' AND '2024-06-30' GROUP BY ACCOUNT_ID c) SELECT ACCOUNT_ID, TXN_TYPE, SUM(AMOUNT) FROM TRANSACTION WHERE TXN_DATE = '2024-06-15' GROUP BY ACCOUNT_ID, TXN_TYPE d) Explain your reasoning for (c).

Section C: Optimizer Behavior Analysis (Exercises 19-25)

Exercise 19: Diagnosing an Access Path Change

A critical OLTP query that retrieves transactions for an account has changed from an index scan to a table scan after last night's RUNSTATS job. The TRANSACTION table has 80,000,000 rows.

a) List three catalog statistics that might have changed to cause this. b) What queries would you run against the catalog to investigate? c) If CLUSTERRATIO dropped from 95 to 12, explain how this would cause the plan change. d) What corrective action would you take?

Exercise 20: REBIND Analysis

A production batch job was REBOUND after an index was added. The job's elapsed time increased from 2 hours to 8 hours. The new index was added to the TRANSACTION table.

a) Why might adding an index make a query slower? b) How would you determine which statement in the package changed its plan? c) What would you check about the new index's statistics? d) If you need to revert immediately, what z/OS command would you use (assuming PLANMGMT was EXTENDED)?

Exercise 21: OPTIMIZE FOR n ROWS

Consider this query used to display the first page of transaction history in a web application:

SELECT TXN_ID, TXN_DATE, AMOUNT, DESCRIPTION
FROM TRANSACTION
WHERE ACCOUNT_ID = :acct_id
ORDER BY TXN_DATE DESC
FETCH FIRST 20 ROWS ONLY;

a) Without OPTIMIZE FOR, what plan might the optimizer choose if there are 5,000 transactions for this account? b) How would adding OPTIMIZE FOR 20 ROWS change the optimizer's approach? c) What index would best support this query with OPTIMIZE FOR 20 ROWS? d) Why is FETCH FIRST alone not sufficient to influence the optimizer?

Exercise 22: REOPT Scenarios

A prepared statement uses parameter markers:

SELECT * FROM ACCOUNT WHERE STATUS = ? AND BALANCE > ?

a) Without REOPT, how does the optimizer estimate the filter factor for STATUS = ?? b) If the application usually queries STATUS = 'FROZEN' AND BALANCE > 500000, how might the default estimate differ from reality? c) Explain the difference between REOPT(ALWAYS), REOPT(ONCE), and REOPT(AUTO). d) For a high-frequency OLTP query executed 10,000 times per second, which REOPT setting is appropriate and why?

Exercise 23: Version Migration Impact

Your organization is migrating from DB2 11 to DB2 12 for z/OS. During testing, you discover that 15 out of 200 critical SQL statements have changed access plans.

a) What are three reasons the optimizer might choose different plans in the new version? b) Describe a systematic process for evaluating whether the new plans are better or worse. c) For a plan that regressed, what options do you have besides using OPTHINT? d) What is the role of PLANMGMT during a version migration?

Exercise 24: Cascading Estimation Errors

A three-table join query joins CUSTOMER, ACCOUNT, and TRANSACTION with predicates on all three tables. The optimizer estimates: - CUSTOMER after filter: 500 rows (actual: 5,000) - CUSTOMER-ACCOUNT join: 1,200 rows (actual: 12,000) - Final three-table join: 3,000 rows (actual: 150,000)

a) By what factor is the final estimate off? b) If the optimizer chose nested loop join for the last join step based on the estimated 3,000 rows, what join method would likely be better for the actual 150,000 rows? c) What is the likely root cause of the initial underestimate on CUSTOMER? d) What specific statistics would you collect to fix this?

Exercise 25: Complete Optimization Scenario

The Meridian Bank needs a quarterly report:

SELECT b.BRANCH_NAME, c.STATE,
       COUNT(DISTINCT c.CUSTOMER_ID) AS UNIQUE_CUSTOMERS,
       SUM(t.AMOUNT) AS TOTAL_VOLUME,
       AVG(t.AMOUNT) AS AVG_TXN_SIZE
FROM BRANCH b
JOIN ACCOUNT a ON b.BRANCH_ID = a.BRANCH_ID
JOIN CUSTOMER c ON a.CUSTOMER_ID = c.CUSTOMER_ID
JOIN TRANSACTION t ON a.ACCOUNT_ID = t.ACCOUNT_ID
WHERE t.TXN_DATE BETWEEN '2024-01-01' AND '2024-03-31'
  AND a.STATUS = 'ACTIVE'
GROUP BY b.BRANCH_NAME, c.STATE
ORDER BY TOTAL_VOLUME DESC;

a) Identify the tables, their approximate cardinalities, and the join relationships. b) Predict the optimizer's likely join order and justify your prediction. c) Predict the join method for each join step. d) Identify which predicates are most critical for accurate cost estimation. e) Describe what statistics you would ensure are collected before running this query. f) Propose an MQT that could accelerate this query. g) On z/OS, classify each predicate as Stage 1 or Stage 2.

Section D: Advanced Challenges (Exercises 26-30)

Exercise 26: Cost Model Calculation

Given these parameters: - SEQIOCOST = 1 (normalized unit) - RANDIOCOST = 20 (random I/O is 20x more expensive) - CPU_per_row = 0.01 - Table: 500,000 rows, 20,000 pages - Index: NLEVELS = 3, NLEAF = 1,500, CLUSTERRATIO = 90 - Predicate filter factor: 0.02 (10,000 qualifying rows)

Calculate the approximate cost of: a) A table scan b) A matching index scan (assuming random data page reads for each qualifying row) c) A matching index scan with list prefetch (assuming qualifying rows touch 8,000 unique pages) d) Which access path has the lowest cost?

Exercise 27: Join Order Optimization

Three tables: - A: 10,000 rows, predicate reduces to 100 rows. Index on join column to B. - B: 1,000,000 rows, no predicate. Index on join column to A and to C. - C: 500 rows, predicate reduces to 50 rows. Index on join column to B.

Evaluate these join orders and estimate which is best: a) A → B → C (start with A, join B, then join C) b) C → B → A c) A → C → B (is this even possible directly?)

Exercise 28: Statistics Collection Strategy

Design a RUNSTATS strategy for the Meridian Bank TRANSACTION table, which has: - 80,000,000 rows - Columns: TXN_ID, ACCOUNT_ID, TXN_TYPE (5 values, skewed), AMOUNT, TXN_DATE, DESCRIPTION, RELATED_ACCOUNT_ID - Indexes: IX_TXN_PK (TXN_ID), IX_TXN_ACCT (ACCOUNT_ID), IX_TXN_DATE (TXN_DATE), IX_TXN_ACCT_DATE (ACCOUNT_ID, TXN_DATE)

a) Write the RUNSTATS command for z/OS, collecting frequency values for skewed columns and comprehensive index statistics. b) Write the equivalent RUNSTATS command for LUW. c) Justify which columns need frequency values and which do not. d) How often should RUNSTATS run on this table, given it receives approximately 200,000 new rows per day?

Exercise 29: Optimizer Hint Design (z/OS)

A query is performing poorly because the optimizer is choosing a table scan on the TRANSACTION table when an index scan on IX_TXN_ACCT_DATE would be better. You have verified that statistics are current and the index is appropriate.

a) Write the PLAN_TABLE INSERT statement to create a hint forcing the use of IX_TXN_ACCT_DATE with 2 matching columns. b) Write the BIND command to apply the hint. c) List three risks of using this hint in production. d) Describe a monitoring plan to ensure the hint remains beneficial over time.

Exercise 30: End-to-End Optimization

A bank analyst reports that the following query, which used to run in 5 minutes, now takes 2 hours:

SELECT c.LAST_NAME, c.FIRST_NAME,
       a.ACCOUNT_ID, a.ACCOUNT_TYPE,
       SUM(t.AMOUNT) AS NET_CHANGE
FROM CUSTOMER c
JOIN ACCOUNT a ON c.CUSTOMER_ID = a.CUSTOMER_ID
JOIN TRANSACTION t ON a.ACCOUNT_ID = t.ACCOUNT_ID
WHERE c.STATE = 'IL'
  AND t.TXN_DATE >= '2024-01-01'
GROUP BY c.LAST_NAME, c.FIRST_NAME, a.ACCOUNT_ID, a.ACCOUNT_TYPE
HAVING SUM(t.AMOUNT) > 50000
ORDER BY NET_CHANGE DESC;

Document a complete investigation plan: a) What catalog queries would you run first? b) What EXPLAIN analysis would you perform? c) List five possible root causes in order of likelihood. d) For each root cause, describe the corrective action. e) How would you prevent this problem from recurring?

Answer Key Guidance

Detailed solutions are available in the instructor supplement. Key formulas to remember:

Equality FF = 1 / COLCARD (without frequency values)
Range FF = (HIGH2KEY - value) / (HIGH2KEY - LOW2KEY)
AND combination = FF(A) x FF(B) (independence assumed)
OR combination = FF(A) + FF(B) - FF(A) x FF(B)
NOT = 1 - FF(A)
Join cardinality = (outer rows x inner rows) / MAX(COLCARD_outer_join_col, COLCARD_inner_join_col)
Scan cost = pages x SEQIOCOST + rows x CPU_per_row
Index cost = levels x RANDIOCOST + leaf_pages x SEQIOCOST + qualifying_rows x RANDIOCOST + qualifying_rows x CPU_per_row