Chapter 6 Quiz: DB2 Optimizer Internals

Instructions: Select the best answer for each question. Questions cover all sections of Chapter 6.


Question 1

A table has 10,000,000 rows. Column STATUS has COLCARDF = 4. No frequency statistics have been collected. What filter factor does the DB2 optimizer use for the predicate WHERE STATUS = 'A'?

A) 0.04 B) 0.25 C) 0.50 D) 1.00

Answer: B

Explanation: Without frequency statistics, DB2 uses the uniform distribution assumption: FF = 1 / COLCARDF = 1/4 = 0.25. This assumes each of the 4 distinct values accounts for 25% of rows — which may be far from reality. Collecting frequency statistics (FREQVAL) would give the optimizer the actual distribution.


Question 2

In a PLAN_TABLE row, ACCESSTYPE = 'I', MATCHCOLS = 3, and INDEXONLY = 'Y'. What does this tell you?

A) DB2 is scanning the entire index without using the B-tree structure B) DB2 is traversing the B-tree with 3 matching predicates and accessing the base table C) DB2 is traversing the B-tree with 3 matching predicates and satisfying the query entirely from the index D) DB2 is using 3 different indexes in parallel

Answer: C

Explanation: ACCESSTYPE = 'I' means index access. MATCHCOLS = 3 means three leading columns of the index have sargable predicates, allowing B-tree navigation. INDEXONLY = 'Y' means all columns needed by the query exist in the index, so no base table access is required. This is the most efficient access path for narrow queries.


Question 3

Which of the following predicates is NOT sargable (cannot be used for index matching)?

A) WHERE ACCOUNT_ID = :WS-ACCT-ID B) WHERE TRAN_DATE BETWEEN :WS-START AND :WS-END C) WHERE SUBSTR(ACCOUNT_ID, 1, 4) = 'CHK-' D) WHERE STATUS IN ('A', 'P', 'H')

Answer: C

Explanation: SUBSTR() applies a function to the column, making it non-sargable. The optimizer cannot use an index on ACCOUNT_ID to evaluate this predicate because the B-tree stores ACCOUNT_ID values, not SUBSTR(ACCOUNT_ID, 1, 4) values. The sargable equivalent would be WHERE ACCOUNT_ID LIKE 'CHK-%' (which can use the literal prefix for range evaluation) or WHERE ACCOUNT_ID >= 'CHK-' AND ACCOUNT_ID < 'CHK.'.


Question 4

After RUNSTATS, a batch job's elapsed time increases from 20 minutes to 90 minutes. EXPLAIN shows the access path changed from a matching index scan (MATCHCOLS=4, nested loop join) to a tablespace scan (merge scan join). What is the MOST LIKELY immediate cause?

A) The index was dropped B) The buffer pool size was reduced C) RUNSTATS captured statistics that changed the optimizer's cost estimates D) A DB2 APAR changed the optimizer's algorithm

Answer: C

Explanation: RUNSTATS updates the catalog statistics that the optimizer uses for cost estimation. Changed statistics (COLCARDF, CLUSTERRATIO, CARDF, etc.) can alter filter factor calculations, causing the optimizer to choose a different access path. This is exactly what happened in Lisa Tran's CNB incident — COLCARDF for TRAN_DATE increased due to temporarily loaded historical data, changing the cost calculation.


Question 5

An index is defined as (ACCOUNT_ID, TRAN_DATE, TRAN_TYPE, AMOUNT). The query predicates are:

WHERE ACCOUNT_ID = :HV1
  AND TRAN_TYPE = 'CRD'
  AND TRAN_DATE > '2025-01-01'

How many matching columns will DB2 report?

A) 1 B) 2 C) 3 D) 4

Answer: B

Explanation: Matching columns must be consecutive leading columns with sargable predicates. ACCOUNT_ID (col 1) has an equality predicate — matching. TRAN_DATE (col 2) has a range predicate — matching, but range ends the matching sequence. TRAN_TYPE (col 3) has an equality predicate, but because it follows a range predicate in the index key, it can only be a screening column, not a matching column. MATCHCOLS = 2.


Question 6

METHOD = 2 in the PLAN_TABLE indicates which join method?

A) Nested loop join B) Merge scan join C) Hash join D) Hybrid join

Answer: B

Explanation: In DB2 PLAN_TABLE, METHOD = 0 means first table accessed (no join), METHOD = 1 is nested loop join, METHOD = 2 is merge scan join, and METHOD = 4 is hash join. Merge scan join sorts both input tables on the join key (if not already ordered) and merges them.


Question 7

COST_CATEGORY = 'B' in the DSN_STATEMNT_TABLE means:

A) The optimizer used default statistics because RUNSTATS has never been run B) The optimizer found the plan to be the second-best option C) The plan was generated during a BIND with EXPLAIN(ONLY) D) The plan uses backward index scan

Answer: A

Explanation: COST_CATEGORY = 'A' means the optimizer had actual catalog statistics (from RUNSTATS) to make its decision. COST_CATEGORY = 'B' means default statistics were used — DB2 guessed. Any plan based on 'B' statistics is unreliable. The immediate action is to run RUNSTATS and REBIND.


Question 8

Which RUNSTATS option addresses the optimizer's independence assumption for correlated columns?

A) FREQVAL COUNT B) HISTOGRAM C) COLGROUP D) SHRLEVEL CHANGE

Answer: C

Explanation: COLGROUP statistics tell the optimizer the combined cardinality (or frequency distribution) of multiple columns together, overriding the independence assumption that multiplies individual filter factors. For correlated columns like (STATE, CITY), COLGROUP statistics give the optimizer the actual joint distribution.


Question 9

What does PREFETCH = 'L' in the PLAN_TABLE indicate?

A) DB2 is using list prefetch — collecting RIDs, sorting by page number, then fetching pages sequentially B) DB2 is using large sequential prefetch (64-page reads) C) DB2 is using local buffer pool prefetch D) DB2 is performing a lateral join with prefetch

Answer: A

Explanation: List prefetch (PREFETCH = 'L') is DB2's technique for converting random I/O into sequential I/O when using an index on data with moderate clustering. DB2 collects RIDs from the index, sorts them by page number, then fetches the base table pages in physical sequence. This is more expensive than sequential prefetch (PREFETCH = 'S') due to the sort step, but much cheaper than random I/O.


Question 10

You want to prevent access path changes during a DB2 maintenance APAR application. Which approach preserves the current access path?

A) REBIND PACKAGE(coll.pkg) APREUSE(WARN) B) REBIND PACKAGE(coll.pkg) EXPLAIN(ONLY) C) REBIND PACKAGE(coll.pkg) PLANMGMT(BASIC) D) REBIND PACKAGE(coll.pkg) OPTHINT('FREEZE')

Answer: A

Explanation: APREUSE(WARN) tells DB2 to reuse the existing access path during REBIND unless the old path is no longer valid (e.g., a referenced index was dropped). It issues a warning if the optimizer would have chosen a different plan. This is ideal for maintenance windows where you want to validate new plans before adopting them. PLANMGMT preserves copies but doesn't prevent changes; it enables rollback after the fact.


Question 11

CLUSTERRATIO for an index is 35%. What does this imply about index access for queries using this index?

A) 35% of queries will use this index B) Only 35% of the index pages are in the buffer pool C) 65% of the base table page fetches via this index will likely be random I/O D) The index has only 35% of its pages allocated

Answer: C

Explanation: CLUSTERRATIO measures how well the physical order of base table rows matches the index key order. A CLUSTERRATIO of 35% means only 35% of consecutive index entries point to the same or adjacent base table pages. For index access that requires base table fetches, 65% of the I/O will be random (non-sequential page access). This significantly increases I/O cost and may cause the optimizer to prefer a tablespace scan with sequential prefetch.


Question 12

A nested loop join (METHOD=1) is generally preferred over a merge scan join (METHOD=2) when:

A) Both tables are very large B) The outer table result set is small and the inner table has a good matching index C) Neither table has a useful index D) The query requires no sorting for ORDER BY

Answer: B

Explanation: Nested loop joins are efficient when the outer table produces few rows and each row can efficiently probe the inner table via an index. The cost is proportional to outer_rows * index_probe_cost. Merge scan joins are preferred when both result sets are large because the single-pass merge is cheaper than millions of index probes.


Question 13

The INCLUDE clause on a CREATE INDEX statement:

A) Includes the specified columns in both leaf and non-leaf index pages B) Includes the specified columns only in index leaf pages, enabling index-only access without increasing B-tree depth C) Forces the optimizer to always include this index in its evaluation D) Includes rows that would normally be excluded by a WHERE clause on the CREATE INDEX

Answer: B

Explanation: INCLUDE columns (available since DB2 10) are stored in index leaf pages but not in non-leaf (internal) pages. This means they don't increase the B-tree depth (NLEVELS) but do enable index-only access for queries that reference those columns. This is the optimal way to create covering indexes without bloating the B-tree navigation structure.


Question 14

You manually update SYSIBM.SYSCOLUMNS to set COLCARDF = 365 for a column. What happens on the next REBIND?

A) DB2 rejects the REBIND because manual catalog updates are not allowed B) The optimizer uses the manually set value of 365 for filter factor calculation C) DB2 automatically runs RUNSTATS to verify the value before using it D) The value is ignored because DB2 only uses RUNSTATS-collected statistics

Answer: B

Explanation: DB2 uses whatever values are in the catalog tables, regardless of how they got there. Manual catalog updates are a supported (though risky) technique for forcing specific filter factor calculations. The optimizer simply reads COLCARDF from SYSIBM.SYSCOLUMNS — it doesn't track the source of the value. This is a last-resort technique that requires careful documentation and regular review.


Question 15

Inline RUNSTATS (using the STATISTICS keyword on REORG) is preferred over separate RUNSTATS after REORG because:

A) It collects more accurate statistics than standalone RUNSTATS B) It adds negligible overhead since all data already passes through memory during REORG's sort phase C) It locks the tablespace for a shorter duration D) It is the only way to collect distribution statistics

Answer: B

Explanation: During REORG, all rows are read, sorted, and rewritten. Inline RUNSTATS piggybacks on this existing data pass to collect statistics, adding minimal additional CPU and zero additional I/O. Running separate RUNSTATS after REORG requires a second full scan of the data — wasteful when the data was just processed. Both methods collect the same statistics; inline RUNSTATS is simply more efficient.


Question 16

A query has two predicates: STATE = 'CA' (COLCARDF = 52) and CITY = 'LOS ANGELES' (COLCARDF = 5000). The optimizer estimates the combined filter factor as:

A) 1/52 + 1/5000 = 0.01942 B) 1/52 * 1/5000 = 0.00000385 C) 1/52 * 5000/52 = 1.8496 D) 1/(52 * 5000) = 0.00000385

Answer: B (same as D)

Explanation: For AND predicates, DB2 multiplies the individual filter factors (independence assumption): FF = (1/52) * (1/5000) = 0.00000385. This is a dramatic underestimate because STATE and CITY are correlated — if CITY = 'LOS ANGELES', STATE is almost certainly 'CA'. The actual filter factor is simply 1/5000 = 0.0002. The optimizer underestimates the result set by a factor of approximately 52x.


Question 17

PLANMGMT(EXTENDED) on BIND preserves how many previous copies of the package?

A) 0 B) 1 C) 2 D) All previous versions indefinitely

Answer: C

Explanation: PLANMGMT(EXTENDED) preserves two previous copies: the PREVIOUS version (the one that was active before the current BIND) and the ORIGINAL version (the first version bound with PLANMGMT). PLANMGMT(BASIC) preserves only one copy (PREVIOUS). The SWITCH keyword on REBIND allows you to activate either the PREVIOUS or ORIGINAL version.


Question 18

A hash join (METHOD=4) requires which of the following?

A) At least one equality predicate on the join columns B) Both tables to be sorted on the join columns C) An index on the join columns of the inner table D) Parallel query mode to be enabled

Answer: A

Explanation: Hash joins work by hashing join key values into a hash table, then probing that table with rows from the other table. This technique requires equality comparison — you can't hash a range. The hash join does NOT require sorted input (that's merge scan) or an index on the inner table (that's where hash join shines — it's optimal when no useful index exists). Parallelism is not required, though hash joins can be parallelized.


Question 19

After creating a new index, you REBIND a package and discover the optimizer is NOT using the new index. Which of the following is the LEAST LIKELY reason?

A) RUNSTATS has not been run on the new index (no statistics available) B) The index columns don't match the query's predicate columns in a useful order C) The optimizer's cost estimate for the existing access path is lower than the new index path D) The new index name is too long for the optimizer to evaluate

Answer: D

Explanation: The optimizer evaluates all available indexes regardless of naming. The most common reasons for a new index being ignored are: (A) missing statistics — without RUNSTATS, DB2 uses defaults that may not favor the index; (B) column order mismatch — the index columns don't align with the query predicates for matching; (C) cost — the optimizer calculated that the existing plan is cheaper, which could happen if CLUSTERRATIO is low or the filter factor doesn't favor the index.


Question 20

Lisa Tran's EXPLAIN Baseline System at CNB catches plan regressions by:

A) Running EXPLAIN before every SQL execution at runtime B) Comparing PLAN_TABLE output from each BIND against a stored baseline of expected access paths C) Monitoring CPU consumption in real-time and alerting when it exceeds thresholds D) Preventing any access path changes by using OPTHINT on all packages

Answer: B

Explanation: The EXPLAIN Baseline System captures PLAN_TABLE output at every production BIND and compares it against a stored baseline. Differences are categorized by severity (CRITICAL for tablespace scan replacing index on large tables, WARNING for join method changes, INFO for minor cost changes). This proactive approach catches regressions at BIND time — before the package runs in production — rather than waiting for a performance incident. It does not prevent changes or add runtime overhead.