Chapter 15 Quiz: Index Design

DataField.Dev

Chapter 15 Quiz: Index Design

Test your understanding of B+ tree internals, index design principles, clustering, covering indexes, and index maintenance. Try to answer each question before revealing the answer.

Question 1

What is the primary advantage of a B+ tree over a simple sorted list for database indexing?

Show Answer

A B+ tree provides **O(log n) lookup, insertion, and deletion** while maintaining sorted order, whereas a sorted list provides O(log n) lookup via binary search but O(n) insertion and deletion (because elements must be shifted). Additionally, the B+ tree's **leaf-page chaining** supports efficient range scans, and its **balanced structure** guarantees consistent performance regardless of the data distribution. The B+ tree is also designed for **page-based I/O** — each node corresponds to a disk page, minimizing the number of physical reads.

Question 2

A B+ tree index on a table with 100 million rows has a height of 4. Approximately how many page reads are required for a single-row equality lookup via this index (assuming no pages are cached in the buffer pool)?

Show Answer

**5 page reads**: 4 index page reads (1 root + 2 non-leaf levels + 1 leaf page) plus 1 data page read to fetch the actual row. In practice, the root page and upper non-leaf levels are almost always cached in the buffer pool, so the actual physical I/O is typically **1-2 reads** (one leaf page read + one data page read).

Question 3

Explain why leaf pages in a B+ tree are linked together in a doubly linked list. What type of query benefits from this structure?

Show Answer

Leaf pages are linked so that DB2 can perform **range scans** efficiently. After using the tree structure to locate the first qualifying leaf entry, DB2 follows the forward pointers through consecutive leaf pages without returning to non-leaf or root pages. The backward pointers enable **reverse scans** for queries with `ORDER BY ... DESC`. Without leaf-page chaining, every range scan entry would require a separate root-to-leaf traversal, which would be far more expensive.

Question 4

What is a page split, and under what circumstances does it occur? What parameter can reduce page split frequency?

Show Answer

A **page split** occurs when DB2 needs to insert a new index entry into a leaf page that is already full. DB2 allocates a new leaf page, moves approximately half the entries from the full page to the new page, inserts the new entry, and updates the parent non-leaf page and the leaf-page chain pointers. The **PCTFREE** parameter reduces page split frequency by reserving free space on each page during index creation or REORG, leaving room for future inserts.

Question 5

A table has an index on (CUSTOMER_ID, ORDER_DATE, STATUS). Which of the following WHERE clauses can use this index for a matching scan? For each, state how many columns are "matching."

a) WHERE CUSTOMER_ID = 100 b) WHERE ORDER_DATE = '2025-03-01' c) WHERE CUSTOMER_ID = 100 AND STATUS = 'ACTIVE' d) WHERE CUSTOMER_ID = 100 AND ORDER_DATE > '2025-01-01' AND STATUS = 'ACTIVE'

Show Answer

a) **Yes, 1 matching column** (CUSTOMER_ID). The leading column is provided. b) **No matching scan**. The leading column (CUSTOMER_ID) is not provided. DB2 cannot navigate the tree efficiently. It would need a non-matching scan (full leaf scan) or a table scan. c) **Yes, 1 matching column** (CUSTOMER_ID only). Although STATUS is in the index, ORDER_DATE (the second column) is not provided, so DB2 cannot match beyond the first column. It uses CUSTOMER_ID to navigate the tree, then filters on STATUS within the matching entries. d) **Yes, 2 matching columns** (CUSTOMER_ID and ORDER_DATE). CUSTOMER_ID is an equality match. ORDER_DATE is a range match (the range terminates matching, so STATUS is applied as a non-matching filter). DB2 navigates to CUSTOMER_ID = 100, then scans ORDER_DATE > '2025-01-01', filtering on STATUS as it reads leaf entries.

Question 6

What is a clustering index? Why is there a limit of one clustering index per table?

Show Answer

A **clustering index** defines the physical order in which DB2 attempts to store rows in the table. When new rows are inserted, DB2 tries to place them on pages near other rows with similar clustering key values. The limit of **one clustering index per table** exists because data rows have only one physical location — they cannot be simultaneously ordered by two different keys. You can sort a deck of cards by suit or by rank, but not both at the same time.

Question 7

What is the cluster ratio, and how does it affect query performance?

Show Answer

The **cluster ratio** measures the degree to which the physical order of rows in the table matches the order of the clustering index, expressed as a percentage (0-100%). A high cluster ratio (close to 100%) means that range scans using the clustering index read mostly contiguous pages, enabling **sequential prefetch** and fast I/O. A low cluster ratio means qualifying rows are scattered across many pages, forcing **random I/O** — which can be orders of magnitude slower. When the cluster ratio drops below a threshold (commonly 80-90%), a REORG is recommended to restore physical order.

Question 8

What is index-only access, and why is it called "the holy grail" of index design?

Show Answer

**Index-only access** occurs when DB2 satisfies a query entirely from the index without reading any data pages from the table. This is possible when every column referenced in the query (SELECT, WHERE, ORDER BY, GROUP BY, HAVING) exists in the index (as key columns or INCLUDE columns). It is called "the holy grail" because it eliminates the most expensive part of an indexed lookup: the **random data page read**. Each qualifying index entry would normally require a separate data page fetch. By avoiding these fetches entirely, index-only access can reduce I/O by 90% or more for many queries.

Question 9

On DB2 for LUW, what is the difference between key columns and INCLUDE columns in a CREATE INDEX statement?

Show Answer

**Key columns** are part of the index sort order. They determine how entries are organized in the B+ tree, are used for searching (matching predicates), enforce uniqueness (in unique indexes), and participate in ORDER BY optimization. **INCLUDE columns** are stored in the leaf pages but are NOT part of the sort order. They exist solely to make additional data available for index-only access. INCLUDE columns do not participate in uniqueness checking, cannot be used as matching predicates for tree navigation, and do not affect the sort order. They add width to leaf entries (reducing entries per page) but serve no search function.

Question 10

Explain why SELECT * FROM CUSTOMER WHERE UPPER(LAST_NAME) = 'SMITH' cannot use an index on LAST_NAME.

Show Answer

The `UPPER()` function **transforms the column value** before comparison. The index is organized by the raw values of LAST_NAME (e.g., 'Smith', 'SMITH', 'smith' are stored differently). To find entries matching `UPPER(LAST_NAME) = 'SMITH'`, DB2 would need to apply the UPPER function to every index entry and check — which is equivalent to a full index scan and defeats the purpose of the tree structure. The optimizer typically chooses a tablespace scan instead. The fix is to create an **expression-based index** on `UPPER(LAST_NAME)` (LUW) or use a **generated column** (z/OS).

Question 11

A table has indexes on (A) and (A, B). Is the index on (A) redundant? Are there any scenarios where keeping both is justified?

Show Answer

**Yes, the index on (A) is generally redundant** because the composite index (A, B) can serve all queries that only need column A — DB2 can use the leading prefix of a composite index. However, there are rare scenarios where keeping both might be justified: 1. The narrower index on (A) has **fewer leaf pages**, making non-matching index scans faster. 2. The narrower index is used for **index-only access** on queries that need only column A (the wider index includes B in every leaf entry, which is unnecessary data). 3. In practice, the storage and write overhead of the redundant index usually outweigh these minor advantages, and **the recommendation is to drop the narrower index**.

Question 12

On z/OS, what is the difference between a DPSI and an NPSI? When would you choose each?

Show Answer

A **DPSI (Data-Partitioned Secondary Index)** is partitioned the same way as the data — each data partition has its own index partition. An **NPSI (Non-Partitioned Secondary Index)** spans the entire table as a single index structure. Choose **DPSI** when: - Most queries using this index also include the partition key in their WHERE clause (so only one DPSI partition needs to be searched). - Partition independence is critical (REORG, REBUILD can operate on one partition without affecting others). Choose **NPSI** when: - Queries frequently use this index WITHOUT the partition key (avoiding the need to search all DPSI partitions). - Cross-partition lookups are common and performance-critical. The trade-off: DPSI provides better partition independence and availability but penalizes queries that lack the partition key predicate. NPSI provides better cross-partition query performance but reduces partition independence for utility operations.

Question 13

An OLTP table receives 10,000 INSERTs per second. It currently has 3 indexes. A developer requests a fourth index to speed up a query. What questions should you ask before creating the index?

Show Answer

1. **How frequently does the query run?** (If it runs once a day, the INSERT overhead may not be justified.) 2. **What is the current query response time, and what is the target?** (Quantify the improvement needed.) 3. **Can an existing index be modified to support the query?** (Adding an INCLUDE column to an existing index might be cheaper than a new index.) 4. **Is the new index redundant with an existing index?** (Check for left-prefix overlap.) 5. **What is the estimated write overhead?** (Each index adds I/O to every INSERT, UPDATE of indexed columns, and DELETE.) 6. **What is the selectivity of the proposed index columns?** (Low-cardinality columns alone are rarely useful.) 7. **Has the query been reviewed for anti-patterns?** (Function on column, implicit conversion, leading wildcard.) 8. **Can the query be rewritten to use existing indexes?** (Sometimes the query is the problem, not the index set.) 9. **What is the estimated storage cost?** 10. **Is there a testing plan to measure actual improvement?**

Question 14

What is Multi-Dimensional Clustering (MDC) in DB2 for LUW, and how does it differ from a traditional clustering index?

Show Answer

**MDC** allows clustering on **multiple dimensions simultaneously** — something a traditional clustering index cannot do (which is limited to one key). MDC organizes data into "blocks" (groups of consecutive pages) based on combinations of dimension column values. DB2 creates **dimension block indexes** that point to blocks rather than individual rows. This means a query filtering on any combination of the dimension columns benefits from clustering. The trade-off: MDC tables may consume more space (blocks for each dimension value combination must be pre-allocated), and the table design is more constrained. MDC is particularly effective for **data warehouse** workloads with queries that filter on multiple dimensions.

Question 15

You observe that a query is using a tablespace scan despite having a suitable index. The index is on a column with good selectivity (0.1% of rows match the predicate). What might cause the optimizer to avoid the index?

Show Answer

Possible causes: 1. **Stale statistics**: The optimizer's cardinality estimates are wrong because RUNSTATS has not been run recently. The optimizer may not know the column is selective. 2. **Low cluster ratio**: Even though the index is selective, if the cluster ratio is very low, the optimizer estimates that the random I/O for data page fetches outweighs the benefit of the index. 3. **Function on column or implicit conversion**: The predicate applies a function to the indexed column (e.g., CAST, UPPER) that prevents index matching. 4. **Parameter markers**: If using parameter markers (`WHERE COL = ?`), the optimizer may use average statistics rather than the actual selective value. 5. **Data type mismatch**: The predicate value's type does not match the column type, causing implicit conversion. 6. **Catalog statistics missing**: The index exists but RUNSTATS has never been run on it, so the optimizer has no cost information. 7. **Buffer pool pressure**: The optimizer's cost model accounts for buffer pool size; if the buffer pool is too small to hold the index pages, the estimated I/O cost increases.

Question 16

In the Meridian Bank schema, the TRANSACTION table's clustering index is on (ACCOUNT_ID, TRANSACTION_DATE DESC). Why was TRANSACTION_DATE alone not chosen as the clustering key?

Show Answer

Because the **dominant query pattern** is "show me recent transactions for a specific account" (customer-facing OLTP), not "show me all transactions for a date range" (batch reporting). Clustering on `(ACCOUNT_ID, TRANSACTION_DATE DESC)` means that all transactions for the same account are stored on consecutive pages, in reverse chronological order — exactly the access pattern for the online banking transaction history screen. Clustering on TRANSACTION_DATE alone would benefit date-range reporting queries but would scatter a single account's transactions across many pages, making the high-frequency OLTP queries much slower. Since Meridian is primarily an OLTP system, the OLTP access pattern takes priority for clustering.

Question 17

What is the left prefix rule, and how does it affect the design of composite indexes?

Show Answer

The **left prefix rule** states that a composite index on columns (A, B, C) can be used for a matching index scan only when the query provides a predicate on the **leading (leftmost) column(s)** in order. The index efficiently supports predicates on A, (A, B), or (A, B, C) — but NOT on B alone, C alone, or (B, C). This is because the B+ tree is sorted by A first, then by B within each A value, then by C within each (A, B) combination. Without knowing A, DB2 cannot navigate the tree. This rule means **column order in composite indexes is critical**: the most frequently filtered column (or the most selective equality column) should be the leading column.

Question 18

A DBA proposes creating a covering index with 12 INCLUDE columns to achieve index-only access for a complex query. The index would be nearly as wide as the table itself. Is this a good idea?

Show Answer

**Generally no.** When an index approaches the width of the table, the advantages of index-only access diminish: 1. The index is almost as large as the table, so scanning it saves little I/O compared to scanning the table. 2. Every INSERT and UPDATE to any INCLUDE column requires updating this wide index, adding significant **write overhead**. 3. The wide leaf entries mean **fewer entries per page**, making the index taller and increasing I/O for all queries using it. 4. The index consumes **buffer pool space** that could be used for other objects. 5. You have essentially **duplicated the table** in index form, doubling storage. The guideline is to cover only the **most critical and most frequent** queries with INCLUDE columns, keeping the total index width reasonable (typically no more than 5-8 columns total). For queries that need many columns, accept the data page access cost.

Question 19

How can you determine whether DB2 is using index-only access for a specific query?

Show Answer

**On z/OS**: Use the EXPLAIN facility. The **PLAN_TABLE** column `INDEXONLY` will show `'Y'` when index-only access is used. Run `EXPLAIN PLAN FOR ` and query the PLAN_TABLE. **On LUW**: Use `EXPLAIN PLAN FOR ` or `db2expln`. In the access plan, look for the absence of a **FETCH** or **TABLE ACCESS** operator after the index scan operator. If the plan shows only IXSCAN (index scan) without a subsequent FETCH, the query is using index-only access. Alternatively, use `db2exfmt` to format the explain output and look for the "Index Only" indicator. In both cases, you can also use runtime monitoring (z/OS IFCID traces or LUW MON_GET_INDEX) to see `INDEX_ONLY_SCANS` counts for specific indexes.

Question 20

You need to support this query efficiently:

SELECT COUNT(*) FROM TRANSACTION
WHERE ACCOUNT_ID = ? AND STATUS = 'POSTED'
AND TRANSACTION_DATE >= CURRENT DATE - 90 DAYS

The existing index is (ACCOUNT_ID, TRANSACTION_DATE DESC). Would you modify the existing index or create a new one? What would the ideal index look like?

Show Answer

The existing index `(ACCOUNT_ID, TRANSACTION_DATE DESC)` can be used with **2 matching columns** (ACCOUNT_ID equality, TRANSACTION_DATE range), but STATUS must be applied as a non-matching filter to each leaf entry — and the query still requires data page reads to check STATUS unless it is in the index. **Option A — Add INCLUDE to existing index**: Modify the clustering index to `(ACCOUNT_ID, TRANSACTION_DATE DESC) INCLUDE (STATUS)`. This enables index-only access (COUNT(*) needs only to verify predicates, all of which are now in the index). However, modifying the clustering index on a 500M-row table requires a full REORG and affects all queries. **Option B — Create a new index**: `CREATE INDEX IX_TXN_ACCT_STATUS ON TRANSACTION (ACCOUNT_ID, STATUS, TRANSACTION_DATE DESC)`. This puts both equality columns first, then the range column. All matching columns. And since COUNT(*) needs no table data, this is index-only for this query. **Recommendation**: If this query is frequent and critical, Option B is safer — it does not disturb the existing clustering index. The trade-off is an additional index on a write-heavy table.

Question 21

Explain why an index on a column with only 3 distinct values (like STATUS with values 'A', 'I', 'C') is usually not helpful as a standalone index.

Show Answer

With only 3 distinct values, each value covers approximately **33% of the table** (assuming even distribution; in practice, the split is often skewed). Using an index to retrieve 33% of the table means fetching roughly one-third of all data pages via **random I/O** — far worse than a tablespace scan with **sequential prefetch** that reads all pages in order. The optimizer's cost model recognizes this: when selectivity is low (many rows match), the random I/O cost of index access exceeds the sequential I/O cost of a scan. The break-even point is typically around 5-20% of the table, depending on cluster ratio and page size. A standalone index on a 3-value column almost never reaches this threshold. However, such a column can be **valuable as a trailing column** in a composite index, where the leading column(s) provide the selectivity.

Question 22

What is the relationship between PCTFREE and index size? When would you set a high PCTFREE?

Show Answer

**PCTFREE** specifies the percentage of each index page to leave empty during CREATE INDEX or REORG. A higher PCTFREE results in a **larger index** (more pages needed because each page holds fewer entries) but **fewer page splits** during subsequent inserts (because pages have room for new entries). Set a **high PCTFREE** (e.g., 20-30%) when: - The table receives frequent inserts with randomly distributed key values (causing splits throughout the index). - Page splits are causing measurable performance problems. - You want to reduce the frequency of index REORG operations. Set a **low PCTFREE** (e.g., 0-5%) when: - The table is read-only or rarely updated. - Inserts are monotonically increasing (splits only at the right edge, which is benign). - Storage efficiency is a priority.

Question 23

A query runs with the predicate WHERE ACCOUNT_NUMBER = 12345 but the ACCOUNT_NUMBER column is defined as CHAR(10). Why might the index on ACCOUNT_NUMBER not be used?

Show Answer

The literal `12345` is a **numeric value**, but the column is **CHAR(10)**. DB2 must perform an **implicit data type conversion** to compare them. Depending on how DB2 resolves this, it may convert the CHAR column to a number (applying a function to the column, which prevents index use) rather than converting the numeric literal to a character string. Even if DB2 converts the literal, the comparison semantics may differ (numeric 12345 vs. character '12345 ' with trailing spaces). The fix is to **match data types**: use `WHERE ACCOUNT_NUMBER = '12345'` (a character literal) or, better yet, `WHERE ACCOUNT_NUMBER = '0000012345'` if the column stores zero-padded values.

Question 24

You are considering dropping an index that has not been used in 6 months according to SYSCAT.INDEXES.LASTUSED. What precautions should you take before dropping it?

Show Answer

1. **Check for seasonal or periodic queries**: The index might support a quarterly, semi-annual, or annual process (e.g., year-end reporting, audit queries) that has not run in the past 6 months. 2. **Verify with application teams**: Confirm that no application depends on this index. Check application SQL in source code repositories. 3. **Check for constraint enforcement**: The index might enforce a UNIQUE or PRIMARY KEY constraint. Dropping it would remove the constraint. 4. **Save the CREATE INDEX DDL**: Record the exact CREATE INDEX statement so you can recreate it if needed. 5. **Drop during a maintenance window**: Minimize the risk of impacting production workloads. 6. **Monitor after dropping**: Watch for performance regressions in the days and weeks following the drop, including during the next occurrence of any periodic batch jobs. 7. **Consider making the index invisible first** (if your DB2 version supports it) to test the impact before actually dropping.

Question 25

True or false: A unique index on (CUSTOMER_ID) guarantees that no two rows in the table have the same CUSTOMER_ID value, even for rows inserted by concurrent transactions that have not yet committed.

Show Answer

**True.** A unique index enforces uniqueness **at the time of INSERT**, not at the time of COMMIT. If transaction T1 inserts a row with CUSTOMER_ID = 100, and transaction T2 attempts to insert another row with CUSTOMER_ID = 100 before T1 commits, T2 will be **blocked** (waiting for T1's lock on the index entry) or will receive a duplicate key error, depending on the isolation level and timing. This is essential for correctness — if uniqueness were only checked at commit time, two concurrent transactions could both insert the same key, and the conflict would only be detected when one tries to commit (by which time significant work may need to be rolled back).

Return to Chapter 15 | Continue to Case Study 1