Chapter 13 Quiz: Logical Database Design

DataField.Dev

Chapter 13 Quiz: Logical Database Design

Test your understanding of logical data modeling, ER diagrams, normalization validation, and design patterns. Try to answer each question before revealing the answer.

Question 1

What is the primary difference between a logical data model and a physical data model?

Show Answer

A **logical data model** describes *what data means* — entities, attributes, relationships, and business rules — independent of any specific database technology. A **physical data model** describes *how data is stored* — tables, columns, data types, tablespaces, indexes, and DB2-specific implementation details. The logical model is a contract between business and technology; the physical model is the implementation of that contract.

Question 2

Name three costs of skipping logical database design and jumping straight to CREATE TABLE statements.

Show Answer

Any three of the following: 1. **Data anomalies** — duplicate, inconsistent, or contradictory data because relationships were not properly modeled. 2. **Performance degradation** — queries require complex workarounds (self-joins, string parsing, UNION ALL across multiple tables) because the schema does not match the query patterns. 3. **Maintenance burden** — every schema change requires understanding implicit relationships that were never documented. 4. **Failed audits** — regulators cannot verify data integrity because the data dictionary does not exist. 5. **Costly re-platforming** — eventually the schema must be redesigned from scratch, at a cost that dwarfs the original design effort.

Question 3

In requirements gathering, why should you treat an existing database schema as "evidence of what the business did, not what it should do"?

Show Answer

Existing schemas accumulate design compromises, workarounds, obsolete columns, and decisions made under constraints that may no longer apply. They reflect historical technical limitations, past developer preferences, and business rules that may have changed. Using the existing schema uncritically means inheriting all its flaws. Instead, use it as a source of information about entities and relationships, but validate every design decision against current business requirements.

Question 4

Classify the following attribute of the EMPLOYEE entity: "Years of service (calculated from hire date and current date)."

Show Answer

This is a **derived attribute**. It can be calculated from HIRE_DATE and CURRENT DATE using: `YEAR(CURRENT DATE) - YEAR(HIRE_DATE)` or more precisely using `TIMESTAMPDIFF`. Derived attributes should generally not be stored in the database because they create update anomalies — the stored value becomes stale as time passes and requires periodic recalculation.

Question 5

What is a weak entity? Give an example from the Meridian Bank model.

Show Answer

A **weak entity** cannot be uniquely identified by its own attributes alone — it depends on a related (parent) entity for identification. Its primary key includes the parent entity's primary key. Example: **LOAN_PAYMENT** is a weak entity. It is identified by the combination of (LOAN_ID, PAYMENT_SEQUENCE). Without LOAN_ID (inherited from the LOAN entity), the PAYMENT_SEQUENCE number alone is not unique. The relationship between LOAN and LOAN_PAYMENT is an identifying relationship.

Question 6

A data model shows: CUSTOMER ──── 1 ──── 0..* ──── ACCOUNT. In plain English, how do you read this from each side?

Show Answer

- **From CUSTOMER to ACCOUNT:** "One customer has zero or more accounts." (A customer can exist without any accounts, or can have many accounts.) - **From ACCOUNT to CUSTOMER:** "One account belongs to exactly one customer." (Every account must be associated with one and only one customer.) The "0..*" on the ACCOUNT side means the participation is optional (zero is allowed) and the maximum is unbounded. The "1" on the CUSTOMER side means the participation is mandatory (exactly one).

Question 7

Why can a many-to-many relationship not be directly implemented in a relational database? What is the standard solution?

Show Answer

A relational table stores data in rows with fixed columns. A foreign key column can hold one value per row, pointing to one parent. To represent "many" parents, you would need either: (a) multiple FK columns (CUSTOMER_ID_1, CUSTOMER_ID_2, ...), which limits the number of relationships and violates 1NF; or (b) multiple rows with duplicated data, which violates normalization. The standard solution is a **junction table** (also called a bridge table or associative entity). The junction table contains foreign keys to both sides of the relationship, and each row represents one instance of the M:N association. For example, ACCOUNT_HOLDER with (ACCOUNT_ID, CUSTOMER_ID) resolves the M:N between CUSTOMER and ACCOUNT.

Question 8

When designing a junction table, why is naming important? What is wrong with naming it CUSTOMER_ACCOUNT_XREF?

Show Answer

A junction table is a first-class entity — it often has its own attributes that describe the relationship (e.g., HOLDER_TYPE, ADDED_DATE, CAN_TRANSACT). Naming it with a generic suffix like "_XREF" or "_LINK" obscures its meaning and treats it as a mechanical artifact rather than a business concept. **ACCOUNT_HOLDER** tells everyone — developers, DBAs, business analysts — what the table represents: a person who holds an account. **CUSTOMER_ACCOUNT_XREF** tells you nothing about the business meaning. Good names reduce the need for documentation and make SQL self-documenting.

Question 9

In the Meridian Bank model, ACCOUNT is a supertype with subtypes CHECKING_DETAIL, SAVINGS_DETAIL, and CD_DETAIL. What does the "discriminator" attribute do, and which column serves as the discriminator?

Show Answer

A **discriminator** attribute (also called a type indicator or subtype discriminator) is a column in the supertype table that identifies which subtype each row belongs to. It determines which subtype table contains the additional attributes for that row. In the Meridian Bank model, **ACCOUNT_TYPE_CODE** in the ACCOUNT table serves as the discriminator. When ACCOUNT_TYPE_CODE = 'CHK', the corresponding subtype-specific attributes are in CHECKING_DETAIL. When ACCOUNT_TYPE_CODE = 'SAV', they are in SAVINGS_DETAIL. And so on.

Question 10

Compare the three supertype/subtype implementation strategies. When would you choose table-per-type over single-table?

Show Answer

| Strategy | Pros | Cons | |----------|------|------| | **Single table** (table-per-hierarchy) | No JOINs needed; simple queries | Many NULL columns; cannot enforce NOT NULL on subtype attributes; table becomes very wide | | **Table-per-type** | Clean separation; subtype NOT NULL constraints enforceable; no wasted NULLs | Requires JOIN to get full record; INSERT requires two statements | | **Table-per-concrete-class** | Each table is self-contained; no JOINs for single-type queries | Common attributes duplicated; UNION ALL needed for cross-type queries; FK references are problematic | Choose **table-per-type** over single-table when: - Subtypes have many distinct attributes (avoiding excessive NULLs) - Subtype-specific NOT NULL constraints are important for data quality - The number of subtypes may grow over time - Queries commonly access one subtype at a time rather than all types together

Question 11

A table has primary key (ORDER_ID, PRODUCT_ID) and includes the attribute PRODUCT_NAME. What normal form is violated, and why?

Show Answer

**Second Normal Form (2NF)** is violated. PRODUCT_NAME depends on PRODUCT_ID alone, not on the full composite key (ORDER_ID, PRODUCT_ID). This is a **partial dependency** — a non-key attribute depends on only part of the composite primary key. The fix: move PRODUCT_NAME to a separate PRODUCT table where it depends on PRODUCT_ID (the full key of that table), and keep only the foreign key PRODUCT_ID in the ORDER_LINE table.

Question 12

A CUSTOMER table has columns CUSTOMER_ID (PK), CUSTOMER_NAME, BRANCH_ID, and BRANCH_NAME. What normal form is violated?

Show Answer

**Third Normal Form (3NF)** is violated. BRANCH_NAME depends on BRANCH_ID, which is a non-key attribute. This is a **transitive dependency**: CUSTOMER_ID -> BRANCH_ID -> BRANCH_NAME. A non-key attribute (BRANCH_NAME) depends on another non-key attribute (BRANCH_ID) rather than directly on the primary key. The fix: move BRANCH_NAME to a BRANCH table where BRANCH_ID is the primary key. Keep only BRANCH_ID as a foreign key in the CUSTOMER table.

Question 13

What is the Entity-Attribute-Value (EAV) anti-pattern? Give two specific reasons it is problematic in DB2.

Show Answer

The **EAV anti-pattern** stores data as (entity_id, attribute_name, attribute_value) triples instead of as columns in properly typed tables. Every attribute becomes a row rather than a column. Two DB2-specific problems: 1. **Loss of type safety.** All values are stored as VARCHAR, so DB2 cannot enforce data types, CHECK constraints, or foreign keys. A date stored as text can contain invalid values like '2024-13-45'. 2. **Optimizer blindness.** DB2's optimizer relies on column-level statistics (histograms, cardinality estimates) to choose access paths. An EAV table has no meaningful per-attribute statistics, causing the optimizer to make poor access path decisions. You also cannot create targeted indexes — an index on ATTRIBUTE_VALUE indexes all attribute types together, diluting its selectivity.

Question 14

What is the OTLT (One True Lookup Table) anti-pattern? Why can you not enforce referential integrity with it?

Show Answer

**OTLT** is the practice of storing all lookup/reference data in a single table with columns like (CODE_TYPE, CODE_VALUE, CODE_DESCRIPTION) instead of using separate lookup tables for each domain. Referential integrity cannot be enforced because a DB2 foreign key references a primary key or unique constraint unconditionally. You cannot create a FK that says "ACCOUNT.ACCOUNT_TYPE must match a row in CODES WHERE CODE_TYPE = 'ACCT_TYPE'." The FK would validate against ALL rows in CODES regardless of CODE_TYPE, meaning an account type of 'USA' (a country code) would pass validation.

Question 15

In the effective dating pattern, why is a half-open interval (EFFECTIVE_DATE inclusive, END_DATE exclusive) preferred over a fully closed interval?

Show Answer

A half-open interval prevents **gaps and overlaps** between consecutive periods. If Rate A is effective [2024-01-01, 2024-04-01) and Rate B is effective [2024-04-01, 2024-07-01), there is no ambiguity: March 31 uses Rate A, April 1 uses Rate B. With a fully closed interval — Rate A [2024-01-01, 2024-03-31] and Rate B [2024-04-01, 2024-06-30] — you must carefully coordinate end and start dates. A one-day error (e.g., Rate A ending 2024-04-01 and Rate B starting 2024-04-01) creates a day where both rates are "valid." The half-open convention eliminates this class of bugs by design.

Question 16

In the Meridian Bank model, why is TRANSACTION_ID defined as BIGINT rather than INTEGER?

Show Answer

INTEGER in DB2 is a 32-bit signed integer with a maximum value of approximately 2.1 billion. With 500 million transactions per year, Meridian Bank would exhaust the INTEGER range in approximately 4 years. **BIGINT** is a 64-bit signed integer with a maximum value of approximately 9.2 x 10^18 (9.2 quintillion), providing effectively unlimited headroom. The additional storage per row (8 bytes vs. 4 bytes) is negligible compared to the catastrophic risk of running out of primary key values.

Question 17

Why did the Meridian Bank model choose to store CUSTOMER and ACCOUNT in a many-to-many relationship (via ACCOUNT_HOLDER) rather than a simple one-to-many with CUSTOMER_ID in the ACCOUNT table?

Show Answer

The M:N relationship via ACCOUNT_HOLDER supports **joint accounts** — a single account held by multiple customers. With a simple FK (CUSTOMER_ID in ACCOUNT), each account can belong to only one customer. Joint checking accounts, authorized signers, beneficiary designations, and trust arrangements all require multiple customers to be associated with a single account. The junction table also carries relationship-specific attributes: HOLDER_TYPE (primary, joint, authorized signer), CAN_TRANSACT, CAN_CLOSE, ADDED_DATE, and REMOVED_DATE. These attributes describe the *relationship* between a customer and an account, not the customer or the account individually.

Question 18

A developer argues: "We should store CUSTOMER_NAME in the TRANSACTION table to avoid JOINs when generating statements." Evaluate this argument.

Show Answer

This is a **denormalization** decision that creates a 3NF violation (CUSTOMER_NAME depends on CUSTOMER_ID, not on TRANSACTION_ID — a transitive dependency). **Arguments for storing it:** - Statement generation queries avoid the JOIN chain TRANSACTION -> ACCOUNT -> ACCOUNT_HOLDER -> CUSTOMER, improving performance for a high-frequency operation. **Arguments against:** - If a customer changes their name (marriage, legal change), every historical transaction must be updated — or the system will show inconsistent names across different views. - Storage is wasted: with 500 million transactions per year, even 50 bytes per name adds ~25 GB of redundant data annually. **The proper approach:** This is a **physical design decision**, not a logical design decision. The logical model should NOT include CUSTOMER_NAME in TRANSACTION. If statement generation performance requires it, the denormalization should be documented as a deliberate physical design choice in [Chapter 14](../chapter-14-physical-design/index.md), with a defined mechanism (trigger or materialized view) to keep the data in sync.

Question 19

What is the difference between an identifying relationship and a non-identifying relationship? Give one example of each from the Meridian Bank model.

Show Answer

An **identifying relationship** is one where the child entity depends on the parent for its identity — the parent's primary key becomes part of the child's primary key. Example: **LOAN to LOAN_PAYMENT** is identifying. LOAN_PAYMENT's primary key is (LOAN_ID, PAYMENT_SEQUENCE). Without LOAN_ID, the payment cannot be uniquely identified. A **non-identifying relationship** is one where the child is independently identifiable — the parent's key appears as a foreign key but not part of the child's primary key. Example: **BRANCH to EMPLOYEE** is non-identifying. EMPLOYEE has its own EMPLOYEE_ID as primary key. BRANCH_ID appears as a foreign key in EMPLOYEE, but it is not part of the primary key.

Question 20

For the Meridian Bank model, trace the JOIN path needed to answer: "What is the interest rate for Margaret Chen's savings account?" List every table in the JOIN chain.

Show Answer

The JOIN path: 1. **INDIVIDUAL_CUSTOMER** — find Margaret Chen by FIRST_NAME = 'Margaret' AND LAST_NAME = 'Chen' to get CUSTOMER_ID 2. **ACCOUNT_HOLDER** — JOIN on CUSTOMER_ID to find all accounts she holds 3. **ACCOUNT** — JOIN on ACCOUNT_ID to filter by ACCOUNT_TYPE_CODE = 'SAV' (savings) 4. **SAVINGS_DETAIL** — JOIN on ACCOUNT_ID to get the INTEREST_RATE

SELECT sd.INTEREST_RATE
FROM INDIVIDUAL_CUSTOMER ic
  JOIN ACCOUNT_HOLDER ah ON ic.CUSTOMER_ID = ah.CUSTOMER_ID
  JOIN ACCOUNT a ON ah.ACCOUNT_ID = a.ACCOUNT_ID
  JOIN SAVINGS_DETAIL sd ON a.ACCOUNT_ID = sd.ACCOUNT_ID
WHERE ic.FIRST_NAME = 'Margaret'
  AND ic.LAST_NAME = 'Chen'
  AND a.ACCOUNT_TYPE_CODE = 'SAV';

Four tables are involved. This is a natural consequence of the normalized, table-per-type design. The optimizer can handle this efficiently with proper indexes on the JOIN columns.

Question 21

Why does the Meridian Bank model separate PENDING_TRANSACTION from TRANSACTION instead of using a single table with a STATUS column?

Show Answer

Two key reasons: 1. **Immutability.** Posted transactions (in TRANSACTION) are **immutable** — they represent completed financial events and must not be modified (a regulatory requirement). Pending transactions (in PENDING_TRANSACTION) are **mutable** — they change status, can be cancelled, and eventually either post (becoming a TRANSACTION row) or expire. Separating them enforces the immutability constraint at the table level — you can grant UPDATE permission on PENDING_TRANSACTION but deny it on TRANSACTION. 2. **Performance isolation.** The TRANSACTION table is massive (500 million rows per year, multi-year retention). Queries against it (statement generation, balance calculation) benefit from being free of pending/expired rows. The PENDING_TRANSACTION table is much smaller and is queried differently (by status, expected post date). Separating them allows different physical design choices (partitioning, compression, buffer pool assignment) for each workload.

Question 22

Name three design patterns discussed in this chapter and briefly describe when each is appropriate.

Show Answer

1. **Audit trail** — Used when the business or regulators require tracking of who changed what data and when. Appropriate for any entity with financial, medical, or personally identifiable information. Implemented via a centralized AUDIT_LOG, shadow tables, or both. 2. **Effective dating** — Used when an attribute changes over time and the business needs to know the value at any historical point. Appropriate for interest rates, fee schedules, addresses, and job assignments. Implemented with EFFECTIVE_DATE and END_DATE columns using half-open intervals. 3. **Supertype/subtype** — Used when entities share common attributes but have type-specific attributes. Appropriate when the subtypes have meaningfully different attributes (not just one or two extra columns) and queries commonly need to access all types together. Implemented via table-per-type in the Meridian Bank model.

Question 23

You are reviewing a data model and notice that a CUSTOMER table has 85 columns, of which 40 are NULL for 90% of the rows. What is the likely design problem, and what solution would you propose?

Show Answer

The likely problem is a **missing supertype/subtype decomposition** or a **single-table implementation** of what should be a table-per-type design. The 40 columns that are NULL for 90% of rows probably belong to specific customer subtypes (e.g., individual vs. business customer, or different customer tiers). **Solution:** Analyze which columns are populated together. Group them into subtypes. Move them to separate subtype tables linked by the customer's primary key. The main CUSTOMER table keeps only the common attributes (those that are NOT NULL or relevant for all customer types). This reduces the row width of the main table (improving buffer pool efficiency), allows proper NOT NULL constraints on subtype-specific attributes, and makes the schema self-documenting.

Question 24

In the adjacency list pattern for tree structures, why does DB2's support for recursive CTEs make this pattern practical?

Show Answer

The adjacency list pattern stores each row's parent reference (PARENT_ID) and nothing else about the hierarchy. The challenge is querying the full tree (all ancestors or all descendants), which requires traversing the parent links recursively. Without recursive CTEs, this requires either: (a) multiple queries in a loop at the application level, or (b) a fixed number of self-JOINs (which limits the tree depth). Neither is practical. DB2's recursive CTE support (the WITH RECURSIVE clause, explored in [Chapter 8](../../part-02-sql-mastery/chapter-08-subqueries-ctes/index.md)) allows a single SQL statement to traverse the entire tree to any depth. The database engine handles the iteration internally, which is both simpler to code and more efficient than application-level loops. This makes the adjacency list pattern — the simplest tree storage pattern — a practical choice for hierarchies with reasonable depth (such as employee management structures or account category trees).

Question 25

A stakeholder says: "We need columns for January sales, February sales, March sales, ... December sales." What is wrong with this requirement, and how should it be modeled?

Show Answer

This is a **repeating group** — the same type of data (monthly sales) repeated as separate columns. It violates First Normal Form because the "month" is encoded in the column name rather than as data. Problems include: - Adding a 13th period (e.g., a fiscal adjustment period) requires ALTER TABLE. - Querying "which month had the highest sales" requires a 12-way CASE expression. - Aggregating across months requires listing all 12 columns. **Proper model:**

MONTHLY_SALES
-------------
ENTITY_ID (FK)
SALES_YEAR (part of PK)
SALES_MONTH (part of PK) -- 1-12
SALES_AMOUNT

Now, queries like "find the month with the highest sales" are simple: `ORDER BY SALES_AMOUNT DESC FETCH FIRST 1 ROW ONLY`. Adding periods requires no schema change. Aggregations use standard SUM/AVG across rows.