26 min read

> Where you are: Part I, Chapter 3 of 40. You have a running database full of data. Now you'll learn the precise vocabulary and rules behind it — the model that makes a "table" far more than a grid.

Chapter 3: The Relational Model — Tables, Rows, Columns, Keys, and the Mathematical Foundation of SQL

Where you are: Part I, Chapter 3 of 40. You have a running database full of data. Now you'll learn the precise vocabulary and rules behind it — the model that makes a "table" far more than a grid.

Learning paths: 💻 📊 🔬 🏗️ — everyone, but 🔬 CS students should study the formal definitions and keys closely; they recur in Chapter 4 (algebra) and Chapter 19 (normalization).


The words you'll use for the rest of your life

You already know, loosely, that a database has "tables" with "rows" and "columns." That intuition got you through Chapter 2. But databases are built on a precise mathematical idea — the relational model — and learning its real vocabulary is not pedantry. It's the difference between someone who pokes at tables hopefully and someone who can reason about whether a design is correct. Every later topic — joins, normalization, indexing, transactions — is expressed in these terms.

The relational model, you'll recall from Chapter 1, was Edgar Codd's 1970 idea: represent all data as relations and let users ask for data by describing what they want. This chapter unpacks what a relation actually is, what makes a row uniquely identifiable (keys), how relations connect to each other (foreign keys), what the troublesome value NULL really means, and the integrity constraints that let the database protect your data from contradiction.

Theory → Practice. The relational model rests on set theory and predicate logic. That foundation is why SQL queries can be optimized automatically and reasoned about rigorously. You don't need to love math to use databases — but knowing the model is mathematical explains why it behaves so consistently, and why some "obvious" things (like how NULL compares) work the way they do.


Relations, tuples, and attributes

Here is the formal vocabulary, paired with the everyday words:

Formal term Everyday term In Mercado
Relation Table customers
Tuple Row one customer
Attribute Column email
Degree Number of columns customers has 7 attributes
Cardinality Number of rows the sample customers has 12 tuples
Domain The set of allowed values for an attribute loyalty_tier

A relation is, mathematically, a set of tuples, where each tuple is a collection of attribute values drawn from their domains. That one-sentence definition has surprising consequences, because the word set is doing real work:

  1. No two tuples are identical. A true relation is a set, and sets don't contain duplicates. (In practice, SQL tables can contain duplicate rows unless you prevent it — a place where SQL departs slightly from the pure theory, and one reason primary keys matter.)
  2. Tuples have no inherent order. A relation is a set, not a list. The rows in a table have no "natural" position. This is why, if you want results in a particular order, you must say so with ORDER BY — the database is under no obligation to return rows in insertion order, or any order at all, unless you ask.
  3. Attributes have no inherent order, and each has a name. You refer to a value by its attribute name (email), not by position. (SQL lets you write SELECT *, which does impose a column order, but good practice names columns explicitly.)
  4. Each attribute value is atomic. In the classic relational model, a cell holds a single value from its domain — not a list, not a nested table. (Modern PostgreSQL relaxes this with arrays and JSONB, which we'll treat carefully in Chapter 16; the default discipline of atomic values is what makes normalization possible.)

Why this matters. "Rows have no order" trips up nearly every beginner. They run a query, see rows come back in what looks like insertion order, and assume that's guaranteed. It isn't. Without ORDER BY, the order can change when the data grows, when an index is added, or when the query plan changes. Treat result order as undefined unless you specify it. (We'll prove this to you in Chapter 5.)


Domains and data types

Every attribute draws its values from a domain — the set of values it's allowed to hold. In practice, you express a domain primarily through a data type, optionally narrowed by constraints.

When you ran \d products in Chapter 2, you saw each column's type. Those types are the domains:

   Column    |     Type      | Nullable | Default
-------------+---------------+----------+--------------------------------
 product_id  | integer       | not null | generated always as identity
 sku         | text          | not null |
 price       | numeric(10,2) | not null |
 attributes  | jsonb         | not null | '{}'::jsonb
 is_active   | boolean       | not null | true
 created_at  | timestamptz   | not null | now()

Choosing the right domain is a real design decision (Chapter 14 covers types in depth; Appendix D is a full reference). A taste of the most important PostgreSQL types:

  • Numbers: integer, bigint (whole numbers), numeric(p,s) (exact decimals — always for money), real/double precision (approximate floats — never for money).
  • Text: text (variable length, no limit — the usual choice), varchar(n) (length-limited), char(n) (fixed, rarely useful).
  • Dates and times: date, time, timestamp, and timestamptz (timestamp with time zone — almost always the right one).
  • Boolean: boolean (true/false/NULL).
  • Others: uuid, jsonb, arrays, and more.

Common mistake. Storing money as a floating-point type. real/double precision can't represent 0.10 exactly, so sums drift by fractions of a cent and eventually a customer is charged the wrong amount. Mercado uses numeric(10,2) for price, unit_price, amount, and salary for exactly this reason. The domain you pick is a correctness decision, not just a storage one.

Dialect Difference. Type names vary across databases: PostgreSQL's text and timestamptz aren't universal; SQL Server uses nvarchar and datetime2, Oracle uses VARCHAR2 and NUMBER, and so on. The concepts (exact vs. approximate numbers, timezone-aware timestamps) transfer; the spellings don't. See Appendix J.


Keys: how the database tells rows apart

If a relation is a set of tuples and no two tuples should be identical, then there must be some attribute (or combination) whose value is unique for every tuple. That's the idea of a key, and it comes in a small family.

Superkey. Any set of attributes that, taken together, uniquely identifies a tuple. {customer_id} is a superkey of customers; so is {customer_id, email} (the extra column is redundant but it's still unique). Superkeys can be wasteful — they may contain more than they need.

Candidate key. A minimal superkey — one with no removable attributes. In customers, both {customer_id} and {email} are candidate keys: each is unique on its own, and neither has anything to trim.

Primary key. The one candidate key you choose to be the official, primary identifier of a tuple. In customers, we chose customer_id. A primary key has two iron rules: its values are unique, and they are never NULL (you can't identify a row by an unknown value). This is called entity integrity.

-- The primary key is declared when the table is created (Chapter 14):
--   customer_id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY
-- PostgreSQL then refuses any duplicate or NULL customer_id.
SELECT customer_id, email FROM customers WHERE customer_id = 5;

Natural vs. surrogate keys

A natural key is a key that exists in the real-world data — email could identify a customer, an ISBN identifies a book. A surrogate key is an artificial identifier the database generates with no business meaning — Mercado's customer_id, product_id, etc.

Mercado uses surrogate keys (auto-generated integers) as primary keys, while also enforcing the natural keys with UNIQUE constraints (email is unique, sku is unique). This is a common, pragmatic pattern:

  • Surrogate keys are stable (a customer's email might change; their customer_id never does) and simple (one integer column to join on, rather than a long natural string).
  • Natural keys still deserve protection: declaring email UNIQUE stops two customers from registering the same address — an integrity rule the surrogate key alone wouldn't give you.

Why this matters (design, theme #1). The choice of primary key ripples through the whole schema, because every table that references this one will carry a copy of the key as a foreign key. A bad choice (say, a natural key that turns out not to be unique, or that changes) is painful to undo later. We'll return to this in Part III, but the instinct to form now is: prefer a stable surrogate key as the primary key, and protect natural keys with UNIQUE.

Composite keys

A key can span more than one attribute. Mercado's order_items table has the composite primary key (order_id, product_id): a given product appears at most once per order (quantity captures multiples). Its inventory table uses (product_id, warehouse_id) — one row per product per warehouse.

-- A composite key means the COMBINATION is unique, not each column.
-- This is fine: product 5 in two orders, and two products in order 1.
SELECT order_id, product_id, quantity FROM order_items
WHERE order_id IN (1, 6) ORDER BY order_id, product_id;

Composite keys appear naturally in junction tables (the tables that resolve many-to-many relationships), which you'll design deliberately in Chapter 18.


Foreign keys: how relations connect

A relation alone is just a list. The power of the relational model is that relations reference each other, and the database enforces those references. A foreign key is an attribute (or set) in one table whose values must match a primary key value in another table (or be NULL).

In Mercado, orders.customer_id is a foreign key referencing customers.customer_id:

-- Declared at table creation:
--   customer_id integer NOT NULL REFERENCES customers (customer_id)
-- This makes the following IMPOSSIBLE — there is no customer 9999:
INSERT INTO orders (customer_id, order_date, status)
VALUES (9999, now(), 'pending');
-- ERROR:  insert or update on table "orders" violates foreign key constraint
-- DETAIL: Key (customer_id)=(9999) is not present in table "customers".

This guarantee is referential integrity: the database will not let an order reference a customer who doesn't exist. Recall the Chapter 1 spreadsheet, where an order could name a customer who'd been deleted or misspelled. Foreign keys make that class of bug unrepresentable.

   The reference, enforced by the database:

   customers                       orders
   ┌────────────┬─────────┐        ┌──────────┬─────────────┐
   │customer_id │ email   │        │ order_id │ customer_id │
   │   (PK)     │         │        │  (PK)    │   (FK) ──────┼──┐
   ├────────────┼─────────┤        ├──────────┼─────────────┤  │
   │     1      │ alice   │◄───────┼─ 1        │      1      │  │
   │     2      │ bob     │        │  2        │      1      │  │
   └────────────┴─────────┘        └──────────┴─────────────┘  │
        ▲                                                       │
        └───────── every FK value must exist here ◄────────────┘

Foreign keys also control what happens when the referenced row changes or is deleted — ON DELETE CASCADE, ON DELETE SET NULL, ON DELETE RESTRICT. Mercado uses these deliberately: deleting a customer cascades to their addresses (ON DELETE CASCADE) but you can't accidentally orphan orders. We'll cover these referential actions in Chapter 14.

Try this. Ask PostgreSQL to show you a table's foreign keys: sql \d orders Look at the "Foreign-key constraints" section at the bottom. Every line is a rule the database enforces on your behalf — a promise about your data's consistency that you get for free once declared.


NULL: the billion-dollar mistake

Some attribute values are simply unknown or inapplicable. A customer might not have given a phone number; an order might not yet have a shipping address. The relational model represents "no value here" with a special marker: NULL.

NULL is not zero. It is not an empty string. It is not false. It means "unknown or absent." And that meaning has a consequence that surprises everyone at first: comparisons with NULL yield not true or false, but NULL (effectively "unknown").

This is three-valued logictrue, false, and unknown:

   Comparison              Result
   ─────────────────────   ────────
   5 = 5                    true
   5 = 6                    false
   5 = NULL                 NULL    (is 5 equal to an unknown? unknown!)
   NULL = NULL              NULL    (are two unknowns equal? unknown!)
   NULL <> 3                NULL

The practical fallout: you cannot test for NULL with =. This query returns no rows, even though some customers have no phone:

-- WRONG: phone = NULL is never true, so nothing matches.
SELECT * FROM customers WHERE phone = NULL;

You must use the special operators IS NULL / IS NOT NULL:

-- RIGHT: find customers with no phone number on file.
SELECT first_name, last_name FROM customers WHERE phone IS NULL;
 first_name | last_name
------------+-----------
 Frank      | Lee
 Irene      | Costa
(2 rows)

Three-valued logic also affects AND/OR and aggregate functions:

  • In a WHERE clause, only rows where the condition is exactly true are returned; unknown is treated like false (the row is excluded).
  • Aggregates like SUM, AVG, and COUNT(column) skip NULLs. COUNT(*) counts rows; COUNT(phone) counts only non-null phones. This distinction causes real reporting bugs (Chapter 7).
-- These can disagree, and the difference is exactly the NULL phones:
SELECT COUNT(*) AS all_customers, COUNT(phone) AS with_phone
FROM customers;
 all_customers | with_phone
---------------+------------
            12 |         10

Why did Tony Hoare, who introduced the null reference, later call it his "billion-dollar mistake"? Because NULL quietly breaks intuitions and causes bugs in every language and database that has it. You can't avoid NULL entirely — sometimes data really is unknown — but you should use it deliberately: declare columns NOT NULL whenever "missing" makes no sense (Mercado's email, price, order_date are all NOT NULL), and handle the NULLs you do allow with care.

Common mistake. Assuming WHERE status <> 'cancelled' returns every non-cancelled order. If status could be NULL, those rows are excluded, because NULL <> 'cancelled' is unknown, not true. (Mercado's status is NOT NULL, so it's safe there — but in the wild, always consider whether a column can be null before you filter on it.)


Integrity constraints: rules the data must obey

We've now met, piece by piece, the relational model's system of integrity constraints — the declared rules the DBMS enforces so your data can never enter a contradictory state. Gathered in one place:

Constraint Guarantees Mercado example
Domain / type A value belongs to the column's type price is numeric, not text
NOT NULL A value must be present customers.email
UNIQUE No two rows share this value customers.email, products.sku
Primary key Unique and not null (entity integrity) customer_id
Foreign key Reference must exist (referential integrity) orders.customer_idcustomers
CHECK A custom rule must hold price >= 0, rating BETWEEN 1 AND 5

The profound point — the one that connects all of Part I — is this: these rules are enforced by the database, once, in one place, for every client and every program forever. You don't have to remember to validate the rating in your web app, your mobile app, your admin script, and your bulk importer. You declare CHECK (rating BETWEEN 1 AND 5) on the reviews table, and no path into the database — no app, no typo, no rogue script — can ever store a rating of 7. The data defends itself.

-- The database rejects this, no matter who tries it:
INSERT INTO reviews (product_id, customer_id, rating)
VALUES (1, 2, 7);
-- ERROR:  new row for relation "reviews" violates check constraint "reviews_rating_check"

Why this matters (design, theme #1). Constraints are how a good design makes integrity automatic. The skill of database design is, in large part, the skill of identifying every rule your data must obey and expressing it as a constraint — so the rule holds by construction, not by hope. This is the through-line from here to Part III.


Referential actions: what happens when a referenced row changes

A foreign key guarantees that orders.customer_id always points to a real customer. But that raises a question: what should happen to a customer's orders when the customer themselves is deleted, or their customer_id changes? The relational model lets you declare the answer, and the database enforces it automatically. These are referential actions, attached to the foreign key:

Action On delete/update of the parent row… Use when
RESTRICT / NO ACTION refuse if any child rows reference it the reference is essential; deleting the parent should be a deliberate, explicit act
CASCADE also delete/update the child rows the child can't meaningfully exist without the parent
SET NULL …set the child's FK to NULL the relationship is optional; the child survives, just unlinked
SET DEFAULT …set the child's FK to its default value rare; the child should fall back to a default parent

Mercado uses these deliberately, and the choices encode real business rules:

-- An address belongs to its customer; if the customer is deleted,
-- their addresses should go too:
--   customer_id integer NOT NULL REFERENCES customers ON DELETE CASCADE
--
-- An order, by contrast, is a financial record. You must NOT be able to
-- delete a customer who has orders — that would erase history:
--   customer_id integer NOT NULL REFERENCES customers ON DELETE RESTRICT

So this fails, protecting your order history:

DELETE FROM customers WHERE customer_id = 1;   -- customer 1 has orders
-- ERROR: update or delete on table "customers" violates foreign key
--        constraint on table "orders"
-- DETAIL: Key (customer_id)=(1) is still referenced from table "orders".

…while deleting a customer who has only addresses (no orders) succeeds, and their addresses vanish with them via CASCADE.

Why this matters (design, theme #1). The referential action you choose is a policy decision about your data's lifecycle, made once at design time and enforced forever. CASCADE is convenient but dangerous — a single delete can ripple through many tables. RESTRICT is safe but forces you to clean up children explicitly. Choosing well requires understanding the business meaning of each relationship, which is exactly the skill Part III builds. The default in PostgreSQL is NO ACTION (essentially RESTRICT), which is the safe choice when you're unsure.


A first look at functional dependencies

Here is an idea that seems abstract now but becomes the engine of good design in Chapter 19, so it's worth planting the seed while keys are fresh in your mind. A functional dependency is a statement that one attribute (or set) determines another: knowing the first tells you the second, with no ambiguity. We write it A → B ("A determines B").

In customers, customer_id → email: given a customer's id, there is exactly one email. Also customer_id → first_name, customer_id → loyalty_tier, and so on. In fact, the primary key determines every other attribute in the table — that's practically what it means to be a key. We could write customer_id → (everything else).

Functional dependencies that don't involve the whole key are the interesting ones, because they signal trouble. Imagine someone designed a flat orders table that also stored the customer's email directly in every order row. Then within that table, customer_id → email still holds — but customer_id is not the key of orders (the key is order_id). You'd have a fact (customer_id → email) sitting in a table where it doesn't belong, which is precisely why the customer's email gets duplicated across 47 rows and goes stale when it changes — the exact Chapter 1 spreadsheet disaster.

   A well-placed dependency (good):       A misplaced dependency (bad):
   customers                              orders (denormalized, flat)
   customer_id → email                    order_id    → ...
   (the key determines email — fine)      customer_id → email   ← email depends on
                                                                  a NON-key column!
                                                                  duplication & rot

You don't need to do anything with this yet. Just notice the shape of the idea: a fact should live in the one table whose key determines it, and nowhere else. That single principle, made rigorous, is normalization (Chapter 19). The relational model gives you keys; functional dependencies tell you whether you've used them to put each fact in its rightful home.

Theory → Practice. This is the bridge from Part I to Part III. Right now, accept it as intuition: "store each fact once, in the table whose key owns it." In Chapter 19 you'll turn that intuition into a procedure (1NF → 2NF → 3NF) that mechanically detects and fixes misplaced dependencies. The vocabulary you're learning now — keys, dependencies — is the vocabulary that procedure is written in.


NULL in keys and constraints: the subtle cases

NULL's three-valued logic has specific consequences for the constraints you just learned, and these catch even experienced people:

NULL and UNIQUE. A UNIQUE constraint allows multiple NULLs, because two NULLs are not considered equal (NULL = NULL is unknown, not true). So a UNIQUE column that is also nullable can have many rows with no value — which is sometimes what you want and sometimes a surprise.

-- If 'phone' were UNIQUE and nullable, this would be ALLOWED —
-- two NULLs don't violate uniqueness:
--   ... phone text UNIQUE  -- nullable
-- Many customers could have NULL phone, all coexisting.

NULL and primary keys. A primary key is UNIQUE and NOT NULL — the NOT NULL half exists precisely to forbid the ambiguity above. You cannot identify a row by an unknown value, so entity integrity requires the key be present.

NULL and foreign keys. A foreign key column may be NULL (unless separately declared NOT NULL), and a NULL foreign key is considered to satisfy the constraint — it simply means "this row references nothing." That's how SET NULL works above: the child is allowed to point at no parent.

PostgreSQL also gives you tools to work with NULL deliberately rather than be ambushed by it:

-- COALESCE: use the first non-NULL value (a default for display):
SELECT first_name, COALESCE(phone, 'no phone on file') AS phone
FROM customers;

-- IS DISTINCT FROM: a NULL-SAFE comparison that treats NULL as a
-- normal value (NULL IS DISTINCT FROM NULL is FALSE — i.e. "they match"):
SELECT * FROM customers WHERE phone IS DISTINCT FROM '555-0101';
-- includes rows where phone IS NULL, unlike  phone <> '555-0101'

COALESCE (substitute a fallback) and IS DISTINCT FROM (compare without NULL weirdness) are the two tools that turn NULL from a source of silent bugs into something you handle on purpose. You'll reach for both constantly once you start writing real queries in Part II.

Common mistake. Expecting a UNIQUE constraint to prevent duplicate "missing" values. If you truly need "at most one row with no value here," a plain UNIQUE won't do it — you'd use a different tool (a partial unique index, Chapter 23). Knowing why (two NULLs aren't equal) is what lets you reach for the right fix.


The information principle: everything is a value in a relation

One last idea ties the chapter together and explains why the relational model feels so uniform. Codd insisted on what's now called the information principle: all information in a relational database is represented in exactly one way — as values in relations. No hidden pointers, no record positions, no out-of-band links. A customer references an order through a value (customer_id) that appears as a value in both tables — not through a memory address or a physical location on disk.

This sounds almost too simple, but it's the source of the model's two greatest strengths:

  • Data independence. Because relationships are values, not physical pointers, PostgreSQL is free to store and reorganize data however it likes — add an index, change the on-disk format, move to faster hardware — without any of your queries changing. You describe data by what it is, never by where it sits. (Contrast the 1960s network databases from Chapter 1, where the access path was the structure.)
  • Uniformity. Even the database's description of itself — its catalog of tables, columns, and constraints — is stored as values in relations you can query. There's one mechanism, applied all the way down.

Why this matters. Data independence is the reason the performance work in Part IV is possible without rewriting your application. You can add an index in Chapter 23 and a slow query becomes fast — and not a single line of the query changes, because the query never said how to find the data, only what data it wanted. That separation, guaranteed by the information principle, is the relational model's quiet gift to every developer who came after Codd.


Why this model has endured for 50+ years

Three properties, all visible now, explain the relational model's staying power:

  1. It's grounded in mathematics. Relations are sets; queries are logic and algebra (Chapter 4). This rigor is what lets the optimizer transform a query into an equivalent, faster one and still be provably correct.
  2. It separates logical from physical. You describe what your data is (relations, keys, constraints); the DBMS decides how to store and retrieve it. The storage can change — new indexes, new internal formats, new hardware — without your queries changing. This is data independence, and it's why the SQL you write today will still run in twenty years.
  3. It enforces integrity declaratively. You declare the rules; the system enforces them universally. Contrast the document-database story from Chapter 1's case study, where "schemaless" meant every rule had to be re-implemented, imperfectly, in application code.

This is theme #6 made concrete: the relational model is right for the vast majority of data problems — not by fashion, but because these three properties are exactly what most data management needs.


The closed-world assumption

There's a philosophical commitment hiding inside every query you'll write, and naming it resolves a surprising amount of later confusion. The relational model adopts the closed-world assumption: the database is taken to contain all the true facts; anything not present is assumed false (or non-existent).

When you ask "which customers have placed an order?" and customer_id = 7 has no rows in orders, the database concludes customer 7 has never ordered — not "we don't know whether they ordered." The absence of a fact is treated as the negation of that fact. Your data is the whole world, and the world is closed.

This is usually exactly what you want, but it has an edge. Consider a shipments table where a missing row means "not yet shipped." Under the closed-world assumption, "no shipment row" is read as "unshipped" — fine, if your data faithfully records every shipment. If a shipment happened but nobody entered it, the database will confidently tell you it didn't. The model's reasoning is only as good as the completeness of the facts you feed it.

Why this matters. The closed-world assumption is why an anti-join (Chapter 4) or a LEFT JOIN ... IS NULL can answer "which products were never reviewed?" at all — "never" is meaningful precisely because the database assumes its review rows are the complete set. It's also a quiet reminder of theme #1: your queries inherit the integrity of your data. Garbage in, confident-garbage out. Design and constraints are what keep the "world" in your database complete and true enough to reason over.


Designing a domain: types and constraints together

The relational model gives you several tools — types, NOT NULL, UNIQUE, CHECK, foreign keys — and good design uses them together to pin a column down to exactly its legal values. Let's design one column of Mercado's reviews table carefully, to see how the pieces combine, because this is the actual craft you'll practice for the rest of the book.

A product review has a rating. What is its true domain? "A whole number from 1 to 5, always present." Translate each clause into a tool:

  • "a number" → the type integer (or smallint — ratings are tiny)
  • "from 1 to 5" → a CHECK constraint: CHECK (rating BETWEEN 1 AND 5)
  • "always present" → NOT NULL
-- The domain of `rating`, fully expressed:
rating smallint NOT NULL CHECK (rating BETWEEN 1 AND 5)

Now the database itself guarantees that every rating, from every source, forever, is a whole number between 1 and 5. A typo'd 50, a missing value, the text "five" — all rejected at the door. Note how much sharper this is than the type alone: smallint permits −32768 through 32767; the CHECK narrows that to the five values the business actually allows. Type plus constraint equals domain.

The same table also shows constraints working across columns. A customer should review a given product at most once. That's not a fact about a single column — it's a rule about a combination, expressed as a multi-column UNIQUE:

-- One review per (product, customer): a table-level UNIQUE constraint.
UNIQUE (product_id, customer_id)

And both product_id and customer_id are foreign keys, so a review can never reference a product or customer that doesn't exist. Stack it all up and the reviews table becomes almost impossible to corrupt: every rating is in range, every review points at real entities, and no one can review the same product twice. That is what it means to design a domain well — you spend the constraints up front, and the integrity is free forever after.

Why this matters (design, theme #1). Beginners reach only for types and leave the rest to application code ("I'll validate the rating in the app"). But the app is one of many doors into the database — there's also the admin script, the bulk import, the next developer's quick fix. A constraint guards every door at once. The discipline to ask, for each column, "what is its true domain, and which tools express it?" is the single habit that most separates robust schemas from fragile ones.


Finding all the keys of a table

We said a candidate key is a minimal unique identifier, and that a table can have several. Being able to spot them all is a real design skill (and a frequent exam question for 🔬 CS students), because each candidate key is a uniqueness guarantee you might want to enforce, and the one you crown as primary key becomes the anchor for every foreign key pointing in.

Consider Mercado's inventory table, which records how many of each product sit in each warehouse:

   inventory(product_id, warehouse_id, quantity, reorder_level, updated_at)

Work through the candidates methodically:

  • Is {product_id} unique? No — the same product is stocked in many warehouses, so product_id repeats.
  • Is {warehouse_id} unique? No — a warehouse holds many products.
  • Is {product_id, warehouse_id} unique? Yes — there is exactly one inventory row per product per warehouse. And it's minimal (drop either column and uniqueness breaks). So it's a candidate key — in fact the one we chose as the composite primary key.
  • Is {quantity} a key? Of course not — many rows can share a quantity.

So inventory has a single candidate key, the composite (product_id, warehouse_id). Contrast customers, which has two candidate keys — {customer_id} and {email} — because each independently identifies a customer. We made customer_id primary (it's stable and simple) and protected email with UNIQUE (so the second candidate key is still enforced).

The procedure generalizes: for each combination of columns, ask "could two different real-world rows ever share these exact values?" If no, it's a superkey; trim it to minimal and you have a candidate key. Doing this consciously at design time prevents the classic mistake of discovering, months into production, that a column you assumed was unique isn't — and that duplicate data has already crept in.

Theory → Practice. This is why the relational model insists every table should have at least one candidate key (and you should declare one as the primary key). A table with no key is a bag of rows you can't reliably address, update, or reference — the relational equivalent of losing the index of a book. PostgreSQL won't force you to declare a primary key, but you almost always should, and Part III treats the rare exceptions.


A few misconceptions

"A table is just a spreadsheet grid." No — a relation is a set of tuples with enforced types, keys, and constraints, and with no inherent row order. The grid is the appearance; the model is the substance.

"NULL means zero / empty / false." NULL means unknown or absent, and it follows three-valued logic. Test for it with IS NULL, never =.

"The primary key is just for show." It's the enforcement of entity integrity (unique, not null) and the anchor every foreign key points to. It's load-bearing.

"Foreign keys slow things down, so skip them." Foreign keys have a small cost, but they prevent an entire category of corruption. Dropping them to save microseconds is almost always a false economy (we'll discuss the rare exceptions in Part IV).


Progressive project: model your entities and keys

Return to your project domain. For your central entities (the nouns you listed in Chapter 1):

  1. Choose a primary key for each. Prefer a surrogate key (an auto-generated id). Note any natural keys you'll also protect with UNIQUE (e.g., a library member's email, a book's ISBN, a student's ID number).
  2. Identify the foreign keys. For each pair of related entities, which one references the other? (A loan references a member and a book; an enrollment references a student and a section.)
  3. List your constraints. For at least three attributes, write the rule it must obey as a constraint: NOT NULL? UNIQUE? a CHECK (e.g., due_date >= loan_date, grade IN ('A','B','C','D','F'))?
  4. Decide your NULLs deliberately. Which attributes may legitimately be unknown (a member's phone), and which must always be present (a loan's due date)?

Capture this in project-notes.md. You're now doing real data modeling — Chapter 17 will give it formal notation, and Part III will turn it into a polished schema.


Summary

The relational model represents data as relations (tables) — sets of tuples (rows) of attribute (column) values drawn from domains (types). Because relations are sets, rows have no inherent order (use ORDER BY) and should be uniquely identifiable. Keys provide that identity: candidate keys are minimal unique identifiers, the primary key is the chosen one (unique + not null = entity integrity), and foreign keys enforce referential integrity between tables. NULL marks unknown/absent values and obeys three-valued logic — test it with IS NULL. Finally, integrity constraints (NOT NULL, UNIQUE, CHECK, primary and foreign keys) let the database enforce your data's rules once, for everyone, forever. These properties — mathematical grounding, data independence, and declarative integrity — are why the model has lasted half a century.

You can now: - Use the formal vocabulary (relation, tuple, attribute, domain, degree, cardinality) precisely. - Explain why rows have no inherent order and what that implies for queries. - Distinguish superkey, candidate key, and primary key; choose between natural and surrogate keys; design composite keys. - Explain referential integrity and what foreign keys guarantee. - Define NULL, apply three-valued logic, and test for it correctly. - List the integrity constraints and explain how they make integrity automatic.

What's next. Chapter 4 reveals the relational algebra — the handful of operations (selection, projection, join, union, difference) that every SQL query is secretly translated into. Once you can see the algebra underneath the SQL, queries stop being incantations and become compositions you can reason about.


Practice in exercises.md, test yourself with the quiz, apply the ideas in the case studies, review the key takeaways, and go deeper with further reading.