Case Study 2 — The Comma-Separated Tags
The most common ER-modeling mistake in the wild: cramming a many-to-many relationship (or a multi-valued attribute) into a single text column. It looks easy, ships fast, and then makes every subsequent feature painful. The cure is a junction table — and recognizing the M:N relationship in the first place.
Background
A content platform let articles have tags. The original developer, wanting to ship quickly, added a single column:
CREATE TABLE articles (
article_id integer PRIMARY KEY,
title text,
tags text -- e.g., 'postgres, databases, sql'
);
It worked in the demo. Then the feature requests came, and each one was a fight:
- "Show all articles tagged
postgres." You can't justWHERE tags = 'postgres'(the tag is buried in a list). You resort toWHERE tags LIKE '%postgres%'— which also matchespostgresqlandpostgres-tips, and can't use an index (Chapter 5). False matches and full scans. - "Rename the tag
postgrestoPostgreSQLeverywhere." There's no tag entity to rename; you mustUPDATEevery article's text with fragile string surgery (replace), risking partial matches and corruptingpostgres-tipsintoPostgreSQL-tips. - "List all tags, with how many articles each has." There's no list of tags — they exist only as fragments inside article strings. Counting them requires splitting every row's text (
string_to_array/unnest) and aggregating — slow and awkward, and it can't enforce a canonical tag set. - "Don't allow misspelled or duplicate tags." Nothing stops
postgre,Postgres,postgres, andpostgresfrom all coexisting. There's no integrity, because there's no tag entity to constrain or reference. - "A tag has a description and a color." Now you need attributes on the tag — impossible when the tag is just a substring.
Every one of these is trivial with the right model and miserable with the comma-separated column.
What went wrong: a hidden M:N
The relationship between articles and tags is many-to-many: an article has many tags, and a tag applies to many articles. The text column tried to pretend it was a single attribute, hiding the M:N. As Chapter 17 stresses, M:N relationships require a junction table — and tags are also a multi-valued attribute, which likewise becomes its own table. Two signals, same fix.
The correct model
Three entities — articles, tags, and the junction article_tags:
CREATE TABLE tags (
tag_id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name text NOT NULL UNIQUE, -- canonical, no duplicates
description text,
color text -- tags can now have attributes
);
CREATE TABLE article_tags ( -- the junction (M:N resolution)
article_id integer NOT NULL REFERENCES articles (article_id) ON DELETE CASCADE,
tag_id integer NOT NULL REFERENCES tags (tag_id) ON DELETE CASCADE,
PRIMARY KEY (article_id, tag_id)
);
Now every painful request becomes a clean query:
-- Articles tagged 'postgres' (exact, indexable):
SELECT a.* FROM articles a
JOIN article_tags at ON at.article_id = a.article_id
JOIN tags t ON t.tag_id = at.tag_id
WHERE t.name = 'postgres';
-- Rename a tag everywhere, in one place:
UPDATE tags SET name = 'PostgreSQL' WHERE name = 'postgres';
-- Tag usage counts:
SELECT t.name, COUNT(*) AS articles
FROM tags t JOIN article_tags at ON at.tag_id = t.tag_id
GROUP BY t.name ORDER BY articles DESC;
The UNIQUE(name) on tags enforces a canonical tag set (no duplicates/misspellings proliferating); the foreign keys guarantee integrity; the junction's composite PK prevents tagging the same article twice with the same tag; and tags can now carry their own attributes.
The analysis
-
A list in a column is an unmodeled M:N (or multi-valued attribute). The comma-separated string is the symptom; the disease is failing to recognize that articles↔tags is many-to-many. Both signals — "many on both sides" and "a multi-valued attribute" — point to a separate table.
-
The junction table is the universal cure.
article_tags(article_id, tag_id)restores querying (joins, exact matches, indexes), integrity (FKs, unique tags), and extensibility (attributes on the tag, and even on the relationship — e.g., who applied the tag, when). -
"It shipped fast" is a false economy. The column saved an hour at creation and cost weeks across every tag-related feature, plus data-quality problems (duplicate/misspelled tags) that are hard to clean up retroactively.
-
String-surgery on packed columns is fragile.
LIKE '%tag%'andreplace()produce false matches and corruption (postgresinsidepostgres-tips). Structured data avoids the whole class of bug. -
This is theme #1, vividly. Design is the most important skill. Recognizing the M:N at modeling time (Chapter 17) and resolving it with a junction (Chapter 18) makes everything downstream easy. The shortcut at design time becomes the bottleneck forever.
Discussion questions
- Identify the relationship between articles and tags. Why is a single column the wrong model for it?
- For each of the five feature requests, explain why it's hard with the column and easy with the junction.
- What does
UNIQUE(name)ontagsbuy you that the text column never could? - The junction's primary key is
(article_id, tag_id). What rule does that enforce? - ⭐ Suppose you later need "who applied this tag and when." Where does that data go, and why is it now trivial to add?