Case Study 2 — The Comma-Separated Tags

DataField.Dev

Case Study 2 — The Comma-Separated Tags

The most common ER-modeling mistake in the wild: cramming a many-to-many relationship (or a multi-valued attribute) into a single text column. It looks easy, ships fast, and then makes every subsequent feature painful. The cure is a junction table — and recognizing the M:N relationship in the first place.

Background

A content platform let articles have tags. The original developer, wanting to ship quickly, added a single column:

CREATE TABLE articles (
    article_id integer PRIMARY KEY,
    title      text,
    tags       text          -- e.g., 'postgres, databases, sql'
);

It worked in the demo. Then the feature requests came, and each one was a fight:

"Show all articles tagged postgres." You can't just WHERE tags = 'postgres' (the tag is buried in a list). You resort to WHERE tags LIKE '%postgres%' — which also matches postgresql and postgres-tips, and can't use an index (Chapter 5). False matches and full scans.
"Rename the tag postgres to PostgreSQL everywhere." There's no tag entity to rename; you must UPDATE every article's text with fragile string surgery (replace), risking partial matches and corrupting postgres-tips into PostgreSQL-tips.
"List all tags, with how many articles each has." There's no list of tags — they exist only as fragments inside article strings. Counting them requires splitting every row's text (string_to_array/unnest) and aggregating — slow and awkward, and it can't enforce a canonical tag set.
"Don't allow misspelled or duplicate tags." Nothing stops postgre, Postgres, postgres, and postgres from all coexisting. There's no integrity, because there's no tag entity to constrain or reference.
"A tag has a description and a color." Now you need attributes on the tag — impossible when the tag is just a substring.

Every one of these is trivial with the right model and miserable with the comma-separated column.

What went wrong: a hidden M:N

The relationship between articles and tags is many-to-many: an article has many tags, and a tag applies to many articles. The text column tried to pretend it was a single attribute, hiding the M:N. As Chapter 17 stresses, M:N relationships require a junction table — and tags are also a multi-valued attribute, which likewise becomes its own table. Two signals, same fix.

The correct model

Three entities — articles, tags, and the junction article_tags:

CREATE TABLE tags (
    tag_id      integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    name        text NOT NULL UNIQUE,        -- canonical, no duplicates
    description text,
    color       text                          -- tags can now have attributes
);

CREATE TABLE article_tags (                    -- the junction (M:N resolution)
    article_id integer NOT NULL REFERENCES articles (article_id) ON DELETE CASCADE,
    tag_id     integer NOT NULL REFERENCES tags (tag_id)         ON DELETE CASCADE,
    PRIMARY KEY (article_id, tag_id)
);

Now every painful request becomes a clean query:

-- Articles tagged 'postgres' (exact, indexable):
SELECT a.* FROM articles a
JOIN article_tags at ON at.article_id = a.article_id
JOIN tags t ON t.tag_id = at.tag_id
WHERE t.name = 'postgres';

-- Rename a tag everywhere, in one place:
UPDATE tags SET name = 'PostgreSQL' WHERE name = 'postgres';

-- Tag usage counts:
SELECT t.name, COUNT(*) AS articles
FROM tags t JOIN article_tags at ON at.tag_id = t.tag_id
GROUP BY t.name ORDER BY articles DESC;

The UNIQUE(name) on tags enforces a canonical tag set (no duplicates/misspellings proliferating); the foreign keys guarantee integrity; the junction's composite PK prevents tagging the same article twice with the same tag; and tags can now carry their own attributes.

The analysis

A list in a column is an unmodeled M:N (or multi-valued attribute). The comma-separated string is the symptom; the disease is failing to recognize that articles↔tags is many-to-many. Both signals — "many on both sides" and "a multi-valued attribute" — point to a separate table.
The junction table is the universal cure. article_tags(article_id, tag_id) restores querying (joins, exact matches, indexes), integrity (FKs, unique tags), and extensibility (attributes on the tag, and even on the relationship — e.g., who applied the tag, when).
"It shipped fast" is a false economy. The column saved an hour at creation and cost weeks across every tag-related feature, plus data-quality problems (duplicate/misspelled tags) that are hard to clean up retroactively.
String-surgery on packed columns is fragile. LIKE '%tag%' and replace() produce false matches and corruption (postgres inside postgres-tips). Structured data avoids the whole class of bug.
This is theme #1, vividly. Design is the most important skill. Recognizing the M:N at modeling time (Chapter 17) and resolving it with a junction (Chapter 18) makes everything downstream easy. The shortcut at design time becomes the bottleneck forever.

Discussion questions

Identify the relationship between articles and tags. Why is a single column the wrong model for it?
For each of the five feature requests, explain why it's hard with the column and easy with the junction.
What does UNIQUE(name) on tags buy you that the text column never could?
The junction's primary key is (article_id, tag_id). What rule does that enforce?
⭐ Suppose you later need "who applied this tag and when." Where does that data go, and why is it now trivial to add?