Case Study 1 — Partitioning a 2-Billion-Row Events Table

Partitioning earning its keep on a genuinely large table. An events table grew until both queries and the nightly retention purge became unsustainable. Range partitioning by time fixed both — pruning for reads, instant DROP for retention.

Background

An analytics platform logged user events into a single events table: (event_id, user_id, event_type, event_time, payload). It grew by tens of millions of rows per day, reaching ~2 billion rows. Two operations were collapsing:

  1. Queries by time window ("events in the last 7 days," "events on 2024-06-01") — the most common query — were slow. Even with an index on event_time, the index itself spanned two years of data and was enormous; scanning it for a recent window touched a huge structure.
  2. The retention purge. Policy kept 90 days of events. The nightly job ran DELETE FROM events WHERE event_time < now() - interval '90 days', removing tens of millions of rows. This DELETE was slow, generated enormous WAL, bloated the table (dead tuples, Chapter 28), held locks, and competed with ingestion. VACUUM then struggled to keep up. The purge was becoming the system's biggest operational pain.

Both problems are exactly what partitioning addresses.

The redesign: range partitioning by day/month

They converted events to a table partitioned by RANGE (event_time), with daily partitions (given the volume and the day-grained retention/queries):

CREATE TABLE events (
    event_id   bigint GENERATED ALWAYS AS IDENTITY,
    user_id    integer NOT NULL,
    event_type text NOT NULL,
    event_time timestamptz NOT NULL,
    payload    jsonb,
    PRIMARY KEY (event_id, event_time)        -- partition key in the PK
) PARTITION BY RANGE (event_time);

-- one partition per day (created ahead of time by an automated job):
CREATE TABLE events_2024_06_01 PARTITION OF events
    FOR VALUES FROM ('2024-06-01') TO ('2024-06-02');
-- ...
CREATE INDEX ON events (user_id);             -- propagates to all partitions

An automated job pre-creates the next several days' partitions so ingestion never hits a missing partition.

The payoffs

Queries prune. "Events on 2024-06-01" now scans one partition (one day's data) instead of the index over two years:

SELECT * FROM events WHERE event_time >= '2024-06-01' AND event_time < '2024-06-02';
-- EXPLAIN: scans only events_2024_06_01; all other partitions pruned

A "last 7 days" query prunes to 7 partitions. The query planner skips ~99% of the data automatically — far faster than scanning a global index, and each partition's index is small and shallow.

Retention becomes instant. The 90-day purge stopped being a DELETE and became a DROP:

-- drop the day that just aged out of the 90-day window:
DROP TABLE events_2024_03_03;     -- or DETACH then archive

Dropping a partition is a near-instant metadata operation that frees the storage immediately — no row-by-row delete, no WAL flood, no bloat, no lock contention with ingestion, no VACUUM aftermath. The nightly purge went from the system's biggest pain to a one-line, sub-second operation. This alone justified the migration.

Maintenance scales. VACUUM/ANALYZE run per daily partition (small, fast, only on partitions that changed), and ingestion writes to a small "today" partition with a small index rather than appending to one colossal index.

The analysis

  1. Partitioning solves problems indexes can't. An index on event_time made lookups possible, but the index was still global and huge, and it did nothing for the retention DELETE. Partitioning shrank both the data scanned (pruning) and the maintenance unit (per-partition), and turned retention into DROP.

  2. Range-by-time is the canonical fit. The table grew by time, was queried by time, and was retained by time — so the partition key (event_time) matched all three access patterns. Pruning works precisely because the hot queries filter on the partition key.

  3. Retention via DROP is the killer feature for time-series/log data. Replacing a giant recurring DELETE with a partition DROP eliminates WAL floods, bloat, lock contention, and VACUUM pressure in one stroke. For any "keep N days" table at scale, this is reason enough to partition.

  4. Automate partition creation. Pre-creating future partitions (and dropping aged ones) is operational work that must be automated (a scheduled job / pg_partman), or ingestion eventually hits a missing partition.

  5. Granularity matched the workload. Daily partitions suited day-grained retention and queries at this volume; monthly would have made retention coarse and partitions huge. Choose granularity by your query/retention grain and volume.

Discussion questions

  1. Why did even an index on event_time leave both queries and the purge painful?
  2. Explain how pruning makes a "last 7 days" query fast on a 2-billion-row table.
  3. Why is partition DROP so much cheaper than a DELETE for retention? (Tie to WAL/bloat/VACUUM, Ch. 28.)
  4. Why does the partition key (event_time) being in the hot queries' WHERE matter?
  5. ⭐ What must be automated for this design to work in production, and what breaks if it isn't?