Case Study 2 — Partitioned by the Wrong Key

DataField.Dev

Case Study 2 — Partitioned by the Wrong Key

Partitioning isn't automatically a win — it only helps if the partition key matches how you query. A team partitioned a big table by the wrong column, got zero pruning benefit, and inherited all of partitioning's complexity for none of its payoff.

Background

A team had a large orders table (~500M rows) and, having read that "partitioning makes big tables fast," decided to partition it. The most-discussed query was "all orders for a given customer" (the customer-360 page, dashboards, support tools) — overwhelmingly, queries filtered by customer_id. But the team partitioned by order_date (range, monthly), reasoning that dates "felt like the natural thing to partition on."

After the migration, the customer queries were no faster — in some cases slightly slower — and the team now had dozens of partitions to manage. They'd taken on all the complexity of partitioning and gotten none of the speed.

What went wrong

Partition pruning only helps when the query filters on the partition key. The hot query filtered by customer_id:

SELECT * FROM orders_part WHERE customer_id = 4267;

But the table was partitioned by order_date. A customer's orders are spread across every monthly partition (a customer orders over time), so PostgreSQL cannot prune — it must check all partitions for matching customer_id rows. EXPLAIN confirmed it: the plan scanned (or index-scanned) every partition, then combined the results. There was no pruning, because the filter column wasn't the partition key.

Worse, partitioning had subtle downsides for this access pattern:

The query now had to consult an index per partition and merge results (an "Append" over many partition scans), which can be slower than one index on a single unpartitioned table.
The team had taken on partition management (creating/dropping monthly partitions, the partition-key-in-PK constraint) for a query pattern that got no benefit.

The diagnosis was simple in hindsight: the partition key (order_date) didn't match the hot query's filter (customer_id).

The fixes

Two reasonable resolutions, depending on the real requirements:

Option A — Don't partition; just index. For "orders by customer," a single unpartitioned orders table with an index on customer_id (Chapter 23) is exactly right — one index scan, no per-partition overhead. If the table's size is manageable with good indexing and there's no time-based retention need, partitioning was simply unnecessary. They could revert to a plain table + index. (At 500M rows, this is often fine.)

Option B — Partition by what you actually query/retain. If there is a real partitioning need, choose the key by the access pattern: - If most queries filter by customer_id, hash-partition by customer_id so each customer's rows cluster into one partition and customer queries prune to it. - If there's also a strong time-based retention/query need, consider composite/sub-partitioning (e.g., range by date, then hash by customer) — more complex, but matches multiple patterns.

The team had no retention requirement and queried by customer, so Option A (unpartition + index on customer_id) was the right call. Customer queries became fast (single index scan), and the partition-management burden disappeared.

The analysis

Partition by the key your hot queries filter on. Pruning — the whole performance point — only works when the WHERE references the partition key. Partitioning by order_date while querying by customer_id gives pruning on date queries (which they rarely ran) and nothing on the customer queries (which they ran constantly).
"Big table" alone doesn't mean "partition." Partitioning is for specific problems: queries/retention bounded by a key, or maintenance on a colossal table. If a good index on the queried column solves your reads and you have no retention/maintenance pain, don't partition — an index is simpler and often faster for that access pattern.
Partitioning has costs even when it doesn't help. Per-partition index merges, the partition-key-in-PK constraint, and partition management are real overhead. Taken on without a matching benefit, partitioning is a net negative.
Match the strategy to the pattern. Range-by-time suits time-bounded queries/retention; hash-by-customer suits customer-bounded queries; sub-partitioning matches multiple patterns at the cost of complexity. Choose by how you actually query and retain, not by what "feels natural."
Diagnose with EXPLAIN. The plan showed scans across all partitions (no pruning) — the unambiguous sign the partition key was wrong for the workload. As always, the plan tells the truth (Chapter 24).

Discussion questions

Why did partitioning by order_date give no benefit to WHERE customer_id = ... queries?
How would EXPLAIN reveal that pruning isn't happening?
When is "just index it, don't partition" the right answer for a large table?
If the team genuinely needed partitioning, how would the strategy differ for customer-bounded vs. time-bounded access?
⭐ Describe a workload where sub-partitioning (range-then-hash) would be justified, and its added cost.