Case Study 2 — The Analytics Query That Took Down Checkout

DataField.Dev

Case Study 2 — The Analytics Query That Took Down Checkout

Running heavy analytics on the live transactional database isn't just slow — it can take down the application. A monster "executive dashboard" query competed with live orders for resources and locked the OLTP database during peak hours. The fix was to separate the workloads.

Background

An e-commerce company's executives wanted a live dashboard: revenue trends, top products, cohort analysis — all computed from the production orders/order_items tables. The BI tool ran these directly against the live OLTP database (the same one serving checkout). For a while it worked. Then the dashboards grew more ambitious (three-year trends, multi-dimensional breakdowns), and one morning the checkout API slowed to a crawl and started timing out during the daily traffic peak — exactly when executives were also refreshing their dashboards.

What went wrong: mixing OLTP and OLAP on one database

The dashboards' analytical queries (OLAP) and the application's transactions (OLTP) were competing for the same database's resources, and they have opposite profiles:

The dashboard queries scanned and aggregated millions of rows across many joins (the normalized OLTP schema made them join-heavy — Chapter 34). Each consumed huge amounts of CPU, memory (work_mem for sorts/hashes), and I/O, and ran for minutes.
Meanwhile, checkout needed many small, fast transactions with low latency.

When several heavy dashboard queries ran during peak, they saturated the database's resources and buffer cache (Chapter 28), starving the small transactional queries. Checkout transactions queued behind the analytics, latency spiked, and the customer-facing app degraded. The analytics workload was cannibalizing the transactional workload — on the same server, the big slow queries crowded out the small fast ones.

This is the fundamental reason warehouses exist (Chapter 34): OLAP and OLTP have conflicting needs and should not share one database under load.

The fix: separate the workloads

The team separated analytics from transactions, in stages:

1. Immediate relief — a read replica. They set up a streaming read replica (Chapter 28/35) of the production database and pointed the BI tool at the replica, not the primary. Analytical queries now ran against a copy, leaving the primary's resources for checkout. The replica stays nearly current via WAL replay, so dashboards saw fresh-enough data, and the production app was insulated.

2. Proper fix — a data warehouse. For the heavier analytics (multi-year, multi-dimensional), even a replica of the normalized schema was slow (join-heavy). They built a star-schema warehouse (Case Study 1), loaded nightly via ELT (Chapter 31). Dashboards queried the warehouse — denormalized and (in their cloud warehouse) columnar — so the big aggregations were both fast and completely isolated from the production database.

The result: checkout latency returned to normal (analytics no longer competed with it), and the dashboards got faster (the warehouse is built for exactly those queries). Each workload ran on infrastructure designed for it.

The analysis

OLAP and OLTP don't belong on the same database under load. Their resource profiles conflict: huge slow aggregations vs. many small fast transactions. Run together at scale, the analytics starve the transactions — and the customer-facing app suffers. Separation isn't a nicety; it's why warehouses exist.
A read replica is the quick win. Pointing reporting at a replica offloads analytical reads from the primary with minimal effort, insulating the production app. It's often the first step before a full warehouse.
A warehouse is the proper fix for heavy analytics. A replica of the normalized schema is still join-heavy and row-oriented. A star-schema, columnar warehouse makes the analytics fast and isolated — purpose-built for the workload (Chapter 34).
The symptom looked like a database problem; the cause was a workload-mixing problem. Checkout slowness during dashboard-refresh peaks pointed at resource contention between two incompatible workloads — diagnosed by noticing what else was running, not just the slow query itself.
Right tool, right workload — across systems. Just as you pick the right index or the right NoSQL store, you separate analytical and transactional workloads onto systems designed for each. It's the same judgment at the architecture level.

Discussion questions

Why do OLAP and OLTP workloads conflict when sharing one database?
How did the dashboard queries starve the checkout transactions (think resources/cache)?
What does a read replica fix immediately, and what does it not fix for heavy analytics?
Why is a star-schema columnar warehouse better than a replica of the normalized schema for big analytics?
⭐ The slowness only appeared at peak. How would you diagnose "checkout is slow" to discover it was analytics contention?