Case Study 1 — Designing Mercado's Sales Star Schema
A worked example of dimensional modeling: turning Mercado's normalized OLTP schema into a star schema for analytics. The contrast with the transactional design (Part III) makes the "design follows workload" principle concrete.
The need
Mercado's leadership wanted self-service analytics: "revenue by category by month," "top customers by region this quarter," "how do gold-tier customers' baskets compare to standard?" — sliced and diced freely, over three years of history, fast. Running these against the live OLTP schema meant five-table joins (orders → order_items → products → categories, plus customers) aggregated over millions of rows, competing with live transactions. It was slow and risky (Case Study 2). They built a warehouse with a star schema.
Identifying the fact
The central question: what do we measure, and at what grain? Mercado analyzes sales, and the finest useful grain is one row per order line item (a product sold on an order) — fine enough to roll up to anything coarser. So the fact table is fact_sales, one row per order-item:
fact_sales (grain = one order line item)
date_key → dim_date
product_key → dim_product
customer_key → dim_customer
-- measures (numeric, additive):
quantity
gross_revenue (quantity * unit_price)
discount_amount
net_revenue (gross_revenue - discount_amount)
The fact table is huge but narrow: foreign keys to dimensions + a handful of numeric measures. It will have hundreds of millions of rows but few columns.
Identifying the dimensions
The ways leadership slices sales become dimension tables — denormalized, descriptive, small:
dim_date(date_key, full_date, year, quarter, month, month_name, day_of_week, is_weekend, is_holiday)
dim_product(product_key, sku, product_name, category, department, supplier_name, ...) -- denormalized!
dim_customer(customer_key, customer_name, loyalty_tier, region, city, signup_year, ...)
Note the deliberate denormalization (Chapter 20, vindicated): dim_product flattens product → category → department → supplier all into one table, even though Mercado's OLTP schema (rightly) split those into separate normalized tables. In the warehouse, that flattening is correct — it removes joins from analytical queries and makes slicing by category or department a single, fast group-by. Likewise dim_customer flattens customer + region + tier.
The date dimension is the workhorse: one row per calendar day with pre-computed year/quarter/month/weekend/holiday attributes, so any time-based slice ("by quarter," "weekdays only," "excluding holidays") is a simple group-by on a column — no date functions, no missing-date gaps.
The payoff: analytical queries become natural
Now leadership's questions are clean star-schema queries:
-- Net revenue by category by month
SELECT d.year, d.month_name, p.category, SUM(f.net_revenue) AS revenue
FROM fact_sales f
JOIN dim_date d ON d.date_key = f.date_key
JOIN dim_product p ON p.product_key = f.product_key
GROUP BY d.year, d.month, d.month_name, p.category
ORDER BY d.year, d.month, revenue DESC;
-- Gold vs standard customers' average basket
SELECT c.loyalty_tier, AVG(f.net_revenue) AS avg_line_revenue
FROM fact_sales f JOIN dim_customer c ON c.customer_key = f.customer_key
GROUP BY c.loyalty_tier;
Every question follows the same shape — measures from the fact, sliced by dimension attributes, grouped, aggregated — which is exactly how analysts and BI tools (Tableau, Power BI, Looker) think. The warehouse is fed nightly from the OLTP database via ELT (Chapter 31): extract the day's orders, load into the warehouse, transform into fact and dimension rows.
The analysis
-
Design follows workload. The same business data is modeled normalized for OLTP (write integrity, Part III) and denormalized as a star for OLAP (read/aggregation speed). Neither is "more correct" — each fits its job. This is the book's design theme at its broadest.
-
The fact grain is the foundational decision. Choosing "one row per order line item" determines what every analysis can do. Too coarse (one row per order) loses product-level analysis; the right grain (per line item) rolls up to everything coarser. Get the grain right first.
-
Denormalized dimensions are a feature here. Flattening product→category→department into
dim_productremoves joins from analytical queries — the opposite of OLTP advice, and correct, because the warehouse is batch-loaded (no update anomalies to fear) and read-optimized (Chapter 20 at scale). -
The date dimension is indispensable. Pre-computed calendar attributes turn every time-based slice into a trivial group-by and guarantee a complete, gap-free calendar.
-
Star schemas match how people analyze. "This measure, by these dimensions" is the universal analytical question, and the star schema is its physical form — which is why BI tools assume it.
Discussion questions
- Why is "one row per order line item" the right fact grain, and what would too coarse a grain cost?
- Why is flattening product→category→department into one dimension table correct here but wrong in the OLTP schema?
- What does the date dimension provide that querying
order_datedirectly does not? - How does the star schema make leadership's varied questions all the same query shape?
- ⭐ Leadership now wants "as-of" category (the category a product had when sold, even after recategorization). Which SCD type, and how does it change
dim_productand the fact load?