Case Study 1 — Designing Mercado's Sales Star Schema

DataField.Dev

Case Study 1 — Designing Mercado's Sales Star Schema

A worked example of dimensional modeling: turning Mercado's normalized OLTP schema into a star schema for analytics. The contrast with the transactional design (Part III) makes the "design follows workload" principle concrete.

The need

Mercado's leadership wanted self-service analytics: "revenue by category by month," "top customers by region this quarter," "how do gold-tier customers' baskets compare to standard?" — sliced and diced freely, over three years of history, fast. Running these against the live OLTP schema meant five-table joins (orders → order_items → products → categories, plus customers) aggregated over millions of rows, competing with live transactions. It was slow and risky (Case Study 2). They built a warehouse with a star schema.

Identifying the fact

The central question: what do we measure, and at what grain? Mercado analyzes sales, and the finest useful grain is one row per order line item (a product sold on an order) — fine enough to roll up to anything coarser. So the fact table is fact_sales, one row per order-item:

fact_sales (grain = one order line item)
  date_key      → dim_date
  product_key   → dim_product
  customer_key  → dim_customer
  -- measures (numeric, additive):
  quantity
  gross_revenue        (quantity * unit_price)
  discount_amount
  net_revenue          (gross_revenue - discount_amount)

The fact table is huge but narrow: foreign keys to dimensions + a handful of numeric measures. It will have hundreds of millions of rows but few columns.

Identifying the dimensions

The ways leadership slices sales become dimension tables — denormalized, descriptive, small:

dim_date(date_key, full_date, year, quarter, month, month_name, day_of_week, is_weekend, is_holiday)
dim_product(product_key, sku, product_name, category, department, supplier_name, ...)   -- denormalized!
dim_customer(customer_key, customer_name, loyalty_tier, region, city, signup_year, ...)

Note the deliberate denormalization (Chapter 20, vindicated): dim_product flattens product → category → department → supplier all into one table, even though Mercado's OLTP schema (rightly) split those into separate normalized tables. In the warehouse, that flattening is correct — it removes joins from analytical queries and makes slicing by category or department a single, fast group-by. Likewise dim_customer flattens customer + region + tier.

The date dimension is the workhorse: one row per calendar day with pre-computed year/quarter/month/weekend/holiday attributes, so any time-based slice ("by quarter," "weekdays only," "excluding holidays") is a simple group-by on a column — no date functions, no missing-date gaps.

The payoff: analytical queries become natural

Now leadership's questions are clean star-schema queries:

-- Net revenue by category by month
SELECT d.year, d.month_name, p.category, SUM(f.net_revenue) AS revenue
FROM fact_sales f
JOIN dim_date d    ON d.date_key = f.date_key
JOIN dim_product p ON p.product_key = f.product_key
GROUP BY d.year, d.month, d.month_name, p.category
ORDER BY d.year, d.month, revenue DESC;

-- Gold vs standard customers' average basket
SELECT c.loyalty_tier, AVG(f.net_revenue) AS avg_line_revenue
FROM fact_sales f JOIN dim_customer c ON c.customer_key = f.customer_key
GROUP BY c.loyalty_tier;

Every question follows the same shape — measures from the fact, sliced by dimension attributes, grouped, aggregated — which is exactly how analysts and BI tools (Tableau, Power BI, Looker) think. The warehouse is fed nightly from the OLTP database via ELT (Chapter 31): extract the day's orders, load into the warehouse, transform into fact and dimension rows.

The analysis

Design follows workload. The same business data is modeled normalized for OLTP (write integrity, Part III) and denormalized as a star for OLAP (read/aggregation speed). Neither is "more correct" — each fits its job. This is the book's design theme at its broadest.
The fact grain is the foundational decision. Choosing "one row per order line item" determines what every analysis can do. Too coarse (one row per order) loses product-level analysis; the right grain (per line item) rolls up to everything coarser. Get the grain right first.
Denormalized dimensions are a feature here. Flattening product→category→department into dim_product removes joins from analytical queries — the opposite of OLTP advice, and correct, because the warehouse is batch-loaded (no update anomalies to fear) and read-optimized (Chapter 20 at scale).
The date dimension is indispensable. Pre-computed calendar attributes turn every time-based slice into a trivial group-by and guarantee a complete, gap-free calendar.
Star schemas match how people analyze. "This measure, by these dimensions" is the universal analytical question, and the star schema is its physical form — which is why BI tools assume it.

Discussion questions

Why is "one row per order line item" the right fact grain, and what would too coarse a grain cost?
Why is flattening product→category→department into one dimension table correct here but wrong in the OLTP schema?
What does the date dimension provide that querying order_date directly does not?
How does the star schema make leadership's varied questions all the same query shape?
⭐ Leadership now wants "as-of" category (the category a product had when sold, even after recategorization). Which SCD type, and how does it change dim_product and the fact load?