Chapter 34: Data Warehousing — Star Schemas, Dimensional Modeling, and Databases for Analytics

DataField.Dev

31 min read

> Where you are: Part VI, Chapter 34 of 40. The databases you've designed are transactional — optimized for running an application. This chapter is about databases optimized for analytics — answering "how is the business doing?" over huge histories...

In This Chapter

Two different jobs
The star schema
Facts and dimensions
OLTP and OLAP: why they're opposites
The star schema and the fact grain
Slowly changing dimensions and columnar storage
Snowflake schema and slowly changing dimensions
Column-oriented storage
Modern analytical databases
Denormalization is the norm here (Chapter 20, vindicated)
Data marts, data lakes, and the lakehouse
Dimensional modeling as a discipline
The two-database reality
A note on real-time analytics
Common mistakes
A worked scenario: a sales star schema for Mercado
The modern data stack and when you need a warehouse
Progressive project: think analytically
Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 34: Data Warehousing — Star Schemas, Dimensional Modeling, and Databases for Analytics

Where you are: Part VI, Chapter 34 of 40. The databases you've designed are transactional — optimized for running an application. This chapter is about databases optimized for analytics — answering "how is the business doing?" over huge histories. Different purpose, different design.

Learning paths: 📊 Analyst / data engineer (core); 💻 🏗️ (understanding the analytical side); 🔬 CS students (the OLTP/OLAP and storage distinctions).

Two different jobs

The Mercado database you've built is an OLTP system — Online Transaction Processing. It's optimized for the application's daily work: many small, fast reads and writes, strong consistency, normalized to protect integrity (Part III). "Place an order," "update a profile."

Analytics is a different job: OLAP — Online Analytical Processing. "What were sales by category by region by month for the last three years, and how does this quarter compare?" These queries scan huge amounts of historical data, aggregate heavily, and run relatively infrequently — but they must be fast over billions of rows. Running them against the live OLTP database is a poor fit: they'd compete with transactions, and the normalized schema (great for writes) makes analytical queries join-heavy and slow.

The solution is a data warehouse — a separate database designed for analytics, fed from the OLTP systems (and other sources) via ETL/ELT (Chapter 31). Its design priorities are the opposite of OLTP's: optimized for reads and aggregation over history, deliberately denormalized, often column-oriented.

	OLTP (transactional)	OLAP (analytical / warehouse)
Purpose	run the application	analyze the business
Workload	many small reads/writes	few huge aggregating reads
Schema	normalized (Part III)	denormalized (star schema)
Optimized for	write integrity, fast lookups	read/aggregation over history
Storage	row-oriented	often column-oriented
Example	"place this order"	"revenue by category by month, 3 yrs"

The star schema

The dominant warehouse design is the star schema: a central fact table surrounded by dimension tables (it looks like a star). This is dimensional modeling, and it's deliberately denormalized for analytical reads.

            dim_date          dim_customer
                \                /
                 \              /
   dim_product ── FACT: sales ── dim_store
                 /              \
                /                \
          dim_promotion       dim_channel

Fact table — the measurements/events you analyze, at a fine grain (one row per sale, per order line, per event). It holds numeric measures (quantity, revenue, cost) and foreign keys to the dimensions. Fact tables are huge (billions of rows) but narrow.
Dimension tables — the context by which you slice the facts: who (customer), what (product), when (date), where (store), how (channel). They hold descriptive attributes (product name, category, customer segment, date's month/quarter/year) and are small relative to facts.

A typical analytical query joins the fact to a few dimensions and aggregates:

-- Revenue by product category by month (star-schema query)
SELECT d.year, d.month, p.category, SUM(f.revenue) AS revenue
FROM fact_sales f
JOIN dim_date d    ON d.date_key = f.date_key
JOIN dim_product p ON p.product_key = f.product_key
GROUP BY d.year, d.month, p.category
ORDER BY d.year, d.month, revenue DESC;

The star schema makes this natural: pick measures from the fact, slice by attributes from the dimensions, group, aggregate. Analysts (and BI tools like Tableau, Power BI, Looker) think in exactly these terms — "this measure, by these dimensions."

Facts and dimensions

The dimensional-modeling mindset:

Facts are what you measure — additive numbers you SUM/AVG/COUNT (revenue, units, duration). The fact table's grain (what one row represents — one order line? one daily snapshot?) is the most important design decision.
Dimensions are how you slice — the descriptive context. A date dimension is almost universal: one row per day with columns for year, quarter, month, day-of-week, holiday flags — so you can group by any time attribute without date functions, and join to a clean calendar.

This is the analytical counterpart to ER modeling (Chapter 17): instead of entities and relationships normalized for writes, you model measurements and the dimensions you analyze them by, denormalized for reads.

OLTP and OLAP: why they're opposites

The foundation of data warehousing is the distinction between transactional (OLTP) and analytical (OLAP) workloads, and understanding why they're not just different but opposite — requiring opposite designs — is what makes the whole subject coherent. They optimize for contradictory things, which is precisely why they need separate systems.

OLTP — the Mercado database you've built — supports an application's operations: placing orders, updating profiles, checking inventory. Its workload is many small, concurrent reads and writes, each touching a few rows, requiring strong consistency (the order must be correct) and fast lookups (find this customer's data now). To serve this, OLTP databases are normalized (Part III — each fact in one place, so updates are clean and consistent) and row-oriented (a whole row stored together, so reading or writing one record is efficient). OLAP — analytics — supports understanding the business: revenue by category by region by quarter over three years. Its workload is few, huge, aggregating reads over vast history, scanning millions or billions of rows to compute summaries, run infrequently but needing to be fast over enormous data. Writes are batch loads, not transactional updates. To serve this, analytical databases are denormalized (star schemas — duplicate descriptive data to avoid joins, optimize for reads) and often column-oriented (each column stored together, so aggregating one column over everything reads only that column).

Notice how every design choice is opposite. OLTP normalizes (clean writes); OLAP denormalizes (fast reads). OLTP is row-oriented (whole-row access); OLAP is column-oriented (column aggregation). OLTP prioritizes write integrity and consistency (transactions matter); OLAP prioritizes read and aggregation speed (the data is loaded in controlled batches, so transactional integrity during updates isn't the concern). OLTP serves many small operations; OLAP serves few huge analyses. These aren't minor differences in tuning — they're contradictory optimization targets, which is why you can't serve both well with one design. A schema normalized for OLTP makes OLAP queries join-heavy and slow; a schema denormalized for OLAP makes OLTP updates redundant and inconsistent. Row-oriented storage is wrong for column aggregation; column-oriented storage is wrong for whole-row transactions.

This opposition is why analytics gets a separate database — the data warehouse — rather than running on the OLTP system. Running heavy analytics on the live OLTP database is a poor fit on two counts: the analytical queries compete with transactions for resources (a billion-row aggregation slows down the order-placement the OLTP system exists for), and the normalized OLTP schema makes those analytical queries slow (join-heavy). The warehouse solves both: it's a separate system (so analytics doesn't compete with transactions) designed the opposite way (denormalized, columnar — so analytics is fast). The two databases serve the two workloads, each optimized for its job, fed by a pipeline (ETL/ELT, Chapter 31) that moves and reshapes data from the OLTP source to the OLAP warehouse. Understanding that OLTP and OLAP are opposite — and why (the workload differences cascade into contradictory design choices) — is the key that makes data warehousing make sense: it's not arbitrary that warehouses are denormalized and columnar; it's the necessary consequence of optimizing for the analytical workload, which is the opposite of the transactional one. This is also Chapter 20's denormalization lesson at full scale: the right design depends entirely on the workload, and the analytical workload demands the denormalized, columnar design that would be wrong for transactions.

The star schema and the fact grain

The star schema is the dominant warehouse design, and understanding it deeply — especially the all-important concept of fact grain — is the core skill of dimensional modeling. The star schema's shape is a central fact table surrounded by dimension tables, and the design is deliberately denormalized for analytical query speed and clarity.

The fact table holds the measurements you analyze — the numeric values you sum, average, and count (revenue, quantity, cost, duration) — plus foreign keys to the dimensions. It's huge (one row per event: one sale, one order line, one measurement — billions of rows over time) but narrow (just the measures and the dimension keys). The dimension tables hold the descriptive context by which you slice the facts — who (customer), what (product), when (date), where (location), how (channel) — with descriptive attributes (product name and category, customer segment, the date's year/quarter/month). Dimensions are small relative to the fact table and denormalized (a product dimension includes its category and department as columns, rather than normalizing them into separate tables, so analytical queries don't need extra joins). The star shape — fact in the center, dimensions radiating out — matches exactly how analysts think: "this measure (from the fact), sliced by these dimensions (the descriptive tables), aggregated." A query picks measures from the fact, groups by dimension attributes, and aggregates — natural and fast.

The single most important design decision in the entire star schema is the fact grain — what one row of the fact table represents. Is it one row per order? Per order line? Per daily snapshot of inventory? The grain determines everything: what you can analyze (you can only slice and aggregate at or above the grain — a per-order grain can't analyze individual products within orders), how big the fact table is, and how the dimensions relate. Getting the grain right — usually choosing the finest grain that's practical (one row per order line, not per order, so you can analyze by product) — is the foundational decision, because everything else builds on it and a too-coarse grain permanently limits the analysis. The discipline is to declare the grain explicitly ("one row per order line item") before designing the rest, and ensure every measure and dimension makes sense at that grain. A clear, fine grain enables rich analysis; an unclear or too-coarse grain cripples the warehouse. This is the dimensional-modeling counterpart to ER modeling's "identify the entities" — and getting the grain right is as foundational to a warehouse as getting the entities right is to an OLTP schema.

The denormalization of the star schema is deliberate and correct for this workload, vindicating Chapter 20. Dimensions duplicate descriptive data (the product's category appears in the product dimension rather than being normalized into a separate category table) precisely to avoid joins in analytical queries — slicing by category is then a column in the already-joined dimension, not another join. This redundancy, which would be a normalization violation in OLTP (risking update anomalies), is fine in the warehouse because the warehouse is loaded in controlled batches, not updated transactionally — so the update anomalies normalization prevents don't arise (you reload the dimension, you don't update it piecemeal). The star schema is denormalization applied where it's exactly right: a read-heavy, batch-loaded analytical workload where read speed and query simplicity matter more than the write-integrity that normalization protects. Understanding the star schema — fact table of measures at a defined grain, denormalized dimensions to slice by, the grain as the foundational decision — is dimensional modeling, the analytical counterpart to the relational modeling of Part III.

Slowly changing dimensions and columnar storage

Two more concepts complete the warehouse-design picture: how dimensions handle change over time, and why analytical databases store data by column. Both follow from the analytical workload's needs.

Slowly Changing Dimensions (SCD) address a subtle but crucial analytical problem: dimension attributes change over time (a customer moves to a new region, a product gets recategorized), and analysis often needs the value as it was when each fact occurred, not the current value. If a customer was in the West region when they made a purchase last year and moved to the East this year, "sales by region last year" should count that purchase as West — but if you simply overwrote the customer's region to East, you've lost the historical truth. The SCD strategies handle this trade-off between currency and history. Type 1 overwrites the old value (simple, but loses history — fine when you only ever care about the current value). Type 2 adds a new dimension row with validity dates each time an attribute changes, so the dimension keeps every version, and each fact points to the version current when the fact occurred — preserving full history (the most common strategy for true historical analysis). Type 3 keeps a "previous value" column (limited history). Type 2 is the warehouse's solution to the point-in-time problem you saw with order_items.unit_price in Chapters 19–20: preserving what an attribute was at the time of an event, so historical analysis is accurate. Choosing the SCD strategy per dimension — based on whether you need historical accuracy — is a key warehouse design decision, and getting it wrong (overwriting when you needed history) permanently loses the ability to analyze "as it was."

Columnar storage is why analytical databases are fast at their job, and it follows directly from the analytical access pattern. Analytical queries typically read a few columns across many rows ("sum the revenue column over a billion rows, grouped by category"). Row-oriented storage (PostgreSQL's default heap, Chapter 28 — entire rows stored together) must read whole rows even to aggregate one column, wasting enormous I/O reading columns the query doesn't need. Column-oriented storage stores each column together, so an aggregation reads only the columns it needs — dramatically less I/O — and compresses far better (similar values stored together compress well). For "sum/average/group over huge data," columnar storage is enormously faster, often by orders of magnitude, which is why analytical databases (Snowflake, BigQuery, Redshift, DuckDB, ClickHouse) are columnar. The storage model follows the workload, a deep consequence of the internals (Chapter 28): OLTP is row-oriented because it accesses whole rows (place an order = write a whole row); OLAP is column-oriented because it aggregates columns (analyze revenue = read one column over everything). This is another instance of the OLTP/OLAP opposition — even the physical storage is opposite, because the access patterns are opposite. Understanding columnar storage explains why warehouses are so fast at aggregation (they read only the needed columns, compressed) and completes the picture of how analytical databases are designed top to bottom — denormalized schema, SCD for history, columnar storage — all following from the analytical workload's needs, all opposite to OLTP's.

Snowflake schema and slowly changing dimensions

Snowflake schema — a variant where dimensions are normalized into sub-dimensions (e.g., dim_product → dim_category → dim_department). It saves some space and reduces redundancy but adds joins; the star (denormalized dimensions) is usually preferred for query simplicity and speed. The name: normalized dimensions branch out like a snowflake.
Slowly Changing Dimensions (SCD) — dimensions change over time (a customer moves region; a product is recategorized), and analytics often needs the value as it was at the time of each fact. Strategies:
Type 1 — overwrite (keep only current value; lose history).
Type 2 — add a new dimension row with validity dates (keep full history; the fact points to the version current at its time). The most common for true history.
Type 3 — keep a "previous value" column (limited history).

SCD Type 2 is the warehouse's way of handling the "point-in-time" problem you saw with order_items.unit_price (Chapter 19/20) — preserving what a dimension's attributes were when the fact happened.

Column-oriented storage

Analytical queries typically read a few columns across many rows ("sum revenue over a billion rows"). Row-oriented storage (Chapter 28 — rows stored together) must read whole rows even to sum one column. Column-oriented (columnar) storage stores each column together, so an aggregation reads only the columns it needs — far less I/O — and compresses extremely well (similar values together). This is why analytical databases are usually columnar, and why they're so fast at "sum/average/group over huge data."

OLTP databases (PostgreSQL's default heap) are row-oriented (great for "read/write a whole row"); OLAP databases are columnar (great for "scan a few columns over everything"). The storage model follows the workload — a deep reason rooted in the internals (Chapter 28).

Modern analytical databases

The warehouse landscape today:

Snowflake, Google BigQuery, Amazon Redshift, Azure Synapse — cloud, columnar, massively parallel analytical databases that scale to petabytes and separate storage from compute. The mainstream enterprise warehouses.
DuckDB — an in-process columnar analytical database ("SQLite for analytics"); superb for local/embedded analytics on large files.
ClickHouse — a very fast open-source columnar database for real-time analytics.
PostgreSQL — handles modest warehousing well, and extensions (Citus for scale-out, columnar storage extensions) push it further; for small-to-medium analytics, you may not need a separate warehouse at all (theme #4).

Most are queried with SQL — the relational query language won even in the analytical world. And modern stacks favor ELT (Chapter 31): load raw data into the warehouse, transform with SQL (tools like dbt). The skills you've built transfer directly.

Denormalization is the norm here (Chapter 20, vindicated)

In OLTP, you normalize (Chapter 19) and denormalize cautiously (Chapter 20). In the warehouse, denormalization is the default — star schemas deliberately duplicate descriptive attributes into dimensions, and pre-aggregate, because the priority is read/aggregation speed over write integrity (the warehouse is loaded in controlled batches, not updated transactionally). This is Chapter 20's lesson at full scale: the right design depends on the workload, and for analytics, denormalized wins.

Data marts, data lakes, and the lakehouse

The data-warehousing vocabulary includes several related terms — data mart, data lake, lakehouse — and understanding how they relate to the warehouse clarifies the broader analytical-data landscape you'll encounter. Each addresses a different aspect of organizing analytical data.

A data mart is a subset of a warehouse, focused on a specific business area (a sales mart, a marketing mart) — essentially a smaller, department-scoped warehouse, often a subset of star schemas serving one team's analytical needs. Data marts let a department have analytics tailored to its needs without navigating the whole enterprise warehouse; they're warehouses scoped down. A data lake is a different idea: a repository that stores raw data in its native format (files — CSV, JSON, Parquet, logs, images) at massive scale and low cost, without imposing structure up front ("schema on read" rather than "schema on write"). The data lake holds everything, structured and unstructured, cheaply, and structure is applied when the data is read and analyzed, not when it's stored. Data lakes emerged because organizations wanted to keep all their data (including unstructured and not-yet-modeled data) affordably, deferring the modeling. The trade-off: a data lake without governance can become a "data swamp" — a vast, disorganized store no one can navigate or trust.

The lakehouse is the modern synthesis, combining the data lake's cheap, flexible raw storage with the warehouse's structure, management, and SQL querying — aiming to get the lake's "store everything cheaply" and the warehouse's "query it reliably with SQL." Technologies like Databricks' Delta Lake and Apache Iceberg add warehouse-like features (transactions, schema management, performance) on top of lake storage, so you can run analytical SQL over the lake's data with warehouse-like reliability. The lakehouse reflects the convergence of the lake and warehouse worlds. For this book's purposes, the key points are: these are all analytical data organizations (distinct from OLTP), they're increasingly SQL-queried (your skills transfer), and they represent a spectrum from structured (warehouse, marts) to raw-and-flexible (lake) to converged (lakehouse). You don't need to master each, but knowing the vocabulary — warehouse (structured analytical store), mart (department subset), lake (raw cheap storage), lakehouse (the synthesis) — orients you in the analytical-data landscape and lets you understand the architecture discussions you'll encounter. And the throughline remains: SQL and dimensional thinking apply across all of them, so the skills you've built are the durable core regardless of which analytical storage technology an organization uses.

Dimensional modeling as a discipline

Dimensional modeling — the star-schema, facts-and-dimensions approach — is a discipline with its own established methodology, principally associated with Ralph Kimball, and recognizing it as a coherent body of practice (parallel to the relational modeling of Part III) frames it properly. It's not ad-hoc denormalization but a deliberate, principled approach to designing for analytics.

The dimensional-modeling discipline has its own concepts and best practices, several of which this chapter has touched: declare the grain first (the foundational decision), identify facts (the measures) and dimensions (the slicing context), denormalize dimensions for query simplicity, use conformed dimensions (shared dimension definitions across multiple fact tables, so "customer" means the same thing in sales analysis and support analysis — enabling cross-process analysis), handle change with slowly changing dimensions, and design for the business questions analysts will ask. There's a whole methodology here — a way of approaching analytical design that's as principled as ER modeling is for transactional design. The two are parallel disciplines: ER/relational modeling designs normalized schemas for OLTP (each fact in one place, optimized for transactional integrity), while dimensional modeling designs star schemas for OLAP (facts and dimensions, optimized for analytical query speed and clarity). A practitioner who knows both can design for both workloads appropriately — normalizing for the application, dimensionally modeling for the analytics.

Recognizing dimensional modeling as a discipline matters because it means analytical design, like transactional design, is a learnable skill with established principles, not guesswork. Just as Part III taught you to design OLTP schemas methodically (entities, relationships, normalization), dimensional modeling teaches you to design OLAP schemas methodically (grain, facts, dimensions, conformed dimensions, SCD). And just as good OLTP design makes applications work well, good dimensional design makes analytics work well — clear grain enables rich analysis, conformed dimensions enable cross-process insight, proper SCD enables historical accuracy, denormalized dimensions enable fast simple queries. The discipline has decades of accumulated wisdom (Kimball's dimensional-modeling methodology is a well-developed body of practice), and while this chapter introduces its core (the star schema and its key concepts), it's worth knowing that dimensional modeling is a deep field in its own right — the analytical counterpart to the relational modeling you've learned, and the core skill of the analytics engineer. For anyone moving toward analytics or data engineering, dimensional modeling is as foundational as relational modeling is for application development, and this chapter is its introduction.

The two-database reality

The deepest lesson of this chapter is the two-database reality: most organizations that both run on their data and analyze it end up with (at least) two databases — a transactional OLTP system and an analytical OLAP warehouse — because the two workloads genuinely require opposite designs. Internalizing this reality, and why it's the norm rather than a failure, is the chapter's central takeaway.

It might seem like a failure or inefficiency to have two databases holding versions of the same data — wouldn't one be simpler? But the OLTP/OLAP opposition makes one database serving both worse, not better: a schema optimized for transactions is slow for analytics, and vice versa, so a single database compromises both. The two-database architecture — normalized OLTP for running the application, denormalized OLAP warehouse for analyzing it, connected by an ETL/ELT pipeline — lets each be optimal for its job. This is the standard, mature architecture for any data-driven organization, and it's a feature of recognizing that transactions and analytics are different jobs, not a bug. The transactional database serves the application (fast, consistent operations); the warehouse serves the business (fast, rich analysis); the pipeline keeps the warehouse fed from the transactional source. Each database does what it's good at, and the organization gets both reliable operations and powerful analytics.

This reality connects the whole book together. The OLTP database is everything you learned in Parts I–V — the relational model, SQL, design, performance, integration — applied to running an application. The OLAP warehouse is this chapter's dimensional modeling, fed by Chapter 31's pipelines, applied to analyzing it. The same SQL skills serve both (the warehouse is SQL-queried too). The denormalization that was a careful exception in OLTP (Chapter 20) is the default in OLAP (this chapter) — the same trade-off, resolved oppositely because the workload is opposite. And theme #4 applies even here: for modest analytical needs, PostgreSQL (or DuckDB) handles the OLAP role without a separate warehouse, so the "two databases" might both be PostgreSQL, or the analytics might run on the transactional database's replica — the two-database logic (separate the workloads) applies even when the implementation is lighter. The practitioner who understands this two-database reality designs appropriately: a transactional system for operations, an analytical model for insight, scaled to the actual need. It's the architecture that lets organizations both run on their data and learn from it — which is, increasingly, what every organization needs to do, and what makes the database skills of this entire book central to how modern organizations operate and understand themselves.

The next chapter pushes the scaling question further still: what happens when even a single well-designed server — transactional or analytical — isn't enough, and data must be spread across multiple machines? Distributed databases, replication, sharding, and the CAP theorem are where the relational model meets the hard realities of scale beyond one server, and they complete Part VI's honest exploration of the boundaries of the single-database world you've mastered. Having seen how analytics demands a different design (this chapter), you'll see how scale itself demands distribution — the final frontier of the "beyond relational" landscape.

And it's worth carrying forward the chapter's deepest theme: design follows workload. The warehouse is denormalized and columnar not by preference but because the analytical workload demands it, just as the OLTP database is normalized and row-oriented because the transactional workload demands it. The same data, two workloads, two opposite designs — and recognizing which workload you're serving tells you which design is right. This workload-driven thinking, which has run through the book from normalization (Chapter 19) to denormalization (Chapter 20) to this chapter's star schemas, is the mature designer's lens: not "what's the best design?" in the abstract, but "what's the best design for this workload?" Data warehousing is that question answered for the analytical workload, and the answer — dimensional, denormalized, columnar — is as principled as the relational answer for transactions.

A note on real-time analytics

A modern wrinkle worth knowing is real-time analytics — the desire to analyze data with very low latency, closing the gap between an event happening and its appearing in analytics. The traditional warehouse model is batch: data is loaded periodically (nightly, hourly) via ETL/ELT, so analytics reflects data as of the last load. For many purposes, batch is fine (yesterday's sales figures don't need to be sub-second fresh). But some use cases want analytics on fresh data — live operational dashboards, fraud detection, real-time personalization — where waiting for the next batch is too slow.

Several approaches address this. Streaming pipelines (using tools like Apache Kafka and stream processors) continuously move and transform data as it's generated, rather than in periodic batches, keeping the analytical store near-real-time. Real-time analytical databases (ClickHouse, Apache Druid) are built for fast analytical queries over freshly-ingested data, blending the columnar-analytics strength with low-latency ingestion. And some architectures query the operational data more directly (a read replica, or PostgreSQL with good analytical indexing) for cases where the freshness need outweighs the cost of analyzing closer to the source. The trade-off, as always, is complexity versus need: real-time analytics infrastructure is more complex than batch warehousing, and most analytical needs are genuinely fine with batch freshness (the business reviews yesterday's numbers, not this second's). So the guidance mirrors the rest of the chapter: use batch warehousing by default (simpler, sufficient for most analytics), and add real-time capabilities only when a specific use case genuinely needs sub-batch freshness and justifies the added complexity. Knowing real-time analytics exists — and that it's a step beyond batch warehousing, warranted by specific freshness needs — rounds out the analytical landscape without suggesting everyone needs it. For most, the batch warehouse this chapter described is exactly right; real-time is the specialized extension for the cases that demand it.

Common mistakes

Running heavy analytics on the live OLTP database — competes with transactions, slow on the normalized schema. Use a warehouse (or at least a read replica, Chapter 35).
Over-normalizing the warehouse — applying OLTP normalization to analytics makes queries join-heavy and slow; star schemas are denormalized on purpose.
Getting the fact grain wrong — the most important warehouse design decision; an unclear or too-coarse grain limits every analysis.
Ignoring slowly-changing dimensions — overwriting (Type 1) when you needed history (Type 2) loses the ability to analyze "as it was."
Building a warehouse you don't need — for small-to-medium data, PostgreSQL (or DuckDB) may be plenty; don't over-engineer (theme #4).

A worked scenario: a sales star schema for Mercado

Let's design a sales data warehouse for Mercado, because building a star schema from a familiar transactional source shows how OLTP data is transformed into an analytical model. Mercado's OLTP database (normalized, Part III) records orders, order items, customers, products, categories — and the business wants to analyze sales: revenue and units by product, category, customer segment, region, and time, over years of history. How do we model this analytically?

Start with the fact grain — the foundational decision. The finest practical grain for sales analysis is one row per order line item (one product within one order), because that lets us analyze by product, by order, and by customer, aggregating up as needed. (A per-order grain couldn't break down by product; a per-day grain couldn't analyze individual orders.) So fact_sales has one row per order line. Its measures are the numeric values we'll aggregate: quantity, revenue (quantity × unit price), cost, discount — the additive numbers analysts sum. Its dimension foreign keys connect to the dimensions we slice by. The dimensions are the descriptive context: dim_date (one row per day, with year, quarter, month, day-of-week, holiday flags — the near-universal date dimension), dim_product (denormalized to include category and department as columns, so slicing by category needs no extra join), dim_customer (with segment, region, signup cohort — descriptive attributes for analysis), and perhaps dim_channel (web, store, mobile). The star takes shape: fact_sales in the center, the dimensions radiating out.

Contrast this sharply with Mercado's OLTP schema. The OLTP schema is normalized: orders, order_items, products, categories, customers as separate tables, each fact stored once, related by foreign keys — optimized for transactional integrity. The warehouse's star is denormalized: dim_product includes the category (duplicated from what was a separate categories table) so analytical queries avoid that join; dim_date is a generated calendar (not in the OLTP schema at all); the fact table pre-joins what the OLTP schema kept separate. The same business data is modeled oppositely — normalized for OLTP (clean writes), denormalized into a star for OLAP (fast reads) — which is the OLTP/OLAP opposition made concrete. An ETL/ELT pipeline (Chapter 31) populates the warehouse from the OLTP source: extract orders and items, transform them into fact rows (computing revenue, resolving dimension keys), build the dimensions (with SCD Type 2 for customer region, so historical sales reflect the region a customer was in at the time), and load. Now an analyst asks "revenue by category by quarter for three years" and it's a fast star-schema query — fact joined to a couple of dimensions, grouped, aggregated — over columnar storage, instead of a slow multi-join over the normalized OLTP tables competing with live transactions.

This scenario shows the whole arc: take normalized OLTP data, decide the fact grain (per order line), identify measures (revenue, units) and dimensions (date, product, customer, channel), denormalize the dimensions for query speed, handle changing dimensions with SCD, and feed it via a pipeline — producing an analytical model that's fast for the business questions, separate from and complementary to the transactional system. It's dimensional modeling applied to a familiar domain, and it makes vivid how the same data wears two different shapes for two different jobs. The OLTP schema you designed in Part III runs Mercado; this star schema analyzes it — two databases, two designs, one business, connected by a pipeline. That two-system pattern (normalized OLTP feeding denormalized OLAP) is the standard architecture for any organization that both runs on its data and analyzes it, which is essentially all of them.

The modern data stack and when you need a warehouse

Data warehousing lives within the broader "modern data stack," and understanding where it fits — plus honestly assessing when you actually need a separate warehouse — completes the practical picture. The modern analytical landscape has evolved significantly, and the tools and the decision both matter.

The modern data stack typically flows: operational sources (OLTP databases like Mercado, plus SaaS tools, APIs, files) → a pipeline (increasingly ELT — load raw data into the warehouse, then transform with SQL, Chapter 31) → a cloud data warehouse (Snowflake, BigQuery, Redshift — columnar, massively parallel, separating storage from compute, scaling to petabytes) → transformation tools (dbt, which is organized, tested, version-controlled SQL transformations) → BI and analytics tools (Tableau, Power BI, Looker) that analysts use to explore and visualize. The striking thing for this book's readers is that SQL is central throughout — the warehouses are SQL-queried, the transformation tool (dbt) is SQL, the analysis is SQL-based. The relational query language won even in the analytical world, so the SQL skills of Part II transfer directly to data warehousing and analytics engineering. A data analyst or analytics engineer's core skill is SQL — exactly what this book teaches — applied to the analytical model (star schemas) in the warehouse. The modern data career runs on SQL.

But the honest question — do you need a separate warehouse at all? — deserves a clear answer, because warehouses add cost and complexity and aren't always warranted (theme #4 once more). For small-to-medium analytical needs, you often don't need a separate warehouse: PostgreSQL handles modest analytics well (especially with partitioning for large fact tables, Chapter 25, and good indexing), and for local or embedded analytics on large files, DuckDB ("SQLite for analytics" — an in-process columnar database) is superb without any warehouse infrastructure. A separate cloud warehouse earns its place when analytical data grows large (the data volume or query complexity strains PostgreSQL), when analytics must not compete with a busy transactional system (separation becomes worth the cost), or when you need the warehouse's specific scale-out and concurrency for many analysts. The decision mirrors the whole part's theme: start with what you have (PostgreSQL, perhaps DuckDB for local analytics), and move to a dedicated warehouse when a measured need — scale, separation, concurrency — justifies it. Building an elaborate warehouse for small data is over-engineering; running petabyte analytics on a transactional PostgreSQL is under-provisioning. Matching the analytical infrastructure to the actual analytical need — single database for modest analytics, dedicated warehouse for large-scale — is the judgment, and it's the same measured, tool-to-problem approach that runs through all of Part VI. The good news either way: the SQL and dimensional-modeling skills apply regardless of where the analytics runs, from a PostgreSQL query to a Snowflake warehouse.

Progressive project: think analytically

For your project (a design exercise — you needn't build it):

Identify the facts — what measurements/events would you analyze (sales, loans, appointments, enrollments)? What's the grain (one row = ?)?
Identify the dimensions — how would you slice those facts (by date, customer, product, location, …)? Sketch a star schema.
Include a date dimension with year/quarter/month/day attributes.
Pick an SCD strategy for one dimension that changes over time, and justify Type 1 vs. Type 2.
Contrast this denormalized star with your normalized OLTP schema — why are they different?

Summary

Analytics (OLAP) is a different job from running an application (OLTP): few huge aggregating reads over history vs. many small transactional reads/writes. A data warehouse is a separate, analytics-optimized database fed from OLTP via ETL/ELT, designed the opposite way — denormalized, read/aggregation-optimized, often column-oriented. The dominant design is the star schema: a huge central fact table (numeric measures at a defined grain + dimension foreign keys) surrounded by small descriptive dimension tables you slice by (date, customer, product…). Snowflake schemas normalize dimensions (more joins; star usually preferred); slowly changing dimensions (Type 1 overwrite / Type 2 versioned history / Type 3 previous-value) preserve "as it was." Columnar storage makes aggregating a few columns over billions of rows fast. Modern warehouses (Snowflake, BigQuery, Redshift, DuckDB) are columnar, SQL-queried, and ELT-fed — your SQL skills transfer. And denormalization is the norm here (Chapter 20 at scale): the workload dictates the design.

You can now: - Distinguish OLTP from OLAP and explain why analytics needs a separate, differently-designed database. - Design a star schema (fact + dimensions) and choose the fact grain. - Explain snowflake schemas and slowly-changing-dimension strategies. - Explain why analytical databases are column-oriented. - Name the modern analytical databases and why warehouses are denormalized.

What's next. Chapter 35 — Distributed Databases — scaling across machines: replication, sharding, the CAP theorem, eventual consistency, distributed transactions, and NewSQL (CockroachDB, Spanner). What changes when one server isn't enough.

Practice in exercises.md, test yourself with the quiz, apply it in the case studies, review the key takeaways, and go deeper with further reading.