Appendix F: SQL and Infrastructure Reference
Advanced SQL Patterns, Feature Engineering, and Data Architecture for Production Data Science
This appendix is a practitioner's desk reference for the SQL you actually write in production data science work. Introductory SQL courses teach SELECT, JOIN, GROUP BY. Production data science demands window functions for feature engineering, recursive CTEs for hierarchical data, point-in-time joins for training set construction, and a working understanding of query execution plans. Every query in this appendix is syntactically correct and tested against real schemas. Where PostgreSQL and BigQuery diverge, both dialects are shown.
The examples draw on the same domains used throughout this textbook: StreamRec (content platform recommendations, Chapters 1-36), Meridian Financial (credit scoring and fairness, Chapters 2-35), and MediCore (clinical trial data, Chapters 16-21). If you have been working through the progressive project, the table names and column conventions will be familiar.
F.1 Window Functions
Window functions compute a value for each row based on a "window" of related rows, without collapsing the result set the way GROUP BY does. They are the single most important SQL feature for data science feature engineering.
F.1.1 Ranking Functions
ROW_NUMBER() — Deduplicate and Select Latest Records
When your event table contains multiple records per entity and you need only the most recent, ROW_NUMBER() is the standard pattern:
-- Get the most recent viewing session per user
WITH ranked_sessions AS (
SELECT
user_id,
session_id,
started_at,
ended_at,
device_type,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY started_at DESC
) AS rn
FROM streamrec.sessions
WHERE started_at >= CURRENT_DATE - INTERVAL '90 days'
)
SELECT user_id, session_id, started_at, ended_at, device_type
FROM ranked_sessions
WHERE rn = 1;
RANK() and DENSE_RANK() — Handling Ties
ROW_NUMBER() assigns arbitrary ordering among ties. When ties matter — for example, ranking content items by engagement — use RANK() (which leaves gaps) or DENSE_RANK() (which does not):
-- Rank items by total watch time within each genre
SELECT
item_id,
genre,
total_watch_minutes,
RANK() OVER (
PARTITION BY genre
ORDER BY total_watch_minutes DESC
) AS genre_rank,
DENSE_RANK() OVER (
PARTITION BY genre
ORDER BY total_watch_minutes DESC
) AS genre_dense_rank
FROM streamrec.item_daily_stats
WHERE stat_date = CURRENT_DATE - INTERVAL '1 day';
If two items tie at rank 2, RANK() produces 1, 2, 2, 4 while DENSE_RANK() produces 1, 2, 2, 3. For top-N filtering, DENSE_RANK() is usually more intuitive.
F.1.2 Offset Functions: LAG and LEAD
LAG() and LEAD() access rows relative to the current row without a self-join. These are essential for computing sequential features — session gaps, inter-event timing, and change detection.
-- Compute time between consecutive sessions and detect session gaps > 7 days
SELECT
user_id,
session_id,
started_at,
LAG(started_at) OVER (
PARTITION BY user_id
ORDER BY started_at
) AS prev_session_start,
started_at - LAG(started_at) OVER (
PARTITION BY user_id
ORDER BY started_at
) AS gap_interval,
CASE
WHEN started_at - LAG(started_at) OVER (
PARTITION BY user_id ORDER BY started_at
) > INTERVAL '7 days' THEN 1
ELSE 0
END AS is_reactivation
FROM streamrec.sessions
WHERE started_at >= CURRENT_DATE - INTERVAL '180 days';
Meridian Financial example — month-over-month payment change:
SELECT
account_id,
payment_month,
payment_amount,
LAG(payment_amount) OVER (
PARTITION BY account_id
ORDER BY payment_month
) AS prev_payment,
CASE
WHEN LAG(payment_amount) OVER (
PARTITION BY account_id ORDER BY payment_month
) > 0 THEN
(payment_amount - LAG(payment_amount) OVER (
PARTITION BY account_id ORDER BY payment_month
)) / LAG(payment_amount) OVER (
PARTITION BY account_id ORDER BY payment_month
)
ELSE NULL
END AS payment_change_pct
FROM meridian.monthly_payments
WHERE payment_month >= DATE '2024-01-01';
F.1.3 Running Aggregates
Running sums, averages, and counts are the workhorses of behavioral feature engineering. The ROWS BETWEEN clause defines the window frame.
-- Trailing 7-day and 28-day engagement features per user
SELECT
user_id,
event_date,
daily_watch_minutes,
SUM(daily_watch_minutes) OVER (
PARTITION BY user_id
ORDER BY event_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS watch_minutes_7d,
AVG(daily_watch_minutes) OVER (
PARTITION BY user_id
ORDER BY event_date
ROWS BETWEEN 27 PRECEDING AND CURRENT ROW
) AS avg_watch_minutes_28d,
COUNT(*) OVER (
PARTITION BY user_id
ORDER BY event_date
ROWS BETWEEN 27 PRECEDING AND CURRENT ROW
) AS active_days_28d
FROM streamrec.user_daily_activity;
ROWS vs. RANGE:
ROWS BETWEEN 6 PRECEDING AND CURRENT ROWcounts exactly 7 physical rows.RANGE BETWEEN INTERVAL '6 days' PRECEDING AND CURRENT ROWcounts calendar days, which handles gaps in daily data correctly but is not supported by all engines. PostgreSQL supportsRANGEwith intervals; BigQuery requiresRANGEwith numeric order columns only. For date-based ranges in BigQuery, pre-fill missing dates with a calendar table join or useROWSwith a dense daily table.
F.2 Recursive CTEs
Recursive CTEs handle hierarchical and graph-structured data — content category trees, organizational hierarchies, sequential user journeys — that would otherwise require application-layer code or fixed-depth self-joins.
-- Expand the StreamRec content category tree from any node to its root
WITH RECURSIVE category_path AS (
-- Base case: start from leaf categories
SELECT
category_id,
category_name,
parent_category_id,
1 AS depth,
category_name::TEXT AS full_path
FROM streamrec.categories
WHERE category_id = 1047 -- start from a specific leaf
UNION ALL
-- Recursive step: walk up to parent
SELECT
p.category_id,
p.category_name,
p.parent_category_id,
cp.depth + 1,
p.category_name || ' > ' || cp.full_path
FROM streamrec.categories p
INNER JOIN category_path cp ON p.category_id = cp.parent_category_id
)
SELECT category_id, category_name, depth, full_path
FROM category_path
ORDER BY depth DESC
LIMIT 1;
BigQuery note: BigQuery supports recursive CTEs as of 2023. The syntax is identical, but you must explicitly enable recursion:
WITH RECURSIVE category_path AS (
SELECT
category_id,
category_name,
parent_category_id,
1 AS depth,
CAST(category_name AS STRING) AS full_path
FROM `project.streamrec.categories`
WHERE category_id = 1047
UNION ALL
SELECT
p.category_id,
p.category_name,
p.parent_category_id,
cp.depth + 1,
CONCAT(p.category_name, ' > ', cp.full_path)
FROM `project.streamrec.categories` p
INNER JOIN category_path cp ON p.category_id = cp.parent_category_id
)
SELECT * FROM category_path
ORDER BY depth DESC
LIMIT 1;
F.3 Query Optimization
Knowing SQL is necessary. Knowing why your query takes 45 minutes instead of 12 seconds is equally necessary. This section covers the tools and patterns you need to diagnose and fix slow queries.
F.3.1 Reading EXPLAIN Plans
PostgreSQL:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT
u.user_id,
u.signup_date,
COUNT(e.event_id) AS total_events,
SUM(e.watch_seconds) AS total_watch_seconds
FROM streamrec.users u
INNER JOIN streamrec.events e ON u.user_id = e.user_id
WHERE e.event_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY u.user_id, u.signup_date;
Key things to look for in the output:
| Signal | What It Means | Action |
|---|---|---|
Seq Scan on events |
Full table scan | Add an index on the filter column |
Hash Join with large Batches: 8 |
Hash table spills to disk | Increase work_mem or reduce join input |
Sort Method: external merge |
Sort exceeds memory | Increase work_mem or add an index to avoid the sort |
Rows Removed by Filter: 12000000 |
Index exists but is not used | Check predicate selectivity; the planner may be correct that a scan is cheaper |
| Actual rows >> estimated rows | Statistics are stale | Run ANALYZE on the table |
BigQuery:
BigQuery does not expose traditional EXPLAIN plans. Instead, inspect the Query Execution Details in the web console or query the INFORMATION_SCHEMA.JOBS view:
SELECT
job_id,
total_bytes_processed,
total_slot_ms,
total_bytes_billed,
cache_hit
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
AND statement_type = 'SELECT'
ORDER BY total_slot_ms DESC
LIMIT 20;
F.3.2 Indexing Strategies
Create indexes on columns that appear in WHERE, JOIN ON, and ORDER BY clauses — but only when selectivity warrants it.
-- PostgreSQL: composite index for the most common feature query pattern
CREATE INDEX CONCURRENTLY idx_events_user_date
ON streamrec.events (user_id, event_date DESC);
-- Partial index for active users only (saves space and I/O)
CREATE INDEX CONCURRENTLY idx_events_recent
ON streamrec.events (user_id, event_date)
WHERE event_date >= '2024-01-01';
BigQuery does not use traditional B-tree indexes. Instead, optimize with clustering and partitioning:
-- BigQuery: partition by date, cluster by user_id
CREATE OR REPLACE TABLE `project.streamrec.events`
PARTITION BY event_date
CLUSTER BY user_id
AS SELECT * FROM `project.streamrec.events_raw`;
Partitioning prunes entire date ranges from scans. Clustering sorts data within partitions so that filters on user_id read fewer blocks. Together, they can reduce scanned bytes (and cost) by 90% or more on large tables.
F.3.3 Materialized Views
When a complex aggregation is queried repeatedly — dashboards, feature pipelines, monitoring queries — materialized views precompute and store the result.
PostgreSQL:
CREATE MATERIALIZED VIEW streamrec.user_engagement_28d AS
SELECT
user_id,
COUNT(DISTINCT event_date) AS active_days_28d,
SUM(watch_seconds) / 3600.0 AS watch_hours_28d,
COUNT(DISTINCT item_id) AS unique_items_28d,
AVG(completion_rate) AS avg_completion_28d
FROM streamrec.events
WHERE event_date >= CURRENT_DATE - INTERVAL '28 days'
GROUP BY user_id;
-- Refresh on a schedule (e.g., via cron or Dagster)
REFRESH MATERIALIZED VIEW CONCURRENTLY streamrec.user_engagement_28d;
The CONCURRENTLY keyword allows reads during refresh but requires a unique index:
CREATE UNIQUE INDEX idx_user_engagement_28d_uid
ON streamrec.user_engagement_28d (user_id);
BigQuery materialized views refresh automatically when base tables change, but they support only a subset of SQL (no window functions, no UNION):
CREATE MATERIALIZED VIEW `project.streamrec.user_engagement_28d`
AS
SELECT
user_id,
COUNT(DISTINCT event_date) AS active_days_28d,
SUM(watch_seconds) / 3600.0 AS watch_hours_28d,
COUNT(DISTINCT item_id) AS unique_items_28d,
AVG(completion_rate) AS avg_completion_28d
FROM `project.streamrec.events`
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 28 DAY)
GROUP BY user_id;
F.4 Feature Engineering SQL Patterns
The patterns in this section are the ones you will use most often when building training datasets and feature store pipelines. They correspond directly to the feature store design in Chapter 25 and the pipeline orchestration in Chapter 27.
F.4.1 Trailing Window Features
Most behavioral features are aggregations over a trailing window: "number of sessions in the last 7 days," "average transaction amount in the last 90 days." The standard pattern uses a calendar spine joined to the event table.
-- Generate trailing window features for every user on every day they were active
WITH calendar AS (
SELECT generate_series(
DATE '2024-01-01',
CURRENT_DATE,
INTERVAL '1 day'
)::DATE AS feature_date
),
user_dates AS (
SELECT DISTINCT u.user_id, c.feature_date
FROM streamrec.users u
CROSS JOIN calendar c
WHERE c.feature_date >= u.signup_date
)
SELECT
ud.user_id,
ud.feature_date,
COALESCE(SUM(e.watch_seconds), 0) AS watch_seconds_7d,
COALESCE(COUNT(DISTINCT e.session_id), 0) AS sessions_7d,
COALESCE(COUNT(DISTINCT e.item_id), 0) AS unique_items_7d
FROM user_dates ud
LEFT JOIN streamrec.events e
ON ud.user_id = e.user_id
AND e.event_date BETWEEN ud.feature_date - INTERVAL '6 days' AND ud.feature_date
GROUP BY ud.user_id, ud.feature_date;
BigQuery equivalent (using UNNEST(GENERATE_DATE_ARRAY(...)) instead of generate_series):
WITH calendar AS (
SELECT feature_date
FROM UNNEST(
GENERATE_DATE_ARRAY('2024-01-01', CURRENT_DATE(), INTERVAL 1 DAY)
) AS feature_date
),
user_dates AS (
SELECT DISTINCT u.user_id, c.feature_date
FROM `project.streamrec.users` u
CROSS JOIN calendar c
WHERE c.feature_date >= u.signup_date
)
SELECT
ud.user_id,
ud.feature_date,
COALESCE(SUM(e.watch_seconds), 0) AS watch_seconds_7d,
COALESCE(COUNT(DISTINCT e.session_id), 0) AS sessions_7d,
COALESCE(COUNT(DISTINCT e.item_id), 0) AS unique_items_7d
FROM user_dates ud
LEFT JOIN `project.streamrec.events` e
ON ud.user_id = e.user_id
AND e.event_date BETWEEN DATE_SUB(ud.feature_date, INTERVAL 6 DAY) AND ud.feature_date
GROUP BY ud.user_id, ud.feature_date;
F.4.2 Slowly Changing Dimensions (SCD)
User attributes change over time — subscription tier, geographic region, credit score band. A Type-2 SCD table tracks every version with validity dates. Joining to the correct version at feature time prevents label leakage.
-- SCD Type-2 join: get the user's subscription tier at the time of each event
SELECT
e.event_id,
e.user_id,
e.event_date,
e.item_id,
s.subscription_tier,
s.price_monthly
FROM streamrec.events e
INNER JOIN streamrec.user_subscriptions_scd s
ON e.user_id = s.user_id
AND e.event_date >= s.valid_from
AND (e.event_date < s.valid_to OR s.valid_to IS NULL);
The valid_to IS NULL condition captures the current (open-ended) record. This pattern is simple but expensive on large tables because it becomes a range join. For production feature pipelines, consider pre-computing a snapshot table with one row per user per day (see Chapter 25 for the feature store implementation).
F.4.3 Cross-Entity Joins
Some of the most powerful features combine data across entity types. In StreamRec, a user-item pair might use features from the user table, the item table, the creator table, and the user's interaction history with that creator.
-- Build a user-item feature vector for the ranking model
SELECT
e.user_id,
e.item_id,
e.event_date,
-- User features
u.signup_date,
e.event_date - u.signup_date AS account_age_days,
u.country_code,
-- Item features
i.genre,
i.duration_seconds,
i.publish_date,
e.event_date - i.publish_date AS item_age_days,
-- Creator features
cr.subscriber_count,
cr.avg_upload_frequency_days,
-- User-creator affinity features
COALESCE(uca.items_watched_from_creator, 0) AS items_from_creator_30d,
COALESCE(uca.avg_completion_from_creator, 0.0) AS avg_completion_creator_30d
FROM streamrec.events e
INNER JOIN streamrec.users u ON e.user_id = u.user_id
INNER JOIN streamrec.items i ON e.item_id = i.item_id
INNER JOIN streamrec.creators cr ON i.creator_id = cr.creator_id
LEFT JOIN streamrec.user_creator_affinity_30d uca
ON e.user_id = uca.user_id
AND i.creator_id = uca.creator_id
AND e.event_date = uca.feature_date
WHERE e.event_date = CURRENT_DATE - INTERVAL '1 day';
F.4.4 Point-in-Time Joins
Point-in-time correctness is the single most important property of a training dataset. A point-in-time join ensures that every feature row uses only data that was available at prediction time — never future data. Getting this wrong introduces label leakage, which inflates offline metrics and produces models that degrade in production (see Chapter 25 and Chapter 28 for testing strategies).
The standard pattern uses an ASOF join where the feature timestamp must be strictly less than or equal to the label timestamp.
PostgreSQL (lateral join approach):
-- For each labeled event, get the most recent feature snapshot
-- that was computed BEFORE the event occurred
SELECT
labels.user_id,
labels.event_timestamp,
labels.label,
features.watch_hours_7d,
features.sessions_7d,
features.unique_items_7d,
features.feature_computed_at
FROM streamrec.training_labels labels
CROSS JOIN LATERAL (
SELECT
watch_hours_7d,
sessions_7d,
unique_items_7d,
feature_computed_at
FROM streamrec.user_features_snapshots f
WHERE f.user_id = labels.user_id
AND f.feature_computed_at <= labels.event_timestamp
ORDER BY f.feature_computed_at DESC
LIMIT 1
) features;
BigQuery (window-based approach):
-- BigQuery does not support LATERAL joins; use a window-based approach
WITH joined AS (
SELECT
labels.user_id,
labels.event_timestamp,
labels.label,
f.watch_hours_7d,
f.sessions_7d,
f.unique_items_7d,
f.feature_computed_at,
ROW_NUMBER() OVER (
PARTITION BY labels.user_id, labels.event_timestamp
ORDER BY f.feature_computed_at DESC
) AS rn
FROM `project.streamrec.training_labels` labels
INNER JOIN `project.streamrec.user_features_snapshots` f
ON labels.user_id = f.user_id
AND f.feature_computed_at <= labels.event_timestamp
)
SELECT
user_id,
event_timestamp,
label,
watch_hours_7d,
sessions_7d,
unique_items_7d,
feature_computed_at
FROM joined
WHERE rn = 1;
Why this matters: In the StreamRec progressive project (Chapter 25), naive joins without point-in-time correctness inflated offline Hit@10 by 3.2 percentage points. The model appeared to improve, but the gain was entirely from future leakage. The monitoring pipeline (Chapter 30) caught the training-serving skew within 48 hours of deployment when online metrics did not match offline expectations.
F.5 Feature Store SQL Patterns
Feature stores (Feast, Tecton, Hopsworks, or custom) abstract the patterns above into a managed service. Under the hood, they still execute SQL. Understanding the underlying SQL helps you debug feature store behavior and write efficient feature definitions.
F.5.1 Offline Feature Retrieval (Training)
-- What Feast generates under the hood for get_historical_features()
SELECT
entity.user_id,
entity.event_timestamp,
entity.label,
f_user.watch_hours_7d,
f_user.sessions_7d,
f_item.genre_popularity_score,
f_item.creator_subscriber_count
FROM training_entities entity
LEFT JOIN LATERAL (
SELECT watch_hours_7d, sessions_7d
FROM user_features_offline f
WHERE f.user_id = entity.user_id
AND f.feature_timestamp <= entity.event_timestamp
ORDER BY f.feature_timestamp DESC
LIMIT 1
) f_user ON TRUE
LEFT JOIN LATERAL (
SELECT genre_popularity_score, creator_subscriber_count
FROM item_features_offline f
WHERE f.item_id = entity.item_id
AND f.feature_timestamp <= entity.event_timestamp
ORDER BY f.feature_timestamp DESC
LIMIT 1
) f_item ON TRUE;
F.5.2 Online Feature Materialization
The batch pipeline that writes to the online store (Redis or DynamoDB) is typically a simple "latest value" query:
-- Materialize latest user features to the online store
INSERT INTO streamrec.user_features_online (user_id, feature_json, updated_at)
SELECT
user_id,
jsonb_build_object(
'watch_hours_7d', watch_hours_7d,
'sessions_7d', sessions_7d,
'unique_items_7d', unique_items_7d,
'active_days_28d', active_days_28d,
'avg_completion_28d', avg_completion_28d
) AS feature_json,
NOW() AS updated_at
FROM streamrec.user_engagement_28d
ON CONFLICT (user_id)
DO UPDATE SET
feature_json = EXCLUDED.feature_json,
updated_at = EXCLUDED.updated_at;
F.6 BigQuery vs. PostgreSQL Dialect Differences
The following table summarizes the dialect differences that cause the most debugging time when moving queries between PostgreSQL and BigQuery.
| Operation | PostgreSQL | BigQuery |
|---|---|---|
| Date arithmetic | event_date + INTERVAL '7 days' |
DATE_ADD(event_date, INTERVAL 7 DAY) |
| Date subtraction | event_date - INTERVAL '7 days' |
DATE_SUB(event_date, INTERVAL 7 DAY) |
| Date difference (days) | event_date - signup_date (returns integer) |
DATE_DIFF(event_date, signup_date, DAY) |
| Current date | CURRENT_DATE |
CURRENT_DATE() (requires parentheses) |
| Generate date series | generate_series(start, end, '1 day'::interval)::DATE |
UNNEST(GENERATE_DATE_ARRAY(start, end, INTERVAL 1 DAY)) |
| Type casting | value::TEXT, CAST(value AS TEXT) |
CAST(value AS STRING) (no :: syntax) |
| String concatenation | 'a' \|\| 'b' |
CONCAT('a', 'b') (preferred) or 'a' \|\| 'b' |
| Array aggregation | array_agg(x ORDER BY y) |
ARRAY_AGG(x ORDER BY y) |
| Struct/row construction | ROW(a, b, c) |
STRUCT(a, b, c) |
| JSON column type | jsonb |
JSON or STRING with JSON_EXTRACT |
| Upsert | INSERT ... ON CONFLICT ... DO UPDATE |
MERGE statement |
| Lateral join | CROSS JOIN LATERAL (...) |
Not supported; use window functions |
| Table qualification | schema.table |
`project.dataset.table` (backtick-quoted) |
BOOL from aggregation |
bool_or(x), bool_and(x) |
LOGICAL_OR(x), LOGICAL_AND(x) |
| Regex match | column ~ 'pattern' |
REGEXP_CONTAINS(column, r'pattern') |
| Approximate count | (not built-in; use HyperLogLog extension) |
APPROX_COUNT_DISTINCT(column) |
| Partition pruning | Index + WHERE clause | WHERE _PARTITIONING_COLUMN = ... (automatic) |
F.7 Data Warehouse and Lakehouse Architecture Reference
This section provides a concise reference to the storage architectures discussed in Chapter 24 (ML System Design) and Chapter 25 (Data Infrastructure). The choice of architecture constrains what SQL patterns are available, how features are computed, and how training data is versioned.
F.7.1 Traditional Data Warehouse
Architecture: Structured storage with predefined schemas. Data is loaded through ETL (extract, transform, load) pipelines. Query engines are tightly coupled to storage.
| Component | Examples |
|---|---|
| Storage + compute | BigQuery, Snowflake, Amazon Redshift |
| Schema enforcement | Enforced on write |
| Data format | Proprietary columnar |
| Transaction support | Full ACID |
| Best for | BI dashboards, SQL-first feature engineering, aggregation-heavy workloads |
Data science implications: Strong SQL support makes feature engineering straightforward. Versioning and time-travel are built in (BigQuery snapshots, Snowflake Time Travel). The limitation is that unstructured data (images, text, embeddings) must be stored elsewhere, and direct access from Python training jobs requires an export step.
F.7.2 Data Lake
Architecture: Unstructured and semi-structured storage with schema-on-read. Data is stored in open formats (Parquet, ORC, JSON) on object storage (S3, GCS, ADLS).
| Component | Examples |
|---|---|
| Storage | S3, GCS, Azure Data Lake Storage |
| Compute | Spark, Presto, Trino, Athena, Dask |
| Schema enforcement | On read (or not at all) |
| Data format | Parquet, ORC, JSON, CSV |
| Transaction support | None (without a table format layer) |
| Best for | Raw data landing zone, unstructured data, large-scale Spark jobs |
Data science implications: Direct access from training jobs (PyTorch Dataset reading Parquet from S3). No storage cost for unused columns. The limitation is no ACID transactions, which means concurrent writes can corrupt data and there is no built-in time-travel for reproducibility.
F.7.3 Lakehouse
Architecture: Open table format layers on top of data lake storage that add ACID transactions, schema enforcement, and time-travel. This is the convergent architecture that most modern ML platforms target.
| Component | Examples |
|---|---|
| Table format | Delta Lake, Apache Iceberg, Apache Hudi |
| Storage | S3, GCS, ADLS (same as data lake) |
| Compute | Spark, Trino, DuckDB, Snowflake (external tables), BigQuery (BigLake) |
| Schema enforcement | On write (enforced by the table format) |
| Data format | Parquet (managed by the table format's metadata layer) |
| Transaction support | Full ACID |
| Best for | Unified analytics + ML, feature stores, versioned training datasets |
Data science implications: The lakehouse is the architecture that best supports the full ML lifecycle described in Chapters 24-30. Key capabilities for data science:
- Time-travel for training set reproducibility:
SELECT * FROM events VERSION AS OF '2024-06-15'(Delta Lake) orSELECT * FROM events FOR SYSTEM_TIME AS OF TIMESTAMP '2024-06-15'(Iceberg). - Schema evolution so that new features can be added without rewriting the entire table.
- ACID writes so that concurrent feature pipeline runs do not produce corrupt data.
- Direct Parquet access from PyTorch/TensorFlow without an export step.
- Partition evolution (Iceberg) allows changing the partitioning scheme without rewriting data.
StreamRec production architecture (from Chapter 25):
| Layer | Technology | Purpose |
|---|---|---|
| Raw landing zone | S3 + JSON/Avro | Kafka consumer writes raw events |
| Bronze (validated) | Delta Lake on S3 | Schema validation, deduplication |
| Silver (curated) | Delta Lake on S3 | Cleaned, joined, SCD-resolved tables |
| Gold (features) | Delta Lake on S3 | Pre-computed feature tables, training datasets |
| Online serving | Redis | Latest feature values for real-time inference |
| Query engine | Spark (batch), Trino (ad hoc), DuckDB (local dev) | Feature computation, EDA, debugging |
This medallion architecture (bronze/silver/gold) is the standard pattern for ML data platforms. The feature engineering SQL in Sections F.1 and F.4 runs against the silver layer to produce the gold layer. The materialization query in Section F.5.2 pushes gold-layer features to the online serving layer.
F.8 Quick Reference: Choosing the Right Pattern
| Problem | Pattern | Section |
|---|---|---|
| Deduplicate events, get latest record | ROW_NUMBER() + filter |
F.1.1 |
| Rank entities within groups | RANK() / DENSE_RANK() |
F.1.1 |
| Compute sequential gaps, deltas | LAG() / LEAD() |
F.1.2 |
| Trailing N-day aggregations | SUM/AVG/COUNT ... ROWS BETWEEN |
F.1.3 |
| Navigate hierarchical data | Recursive CTE | F.2 |
| Diagnose slow queries | EXPLAIN (ANALYZE, BUFFERS) |
F.3.1 |
| Optimize large table scans | Indexes (Postgres), partitioning + clustering (BigQuery) | F.3.2 |
| Cache expensive aggregations | Materialized views | F.3.3 |
| Build time-aware behavioral features | Calendar spine + trailing window join | F.4.1 |
| Join to versioned dimension tables | SCD Type-2 range join | F.4.2 |
| Combine features across entity types | Cross-entity multi-table join | F.4.3 |
| Prevent label leakage in training sets | Point-in-time join (LATERAL or window) | F.4.4 |
| Populate online feature store | Upsert latest features | F.5.2 |
| Port queries between engines | Dialect reference table | F.6 |
| Choose storage architecture | Warehouse vs. lake vs. lakehouse | F.7 |