Appendix F: SQL and Infrastructure Reference

Advanced SQL Patterns, Feature Engineering, and Data Architecture for Production Data Science


This appendix is a practitioner's desk reference for the SQL you actually write in production data science work. Introductory SQL courses teach SELECT, JOIN, GROUP BY. Production data science demands window functions for feature engineering, recursive CTEs for hierarchical data, point-in-time joins for training set construction, and a working understanding of query execution plans. Every query in this appendix is syntactically correct and tested against real schemas. Where PostgreSQL and BigQuery diverge, both dialects are shown.

The examples draw on the same domains used throughout this textbook: StreamRec (content platform recommendations, Chapters 1-36), Meridian Financial (credit scoring and fairness, Chapters 2-35), and MediCore (clinical trial data, Chapters 16-21). If you have been working through the progressive project, the table names and column conventions will be familiar.


F.1 Window Functions

Window functions compute a value for each row based on a "window" of related rows, without collapsing the result set the way GROUP BY does. They are the single most important SQL feature for data science feature engineering.

F.1.1 Ranking Functions

ROW_NUMBER() — Deduplicate and Select Latest Records

When your event table contains multiple records per entity and you need only the most recent, ROW_NUMBER() is the standard pattern:

-- Get the most recent viewing session per user
WITH ranked_sessions AS (
    SELECT
        user_id,
        session_id,
        started_at,
        ended_at,
        device_type,
        ROW_NUMBER() OVER (
            PARTITION BY user_id
            ORDER BY started_at DESC
        ) AS rn
    FROM streamrec.sessions
    WHERE started_at >= CURRENT_DATE - INTERVAL '90 days'
)
SELECT user_id, session_id, started_at, ended_at, device_type
FROM ranked_sessions
WHERE rn = 1;

RANK() and DENSE_RANK() — Handling Ties

ROW_NUMBER() assigns arbitrary ordering among ties. When ties matter — for example, ranking content items by engagement — use RANK() (which leaves gaps) or DENSE_RANK() (which does not):

-- Rank items by total watch time within each genre
SELECT
    item_id,
    genre,
    total_watch_minutes,
    RANK() OVER (
        PARTITION BY genre
        ORDER BY total_watch_minutes DESC
    ) AS genre_rank,
    DENSE_RANK() OVER (
        PARTITION BY genre
        ORDER BY total_watch_minutes DESC
    ) AS genre_dense_rank
FROM streamrec.item_daily_stats
WHERE stat_date = CURRENT_DATE - INTERVAL '1 day';

If two items tie at rank 2, RANK() produces 1, 2, 2, 4 while DENSE_RANK() produces 1, 2, 2, 3. For top-N filtering, DENSE_RANK() is usually more intuitive.

F.1.2 Offset Functions: LAG and LEAD

LAG() and LEAD() access rows relative to the current row without a self-join. These are essential for computing sequential features — session gaps, inter-event timing, and change detection.

-- Compute time between consecutive sessions and detect session gaps > 7 days
SELECT
    user_id,
    session_id,
    started_at,
    LAG(started_at) OVER (
        PARTITION BY user_id
        ORDER BY started_at
    ) AS prev_session_start,
    started_at - LAG(started_at) OVER (
        PARTITION BY user_id
        ORDER BY started_at
    ) AS gap_interval,
    CASE
        WHEN started_at - LAG(started_at) OVER (
            PARTITION BY user_id ORDER BY started_at
        ) > INTERVAL '7 days' THEN 1
        ELSE 0
    END AS is_reactivation
FROM streamrec.sessions
WHERE started_at >= CURRENT_DATE - INTERVAL '180 days';

Meridian Financial example — month-over-month payment change:

SELECT
    account_id,
    payment_month,
    payment_amount,
    LAG(payment_amount) OVER (
        PARTITION BY account_id
        ORDER BY payment_month
    ) AS prev_payment,
    CASE
        WHEN LAG(payment_amount) OVER (
            PARTITION BY account_id ORDER BY payment_month
        ) > 0 THEN
            (payment_amount - LAG(payment_amount) OVER (
                PARTITION BY account_id ORDER BY payment_month
            )) / LAG(payment_amount) OVER (
                PARTITION BY account_id ORDER BY payment_month
            )
        ELSE NULL
    END AS payment_change_pct
FROM meridian.monthly_payments
WHERE payment_month >= DATE '2024-01-01';

F.1.3 Running Aggregates

Running sums, averages, and counts are the workhorses of behavioral feature engineering. The ROWS BETWEEN clause defines the window frame.

-- Trailing 7-day and 28-day engagement features per user
SELECT
    user_id,
    event_date,
    daily_watch_minutes,
    SUM(daily_watch_minutes) OVER (
        PARTITION BY user_id
        ORDER BY event_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS watch_minutes_7d,
    AVG(daily_watch_minutes) OVER (
        PARTITION BY user_id
        ORDER BY event_date
        ROWS BETWEEN 27 PRECEDING AND CURRENT ROW
    ) AS avg_watch_minutes_28d,
    COUNT(*) OVER (
        PARTITION BY user_id
        ORDER BY event_date
        ROWS BETWEEN 27 PRECEDING AND CURRENT ROW
    ) AS active_days_28d
FROM streamrec.user_daily_activity;

ROWS vs. RANGE: ROWS BETWEEN 6 PRECEDING AND CURRENT ROW counts exactly 7 physical rows. RANGE BETWEEN INTERVAL '6 days' PRECEDING AND CURRENT ROW counts calendar days, which handles gaps in daily data correctly but is not supported by all engines. PostgreSQL supports RANGE with intervals; BigQuery requires RANGE with numeric order columns only. For date-based ranges in BigQuery, pre-fill missing dates with a calendar table join or use ROWS with a dense daily table.


F.2 Recursive CTEs

Recursive CTEs handle hierarchical and graph-structured data — content category trees, organizational hierarchies, sequential user journeys — that would otherwise require application-layer code or fixed-depth self-joins.

-- Expand the StreamRec content category tree from any node to its root
WITH RECURSIVE category_path AS (
    -- Base case: start from leaf categories
    SELECT
        category_id,
        category_name,
        parent_category_id,
        1 AS depth,
        category_name::TEXT AS full_path
    FROM streamrec.categories
    WHERE category_id = 1047  -- start from a specific leaf

    UNION ALL

    -- Recursive step: walk up to parent
    SELECT
        p.category_id,
        p.category_name,
        p.parent_category_id,
        cp.depth + 1,
        p.category_name || ' > ' || cp.full_path
    FROM streamrec.categories p
    INNER JOIN category_path cp ON p.category_id = cp.parent_category_id
)
SELECT category_id, category_name, depth, full_path
FROM category_path
ORDER BY depth DESC
LIMIT 1;

BigQuery note: BigQuery supports recursive CTEs as of 2023. The syntax is identical, but you must explicitly enable recursion:

WITH RECURSIVE category_path AS (
    SELECT
        category_id,
        category_name,
        parent_category_id,
        1 AS depth,
        CAST(category_name AS STRING) AS full_path
    FROM `project.streamrec.categories`
    WHERE category_id = 1047

    UNION ALL

    SELECT
        p.category_id,
        p.category_name,
        p.parent_category_id,
        cp.depth + 1,
        CONCAT(p.category_name, ' > ', cp.full_path)
    FROM `project.streamrec.categories` p
    INNER JOIN category_path cp ON p.category_id = cp.parent_category_id
)
SELECT * FROM category_path
ORDER BY depth DESC
LIMIT 1;

F.3 Query Optimization

Knowing SQL is necessary. Knowing why your query takes 45 minutes instead of 12 seconds is equally necessary. This section covers the tools and patterns you need to diagnose and fix slow queries.

F.3.1 Reading EXPLAIN Plans

PostgreSQL:

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT
    u.user_id,
    u.signup_date,
    COUNT(e.event_id) AS total_events,
    SUM(e.watch_seconds) AS total_watch_seconds
FROM streamrec.users u
INNER JOIN streamrec.events e ON u.user_id = e.user_id
WHERE e.event_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY u.user_id, u.signup_date;

Key things to look for in the output:

Signal What It Means Action
Seq Scan on events Full table scan Add an index on the filter column
Hash Join with large Batches: 8 Hash table spills to disk Increase work_mem or reduce join input
Sort Method: external merge Sort exceeds memory Increase work_mem or add an index to avoid the sort
Rows Removed by Filter: 12000000 Index exists but is not used Check predicate selectivity; the planner may be correct that a scan is cheaper
Actual rows >> estimated rows Statistics are stale Run ANALYZE on the table

BigQuery:

BigQuery does not expose traditional EXPLAIN plans. Instead, inspect the Query Execution Details in the web console or query the INFORMATION_SCHEMA.JOBS view:

SELECT
    job_id,
    total_bytes_processed,
    total_slot_ms,
    total_bytes_billed,
    cache_hit
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
    AND statement_type = 'SELECT'
ORDER BY total_slot_ms DESC
LIMIT 20;

F.3.2 Indexing Strategies

Create indexes on columns that appear in WHERE, JOIN ON, and ORDER BY clauses — but only when selectivity warrants it.

-- PostgreSQL: composite index for the most common feature query pattern
CREATE INDEX CONCURRENTLY idx_events_user_date
    ON streamrec.events (user_id, event_date DESC);

-- Partial index for active users only (saves space and I/O)
CREATE INDEX CONCURRENTLY idx_events_recent
    ON streamrec.events (user_id, event_date)
    WHERE event_date >= '2024-01-01';

BigQuery does not use traditional B-tree indexes. Instead, optimize with clustering and partitioning:

-- BigQuery: partition by date, cluster by user_id
CREATE OR REPLACE TABLE `project.streamrec.events`
PARTITION BY event_date
CLUSTER BY user_id
AS SELECT * FROM `project.streamrec.events_raw`;

Partitioning prunes entire date ranges from scans. Clustering sorts data within partitions so that filters on user_id read fewer blocks. Together, they can reduce scanned bytes (and cost) by 90% or more on large tables.

F.3.3 Materialized Views

When a complex aggregation is queried repeatedly — dashboards, feature pipelines, monitoring queries — materialized views precompute and store the result.

PostgreSQL:

CREATE MATERIALIZED VIEW streamrec.user_engagement_28d AS
SELECT
    user_id,
    COUNT(DISTINCT event_date) AS active_days_28d,
    SUM(watch_seconds) / 3600.0 AS watch_hours_28d,
    COUNT(DISTINCT item_id) AS unique_items_28d,
    AVG(completion_rate) AS avg_completion_28d
FROM streamrec.events
WHERE event_date >= CURRENT_DATE - INTERVAL '28 days'
GROUP BY user_id;

-- Refresh on a schedule (e.g., via cron or Dagster)
REFRESH MATERIALIZED VIEW CONCURRENTLY streamrec.user_engagement_28d;

The CONCURRENTLY keyword allows reads during refresh but requires a unique index:

CREATE UNIQUE INDEX idx_user_engagement_28d_uid
    ON streamrec.user_engagement_28d (user_id);

BigQuery materialized views refresh automatically when base tables change, but they support only a subset of SQL (no window functions, no UNION):

CREATE MATERIALIZED VIEW `project.streamrec.user_engagement_28d`
AS
SELECT
    user_id,
    COUNT(DISTINCT event_date) AS active_days_28d,
    SUM(watch_seconds) / 3600.0 AS watch_hours_28d,
    COUNT(DISTINCT item_id) AS unique_items_28d,
    AVG(completion_rate) AS avg_completion_28d
FROM `project.streamrec.events`
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 28 DAY)
GROUP BY user_id;

F.4 Feature Engineering SQL Patterns

The patterns in this section are the ones you will use most often when building training datasets and feature store pipelines. They correspond directly to the feature store design in Chapter 25 and the pipeline orchestration in Chapter 27.

F.4.1 Trailing Window Features

Most behavioral features are aggregations over a trailing window: "number of sessions in the last 7 days," "average transaction amount in the last 90 days." The standard pattern uses a calendar spine joined to the event table.

-- Generate trailing window features for every user on every day they were active
WITH calendar AS (
    SELECT generate_series(
        DATE '2024-01-01',
        CURRENT_DATE,
        INTERVAL '1 day'
    )::DATE AS feature_date
),
user_dates AS (
    SELECT DISTINCT u.user_id, c.feature_date
    FROM streamrec.users u
    CROSS JOIN calendar c
    WHERE c.feature_date >= u.signup_date
)
SELECT
    ud.user_id,
    ud.feature_date,
    COALESCE(SUM(e.watch_seconds), 0) AS watch_seconds_7d,
    COALESCE(COUNT(DISTINCT e.session_id), 0) AS sessions_7d,
    COALESCE(COUNT(DISTINCT e.item_id), 0) AS unique_items_7d
FROM user_dates ud
LEFT JOIN streamrec.events e
    ON ud.user_id = e.user_id
    AND e.event_date BETWEEN ud.feature_date - INTERVAL '6 days' AND ud.feature_date
GROUP BY ud.user_id, ud.feature_date;

BigQuery equivalent (using UNNEST(GENERATE_DATE_ARRAY(...)) instead of generate_series):

WITH calendar AS (
    SELECT feature_date
    FROM UNNEST(
        GENERATE_DATE_ARRAY('2024-01-01', CURRENT_DATE(), INTERVAL 1 DAY)
    ) AS feature_date
),
user_dates AS (
    SELECT DISTINCT u.user_id, c.feature_date
    FROM `project.streamrec.users` u
    CROSS JOIN calendar c
    WHERE c.feature_date >= u.signup_date
)
SELECT
    ud.user_id,
    ud.feature_date,
    COALESCE(SUM(e.watch_seconds), 0) AS watch_seconds_7d,
    COALESCE(COUNT(DISTINCT e.session_id), 0) AS sessions_7d,
    COALESCE(COUNT(DISTINCT e.item_id), 0) AS unique_items_7d
FROM user_dates ud
LEFT JOIN `project.streamrec.events` e
    ON ud.user_id = e.user_id
    AND e.event_date BETWEEN DATE_SUB(ud.feature_date, INTERVAL 6 DAY) AND ud.feature_date
GROUP BY ud.user_id, ud.feature_date;

F.4.2 Slowly Changing Dimensions (SCD)

User attributes change over time — subscription tier, geographic region, credit score band. A Type-2 SCD table tracks every version with validity dates. Joining to the correct version at feature time prevents label leakage.

-- SCD Type-2 join: get the user's subscription tier at the time of each event
SELECT
    e.event_id,
    e.user_id,
    e.event_date,
    e.item_id,
    s.subscription_tier,
    s.price_monthly
FROM streamrec.events e
INNER JOIN streamrec.user_subscriptions_scd s
    ON e.user_id = s.user_id
    AND e.event_date >= s.valid_from
    AND (e.event_date < s.valid_to OR s.valid_to IS NULL);

The valid_to IS NULL condition captures the current (open-ended) record. This pattern is simple but expensive on large tables because it becomes a range join. For production feature pipelines, consider pre-computing a snapshot table with one row per user per day (see Chapter 25 for the feature store implementation).

F.4.3 Cross-Entity Joins

Some of the most powerful features combine data across entity types. In StreamRec, a user-item pair might use features from the user table, the item table, the creator table, and the user's interaction history with that creator.

-- Build a user-item feature vector for the ranking model
SELECT
    e.user_id,
    e.item_id,
    e.event_date,

    -- User features
    u.signup_date,
    e.event_date - u.signup_date AS account_age_days,
    u.country_code,

    -- Item features
    i.genre,
    i.duration_seconds,
    i.publish_date,
    e.event_date - i.publish_date AS item_age_days,

    -- Creator features
    cr.subscriber_count,
    cr.avg_upload_frequency_days,

    -- User-creator affinity features
    COALESCE(uca.items_watched_from_creator, 0) AS items_from_creator_30d,
    COALESCE(uca.avg_completion_from_creator, 0.0) AS avg_completion_creator_30d

FROM streamrec.events e
INNER JOIN streamrec.users u ON e.user_id = u.user_id
INNER JOIN streamrec.items i ON e.item_id = i.item_id
INNER JOIN streamrec.creators cr ON i.creator_id = cr.creator_id
LEFT JOIN streamrec.user_creator_affinity_30d uca
    ON e.user_id = uca.user_id
    AND i.creator_id = uca.creator_id
    AND e.event_date = uca.feature_date
WHERE e.event_date = CURRENT_DATE - INTERVAL '1 day';

F.4.4 Point-in-Time Joins

Point-in-time correctness is the single most important property of a training dataset. A point-in-time join ensures that every feature row uses only data that was available at prediction time — never future data. Getting this wrong introduces label leakage, which inflates offline metrics and produces models that degrade in production (see Chapter 25 and Chapter 28 for testing strategies).

The standard pattern uses an ASOF join where the feature timestamp must be strictly less than or equal to the label timestamp.

PostgreSQL (lateral join approach):

-- For each labeled event, get the most recent feature snapshot
-- that was computed BEFORE the event occurred
SELECT
    labels.user_id,
    labels.event_timestamp,
    labels.label,
    features.watch_hours_7d,
    features.sessions_7d,
    features.unique_items_7d,
    features.feature_computed_at
FROM streamrec.training_labels labels
CROSS JOIN LATERAL (
    SELECT
        watch_hours_7d,
        sessions_7d,
        unique_items_7d,
        feature_computed_at
    FROM streamrec.user_features_snapshots f
    WHERE f.user_id = labels.user_id
        AND f.feature_computed_at <= labels.event_timestamp
    ORDER BY f.feature_computed_at DESC
    LIMIT 1
) features;

BigQuery (window-based approach):

-- BigQuery does not support LATERAL joins; use a window-based approach
WITH joined AS (
    SELECT
        labels.user_id,
        labels.event_timestamp,
        labels.label,
        f.watch_hours_7d,
        f.sessions_7d,
        f.unique_items_7d,
        f.feature_computed_at,
        ROW_NUMBER() OVER (
            PARTITION BY labels.user_id, labels.event_timestamp
            ORDER BY f.feature_computed_at DESC
        ) AS rn
    FROM `project.streamrec.training_labels` labels
    INNER JOIN `project.streamrec.user_features_snapshots` f
        ON labels.user_id = f.user_id
        AND f.feature_computed_at <= labels.event_timestamp
)
SELECT
    user_id,
    event_timestamp,
    label,
    watch_hours_7d,
    sessions_7d,
    unique_items_7d,
    feature_computed_at
FROM joined
WHERE rn = 1;

Why this matters: In the StreamRec progressive project (Chapter 25), naive joins without point-in-time correctness inflated offline Hit@10 by 3.2 percentage points. The model appeared to improve, but the gain was entirely from future leakage. The monitoring pipeline (Chapter 30) caught the training-serving skew within 48 hours of deployment when online metrics did not match offline expectations.


F.5 Feature Store SQL Patterns

Feature stores (Feast, Tecton, Hopsworks, or custom) abstract the patterns above into a managed service. Under the hood, they still execute SQL. Understanding the underlying SQL helps you debug feature store behavior and write efficient feature definitions.

F.5.1 Offline Feature Retrieval (Training)

-- What Feast generates under the hood for get_historical_features()
SELECT
    entity.user_id,
    entity.event_timestamp,
    entity.label,
    f_user.watch_hours_7d,
    f_user.sessions_7d,
    f_item.genre_popularity_score,
    f_item.creator_subscriber_count
FROM training_entities entity
LEFT JOIN LATERAL (
    SELECT watch_hours_7d, sessions_7d
    FROM user_features_offline f
    WHERE f.user_id = entity.user_id
        AND f.feature_timestamp <= entity.event_timestamp
    ORDER BY f.feature_timestamp DESC
    LIMIT 1
) f_user ON TRUE
LEFT JOIN LATERAL (
    SELECT genre_popularity_score, creator_subscriber_count
    FROM item_features_offline f
    WHERE f.item_id = entity.item_id
        AND f.feature_timestamp <= entity.event_timestamp
    ORDER BY f.feature_timestamp DESC
    LIMIT 1
) f_item ON TRUE;

F.5.2 Online Feature Materialization

The batch pipeline that writes to the online store (Redis or DynamoDB) is typically a simple "latest value" query:

-- Materialize latest user features to the online store
INSERT INTO streamrec.user_features_online (user_id, feature_json, updated_at)
SELECT
    user_id,
    jsonb_build_object(
        'watch_hours_7d', watch_hours_7d,
        'sessions_7d', sessions_7d,
        'unique_items_7d', unique_items_7d,
        'active_days_28d', active_days_28d,
        'avg_completion_28d', avg_completion_28d
    ) AS feature_json,
    NOW() AS updated_at
FROM streamrec.user_engagement_28d
ON CONFLICT (user_id)
DO UPDATE SET
    feature_json = EXCLUDED.feature_json,
    updated_at = EXCLUDED.updated_at;

F.6 BigQuery vs. PostgreSQL Dialect Differences

The following table summarizes the dialect differences that cause the most debugging time when moving queries between PostgreSQL and BigQuery.

Operation PostgreSQL BigQuery
Date arithmetic event_date + INTERVAL '7 days' DATE_ADD(event_date, INTERVAL 7 DAY)
Date subtraction event_date - INTERVAL '7 days' DATE_SUB(event_date, INTERVAL 7 DAY)
Date difference (days) event_date - signup_date (returns integer) DATE_DIFF(event_date, signup_date, DAY)
Current date CURRENT_DATE CURRENT_DATE() (requires parentheses)
Generate date series generate_series(start, end, '1 day'::interval)::DATE UNNEST(GENERATE_DATE_ARRAY(start, end, INTERVAL 1 DAY))
Type casting value::TEXT, CAST(value AS TEXT) CAST(value AS STRING) (no :: syntax)
String concatenation 'a' \|\| 'b' CONCAT('a', 'b') (preferred) or 'a' \|\| 'b'
Array aggregation array_agg(x ORDER BY y) ARRAY_AGG(x ORDER BY y)
Struct/row construction ROW(a, b, c) STRUCT(a, b, c)
JSON column type jsonb JSON or STRING with JSON_EXTRACT
Upsert INSERT ... ON CONFLICT ... DO UPDATE MERGE statement
Lateral join CROSS JOIN LATERAL (...) Not supported; use window functions
Table qualification schema.table `project.dataset.table` (backtick-quoted)
BOOL from aggregation bool_or(x), bool_and(x) LOGICAL_OR(x), LOGICAL_AND(x)
Regex match column ~ 'pattern' REGEXP_CONTAINS(column, r'pattern')
Approximate count (not built-in; use HyperLogLog extension) APPROX_COUNT_DISTINCT(column)
Partition pruning Index + WHERE clause WHERE _PARTITIONING_COLUMN = ... (automatic)

F.7 Data Warehouse and Lakehouse Architecture Reference

This section provides a concise reference to the storage architectures discussed in Chapter 24 (ML System Design) and Chapter 25 (Data Infrastructure). The choice of architecture constrains what SQL patterns are available, how features are computed, and how training data is versioned.

F.7.1 Traditional Data Warehouse

Architecture: Structured storage with predefined schemas. Data is loaded through ETL (extract, transform, load) pipelines. Query engines are tightly coupled to storage.

Component Examples
Storage + compute BigQuery, Snowflake, Amazon Redshift
Schema enforcement Enforced on write
Data format Proprietary columnar
Transaction support Full ACID
Best for BI dashboards, SQL-first feature engineering, aggregation-heavy workloads

Data science implications: Strong SQL support makes feature engineering straightforward. Versioning and time-travel are built in (BigQuery snapshots, Snowflake Time Travel). The limitation is that unstructured data (images, text, embeddings) must be stored elsewhere, and direct access from Python training jobs requires an export step.

F.7.2 Data Lake

Architecture: Unstructured and semi-structured storage with schema-on-read. Data is stored in open formats (Parquet, ORC, JSON) on object storage (S3, GCS, ADLS).

Component Examples
Storage S3, GCS, Azure Data Lake Storage
Compute Spark, Presto, Trino, Athena, Dask
Schema enforcement On read (or not at all)
Data format Parquet, ORC, JSON, CSV
Transaction support None (without a table format layer)
Best for Raw data landing zone, unstructured data, large-scale Spark jobs

Data science implications: Direct access from training jobs (PyTorch Dataset reading Parquet from S3). No storage cost for unused columns. The limitation is no ACID transactions, which means concurrent writes can corrupt data and there is no built-in time-travel for reproducibility.

F.7.3 Lakehouse

Architecture: Open table format layers on top of data lake storage that add ACID transactions, schema enforcement, and time-travel. This is the convergent architecture that most modern ML platforms target.

Component Examples
Table format Delta Lake, Apache Iceberg, Apache Hudi
Storage S3, GCS, ADLS (same as data lake)
Compute Spark, Trino, DuckDB, Snowflake (external tables), BigQuery (BigLake)
Schema enforcement On write (enforced by the table format)
Data format Parquet (managed by the table format's metadata layer)
Transaction support Full ACID
Best for Unified analytics + ML, feature stores, versioned training datasets

Data science implications: The lakehouse is the architecture that best supports the full ML lifecycle described in Chapters 24-30. Key capabilities for data science:

  • Time-travel for training set reproducibility: SELECT * FROM events VERSION AS OF '2024-06-15' (Delta Lake) or SELECT * FROM events FOR SYSTEM_TIME AS OF TIMESTAMP '2024-06-15' (Iceberg).
  • Schema evolution so that new features can be added without rewriting the entire table.
  • ACID writes so that concurrent feature pipeline runs do not produce corrupt data.
  • Direct Parquet access from PyTorch/TensorFlow without an export step.
  • Partition evolution (Iceberg) allows changing the partitioning scheme without rewriting data.

StreamRec production architecture (from Chapter 25):

Layer Technology Purpose
Raw landing zone S3 + JSON/Avro Kafka consumer writes raw events
Bronze (validated) Delta Lake on S3 Schema validation, deduplication
Silver (curated) Delta Lake on S3 Cleaned, joined, SCD-resolved tables
Gold (features) Delta Lake on S3 Pre-computed feature tables, training datasets
Online serving Redis Latest feature values for real-time inference
Query engine Spark (batch), Trino (ad hoc), DuckDB (local dev) Feature computation, EDA, debugging

This medallion architecture (bronze/silver/gold) is the standard pattern for ML data platforms. The feature engineering SQL in Sections F.1 and F.4 runs against the silver layer to produce the gold layer. The materialization query in Section F.5.2 pushes gold-layer features to the online serving layer.


F.8 Quick Reference: Choosing the Right Pattern

Problem Pattern Section
Deduplicate events, get latest record ROW_NUMBER() + filter F.1.1
Rank entities within groups RANK() / DENSE_RANK() F.1.1
Compute sequential gaps, deltas LAG() / LEAD() F.1.2
Trailing N-day aggregations SUM/AVG/COUNT ... ROWS BETWEEN F.1.3
Navigate hierarchical data Recursive CTE F.2
Diagnose slow queries EXPLAIN (ANALYZE, BUFFERS) F.3.1
Optimize large table scans Indexes (Postgres), partitioning + clustering (BigQuery) F.3.2
Cache expensive aggregations Materialized views F.3.3
Build time-aware behavioral features Calendar spine + trailing window join F.4.1
Join to versioned dimension tables SCD Type-2 range join F.4.2
Combine features across entity types Cross-entity multi-table join F.4.3
Prevent label leakage in training sets Point-in-time join (LATERAL or window) F.4.4
Populate online feature store Upsert latest features F.5.2
Port queries between engines Dialect reference table F.6
Choose storage architecture Warehouse vs. lake vs. lakehouse F.7