21 min read

Data scientists who can't write SQL are data scientists who can't get data.

Chapter 5: SQL for Data Scientists — Window Functions, CTEs, and Query Optimization

Data scientists who can't write SQL are data scientists who can't get data.

That statement will irritate some readers. You have pandas. You have Spark DataFrames. You have dbt and Airflow and a hundred tools that promise to abstract away the database. And yet when you start a new job, the first thing you will need to do is write SQL. The features you need live in PostgreSQL, BigQuery, Snowflake, or Redshift. They live in tables designed by software engineers for application needs, not for machine learning. And the person who turns those tables into model-ready features is you.

This chapter is not an introduction to SQL. You have that from Book 1 — you can write SELECT, JOIN, WHERE, GROUP BY, and ORDER BY without thinking. This chapter makes you dangerous. By the end of it, you will write window functions that compute temporal trends, CTEs that turn a 200-line query into something a colleague can read, complex joins that construct features no single table contains, and optimized queries that run in seconds instead of minutes.

We will work primarily in PostgreSQL, with BigQuery dialect notes in callout boxes. Every SQL pattern in this chapter maps to a feature engineering use case. Every query answers a question that a churn model, a recommendation engine, or a conversion funnel analysis actually needs answered.

Learning Objectives

By the end of this chapter, you will be able to:

  1. Write window functions (ROW_NUMBER, RANK, LAG, LEAD, running aggregates) for feature extraction
  2. Structure complex queries using CTEs for readability and debugging
  3. Perform complex joins (self-joins, anti-joins, lateral joins) for feature construction
  4. Read and interpret query execution plans
  5. Optimize slow queries (indexing strategies, partition pruning, avoiding SELECT *)

5.1 The StreamFlow Database Schema

Before we write a single query, we need to understand the data. StreamFlow is a subscription streaming analytics platform with $180M in annual recurring revenue, 2.4 million subscribers, and an 8.2% monthly churn rate. Their relational database was designed to run the application — user authentication, billing, content delivery. It was not designed for feature engineering. That is the normal state of affairs.

Here is the schema we will work with throughout this chapter:

-- Core subscriber table
CREATE TABLE subscribers (
    subscriber_id   BIGINT PRIMARY KEY,
    email           VARCHAR(255),
    signup_date     DATE NOT NULL,
    plan_type       VARCHAR(20) NOT NULL,  -- 'free', 'basic', 'premium', 'enterprise'
    plan_price      NUMERIC(8,2),
    country         VARCHAR(2),
    cancel_date     DATE,                  -- NULL if still active
    cancel_reason   VARCHAR(100)
);

-- Usage events (one row per session)
CREATE TABLE usage_events (
    event_id        BIGINT PRIMARY KEY,
    subscriber_id   BIGINT REFERENCES subscribers(subscriber_id),
    event_date      DATE NOT NULL,
    session_minutes NUMERIC(8,2),
    feature_used    VARCHAR(50),          -- 'dashboard', 'reports', 'api', 'export'
    device_type     VARCHAR(20)           -- 'web', 'mobile', 'tablet', 'api_client'
);

-- Support tickets
CREATE TABLE support_tickets (
    ticket_id       BIGINT PRIMARY KEY,
    subscriber_id   BIGINT REFERENCES subscribers(subscriber_id),
    created_date    DATE NOT NULL,
    resolved_date   DATE,
    category        VARCHAR(50),          -- 'billing', 'technical', 'feature_request', 'complaint'
    priority        VARCHAR(10)           -- 'low', 'medium', 'high', 'critical'
);

-- Plan change history
CREATE TABLE plan_changes (
    change_id       BIGINT PRIMARY KEY,
    subscriber_id   BIGINT REFERENCES subscribers(subscriber_id),
    change_date     DATE NOT NULL,
    old_plan        VARCHAR(20),
    new_plan        VARCHAR(20),
    change_type     VARCHAR(20)           -- 'upgrade', 'downgrade', 'same_tier'
);

-- Billing events
CREATE TABLE billing_events (
    billing_id      BIGINT PRIMARY KEY,
    subscriber_id   BIGINT REFERENCES subscribers(subscriber_id),
    billing_date    DATE NOT NULL,
    amount          NUMERIC(8,2),
    status          VARCHAR(20)           -- 'paid', 'failed', 'refunded'
);

Five tables. Application-normalized. No feature columns. Every feature we need for churn prediction — tenure, usage trend, support ticket count, plan upgrade history, billing failure count — must be computed from these tables. That computation is the subject of this chapter.

Production Tip — In most companies, you will not query production tables directly. You will query a read replica, a data warehouse, or a materialized view. The SQL is the same. The politics of getting access is a separate skill entirely.


5.2 Window Functions: The Feature Engineer's Best Friend

If you learn one thing from this chapter, let it be window functions. A window function performs a calculation across a set of rows that are somehow related to the current row — without collapsing those rows into a single output the way GROUP BY does.

The general syntax:

function_name(expression) OVER (
    PARTITION BY column1, column2
    ORDER BY column3
    ROWS BETWEEN start AND end
)

Three clauses. Three concepts:

  • PARTITION BY defines the groups (like GROUP BY, but without collapsing rows)
  • ORDER BY defines the sequence within each partition
  • ROWS BETWEEN (or RANGE BETWEEN) defines the window frame — which rows the function can see

Let us start with the functions you will use most.

ROW_NUMBER, RANK, and DENSE_RANK

These ranking functions assign a position to each row within its partition.

-- Rank each subscriber's usage sessions by date (most recent first)
SELECT
    subscriber_id,
    event_date,
    session_minutes,
    ROW_NUMBER() OVER (
        PARTITION BY subscriber_id
        ORDER BY event_date DESC
    ) AS session_rank
FROM usage_events
WHERE event_date >= CURRENT_DATE - INTERVAL '90 days';

ROW_NUMBER assigns unique sequential integers — ties are broken arbitrarily. RANK assigns the same number to ties but leaves gaps (1, 2, 2, 4). DENSE_RANK assigns the same number to ties without gaps (1, 2, 2, 3).

Feature engineering use case: "Get the most recent session for each subscriber."

-- Most recent session per subscriber (the ROW_NUMBER + filter pattern)
WITH ranked_sessions AS (
    SELECT
        subscriber_id,
        event_date,
        session_minutes,
        device_type,
        ROW_NUMBER() OVER (
            PARTITION BY subscriber_id
            ORDER BY event_date DESC
        ) AS rn
    FROM usage_events
)
SELECT *
FROM ranked_sessions
WHERE rn = 1;

This pattern — ROW_NUMBER inside a CTE, then filter to rn = 1 — is one of the most common SQL patterns in data science. You will use it dozens of times. It replaces the pandas equivalent:

# The pandas equivalent — functional, but slower at scale
df.sort_values('event_date', ascending=False).groupby('subscriber_id').first()

Common Mistake — Using GROUP BY subscriber_id with MAX(event_date) will give you the most recent date, but not the other columns from that row (session_minutes, device_type). You need the ROW_NUMBER pattern or a correlated subquery to get the full row.

LAG and LEAD: Temporal Feature Extraction

LAG looks backward. LEAD looks forward. Together, they are how you compute trends, changes, and velocity — the temporal features that dominate churn prediction.

-- Monthly usage with previous month's usage (for computing trends)
SELECT
    subscriber_id,
    DATE_TRUNC('month', event_date) AS usage_month,
    SUM(session_minutes) AS total_minutes,
    LAG(SUM(session_minutes), 1) OVER (
        PARTITION BY subscriber_id
        ORDER BY DATE_TRUNC('month', event_date)
    ) AS prev_month_minutes,
    LAG(SUM(session_minutes), 2) OVER (
        PARTITION BY subscriber_id
        ORDER BY DATE_TRUNC('month', event_date)
    ) AS two_months_ago_minutes
FROM usage_events
GROUP BY subscriber_id, DATE_TRUNC('month', event_date);

The result set:

 subscriber_id | usage_month | total_minutes | prev_month_minutes | two_months_ago_minutes
---------------+-------------+---------------+--------------------+------------------------
        100042 | 2024-10-01  |         342.5 |                    |
        100042 | 2024-11-01  |         287.0 |              342.5 |
        100042 | 2024-12-01  |         198.3 |              287.0 |                  342.5
        100042 | 2025-01-01  |         112.7 |              198.3 |                  287.0

Look at subscriber 100042. Usage is declining: 342 to 287 to 198 to 112 minutes. That is a churn signal. And we can compute it directly in SQL:

-- Usage trend: percentage change over last 3 months
WITH monthly_usage AS (
    SELECT
        subscriber_id,
        DATE_TRUNC('month', event_date) AS usage_month,
        SUM(session_minutes) AS total_minutes
    FROM usage_events
    GROUP BY subscriber_id, DATE_TRUNC('month', event_date)
),
with_lag AS (
    SELECT
        subscriber_id,
        usage_month,
        total_minutes,
        LAG(total_minutes, 1) OVER w AS prev_1m,
        LAG(total_minutes, 2) OVER w AS prev_2m,
        LAG(total_minutes, 3) OVER w AS prev_3m
    FROM monthly_usage
    WINDOW w AS (PARTITION BY subscriber_id ORDER BY usage_month)
)
SELECT
    subscriber_id,
    usage_month,
    total_minutes,
    -- Month-over-month change
    CASE WHEN prev_1m > 0
         THEN ROUND((total_minutes - prev_1m) / prev_1m * 100, 1)
         ELSE NULL
    END AS mom_pct_change,
    -- 3-month average (smoothed baseline)
    ROUND((COALESCE(prev_1m, 0) + COALESCE(prev_2m, 0) + COALESCE(prev_3m, 0)) /
          NULLIF(
              (CASE WHEN prev_1m IS NOT NULL THEN 1 ELSE 0 END +
               CASE WHEN prev_2m IS NOT NULL THEN 1 ELSE 0 END +
               CASE WHEN prev_3m IS NOT NULL THEN 1 ELSE 0 END), 0
          ), 1) AS avg_prev_3m
FROM with_lag
WHERE usage_month = DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month');

BigQuery Note — BigQuery uses the same LAG/LEAD syntax. The main difference: DATE_TRUNC(event_date, MONTH) instead of DATE_TRUNC('month', event_date), and DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH) instead of CURRENT_DATE - INTERVAL '1 month'.

NTILE: Percentile Bucketing

NTILE divides the ordered partition into N roughly equal groups. Use it for creating bucketed features.

-- Assign each subscriber to a usage decile
SELECT
    subscriber_id,
    total_monthly_minutes,
    NTILE(10) OVER (ORDER BY total_monthly_minutes) AS usage_decile
FROM (
    SELECT
        subscriber_id,
        SUM(session_minutes) AS total_monthly_minutes
    FROM usage_events
    WHERE event_date >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY subscriber_id
) monthly;

Feature engineering use case: "Which decile of usage does this subscriber fall into?" This is a robust, outlier-resistant alternative to using raw minutes as a feature.

Running Aggregates: Cumulative Sums and Moving Averages

Window functions with frame specifications let you compute running totals, moving averages, and other cumulative statistics.

-- Running total of session minutes per subscriber (cumulative engagement)
SELECT
    subscriber_id,
    event_date,
    session_minutes,
    SUM(session_minutes) OVER (
        PARTITION BY subscriber_id
        ORDER BY event_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS cumulative_minutes,

    -- 7-day moving average of session minutes
    AVG(session_minutes) OVER (
        PARTITION BY subscriber_id
        ORDER BY event_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS moving_avg_7d
FROM usage_events
ORDER BY subscriber_id, event_date;

The frame clause matters:

Frame Meaning
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW All rows from the start of the partition to here (running total)
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW Current row and 6 before it (7-row window)
ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING Centered window of 7 rows
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING Current row to the end (reverse running total)

Common Mistake — Without an explicit frame clause, the default depends on whether ORDER BY is present. With ORDER BY, PostgreSQL defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which can behave unexpectedly with duplicate values. Always specify the frame clause explicitly when using running aggregates. Future-you will thank present-you.

The SQL-to-Pandas Translation Table

If you think in DataFrames, this table is your Rosetta Stone.

Feature Pattern SQL pandas
Rank within group ROW_NUMBER() OVER (PARTITION BY g ORDER BY x) df.groupby('g')['x'].rank(method='first')
Previous row value LAG(x, 1) OVER (PARTITION BY g ORDER BY t) df.groupby('g')['x'].shift(1)
Next row value LEAD(x, 1) OVER (PARTITION BY g ORDER BY t) df.groupby('g')['x'].shift(-1)
Running sum SUM(x) OVER (PARTITION BY g ORDER BY t ROWS UNBOUNDED PRECEDING) df.groupby('g')['x'].cumsum()
Moving average AVG(x) OVER (PARTITION BY g ORDER BY t ROWS 6 PRECEDING) df.groupby('g')['x'].rolling(7).mean()
Percentile bucket NTILE(10) OVER (ORDER BY x) pd.qcut(df['x'], 10, labels=False) + 1
First value in group FIRST_VALUE(x) OVER (PARTITION BY g ORDER BY t) df.groupby('g')['x'].transform('first')
Count within group (no collapse) COUNT(*) OVER (PARTITION BY g) df.groupby('g')['x'].transform('count')

The SQL versions run inside the database — on indexed, optimized storage, often distributed across multiple nodes. The pandas versions run in memory on your laptop. For 2.4 million subscribers with years of event history, the SQL version finishes while the pandas version is still loading the CSV.

War Story — At a previous company, an analyst inherited a churn feature pipeline written in pandas. It loaded 18 months of usage events into memory (14 GB), computed LAG-equivalent features using groupby().shift(), joined four DataFrames, and wrote the result to a CSV. Runtime: 47 minutes. Memory: peaked at 38 GB, requiring a special EC2 instance. A senior data engineer rewrote it as a single SQL query with three CTEs and four window functions. Runtime: 90 seconds. Infrastructure cost: zero incremental, because the data warehouse was already running. The 200 lines of pandas became 15 lines of SQL that ran 100x faster. The lesson is not that pandas is bad. The lesson is that the database is an extraordinarily powerful computation engine, and ignoring it means doing work that has already been done for you.


5.3 Common Table Expressions (CTEs): Writing SQL That Humans Can Read

A CTE is a named temporary result set that exists for the duration of a single query. Think of it as a variable assignment in SQL.

WITH cte_name AS (
    SELECT ...
)
SELECT * FROM cte_name;

Why do CTEs matter? Because SQL without CTEs becomes unreadable fast. Consider the feature extraction query we need for StreamFlow churn prediction: compute tenure, recent usage, usage trend, support tickets, plan changes, and billing failures — all per subscriber. Without CTEs, this is a single monstrous query with five subqueries nested inside each other. With CTEs, it is five named steps followed by a final join.

Building a Feature Extraction Query with CTEs

-- StreamFlow churn features: the CTE approach
WITH tenure AS (
    -- Step 1: Compute subscriber tenure in months
    SELECT
        subscriber_id,
        plan_type,
        country,
        signup_date,
        cancel_date,
        EXTRACT(EPOCH FROM (COALESCE(cancel_date, CURRENT_DATE) - signup_date))
            / (86400 * 30.44) AS tenure_months
    FROM subscribers
),

recent_usage AS (
    -- Step 2: Aggregate usage over the last 30 and 90 days
    SELECT
        subscriber_id,
        SUM(CASE WHEN event_date >= CURRENT_DATE - INTERVAL '30 days'
                 THEN session_minutes ELSE 0 END) AS minutes_last_30d,
        SUM(CASE WHEN event_date >= CURRENT_DATE - INTERVAL '90 days'
                 THEN session_minutes ELSE 0 END) AS minutes_last_90d,
        COUNT(DISTINCT CASE WHEN event_date >= CURRENT_DATE - INTERVAL '30 days'
                             THEN event_date END) AS active_days_last_30d,
        COUNT(DISTINCT CASE WHEN event_date >= CURRENT_DATE - INTERVAL '30 days'
                             THEN feature_used END) AS distinct_features_last_30d,
        COUNT(DISTINCT CASE WHEN event_date >= CURRENT_DATE - INTERVAL '30 days'
                             THEN device_type END) AS distinct_devices_last_30d
    FROM usage_events
    WHERE event_date >= CURRENT_DATE - INTERVAL '90 days'
    GROUP BY subscriber_id
),

ticket_stats AS (
    -- Step 3: Support ticket features
    SELECT
        subscriber_id,
        COUNT(*) FILTER (WHERE created_date >= CURRENT_DATE - INTERVAL '90 days')
            AS tickets_last_90d,
        COUNT(*) FILTER (WHERE category = 'complaint'
                         AND created_date >= CURRENT_DATE - INTERVAL '90 days')
            AS complaints_last_90d,
        MAX(created_date) AS last_ticket_date,
        CURRENT_DATE - MAX(created_date) AS days_since_last_ticket
    FROM support_tickets
    GROUP BY subscriber_id
),

plan_history AS (
    -- Step 4: Plan change features
    SELECT
        subscriber_id,
        COUNT(*) AS total_plan_changes,
        COUNT(*) FILTER (WHERE change_type = 'downgrade') AS downgrades,
        COUNT(*) FILTER (WHERE change_type = 'upgrade') AS upgrades,
        MAX(change_date) AS last_change_date
    FROM plan_changes
    GROUP BY subscriber_id
),

billing_stats AS (
    -- Step 5: Billing failure features
    SELECT
        subscriber_id,
        COUNT(*) FILTER (WHERE status = 'failed'
                         AND billing_date >= CURRENT_DATE - INTERVAL '180 days')
            AS failed_payments_6m,
        COUNT(*) FILTER (WHERE status = 'refunded'
                         AND billing_date >= CURRENT_DATE - INTERVAL '180 days')
            AS refunds_6m
    FROM billing_events
    GROUP BY subscriber_id
)

-- Final assembly: join all feature CTEs
SELECT
    t.subscriber_id,
    t.plan_type,
    t.country,
    t.tenure_months,
    COALESCE(u.minutes_last_30d, 0) AS minutes_last_30d,
    COALESCE(u.minutes_last_90d, 0) AS minutes_last_90d,
    COALESCE(u.active_days_last_30d, 0) AS active_days_last_30d,
    COALESCE(u.distinct_features_last_30d, 0) AS distinct_features_last_30d,
    COALESCE(u.distinct_devices_last_30d, 0) AS distinct_devices_last_30d,
    COALESCE(ts.tickets_last_90d, 0) AS tickets_last_90d,
    COALESCE(ts.complaints_last_90d, 0) AS complaints_last_90d,
    ts.days_since_last_ticket,
    COALESCE(ph.total_plan_changes, 0) AS total_plan_changes,
    COALESCE(ph.downgrades, 0) AS downgrades,
    COALESCE(ph.upgrades, 0) AS upgrades,
    COALESCE(bs.failed_payments_6m, 0) AS failed_payments_6m,
    COALESCE(bs.refunds_6m, 0) AS refunds_6m,
    CASE WHEN t.cancel_date IS NOT NULL THEN 1 ELSE 0 END AS churned
FROM tenure t
LEFT JOIN recent_usage u ON t.subscriber_id = u.subscriber_id
LEFT JOIN ticket_stats ts ON t.subscriber_id = ts.subscriber_id
LEFT JOIN plan_history ph ON t.subscriber_id = ph.subscriber_id
LEFT JOIN billing_stats bs ON t.subscriber_id = bs.subscriber_id;

Read that query out loud. "Start with tenure. Then compute recent usage. Then ticket stats. Then plan history. Then billing stats. Then join them all together." Each CTE is independently testable. You can run SELECT * FROM tenure LIMIT 10 to check the first step. You can add a sixth CTE without touching the first five. You can hand this query to a colleague and they will understand it.

Production Tip — Name your CTEs like you name functions: descriptively. cte1, cte2, cte3 is only slightly better than nested subqueries. tenure, recent_usage, ticket_stats tells you what each step computes. Future-you is a colleague too.

The FILTER Clause: PostgreSQL's Secret Weapon

Notice the COUNT(*) FILTER (WHERE ...) syntax in the CTEs above. This is PostgreSQL's FILTER clause, and it is dramatically cleaner than the traditional CASE-WHEN approach:

-- Traditional CASE-WHEN (works everywhere)
COUNT(CASE WHEN category = 'complaint' THEN 1 END) AS complaints

-- PostgreSQL FILTER (cleaner, same result)
COUNT(*) FILTER (WHERE category = 'complaint') AS complaints

Both produce identical results. FILTER is more readable, especially when you have multiple conditional aggregations in the same query.

BigQuery Note — BigQuery does not support the FILTER clause. Use COUNTIF(condition) instead, which is BigQuery's equivalent: COUNTIF(category = 'complaint') AS complaints.

Recursive CTEs: When You Need Hierarchies

Recursive CTEs solve problems involving hierarchical or iterative data. They are less common in feature engineering but invaluable when you need them.

-- Generate a date spine (one row per day for the last 365 days)
WITH RECURSIVE date_spine AS (
    SELECT CURRENT_DATE - INTERVAL '365 days' AS dt
    UNION ALL
    SELECT dt + INTERVAL '1 day'
    FROM date_spine
    WHERE dt < CURRENT_DATE
)
SELECT dt::DATE AS calendar_date
FROM date_spine;

Feature engineering use case: join this date spine to your usage events with a LEFT JOIN to ensure every subscriber has a row for every day — including days with zero usage. Missing rows are not the same as zero values, and your model needs to know the difference.

-- Daily usage per subscriber, including zero-usage days
WITH RECURSIVE date_spine AS (
    SELECT CURRENT_DATE - INTERVAL '90 days' AS dt
    UNION ALL
    SELECT dt + INTERVAL '1 day'
    FROM date_spine
    WHERE dt < CURRENT_DATE
),
active_subscribers AS (
    SELECT subscriber_id
    FROM subscribers
    WHERE cancel_date IS NULL OR cancel_date >= CURRENT_DATE - INTERVAL '90 days'
),
subscriber_days AS (
    SELECT
        s.subscriber_id,
        d.dt::DATE AS calendar_date
    FROM active_subscribers s
    CROSS JOIN date_spine d
)
SELECT
    sd.subscriber_id,
    sd.calendar_date,
    COALESCE(SUM(ue.session_minutes), 0) AS daily_minutes,
    COALESCE(COUNT(ue.event_id), 0) AS session_count
FROM subscriber_days sd
LEFT JOIN usage_events ue
    ON sd.subscriber_id = ue.subscriber_id
    AND sd.calendar_date = ue.event_date
GROUP BY sd.subscriber_id, sd.calendar_date;

BigQuery Note — BigQuery does not support recursive CTEs. Use GENERATE_DATE_ARRAY(start_date, end_date) with UNNEST() instead: SELECT dt FROM UNNEST(GENERATE_DATE_ARRAY('2024-04-01', CURRENT_DATE())) AS dt.

Production Tip — Most data warehouses maintain a pre-built date dimension table (dim_date or calendar). Use it instead of generating dates on the fly. The recursive CTE approach is for ad-hoc analysis; the dimension table is for production pipelines.


5.4 Complex Joins for Feature Construction

Basic joins get you basic features. Complex joins — self-joins, anti-joins, lateral joins — get you the features that actually predict.

Self-Joins: Comparing a Row to Itself

A self-join joins a table to itself. Use it when you need to compare rows within the same table.

-- For each subscriber, find the usage gap: max days between consecutive sessions
WITH sessions AS (
    SELECT
        subscriber_id,
        event_date,
        LEAD(event_date) OVER (
            PARTITION BY subscriber_id
            ORDER BY event_date
        ) AS next_session_date
    FROM (
        SELECT DISTINCT subscriber_id, event_date
        FROM usage_events
    ) daily_sessions
)
SELECT
    subscriber_id,
    MAX(next_session_date - event_date) AS max_gap_days,
    AVG(next_session_date - event_date) AS avg_gap_days
FROM sessions
WHERE next_session_date IS NOT NULL
GROUP BY subscriber_id;

This computes two features: the maximum gap between sessions (a spike in this value suggests disengagement) and the average gap between sessions (a baseline engagement rhythm). Both are strong churn predictors.

Anti-Joins: Finding What Is Missing

An anti-join returns rows from the left table that have no match in the right table. There are two patterns:

-- Pattern 1: LEFT JOIN + IS NULL
-- Find subscribers who have NEVER filed a support ticket
SELECT s.subscriber_id, s.plan_type, s.signup_date
FROM subscribers s
LEFT JOIN support_tickets st ON s.subscriber_id = st.subscriber_id
WHERE st.subscriber_id IS NULL;

-- Pattern 2: NOT EXISTS (often faster)
-- Same result, different execution path
SELECT s.subscriber_id, s.plan_type, s.signup_date
FROM subscribers s
WHERE NOT EXISTS (
    SELECT 1
    FROM support_tickets st
    WHERE st.subscriber_id = s.subscriber_id
);

Feature engineering use case: "Has this subscriber ever used feature X?" is a binary feature constructed with an anti-join. "Has this subscriber logged in from mobile?" Same pattern.

-- Binary features via anti-join pattern
SELECT
    s.subscriber_id,
    CASE WHEN mobile.subscriber_id IS NOT NULL THEN 1 ELSE 0 END AS has_used_mobile,
    CASE WHEN api.subscriber_id IS NOT NULL THEN 1 ELSE 0 END AS has_used_api,
    CASE WHEN export.subscriber_id IS NOT NULL THEN 1 ELSE 0 END AS has_used_export
FROM subscribers s
LEFT JOIN (
    SELECT DISTINCT subscriber_id
    FROM usage_events WHERE device_type = 'mobile'
) mobile ON s.subscriber_id = mobile.subscriber_id
LEFT JOIN (
    SELECT DISTINCT subscriber_id
    FROM usage_events WHERE feature_used = 'api'
) api ON s.subscriber_id = api.subscriber_id
LEFT JOIN (
    SELECT DISTINCT subscriber_id
    FROM usage_events WHERE feature_used = 'export'
) export ON s.subscriber_id = export.subscriber_id;

Lateral Joins: Correlated Subqueries Done Right

A lateral join lets a subquery reference columns from preceding tables in the FROM clause. It is like a correlated subquery, but it can return multiple columns and multiple rows.

-- For each subscriber, get their 3 most recent sessions
SELECT
    s.subscriber_id,
    s.plan_type,
    recent.event_date,
    recent.session_minutes,
    recent.feature_used
FROM subscribers s
CROSS JOIN LATERAL (
    SELECT event_date, session_minutes, feature_used
    FROM usage_events ue
    WHERE ue.subscriber_id = s.subscriber_id
    ORDER BY event_date DESC
    LIMIT 3
) recent
WHERE s.cancel_date IS NULL;

Without LATERAL, you would need the ROW_NUMBER + CTE pattern we saw in Section 5.2. LATERAL is more concise for "top-N per group" queries.

BigQuery Note — BigQuery does not support LATERAL joins. Use ARRAY_AGG() with ORDER BY and subscript notation instead, or stick with the ROW_NUMBER pattern which works everywhere.

Common MistakeCROSS JOIN LATERAL returns no rows for subscribers with zero usage events (because the subquery returns empty). If you want to keep subscribers with no usage, use LEFT JOIN LATERAL ... ON TRUE instead.


5.5 Conditional Aggregation: Feature Engineering Patterns

Many features follow the same SQL pattern: count or sum something, but only when a condition is met. Conditional aggregation is the workhorse of SQL-based feature engineering.

The "Days Since Last X" Pattern

-- Days since last activity of each type
SELECT
    s.subscriber_id,
    CURRENT_DATE - MAX(ue.event_date) AS days_since_last_session,
    CURRENT_DATE - MAX(CASE WHEN ue.feature_used = 'dashboard'
                            THEN ue.event_date END)
        AS days_since_last_dashboard,
    CURRENT_DATE - MAX(CASE WHEN ue.feature_used = 'reports'
                            THEN ue.event_date END)
        AS days_since_last_report,
    CURRENT_DATE - MAX(st.created_date) AS days_since_last_ticket
FROM subscribers s
LEFT JOIN usage_events ue ON s.subscriber_id = ue.subscriber_id
LEFT JOIN support_tickets st ON s.subscriber_id = st.subscriber_id
WHERE s.cancel_date IS NULL
GROUP BY s.subscriber_id;

The "days since last X" pattern produces some of the most predictive features in churn models. A subscriber whose last session was 45 days ago is a different risk profile than one whose last session was yesterday — and this pattern captures that difference in a single integer.

The Event Count Pattern

-- Event counts by time window and category
SELECT
    subscriber_id,
    -- Session counts by time window
    COUNT(*) FILTER (WHERE event_date >= CURRENT_DATE - INTERVAL '7 days')
        AS sessions_7d,
    COUNT(*) FILTER (WHERE event_date >= CURRENT_DATE - INTERVAL '30 days')
        AS sessions_30d,
    COUNT(*) FILTER (WHERE event_date >= CURRENT_DATE - INTERVAL '90 days')
        AS sessions_90d,

    -- Feature usage counts (last 30 days)
    COUNT(*) FILTER (WHERE feature_used = 'dashboard'
                     AND event_date >= CURRENT_DATE - INTERVAL '30 days')
        AS dashboard_sessions_30d,
    COUNT(*) FILTER (WHERE feature_used = 'reports'
                     AND event_date >= CURRENT_DATE - INTERVAL '30 days')
        AS report_sessions_30d,
    COUNT(*) FILTER (WHERE feature_used = 'api'
                     AND event_date >= CURRENT_DATE - INTERVAL '30 days')
        AS api_sessions_30d,
    COUNT(*) FILTER (WHERE feature_used = 'export'
                     AND event_date >= CURRENT_DATE - INTERVAL '30 days')
        AS export_sessions_30d,

    -- Engagement ratio: recent vs. historical
    CASE WHEN COUNT(*) FILTER (WHERE event_date >= CURRENT_DATE - INTERVAL '90 days') > 0
         THEN COUNT(*) FILTER (WHERE event_date >= CURRENT_DATE - INTERVAL '30 days')::NUMERIC /
              COUNT(*) FILTER (WHERE event_date >= CURRENT_DATE - INTERVAL '90 days')
         ELSE NULL
    END AS engagement_ratio_30d_to_90d
FROM usage_events
GROUP BY subscriber_id;

The engagement ratio at the bottom is particularly useful: it captures the trend direction without needing LAG. A ratio near 0.33 means usage is steady across all three months. A ratio near 0.1 means the subscriber is fading. A ratio near 0.8 means they are ramping up.

The Ratio Pattern

Ratios normalize features and capture behavior proportions:

-- Behavioral ratios
SELECT
    subscriber_id,
    -- What fraction of sessions use the premium "reports" feature?
    COUNT(*) FILTER (WHERE feature_used = 'reports')::NUMERIC /
        NULLIF(COUNT(*), 0) AS report_usage_ratio,

    -- What fraction of sessions are on mobile?
    COUNT(*) FILTER (WHERE device_type = 'mobile')::NUMERIC /
        NULLIF(COUNT(*), 0) AS mobile_ratio,

    -- Average session length
    AVG(session_minutes) AS avg_session_minutes,

    -- Session length variability (coefficient of variation)
    STDDEV(session_minutes) / NULLIF(AVG(session_minutes), 0) AS session_cv
FROM usage_events
WHERE event_date >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY subscriber_id;

Common Mistake — Division by zero. Always use NULLIF(denominator, 0) when computing ratios. A subscriber with zero sessions in the window will cause a division error without this guard. NULLIF(x, 0) returns NULL when x is 0, and NULL propagates through division to produce NULL rather than an error.


5.6 Reading Query Execution Plans

Writing correct SQL is necessary. Writing fast SQL is what separates you from the analyst who can only run small queries. The tool for understanding query performance is EXPLAIN.

EXPLAIN vs. EXPLAIN ANALYZE

-- EXPLAIN shows the plan WITHOUT executing the query
EXPLAIN
SELECT * FROM usage_events
WHERE subscriber_id = 100042
  AND event_date >= '2024-10-01';

-- EXPLAIN ANALYZE shows the plan AND executes the query (with actual timings)
EXPLAIN ANALYZE
SELECT * FROM usage_events
WHERE subscriber_id = 100042
  AND event_date >= '2024-10-01';

The output of EXPLAIN ANALYZE:

Index Scan using idx_usage_subscriber_date on usage_events
    (cost=0.56..124.32 rows=47 width=52)
    (actual time=0.028..0.156 rows=42 loops=1)
  Index Cond: ((subscriber_id = 100042) AND (event_date >= '2024-10-01'::date))
Planning Time: 0.089 ms
Execution Time: 0.198 ms

Key things to read:

  1. Scan type: Index Scan means the database used an index. Good. Seq Scan means it read the entire table. Bad (usually).
  2. cost: Two numbers. The first is startup cost; the second is total estimated cost. Lower is better.
  3. rows: Estimated vs. actual row count. A large discrepancy means stale statistics (ANALYZE your_table).
  4. actual time: Wall-clock time in milliseconds.
  5. Planning Time vs. Execution Time: If planning time is high, your query might be too complex for the optimizer.

Common Scan Types

Scan Type Meaning When You See It
Seq Scan Full table scan — reads every row No suitable index, or table is small enough that scanning is cheaper
Index Scan Uses an index to find matching rows Index exists and query is selective (returns <10-15% of rows)
Index Only Scan Uses index without touching the table All requested columns are in the index (covering index)
Bitmap Index Scan Builds a bitmap of matching rows, then fetches them Index exists but query returns too many rows for a regular index scan
Hash Join Builds a hash table from one input, probes with the other Equi-joins (=) on non-indexed columns
Merge Join Merges two sorted inputs Both inputs are already sorted (often from index scans)
Nested Loop For each row in outer table, scans inner table Small outer table, indexed inner table

Reading a Real Feature Query Plan

Let us see what happens with our CTE-based feature extraction query:

EXPLAIN ANALYZE
WITH ticket_stats AS (
    SELECT
        subscriber_id,
        COUNT(*) AS tickets_last_90d
    FROM support_tickets
    WHERE created_date >= CURRENT_DATE - INTERVAL '90 days'
    GROUP BY subscriber_id
)
SELECT
    s.subscriber_id,
    s.plan_type,
    COALESCE(ts.tickets_last_90d, 0) AS tickets_last_90d
FROM subscribers s
LEFT JOIN ticket_stats ts ON s.subscriber_id = ts.subscriber_id;
Hash Left Join  (cost=1245.67..34521.89 rows=2400000 width=30)
               (actual time=892.45..4521.33 rows=2400000 loops=1)
  Hash Cond: (s.subscriber_id = ts.subscriber_id)
  ->  Seq Scan on subscribers s
        (cost=0.00..28456.00 rows=2400000 width=22)
        (actual time=0.012..1234.56 rows=2400000 loops=1)
  ->  Hash  (cost=1189.42..1189.42 rows=4500 width=16)
           (actual time=891.23..891.23 rows=4500 loops=1)
        Buckets: 8192  Batches: 1  Memory Usage: 240kB
        ->  GroupAggregate  (cost=1089.42..1189.42 rows=4500 width=16)
                           (actual time=856.78..889.45 rows=4500 loops=1)
              Group Key: support_tickets.subscriber_id
              ->  Sort  (cost=1089.42..1102.67 rows=5300 width=8)
                       (actual time=856.12..861.34 rows=5300 loops=1)
                    Sort Key: support_tickets.subscriber_id
                    Sort Method: quicksort  Memory: 412kB
                    ->  Seq Scan on support_tickets
                          (cost=0.00..789.00 rows=5300 width=8)
                          (actual time=0.015..845.23 rows=5300 loops=1)
                          Filter: (created_date >= (CURRENT_DATE - '90 days'::interval))
                          Rows Removed by Filter: 28700
Planning Time: 1.234 ms
Execution Time: 5678.90 ms

Reading this bottom-up (which is how query plans execute):

  1. PostgreSQL scans the support_tickets table sequentially (no index on created_date)
  2. Filters to the last 90 days, removing 28,700 rows and keeping 5,300
  3. Sorts by subscriber_id and groups to compute the count
  4. Builds a hash table from those 4,500 results
  5. Scans all 2.4M subscribers sequentially
  6. Probes the hash table for each subscriber to complete the left join

The bottleneck: the Seq Scan on support_tickets with a date filter. An index would help.


5.7 Query Optimization

Knowing how to read a plan is step one. Knowing how to improve it is step two.

Strategy 1: Create the Right Indexes

An index is a data structure that lets the database find rows without scanning the entire table. Think of it as the index at the back of a textbook — instead of reading every page to find "window function," you look it up in the index and go directly to page 342.

-- Index for the support_tickets date filter we saw above
CREATE INDEX idx_tickets_date ON support_tickets (created_date);

-- Composite index for the most common query pattern: filter by subscriber + date
CREATE INDEX idx_usage_subscriber_date
    ON usage_events (subscriber_id, event_date);

-- Composite index with an INCLUDE for covering index scans
CREATE INDEX idx_usage_covering
    ON usage_events (subscriber_id, event_date)
    INCLUDE (session_minutes, feature_used, device_type);

The last index is a covering index — it includes all the columns our queries need, so PostgreSQL can satisfy the query entirely from the index without touching the table at all (an "Index Only Scan"). This is the fastest possible read path.

Indexing rules of thumb:

  1. Index columns that appear in WHERE clauses
  2. Index columns that appear in JOIN conditions
  3. Put the most selective column first in composite indexes
  4. Do not index columns with very low cardinality (e.g., a boolean) unless combined with a selective column
  5. Every index slows down writes. Do not index everything.

Strategy 2: Avoid SELECT *

-- Bad: reads all columns from disk, even if you only need two
SELECT * FROM usage_events WHERE subscriber_id = 100042;

-- Good: reads only the columns you need
SELECT event_date, session_minutes
FROM usage_events
WHERE subscriber_id = 100042;

With a covering index on (subscriber_id, event_date) INCLUDE (session_minutes), the second query can use an Index Only Scan. The first query cannot, because SELECT * requests columns that are not in the index.

Strategy 3: Partition Pruning

For large tables, partitioning splits data across multiple physical storage units. The optimizer can skip partitions entirely if the query filter matches the partition key.

-- Partition usage_events by month
CREATE TABLE usage_events (
    event_id        BIGINT,
    subscriber_id   BIGINT,
    event_date      DATE NOT NULL,
    session_minutes NUMERIC(8,2),
    feature_used    VARCHAR(50),
    device_type     VARCHAR(20)
) PARTITION BY RANGE (event_date);

CREATE TABLE usage_events_2024_10
    PARTITION OF usage_events
    FOR VALUES FROM ('2024-10-01') TO ('2024-11-01');

CREATE TABLE usage_events_2024_11
    PARTITION OF usage_events
    FOR VALUES FROM ('2024-11-01') TO ('2024-12-01');
-- ... and so on

Now when you query WHERE event_date >= '2024-10-01' AND event_date < '2025-01-01', PostgreSQL scans only 3 partitions instead of the entire table. For a table with billions of rows, this is the difference between seconds and minutes.

Strategy 4: Materialized Views for Reusable Feature Sets

If you run the same feature extraction query repeatedly (daily model retraining, for example), materialize the result:

-- Create a materialized view of subscriber features
CREATE MATERIALIZED VIEW mv_subscriber_features AS
WITH tenure AS (
    -- ... (same CTEs as Section 5.3)
),
recent_usage AS ( ... ),
ticket_stats AS ( ... ),
plan_history AS ( ... ),
billing_stats AS ( ... )
SELECT
    t.subscriber_id,
    t.plan_type,
    t.tenure_months,
    -- ... all features
FROM tenure t
LEFT JOIN recent_usage u ON t.subscriber_id = u.subscriber_id
LEFT JOIN ticket_stats ts ON t.subscriber_id = ts.subscriber_id
LEFT JOIN plan_history ph ON t.subscriber_id = ph.subscriber_id
LEFT JOIN billing_stats bs ON t.subscriber_id = bs.subscriber_id;

-- Create an index on the materialized view
CREATE UNIQUE INDEX idx_mv_features_subscriber
    ON mv_subscriber_features (subscriber_id);

-- Refresh daily (typically in an ETL/orchestration tool)
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_subscriber_features;

The materialized view runs the expensive query once and stores the result as a physical table. Subsequent reads are instant. REFRESH MATERIALIZED VIEW CONCURRENTLY updates the view without locking it for reads — essential for production systems.

Production Tip — In modern data stacks, materialized views are often replaced by dbt models that run on a schedule. The concept is the same: compute expensive features once, store the results, and read from the stored results. The difference is tooling and version control, not SQL.

Strategy 5: Optimize JOIN Order and Types

The query optimizer usually chooses the best join order, but you can help it:

-- Let PostgreSQL know how many distinct values exist
ANALYZE subscribers;
ANALYZE usage_events;
ANALYZE support_tickets;

ANALYZE updates the statistics that the query planner uses. If your explain plans show wildly inaccurate row estimates, run ANALYZE.

JOIN type performance hierarchy (general guidance, not absolute rules):

  1. Hash Join — Best for equi-joins between large tables
  2. Merge Join — Best when both inputs are already sorted
  3. Nested Loop — Best when outer table is small and inner table is indexed

When you see a Nested Loop on two large tables with no index, that is your bottleneck. Either add an index or restructure the query.


5.8 Putting It All Together: The StreamFlow Feature Query

Let us combine everything from this chapter into a single, production-grade feature extraction query. This is the query that feeds into the churn model.

-- StreamFlow subscriber features for churn prediction
-- PostgreSQL | Refresh: daily | Owner: data-science-team
-- Dependencies: subscribers, usage_events, support_tickets,
--               plan_changes, billing_events

WITH monthly_usage AS (
    -- Per-subscriber monthly usage, with LAG for trend computation
    SELECT
        subscriber_id,
        DATE_TRUNC('month', event_date) AS usage_month,
        SUM(session_minutes) AS total_minutes,
        COUNT(DISTINCT event_date) AS active_days,
        COUNT(DISTINCT feature_used) AS features_used,
        LAG(SUM(session_minutes), 1) OVER (
            PARTITION BY subscriber_id
            ORDER BY DATE_TRUNC('month', event_date)
        ) AS prev_month_minutes
    FROM usage_events
    WHERE event_date >= CURRENT_DATE - INTERVAL '6 months'
    GROUP BY subscriber_id, DATE_TRUNC('month', event_date)
),

current_month_stats AS (
    -- Most recent complete month's statistics
    SELECT
        subscriber_id,
        total_minutes AS current_month_minutes,
        active_days AS current_month_active_days,
        features_used AS current_month_features,
        CASE WHEN prev_month_minutes > 0
             THEN ROUND(((total_minutes - prev_month_minutes)
                         / prev_month_minutes * 100)::NUMERIC, 1)
             ELSE NULL
        END AS month_over_month_pct_change
    FROM monthly_usage
    WHERE usage_month = DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
),

usage_trend AS (
    -- 3-month trend: slope of usage over last 3 complete months
    -- Positive = growing, negative = declining
    SELECT
        subscriber_id,
        REGR_SLOPE(total_minutes, EXTRACT(EPOCH FROM usage_month)) AS usage_slope
    FROM monthly_usage
    WHERE usage_month >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '3 months')
      AND usage_month < DATE_TRUNC('month', CURRENT_DATE)
    GROUP BY subscriber_id
    HAVING COUNT(*) >= 2  -- Need at least 2 data points for a slope
),

session_gaps AS (
    -- Inter-session gaps: how regularly does the subscriber use the product?
    SELECT
        subscriber_id,
        MAX(next_event - event_date) AS max_gap_days,
        AVG(next_event - event_date)::NUMERIC(8,1) AS avg_gap_days,
        CURRENT_DATE - MAX(event_date) AS days_since_last_session
    FROM (
        SELECT
            subscriber_id,
            event_date,
            LEAD(event_date) OVER (
                PARTITION BY subscriber_id ORDER BY event_date
            ) AS next_event
        FROM (SELECT DISTINCT subscriber_id, event_date FROM usage_events) d
    ) gaps
    GROUP BY subscriber_id
),

ticket_features AS (
    SELECT
        subscriber_id,
        COUNT(*) FILTER (WHERE created_date >= CURRENT_DATE - INTERVAL '30 days')
            AS tickets_30d,
        COUNT(*) FILTER (WHERE created_date >= CURRENT_DATE - INTERVAL '90 days')
            AS tickets_90d,
        COUNT(*) FILTER (WHERE category = 'complaint') AS total_complaints,
        BOOL_OR(priority IN ('high', 'critical')
                AND created_date >= CURRENT_DATE - INTERVAL '90 days')
            AS has_recent_critical_ticket,
        CURRENT_DATE - MAX(created_date) AS days_since_last_ticket
    FROM support_tickets
    GROUP BY subscriber_id
),

plan_features AS (
    SELECT
        subscriber_id,
        COUNT(*) AS total_changes,
        COUNT(*) FILTER (WHERE change_type = 'downgrade') AS downgrades,
        COUNT(*) FILTER (WHERE change_type = 'upgrade') AS upgrades,
        -- Net direction: positive = net upgrader, negative = net downgrader
        COUNT(*) FILTER (WHERE change_type = 'upgrade') -
        COUNT(*) FILTER (WHERE change_type = 'downgrade') AS net_plan_direction,
        -- Recent downgrade (strong churn signal)
        BOOL_OR(change_type = 'downgrade'
                AND change_date >= CURRENT_DATE - INTERVAL '60 days')
            AS recent_downgrade
    FROM plan_changes
    GROUP BY subscriber_id
),

billing_features AS (
    SELECT
        subscriber_id,
        COUNT(*) FILTER (WHERE status = 'failed') AS total_failed_payments,
        COUNT(*) FILTER (WHERE status = 'failed'
                         AND billing_date >= CURRENT_DATE - INTERVAL '90 days')
            AS failed_payments_90d,
        COUNT(*) FILTER (WHERE status = 'refunded') AS total_refunds
    FROM billing_events
    GROUP BY subscriber_id
)

-- Final feature assembly
SELECT
    s.subscriber_id,

    -- Demographic / account features
    s.plan_type,
    s.country,
    EXTRACT(EPOCH FROM (COALESCE(s.cancel_date, CURRENT_DATE) - s.signup_date))
        / (86400 * 30.44) AS tenure_months,

    -- Usage features (current month)
    COALESCE(cms.current_month_minutes, 0) AS current_month_minutes,
    COALESCE(cms.current_month_active_days, 0) AS current_month_active_days,
    COALESCE(cms.current_month_features, 0) AS current_month_features_used,
    cms.month_over_month_pct_change,

    -- Usage trend
    COALESCE(ut.usage_slope, 0) AS usage_trend_slope,

    -- Engagement regularity
    sg.days_since_last_session,
    sg.max_gap_days,
    sg.avg_gap_days,

    -- Support features
    COALESCE(tf.tickets_30d, 0) AS tickets_30d,
    COALESCE(tf.tickets_90d, 0) AS tickets_90d,
    COALESCE(tf.total_complaints, 0) AS total_complaints,
    COALESCE(tf.has_recent_critical_ticket, FALSE) AS has_recent_critical_ticket,
    tf.days_since_last_ticket,

    -- Plan features
    COALESCE(pf.total_changes, 0) AS total_plan_changes,
    COALESCE(pf.downgrades, 0) AS plan_downgrades,
    COALESCE(pf.upgrades, 0) AS plan_upgrades,
    COALESCE(pf.net_plan_direction, 0) AS net_plan_direction,
    COALESCE(pf.recent_downgrade, FALSE) AS recent_downgrade,

    -- Billing features
    COALESCE(bf.total_failed_payments, 0) AS total_failed_payments,
    COALESCE(bf.failed_payments_90d, 0) AS failed_payments_90d,
    COALESCE(bf.total_refunds, 0) AS total_refunds,

    -- Target variable
    CASE WHEN s.cancel_date IS NOT NULL
              AND s.cancel_date <= CURRENT_DATE + INTERVAL '30 days'
         THEN 1 ELSE 0
    END AS churned_within_30d

FROM subscribers s
LEFT JOIN current_month_stats cms ON s.subscriber_id = cms.subscriber_id
LEFT JOIN usage_trend ut ON s.subscriber_id = ut.subscriber_id
LEFT JOIN session_gaps sg ON s.subscriber_id = sg.subscriber_id
LEFT JOIN ticket_features tf ON s.subscriber_id = tf.subscriber_id
LEFT JOIN plan_features pf ON s.subscriber_id = pf.subscriber_id
LEFT JOIN billing_features bf ON s.subscriber_id = bf.subscriber_id
WHERE s.cancel_date IS NULL  -- Active subscribers only
   OR s.cancel_date >= CURRENT_DATE - INTERVAL '30 days';  -- Recently canceled (for training)

That is 7 CTEs, 25+ features, and a final assembly join. It runs in the database, takes advantage of indexes and the query optimizer, and produces a flat table that pandas can read directly via pd.read_sql():

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('postgresql://user:pass@host:5432/streamflow')

features_df = pd.read_sql("""
    SELECT * FROM mv_subscriber_features
""", engine)

print(f"Shape: {features_df.shape}")
print(f"Columns: {features_df.columns.tolist()}")
Shape: (2400000, 27)
Columns: ['subscriber_id', 'plan_type', 'country', 'tenure_months',
           'current_month_minutes', 'current_month_active_days', ...]

Try It — Run the feature query CTE by CTE. Start with just the monthly_usage CTE and SELECT * FROM monthly_usage LIMIT 20. Verify each step before adding the next. This is the CTE debugging workflow: build incrementally, validate each layer.


5.9 Advanced Patterns

Two additional patterns that come up frequently in production feature engineering.

Pattern: Sessionization with Window Functions

Raw event logs often need to be grouped into sessions. A session is a sequence of events with no gap longer than some threshold (commonly 30 minutes). This is a classic window function problem.

-- Sessionize usage events (30-minute inactivity threshold)
WITH events_with_prev AS (
    SELECT
        subscriber_id,
        event_id,
        event_date,
        session_minutes,
        LAG(event_date) OVER (
            PARTITION BY subscriber_id
            ORDER BY event_id
        ) AS prev_event_date,
        -- Flag: is this the start of a new session?
        CASE WHEN LAG(event_date) OVER (
                      PARTITION BY subscriber_id ORDER BY event_id
                  ) IS NULL
                  OR event_date - LAG(event_date) OVER (
                      PARTITION BY subscriber_id ORDER BY event_id
                  ) > INTERVAL '30 minutes'
             THEN 1
             ELSE 0
        END AS is_new_session
    FROM usage_events
),
session_ids AS (
    SELECT
        *,
        SUM(is_new_session) OVER (
            PARTITION BY subscriber_id
            ORDER BY event_id
        ) AS session_number
    FROM events_with_prev
)
SELECT
    subscriber_id,
    session_number,
    MIN(event_date) AS session_start,
    MAX(event_date) AS session_end,
    COUNT(*) AS events_in_session,
    SUM(session_minutes) AS total_session_minutes
FROM session_ids
GROUP BY subscriber_id, session_number;

This is a rolling-sum trick: mark each "new session start" with 1 and non-starts with 0, then take a cumulative sum. The cumulative sum increments at each session boundary, giving you a session identifier.

Pattern: Pivoting Event Data into Feature Columns

-- Pivot: event counts by feature_used, one column per feature
SELECT
    subscriber_id,
    COUNT(*) FILTER (WHERE feature_used = 'dashboard') AS dashboard_count,
    COUNT(*) FILTER (WHERE feature_used = 'reports') AS reports_count,
    COUNT(*) FILTER (WHERE feature_used = 'api') AS api_count,
    COUNT(*) FILTER (WHERE feature_used = 'export') AS export_count,
    -- Diversity: how many distinct features used?
    COUNT(DISTINCT feature_used) AS feature_diversity
FROM usage_events
WHERE event_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY subscriber_id;

BigQuery Note — BigQuery has a native PIVOT operator: SELECT * FROM usage_counts PIVOT(SUM(cnt) FOR feature_used IN ('dashboard', 'reports', 'api', 'export')). PostgreSQL does not have native PIVOT; use the FILTER clause or crosstab() from the tablefunc extension.

Pattern: Existence Flags with Correlated Subqueries

Sometimes you need a simple yes/no feature: "Did this subscriber do X in the last Y days?"

SELECT
    s.subscriber_id,
    EXISTS (
        SELECT 1 FROM usage_events ue
        WHERE ue.subscriber_id = s.subscriber_id
          AND ue.event_date >= CURRENT_DATE - INTERVAL '7 days'
    ) AS active_last_7d,
    EXISTS (
        SELECT 1 FROM support_tickets st
        WHERE st.subscriber_id = s.subscriber_id
          AND st.category = 'complaint'
          AND st.created_date >= CURRENT_DATE - INTERVAL '30 days'
    ) AS complained_last_30d,
    EXISTS (
        SELECT 1 FROM billing_events be
        WHERE be.subscriber_id = s.subscriber_id
          AND be.status = 'failed'
          AND be.billing_date >= CURRENT_DATE - INTERVAL '60 days'
    ) AS had_payment_failure_60d
FROM subscribers s
WHERE s.cancel_date IS NULL;

These binary features are fast (EXISTS stops at the first match) and surprisingly predictive. "Has the subscriber used the product in the last 7 days?" is a feature you should always compute.


5.10 From SQL to Model: The Handoff

You have written the SQL. You have optimized the query. Now you need to get the data into Python. Here is the production pattern.

import pandas as pd
from sqlalchemy import create_engine, text

# Connection (use environment variables for credentials)
engine = create_engine(
    'postgresql://readonly_user:${DB_PASSWORD}@warehouse.streamflow.io:5432/analytics'
)

# Option 1: Read query results directly
query = text("""
    SELECT * FROM mv_subscriber_features
    WHERE tenure_months >= 1  -- Exclude brand-new subscribers
""")
df = pd.read_sql(query, engine)

# Option 2: Read in chunks for large result sets
chunks = pd.read_sql(query, engine, chunksize=100_000)
df = pd.concat(chunks, ignore_index=True)

# Verify the handoff
print(f"Shape: {df.shape}")
print(f"Null counts:\n{df.isnull().sum()}")
print(f"Target distribution:\n{df['churned_within_30d'].value_counts(normalize=True)}")
Shape: (2387000, 27)
Null counts:
subscriber_id                  0
plan_type                      0
country                        0
tenure_months                  0
current_month_minutes          0
current_month_active_days      0
month_over_month_pct_change    312456
days_since_last_session        8923
days_since_last_ticket         1847234
...
Target distribution:
0    0.918
1    0.082
Name: churned_within_30d, dtype: float64

Note the nulls. month_over_month_pct_change is null for subscribers with less than 2 months of data (no previous month to compare). days_since_last_ticket is null for subscribers who have never filed a ticket. These nulls are meaningful — they carry information. We will handle them properly in Chapter 8 (Missing Data Strategies).

The target variable shows the familiar 8.2% churn rate. This is a class-imbalanced problem. We will address that in Chapter 17 (Class Imbalance).

Production Tip — Store the SQL query that generated your features alongside the model artifact. When the model behaves unexpectedly in production, the first debugging step is re-running the feature query. If you cannot reproduce the features, you cannot reproduce the problem. This is reproducibility (Theme 6) applied to data extraction.


5.11 The Domain Knowledge Connection

The best SQL features come from understanding the business, not from mechanically applying window functions to every column. Here is how domain knowledge (Theme 4) shows up in SQL:

A domain expert at StreamFlow would tell you:

  • Subscribers who downgrade their plan usually cancel within 90 days
  • A spike in support tickets (especially "billing" category) precedes churn
  • Subscribers who use the API are power users and rarely churn
  • Failed payments that are not resolved within 7 days lead to involuntary churn
  • Usage decline is the strongest churn signal, but only if tenure is > 3 months (new subscribers are still onboarding)

Each of those insights maps to a SQL feature:

-- Domain-informed features
SELECT
    s.subscriber_id,

    -- Downgrade-to-cancel pattern
    CASE WHEN pf.recent_downgrade AND s.plan_type IN ('basic', 'free')
         THEN 1 ELSE 0
    END AS downgrade_risk_flag,

    -- Billing-category ticket spike
    CASE WHEN tf_billing.billing_tickets_30d >= 2
         THEN 1 ELSE 0
    END AS billing_complaint_flag,

    -- Power user indicator
    CASE WHEN ue_api.api_sessions_30d > 0
         THEN 1 ELSE 0
    END AS is_api_user,

    -- Unresolved payment failure
    CASE WHEN bf.unresolved_failures > 0
         THEN 1 ELSE 0
    END AS unresolved_payment_flag,

    -- Usage decline (only meaningful after onboarding)
    CASE WHEN tenure_months > 3
              AND cms.month_over_month_pct_change < -30
         THEN 1 ELSE 0
    END AS post_onboarding_decline_flag

FROM subscribers s
-- ... (appropriate joins to pre-computed CTEs)

These domain-informed binary flags often outperform the raw continuous features they are derived from. A churn model does not need to learn that "recent_downgrade AND plan_type = 'free'" is a risk factor if you hand it a pre-computed flag. You are encoding domain knowledge directly into the feature set.

This is the connection to Chapter 6 (Feature Engineering): the SQL extracts the raw materials, and the domain knowledge shapes them into predictive features. Good feature engineering starts in the SQL layer.


Summary

This chapter covered the SQL that data scientists actually use: window functions for temporal features, CTEs for readable query structure, complex joins for feature construction, and execution plans for query optimization.

The key insight is that SQL is not just a data extraction language — it is a feature engineering language. Every window function, every conditional aggregation, every anti-join pattern corresponds to a feature that a machine learning model can use. The database is not an obstacle between you and pandas. It is a computation engine that handles joins, aggregations, and window operations on billions of rows while your laptop would run out of memory on the first pd.read_csv().

In Chapter 6, we will take the flat table this chapter produced and engineer the remaining features: interaction terms, transformations, and the domain-specific signals that no automated tool can create. But the hardest work — extracting structured features from a normalized relational schema — is done. And it was done in SQL, where it belongs.


Progressive Project Milestone M1: StreamFlow SQL Feature Extraction

Deliverable: A SQL query (or set of queries) that extracts at least 15 customer features from StreamFlow's relational database.

Requirements:

  1. Use at least 3 CTEs for query organization
  2. Use at least 2 window functions (LAG, ROW_NUMBER, or running aggregate)
  3. Compute at least 2 "days since last X" features
  4. Compute at least 2 conditional aggregation features (event counts by category/time window)
  5. Include the target variable (churned_within_30d)
  6. Run EXPLAIN ANALYZE and verify no full table scans on large tables

Stretch goals:

  • Compute a usage trend slope using REGR_SLOPE
  • Create a sessionization query
  • Build the query as a materialized view with a refresh strategy

Save your SQL in your project repository as src/features/extract_features.sql. In Chapter 6, you will load this feature set into pandas and engineer additional features on top of it.