Data scientists who can't write SQL are data scientists who can't get data.
In This Chapter
- 5.1 The StreamFlow Database Schema
- 5.2 Window Functions: The Feature Engineer's Best Friend
- 5.3 Common Table Expressions (CTEs): Writing SQL That Humans Can Read
- 5.4 Complex Joins for Feature Construction
- 5.5 Conditional Aggregation: Feature Engineering Patterns
- 5.6 Reading Query Execution Plans
- 5.7 Query Optimization
- 5.8 Putting It All Together: The StreamFlow Feature Query
- 5.9 Advanced Patterns
- 5.10 From SQL to Model: The Handoff
- 5.11 The Domain Knowledge Connection
- Summary
Chapter 5: SQL for Data Scientists — Window Functions, CTEs, and Query Optimization
Data scientists who can't write SQL are data scientists who can't get data.
That statement will irritate some readers. You have pandas. You have Spark DataFrames. You have dbt and Airflow and a hundred tools that promise to abstract away the database. And yet when you start a new job, the first thing you will need to do is write SQL. The features you need live in PostgreSQL, BigQuery, Snowflake, or Redshift. They live in tables designed by software engineers for application needs, not for machine learning. And the person who turns those tables into model-ready features is you.
This chapter is not an introduction to SQL. You have that from Book 1 — you can write SELECT, JOIN, WHERE, GROUP BY, and ORDER BY without thinking. This chapter makes you dangerous. By the end of it, you will write window functions that compute temporal trends, CTEs that turn a 200-line query into something a colleague can read, complex joins that construct features no single table contains, and optimized queries that run in seconds instead of minutes.
We will work primarily in PostgreSQL, with BigQuery dialect notes in callout boxes. Every SQL pattern in this chapter maps to a feature engineering use case. Every query answers a question that a churn model, a recommendation engine, or a conversion funnel analysis actually needs answered.
Learning Objectives
By the end of this chapter, you will be able to:
- Write window functions (ROW_NUMBER, RANK, LAG, LEAD, running aggregates) for feature extraction
- Structure complex queries using CTEs for readability and debugging
- Perform complex joins (self-joins, anti-joins, lateral joins) for feature construction
- Read and interpret query execution plans
- Optimize slow queries (indexing strategies, partition pruning, avoiding SELECT *)
5.1 The StreamFlow Database Schema
Before we write a single query, we need to understand the data. StreamFlow is a subscription streaming analytics platform with $180M in annual recurring revenue, 2.4 million subscribers, and an 8.2% monthly churn rate. Their relational database was designed to run the application — user authentication, billing, content delivery. It was not designed for feature engineering. That is the normal state of affairs.
Here is the schema we will work with throughout this chapter:
-- Core subscriber table
CREATE TABLE subscribers (
subscriber_id BIGINT PRIMARY KEY,
email VARCHAR(255),
signup_date DATE NOT NULL,
plan_type VARCHAR(20) NOT NULL, -- 'free', 'basic', 'premium', 'enterprise'
plan_price NUMERIC(8,2),
country VARCHAR(2),
cancel_date DATE, -- NULL if still active
cancel_reason VARCHAR(100)
);
-- Usage events (one row per session)
CREATE TABLE usage_events (
event_id BIGINT PRIMARY KEY,
subscriber_id BIGINT REFERENCES subscribers(subscriber_id),
event_date DATE NOT NULL,
session_minutes NUMERIC(8,2),
feature_used VARCHAR(50), -- 'dashboard', 'reports', 'api', 'export'
device_type VARCHAR(20) -- 'web', 'mobile', 'tablet', 'api_client'
);
-- Support tickets
CREATE TABLE support_tickets (
ticket_id BIGINT PRIMARY KEY,
subscriber_id BIGINT REFERENCES subscribers(subscriber_id),
created_date DATE NOT NULL,
resolved_date DATE,
category VARCHAR(50), -- 'billing', 'technical', 'feature_request', 'complaint'
priority VARCHAR(10) -- 'low', 'medium', 'high', 'critical'
);
-- Plan change history
CREATE TABLE plan_changes (
change_id BIGINT PRIMARY KEY,
subscriber_id BIGINT REFERENCES subscribers(subscriber_id),
change_date DATE NOT NULL,
old_plan VARCHAR(20),
new_plan VARCHAR(20),
change_type VARCHAR(20) -- 'upgrade', 'downgrade', 'same_tier'
);
-- Billing events
CREATE TABLE billing_events (
billing_id BIGINT PRIMARY KEY,
subscriber_id BIGINT REFERENCES subscribers(subscriber_id),
billing_date DATE NOT NULL,
amount NUMERIC(8,2),
status VARCHAR(20) -- 'paid', 'failed', 'refunded'
);
Five tables. Application-normalized. No feature columns. Every feature we need for churn prediction — tenure, usage trend, support ticket count, plan upgrade history, billing failure count — must be computed from these tables. That computation is the subject of this chapter.
Production Tip — In most companies, you will not query production tables directly. You will query a read replica, a data warehouse, or a materialized view. The SQL is the same. The politics of getting access is a separate skill entirely.
5.2 Window Functions: The Feature Engineer's Best Friend
If you learn one thing from this chapter, let it be window functions. A window function performs a calculation across a set of rows that are somehow related to the current row — without collapsing those rows into a single output the way GROUP BY does.
The general syntax:
function_name(expression) OVER (
PARTITION BY column1, column2
ORDER BY column3
ROWS BETWEEN start AND end
)
Three clauses. Three concepts:
- PARTITION BY defines the groups (like GROUP BY, but without collapsing rows)
- ORDER BY defines the sequence within each partition
- ROWS BETWEEN (or RANGE BETWEEN) defines the window frame — which rows the function can see
Let us start with the functions you will use most.
ROW_NUMBER, RANK, and DENSE_RANK
These ranking functions assign a position to each row within its partition.
-- Rank each subscriber's usage sessions by date (most recent first)
SELECT
subscriber_id,
event_date,
session_minutes,
ROW_NUMBER() OVER (
PARTITION BY subscriber_id
ORDER BY event_date DESC
) AS session_rank
FROM usage_events
WHERE event_date >= CURRENT_DATE - INTERVAL '90 days';
ROW_NUMBER assigns unique sequential integers — ties are broken arbitrarily. RANK assigns the same number to ties but leaves gaps (1, 2, 2, 4). DENSE_RANK assigns the same number to ties without gaps (1, 2, 2, 3).
Feature engineering use case: "Get the most recent session for each subscriber."
-- Most recent session per subscriber (the ROW_NUMBER + filter pattern)
WITH ranked_sessions AS (
SELECT
subscriber_id,
event_date,
session_minutes,
device_type,
ROW_NUMBER() OVER (
PARTITION BY subscriber_id
ORDER BY event_date DESC
) AS rn
FROM usage_events
)
SELECT *
FROM ranked_sessions
WHERE rn = 1;
This pattern — ROW_NUMBER inside a CTE, then filter to rn = 1 — is one of the most common SQL patterns in data science. You will use it dozens of times. It replaces the pandas equivalent:
# The pandas equivalent — functional, but slower at scale
df.sort_values('event_date', ascending=False).groupby('subscriber_id').first()
Common Mistake — Using
GROUP BY subscriber_idwithMAX(event_date)will give you the most recent date, but not the other columns from that row (session_minutes, device_type). You need the ROW_NUMBER pattern or a correlated subquery to get the full row.
LAG and LEAD: Temporal Feature Extraction
LAG looks backward. LEAD looks forward. Together, they are how you compute trends, changes, and velocity — the temporal features that dominate churn prediction.
-- Monthly usage with previous month's usage (for computing trends)
SELECT
subscriber_id,
DATE_TRUNC('month', event_date) AS usage_month,
SUM(session_minutes) AS total_minutes,
LAG(SUM(session_minutes), 1) OVER (
PARTITION BY subscriber_id
ORDER BY DATE_TRUNC('month', event_date)
) AS prev_month_minutes,
LAG(SUM(session_minutes), 2) OVER (
PARTITION BY subscriber_id
ORDER BY DATE_TRUNC('month', event_date)
) AS two_months_ago_minutes
FROM usage_events
GROUP BY subscriber_id, DATE_TRUNC('month', event_date);
The result set:
subscriber_id | usage_month | total_minutes | prev_month_minutes | two_months_ago_minutes
---------------+-------------+---------------+--------------------+------------------------
100042 | 2024-10-01 | 342.5 | |
100042 | 2024-11-01 | 287.0 | 342.5 |
100042 | 2024-12-01 | 198.3 | 287.0 | 342.5
100042 | 2025-01-01 | 112.7 | 198.3 | 287.0
Look at subscriber 100042. Usage is declining: 342 to 287 to 198 to 112 minutes. That is a churn signal. And we can compute it directly in SQL:
-- Usage trend: percentage change over last 3 months
WITH monthly_usage AS (
SELECT
subscriber_id,
DATE_TRUNC('month', event_date) AS usage_month,
SUM(session_minutes) AS total_minutes
FROM usage_events
GROUP BY subscriber_id, DATE_TRUNC('month', event_date)
),
with_lag AS (
SELECT
subscriber_id,
usage_month,
total_minutes,
LAG(total_minutes, 1) OVER w AS prev_1m,
LAG(total_minutes, 2) OVER w AS prev_2m,
LAG(total_minutes, 3) OVER w AS prev_3m
FROM monthly_usage
WINDOW w AS (PARTITION BY subscriber_id ORDER BY usage_month)
)
SELECT
subscriber_id,
usage_month,
total_minutes,
-- Month-over-month change
CASE WHEN prev_1m > 0
THEN ROUND((total_minutes - prev_1m) / prev_1m * 100, 1)
ELSE NULL
END AS mom_pct_change,
-- 3-month average (smoothed baseline)
ROUND((COALESCE(prev_1m, 0) + COALESCE(prev_2m, 0) + COALESCE(prev_3m, 0)) /
NULLIF(
(CASE WHEN prev_1m IS NOT NULL THEN 1 ELSE 0 END +
CASE WHEN prev_2m IS NOT NULL THEN 1 ELSE 0 END +
CASE WHEN prev_3m IS NOT NULL THEN 1 ELSE 0 END), 0
), 1) AS avg_prev_3m
FROM with_lag
WHERE usage_month = DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month');
BigQuery Note — BigQuery uses the same LAG/LEAD syntax. The main difference:
DATE_TRUNC(event_date, MONTH)instead ofDATE_TRUNC('month', event_date), andDATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH)instead ofCURRENT_DATE - INTERVAL '1 month'.
NTILE: Percentile Bucketing
NTILE divides the ordered partition into N roughly equal groups. Use it for creating bucketed features.
-- Assign each subscriber to a usage decile
SELECT
subscriber_id,
total_monthly_minutes,
NTILE(10) OVER (ORDER BY total_monthly_minutes) AS usage_decile
FROM (
SELECT
subscriber_id,
SUM(session_minutes) AS total_monthly_minutes
FROM usage_events
WHERE event_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY subscriber_id
) monthly;
Feature engineering use case: "Which decile of usage does this subscriber fall into?" This is a robust, outlier-resistant alternative to using raw minutes as a feature.
Running Aggregates: Cumulative Sums and Moving Averages
Window functions with frame specifications let you compute running totals, moving averages, and other cumulative statistics.
-- Running total of session minutes per subscriber (cumulative engagement)
SELECT
subscriber_id,
event_date,
session_minutes,
SUM(session_minutes) OVER (
PARTITION BY subscriber_id
ORDER BY event_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_minutes,
-- 7-day moving average of session minutes
AVG(session_minutes) OVER (
PARTITION BY subscriber_id
ORDER BY event_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_avg_7d
FROM usage_events
ORDER BY subscriber_id, event_date;
The frame clause matters:
| Frame | Meaning |
|---|---|
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW |
All rows from the start of the partition to here (running total) |
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW |
Current row and 6 before it (7-row window) |
ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING |
Centered window of 7 rows |
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING |
Current row to the end (reverse running total) |
Common Mistake — Without an explicit frame clause, the default depends on whether ORDER BY is present. With ORDER BY, PostgreSQL defaults to
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which can behave unexpectedly with duplicate values. Always specify the frame clause explicitly when using running aggregates. Future-you will thank present-you.
The SQL-to-Pandas Translation Table
If you think in DataFrames, this table is your Rosetta Stone.
| Feature Pattern | SQL | pandas |
|---|---|---|
| Rank within group | ROW_NUMBER() OVER (PARTITION BY g ORDER BY x) |
df.groupby('g')['x'].rank(method='first') |
| Previous row value | LAG(x, 1) OVER (PARTITION BY g ORDER BY t) |
df.groupby('g')['x'].shift(1) |
| Next row value | LEAD(x, 1) OVER (PARTITION BY g ORDER BY t) |
df.groupby('g')['x'].shift(-1) |
| Running sum | SUM(x) OVER (PARTITION BY g ORDER BY t ROWS UNBOUNDED PRECEDING) |
df.groupby('g')['x'].cumsum() |
| Moving average | AVG(x) OVER (PARTITION BY g ORDER BY t ROWS 6 PRECEDING) |
df.groupby('g')['x'].rolling(7).mean() |
| Percentile bucket | NTILE(10) OVER (ORDER BY x) |
pd.qcut(df['x'], 10, labels=False) + 1 |
| First value in group | FIRST_VALUE(x) OVER (PARTITION BY g ORDER BY t) |
df.groupby('g')['x'].transform('first') |
| Count within group (no collapse) | COUNT(*) OVER (PARTITION BY g) |
df.groupby('g')['x'].transform('count') |
The SQL versions run inside the database — on indexed, optimized storage, often distributed across multiple nodes. The pandas versions run in memory on your laptop. For 2.4 million subscribers with years of event history, the SQL version finishes while the pandas version is still loading the CSV.
War Story — At a previous company, an analyst inherited a churn feature pipeline written in pandas. It loaded 18 months of usage events into memory (14 GB), computed LAG-equivalent features using
groupby().shift(), joined four DataFrames, and wrote the result to a CSV. Runtime: 47 minutes. Memory: peaked at 38 GB, requiring a special EC2 instance. A senior data engineer rewrote it as a single SQL query with three CTEs and four window functions. Runtime: 90 seconds. Infrastructure cost: zero incremental, because the data warehouse was already running. The 200 lines of pandas became 15 lines of SQL that ran 100x faster. The lesson is not that pandas is bad. The lesson is that the database is an extraordinarily powerful computation engine, and ignoring it means doing work that has already been done for you.
5.3 Common Table Expressions (CTEs): Writing SQL That Humans Can Read
A CTE is a named temporary result set that exists for the duration of a single query. Think of it as a variable assignment in SQL.
WITH cte_name AS (
SELECT ...
)
SELECT * FROM cte_name;
Why do CTEs matter? Because SQL without CTEs becomes unreadable fast. Consider the feature extraction query we need for StreamFlow churn prediction: compute tenure, recent usage, usage trend, support tickets, plan changes, and billing failures — all per subscriber. Without CTEs, this is a single monstrous query with five subqueries nested inside each other. With CTEs, it is five named steps followed by a final join.
Building a Feature Extraction Query with CTEs
-- StreamFlow churn features: the CTE approach
WITH tenure AS (
-- Step 1: Compute subscriber tenure in months
SELECT
subscriber_id,
plan_type,
country,
signup_date,
cancel_date,
EXTRACT(EPOCH FROM (COALESCE(cancel_date, CURRENT_DATE) - signup_date))
/ (86400 * 30.44) AS tenure_months
FROM subscribers
),
recent_usage AS (
-- Step 2: Aggregate usage over the last 30 and 90 days
SELECT
subscriber_id,
SUM(CASE WHEN event_date >= CURRENT_DATE - INTERVAL '30 days'
THEN session_minutes ELSE 0 END) AS minutes_last_30d,
SUM(CASE WHEN event_date >= CURRENT_DATE - INTERVAL '90 days'
THEN session_minutes ELSE 0 END) AS minutes_last_90d,
COUNT(DISTINCT CASE WHEN event_date >= CURRENT_DATE - INTERVAL '30 days'
THEN event_date END) AS active_days_last_30d,
COUNT(DISTINCT CASE WHEN event_date >= CURRENT_DATE - INTERVAL '30 days'
THEN feature_used END) AS distinct_features_last_30d,
COUNT(DISTINCT CASE WHEN event_date >= CURRENT_DATE - INTERVAL '30 days'
THEN device_type END) AS distinct_devices_last_30d
FROM usage_events
WHERE event_date >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY subscriber_id
),
ticket_stats AS (
-- Step 3: Support ticket features
SELECT
subscriber_id,
COUNT(*) FILTER (WHERE created_date >= CURRENT_DATE - INTERVAL '90 days')
AS tickets_last_90d,
COUNT(*) FILTER (WHERE category = 'complaint'
AND created_date >= CURRENT_DATE - INTERVAL '90 days')
AS complaints_last_90d,
MAX(created_date) AS last_ticket_date,
CURRENT_DATE - MAX(created_date) AS days_since_last_ticket
FROM support_tickets
GROUP BY subscriber_id
),
plan_history AS (
-- Step 4: Plan change features
SELECT
subscriber_id,
COUNT(*) AS total_plan_changes,
COUNT(*) FILTER (WHERE change_type = 'downgrade') AS downgrades,
COUNT(*) FILTER (WHERE change_type = 'upgrade') AS upgrades,
MAX(change_date) AS last_change_date
FROM plan_changes
GROUP BY subscriber_id
),
billing_stats AS (
-- Step 5: Billing failure features
SELECT
subscriber_id,
COUNT(*) FILTER (WHERE status = 'failed'
AND billing_date >= CURRENT_DATE - INTERVAL '180 days')
AS failed_payments_6m,
COUNT(*) FILTER (WHERE status = 'refunded'
AND billing_date >= CURRENT_DATE - INTERVAL '180 days')
AS refunds_6m
FROM billing_events
GROUP BY subscriber_id
)
-- Final assembly: join all feature CTEs
SELECT
t.subscriber_id,
t.plan_type,
t.country,
t.tenure_months,
COALESCE(u.minutes_last_30d, 0) AS minutes_last_30d,
COALESCE(u.minutes_last_90d, 0) AS minutes_last_90d,
COALESCE(u.active_days_last_30d, 0) AS active_days_last_30d,
COALESCE(u.distinct_features_last_30d, 0) AS distinct_features_last_30d,
COALESCE(u.distinct_devices_last_30d, 0) AS distinct_devices_last_30d,
COALESCE(ts.tickets_last_90d, 0) AS tickets_last_90d,
COALESCE(ts.complaints_last_90d, 0) AS complaints_last_90d,
ts.days_since_last_ticket,
COALESCE(ph.total_plan_changes, 0) AS total_plan_changes,
COALESCE(ph.downgrades, 0) AS downgrades,
COALESCE(ph.upgrades, 0) AS upgrades,
COALESCE(bs.failed_payments_6m, 0) AS failed_payments_6m,
COALESCE(bs.refunds_6m, 0) AS refunds_6m,
CASE WHEN t.cancel_date IS NOT NULL THEN 1 ELSE 0 END AS churned
FROM tenure t
LEFT JOIN recent_usage u ON t.subscriber_id = u.subscriber_id
LEFT JOIN ticket_stats ts ON t.subscriber_id = ts.subscriber_id
LEFT JOIN plan_history ph ON t.subscriber_id = ph.subscriber_id
LEFT JOIN billing_stats bs ON t.subscriber_id = bs.subscriber_id;
Read that query out loud. "Start with tenure. Then compute recent usage. Then ticket stats. Then plan history. Then billing stats. Then join them all together." Each CTE is independently testable. You can run SELECT * FROM tenure LIMIT 10 to check the first step. You can add a sixth CTE without touching the first five. You can hand this query to a colleague and they will understand it.
Production Tip — Name your CTEs like you name functions: descriptively.
cte1,cte2,cte3is only slightly better than nested subqueries.tenure,recent_usage,ticket_statstells you what each step computes. Future-you is a colleague too.
The FILTER Clause: PostgreSQL's Secret Weapon
Notice the COUNT(*) FILTER (WHERE ...) syntax in the CTEs above. This is PostgreSQL's FILTER clause, and it is dramatically cleaner than the traditional CASE-WHEN approach:
-- Traditional CASE-WHEN (works everywhere)
COUNT(CASE WHEN category = 'complaint' THEN 1 END) AS complaints
-- PostgreSQL FILTER (cleaner, same result)
COUNT(*) FILTER (WHERE category = 'complaint') AS complaints
Both produce identical results. FILTER is more readable, especially when you have multiple conditional aggregations in the same query.
BigQuery Note — BigQuery does not support the FILTER clause. Use
COUNTIF(condition)instead, which is BigQuery's equivalent:COUNTIF(category = 'complaint') AS complaints.
Recursive CTEs: When You Need Hierarchies
Recursive CTEs solve problems involving hierarchical or iterative data. They are less common in feature engineering but invaluable when you need them.
-- Generate a date spine (one row per day for the last 365 days)
WITH RECURSIVE date_spine AS (
SELECT CURRENT_DATE - INTERVAL '365 days' AS dt
UNION ALL
SELECT dt + INTERVAL '1 day'
FROM date_spine
WHERE dt < CURRENT_DATE
)
SELECT dt::DATE AS calendar_date
FROM date_spine;
Feature engineering use case: join this date spine to your usage events with a LEFT JOIN to ensure every subscriber has a row for every day — including days with zero usage. Missing rows are not the same as zero values, and your model needs to know the difference.
-- Daily usage per subscriber, including zero-usage days
WITH RECURSIVE date_spine AS (
SELECT CURRENT_DATE - INTERVAL '90 days' AS dt
UNION ALL
SELECT dt + INTERVAL '1 day'
FROM date_spine
WHERE dt < CURRENT_DATE
),
active_subscribers AS (
SELECT subscriber_id
FROM subscribers
WHERE cancel_date IS NULL OR cancel_date >= CURRENT_DATE - INTERVAL '90 days'
),
subscriber_days AS (
SELECT
s.subscriber_id,
d.dt::DATE AS calendar_date
FROM active_subscribers s
CROSS JOIN date_spine d
)
SELECT
sd.subscriber_id,
sd.calendar_date,
COALESCE(SUM(ue.session_minutes), 0) AS daily_minutes,
COALESCE(COUNT(ue.event_id), 0) AS session_count
FROM subscriber_days sd
LEFT JOIN usage_events ue
ON sd.subscriber_id = ue.subscriber_id
AND sd.calendar_date = ue.event_date
GROUP BY sd.subscriber_id, sd.calendar_date;
BigQuery Note — BigQuery does not support recursive CTEs. Use
GENERATE_DATE_ARRAY(start_date, end_date)withUNNEST()instead:SELECT dt FROM UNNEST(GENERATE_DATE_ARRAY('2024-04-01', CURRENT_DATE())) AS dt.Production Tip — Most data warehouses maintain a pre-built date dimension table (
dim_dateorcalendar). Use it instead of generating dates on the fly. The recursive CTE approach is for ad-hoc analysis; the dimension table is for production pipelines.
5.4 Complex Joins for Feature Construction
Basic joins get you basic features. Complex joins — self-joins, anti-joins, lateral joins — get you the features that actually predict.
Self-Joins: Comparing a Row to Itself
A self-join joins a table to itself. Use it when you need to compare rows within the same table.
-- For each subscriber, find the usage gap: max days between consecutive sessions
WITH sessions AS (
SELECT
subscriber_id,
event_date,
LEAD(event_date) OVER (
PARTITION BY subscriber_id
ORDER BY event_date
) AS next_session_date
FROM (
SELECT DISTINCT subscriber_id, event_date
FROM usage_events
) daily_sessions
)
SELECT
subscriber_id,
MAX(next_session_date - event_date) AS max_gap_days,
AVG(next_session_date - event_date) AS avg_gap_days
FROM sessions
WHERE next_session_date IS NOT NULL
GROUP BY subscriber_id;
This computes two features: the maximum gap between sessions (a spike in this value suggests disengagement) and the average gap between sessions (a baseline engagement rhythm). Both are strong churn predictors.
Anti-Joins: Finding What Is Missing
An anti-join returns rows from the left table that have no match in the right table. There are two patterns:
-- Pattern 1: LEFT JOIN + IS NULL
-- Find subscribers who have NEVER filed a support ticket
SELECT s.subscriber_id, s.plan_type, s.signup_date
FROM subscribers s
LEFT JOIN support_tickets st ON s.subscriber_id = st.subscriber_id
WHERE st.subscriber_id IS NULL;
-- Pattern 2: NOT EXISTS (often faster)
-- Same result, different execution path
SELECT s.subscriber_id, s.plan_type, s.signup_date
FROM subscribers s
WHERE NOT EXISTS (
SELECT 1
FROM support_tickets st
WHERE st.subscriber_id = s.subscriber_id
);
Feature engineering use case: "Has this subscriber ever used feature X?" is a binary feature constructed with an anti-join. "Has this subscriber logged in from mobile?" Same pattern.
-- Binary features via anti-join pattern
SELECT
s.subscriber_id,
CASE WHEN mobile.subscriber_id IS NOT NULL THEN 1 ELSE 0 END AS has_used_mobile,
CASE WHEN api.subscriber_id IS NOT NULL THEN 1 ELSE 0 END AS has_used_api,
CASE WHEN export.subscriber_id IS NOT NULL THEN 1 ELSE 0 END AS has_used_export
FROM subscribers s
LEFT JOIN (
SELECT DISTINCT subscriber_id
FROM usage_events WHERE device_type = 'mobile'
) mobile ON s.subscriber_id = mobile.subscriber_id
LEFT JOIN (
SELECT DISTINCT subscriber_id
FROM usage_events WHERE feature_used = 'api'
) api ON s.subscriber_id = api.subscriber_id
LEFT JOIN (
SELECT DISTINCT subscriber_id
FROM usage_events WHERE feature_used = 'export'
) export ON s.subscriber_id = export.subscriber_id;
Lateral Joins: Correlated Subqueries Done Right
A lateral join lets a subquery reference columns from preceding tables in the FROM clause. It is like a correlated subquery, but it can return multiple columns and multiple rows.
-- For each subscriber, get their 3 most recent sessions
SELECT
s.subscriber_id,
s.plan_type,
recent.event_date,
recent.session_minutes,
recent.feature_used
FROM subscribers s
CROSS JOIN LATERAL (
SELECT event_date, session_minutes, feature_used
FROM usage_events ue
WHERE ue.subscriber_id = s.subscriber_id
ORDER BY event_date DESC
LIMIT 3
) recent
WHERE s.cancel_date IS NULL;
Without LATERAL, you would need the ROW_NUMBER + CTE pattern we saw in Section 5.2. LATERAL is more concise for "top-N per group" queries.
BigQuery Note — BigQuery does not support LATERAL joins. Use
ARRAY_AGG()withORDER BYand subscript notation instead, or stick with the ROW_NUMBER pattern which works everywhere.Common Mistake —
CROSS JOIN LATERALreturns no rows for subscribers with zero usage events (because the subquery returns empty). If you want to keep subscribers with no usage, useLEFT JOIN LATERAL ... ON TRUEinstead.
5.5 Conditional Aggregation: Feature Engineering Patterns
Many features follow the same SQL pattern: count or sum something, but only when a condition is met. Conditional aggregation is the workhorse of SQL-based feature engineering.
The "Days Since Last X" Pattern
-- Days since last activity of each type
SELECT
s.subscriber_id,
CURRENT_DATE - MAX(ue.event_date) AS days_since_last_session,
CURRENT_DATE - MAX(CASE WHEN ue.feature_used = 'dashboard'
THEN ue.event_date END)
AS days_since_last_dashboard,
CURRENT_DATE - MAX(CASE WHEN ue.feature_used = 'reports'
THEN ue.event_date END)
AS days_since_last_report,
CURRENT_DATE - MAX(st.created_date) AS days_since_last_ticket
FROM subscribers s
LEFT JOIN usage_events ue ON s.subscriber_id = ue.subscriber_id
LEFT JOIN support_tickets st ON s.subscriber_id = st.subscriber_id
WHERE s.cancel_date IS NULL
GROUP BY s.subscriber_id;
The "days since last X" pattern produces some of the most predictive features in churn models. A subscriber whose last session was 45 days ago is a different risk profile than one whose last session was yesterday — and this pattern captures that difference in a single integer.
The Event Count Pattern
-- Event counts by time window and category
SELECT
subscriber_id,
-- Session counts by time window
COUNT(*) FILTER (WHERE event_date >= CURRENT_DATE - INTERVAL '7 days')
AS sessions_7d,
COUNT(*) FILTER (WHERE event_date >= CURRENT_DATE - INTERVAL '30 days')
AS sessions_30d,
COUNT(*) FILTER (WHERE event_date >= CURRENT_DATE - INTERVAL '90 days')
AS sessions_90d,
-- Feature usage counts (last 30 days)
COUNT(*) FILTER (WHERE feature_used = 'dashboard'
AND event_date >= CURRENT_DATE - INTERVAL '30 days')
AS dashboard_sessions_30d,
COUNT(*) FILTER (WHERE feature_used = 'reports'
AND event_date >= CURRENT_DATE - INTERVAL '30 days')
AS report_sessions_30d,
COUNT(*) FILTER (WHERE feature_used = 'api'
AND event_date >= CURRENT_DATE - INTERVAL '30 days')
AS api_sessions_30d,
COUNT(*) FILTER (WHERE feature_used = 'export'
AND event_date >= CURRENT_DATE - INTERVAL '30 days')
AS export_sessions_30d,
-- Engagement ratio: recent vs. historical
CASE WHEN COUNT(*) FILTER (WHERE event_date >= CURRENT_DATE - INTERVAL '90 days') > 0
THEN COUNT(*) FILTER (WHERE event_date >= CURRENT_DATE - INTERVAL '30 days')::NUMERIC /
COUNT(*) FILTER (WHERE event_date >= CURRENT_DATE - INTERVAL '90 days')
ELSE NULL
END AS engagement_ratio_30d_to_90d
FROM usage_events
GROUP BY subscriber_id;
The engagement ratio at the bottom is particularly useful: it captures the trend direction without needing LAG. A ratio near 0.33 means usage is steady across all three months. A ratio near 0.1 means the subscriber is fading. A ratio near 0.8 means they are ramping up.
The Ratio Pattern
Ratios normalize features and capture behavior proportions:
-- Behavioral ratios
SELECT
subscriber_id,
-- What fraction of sessions use the premium "reports" feature?
COUNT(*) FILTER (WHERE feature_used = 'reports')::NUMERIC /
NULLIF(COUNT(*), 0) AS report_usage_ratio,
-- What fraction of sessions are on mobile?
COUNT(*) FILTER (WHERE device_type = 'mobile')::NUMERIC /
NULLIF(COUNT(*), 0) AS mobile_ratio,
-- Average session length
AVG(session_minutes) AS avg_session_minutes,
-- Session length variability (coefficient of variation)
STDDEV(session_minutes) / NULLIF(AVG(session_minutes), 0) AS session_cv
FROM usage_events
WHERE event_date >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY subscriber_id;
Common Mistake — Division by zero. Always use
NULLIF(denominator, 0)when computing ratios. A subscriber with zero sessions in the window will cause a division error without this guard.NULLIF(x, 0)returns NULL when x is 0, and NULL propagates through division to produce NULL rather than an error.
5.6 Reading Query Execution Plans
Writing correct SQL is necessary. Writing fast SQL is what separates you from the analyst who can only run small queries. The tool for understanding query performance is EXPLAIN.
EXPLAIN vs. EXPLAIN ANALYZE
-- EXPLAIN shows the plan WITHOUT executing the query
EXPLAIN
SELECT * FROM usage_events
WHERE subscriber_id = 100042
AND event_date >= '2024-10-01';
-- EXPLAIN ANALYZE shows the plan AND executes the query (with actual timings)
EXPLAIN ANALYZE
SELECT * FROM usage_events
WHERE subscriber_id = 100042
AND event_date >= '2024-10-01';
The output of EXPLAIN ANALYZE:
Index Scan using idx_usage_subscriber_date on usage_events
(cost=0.56..124.32 rows=47 width=52)
(actual time=0.028..0.156 rows=42 loops=1)
Index Cond: ((subscriber_id = 100042) AND (event_date >= '2024-10-01'::date))
Planning Time: 0.089 ms
Execution Time: 0.198 ms
Key things to read:
- Scan type:
Index Scanmeans the database used an index. Good.Seq Scanmeans it read the entire table. Bad (usually). - cost: Two numbers. The first is startup cost; the second is total estimated cost. Lower is better.
- rows: Estimated vs. actual row count. A large discrepancy means stale statistics (
ANALYZE your_table). - actual time: Wall-clock time in milliseconds.
- Planning Time vs. Execution Time: If planning time is high, your query might be too complex for the optimizer.
Common Scan Types
| Scan Type | Meaning | When You See It |
|---|---|---|
| Seq Scan | Full table scan — reads every row | No suitable index, or table is small enough that scanning is cheaper |
| Index Scan | Uses an index to find matching rows | Index exists and query is selective (returns <10-15% of rows) |
| Index Only Scan | Uses index without touching the table | All requested columns are in the index (covering index) |
| Bitmap Index Scan | Builds a bitmap of matching rows, then fetches them | Index exists but query returns too many rows for a regular index scan |
| Hash Join | Builds a hash table from one input, probes with the other | Equi-joins (=) on non-indexed columns |
| Merge Join | Merges two sorted inputs | Both inputs are already sorted (often from index scans) |
| Nested Loop | For each row in outer table, scans inner table | Small outer table, indexed inner table |
Reading a Real Feature Query Plan
Let us see what happens with our CTE-based feature extraction query:
EXPLAIN ANALYZE
WITH ticket_stats AS (
SELECT
subscriber_id,
COUNT(*) AS tickets_last_90d
FROM support_tickets
WHERE created_date >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY subscriber_id
)
SELECT
s.subscriber_id,
s.plan_type,
COALESCE(ts.tickets_last_90d, 0) AS tickets_last_90d
FROM subscribers s
LEFT JOIN ticket_stats ts ON s.subscriber_id = ts.subscriber_id;
Hash Left Join (cost=1245.67..34521.89 rows=2400000 width=30)
(actual time=892.45..4521.33 rows=2400000 loops=1)
Hash Cond: (s.subscriber_id = ts.subscriber_id)
-> Seq Scan on subscribers s
(cost=0.00..28456.00 rows=2400000 width=22)
(actual time=0.012..1234.56 rows=2400000 loops=1)
-> Hash (cost=1189.42..1189.42 rows=4500 width=16)
(actual time=891.23..891.23 rows=4500 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 240kB
-> GroupAggregate (cost=1089.42..1189.42 rows=4500 width=16)
(actual time=856.78..889.45 rows=4500 loops=1)
Group Key: support_tickets.subscriber_id
-> Sort (cost=1089.42..1102.67 rows=5300 width=8)
(actual time=856.12..861.34 rows=5300 loops=1)
Sort Key: support_tickets.subscriber_id
Sort Method: quicksort Memory: 412kB
-> Seq Scan on support_tickets
(cost=0.00..789.00 rows=5300 width=8)
(actual time=0.015..845.23 rows=5300 loops=1)
Filter: (created_date >= (CURRENT_DATE - '90 days'::interval))
Rows Removed by Filter: 28700
Planning Time: 1.234 ms
Execution Time: 5678.90 ms
Reading this bottom-up (which is how query plans execute):
- PostgreSQL scans the support_tickets table sequentially (no index on created_date)
- Filters to the last 90 days, removing 28,700 rows and keeping 5,300
- Sorts by subscriber_id and groups to compute the count
- Builds a hash table from those 4,500 results
- Scans all 2.4M subscribers sequentially
- Probes the hash table for each subscriber to complete the left join
The bottleneck: the Seq Scan on support_tickets with a date filter. An index would help.
5.7 Query Optimization
Knowing how to read a plan is step one. Knowing how to improve it is step two.
Strategy 1: Create the Right Indexes
An index is a data structure that lets the database find rows without scanning the entire table. Think of it as the index at the back of a textbook — instead of reading every page to find "window function," you look it up in the index and go directly to page 342.
-- Index for the support_tickets date filter we saw above
CREATE INDEX idx_tickets_date ON support_tickets (created_date);
-- Composite index for the most common query pattern: filter by subscriber + date
CREATE INDEX idx_usage_subscriber_date
ON usage_events (subscriber_id, event_date);
-- Composite index with an INCLUDE for covering index scans
CREATE INDEX idx_usage_covering
ON usage_events (subscriber_id, event_date)
INCLUDE (session_minutes, feature_used, device_type);
The last index is a covering index — it includes all the columns our queries need, so PostgreSQL can satisfy the query entirely from the index without touching the table at all (an "Index Only Scan"). This is the fastest possible read path.
Indexing rules of thumb:
- Index columns that appear in WHERE clauses
- Index columns that appear in JOIN conditions
- Put the most selective column first in composite indexes
- Do not index columns with very low cardinality (e.g., a boolean) unless combined with a selective column
- Every index slows down writes. Do not index everything.
Strategy 2: Avoid SELECT *
-- Bad: reads all columns from disk, even if you only need two
SELECT * FROM usage_events WHERE subscriber_id = 100042;
-- Good: reads only the columns you need
SELECT event_date, session_minutes
FROM usage_events
WHERE subscriber_id = 100042;
With a covering index on (subscriber_id, event_date) INCLUDE (session_minutes), the second query can use an Index Only Scan. The first query cannot, because SELECT * requests columns that are not in the index.
Strategy 3: Partition Pruning
For large tables, partitioning splits data across multiple physical storage units. The optimizer can skip partitions entirely if the query filter matches the partition key.
-- Partition usage_events by month
CREATE TABLE usage_events (
event_id BIGINT,
subscriber_id BIGINT,
event_date DATE NOT NULL,
session_minutes NUMERIC(8,2),
feature_used VARCHAR(50),
device_type VARCHAR(20)
) PARTITION BY RANGE (event_date);
CREATE TABLE usage_events_2024_10
PARTITION OF usage_events
FOR VALUES FROM ('2024-10-01') TO ('2024-11-01');
CREATE TABLE usage_events_2024_11
PARTITION OF usage_events
FOR VALUES FROM ('2024-11-01') TO ('2024-12-01');
-- ... and so on
Now when you query WHERE event_date >= '2024-10-01' AND event_date < '2025-01-01', PostgreSQL scans only 3 partitions instead of the entire table. For a table with billions of rows, this is the difference between seconds and minutes.
Strategy 4: Materialized Views for Reusable Feature Sets
If you run the same feature extraction query repeatedly (daily model retraining, for example), materialize the result:
-- Create a materialized view of subscriber features
CREATE MATERIALIZED VIEW mv_subscriber_features AS
WITH tenure AS (
-- ... (same CTEs as Section 5.3)
),
recent_usage AS ( ... ),
ticket_stats AS ( ... ),
plan_history AS ( ... ),
billing_stats AS ( ... )
SELECT
t.subscriber_id,
t.plan_type,
t.tenure_months,
-- ... all features
FROM tenure t
LEFT JOIN recent_usage u ON t.subscriber_id = u.subscriber_id
LEFT JOIN ticket_stats ts ON t.subscriber_id = ts.subscriber_id
LEFT JOIN plan_history ph ON t.subscriber_id = ph.subscriber_id
LEFT JOIN billing_stats bs ON t.subscriber_id = bs.subscriber_id;
-- Create an index on the materialized view
CREATE UNIQUE INDEX idx_mv_features_subscriber
ON mv_subscriber_features (subscriber_id);
-- Refresh daily (typically in an ETL/orchestration tool)
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_subscriber_features;
The materialized view runs the expensive query once and stores the result as a physical table. Subsequent reads are instant. REFRESH MATERIALIZED VIEW CONCURRENTLY updates the view without locking it for reads — essential for production systems.
Production Tip — In modern data stacks, materialized views are often replaced by dbt models that run on a schedule. The concept is the same: compute expensive features once, store the results, and read from the stored results. The difference is tooling and version control, not SQL.
Strategy 5: Optimize JOIN Order and Types
The query optimizer usually chooses the best join order, but you can help it:
-- Let PostgreSQL know how many distinct values exist
ANALYZE subscribers;
ANALYZE usage_events;
ANALYZE support_tickets;
ANALYZE updates the statistics that the query planner uses. If your explain plans show wildly inaccurate row estimates, run ANALYZE.
JOIN type performance hierarchy (general guidance, not absolute rules):
- Hash Join — Best for equi-joins between large tables
- Merge Join — Best when both inputs are already sorted
- Nested Loop — Best when outer table is small and inner table is indexed
When you see a Nested Loop on two large tables with no index, that is your bottleneck. Either add an index or restructure the query.
5.8 Putting It All Together: The StreamFlow Feature Query
Let us combine everything from this chapter into a single, production-grade feature extraction query. This is the query that feeds into the churn model.
-- StreamFlow subscriber features for churn prediction
-- PostgreSQL | Refresh: daily | Owner: data-science-team
-- Dependencies: subscribers, usage_events, support_tickets,
-- plan_changes, billing_events
WITH monthly_usage AS (
-- Per-subscriber monthly usage, with LAG for trend computation
SELECT
subscriber_id,
DATE_TRUNC('month', event_date) AS usage_month,
SUM(session_minutes) AS total_minutes,
COUNT(DISTINCT event_date) AS active_days,
COUNT(DISTINCT feature_used) AS features_used,
LAG(SUM(session_minutes), 1) OVER (
PARTITION BY subscriber_id
ORDER BY DATE_TRUNC('month', event_date)
) AS prev_month_minutes
FROM usage_events
WHERE event_date >= CURRENT_DATE - INTERVAL '6 months'
GROUP BY subscriber_id, DATE_TRUNC('month', event_date)
),
current_month_stats AS (
-- Most recent complete month's statistics
SELECT
subscriber_id,
total_minutes AS current_month_minutes,
active_days AS current_month_active_days,
features_used AS current_month_features,
CASE WHEN prev_month_minutes > 0
THEN ROUND(((total_minutes - prev_month_minutes)
/ prev_month_minutes * 100)::NUMERIC, 1)
ELSE NULL
END AS month_over_month_pct_change
FROM monthly_usage
WHERE usage_month = DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
),
usage_trend AS (
-- 3-month trend: slope of usage over last 3 complete months
-- Positive = growing, negative = declining
SELECT
subscriber_id,
REGR_SLOPE(total_minutes, EXTRACT(EPOCH FROM usage_month)) AS usage_slope
FROM monthly_usage
WHERE usage_month >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '3 months')
AND usage_month < DATE_TRUNC('month', CURRENT_DATE)
GROUP BY subscriber_id
HAVING COUNT(*) >= 2 -- Need at least 2 data points for a slope
),
session_gaps AS (
-- Inter-session gaps: how regularly does the subscriber use the product?
SELECT
subscriber_id,
MAX(next_event - event_date) AS max_gap_days,
AVG(next_event - event_date)::NUMERIC(8,1) AS avg_gap_days,
CURRENT_DATE - MAX(event_date) AS days_since_last_session
FROM (
SELECT
subscriber_id,
event_date,
LEAD(event_date) OVER (
PARTITION BY subscriber_id ORDER BY event_date
) AS next_event
FROM (SELECT DISTINCT subscriber_id, event_date FROM usage_events) d
) gaps
GROUP BY subscriber_id
),
ticket_features AS (
SELECT
subscriber_id,
COUNT(*) FILTER (WHERE created_date >= CURRENT_DATE - INTERVAL '30 days')
AS tickets_30d,
COUNT(*) FILTER (WHERE created_date >= CURRENT_DATE - INTERVAL '90 days')
AS tickets_90d,
COUNT(*) FILTER (WHERE category = 'complaint') AS total_complaints,
BOOL_OR(priority IN ('high', 'critical')
AND created_date >= CURRENT_DATE - INTERVAL '90 days')
AS has_recent_critical_ticket,
CURRENT_DATE - MAX(created_date) AS days_since_last_ticket
FROM support_tickets
GROUP BY subscriber_id
),
plan_features AS (
SELECT
subscriber_id,
COUNT(*) AS total_changes,
COUNT(*) FILTER (WHERE change_type = 'downgrade') AS downgrades,
COUNT(*) FILTER (WHERE change_type = 'upgrade') AS upgrades,
-- Net direction: positive = net upgrader, negative = net downgrader
COUNT(*) FILTER (WHERE change_type = 'upgrade') -
COUNT(*) FILTER (WHERE change_type = 'downgrade') AS net_plan_direction,
-- Recent downgrade (strong churn signal)
BOOL_OR(change_type = 'downgrade'
AND change_date >= CURRENT_DATE - INTERVAL '60 days')
AS recent_downgrade
FROM plan_changes
GROUP BY subscriber_id
),
billing_features AS (
SELECT
subscriber_id,
COUNT(*) FILTER (WHERE status = 'failed') AS total_failed_payments,
COUNT(*) FILTER (WHERE status = 'failed'
AND billing_date >= CURRENT_DATE - INTERVAL '90 days')
AS failed_payments_90d,
COUNT(*) FILTER (WHERE status = 'refunded') AS total_refunds
FROM billing_events
GROUP BY subscriber_id
)
-- Final feature assembly
SELECT
s.subscriber_id,
-- Demographic / account features
s.plan_type,
s.country,
EXTRACT(EPOCH FROM (COALESCE(s.cancel_date, CURRENT_DATE) - s.signup_date))
/ (86400 * 30.44) AS tenure_months,
-- Usage features (current month)
COALESCE(cms.current_month_minutes, 0) AS current_month_minutes,
COALESCE(cms.current_month_active_days, 0) AS current_month_active_days,
COALESCE(cms.current_month_features, 0) AS current_month_features_used,
cms.month_over_month_pct_change,
-- Usage trend
COALESCE(ut.usage_slope, 0) AS usage_trend_slope,
-- Engagement regularity
sg.days_since_last_session,
sg.max_gap_days,
sg.avg_gap_days,
-- Support features
COALESCE(tf.tickets_30d, 0) AS tickets_30d,
COALESCE(tf.tickets_90d, 0) AS tickets_90d,
COALESCE(tf.total_complaints, 0) AS total_complaints,
COALESCE(tf.has_recent_critical_ticket, FALSE) AS has_recent_critical_ticket,
tf.days_since_last_ticket,
-- Plan features
COALESCE(pf.total_changes, 0) AS total_plan_changes,
COALESCE(pf.downgrades, 0) AS plan_downgrades,
COALESCE(pf.upgrades, 0) AS plan_upgrades,
COALESCE(pf.net_plan_direction, 0) AS net_plan_direction,
COALESCE(pf.recent_downgrade, FALSE) AS recent_downgrade,
-- Billing features
COALESCE(bf.total_failed_payments, 0) AS total_failed_payments,
COALESCE(bf.failed_payments_90d, 0) AS failed_payments_90d,
COALESCE(bf.total_refunds, 0) AS total_refunds,
-- Target variable
CASE WHEN s.cancel_date IS NOT NULL
AND s.cancel_date <= CURRENT_DATE + INTERVAL '30 days'
THEN 1 ELSE 0
END AS churned_within_30d
FROM subscribers s
LEFT JOIN current_month_stats cms ON s.subscriber_id = cms.subscriber_id
LEFT JOIN usage_trend ut ON s.subscriber_id = ut.subscriber_id
LEFT JOIN session_gaps sg ON s.subscriber_id = sg.subscriber_id
LEFT JOIN ticket_features tf ON s.subscriber_id = tf.subscriber_id
LEFT JOIN plan_features pf ON s.subscriber_id = pf.subscriber_id
LEFT JOIN billing_features bf ON s.subscriber_id = bf.subscriber_id
WHERE s.cancel_date IS NULL -- Active subscribers only
OR s.cancel_date >= CURRENT_DATE - INTERVAL '30 days'; -- Recently canceled (for training)
That is 7 CTEs, 25+ features, and a final assembly join. It runs in the database, takes advantage of indexes and the query optimizer, and produces a flat table that pandas can read directly via pd.read_sql():
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('postgresql://user:pass@host:5432/streamflow')
features_df = pd.read_sql("""
SELECT * FROM mv_subscriber_features
""", engine)
print(f"Shape: {features_df.shape}")
print(f"Columns: {features_df.columns.tolist()}")
Shape: (2400000, 27)
Columns: ['subscriber_id', 'plan_type', 'country', 'tenure_months',
'current_month_minutes', 'current_month_active_days', ...]
Try It — Run the feature query CTE by CTE. Start with just the
monthly_usageCTE andSELECT * FROM monthly_usage LIMIT 20. Verify each step before adding the next. This is the CTE debugging workflow: build incrementally, validate each layer.
5.9 Advanced Patterns
Two additional patterns that come up frequently in production feature engineering.
Pattern: Sessionization with Window Functions
Raw event logs often need to be grouped into sessions. A session is a sequence of events with no gap longer than some threshold (commonly 30 minutes). This is a classic window function problem.
-- Sessionize usage events (30-minute inactivity threshold)
WITH events_with_prev AS (
SELECT
subscriber_id,
event_id,
event_date,
session_minutes,
LAG(event_date) OVER (
PARTITION BY subscriber_id
ORDER BY event_id
) AS prev_event_date,
-- Flag: is this the start of a new session?
CASE WHEN LAG(event_date) OVER (
PARTITION BY subscriber_id ORDER BY event_id
) IS NULL
OR event_date - LAG(event_date) OVER (
PARTITION BY subscriber_id ORDER BY event_id
) > INTERVAL '30 minutes'
THEN 1
ELSE 0
END AS is_new_session
FROM usage_events
),
session_ids AS (
SELECT
*,
SUM(is_new_session) OVER (
PARTITION BY subscriber_id
ORDER BY event_id
) AS session_number
FROM events_with_prev
)
SELECT
subscriber_id,
session_number,
MIN(event_date) AS session_start,
MAX(event_date) AS session_end,
COUNT(*) AS events_in_session,
SUM(session_minutes) AS total_session_minutes
FROM session_ids
GROUP BY subscriber_id, session_number;
This is a rolling-sum trick: mark each "new session start" with 1 and non-starts with 0, then take a cumulative sum. The cumulative sum increments at each session boundary, giving you a session identifier.
Pattern: Pivoting Event Data into Feature Columns
-- Pivot: event counts by feature_used, one column per feature
SELECT
subscriber_id,
COUNT(*) FILTER (WHERE feature_used = 'dashboard') AS dashboard_count,
COUNT(*) FILTER (WHERE feature_used = 'reports') AS reports_count,
COUNT(*) FILTER (WHERE feature_used = 'api') AS api_count,
COUNT(*) FILTER (WHERE feature_used = 'export') AS export_count,
-- Diversity: how many distinct features used?
COUNT(DISTINCT feature_used) AS feature_diversity
FROM usage_events
WHERE event_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY subscriber_id;
BigQuery Note — BigQuery has a native
PIVOToperator:SELECT * FROM usage_counts PIVOT(SUM(cnt) FOR feature_used IN ('dashboard', 'reports', 'api', 'export')). PostgreSQL does not have native PIVOT; use the FILTER clause orcrosstab()from thetablefuncextension.
Pattern: Existence Flags with Correlated Subqueries
Sometimes you need a simple yes/no feature: "Did this subscriber do X in the last Y days?"
SELECT
s.subscriber_id,
EXISTS (
SELECT 1 FROM usage_events ue
WHERE ue.subscriber_id = s.subscriber_id
AND ue.event_date >= CURRENT_DATE - INTERVAL '7 days'
) AS active_last_7d,
EXISTS (
SELECT 1 FROM support_tickets st
WHERE st.subscriber_id = s.subscriber_id
AND st.category = 'complaint'
AND st.created_date >= CURRENT_DATE - INTERVAL '30 days'
) AS complained_last_30d,
EXISTS (
SELECT 1 FROM billing_events be
WHERE be.subscriber_id = s.subscriber_id
AND be.status = 'failed'
AND be.billing_date >= CURRENT_DATE - INTERVAL '60 days'
) AS had_payment_failure_60d
FROM subscribers s
WHERE s.cancel_date IS NULL;
These binary features are fast (EXISTS stops at the first match) and surprisingly predictive. "Has the subscriber used the product in the last 7 days?" is a feature you should always compute.
5.10 From SQL to Model: The Handoff
You have written the SQL. You have optimized the query. Now you need to get the data into Python. Here is the production pattern.
import pandas as pd
from sqlalchemy import create_engine, text
# Connection (use environment variables for credentials)
engine = create_engine(
'postgresql://readonly_user:${DB_PASSWORD}@warehouse.streamflow.io:5432/analytics'
)
# Option 1: Read query results directly
query = text("""
SELECT * FROM mv_subscriber_features
WHERE tenure_months >= 1 -- Exclude brand-new subscribers
""")
df = pd.read_sql(query, engine)
# Option 2: Read in chunks for large result sets
chunks = pd.read_sql(query, engine, chunksize=100_000)
df = pd.concat(chunks, ignore_index=True)
# Verify the handoff
print(f"Shape: {df.shape}")
print(f"Null counts:\n{df.isnull().sum()}")
print(f"Target distribution:\n{df['churned_within_30d'].value_counts(normalize=True)}")
Shape: (2387000, 27)
Null counts:
subscriber_id 0
plan_type 0
country 0
tenure_months 0
current_month_minutes 0
current_month_active_days 0
month_over_month_pct_change 312456
days_since_last_session 8923
days_since_last_ticket 1847234
...
Target distribution:
0 0.918
1 0.082
Name: churned_within_30d, dtype: float64
Note the nulls. month_over_month_pct_change is null for subscribers with less than 2 months of data (no previous month to compare). days_since_last_ticket is null for subscribers who have never filed a ticket. These nulls are meaningful — they carry information. We will handle them properly in Chapter 8 (Missing Data Strategies).
The target variable shows the familiar 8.2% churn rate. This is a class-imbalanced problem. We will address that in Chapter 17 (Class Imbalance).
Production Tip — Store the SQL query that generated your features alongside the model artifact. When the model behaves unexpectedly in production, the first debugging step is re-running the feature query. If you cannot reproduce the features, you cannot reproduce the problem. This is reproducibility (Theme 6) applied to data extraction.
5.11 The Domain Knowledge Connection
The best SQL features come from understanding the business, not from mechanically applying window functions to every column. Here is how domain knowledge (Theme 4) shows up in SQL:
A domain expert at StreamFlow would tell you:
- Subscribers who downgrade their plan usually cancel within 90 days
- A spike in support tickets (especially "billing" category) precedes churn
- Subscribers who use the API are power users and rarely churn
- Failed payments that are not resolved within 7 days lead to involuntary churn
- Usage decline is the strongest churn signal, but only if tenure is > 3 months (new subscribers are still onboarding)
Each of those insights maps to a SQL feature:
-- Domain-informed features
SELECT
s.subscriber_id,
-- Downgrade-to-cancel pattern
CASE WHEN pf.recent_downgrade AND s.plan_type IN ('basic', 'free')
THEN 1 ELSE 0
END AS downgrade_risk_flag,
-- Billing-category ticket spike
CASE WHEN tf_billing.billing_tickets_30d >= 2
THEN 1 ELSE 0
END AS billing_complaint_flag,
-- Power user indicator
CASE WHEN ue_api.api_sessions_30d > 0
THEN 1 ELSE 0
END AS is_api_user,
-- Unresolved payment failure
CASE WHEN bf.unresolved_failures > 0
THEN 1 ELSE 0
END AS unresolved_payment_flag,
-- Usage decline (only meaningful after onboarding)
CASE WHEN tenure_months > 3
AND cms.month_over_month_pct_change < -30
THEN 1 ELSE 0
END AS post_onboarding_decline_flag
FROM subscribers s
-- ... (appropriate joins to pre-computed CTEs)
These domain-informed binary flags often outperform the raw continuous features they are derived from. A churn model does not need to learn that "recent_downgrade AND plan_type = 'free'" is a risk factor if you hand it a pre-computed flag. You are encoding domain knowledge directly into the feature set.
This is the connection to Chapter 6 (Feature Engineering): the SQL extracts the raw materials, and the domain knowledge shapes them into predictive features. Good feature engineering starts in the SQL layer.
Summary
This chapter covered the SQL that data scientists actually use: window functions for temporal features, CTEs for readable query structure, complex joins for feature construction, and execution plans for query optimization.
The key insight is that SQL is not just a data extraction language — it is a feature engineering language. Every window function, every conditional aggregation, every anti-join pattern corresponds to a feature that a machine learning model can use. The database is not an obstacle between you and pandas. It is a computation engine that handles joins, aggregations, and window operations on billions of rows while your laptop would run out of memory on the first pd.read_csv().
In Chapter 6, we will take the flat table this chapter produced and engineer the remaining features: interaction terms, transformations, and the domain-specific signals that no automated tool can create. But the hardest work — extracting structured features from a normalized relational schema — is done. And it was done in SQL, where it belongs.
Progressive Project Milestone M1: StreamFlow SQL Feature Extraction
Deliverable: A SQL query (or set of queries) that extracts at least 15 customer features from StreamFlow's relational database.
Requirements:
- Use at least 3 CTEs for query organization
- Use at least 2 window functions (LAG, ROW_NUMBER, or running aggregate)
- Compute at least 2 "days since last X" features
- Compute at least 2 conditional aggregation features (event counts by category/time window)
- Include the target variable (
churned_within_30d) - Run
EXPLAIN ANALYZEand verify no full table scans on large tables
Stretch goals:
- Compute a usage trend slope using
REGR_SLOPE - Create a sessionization query
- Build the query as a materialized view with a refresh strategy
Save your SQL in your project repository as src/features/extract_features.sql. In Chapter 6, you will load this feature set into pandas and engineer additional features on top of it.