Key Takeaways — Chapter 5: SQL for Data Scientists


1. SQL is a feature engineering language, not just a data extraction language. Every window function, conditional aggregation, and complex join pattern in this chapter corresponds directly to a feature that a machine learning model can consume. The database is not an obstacle between you and pandas. It is a computation engine with decades of optimization for exactly the operations that feature engineering requires: joins, aggregations, sorting, and windowed computations across millions or billions of rows.

2. Window functions are the single most important SQL skill for data scientists. LAG computes temporal trends. LEAD identifies upcoming patterns. ROW_NUMBER selects the most recent row per group. Running aggregates compute cumulative and moving statistics. NTILE creates percentile buckets. These functions operate within partitions without collapsing rows — which means you can compute per-subscriber statistics while retaining every row in your result. If you learn only one thing from this chapter, learn the function() OVER (PARTITION BY ... ORDER BY ... ROWS BETWEEN ...) syntax.

3. CTEs transform unreadable SQL into debuggable, maintainable pipelines. A feature extraction query with five subqueries nested inside each other is write-once, understand-never code. The same query restructured as five named CTEs — tenure, recent_usage, ticket_stats, plan_history, billing_stats — reads like a recipe. Each step is independently testable. A new team member can understand the query in minutes. Name CTEs descriptively; cte1 is barely better than a nested subquery.

4. "Days since last X" and conditional aggregation are the two most common SQL feature engineering patterns. "Days since last session," "days since last support ticket," "days since last payment failure" — these temporal distance features consistently rank among the strongest predictors in churn models, fraud models, and engagement models. Conditional aggregation (COUNT(*) FILTER (WHERE ...) in PostgreSQL, COUNTIF() in BigQuery) lets you compute multiple feature variants in a single pass over the data: sessions in the last 7 days, 30 days, and 90 days from one GROUP BY.

5. Anti-joins find what is missing — and what is missing often predicts. Subscribers who have never used the API. Patients who have never been readmitted. Users who added to cart but never purchased. The absence of an event is often as informative as its presence. The LEFT JOIN + IS NULL pattern (or NOT EXISTS) is how you construct these "absence features" in SQL.

6. EXPLAIN ANALYZE is how you understand what the database is actually doing. Read query plans bottom-up. Look for sequential scans on large tables (usually bad), hash joins between large and small tables (usually good), and large gaps between estimated and actual row counts (run ANALYZE on the table). You do not need to become a DBA. You need to recognize when a query is scanning 800 million rows to find 5,000 matches, and know that an index would fix it.

7. The three most impactful optimizations are: indexes, avoiding SELECT star, and materialized views. A composite index on (subscriber_id, event_date) lets the database skip directly to the rows it needs. SELECT column1, column2 instead of SELECT * enables covering index scans. A materialized view runs an expensive query once and stores the result for instant reads. Together, these three techniques can turn a 10-minute query into a sub-second read.

8. Compute features in SQL, transform features in Python. SQL is the right tool for extracting raw features from normalized relational data: joins, aggregations, window functions. Python is the right tool for transformations that require the full feature matrix: scaling, imputation, encoding, interaction terms, model-specific preprocessing. The handoff point is a flat table — one row per observation, one column per feature — loaded via pd.read_sql().

9. Domain knowledge drives feature selection, even in SQL. The best SQL features are not mechanical applications of window functions to every column. They are translations of business intuition into computable queries. "Subscribers who downgrade their plan usually cancel within 90 days" becomes a recent_downgrade flag. "API users are power users" becomes an is_api_user boolean. Talk to the domain experts before you write the SQL.

10. NULLs in SQL feature output are informative, not just inconvenient. A NULL month_over_month_pct_change means the subscriber is too new for trend computation. A NULL days_since_last_ticket means the subscriber has never contacted support. These NULLs carry signal. Replacing them thoughtlessly with zeros can destroy that signal. Chapter 8 covers missing data strategies in detail; for now, use COALESCE deliberately, and document what each default value assumes.


The One-Sentence Summary

SQL is where feature engineering begins — window functions compute temporal trends, CTEs organize complexity, joins construct relational features, and query optimization ensures the pipeline runs in seconds, not hours — and the data scientist who masters SQL extracts features that the data scientist who only knows pandas cannot.

Quick Reference: SQL Patterns for Feature Engineering

Feature Type SQL Pattern Example
Temporal trend LAG/LEAD + PARTITION BY Month-over-month usage change
Recency CURRENT_DATE - MAX(date) Days since last session
Frequency COUNT(*) FILTER (WHERE date range) Sessions in last 30 days
Regularity LEAD(date) - date, then AVG Average days between sessions
Existence LEFT JOIN + IS NULL, or EXISTS Has ever used the API?
Bucketing NTILE(N) OVER (ORDER BY value) Usage decile
Ratio COUNT(condition)::NUMERIC / NULLIF(COUNT(*), 0) Mobile session fraction
Running aggregate SUM/AVG OVER (ROWS BETWEEN ... AND ...) 7-session moving average
Trend slope REGR_SLOPE(y, x) Usage trend over 3 months
Event diversity COUNT(DISTINCT column) Number of distinct features used