Further Reading — Chapter 5: SQL for Data Scientists
Official Documentation
PostgreSQL Documentation — Window Functions The authoritative reference for window function syntax, frame specifications, and behavior. Section 3.5 ("Window Functions") in the tutorial is the best starting point. Section 9.22 in the function reference covers every window function with examples. If you ever wonder "does PostgreSQL support X in a window frame?", this is where you look. postgresql.org/docs/current/tutorial-window.html
PostgreSQL Documentation — EXPLAIN Section 14.1 ("Using EXPLAIN") walks through how to read query plans, what each node type means, and how to interpret cost estimates. The examples progress from simple sequential scans to complex join strategies. Read this before trying to optimize a production query. postgresql.org/docs/current/using-explain.html
BigQuery Documentation — Analytic Functions BigQuery's equivalent of the PostgreSQL window functions reference. The syntax is nearly identical for most functions, with dialect-specific differences for date handling and the FILTER clause (use COUNTIF instead). The "Navigation Functions" section covers LAG, LEAD, FIRST_VALUE, and LAST_VALUE. cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts
Books
Molinaro and Graaf — SQL Cookbook, 2nd Edition (O'Reilly, 2020) A pattern-oriented SQL book organized around common tasks: window functions, hierarchical queries, pivoting, and string manipulation. Chapter 11 on advanced searching and Chapter 12 on reporting are particularly relevant for feature engineering. This is the book to keep on your desk when you know what you want to compute but not how to express it in SQL.
Markham, Kevin — Data Science with SQL (O'Reilly, in progress) An emerging reference that bridges the gap between SQL tutorials and data science workflows. Covers the SQL-for-feature-engineering patterns that most SQL books ignore: conditional aggregation for multi-window features, date spine generation, sessionization, and the SQL-to-pandas translation patterns covered in this chapter.
Kleppmann, Martin — Designing Data-Intensive Applications (O'Reilly, 2017) Not a SQL book, but the definitive reference on how databases work internally. Chapter 3 ("Storage and Retrieval") explains B-tree indexes, LSM-trees, and why some queries are fast and others are slow. Chapter 6 ("Partitioning") explains how partition pruning works. If you want to understand why the optimizations in Section 5.7 work, read Kleppmann.
Courses and Tutorials
Mode Analytics — SQL Tutorial for Data Analysis A free, browser-based SQL tutorial with a built-in query editor and real datasets. The "Advanced SQL" section covers window functions, self-joins, and subqueries with interactive exercises. The exercises use analytical datasets (not toy examples), which makes the transition to real feature engineering smoother. mode.com/sql-tutorial/
Window Functions — PostgreSQL Exercises (pgexercises.com) A free interactive site with progressively harder SQL exercises. The "Aggregates" and "Date" sections are directly relevant to this chapter. Each exercise has a built-in PostgreSQL instance, so you can practice without setting up a local database. pgexercises.com
Zachary Thomas — Select Star SQL A free interactive book that teaches SQL through a single dataset (Texas death row records). The chapter on window functions is one of the clearest introductions available. The dataset forces you to write real analytical queries — counts by year, trends over time, demographic breakdowns — that map directly to the feature engineering patterns in this chapter. selectstarsql.com
Articles and Blog Posts
Ergest Xheblati — "SQL for Data Scientists" (Towards Data Science series) A multi-part series covering the specific SQL patterns that data scientists use most: window functions for feature engineering, CTEs for pipeline organization, and query optimization for large datasets. The examples use realistic datasets and focus on the feature-extraction use case rather than general-purpose SQL.
Use The Index, Luke — A Guide to Database Performance A free online book about SQL indexing. Written for developers but essential for data scientists who write queries against production databases. The chapters on composite indexes, partial indexes, and the "leftmost prefix" rule explain why some indexes help your queries and others are ignored by the optimizer. The visual explanations of B-tree traversal are particularly good. use-the-index-luke.com
Haki Benita — "PostgreSQL: How to Write SQL Queries That Scale" A series of blog posts on PostgreSQL performance with practical, production-tested advice. The posts on EXPLAIN output interpretation, index design, and materialized view management are directly applicable to the optimization techniques in Section 5.7. Haki writes from experience with large-scale systems, not from theory. hakibenita.com
Tools
DBeaver (database IDE) A free, open-source database client that supports PostgreSQL, BigQuery, Snowflake, and dozens of other databases. The visual EXPLAIN plan viewer turns the text-based query plan output into a clickable tree diagram — much easier to read than the raw text for complex queries. Use it for interactive SQL development before committing queries to your feature pipeline. dbeaver.io
pgAdmin 4 The official PostgreSQL administration and development tool. Less polished than DBeaver but deeply integrated with PostgreSQL-specific features: visual EXPLAIN, index advisor, and server-side statistics. If you work exclusively with PostgreSQL, pgAdmin is worth installing. pgadmin.org
SQLFluff A SQL linter that enforces consistent formatting, indentation, and style. If your team shares SQL queries for feature extraction (and it should), SQLFluff ensures that everyone's CTEs, window functions, and joins follow the same conventions. Inconsistent formatting is the number one obstacle to SQL code review. sqlfluff.com
Papers and Technical References
Graefe, Goetz — "Query Evaluation Techniques for Large Databases" (ACM Computing Surveys, 1993) The foundational survey on how databases execute queries. Covers every join algorithm (nested loop, hash, merge), every scan type (sequential, index, bitmap), and the cost models that the optimizer uses to choose between them. Academic but readable. If you want to understand what is happening behind the EXPLAIN output, this is the primary source.
Hellerstein, Stonebraker, and Hamilton — "Architecture of a Database System" (Foundations and Trends in Databases, 2007) A comprehensive overview of how relational databases work internally, written for the technically curious non-specialist. The sections on query processing and storage management explain why indexes help, why sequential scans are slow, and how the buffer pool manages memory. Freely available online.
What to Read Next
If you are working through this book in order, the SQL skills from this chapter become the foundation for everything in Part II:
- Chapter 6 (Feature Engineering) takes the flat table this chapter produced and engineers additional features: transformations, interaction terms, and domain-specific signals
- Chapter 7 (Handling Categorical Data) addresses the
plan_typeandcountrycolumns that our SQL extracted but did not encode - Chapter 8 (Missing Data Strategies) tackles the NULLs in
month_over_month_pct_changeanddays_since_last_ticketthat this chapter intentionally left unresolved - Chapter 10 (Reproducible Data Pipelines) wraps the SQL extraction and Python transformation into a single, reproducible scikit-learn Pipeline
- Chapter 29 (Software Engineering for Data Science) returns to SQL in the context of production systems: version-controlled queries, testing, and CI/CD for data pipelines