Quiz — Chapter 5: SQL for Data Scientists

DataField.Dev

Quiz — Chapter 5: SQL for Data Scientists

Test your understanding of window functions, CTEs, joins, and query optimization. Answers are in Appendix B.

Question 1. What is the key difference between a window function and a GROUP BY aggregation?

(a) Window functions are faster than GROUP BY
(b) Window functions do not collapse rows — each row retains its identity while gaining access to aggregate values across the partition
(c) GROUP BY cannot use ORDER BY
(d) Window functions can only be used with numeric columns

Question 2. You need to compute each subscriber's monthly usage and their previous month's usage in the same row, so you can calculate month-over-month change. Which window function do you use?

(a) ROW_NUMBER
(b) NTILE
(c) LAG
(d) FIRST_VALUE

Question 3. Consider this query:

SELECT subscriber_id, event_date,
       ROW_NUMBER() OVER (PARTITION BY subscriber_id ORDER BY event_date DESC) AS rn
FROM usage_events;

For a subscriber with 50 usage events, which row gets rn = 1?

(a) The earliest event (oldest date)
(b) The most recent event (newest date)
(c) A random event (ROW_NUMBER breaks ties arbitrarily)
(d) The event with the longest session

Question 4. What is the pandas equivalent of LAG(session_minutes, 1) OVER (PARTITION BY subscriber_id ORDER BY event_date)?

(a) df.groupby('subscriber_id')['session_minutes'].diff(1)
(b) df.groupby('subscriber_id')['session_minutes'].shift(1)
(c) df.groupby('subscriber_id')['session_minutes'].rolling(1).mean()
(d) df.groupby('subscriber_id')['session_minutes'].rank()

Question 5. What does the following frame specification do?

AVG(session_minutes) OVER (
    PARTITION BY subscriber_id
    ORDER BY event_date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
)

(a) Computes the average of all sessions for the subscriber
(b) Computes the average of the current row and the 6 rows before it (a 7-row moving average)
(c) Computes the average of the 6 rows before the current row (excluding current)
(d) Computes the average of the next 6 rows

Question 6. A CTE (Common Table Expression) is best described as:

(a) A permanent table created in the database
(b) A named temporary result set that exists for the duration of a single query
(c) A stored procedure that returns a table
(d) An index that speeds up subqueries

When would you choose a CTE over a subquery? Give one reason.

Question 7. You want to find subscribers who have NEVER filed a support ticket. Which pattern achieves this?

(a) INNER JOIN support_tickets st ON s.subscriber_id = st.subscriber_id WHERE st.ticket_id IS NULL
(b) LEFT JOIN support_tickets st ON s.subscriber_id = st.subscriber_id WHERE st.subscriber_id IS NULL
(c) RIGHT JOIN support_tickets st ON s.subscriber_id = st.subscriber_id WHERE s.subscriber_id IS NULL
(d) CROSS JOIN support_tickets st WHERE st.subscriber_id != s.subscriber_id

Question 8. You run EXPLAIN ANALYZE on your feature extraction query and see this output for one step:

Seq Scan on usage_events  (cost=0.00..1845672.00 rows=800000000 width=48)
                           (actual time=0.023..234567.89 rows=800000000 loops=1)
  Filter: (event_date >= '2024-10-01'::date)
  Rows Removed by Filter: 650000000

What does this tell you, and what is the most impactful optimization?

(a) The query is already optimized; Seq Scan is the fastest option for large tables
(b) The database is scanning all 800M rows and filtering out 650M; an index on event_date would let it skip those rows entirely
(c) The Rows Removed by Filter count is a bug in the explain output
(d) You should add more WHERE conditions to reduce the row count

Question 9. What is the difference between ROWS BETWEEN and RANGE BETWEEN in a window frame specification?

(a) They are identical
(b) ROWS counts physical rows; RANGE considers the logical value of the ORDER BY column (so duplicate values are included in the same frame)
(c) RANGE is faster than ROWS
(d) ROWS can only be used with numeric ORDER BY columns

Why does this distinction matter when computing a 7-day moving average?

Question 10. The PostgreSQL FILTER clause in COUNT(*) FILTER (WHERE category = 'complaint') is equivalent to which standard SQL expression?

(a) COUNT(category = 'complaint')
(b) COUNT(CASE WHEN category = 'complaint' THEN 1 END)
(c) SUM(category = 'complaint')
(d) COUNT(DISTINCT category) WHERE category = 'complaint'

Question 11. You are deciding whether to compute a feature in SQL or pandas. The feature is: "For each subscriber, compute the 90-day rolling average of daily session minutes, including zero-usage days." The dataset has 2.4M subscribers and 3 years of daily data.

Which approach is likely better and why?

(a) pandas — because rolling averages are easier to write with .rolling()
(b) SQL — because the database can handle the join with a date spine and the window computation without loading all data into memory
(c) Either one — the performance is the same
(d) Neither — you should use Spark for this

Question 12. What is a materialized view, and when should you use one for feature extraction?

(a) A view that is recomputed every time it is queried; use it when data changes frequently
(b) A view whose results are stored physically on disk and refreshed on demand; use it when the same expensive feature query is run repeatedly
(c) A temporary table that is dropped when the session ends
(d) A view that can only be read, never updated

What is the trade-off of using a materialized view instead of running the query live?

Question 13. Consider this query plan node:

Hash Left Join  (cost=1245.67..34521.89 rows=2400000 width=30)
  Hash Cond: (s.subscriber_id = ts.subscriber_id)

What does "Hash Left Join" mean the database is doing?

(a) Building a hash table from the right input (ts), then probing it with each row from the left input (s), returning all left rows even if there is no match
(b) Sorting both inputs by subscriber_id and merging them
(c) For each row in s, scanning all rows in ts to find matches
(d) Building a hash table from s and discarding non-matching rows from ts

Question 14. You create a composite index: CREATE INDEX idx_usage ON usage_events (subscriber_id, event_date). Which of the following queries can use this index efficiently?

(a) WHERE subscriber_id = 100042 AND event_date >= '2024-10-01'
(b) WHERE event_date >= '2024-10-01' (no subscriber_id filter)
(c) WHERE subscriber_id = 100042
(d) Both (a) and (c), but not (b)

Explain the "leftmost prefix" rule for composite indexes.

Question 15. You are writing a feature extraction query with 7 CTEs, each joining back to the subscribers table. A colleague suggests replacing the CTEs with 7 separate queries, loading each result into pandas, and merging them with pd.merge(). Under what circumstances is the colleague's approach better than the single SQL query?

(a) When the database server is resource-constrained and you want to spread the load over time
(b) When you need to debug individual feature groups independently
(c) When some features require Python-specific transformations that cannot be expressed in SQL
(d) All of the above

What is the primary disadvantage of the split approach?