Exercises — Chapter 5: SQL for Data Scientists

DataField.Dev

Exercises — Chapter 5: SQL for Data Scientists

Exercise 5.1: Window Function Basics (Conceptual)

For each of the following feature engineering tasks, identify which window function (ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, NTILE, SUM, AVG) and which frame/partition specification you would use. You do not need to write the full query — describe the OVER clause.

a) For each subscriber, find their most recent session and their second-most-recent session.

b) For each subscriber, compute the percentage change in monthly usage between the current month and the previous month.

c) Assign each subscriber to a usage quartile (1-4) based on their total minutes in the last 30 days.

d) For each subscriber, compute the running total of support tickets filed since their signup date.

e) For each subscriber's sessions, compute a 5-session moving average of session duration.

Exercise 5.2: LAG for Trend Detection (Code)

Write a SQL query that computes the following features for each active StreamFlow subscriber, using the usage_events table:

Total minutes in the most recent complete month
Total minutes in the month before that
Total minutes two months before that
A column usage_direction that is 'declining' if each month is lower than the previous, 'growing' if each month is higher than the previous, 'stable' if the change between any two consecutive months is less than 10%, and 'mixed' otherwise

Use CTEs for clarity. Test your logic: a subscriber with usage of 300, 250, 180 minutes over three months should be classified as 'declining'.

Exercise 5.3: Anti-Join Feature Construction (Code)

StreamFlow wants to identify subscribers who might be at risk because they have never used certain features. Write a SQL query that produces a table with one row per active subscriber and the following boolean columns:

has_ever_used_reports — TRUE if the subscriber has at least one session with feature_used = 'reports'
has_ever_used_api — TRUE if the subscriber has at least one session with feature_used = 'api'
has_ever_used_export — TRUE if the subscriber has at least one session with feature_used = 'export'
has_ever_used_mobile — TRUE if the subscriber has at least one session with device_type = 'mobile'

Use the anti-join pattern (LEFT JOIN + IS NULL or NOT EXISTS) for at least one of these features. Then compute a summary: what percentage of active subscribers have never used any of the four features above?

Exercise 5.4: CTE Refactoring (Code)

The following query works correctly but is nearly impossible to read. Refactor it into a CTE-based query with at least 3 named CTEs. Preserve the exact same output.

SELECT s.subscriber_id, s.plan_type,
    (SELECT COUNT(*) FROM usage_events ue
     WHERE ue.subscriber_id = s.subscriber_id
       AND ue.event_date >= CURRENT_DATE - INTERVAL '30 days') AS sessions_30d,
    (SELECT COALESCE(SUM(session_minutes), 0) FROM usage_events ue
     WHERE ue.subscriber_id = s.subscriber_id
       AND ue.event_date >= CURRENT_DATE - INTERVAL '30 days') AS minutes_30d,
    (SELECT COUNT(*) FROM support_tickets st
     WHERE st.subscriber_id = s.subscriber_id
       AND st.created_date >= CURRENT_DATE - INTERVAL '90 days') AS tickets_90d,
    (SELECT COUNT(*) FROM plan_changes pc
     WHERE pc.subscriber_id = s.subscriber_id
       AND pc.change_type = 'downgrade') AS total_downgrades,
    CASE WHEN s.cancel_date IS NOT NULL THEN 1 ELSE 0 END AS churned
FROM subscribers s
WHERE s.cancel_date IS NULL
   OR s.cancel_date >= CURRENT_DATE - INTERVAL '60 days';

After refactoring, explain: a) Why is the CTE version likely faster than the correlated-subquery version? b) How would you debug this query step by step?

Exercise 5.5: Conditional Aggregation Patterns (Code)

Write a single SQL query (using CTEs if needed) that computes the following features for each subscriber from the support_tickets table:

total_tickets — total tickets ever filed
tickets_last_30d — tickets filed in the last 30 days
tickets_last_90d — tickets filed in the last 90 days
billing_tickets_90d — tickets with category = 'billing' in the last 90 days
technical_tickets_90d — tickets with category = 'technical' in the last 90 days
complaint_ratio — fraction of all tickets that are complaints
has_unresolved_ticket — TRUE if any ticket has resolved_date IS NULL
avg_resolution_days — average days between created_date and resolved_date for resolved tickets

Use the PostgreSQL FILTER clause where appropriate. Handle division by zero for the complaint_ratio column.

Exercise 5.6: EXPLAIN Plan Interpretation (Conceptual + Code)

Consider the following query on the StreamFlow usage_events table (assume no indexes exist beyond the primary key):

SELECT
    subscriber_id,
    DATE_TRUNC('month', event_date) AS usage_month,
    SUM(session_minutes) AS total_minutes
FROM usage_events
WHERE event_date >= '2024-10-01'
  AND event_date < '2025-01-01'
  AND subscriber_id IN (
      SELECT subscriber_id FROM subscribers WHERE plan_type = 'premium'
  )
GROUP BY subscriber_id, DATE_TRUNC('month', event_date);

a) Without running the query, predict the query plan. What scan types will the database use? What join strategy will it choose for the subquery?

b) Write the CREATE INDEX statements that would most improve this query's performance. Explain why you chose each index.

c) Rewrite the subquery as a JOIN. Will this change the execution plan? Why or why not?

d) If usage_events has 500 million rows and subscribers has 2.4 million rows, estimate whether this query will complete in seconds, minutes, or hours — with and without the indexes you proposed.

Exercise 5.7: Self-Join for Session Gap Analysis (Code)

Write a SQL query that computes the following for each active subscriber:

avg_days_between_sessions — the average number of days between consecutive sessions (using distinct dates only)
max_days_between_sessions — the maximum gap between consecutive sessions
current_gap_days — the number of days between the most recent session and today
gap_increasing — TRUE if the most recent gap is larger than the average gap

Use the LEAD or LAG function to compute inter-session gaps. Include only subscribers with at least 3 distinct session dates in the last 180 days.

Exercise 5.8: Recursive CTE — Date Spine (Code)

A common problem: the usage_events table has no rows for days when a subscriber did not use the product. This means a simple GROUP BY event_date will show only active days, hiding the zeros.

a) Write a recursive CTE that generates every date from 90 days ago to today.

b) CROSS JOIN this date spine with a list of active subscribers to create a "subscriber-day" grid.

c) LEFT JOIN the usage events onto this grid and compute daily usage (zero for days with no events).

d) Using the result from (c), compute a 7-day moving average of daily minutes for each subscriber. Verify that zero-usage days correctly pull down the moving average.

Exercise 5.9: The ShopSmart Conversion Funnel (Code)

ShopSmart tracks user behavior through a page_events table:

CREATE TABLE page_events (
    event_id     BIGINT PRIMARY KEY,
    user_id      BIGINT,
    event_type   VARCHAR(30),  -- 'page_view', 'add_to_cart', 'checkout_start', 'purchase'
    event_time   TIMESTAMP,
    page_url     VARCHAR(255),
    session_id   VARCHAR(50)
);

Write a SQL query that computes, for each user who visited the site in the last 30 days:

total_sessions — number of distinct sessions
pages_viewed — total page view events
added_to_cart — number of add-to-cart events
started_checkout — number of checkout-start events
completed_purchase — number of purchase events
cart_to_purchase_rate — fraction of add-to-cart events that resulted in a purchase (within the same session)
time_to_purchase_minutes — for sessions with a purchase, average minutes from first page view to purchase

Use window functions or CTEs as needed. Think carefully about the cart_to_purchase_rate calculation — does it make sense to compute it per-session or per-user?

Exercise 5.10: Query Optimization Challenge (Code + Analysis)

The following query extracts features but runs slowly on StreamFlow's data (2.4M subscribers, 800M usage events):

SELECT
    s.subscriber_id,
    s.plan_type,
    (SELECT AVG(session_minutes)
     FROM usage_events ue
     WHERE ue.subscriber_id = s.subscriber_id
       AND ue.event_date >= CURRENT_DATE - INTERVAL '30 days') AS avg_minutes_30d,
    (SELECT COUNT(DISTINCT event_date)
     FROM usage_events ue
     WHERE ue.subscriber_id = s.subscriber_id
       AND ue.event_date >= CURRENT_DATE - INTERVAL '30 days') AS active_days_30d,
    (SELECT MAX(event_date)
     FROM usage_events ue
     WHERE ue.subscriber_id = s.subscriber_id) AS last_session_date
FROM subscribers s
WHERE s.cancel_date IS NULL;

a) Explain why this query is slow. How many times does it scan the usage_events table?

b) Rewrite the query using a single scan of usage_events (hint: use a CTE or subquery with GROUP BY, then JOIN).

c) Write the optimal index for the rewritten query.

d) If you needed to run this query daily for model retraining, what storage strategy would you recommend? (Hint: think materialized views or staging tables.)

Exercise 5.11: SQL-to-Pandas Translation (Conceptual)

For each of the following SQL snippets, write the equivalent pandas code. Then explain which version you would prefer for (i) a dataset with 10,000 rows and (ii) a dataset with 100 million rows.

a)

SELECT subscriber_id,
       SUM(session_minutes) OVER (
           PARTITION BY subscriber_id
           ORDER BY event_date
           ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
       ) AS rolling_7d_minutes
FROM usage_events;

b)

SELECT subscriber_id,
       event_date,
       session_minutes,
       session_minutes - LAG(session_minutes, 1) OVER (
           PARTITION BY subscriber_id ORDER BY event_date
       ) AS session_diff
FROM usage_events;

c)

SELECT subscriber_id,
       COUNT(*) FILTER (WHERE feature_used = 'api') AS api_count,
       COUNT(*) FILTER (WHERE feature_used = 'dashboard') AS dashboard_count
FROM usage_events
GROUP BY subscriber_id;

Exercise 5.12: Materialized View Design (Conceptual + Code)

You need to build a materialized view for StreamFlow's churn model that refreshes every night at 2 AM.

a) Write the CREATE MATERIALIZED VIEW statement that combines at least 4 of the CTE-based feature queries from this chapter.

b) Create an appropriate index on the materialized view.

c) Write the REFRESH MATERIALIZED VIEW CONCURRENTLY command and explain why CONCURRENTLY matters in production.

d) A colleague asks: "Why not just run the query every time the model needs features?" Explain the tradeoff between freshness and performance, and under what circumstances a materialized view is the wrong choice.

Exercise 5.13: The BigQuery Translation (Code)

Take the following PostgreSQL query and rewrite it for BigQuery. Identify at least 4 syntax differences.

WITH ticket_counts AS (
    SELECT
        subscriber_id,
        COUNT(*) FILTER (WHERE created_date >= CURRENT_DATE - INTERVAL '30 days')
            AS recent_tickets,
        COUNT(*) FILTER (WHERE category = 'complaint') AS complaints,
        CURRENT_DATE - MAX(created_date) AS days_since_last
    FROM support_tickets
    GROUP BY subscriber_id
)
SELECT
    s.subscriber_id,
    s.plan_type,
    COALESCE(tc.recent_tickets, 0) AS recent_tickets,
    COALESCE(tc.complaints, 0) AS complaints,
    tc.days_since_last
FROM subscribers s
LEFT JOIN ticket_counts tc ON s.subscriber_id = tc.subscriber_id
WHERE s.cancel_date IS NULL;

Exercise 5.14: End-to-End Feature Extraction (Project)

Combine what you have learned in this chapter to write a complete feature extraction pipeline for StreamFlow. Your query should:

Produce one row per active subscriber (plus recently canceled subscribers for training data)
Include at least 20 features across these categories: - Tenure and demographic (at least 3 features) - Usage behavior (at least 5 features, including at least one computed with LAG) - Support interaction (at least 3 features) - Plan history (at least 3 features) - Billing health (at least 2 features) - Derived flags (at least 2 domain-informed binary features)
Use CTEs for structure (at least 5)
Include the target variable
Handle nulls with COALESCE where appropriate

Run EXPLAIN ANALYZE on your query and identify the slowest step. Propose an index that would improve it.

Save this query as src/features/extract_features.sql in your project repository. This is Milestone M1 of the progressive project.