Chapter 2 Exercises: Data Sources and Collection

This collection of exercises progresses from fundamental data retrieval tasks to advanced pipeline design challenges. Difficulty levels are indicated as follows:

  • ★ - Beginner: Basic API calls and data manipulation
  • ★★ - Intermediate: Multi-step processes and data integration
  • ★★★ - Advanced: Complex analysis and system design
  • ★★★★ - Expert: Production-quality implementations

Section A: NBA API Fundamentals

Exercise 2.1 ★

Player Game Log Retrieval

Using the nba_api library, write a function that retrieves the complete game log for Stephen Curry during the 2023-24 regular season. Your function should:

a) Return a pandas DataFrame with all game statistics b) Add a column calculating points per minute (PPM) c) Add a column identifying home vs. away games based on the MATCHUP field

Expected output: DataFrame with 74+ rows and PPM, IS_HOME columns


Exercise 2.2 ★

Team Season Statistics

Write a function that fetches season-level statistics for all 30 NBA teams for a specified season using leaguedashteamstats. The function should:

a) Accept a season parameter in "YYYY-YY" format b) Return both regular season and playoff statistics as separate DataFrames c) Calculate and add offensive and defensive rating columns if not present


Exercise 2.3 ★

Static Data Exploration

Using the static modules in nba_api.stats.static:

a) Find all players whose last name is "Williams" who played in the NBA b) Determine how many are currently active c) Retrieve the full roster of any team that has had at least 3 Williams players historically


Exercise 2.4 ★★

Rate-Limited Batch Collection

Implement a class RateLimitedCollector that:

a) Enforces a configurable maximum request rate (default: 1 request per 2 seconds) b) Tracks the number of requests made in the current session c) Provides a method to fetch game logs for a list of player IDs d) Implements exponential backoff when receiving HTTP 429 errors

class RateLimitedCollector:
    def __init__(self, max_requests_per_second: float = 0.5):
        # Your implementation here
        pass

    def fetch_player_game_logs(self, player_ids: list, season: str) -> dict:
        # Returns {player_id: DataFrame}
        pass

Exercise 2.5 ★★

Shot Chart Data Analysis

Using the shotchartdetail endpoint:

a) Retrieve all shot attempts for a player of your choice in the 2023-24 season b) Calculate the following zone-based statistics: - Restricted area FG% - Mid-range FG% (shots between paint and 3-point line) - Three-point FG% (above the break vs. corner 3s separately) c) Visualize the shot distribution using matplotlib (optional)


Exercise 2.6 ★★

Comparing Endpoint Outputs

The NBA API provides multiple endpoints for similar data. Compare the outputs of:

  • playergamelog for a specific player
  • boxscoretraditionalv2 for games that player appeared in

a) Identify any discrepancies in statistical values b) Document which fields are available in one endpoint but not the other c) Recommend which endpoint to use for different analytical purposes


Exercise 2.7 ★★★

Custom Headers and Session Management

Create a robust API session manager that:

a) Rotates through a pool of user-agent strings b) Implements proxy support for distributed requests c) Maintains cookies across requests for session persistence d) Handles SSL certificate verification appropriately e) Logs all requests with timestamps for debugging


Section B: Web Scraping and Basketball-Reference

Exercise 2.8 ★

Basic Table Extraction

Write a function that extracts the career statistics table from any Basketball-Reference player page. Test it with at least three different players from different eras:

a) A current player b) A retired player from the 2000s c) A Hall of Fame player from before 1990

Document any differences in available statistics across eras.


Exercise 2.9 ★★

Handling Complex HTML Structures

Basketball-Reference uses comment-wrapped tables for some statistics. Write a function that:

a) Detects whether a table is wrapped in HTML comments b) Extracts the table from comments if necessary c) Handles tables with multi-level headers (e.g., shooting splits) d) Returns a clean, properly-indexed DataFrame

Test with the "Shooting" table on a player's page, which contains nested column headers.


Exercise 2.10 ★★

Responsible Scraping Implementation

Implement a ResponsibleScraper class that:

a) Checks robots.txt before making any request b) Implements polite crawling with configurable delays (minimum 3 seconds) c) Caches responses locally to avoid redundant requests d) Respects HTTP 429 (Too Many Requests) responses e) Logs all scraping activity for audit purposes


Exercise 2.11 ★★★

Historical Data Compilation

Using Basketball-Reference, compile a dataset of all NBA MVP award winners including:

a) Player name and season b) Team(s) played for that season c) Key statistics: PPG, RPG, APG, PER, WS, VORP d) Team regular season record

Handle edge cases such as: - Players traded mid-season - Shortened seasons (1998-99, 2011-12, 2019-20) - Missing advanced statistics for older seasons


Exercise 2.12 ★★★

Cross-Source Validation

For the 2022-23 season:

a) Collect team standings from both the NBA API and Basketball-Reference b) Compare win-loss records, point differentials, and conference rankings c) Identify and document any discrepancies d) Hypothesize reasons for any differences found


Section C: Play-by-Play Data

Exercise 2.13 ★

PBP Event Classification

Write a function that takes a play-by-play DataFrame and returns a summary of event type frequencies:

a) Count of each EVENTMSGTYPE b) For shot events, breakdown by EVENTMSGACTIONTYPE (shot type) c) Distribution of events by quarter


Exercise 2.14 ★★

Possession Identification

Implement an algorithm to identify distinct possessions from play-by-play data:

a) Define the start and end conditions for a possession b) Handle edge cases: end-of-quarter, jump balls, and-one situations c) Validate your possession count against the official pace statistic d) Calculate possession-level statistics (points, duration, shot attempts)

Your implementation should pass this validation:

def validate_possessions(pbp_possessions: int,
                         official_pace: float,
                         game_minutes: float = 48.0) -> bool:
    """
    Validate possession count against official pace.

    Pace = possessions per 48 minutes, so:
    expected_possessions = pace * (actual_minutes / 48) * 2 (both teams)
    """
    expected = official_pace * (game_minutes / 48.0) * 2
    tolerance = 0.05  # 5% tolerance
    return abs(pbp_possessions - expected) / expected < tolerance

Exercise 2.15 ★★

Lineup Detection

Using play-by-play substitution events:

a) Track which five players are on the court at any moment b) Identify all unique 5-player lineups used by each team in a game c) Calculate minutes played by each lineup d) Handle edge cases where substitution events may be missing or out of order


Exercise 2.16 ★★★

Play-by-Play to Box Score Reconciliation

Write a comprehensive validation system that:

a) Aggregates play-by-play events to calculate box score statistics b) Compares calculated values against official box scores c) Identifies specific discrepancies (which player, which statistic) d) Generates a report documenting data quality issues e) Proposes corrections for systematic errors


Exercise 2.17 ★★★

Game Flow Reconstruction

From play-by-play data, reconstruct and visualize:

a) Score margin over time (by second) b) Lead changes and ties c) Largest lead for each team and when it occurred d) Win probability over time (using a simple model)

Handle overtime games appropriately.


Exercise 2.18 ★★★★

Real-Time PBP Processing Simulation

Design a system that simulates processing play-by-play data in real-time:

a) Read events with realistic timing delays b) Maintain running statistics that update with each event c) Detect significant game situations (runs, clutch moments) d) Generate alerts when interesting events occur e) Benchmark processing latency


Section D: Tracking Data

Exercise 2.19 ★

Tracking Metric Exploration

Using the NBA API's tracking endpoints:

a) Retrieve speed and distance statistics for all players in a season b) Identify the fastest players (average speed) and highest-mileage players c) Compare tracking metrics across positions (Guard vs. Forward vs. Center)


Exercise 2.20 ★★

Defensive Tracking Analysis

For a selected team:

a) Retrieve defensive tracking statistics for all players b) Calculate team-level defensive metrics from individual data c) Identify which players defend the most shots and their opponent FG% d) Compare to traditional defensive statistics (steals, blocks)


Exercise 2.21 ★★★

Shot Context Classification

Using tracking-enhanced shot data from playerdashptshots:

a) Build a classification system for shot types: - Catch-and-shoot vs. pull-up - Open vs. contested (by defender distance) - Early clock vs. late clock b) Calculate league-average efficiency for each shot type c) Identify players who excel at specific shot types relative to league average


Exercise 2.22 ★★★

Movement Data Parsing (Public Dataset)

Using the publicly available 2015-16 movement data:

a) Parse the JSON format into a structured DataFrame b) Calculate player speeds at each moment c) Identify possession changes from ball position d) Detect passes (ball movement without dribbling) e) Calculate the number of passes per possession


Exercise 2.23 ★★★★

Spatial Analysis from Tracking Data

Using available tracking data:

a) Calculate player spacing metrics (average distance between teammates) b) Identify defensive schemes from player positions c) Detect screens based on player proximity and movement patterns d) Calculate court control metrics (area controlled by each team)

Hint: Use Voronoi diagrams or convex hulls for spatial calculations.


Section E: Public Datasets

Exercise 2.24 ★

Kaggle Data Loading and Exploration

Download the "NBA Shot Logs" dataset from Kaggle and:

a) Load the data into a pandas DataFrame b) Perform basic exploratory analysis (shape, dtypes, missing values) c) Identify potential data quality issues d) Calculate basic shooting statistics and compare to known values


Exercise 2.25 ★★

Dataset Quality Assessment

For a public NBA dataset of your choice:

a) Document the data schema and available fields b) Check for missing values and determine if they are random or systematic c) Validate numeric ranges (e.g., FG% between 0 and 1) d) Cross-reference sample records against official sources e) Write a comprehensive data quality report


Exercise 2.26 ★★

Dataset Integration

Combine data from at least two different public sources:

a) Identify common entities (games, players, teams) b) Resolve entity matching challenges c) Handle schema differences between sources d) Create a unified dataset with properly sourced fields e) Document any data lost or duplicated in the merge


Section F: Data Quality and Cleaning

Exercise 2.27 ★★

Missing Value Analysis

For a dataset containing at least one season of player statistics:

a) Identify all columns with missing values b) Classify missing values as MCAR, MAR, or MNAR c) Implement appropriate imputation strategies for each case d) Validate that imputation preserves statistical properties


Exercise 2.28 ★★

Outlier Detection

Develop an outlier detection system for NBA statistics:

a) Define reasonable ranges for common statistics (points, rebounds, etc.) b) Implement statistical outlier detection (z-score, IQR-based) c) Distinguish between valid extreme performances and data errors d) Create a flagging system that doesn't automatically remove outliers


Exercise 2.29 ★★★

Era-Adjusted Statistics

Create a pipeline that adjusts historical statistics for era differences:

a) Calculate league-average statistics by season b) Compute relative statistics (e.g., points above league average) c) Adjust for pace differences across eras d) Handle missing statistics in older data e) Validate adjustments against published historical analyses


Exercise 2.30 ★★★

Comprehensive Data Cleaning Pipeline

Build a production-quality data cleaning pipeline that:

a) Standardizes player and team names across sources b) Handles missing values with configurable strategies c) Detects and flags anomalous records d) Validates referential integrity (e.g., player IDs match) e) Logs all transformations for reproducibility f) Outputs a cleaned dataset with a quality report


Section G: System Design and Architecture

Exercise 2.31 ★★★

Caching System Design

Design and implement a caching system for NBA API data:

a) Support both memory and disk-based caching b) Implement cache invalidation policies (time-based, version-based) c) Handle cache misses gracefully with fallback to API d) Provide cache statistics (hit rate, size, age distribution) e) Support cache warming for anticipated queries


Exercise 2.32 ★★★

Incremental Data Updates

Build a system that maintains a local database of NBA data with incremental updates:

a) Detect new games since last sync b) Fetch only new or updated records c) Handle retroactive corrections to historical data d) Maintain an audit log of all changes e) Support rollback to previous states


Exercise 2.33 ★★★★

Multi-Source Data Pipeline

Design a complete data pipeline that:

a) Collects data from NBA API, Basketball-Reference, and public datasets b) Performs entity resolution across all sources c) Implements source-specific cleaning rules d) Merges data into a unified schema e) Validates consistency across sources f) Schedules regular updates with failure handling g) Provides data lineage tracking


Exercise 2.34 ★★★★

Scalable Collection Architecture

Design an architecture for collecting NBA data at scale:

a) Handle rate limits across multiple data sources b) Distribute collection across multiple workers c) Implement circuit breakers for failing sources d) Support both batch and streaming collection modes e) Provide monitoring and alerting for collection failures f) Document the system with architecture diagrams


Exercise 2.35 ★★★★

Data Versioning System

Implement a data versioning system inspired by tools like DVC:

a) Track changes to datasets over time b) Support branching for experimental data transformations c) Enable reproducible data states for analysis d) Integrate with git for code versioning e) Handle large datasets efficiently (chunking, compression)


Section H: Advanced Challenges

Exercise 2.36 ★★★★

API Reverse Engineering

The NBA occasionally updates its API without documentation. Design a methodology to:

a) Detect changes to API endpoints or schemas b) Identify new endpoints or parameters c) Validate that existing code still works d) Generate documentation from observed behavior e) Create alerts when breaking changes are detected


Exercise 2.37 ★★★★

Proprietary Data Simulation

Without access to actual Second Spectrum data, create a simulation:

a) Generate synthetic tracking data that mimics real movement patterns b) Ensure physical constraints (speed limits, court boundaries) c) Model realistic basketball behavior (offensive sets, defensive rotations) d) Produce output in the documented tracking data format e) Validate that your simulation produces realistic derived metrics


Exercise 2.38 ★★★★

Cross-League Data Integration

Extend your data collection system to include international basketball:

a) Identify data sources for EuroLeague, G-League, and NCAA b) Map schemas across leagues c) Handle different rules (shot clock, court dimensions) d) Create a unified player database spanning leagues e) Track players across league transitions


Exercise 2.39 ★★★★

Data Quality Monitoring Dashboard

Build a monitoring system that:

a) Tracks data freshness across all sources b) Monitors data quality metrics over time c) Detects anomalies in incoming data d) Visualizes collection pipeline health e) Sends alerts for significant issues f) Provides drill-down investigation capabilities


Exercise 2.40 ★★★★

Research Reproduction Pipeline

Select a published basketball analytics paper and:

a) Identify all data requirements b) Collect the necessary data from available sources c) Document any data gaps or limitations d) Implement the paper's methodology e) Compare your results to published findings f) Analyze any discrepancies and their likely causes


Bonus Exercises

Exercise 2.B1 ★★★

API Load Testing

Design and execute a load testing strategy for your data collection system:

a) Determine sustainable request rates b) Identify failure modes under load c) Measure latency distributions d) Document optimal batch sizes

Exercise 2.B2 ★★★★

Machine Learning Data Preparation

Create a feature engineering pipeline that produces ML-ready datasets:

a) Calculate rolling statistics with appropriate windows b) Create lag features for time series modeling c) Handle categorical encoding for team/player IDs d) Split data respecting temporal ordering e) Document feature definitions and transformations


Exercise Submission Guidelines

For each exercise, your submission should include:

  1. Code: Well-documented Python code following PEP 8 style
  2. Documentation: Explanation of your approach and design decisions
  3. Testing: Unit tests demonstrating correct functionality
  4. Results: Sample output or analysis results
  5. Reflection: Discussion of limitations and potential improvements

Exercises marked with ★★★★ should additionally include: - Architecture diagrams - Performance benchmarks - Production deployment considerations