Case Study 1: Building a Sports Stats Pipeline — Priya Automates NBA Data Collection

Contributors to Introduction to Data Science

Case Study 1: Building a Sports Stats Pipeline — Priya Automates NBA Data Collection

Tier 3 — Illustrative/Composite Example: This case study uses a fictional API and fictional sports statistics modeled on the kinds of publicly available basketball data provided by sites like Basketball Reference and the NBA's own data feeds. All player names are real public figures in their professional capacity, but the specific statistics, API responses, and analysis results are invented for pedagogical purposes. The API endpoint used in this case study is fictional.

The Setting

Priya — our sports journalist from Chapter 1 — has been covering the NBA for her college newspaper for two semesters now. She's built a reputation for data-driven analysis, but her workflow has a bottleneck: she manually downloads box scores from a website, copies them into spreadsheets, and runs her analysis in pandas. It works, but it takes hours every week.

"There has to be a better way," Priya mutters during a late-night session, manually copying the third game's stats into her spreadsheet. She remembers Chapter 13 of her data science course: APIs. If she can find an API that provides NBA game statistics, she can automate the entire collection process.

After some research, Priya finds a sports data API that offers game-level statistics. It's free for academic use with an API key (limited to 100 requests per day). She signs up, gets her key, and opens a new Jupyter notebook.

Understanding the API

Before writing any code, Priya reads the API documentation. She learns:

Base URL: https://api.sportsdata.example.com/v1/nba
Authentication: API key passed as a header
Rate limit: 100 requests per day, 10 per minute
Endpoints:
/games?date=YYYY-MM-DD — returns games for a specific date
/boxscore/{game_id} — returns player stats for a game
/standings — returns current standings

Priya sketches out her plan:

Get the list of games for a date range
For each game, fetch the box score
Combine everything into a single DataFrame
Save to CSV for analysis

Step 1: Setting Up the Connection

Priya starts by configuring her connection securely:

import requests
import pandas as pd
import time
import os

# Load API key from environment variable
API_KEY = os.environ.get('SPORTS_API_KEY')
if not API_KEY:
    raise ValueError("Set SPORTS_API_KEY env variable")

BASE_URL = 'https://api.sportsdata.example.com/v1/nba'
HEADERS = {
    'Authorization': f'Bearer {API_KEY}',
    'User-Agent': 'PriyaNBAAnalysis/1.0'
}

She stores her API key in an environment variable, not in the code. She also sets a User-Agent header to identify her script — a good practice she learned from Chapter 13.

Step 2: Fetching Games for a Date

Priya writes a helper function to fetch games:

def get_games(date_str):
    """Fetch NBA games for a specific date.

    Parameters
    ----------
    date_str : str
        Date in YYYY-MM-DD format

    Returns
    -------
    list of dict, or empty list on error
    """
    response = requests.get(
        f'{BASE_URL}/games',
        params={'date': date_str},
        headers=HEADERS,
        timeout=10
    )

    if response.status_code == 200:
        return response.json()['games']
    elif response.status_code == 429:
        print(f"Rate limited on {date_str}. "
              f"Waiting 60s...")
        time.sleep(60)
        return get_games(date_str)  # Retry once
    else:
        print(f"Error {response.status_code} for "
              f"{date_str}")
        return []

She tests with a single date:

games = get_games('2024-01-15')
print(f"Games on 2024-01-15: {len(games)}")
for g in games:
    print(f"  {g['away_team']} @ {g['home_team']}: "
          f"{g['away_score']}-{g['home_score']}")

Games on 2024-01-15: 6
  LAL @ BOS: 108-115
  MIA @ NYK: 102-99
  GSW @ DEN: 121-118
  PHX @ DAL: 110-114
  SAC @ POR: 105-112
  MIN @ LAC: 119-107

It works. Six games, each with team abbreviations and scores.

Step 3: Fetching Box Scores

Now Priya needs the player-level stats for each game. The API response for a box score is nested:

{
  "game_id": "20240115_LAL_BOS",
  "player_stats": [
    {
      "player": "LeBron James",
      "team": "LAL",
      "stats": {
        "minutes": 36,
        "points": 28,
        "rebounds": 8,
        "assists": 11,
        "steals": 2,
        "blocks": 1
      }
    }
  ]
}

She recognizes this pattern from Chapter 12: nested JSON that needs json_normalize with record_path:

def get_boxscore(game_id):
    """Fetch player stats for a specific game."""
    response = requests.get(
        f'{BASE_URL}/boxscore/{game_id}',
        headers=HEADERS,
        timeout=10
    )

    if response.status_code != 200:
        print(f"Error {response.status_code} for "
              f"game {game_id}")
        return pd.DataFrame()

    data = response.json()
    df = pd.json_normalize(
        data['player_stats'],
        sep='_'
    )
    df['game_id'] = game_id
    return df

Step 4: Building the Pipeline

Now Priya combines everything into a pipeline that collects a week's worth of data:

from datetime import datetime, timedelta

def collect_week(start_date_str):
    """Collect all box scores for a 7-day period."""
    start = datetime.strptime(start_date_str, '%Y-%m-%d')
    all_stats = []
    request_count = 0

    for day_offset in range(7):
        date = start + timedelta(days=day_offset)
        date_str = date.strftime('%Y-%m-%d')

        games = get_games(date_str)
        request_count += 1
        print(f"{date_str}: {len(games)} games")

        for game in games:
            game_id = game['game_id']
            boxscore = get_boxscore(game_id)
            if not boxscore.empty:
                boxscore['date'] = date_str
                all_stats.append(boxscore)
            request_count += 1

            # Rate limit: max 10/min, so wait 7s
            time.sleep(7)

        # Check daily limit
        if request_count > 90:
            print("Approaching daily limit. Stopping.")
            break

    if all_stats:
        return pd.concat(all_stats, ignore_index=True)
    return pd.DataFrame()

Priya runs it:

week_stats = collect_week('2024-01-15')
print(f"\nCollected {len(week_stats)} player-game records")
print(f"Columns: {week_stats.columns.tolist()}")
week_stats.to_csv('nba_week_20240115.csv', index=False)

2024-01-15: 6 games
2024-01-16: 8 games
2024-01-17: 5 games
2024-01-18: 9 games
2024-01-19: 7 games
2024-01-20: 10 games
2024-01-21: 4 games

Collected 1176 player-game records
Columns: ['player', 'team', 'stats_minutes', 'stats_points',
          'stats_rebounds', 'stats_assists', 'stats_steals',
          'stats_blocks', 'game_id', 'date']

In about 10 minutes of automated collection (with rate-limiting pauses), Priya has 1,176 player-game records. Previously, this would have taken her most of a Saturday.

Step 5: Quick Analysis

With the data in hand, Priya does a quick analysis for her column:

# Top scorers of the week
top_scorers = (week_stats
    .groupby('player')['stats_points']
    .agg(['mean', 'max', 'count'])
    .sort_values('mean', ascending=False)
    .head(10)
    .round(1))

print("Top 10 Scorers of the Week (by PPG):")
print(top_scorers)

She also checks for the "triple-double" she'd heard about:

triple_doubles = week_stats[
    (week_stats['stats_points'] >= 10) &
    (week_stats['stats_rebounds'] >= 10) &
    (week_stats['stats_assists'] >= 10)
]
print(f"\nTriple-doubles this week: {len(triple_doubles)}")
print(triple_doubles[['player', 'date', 'stats_points',
                       'stats_rebounds', 'stats_assists']])

What Priya Learned

Read the docs first. Understanding the API's structure, rate limits, and endpoints before writing code saved Priya hours of trial and error.
Build incrementally. She tested each function individually (get_games, get_boxscore) before combining them into the pipeline. When something went wrong, she knew exactly which piece to debug.
Respect the limits. The 7-second delay between requests felt slow, but it kept her safely under the rate limit. One week of data in 10 minutes is still vastly faster than manual collection.
Cache aggressively. Saving to CSV after each collection means Priya never has to re-request data she's already collected. If her script crashes on day 5, she still has days 1-4.
The pipeline pays off. The initial setup took an hour. But now Priya can collect any week's data by changing a single date string. Over a season, this saves hundreds of hours.

Discussion Questions

Priya's API allows 100 requests per day. A full week of NBA games (about 49 games) requires 49 box score requests + 7 date requests = 56 requests. If she wanted to collect an entire month, how would she modify her approach?
The get_games function has a recursive retry for rate limiting. What could go wrong with this approach? How would you improve it?
Priya uses time.sleep(7) between requests. The API allows 10 requests per minute. Is 7 seconds the right delay? Could she go faster? What are the trade-offs?