Chapter 2: Baseball Data Sources and Infrastructure

Beginner 10 min read 84 views Nov 25, 2025

Baseball Data Sources and Infrastructure

The foundation of baseball analytics lies in accessing, managing, and analyzing comprehensive datasets. Modern baseball analysis requires understanding various data sources, from historical databases to real-time tracking systems. This chapter explores the primary data sources available to analysts, including public APIs, proprietary databases, and web scraping techniques that power contemporary baseball research and decision-making.

Understanding Baseball Data Ecosystem

Baseball possesses one of the most complete and accessible data ecosystems in professional sports. This rich data environment stems from baseball's long history of statistical record-keeping and the sport's recent embrace of advanced tracking technologies. Data ranges from basic box scores dating back to the 1870s to sophisticated biomechanical measurements captured by modern systems like Statcast and TrackMan.

The modern baseball data infrastructure consists of several layers: historical databases containing seasonal and career statistics, play-by-play data capturing every event in every game, pitch-level data with velocity and movement characteristics, and tracking data measuring the physical location and movement of every player and ball. Each layer serves different analytical purposes and requires different tools and approaches for effective utilization.

Access to baseball data varies from freely available public sources to expensive proprietary systems. Organizations like Baseball Reference and FanGraphs provide extensive free data through their websites and APIs, while companies like MLB Advanced Media, Sportradar, and Baseball Prospectus offer premium data products. Understanding what data is available, where to find it, and how to access it efficiently is essential for any baseball analyst.

Key Components

Lahman Database: The most comprehensive free historical baseball database, containing complete batting, pitching, and fielding statistics from 1871 to present. Available in SQL, CSV, and R package formats, it includes player demographics, team records, awards, and post-season data.
Retrosheet: Provides detailed play-by-play accounts of games dating back to 1913, with complete coverage from 1974 onward. Essential for analyzing in-game situations, lineup construction, and sequential events.
Statcast: MLB's tracking technology launched in 2015 that measures player movements, bat speed, ball trajectories, exit velocity, launch angle, sprint speed, and defensive positioning. Accessible through Baseball Savant's website and API.
FanGraphs: Comprehensive baseball statistics website offering traditional stats, advanced metrics like WAR and wOBA, plate discipline data, batted ball profiles, and pitch-type breakdowns. Provides both a web interface and data scraping capabilities.
Baseball Reference: Another major statistics repository with play index tools, game logs, splits data, and extensive historical information. Known for calculating bWAR (Baseball-Reference WAR).
PyBaseball & baseballr: Python and R packages respectively that provide programmatic access to FanGraphs, Baseball Reference, Baseball Savant, and other data sources through simple function calls.

Data Pipeline Architecture

Data Flow: Raw Sources → Data Acquisition (API/Scraping) → Data Cleaning & Validation → Storage (Database) → Analysis & Modeling → Visualization & Reporting

A robust baseball analytics pipeline begins with systematic data collection from multiple sources, continues through cleaning and standardization processes, stores data in efficient database structures, and culminates in analysis and presentation tools.

Python Implementation


import pandas as pd
import numpy as np
from pybaseball import statcast, batting_stats, pitching_stats, playerid_lookup
from pybaseball import cache
import sqlite3
from datetime import datetime, timedelta

# Enable caching to improve performance
cache.enable()

class BaseballDataPipeline:
    """
    Comprehensive baseball data pipeline for acquisition and storage.
    """

    def __init__(self, db_path='baseball_analytics.db'):
        """Initialize database connection."""
        self.conn = sqlite3.connect(db_path)
        self.setup_database()

    def setup_database(self):
        """Create tables for storing baseball data."""
        cursor = self.conn.cursor()

        # Create statcast table
        cursor.execute('''
            CREATE TABLE IF NOT EXISTS statcast_data (
                game_date TEXT,
                player_name TEXT,
                pitcher_name TEXT,
                events TEXT,
                launch_speed REAL,
                launch_angle REAL,
                hit_distance_sc REAL,
                estimated_ba_using_speedangle REAL,
                estimated_woba_using_speedangle REAL,
                woba_value REAL,
                pitch_type TEXT,
                release_speed REAL,
                release_spin_rate REAL,
                PRIMARY KEY (game_date, player_name, pitcher_name, events)
            )
        ''')

        self.conn.commit()

    def fetch_statcast_data(self, start_date, end_date):
        """
        Fetch Statcast data for a date range.

        Parameters:
        start_date: Start date (YYYY-MM-DD format)
        end_date: End date (YYYY-MM-DD format)

        Returns:
        DataFrame with Statcast data
        """
        print(f"Fetching Statcast data from {start_date} to {end_date}...")
        data = statcast(start_dt=start_date, end_dt=end_date)

        if data is not None and len(data) > 0:
            # Select relevant columns
            columns = ['game_date', 'player_name', 'pitcher_name', 'events',
                      'launch_speed', 'launch_angle', 'hit_distance_sc',
                      'estimated_ba_using_speedangle', 'estimated_woba_using_speedangle',
                      'woba_value', 'pitch_type', 'release_speed', 'release_spin_rate']

            data_clean = data[columns].dropna(subset=['events'])

            # Store in database
            data_clean.to_sql('statcast_data', self.conn, if_exists='append', index=False)
            print(f"Stored {len(data_clean)} records")

            return data_clean
        else:
            print("No data found for date range")
            return pd.DataFrame()

    def fetch_season_stats(self, year):
        """
        Fetch comprehensive season statistics.

        Parameters:
        year: Season year

        Returns:
        Dictionary with batting and pitching stats
        """
        batting = batting_stats(year)
        pitching = pitching_stats(year)

        # Store in database
        batting.to_sql(f'batting_{year}', self.conn, if_exists='replace', index=False)
        pitching.to_sql(f'pitching_{year}', self.conn, if_exists='replace', index=False)

        return {'batting': batting, 'pitching': pitching}

    def get_player_data(self, last_name, first_name):
        """
        Look up player ID and retrieve their statistics.

        Parameters:
        last_name: Player's last name
        first_name: Player's first name

        Returns:
        Player information and ID
        """
        player = playerid_lookup(last_name, first_name)
        return player

    def query_database(self, sql_query):
        """
        Execute custom SQL query on the database.

        Parameters:
        sql_query: SQL query string

        Returns:
        Query results as DataFrame
        """
        return pd.read_sql_query(sql_query, self.conn)

    def close(self):
        """Close database connection."""
        self.conn.close()

# Example usage
pipeline = BaseballDataPipeline()

# Fetch recent data
start = (datetime.now() - timedelta(days=7)).strftime('%Y-%m-%d')
end = datetime.now().strftime('%Y-%m-%d')
statcast_data = pipeline.fetch_statcast_data(start, end)

# Fetch 2023 season statistics
season_2023 = pipeline.fetch_season_stats(2023)
print(f"\nTop 10 hitters by wRC+ in 2023:")
print(season_2023['batting'].nlargest(10, 'wRC+')[['Name', 'Team', 'wRC+', 'WAR']])

# Look up specific player
shohei = pipeline.get_player_data('Ohtani', 'Shohei')
print(f"\nShohei Ohtani player ID: {shohei['key_mlbam'].values[0]}")

pipeline.close()

R Implementation


library(tidyverse)
library(baseballr)
library(Lahman)
library(DBI)
library(RSQLite)

# Create baseball data pipeline class
BaseballDataPipeline <- R6::R6Class(
  "BaseballDataPipeline",

  public = list(
    conn = NULL,

    initialize = function(db_path = "baseball_analytics.db") {
      self$conn <- dbConnect(RSQLite::SQLite(), db_path)
      self$setup_database()
    },

    setup_database = function() {
      # Create tables for storing data
      dbExecute(self$conn, "
        CREATE TABLE IF NOT EXISTS statcast_data (
          game_date TEXT,
          player_name TEXT,
          pitcher_name TEXT,
          events TEXT,
          launch_speed REAL,
          launch_angle REAL,
          hit_distance_sc REAL,
          PRIMARY KEY (game_date, player_name, pitcher_name)
        )
      ")
    },

    fetch_statcast_data = function(start_date, end_date) {
      # Fetch Statcast data using baseballr
      message(sprintf("Fetching data from %s to %s...", start_date, end_date))

      data <- statcast_search(
        start_date = start_date,
        end_date = end_date,
        playerid = NULL
      )

      if (nrow(data) > 0) {
        # Clean and store data
        data_clean <- data %>%
          select(game_date, player_name, pitcher_name, events,
                 launch_speed, launch_angle, hit_distance_sc) %>%
          filter(!is.na(events))

        dbWriteTable(self$conn, "statcast_data", data_clean, append = TRUE)
        message(sprintf("Stored %d records", nrow(data_clean)))

        return(data_clean)
      }
    },

    fetch_fangraphs_data = function(year) {
      # Fetch FanGraphs leaderboards
      batting <- fg_batter_leaders(startseason = year, endseason = year)
      pitching <- fg_pitcher_leaders(startseason = year, endseason = year)

      # Store in database
      dbWriteTable(self$conn, paste0("batting_", year), batting, overwrite = TRUE)
      dbWriteTable(self$conn, paste0("pitching_", year), pitching, overwrite = TRUE)

      return(list(batting = batting, pitching = pitching))
    },

    get_lahman_data = function() {
      # Access Lahman database
      batting_data <- Batting %>%
        filter(yearID >= 2020) %>%
        collect()

      pitching_data <- Pitching %>%
        filter(yearID >= 2020) %>%
        collect()

      return(list(batting = batting_data, pitching = pitching_data))
    },

    query_database = function(sql_query) {
      # Execute custom SQL query
      result <- dbGetQuery(self$conn, sql_query)
      return(result)
    },

    finalize = function() {
      dbDisconnect(self$conn)
    }
  )
)

# Example usage
pipeline <- BaseballDataPipeline$new()

# Fetch recent Statcast data
start_date <- Sys.Date() - 7
end_date <- Sys.Date()
statcast_data <- pipeline$fetch_statcast_data(
  format(start_date, "%Y-%m-%d"),
  format(end_date, "%Y-%m-%d")
)

# Fetch 2023 season data
season_2023 <- pipeline$fetch_fangraphs_data(2023)
top_hitters <- season_2023$batting %>%
  arrange(desc(WAR)) %>%
  select(Name, Team, WAR, wRC., wOBA) %>%
  head(10)

print("Top 10 hitters by WAR (2023):")
print(top_hitters)

Real-World Application

MLB teams employ dedicated data engineers and infrastructure specialists to manage their data pipelines. The Cleveland Guardians, for instance, built a comprehensive data warehouse that integrates Statcast data, TrackMan information from their minor league affiliates, biomechanical data from motion capture systems, and scouting reports into a unified platform. This allows analysts, coaches, and executives to query any aspect of player performance instantly.

The New York Yankees partnered with Google Cloud to build their analytics infrastructure, using BigQuery for data storage and analysis. Their system processes millions of pitches and batted balls, providing real-time insights during games and supporting long-term player development initiatives. The Pittsburgh Pirates created an internal API that standardizes data access across departments, ensuring consistency in analysis and enabling rapid development of new analytical tools.

Interpreting the Results

Data Source	Coverage	Access Method	Best Use Cases
Lahman Database	1871-Present (Season level)	Free download, R package, SQL	Historical analysis, career statistics
Retrosheet	1913-Present (Play-by-play)	Free download, parsing required	Game situations, lineup analysis
Statcast	2015-Present (Pitch/batted ball level)	Baseball Savant, PyBaseball API	Player evaluation, skill assessment
FanGraphs	1871-Present (Advanced metrics)	Web scraping, PyBaseball	Modern analytics, WAR calculations
Baseball Reference	1871-Present (Comprehensive)	Web scraping, play index	Research, splits analysis

Key Takeaways

Multiple complementary data sources exist for baseball analytics, each serving different purposes from historical research to real-time player evaluation.
Python and R packages like PyBaseball and baseballr dramatically simplify data acquisition, eliminating the need for complex web scraping or API management.
Building a robust data infrastructure with proper database storage and automated pipelines is essential for scalable and reproducible baseball analysis.
Statcast has revolutionized baseball analytics by providing objective physical measurements of player performance, enabling new forms of evaluation and prediction.
Understanding data provenance, quality, and limitations is crucial for drawing valid conclusions from baseball analytics.

Chapter 1: Introduction to Baseball Analytics and Sabermetrics Previous

Chapter 3: R and Python for Baseball Analytics Next

Discussion

Have questions or feedback? Join our community discussion on Discord or GitHub Discussions.

Table of Contents