Baseball Data Sources and Infrastructure

Beginner 20 min read 162 views Nov 25, 2025

Baseball Data Sources & Infrastructure

In the modern era of baseball analytics, access to high-quality data is the foundation upon which all meaningful analysis is built. From pitch-tracking systems that capture every rotation of a fastball to historical databases spanning over a century of play, the infrastructure supporting baseball analytics has evolved into a sophisticated ecosystem that enables teams, researchers, and fans to uncover insights that were previously impossible to obtain.

The revolution in baseball data began in earnest with the introduction of PITCHf/x in 2006, which used cameras to track pitch movement and velocity. This system was eventually superseded by Statcast, which debuted in 2015 and expanded tracking capabilities to include batted ball data, player positioning, and sprint speeds. Today, the combination of public databases, proprietary tracking systems, and open-source tools provides analysts with an unprecedented wealth of information to explore.

Understanding Baseball Data Infrastructure

Baseball's data infrastructure encompasses multiple layers, from raw data collection at the ballpark to processed statistics available through APIs and databases. At the collection layer, systems like Statcast use a combination of radar (TrackMan) and optical tracking (Hawk-Eye) to capture ball and player movements at high frame rates. This raw data is then processed, validated, and stored in databases maintained by MLB Advanced Media.

For analysts outside of MLB organizations, data access comes primarily through three channels: official MLB APIs and data feeds, third-party data aggregators like Baseball Reference and FanGraphs, and historical databases like Retrosheet that preserve play-by-play records dating back to the 19th century. Each source has its strengths, with official sources providing the most current and detailed tracking data, while historical databases offer unmatched depth for longitudinal studies.

Key Components

  • Statcast: MLB's ball and player tracking system providing pitch velocity, spin rate, launch angle, exit velocity, sprint speed, and defensive positioning data since 2015
  • Retrosheet: Volunteer-driven organization maintaining play-by-play data for every MLB game since 1871, with detailed event files enabling historical analysis
  • Baseball Reference: Comprehensive statistical database with standard and advanced metrics, player pages, team histories, and Play Index search functionality
  • FanGraphs: Advanced analytics platform featuring WAR calculations, pitch-level data, projection systems (ZiPS, Steamer), and leaderboards
  • Lahman Database: Open-source database containing complete batting, pitching, and fielding statistics from 1871 to present, widely used in academic research
  • Baseball Savant: MLB's public-facing Statcast portal with visualizations, leaderboards, and CSV export capabilities for detailed analysis

Data Collection Methodology

Statcast Data Points Per Game ≈ 2.5 million measurements

Tracking Frequency = 30 frames/second (Hawk-Eye optical) + radar updates

Each pitch generates approximately 20+ data points including release position, velocity, movement profiles, spin axis, and extension. Batted balls add launch angle, exit velocity, spray angle, and expected outcomes. Player tracking captures positioning data for all fielders and baserunners throughout each play.

Python Implementation


import pandas as pd
import numpy as np
from pybaseball import (
    statcast,
    statcast_pitcher,
    statcast_batter,
    playerid_lookup,
    team_batting,
    team_pitching,
    retrosheet
)
from datetime import datetime, timedelta

class BaseballDataCollector:
    """
    Comprehensive data collection class for baseball analytics.
    Interfaces with multiple data sources including Statcast,
    FanGraphs, and Retrosheet.
    """

    def __init__(self):
        self.cache = {}

    def get_statcast_data(self, start_date, end_date, team=None):
        """
        Fetch Statcast pitch-level data for a date range.

        Parameters:
        -----------
        start_date : str
            Start date in 'YYYY-MM-DD' format
        end_date : str
            End date in 'YYYY-MM-DD' format
        team : str, optional
            Three-letter team abbreviation to filter results

        Returns:
        --------
        DataFrame with pitch-level Statcast data
        """
        cache_key = f"statcast_{start_date}_{end_date}_{team}"

        if cache_key in self.cache:
            return self.cache[cache_key]

        # Fetch data from Baseball Savant via pybaseball
        data = statcast(start_dt=start_date, end_dt=end_date, team=team)

        # Clean and process the data
        data = self._clean_statcast_data(data)

        self.cache[cache_key] = data
        return data

    def _clean_statcast_data(self, df):
        """Clean and standardize Statcast data."""
        if df is None or df.empty:
            return pd.DataFrame()

        # Convert date columns
        df['game_date'] = pd.to_datetime(df['game_date'])

        # Fill missing values for key metrics
        numeric_cols = ['release_speed', 'launch_speed', 'launch_angle',
                       'release_spin_rate', 'pfx_x', 'pfx_z']
        for col in numeric_cols:
            if col in df.columns:
                df[col] = pd.to_numeric(df[col], errors='coerce')

        return df

    def get_player_statcast(self, player_name, year, player_type='batter'):
        """
        Get Statcast data for a specific player.

        Parameters:
        -----------
        player_name : str
            Player name in "First Last" format
        year : int
            Season year
        player_type : str
            Either 'batter' or 'pitcher'

        Returns:
        --------
        DataFrame with player's Statcast data
        """
        # Look up player ID
        name_parts = player_name.split()
        lookup = playerid_lookup(name_parts[-1], name_parts[0])

        if lookup.empty:
            raise ValueError(f"Player not found: {player_name}")

        player_id = lookup.iloc[0]['key_mlbam']

        # Fetch appropriate data
        if player_type == 'batter':
            return statcast_batter(f"{year}-03-01", f"{year}-11-30", player_id)
        else:
            return statcast_pitcher(f"{year}-03-01", f"{year}-11-30", player_id)

    def calculate_barrel_rate(self, df):
        """
        Calculate barrel rate from Statcast batted ball data.
        A barrel is defined by optimal exit velocity and launch angle.
        """
        if 'launch_speed_angle' not in df.columns:
            # Calculate manually if not present
            barrels = df[
                (df['launch_speed'] >= 98) &
                (df['launch_angle'].between(26, 30))
            ].copy()

            # Expanded barrel zone for higher exit velocities
            for ev in range(99, 117):
                angle_range = (26 - (ev - 98) * 0.5, 30 + (ev - 98) * 0.5)
                additional = df[
                    (df['launch_speed'] >= ev) &
                    (df['launch_angle'].between(*angle_range))
                ]
                barrels = pd.concat([barrels, additional]).drop_duplicates()

            batted_balls = df[df['type'] == 'X']
            return len(barrels) / len(batted_balls) if len(batted_balls) > 0 else 0

        return (df['launch_speed_angle'] == 6).sum() / len(df[df['type'] == 'X'])


# Example usage
if __name__ == "__main__":
    collector = BaseballDataCollector()

    # Get recent Statcast data
    yesterday = (datetime.now() - timedelta(days=1)).strftime('%Y-%m-%d')
    week_ago = (datetime.now() - timedelta(days=7)).strftime('%Y-%m-%d')

    print("Fetching Statcast data...")
    data = collector.get_statcast_data(week_ago, yesterday)

    print(f"Retrieved {len(data)} pitch records")
    print(f"\nColumns available: {list(data.columns)[:10]}...")
    print(f"\nSample pitch types: {data['pitch_type'].value_counts().head()}")
    

R Implementation


library(baseballr)
library(Lahman)
library(tidyverse)
library(lubridate)

#' Baseball Data Collection Functions
#'
#' Comprehensive suite for accessing baseball data from multiple sources
#' including Statcast, FanGraphs, Baseball Reference, and Lahman database.

# ============================================
# Statcast Data Functions
# ============================================

#' Fetch Statcast data for a date range
#'
#' @param start_date Start date (character, "YYYY-MM-DD")
#' @param end_date End date (character, "YYYY-MM-DD")
#' @return Data frame with pitch-level Statcast data
get_statcast_range <- function(start_date, end_date) {

  # Convert to date objects
  start <- as.Date(start_date)
  end <- as.Date(end_date)

  # Statcast limits queries to ~5 days, so chunk larger requests
  all_data <- data.frame()

  current_start <- start
  while (current_start <= end) {
    current_end <- min(current_start + days(5), end)

    chunk <- scrape_statcast_savant(
      start_date = as.character(current_start),
      end_date = as.character(current_end),
      player_type = "batter"
    )

    all_data <- bind_rows(all_data, chunk)
    current_start <- current_end + days(1)

    # Respect rate limits
    Sys.sleep(2)
  }

  return(all_data)
}

#' Get player Statcast data by name
#'
#' @param player_name Player name ("First Last")
#' @param season Year of data to retrieve
#' @param player_type "batter" or "pitcher"
#' @return Data frame with player's Statcast data
get_player_statcast <- function(player_name, season, player_type = "batter") {

  # Parse name
  name_parts <- str_split(player_name, " ")[[1]]
  first_name <- name_parts[1]
  last_name <- paste(name_parts[-1], collapse = " ")

  # Look up player ID
  player_info <- playerid_lookup(last_name, first_name)

  if (nrow(player_info) == 0) {
    stop(paste("Player not found:", player_name))
  }

  mlbam_id <- player_info$key_mlbam[1]

  # Define season dates
  start_date <- paste0(season, "-03-01")
  end_date <- paste0(season, "-11-30")

  # Fetch data
  if (player_type == "batter") {
    data <- scrape_statcast_savant(
      start_date = start_date,
      end_date = end_date,
      playerid = mlbam_id,
      player_type = "batter"
    )
  } else {
    data <- scrape_statcast_savant(
      start_date = start_date,
      end_date = end_date,
      playerid = mlbam_id,
      player_type = "pitcher"
    )
  }

  return(data)
}

# ============================================
# Lahman Database Functions
# ============================================

#' Get comprehensive batting statistics from Lahman
#'
#' @param min_year Minimum season year
#' @param max_year Maximum season year
#' @param min_pa Minimum plate appearances filter
#' @return Data frame with batting stats
get_lahman_batting <- function(min_year = 1900, max_year = 2023, min_pa = 200) {

  batting <- Lahman::Batting %>%
    filter(yearID >= min_year, yearID <= max_year) %>%
    group_by(playerID, yearID) %>%
    summarize(
      G = sum(G),
      AB = sum(AB),
      R = sum(R),
      H = sum(H),
      X2B = sum(X2B),
      X3B = sum(X3B),
      HR = sum(HR),
      RBI = sum(RBI),
      SB = sum(SB),
      CS = sum(CS),
      BB = sum(BB),
      SO = sum(SO),
      IBB = sum(IBB, na.rm = TRUE),
      HBP = sum(HBP, na.rm = TRUE),
      SH = sum(SH, na.rm = TRUE),
      SF = sum(SF, na.rm = TRUE),
      .groups = "drop"
    ) %>%
    mutate(
      PA = AB + BB + HBP + SF + SH,
      AVG = round(H / AB, 3),
      OBP = round((H + BB + HBP) / (AB + BB + HBP + SF), 3),
      SLG = round((H + X2B + 2*X3B + 3*HR) / AB, 3),
      OPS = OBP + SLG
    ) %>%
    filter(PA >= min_pa)

  # Join with player names
  people <- Lahman::People %>%
    select(playerID, nameFirst, nameLast)

  batting <- batting %>%
    left_join(people, by = "playerID") %>%
    mutate(name = paste(nameFirst, nameLast))

  return(batting)
}

# ============================================
# FanGraphs Data Functions
# ============================================

#' Fetch FanGraphs leaderboard data
#'
#' @param season Year
#' @param stat_type "bat" or "pit"
#' @param min_pa Minimum PA (batters) or IP (pitchers)
#' @return Data frame with FanGraphs stats
get_fangraphs_leaders <- function(season, stat_type = "bat", min_pa = 200) {

  if (stat_type == "bat") {
    data <- fg_batter_leaders(
      x = season,
      y = season,
      qual = min_pa,
      ind = 1
    )
  } else {
    data <- fg_pitcher_leaders(
      x = season,
      y = season,
      qual = min_pa,
      ind = 1
    )
  }

  return(data)
}

# ============================================
# Example Usage
# ============================================

# Fetch Lahman batting data for analysis
batting_data <- get_lahman_batting(min_year = 2010, max_year = 2023)

# View top OPS seasons
top_ops <- batting_data %>%
  arrange(desc(OPS)) %>%
  head(20)

print("Top 20 OPS Seasons (2010-2023):")
print(top_ops %>% select(name, yearID, PA, AVG, OBP, SLG, OPS))

# Calculate league averages by year
league_avgs <- batting_data %>%
  group_by(yearID) %>%
  summarize(
    avg_OBP = mean(OBP, na.rm = TRUE),
    avg_SLG = mean(SLG, na.rm = TRUE),
    avg_OPS = mean(OPS, na.rm = TRUE)
  )

print("\nLeague Average OPS by Year:")
print(league_avgs)
    

Real-World Application

Professional baseball teams leverage these data sources in fundamentally different ways than casual analysts. The Houston Astros, pioneers in baseball's analytical revolution, built their dynasty partly on proprietary systems that ingest Statcast data in real-time to inform in-game decisions. Their analytics department developed algorithms that identify optimal defensive shifts based on spray charts derived from batted ball data, contributing to their sustained success in the late 2010s and early 2020s.

The Los Angeles Dodgers exemplify another approach, investing heavily in player development analytics. By combining minor league tracking data with major league Statcast information, they've consistently identified undervalued players and helped existing talent optimize their approaches. Their acquisition and development of Max Muncy—who transformed from a journeyman into an All-Star through swing mechanics changes informed by launch angle data—demonstrates the practical value of these data systems.

Interpreting Data Quality

Data SourceAccuracy LevelBest Use Cases
Statcast (2020+)Elite (99%+)Pitch movement, exit velocity, sprint speed analysis
Statcast (2015-2019)High (95%+)General tracking metrics, some position data gaps
PITCHf/x (2008-2014)Good (90%+)Pitch velocity, movement, historical pitch analysis
RetrosheetExcellent for eventsHistorical play-by-play, game outcomes, situational analysis
Lahman DatabaseHigh for counting statsCareer totals, historical comparisons, academic research

Key Takeaways

  • Statcast provides the most detailed pitch and batted ball data since 2015, with continuous improvements in tracking accuracy and additional metrics
  • Historical analysis requires combining multiple sources: Retrosheet for play-by-play events, Lahman for statistics, and Baseball Reference for context
  • Python's pybaseball and R's baseballr packages democratize access to MLB data that was once available only to teams
  • Data quality varies by era—always consider the limitations of historical data when making cross-era comparisons
  • Building a robust data pipeline with caching and error handling is essential for any serious baseball analytics project

Discussion

Have questions or feedback? Join our community discussion on Discord or GitHub Discussions.