Chapter 2: Baseball Data Sources and Infrastructure
Baseball Data Sources and Infrastructure
The foundation of baseball analytics lies in accessing, managing, and analyzing comprehensive datasets. Modern baseball analysis requires understanding various data sources, from historical databases to real-time tracking systems. This chapter explores the primary data sources available to analysts, including public APIs, proprietary databases, and web scraping techniques that power contemporary baseball research and decision-making.
Understanding Baseball Data Ecosystem
Baseball possesses one of the most complete and accessible data ecosystems in professional sports. This rich data environment stems from baseball's long history of statistical record-keeping and the sport's recent embrace of advanced tracking technologies. Data ranges from basic box scores dating back to the 1870s to sophisticated biomechanical measurements captured by modern systems like Statcast and TrackMan.
The modern baseball data infrastructure consists of several layers: historical databases containing seasonal and career statistics, play-by-play data capturing every event in every game, pitch-level data with velocity and movement characteristics, and tracking data measuring the physical location and movement of every player and ball. Each layer serves different analytical purposes and requires different tools and approaches for effective utilization.
Access to baseball data varies from freely available public sources to expensive proprietary systems. Organizations like Baseball Reference and FanGraphs provide extensive free data through their websites and APIs, while companies like MLB Advanced Media, Sportradar, and Baseball Prospectus offer premium data products. Understanding what data is available, where to find it, and how to access it efficiently is essential for any baseball analyst.
Key Components
- Lahman Database: The most comprehensive free historical baseball database, containing complete batting, pitching, and fielding statistics from 1871 to present. Available in SQL, CSV, and R package formats, it includes player demographics, team records, awards, and post-season data.
- Retrosheet: Provides detailed play-by-play accounts of games dating back to 1913, with complete coverage from 1974 onward. Essential for analyzing in-game situations, lineup construction, and sequential events.
- Statcast: MLB's tracking technology launched in 2015 that measures player movements, bat speed, ball trajectories, exit velocity, launch angle, sprint speed, and defensive positioning. Accessible through Baseball Savant's website and API.
- FanGraphs: Comprehensive baseball statistics website offering traditional stats, advanced metrics like WAR and wOBA, plate discipline data, batted ball profiles, and pitch-type breakdowns. Provides both a web interface and data scraping capabilities.
- Baseball Reference: Another major statistics repository with play index tools, game logs, splits data, and extensive historical information. Known for calculating bWAR (Baseball-Reference WAR).
- PyBaseball & baseballr: Python and R packages respectively that provide programmatic access to FanGraphs, Baseball Reference, Baseball Savant, and other data sources through simple function calls.
Data Pipeline Architecture
Data Flow: Raw Sources → Data Acquisition (API/Scraping) → Data Cleaning & Validation → Storage (Database) → Analysis & Modeling → Visualization & Reporting
A robust baseball analytics pipeline begins with systematic data collection from multiple sources, continues through cleaning and standardization processes, stores data in efficient database structures, and culminates in analysis and presentation tools.
Python Implementation
import pandas as pd
import numpy as np
from pybaseball import statcast, batting_stats, pitching_stats, playerid_lookup
from pybaseball import cache
import sqlite3
from datetime import datetime, timedelta
# Enable caching to improve performance
cache.enable()
class BaseballDataPipeline:
"""
Comprehensive baseball data pipeline for acquisition and storage.
"""
def __init__(self, db_path='baseball_analytics.db'):
"""Initialize database connection."""
self.conn = sqlite3.connect(db_path)
self.setup_database()
def setup_database(self):
"""Create tables for storing baseball data."""
cursor = self.conn.cursor()
# Create statcast table
cursor.execute('''
CREATE TABLE IF NOT EXISTS statcast_data (
game_date TEXT,
player_name TEXT,
pitcher_name TEXT,
events TEXT,
launch_speed REAL,
launch_angle REAL,
hit_distance_sc REAL,
estimated_ba_using_speedangle REAL,
estimated_woba_using_speedangle REAL,
woba_value REAL,
pitch_type TEXT,
release_speed REAL,
release_spin_rate REAL,
PRIMARY KEY (game_date, player_name, pitcher_name, events)
)
''')
self.conn.commit()
def fetch_statcast_data(self, start_date, end_date):
"""
Fetch Statcast data for a date range.
Parameters:
start_date: Start date (YYYY-MM-DD format)
end_date: End date (YYYY-MM-DD format)
Returns:
DataFrame with Statcast data
"""
print(f"Fetching Statcast data from {start_date} to {end_date}...")
data = statcast(start_dt=start_date, end_dt=end_date)
if data is not None and len(data) > 0:
# Select relevant columns
columns = ['game_date', 'player_name', 'pitcher_name', 'events',
'launch_speed', 'launch_angle', 'hit_distance_sc',
'estimated_ba_using_speedangle', 'estimated_woba_using_speedangle',
'woba_value', 'pitch_type', 'release_speed', 'release_spin_rate']
data_clean = data[columns].dropna(subset=['events'])
# Store in database
data_clean.to_sql('statcast_data', self.conn, if_exists='append', index=False)
print(f"Stored {len(data_clean)} records")
return data_clean
else:
print("No data found for date range")
return pd.DataFrame()
def fetch_season_stats(self, year):
"""
Fetch comprehensive season statistics.
Parameters:
year: Season year
Returns:
Dictionary with batting and pitching stats
"""
batting = batting_stats(year)
pitching = pitching_stats(year)
# Store in database
batting.to_sql(f'batting_{year}', self.conn, if_exists='replace', index=False)
pitching.to_sql(f'pitching_{year}', self.conn, if_exists='replace', index=False)
return {'batting': batting, 'pitching': pitching}
def get_player_data(self, last_name, first_name):
"""
Look up player ID and retrieve their statistics.
Parameters:
last_name: Player's last name
first_name: Player's first name
Returns:
Player information and ID
"""
player = playerid_lookup(last_name, first_name)
return player
def query_database(self, sql_query):
"""
Execute custom SQL query on the database.
Parameters:
sql_query: SQL query string
Returns:
Query results as DataFrame
"""
return pd.read_sql_query(sql_query, self.conn)
def close(self):
"""Close database connection."""
self.conn.close()
# Example usage
pipeline = BaseballDataPipeline()
# Fetch recent data
start = (datetime.now() - timedelta(days=7)).strftime('%Y-%m-%d')
end = datetime.now().strftime('%Y-%m-%d')
statcast_data = pipeline.fetch_statcast_data(start, end)
# Fetch 2023 season statistics
season_2023 = pipeline.fetch_season_stats(2023)
print(f"\nTop 10 hitters by wRC+ in 2023:")
print(season_2023['batting'].nlargest(10, 'wRC+')[['Name', 'Team', 'wRC+', 'WAR']])
# Look up specific player
shohei = pipeline.get_player_data('Ohtani', 'Shohei')
print(f"\nShohei Ohtani player ID: {shohei['key_mlbam'].values[0]}")
pipeline.close()
R Implementation
library(tidyverse)
library(baseballr)
library(Lahman)
library(DBI)
library(RSQLite)
# Create baseball data pipeline class
BaseballDataPipeline <- R6::R6Class(
"BaseballDataPipeline",
public = list(
conn = NULL,
initialize = function(db_path = "baseball_analytics.db") {
self$conn <- dbConnect(RSQLite::SQLite(), db_path)
self$setup_database()
},
setup_database = function() {
# Create tables for storing data
dbExecute(self$conn, "
CREATE TABLE IF NOT EXISTS statcast_data (
game_date TEXT,
player_name TEXT,
pitcher_name TEXT,
events TEXT,
launch_speed REAL,
launch_angle REAL,
hit_distance_sc REAL,
PRIMARY KEY (game_date, player_name, pitcher_name)
)
")
},
fetch_statcast_data = function(start_date, end_date) {
# Fetch Statcast data using baseballr
message(sprintf("Fetching data from %s to %s...", start_date, end_date))
data <- statcast_search(
start_date = start_date,
end_date = end_date,
playerid = NULL
)
if (nrow(data) > 0) {
# Clean and store data
data_clean <- data %>%
select(game_date, player_name, pitcher_name, events,
launch_speed, launch_angle, hit_distance_sc) %>%
filter(!is.na(events))
dbWriteTable(self$conn, "statcast_data", data_clean, append = TRUE)
message(sprintf("Stored %d records", nrow(data_clean)))
return(data_clean)
}
},
fetch_fangraphs_data = function(year) {
# Fetch FanGraphs leaderboards
batting <- fg_batter_leaders(startseason = year, endseason = year)
pitching <- fg_pitcher_leaders(startseason = year, endseason = year)
# Store in database
dbWriteTable(self$conn, paste0("batting_", year), batting, overwrite = TRUE)
dbWriteTable(self$conn, paste0("pitching_", year), pitching, overwrite = TRUE)
return(list(batting = batting, pitching = pitching))
},
get_lahman_data = function() {
# Access Lahman database
batting_data <- Batting %>%
filter(yearID >= 2020) %>%
collect()
pitching_data <- Pitching %>%
filter(yearID >= 2020) %>%
collect()
return(list(batting = batting_data, pitching = pitching_data))
},
query_database = function(sql_query) {
# Execute custom SQL query
result <- dbGetQuery(self$conn, sql_query)
return(result)
},
finalize = function() {
dbDisconnect(self$conn)
}
)
)
# Example usage
pipeline <- BaseballDataPipeline$new()
# Fetch recent Statcast data
start_date <- Sys.Date() - 7
end_date <- Sys.Date()
statcast_data <- pipeline$fetch_statcast_data(
format(start_date, "%Y-%m-%d"),
format(end_date, "%Y-%m-%d")
)
# Fetch 2023 season data
season_2023 <- pipeline$fetch_fangraphs_data(2023)
top_hitters <- season_2023$batting %>%
arrange(desc(WAR)) %>%
select(Name, Team, WAR, wRC., wOBA) %>%
head(10)
print("Top 10 hitters by WAR (2023):")
print(top_hitters)
Real-World Application
MLB teams employ dedicated data engineers and infrastructure specialists to manage their data pipelines. The Cleveland Guardians, for instance, built a comprehensive data warehouse that integrates Statcast data, TrackMan information from their minor league affiliates, biomechanical data from motion capture systems, and scouting reports into a unified platform. This allows analysts, coaches, and executives to query any aspect of player performance instantly.
The New York Yankees partnered with Google Cloud to build their analytics infrastructure, using BigQuery for data storage and analysis. Their system processes millions of pitches and batted balls, providing real-time insights during games and supporting long-term player development initiatives. The Pittsburgh Pirates created an internal API that standardizes data access across departments, ensuring consistency in analysis and enabling rapid development of new analytical tools.
Interpreting the Results
| Data Source | Coverage | Access Method | Best Use Cases |
|---|---|---|---|
| Lahman Database | 1871-Present (Season level) | Free download, R package, SQL | Historical analysis, career statistics |
| Retrosheet | 1913-Present (Play-by-play) | Free download, parsing required | Game situations, lineup analysis |
| Statcast | 2015-Present (Pitch/batted ball level) | Baseball Savant, PyBaseball API | Player evaluation, skill assessment |
| FanGraphs | 1871-Present (Advanced metrics) | Web scraping, PyBaseball | Modern analytics, WAR calculations |
| Baseball Reference | 1871-Present (Comprehensive) | Web scraping, play index | Research, splits analysis |
Key Takeaways
- Multiple complementary data sources exist for baseball analytics, each serving different purposes from historical research to real-time player evaluation.
- Python and R packages like PyBaseball and baseballr dramatically simplify data acquisition, eliminating the need for complex web scraping or API management.
- Building a robust data infrastructure with proper database storage and automated pipelines is essential for scalable and reproducible baseball analysis.
- Statcast has revolutionized baseball analytics by providing objective physical measurements of player performance, enabling new forms of evaluation and prediction.
- Understanding data provenance, quality, and limitations is crucial for drawing valid conclusions from baseball analytics.