Using pybaseball for Data Collection

Beginner 10 min read 1 views Nov 26, 2025

Introduction to pybaseball

pybaseball is a comprehensive Python library designed to simplify the acquisition and analysis of baseball data. Created and maintained by James LeDoux, this powerful tool provides easy access to data from multiple sources including Baseball Reference, FanGraphs, Statcast, and more. Whether you're conducting sabermetric research, building predictive models, or simply exploring baseball statistics, pybaseball streamlines the data collection process.

Installation and Setup

# Install pybaseball
pip install pybaseball

# Install with additional dependencies
pip install pybaseball pandas matplotlib seaborn scipy

# Verify installation
import pybaseball as pyb
print(pyb.__version__)

# Enable data caching for better performance
pyb.cache.enable()

Data Caching and Rate Limits

from pybaseball import cache

# Enable caching
cache.enable()

# Check cache status
print(f"Cache enabled: {cache.is_enabled()}")

# Clear cache if needed
# cache.purge()

Player Identification with playerid_lookup

from pybaseball import playerid_lookup

# Look up a player by name
player = playerid_lookup('trout', 'mike')
print(player)

# Returns DataFrame with multiple ID types:
# key_mlbam: MLB Advanced Media ID (for Statcast)
# key_retro: Retrosheet ID
# key_bbref: Baseball Reference ID
# key_fangraphs: FanGraphs ID

Available Functions Reference

Function Data Source Description
batting_stats() FanGraphs Season-level batting statistics
pitching_stats() FanGraphs Season-level pitching statistics
statcast_batter() Baseball Savant Pitch-level data for specific batter
statcast_pitcher() Baseball Savant Pitch-level data for specific pitcher
statcast() Baseball Savant All Statcast data for date range
playerid_lookup() Chadwick Bureau Player identification across systems
team_batting() Baseball Reference Team batting statistics
team_pitching() Baseball Reference Team pitching statistics

Batting Statistics

from pybaseball import batting_stats
import pandas as pd

# Get batting stats for 2024 season
batting_2024 = batting_stats(2024)
print(batting_2024.head())

# Get stats for a range of seasons
batting_multi = batting_stats(2020, 2024)
print(f"Retrieved {len(batting_multi)} player-seasons")

# Filter for qualified batters (502 PA minimum)
qualified = batting_2024[batting_2024['PA'] >= 502].copy()

# Find top 10 by batting average
top_avg = qualified.nlargest(10, 'AVG')[['Name', 'Team', 'AVG', 'OBP', 'SLG', 'wRC+']]
print("\nTop 10 Batting Average (2024):")
print(top_avg.to_string(index=False))

# Find top power hitters
top_hr = qualified.nlargest(10, 'HR')[['Name', 'Team', 'HR', 'ISO', 'Barrel%']]
print("\nTop 10 Home Run Hitters:")
print(top_hr.to_string(index=False))

Pitching Statistics

from pybaseball import pitching_stats

# Get pitching stats for 2024
pitching_2024 = pitching_stats(2024)

# Filter for qualified starters (162 IP minimum)
qualified = pitching_2024[pitching_2024['IP'] >= 162].copy()

# Find top pitchers by ERA
top_era = qualified.nlargest(10, 'WAR')[['Name', 'Team', 'IP', 'ERA', 'FIP', 'K/9', 'BB/9', 'WAR']]
print("Top 10 Pitchers by WAR:")
print(top_era.to_string(index=False))

Statcast Data Access

from pybaseball import statcast_batter, statcast_pitcher, playerid_lookup

# Get Mike Trout's Statcast data
trout = playerid_lookup('trout', 'mike')
trout_id = trout.iloc[0]['key_mlbam']

# Retrieve Statcast data for specific date range
statcast_data = statcast_batter('2024-04-01', '2024-10-01', trout_id)
print(f"Retrieved {len(statcast_data)} pitches")

# Filter for balls in play
bip = statcast_data[statcast_data['type'] == 'X'].copy()

# Calculate key metrics
avg_ev = bip['launch_speed'].mean()
avg_la = bip['launch_angle'].mean()
print(f"Average Exit Velocity: {avg_ev:.1f} mph")
print(f"Average Launch Angle: {avg_la:.1f} degrees")

# Count barrels
barrels = len(bip[bip['barrel'] == 1])
barrel_rate = (barrels / len(bip)) * 100
print(f"Barrel Rate: {barrel_rate:.1f}%")

Creating Visualizations

from pybaseball import statcast_batter
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')

# Get player data
player_id = 592450  # Aaron Judge
data = statcast_batter('2024-04-01', '2024-10-01', player_id)

# Filter for balls in play
bip = data[data['type'] == 'X'].dropna(subset=['launch_speed', 'launch_angle'])

# Create scatter plot
plt.figure(figsize=(12, 8))
scatter = plt.scatter(bip['launch_angle'], bip['launch_speed'],
                     c=bip['barrel'], cmap='RdYlGn',
                     alpha=0.6, s=50)
plt.xlabel('Launch Angle (degrees)', fontsize=12)
plt.ylabel('Exit Velocity (mph)', fontsize=12)
plt.title('Aaron Judge - Exit Velocity vs Launch Angle (2024)', fontsize=14)
plt.colorbar(scatter, label='Barrel (1=Yes, 0=No)')
plt.axhline(y=95, color='r', linestyle='--', alpha=0.3, label='95 mph threshold')
plt.legend()
plt.tight_layout()
plt.savefig('judge_exitvelo_launchangle.png', dpi=300)
plt.show()

League-Wide Trends Analysis

from pybaseball import batting_stats
import matplotlib.pyplot as plt

# Get multi-year data
years = range(2015, 2025)
hr_by_year = []
k_rate_by_year = []

for year in years:
    data = batting_stats(year)
    qualified = data[data['PA'] >= 502]
    hr_by_year.append(qualified['HR'].mean())
    k_rate_by_year.append(qualified['K%'].mean())

# Create visualization
fig, ax1 = plt.subplots(figsize=(12, 7))

ax1.set_xlabel('Season', fontsize=12)
ax1.set_ylabel('Average Home Runs', color='tab:red', fontsize=12)
ax1.plot(years, hr_by_year, color='tab:red', marker='o', linewidth=2, label='Home Runs')
ax1.tick_params(axis='y', labelcolor='tab:red')

ax2 = ax1.twinx()
ax2.set_ylabel('Strikeout Rate (%)', color='tab:blue', fontsize=12)
ax2.plot(years, k_rate_by_year, color='tab:blue', marker='s', linewidth=2, label='K%')
ax2.tick_params(axis='y', labelcolor='tab:blue')

plt.title('MLB Trends: Home Runs and Strikeout Rates (2015-2024)', fontsize=14)
plt.tight_layout()
plt.savefig('mlb_trends.png', dpi=300)
plt.show()

Best Practices

  • Always enable caching: Use pybaseball.cache.enable() at the start of scripts
  • Save data locally: Store DataFrames to CSV/parquet for repeated analysis
  • Respect rate limits: Add delays between large requests
  • Filter after retrieval: Download broader datasets and filter locally
  • Use the qual parameter: Filter for qualified players to reduce data size

Key Takeaways

  • Comprehensive data access: pybaseball provides unified access to FanGraphs, Baseball Savant, and Baseball Reference data.
  • Simple API: Functions like batting_stats() and statcast_batter() make data retrieval straightforward.
  • Player ID lookup: The playerid_lookup() function maps names to various ID systems.
  • Caching improves performance: Enable caching to avoid repeated API calls.
  • Integration with pandas: All data returns as pandas DataFrames for easy analysis.

Discussion

Have questions or feedback? Join our community discussion on Discord or GitHub Discussions.