Using pybaseball for Data Collection
Beginner
10 min read
0 views
Nov 26, 2025
Introduction to pybaseball
pybaseball is a comprehensive Python library designed to simplify the acquisition and analysis of baseball data. Created and maintained by James LeDoux, this powerful tool provides easy access to data from multiple sources including Baseball Reference, FanGraphs, Statcast, and more. Whether you're conducting sabermetric research, building predictive models, or simply exploring baseball statistics, pybaseball streamlines the data collection process.
Installation and Setup
# Install pybaseball
pip install pybaseball
# Install with additional dependencies
pip install pybaseball pandas matplotlib seaborn scipy
# Verify installation
import pybaseball as pyb
print(pyb.__version__)
# Enable data caching for better performance
pyb.cache.enable()
Data Caching and Rate Limits
from pybaseball import cache
# Enable caching
cache.enable()
# Check cache status
print(f"Cache enabled: {cache.is_enabled()}")
# Clear cache if needed
# cache.purge()
Player Identification with playerid_lookup
from pybaseball import playerid_lookup
# Look up a player by name
player = playerid_lookup('trout', 'mike')
print(player)
# Returns DataFrame with multiple ID types:
# key_mlbam: MLB Advanced Media ID (for Statcast)
# key_retro: Retrosheet ID
# key_bbref: Baseball Reference ID
# key_fangraphs: FanGraphs ID
Available Functions Reference
| Function | Data Source | Description |
|---|---|---|
| batting_stats() | FanGraphs | Season-level batting statistics |
| pitching_stats() | FanGraphs | Season-level pitching statistics |
| statcast_batter() | Baseball Savant | Pitch-level data for specific batter |
| statcast_pitcher() | Baseball Savant | Pitch-level data for specific pitcher |
| statcast() | Baseball Savant | All Statcast data for date range |
| playerid_lookup() | Chadwick Bureau | Player identification across systems |
| team_batting() | Baseball Reference | Team batting statistics |
| team_pitching() | Baseball Reference | Team pitching statistics |
Batting Statistics
from pybaseball import batting_stats
import pandas as pd
# Get batting stats for 2024 season
batting_2024 = batting_stats(2024)
print(batting_2024.head())
# Get stats for a range of seasons
batting_multi = batting_stats(2020, 2024)
print(f"Retrieved {len(batting_multi)} player-seasons")
# Filter for qualified batters (502 PA minimum)
qualified = batting_2024[batting_2024['PA'] >= 502].copy()
# Find top 10 by batting average
top_avg = qualified.nlargest(10, 'AVG')[['Name', 'Team', 'AVG', 'OBP', 'SLG', 'wRC+']]
print("\nTop 10 Batting Average (2024):")
print(top_avg.to_string(index=False))
# Find top power hitters
top_hr = qualified.nlargest(10, 'HR')[['Name', 'Team', 'HR', 'ISO', 'Barrel%']]
print("\nTop 10 Home Run Hitters:")
print(top_hr.to_string(index=False))
Pitching Statistics
from pybaseball import pitching_stats
# Get pitching stats for 2024
pitching_2024 = pitching_stats(2024)
# Filter for qualified starters (162 IP minimum)
qualified = pitching_2024[pitching_2024['IP'] >= 162].copy()
# Find top pitchers by ERA
top_era = qualified.nlargest(10, 'WAR')[['Name', 'Team', 'IP', 'ERA', 'FIP', 'K/9', 'BB/9', 'WAR']]
print("Top 10 Pitchers by WAR:")
print(top_era.to_string(index=False))
Statcast Data Access
from pybaseball import statcast_batter, statcast_pitcher, playerid_lookup
# Get Mike Trout's Statcast data
trout = playerid_lookup('trout', 'mike')
trout_id = trout.iloc[0]['key_mlbam']
# Retrieve Statcast data for specific date range
statcast_data = statcast_batter('2024-04-01', '2024-10-01', trout_id)
print(f"Retrieved {len(statcast_data)} pitches")
# Filter for balls in play
bip = statcast_data[statcast_data['type'] == 'X'].copy()
# Calculate key metrics
avg_ev = bip['launch_speed'].mean()
avg_la = bip['launch_angle'].mean()
print(f"Average Exit Velocity: {avg_ev:.1f} mph")
print(f"Average Launch Angle: {avg_la:.1f} degrees")
# Count barrels
barrels = len(bip[bip['barrel'] == 1])
barrel_rate = (barrels / len(bip)) * 100
print(f"Barrel Rate: {barrel_rate:.1f}%")
Creating Visualizations
from pybaseball import statcast_batter
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
# Get player data
player_id = 592450 # Aaron Judge
data = statcast_batter('2024-04-01', '2024-10-01', player_id)
# Filter for balls in play
bip = data[data['type'] == 'X'].dropna(subset=['launch_speed', 'launch_angle'])
# Create scatter plot
plt.figure(figsize=(12, 8))
scatter = plt.scatter(bip['launch_angle'], bip['launch_speed'],
c=bip['barrel'], cmap='RdYlGn',
alpha=0.6, s=50)
plt.xlabel('Launch Angle (degrees)', fontsize=12)
plt.ylabel('Exit Velocity (mph)', fontsize=12)
plt.title('Aaron Judge - Exit Velocity vs Launch Angle (2024)', fontsize=14)
plt.colorbar(scatter, label='Barrel (1=Yes, 0=No)')
plt.axhline(y=95, color='r', linestyle='--', alpha=0.3, label='95 mph threshold')
plt.legend()
plt.tight_layout()
plt.savefig('judge_exitvelo_launchangle.png', dpi=300)
plt.show()
League-Wide Trends Analysis
from pybaseball import batting_stats
import matplotlib.pyplot as plt
# Get multi-year data
years = range(2015, 2025)
hr_by_year = []
k_rate_by_year = []
for year in years:
data = batting_stats(year)
qualified = data[data['PA'] >= 502]
hr_by_year.append(qualified['HR'].mean())
k_rate_by_year.append(qualified['K%'].mean())
# Create visualization
fig, ax1 = plt.subplots(figsize=(12, 7))
ax1.set_xlabel('Season', fontsize=12)
ax1.set_ylabel('Average Home Runs', color='tab:red', fontsize=12)
ax1.plot(years, hr_by_year, color='tab:red', marker='o', linewidth=2, label='Home Runs')
ax1.tick_params(axis='y', labelcolor='tab:red')
ax2 = ax1.twinx()
ax2.set_ylabel('Strikeout Rate (%)', color='tab:blue', fontsize=12)
ax2.plot(years, k_rate_by_year, color='tab:blue', marker='s', linewidth=2, label='K%')
ax2.tick_params(axis='y', labelcolor='tab:blue')
plt.title('MLB Trends: Home Runs and Strikeout Rates (2015-2024)', fontsize=14)
plt.tight_layout()
plt.savefig('mlb_trends.png', dpi=300)
plt.show()
Best Practices
- Always enable caching: Use pybaseball.cache.enable() at the start of scripts
- Save data locally: Store DataFrames to CSV/parquet for repeated analysis
- Respect rate limits: Add delays between large requests
- Filter after retrieval: Download broader datasets and filter locally
- Use the qual parameter: Filter for qualified players to reduce data size
Key Takeaways
- Comprehensive data access: pybaseball provides unified access to FanGraphs, Baseball Savant, and Baseball Reference data.
- Simple API: Functions like batting_stats() and statcast_batter() make data retrieval straightforward.
- Player ID lookup: The playerid_lookup() function maps names to various ID systems.
- Caching improves performance: Enable caching to avoid repeated API calls.
- Integration with pandas: All data returns as pandas DataFrames for easy analysis.
Discussion
Have questions or feedback? Join our community discussion on
Discord or
GitHub Discussions.
Table of Contents
Related Topics
Quick Actions