Working with Retrosheet Historical Data
Intermediate
10 min read
1 views
Nov 26, 2025
# Retrosheet Historical Data: Complete Guide to Baseball's Play-by-Play Archive
## What is Retrosheet
Retrosheet is a non-profit organization dedicated to collecting, digitizing, and freely distributing play-by-play accounts and related information for every Major League Baseball game ever played. Founded in 1989 by Dave Smith, Retrosheet represents one of the most ambitious volunteer-driven data preservation projects in sports history, providing researchers, analysts, and fans with unprecedented access to baseball's historical record.
### The Mission and Philosophy
Retrosheet's mission is rooted in the belief that detailed baseball data should be freely available to the public. Unlike commercial data providers, Retrosheet operates as an all-volunteer organization supported by donations and the tireless work of hundreds of contributors who digitize box scores, code play-by-play data from scorecards and newspaper accounts, and verify information for accuracy.
The organization's philosophy centers on:
- **Free and Open Access**: All Retrosheet data is available for download at no cost for personal, educational, and research use
- **Historical Completeness**: Capturing play-by-play data for every MLB game from 1871 to present
- **Data Accuracy**: Rigorous verification processes to ensure historical fidelity
- **Community Collaboration**: Relying on volunteers, researchers, and baseball enthusiasts worldwide
- **Preservation**: Digitizing ephemeral materials before they are lost to time
### Historical Significance
Retrosheet has fundamentally transformed baseball research. Before Retrosheet, comprehensive play-by-play data was largely inaccessible to the public. Researchers had to travel to physical archives, manually transcribe box scores, or pay prohibitive fees for limited datasets. Retrosheet democratized this information, enabling:
- Academic research on baseball history and strategy
- Development of advanced metrics like Win Probability Added (WPA) and Leverage Index
- Historical player and team analysis previously impossible
- Validation and correction of official baseball records
- New insights into the evolution of baseball tactics and rules
The project has digitized over 200,000 games and continues to expand coverage of early baseball eras and minor leagues.
## Retrosheet Data Coverage
Understanding what data Retrosheet provides and for which time periods is essential for effective research.
### Historical Timeline
**1871-1913: Early Baseball Era**
- Coverage: Box score data for most games
- Play-by-play: Limited availability
- Data quality: Variable, reconstructed from newspaper accounts
- Notable gaps: Some minor league and Negro League games
**1914-1949: Dead-ball to WWII Era**
- Coverage: Improved box score completeness
- Play-by-play: Approximately 60% of games
- Data quality: Good for major events, gaps in routine plays
- Sources: Newspaper accounts, team records, scorecards
**1950-1983: Post-war Expansion**
- Coverage: Near-complete box scores
- Play-by-play: 85% of games covered
- Data quality: Excellent for most metrics
- Sources: Official scoresheets, broadcaster notes, team archives
**1984-Present: Modern Era**
- Coverage: Complete play-by-play for every game
- Data quality: Comprehensive and highly accurate
- Sources: Official MLB data feeds, team files, digital records
- Real-time: Updates typically within days of games played
### Types of Data Available
**Game Logs**
- Basic game information (date, teams, score, attendance)
- Starting lineups and substitutions
- Umpire assignments
- Game duration and day/night designation
- Weather conditions (when available)
**Box Scores**
- Player batting and pitching statistics
- Defensive positions played
- Scoring by inning
- Team totals
**Event Files (Play-by-Play)**
- Every pitch and its outcome
- Baserunner advances
- Defensive plays and fielder positioning
- Substitutions with timing
- Pitching changes
- Earned/unearned run determination
- Hit location codes
- Detailed game state information
**Roster Files**
- Player names and IDs
- Team affiliations by season
- Positional designations
- Biographical information linkages
**Schedule Files**
- Game dates and times
- Home/away team designations
- Doubleheader information
- Postseason game details
## Event File Format Explained
Retrosheet's event files use a specialized format designed to capture every detail of a baseball game in a compact, parseable structure. Understanding this format is crucial for working with Retrosheet data.
### File Structure
Event files (.EVN or .EVA extensions) are plain text files with specific record types:
```
id,NYA201804020
version,2
info,visteam,TOR
info,hometeam,NYA
info,date,2018/04/02
info,number,0
info,starttime,1:05PM
info,daynight,day
info,usedh,true
info,temp,45
info,winddir,fromrf
start,bautj001,"Jose Bautista",0,1,9
start,donaj001,"Josh Donaldson",0,2,5
play,1,0,bautj001,00,X,S7/L
play,1,0,donaj001,12,CBFX,K
```
### Record Types
**ID Records**
- Format: `id,GAMEID`
- Unique identifier for each game
- Structure: `TEAMYYYYMMDD#` where # is game number that day
**Version Records**
- Indicates event file format version
- Current version: 2
**Info Records**
- Game metadata: `info,FIELD,VALUE`
- Common fields: teams, date, park, temperature, umpires
- Variable completeness based on era
**Start Records**
- Starting lineups: `start,PLAYERID,"NAME",TEAM,BATTINGORDER,FIELDPOS`
- Team: 0 (visitor) or 1 (home)
- Field positions use standard numbering (1=P, 2=C, 3=1B, etc.)
**Play Records**
- Core play-by-play data
- Format: `play,INNING,TEAM,PLAYERID,COUNT,PITCHES,EVENT`
- Most complex and information-rich records
**Substitution Records**
- Player changes: `sub,PLAYERID,"NAME",TEAM,BATTINGORDER,FIELDPOS`
**Data Records**
- Earned runs: `data,er,PLAYERID,RUNS`
- Additional context for specific plays
### Event Codes
Retrosheet uses a sophisticated coding system to represent every possible play outcome:
**Basic Events**
- `S`: Single
- `D`: Double
- `T`: Triple
- `HR`: Home run
- `K`: Strikeout
- `W`: Walk
- `HP`: Hit by pitch
- `E#`: Error by fielder #
- `FC`: Fielder's choice
**Batted Ball Location**
- Number indicates fielder: `S7` = single to left field
- Letter modifiers: `L` (line drive), `F` (fly ball), `G` (ground ball), `P` (popup)
- Example: `S7/L` = line drive single to left field
**Baserunner Advances**
- Format: `EVENT.B-#` where B is base and # is destination
- `S8.2-H` = single to center, runner on second scores
- `1-3` = runner advances first to third
**Pitch Sequences**
- Letters represent pitch results before event
- `C` = called strike
- `B` = ball
- `F` = foul ball
- `S` = swinging strike
- `X` = ball in play
- Example: `12,CBFX,K` = called strike, ball, foul, swinging strike (strikeout)
**Modifiers and Special Cases**
- `SH` = sacrifice hit (bunt)
- `SF` = sacrifice fly
- `GDP` = grounded into double play
- `NP` = no pitch (balk, etc.)
- `+` = additional defensive detail
### Parsing Complexity
The event notation handles complex scenarios:
```
play,5,1,ramij001,32,BBCFFX,D7/L.2-H;1-H(E7/TH);B-3
```
This represents:
- 5th inning, home team batting
- Player: ramij001
- Count: 3-2
- Pitch sequence: Ball, Ball, Called strike, Foul, Foul, ball in play
- Event: Double to left field
- Runner on 2nd scores
- Runner on 1st scores (error by left fielder on throw home)
- Batter reaches 3rd base
## Parsing Event Files
Working with Retrosheet data programmatically requires parsing the event file format and transforming it into usable data structures.
### Python Parsing with pybaseball
The `pybaseball` library includes Retrosheet functionality, though direct parsing gives more control:
```python
import pandas as pd
from pybaseball import retrosheet
import re
class RetrosheetParser:
"""
Parse Retrosheet event files into structured DataFrames.
"""
def __init__(self):
self.games = []
self.plays = []
self.current_game = {}
self.current_lineups = {'0': {}, '1': {}}
def parse_event_file(self, filepath):
"""
Parse a complete Retrosheet event file.
Parameters:
-----------
filepath : str
Path to .EVN or .EVA event file
Returns:
--------
Tuple of (games_df, plays_df) DataFrames
"""
with open(filepath, 'r') as f:
for line in f:
line = line.strip()
if not line:
continue
# Split on first comma only
record_type = line.split(',', 1)[0]
if record_type == 'id':
self._start_new_game(line)
elif record_type == 'info':
self._parse_info(line)
elif record_type == 'start':
self._parse_start(line)
elif record_type == 'play':
self._parse_play(line)
elif record_type == 'sub':
self._parse_sub(line)
elif record_type == 'data':
self._parse_data(line)
return (
pd.DataFrame(self.games),
pd.DataFrame(self.plays)
)
def _start_new_game(self, line):
"""Initialize a new game record."""
game_id = line.split(',')[1]
# Save previous game if exists
if self.current_game:
self.games.append(self.current_game.copy())
# Reset for new game
self.current_game = {
'game_id': game_id,
'plays': []
}
self.current_lineups = {'0': {}, '1': {}}
def _parse_info(self, line):
"""Parse info records."""
parts = line.split(',')
if len(parts) >= 3:
field = parts[1]
value = ','.join(parts[2:]) # Handle values with commas
self.current_game[field] = value
def _parse_start(self, line):
"""Parse starting lineup."""
# Format: start,PLAYERID,"NAME",TEAM,BATTINGORDER,FIELDPOS
parts = self._split_preserving_quotes(line)
if len(parts) >= 6:
player_id = parts[1]
name = parts[2].strip('"')
team = parts[3]
batting_order = parts[4]
field_pos = parts[5]
self.current_lineups[team][batting_order] = {
'player_id': player_id,
'name': name,
'position': field_pos
}
def _parse_play(self, line):
"""Parse play record - the most complex record type."""
parts = line.split(',')
if len(parts) >= 6:
play_data = {
'game_id': self.current_game['game_id'],
'inning': int(parts[1]),
'team': int(parts[2]),
'player_id': parts[3],
'count': parts[4],
'pitches': parts[5],
'event': ','.join(parts[6:]) # Event may contain commas
}
# Parse the count
if play_data['count']:
play_data['balls'] = int(play_data['count'][0]) if play_data['count'][0].isdigit() else None
play_data['strikes'] = int(play_data['count'][1]) if len(play_data['count']) > 1 and play_data['count'][1].isdigit() else None
# Parse pitch sequence
play_data['pitch_count'] = len(play_data['pitches']) if play_data['pitches'] else 0
# Extract basic event type
event_str = play_data['event']
play_data['event_type'] = self._extract_event_type(event_str)
# Parse baserunner advances if present
if '.' in event_str:
base_event, advances = event_str.split('.', 1)
play_data['base_event'] = base_event
play_data['advances'] = advances
else:
play_data['base_event'] = event_str
play_data['advances'] = None
self.plays.append(play_data)
def _parse_sub(self, line):
"""Parse substitution record."""
parts = self._split_preserving_quotes(line)
if len(parts) >= 6:
player_id = parts[1]
name = parts[2].strip('"')
team = parts[3]
batting_order = parts[4]
field_pos = parts[5]
# Update lineup
self.current_lineups[team][batting_order] = {
'player_id': player_id,
'name': name,
'position': field_pos
}
def _parse_data(self, line):
"""Parse data records (earned runs, etc.)."""
parts = line.split(',')
if len(parts) >= 3:
data_type = parts[1]
if data_type == 'er':
# Earned runs for pitcher
self.current_game.setdefault('earned_runs', []).append({
'player_id': parts[2],
'er': int(parts[3]) if len(parts) > 3 else 0
})
def _extract_event_type(self, event_str):
"""Extract primary event type from event string."""
# Remove modifiers and baserunning
base_event = event_str.split('.')[0].split('/')[0]
# Determine event category
if base_event.startswith('S'):
return 'Single'
elif base_event.startswith('D'):
return 'Double'
elif base_event.startswith('T'):
return 'Triple'
elif base_event.startswith('HR'):
return 'Home Run'
elif base_event == 'K' or base_event.startswith('K'):
return 'Strikeout'
elif base_event == 'W' or base_event.startswith('W'):
return 'Walk'
elif base_event.startswith('E'):
return 'Error'
elif 'HP' in base_event:
return 'Hit By Pitch'
elif 'FC' in base_event:
return 'Fielders Choice'
elif re.match(r'\d+', base_event):
return 'Out'
else:
return 'Other'
def _split_preserving_quotes(self, line):
"""Split CSV line while preserving quoted strings."""
import csv
return list(csv.reader([line]))[0]
# Example usage
parser = RetrosheetParser()
games_df, plays_df = parser.parse_event_file('2018NYA.EVA')
# Examine parsed data
print(f"Parsed {len(games_df)} games")
print(f"Parsed {len(plays_df)} plays")
print("\nFirst game info:")
print(games_df.iloc[0])
print("\nFirst 10 plays:")
print(plays_df.head(10))
```
### Downloading Retrosheet Data
Retrosheet data is organized by season and team:
```python
import requests
import zipfile
import io
import os
def download_retrosheet_season(year, output_dir='retrosheet_data'):
"""
Download all Retrosheet event files for a given season.
Parameters:
-----------
year : int
Season year to download
output_dir : str
Directory to save extracted files
"""
# Create output directory
os.makedirs(output_dir, exist_ok=True)
# Retrosheet URLs
base_url = 'https://www.retrosheet.org/events/'
file_suffix = 'eve.zip' if year >= 1950 else 'eba.zip'
url = f"{base_url}{year}{file_suffix}"
print(f"Downloading {year} event files from {url}")
try:
response = requests.get(url)
response.raise_for_status()
# Extract zip file
with zipfile.ZipFile(io.BytesIO(response.content)) as z:
z.extractall(output_dir)
print(f"Successfully downloaded and extracted {year} data")
print(f"Files saved to: {output_dir}")
# List extracted files
files = [f for f in os.listdir(output_dir) if f.endswith('.EVA') or f.endswith('.EVN')]
print(f"Extracted {len(files)} event files")
return files
except requests.exceptions.RequestException as e:
print(f"Error downloading data: {e}")
return []
# Download multiple seasons
seasons = range(2015, 2024)
for season in seasons:
download_retrosheet_season(season, f'retrosheet_data/{season}')
```
### Using the Chadwick Tools
The Chadwick Baseball Bureau provides command-line tools for parsing Retrosheet data more efficiently than pure Python:
```python
import subprocess
import pandas as pd
def parse_with_cwevent(event_files, year, output_csv='parsed_events.csv'):
"""
Use Chadwick's cwevent tool to parse event files.
Parameters:
-----------
event_files : list
List of .EVA or .EVN files to parse
year : int
Season year
output_csv : str
Output filename for parsed data
Returns:
--------
DataFrame with parsed play-by-play data
"""
# cwevent command with field specifications
# -f specifies output fields, -y is year
fields = [
'0', # game_id
'1', # visiting team
'2', # inning
'3', # batting team
'4', # outs
'5', # balls
'6', # strikes
'7', # pitch sequence
'12', # batter
'13', # pitcher
'26', # event text
'27', # event type
'39', # leadoff flag
'50', # RBI on play
'51' # runs on play
]
field_arg = ','.join(fields)
# Build command
cmd = ['cwevent', '-f', field_arg, '-y', str(year)]
cmd.extend(event_files)
# Execute cwevent
result = subprocess.run(
cmd,
capture_output=True,
text=True
)
# Parse output to DataFrame
lines = result.stdout.strip().split('\n')
# Column names based on fields selected
columns = [
'game_id', 'visiting_team', 'inning', 'batting_team',
'outs', 'balls', 'strikes', 'pitch_sequence',
'batter', 'pitcher', 'event_text', 'event_type',
'leadoff_flag', 'rbi', 'runs'
]
data = [line.split(',') for line in lines]
df = pd.DataFrame(data, columns=columns)
# Convert numeric columns
numeric_cols = ['inning', 'outs', 'balls', 'strikes', 'leadoff_flag', 'rbi', 'runs']
for col in numeric_cols:
df[col] = pd.to_numeric(df[col], errors='coerce')
# Save to CSV
df.to_csv(output_csv, index=False)
print(f"Parsed {len(df)} plays to {output_csv}")
return df
# Example: Parse all 2018 Yankees games
import glob
event_files = glob.glob('retrosheet_data/2018/*.EVA')
plays_df = parse_with_cwevent(event_files, 2018, 'yankees_2018.csv')
```
## Building Game-Level Summaries
Once you have play-by-play data, you can build comprehensive game summaries:
```python
import pandas as pd
import numpy as np
def build_game_summary(plays_df):
"""
Aggregate play-by-play data into game-level statistics.
Parameters:
-----------
plays_df : DataFrame
Play-by-play data from parsed event files
Returns:
--------
DataFrame with one row per game
"""
# Group by game
game_groups = plays_df.groupby('game_id')
summaries = []
for game_id, plays in game_groups:
# Initialize summary
summary = {'game_id': game_id}
# Get teams from first play
first_play = plays.iloc[0]
summary['visiting_team'] = first_play['visiting_team']
summary['home_team'] = plays[plays['batting_team'] == '1'].iloc[0]['batting_team'] if 'batting_team' in plays.columns else None
# Count event types for each team
for team_num in [0, 1]:
team_plays = plays[plays['batting_team'] == str(team_num)]
prefix = 'visiting' if team_num == 0 else 'home'
# Basic statistics
summary[f'{prefix}_runs'] = team_plays['runs'].sum()
summary[f'{prefix}_hits'] = team_plays[team_plays['event_type'].isin(['Single', 'Double', 'Triple', 'Home Run'])].shape[0]
summary[f'{prefix}_walks'] = team_plays[team_plays['event_type'] == 'Walk'].shape[0]
summary[f'{prefix}_strikeouts'] = team_plays[team_plays['event_type'] == 'Strikeout'].shape[0]
summary[f'{prefix}_errors'] = team_plays[team_plays['event_type'] == 'Error'].shape[0]
# Extra base hits
summary[f'{prefix}_doubles'] = team_plays[team_plays['event_type'] == 'Double'].shape[0]
summary[f'{prefix}_triples'] = team_plays[team_plays['event_type'] == 'Triple'].shape[0]
summary[f'{prefix}_home_runs'] = team_plays[team_plays['event_type'] == 'Home Run'].shape[0]
# Game metadata
summary['total_plays'] = len(plays)
summary['innings_played'] = plays['inning'].max()
summary['total_pitches'] = plays['pitch_sequence'].str.len().sum()
summaries.append(summary)
return pd.DataFrame(summaries)
# Create game summary
game_summary = build_game_summary(plays_df)
print(game_summary.head())
# Calculate team season statistics
def calculate_season_stats(game_summary, team_code):
"""
Calculate season statistics for a specific team.
Parameters:
-----------
game_summary : DataFrame
Game-level summaries
team_code : str
Three-letter team code (e.g., 'NYA')
Returns:
--------
Dict with season totals
"""
# Games as home team
home_games = game_summary[game_summary['home_team'] == team_code]
# Games as visiting team
away_games = game_summary[game_summary['visiting_team'] == team_code]
# Combine statistics
stats = {
'team': team_code,
'games': len(home_games) + len(away_games),
'runs_scored': home_games['home_runs'].sum() + away_games['visiting_runs'].sum(),
'runs_allowed': home_games['visiting_runs'].sum() + away_games['home_runs'].sum(),
'hits': home_games['home_hits'].sum() + away_games['visiting_hits'].sum(),
'walks': home_games['home_walks'].sum() + away_games['visiting_walks'].sum(),
'strikeouts': home_games['home_strikeouts'].sum() + away_games['visiting_strikeouts'].sum(),
'home_runs': home_games['home_home_runs'].sum() + away_games['visiting_home_runs'].sum()
}
# Calculate derived statistics
stats['runs_per_game'] = stats['runs_scored'] / stats['games']
stats['run_differential'] = stats['runs_scored'] - stats['runs_allowed']
return stats
# Example
yankees_stats = calculate_season_stats(game_summary, 'NYA')
print(yankees_stats)
```
## R Programming with Retrosheet
The `retrosheet` package in R provides excellent tools for working with Retrosheet data:
```r
# Install and load packages
install.packages("retrosheet")
library(retrosheet)
library(dplyr)
library(ggplot2)
# Get Retrosheet data for a season
# This downloads and parses automatically
get_retrosheet_data <- function(year) {
# Download event files
retrosheet::getRetrosheet("game", year)
# Also get roster data
retrosheet::getRetrosheet("roster", year)
}
# Parse event files that are already downloaded
parse_retrosheet_events <- function(year, event_files_dir) {
# Set up Retrosheet directory
options(retrosheet.dir = event_files_dir)
# Parse all event files for the year
games <- retrosheet::parseRetrosheet(year)
return(games)
}
# Example: Load 2018 season
games_2018 <- parse_retrosheet_events(2018, "retrosheet_data/2018")
# View structure
str(games_2018)
# Create play-by-play dataset
plays_2018 <- games_2018$plays
# Examine first few plays
head(plays_2018)
# Filter to specific team
yankees_plays <- plays_2018 %>%
filter(BAT_HOME_ID == "NYA" | BAT_AWAY_ID == "NYA")
# Calculate batting statistics
batting_stats <- plays_2018 %>%
filter(!is.na(BAT_ID)) %>%
group_by(BAT_ID) %>%
summarise(
games = n_distinct(GAME_ID),
plate_appearances = n(),
hits = sum(H_FL == 1, na.rm = TRUE),
doubles = sum(EVENT_CD == 20, na.rm = TRUE),
triples = sum(EVENT_CD == 21, na.rm = TRUE),
home_runs = sum(EVENT_CD == 23, na.rm = TRUE),
walks = sum(EVENT_CD == 14, na.rm = TRUE),
strikeouts = sum(EVENT_CD == 3, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(
batting_average = hits / plate_appearances,
slugging = (hits + doubles + 2*triples + 3*home_runs) / plate_appearances
)
# Sort by batting average (minimum 100 PA)
batting_leaders <- batting_stats %>%
filter(plate_appearances >= 100) %>%
arrange(desc(batting_average)) %>%
head(20)
print(batting_leaders)
# Pitching statistics
pitching_stats <- plays_2018 %>%
filter(!is.na(PIT_ID)) %>%
group_by(PIT_ID) %>%
summarise(
batters_faced = n(),
hits_allowed = sum(H_FL == 1, na.rm = TRUE),
walks_allowed = sum(EVENT_CD == 14, na.rm = TRUE),
strikeouts = sum(EVENT_CD == 3, na.rm = TRUE),
home_runs_allowed = sum(EVENT_CD == 23, na.rm = TRUE),
.groups = "drop"
)
print(pitching_stats)
```
### Advanced R Analysis
Historical trend analysis across decades:
```r
library(retrosheet)
library(dplyr)
library(ggplot2)
library(tidyr)
# Function to get season totals
get_season_totals <- function(year) {
# Parse Retrosheet data for year
tryCatch({
games <- parse_retrosheet_events(year, paste0("retrosheet_data/", year))
plays <- games$plays
# Calculate league totals
totals <- plays %>%
summarise(
year = year,
plate_appearances = n(),
hits = sum(H_FL == 1, na.rm = TRUE),
doubles = sum(EVENT_CD == 20, na.rm = TRUE),
triples = sum(EVENT_CD == 21, na.rm = TRUE),
home_runs = sum(EVENT_CD == 23, na.rm = TRUE),
walks = sum(EVENT_CD == 14, na.rm = TRUE),
strikeouts = sum(EVENT_CD == 3, na.rm = TRUE),
hbp = sum(EVENT_CD == 16, na.rm = TRUE)
) %>%
mutate(
avg = hits / plate_appearances,
obp = (hits + walks + hbp) / plate_appearances,
slg = (hits + doubles + 2*triples + 3*home_runs) / plate_appearances,
k_rate = strikeouts / plate_appearances,
bb_rate = walks / plate_appearances,
hr_rate = home_runs / plate_appearances
)
return(totals)
}, error = function(e) {
message(paste("Error processing", year, ":", e$message))
return(NULL)
})
}
# Analyze trends across baseball history
years <- seq(1950, 2023, by = 1)
historical_trends <- bind_rows(lapply(years, get_season_totals))
# Visualize strikeout rate over time
ggplot(historical_trends, aes(x = year, y = k_rate)) +
geom_line(size = 1.2, color = "steelblue") +
geom_smooth(method = "loess", se = TRUE, alpha = 0.2) +
labs(
title = "MLB Strikeout Rate Trends (1950-2023)",
subtitle = "Based on Retrosheet Play-by-Play Data",
x = "Season",
y = "Strikeout Rate (K/PA)",
caption = "Data: Retrosheet"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
axis.title = element_text(size = 12)
)
# Home run rate trends
ggplot(historical_trends, aes(x = year, y = hr_rate * 100)) +
geom_line(size = 1.2, color = "darkred") +
geom_smooth(method = "loess", se = TRUE, alpha = 0.2) +
annotate("text", x = 1998, y = 3.5, label = "Steroid Era", size = 3) +
annotate("text", x = 2019, y = 3.8, label = "Launch Angle\nRevolution", size = 3) +
labs(
title = "MLB Home Run Rate Trends (1950-2023)",
x = "Season",
y = "Home Run Rate (HR per 100 PA)",
caption = "Data: Retrosheet"
) +
theme_minimal()
# Compare eras
era_comparison <- historical_trends %>%
mutate(
era = case_when(
year >= 1950 & year <= 1967 ~ "1950s-60s",
year >= 1968 & year <= 1976 ~ "Pitching Era (1968-76)",
year >= 1977 & year <= 1993 ~ "1970s-80s",
year >= 1994 & year <= 2005 ~ "Steroid Era (1994-2005)",
year >= 2006 & year <= 2014 ~ "Post-Steroid (2006-14)",
year >= 2015 ~ "Launch Angle Era (2015+)"
)
) %>%
group_by(era) %>%
summarise(
avg = mean(avg, na.rm = TRUE),
obp = mean(obp, na.rm = TRUE),
slg = mean(slg, na.rm = TRUE),
k_rate = mean(k_rate, na.rm = TRUE),
bb_rate = mean(bb_rate, na.rm = TRUE),
hr_rate = mean(hr_rate, na.rm = TRUE),
.groups = "drop"
)
print(era_comparison)
# Create era comparison visualization
era_long <- era_comparison %>%
select(era, avg, obp, slg) %>%
pivot_longer(cols = c(avg, obp, slg), names_to = "metric", values_to = "value")
ggplot(era_long, aes(x = era, y = value, fill = metric)) +
geom_col(position = "dodge") +
labs(
title = "Offensive Metrics by Baseball Era",
x = "Era",
y = "Rate",
fill = "Metric"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
## Data Quality Considerations
While Retrosheet provides invaluable historical data, researchers must understand data quality variations across eras.
### Completeness by Era
**Pre-1920 Data**
- Many games have only box score-level data
- Play-by-play often reconstructed from newspaper accounts
- Pitch-level data generally unavailable
- Fielding details may be incomplete
**1920-1950**
- Improved completeness but gaps remain
- Some minor discrepancies with official records
- Limited pitch sequence information
- Defensive positioning not captured
**1950-1980**
- High-quality box scores and game logs
- Play-by-play coverage extensive but not universal
- Some substitution timing approximate
- Pitch counts sometimes estimated
**1980-Present**
- Near-complete play-by-play coverage
- Detailed event coding
- Accurate pitch sequences
- Comprehensive substitution records
### Known Limitations
**No Direct Pitch Tracking**
- Retrosheet records pitch sequences (ball, strike, foul) but not pitch type, velocity, or location
- For Statcast-era analysis (2015+), combine with Baseball Savant data
**Subjective Decisions**
- Hit vs. error classification may differ from official scorers
- Earned vs. unearned runs subject to scorer interpretation
- Wild pitch vs. passed ball distinctions
**Missing Context**
- Weather and field conditions spotty before 1990s
- Defensive positioning not captured (except 2015+ via Statcast integration)
- Injury and substitution reasons not always recorded
**Historical Uncertainties**
- Very early games (1870s-1890s) may have conflicting source accounts
- Negro League data coverage incomplete
- Minor league coverage limited
### Validation Approaches
```python
def validate_retrosheet_data(plays_df, official_totals):
"""
Compare Retrosheet aggregations to official MLB totals.
Parameters:
-----------
plays_df : DataFrame
Parsed play-by-play data
official_totals : DataFrame
Official MLB statistics for comparison
Returns:
--------
DataFrame showing discrepancies
"""
# Calculate Retrosheet totals
retrosheet_totals = plays_df.groupby('player_id').agg({
'event_type': lambda x: sum(x.isin(['Single', 'Double', 'Triple', 'Home Run'])), # Hits
'runs': 'sum',
'rbi': 'sum'
}).rename(columns={'event_type': 'hits'})
# Merge with official totals
comparison = retrosheet_totals.merge(
official_totals,
left_index=True,
right_on='player_id',
suffixes=('_retrosheet', '_official')
)
# Calculate differences
comparison['hits_diff'] = comparison['hits_retrosheet'] - comparison['hits_official']
comparison['runs_diff'] = comparison['runs_retrosheet'] - comparison['runs_official']
comparison['rbi_diff'] = comparison['rbi_retrosheet'] - comparison['rbi_official']
# Flag significant discrepancies
threshold = 5
discrepancies = comparison[
(abs(comparison['hits_diff']) > threshold) |
(abs(comparison['runs_diff']) > threshold) |
(abs(comparison['rbi_diff']) > threshold)
]
return discrepancies
```
## Research Applications
Retrosheet data enables research previously impossible, spanning baseball analytics, sports science, economics, and history.
### Win Probability Models
```python
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
def build_win_probability_model(plays_df):
"""
Build a win probability model from historical play-by-play data.
Uses game state (inning, score, outs, runners) to predict
probability of home team winning.
"""
# Create features from game state
features = plays_df[['inning', 'outs', 'balls', 'strikes']].copy()
# Add score differential
features['score_diff'] = plays_df['home_score'] - plays_df['visiting_score']
# Encode baserunner state (0-7 for all combinations)
features['runners_state'] = (
plays_df['runner_on_1st'].astype(int) * 1 +
plays_df['runner_on_2nd'].astype(int) * 2 +
plays_df['runner_on_3rd'].astype(int) * 4
)
# Target: did home team win?
target = plays_df.groupby('game_id')['home_score'].last() > plays_df.groupby('game_id')['visiting_score'].last()
# Train gradient boosting model
model = GradientBoostingClassifier(n_estimators=100, max_depth=5)
model.fit(features, target)
return model
# Calculate Win Probability Added for players
def calculate_wpa(plays_df, wp_model):
"""
Calculate Win Probability Added for each play and player.
"""
# Predict win probability before each play
plays_df['wp_before'] = wp_model.predict_proba(
plays_df[['inning', 'outs', 'balls', 'strikes', 'score_diff', 'runners_state']]
)[:, 1]
# Shift to get win probability after play (before next play)
plays_df['wp_after'] = plays_df.groupby('game_id')['wp_before'].shift(-1)
# Calculate WPA
plays_df['wpa'] = plays_df['wp_after'] - plays_df['wp_before']
# Aggregate by player
player_wpa = plays_df.groupby('batter')['wpa'].sum().sort_values(ascending=False)
return player_wpa
```
### Historical Player Comparison
```python
def compare_players_across_eras(player1_id, player2_id, plays_df):
"""
Compare two players accounting for era differences.
"""
# Get each player's statistics
p1_stats = calculate_player_stats(player1_id, plays_df)
p2_stats = calculate_player_stats(player2_id, plays_df)
# Get league averages for each player's era
p1_era_avg = get_era_averages(p1_stats['seasons'])
p2_era_avg = get_era_averages(p2_stats['seasons'])
# Calculate ERA+ type adjustments
p1_adjusted = {
'avg_plus': (p1_stats['avg'] / p1_era_avg['avg']) * 100,
'obp_plus': (p1_stats['obp'] / p1_era_avg['obp']) * 100,
'slg_plus': (p1_stats['slg'] / p1_era_avg['slg']) * 100
}
p2_adjusted = {
'avg_plus': (p2_stats['avg'] / p2_era_avg['avg']) * 100,
'obp_plus': (p2_stats['obp'] / p2_era_avg['obp']) * 100,
'slg_plus': (p2_stats['slg'] / p2_era_avg['slg']) * 100
}
return p1_adjusted, p2_adjusted
```
### Strategic Analysis
Retrosheet enables analysis of strategic decisions:
```r
# Analyze bunting effectiveness
bunt_analysis <- plays_2018 %>%
filter(SH_FL == 1) %>% # Sacrifice hits
group_by(OUTS_CT, START_BASES_CD) %>%
summarise(
attempts = n(),
successful = sum(EVENT_CD == 14), # Bunt single or successful sacrifice
runs_scored_after = mean(RUNS_CT, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(success_rate = successful / attempts)
# Compare to historical trends
bunt_trends <- lapply(1950:2023, function(yr) {
plays <- get_season_plays(yr)
plays %>%
summarise(
year = yr,
bunts = sum(SH_FL == 1, na.rm = TRUE),
plate_appearances = n()
) %>%
mutate(bunt_rate = bunts / plate_appearances)
}) %>%
bind_rows()
ggplot(bunt_trends, aes(x = year, y = bunt_rate * 100)) +
geom_line() +
labs(title = "Decline of the Sacrifice Bunt",
y = "Sacrifice Bunts per 100 PA")
```
## Chadwick Baseball Bureau Tools
The Chadwick Tools are command-line utilities for efficiently processing Retrosheet data, named after Henry Chadwick, the inventor of the baseball box score.
### Key Chadwick Tools
**cwevent** - Extract play-by-play events
```bash
cwevent -y 2018 -f 0-96 2018*.EVA > events_2018.csv
```
**cwgame** - Generate game-level summaries
```bash
cwgame -y 2018 2018*.EVA > games_2018.csv
```
**cwdaily** - Create daily player statistics
```bash
cwdaily -y 2018 2018*.EVA > daily_2018.csv
```
**cwsub** - Track substitutions
```bash
cwsub -y 2018 2018*.EVA > subs_2018.csv
```
**cwcomment** - Extract game comments and notes
```bash
cwcomment -y 2018 2018*.EVA > comments_2018.csv
```
### Field Selection in cwevent
The `-f` flag specifies which fields to output (0-96 available):
```bash
# Common field selections
# 0 = game_id
# 1 = visiting_team
# 2 = inning
# 3 = batting_team
# 4 = outs
# 6 = batter
# 7 = pitcher
# 10 = event_text
# 26 = hit_location
# 50 = rbi_on_play
cwevent -f 0,1,2,3,4,6,7,10,26,50 -y 2018 2018NYA.EVA
```
### Python Integration
```python
import subprocess
import pandas as pd
def run_cwevent(event_files, year, fields, output_file='events.csv'):
"""
Run cwevent command and load results into pandas DataFrame.
"""
# Build command
field_str = ','.join(map(str, fields))
cmd = ['cwevent', '-f', field_str, '-y', str(year), '-n'] + event_files
# Run command
result = subprocess.run(cmd, capture_output=True, text=True)
# Save output
with open(output_file, 'w') as f:
f.write(result.stdout)
# Load into DataFrame
df = pd.read_csv(output_file)
return df
# Example usage
fields = [0, 1, 2, 3, 4, 5, 6, 7, 10, 26, 50]
events = run_cwevent(['2018NYA.EVA'], 2018, fields)
print(events.head())
```
## Building a Historical Database
Creating a comprehensive baseball database from Retrosheet data:
```python
import sqlite3
import pandas as pd
def create_retrosheet_database(db_path='baseball_history.db'):
"""
Create a SQLite database structure for Retrosheet data.
"""
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
# Games table
cursor.execute('''
CREATE TABLE IF NOT EXISTS games (
game_id TEXT PRIMARY KEY,
game_date DATE,
visiting_team TEXT,
home_team TEXT,
visiting_score INTEGER,
home_score INTEGER,
innings INTEGER,
day_night TEXT,
park_id TEXT,
attendance INTEGER
)
''')
# Plays table
cursor.execute('''
CREATE TABLE IF NOT EXISTS plays (
play_id INTEGER PRIMARY KEY AUTOINCREMENT,
game_id TEXT,
inning INTEGER,
batting_team INTEGER,
outs INTEGER,
balls INTEGER,
strikes INTEGER,
batter_id TEXT,
pitcher_id TEXT,
event_text TEXT,
event_type TEXT,
rbi INTEGER,
runs_on_play INTEGER,
FOREIGN KEY (game_id) REFERENCES games(game_id)
)
''')
# Players table
cursor.execute('''
CREATE TABLE IF NOT EXISTS players (
player_id TEXT PRIMARY KEY,
last_name TEXT,
first_name TEXT,
debut_date DATE,
final_date DATE
)
''')
# Create indexes
cursor.execute('CREATE INDEX IF NOT EXISTS idx_plays_game ON plays(game_id)')
cursor.execute('CREATE INDEX IF NOT EXISTS idx_plays_batter ON plays(batter_id)')
cursor.execute('CREATE INDEX IF NOT EXISTS idx_plays_pitcher ON plays(pitcher_id)')
cursor.execute('CREATE INDEX IF NOT EXISTS idx_games_date ON games(game_date)')
conn.commit()
return conn
def load_season_to_database(year, event_files, conn):
"""
Load a full season of Retrosheet data into the database.
"""
# Parse event files
parser = RetrosheetParser()
for event_file in event_files:
games_df, plays_df = parser.parse_event_file(event_file)
# Load games
games_df.to_sql('games', conn, if_exists='append', index=False)
# Load plays
plays_df.to_sql('plays', conn, if_exists='append', index=False)
print(f"Loaded {year} season to database")
# Create database and load multiple seasons
conn = create_retrosheet_database()
for year in range(2010, 2024):
event_files = glob.glob(f'retrosheet_data/{year}/*.EVA')
load_season_to_database(year, event_files, conn)
# Query the database
query = '''
SELECT
batter_id,
COUNT(*) as plate_appearances,
SUM(CASE WHEN event_type IN ('Single', 'Double', 'Triple', 'Home Run') THEN 1 ELSE 0 END) as hits,
SUM(CASE WHEN event_type = 'Home Run' THEN 1 ELSE 0 END) as home_runs
FROM plays
WHERE game_id LIKE '2018%'
GROUP BY batter_id
HAVING plate_appearances >= 100
ORDER BY hits DESC
LIMIT 20
'''
batting_leaders = pd.read_sql_query(query, conn)
print(batting_leaders)
```
## Conclusion
Retrosheet represents one of the most valuable resources in baseball analytics, providing comprehensive play-by-play data spanning over 150 years of baseball history. From casual research projects to sophisticated analytical models, Retrosheet enables insights that would be impossible with traditional statistics alone.
### Key Takeaways
- **Free and Comprehensive**: Retrosheet provides play-by-play data for MLB history at no cost
- **Multiple Access Methods**: Direct file parsing, Chadwick tools, Python packages, and R libraries
- **Rich Detail**: Event files capture every pitch, play, and substitution with extensive coding
- **Research Applications**: Enables win probability, historical analysis, strategic research, and player evaluation
- **Data Quality Varies**: Coverage and detail improve significantly in recent decades
- **Community-Driven**: Volunteer organization that welcomes contributions
- **Complementary**: Works best when combined with other data sources like Statcast and FanGraphs
### Getting Started
1. **Download data** from Retrosheet.org for your period of interest
2. **Choose parsing method**: Chadwick tools for efficiency, Python/R for flexibility
3. **Build game summaries** by aggregating play-by-play data
4. **Validate results** against known statistics
5. **Explore research questions** enabled by granular historical data
### Resources
- **Retrosheet.org**: Official site with downloads and documentation
- **Chadwick Bureau**: Tools and parsing utilities
- **pybaseball**: Python package with Retrosheet integration
- **retrosheet (R)**: R package for event file parsing
- **Community forums**: retrosheet-discussion mailing list for technical help
Retrosheet has democratized access to baseball's historical record, making every researcher a potential baseball historian. Whether you're investigating the evolution of pitching strategy, building predictive models, or simply satisfying curiosity about your favorite player's career, Retrosheet provides the foundation for rigorous, data-driven baseball analysis.
Discussion
Have questions or feedback? Join our community discussion on
Discord or
GitHub Discussions.
Table of Contents
Related Topics
Quick Actions