Working with Retrosheet Historical Data

Intermediate 10 min read 1 views Nov 26, 2025
# Retrosheet Historical Data: Complete Guide to Baseball's Play-by-Play Archive ## What is Retrosheet Retrosheet is a non-profit organization dedicated to collecting, digitizing, and freely distributing play-by-play accounts and related information for every Major League Baseball game ever played. Founded in 1989 by Dave Smith, Retrosheet represents one of the most ambitious volunteer-driven data preservation projects in sports history, providing researchers, analysts, and fans with unprecedented access to baseball's historical record. ### The Mission and Philosophy Retrosheet's mission is rooted in the belief that detailed baseball data should be freely available to the public. Unlike commercial data providers, Retrosheet operates as an all-volunteer organization supported by donations and the tireless work of hundreds of contributors who digitize box scores, code play-by-play data from scorecards and newspaper accounts, and verify information for accuracy. The organization's philosophy centers on: - **Free and Open Access**: All Retrosheet data is available for download at no cost for personal, educational, and research use - **Historical Completeness**: Capturing play-by-play data for every MLB game from 1871 to present - **Data Accuracy**: Rigorous verification processes to ensure historical fidelity - **Community Collaboration**: Relying on volunteers, researchers, and baseball enthusiasts worldwide - **Preservation**: Digitizing ephemeral materials before they are lost to time ### Historical Significance Retrosheet has fundamentally transformed baseball research. Before Retrosheet, comprehensive play-by-play data was largely inaccessible to the public. Researchers had to travel to physical archives, manually transcribe box scores, or pay prohibitive fees for limited datasets. Retrosheet democratized this information, enabling: - Academic research on baseball history and strategy - Development of advanced metrics like Win Probability Added (WPA) and Leverage Index - Historical player and team analysis previously impossible - Validation and correction of official baseball records - New insights into the evolution of baseball tactics and rules The project has digitized over 200,000 games and continues to expand coverage of early baseball eras and minor leagues. ## Retrosheet Data Coverage Understanding what data Retrosheet provides and for which time periods is essential for effective research. ### Historical Timeline **1871-1913: Early Baseball Era** - Coverage: Box score data for most games - Play-by-play: Limited availability - Data quality: Variable, reconstructed from newspaper accounts - Notable gaps: Some minor league and Negro League games **1914-1949: Dead-ball to WWII Era** - Coverage: Improved box score completeness - Play-by-play: Approximately 60% of games - Data quality: Good for major events, gaps in routine plays - Sources: Newspaper accounts, team records, scorecards **1950-1983: Post-war Expansion** - Coverage: Near-complete box scores - Play-by-play: 85% of games covered - Data quality: Excellent for most metrics - Sources: Official scoresheets, broadcaster notes, team archives **1984-Present: Modern Era** - Coverage: Complete play-by-play for every game - Data quality: Comprehensive and highly accurate - Sources: Official MLB data feeds, team files, digital records - Real-time: Updates typically within days of games played ### Types of Data Available **Game Logs** - Basic game information (date, teams, score, attendance) - Starting lineups and substitutions - Umpire assignments - Game duration and day/night designation - Weather conditions (when available) **Box Scores** - Player batting and pitching statistics - Defensive positions played - Scoring by inning - Team totals **Event Files (Play-by-Play)** - Every pitch and its outcome - Baserunner advances - Defensive plays and fielder positioning - Substitutions with timing - Pitching changes - Earned/unearned run determination - Hit location codes - Detailed game state information **Roster Files** - Player names and IDs - Team affiliations by season - Positional designations - Biographical information linkages **Schedule Files** - Game dates and times - Home/away team designations - Doubleheader information - Postseason game details ## Event File Format Explained Retrosheet's event files use a specialized format designed to capture every detail of a baseball game in a compact, parseable structure. Understanding this format is crucial for working with Retrosheet data. ### File Structure Event files (.EVN or .EVA extensions) are plain text files with specific record types: ``` id,NYA201804020 version,2 info,visteam,TOR info,hometeam,NYA info,date,2018/04/02 info,number,0 info,starttime,1:05PM info,daynight,day info,usedh,true info,temp,45 info,winddir,fromrf start,bautj001,"Jose Bautista",0,1,9 start,donaj001,"Josh Donaldson",0,2,5 play,1,0,bautj001,00,X,S7/L play,1,0,donaj001,12,CBFX,K ``` ### Record Types **ID Records** - Format: `id,GAMEID` - Unique identifier for each game - Structure: `TEAMYYYYMMDD#` where # is game number that day **Version Records** - Indicates event file format version - Current version: 2 **Info Records** - Game metadata: `info,FIELD,VALUE` - Common fields: teams, date, park, temperature, umpires - Variable completeness based on era **Start Records** - Starting lineups: `start,PLAYERID,"NAME",TEAM,BATTINGORDER,FIELDPOS` - Team: 0 (visitor) or 1 (home) - Field positions use standard numbering (1=P, 2=C, 3=1B, etc.) **Play Records** - Core play-by-play data - Format: `play,INNING,TEAM,PLAYERID,COUNT,PITCHES,EVENT` - Most complex and information-rich records **Substitution Records** - Player changes: `sub,PLAYERID,"NAME",TEAM,BATTINGORDER,FIELDPOS` **Data Records** - Earned runs: `data,er,PLAYERID,RUNS` - Additional context for specific plays ### Event Codes Retrosheet uses a sophisticated coding system to represent every possible play outcome: **Basic Events** - `S`: Single - `D`: Double - `T`: Triple - `HR`: Home run - `K`: Strikeout - `W`: Walk - `HP`: Hit by pitch - `E#`: Error by fielder # - `FC`: Fielder's choice **Batted Ball Location** - Number indicates fielder: `S7` = single to left field - Letter modifiers: `L` (line drive), `F` (fly ball), `G` (ground ball), `P` (popup) - Example: `S7/L` = line drive single to left field **Baserunner Advances** - Format: `EVENT.B-#` where B is base and # is destination - `S8.2-H` = single to center, runner on second scores - `1-3` = runner advances first to third **Pitch Sequences** - Letters represent pitch results before event - `C` = called strike - `B` = ball - `F` = foul ball - `S` = swinging strike - `X` = ball in play - Example: `12,CBFX,K` = called strike, ball, foul, swinging strike (strikeout) **Modifiers and Special Cases** - `SH` = sacrifice hit (bunt) - `SF` = sacrifice fly - `GDP` = grounded into double play - `NP` = no pitch (balk, etc.) - `+` = additional defensive detail ### Parsing Complexity The event notation handles complex scenarios: ``` play,5,1,ramij001,32,BBCFFX,D7/L.2-H;1-H(E7/TH);B-3 ``` This represents: - 5th inning, home team batting - Player: ramij001 - Count: 3-2 - Pitch sequence: Ball, Ball, Called strike, Foul, Foul, ball in play - Event: Double to left field - Runner on 2nd scores - Runner on 1st scores (error by left fielder on throw home) - Batter reaches 3rd base ## Parsing Event Files Working with Retrosheet data programmatically requires parsing the event file format and transforming it into usable data structures. ### Python Parsing with pybaseball The `pybaseball` library includes Retrosheet functionality, though direct parsing gives more control: ```python import pandas as pd from pybaseball import retrosheet import re class RetrosheetParser: """ Parse Retrosheet event files into structured DataFrames. """ def __init__(self): self.games = [] self.plays = [] self.current_game = {} self.current_lineups = {'0': {}, '1': {}} def parse_event_file(self, filepath): """ Parse a complete Retrosheet event file. Parameters: ----------- filepath : str Path to .EVN or .EVA event file Returns: -------- Tuple of (games_df, plays_df) DataFrames """ with open(filepath, 'r') as f: for line in f: line = line.strip() if not line: continue # Split on first comma only record_type = line.split(',', 1)[0] if record_type == 'id': self._start_new_game(line) elif record_type == 'info': self._parse_info(line) elif record_type == 'start': self._parse_start(line) elif record_type == 'play': self._parse_play(line) elif record_type == 'sub': self._parse_sub(line) elif record_type == 'data': self._parse_data(line) return ( pd.DataFrame(self.games), pd.DataFrame(self.plays) ) def _start_new_game(self, line): """Initialize a new game record.""" game_id = line.split(',')[1] # Save previous game if exists if self.current_game: self.games.append(self.current_game.copy()) # Reset for new game self.current_game = { 'game_id': game_id, 'plays': [] } self.current_lineups = {'0': {}, '1': {}} def _parse_info(self, line): """Parse info records.""" parts = line.split(',') if len(parts) >= 3: field = parts[1] value = ','.join(parts[2:]) # Handle values with commas self.current_game[field] = value def _parse_start(self, line): """Parse starting lineup.""" # Format: start,PLAYERID,"NAME",TEAM,BATTINGORDER,FIELDPOS parts = self._split_preserving_quotes(line) if len(parts) >= 6: player_id = parts[1] name = parts[2].strip('"') team = parts[3] batting_order = parts[4] field_pos = parts[5] self.current_lineups[team][batting_order] = { 'player_id': player_id, 'name': name, 'position': field_pos } def _parse_play(self, line): """Parse play record - the most complex record type.""" parts = line.split(',') if len(parts) >= 6: play_data = { 'game_id': self.current_game['game_id'], 'inning': int(parts[1]), 'team': int(parts[2]), 'player_id': parts[3], 'count': parts[4], 'pitches': parts[5], 'event': ','.join(parts[6:]) # Event may contain commas } # Parse the count if play_data['count']: play_data['balls'] = int(play_data['count'][0]) if play_data['count'][0].isdigit() else None play_data['strikes'] = int(play_data['count'][1]) if len(play_data['count']) > 1 and play_data['count'][1].isdigit() else None # Parse pitch sequence play_data['pitch_count'] = len(play_data['pitches']) if play_data['pitches'] else 0 # Extract basic event type event_str = play_data['event'] play_data['event_type'] = self._extract_event_type(event_str) # Parse baserunner advances if present if '.' in event_str: base_event, advances = event_str.split('.', 1) play_data['base_event'] = base_event play_data['advances'] = advances else: play_data['base_event'] = event_str play_data['advances'] = None self.plays.append(play_data) def _parse_sub(self, line): """Parse substitution record.""" parts = self._split_preserving_quotes(line) if len(parts) >= 6: player_id = parts[1] name = parts[2].strip('"') team = parts[3] batting_order = parts[4] field_pos = parts[5] # Update lineup self.current_lineups[team][batting_order] = { 'player_id': player_id, 'name': name, 'position': field_pos } def _parse_data(self, line): """Parse data records (earned runs, etc.).""" parts = line.split(',') if len(parts) >= 3: data_type = parts[1] if data_type == 'er': # Earned runs for pitcher self.current_game.setdefault('earned_runs', []).append({ 'player_id': parts[2], 'er': int(parts[3]) if len(parts) > 3 else 0 }) def _extract_event_type(self, event_str): """Extract primary event type from event string.""" # Remove modifiers and baserunning base_event = event_str.split('.')[0].split('/')[0] # Determine event category if base_event.startswith('S'): return 'Single' elif base_event.startswith('D'): return 'Double' elif base_event.startswith('T'): return 'Triple' elif base_event.startswith('HR'): return 'Home Run' elif base_event == 'K' or base_event.startswith('K'): return 'Strikeout' elif base_event == 'W' or base_event.startswith('W'): return 'Walk' elif base_event.startswith('E'): return 'Error' elif 'HP' in base_event: return 'Hit By Pitch' elif 'FC' in base_event: return 'Fielders Choice' elif re.match(r'\d+', base_event): return 'Out' else: return 'Other' def _split_preserving_quotes(self, line): """Split CSV line while preserving quoted strings.""" import csv return list(csv.reader([line]))[0] # Example usage parser = RetrosheetParser() games_df, plays_df = parser.parse_event_file('2018NYA.EVA') # Examine parsed data print(f"Parsed {len(games_df)} games") print(f"Parsed {len(plays_df)} plays") print("\nFirst game info:") print(games_df.iloc[0]) print("\nFirst 10 plays:") print(plays_df.head(10)) ``` ### Downloading Retrosheet Data Retrosheet data is organized by season and team: ```python import requests import zipfile import io import os def download_retrosheet_season(year, output_dir='retrosheet_data'): """ Download all Retrosheet event files for a given season. Parameters: ----------- year : int Season year to download output_dir : str Directory to save extracted files """ # Create output directory os.makedirs(output_dir, exist_ok=True) # Retrosheet URLs base_url = 'https://www.retrosheet.org/events/' file_suffix = 'eve.zip' if year >= 1950 else 'eba.zip' url = f"{base_url}{year}{file_suffix}" print(f"Downloading {year} event files from {url}") try: response = requests.get(url) response.raise_for_status() # Extract zip file with zipfile.ZipFile(io.BytesIO(response.content)) as z: z.extractall(output_dir) print(f"Successfully downloaded and extracted {year} data") print(f"Files saved to: {output_dir}") # List extracted files files = [f for f in os.listdir(output_dir) if f.endswith('.EVA') or f.endswith('.EVN')] print(f"Extracted {len(files)} event files") return files except requests.exceptions.RequestException as e: print(f"Error downloading data: {e}") return [] # Download multiple seasons seasons = range(2015, 2024) for season in seasons: download_retrosheet_season(season, f'retrosheet_data/{season}') ``` ### Using the Chadwick Tools The Chadwick Baseball Bureau provides command-line tools for parsing Retrosheet data more efficiently than pure Python: ```python import subprocess import pandas as pd def parse_with_cwevent(event_files, year, output_csv='parsed_events.csv'): """ Use Chadwick's cwevent tool to parse event files. Parameters: ----------- event_files : list List of .EVA or .EVN files to parse year : int Season year output_csv : str Output filename for parsed data Returns: -------- DataFrame with parsed play-by-play data """ # cwevent command with field specifications # -f specifies output fields, -y is year fields = [ '0', # game_id '1', # visiting team '2', # inning '3', # batting team '4', # outs '5', # balls '6', # strikes '7', # pitch sequence '12', # batter '13', # pitcher '26', # event text '27', # event type '39', # leadoff flag '50', # RBI on play '51' # runs on play ] field_arg = ','.join(fields) # Build command cmd = ['cwevent', '-f', field_arg, '-y', str(year)] cmd.extend(event_files) # Execute cwevent result = subprocess.run( cmd, capture_output=True, text=True ) # Parse output to DataFrame lines = result.stdout.strip().split('\n') # Column names based on fields selected columns = [ 'game_id', 'visiting_team', 'inning', 'batting_team', 'outs', 'balls', 'strikes', 'pitch_sequence', 'batter', 'pitcher', 'event_text', 'event_type', 'leadoff_flag', 'rbi', 'runs' ] data = [line.split(',') for line in lines] df = pd.DataFrame(data, columns=columns) # Convert numeric columns numeric_cols = ['inning', 'outs', 'balls', 'strikes', 'leadoff_flag', 'rbi', 'runs'] for col in numeric_cols: df[col] = pd.to_numeric(df[col], errors='coerce') # Save to CSV df.to_csv(output_csv, index=False) print(f"Parsed {len(df)} plays to {output_csv}") return df # Example: Parse all 2018 Yankees games import glob event_files = glob.glob('retrosheet_data/2018/*.EVA') plays_df = parse_with_cwevent(event_files, 2018, 'yankees_2018.csv') ``` ## Building Game-Level Summaries Once you have play-by-play data, you can build comprehensive game summaries: ```python import pandas as pd import numpy as np def build_game_summary(plays_df): """ Aggregate play-by-play data into game-level statistics. Parameters: ----------- plays_df : DataFrame Play-by-play data from parsed event files Returns: -------- DataFrame with one row per game """ # Group by game game_groups = plays_df.groupby('game_id') summaries = [] for game_id, plays in game_groups: # Initialize summary summary = {'game_id': game_id} # Get teams from first play first_play = plays.iloc[0] summary['visiting_team'] = first_play['visiting_team'] summary['home_team'] = plays[plays['batting_team'] == '1'].iloc[0]['batting_team'] if 'batting_team' in plays.columns else None # Count event types for each team for team_num in [0, 1]: team_plays = plays[plays['batting_team'] == str(team_num)] prefix = 'visiting' if team_num == 0 else 'home' # Basic statistics summary[f'{prefix}_runs'] = team_plays['runs'].sum() summary[f'{prefix}_hits'] = team_plays[team_plays['event_type'].isin(['Single', 'Double', 'Triple', 'Home Run'])].shape[0] summary[f'{prefix}_walks'] = team_plays[team_plays['event_type'] == 'Walk'].shape[0] summary[f'{prefix}_strikeouts'] = team_plays[team_plays['event_type'] == 'Strikeout'].shape[0] summary[f'{prefix}_errors'] = team_plays[team_plays['event_type'] == 'Error'].shape[0] # Extra base hits summary[f'{prefix}_doubles'] = team_plays[team_plays['event_type'] == 'Double'].shape[0] summary[f'{prefix}_triples'] = team_plays[team_plays['event_type'] == 'Triple'].shape[0] summary[f'{prefix}_home_runs'] = team_plays[team_plays['event_type'] == 'Home Run'].shape[0] # Game metadata summary['total_plays'] = len(plays) summary['innings_played'] = plays['inning'].max() summary['total_pitches'] = plays['pitch_sequence'].str.len().sum() summaries.append(summary) return pd.DataFrame(summaries) # Create game summary game_summary = build_game_summary(plays_df) print(game_summary.head()) # Calculate team season statistics def calculate_season_stats(game_summary, team_code): """ Calculate season statistics for a specific team. Parameters: ----------- game_summary : DataFrame Game-level summaries team_code : str Three-letter team code (e.g., 'NYA') Returns: -------- Dict with season totals """ # Games as home team home_games = game_summary[game_summary['home_team'] == team_code] # Games as visiting team away_games = game_summary[game_summary['visiting_team'] == team_code] # Combine statistics stats = { 'team': team_code, 'games': len(home_games) + len(away_games), 'runs_scored': home_games['home_runs'].sum() + away_games['visiting_runs'].sum(), 'runs_allowed': home_games['visiting_runs'].sum() + away_games['home_runs'].sum(), 'hits': home_games['home_hits'].sum() + away_games['visiting_hits'].sum(), 'walks': home_games['home_walks'].sum() + away_games['visiting_walks'].sum(), 'strikeouts': home_games['home_strikeouts'].sum() + away_games['visiting_strikeouts'].sum(), 'home_runs': home_games['home_home_runs'].sum() + away_games['visiting_home_runs'].sum() } # Calculate derived statistics stats['runs_per_game'] = stats['runs_scored'] / stats['games'] stats['run_differential'] = stats['runs_scored'] - stats['runs_allowed'] return stats # Example yankees_stats = calculate_season_stats(game_summary, 'NYA') print(yankees_stats) ``` ## R Programming with Retrosheet The `retrosheet` package in R provides excellent tools for working with Retrosheet data: ```r # Install and load packages install.packages("retrosheet") library(retrosheet) library(dplyr) library(ggplot2) # Get Retrosheet data for a season # This downloads and parses automatically get_retrosheet_data <- function(year) { # Download event files retrosheet::getRetrosheet("game", year) # Also get roster data retrosheet::getRetrosheet("roster", year) } # Parse event files that are already downloaded parse_retrosheet_events <- function(year, event_files_dir) { # Set up Retrosheet directory options(retrosheet.dir = event_files_dir) # Parse all event files for the year games <- retrosheet::parseRetrosheet(year) return(games) } # Example: Load 2018 season games_2018 <- parse_retrosheet_events(2018, "retrosheet_data/2018") # View structure str(games_2018) # Create play-by-play dataset plays_2018 <- games_2018$plays # Examine first few plays head(plays_2018) # Filter to specific team yankees_plays <- plays_2018 %>% filter(BAT_HOME_ID == "NYA" | BAT_AWAY_ID == "NYA") # Calculate batting statistics batting_stats <- plays_2018 %>% filter(!is.na(BAT_ID)) %>% group_by(BAT_ID) %>% summarise( games = n_distinct(GAME_ID), plate_appearances = n(), hits = sum(H_FL == 1, na.rm = TRUE), doubles = sum(EVENT_CD == 20, na.rm = TRUE), triples = sum(EVENT_CD == 21, na.rm = TRUE), home_runs = sum(EVENT_CD == 23, na.rm = TRUE), walks = sum(EVENT_CD == 14, na.rm = TRUE), strikeouts = sum(EVENT_CD == 3, na.rm = TRUE), .groups = "drop" ) %>% mutate( batting_average = hits / plate_appearances, slugging = (hits + doubles + 2*triples + 3*home_runs) / plate_appearances ) # Sort by batting average (minimum 100 PA) batting_leaders <- batting_stats %>% filter(plate_appearances >= 100) %>% arrange(desc(batting_average)) %>% head(20) print(batting_leaders) # Pitching statistics pitching_stats <- plays_2018 %>% filter(!is.na(PIT_ID)) %>% group_by(PIT_ID) %>% summarise( batters_faced = n(), hits_allowed = sum(H_FL == 1, na.rm = TRUE), walks_allowed = sum(EVENT_CD == 14, na.rm = TRUE), strikeouts = sum(EVENT_CD == 3, na.rm = TRUE), home_runs_allowed = sum(EVENT_CD == 23, na.rm = TRUE), .groups = "drop" ) print(pitching_stats) ``` ### Advanced R Analysis Historical trend analysis across decades: ```r library(retrosheet) library(dplyr) library(ggplot2) library(tidyr) # Function to get season totals get_season_totals <- function(year) { # Parse Retrosheet data for year tryCatch({ games <- parse_retrosheet_events(year, paste0("retrosheet_data/", year)) plays <- games$plays # Calculate league totals totals <- plays %>% summarise( year = year, plate_appearances = n(), hits = sum(H_FL == 1, na.rm = TRUE), doubles = sum(EVENT_CD == 20, na.rm = TRUE), triples = sum(EVENT_CD == 21, na.rm = TRUE), home_runs = sum(EVENT_CD == 23, na.rm = TRUE), walks = sum(EVENT_CD == 14, na.rm = TRUE), strikeouts = sum(EVENT_CD == 3, na.rm = TRUE), hbp = sum(EVENT_CD == 16, na.rm = TRUE) ) %>% mutate( avg = hits / plate_appearances, obp = (hits + walks + hbp) / plate_appearances, slg = (hits + doubles + 2*triples + 3*home_runs) / plate_appearances, k_rate = strikeouts / plate_appearances, bb_rate = walks / plate_appearances, hr_rate = home_runs / plate_appearances ) return(totals) }, error = function(e) { message(paste("Error processing", year, ":", e$message)) return(NULL) }) } # Analyze trends across baseball history years <- seq(1950, 2023, by = 1) historical_trends <- bind_rows(lapply(years, get_season_totals)) # Visualize strikeout rate over time ggplot(historical_trends, aes(x = year, y = k_rate)) + geom_line(size = 1.2, color = "steelblue") + geom_smooth(method = "loess", se = TRUE, alpha = 0.2) + labs( title = "MLB Strikeout Rate Trends (1950-2023)", subtitle = "Based on Retrosheet Play-by-Play Data", x = "Season", y = "Strikeout Rate (K/PA)", caption = "Data: Retrosheet" ) + theme_minimal() + theme( plot.title = element_text(size = 16, face = "bold"), axis.title = element_text(size = 12) ) # Home run rate trends ggplot(historical_trends, aes(x = year, y = hr_rate * 100)) + geom_line(size = 1.2, color = "darkred") + geom_smooth(method = "loess", se = TRUE, alpha = 0.2) + annotate("text", x = 1998, y = 3.5, label = "Steroid Era", size = 3) + annotate("text", x = 2019, y = 3.8, label = "Launch Angle\nRevolution", size = 3) + labs( title = "MLB Home Run Rate Trends (1950-2023)", x = "Season", y = "Home Run Rate (HR per 100 PA)", caption = "Data: Retrosheet" ) + theme_minimal() # Compare eras era_comparison <- historical_trends %>% mutate( era = case_when( year >= 1950 & year <= 1967 ~ "1950s-60s", year >= 1968 & year <= 1976 ~ "Pitching Era (1968-76)", year >= 1977 & year <= 1993 ~ "1970s-80s", year >= 1994 & year <= 2005 ~ "Steroid Era (1994-2005)", year >= 2006 & year <= 2014 ~ "Post-Steroid (2006-14)", year >= 2015 ~ "Launch Angle Era (2015+)" ) ) %>% group_by(era) %>% summarise( avg = mean(avg, na.rm = TRUE), obp = mean(obp, na.rm = TRUE), slg = mean(slg, na.rm = TRUE), k_rate = mean(k_rate, na.rm = TRUE), bb_rate = mean(bb_rate, na.rm = TRUE), hr_rate = mean(hr_rate, na.rm = TRUE), .groups = "drop" ) print(era_comparison) # Create era comparison visualization era_long <- era_comparison %>% select(era, avg, obp, slg) %>% pivot_longer(cols = c(avg, obp, slg), names_to = "metric", values_to = "value") ggplot(era_long, aes(x = era, y = value, fill = metric)) + geom_col(position = "dodge") + labs( title = "Offensive Metrics by Baseball Era", x = "Era", y = "Rate", fill = "Metric" ) + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` ## Data Quality Considerations While Retrosheet provides invaluable historical data, researchers must understand data quality variations across eras. ### Completeness by Era **Pre-1920 Data** - Many games have only box score-level data - Play-by-play often reconstructed from newspaper accounts - Pitch-level data generally unavailable - Fielding details may be incomplete **1920-1950** - Improved completeness but gaps remain - Some minor discrepancies with official records - Limited pitch sequence information - Defensive positioning not captured **1950-1980** - High-quality box scores and game logs - Play-by-play coverage extensive but not universal - Some substitution timing approximate - Pitch counts sometimes estimated **1980-Present** - Near-complete play-by-play coverage - Detailed event coding - Accurate pitch sequences - Comprehensive substitution records ### Known Limitations **No Direct Pitch Tracking** - Retrosheet records pitch sequences (ball, strike, foul) but not pitch type, velocity, or location - For Statcast-era analysis (2015+), combine with Baseball Savant data **Subjective Decisions** - Hit vs. error classification may differ from official scorers - Earned vs. unearned runs subject to scorer interpretation - Wild pitch vs. passed ball distinctions **Missing Context** - Weather and field conditions spotty before 1990s - Defensive positioning not captured (except 2015+ via Statcast integration) - Injury and substitution reasons not always recorded **Historical Uncertainties** - Very early games (1870s-1890s) may have conflicting source accounts - Negro League data coverage incomplete - Minor league coverage limited ### Validation Approaches ```python def validate_retrosheet_data(plays_df, official_totals): """ Compare Retrosheet aggregations to official MLB totals. Parameters: ----------- plays_df : DataFrame Parsed play-by-play data official_totals : DataFrame Official MLB statistics for comparison Returns: -------- DataFrame showing discrepancies """ # Calculate Retrosheet totals retrosheet_totals = plays_df.groupby('player_id').agg({ 'event_type': lambda x: sum(x.isin(['Single', 'Double', 'Triple', 'Home Run'])), # Hits 'runs': 'sum', 'rbi': 'sum' }).rename(columns={'event_type': 'hits'}) # Merge with official totals comparison = retrosheet_totals.merge( official_totals, left_index=True, right_on='player_id', suffixes=('_retrosheet', '_official') ) # Calculate differences comparison['hits_diff'] = comparison['hits_retrosheet'] - comparison['hits_official'] comparison['runs_diff'] = comparison['runs_retrosheet'] - comparison['runs_official'] comparison['rbi_diff'] = comparison['rbi_retrosheet'] - comparison['rbi_official'] # Flag significant discrepancies threshold = 5 discrepancies = comparison[ (abs(comparison['hits_diff']) > threshold) | (abs(comparison['runs_diff']) > threshold) | (abs(comparison['rbi_diff']) > threshold) ] return discrepancies ``` ## Research Applications Retrosheet data enables research previously impossible, spanning baseball analytics, sports science, economics, and history. ### Win Probability Models ```python import pandas as pd from sklearn.ensemble import GradientBoostingClassifier def build_win_probability_model(plays_df): """ Build a win probability model from historical play-by-play data. Uses game state (inning, score, outs, runners) to predict probability of home team winning. """ # Create features from game state features = plays_df[['inning', 'outs', 'balls', 'strikes']].copy() # Add score differential features['score_diff'] = plays_df['home_score'] - plays_df['visiting_score'] # Encode baserunner state (0-7 for all combinations) features['runners_state'] = ( plays_df['runner_on_1st'].astype(int) * 1 + plays_df['runner_on_2nd'].astype(int) * 2 + plays_df['runner_on_3rd'].astype(int) * 4 ) # Target: did home team win? target = plays_df.groupby('game_id')['home_score'].last() > plays_df.groupby('game_id')['visiting_score'].last() # Train gradient boosting model model = GradientBoostingClassifier(n_estimators=100, max_depth=5) model.fit(features, target) return model # Calculate Win Probability Added for players def calculate_wpa(plays_df, wp_model): """ Calculate Win Probability Added for each play and player. """ # Predict win probability before each play plays_df['wp_before'] = wp_model.predict_proba( plays_df[['inning', 'outs', 'balls', 'strikes', 'score_diff', 'runners_state']] )[:, 1] # Shift to get win probability after play (before next play) plays_df['wp_after'] = plays_df.groupby('game_id')['wp_before'].shift(-1) # Calculate WPA plays_df['wpa'] = plays_df['wp_after'] - plays_df['wp_before'] # Aggregate by player player_wpa = plays_df.groupby('batter')['wpa'].sum().sort_values(ascending=False) return player_wpa ``` ### Historical Player Comparison ```python def compare_players_across_eras(player1_id, player2_id, plays_df): """ Compare two players accounting for era differences. """ # Get each player's statistics p1_stats = calculate_player_stats(player1_id, plays_df) p2_stats = calculate_player_stats(player2_id, plays_df) # Get league averages for each player's era p1_era_avg = get_era_averages(p1_stats['seasons']) p2_era_avg = get_era_averages(p2_stats['seasons']) # Calculate ERA+ type adjustments p1_adjusted = { 'avg_plus': (p1_stats['avg'] / p1_era_avg['avg']) * 100, 'obp_plus': (p1_stats['obp'] / p1_era_avg['obp']) * 100, 'slg_plus': (p1_stats['slg'] / p1_era_avg['slg']) * 100 } p2_adjusted = { 'avg_plus': (p2_stats['avg'] / p2_era_avg['avg']) * 100, 'obp_plus': (p2_stats['obp'] / p2_era_avg['obp']) * 100, 'slg_plus': (p2_stats['slg'] / p2_era_avg['slg']) * 100 } return p1_adjusted, p2_adjusted ``` ### Strategic Analysis Retrosheet enables analysis of strategic decisions: ```r # Analyze bunting effectiveness bunt_analysis <- plays_2018 %>% filter(SH_FL == 1) %>% # Sacrifice hits group_by(OUTS_CT, START_BASES_CD) %>% summarise( attempts = n(), successful = sum(EVENT_CD == 14), # Bunt single or successful sacrifice runs_scored_after = mean(RUNS_CT, na.rm = TRUE), .groups = "drop" ) %>% mutate(success_rate = successful / attempts) # Compare to historical trends bunt_trends <- lapply(1950:2023, function(yr) { plays <- get_season_plays(yr) plays %>% summarise( year = yr, bunts = sum(SH_FL == 1, na.rm = TRUE), plate_appearances = n() ) %>% mutate(bunt_rate = bunts / plate_appearances) }) %>% bind_rows() ggplot(bunt_trends, aes(x = year, y = bunt_rate * 100)) + geom_line() + labs(title = "Decline of the Sacrifice Bunt", y = "Sacrifice Bunts per 100 PA") ``` ## Chadwick Baseball Bureau Tools The Chadwick Tools are command-line utilities for efficiently processing Retrosheet data, named after Henry Chadwick, the inventor of the baseball box score. ### Key Chadwick Tools **cwevent** - Extract play-by-play events ```bash cwevent -y 2018 -f 0-96 2018*.EVA > events_2018.csv ``` **cwgame** - Generate game-level summaries ```bash cwgame -y 2018 2018*.EVA > games_2018.csv ``` **cwdaily** - Create daily player statistics ```bash cwdaily -y 2018 2018*.EVA > daily_2018.csv ``` **cwsub** - Track substitutions ```bash cwsub -y 2018 2018*.EVA > subs_2018.csv ``` **cwcomment** - Extract game comments and notes ```bash cwcomment -y 2018 2018*.EVA > comments_2018.csv ``` ### Field Selection in cwevent The `-f` flag specifies which fields to output (0-96 available): ```bash # Common field selections # 0 = game_id # 1 = visiting_team # 2 = inning # 3 = batting_team # 4 = outs # 6 = batter # 7 = pitcher # 10 = event_text # 26 = hit_location # 50 = rbi_on_play cwevent -f 0,1,2,3,4,6,7,10,26,50 -y 2018 2018NYA.EVA ``` ### Python Integration ```python import subprocess import pandas as pd def run_cwevent(event_files, year, fields, output_file='events.csv'): """ Run cwevent command and load results into pandas DataFrame. """ # Build command field_str = ','.join(map(str, fields)) cmd = ['cwevent', '-f', field_str, '-y', str(year), '-n'] + event_files # Run command result = subprocess.run(cmd, capture_output=True, text=True) # Save output with open(output_file, 'w') as f: f.write(result.stdout) # Load into DataFrame df = pd.read_csv(output_file) return df # Example usage fields = [0, 1, 2, 3, 4, 5, 6, 7, 10, 26, 50] events = run_cwevent(['2018NYA.EVA'], 2018, fields) print(events.head()) ``` ## Building a Historical Database Creating a comprehensive baseball database from Retrosheet data: ```python import sqlite3 import pandas as pd def create_retrosheet_database(db_path='baseball_history.db'): """ Create a SQLite database structure for Retrosheet data. """ conn = sqlite3.connect(db_path) cursor = conn.cursor() # Games table cursor.execute(''' CREATE TABLE IF NOT EXISTS games ( game_id TEXT PRIMARY KEY, game_date DATE, visiting_team TEXT, home_team TEXT, visiting_score INTEGER, home_score INTEGER, innings INTEGER, day_night TEXT, park_id TEXT, attendance INTEGER ) ''') # Plays table cursor.execute(''' CREATE TABLE IF NOT EXISTS plays ( play_id INTEGER PRIMARY KEY AUTOINCREMENT, game_id TEXT, inning INTEGER, batting_team INTEGER, outs INTEGER, balls INTEGER, strikes INTEGER, batter_id TEXT, pitcher_id TEXT, event_text TEXT, event_type TEXT, rbi INTEGER, runs_on_play INTEGER, FOREIGN KEY (game_id) REFERENCES games(game_id) ) ''') # Players table cursor.execute(''' CREATE TABLE IF NOT EXISTS players ( player_id TEXT PRIMARY KEY, last_name TEXT, first_name TEXT, debut_date DATE, final_date DATE ) ''') # Create indexes cursor.execute('CREATE INDEX IF NOT EXISTS idx_plays_game ON plays(game_id)') cursor.execute('CREATE INDEX IF NOT EXISTS idx_plays_batter ON plays(batter_id)') cursor.execute('CREATE INDEX IF NOT EXISTS idx_plays_pitcher ON plays(pitcher_id)') cursor.execute('CREATE INDEX IF NOT EXISTS idx_games_date ON games(game_date)') conn.commit() return conn def load_season_to_database(year, event_files, conn): """ Load a full season of Retrosheet data into the database. """ # Parse event files parser = RetrosheetParser() for event_file in event_files: games_df, plays_df = parser.parse_event_file(event_file) # Load games games_df.to_sql('games', conn, if_exists='append', index=False) # Load plays plays_df.to_sql('plays', conn, if_exists='append', index=False) print(f"Loaded {year} season to database") # Create database and load multiple seasons conn = create_retrosheet_database() for year in range(2010, 2024): event_files = glob.glob(f'retrosheet_data/{year}/*.EVA') load_season_to_database(year, event_files, conn) # Query the database query = ''' SELECT batter_id, COUNT(*) as plate_appearances, SUM(CASE WHEN event_type IN ('Single', 'Double', 'Triple', 'Home Run') THEN 1 ELSE 0 END) as hits, SUM(CASE WHEN event_type = 'Home Run' THEN 1 ELSE 0 END) as home_runs FROM plays WHERE game_id LIKE '2018%' GROUP BY batter_id HAVING plate_appearances >= 100 ORDER BY hits DESC LIMIT 20 ''' batting_leaders = pd.read_sql_query(query, conn) print(batting_leaders) ``` ## Conclusion Retrosheet represents one of the most valuable resources in baseball analytics, providing comprehensive play-by-play data spanning over 150 years of baseball history. From casual research projects to sophisticated analytical models, Retrosheet enables insights that would be impossible with traditional statistics alone. ### Key Takeaways - **Free and Comprehensive**: Retrosheet provides play-by-play data for MLB history at no cost - **Multiple Access Methods**: Direct file parsing, Chadwick tools, Python packages, and R libraries - **Rich Detail**: Event files capture every pitch, play, and substitution with extensive coding - **Research Applications**: Enables win probability, historical analysis, strategic research, and player evaluation - **Data Quality Varies**: Coverage and detail improve significantly in recent decades - **Community-Driven**: Volunteer organization that welcomes contributions - **Complementary**: Works best when combined with other data sources like Statcast and FanGraphs ### Getting Started 1. **Download data** from Retrosheet.org for your period of interest 2. **Choose parsing method**: Chadwick tools for efficiency, Python/R for flexibility 3. **Build game summaries** by aggregating play-by-play data 4. **Validate results** against known statistics 5. **Explore research questions** enabled by granular historical data ### Resources - **Retrosheet.org**: Official site with downloads and documentation - **Chadwick Bureau**: Tools and parsing utilities - **pybaseball**: Python package with Retrosheet integration - **retrosheet (R)**: R package for event file parsing - **Community forums**: retrosheet-discussion mailing list for technical help Retrosheet has democratized access to baseball's historical record, making every researcher a potential baseball historian. Whether you're investigating the evolution of pitching strategy, building predictive models, or simply satisfying curiosity about your favorite player's career, Retrosheet provides the foundation for rigorous, data-driven baseball analysis.

Discussion

Have questions or feedback? Join our community discussion on Discord or GitHub Discussions.