Data Wrangling and Play-by-Play Analysis
Data Wrangling & Play-by-Play Analysis
Play-by-play data represents the most granular level of baseball analysis, capturing every pitch, swing, and defensive action throughout a game. Transforming this raw sequential data into analytical insights requires sophisticated data wrangling techniques—the process of cleaning, structuring, and enriching data to enable meaningful analysis. For baseball analysts, mastering these skills unlocks the ability to study game situations, evaluate player performance in context, and build predictive models that account for the game's inherent complexity.
The challenge with play-by-play data lies in its structure. Each row typically represents a single event—a pitch, a batted ball, or a baserunning action—but meaningful analysis often requires aggregating across events, calculating running totals, and joining data from multiple sources. A simple question like "How does this batter perform with runners in scoring position?" requires filtering events, identifying game states, calculating statistics, and contextualizing results against league averages.
Understanding Play-by-Play Data Structure
Play-by-play data comes in several formats depending on the source. Retrosheet provides event files with coded notation describing each play (e.g., "S8/L.1-2" indicates a single to center field with the runner advancing from first to second). Statcast data includes pitch-level information with each row representing a single pitch. Understanding these formats and their relationships is essential for proper analysis.
The key to effective data wrangling is recognizing that baseball events are nested within hierarchical contexts: pitches within at-bats, at-bats within innings, innings within games, and games within seasons. Analysts must be able to move fluidly between these levels, aggregating up from pitches to plate appearances or drilling down from season statistics to specific game situations.
Key Components
- Event Parsing: Converting coded play notation into structured fields (hit type, fielders involved, baserunner movements)
- State Tracking: Maintaining current game state (score, outs, base occupancy) throughout event sequences
- Run Expectancy: Calculating expected runs based on base-out situations using historical probability matrices
- Contextual Aggregation: Grouping events by situation (RISP, high leverage, platoon matchups) for meaningful splits
- Data Joining: Linking play-by-play events with player information, park factors, and environmental conditions
- Quality Control: Identifying and handling missing data, inconsistencies, and edge cases in event sequences
Run Expectancy Matrix Formula
Run Expectancy (RE) = Average runs scored from current state to end of inning
RE24 = (RE after event) - (RE before event) + Runs scored on play
The run expectancy matrix provides the foundation for context-neutral offensive value. By comparing the change in expected runs before and after each event, we can assign credit (or blame) to individual players independent of what their teammates subsequently did.
Python Implementation
"""
Play-by-Play Data Wrangling for Baseball Analytics
Complete implementation for parsing, cleaning, and analyzing game events
"""
import pandas as pd
import numpy as np
from dataclasses import dataclass
from typing import Dict, List, Optional, Tuple
import re
@dataclass
class GameState:
"""Represents the current state of a baseball game."""
inning: int
is_top: bool
outs: int
runners: Tuple[bool, bool, bool] # (1B, 2B, 3B)
home_score: int
away_score: int
@property
def base_out_state(self) -> str:
"""Return base-out state code (e.g., '1_0' for runner on 1st, 0 outs)."""
runners_str = ''.join(['1' if r else '_' for r in self.runners])
return f"{runners_str}_{self.outs}"
@property
def score_differential(self) -> int:
"""Return score differential from batting team's perspective."""
if self.is_top:
return self.away_score - self.home_score
return self.home_score - self.away_score
class RetrosheetParser:
"""
Parser for Retrosheet event files.
Converts coded play notation into structured data.
"""
# Hit type patterns
HIT_PATTERNS = {
r'^S(\d)?': 'single',
r'^D(\d)?': 'double',
r'^T(\d)?': 'triple',
r'^HR?(\d)?': 'home_run',
r'^K': 'strikeout',
r'^W': 'walk',
r'^IW': 'intentional_walk',
r'^HP': 'hit_by_pitch',
r'^(\d)+\(B\)': 'out',
r'^(\d)+$': 'out',
}
def __init__(self):
self.run_expectancy = self._load_run_expectancy()
def _load_run_expectancy(self) -> Dict[str, float]:
"""Load historical run expectancy matrix."""
# 2010-2019 average run expectancy by base-out state
return {
'___0': 0.481, '___1': 0.254, '___2': 0.098,
'1__0': 0.859, '1__1': 0.509, '1__2': 0.224,
'_2_0': 1.100, '_2_1': 0.664, '_2_2': 0.319,
'12_0': 1.437, '12_1': 0.884, '12_2': 0.429,
'__30': 1.350, '__31': 0.950, '__32': 0.353,
'1_30': 1.784, '1_31': 1.130, '1_32': 0.478,
'_230': 1.964, '_231': 1.376, '_232': 0.580,
'1230': 2.282, '1231': 1.541, '1232': 0.752,
}
def parse_play(self, play_code: str) -> Dict:
"""
Parse a Retrosheet play code into structured data.
Parameters:
-----------
play_code : str
Retrosheet play notation (e.g., 'S8/L.1-2')
Returns:
--------
dict with parsed play information
"""
result = {
'raw': play_code,
'event_type': None,
'hit_location': None,
'fielders': [],
'modifiers': [],
'advances': []
}
# Split main play from baserunning
parts = play_code.split('.')
main_play = parts[0]
if len(parts) > 1:
result['advances'] = self._parse_advances(parts[1])
# Parse main play
for pattern, event_type in self.HIT_PATTERNS.items():
if re.match(pattern, main_play):
result['event_type'] = event_type
# Extract fielders involved
fielders = re.findall(r'\d', main_play)
result['fielders'] = [int(f) for f in fielders]
break
# Parse modifiers (e.g., /L for line drive)
if '/' in main_play:
modifier_part = main_play.split('/')[1:]
result['modifiers'] = modifier_part
return result
def _parse_advances(self, advance_str: str) -> List[Dict]:
"""Parse baserunner advance notation."""
advances = []
# Pattern: X-Y or X-Y(E#) for advances, XxY for outs
for advance in advance_str.split(';'):
advance = advance.strip()
if '-' in advance:
parts = advance.split('-')
advances.append({
'from': parts[0],
'to': parts[1].rstrip('()EH'),
'out': False,
'error': 'E' in parts[1] if len(parts) > 1 else False
})
elif 'X' in advance.upper():
parts = advance.upper().split('X')
advances.append({
'from': parts[0],
'to': parts[1].rstrip('()'),
'out': True
})
return advances
class PlayByPlayAnalyzer:
"""
Comprehensive analyzer for play-by-play baseball data.
"""
def __init__(self, data: pd.DataFrame):
"""
Initialize with play-by-play data.
Parameters:
-----------
data : DataFrame
Play-by-play data with standard columns
"""
self.data = data.copy()
self.parser = RetrosheetParser()
self._preprocess_data()
def _preprocess_data(self):
"""Clean and standardize the data."""
# Ensure required columns exist
required = ['game_id', 'inning', 'half_inning', 'outs',
'batter', 'pitcher', 'event']
for col in required:
if col not in self.data.columns:
raise ValueError(f"Missing required column: {col}")
# Create base-out state
self.data['base_state'] = self.data.apply(
lambda r: self._get_base_state(r), axis=1
)
# Calculate run expectancy
self.data['re_before'] = self.data['base_state'].map(
self.parser.run_expectancy
)
def _get_base_state(self, row) -> str:
"""Determine base state from row data."""
r1 = '1' if row.get('runner_1b', False) else '_'
r2 = '2' if row.get('runner_2b', False) else '_'
r3 = '3' if row.get('runner_3b', False) else '_'
return f"{r1}{r2}{r3}{row['outs']}"
def calculate_wpa(self) -> pd.DataFrame:
"""
Calculate Win Probability Added for each event.
Returns:
--------
DataFrame with WPA column added
"""
# Load win probability matrix (simplified)
# In production, use full inning/score/base-out matrix
df = self.data.copy()
# Calculate WPA as change in win probability
df['wp_before'] = df.apply(self._calculate_win_prob, axis=1)
df['wp_after'] = df['wp_before'].shift(-1)
df['wpa'] = df['wp_after'] - df['wp_before']
return df
def _calculate_win_prob(self, row) -> float:
"""Calculate win probability for batting team."""
# Simplified model - production would use comprehensive tables
inning = row['inning']
score_diff = row.get('score_diff', 0)
# Base probability from score differential
base_prob = 1 / (1 + np.exp(-0.15 * score_diff))
# Adjust for inning
inning_factor = 1 - (min(inning, 9) - 1) * 0.02
return base_prob * inning_factor
def situational_splits(self, batter_id: str) -> Dict:
"""
Calculate situational batting splits for a player.
Parameters:
-----------
batter_id : str
Player identifier
Returns:
--------
dict with splits by situation
"""
batter_data = self.data[self.data['batter'] == batter_id]
splits = {}
# RISP (Runners in Scoring Position)
risp = batter_data[
batter_data['runner_2b'].fillna(False) |
batter_data['runner_3b'].fillna(False)
]
splits['risp'] = self._calculate_slash_line(risp)
# Two outs
two_out = batter_data[batter_data['outs'] == 2]
splits['two_out'] = self._calculate_slash_line(two_out)
# High leverage (simplified: close game, late innings)
high_lev = batter_data[
(batter_data['inning'] >= 7) &
(batter_data['score_diff'].abs() <= 2)
]
splits['high_leverage'] = self._calculate_slash_line(high_lev)
return splits
def _calculate_slash_line(self, df: pd.DataFrame) -> Dict:
"""Calculate AVG/OBP/SLG from event data."""
if len(df) == 0:
return {'pa': 0, 'avg': 0, 'obp': 0, 'slg': 0}
hits = df['event'].isin(['single', 'double', 'triple', 'home_run']).sum()
at_bats = len(df[~df['event'].isin(['walk', 'hbp', 'sac_fly', 'sac_bunt'])])
total_bases = (
df[df['event'] == 'single'].shape[0] +
df[df['event'] == 'double'].shape[0] * 2 +
df[df['event'] == 'triple'].shape[0] * 3 +
df[df['event'] == 'home_run'].shape[0] * 4
)
on_base = hits + df['event'].isin(['walk', 'hbp']).sum()
pa = len(df)
return {
'pa': pa,
'avg': round(hits / at_bats, 3) if at_bats > 0 else 0,
'obp': round(on_base / pa, 3) if pa > 0 else 0,
'slg': round(total_bases / at_bats, 3) if at_bats > 0 else 0
}
# Example usage
if __name__ == "__main__":
# Create sample play-by-play data
sample_data = pd.DataFrame({
'game_id': ['NYA202304010'] * 10,
'inning': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3],
'half_inning': ['top'] * 4 + ['bottom'] * 3 + ['top'] * 3,
'outs': [0, 0, 1, 2, 0, 1, 2, 0, 1, 1],
'batter': ['player1', 'player2', 'player3', 'player4'] * 2 + ['player1', 'player2'],
'pitcher': ['pitcherA'] * 10,
'event': ['single', 'strikeout', 'double', 'flyout',
'walk', 'single', 'groundout', 'home_run', 'strikeout', 'single'],
'runner_1b': [False, True, False, True, False, True, True, False, False, False],
'runner_2b': [False, False, True, False, False, False, True, False, False, False],
'runner_3b': [False, False, False, True, False, False, False, False, False, False],
'score_diff': [0, 0, 0, 0, -1, -1, -1, -1, 0, 0]
})
# Initialize analyzer
analyzer = PlayByPlayAnalyzer(sample_data)
# Parse a Retrosheet play
parser = RetrosheetParser()
play = parser.parse_play("S8/L.1-3;B-2")
print("Parsed play:", play)
# Get situational splits
splits = analyzer.situational_splits('player1')
print("\nPlayer1 Situational Splits:")
for situation, stats in splits.items():
print(f" {situation}: {stats}")
R Implementation
# ============================================
# Play-by-Play Data Wrangling in R
# ============================================
library(tidyverse)
library(baseballr)
library(lubridate)
# ============================================
# Run Expectancy Matrix
# ============================================
#' Historical run expectancy values by base-out state
run_expectancy_matrix <- tribble(
~base_state, ~outs_0, ~outs_1, ~outs_2,
"___", 0.481, 0.254, 0.098,
"1__", 0.859, 0.509, 0.224,
"_2_", 1.100, 0.664, 0.319,
"12_", 1.437, 0.884, 0.429,
"__3", 1.350, 0.950, 0.353,
"1_3", 1.784, 1.130, 0.478,
"_23", 1.964, 1.376, 0.580,
"123", 2.282, 1.541, 0.752
)
#' Get run expectancy for a base-out state
#'
#' @param runners Character vector of occupied bases (e.g., c("1B", "3B"))
#' @param outs Integer number of outs (0, 1, or 2)
#' @return Numeric run expectancy value
get_run_expectancy <- function(runners, outs) {
# Build base state string
base_state <- paste0(
ifelse("1B" %in% runners, "1", "_"),
ifelse("2B" %in% runners, "2", "_"),
ifelse("3B" %in% runners, "3", "_")
)
# Look up value
col_name <- paste0("outs_", outs)
re_value <- run_expectancy_matrix %>%
filter(base_state == !!base_state) %>%
pull(!!sym(col_name))
return(ifelse(length(re_value) > 0, re_value, NA_real_))
}
# ============================================
# Retrosheet Play Parser
# ============================================
#' Parse Retrosheet play notation
#'
#' @param play_code Character string of play notation
#' @return List with parsed play components
parse_retrosheet_play <- function(play_code) {
result <- list(
raw = play_code,
event_type = NA_character_,
fielders = character(),
advances = list()
)
# Split play from advances
parts <- str_split(play_code, "\\.")[[1]]
main_play <- parts[1]
# Identify event type
event_patterns <- c(
"single" = "^S\\d?",
"double" = "^D\\d?",
"triple" = "^T\\d?",
"home_run" = "^HR?\\d?",
"strikeout" = "^K",
"walk" = "^W$|^W[^P]",
"intentional_walk" = "^IW",
"hit_by_pitch" = "^HP"
)
for (event_name in names(event_patterns)) {
if (str_detect(main_play, event_patterns[event_name])) {
result$event_type <- event_name
break
}
}
# Check for outs if no hit event found
if (is.na(result$event_type)) {
if (str_detect(main_play, "^\\d+\\(B\\)|^\\d+$")) {
result$event_type <- "out"
}
}
# Extract fielders
fielders <- str_extract_all(main_play, "\\d")[[1]]
result$fielders <- fielders
# Parse advances if present
if (length(parts) > 1) {
advance_parts <- str_split(parts[2], ";")[[1]]
result$advances <- map(advance_parts, parse_advance)
}
return(result)
}
#' Parse a single baserunner advance
#'
#' @param advance_str Character advance notation (e.g., "1-3" or "BX2")
#' @return List with from, to, and out fields
parse_advance <- function(advance_str) {
advance_str <- str_trim(advance_str)
if (str_detect(advance_str, "-")) {
parts <- str_split(advance_str, "-")[[1]]
return(list(
from = parts[1],
to = str_extract(parts[2], "^[1-3BH]"),
out = FALSE
))
} else if (str_detect(toupper(advance_str), "X")) {
parts <- str_split(toupper(advance_str), "X")[[1]]
return(list(
from = parts[1],
to = parts[2],
out = TRUE
))
}
return(list(from = NA, to = NA, out = FALSE))
}
# ============================================
# Play-by-Play Analysis Functions
# ============================================
#' Calculate RE24 for each event in play-by-play data
#'
#' @param pbp Data frame with play-by-play events
#' @return Data frame with RE24 column added
calculate_re24 <- function(pbp) {
pbp <- pbp %>%
mutate(
# Build base state string
base_state = paste0(
ifelse(runner_on_1b, "1", "_"),
ifelse(runner_on_2b, "2", "_"),
ifelse(runner_on_3b, "3", "_")
)
)
# Join with run expectancy
re_long <- run_expectancy_matrix %>%
pivot_longer(
cols = starts_with("outs_"),
names_to = "outs",
values_to = "re"
) %>%
mutate(outs = as.integer(str_extract(outs, "\\d")))
pbp <- pbp %>%
left_join(
re_long,
by = c("base_state", "outs")
) %>%
rename(re_before = re)
# Calculate RE after (from next row within inning)
pbp <- pbp %>%
group_by(game_id, inning, half_inning) %>%
mutate(
re_after = lead(re_before, default = 0),
runs_on_play = coalesce(runs_scored, 0),
re24 = re_after - re_before + runs_on_play
) %>%
ungroup()
return(pbp)
}
#' Calculate situational batting splits
#'
#' @param pbp Data frame with play-by-play data
#' @param batter_id Character player identifier
#' @return Data frame with splits by situation
get_situational_splits <- function(pbp, batter_id) {
batter_pa <- pbp %>%
filter(batter == batter_id)
# Helper function for slash line
calc_slash <- function(df) {
tibble(
pa = nrow(df),
ab = sum(!df$event %in% c("walk", "hit_by_pitch", "sac_fly", "sac_bunt")),
hits = sum(df$event %in% c("single", "double", "triple", "home_run")),
tb = sum(df$event == "single") +
sum(df$event == "double") * 2 +
sum(df$event == "triple") * 3 +
sum(df$event == "home_run") * 4,
bb_hbp = sum(df$event %in% c("walk", "hit_by_pitch"))
) %>%
mutate(
avg = round(hits / ab, 3),
obp = round((hits + bb_hbp) / pa, 3),
slg = round(tb / ab, 3),
ops = obp + slg
)
}
# Overall
overall <- calc_slash(batter_pa) %>%
mutate(situation = "overall")
# RISP
risp <- batter_pa %>%
filter(runner_on_2b | runner_on_3b) %>%
calc_slash() %>%
mutate(situation = "risp")
# Two outs
two_out <- batter_pa %>%
filter(outs == 2) %>%
calc_slash() %>%
mutate(situation = "two_out")
# Late & Close (7th inning+, within 2 runs)
late_close <- batter_pa %>%
filter(inning >= 7, abs(score_diff) <= 2) %>%
calc_slash() %>%
mutate(situation = "late_close")
bind_rows(overall, risp, two_out, late_close) %>%
select(situation, pa, avg, obp, slg, ops)
}
# ============================================
# Example Usage
# ============================================
# Create sample play-by-play data
sample_pbp <- tibble(
game_id = rep("NYA202304010", 10),
inning = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3),
half_inning = c(rep("top", 4), rep("bottom", 3), rep("top", 3)),
outs = c(0, 0, 1, 2, 0, 1, 2, 0, 1, 1),
batter = rep(c("player1", "player2", "player3", "player4"), 2)[1:10],
pitcher = rep("pitcherA", 10),
event = c("single", "strikeout", "double", "field_out",
"walk", "single", "ground_out", "home_run", "strikeout", "single"),
runner_on_1b = c(F, T, F, T, F, T, T, F, F, F),
runner_on_2b = c(F, F, T, F, F, F, T, F, F, F),
runner_on_3b = c(F, F, F, T, F, F, F, F, F, F),
runs_scored = c(0, 0, 0, 1, 0, 0, 0, 1, 0, 0),
score_diff = c(0, 0, 0, 0, -1, -1, -1, -1, 0, 0)
)
# Calculate RE24
pbp_with_re24 <- calculate_re24(sample_pbp)
cat("Play-by-Play with RE24:\n")
print(pbp_with_re24 %>% select(inning, outs, batter, event, re_before, re24))
# Get situational splits
splits <- get_situational_splits(sample_pbp, "player1")
cat("\nPlayer1 Situational Splits:\n")
print(splits)
# Parse a Retrosheet play
play_result <- parse_retrosheet_play("D8/L.2-H;1-3")
cat("\nParsed Retrosheet Play:\n")
print(play_result)
Real-World Application
The analytical revolution that transformed the Oakland Athletics—made famous by Moneyball—relied fundamentally on play-by-play analysis. By examining situational data, Billy Beane's front office identified that on-base percentage was undervalued relative to batting average, and that clutch hitting was largely a myth when examined through the lens of sample size. These insights came from rigorous analysis of play-by-play records.
Modern applications extend far beyond simple counting statistics. The Houston Astros used play-by-play pitch data to optimize their pitching staff's usage patterns, identifying which pitchers performed better in high-leverage situations and adjusting bullpen deployment accordingly. Similarly, defensive positioning strategies that have become ubiquitous in modern baseball emerged from spray chart analysis derived from play-by-play batted ball data.
Interpreting Results
| RE24 Range | Event Impact | Typical Events |
|---|---|---|
| > +1.0 | Highly Positive | Home run with runners on, bases-clearing extra-base hit |
| +0.3 to +1.0 | Positive | Single with RISP, walk with bases loaded, most extra-base hits |
| -0.1 to +0.3 | Neutral to Slight Positive | Single with no runners, walk, productive out |
| -0.3 to -0.1 | Slightly Negative | Strikeout with 0-1 outs, routine flyout |
| < -0.3 | Highly Negative | Double play, caught stealing, strikeout with RISP and 2 outs |
Key Takeaways
- Play-by-play data enables context-aware analysis that aggregate statistics cannot provide—situational splits reveal how players perform under specific game conditions
- Run expectancy matrices form the foundation for advanced metrics like RE24 and WPA, allowing credit assignment independent of subsequent events
- Retrosheet's event notation requires careful parsing but provides rich historical data for longitudinal analysis spanning decades
- Proper data wrangling includes state tracking, quality control, and hierarchical aggregation across the nested structure of baseball games
- Combining play-by-play analysis with tracking data (Statcast) enables the most comprehensive player evaluation possible in modern baseball