Data Wrangling and Play-by-Play Analysis

Intermediate 20 min read 234 views Nov 25, 2025

Data Wrangling & Play-by-Play Analysis

Play-by-play data represents the most granular level of baseball analysis, capturing every pitch, swing, and defensive action throughout a game. Transforming this raw sequential data into analytical insights requires sophisticated data wrangling techniques—the process of cleaning, structuring, and enriching data to enable meaningful analysis. For baseball analysts, mastering these skills unlocks the ability to study game situations, evaluate player performance in context, and build predictive models that account for the game's inherent complexity.

The challenge with play-by-play data lies in its structure. Each row typically represents a single event—a pitch, a batted ball, or a baserunning action—but meaningful analysis often requires aggregating across events, calculating running totals, and joining data from multiple sources. A simple question like "How does this batter perform with runners in scoring position?" requires filtering events, identifying game states, calculating statistics, and contextualizing results against league averages.

Understanding Play-by-Play Data Structure

Play-by-play data comes in several formats depending on the source. Retrosheet provides event files with coded notation describing each play (e.g., "S8/L.1-2" indicates a single to center field with the runner advancing from first to second). Statcast data includes pitch-level information with each row representing a single pitch. Understanding these formats and their relationships is essential for proper analysis.

The key to effective data wrangling is recognizing that baseball events are nested within hierarchical contexts: pitches within at-bats, at-bats within innings, innings within games, and games within seasons. Analysts must be able to move fluidly between these levels, aggregating up from pitches to plate appearances or drilling down from season statistics to specific game situations.

Key Components

Event Parsing: Converting coded play notation into structured fields (hit type, fielders involved, baserunner movements)
State Tracking: Maintaining current game state (score, outs, base occupancy) throughout event sequences
Run Expectancy: Calculating expected runs based on base-out situations using historical probability matrices
Contextual Aggregation: Grouping events by situation (RISP, high leverage, platoon matchups) for meaningful splits
Data Joining: Linking play-by-play events with player information, park factors, and environmental conditions
Quality Control: Identifying and handling missing data, inconsistencies, and edge cases in event sequences

Run Expectancy Matrix Formula

Run Expectancy (RE) = Average runs scored from current state to end of inning

RE24 = (RE after event) - (RE before event) + Runs scored on play

The run expectancy matrix provides the foundation for context-neutral offensive value. By comparing the change in expected runs before and after each event, we can assign credit (or blame) to individual players independent of what their teammates subsequently did.

Python Implementation


"""
Play-by-Play Data Wrangling for Baseball Analytics
Complete implementation for parsing, cleaning, and analyzing game events
"""

import pandas as pd
import numpy as np
from dataclasses import dataclass
from typing import Dict, List, Optional, Tuple
import re

@dataclass
class GameState:
    """Represents the current state of a baseball game."""
    inning: int
    is_top: bool
    outs: int
    runners: Tuple[bool, bool, bool]  # (1B, 2B, 3B)
    home_score: int
    away_score: int

    @property
    def base_out_state(self) -> str:
        """Return base-out state code (e.g., '1_0' for runner on 1st, 0 outs)."""
        runners_str = ''.join(['1' if r else '_' for r in self.runners])
        return f"{runners_str}_{self.outs}"

    @property
    def score_differential(self) -> int:
        """Return score differential from batting team's perspective."""
        if self.is_top:
            return self.away_score - self.home_score
        return self.home_score - self.away_score


class RetrosheetParser:
    """
    Parser for Retrosheet event files.
    Converts coded play notation into structured data.
    """

    # Hit type patterns
    HIT_PATTERNS = {
        r'^S(\d)?': 'single',
        r'^D(\d)?': 'double',
        r'^T(\d)?': 'triple',
        r'^HR?(\d)?': 'home_run',
        r'^K': 'strikeout',
        r'^W': 'walk',
        r'^IW': 'intentional_walk',
        r'^HP': 'hit_by_pitch',
        r'^(\d)+\(B\)': 'out',
        r'^(\d)+$': 'out',
    }

    def __init__(self):
        self.run_expectancy = self._load_run_expectancy()

    def _load_run_expectancy(self) -> Dict[str, float]:
        """Load historical run expectancy matrix."""
        # 2010-2019 average run expectancy by base-out state
        return {
            '___0': 0.481, '___1': 0.254, '___2': 0.098,
            '1__0': 0.859, '1__1': 0.509, '1__2': 0.224,
            '_2_0': 1.100, '_2_1': 0.664, '_2_2': 0.319,
            '12_0': 1.437, '12_1': 0.884, '12_2': 0.429,
            '__30': 1.350, '__31': 0.950, '__32': 0.353,
            '1_30': 1.784, '1_31': 1.130, '1_32': 0.478,
            '_230': 1.964, '_231': 1.376, '_232': 0.580,
            '1230': 2.282, '1231': 1.541, '1232': 0.752,
        }

    def parse_play(self, play_code: str) -> Dict:
        """
        Parse a Retrosheet play code into structured data.

        Parameters:
        -----------
        play_code : str
            Retrosheet play notation (e.g., 'S8/L.1-2')

        Returns:
        --------
        dict with parsed play information
        """
        result = {
            'raw': play_code,
            'event_type': None,
            'hit_location': None,
            'fielders': [],
            'modifiers': [],
            'advances': []
        }

        # Split main play from baserunning
        parts = play_code.split('.')
        main_play = parts[0]

        if len(parts) > 1:
            result['advances'] = self._parse_advances(parts[1])

        # Parse main play
        for pattern, event_type in self.HIT_PATTERNS.items():
            if re.match(pattern, main_play):
                result['event_type'] = event_type
                # Extract fielders involved
                fielders = re.findall(r'\d', main_play)
                result['fielders'] = [int(f) for f in fielders]
                break

        # Parse modifiers (e.g., /L for line drive)
        if '/' in main_play:
            modifier_part = main_play.split('/')[1:]
            result['modifiers'] = modifier_part

        return result

    def _parse_advances(self, advance_str: str) -> List[Dict]:
        """Parse baserunner advance notation."""
        advances = []
        # Pattern: X-Y or X-Y(E#) for advances, XxY for outs
        for advance in advance_str.split(';'):
            advance = advance.strip()
            if '-' in advance:
                parts = advance.split('-')
                advances.append({
                    'from': parts[0],
                    'to': parts[1].rstrip('()EH'),
                    'out': False,
                    'error': 'E' in parts[1] if len(parts) > 1 else False
                })
            elif 'X' in advance.upper():
                parts = advance.upper().split('X')
                advances.append({
                    'from': parts[0],
                    'to': parts[1].rstrip('()'),
                    'out': True
                })
        return advances


class PlayByPlayAnalyzer:
    """
    Comprehensive analyzer for play-by-play baseball data.
    """

    def __init__(self, data: pd.DataFrame):
        """
        Initialize with play-by-play data.

        Parameters:
        -----------
        data : DataFrame
            Play-by-play data with standard columns
        """
        self.data = data.copy()
        self.parser = RetrosheetParser()
        self._preprocess_data()

    def _preprocess_data(self):
        """Clean and standardize the data."""
        # Ensure required columns exist
        required = ['game_id', 'inning', 'half_inning', 'outs',
                   'batter', 'pitcher', 'event']

        for col in required:
            if col not in self.data.columns:
                raise ValueError(f"Missing required column: {col}")

        # Create base-out state
        self.data['base_state'] = self.data.apply(
            lambda r: self._get_base_state(r), axis=1
        )

        # Calculate run expectancy
        self.data['re_before'] = self.data['base_state'].map(
            self.parser.run_expectancy
        )

    def _get_base_state(self, row) -> str:
        """Determine base state from row data."""
        r1 = '1' if row.get('runner_1b', False) else '_'
        r2 = '2' if row.get('runner_2b', False) else '_'
        r3 = '3' if row.get('runner_3b', False) else '_'
        return f"{r1}{r2}{r3}{row['outs']}"

    def calculate_wpa(self) -> pd.DataFrame:
        """
        Calculate Win Probability Added for each event.

        Returns:
        --------
        DataFrame with WPA column added
        """
        # Load win probability matrix (simplified)
        # In production, use full inning/score/base-out matrix

        df = self.data.copy()

        # Calculate WPA as change in win probability
        df['wp_before'] = df.apply(self._calculate_win_prob, axis=1)
        df['wp_after'] = df['wp_before'].shift(-1)
        df['wpa'] = df['wp_after'] - df['wp_before']

        return df

    def _calculate_win_prob(self, row) -> float:
        """Calculate win probability for batting team."""
        # Simplified model - production would use comprehensive tables
        inning = row['inning']
        score_diff = row.get('score_diff', 0)

        # Base probability from score differential
        base_prob = 1 / (1 + np.exp(-0.15 * score_diff))

        # Adjust for inning
        inning_factor = 1 - (min(inning, 9) - 1) * 0.02

        return base_prob * inning_factor

    def situational_splits(self, batter_id: str) -> Dict:
        """
        Calculate situational batting splits for a player.

        Parameters:
        -----------
        batter_id : str
            Player identifier

        Returns:
        --------
        dict with splits by situation
        """
        batter_data = self.data[self.data['batter'] == batter_id]

        splits = {}

        # RISP (Runners in Scoring Position)
        risp = batter_data[
            batter_data['runner_2b'].fillna(False) |
            batter_data['runner_3b'].fillna(False)
        ]
        splits['risp'] = self._calculate_slash_line(risp)

        # Two outs
        two_out = batter_data[batter_data['outs'] == 2]
        splits['two_out'] = self._calculate_slash_line(two_out)

        # High leverage (simplified: close game, late innings)
        high_lev = batter_data[
            (batter_data['inning'] >= 7) &
            (batter_data['score_diff'].abs() <= 2)
        ]
        splits['high_leverage'] = self._calculate_slash_line(high_lev)

        return splits

    def _calculate_slash_line(self, df: pd.DataFrame) -> Dict:
        """Calculate AVG/OBP/SLG from event data."""
        if len(df) == 0:
            return {'pa': 0, 'avg': 0, 'obp': 0, 'slg': 0}

        hits = df['event'].isin(['single', 'double', 'triple', 'home_run']).sum()
        at_bats = len(df[~df['event'].isin(['walk', 'hbp', 'sac_fly', 'sac_bunt'])])

        total_bases = (
            df[df['event'] == 'single'].shape[0] +
            df[df['event'] == 'double'].shape[0] * 2 +
            df[df['event'] == 'triple'].shape[0] * 3 +
            df[df['event'] == 'home_run'].shape[0] * 4
        )

        on_base = hits + df['event'].isin(['walk', 'hbp']).sum()
        pa = len(df)

        return {
            'pa': pa,
            'avg': round(hits / at_bats, 3) if at_bats > 0 else 0,
            'obp': round(on_base / pa, 3) if pa > 0 else 0,
            'slg': round(total_bases / at_bats, 3) if at_bats > 0 else 0
        }


# Example usage
if __name__ == "__main__":
    # Create sample play-by-play data
    sample_data = pd.DataFrame({
        'game_id': ['NYA202304010'] * 10,
        'inning': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3],
        'half_inning': ['top'] * 4 + ['bottom'] * 3 + ['top'] * 3,
        'outs': [0, 0, 1, 2, 0, 1, 2, 0, 1, 1],
        'batter': ['player1', 'player2', 'player3', 'player4'] * 2 + ['player1', 'player2'],
        'pitcher': ['pitcherA'] * 10,
        'event': ['single', 'strikeout', 'double', 'flyout',
                 'walk', 'single', 'groundout', 'home_run', 'strikeout', 'single'],
        'runner_1b': [False, True, False, True, False, True, True, False, False, False],
        'runner_2b': [False, False, True, False, False, False, True, False, False, False],
        'runner_3b': [False, False, False, True, False, False, False, False, False, False],
        'score_diff': [0, 0, 0, 0, -1, -1, -1, -1, 0, 0]
    })

    # Initialize analyzer
    analyzer = PlayByPlayAnalyzer(sample_data)

    # Parse a Retrosheet play
    parser = RetrosheetParser()
    play = parser.parse_play("S8/L.1-3;B-2")
    print("Parsed play:", play)

    # Get situational splits
    splits = analyzer.situational_splits('player1')
    print("\nPlayer1 Situational Splits:")
    for situation, stats in splits.items():
        print(f"  {situation}: {stats}")

R Implementation


# ============================================
# Play-by-Play Data Wrangling in R
# ============================================

library(tidyverse)
library(baseballr)
library(lubridate)

# ============================================
# Run Expectancy Matrix
# ============================================

#' Historical run expectancy values by base-out state
run_expectancy_matrix <- tribble(
  ~base_state, ~outs_0, ~outs_1, ~outs_2,
  "___",       0.481,   0.254,   0.098,
  "1__",       0.859,   0.509,   0.224,
  "_2_",       1.100,   0.664,   0.319,
  "12_",       1.437,   0.884,   0.429,
  "__3",       1.350,   0.950,   0.353,
  "1_3",       1.784,   1.130,   0.478,
  "_23",       1.964,   1.376,   0.580,
  "123",       2.282,   1.541,   0.752
)

#' Get run expectancy for a base-out state
#'
#' @param runners Character vector of occupied bases (e.g., c("1B", "3B"))
#' @param outs Integer number of outs (0, 1, or 2)
#' @return Numeric run expectancy value
get_run_expectancy <- function(runners, outs) {

  # Build base state string
  base_state <- paste0(
    ifelse("1B" %in% runners, "1", "_"),
    ifelse("2B" %in% runners, "2", "_"),
    ifelse("3B" %in% runners, "3", "_")
  )

  # Look up value
  col_name <- paste0("outs_", outs)

  re_value <- run_expectancy_matrix %>%
    filter(base_state == !!base_state) %>%
    pull(!!sym(col_name))

  return(ifelse(length(re_value) > 0, re_value, NA_real_))
}

# ============================================
# Retrosheet Play Parser
# ============================================

#' Parse Retrosheet play notation
#'
#' @param play_code Character string of play notation
#' @return List with parsed play components
parse_retrosheet_play <- function(play_code) {

  result <- list(
    raw = play_code,
    event_type = NA_character_,
    fielders = character(),
    advances = list()
  )

  # Split play from advances
  parts <- str_split(play_code, "\\.")[[1]]
  main_play <- parts[1]

  # Identify event type
  event_patterns <- c(
    "single" = "^S\\d?",
    "double" = "^D\\d?",
    "triple" = "^T\\d?",
    "home_run" = "^HR?\\d?",
    "strikeout" = "^K",
    "walk" = "^W$|^W[^P]",
    "intentional_walk" = "^IW",
    "hit_by_pitch" = "^HP"
  )

  for (event_name in names(event_patterns)) {
    if (str_detect(main_play, event_patterns[event_name])) {
      result$event_type <- event_name
      break
    }
  }

  # Check for outs if no hit event found
  if (is.na(result$event_type)) {
    if (str_detect(main_play, "^\\d+\\(B\\)|^\\d+$")) {
      result$event_type <- "out"
    }
  }

  # Extract fielders
  fielders <- str_extract_all(main_play, "\\d")[[1]]
  result$fielders <- fielders

  # Parse advances if present
  if (length(parts) > 1) {
    advance_parts <- str_split(parts[2], ";")[[1]]
    result$advances <- map(advance_parts, parse_advance)
  }

  return(result)
}

#' Parse a single baserunner advance
#'
#' @param advance_str Character advance notation (e.g., "1-3" or "BX2")
#' @return List with from, to, and out fields
parse_advance <- function(advance_str) {

  advance_str <- str_trim(advance_str)

  if (str_detect(advance_str, "-")) {
    parts <- str_split(advance_str, "-")[[1]]
    return(list(
      from = parts[1],
      to = str_extract(parts[2], "^[1-3BH]"),
      out = FALSE
    ))
  } else if (str_detect(toupper(advance_str), "X")) {
    parts <- str_split(toupper(advance_str), "X")[[1]]
    return(list(
      from = parts[1],
      to = parts[2],
      out = TRUE
    ))
  }

  return(list(from = NA, to = NA, out = FALSE))
}

# ============================================
# Play-by-Play Analysis Functions
# ============================================

#' Calculate RE24 for each event in play-by-play data
#'
#' @param pbp Data frame with play-by-play events
#' @return Data frame with RE24 column added
calculate_re24 <- function(pbp) {

  pbp <- pbp %>%
    mutate(
      # Build base state string
      base_state = paste0(
        ifelse(runner_on_1b, "1", "_"),
        ifelse(runner_on_2b, "2", "_"),
        ifelse(runner_on_3b, "3", "_")
      )
    )

  # Join with run expectancy
  re_long <- run_expectancy_matrix %>%
    pivot_longer(
      cols = starts_with("outs_"),
      names_to = "outs",
      values_to = "re"
    ) %>%
    mutate(outs = as.integer(str_extract(outs, "\\d")))

  pbp <- pbp %>%
    left_join(
      re_long,
      by = c("base_state", "outs")
    ) %>%
    rename(re_before = re)

  # Calculate RE after (from next row within inning)
  pbp <- pbp %>%
    group_by(game_id, inning, half_inning) %>%
    mutate(
      re_after = lead(re_before, default = 0),
      runs_on_play = coalesce(runs_scored, 0),
      re24 = re_after - re_before + runs_on_play
    ) %>%
    ungroup()

  return(pbp)
}

#' Calculate situational batting splits
#'
#' @param pbp Data frame with play-by-play data
#' @param batter_id Character player identifier
#' @return Data frame with splits by situation
get_situational_splits <- function(pbp, batter_id) {

  batter_pa <- pbp %>%
    filter(batter == batter_id)

  # Helper function for slash line
  calc_slash <- function(df) {
    tibble(
      pa = nrow(df),
      ab = sum(!df$event %in% c("walk", "hit_by_pitch", "sac_fly", "sac_bunt")),
      hits = sum(df$event %in% c("single", "double", "triple", "home_run")),
      tb = sum(df$event == "single") +
           sum(df$event == "double") * 2 +
           sum(df$event == "triple") * 3 +
           sum(df$event == "home_run") * 4,
      bb_hbp = sum(df$event %in% c("walk", "hit_by_pitch"))
    ) %>%
    mutate(
      avg = round(hits / ab, 3),
      obp = round((hits + bb_hbp) / pa, 3),
      slg = round(tb / ab, 3),
      ops = obp + slg
    )
  }

  # Overall
  overall <- calc_slash(batter_pa) %>%
    mutate(situation = "overall")

  # RISP
  risp <- batter_pa %>%
    filter(runner_on_2b | runner_on_3b) %>%
    calc_slash() %>%
    mutate(situation = "risp")

  # Two outs
  two_out <- batter_pa %>%
    filter(outs == 2) %>%
    calc_slash() %>%
    mutate(situation = "two_out")

  # Late & Close (7th inning+, within 2 runs)
  late_close <- batter_pa %>%
    filter(inning >= 7, abs(score_diff) <= 2) %>%
    calc_slash() %>%
    mutate(situation = "late_close")

  bind_rows(overall, risp, two_out, late_close) %>%
    select(situation, pa, avg, obp, slg, ops)
}

# ============================================
# Example Usage
# ============================================

# Create sample play-by-play data
sample_pbp <- tibble(
  game_id = rep("NYA202304010", 10),
  inning = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3),
  half_inning = c(rep("top", 4), rep("bottom", 3), rep("top", 3)),
  outs = c(0, 0, 1, 2, 0, 1, 2, 0, 1, 1),
  batter = rep(c("player1", "player2", "player3", "player4"), 2)[1:10],
  pitcher = rep("pitcherA", 10),
  event = c("single", "strikeout", "double", "field_out",
            "walk", "single", "ground_out", "home_run", "strikeout", "single"),
  runner_on_1b = c(F, T, F, T, F, T, T, F, F, F),
  runner_on_2b = c(F, F, T, F, F, F, T, F, F, F),
  runner_on_3b = c(F, F, F, T, F, F, F, F, F, F),
  runs_scored = c(0, 0, 0, 1, 0, 0, 0, 1, 0, 0),
  score_diff = c(0, 0, 0, 0, -1, -1, -1, -1, 0, 0)
)

# Calculate RE24
pbp_with_re24 <- calculate_re24(sample_pbp)

cat("Play-by-Play with RE24:\n")
print(pbp_with_re24 %>% select(inning, outs, batter, event, re_before, re24))

# Get situational splits
splits <- get_situational_splits(sample_pbp, "player1")
cat("\nPlayer1 Situational Splits:\n")
print(splits)

# Parse a Retrosheet play
play_result <- parse_retrosheet_play("D8/L.2-H;1-3")
cat("\nParsed Retrosheet Play:\n")
print(play_result)

Real-World Application

The analytical revolution that transformed the Oakland Athletics—made famous by Moneyball—relied fundamentally on play-by-play analysis. By examining situational data, Billy Beane's front office identified that on-base percentage was undervalued relative to batting average, and that clutch hitting was largely a myth when examined through the lens of sample size. These insights came from rigorous analysis of play-by-play records.

Modern applications extend far beyond simple counting statistics. The Houston Astros used play-by-play pitch data to optimize their pitching staff's usage patterns, identifying which pitchers performed better in high-leverage situations and adjusting bullpen deployment accordingly. Similarly, defensive positioning strategies that have become ubiquitous in modern baseball emerged from spray chart analysis derived from play-by-play batted ball data.

Interpreting Results

RE24 Range	Event Impact	Typical Events
> +1.0	Highly Positive	Home run with runners on, bases-clearing extra-base hit
+0.3 to +1.0	Positive	Single with RISP, walk with bases loaded, most extra-base hits
-0.1 to +0.3	Neutral to Slight Positive	Single with no runners, walk, productive out
-0.3 to -0.1	Slightly Negative	Strikeout with 0-1 outs, routine flyout
< -0.3	Highly Negative	Double play, caught stealing, strikeout with RISP and 2 outs

Key Takeaways

Play-by-play data enables context-aware analysis that aggregate statistics cannot provide—situational splits reveal how players perform under specific game conditions
Run expectancy matrices form the foundation for advanced metrics like RE24 and WPA, allowing credit assignment independent of subsequent events
Retrosheet's event notation requires careful parsing but provides rich historical data for longitudinal analysis spanning decades
Proper data wrangling includes state tracking, quality control, and hierarchical aggregation across the nested structure of baseball games
Combining play-by-play analysis with tracking data (Statcast) enables the most comprehensive player evaluation possible in modern baseball

Discussion

Have questions or feedback? Join our community discussion on Discord or GitHub Discussions.

Table of Contents