Chapter 4: Data Wrangling and Play-by-Play Analysis

Intermediate 10 min read 439 views Nov 25, 2025

Chapter 4: Data Wrangling and Play-by-Play Analysis

Data wrangling represents the critical process of transforming raw baseball data into clean, structured formats suitable for analysis. Play-by-play data captures every event in every game and requires sophisticated processing to extract meaningful insights.

Understanding Play-by-Play Analysis

This analytical approach has transformed modern baseball decision-making. Teams across MLB now employ dedicated analysts who specialize in these techniques, using sophisticated tools and methodologies to gain competitive advantages. The insights derived from this analysis inform everything from in-game strategy to long-term roster construction.

Modern analytics combines historical data with cutting-edge tracking technologies like Statcast, which measures exit velocity, launch angle, sprint speed, and defensive positioning with unprecedented precision. This wealth of data enables teams to make more informed decisions and optimize player performance across all aspects of the game.

Key Components

Event Parsing: Converting Retrosheet event files into structured data with consistent field names
State Variables: Calculating base-out states, inning, score differential, and situational variables
Run Expectancy: Determining expected runs from each base-out state based on historical data
Data Validation: Checking for inconsistencies, missing values, and data quality issues
Aggregation: Summarizing play-level data to player, game, or season levels

Mathematical Formula

Run Expectancy = Average runs scored from current state until end of inning

This formula provides the foundation for quantitative analysis, allowing analysts to make objective comparisons and predictions based on historical patterns.

Python Implementation


import pandas as pd
import numpy as np
from pybaseball import statcast, batting_stats

def analyze_baseball_data(start_date, end_date):
    """
    Comprehensive baseball data analysis function.

    Parameters:
    start_date: Start date for analysis (YYYY-MM-DD)
    end_date: End date for analysis (YYYY-MM-DD)

    Returns:
    DataFrame with calculated metrics
    """
    # Fetch Statcast data
    data = statcast(start_dt=start_date, end_dt=end_date)

    # Calculate key metrics
    metrics = data.groupby('player_name').agg({
        'launch_speed': ['mean', 'max'],
        'launch_angle': 'mean',
        'estimated_woba_using_speedangle': 'mean',
        'events': 'count'
    }).reset_index()

    # Rename columns
    metrics.columns = ['player', 'avg_ev', 'max_ev', 'avg_la', 'xwOBA', 'total_batted_balls']

    # Filter to qualified players
    qualified = metrics[metrics['total_batted_balls'] >= 50]

    return qualified.sort_values('xwOBA', ascending=False)

# Example usage
results = analyze_baseball_data('2023-04-01', '2023-10-01')
print("Top 20 performers by xwOBA:")
print(results.head(20))

# Calculate additional derived metrics
results['hard_hit_rate'] = results['avg_ev'].apply(lambda x: (x - 80) / 20 * 100)
print("\nHard hit rate analysis:")
print(results[['player', 'avg_ev', 'hard_hit_rate']].head(10))

R Implementation


library(tidyverse)
library(baseballr)
library(Lahman)

# Comprehensive baseball analysis function
analyze_baseball_data <- function(start_date, end_date) {
  # Fetch Statcast data
  data <- statcast_search(
    start_date = start_date,
    end_date = end_date
  )

  # Calculate metrics by player
  metrics <- data %>%
    group_by(player_name) %>%
    summarise(
      avg_ev = mean(launch_speed, na.rm = TRUE),
      max_ev = max(launch_speed, na.rm = TRUE),
      avg_la = mean(launch_angle, na.rm = TRUE),
      xwOBA = mean(estimated_woba_using_speedangle, na.rm = TRUE),
      total_batted_balls = n(),
      .groups = "drop"
    ) %>%
    filter(total_batted_balls >= 50) %>%
    arrange(desc(xwOBA))

  return(metrics)
}

# Example usage
results <- analyze_baseball_data("2023-04-01", "2023-10-01")
cat("Top 20 performers by xwOBA:\n")
print(head(results, 20))

# Calculate additional metrics
results <- results %>%
  mutate(hard_hit_rate = (avg_ev - 80) / 20 * 100)

cat("\nHard hit rate analysis:\n")
print(results %>% select(player_name, avg_ev, hard_hit_rate) %>% head(10))

Real-World Application

MLB teams use play-by-play analysis for defensive positioning. The Tampa Bay Rays analyze historical data to optimize their shifts, while the Boston Red Sox employ run expectancy analysis to evaluate base-running decisions.

Front offices across baseball have invested heavily in analytics infrastructure, hiring data scientists, statisticians, and engineers to build sophisticated systems for player evaluation. Organizations like the Cleveland Guardians and Houston Astros have become industry leaders, using data-driven approaches to identify undervalued players and optimize their rosters despite financial constraints.

Interpreting the Results

Metric/State	Value/Interpretation
Empty (0 outs)	0.481 expected runs
1st base (0 outs)	0.859 expected runs
2nd base (0 outs)	1.100 expected runs
Bases loaded (0 outs)	2.254 expected runs

Key Takeaways

This analytical approach provides objective, data-driven insights that inform strategic decision-making across MLB organizations.
Modern baseball analytics combines historical statistical analysis with cutting-edge tracking technologies to measure player performance with unprecedented precision.
Understanding these concepts is essential for anyone working in baseball analytics, from entry-level analysts to front office executives.
The practical application of these techniques has led to measurable competitive advantages for teams that effectively implement data-driven strategies.
Continued evolution in data collection and analytical methodologies ensures that baseball analytics remains a dynamic and rapidly advancing field.

Chapter 3: R and Python for Baseball Analytics Previous

Chapter 5: Visualization for Baseball Analytics Next

Discussion

Have questions or feedback? Join our community discussion on Discord or GitHub Discussions.

Table of Contents

Chapter 4: Data Wrangling and Play-by-Play Analysis