Chapter 4: Data Wrangling and Play-by-Play Analysis
Chapter 4: Data Wrangling and Play-by-Play Analysis
Data wrangling represents the critical process of transforming raw baseball data into clean, structured formats suitable for analysis. Play-by-play data captures every event in every game and requires sophisticated processing to extract meaningful insights.
Understanding Play-by-Play Analysis
This analytical approach has transformed modern baseball decision-making. Teams across MLB now employ dedicated analysts who specialize in these techniques, using sophisticated tools and methodologies to gain competitive advantages. The insights derived from this analysis inform everything from in-game strategy to long-term roster construction.
Modern analytics combines historical data with cutting-edge tracking technologies like Statcast, which measures exit velocity, launch angle, sprint speed, and defensive positioning with unprecedented precision. This wealth of data enables teams to make more informed decisions and optimize player performance across all aspects of the game.
Key Components
- Event Parsing: Converting Retrosheet event files into structured data with consistent field names
- State Variables: Calculating base-out states, inning, score differential, and situational variables
- Run Expectancy: Determining expected runs from each base-out state based on historical data
- Data Validation: Checking for inconsistencies, missing values, and data quality issues
- Aggregation: Summarizing play-level data to player, game, or season levels
Mathematical Formula
Run Expectancy = Average runs scored from current state until end of inning
This formula provides the foundation for quantitative analysis, allowing analysts to make objective comparisons and predictions based on historical patterns.
Python Implementation
import pandas as pd
import numpy as np
from pybaseball import statcast, batting_stats
def analyze_baseball_data(start_date, end_date):
"""
Comprehensive baseball data analysis function.
Parameters:
start_date: Start date for analysis (YYYY-MM-DD)
end_date: End date for analysis (YYYY-MM-DD)
Returns:
DataFrame with calculated metrics
"""
# Fetch Statcast data
data = statcast(start_dt=start_date, end_dt=end_date)
# Calculate key metrics
metrics = data.groupby('player_name').agg({
'launch_speed': ['mean', 'max'],
'launch_angle': 'mean',
'estimated_woba_using_speedangle': 'mean',
'events': 'count'
}).reset_index()
# Rename columns
metrics.columns = ['player', 'avg_ev', 'max_ev', 'avg_la', 'xwOBA', 'total_batted_balls']
# Filter to qualified players
qualified = metrics[metrics['total_batted_balls'] >= 50]
return qualified.sort_values('xwOBA', ascending=False)
# Example usage
results = analyze_baseball_data('2023-04-01', '2023-10-01')
print("Top 20 performers by xwOBA:")
print(results.head(20))
# Calculate additional derived metrics
results['hard_hit_rate'] = results['avg_ev'].apply(lambda x: (x - 80) / 20 * 100)
print("\nHard hit rate analysis:")
print(results[['player', 'avg_ev', 'hard_hit_rate']].head(10))
R Implementation
library(tidyverse)
library(baseballr)
library(Lahman)
# Comprehensive baseball analysis function
analyze_baseball_data <- function(start_date, end_date) {
# Fetch Statcast data
data <- statcast_search(
start_date = start_date,
end_date = end_date
)
# Calculate metrics by player
metrics <- data %>%
group_by(player_name) %>%
summarise(
avg_ev = mean(launch_speed, na.rm = TRUE),
max_ev = max(launch_speed, na.rm = TRUE),
avg_la = mean(launch_angle, na.rm = TRUE),
xwOBA = mean(estimated_woba_using_speedangle, na.rm = TRUE),
total_batted_balls = n(),
.groups = "drop"
) %>%
filter(total_batted_balls >= 50) %>%
arrange(desc(xwOBA))
return(metrics)
}
# Example usage
results <- analyze_baseball_data("2023-04-01", "2023-10-01")
cat("Top 20 performers by xwOBA:\n")
print(head(results, 20))
# Calculate additional metrics
results <- results %>%
mutate(hard_hit_rate = (avg_ev - 80) / 20 * 100)
cat("\nHard hit rate analysis:\n")
print(results %>% select(player_name, avg_ev, hard_hit_rate) %>% head(10))
Real-World Application
MLB teams use play-by-play analysis for defensive positioning. The Tampa Bay Rays analyze historical data to optimize their shifts, while the Boston Red Sox employ run expectancy analysis to evaluate base-running decisions.
Front offices across baseball have invested heavily in analytics infrastructure, hiring data scientists, statisticians, and engineers to build sophisticated systems for player evaluation. Organizations like the Cleveland Guardians and Houston Astros have become industry leaders, using data-driven approaches to identify undervalued players and optimize their rosters despite financial constraints.
Interpreting the Results
| Metric/State | Value/Interpretation |
|---|---|
| Empty (0 outs) | 0.481 expected runs |
| 1st base (0 outs) | 0.859 expected runs |
| 2nd base (0 outs) | 1.100 expected runs |
| Bases loaded (0 outs) | 2.254 expected runs |
Key Takeaways
- This analytical approach provides objective, data-driven insights that inform strategic decision-making across MLB organizations.
- Modern baseball analytics combines historical statistical analysis with cutting-edge tracking technologies to measure player performance with unprecedented precision.
- Understanding these concepts is essential for anyone working in baseball analytics, from entry-level analysts to front office executives.
- The practical application of these techniques has led to measurable competitive advantages for teams that effectively implement data-driven strategies.
- Continued evolution in data collection and analytical methodologies ensures that baseball analytics remains a dynamic and rapidly advancing field.