Understanding Baseball Data Types
Understanding Baseball Data
Baseball is one of the most data-rich sports in the world, with detailed statistics tracked for every pitch, play, and game. Modern baseball analytics relies on multiple types of data sources, each offering different levels of granularity and insight. This comprehensive guide will help you understand the various types of baseball data, how they're structured, and how to work with them effectively in your analyses.
Types of Baseball Data
Box Score Data
Box score data represents the traditional summary statistics for each game. This is the most aggregated form of baseball data, showing totals and averages for players and teams in a single game.
Typical fields include:
- Batting: At-bats (AB), runs (R), hits (H), RBIs, home runs (HR), walks (BB), strikeouts (SO)
- Pitching: Innings pitched (IP), hits allowed, runs, earned runs (ER), walks, strikeouts, ERA
- Fielding: Putouts (PO), assists (A), errors (E)
- Game metadata: Date, teams, final score, attendance
Box score data is excellent for high-level analysis, season totals, and historical comparisons. It's available for virtually every professional baseball game ever played, making it ideal for long-term trend analysis.
Play-by-Play (Event) Data
Play-by-play data captures every discrete event that occurs during a game. Each row represents a single plate appearance or baserunning event, with details about what happened, who was involved, and the game situation.
Key information captured:
- Event type (single, double, strikeout, walk, ground out, etc.)
- Players involved (batter, pitcher, fielders)
- Game state (inning, outs, runners on base, score)
- Pitch count and sequence
- Event outcomes (runs scored, RBIs, errors)
- Hit location and trajectory
Play-by-play data enables context-aware analysis, such as win probability calculations, leverage index, and situational performance metrics. Retrosheet provides free play-by-play data dating back to 1918.
Pitch-Level Data
Pitch-level data contains information about every individual pitch thrown in a game. This granularity allows for detailed analysis of pitcher tendencies, batter discipline, and pitch sequencing.
Standard pitch-level fields:
- Pitch type (fastball, curveball, slider, changeup, etc.)
- Pitch velocity
- Pitch location (x, y coordinates in strike zone)
- Pitch result (called strike, swinging strike, ball, foul, in play)
- Count when thrown (balls-strikes)
- Pitch sequence number in at-bat
PITCHf/x data (2008-2016) and Statcast data (2015-present) provide pitch-level tracking for MLB games. This data is essential for analyzing pitcher effectiveness, batter approach, and umpire tendencies.
Statcast Data
Statcast represents the most advanced publicly available baseball data, using high-speed cameras and radar to track every movement on the field. Introduced in 2015, Statcast measures both pitches and batted balls with unprecedented precision.
Statcast captures:
- Pitch tracking: Velocity, spin rate, spin axis, movement (break), release point, extension
- Batted ball tracking: Exit velocity, launch angle, spray angle, hit distance, hang time
- Fielding tracking: Fielder positioning, route efficiency, catch probability, arm strength
- Baserunning tracking: Sprint speed, acceleration, jump efficiency, stolen base metrics
Statcast data has revolutionized baseball analysis by providing objective measurements of player skills and performance that were previously impossible to quantify accurately.
Data Granularity Levels
Baseball data exists at multiple levels of aggregation, each suited for different analytical purposes:
| Granularity Level | Description | Use Cases | Typical Row Count (per season) |
|---|---|---|---|
| Pitch Level | One row per pitch thrown | Pitch sequencing, umpire analysis, batter discipline, pitcher stuff | ~750,000 pitches |
| Plate Appearance | One row per batter-pitcher matchup | Outcome analysis, matchup studies, situational stats | ~185,000 PAs |
| Game Level | One row per player per game | Performance tracking, streak analysis, daily fantasy | ~75,000 player-games |
| Player-Season | One row per player per season | Awards voting, contract analysis, historical comparisons | ~1,500 players |
| Team-Season | One row per team per season | Team building, payroll analysis, competitive balance | 30 teams |
Understanding which granularity level you need is crucial for efficient data processing and appropriate analysis. More granular data provides greater flexibility but requires more storage and computational resources.
Key Identifiers and Coding Systems
Working with baseball data from multiple sources requires understanding the various identification systems used to track players, games, and teams.
Player Identifiers
Different data providers use different player ID systems:
- Retrosheet ID: Format like "bondb001" (last name + first initial + number). Used by Retrosheet for historical data.
- Baseball Reference ID (bbref_id): Similar to Retrosheet, like "bondsba01". Used by Baseball-Reference.com.
- MLB AM ID (mlbam_id): Numeric ID used by MLB Advanced Media (e.g., 115135 for Barry Bonds). Used in Statcast and modern MLB APIs.
- FanGraphs ID (fg_id): Numeric ID used by FanGraphs (e.g., 1109 for Barry Bonds).
- Baseball Prospectus ID (bp_id): Numeric ID used by Baseball Prospectus.
The Chadwick Bureau maintains a comprehensive player ID mapping file that cross-references these different ID systems, making it possible to link data across sources.
Game Identifiers
Games are typically identified using a standardized format:
- Retrosheet Game ID: Format "HHH201904100" (home team code + date YYYYMMDD + game number). Example: "SFN20190410" for a Giants home game on April 10, 2019.
- MLB Game PK: Numeric primary key used by MLB systems (e.g., 566789).
Team Codes
Teams are identified using three-letter abbreviation codes:
| League | Example Teams | Codes |
|---|---|---|
| American League | New York Yankees, Boston Red Sox, Los Angeles Angels | NYA, BOS, ANA |
| National League | San Francisco Giants, Los Angeles Dodgers, Atlanta Braves | SFN, LAN, ATL |
Note: Team codes can vary between data sources. For example, the Yankees might be "NYY" in some systems and "NYA" in Retrosheet. Always verify the coding system used in your data source.
Data Schemas and Common Fields
Standard Batting Statistics Schema
A typical batting statistics table includes these fields:
player_id VARCHAR # Unique player identifier
season INT # Year (e.g., 2023)
team VARCHAR # Team abbreviation
games INT # Games played (G)
plate_appearances INT # Total plate appearances (PA)
at_bats INT # Official at-bats (AB)
runs INT # Runs scored (R)
hits INT # Hits (H)
doubles INT # Doubles (2B)
triples INT # Triples (3B)
home_runs INT # Home runs (HR)
rbi INT # Runs batted in
walks INT # Walks (BB)
strikeouts INT # Strikeouts (SO)
stolen_bases INT # Stolen bases (SB)
caught_stealing INT # Caught stealing (CS)
batting_avg FLOAT # Batting average (AVG)
on_base_pct FLOAT # On-base percentage (OBP)
slugging_pct FLOAT # Slugging percentage (SLG)
ops FLOAT # On-base plus slugging (OPS)
Standard Pitching Statistics Schema
player_id VARCHAR # Unique player identifier
season INT # Year
team VARCHAR # Team abbreviation
games INT # Games pitched (G)
games_started INT # Games started (GS)
innings_pitched FLOAT # Innings pitched (IP)
hits_allowed INT # Hits allowed (H)
runs_allowed INT # Runs allowed (R)
earned_runs INT # Earned runs (ER)
home_runs_allowed INT # Home runs allowed (HR)
walks INT # Walks issued (BB)
strikeouts INT # Strikeouts (SO)
wins INT # Wins (W)
losses INT # Losses (L)
saves INT # Saves (SV)
era FLOAT # Earned run average
whip FLOAT # Walks + hits per inning pitched
k_per_9 FLOAT # Strikeouts per 9 innings (K/9)
bb_per_9 FLOAT # Walks per 9 innings (BB/9)
Statcast Event Schema
Statcast data for batted balls includes extensive tracking metrics:
game_date DATE # Date of game
player_name VARCHAR # Batter name
batter INT # Batter MLB ID
pitcher INT # Pitcher MLB ID
events VARCHAR # Event result (single, double, out, etc.)
description VARCHAR # Pitch result description
zone INT # Strike zone location (1-14)
stand CHAR(1) # Batter side (L/R)
p_throws CHAR(1) # Pitcher throws (L/R)
type CHAR(1) # Pitch type (B/S/X)
balls INT # Balls in count
strikes INT # Strikes in count
pitch_type VARCHAR # Pitch classification (FF/SL/CU/CH/etc.)
release_speed FLOAT # Pitch velocity (mph)
release_spin_rate INT # Spin rate (rpm)
release_extension FLOAT # Release point extension (ft)
pfx_x FLOAT # Horizontal movement (inches)
pfx_z FLOAT # Vertical movement (inches)
plate_x FLOAT # Horizontal location at plate
plate_z FLOAT # Vertical location at plate
launch_speed FLOAT # Exit velocity (mph)
launch_angle FLOAT # Launch angle (degrees)
hit_distance_sc INT # Projected hit distance (feet)
hc_x FLOAT # Hit coordinate X
hc_y FLOAT # Hit coordinate Y
estimated_ba_using_speedangle FLOAT # Expected batting average (xBA)
estimated_woba_using_speedangle FLOAT # Expected weighted on-base avg (xwOBA)
Statcast Data Fields Explained
Statcast provides numerous advanced metrics that require explanation:
Pitch Metrics
Release Speed (Velocity): The speed of the pitch measured out of the pitcher's hand, in miles per hour. Fastballs typically range from 90-100 mph for starters, while breaking balls are slower (75-85 mph).
Spin Rate: How fast the ball is spinning, measured in revolutions per minute (RPM). Higher spin rates generally lead to more movement and better performance. Four-seam fastballs average 2200-2400 RPM, while curveballs can exceed 2800 RPM.
Spin Axis: The orientation of the ball's rotation, measured in degrees (0-360). This determines the direction of movement. A spin axis of 180 degrees produces pure backspin (rising fastball), while 0 degrees is pure topspin (12-6 curveball).
Release Extension: How far in front of the pitching rubber the pitcher releases the ball, measured in feet. Greater extension effectively reduces the distance to home plate, giving batters less reaction time.
Horizontal Movement (pfx_x) and Vertical Movement (pfx_z): The amount the pitch deviates from a straight line due to spin and gravity, measured in inches. Positive values indicate movement to the right (from catcher's perspective) and upward, respectively.
Batted Ball Metrics
Exit Velocity (Launch Speed): The speed of the ball off the bat, measured in miles per hour. Exit velocities above 95 mph are considered "hard hit" and correlate strongly with positive outcomes. The MLB average is around 88-89 mph.
Launch Angle: The vertical angle at which the ball leaves the bat, measured in degrees. Negative angles are ground balls, 10-25 degrees are line drives, 25-50 degrees are fly balls, and above 50 degrees are pop-ups. The optimal launch angle for power is typically 25-30 degrees.
Spray Angle: The horizontal direction of the batted ball. This helps analyze a hitter's tendency to pull the ball versus hitting to the opposite field.
Hit Distance: The projected distance the ball would travel based on its trajectory, measured in feet. This accounts for both the carry and the landing point.
Expected Batting Average (xBA): The expected batting average on a batted ball based solely on its exit velocity and launch angle, using historical data. This removes the luck and defensive positioning factors.
Expected Weighted On-Base Average (xwOBA): Similar to xBA but weighted by the value of each outcome (single vs. double vs. home run). This provides a more complete picture of expected offensive value.
Barrel Rate: The percentage of batted balls that achieve the optimal combination of exit velocity and launch angle (generally 98+ mph exit velo and 26-30 degree launch angle). Barrels result in extra-base hits over 80% of the time.
Traditional Stats vs. Tracking Data
Understanding the relationship between traditional statistics and modern tracking data is crucial for comprehensive analysis:
| Traditional Stat | Tracking Data Equivalent | Advantage of Tracking Data |
|---|---|---|
| Batting Average (AVG) | Expected Batting Average (xBA) | Removes luck from hits vs. outs, accounts for quality of contact |
| Slugging Percentage (SLG) | Expected Slugging (xSLG) | Predicts future performance better by focusing on process over results |
| Earned Run Average (ERA) | Expected ERA (xERA) or FIP | Isolates pitcher performance from defense and luck |
| Home Runs | Barrel Rate, Average Exit Velocity | Identifies power hitters who may be unlucky with park factors |
| Strikeouts | Whiff Rate, Chase Rate | Reveals underlying skills in pitch recognition and bat control |
| Stolen Bases | Sprint Speed, Jump Efficiency | Quantifies pure speed and baserunning skill independent of opportunity |
Key Insight: Traditional statistics measure outcomes, while tracking data measures process and skills. Expected metrics (xBA, xwOBA, xERA) often predict future performance better than actual results because they remove the noise of random variation and defense.
Data Quality Considerations
When working with baseball data, be aware of these common quality issues:
Missing Data
- Historical gaps: Statcast data only exists from 2015 onward. PITCHf/x covers 2008-2016 but has different fields.
- Incomplete tracking: Some parks had tracking issues in early Statcast years (2015-2016), particularly for pitch spin data.
- Minor league data: Publicly available data is much more limited for minor leagues.
- International leagues: Data quality and availability varies significantly for NPB, KBO, and other international leagues.
Data Accuracy Issues
- Scorer bias: Official scorers have discretion on hits vs. errors, which can introduce bias.
- Measurement error: Statcast measurements are generally accurate but can have noise, especially in edge cases.
- Calibration differences: Tracking systems may be calibrated differently across ballparks, though MLB works to standardize this.
- Weather effects: Temperature, humidity, and altitude affect ball flight but aren't always accounted for in data.
Definitional Inconsistencies
- Pitch classification: Pitch types are algorithmically classified and can be inconsistent, especially for hybrid pitches (cutter vs. slider).
- Position coding: Players who play multiple positions may be coded differently across sources.
- Game vs. appearance: Some sources count games played, others count games started or appearances.
- Season boundaries: Playoff games may or may not be included in season totals depending on the source.
Data Validation Best Practices
- Cross-reference key totals (games, plate appearances) across multiple sources
- Check for impossible values (negative stats, velocities over 110 mph, launch angles over 90 degrees)
- Validate that aggregated data sums correctly to totals
- Look for suspicious patterns that might indicate data entry errors
- Be aware of small sample size issues, especially early in the season
Joining Multiple Data Sources
Most baseball analyses require combining data from multiple sources. This requires careful attention to identifiers, time periods, and data granularity.
Common Join Scenarios
1. Adding player demographics to statistics: Join player biographical data (name, birthdate, position) to season statistics using player_id.
2. Combining Statcast with traditional stats: Join Statcast event-level data with season totals, linking on player_id and season.
3. Merging pitching and batting data: Create matchup data by joining pitcher and batter records on game_id and plate appearance sequence.
4. Adding park factors: Join game-level data with ballpark metadata using team code and season.
5. Cross-referencing player IDs: Use the Chadwick ID mapping file to connect players across different data sources (FanGraphs, Baseball Reference, Statcast).
Join Keys and Best Practices
- Use multiple keys when available: Joining on both player_id AND season is more robust than player_id alone.
- Verify cardinality: Ensure your joins produce the expected number of rows (one-to-one, one-to-many, many-to-many).
- Handle missing matches: Decide whether to use inner joins (only matched records) or left/outer joins (keep unmatched records).
- Standardize date formats: Convert all dates to a consistent format (ISO 8601: YYYY-MM-DD is recommended).
- Normalize team codes: Create a mapping table to standardize team abbreviations across data sources.
Practical Code Examples
Python: Loading and Understanding Data Schemas
import pandas as pd
import numpy as np
from pybaseball import statcast, playerid_lookup, batting_stats
# Example 1: Loading Statcast data for a date range
print("Loading Statcast data...")
statcast_data = statcast(start_dt='2023-04-01', end_dt='2023-04-07')
# Examine the schema
print(f"\nDataset shape: {statcast_data.shape}")
print(f"Columns: {len(statcast_data.columns)}")
print("\nFirst few column names:")
print(statcast_data.columns[:20].tolist())
# Check data types
print("\nData types for key fields:")
key_fields = ['game_date', 'batter', 'pitcher', 'events', 'launch_speed',
'launch_angle', 'release_speed', 'release_spin_rate']
print(statcast_data[key_fields].dtypes)
# Summary statistics for Statcast metrics
print("\nStatcast metrics summary:")
statcast_metrics = ['launch_speed', 'launch_angle', 'release_speed',
'release_spin_rate', 'hit_distance_sc']
print(statcast_data[statcast_metrics].describe())
# Example 2: Loading season-level batting statistics
print("\n" + "="*60)
print("Loading season batting statistics...")
batting_2023 = batting_stats(2023, qual=100) # Min 100 PA
print(f"\nBatting stats shape: {batting_2023.shape}")
print("\nColumns available:")
print(batting_2023.columns.tolist())
# Show sample of traditional vs. advanced stats
print("\nSample player statistics:")
display_cols = ['Name', 'Team', 'G', 'PA', 'AVG', 'OBP', 'SLG', 'wOBA', 'wRC+']
print(batting_2023[display_cols].head(10).to_string())
# Example 3: Understanding player IDs
print("\n" + "="*60)
print("Working with player identifiers...")
# Look up a player to see different ID systems
player_ids = playerid_lookup('trout', 'mike')
print("\nMike Trout's IDs across systems:")
print(player_ids[['name_first', 'name_last', 'key_mlbam',
'key_fangraphs', 'key_bbref', 'mlb_played_first']].to_string())
# Example 4: Filtering and understanding data granularity
print("\n" + "="*60)
print("Understanding data granularity...")
# Count events by type
print("\nEvent type distribution (play-by-play level):")
event_counts = statcast_data['events'].value_counts()
print(event_counts.head(15))
# Aggregate to player level
print("\nAggregating to player-level statistics:")
player_summary = statcast_data.groupby('batter').agg({
'launch_speed': ['count', 'mean', 'max'],
'launch_angle': 'mean',
'events': lambda x: (x == 'home_run').sum()
}).round(2)
player_summary.columns = ['batted_balls', 'avg_exit_velo', 'max_exit_velo',
'avg_launch_angle', 'home_runs']
print(player_summary.head(10).to_string())
# Example 5: Data type handling and conversion
print("\n" + "="*60)
print("Handling data types...")
# Convert dates properly
statcast_data['game_date'] = pd.to_datetime(statcast_data['game_date'])
# Ensure numeric fields are proper types
numeric_fields = ['launch_speed', 'launch_angle', 'release_speed',
'release_spin_rate', 'plate_x', 'plate_z']
for field in numeric_fields:
statcast_data[field] = pd.to_numeric(statcast_data[field], errors='coerce')
# Categorical fields for efficiency
categorical_fields = ['events', 'pitch_type', 'stand', 'p_throws', 'type']
for field in categorical_fields:
statcast_data[field] = statcast_data[field].astype('category')
print("\nMemory usage before/after optimization:")
print(f"Before: {statcast_data.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
# Check for missing values
print("\nMissing value percentages for key fields:")
missing_pct = (statcast_data[key_fields].isnull().sum() / len(statcast_data) * 100).round(2)
print(missing_pct)
Python: Merging Multiple Data Sources
import pandas as pd
from pybaseball import statcast, batting_stats, pitching_stats, playerid_lookup
# Example 6: Merging Statcast with season statistics
print("Merging Statcast data with season totals...")
# Load Statcast data for a season
statcast_2023 = statcast(start_dt='2023-04-01', end_dt='2023-09-30')
# Aggregate Statcast metrics by batter
statcast_agg = statcast_2023.groupby('batter').agg({
'launch_speed': 'mean',
'launch_angle': 'mean',
'release_spin_rate': 'mean',
'events': ['count', lambda x: (x.isin(['single', 'double', 'triple', 'home_run'])).sum()]
}).round(2)
statcast_agg.columns = ['avg_exit_velo', 'avg_launch_angle',
'avg_spin_vs', 'statcast_events', 'statcast_hits']
statcast_agg = statcast_agg.reset_index()
# Load season batting statistics
batting_2023 = batting_stats(2023, qual=50)
# Need to merge on player IDs - batting_stats uses different ID
# Create a mapping using playerid_reverse_lookup or manual mapping
# For this example, we'll use the IDfg (FanGraphs ID) to key_mlbam mapping
print(f"\nStatcast aggregated data: {statcast_agg.shape}")
print(f"Season batting data: {batting_2023.shape}")
# Perform the merge (simplified - in practice you'd use ID mapping)
# Here we'll demonstrate the merge concept with a sample
print("\nMerge result sample (first 10 rows):")
# Note: In production, use proper ID mapping file from Chadwick Bureau
# Example 7: Joining pitcher and batter data for matchups
print("\n" + "="*60)
print("Creating pitcher-batter matchup data...")
# Get unique plate appearances
matchups = statcast_2023[['game_date', 'batter', 'pitcher', 'events',
'launch_speed', 'launch_angle', 'pitch_type',
'release_speed']].copy()
# Load pitcher season stats
pitching_2023 = pitching_stats(2023, qual=50)
# Create batter season context
batting_context = batting_2023[['IDfg', 'Name', 'Team', 'AVG', 'OBP', 'SLG']]
batting_context.columns = ['batter_fg_id', 'batter_name', 'batter_team',
'batter_avg', 'batter_obp', 'batter_slg']
# Create pitcher season context
pitching_context = pitching_2023[['IDfg', 'Name', 'Team', 'ERA', 'WHIP', 'K/9']]
pitching_context.columns = ['pitcher_fg_id', 'pitcher_name', 'pitcher_team',
'pitcher_era', 'pitcher_whip', 'pitcher_k9']
print(f"\nMatchup data shape: {matchups.shape}")
print("Sample matchup record:")
print(matchups.head(3).to_string())
# Example 8: Joining with park factors
print("\n" + "="*60)
print("Adding ballpark context...")
# Create sample park factors data
park_factors = pd.DataFrame({
'team_code': ['LAA', 'BOS', 'NYY', 'SFN', 'COL'],
'park_name': ['Angel Stadium', 'Fenway Park', 'Yankee Stadium',
'Oracle Park', 'Coors Field'],
'run_factor': [0.98, 1.02, 1.05, 0.92, 1.15],
'hr_factor': [0.95, 1.08, 1.12, 0.85, 1.25]
})
# Add home team to statcast data (simplified)
# In practice, you'd extract this from game_pk or other fields
statcast_sample = statcast_2023.head(1000).copy()
print("\nPark factors:")
print(park_factors.to_string())
# Example 9: Working with player ID mappings
print("\n" + "="*60)
print("Cross-referencing player IDs across systems...")
# In practice, load the Chadwick Bureau ID mapping file
# For demonstration, we'll show the concept
player_mapping = pd.DataFrame({
'key_mlbam': [545361, 660271, 592450],
'key_fangraphs': [19755, 28668, 16543],
'key_bbref': ['troutmi01', 'tatisfe02', 'judgeaa01'],
'name_first': ['Mike', 'Fernando', 'Aaron'],
'name_last': ['Trout', 'Tatis Jr.', 'Judge']
})
print("\nPlayer ID mapping sample:")
print(player_mapping.to_string())
# Use this to join FanGraphs data with Statcast (mlbam) data
print("\nJoining across ID systems:")
print("1. Start with Statcast data (uses key_mlbam)")
print("2. Join player_mapping on key_mlbam")
print("3. Now you have key_fangraphs to join with FanGraphs stats")
# Demonstrate the join
demo_statcast = pd.DataFrame({
'key_mlbam': [545361, 660271],
'avg_exit_velo': [92.3, 94.1]
})
demo_fangraphs = pd.DataFrame({
'key_fangraphs': [19755, 28668],
'wOBA': [.398, .415]
})
# Step 1: Join Statcast with mapping
result = demo_statcast.merge(
player_mapping[['key_mlbam', 'key_fangraphs', 'name_first', 'name_last']],
on='key_mlbam',
how='left'
)
# Step 2: Join with FanGraphs data
result = result.merge(demo_fangraphs, on='key_fangraphs', how='left')
print("\nFinal joined result:")
print(result.to_string())
print("\n" + "="*60)
print("Key takeaways:")
print("- Always verify join keys match across datasets")
print("- Use player ID mapping files for cross-system joins")
print("- Check for duplicate keys before merging")
print("- Validate row counts after joins (inner vs left vs outer)")
print("- Handle missing values appropriately for your analysis")
R: Loading and Understanding Data Schemas
# Load required libraries
library(tidyverse)
library(baseballr)
library(lubridate)
# Example 1: Loading Statcast data
cat("Loading Statcast data...\n")
statcast_data <- statcast_search(
start_date = "2023-04-01",
end_date = "2023-04-07",
playertype = "batter"
)
# Examine the schema
cat(sprintf("\nDataset dimensions: %d rows x %d columns\n",
nrow(statcast_data), ncol(statcast_data)))
cat("\nFirst 20 column names:\n")
print(names(statcast_data)[1:20])
# Check data types
cat("\nData types for key fields:\n")
key_fields <- c('game_date', 'batter', 'pitcher', 'events',
'launch_speed', 'launch_angle', 'release_speed',
'release_spin_rate')
str(statcast_data[, key_fields])
# Summary statistics for Statcast metrics
cat("\nStatcast metrics summary:\n")
statcast_metrics <- c('launch_speed', 'launch_angle', 'release_speed',
'release_spin_rate', 'hit_distance_sc')
summary(statcast_data[, statcast_metrics])
# Example 2: Loading season-level batting statistics
cat("\n", rep("=", 60), "\n", sep="")
cat("Loading season batting statistics...\n")
# Using baseballr to get FanGraphs data
batting_2023 <- fg_batter_leaders(2023, 2023, qual = 100)
cat(sprintf("\nBatting stats dimensions: %d rows x %d columns\n",
nrow(batting_2023), ncol(batting_2023)))
cat("\nColumns available:\n")
print(names(batting_2023))
# Show sample of traditional vs. advanced stats
cat("\nSample player statistics:\n")
batting_2023 %>%
select(Name, Team, G, PA, AVG, OBP, SLG, wOBA, `wRC+`) %>%
head(10) %>%
print()
# Example 3: Understanding data granularity
cat("\n", rep("=", 60), "\n", sep="")
cat("Understanding data granularity...\n")
# Count events by type (play-by-play level)
cat("\nEvent type distribution:\n")
event_counts <- statcast_data %>%
count(events, sort = TRUE) %>%
head(15)
print(event_counts)
# Aggregate to player level
cat("\nAggregating to player-level statistics:\n")
player_summary <- statcast_data %>%
filter(!is.na(launch_speed)) %>%
group_by(batter) %>%
summarise(
batted_balls = n(),
avg_exit_velo = mean(launch_speed, na.rm = TRUE),
max_exit_velo = max(launch_speed, na.rm = TRUE),
avg_launch_angle = mean(launch_angle, na.rm = TRUE),
home_runs = sum(events == "home_run", na.rm = TRUE)
) %>%
arrange(desc(batted_balls)) %>%
head(10)
print(player_summary)
# Example 4: Data type handling and conversion
cat("\n", rep("=", 60), "\n", sep="")
cat("Handling data types...\n")
# Convert dates properly
statcast_data <- statcast_data %>%
mutate(game_date = ymd(game_date))
# Ensure numeric fields are proper types
numeric_fields <- c('launch_speed', 'launch_angle', 'release_speed',
'release_spin_rate', 'plate_x', 'plate_z')
statcast_data <- statcast_data %>%
mutate(across(all_of(numeric_fields), as.numeric))
# Convert categorical fields to factors
categorical_fields <- c('events', 'pitch_type', 'stand', 'p_throws', 'type')
statcast_data <- statcast_data %>%
mutate(across(all_of(categorical_fields), as.factor))
# Check for missing values
cat("\nMissing value percentages for key fields:\n")
missing_pct <- statcast_data %>%
select(all_of(key_fields)) %>%
summarise(across(everything(), ~sum(is.na(.)) / n() * 100)) %>%
pivot_longer(everything(), names_to = "field", values_to = "pct_missing")
print(missing_pct)
# Example 5: Filtering and subsetting data
cat("\n", rep("=", 60), "\n", sep="")
cat("Filtering Statcast data...\n")
# Filter for hard-hit balls
hard_hit <- statcast_data %>%
filter(launch_speed >= 95, !is.na(launch_speed)) %>%
select(game_date, batter, pitcher, events, launch_speed,
launch_angle, hit_distance_sc)
cat(sprintf("\nHard-hit balls (95+ mph): %d events\n", nrow(hard_hit)))
cat("\nHard-hit ball outcomes:\n")
hard_hit %>%
count(events, sort = TRUE) %>%
print()
# Barrels (optimal contact)
barrels <- statcast_data %>%
filter(
launch_speed >= 98,
launch_angle >= 26,
launch_angle <= 30
)
cat(sprintf("\nBarrels (optimal contact): %d events\n", nrow(barrels)))
cat(sprintf("Barrel rate: %.2f%%\n",
nrow(barrels) / nrow(filter(statcast_data, !is.na(launch_speed))) * 100))
R: Merging Multiple Data Sources
library(tidyverse)
library(baseballr)
# Example 6: Merging Statcast with season statistics
cat("Merging Statcast data with season totals...\n")
# Load Statcast data for a season (using sample from earlier)
# In practice, you'd load full season data
statcast_2023 <- statcast_search(
start_date = "2023-04-01",
end_date = "2023-04-30", # Using one month for demo
playertype = "batter"
)
# Aggregate Statcast metrics by batter
statcast_agg <- statcast_2023 %>%
filter(!is.na(launch_speed)) %>%
group_by(batter) %>%
summarise(
avg_exit_velo = mean(launch_speed, na.rm = TRUE),
avg_launch_angle = mean(launch_angle, na.rm = TRUE),
avg_spin_vs = mean(release_spin_rate, na.rm = TRUE),
statcast_events = n(),
statcast_hits = sum(events %in% c("single", "double", "triple", "home_run"),
na.rm = TRUE)
) %>%
rename(key_mlbam = batter) # Rename for clarity
cat(sprintf("\nStatcast aggregated data: %d players\n", nrow(statcast_agg)))
# Load season batting statistics
batting_2023 <- fg_batter_leaders(2023, 2023, qual = 50)
cat(sprintf("Season batting data: %d players\n", nrow(batting_2023)))
# Example 7: Using player ID mappings
cat("\n", rep("=", 60), "\n", sep="")
cat("Cross-referencing player IDs...\n")
# In practice, you would load the Chadwick ID mapping file
# Download from: https://github.com/chadwickbureau/register/blob/master/data/people.csv
# For demonstration, we'll create a sample mapping
player_mapping <- tibble(
key_mlbam = c(545361, 660271, 592450),
key_fangraphs = c(19755, 28668, 16543),
key_bbref = c("troutmi01", "tatisfe02", "judgeaa01"),
name_first = c("Mike", "Fernando", "Aaron"),
name_last = c("Trout", "Tatis Jr.", "Judge")
)
cat("\nPlayer ID mapping sample:\n")
print(player_mapping)
# Demonstrate joining Statcast (mlbam IDs) with FanGraphs data
cat("\nJoining Statcast with FanGraphs data using ID mapping...\n")
# Step 1: Join Statcast aggregate with player mapping
statcast_with_ids <- statcast_agg %>%
left_join(
player_mapping %>% select(key_mlbam, key_fangraphs, name_first, name_last),
by = "key_mlbam"
)
# Step 2: Join with FanGraphs data
# Note: FanGraphs uses 'playerid' column (equivalent to key_fangraphs)
combined_data <- statcast_with_ids %>%
left_join(
batting_2023 %>% select(playerid, Name, Team, AVG, OBP, SLG, wOBA, `wRC+`),
by = c("key_fangraphs" = "playerid")
)
cat("\nCombined dataset sample:\n")
combined_data %>%
select(name_first, name_last, avg_exit_velo, avg_launch_angle,
AVG, OBP, wOBA, `wRC+`) %>%
head(10) %>%
print()
# Example 8: Creating pitcher-batter matchups
cat("\n", rep("=", 60), "\n", sep="")
cat("Creating pitcher-batter matchup data...\n")
# Create matchup-level data
matchups <- statcast_2023 %>%
select(game_date, game_pk, batter, pitcher, events,
launch_speed, launch_angle, pitch_type, release_speed,
stand, p_throws) %>%
filter(!is.na(events))
cat(sprintf("\nTotal matchups (plate appearances): %d\n", nrow(matchups)))
# Load pitcher season stats
pitching_2023 <- fg_pitcher_leaders(2023, 2023, qual = 50)
# Create summary by matchup handedness
matchup_summary <- matchups %>%
mutate(matchup_type = paste(stand, "vs", p_throws)) %>%
group_by(matchup_type) %>%
summarise(
plate_appearances = n(),
avg_exit_velo = mean(launch_speed, na.rm = TRUE),
avg_launch_angle = mean(launch_angle, na.rm = TRUE),
home_runs = sum(events == "home_run", na.rm = TRUE),
batting_avg = sum(events %in% c("single", "double", "triple", "home_run")) / n()
)
cat("\nPerformance by batter/pitcher handedness:\n")
print(matchup_summary)
# Example 9: Joining with park factors
cat("\n", rep("=", 60), "\n", sep="")
cat("Adding ballpark context...\n")
# Create sample park factors
park_factors <- tibble(
team_code = c("LAA", "BOS", "NYY", "SF", "COL"),
park_name = c("Angel Stadium", "Fenway Park", "Yankee Stadium",
"Oracle Park", "Coors Field"),
run_factor = c(0.98, 1.02, 1.05, 0.92, 1.15),
hr_factor = c(0.95, 1.08, 1.12, 0.85, 1.25)
)
cat("\nPark factors:\n")
print(park_factors)
# In practice, you would extract home team from game data
# and join with park factors to adjust statistics
# Example 10: Data validation after joins
cat("\n", rep("=", 60), "\n", sep="")
cat("Validating merged data...\n")
# Check for missing matches
cat("\nChecking join quality:\n")
cat(sprintf("Statcast records: %d\n", nrow(statcast_agg)))
cat(sprintf("Records with FanGraphs match: %d\n",
sum(!is.na(combined_data$AVG))))
cat(sprintf("Records missing FanGraphs data: %d\n",
sum(is.na(combined_data$AVG))))
# Verify no duplicate keys
duplicate_check <- statcast_agg %>%
count(key_mlbam) %>%
filter(n > 1)
cat(sprintf("\nDuplicate player IDs in Statcast data: %d\n",
nrow(duplicate_check)))
# Check data ranges make sense
cat("\nData validation - checking reasonable ranges:\n")
validation <- combined_data %>%
summarise(
avg_exit_velo_min = min(avg_exit_velo, na.rm = TRUE),
avg_exit_velo_max = max(avg_exit_velo, na.rm = TRUE),
avg_valid = all(AVG >= 0 & AVG <= 1, na.rm = TRUE),
obp_valid = all(OBP >= 0 & OBP <= 1, na.rm = TRUE)
)
print(validation)
cat("\n", rep("=", 60), "\n", sep="")
cat("Key R workflow takeaways:\n")
cat("- Use dplyr joins (left_join, inner_join) for merging\n")
cat("- Always specify join keys explicitly with 'by' parameter\n")
cat("- Use select() to keep only needed columns before joining\n")
cat("- Validate results with summary statistics and counts\n")
cat("- Handle NA values appropriately for your analysis\n")
cat("- Use player ID mapping files from Chadwick Bureau\n")
Conclusion
Understanding baseball data requires familiarity with multiple data types, granularity levels, identification systems, and merging strategies. Key takeaways:
- Choose the right data type: Use box scores for summaries, play-by-play for context, pitch-level for sequencing, and Statcast for skills-based analysis
- Match granularity to your question: Pitch-level data isn't always necessary; sometimes season aggregates are more appropriate
- Master player identifiers: Use mapping files to connect data across sources (FanGraphs, Baseball Reference, Statcast)
- Validate your data: Check for missing values, impossible values, and join quality before analysis
- Leverage expected metrics: xBA, xwOBA, and other expected stats often predict future performance better than actual results
- Understand data limitations: Be aware of historical gaps, measurement error, and definitional inconsistencies
By mastering these fundamentals, you'll be well-equipped to conduct sophisticated baseball analytics using any combination of data sources.