Understanding Baseball Data Types

Beginner 10 min read 0 views Nov 26, 2025

Understanding Baseball Data

Baseball is one of the most data-rich sports in the world, with detailed statistics tracked for every pitch, play, and game. Modern baseball analytics relies on multiple types of data sources, each offering different levels of granularity and insight. This comprehensive guide will help you understand the various types of baseball data, how they're structured, and how to work with them effectively in your analyses.

Types of Baseball Data

Box Score Data

Box score data represents the traditional summary statistics for each game. This is the most aggregated form of baseball data, showing totals and averages for players and teams in a single game.

Typical fields include:

  • Batting: At-bats (AB), runs (R), hits (H), RBIs, home runs (HR), walks (BB), strikeouts (SO)
  • Pitching: Innings pitched (IP), hits allowed, runs, earned runs (ER), walks, strikeouts, ERA
  • Fielding: Putouts (PO), assists (A), errors (E)
  • Game metadata: Date, teams, final score, attendance

Box score data is excellent for high-level analysis, season totals, and historical comparisons. It's available for virtually every professional baseball game ever played, making it ideal for long-term trend analysis.

Play-by-Play (Event) Data

Play-by-play data captures every discrete event that occurs during a game. Each row represents a single plate appearance or baserunning event, with details about what happened, who was involved, and the game situation.

Key information captured:

  • Event type (single, double, strikeout, walk, ground out, etc.)
  • Players involved (batter, pitcher, fielders)
  • Game state (inning, outs, runners on base, score)
  • Pitch count and sequence
  • Event outcomes (runs scored, RBIs, errors)
  • Hit location and trajectory

Play-by-play data enables context-aware analysis, such as win probability calculations, leverage index, and situational performance metrics. Retrosheet provides free play-by-play data dating back to 1918.

Pitch-Level Data

Pitch-level data contains information about every individual pitch thrown in a game. This granularity allows for detailed analysis of pitcher tendencies, batter discipline, and pitch sequencing.

Standard pitch-level fields:

  • Pitch type (fastball, curveball, slider, changeup, etc.)
  • Pitch velocity
  • Pitch location (x, y coordinates in strike zone)
  • Pitch result (called strike, swinging strike, ball, foul, in play)
  • Count when thrown (balls-strikes)
  • Pitch sequence number in at-bat

PITCHf/x data (2008-2016) and Statcast data (2015-present) provide pitch-level tracking for MLB games. This data is essential for analyzing pitcher effectiveness, batter approach, and umpire tendencies.

Statcast Data

Statcast represents the most advanced publicly available baseball data, using high-speed cameras and radar to track every movement on the field. Introduced in 2015, Statcast measures both pitches and batted balls with unprecedented precision.

Statcast captures:

  • Pitch tracking: Velocity, spin rate, spin axis, movement (break), release point, extension
  • Batted ball tracking: Exit velocity, launch angle, spray angle, hit distance, hang time
  • Fielding tracking: Fielder positioning, route efficiency, catch probability, arm strength
  • Baserunning tracking: Sprint speed, acceleration, jump efficiency, stolen base metrics

Statcast data has revolutionized baseball analysis by providing objective measurements of player skills and performance that were previously impossible to quantify accurately.

Data Granularity Levels

Baseball data exists at multiple levels of aggregation, each suited for different analytical purposes:

Granularity Level Description Use Cases Typical Row Count (per season)
Pitch Level One row per pitch thrown Pitch sequencing, umpire analysis, batter discipline, pitcher stuff ~750,000 pitches
Plate Appearance One row per batter-pitcher matchup Outcome analysis, matchup studies, situational stats ~185,000 PAs
Game Level One row per player per game Performance tracking, streak analysis, daily fantasy ~75,000 player-games
Player-Season One row per player per season Awards voting, contract analysis, historical comparisons ~1,500 players
Team-Season One row per team per season Team building, payroll analysis, competitive balance 30 teams

Understanding which granularity level you need is crucial for efficient data processing and appropriate analysis. More granular data provides greater flexibility but requires more storage and computational resources.

Key Identifiers and Coding Systems

Working with baseball data from multiple sources requires understanding the various identification systems used to track players, games, and teams.

Player Identifiers

Different data providers use different player ID systems:

  • Retrosheet ID: Format like "bondb001" (last name + first initial + number). Used by Retrosheet for historical data.
  • Baseball Reference ID (bbref_id): Similar to Retrosheet, like "bondsba01". Used by Baseball-Reference.com.
  • MLB AM ID (mlbam_id): Numeric ID used by MLB Advanced Media (e.g., 115135 for Barry Bonds). Used in Statcast and modern MLB APIs.
  • FanGraphs ID (fg_id): Numeric ID used by FanGraphs (e.g., 1109 for Barry Bonds).
  • Baseball Prospectus ID (bp_id): Numeric ID used by Baseball Prospectus.

The Chadwick Bureau maintains a comprehensive player ID mapping file that cross-references these different ID systems, making it possible to link data across sources.

Game Identifiers

Games are typically identified using a standardized format:

  • Retrosheet Game ID: Format "HHH201904100" (home team code + date YYYYMMDD + game number). Example: "SFN20190410" for a Giants home game on April 10, 2019.
  • MLB Game PK: Numeric primary key used by MLB systems (e.g., 566789).

Team Codes

Teams are identified using three-letter abbreviation codes:

League Example Teams Codes
American League New York Yankees, Boston Red Sox, Los Angeles Angels NYA, BOS, ANA
National League San Francisco Giants, Los Angeles Dodgers, Atlanta Braves SFN, LAN, ATL

Note: Team codes can vary between data sources. For example, the Yankees might be "NYY" in some systems and "NYA" in Retrosheet. Always verify the coding system used in your data source.

Data Schemas and Common Fields

Standard Batting Statistics Schema

A typical batting statistics table includes these fields:

player_id         VARCHAR    # Unique player identifier
season            INT        # Year (e.g., 2023)
team              VARCHAR    # Team abbreviation
games             INT        # Games played (G)
plate_appearances INT        # Total plate appearances (PA)
at_bats           INT        # Official at-bats (AB)
runs              INT        # Runs scored (R)
hits              INT        # Hits (H)
doubles           INT        # Doubles (2B)
triples           INT        # Triples (3B)
home_runs         INT        # Home runs (HR)
rbi               INT        # Runs batted in
walks             INT        # Walks (BB)
strikeouts        INT        # Strikeouts (SO)
stolen_bases      INT        # Stolen bases (SB)
caught_stealing   INT        # Caught stealing (CS)
batting_avg       FLOAT      # Batting average (AVG)
on_base_pct       FLOAT      # On-base percentage (OBP)
slugging_pct      FLOAT      # Slugging percentage (SLG)
ops               FLOAT      # On-base plus slugging (OPS)

Standard Pitching Statistics Schema

player_id         VARCHAR    # Unique player identifier
season            INT        # Year
team              VARCHAR    # Team abbreviation
games             INT        # Games pitched (G)
games_started     INT        # Games started (GS)
innings_pitched   FLOAT      # Innings pitched (IP)
hits_allowed      INT        # Hits allowed (H)
runs_allowed      INT        # Runs allowed (R)
earned_runs       INT        # Earned runs (ER)
home_runs_allowed INT        # Home runs allowed (HR)
walks             INT        # Walks issued (BB)
strikeouts        INT        # Strikeouts (SO)
wins              INT        # Wins (W)
losses            INT        # Losses (L)
saves             INT        # Saves (SV)
era               FLOAT      # Earned run average
whip              FLOAT      # Walks + hits per inning pitched
k_per_9           FLOAT      # Strikeouts per 9 innings (K/9)
bb_per_9          FLOAT      # Walks per 9 innings (BB/9)

Statcast Event Schema

Statcast data for batted balls includes extensive tracking metrics:

game_date              DATE       # Date of game
player_name            VARCHAR    # Batter name
batter                 INT        # Batter MLB ID
pitcher                INT        # Pitcher MLB ID
events                 VARCHAR    # Event result (single, double, out, etc.)
description            VARCHAR    # Pitch result description
zone                   INT        # Strike zone location (1-14)
stand                  CHAR(1)    # Batter side (L/R)
p_throws               CHAR(1)    # Pitcher throws (L/R)
type                   CHAR(1)    # Pitch type (B/S/X)
balls                  INT        # Balls in count
strikes                INT        # Strikes in count
pitch_type             VARCHAR    # Pitch classification (FF/SL/CU/CH/etc.)
release_speed          FLOAT      # Pitch velocity (mph)
release_spin_rate      INT        # Spin rate (rpm)
release_extension      FLOAT      # Release point extension (ft)
pfx_x                  FLOAT      # Horizontal movement (inches)
pfx_z                  FLOAT      # Vertical movement (inches)
plate_x                FLOAT      # Horizontal location at plate
plate_z                FLOAT      # Vertical location at plate
launch_speed           FLOAT      # Exit velocity (mph)
launch_angle           FLOAT      # Launch angle (degrees)
hit_distance_sc        INT        # Projected hit distance (feet)
hc_x                   FLOAT      # Hit coordinate X
hc_y                   FLOAT      # Hit coordinate Y
estimated_ba_using_speedangle  FLOAT  # Expected batting average (xBA)
estimated_woba_using_speedangle FLOAT # Expected weighted on-base avg (xwOBA)

Statcast Data Fields Explained

Statcast provides numerous advanced metrics that require explanation:

Pitch Metrics

Release Speed (Velocity): The speed of the pitch measured out of the pitcher's hand, in miles per hour. Fastballs typically range from 90-100 mph for starters, while breaking balls are slower (75-85 mph).

Spin Rate: How fast the ball is spinning, measured in revolutions per minute (RPM). Higher spin rates generally lead to more movement and better performance. Four-seam fastballs average 2200-2400 RPM, while curveballs can exceed 2800 RPM.

Spin Axis: The orientation of the ball's rotation, measured in degrees (0-360). This determines the direction of movement. A spin axis of 180 degrees produces pure backspin (rising fastball), while 0 degrees is pure topspin (12-6 curveball).

Release Extension: How far in front of the pitching rubber the pitcher releases the ball, measured in feet. Greater extension effectively reduces the distance to home plate, giving batters less reaction time.

Horizontal Movement (pfx_x) and Vertical Movement (pfx_z): The amount the pitch deviates from a straight line due to spin and gravity, measured in inches. Positive values indicate movement to the right (from catcher's perspective) and upward, respectively.

Batted Ball Metrics

Exit Velocity (Launch Speed): The speed of the ball off the bat, measured in miles per hour. Exit velocities above 95 mph are considered "hard hit" and correlate strongly with positive outcomes. The MLB average is around 88-89 mph.

Launch Angle: The vertical angle at which the ball leaves the bat, measured in degrees. Negative angles are ground balls, 10-25 degrees are line drives, 25-50 degrees are fly balls, and above 50 degrees are pop-ups. The optimal launch angle for power is typically 25-30 degrees.

Spray Angle: The horizontal direction of the batted ball. This helps analyze a hitter's tendency to pull the ball versus hitting to the opposite field.

Hit Distance: The projected distance the ball would travel based on its trajectory, measured in feet. This accounts for both the carry and the landing point.

Expected Batting Average (xBA): The expected batting average on a batted ball based solely on its exit velocity and launch angle, using historical data. This removes the luck and defensive positioning factors.

Expected Weighted On-Base Average (xwOBA): Similar to xBA but weighted by the value of each outcome (single vs. double vs. home run). This provides a more complete picture of expected offensive value.

Barrel Rate: The percentage of batted balls that achieve the optimal combination of exit velocity and launch angle (generally 98+ mph exit velo and 26-30 degree launch angle). Barrels result in extra-base hits over 80% of the time.

Traditional Stats vs. Tracking Data

Understanding the relationship between traditional statistics and modern tracking data is crucial for comprehensive analysis:

Traditional Stat Tracking Data Equivalent Advantage of Tracking Data
Batting Average (AVG) Expected Batting Average (xBA) Removes luck from hits vs. outs, accounts for quality of contact
Slugging Percentage (SLG) Expected Slugging (xSLG) Predicts future performance better by focusing on process over results
Earned Run Average (ERA) Expected ERA (xERA) or FIP Isolates pitcher performance from defense and luck
Home Runs Barrel Rate, Average Exit Velocity Identifies power hitters who may be unlucky with park factors
Strikeouts Whiff Rate, Chase Rate Reveals underlying skills in pitch recognition and bat control
Stolen Bases Sprint Speed, Jump Efficiency Quantifies pure speed and baserunning skill independent of opportunity

Key Insight: Traditional statistics measure outcomes, while tracking data measures process and skills. Expected metrics (xBA, xwOBA, xERA) often predict future performance better than actual results because they remove the noise of random variation and defense.

Data Quality Considerations

When working with baseball data, be aware of these common quality issues:

Missing Data

  • Historical gaps: Statcast data only exists from 2015 onward. PITCHf/x covers 2008-2016 but has different fields.
  • Incomplete tracking: Some parks had tracking issues in early Statcast years (2015-2016), particularly for pitch spin data.
  • Minor league data: Publicly available data is much more limited for minor leagues.
  • International leagues: Data quality and availability varies significantly for NPB, KBO, and other international leagues.

Data Accuracy Issues

  • Scorer bias: Official scorers have discretion on hits vs. errors, which can introduce bias.
  • Measurement error: Statcast measurements are generally accurate but can have noise, especially in edge cases.
  • Calibration differences: Tracking systems may be calibrated differently across ballparks, though MLB works to standardize this.
  • Weather effects: Temperature, humidity, and altitude affect ball flight but aren't always accounted for in data.

Definitional Inconsistencies

  • Pitch classification: Pitch types are algorithmically classified and can be inconsistent, especially for hybrid pitches (cutter vs. slider).
  • Position coding: Players who play multiple positions may be coded differently across sources.
  • Game vs. appearance: Some sources count games played, others count games started or appearances.
  • Season boundaries: Playoff games may or may not be included in season totals depending on the source.

Data Validation Best Practices

  • Cross-reference key totals (games, plate appearances) across multiple sources
  • Check for impossible values (negative stats, velocities over 110 mph, launch angles over 90 degrees)
  • Validate that aggregated data sums correctly to totals
  • Look for suspicious patterns that might indicate data entry errors
  • Be aware of small sample size issues, especially early in the season

Joining Multiple Data Sources

Most baseball analyses require combining data from multiple sources. This requires careful attention to identifiers, time periods, and data granularity.

Common Join Scenarios

1. Adding player demographics to statistics: Join player biographical data (name, birthdate, position) to season statistics using player_id.

2. Combining Statcast with traditional stats: Join Statcast event-level data with season totals, linking on player_id and season.

3. Merging pitching and batting data: Create matchup data by joining pitcher and batter records on game_id and plate appearance sequence.

4. Adding park factors: Join game-level data with ballpark metadata using team code and season.

5. Cross-referencing player IDs: Use the Chadwick ID mapping file to connect players across different data sources (FanGraphs, Baseball Reference, Statcast).

Join Keys and Best Practices

  • Use multiple keys when available: Joining on both player_id AND season is more robust than player_id alone.
  • Verify cardinality: Ensure your joins produce the expected number of rows (one-to-one, one-to-many, many-to-many).
  • Handle missing matches: Decide whether to use inner joins (only matched records) or left/outer joins (keep unmatched records).
  • Standardize date formats: Convert all dates to a consistent format (ISO 8601: YYYY-MM-DD is recommended).
  • Normalize team codes: Create a mapping table to standardize team abbreviations across data sources.

Practical Code Examples

Python: Loading and Understanding Data Schemas

import pandas as pd
import numpy as np
from pybaseball import statcast, playerid_lookup, batting_stats

# Example 1: Loading Statcast data for a date range
print("Loading Statcast data...")
statcast_data = statcast(start_dt='2023-04-01', end_dt='2023-04-07')

# Examine the schema
print(f"\nDataset shape: {statcast_data.shape}")
print(f"Columns: {len(statcast_data.columns)}")
print("\nFirst few column names:")
print(statcast_data.columns[:20].tolist())

# Check data types
print("\nData types for key fields:")
key_fields = ['game_date', 'batter', 'pitcher', 'events', 'launch_speed',
              'launch_angle', 'release_speed', 'release_spin_rate']
print(statcast_data[key_fields].dtypes)

# Summary statistics for Statcast metrics
print("\nStatcast metrics summary:")
statcast_metrics = ['launch_speed', 'launch_angle', 'release_speed',
                    'release_spin_rate', 'hit_distance_sc']
print(statcast_data[statcast_metrics].describe())

# Example 2: Loading season-level batting statistics
print("\n" + "="*60)
print("Loading season batting statistics...")
batting_2023 = batting_stats(2023, qual=100)  # Min 100 PA

print(f"\nBatting stats shape: {batting_2023.shape}")
print("\nColumns available:")
print(batting_2023.columns.tolist())

# Show sample of traditional vs. advanced stats
print("\nSample player statistics:")
display_cols = ['Name', 'Team', 'G', 'PA', 'AVG', 'OBP', 'SLG', 'wOBA', 'wRC+']
print(batting_2023[display_cols].head(10).to_string())

# Example 3: Understanding player IDs
print("\n" + "="*60)
print("Working with player identifiers...")

# Look up a player to see different ID systems
player_ids = playerid_lookup('trout', 'mike')
print("\nMike Trout's IDs across systems:")
print(player_ids[['name_first', 'name_last', 'key_mlbam',
                  'key_fangraphs', 'key_bbref', 'mlb_played_first']].to_string())

# Example 4: Filtering and understanding data granularity
print("\n" + "="*60)
print("Understanding data granularity...")

# Count events by type
print("\nEvent type distribution (play-by-play level):")
event_counts = statcast_data['events'].value_counts()
print(event_counts.head(15))

# Aggregate to player level
print("\nAggregating to player-level statistics:")
player_summary = statcast_data.groupby('batter').agg({
    'launch_speed': ['count', 'mean', 'max'],
    'launch_angle': 'mean',
    'events': lambda x: (x == 'home_run').sum()
}).round(2)
player_summary.columns = ['batted_balls', 'avg_exit_velo', 'max_exit_velo',
                          'avg_launch_angle', 'home_runs']
print(player_summary.head(10).to_string())

# Example 5: Data type handling and conversion
print("\n" + "="*60)
print("Handling data types...")

# Convert dates properly
statcast_data['game_date'] = pd.to_datetime(statcast_data['game_date'])

# Ensure numeric fields are proper types
numeric_fields = ['launch_speed', 'launch_angle', 'release_speed',
                  'release_spin_rate', 'plate_x', 'plate_z']
for field in numeric_fields:
    statcast_data[field] = pd.to_numeric(statcast_data[field], errors='coerce')

# Categorical fields for efficiency
categorical_fields = ['events', 'pitch_type', 'stand', 'p_throws', 'type']
for field in categorical_fields:
    statcast_data[field] = statcast_data[field].astype('category')

print("\nMemory usage before/after optimization:")
print(f"Before: {statcast_data.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Check for missing values
print("\nMissing value percentages for key fields:")
missing_pct = (statcast_data[key_fields].isnull().sum() / len(statcast_data) * 100).round(2)
print(missing_pct)

Python: Merging Multiple Data Sources

import pandas as pd
from pybaseball import statcast, batting_stats, pitching_stats, playerid_lookup

# Example 6: Merging Statcast with season statistics
print("Merging Statcast data with season totals...")

# Load Statcast data for a season
statcast_2023 = statcast(start_dt='2023-04-01', end_dt='2023-09-30')

# Aggregate Statcast metrics by batter
statcast_agg = statcast_2023.groupby('batter').agg({
    'launch_speed': 'mean',
    'launch_angle': 'mean',
    'release_spin_rate': 'mean',
    'events': ['count', lambda x: (x.isin(['single', 'double', 'triple', 'home_run'])).sum()]
}).round(2)

statcast_agg.columns = ['avg_exit_velo', 'avg_launch_angle',
                        'avg_spin_vs', 'statcast_events', 'statcast_hits']
statcast_agg = statcast_agg.reset_index()

# Load season batting statistics
batting_2023 = batting_stats(2023, qual=50)

# Need to merge on player IDs - batting_stats uses different ID
# Create a mapping using playerid_reverse_lookup or manual mapping
# For this example, we'll use the IDfg (FanGraphs ID) to key_mlbam mapping

print(f"\nStatcast aggregated data: {statcast_agg.shape}")
print(f"Season batting data: {batting_2023.shape}")

# Perform the merge (simplified - in practice you'd use ID mapping)
# Here we'll demonstrate the merge concept with a sample
print("\nMerge result sample (first 10 rows):")
# Note: In production, use proper ID mapping file from Chadwick Bureau

# Example 7: Joining pitcher and batter data for matchups
print("\n" + "="*60)
print("Creating pitcher-batter matchup data...")

# Get unique plate appearances
matchups = statcast_2023[['game_date', 'batter', 'pitcher', 'events',
                          'launch_speed', 'launch_angle', 'pitch_type',
                          'release_speed']].copy()

# Load pitcher season stats
pitching_2023 = pitching_stats(2023, qual=50)

# Create batter season context
batting_context = batting_2023[['IDfg', 'Name', 'Team', 'AVG', 'OBP', 'SLG']]
batting_context.columns = ['batter_fg_id', 'batter_name', 'batter_team',
                           'batter_avg', 'batter_obp', 'batter_slg']

# Create pitcher season context
pitching_context = pitching_2023[['IDfg', 'Name', 'Team', 'ERA', 'WHIP', 'K/9']]
pitching_context.columns = ['pitcher_fg_id', 'pitcher_name', 'pitcher_team',
                            'pitcher_era', 'pitcher_whip', 'pitcher_k9']

print(f"\nMatchup data shape: {matchups.shape}")
print("Sample matchup record:")
print(matchups.head(3).to_string())

# Example 8: Joining with park factors
print("\n" + "="*60)
print("Adding ballpark context...")

# Create sample park factors data
park_factors = pd.DataFrame({
    'team_code': ['LAA', 'BOS', 'NYY', 'SFN', 'COL'],
    'park_name': ['Angel Stadium', 'Fenway Park', 'Yankee Stadium',
                  'Oracle Park', 'Coors Field'],
    'run_factor': [0.98, 1.02, 1.05, 0.92, 1.15],
    'hr_factor': [0.95, 1.08, 1.12, 0.85, 1.25]
})

# Add home team to statcast data (simplified)
# In practice, you'd extract this from game_pk or other fields
statcast_sample = statcast_2023.head(1000).copy()

print("\nPark factors:")
print(park_factors.to_string())

# Example 9: Working with player ID mappings
print("\n" + "="*60)
print("Cross-referencing player IDs across systems...")

# In practice, load the Chadwick Bureau ID mapping file
# For demonstration, we'll show the concept
player_mapping = pd.DataFrame({
    'key_mlbam': [545361, 660271, 592450],
    'key_fangraphs': [19755, 28668, 16543],
    'key_bbref': ['troutmi01', 'tatisfe02', 'judgeaa01'],
    'name_first': ['Mike', 'Fernando', 'Aaron'],
    'name_last': ['Trout', 'Tatis Jr.', 'Judge']
})

print("\nPlayer ID mapping sample:")
print(player_mapping.to_string())

# Use this to join FanGraphs data with Statcast (mlbam) data
print("\nJoining across ID systems:")
print("1. Start with Statcast data (uses key_mlbam)")
print("2. Join player_mapping on key_mlbam")
print("3. Now you have key_fangraphs to join with FanGraphs stats")

# Demonstrate the join
demo_statcast = pd.DataFrame({
    'key_mlbam': [545361, 660271],
    'avg_exit_velo': [92.3, 94.1]
})

demo_fangraphs = pd.DataFrame({
    'key_fangraphs': [19755, 28668],
    'wOBA': [.398, .415]
})

# Step 1: Join Statcast with mapping
result = demo_statcast.merge(
    player_mapping[['key_mlbam', 'key_fangraphs', 'name_first', 'name_last']],
    on='key_mlbam',
    how='left'
)

# Step 2: Join with FanGraphs data
result = result.merge(demo_fangraphs, on='key_fangraphs', how='left')

print("\nFinal joined result:")
print(result.to_string())

print("\n" + "="*60)
print("Key takeaways:")
print("- Always verify join keys match across datasets")
print("- Use player ID mapping files for cross-system joins")
print("- Check for duplicate keys before merging")
print("- Validate row counts after joins (inner vs left vs outer)")
print("- Handle missing values appropriately for your analysis")

R: Loading and Understanding Data Schemas

# Load required libraries
library(tidyverse)
library(baseballr)
library(lubridate)

# Example 1: Loading Statcast data
cat("Loading Statcast data...\n")
statcast_data <- statcast_search(
  start_date = "2023-04-01",
  end_date = "2023-04-07",
  playertype = "batter"
)

# Examine the schema
cat(sprintf("\nDataset dimensions: %d rows x %d columns\n",
            nrow(statcast_data), ncol(statcast_data)))

cat("\nFirst 20 column names:\n")
print(names(statcast_data)[1:20])

# Check data types
cat("\nData types for key fields:\n")
key_fields <- c('game_date', 'batter', 'pitcher', 'events',
                'launch_speed', 'launch_angle', 'release_speed',
                'release_spin_rate')
str(statcast_data[, key_fields])

# Summary statistics for Statcast metrics
cat("\nStatcast metrics summary:\n")
statcast_metrics <- c('launch_speed', 'launch_angle', 'release_speed',
                      'release_spin_rate', 'hit_distance_sc')
summary(statcast_data[, statcast_metrics])

# Example 2: Loading season-level batting statistics
cat("\n", rep("=", 60), "\n", sep="")
cat("Loading season batting statistics...\n")

# Using baseballr to get FanGraphs data
batting_2023 <- fg_batter_leaders(2023, 2023, qual = 100)

cat(sprintf("\nBatting stats dimensions: %d rows x %d columns\n",
            nrow(batting_2023), ncol(batting_2023)))

cat("\nColumns available:\n")
print(names(batting_2023))

# Show sample of traditional vs. advanced stats
cat("\nSample player statistics:\n")
batting_2023 %>%
  select(Name, Team, G, PA, AVG, OBP, SLG, wOBA, `wRC+`) %>%
  head(10) %>%
  print()

# Example 3: Understanding data granularity
cat("\n", rep("=", 60), "\n", sep="")
cat("Understanding data granularity...\n")

# Count events by type (play-by-play level)
cat("\nEvent type distribution:\n")
event_counts <- statcast_data %>%
  count(events, sort = TRUE) %>%
  head(15)
print(event_counts)

# Aggregate to player level
cat("\nAggregating to player-level statistics:\n")
player_summary <- statcast_data %>%
  filter(!is.na(launch_speed)) %>%
  group_by(batter) %>%
  summarise(
    batted_balls = n(),
    avg_exit_velo = mean(launch_speed, na.rm = TRUE),
    max_exit_velo = max(launch_speed, na.rm = TRUE),
    avg_launch_angle = mean(launch_angle, na.rm = TRUE),
    home_runs = sum(events == "home_run", na.rm = TRUE)
  ) %>%
  arrange(desc(batted_balls)) %>%
  head(10)

print(player_summary)

# Example 4: Data type handling and conversion
cat("\n", rep("=", 60), "\n", sep="")
cat("Handling data types...\n")

# Convert dates properly
statcast_data <- statcast_data %>%
  mutate(game_date = ymd(game_date))

# Ensure numeric fields are proper types
numeric_fields <- c('launch_speed', 'launch_angle', 'release_speed',
                    'release_spin_rate', 'plate_x', 'plate_z')

statcast_data <- statcast_data %>%
  mutate(across(all_of(numeric_fields), as.numeric))

# Convert categorical fields to factors
categorical_fields <- c('events', 'pitch_type', 'stand', 'p_throws', 'type')

statcast_data <- statcast_data %>%
  mutate(across(all_of(categorical_fields), as.factor))

# Check for missing values
cat("\nMissing value percentages for key fields:\n")
missing_pct <- statcast_data %>%
  select(all_of(key_fields)) %>%
  summarise(across(everything(), ~sum(is.na(.)) / n() * 100)) %>%
  pivot_longer(everything(), names_to = "field", values_to = "pct_missing")

print(missing_pct)

# Example 5: Filtering and subsetting data
cat("\n", rep("=", 60), "\n", sep="")
cat("Filtering Statcast data...\n")

# Filter for hard-hit balls
hard_hit <- statcast_data %>%
  filter(launch_speed >= 95, !is.na(launch_speed)) %>%
  select(game_date, batter, pitcher, events, launch_speed,
         launch_angle, hit_distance_sc)

cat(sprintf("\nHard-hit balls (95+ mph): %d events\n", nrow(hard_hit)))
cat("\nHard-hit ball outcomes:\n")
hard_hit %>%
  count(events, sort = TRUE) %>%
  print()

# Barrels (optimal contact)
barrels <- statcast_data %>%
  filter(
    launch_speed >= 98,
    launch_angle >= 26,
    launch_angle <= 30
  )

cat(sprintf("\nBarrels (optimal contact): %d events\n", nrow(barrels)))
cat(sprintf("Barrel rate: %.2f%%\n",
            nrow(barrels) / nrow(filter(statcast_data, !is.na(launch_speed))) * 100))

R: Merging Multiple Data Sources

library(tidyverse)
library(baseballr)

# Example 6: Merging Statcast with season statistics
cat("Merging Statcast data with season totals...\n")

# Load Statcast data for a season (using sample from earlier)
# In practice, you'd load full season data
statcast_2023 <- statcast_search(
  start_date = "2023-04-01",
  end_date = "2023-04-30",  # Using one month for demo
  playertype = "batter"
)

# Aggregate Statcast metrics by batter
statcast_agg <- statcast_2023 %>%
  filter(!is.na(launch_speed)) %>%
  group_by(batter) %>%
  summarise(
    avg_exit_velo = mean(launch_speed, na.rm = TRUE),
    avg_launch_angle = mean(launch_angle, na.rm = TRUE),
    avg_spin_vs = mean(release_spin_rate, na.rm = TRUE),
    statcast_events = n(),
    statcast_hits = sum(events %in% c("single", "double", "triple", "home_run"),
                        na.rm = TRUE)
  ) %>%
  rename(key_mlbam = batter)  # Rename for clarity

cat(sprintf("\nStatcast aggregated data: %d players\n", nrow(statcast_agg)))

# Load season batting statistics
batting_2023 <- fg_batter_leaders(2023, 2023, qual = 50)

cat(sprintf("Season batting data: %d players\n", nrow(batting_2023)))

# Example 7: Using player ID mappings
cat("\n", rep("=", 60), "\n", sep="")
cat("Cross-referencing player IDs...\n")

# In practice, you would load the Chadwick ID mapping file
# Download from: https://github.com/chadwickbureau/register/blob/master/data/people.csv
# For demonstration, we'll create a sample mapping

player_mapping <- tibble(
  key_mlbam = c(545361, 660271, 592450),
  key_fangraphs = c(19755, 28668, 16543),
  key_bbref = c("troutmi01", "tatisfe02", "judgeaa01"),
  name_first = c("Mike", "Fernando", "Aaron"),
  name_last = c("Trout", "Tatis Jr.", "Judge")
)

cat("\nPlayer ID mapping sample:\n")
print(player_mapping)

# Demonstrate joining Statcast (mlbam IDs) with FanGraphs data
cat("\nJoining Statcast with FanGraphs data using ID mapping...\n")

# Step 1: Join Statcast aggregate with player mapping
statcast_with_ids <- statcast_agg %>%
  left_join(
    player_mapping %>% select(key_mlbam, key_fangraphs, name_first, name_last),
    by = "key_mlbam"
  )

# Step 2: Join with FanGraphs data
# Note: FanGraphs uses 'playerid' column (equivalent to key_fangraphs)
combined_data <- statcast_with_ids %>%
  left_join(
    batting_2023 %>% select(playerid, Name, Team, AVG, OBP, SLG, wOBA, `wRC+`),
    by = c("key_fangraphs" = "playerid")
  )

cat("\nCombined dataset sample:\n")
combined_data %>%
  select(name_first, name_last, avg_exit_velo, avg_launch_angle,
         AVG, OBP, wOBA, `wRC+`) %>%
  head(10) %>%
  print()

# Example 8: Creating pitcher-batter matchups
cat("\n", rep("=", 60), "\n", sep="")
cat("Creating pitcher-batter matchup data...\n")

# Create matchup-level data
matchups <- statcast_2023 %>%
  select(game_date, game_pk, batter, pitcher, events,
         launch_speed, launch_angle, pitch_type, release_speed,
         stand, p_throws) %>%
  filter(!is.na(events))

cat(sprintf("\nTotal matchups (plate appearances): %d\n", nrow(matchups)))

# Load pitcher season stats
pitching_2023 <- fg_pitcher_leaders(2023, 2023, qual = 50)

# Create summary by matchup handedness
matchup_summary <- matchups %>%
  mutate(matchup_type = paste(stand, "vs", p_throws)) %>%
  group_by(matchup_type) %>%
  summarise(
    plate_appearances = n(),
    avg_exit_velo = mean(launch_speed, na.rm = TRUE),
    avg_launch_angle = mean(launch_angle, na.rm = TRUE),
    home_runs = sum(events == "home_run", na.rm = TRUE),
    batting_avg = sum(events %in% c("single", "double", "triple", "home_run")) / n()
  )

cat("\nPerformance by batter/pitcher handedness:\n")
print(matchup_summary)

# Example 9: Joining with park factors
cat("\n", rep("=", 60), "\n", sep="")
cat("Adding ballpark context...\n")

# Create sample park factors
park_factors <- tibble(
  team_code = c("LAA", "BOS", "NYY", "SF", "COL"),
  park_name = c("Angel Stadium", "Fenway Park", "Yankee Stadium",
                "Oracle Park", "Coors Field"),
  run_factor = c(0.98, 1.02, 1.05, 0.92, 1.15),
  hr_factor = c(0.95, 1.08, 1.12, 0.85, 1.25)
)

cat("\nPark factors:\n")
print(park_factors)

# In practice, you would extract home team from game data
# and join with park factors to adjust statistics

# Example 10: Data validation after joins
cat("\n", rep("=", 60), "\n", sep="")
cat("Validating merged data...\n")

# Check for missing matches
cat("\nChecking join quality:\n")
cat(sprintf("Statcast records: %d\n", nrow(statcast_agg)))
cat(sprintf("Records with FanGraphs match: %d\n",
            sum(!is.na(combined_data$AVG))))
cat(sprintf("Records missing FanGraphs data: %d\n",
            sum(is.na(combined_data$AVG))))

# Verify no duplicate keys
duplicate_check <- statcast_agg %>%
  count(key_mlbam) %>%
  filter(n > 1)

cat(sprintf("\nDuplicate player IDs in Statcast data: %d\n",
            nrow(duplicate_check)))

# Check data ranges make sense
cat("\nData validation - checking reasonable ranges:\n")

validation <- combined_data %>%
  summarise(
    avg_exit_velo_min = min(avg_exit_velo, na.rm = TRUE),
    avg_exit_velo_max = max(avg_exit_velo, na.rm = TRUE),
    avg_valid = all(AVG >= 0 & AVG <= 1, na.rm = TRUE),
    obp_valid = all(OBP >= 0 & OBP <= 1, na.rm = TRUE)
  )

print(validation)

cat("\n", rep("=", 60), "\n", sep="")
cat("Key R workflow takeaways:\n")
cat("- Use dplyr joins (left_join, inner_join) for merging\n")
cat("- Always specify join keys explicitly with 'by' parameter\n")
cat("- Use select() to keep only needed columns before joining\n")
cat("- Validate results with summary statistics and counts\n")
cat("- Handle NA values appropriately for your analysis\n")
cat("- Use player ID mapping files from Chadwick Bureau\n")

Conclusion

Understanding baseball data requires familiarity with multiple data types, granularity levels, identification systems, and merging strategies. Key takeaways:

  • Choose the right data type: Use box scores for summaries, play-by-play for context, pitch-level for sequencing, and Statcast for skills-based analysis
  • Match granularity to your question: Pitch-level data isn't always necessary; sometimes season aggregates are more appropriate
  • Master player identifiers: Use mapping files to connect data across sources (FanGraphs, Baseball Reference, Statcast)
  • Validate your data: Check for missing values, impossible values, and join quality before analysis
  • Leverage expected metrics: xBA, xwOBA, and other expected stats often predict future performance better than actual results
  • Understand data limitations: Be aware of historical gaps, measurement error, and definitional inconsistencies

By mastering these fundamentals, you'll be well-equipped to conduct sophisticated baseball analytics using any combination of data sources.

Discussion

Have questions or feedback? Join our community discussion on Discord or GitHub Discussions.