Player Performance Prediction

Advanced 10 min read 19 views Nov 26, 2025

Player Performance Prediction in Baseball

Player performance prediction represents one of the most challenging and valuable applications of baseball analytics. Unlike evaluating past performance, projection systems must account for aging curves, regression to the mean, injury risk, sample size limitations, and the inherent randomness of baseball. This tutorial explores the methodologies behind modern projection systems, from simple aging-based approaches to sophisticated ensemble methods that power fantasy baseball and front office decision-making.

The stakes are high: teams use projections to evaluate trades, set arbitration cases, and plan roster construction. Fantasy players draft based on projections. Analysts use them as baselines for player evaluation. Understanding how projection systems work—and their limitations—is essential for anyone engaged in serious baseball analysis.

We'll explore the foundational Marcel system, examine professional systems like ZiPS and Steamer, and build our own projection models using modern machine learning techniques. Along the way, we'll discuss uncertainty quantification, cross-validation strategies specific to player projections, and how to compare predictions to actual outcomes.

The Fundamentals: Why Prediction is Hard

Baseball presents unique challenges for prediction:

Small Sample Sizes: Even a full season provides only 600 plate appearances—small for statistical inference
Regression to the Mean: Extreme performances tend to move toward league average
Aging Effects: Players improve until their late 20s, then gradually decline
Injury Risk: Unpredictable injuries can derail projections
Context Changes: Ballpark, league, and lineup context affect outcomes
Skill Changes: Players develop new skills or lose existing ones
Random Variation: Baseball has substantial luck—BABIP, HR/FB rates fluctuate

Marcel the Monkey: The Simplest Projection System

Created by Tom Tango, Marcel the Monkey demonstrates that a simple, transparent system can be surprisingly accurate. Marcel uses only three inputs: the player's previous three seasons of data, weighted by recency, with regression to the mean and an aging adjustment. No complex statistics, no machine learning—just sound statistical principles.

Marcel's Algorithm:

Weight last three seasons: 5 (last year), 4 (two years ago), 3 (three years ago)
Regress to the mean: 1200 PA of league average added
Age adjustment: Peak at 29, decline 0.5% per year after
Playing time projection: Conservative estimate based on past PA

Professional Projection Systems

ZiPS (Szymborski Projection System)

Created by Dan Szymborski, ZiPS uses a more sophisticated approach than Marcel:

Weighted historical data with more complex aging curves
Similarity scores to compare players to historical comparables
Component-based approach (K%, BB%, ISO, BABIP)
Incorporates minor league translations
Accounts for injury history and playing time risk

Steamer Projections

Developed by Dash Davidson and Jared Cross, Steamer emphasizes:

Five-year weighted history
Granular aging curves by player type
Platoon adjustments
Ballpark factors
Extensive regression testing and validation

THE BAT (The BiAS Adjuster for Two-way players)

A modern system by Derek Carty that uses Bayesian statistics:

Bayesian updating of priors based on new information
Advanced treatment of outliers and sustainability
Statcast integration for batted ball data
Sophisticated playing time algorithms

Feature Selection for Projection Models

Selecting the right features is crucial for accurate projections:

Feature Type	Examples	Predictive Value
Rate Stats (Last 3 Years)	AVG, OBP, SLG, ISO	High
Component Stats	K%, BB%, BABIP, HR/FB	Very High
Statcast Metrics	Exit Velocity, Launch Angle, Sprint Speed	High
Age	Current Age, Age²	Very High
Historical Playing Time	PA per season trend	Medium
Batted Ball Profile	GB%, FB%, LD%	Medium
Plate Discipline	O-Swing%, Z-Contact%, SwStr%	High
Position	Defensive position	Low (for offensive stats)

Aging Curves: Modeling Peak Performance

Understanding aging curves is fundamental to projections. Research by Jeff Zimmerman, Mitchel Lichtman, and others has shown:

Power peaks early: HR/FB rates peak around age 26-27
Contact skills age well: K% increases gradually after 30
Speed declines linearly: SB and defense decline steadily
Plate discipline improves: BB% often increases into early 30s
BABIP declines: Speed loss affects infield hits

Aging curves should be applied at the component level rather than to aggregate stats like batting average or OPS, as different skills age differently.

Regression to the Mean

Regression to the mean is the statistical phenomenon where extreme outcomes tend to be followed by more moderate ones. In baseball:

A .350 hitter is unlikely to repeat that average
A player with 5 HR in 50 PA probably won't maintain 162-game pace of 60+ HR
Extreme K% or BB% rates tend to move toward career norms

How to Apply Regression:

The standard approach is to add "dummy" plate appearances of league-average performance. For Marcel, this is 1200 PA. The more extreme the sample, the more regression is needed. Bayesian approaches provide a more sophisticated framework for this.

Ensemble Methods for Player Projections

Modern machine learning enables ensemble approaches that combine multiple models:

Random Forests: Ensemble of decision trees that handles non-linear relationships
Gradient Boosting: Sequential trees that correct previous errors (XGBoost, LightGBM)
Stacking: Combine predictions from Marcel, similarity scores, and ML models
Neural Networks: Deep learning for complex feature interactions

The advantage: ensembles often outperform individual models by capturing different aspects of player performance.

Uncertainty Quantification

Point estimates without uncertainty are dangerous. A projection of 25 HR could mean:

90% confidence interval of 20-30 HR (narrow, reliable)
90% confidence interval of 10-40 HR (wide, uncertain)

Methods for Uncertainty:

Bootstrap resampling: Resample historical data to generate distribution
Monte Carlo simulation: Simulate thousands of seasons
Quantile regression: Predict 10th, 50th, 90th percentiles directly
Bayesian credible intervals: Natural uncertainty from Bayesian models

Comparing Predictions to Actuals

Evaluating projection systems requires careful methodology:

Metric	Description	Interpretation
RMSE	Root Mean Squared Error	Lower is better, penalizes large errors
MAE	Mean Absolute Error	Average magnitude of errors
R²	Correlation coefficient squared	Proportion of variance explained
Bias	Mean (Actual - Predicted)	Systematic over/under prediction
Coverage	% of actuals within prediction interval	Should match confidence level

Important Considerations:

Only evaluate players with sufficient playing time (e.g., 200+ PA)
Account for pro-rated projections vs full-season actuals
Test on out-of-sample data to avoid overfitting
Compare to baseline (Marcel, last year's stats)

Seasonal vs Career Projections

Rest-of-Season Projections:

Update during season with current year's performance
Weight recent data more heavily
Account for platoon splits, matchups
Useful for trades, DFS, in-season decisions

Career Projections:

Longer time horizon (3-5 years)
Higher uncertainty
Aging curves more important
Used for contract negotiations, dynasty leagues

Python Implementation: Building a Projection System


"""
Player Performance Prediction System
Comprehensive projection model with aging curves, regression, and uncertainty
"""

import pybaseball as pyb
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

pyb.cache.enable()

print("Player Performance Prediction System")
print("=" * 60)

# ============================================================================
# 1. DATA PREPARATION
# ============================================================================

print("\n1. Fetching historical player data...")

# Get multiple years of data
years = range(2018, 2024)
all_data = []

for year in years:
    print(f"  Fetching {year}...")
    df = pyb.batting_stats(year, qual=200)
    df['season'] = year
    all_data.append(df)

data = pd.concat(all_data, ignore_index=True)
print(f"\nTotal player-seasons: {len(data)}")

# ============================================================================
# 2. FEATURE ENGINEERING
# ============================================================================

print("\n2. Engineering predictive features...")

# Calculate key metrics if not present
data['ISO'] = data['SLG'] - data['AVG']
data['BABIP'] = (data['H'] - data['HR']) / (data['AB'] - data['SO'] - data['HR'] + data['SF'])
data['HR_per_PA'] = data['HR'] / data['PA']
data['K_pct'] = data['SO'] / data['PA'] * 100
data['BB_pct'] = data['BB'] / data['PA'] * 100

# ============================================================================
# 3. MARCEL-STYLE SIMPLE PROJECTION
# ============================================================================

print("\n3. Building Marcel-style projection system...")

def marcel_projection(player_df, stat_cols, league_avg_df):
    """
    Simple Marcel projection for a single player

    Weights: 5 (last year), 4 (2 years ago), 3 (3 years ago)
    Regression: Add 1200 PA of league average
    Age adjustment: -0.5% per year after 29
    """
    player_df = player_df.sort_values('season')

    if len(player_df) == 0:
        return None

    projections = {}

    for stat in stat_cols:
        # Weight last 3 years
        weights = [3, 4, 5]  # oldest to newest
        weighted_vals = []

        for i, (_, row) in enumerate(player_df.tail(3).iterrows()):
            weight = weights[i] if i < len(weights) else 5
            weighted_vals.append(row[stat] * weight * row['PA'])

        total_weighted_pa = sum([row['PA'] * weights[i] if i < len(weights) else row['PA'] * 5
                                for i, (_, row) in enumerate(player_df.tail(3).iterrows())])

        # Add regression PA
        regression_pa = 1200
        league_avg = league_avg_df[stat].mean()

        weighted_stat = (sum(weighted_vals) + league_avg * regression_pa) / (total_weighted_pa + regression_pa)

        # Age adjustment (assuming peak at 29)
        current_age = player_df.iloc[-1].get('Age', 29)
        if current_age > 29:
            age_decline = 0.005 * (current_age - 29)  # 0.5% per year
            weighted_stat *= (1 - age_decline)

        projections[stat] = weighted_stat

    return projections

# Apply Marcel to 2023 season using 2020-2022 data
train_data = data[data['season'] < 2023]
test_data = data[data['season'] == 2023]

league_avg = train_data.groupby('season').mean()

projection_stats = ['AVG', 'OBP', 'SLG', 'HR', 'K_pct', 'BB_pct']
marcel_results = []

print("\nGenerating Marcel projections for 2023...")
for name in test_data['Name'].unique():
    player_history = train_data[train_data['Name'] == name]
    if len(player_history) > 0:
        proj = marcel_projection(player_history, projection_stats, league_avg)
        if proj:
            proj['Name'] = name
            marcel_results.append(proj)

marcel_df = pd.DataFrame(marcel_results)
print(f"Generated {len(marcel_df)} Marcel projections")

# Evaluate Marcel
marcel_eval = marcel_df.merge(test_data[['Name'] + projection_stats],
                              on='Name', suffixes=('_proj', '_actual'))

print("\nMarcel Projection Accuracy:")
for stat in projection_stats:
    if f'{stat}_proj' in marcel_eval.columns and f'{stat}_actual' in marcel_eval.columns:
        actual = marcel_eval[f'{stat}_actual'].dropna()
        pred = marcel_eval[f'{stat}_proj'].dropna()
        if len(actual) > 0 and len(pred) > 0:
            # Align lengths
            min_len = min(len(actual), len(pred))
            rmse = np.sqrt(mean_squared_error(actual[:min_len], pred[:min_len]))
            mae = mean_absolute_error(actual[:min_len], pred[:min_len])
            r2 = r2_score(actual[:min_len], pred[:min_len])
            print(f"  {stat}: RMSE={rmse:.4f}, MAE={mae:.4f}, R²={r2:.3f}")

# ============================================================================
# 4. AGING CURVES
# ============================================================================

print("\n4. Calculating aging curves...")

# Group by age and calculate mean performance
aging_data = data.groupby('Age').agg({
    'HR_per_PA': 'mean',
    'K_pct': 'mean',
    'BB_pct': 'mean',
    'ISO': 'mean',
    'BABIP': 'mean',
    'Name': 'count'
}).rename(columns={'Name': 'player_count'})

aging_data = aging_data[aging_data['player_count'] >= 10]  # Minimum sample size

print("\nAging Curve Summary:")
print(aging_data.round(3))

# Visualize aging curves
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
metrics = ['HR_per_PA', 'K_pct', 'BB_pct', 'ISO', 'BABIP']
titles = ['HR per PA', 'Strikeout %', 'Walk %', 'ISO', 'BABIP']

for i, (metric, title) in enumerate(zip(metrics, titles)):
    ax = axes[i // 3, i % 3]
    ax.plot(aging_data.index, aging_data[metric], marker='o', linewidth=2)
    ax.axvline(29, color='red', linestyle='--', alpha=0.5, label='Peak Age (29)')
    ax.set_xlabel('Age')
    ax.set_ylabel(title)
    ax.set_title(f'Aging Curve: {title}')
    ax.grid(True, alpha=0.3)
    ax.legend()

# Player count
ax = axes[1, 2]
ax.bar(aging_data.index, aging_data['player_count'], color='steelblue', edgecolor='black')
ax.set_xlabel('Age')
ax.set_ylabel('Number of Players')
ax.set_title('Sample Size by Age')
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('aging_curves.png', dpi=300)
plt.show()

# ============================================================================
# 5. MACHINE LEARNING PROJECTION MODEL
# ============================================================================

print("\n5. Building machine learning projection models...")

# Prepare features for ML model
# Goal: Predict next season stats using previous 3 seasons

def create_ml_features(df):
    """Create lagged features for prediction"""
    features_list = []

    for name in df['Name'].unique():
        player_data = df[df['Name'] == name].sort_values('season')

        for i in range(len(player_data) - 1):
            # Use data up to season i to predict season i+1
            history = player_data.iloc[max(0, i-2):i+1]
            target = player_data.iloc[i+1]

            if len(history) > 0:
                features = {}

                # Weighted historical stats (most recent = higher weight)
                weights = np.array([3, 4, 5])[-len(history):]
                weights = weights / weights.sum()

                for stat in ['AVG', 'OBP', 'SLG', 'HR', 'K_pct', 'BB_pct', 'ISO', 'BABIP']:
                    features[f'{stat}_weighted'] = (history[stat] * weights).sum()
                    features[f'{stat}_last'] = history.iloc[-1][stat]
                    if len(history) >= 2:
                        features[f'{stat}_trend'] = history.iloc[-1][stat] - history.iloc[-2][stat]
                    else:
                        features[f'{stat}_trend'] = 0

                # Age
                features['age'] = target['Age']
                features['age_squared'] = target['Age'] ** 2

                # Playing time trend
                features['pa_last'] = history.iloc[-1]['PA']
                features['pa_trend'] = history['PA'].diff().mean() if len(history) >= 2 else 0

                # Target variables
                for stat in ['AVG', 'OBP', 'SLG', 'HR', 'K_pct', 'BB_pct']:
                    features[f'target_{stat}'] = target[stat]

                features['target_season'] = target['season']
                features['name'] = name

                features_list.append(features)

    return pd.DataFrame(features_list)

print("\nCreating ML features...")
ml_data = create_ml_features(data)
print(f"ML dataset size: {len(ml_data)} player-seasons")

# Prepare train/test split (use 2023 as test)
train_ml = ml_data[ml_data['target_season'] < 2023]
test_ml = ml_data[ml_data['target_season'] == 2023]

feature_cols = [col for col in ml_data.columns
                if col not in ['name', 'target_season'] and not col.startswith('target_')]
target_stats = ['AVG', 'OBP', 'SLG', 'HR']

# Remove any NaN values
train_ml = train_ml.dropna()
test_ml = test_ml.dropna()

print(f"\nTraining samples: {len(train_ml)}")
print(f"Test samples: {len(test_ml)}")

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(train_ml[feature_cols])
X_test = scaler.transform(test_ml[feature_cols])

# Train models for each target stat
models = {}
predictions = {}

print("\nTraining Random Forest models...")
for target_stat in target_stats:
    print(f"\n  Predicting {target_stat}...")

    y_train = train_ml[f'target_{target_stat}']
    y_test = test_ml[f'target_{target_stat}']

    # Random Forest
    rf_model = RandomForestRegressor(
        n_estimators=100,
        max_depth=10,
        min_samples_split=20,
        random_state=42,
        n_jobs=-1
    )
    rf_model.fit(X_train, y_train)

    # Gradient Boosting
    gb_model = GradientBoostingRegressor(
        n_estimators=100,
        max_depth=5,
        learning_rate=0.1,
        random_state=42
    )
    gb_model.fit(X_train, y_train)

    # Predictions
    rf_pred = rf_model.predict(X_test)
    gb_pred = gb_model.predict(X_test)

    # Ensemble: average RF and GB
    ensemble_pred = (rf_pred + gb_pred) / 2

    # Evaluate
    rmse_rf = np.sqrt(mean_squared_error(y_test, rf_pred))
    rmse_gb = np.sqrt(mean_squared_error(y_test, gb_pred))
    rmse_ens = np.sqrt(mean_squared_error(y_test, ensemble_pred))

    r2_rf = r2_score(y_test, rf_pred)
    r2_gb = r2_score(y_test, gb_pred)
    r2_ens = r2_score(y_test, ensemble_pred)

    print(f"    Random Forest - RMSE: {rmse_rf:.4f}, R²: {r2_rf:.3f}")
    print(f"    Gradient Boost - RMSE: {rmse_gb:.4f}, R²: {r2_gb:.3f}")
    print(f"    Ensemble - RMSE: {rmse_ens:.4f}, R²: {r2_ens:.3f}")

    models[target_stat] = {'rf': rf_model, 'gb': gb_model}
    predictions[target_stat] = {
        'actual': y_test.values,
        'rf': rf_pred,
        'gb': gb_pred,
        'ensemble': ensemble_pred
    }

    # Feature importance (Random Forest)
    importances = pd.DataFrame({
        'feature': feature_cols,
        'importance': rf_model.feature_importances_
    }).sort_values('importance', ascending=False).head(10)

    print(f"\n    Top 10 Features for {target_stat}:")
    for _, row in importances.iterrows():
        print(f"      {row['feature']}: {row['importance']:.4f}")

# ============================================================================
# 6. UNCERTAINTY QUANTIFICATION
# ============================================================================

print("\n6. Quantifying prediction uncertainty...")

def bootstrap_prediction_interval(model, X, n_bootstraps=100, confidence=0.90):
    """Generate prediction intervals via bootstrap"""
    predictions = []

    for _ in range(n_bootstraps):
        # Resample training data indices
        indices = np.random.choice(len(X), len(X), replace=True)
        X_boot = X[indices]

        # Predict
        pred = model.predict(X_boot)
        predictions.append(pred.mean())

    predictions = np.array(predictions)
    lower = np.percentile(predictions, (1 - confidence) / 2 * 100)
    upper = np.percentile(predictions, (1 + confidence) / 2 * 100)

    return lower, upper

# Calculate prediction intervals for HR
print("\nCalculating 90% prediction intervals for HR...")
target = 'HR'
model = models[target]['ensemble'] if target in models else models[target]['rf']

sample_players = test_ml.sample(min(10, len(test_ml)), random_state=42)

for idx, row in sample_players.iterrows():
    player_name = row['name']
    actual = row[f'target_{target}']

    # Point prediction
    X_player = scaler.transform([row[feature_cols]])
    point_pred = model.predict(X_player)[0]

    # Simple approximation: use RMSE to estimate interval
    rmse = np.sqrt(mean_squared_error(
        test_ml[f'target_{target}'],
        model.predict(X_test)
    ))

    lower = point_pred - 1.645 * rmse  # 90% CI
    upper = point_pred + 1.645 * rmse

    print(f"  {player_name}: Actual={actual:.0f}, Pred={point_pred:.0f}, "
          f"90% CI=[{lower:.0f}, {upper:.0f}]")

# ============================================================================
# 7. VISUALIZATION: PREDICTIONS VS ACTUALS
# ============================================================================

print("\n7. Creating prediction visualizations...")

fig, axes = plt.subplots(2, 2, figsize=(14, 12))

for i, stat in enumerate(target_stats):
    ax = axes[i // 2, i % 2]

    actual = predictions[stat]['actual']
    pred = predictions[stat]['ensemble']

    # Scatter plot
    ax.scatter(actual, pred, alpha=0.6, s=50)

    # Perfect prediction line
    min_val = min(actual.min(), pred.min())
    max_val = max(actual.max(), pred.max())
    ax.plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2, label='Perfect Prediction')

    # Calculate metrics
    rmse = np.sqrt(mean_squared_error(actual, pred))
    r2 = r2_score(actual, pred)

    ax.set_xlabel(f'Actual {stat}', fontsize=12)
    ax.set_ylabel(f'Predicted {stat}', fontsize=12)
    ax.set_title(f'{stat} Predictions (2023)\nRMSE: {rmse:.3f}, R²: {r2:.3f}', fontsize=12)
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('prediction_accuracy.png', dpi=300)
plt.show()

# Residual analysis
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

for i, stat in enumerate(target_stats):
    ax = axes[i // 2, i % 2]

    actual = predictions[stat]['actual']
    pred = predictions[stat]['ensemble']
    residuals = actual - pred

    ax.hist(residuals, bins=20, edgecolor='black', alpha=0.7, color='steelblue')
    ax.axvline(0, color='red', linestyle='--', linewidth=2)
    ax.set_xlabel('Residual (Actual - Predicted)', fontsize=12)
    ax.set_ylabel('Frequency', fontsize=12)
    ax.set_title(f'{stat} Residuals\nMean: {residuals.mean():.3f}, Std: {residuals.std():.3f}',
                fontsize=12)
    ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('prediction_residuals.png', dpi=300)
plt.show()

# ============================================================================
# 8. CROSS-VALIDATION FOR PLAYER PROJECTIONS
# ============================================================================

print("\n8. Performing time-series cross-validation...")

# Time-series CV: train on past, predict next season
cv_results = []

for test_year in range(2021, 2024):
    print(f"\n  Testing on {test_year}...")

    train_cv = ml_data[ml_data['target_season'] < test_year].dropna()
    test_cv = ml_data[ml_data['target_season'] == test_year].dropna()

    if len(train_cv) == 0 or len(test_cv) == 0:
        continue

    X_train_cv = scaler.fit_transform(train_cv[feature_cols])
    X_test_cv = scaler.transform(test_cv[feature_cols])

    for stat in ['AVG', 'OBP', 'SLG', 'HR']:
        y_train_cv = train_cv[f'target_{stat}']
        y_test_cv = test_cv[f'target_{stat}']

        model_cv = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
        model_cv.fit(X_train_cv, y_train_cv)

        pred_cv = model_cv.predict(X_test_cv)

        rmse = np.sqrt(mean_squared_error(y_test_cv, pred_cv))
        r2 = r2_score(y_test_cv, pred_cv)

        cv_results.append({
            'test_year': test_year,
            'stat': stat,
            'rmse': rmse,
            'r2': r2
        })

        print(f"    {stat}: RMSE={rmse:.4f}, R²={r2:.3f}")

cv_df = pd.DataFrame(cv_results)

print("\n9. Cross-Validation Summary:")
summary = cv_df.groupby('stat').agg({'rmse': 'mean', 'r2': 'mean'})
print(summary)

print("\n" + "=" * 60)
print("Player Projection System Complete!")
print("\nKey Findings:")
print("1. Marcel provides a strong baseline with simple, interpretable methods")
print("2. Aging curves show different patterns for power, contact, and speed")
print("3. Machine learning ensembles improve upon Marcel slightly")
print("4. Prediction intervals are wide—uncertainty is inherent")
print("5. Time-series CV is essential to avoid overfitting")

R Implementation: Player Projection System


# Player Performance Prediction System in R
# Comprehensive projection model with aging curves and validation

library(baseballr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(randomForest)
library(caret)
library(gridExtra)

cat("Player Performance Prediction System\n")
cat(rep("=", 60), "\n", sep="")

# ============================================================================
# 1. DATA PREPARATION
# ============================================================================

cat("\n1. Fetching historical player data...\n")

# Get multiple years of data
years <- 2018:2023
all_data <- list()

for (year in years) {
  cat(sprintf("  Fetching %d...\n", year))
  df <- fg_batter_leaders(year, year, qual = 200)
  df$season <- year
  all_data[[as.character(year)]] <- df
}

data <- bind_rows(all_data)
cat(sprintf("\nTotal player-seasons: %d\n", nrow(data)))

# ============================================================================
# 2. FEATURE ENGINEERING
# ============================================================================

cat("\n2. Engineering predictive features...\n")

data <- data %>%
  mutate(
    ISO = SLG - AVG,
    BABIP = (H - HR) / (AB - SO - HR + SF),
    HR_per_PA = HR / PA,
    K_pct = SO / PA * 100,
    BB_pct = BB / PA * 100
  )

# ============================================================================
# 3. MARCEL-STYLE PROJECTION
# ============================================================================

cat("\n3. Building Marcel-style projection system...\n")

marcel_projection <- function(player_df, stat_col, league_avg) {
  # Sort by season
  player_df <- player_df %>% arrange(season)

  # Weight last 3 years: 3, 4, 5 (oldest to newest)
  if (nrow(player_df) == 0) return(NA)

  recent <- tail(player_df, 3)
  weights <- c(3, 4, 5)[1:nrow(recent)]

  # Weighted average
  weighted_stat <- sum(recent[[stat_col]] * weights * recent$PA) /
                   sum(weights * recent$PA)

  # Add regression (1200 PA of league average)
  regression_pa <- 1200
  total_pa <- sum(weights * recent$PA) + regression_pa

  regressed_stat <- (sum(recent[[stat_col]] * weights * recent$PA) +
                     league_avg * regression_pa) / total_pa

  # Age adjustment (assuming peak at 29)
  current_age <- tail(player_df$Age, 1)
  if (current_age > 29) {
    age_decline <- 0.005 * (current_age - 29)
    regressed_stat <- regressed_stat * (1 - age_decline)
  }

  return(regressed_stat)
}

# Apply Marcel to 2023 season using 2020-2022 data
train_data <- data %>% filter(season < 2023)
test_data <- data %>% filter(season == 2023)

league_avg <- train_data %>%
  group_by(season) %>%
  summarise(
    AVG = mean(AVG, na.rm = TRUE),
    OBP = mean(OBP, na.rm = TRUE),
    SLG = mean(SLG, na.rm = TRUE),
    HR = mean(HR, na.rm = TRUE)
  ) %>%
  summarise(across(where(is.numeric), mean))

projection_stats <- c("AVG", "OBP", "SLG", "HR")
marcel_results <- data.frame()

cat("\nGenerating Marcel projections for 2023...\n")

for (player_name in unique(test_data$Name)) {
  player_history <- train_data %>% filter(Name == player_name)

  if (nrow(player_history) > 0) {
    proj_row <- data.frame(Name = player_name)

    for (stat in projection_stats) {
      proj_value <- marcel_projection(player_history, stat, league_avg[[stat]])
      proj_row[[paste0(stat, "_proj")]] <- proj_value
    }

    marcel_results <- bind_rows(marcel_results, proj_row)
  }
}

cat(sprintf("Generated %d Marcel projections\n", nrow(marcel_results)))

# Evaluate Marcel
marcel_eval <- marcel_results %>%
  inner_join(test_data %>% select(Name, all_of(projection_stats)),
            by = "Name", suffix = c("_proj", "_actual"))

cat("\nMarcel Projection Accuracy:\n")
for (stat in projection_stats) {
  proj_col <- paste0(stat, "_proj")
  actual_col <- stat

  if (proj_col %in% names(marcel_eval) && actual_col %in% names(marcel_eval)) {
    eval_subset <- marcel_eval %>%
      filter(!is.na(!!sym(proj_col)), !is.na(!!sym(actual_col)))

    if (nrow(eval_subset) > 0) {
      rmse <- sqrt(mean((eval_subset[[actual_col]] - eval_subset[[proj_col]])^2))
      mae <- mean(abs(eval_subset[[actual_col]] - eval_subset[[proj_col]]))
      r2 <- cor(eval_subset[[actual_col]], eval_subset[[proj_col]])^2

      cat(sprintf("  %s: RMSE=%.4f, MAE=%.4f, R²=%.3f\n", stat, rmse, mae, r2))
    }
  }
}

# ============================================================================
# 4. AGING CURVES
# ============================================================================

cat("\n4. Calculating aging curves...\n")

aging_data <- data %>%
  group_by(Age) %>%
  summarise(
    HR_per_PA = mean(HR_per_PA, na.rm = TRUE),
    K_pct = mean(K_pct, na.rm = TRUE),
    BB_pct = mean(BB_pct, na.rm = TRUE),
    ISO = mean(ISO, na.rm = TRUE),
    BABIP = mean(BABIP, na.rm = TRUE),
    player_count = n()
  ) %>%
  filter(player_count >= 10)

cat("\nAging Curve Summary:\n")
print(aging_data)

# Visualize aging curves
p1 <- ggplot(aging_data, aes(x = Age, y = HR_per_PA)) +
  geom_line(size = 1.5, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = 29, color = "red", linetype = "dashed", alpha = 0.5) +
  labs(title = "Aging Curve: HR per PA", x = "Age", y = "HR per PA") +
  theme_minimal()

p2 <- ggplot(aging_data, aes(x = Age, y = K_pct)) +
  geom_line(size = 1.5, color = "darkred") +
  geom_point(size = 3, color = "darkred") +
  geom_vline(xintercept = 29, color = "red", linetype = "dashed", alpha = 0.5) +
  labs(title = "Aging Curve: Strikeout %", x = "Age", y = "K%") +
  theme_minimal()

p3 <- ggplot(aging_data, aes(x = Age, y = BB_pct)) +
  geom_line(size = 1.5, color = "darkgreen") +
  geom_point(size = 3, color = "darkgreen") +
  geom_vline(xintercept = 29, color = "red", linetype = "dashed", alpha = 0.5) +
  labs(title = "Aging Curve: Walk %", x = "Age", y = "BB%") +
  theme_minimal()

p4 <- ggplot(aging_data, aes(x = Age, y = ISO)) +
  geom_line(size = 1.5, color = "purple") +
  geom_point(size = 3, color = "purple") +
  geom_vline(xintercept = 29, color = "red", linetype = "dashed", alpha = 0.5) +
  labs(title = "Aging Curve: ISO", x = "Age", y = "ISO") +
  theme_minimal()

combined <- grid.arrange(p1, p2, p3, p4, ncol = 2)
ggsave("aging_curves_r.png", combined, width = 12, height = 10, dpi = 300)

# ============================================================================
# 5. MACHINE LEARNING PROJECTION MODEL
# ============================================================================

cat("\n5. Building machine learning projection models...\n")

# Create lagged features
create_ml_features <- function(df) {
  features_list <- list()

  for (player_name in unique(df$Name)) {
    player_data <- df %>%
      filter(Name == player_name) %>%
      arrange(season)

    for (i in 1:(nrow(player_data) - 1)) {
      history <- player_data[max(1, i-2):i, ]
      target <- player_data[i + 1, ]

      if (nrow(history) > 0) {
        # Weighted averages
        weights <- c(3, 4, 5)[seq_len(nrow(history))]
        weights <- weights / sum(weights)

        features <- list(
          AVG_weighted = sum(history$AVG * weights),
          OBP_weighted = sum(history$OBP * weights),
          SLG_weighted = sum(history$SLG * weights),
          HR_weighted = sum(history$HR * weights),
          K_pct_weighted = sum(history$K_pct * weights),
          BB_pct_weighted = sum(history$BB_pct * weights),

          AVG_last = tail(history$AVG, 1),
          OBP_last = tail(history$OBP, 1),
          SLG_last = tail(history$SLG, 1),

          age = target$Age,
          age_squared = target$Age^2,

          pa_last = tail(history$PA, 1),

          target_AVG = target$AVG,
          target_OBP = target$OBP,
          target_SLG = target$SLG,
          target_HR = target$HR,

          target_season = target$season,
          name = player_name
        )

        features_list[[length(features_list) + 1]] <- as.data.frame(features)
      }
    }
  }

  bind_rows(features_list)
}

cat("\nCreating ML features...\n")
ml_data <- create_ml_features(data)
cat(sprintf("ML dataset size: %d player-seasons\n", nrow(ml_data)))

# Split data
train_ml <- ml_data %>% filter(target_season < 2023) %>% na.omit()
test_ml <- ml_data %>% filter(target_season == 2023) %>% na.omit()

cat(sprintf("\nTraining samples: %d\n", nrow(train_ml)))
cat(sprintf("Test samples: %d\n", nrow(test_ml)))

# Feature columns
feature_cols <- c("AVG_weighted", "OBP_weighted", "SLG_weighted", "HR_weighted",
                  "K_pct_weighted", "BB_pct_weighted", "AVG_last", "OBP_last",
                  "SLG_last", "age", "age_squared", "pa_last")

target_stats <- c("AVG", "OBP", "SLG", "HR")

# Train Random Forest models
cat("\nTraining Random Forest models...\n")

models <- list()
predictions <- list()

for (stat in target_stats) {
  cat(sprintf("\n  Predicting %s...\n", stat))

  target_col <- paste0("target_", stat)

  # Train Random Forest
  rf_model <- randomForest(
    x = train_ml[, feature_cols],
    y = train_ml[[target_col]],
    ntree = 100,
    mtry = 4,
    importance = TRUE
  )

  # Predictions
  train_pred <- predict(rf_model, train_ml[, feature_cols])
  test_pred <- predict(rf_model, test_ml[, feature_cols])

  # Evaluate
  test_actual <- test_ml[[target_col]]
  rmse <- sqrt(mean((test_actual - test_pred)^2))
  r2 <- cor(test_actual, test_pred)^2

  cat(sprintf("    RMSE: %.4f, R²: %.3f\n", rmse, r2))

  models[[stat]] <- rf_model
  predictions[[stat]] <- list(actual = test_actual, predicted = test_pred)

  # Feature importance
  importance_df <- data.frame(
    feature = rownames(importance(rf_model)),
    importance = importance(rf_model)[, 1]
  ) %>%
    arrange(desc(importance)) %>%
    head(10)

  cat(sprintf("\n    Top 10 Features for %s:\n", stat))
  for (i in 1:nrow(importance_df)) {
    cat(sprintf("      %s: %.4f\n", importance_df$feature[i], importance_df$importance[i]))
  }
}

# ============================================================================
# 6. VISUALIZATION
# ============================================================================

cat("\n6. Creating prediction visualizations...\n")

# Predictions vs Actuals
plots <- list()

for (stat in target_stats) {
  df_plot <- data.frame(
    actual = predictions[[stat]]$actual,
    predicted = predictions[[stat]]$predicted
  )

  rmse <- sqrt(mean((df_plot$actual - df_plot$predicted)^2))
  r2 <- cor(df_plot$actual, df_plot$predicted)^2

  p <- ggplot(df_plot, aes(x = actual, y = predicted)) +
    geom_point(alpha = 0.6, size = 3, color = "steelblue") +
    geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed", size = 1) +
    labs(
      title = sprintf("%s Predictions (2023)", stat),
      subtitle = sprintf("RMSE: %.3f, R²: %.3f", rmse, r2),
      x = sprintf("Actual %s", stat),
      y = sprintf("Predicted %s", stat)
    ) +
    theme_minimal() +
    theme(plot.title = element_text(face = "bold"))

  plots[[stat]] <- p
}

combined_preds <- grid.arrange(grobs = plots, ncol = 2)
ggsave("prediction_accuracy_r.png", combined_preds, width = 12, height = 10, dpi = 300)

# ============================================================================
# 7. CROSS-VALIDATION
# ============================================================================

cat("\n7. Performing time-series cross-validation...\n")

cv_results <- data.frame()

for (test_year in 2021:2023) {
  cat(sprintf("\n  Testing on %d...\n", test_year))

  train_cv <- ml_data %>% filter(target_season < test_year) %>% na.omit()
  test_cv <- ml_data %>% filter(target_season == test_year) %>% na.omit()

  if (nrow(train_cv) == 0 || nrow(test_cv) == 0) next

  for (stat in c("AVG", "OBP", "SLG", "HR")) {
    target_col <- paste0("target_", stat)

    model_cv <- randomForest(
      x = train_cv[, feature_cols],
      y = train_cv[[target_col]],
      ntree = 100
    )

    pred_cv <- predict(model_cv, test_cv[, feature_cols])
    actual_cv <- test_cv[[target_col]]

    rmse <- sqrt(mean((actual_cv - pred_cv)^2))
    r2 <- cor(actual_cv, pred_cv)^2

    cv_results <- bind_rows(
      cv_results,
      data.frame(
        test_year = test_year,
        stat = stat,
        rmse = rmse,
        r2 = r2
      )
    )

    cat(sprintf("    %s: RMSE=%.4f, R²=%.3f\n", stat, rmse, r2))
  }
}

cat("\n8. Cross-Validation Summary:\n")
cv_summary <- cv_results %>%
  group_by(stat) %>%
  summarise(
    mean_rmse = mean(rmse),
    mean_r2 = mean(r2)
  )
print(cv_summary)

cat("\n", rep("=", 60), "\n", sep="")
cat("Player Projection System Complete!\n\n")
cat("Key Findings:\n")
cat("1. Marcel provides a strong baseline with simple methods\n")
cat("2. Aging curves differ by skill component\n")
cat("3. Machine learning improves accuracy modestly\n")
cat("4. Cross-validation is essential for validation\n")

Best Practices for Player Projections

Use component stats: Project K%, BB%, ISO, BABIP separately, not just AVG/HR
Weight recent data more: Last season matters more than three years ago
Apply aging curves at component level: Different skills age differently
Regress extreme performances: .400 BABIP will decline; add league-average PA
Account for playing time risk: Injury-prone players need conservative PT projections
Use out-of-sample testing: Always validate on future seasons, not same data
Provide uncertainty estimates: Prediction intervals, not just point estimates
Update in-season: Rest-of-season projections should incorporate current performance
Compare to baselines: Marcel is a strong baseline; beat it before claiming success
Be transparent about limitations: Projections are educated guesses with wide error bars

Common Pitfalls to Avoid

Overfitting to small samples: 50 PA is not predictive of full-season talent
Ignoring aging: Projecting 35-year-olds like they're 27 leads to overestimation
Neglecting regression: Career .250 hitters who hit .320 won't repeat
Training on future data: Accidentally including test season in training causes data leakage
Treating projections as facts: They're probabilistic estimates with substantial uncertainty
Using wrong sample: Only evaluate projections for players with sufficient playing time
Comparing full-season projections to partial seasons: Pro-rate appropriately
Forgetting context: Ballpark and league matter for counting stats

Resources for Further Learning

FanGraphs Projection Systems: FanGraphs Projections
Tom Tango's Marcel: Marcel the Monkey
The Book (Tango, Lichtman, Dolphin): Chapter on aging curves
FiveThirtyEight CARMELO (NBA analogue): Inspiration for Bayesian projection methods
Baseball Prospectus PECOTA: Similarity-based projection system
Steamer Projections: Steamer
ZiPS Projections: ZiPS

Key Takeaways

Start simple: Marcel demonstrates that simple, principled methods work well
Components matter: Project K%, BB%, ISO, BABIP separately for better accuracy
Aging is real: Apply age-specific adjustments, especially after 30
Regression is essential: Small samples and extreme outcomes need heavy regression
Ensembles improve accuracy: Combining models often beats individual approaches
Uncertainty is inherent: Even the best projections have wide confidence intervals
Validation is critical: Use time-series cross-validation on out-of-sample seasons
Compare to actuals systematically: Track RMSE, MAE, R² across multiple seasons
In-season updates matter: Rest-of-season projections should incorporate current data
Transparency builds trust: Document methodology, limitations, and uncertainty

Remember: Player projections are probabilistic forecasts, not certainties. Even the best systems have substantial error rates. Use them as a starting point for evaluation, not an ending point. The goal is to be less wrong than naive baselines like "last year's stats," not to be perfectly accurate.

Introduction to ML in Baseball Previous

Pitch Classification with ML Next

Discussion

Have questions or feedback? Join our community discussion on Discord or GitHub Discussions.

Table of Contents

Player Performance Prediction