Player Performance Prediction

Advanced 10 min read 0 views Nov 26, 2025

Player Performance Prediction in Baseball

Player performance prediction represents one of the most challenging and valuable applications of baseball analytics. Unlike evaluating past performance, projection systems must account for aging curves, regression to the mean, injury risk, sample size limitations, and the inherent randomness of baseball. This tutorial explores the methodologies behind modern projection systems, from simple aging-based approaches to sophisticated ensemble methods that power fantasy baseball and front office decision-making.

The stakes are high: teams use projections to evaluate trades, set arbitration cases, and plan roster construction. Fantasy players draft based on projections. Analysts use them as baselines for player evaluation. Understanding how projection systems work—and their limitations—is essential for anyone engaged in serious baseball analysis.

We'll explore the foundational Marcel system, examine professional systems like ZiPS and Steamer, and build our own projection models using modern machine learning techniques. Along the way, we'll discuss uncertainty quantification, cross-validation strategies specific to player projections, and how to compare predictions to actual outcomes.

The Fundamentals: Why Prediction is Hard

Baseball presents unique challenges for prediction:

  • Small Sample Sizes: Even a full season provides only 600 plate appearances—small for statistical inference
  • Regression to the Mean: Extreme performances tend to move toward league average
  • Aging Effects: Players improve until their late 20s, then gradually decline
  • Injury Risk: Unpredictable injuries can derail projections
  • Context Changes: Ballpark, league, and lineup context affect outcomes
  • Skill Changes: Players develop new skills or lose existing ones
  • Random Variation: Baseball has substantial luck—BABIP, HR/FB rates fluctuate

Marcel the Monkey: The Simplest Projection System

Created by Tom Tango, Marcel the Monkey demonstrates that a simple, transparent system can be surprisingly accurate. Marcel uses only three inputs: the player's previous three seasons of data, weighted by recency, with regression to the mean and an aging adjustment. No complex statistics, no machine learning—just sound statistical principles.

Marcel's Algorithm:

  1. Weight last three seasons: 5 (last year), 4 (two years ago), 3 (three years ago)
  2. Regress to the mean: 1200 PA of league average added
  3. Age adjustment: Peak at 29, decline 0.5% per year after
  4. Playing time projection: Conservative estimate based on past PA

Professional Projection Systems

ZiPS (Szymborski Projection System)

Created by Dan Szymborski, ZiPS uses a more sophisticated approach than Marcel:

  • Weighted historical data with more complex aging curves
  • Similarity scores to compare players to historical comparables
  • Component-based approach (K%, BB%, ISO, BABIP)
  • Incorporates minor league translations
  • Accounts for injury history and playing time risk

Steamer Projections

Developed by Dash Davidson and Jared Cross, Steamer emphasizes:

  • Five-year weighted history
  • Granular aging curves by player type
  • Platoon adjustments
  • Ballpark factors
  • Extensive regression testing and validation

THE BAT (The BiAS Adjuster for Two-way players)

A modern system by Derek Carty that uses Bayesian statistics:

  • Bayesian updating of priors based on new information
  • Advanced treatment of outliers and sustainability
  • Statcast integration for batted ball data
  • Sophisticated playing time algorithms

Feature Selection for Projection Models

Selecting the right features is crucial for accurate projections:

Feature TypeExamplesPredictive Value
Rate Stats (Last 3 Years)AVG, OBP, SLG, ISOHigh
Component StatsK%, BB%, BABIP, HR/FBVery High
Statcast MetricsExit Velocity, Launch Angle, Sprint SpeedHigh
AgeCurrent Age, Age²Very High
Historical Playing TimePA per season trendMedium
Batted Ball ProfileGB%, FB%, LD%Medium
Plate DisciplineO-Swing%, Z-Contact%, SwStr%High
PositionDefensive positionLow (for offensive stats)

Aging Curves: Modeling Peak Performance

Understanding aging curves is fundamental to projections. Research by Jeff Zimmerman, Mitchel Lichtman, and others has shown:

  • Power peaks early: HR/FB rates peak around age 26-27
  • Contact skills age well: K% increases gradually after 30
  • Speed declines linearly: SB and defense decline steadily
  • Plate discipline improves: BB% often increases into early 30s
  • BABIP declines: Speed loss affects infield hits

Aging curves should be applied at the component level rather than to aggregate stats like batting average or OPS, as different skills age differently.

Regression to the Mean

Regression to the mean is the statistical phenomenon where extreme outcomes tend to be followed by more moderate ones. In baseball:

  • A .350 hitter is unlikely to repeat that average
  • A player with 5 HR in 50 PA probably won't maintain 162-game pace of 60+ HR
  • Extreme K% or BB% rates tend to move toward career norms

How to Apply Regression:

The standard approach is to add "dummy" plate appearances of league-average performance. For Marcel, this is 1200 PA. The more extreme the sample, the more regression is needed. Bayesian approaches provide a more sophisticated framework for this.

Ensemble Methods for Player Projections

Modern machine learning enables ensemble approaches that combine multiple models:

  • Random Forests: Ensemble of decision trees that handles non-linear relationships
  • Gradient Boosting: Sequential trees that correct previous errors (XGBoost, LightGBM)
  • Stacking: Combine predictions from Marcel, similarity scores, and ML models
  • Neural Networks: Deep learning for complex feature interactions

The advantage: ensembles often outperform individual models by capturing different aspects of player performance.

Uncertainty Quantification

Point estimates without uncertainty are dangerous. A projection of 25 HR could mean:

  • 90% confidence interval of 20-30 HR (narrow, reliable)
  • 90% confidence interval of 10-40 HR (wide, uncertain)

Methods for Uncertainty:

  • Bootstrap resampling: Resample historical data to generate distribution
  • Monte Carlo simulation: Simulate thousands of seasons
  • Quantile regression: Predict 10th, 50th, 90th percentiles directly
  • Bayesian credible intervals: Natural uncertainty from Bayesian models

Comparing Predictions to Actuals

Evaluating projection systems requires careful methodology:

MetricDescriptionInterpretation
RMSERoot Mean Squared ErrorLower is better, penalizes large errors
MAEMean Absolute ErrorAverage magnitude of errors
Correlation coefficient squaredProportion of variance explained
BiasMean (Actual - Predicted)Systematic over/under prediction
Coverage% of actuals within prediction intervalShould match confidence level

Important Considerations:

  • Only evaluate players with sufficient playing time (e.g., 200+ PA)
  • Account for pro-rated projections vs full-season actuals
  • Test on out-of-sample data to avoid overfitting
  • Compare to baseline (Marcel, last year's stats)

Seasonal vs Career Projections

Rest-of-Season Projections:

  • Update during season with current year's performance
  • Weight recent data more heavily
  • Account for platoon splits, matchups
  • Useful for trades, DFS, in-season decisions

Career Projections:

  • Longer time horizon (3-5 years)
  • Higher uncertainty
  • Aging curves more important
  • Used for contract negotiations, dynasty leagues

Python Implementation: Building a Projection System


"""
Player Performance Prediction System
Comprehensive projection model with aging curves, regression, and uncertainty
"""

import pybaseball as pyb
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

pyb.cache.enable()

print("Player Performance Prediction System")
print("=" * 60)

# ============================================================================
# 1. DATA PREPARATION
# ============================================================================

print("\n1. Fetching historical player data...")

# Get multiple years of data
years = range(2018, 2024)
all_data = []

for year in years:
    print(f"  Fetching {year}...")
    df = pyb.batting_stats(year, qual=200)
    df['season'] = year
    all_data.append(df)

data = pd.concat(all_data, ignore_index=True)
print(f"\nTotal player-seasons: {len(data)}")

# ============================================================================
# 2. FEATURE ENGINEERING
# ============================================================================

print("\n2. Engineering predictive features...")

# Calculate key metrics if not present
data['ISO'] = data['SLG'] - data['AVG']
data['BABIP'] = (data['H'] - data['HR']) / (data['AB'] - data['SO'] - data['HR'] + data['SF'])
data['HR_per_PA'] = data['HR'] / data['PA']
data['K_pct'] = data['SO'] / data['PA'] * 100
data['BB_pct'] = data['BB'] / data['PA'] * 100

# ============================================================================
# 3. MARCEL-STYLE SIMPLE PROJECTION
# ============================================================================

print("\n3. Building Marcel-style projection system...")

def marcel_projection(player_df, stat_cols, league_avg_df):
    """
    Simple Marcel projection for a single player

    Weights: 5 (last year), 4 (2 years ago), 3 (3 years ago)
    Regression: Add 1200 PA of league average
    Age adjustment: -0.5% per year after 29
    """
    player_df = player_df.sort_values('season')

    if len(player_df) == 0:
        return None

    projections = {}

    for stat in stat_cols:
        # Weight last 3 years
        weights = [3, 4, 5]  # oldest to newest
        weighted_vals = []

        for i, (_, row) in enumerate(player_df.tail(3).iterrows()):
            weight = weights[i] if i < len(weights) else 5
            weighted_vals.append(row[stat] * weight * row['PA'])

        total_weighted_pa = sum([row['PA'] * weights[i] if i < len(weights) else row['PA'] * 5
                                for i, (_, row) in enumerate(player_df.tail(3).iterrows())])

        # Add regression PA
        regression_pa = 1200
        league_avg = league_avg_df[stat].mean()

        weighted_stat = (sum(weighted_vals) + league_avg * regression_pa) / (total_weighted_pa + regression_pa)

        # Age adjustment (assuming peak at 29)
        current_age = player_df.iloc[-1].get('Age', 29)
        if current_age > 29:
            age_decline = 0.005 * (current_age - 29)  # 0.5% per year
            weighted_stat *= (1 - age_decline)

        projections[stat] = weighted_stat

    return projections

# Apply Marcel to 2023 season using 2020-2022 data
train_data = data[data['season'] < 2023]
test_data = data[data['season'] == 2023]

league_avg = train_data.groupby('season').mean()

projection_stats = ['AVG', 'OBP', 'SLG', 'HR', 'K_pct', 'BB_pct']
marcel_results = []

print("\nGenerating Marcel projections for 2023...")
for name in test_data['Name'].unique():
    player_history = train_data[train_data['Name'] == name]
    if len(player_history) > 0:
        proj = marcel_projection(player_history, projection_stats, league_avg)
        if proj:
            proj['Name'] = name
            marcel_results.append(proj)

marcel_df = pd.DataFrame(marcel_results)
print(f"Generated {len(marcel_df)} Marcel projections")

# Evaluate Marcel
marcel_eval = marcel_df.merge(test_data[['Name'] + projection_stats],
                              on='Name', suffixes=('_proj', '_actual'))

print("\nMarcel Projection Accuracy:")
for stat in projection_stats:
    if f'{stat}_proj' in marcel_eval.columns and f'{stat}_actual' in marcel_eval.columns:
        actual = marcel_eval[f'{stat}_actual'].dropna()
        pred = marcel_eval[f'{stat}_proj'].dropna()
        if len(actual) > 0 and len(pred) > 0:
            # Align lengths
            min_len = min(len(actual), len(pred))
            rmse = np.sqrt(mean_squared_error(actual[:min_len], pred[:min_len]))
            mae = mean_absolute_error(actual[:min_len], pred[:min_len])
            r2 = r2_score(actual[:min_len], pred[:min_len])
            print(f"  {stat}: RMSE={rmse:.4f}, MAE={mae:.4f}, R²={r2:.3f}")

# ============================================================================
# 4. AGING CURVES
# ============================================================================

print("\n4. Calculating aging curves...")

# Group by age and calculate mean performance
aging_data = data.groupby('Age').agg({
    'HR_per_PA': 'mean',
    'K_pct': 'mean',
    'BB_pct': 'mean',
    'ISO': 'mean',
    'BABIP': 'mean',
    'Name': 'count'
}).rename(columns={'Name': 'player_count'})

aging_data = aging_data[aging_data['player_count'] >= 10]  # Minimum sample size

print("\nAging Curve Summary:")
print(aging_data.round(3))

# Visualize aging curves
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
metrics = ['HR_per_PA', 'K_pct', 'BB_pct', 'ISO', 'BABIP']
titles = ['HR per PA', 'Strikeout %', 'Walk %', 'ISO', 'BABIP']

for i, (metric, title) in enumerate(zip(metrics, titles)):
    ax = axes[i // 3, i % 3]
    ax.plot(aging_data.index, aging_data[metric], marker='o', linewidth=2)
    ax.axvline(29, color='red', linestyle='--', alpha=0.5, label='Peak Age (29)')
    ax.set_xlabel('Age')
    ax.set_ylabel(title)
    ax.set_title(f'Aging Curve: {title}')
    ax.grid(True, alpha=0.3)
    ax.legend()

# Player count
ax = axes[1, 2]
ax.bar(aging_data.index, aging_data['player_count'], color='steelblue', edgecolor='black')
ax.set_xlabel('Age')
ax.set_ylabel('Number of Players')
ax.set_title('Sample Size by Age')
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('aging_curves.png', dpi=300)
plt.show()

# ============================================================================
# 5. MACHINE LEARNING PROJECTION MODEL
# ============================================================================

print("\n5. Building machine learning projection models...")

# Prepare features for ML model
# Goal: Predict next season stats using previous 3 seasons

def create_ml_features(df):
    """Create lagged features for prediction"""
    features_list = []

    for name in df['Name'].unique():
        player_data = df[df['Name'] == name].sort_values('season')

        for i in range(len(player_data) - 1):
            # Use data up to season i to predict season i+1
            history = player_data.iloc[max(0, i-2):i+1]
            target = player_data.iloc[i+1]

            if len(history) > 0:
                features = {}

                # Weighted historical stats (most recent = higher weight)
                weights = np.array([3, 4, 5])[-len(history):]
                weights = weights / weights.sum()

                for stat in ['AVG', 'OBP', 'SLG', 'HR', 'K_pct', 'BB_pct', 'ISO', 'BABIP']:
                    features[f'{stat}_weighted'] = (history[stat] * weights).sum()
                    features[f'{stat}_last'] = history.iloc[-1][stat]
                    if len(history) >= 2:
                        features[f'{stat}_trend'] = history.iloc[-1][stat] - history.iloc[-2][stat]
                    else:
                        features[f'{stat}_trend'] = 0

                # Age
                features['age'] = target['Age']
                features['age_squared'] = target['Age'] ** 2

                # Playing time trend
                features['pa_last'] = history.iloc[-1]['PA']
                features['pa_trend'] = history['PA'].diff().mean() if len(history) >= 2 else 0

                # Target variables
                for stat in ['AVG', 'OBP', 'SLG', 'HR', 'K_pct', 'BB_pct']:
                    features[f'target_{stat}'] = target[stat]

                features['target_season'] = target['season']
                features['name'] = name

                features_list.append(features)

    return pd.DataFrame(features_list)

print("\nCreating ML features...")
ml_data = create_ml_features(data)
print(f"ML dataset size: {len(ml_data)} player-seasons")

# Prepare train/test split (use 2023 as test)
train_ml = ml_data[ml_data['target_season'] < 2023]
test_ml = ml_data[ml_data['target_season'] == 2023]

feature_cols = [col for col in ml_data.columns
                if col not in ['name', 'target_season'] and not col.startswith('target_')]
target_stats = ['AVG', 'OBP', 'SLG', 'HR']

# Remove any NaN values
train_ml = train_ml.dropna()
test_ml = test_ml.dropna()

print(f"\nTraining samples: {len(train_ml)}")
print(f"Test samples: {len(test_ml)}")

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(train_ml[feature_cols])
X_test = scaler.transform(test_ml[feature_cols])

# Train models for each target stat
models = {}
predictions = {}

print("\nTraining Random Forest models...")
for target_stat in target_stats:
    print(f"\n  Predicting {target_stat}...")

    y_train = train_ml[f'target_{target_stat}']
    y_test = test_ml[f'target_{target_stat}']

    # Random Forest
    rf_model = RandomForestRegressor(
        n_estimators=100,
        max_depth=10,
        min_samples_split=20,
        random_state=42,
        n_jobs=-1
    )
    rf_model.fit(X_train, y_train)

    # Gradient Boosting
    gb_model = GradientBoostingRegressor(
        n_estimators=100,
        max_depth=5,
        learning_rate=0.1,
        random_state=42
    )
    gb_model.fit(X_train, y_train)

    # Predictions
    rf_pred = rf_model.predict(X_test)
    gb_pred = gb_model.predict(X_test)

    # Ensemble: average RF and GB
    ensemble_pred = (rf_pred + gb_pred) / 2

    # Evaluate
    rmse_rf = np.sqrt(mean_squared_error(y_test, rf_pred))
    rmse_gb = np.sqrt(mean_squared_error(y_test, gb_pred))
    rmse_ens = np.sqrt(mean_squared_error(y_test, ensemble_pred))

    r2_rf = r2_score(y_test, rf_pred)
    r2_gb = r2_score(y_test, gb_pred)
    r2_ens = r2_score(y_test, ensemble_pred)

    print(f"    Random Forest - RMSE: {rmse_rf:.4f}, R²: {r2_rf:.3f}")
    print(f"    Gradient Boost - RMSE: {rmse_gb:.4f}, R²: {r2_gb:.3f}")
    print(f"    Ensemble - RMSE: {rmse_ens:.4f}, R²: {r2_ens:.3f}")

    models[target_stat] = {'rf': rf_model, 'gb': gb_model}
    predictions[target_stat] = {
        'actual': y_test.values,
        'rf': rf_pred,
        'gb': gb_pred,
        'ensemble': ensemble_pred
    }

    # Feature importance (Random Forest)
    importances = pd.DataFrame({
        'feature': feature_cols,
        'importance': rf_model.feature_importances_
    }).sort_values('importance', ascending=False).head(10)

    print(f"\n    Top 10 Features for {target_stat}:")
    for _, row in importances.iterrows():
        print(f"      {row['feature']}: {row['importance']:.4f}")

# ============================================================================
# 6. UNCERTAINTY QUANTIFICATION
# ============================================================================

print("\n6. Quantifying prediction uncertainty...")

def bootstrap_prediction_interval(model, X, n_bootstraps=100, confidence=0.90):
    """Generate prediction intervals via bootstrap"""
    predictions = []

    for _ in range(n_bootstraps):
        # Resample training data indices
        indices = np.random.choice(len(X), len(X), replace=True)
        X_boot = X[indices]

        # Predict
        pred = model.predict(X_boot)
        predictions.append(pred.mean())

    predictions = np.array(predictions)
    lower = np.percentile(predictions, (1 - confidence) / 2 * 100)
    upper = np.percentile(predictions, (1 + confidence) / 2 * 100)

    return lower, upper

# Calculate prediction intervals for HR
print("\nCalculating 90% prediction intervals for HR...")
target = 'HR'
model = models[target]['ensemble'] if target in models else models[target]['rf']

sample_players = test_ml.sample(min(10, len(test_ml)), random_state=42)

for idx, row in sample_players.iterrows():
    player_name = row['name']
    actual = row[f'target_{target}']

    # Point prediction
    X_player = scaler.transform([row[feature_cols]])
    point_pred = model.predict(X_player)[0]

    # Simple approximation: use RMSE to estimate interval
    rmse = np.sqrt(mean_squared_error(
        test_ml[f'target_{target}'],
        model.predict(X_test)
    ))

    lower = point_pred - 1.645 * rmse  # 90% CI
    upper = point_pred + 1.645 * rmse

    print(f"  {player_name}: Actual={actual:.0f}, Pred={point_pred:.0f}, "
          f"90% CI=[{lower:.0f}, {upper:.0f}]")

# ============================================================================
# 7. VISUALIZATION: PREDICTIONS VS ACTUALS
# ============================================================================

print("\n7. Creating prediction visualizations...")

fig, axes = plt.subplots(2, 2, figsize=(14, 12))

for i, stat in enumerate(target_stats):
    ax = axes[i // 2, i % 2]

    actual = predictions[stat]['actual']
    pred = predictions[stat]['ensemble']

    # Scatter plot
    ax.scatter(actual, pred, alpha=0.6, s=50)

    # Perfect prediction line
    min_val = min(actual.min(), pred.min())
    max_val = max(actual.max(), pred.max())
    ax.plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2, label='Perfect Prediction')

    # Calculate metrics
    rmse = np.sqrt(mean_squared_error(actual, pred))
    r2 = r2_score(actual, pred)

    ax.set_xlabel(f'Actual {stat}', fontsize=12)
    ax.set_ylabel(f'Predicted {stat}', fontsize=12)
    ax.set_title(f'{stat} Predictions (2023)\nRMSE: {rmse:.3f}, R²: {r2:.3f}', fontsize=12)
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('prediction_accuracy.png', dpi=300)
plt.show()

# Residual analysis
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

for i, stat in enumerate(target_stats):
    ax = axes[i // 2, i % 2]

    actual = predictions[stat]['actual']
    pred = predictions[stat]['ensemble']
    residuals = actual - pred

    ax.hist(residuals, bins=20, edgecolor='black', alpha=0.7, color='steelblue')
    ax.axvline(0, color='red', linestyle='--', linewidth=2)
    ax.set_xlabel('Residual (Actual - Predicted)', fontsize=12)
    ax.set_ylabel('Frequency', fontsize=12)
    ax.set_title(f'{stat} Residuals\nMean: {residuals.mean():.3f}, Std: {residuals.std():.3f}',
                fontsize=12)
    ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('prediction_residuals.png', dpi=300)
plt.show()

# ============================================================================
# 8. CROSS-VALIDATION FOR PLAYER PROJECTIONS
# ============================================================================

print("\n8. Performing time-series cross-validation...")

# Time-series CV: train on past, predict next season
cv_results = []

for test_year in range(2021, 2024):
    print(f"\n  Testing on {test_year}...")

    train_cv = ml_data[ml_data['target_season'] < test_year].dropna()
    test_cv = ml_data[ml_data['target_season'] == test_year].dropna()

    if len(train_cv) == 0 or len(test_cv) == 0:
        continue

    X_train_cv = scaler.fit_transform(train_cv[feature_cols])
    X_test_cv = scaler.transform(test_cv[feature_cols])

    for stat in ['AVG', 'OBP', 'SLG', 'HR']:
        y_train_cv = train_cv[f'target_{stat}']
        y_test_cv = test_cv[f'target_{stat}']

        model_cv = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
        model_cv.fit(X_train_cv, y_train_cv)

        pred_cv = model_cv.predict(X_test_cv)

        rmse = np.sqrt(mean_squared_error(y_test_cv, pred_cv))
        r2 = r2_score(y_test_cv, pred_cv)

        cv_results.append({
            'test_year': test_year,
            'stat': stat,
            'rmse': rmse,
            'r2': r2
        })

        print(f"    {stat}: RMSE={rmse:.4f}, R²={r2:.3f}")

cv_df = pd.DataFrame(cv_results)

print("\n9. Cross-Validation Summary:")
summary = cv_df.groupby('stat').agg({'rmse': 'mean', 'r2': 'mean'})
print(summary)

print("\n" + "=" * 60)
print("Player Projection System Complete!")
print("\nKey Findings:")
print("1. Marcel provides a strong baseline with simple, interpretable methods")
print("2. Aging curves show different patterns for power, contact, and speed")
print("3. Machine learning ensembles improve upon Marcel slightly")
print("4. Prediction intervals are wide—uncertainty is inherent")
print("5. Time-series CV is essential to avoid overfitting")
    

R Implementation: Player Projection System


# Player Performance Prediction System in R
# Comprehensive projection model with aging curves and validation

library(baseballr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(randomForest)
library(caret)
library(gridExtra)

cat("Player Performance Prediction System\n")
cat(rep("=", 60), "\n", sep="")

# ============================================================================
# 1. DATA PREPARATION
# ============================================================================

cat("\n1. Fetching historical player data...\n")

# Get multiple years of data
years <- 2018:2023
all_data <- list()

for (year in years) {
  cat(sprintf("  Fetching %d...\n", year))
  df <- fg_batter_leaders(year, year, qual = 200)
  df$season <- year
  all_data[[as.character(year)]] <- df
}

data <- bind_rows(all_data)
cat(sprintf("\nTotal player-seasons: %d\n", nrow(data)))

# ============================================================================
# 2. FEATURE ENGINEERING
# ============================================================================

cat("\n2. Engineering predictive features...\n")

data <- data %>%
  mutate(
    ISO = SLG - AVG,
    BABIP = (H - HR) / (AB - SO - HR + SF),
    HR_per_PA = HR / PA,
    K_pct = SO / PA * 100,
    BB_pct = BB / PA * 100
  )

# ============================================================================
# 3. MARCEL-STYLE PROJECTION
# ============================================================================

cat("\n3. Building Marcel-style projection system...\n")

marcel_projection <- function(player_df, stat_col, league_avg) {
  # Sort by season
  player_df <- player_df %>% arrange(season)

  # Weight last 3 years: 3, 4, 5 (oldest to newest)
  if (nrow(player_df) == 0) return(NA)

  recent <- tail(player_df, 3)
  weights <- c(3, 4, 5)[1:nrow(recent)]

  # Weighted average
  weighted_stat <- sum(recent[[stat_col]] * weights * recent$PA) /
                   sum(weights * recent$PA)

  # Add regression (1200 PA of league average)
  regression_pa <- 1200
  total_pa <- sum(weights * recent$PA) + regression_pa

  regressed_stat <- (sum(recent[[stat_col]] * weights * recent$PA) +
                     league_avg * regression_pa) / total_pa

  # Age adjustment (assuming peak at 29)
  current_age <- tail(player_df$Age, 1)
  if (current_age > 29) {
    age_decline <- 0.005 * (current_age - 29)
    regressed_stat <- regressed_stat * (1 - age_decline)
  }

  return(regressed_stat)
}

# Apply Marcel to 2023 season using 2020-2022 data
train_data <- data %>% filter(season < 2023)
test_data <- data %>% filter(season == 2023)

league_avg <- train_data %>%
  group_by(season) %>%
  summarise(
    AVG = mean(AVG, na.rm = TRUE),
    OBP = mean(OBP, na.rm = TRUE),
    SLG = mean(SLG, na.rm = TRUE),
    HR = mean(HR, na.rm = TRUE)
  ) %>%
  summarise(across(where(is.numeric), mean))

projection_stats <- c("AVG", "OBP", "SLG", "HR")
marcel_results <- data.frame()

cat("\nGenerating Marcel projections for 2023...\n")

for (player_name in unique(test_data$Name)) {
  player_history <- train_data %>% filter(Name == player_name)

  if (nrow(player_history) > 0) {
    proj_row <- data.frame(Name = player_name)

    for (stat in projection_stats) {
      proj_value <- marcel_projection(player_history, stat, league_avg[[stat]])
      proj_row[[paste0(stat, "_proj")]] <- proj_value
    }

    marcel_results <- bind_rows(marcel_results, proj_row)
  }
}

cat(sprintf("Generated %d Marcel projections\n", nrow(marcel_results)))

# Evaluate Marcel
marcel_eval <- marcel_results %>%
  inner_join(test_data %>% select(Name, all_of(projection_stats)),
            by = "Name", suffix = c("_proj", "_actual"))

cat("\nMarcel Projection Accuracy:\n")
for (stat in projection_stats) {
  proj_col <- paste0(stat, "_proj")
  actual_col <- stat

  if (proj_col %in% names(marcel_eval) && actual_col %in% names(marcel_eval)) {
    eval_subset <- marcel_eval %>%
      filter(!is.na(!!sym(proj_col)), !is.na(!!sym(actual_col)))

    if (nrow(eval_subset) > 0) {
      rmse <- sqrt(mean((eval_subset[[actual_col]] - eval_subset[[proj_col]])^2))
      mae <- mean(abs(eval_subset[[actual_col]] - eval_subset[[proj_col]]))
      r2 <- cor(eval_subset[[actual_col]], eval_subset[[proj_col]])^2

      cat(sprintf("  %s: RMSE=%.4f, MAE=%.4f, R²=%.3f\n", stat, rmse, mae, r2))
    }
  }
}

# ============================================================================
# 4. AGING CURVES
# ============================================================================

cat("\n4. Calculating aging curves...\n")

aging_data <- data %>%
  group_by(Age) %>%
  summarise(
    HR_per_PA = mean(HR_per_PA, na.rm = TRUE),
    K_pct = mean(K_pct, na.rm = TRUE),
    BB_pct = mean(BB_pct, na.rm = TRUE),
    ISO = mean(ISO, na.rm = TRUE),
    BABIP = mean(BABIP, na.rm = TRUE),
    player_count = n()
  ) %>%
  filter(player_count >= 10)

cat("\nAging Curve Summary:\n")
print(aging_data)

# Visualize aging curves
p1 <- ggplot(aging_data, aes(x = Age, y = HR_per_PA)) +
  geom_line(size = 1.5, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = 29, color = "red", linetype = "dashed", alpha = 0.5) +
  labs(title = "Aging Curve: HR per PA", x = "Age", y = "HR per PA") +
  theme_minimal()

p2 <- ggplot(aging_data, aes(x = Age, y = K_pct)) +
  geom_line(size = 1.5, color = "darkred") +
  geom_point(size = 3, color = "darkred") +
  geom_vline(xintercept = 29, color = "red", linetype = "dashed", alpha = 0.5) +
  labs(title = "Aging Curve: Strikeout %", x = "Age", y = "K%") +
  theme_minimal()

p3 <- ggplot(aging_data, aes(x = Age, y = BB_pct)) +
  geom_line(size = 1.5, color = "darkgreen") +
  geom_point(size = 3, color = "darkgreen") +
  geom_vline(xintercept = 29, color = "red", linetype = "dashed", alpha = 0.5) +
  labs(title = "Aging Curve: Walk %", x = "Age", y = "BB%") +
  theme_minimal()

p4 <- ggplot(aging_data, aes(x = Age, y = ISO)) +
  geom_line(size = 1.5, color = "purple") +
  geom_point(size = 3, color = "purple") +
  geom_vline(xintercept = 29, color = "red", linetype = "dashed", alpha = 0.5) +
  labs(title = "Aging Curve: ISO", x = "Age", y = "ISO") +
  theme_minimal()

combined <- grid.arrange(p1, p2, p3, p4, ncol = 2)
ggsave("aging_curves_r.png", combined, width = 12, height = 10, dpi = 300)

# ============================================================================
# 5. MACHINE LEARNING PROJECTION MODEL
# ============================================================================

cat("\n5. Building machine learning projection models...\n")

# Create lagged features
create_ml_features <- function(df) {
  features_list <- list()

  for (player_name in unique(df$Name)) {
    player_data <- df %>%
      filter(Name == player_name) %>%
      arrange(season)

    for (i in 1:(nrow(player_data) - 1)) {
      history <- player_data[max(1, i-2):i, ]
      target <- player_data[i + 1, ]

      if (nrow(history) > 0) {
        # Weighted averages
        weights <- c(3, 4, 5)[seq_len(nrow(history))]
        weights <- weights / sum(weights)

        features <- list(
          AVG_weighted = sum(history$AVG * weights),
          OBP_weighted = sum(history$OBP * weights),
          SLG_weighted = sum(history$SLG * weights),
          HR_weighted = sum(history$HR * weights),
          K_pct_weighted = sum(history$K_pct * weights),
          BB_pct_weighted = sum(history$BB_pct * weights),

          AVG_last = tail(history$AVG, 1),
          OBP_last = tail(history$OBP, 1),
          SLG_last = tail(history$SLG, 1),

          age = target$Age,
          age_squared = target$Age^2,

          pa_last = tail(history$PA, 1),

          target_AVG = target$AVG,
          target_OBP = target$OBP,
          target_SLG = target$SLG,
          target_HR = target$HR,

          target_season = target$season,
          name = player_name
        )

        features_list[[length(features_list) + 1]] <- as.data.frame(features)
      }
    }
  }

  bind_rows(features_list)
}

cat("\nCreating ML features...\n")
ml_data <- create_ml_features(data)
cat(sprintf("ML dataset size: %d player-seasons\n", nrow(ml_data)))

# Split data
train_ml <- ml_data %>% filter(target_season < 2023) %>% na.omit()
test_ml <- ml_data %>% filter(target_season == 2023) %>% na.omit()

cat(sprintf("\nTraining samples: %d\n", nrow(train_ml)))
cat(sprintf("Test samples: %d\n", nrow(test_ml)))

# Feature columns
feature_cols <- c("AVG_weighted", "OBP_weighted", "SLG_weighted", "HR_weighted",
                  "K_pct_weighted", "BB_pct_weighted", "AVG_last", "OBP_last",
                  "SLG_last", "age", "age_squared", "pa_last")

target_stats <- c("AVG", "OBP", "SLG", "HR")

# Train Random Forest models
cat("\nTraining Random Forest models...\n")

models <- list()
predictions <- list()

for (stat in target_stats) {
  cat(sprintf("\n  Predicting %s...\n", stat))

  target_col <- paste0("target_", stat)

  # Train Random Forest
  rf_model <- randomForest(
    x = train_ml[, feature_cols],
    y = train_ml[[target_col]],
    ntree = 100,
    mtry = 4,
    importance = TRUE
  )

  # Predictions
  train_pred <- predict(rf_model, train_ml[, feature_cols])
  test_pred <- predict(rf_model, test_ml[, feature_cols])

  # Evaluate
  test_actual <- test_ml[[target_col]]
  rmse <- sqrt(mean((test_actual - test_pred)^2))
  r2 <- cor(test_actual, test_pred)^2

  cat(sprintf("    RMSE: %.4f, R²: %.3f\n", rmse, r2))

  models[[stat]] <- rf_model
  predictions[[stat]] <- list(actual = test_actual, predicted = test_pred)

  # Feature importance
  importance_df <- data.frame(
    feature = rownames(importance(rf_model)),
    importance = importance(rf_model)[, 1]
  ) %>%
    arrange(desc(importance)) %>%
    head(10)

  cat(sprintf("\n    Top 10 Features for %s:\n", stat))
  for (i in 1:nrow(importance_df)) {
    cat(sprintf("      %s: %.4f\n", importance_df$feature[i], importance_df$importance[i]))
  }
}

# ============================================================================
# 6. VISUALIZATION
# ============================================================================

cat("\n6. Creating prediction visualizations...\n")

# Predictions vs Actuals
plots <- list()

for (stat in target_stats) {
  df_plot <- data.frame(
    actual = predictions[[stat]]$actual,
    predicted = predictions[[stat]]$predicted
  )

  rmse <- sqrt(mean((df_plot$actual - df_plot$predicted)^2))
  r2 <- cor(df_plot$actual, df_plot$predicted)^2

  p <- ggplot(df_plot, aes(x = actual, y = predicted)) +
    geom_point(alpha = 0.6, size = 3, color = "steelblue") +
    geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed", size = 1) +
    labs(
      title = sprintf("%s Predictions (2023)", stat),
      subtitle = sprintf("RMSE: %.3f, R²: %.3f", rmse, r2),
      x = sprintf("Actual %s", stat),
      y = sprintf("Predicted %s", stat)
    ) +
    theme_minimal() +
    theme(plot.title = element_text(face = "bold"))

  plots[[stat]] <- p
}

combined_preds <- grid.arrange(grobs = plots, ncol = 2)
ggsave("prediction_accuracy_r.png", combined_preds, width = 12, height = 10, dpi = 300)

# ============================================================================
# 7. CROSS-VALIDATION
# ============================================================================

cat("\n7. Performing time-series cross-validation...\n")

cv_results <- data.frame()

for (test_year in 2021:2023) {
  cat(sprintf("\n  Testing on %d...\n", test_year))

  train_cv <- ml_data %>% filter(target_season < test_year) %>% na.omit()
  test_cv <- ml_data %>% filter(target_season == test_year) %>% na.omit()

  if (nrow(train_cv) == 0 || nrow(test_cv) == 0) next

  for (stat in c("AVG", "OBP", "SLG", "HR")) {
    target_col <- paste0("target_", stat)

    model_cv <- randomForest(
      x = train_cv[, feature_cols],
      y = train_cv[[target_col]],
      ntree = 100
    )

    pred_cv <- predict(model_cv, test_cv[, feature_cols])
    actual_cv <- test_cv[[target_col]]

    rmse <- sqrt(mean((actual_cv - pred_cv)^2))
    r2 <- cor(actual_cv, pred_cv)^2

    cv_results <- bind_rows(
      cv_results,
      data.frame(
        test_year = test_year,
        stat = stat,
        rmse = rmse,
        r2 = r2
      )
    )

    cat(sprintf("    %s: RMSE=%.4f, R²=%.3f\n", stat, rmse, r2))
  }
}

cat("\n8. Cross-Validation Summary:\n")
cv_summary <- cv_results %>%
  group_by(stat) %>%
  summarise(
    mean_rmse = mean(rmse),
    mean_r2 = mean(r2)
  )
print(cv_summary)

cat("\n", rep("=", 60), "\n", sep="")
cat("Player Projection System Complete!\n\n")
cat("Key Findings:\n")
cat("1. Marcel provides a strong baseline with simple methods\n")
cat("2. Aging curves differ by skill component\n")
cat("3. Machine learning improves accuracy modestly\n")
cat("4. Cross-validation is essential for validation\n")
    

Best Practices for Player Projections

  • Use component stats: Project K%, BB%, ISO, BABIP separately, not just AVG/HR
  • Weight recent data more: Last season matters more than three years ago
  • Apply aging curves at component level: Different skills age differently
  • Regress extreme performances: .400 BABIP will decline; add league-average PA
  • Account for playing time risk: Injury-prone players need conservative PT projections
  • Use out-of-sample testing: Always validate on future seasons, not same data
  • Provide uncertainty estimates: Prediction intervals, not just point estimates
  • Update in-season: Rest-of-season projections should incorporate current performance
  • Compare to baselines: Marcel is a strong baseline; beat it before claiming success
  • Be transparent about limitations: Projections are educated guesses with wide error bars

Common Pitfalls to Avoid

  • Overfitting to small samples: 50 PA is not predictive of full-season talent
  • Ignoring aging: Projecting 35-year-olds like they're 27 leads to overestimation
  • Neglecting regression: Career .250 hitters who hit .320 won't repeat
  • Training on future data: Accidentally including test season in training causes data leakage
  • Treating projections as facts: They're probabilistic estimates with substantial uncertainty
  • Using wrong sample: Only evaluate projections for players with sufficient playing time
  • Comparing full-season projections to partial seasons: Pro-rate appropriately
  • Forgetting context: Ballpark and league matter for counting stats

Resources for Further Learning

  • FanGraphs Projection Systems: FanGraphs Projections
  • Tom Tango's Marcel: Marcel the Monkey
  • The Book (Tango, Lichtman, Dolphin): Chapter on aging curves
  • FiveThirtyEight CARMELO (NBA analogue): Inspiration for Bayesian projection methods
  • Baseball Prospectus PECOTA: Similarity-based projection system
  • Steamer Projections: Steamer
  • ZiPS Projections: ZiPS

Key Takeaways

  • Start simple: Marcel demonstrates that simple, principled methods work well
  • Components matter: Project K%, BB%, ISO, BABIP separately for better accuracy
  • Aging is real: Apply age-specific adjustments, especially after 30
  • Regression is essential: Small samples and extreme outcomes need heavy regression
  • Ensembles improve accuracy: Combining models often beats individual approaches
  • Uncertainty is inherent: Even the best projections have wide confidence intervals
  • Validation is critical: Use time-series cross-validation on out-of-sample seasons
  • Compare to actuals systematically: Track RMSE, MAE, R² across multiple seasons
  • In-season updates matter: Rest-of-season projections should incorporate current data
  • Transparency builds trust: Document methodology, limitations, and uncertainty
Remember: Player projections are probabilistic forecasts, not certainties. Even the best systems have substantial error rates. Use them as a starting point for evaluation, not an ending point. The goal is to be less wrong than naive baselines like "last year's stats," not to be perfectly accurate.

Discussion

Have questions or feedback? Join our community discussion on Discord or GitHub Discussions.