Performance Trend Analysis
Trend Analysis Charts in Baseball Analytics
Trend analysis is fundamental to understanding player performance trajectories, identifying hot and cold streaks, and making data-driven predictions about future performance. Unlike static season-long statistics, trend analysis reveals the temporal dynamics of player performance, allowing analysts to detect emerging patterns, assess development trajectories, and identify optimal timing for roster moves. This comprehensive guide explores sophisticated techniques for analyzing and visualizing baseball performance trends using statistical methods, moving averages, and advanced smoothing techniques.
Baseball performance is inherently variable, with randomness and small sample sizes creating noise that can obscure true talent levels. Trend analysis helps separate signal from noise by aggregating data over time windows, smoothing out random variation while preserving meaningful patterns. Understanding when a player is genuinely improving versus experiencing normal variance is critical for player evaluation, trade decisions, and fantasy baseball strategy. Modern trend analysis combines traditional rolling averages with sophisticated statistical methods to provide robust insights into performance dynamics.
Understanding Time Series Analysis in Baseball
Time series analysis treats baseball statistics as sequences of observations ordered chronologically, enabling the application of powerful statistical techniques designed for temporal data. Baseball performance metrics exhibit several time series characteristics: seasonality (performance changes across the season), autocorrelation (today's performance relates to recent performance), and trends (long-term improvement or decline). Proper time series analysis accounts for these characteristics while identifying meaningful patterns.
Key concepts in baseball time series analysis include stationarity (whether statistical properties remain constant over time), trend components (long-term directional movement), seasonal components (predictable periodic patterns), and irregular components (random fluctuations). Most baseball statistics are non-stationary, showing trends as players develop or age. Understanding these components enables analysts to decompose performance into interpretable elements and make more accurate forecasts of future performance.
The challenge in baseball time series analysis is balancing responsiveness to recent changes against stability from larger sample sizes. Short rolling windows (3-7 days) quickly detect changes but are volatile and prone to overreaction. Long windows (30+ days) provide stability but lag behind genuine performance shifts. Sophisticated approaches use adaptive windows, weighted averages, or smoothing algorithms that balance these competing objectives based on the analytical question being addressed.
Rolling Averages and Moving Windows
Rolling averages, also called moving averages, are the foundational tool for baseball trend analysis. They calculate statistics over a sliding window of recent games or plate appearances, providing a smoothed view of performance that filters out game-to-game volatility. Common window sizes include 7-day (approximately one week of games), 14-day (two weeks), and 30-day (one month) periods. Each window size serves different purposes: shorter windows detect recent changes, while longer windows establish more reliable baseline performance levels.
The choice of window size involves a bias-variance tradeoff. Smaller windows have low bias (closely follow true performance changes) but high variance (susceptible to random fluctuations). Larger windows have low variance (stable estimates) but high bias (slow to detect genuine changes). The optimal window depends on the metric's inherent variability and the analytical purpose. Power metrics like home runs require longer windows due to their scarcity, while contact metrics like batting average can use shorter windows.
Advanced rolling average techniques include weighted moving averages (giving more weight to recent observations), exponentially weighted moving averages (EWMA, where weights decay exponentially), and centered moving averages (averaging around a central point for smoother historical analysis). Weighted approaches better capture recent performance while maintaining some stability from historical data. EWMA is particularly valuable for real-time analysis as it updates efficiently with each new observation.
Simple Moving Average (SMA)
SMA(t) = (1/n) × Σ[x(t-i)] for i = 0 to n-1
Where n is the window size and x(t) is the value at time t
Exponentially Weighted Moving Average (EWMA)
EWMA(t) = α × x(t) + (1-α) × EWMA(t-1)
Where α is the smoothing parameter (typically 0.1 to 0.3)
Weighted Moving Average (WMA)
WMA(t) = Σ[w(i) × x(t-i)] / Σ[w(i)]
Where w(i) are weights assigned to each observation
Seasonal Trends and Hot/Cold Streaks
Baseball performance exhibits distinct seasonal patterns driven by factors like weather, travel fatigue, league-wide pitching quality changes, and individual player conditioning. Early season performance often differs from mid-summer or late-season statistics due to small samples, rust from the offseason, and strategic adjustments. Identifying these seasonal patterns enables more accurate player evaluation by contextualizing performance within expected temporal fluctuations.
Hot and cold streaks represent temporary deviations from a player's true talent level, driven by a combination of luck, mechanical adjustments, opponent quality, and psychological factors. Statistically detecting genuine streaks versus random variation requires careful analysis. Most apparent streaks reflect normal variance in small samples rather than fundamental performance changes. Advanced methods use changepoint detection algorithms to identify statistically significant shifts in performance levels that exceed random expectations.
Regression to the mean is a critical concept for understanding streaks. Players performing far above or below their career norms tend to revert toward their historical averages over time. This statistical phenomenon means that hot streaks predict cooling off, while cold streaks predict recovery. Quantifying the expected regression magnitude helps analysts avoid overreacting to short-term fluctuations and make more rational evaluation and prediction decisions.
Year-over-Year Comparisons
Year-over-year (YoY) analysis compares a player's current performance against the same calendar period in previous seasons, controlling for seasonal effects and identifying genuine development or decline. This approach is particularly valuable for evaluating young players' development curves and detecting age-related decline in veterans. YoY comparisons must account for changes in league environment (run-scoring context), playing time, and role changes that affect statistical comparisons.
Aging curves represent the typical performance trajectory across a player's career, showing how skills develop through peak years (typically ages 27-29 for most skills) and decline afterward. Individual players deviate from average aging curves based on skill profile, injury history, and conditioning. Comparing a player's actual performance trajectory against expected aging patterns identifies players aging better or worse than expected, informing long-term contract and roster construction decisions.
Cohort analysis groups players by debut year or age to examine how entire generations of players develop and decline. This population-level perspective reveals whether recent development patterns differ from historical norms, potentially indicating changes in player development practices, performance-enhancing technologies, or league-wide strategic shifts. It also provides more robust aging curve estimates by aggregating data across many players to reduce individual variance.
League-Wide Trends
League-wide trends in metrics like strikeout rates, home run rates, and launch angle reflect evolving offensive and defensive strategies. The modern era has witnessed dramatic increases in strikeout rates (from ~16% in 2000 to ~23% in 2023), home run rates (especially during the "juiced ball" periods), and launch angle optimization (the "fly ball revolution"). These aggregate trends provide essential context for evaluating individual player performance.
Normalizing individual statistics for league context using metrics like wRC+ (weighted runs created plus) and ERA+ ensures fair comparisons across eras. A .300 batting average in a high-offense environment represents less value than .280 in a low-scoring era. League-adjusted metrics scale performance relative to average, with 100 representing league average and each point above or below representing 1% better or worse than average. This normalization is essential for valid historical comparisons and fair player evaluation.
Tracking league-wide trends helps identify market inefficiencies and strategic opportunities. Teams that recognize emerging trends early gain competitive advantages. For example, organizations that adopted defensive shifts before they became universal benefited from reduced run-scoring against. Similarly, teams emphasizing launch angle optimization before it became mainstream identified undervalued players who could improve with mechanical adjustments. Monitoring trend inflection points reveals when market inefficiencies are emerging or closing.
Statistical Smoothing Techniques
LOESS (Locally Estimated Scatterplot Smoothing), also known as LOWESS, is a powerful non-parametric regression method that fits local polynomial regressions to subsets of data, creating smooth trend curves without assuming a global functional form. LOESS is ideal for baseball trend analysis because it adapts to local patterns, handles non-linear trends naturally, and doesn't require specifying a mathematical model. The smoothing parameter controls the bandwidth of local fitting, with smaller values producing more flexible curves and larger values creating smoother trends.
Savitzky-Golay filters provide an alternative smoothing approach by fitting successive sub-sets of adjacent data points with low-degree polynomials using least squares. These filters preserve higher moments (peaks and valleys) better than simple moving averages, making them valuable for identifying performance inflection points while reducing noise. The filter parameters include window size and polynomial order, which can be optimized based on the time series characteristics and analytical objectives.
Kalman filtering represents a sophisticated approach that uses Bayesian updating to estimate the true underlying signal from noisy observations. While computationally more intensive, Kalman filters excel at handling missing data, incorporating measurement uncertainty, and providing probabilistic confidence intervals around trend estimates. These features make them valuable for rigorous statistical analysis where quantifying uncertainty is essential, such as predictive modeling and hypothesis testing.
LOESS Smoothing
For each point x, fit weighted polynomial regression on nearby points
Weights: w(i) = (1 - (|x - x(i)|/d)³)³ for points within distance d
Typically uses 2nd degree polynomial and bandwidth of 0.2-0.75
Savitzky-Golay Filter
Fits polynomial of degree k to window of 2m+1 points
Smoothed value: y*(t) = Σ[c(i) × y(t+i)] for i = -m to m
Common parameters: k=2 or 3, m=5 to 10
Identifying Breakout Candidates
Breakout detection combines trend analysis with underlying skill metrics to identify players showing genuine improvement rather than temporary hot streaks. Key indicators include sustained improvements in quality of contact (exit velocity, barrel rate), plate discipline (walk rate, chase rate), or batted ball profiles (launch angle, hard-hit rate). These "sticky" skills tend to persist more than outcome-based metrics like batting average or ERA, making them more reliable breakout indicators.
Changepoint detection algorithms identify specific time points where statistical properties of a time series shift significantly. In baseball context, changepoints might represent mechanical adjustments, role changes, or genuine skill development. Methods like the PELT (Pruned Exact Linear Time) algorithm or Bayesian changepoint detection provide statistical tests for whether observed changes exceed random variation expectations, helping separate true breakouts from noise.
Machine learning approaches to breakout prediction incorporate multiple features including recent trends, career trajectories, age, underlying skills, and contextual factors like team changes or coaching staff. Random forests, gradient boosting machines, and neural networks can identify complex patterns that simple trend analysis might miss. These models require careful validation to avoid overfitting on small samples and must be recalibrated regularly as the baseball environment evolves.
Regression to the Mean Visualization
Visualizing regression to the mean helps communicate this fundamental statistical concept to stakeholders who might otherwise overreact to recent performance. Funnel plots show individual player performance against sample size, with confidence intervals narrowing for larger samples. Players outside the funnel boundaries represent statistically significant outliers, but even these tend to regress toward their talent level over time.
Scatter plots comparing early-season versus late-season performance typically show regression toward the mean, with extreme early performers clustering closer to average in late-season stats. Adding a y=x reference line and actual regression line visualizes the tendency for extremes to moderate. The correlation between early and late performance quantifies predictive power—higher correlations indicate more persistent skills, while low correlations suggest high variance and strong regression effects.
Interactive visualizations enable dynamic exploration of regression effects across different time windows, sample sizes, and statistical categories. Users can adjust parameters to see how regression strength varies, helping build intuition about when to trust recent performance versus career norms. These tools are valuable for fantasy baseball, front office analysis, and broadcasting, making sophisticated statistical concepts accessible to broader audiences.
Python Implementation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import signal
from scipy.interpolate import make_interp_spline
from sklearn.linear_model import LinearRegression
from statsmodels.nonparametric.smoothers_lowess import lowess
from pybaseball import statcast_batter, playerid_lookup
import warnings
warnings.filterwarnings('ignore')
class BaseballTrendAnalyzer:
"""
Comprehensive trend analysis toolkit for baseball statistics.
"""
def __init__(self):
"""Initialize the analyzer."""
self.data = None
def load_player_season(self, last_name, first_name, year):
"""
Load player data for trend analysis.
Parameters:
last_name: Player last name
first_name: Player first name
year: Season year
Returns:
DataFrame with daily aggregated statistics
"""
# Look up player
player = playerid_lookup(last_name, first_name)
if len(player) == 0:
raise ValueError(f"Player {first_name} {last_name} not found")
player_id = player.iloc[0]['key_mlbam']
# Fetch Statcast data
print(f"Loading {first_name} {last_name} {year} data...")
raw_data = statcast_batter(f'{year}-03-01', f'{year}-11-30', player_id)
if raw_data is None or len(raw_data) == 0:
raise ValueError("No data found")
# Aggregate by date
daily_stats = raw_data.groupby('game_date').agg({
'events': 'count',
'launch_speed': 'mean',
'launch_angle': 'mean',
'estimated_ba_using_speedangle': 'mean',
'estimated_woba_using_speedangle': 'mean',
'barrel': lambda x: (x == 1).sum() if x.notna().sum() > 0 else 0
}).reset_index()
daily_stats.columns = ['date', 'pa', 'avg_exit_velo', 'avg_launch_angle',
'xBA', 'xwOBA', 'barrels']
daily_stats['date'] = pd.to_datetime(daily_stats['date'])
daily_stats = daily_stats.sort_values('date').reset_index(drop=True)
# Calculate cumulative stats
daily_stats['cumulative_pa'] = daily_stats['pa'].cumsum()
daily_stats['barrel_rate'] = daily_stats['barrels'] / daily_stats['pa']
self.data = daily_stats
return daily_stats
def calculate_rolling_averages(self, column, windows=[7, 14, 30]):
"""
Calculate multiple rolling averages for a metric.
Parameters:
column: Column name to analyze
windows: List of window sizes in days
Returns:
DataFrame with original data and rolling averages
"""
if self.data is None:
raise ValueError("Load data first using load_player_season()")
result = self.data[['date', column]].copy()
for window in windows:
result[f'{column}_sma_{window}d'] = (
result[column].rolling(window=window, min_periods=max(1, window//2)).mean()
)
return result
def exponential_moving_average(self, column, alpha=0.2):
"""
Calculate exponentially weighted moving average.
Parameters:
column: Column name to analyze
alpha: Smoothing parameter (0 < alpha < 1, higher = more weight on recent)
Returns:
Series with EWMA values
"""
return self.data[column].ewm(alpha=alpha, adjust=False).mean()
def detect_streaks(self, column, threshold_std=1.5, min_length=5):
"""
Detect hot and cold streaks using statistical thresholds.
Parameters:
column: Column to analyze for streaks
threshold_std: Number of standard deviations for streak detection
min_length: Minimum consecutive days to qualify as streak
Returns:
DataFrame with streak indicators
"""
data = self.data.copy()
# Calculate z-scores
mean = data[column].mean()
std = data[column].std()
data['z_score'] = (data[column] - mean) / std
# Identify hot and cold periods
data['hot'] = data['z_score'] > threshold_std
data['cold'] = data['z_score'] < -threshold_std
# Find consecutive streaks
data['hot_streak'] = data['hot'].rolling(min_length).sum() == min_length
data['cold_streak'] = data['cold'].rolling(min_length).sum() == min_length
return data[['date', column, 'z_score', 'hot_streak', 'cold_streak']]
def apply_loess_smoothing(self, column, frac=0.3):
"""
Apply LOESS smoothing to time series data.
Parameters:
column: Column to smooth
frac: Fraction of data used for local regression (0 < frac < 1)
Returns:
Array with smoothed values
"""
if self.data is None:
raise ValueError("Load data first")
# Prepare data (need numeric x-axis)
x = np.arange(len(self.data))
y = self.data[column].values
# Remove NaN values
mask = ~np.isnan(y)
x_clean = x[mask]
y_clean = y[mask]
# Apply LOESS
smoothed = lowess(y_clean, x_clean, frac=frac, return_sorted=False)
# Map back to original indices
result = np.full(len(self.data), np.nan)
result[mask] = smoothed
return result
def apply_savgol_filter(self, column, window_length=11, polyorder=2):
"""
Apply Savitzky-Golay filter for smoothing.
Parameters:
column: Column to smooth
window_length: Length of filter window (must be odd)
polyorder: Order of polynomial fit
Returns:
Array with filtered values
"""
if window_length % 2 == 0:
window_length += 1 # Must be odd
y = self.data[column].fillna(method='ffill').fillna(method='bfill').values
if len(y) < window_length:
print(f"Warning: Data length ({len(y)}) < window ({window_length}), using {len(y)//2}")
window_length = (len(y) // 2) * 2 - 1 # Make odd and smaller
smoothed = signal.savgol_filter(y, window_length, polyorder)
return smoothed
def identify_changepoints(self, column, penalty=10):
"""
Identify performance changepoints using simple algorithm.
Parameters:
column: Column to analyze
penalty: Penalty for adding changepoints (higher = fewer changepoints)
Returns:
List of changepoint indices
"""
data = self.data[column].fillna(method='ffill').values
# Simple changepoint detection: look for large jumps in rolling mean
rolling_mean = pd.Series(data).rolling(7, min_periods=1).mean().values
diff = np.abs(np.diff(rolling_mean))
# Identify points where change exceeds threshold
threshold = np.std(diff) * penalty / 10
changepoints = np.where(diff > threshold)[0] + 1
return changepoints.tolist()
def visualize_trends(self, column, title=None, save_path=None):
"""
Create comprehensive trend visualization.
Parameters:
column: Column to visualize
title: Chart title
save_path: Path to save figure (optional)
"""
fig, axes = plt.subplots(3, 1, figsize=(14, 12))
if title is None:
title = f'{column.replace("_", " ").title()} Trend Analysis'
data = self.data.copy()
dates = data['date']
values = data[column]
# Plot 1: Raw data with rolling averages
axes[0].plot(dates, values, 'o-', alpha=0.3, label='Daily', markersize=3)
# Add rolling averages
for window in [7, 14, 30]:
rolling = values.rolling(window, min_periods=max(1, window//2)).mean()
axes[0].plot(dates, rolling, linewidth=2, label=f'{window}-day SMA')
axes[0].set_ylabel(column.replace('_', ' ').title())
axes[0].set_title(f'{title} - Rolling Averages')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Plot 2: LOESS and Savitzky-Golay smoothing
loess_smooth = self.apply_loess_smoothing(column, frac=0.3)
savgol_smooth = self.apply_savgol_filter(column, window_length=11, polyorder=2)
axes[1].plot(dates, values, 'o', alpha=0.2, label='Daily', markersize=3)
axes[1].plot(dates, loess_smooth, linewidth=2.5, label='LOESS (frac=0.3)', color='red')
axes[1].plot(dates, savgol_smooth, linewidth=2.5, label='Savitzky-Golay',
color='green', linestyle='--')
axes[1].set_ylabel(column.replace('_', ' ').title())
axes[1].set_title(f'{title} - Advanced Smoothing')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
# Plot 3: Streak detection
streaks = self.detect_streaks(column, threshold_std=1.0, min_length=5)
axes[2].plot(dates, values, 'o-', alpha=0.5, label='Daily', markersize=4)
# Highlight streaks
hot_dates = streaks[streaks['hot_streak']]['date']
hot_values = values[streaks['hot_streak']]
cold_dates = streaks[streaks['cold_streak']]['date']
cold_values = values[streaks['cold_streak']]
axes[2].scatter(hot_dates, hot_values, color='red', s=100, marker='^',
label='Hot Streak', zorder=5, edgecolors='darkred', linewidth=1.5)
axes[2].scatter(cold_dates, cold_values, color='blue', s=100, marker='v',
label='Cold Streak', zorder=5, edgecolors='darkblue', linewidth=1.5)
# Add mean line
axes[2].axhline(values.mean(), color='black', linestyle='--',
linewidth=1.5, alpha=0.7, label='Season Average')
axes[2].set_xlabel('Date')
axes[2].set_ylabel(column.replace('_', ' ').title())
axes[2].set_title(f'{title} - Streak Detection')
axes[2].legend()
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
if save_path:
plt.savefig(save_path, dpi=300, bbox_inches='tight')
return fig
def regression_to_mean_analysis(self, column, split_date=None):
"""
Analyze regression to the mean by comparing first half vs second half.
Parameters:
column: Column to analyze
split_date: Date to split season (defaults to midpoint)
Returns:
DataFrame with regression analysis results
"""
if split_date is None:
split_idx = len(self.data) // 2
else:
split_idx = (self.data['date'] <= pd.to_datetime(split_date)).sum()
first_half = self.data.iloc[:split_idx]
second_half = self.data.iloc[split_idx:]
# Calculate period averages (weighted by PA if available)
if 'pa' in self.data.columns:
first_avg = np.average(first_half[column].dropna(),
weights=first_half.loc[first_half[column].notna(), 'pa'])
second_avg = np.average(second_half[column].dropna(),
weights=second_half.loc[second_half[column].notna(), 'pa'])
else:
first_avg = first_half[column].mean()
second_avg = second_half[column].mean()
season_avg = self.data[column].mean()
# Calculate regression coefficient
deviation_first = first_avg - season_avg
deviation_second = second_avg - season_avg
if deviation_first != 0:
regression_pct = (1 - (deviation_second / deviation_first)) * 100
else:
regression_pct = 0
results = {
'metric': column,
'first_half_avg': first_avg,
'second_half_avg': second_avg,
'season_avg': season_avg,
'first_half_deviation': deviation_first,
'second_half_deviation': deviation_second,
'regression_percent': regression_pct
}
return pd.DataFrame([results])
# Example Usage and Analysis
if __name__ == "__main__":
# Initialize analyzer
analyzer = BaseballTrendAnalyzer()
# Load data for Aaron Judge 2023 season
print("=" * 60)
print("BASEBALL TREND ANALYSIS EXAMPLE")
print("=" * 60)
try:
data = analyzer.load_player_season('Judge', 'Aaron', 2023)
print(f"\nLoaded {len(data)} days of data\n")
# 1. Rolling Averages Analysis
print("\n1. ROLLING AVERAGES (xwOBA)")
print("-" * 60)
rolling_xwoba = analyzer.calculate_rolling_averages('xwOBA', windows=[7, 14, 30])
print(rolling_xwoba.tail(10))
# 2. Exponential Moving Average
print("\n2. EXPONENTIAL MOVING AVERAGE (Exit Velocity)")
print("-" * 60)
data['ewma_exit_velo'] = analyzer.exponential_moving_average('avg_exit_velo', alpha=0.2)
print(data[['date', 'avg_exit_velo', 'ewma_exit_velo']].tail(10))
# 3. Streak Detection
print("\n3. HOT/COLD STREAK DETECTION (xwOBA)")
print("-" * 60)
streaks = analyzer.detect_streaks('xwOBA', threshold_std=1.0, min_length=5)
hot_streaks = streaks[streaks['hot_streak']]
cold_streaks = streaks[streaks['cold_streak']]
print(f"Hot streaks detected: {len(hot_streaks)}")
if len(hot_streaks) > 0:
print("\nHot streak periods:")
print(hot_streaks[['date', 'xwOBA', 'z_score']].head())
print(f"\nCold streaks detected: {len(cold_streaks)}")
if len(cold_streaks) > 0:
print("\nCold streak periods:")
print(cold_streaks[['date', 'xwOBA', 'z_score']].head())
# 4. Smoothing Techniques Comparison
print("\n4. SMOOTHING TECHNIQUES COMPARISON")
print("-" * 60)
data['loess_xwoba'] = analyzer.apply_loess_smoothing('xwOBA', frac=0.3)
data['savgol_xwoba'] = analyzer.apply_savgol_filter('xwOBA', window_length=11, polyorder=2)
comparison = data[['date', 'xwOBA', 'loess_xwoba', 'savgol_xwoba']].tail(10)
print(comparison)
# 5. Changepoint Detection
print("\n5. PERFORMANCE CHANGEPOINTS (Exit Velocity)")
print("-" * 60)
changepoints = analyzer.identify_changepoints('avg_exit_velo', penalty=8)
print(f"Detected {len(changepoints)} changepoints")
if len(changepoints) > 0:
print("\nChangepoint dates and values:")
for idx in changepoints[:5]: # Show first 5
if idx < len(data):
print(f" {data.iloc[idx]['date'].strftime('%Y-%m-%d')}: "
f"{data.iloc[idx]['avg_exit_velo']:.2f} mph")
# 6. Regression to Mean Analysis
print("\n6. REGRESSION TO MEAN ANALYSIS (xwOBA)")
print("-" * 60)
regression_results = analyzer.regression_to_mean_analysis('xwOBA')
print(regression_results.T)
# 7. Create Comprehensive Visualization
print("\n7. GENERATING VISUALIZATIONS...")
print("-" * 60)
fig = analyzer.visualize_trends(
'xwOBA',
title='Aaron Judge 2023 xwOBA Trend Analysis',
save_path='judge_trend_analysis.png'
)
print("Saved comprehensive trend visualization to 'judge_trend_analysis.png'")
# Summary Statistics
print("\n" + "=" * 60)
print("SUMMARY STATISTICS")
print("=" * 60)
print(f"Season dates: {data['date'].min()} to {data['date'].max()}")
print(f"Total games: {len(data)}")
print(f"Total PA: {data['pa'].sum():.0f}")
print(f"Avg Exit Velocity: {data['avg_exit_velo'].mean():.1f} mph")
print(f"Avg xwOBA: {data['xwOBA'].mean():.3f}")
print(f"Barrel Rate: {data['barrel_rate'].mean()*100:.1f}%")
except Exception as e:
print(f"Error: {e}")
import traceback
traceback.print_exc()
R Implementation
library(tidyverse)
library(baseballr)
library(zoo)
library(forecast)
library(changepoint)
library(TTR)
library(ggplot2)
library(gridExtra)
# Baseball Trend Analyzer Class
BaseballTrendAnalyzer <- R6::R6Class(
"BaseballTrendAnalyzer",
public = list(
data = NULL,
load_player_season = function(last_name, first_name, year) {
# Look up player
player <- playerid_lookup(last_name, first_name)
if (nrow(player) == 0) {
stop(sprintf("Player %s %s not found", first_name, last_name))
}
player_id <- player$key_mlbam[1]
message(sprintf("Loading %s %s %d data...", first_name, last_name, year))
# Fetch Statcast data
raw_data <- statcast_search_batters(
start_date = sprintf("%d-03-01", year),
end_date = sprintf("%d-11-30", year),
batterid = player_id
)
# Aggregate by date
daily_stats <- raw_data %>%
group_by(game_date) %>%
summarise(
pa = n(),
avg_exit_velo = mean(launch_speed, na.rm = TRUE),
avg_launch_angle = mean(launch_angle, na.rm = TRUE),
xBA = mean(estimated_ba_using_speedangle, na.rm = TRUE),
xwOBA = mean(estimated_woba_using_speedangle, na.rm = TRUE),
barrels = sum(barrel == 1, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(
date = as.Date(game_date),
cumulative_pa = cumsum(pa),
barrel_rate = barrels / pa
) %>%
arrange(date)
self$data <- daily_stats
return(daily_stats)
},
calculate_rolling_averages = function(column, windows = c(7, 14, 30)) {
if (is.null(self$data)) {
stop("Load data first using load_player_season()")
}
result <- self$data %>%
select(date, all_of(column))
for (window in windows) {
col_name <- sprintf("%s_sma_%dd", column, window)
result[[col_name]] <- rollmean(
result[[column]],
k = window,
fill = NA,
align = "right"
)
}
return(result)
},
exponential_moving_average = function(column, alpha = 0.2) {
values <- self$data[[column]]
# Calculate EWMA manually
ewma <- numeric(length(values))
ewma[1] <- values[1]
for (i in 2:length(values)) {
if (!is.na(values[i])) {
ewma[i] <- alpha * values[i] + (1 - alpha) * ewma[i-1]
} else {
ewma[i] <- ewma[i-1]
}
}
return(ewma)
},
detect_streaks = function(column, threshold_std = 1.5, min_length = 5) {
data <- self$data
values <- data[[column]]
# Calculate z-scores
mean_val <- mean(values, na.rm = TRUE)
sd_val <- sd(values, na.rm = TRUE)
z_scores <- (values - mean_val) / sd_val
# Identify hot and cold periods
hot <- z_scores > threshold_std
cold <- z_scores < -threshold_std
# Find consecutive streaks
hot_streak <- rollsum(hot, k = min_length, fill = FALSE, align = "right") == min_length
cold_streak <- rollsum(cold, k = min_length, fill = FALSE, align = "right") == min_length
result <- data %>%
mutate(
z_score = z_scores,
hot_streak = hot_streak,
cold_streak = cold_streak
) %>%
select(date, all_of(column), z_score, hot_streak, cold_streak)
return(result)
},
apply_loess_smoothing = function(column, span = 0.3) {
if (is.null(self$data)) {
stop("Load data first")
}
values <- self$data[[column]]
x <- seq_along(values)
# Remove NAs
valid_idx <- !is.na(values)
# Apply LOESS
loess_model <- loess(values[valid_idx] ~ x[valid_idx], span = span)
smoothed <- predict(loess_model, newdata = data.frame(x = x))
return(smoothed)
},
apply_savgol_filter = function(column, window_length = 11, polyorder = 2) {
# R equivalent using smooth.spline as approximation
values <- self$data[[column]]
values_clean <- na.approx(values, na.rm = FALSE)
# Use smoothing spline
smooth_model <- smooth.spline(
seq_along(values_clean),
values_clean,
spar = 0.3 # Smoothing parameter
)
smoothed <- predict(smooth_model, seq_along(values_clean))$y
return(smoothed)
},
identify_changepoints = function(column, method = "PELT", penalty = 10) {
values <- self$data[[column]]
values_clean <- na.approx(values, na.rm = FALSE)
# Detect changepoints using changepoint package
cpt_result <- cpt.mean(
values_clean,
method = method,
penalty = "Manual",
pen.value = penalty
)
changepoints <- cpts(cpt_result)
return(changepoints)
},
visualize_trends = function(column, title = NULL, save_path = NULL) {
if (is.null(title)) {
title <- sprintf("%s Trend Analysis", str_to_title(str_replace_all(column, "_", " ")))
}
data <- self$data
dates <- data$date
values <- data[[column]]
# Plot 1: Rolling Averages
rolling_data <- data %>%
mutate(
sma_7 = rollmean(!!sym(column), k = 7, fill = NA, align = "right"),
sma_14 = rollmean(!!sym(column), k = 14, fill = NA, align = "right"),
sma_30 = rollmean(!!sym(column), k = 30, fill = NA, align = "right")
)
p1 <- ggplot(rolling_data, aes(x = date)) +
geom_point(aes(y = !!sym(column)), alpha = 0.3, size = 1) +
geom_line(aes(y = sma_7, color = "7-day"), linewidth = 1) +
geom_line(aes(y = sma_14, color = "14-day"), linewidth = 1) +
geom_line(aes(y = sma_30, color = "30-day"), linewidth = 1) +
scale_color_manual(
values = c("7-day" = "#E74C3C", "14-day" = "#3498DB", "30-day" = "#2ECC71"),
name = "Rolling Average"
) +
labs(
title = paste(title, "- Rolling Averages"),
x = NULL,
y = str_to_title(str_replace_all(column, "_", " "))
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 12),
legend.position = "right"
)
# Plot 2: Advanced Smoothing
loess_smooth <- self$apply_loess_smoothing(column, span = 0.3)
savgol_smooth <- self$apply_savgol_filter(column)
smooth_data <- data %>%
mutate(
loess = loess_smooth,
savgol = savgol_smooth
)
p2 <- ggplot(smooth_data, aes(x = date)) +
geom_point(aes(y = !!sym(column)), alpha = 0.2, size = 1) +
geom_line(aes(y = loess, color = "LOESS"), linewidth = 1.5) +
geom_line(aes(y = savgol, color = "Savitzky-Golay"), linewidth = 1.5, linetype = "dashed") +
scale_color_manual(
values = c("LOESS" = "#E74C3C", "Savitzky-Golay" = "#2ECC71"),
name = "Smoothing Method"
) +
labs(
title = paste(title, "- Advanced Smoothing"),
x = NULL,
y = str_to_title(str_replace_all(column, "_", " "))
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 12),
legend.position = "right"
)
# Plot 3: Streak Detection
streaks <- self$detect_streaks(column, threshold_std = 1.0, min_length = 5)
p3 <- ggplot(streaks, aes(x = date, y = !!sym(column))) +
geom_line(alpha = 0.5) +
geom_point(alpha = 0.5, size = 2) +
geom_point(
data = filter(streaks, hot_streak),
aes(y = !!sym(column)),
color = "red",
size = 4,
shape = 17
) +
geom_point(
data = filter(streaks, cold_streak),
aes(y = !!sym(column)),
color = "blue",
size = 4,
shape = 25,
fill = "blue"
) +
geom_hline(
yintercept = mean(values, na.rm = TRUE),
linetype = "dashed",
color = "black",
linewidth = 1
) +
labs(
title = paste(title, "- Streak Detection"),
x = "Date",
y = str_to_title(str_replace_all(column, "_", " "))
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 12))
# Combine plots
combined_plot <- grid.arrange(p1, p2, p3, ncol = 1)
if (!is.null(save_path)) {
ggsave(save_path, combined_plot, width = 14, height = 12, dpi = 300)
}
return(combined_plot)
},
regression_to_mean_analysis = function(column, split_date = NULL) {
if (is.null(split_date)) {
split_idx <- nrow(self$data) %/% 2
} else {
split_idx <- sum(self$data$date <= as.Date(split_date))
}
first_half <- self$data[1:split_idx, ]
second_half <- self$data[(split_idx + 1):nrow(self$data), ]
# Calculate weighted averages
first_avg <- weighted.mean(
first_half[[column]],
first_half$pa,
na.rm = TRUE
)
second_avg <- weighted.mean(
second_half[[column]],
second_half$pa,
na.rm = TRUE
)
season_avg <- weighted.mean(
self$data[[column]],
self$data$pa,
na.rm = TRUE
)
deviation_first <- first_avg - season_avg
deviation_second <- second_avg - season_avg
regression_pct <- if (deviation_first != 0) {
(1 - (deviation_second / deviation_first)) * 100
} else {
0
}
tibble(
metric = column,
first_half_avg = first_avg,
second_half_avg = second_avg,
season_avg = season_avg,
first_half_deviation = deviation_first,
second_half_deviation = deviation_second,
regression_percent = regression_pct
)
}
)
)
# Example Usage
if (interactive()) {
analyzer <- BaseballTrendAnalyzer$new()
cat("=" %R% 60, "\n")
cat("BASEBALL TREND ANALYSIS EXAMPLE\n")
cat("=" %R% 60, "\n\n")
# Load Aaron Judge 2023 data
data <- analyzer$load_player_season("Judge", "Aaron", 2023)
cat(sprintf("Loaded %d days of data\n\n", nrow(data)))
# 1. Rolling Averages
cat("1. ROLLING AVERAGES (xwOBA)\n")
cat("-" %R% 60, "\n")
rolling_xwoba <- analyzer$calculate_rolling_averages("xwOBA", c(7, 14, 30))
print(tail(rolling_xwoba, 10))
# 2. EWMA
cat("\n2. EXPONENTIAL MOVING AVERAGE (Exit Velocity)\n")
cat("-" %R% 60, "\n")
data$ewma_exit_velo <- analyzer$exponential_moving_average("avg_exit_velo", alpha = 0.2)
print(tail(data[, c("date", "avg_exit_velo", "ewma_exit_velo")], 10))
# 3. Streak Detection
cat("\n3. HOT/COLD STREAK DETECTION (xwOBA)\n")
cat("-" %R% 60, "\n")
streaks <- analyzer$detect_streaks("xwOBA", threshold_std = 1.0, min_length = 5)
hot_streaks <- filter(streaks, hot_streak)
cold_streaks <- filter(streaks, cold_streak)
cat(sprintf("Hot streaks detected: %d\n", nrow(hot_streaks)))
if (nrow(hot_streaks) > 0) {
cat("\nHot streak periods:\n")
print(head(hot_streaks[, c("date", "xwOBA", "z_score")]))
}
# 4. Regression to Mean
cat("\n6. REGRESSION TO MEAN ANALYSIS (xwOBA)\n")
cat("-" %R% 60, "\n")
regression_results <- analyzer$regression_to_mean_analysis("xwOBA")
print(t(regression_results))
# 5. Create Visualization
cat("\n7. GENERATING VISUALIZATIONS...\n")
cat("-" %R% 60, "\n")
analyzer$visualize_trends(
"xwOBA",
title = "Aaron Judge 2023 xwOBA Trend Analysis",
save_path = "judge_trend_analysis.png"
)
cat("Saved visualization to 'judge_trend_analysis.png'\n")
}
Real-World Applications
Player Evaluation and Trade Decisions
The Toronto Blue Jays analytics department uses sophisticated trend analysis to identify buy-low and sell-high candidates for trades. When evaluating potential acquisitions, they analyze rolling performance metrics to distinguish between players experiencing temporary slumps due to bad luck (evidenced by stable underlying metrics like exit velocity and barrel rate despite poor results) versus genuine decline (deteriorating peripherals). This approach helped them acquire José Bautista in 2008 when trend analysis revealed improving batted ball quality despite poor surface statistics.
Trend analysis is crucial for contract timing decisions. Teams monitor aging curves and performance trajectories to identify optimal windows for extending players. The Milwaukee Brewers extended Christian Yelich after detecting positive trends in his launch angle and pull-side power, predicting his breakout season before it fully materialized in traditional statistics. Conversely, they avoided costly extensions for players showing concerning trends in exit velocity decline or increased chase rates.
Fantasy Baseball Strategy
Fantasy analysts leverage trend analysis to identify streaming candidates and waiver wire pickups before mainstream recognition. Examining 7-day and 14-day rolling averages of xwOBA, barrel rate, and hard-hit rate reveals players entering hot streaks while their ownership remains low. The key is identifying statistical indicators that lead outcome metrics—a player showing rising exit velocity and barrel rate will likely see improved batting average and home run totals soon.
Season-long leagues benefit from regression to the mean analysis when evaluating trade proposals. Selling players after extended hot streaks when underlying metrics suggest unsustainability maximizes return. Conversely, buying slumping players whose peripherals remain strong exploits market inefficiencies created by recency bias. Advanced fantasy players create custom dashboards monitoring rolling averages and smoothed trend lines for their entire player pool.
Broadcasting and Media Analysis
Broadcast networks increasingly incorporate trend visualizations to tell compelling stories about player performance. ESPN's Baseball Tonight uses 30-day rolling average charts to illustrate MVP narratives, showing how candidates have performed throughout the season rather than relying on season totals. These visualizations help casual fans understand performance context and make analysis more engaging than static numbers.
Writers use changepoint detection to structure feature articles around performance inflection points. When a player makes a significant mechanical adjustment or role change, identifying the statistical changepoint provides concrete evidence for narrative claims. Articles discussing breakout seasons gain credibility by showing LOESS smoothed trend lines that reveal gradual improvement preceding the breakout rather than sudden transformation.
Advanced Techniques
Multivariate Trend Analysis
Analyzing trends across multiple related metrics simultaneously reveals relationships invisible in univariate analysis. For example, examining the correlation between trends in chase rate and strikeout rate identifies whether increased strikeouts stem from deteriorating plate discipline or bad luck on contact. Principal component analysis (PCA) can reduce multiple batted ball metrics (exit velocity, launch angle, barrel rate, hard-hit rate) into composite trend indicators that capture overall quality-of-contact trajectory.
Bayesian Structural Time Series
Bayesian structural time series models decompose performance into trend, seasonal, and irregular components while incorporating uncertainty quantification. These models provide probabilistic forecasts with confidence intervals, enabling more nuanced decision-making. They also handle missing data gracefully and allow incorporation of external predictors like opponent quality or weather conditions that influence performance trends.
Machine Learning for Trend Prediction
Gradient boosting machines and neural networks can learn complex nonlinear relationships between current trends and future performance. Training models on historical player trend patterns enables prediction of which current trends are likely to persist versus revert. Features include recent rolling averages, trend slopes, volatility measures, age, and historical performance. Regular retraining ensures models adapt to evolving baseball environments.
Key Takeaways
- Trend analysis reveals temporal performance dynamics invisible in season-long statistics, enabling better player evaluation, trade timing, and fantasy decisions.
- Rolling averages smooth noisy baseball data, with window size choice balancing responsiveness to changes against stability—7-day windows detect recent shifts while 30-day windows establish baseline performance.
- Advanced smoothing techniques like LOESS and Savitzky-Golay filters preserve important performance inflection points while reducing noise more effectively than simple moving averages.
- Statistical streak detection using z-scores and changepoint algorithms separates genuine performance shifts from random variation, preventing overreaction to small samples.
- Regression to the mean is inevitable—extreme early performance reliably moderates toward career norms, making understanding this phenomenon essential for accurate forecasting and valuation.
- Combining multiple analytical approaches (rolling averages, smoothing, changepoint detection, underlying metrics) provides more robust insights than any single method alone.
- Visualization is critical for communicating trend analysis insights to stakeholders, with interactive dashboards enabling exploration of different time windows and metrics.
Conclusion
Trend analysis represents one of the most powerful tools in modern baseball analytics, transforming static seasonal statistics into dynamic performance narratives. By applying sophisticated statistical methods to time-ordered data, analysts gain insights into player development, identify optimal decision timing, and make more accurate predictions about future performance. As baseball analytics continues evolving, trend analysis techniques will become increasingly sophisticated, incorporating machine learning, Bayesian methods, and real-time data streams to provide ever more granular and actionable insights into the temporal dynamics of player performance.