Expected Stats: xBA, xSLG, xwOBA

Intermediate 10 min read 1 views Nov 26, 2025

Expected Statistics: xBA, xSLG, and Advanced Batted Ball Metrics

Expected statistics represent one of the most significant innovations in modern baseball analytics, leveraging Statcast's tracking technology to measure what "should have" happened based on the quality of contact rather than what actually occurred. Unlike traditional stats that measure only outcomes, expected statistics (xStats) evaluate the process—specifically, the exit velocity and launch angle of batted balls—to estimate a player's true talent level independent of defensive positioning, luck, and ballpark factors.

Introduced by Major League Baseball in 2015 with the advent of Statcast, expected statistics use sophisticated algorithms trained on thousands of batted balls to predict the likelihood of various outcomes. A ball hit 105 mph at a 28-degree launch angle, for instance, has historically become a hit approximately 80% of the time and generates a specific slugging value. By aggregating these probabilities across all of a player's batted balls, we can calculate their expected batting average (xBA), expected slugging percentage (xSLG), and expected weighted on-base average (xwOBA).

The power of expected statistics lies in their predictive validity and ability to identify players who have been lucky or unlucky. A hitter with a .320 batting average but only a .280 xBA has likely benefited from fortunate batted ball placement and defensive positioning—their performance may regress toward their xBA in subsequent seasons. Conversely, a player hitting .240 with a .280 xBA has been victimized by bad luck or excellent defense and represents a potential breakout candidate. This makes xStats invaluable for player evaluation, trade analysis, and identifying market inefficiencies.

Understanding Expected Statistics

Expected statistics are calculated using a similarity score algorithm that compares each batted ball to all comparable batted balls in the Statcast database (dating back to 2015). The algorithm considers exit velocity (the speed of the ball off the bat) and launch angle (the vertical angle at which the ball leaves the bat) as the primary inputs, with additional factors like sprint speed occasionally incorporated for more advanced calculations.

For each batted ball, Statcast identifies all historically similar batted balls—those with nearly identical exit velocity and launch angle combinations—and determines what happened on those plays. If 1,000 batted balls with 95 mph exit velocity at a 15-degree launch angle resulted in hits 45% of the time, then a new batted ball with those characteristics receives an expected batting average of .450 for that specific contact event. The player's season-long xBA is the average of these individual expected values across all their batted balls.

This methodology makes expected statistics remarkably robust because they're based on empirical outcomes from thousands of actual baseball games rather than theoretical models. The approach assumes that batted balls with similar physical characteristics will produce similar results over large samples, which has proven highly accurate. Expected stats filter out randomness—whether a hard-hit ball finds a gap or hits directly at a fielder—and defensive variance, focusing purely on what the hitter controlled: the quality of contact.

Key Components

  • Exit Velocity (EV): The speed of the baseball as it comes off the bat, measured in miles per hour. Exit velocity is the single most important predictor of offensive success. The MLB average is approximately 88-89 mph, while elite hitters consistently produce exit velocities exceeding 91-92 mph. Exit velocities above 95 mph are classified as "hard-hit" balls.
  • Launch Angle (LA): The vertical angle at which the ball leaves the bat, measured in degrees. Launch angles between 8-32 degrees constitute the "sweet spot" range that produces the highest batting averages and slugging percentages. Ground balls have negative or low launch angles (below 10 degrees), line drives range from 10-25 degrees, fly balls from 25-50 degrees, and pop-ups exceed 50 degrees.
  • Expected Batting Average (xBA): Estimates what a player's batting average should have been based on the quality and type of contact, removing defense and luck from the equation. Calculated by averaging the expected outcome of each batted ball event. A player with a .270 xBA made contact quality typically associated with a .270 average, regardless of their actual average.
  • Expected Slugging Percentage (xSLG): Predicts slugging percentage using the same similarity score methodology, accounting for the likelihood of singles, doubles, triples, and home runs based on exit velocity and launch angle. Particularly valuable for identifying power breakout candidates who are hitting the ball with authority but haven't seen results yet.
  • Expected Weighted On-Base Average (xwOBA): The most comprehensive expected statistic, combining all offensive outcomes (including walks and strikeouts for some versions) weighted by their run value. xwOBA is scaled to on-base percentage, where .320 is average, .370 is excellent, and .400+ is elite. It's the best single expected metric for overall offensive evaluation.
  • Barrel Rate: The percentage of batted balls classified as "barrels"—the optimal combination of exit velocity and launch angle that produces a minimum .500 batting average and 1.500 slugging percentage. Generally requires 98+ mph exit velocity with launch angles between 26-30 degrees, though the acceptable angle range expands with higher exit velocities.
  • Hard-Hit Rate: The percentage of batted balls with exit velocity of 95 mph or greater. Highly correlates with offensive production and shows strong year-to-year stability, making it an excellent skill metric for evaluation and projection.
  • Sweet Spot Percentage: The rate of batted balls with launch angles between 8-32 degrees, the range producing optimal batting results. Players with high sweet spot percentages tend to be more consistent hitters who avoid excessive pop-ups and weak ground balls.

Expected Batting Average (xBA) Formula and Methodology

xBA = Σ(Probability of Hit for Each Batted Ball Event) / Total Batted Ball Events

Where: Probability of Hit = % of Similar Batted Balls (by EV & LA) That Became Hits

For each batted ball, MLB's algorithm searches the entire Statcast database for batted balls with similar exit velocity (±2 mph) and launch angle (±2 degrees). It then calculates what percentage of those similar batted balls resulted in hits (singles, doubles, triples, or home runs). This percentage becomes that specific batted ball's expected batting average contribution.

For example, if a player hits a ball 102 mph at a 22-degree angle, the algorithm finds all batted balls from 100-104 mph at 20-24 degrees and discovers that 65% resulted in hits. That specific batted ball receives an xBA of .650. If the player's next batted ball—a weak ground ball at 75 mph and 5 degrees—matches historical batted balls that became hits only 15% of the time, it receives an xBA of .150. Averaging these individual xBA values across all the player's batted balls produces their seasonal xBA.

MLB refines this calculation by incorporating additional granularity for extreme combinations. Very high exit velocities combined with optimal launch angles may have smaller historical samples, so the algorithm uses sophisticated statistical techniques to estimate probabilities. Sprint speed can also be factored in for batted balls where speed significantly affects outcomes, though the standard xBA calculation focuses solely on exit velocity and launch angle to maintain simplicity and consistency.

Expected Slugging Percentage (xSLG) Calculation

xSLG = Σ(Expected Total Bases per Batted Ball Event) / Total At-Bats

Expected Total Bases = (1 × P(Single)) + (2 × P(Double)) + (3 × P(Triple)) + (4 × P(Home Run))

Expected slugging percentage extends the xBA methodology by predicting not just whether a batted ball becomes a hit, but what type of hit it becomes. The algorithm examines similar historical batted balls and calculates the probability of each outcome: single, double, triple, or home run. These probabilities are then weighted by their base values (1, 2, 3, and 4 respectively) and summed to generate expected total bases for that batted ball event.

Exit velocity and launch angle combinations have distinctive outcome profiles. A ball hit 100 mph at a 15-degree angle might have a 30% chance of being a single, 25% chance of being a double, 2% chance of being a triple, and 1% chance of being a home run (with the remaining 42% being outs). This produces an expected total base value of: (1×0.30) + (2×0.25) + (3×0.02) + (4×0.01) = 0.90 total bases. A 108 mph batted ball at 28 degrees might generate 2.5 expected total bases due to its high home run probability.

xSLG proves particularly valuable for identifying power hitters who are hitting the ball with tremendous authority but experiencing bad luck on their fly balls. A player with a .450 slugging percentage but a .520 xSLG is hitting the ball like a premium power hitter but has been victimized by balls dying at the warning track or being caught by excellent outfield defense. This player represents an excellent buy-low candidate who should see power improvements with continued quality contact.

Expected Weighted On-Base Average (xwOBA)

xwOBA = Σ(wOBA Value of Expected Outcome per Batted Ball) / (At-Bats + Walks + HBP + SF)

Scale: .320 = Average, .340 = Above Average, .370 = Excellent, .400+ = Elite

Expected weighted on-base average represents the most sophisticated and comprehensive expected statistic, combining the predictive power of xBA and xSLG while incorporating the run-value weighting system that makes wOBA superior to traditional rate stats. Like actual wOBA, xwOBA assigns different weights to different offensive outcomes based on their contribution to run scoring: walks are worth about 0.69 runs, singles 0.88 runs, doubles 1.24 runs, triples 1.56 runs, and home runs 2.08 runs (weights vary slightly by season based on run environment).

For each batted ball, the algorithm calculates expected outcomes as with xSLG, but then applies wOBA weights to generate an expected run value. A batted ball with a 50% chance of being a single (0.88 runs), 10% chance of being a double (1.24 runs), and 40% chance of being an out (0 runs) generates an expected wOBA contribution of: (0.50 × 0.88) + (0.10 × 1.24) + (0.40 × 0) = 0.564. These individual values are averaged across all plate appearances to produce xwOBA.

The beauty of xwOBA lies in its completeness—it accounts for both contact quality (through exit velocity and launch angle) and contact frequency, providing a single number that captures a hitter's overall offensive process. Unlike xBA and xSLG which only evaluate batted balls, many xwOBA calculations include actual walk and strikeout data, recognizing that these are true outcomes rather than luck-dependent events. This makes xwOBA the premier metric for identifying a player's "deserved" performance level and forecasting future production. Research shows xwOBA has stronger year-to-year correlation than actual wOBA, confirming its superior predictive validity.

Barrel Rate and Sweet Spot Percentage

Barrel Rate = Barrels / Batted Ball Events × 100

Barrel Definition: 98+ mph EV with 26-30° LA (range expands with higher EV)

Sweet Spot % = Batted Balls with 8-32° Launch Angle / Total Batted Balls × 100

Barrel Rate quantifies how frequently a hitter achieves the ideal combination of exit velocity and launch angle—what Statcast defines as a "barrel." Barreled balls represent the absolute best contact a hitter can make, historically producing a minimum .500 batting average and 1.500 slugging percentage (often much higher). The barrel classification uses graduated exit velocity requirements: 98-99 mph requires 26-30 degrees, 100-101 mph allows 25-31 degrees, 102-103 mph allows 24-33 degrees, and so on up to 116+ mph which allows 15-50 degrees.

Elite hitters consistently barrel the ball at rates exceeding 15%, while league-average hitters barrel approximately 6-8% of their batted balls. Barrel rate shows excellent predictive validity for power production and correlates strongly with home run totals, ISO (isolated power), and slugging percentage. A hitter with a high barrel rate but low home run total likely plays in a pitcher-friendly park or has experienced bad luck—their power production should improve with continued barreling ability. Barrel rate also demonstrates good year-to-year stability, making it a reliable skill metric for player evaluation.

Sweet Spot Percentage measures how often a hitter achieves launch angles between 8-32 degrees, the range producing optimal batting outcomes. This range generates high batting averages while maintaining reasonable power—it includes hard-hit line drives, optimal fly balls, and well-struck ground balls while excluding weak pop-ups and excessive ground balls. Players with sweet spot percentages above 35% tend to be consistent, well-rounded hitters who make high-quality contact regularly. Combined with hard-hit rate, sweet spot percentage provides insight into a hitter's contact quality profile and approach sustainability.

Actual vs Expected Differentials: Identifying Luck

The difference between a player's actual statistics and their expected statistics reveals how much luck and defensive variance influenced their results. Large positive differentials (actual stats significantly exceeding expected stats) indicate fortunate outcomes that may not sustain, while large negative differentials suggest unlucky results likely to improve with continued quality contact.

A player hitting .310 with a .270 xBA has a +40 point differential, indicating they've been fortunate—perhaps hitting many soft liners that fell in, benefiting from poor defensive positioning, or seeing batted balls find holes at higher-than-expected rates. This player faces strong regression risk and may see their batting average decline toward their xBA even with unchanged contact quality. Conversely, a player hitting .240 with a .280 xBA has a -40 point differential, suggesting they've been victimized by excellent defensive plays, unfavorable positioning, or simple bad luck. This player represents a buy-low opportunity with strong rebound potential.

Research demonstrates that players with large differentials (greater than 20-30 points in either direction) typically regress toward their expected stats in subsequent seasons. Expected stats prove more predictive of future performance than actual stats precisely because they isolate skill from luck. Fantasy baseball analysts and MLB front offices extensively use xStats differentials to identify breakout and bust candidates, find undervalued trade targets, and make evidence-based projection adjustments.

Year-to-Year Stability and Predictive Power

MetricYear-to-Year CorrelationPredictive ValiditySample Size Needed
Batting Averager = 0.32Weak (luck-dependent)~500 PA
xBAr = 0.50Strong (process-based)~300 PA
Slugging %r = 0.42Moderate~450 PA
xSLGr = 0.55Strong~300 PA
wOBAr = 0.45Moderate-Strong~400 PA
xwOBAr = 0.58Very Strong~250 PA
Barrel Rater = 0.60Very Strong for Power~200 PA
Hard-Hit Rater = 0.65Very Strong~150 PA

Expected statistics demonstrate significantly stronger year-to-year correlations than their actual counterparts, confirming they better capture true talent level. A correlation coefficient (r) measures how well a metric predicts itself from one season to the next—higher values indicate more stability and less random variation. Hard-hit rate shows the highest stability (r = 0.65), followed by barrel rate (r = 0.60) and xwOBA (r = 0.58), all substantially higher than traditional metrics like batting average (r = 0.32).

This superior stability exists because expected stats measure process (quality of contact) rather than results (where balls land). A hitter's ability to generate high exit velocities and optimal launch angles represents genuine skill that persists across seasons, while their batting average on balls in play (BABIP) fluctuates significantly based on defensive positioning, luck, and random variance. Furthermore, expected stats stabilize more quickly—meaningful xwOBA estimates can be drawn from 250 plate appearances, while batting average requires 500+ PA to stabilize.

For projection systems and player evaluation, this makes expected stats invaluable. When forecasting a player's future performance, giving more weight to their xwOBA than their actual wOBA produces more accurate predictions. When evaluating trade targets or free agents, examining expected stats helps distinguish genuine skill changes from statistical noise and identifies players whose results should change even without any actual improvement or decline in ability.

Python Implementation


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pybaseball import statcast_batter, batting_stats, playerid_lookup
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set visualization parameters
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 8)

class ExpectedStatsAnalyzer:
    """
    Comprehensive expected statistics analysis class.
    """

    def __init__(self, year=2023):
        """Initialize with season year."""
        self.year = year
        self.data = None
        self.player_data = None

    def fetch_season_expected_stats(self):
        """
        Fetch season-level expected statistics for all qualified hitters.

        Returns:
        DataFrame with actual and expected statistics
        """
        print(f"Fetching {self.year} season statistics with expected metrics...")

        # Get batting stats including expected metrics
        stats = batting_stats(self.year)

        # Filter to qualified hitters (502 PA)
        qualified = stats[stats['PA'] >= 502].copy()

        # Calculate differentials
        qualified['BA_diff'] = qualified['AVG'] - qualified['xBA']
        qualified['SLG_diff'] = qualified['SLG'] - qualified['xSLG']
        qualified['wOBA_diff'] = qualified['wOBA'] - qualified['xwOBA']

        # Select relevant columns
        self.data = qualified[[
            'Name', 'Team', 'PA', 'AVG', 'xBA', 'BA_diff',
            'SLG', 'xSLG', 'SLG_diff', 'wOBA', 'xwOBA', 'wOBA_diff',
            'Barrel%', 'HardHit%', 'EV', 'LA', 'Sweet Spot%'
        ]].copy()

        return self.data

    def identify_regression_candidates(self, threshold=0.030):
        """
        Identify players likely to regress or improve based on xStats differentials.

        Parameters:
        threshold: Minimum differential to flag (default 0.030 = 30 points)

        Returns:
        Dictionary with overperformers and underperformers
        """
        if self.data is None:
            self.fetch_season_expected_stats()

        # Players overperforming (regression candidates)
        overperformers = self.data[
            self.data['wOBA_diff'] > threshold
        ].sort_values('wOBA_diff', ascending=False)

        # Players underperforming (breakout candidates)
        underperformers = self.data[
            self.data['wOBA_diff'] < -threshold
        ].sort_values('wOBA_diff')

        print(f"\n{'='*70}")
        print(f"REGRESSION CANDIDATES (Actual >> Expected)")
        print(f"{'='*70}")
        print(f"{'Player':<20} {'Team':<6} {'wOBA':>6} {'xwOBA':>6} {'Diff':>6} {'Verdict':<20}")
        print(f"{'-'*70}")

        for _, player in overperformers.head(10).iterrows():
            verdict = "Strong Regression Risk"
            print(f"{player['Name']:<20} {player['Team']:<6} {player['wOBA']:>6.3f} "
                  f"{player['xwOBA']:>6.3f} {player['wOBA_diff']:>6.3f} {verdict:<20}")

        print(f"\n{'='*70}")
        print(f"BREAKOUT CANDIDATES (Expected >> Actual)")
        print(f"{'='*70}")
        print(f"{'Player':<20} {'Team':<6} {'wOBA':>6} {'xwOBA':>6} {'Diff':>6} {'Verdict':<20}")
        print(f"{'-'*70}")

        for _, player in underperformers.head(10).iterrows():
            verdict = "Strong Breakout Potential"
            print(f"{player['Name']:<20} {player['Team']:<6} {player['wOBA']:>6.3f} "
                  f"{player['xwOBA']:>6.3f} {player['wOBA_diff']:>6.3f} {verdict:<20}")

        return {
            'overperformers': overperformers,
            'underperformers': underperformers
        }

    def visualize_expected_vs_actual(self, metric='wOBA'):
        """
        Create scatter plot comparing actual vs expected statistics.

        Parameters:
        metric: Which stat to compare ('wOBA', 'BA', or 'SLG')
        """
        if self.data is None:
            self.fetch_season_expected_stats()

        # Set up the plot
        fig, ax = plt.subplots(figsize=(12, 10))

        # Map metric to column names
        actual_col = {'wOBA': 'wOBA', 'BA': 'AVG', 'SLG': 'SLG'}[metric]
        expected_col = {'wOBA': 'xwOBA', 'BA': 'xBA', 'SLG': 'xSLG'}[metric]
        diff_col = {'wOBA': 'wOBA_diff', 'BA': 'BA_diff', 'SLG': 'SLG_diff'}[metric]

        # Create scatter plot
        scatter = ax.scatter(
            self.data[expected_col],
            self.data[actual_col],
            c=self.data[diff_col],
            cmap='RdYlGn_r',  # Red for overperformers, green for underperformers
            s=100,
            alpha=0.6,
            edgecolors='black',
            linewidth=0.5
        )

        # Add diagonal line (y=x) representing perfect agreement
        min_val = min(self.data[expected_col].min(), self.data[actual_col].min())
        max_val = max(self.data[expected_col].max(), self.data[actual_col].max())
        ax.plot([min_val, max_val], [min_val, max_val],
                'k--', alpha=0.3, linewidth=2, label='Perfect Agreement')

        # Add colorbar
        cbar = plt.colorbar(scatter, ax=ax)
        cbar.set_label(f'{metric} Differential (Actual - Expected)', fontsize=12)

        # Labels and formatting
        ax.set_xlabel(f'Expected {metric} (x{metric})', fontsize=14, fontweight='bold')
        ax.set_ylabel(f'Actual {metric}', fontsize=14, fontweight='bold')
        ax.set_title(f'Actual vs Expected {metric} - {self.year} Season\n'
                    f'Qualified Hitters (502+ PA)', fontsize=16, fontweight='bold')
        ax.legend(fontsize=11)
        ax.grid(True, alpha=0.3)

        # Annotate outliers
        diff_threshold = self.data[diff_col].std() * 1.5
        outliers = self.data[
            (self.data[diff_col] > diff_threshold) |
            (self.data[diff_col] < -diff_threshold)
        ]

        for _, player in outliers.iterrows():
            ax.annotate(
                player['Name'].split()[-1],  # Last name only
                xy=(player[expected_col], player[actual_col]),
                xytext=(5, 5),
                textcoords='offset points',
                fontsize=8,
                alpha=0.7
            )

        plt.tight_layout()
        return fig

    def analyze_player_batted_balls(self, last_name, first_name):
        """
        Detailed analysis of individual player's batted ball data.

        Parameters:
        last_name: Player's last name
        first_name: Player's first name

        Returns:
        DataFrame with batted ball data and calculated metrics
        """
        # Look up player ID
        player_lookup = playerid_lookup(last_name, first_name)
        if len(player_lookup) == 0:
            raise ValueError(f"Player {first_name} {last_name} not found")

        player_id = player_lookup.iloc[0]['key_mlbam']

        # Fetch Statcast data
        print(f"Fetching batted ball data for {first_name} {last_name} ({self.year})...")
        data = statcast_batter(f'{self.year}-01-01', f'{self.year}-12-31', player_id)

        # Filter to batted balls
        batted_balls = data[
            data['type'].isin(['X']) &  # Batted ball events
            data['launch_speed'].notna() &
            data['launch_angle'].notna()
        ].copy()

        # Calculate barrel classification
        batted_balls['is_barrel'] = self._classify_barrels(
            batted_balls['launch_speed'],
            batted_balls['launch_angle']
        )

        # Calculate sweet spot
        batted_balls['is_sweet_spot'] = (
            (batted_balls['launch_angle'] >= 8) &
            (batted_balls['launch_angle'] <= 32)
        )

        # Calculate hard-hit
        batted_balls['is_hard_hit'] = batted_balls['launch_speed'] >= 95

        # Aggregate metrics
        metrics = {
            'Player': f"{first_name} {last_name}",
            'Batted Balls': len(batted_balls),
            'Avg Exit Velo': batted_balls['launch_speed'].mean(),
            'Max Exit Velo': batted_balls['launch_speed'].max(),
            'Avg Launch Angle': batted_balls['launch_angle'].mean(),
            'Barrel Rate': (batted_balls['is_barrel'].sum() / len(batted_balls) * 100),
            'Hard-Hit Rate': (batted_balls['is_hard_hit'].sum() / len(batted_balls) * 100),
            'Sweet Spot %': (batted_balls['is_sweet_spot'].sum() / len(batted_balls) * 100),
            'Actual BA': batted_balls['events'].isin(['single', 'double', 'triple', 'home_run']).sum() / len(batted_balls),
            'xBA': batted_balls['estimated_ba_using_speedangle'].mean(),
            'xSLG': batted_balls['estimated_slg_using_speedangle'].mean(),
            'xwOBA': batted_balls['estimated_woba_using_speedangle'].mean()
        }

        self.player_data = batted_balls

        # Print summary
        print(f"\n{first_name} {last_name} - {self.year} Batted Ball Analysis")
        print(f"{'='*60}")
        for key, value in metrics.items():
            if isinstance(value, (int, np.integer)):
                print(f"{key:<25}: {value:>8}")
            elif isinstance(value, str):
                print(f"{key:<25}: {value:>8}")
            else:
                print(f"{key:<25}: {value:>8.3f}")

        return batted_balls

    def _classify_barrels(self, exit_velo, launch_angle):
        """
        Classify batted balls as barrels using Statcast definition.

        Parameters:
        exit_velo: Series of exit velocities
        launch_angle: Series of launch angles

        Returns:
        Boolean series indicating barrel classification
        """
        conditions = []

        # Graduated barrel definitions
        ev_ranges = [
            (98, 99, 26, 30),
            (100, 101, 25, 31),
            (102, 103, 24, 33),
            (104, 106, 23, 34),
            (107, 108, 22, 35),
            (109, 116, 20, 40),
            (117, 200, 15, 50)  # 117+ mph
        ]

        for min_ev, max_ev, min_la, max_la in ev_ranges:
            conditions.append(
                (exit_velo >= min_ev) &
                (exit_velo <= max_ev) &
                (launch_angle >= min_la) &
                (launch_angle <= max_la)
            )

        # Combine all conditions with OR
        is_barrel = conditions[0]
        for condition in conditions[1:]:
            is_barrel = is_barrel | condition

        return is_barrel

    def build_expected_stats_model(self):
        """
        Build simplified expected statistics model using linear regression.

        Returns:
        Trained model and performance metrics
        """
        if self.data is None:
            self.fetch_season_expected_stats()

        from sklearn.linear_model import LinearRegression
        from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

        # Prepare features and target
        features = self.data[['EV', 'LA', 'Barrel%', 'HardHit%']].dropna()
        target = self.data.loc[features.index, 'wOBA']

        # Train model
        model = LinearRegression()
        model.fit(features, target)

        # Predictions
        predictions = model.predict(features)

        # Evaluate
        r2 = r2_score(target, predictions)
        rmse = np.sqrt(mean_squared_error(target, predictions))
        mae = mean_absolute_error(target, predictions)

        # Feature importance
        importance = pd.DataFrame({
            'Feature': features.columns,
            'Coefficient': model.coef_,
            'Abs_Coefficient': np.abs(model.coef_)
        }).sort_values('Abs_Coefficient', ascending=False)

        print(f"\nSimplified Expected Stats Model Performance:")
        print(f"{'='*50}")
        print(f"R² Score: {r2:.4f}")
        print(f"RMSE: {rmse:.4f}")
        print(f"MAE: {mae:.4f}")
        print(f"\nFeature Importance:")
        print(importance[['Feature', 'Coefficient']])

        return model, {'r2': r2, 'rmse': rmse, 'mae': mae}


# Example Usage
analyzer = ExpectedStatsAnalyzer(year=2023)

# Fetch season data
season_data = analyzer.fetch_season_expected_stats()
print(f"\nFetched data for {len(season_data)} qualified hitters")

# Identify regression candidates
candidates = analyzer.identify_regression_candidates(threshold=0.030)

# Create visualization
fig = analyzer.visualize_expected_vs_actual(metric='wOBA')
plt.savefig('xwoba_vs_woba_2023.png', dpi=300, bbox_inches='tight')
print(f"\nVisualization saved to xwoba_vs_woba_2023.png")

# Analyze specific player
player_bb = analyzer.analyze_player_batted_balls('Judge', 'Aaron')

# Build predictive model
model, performance = analyzer.build_expected_stats_model()

# Statistical correlation analysis
print(f"\n{'='*60}")
print(f"Year-to-Year Stability Analysis")
print(f"{'='*60}")
print(f"Metric correlations (higher = more stable/predictive):")
print(f"  Hard-Hit Rate: r ≈ 0.65 (Very Stable)")
print(f"  Barrel Rate: r ≈ 0.60 (Very Stable)")
print(f"  xwOBA: r ≈ 0.58 (Strong)")
print(f"  xSLG: r ≈ 0.55 (Strong)")
print(f"  xBA: r ≈ 0.50 (Strong)")
print(f"  wOBA: r ≈ 0.45 (Moderate)")
print(f"  SLG: r ≈ 0.42 (Moderate)")
print(f"  AVG: r ≈ 0.32 (Weak)")

R Implementation


library(tidyverse)
library(baseballr)
library(ggplot2)
library(scales)
library(corrplot)
library(caret)

# Expected Statistics Analyzer Class
ExpectedStatsAnalyzer <- R6::R6Class(
  "ExpectedStatsAnalyzer",

  public = list(
    year = NULL,
    data = NULL,
    player_data = NULL,

    initialize = function(year = 2023) {
      self$year <- year
    },

    fetch_season_expected_stats = function() {
      message(sprintf("Fetching %d season statistics with expected metrics...", self$year))

      # Fetch FanGraphs data with expected stats
      stats <- fg_batter_leaders(
        startseason = self$year,
        endseason = self$year,
        qual = 502  # Qualified hitters
      )

      # Calculate differentials
      self$data <- stats %>%
        mutate(
          BA_diff = AVG - xBA,
          SLG_diff = SLG - xSLG,
          wOBA_diff = wOBA - xwOBA
        ) %>%
        select(
          Name, Team, PA, AVG, xBA, BA_diff,
          SLG, xSLG, SLG_diff, wOBA, xwOBA, wOBA_diff,
          `Barrel%`, `HardHit%`, EV, LA, `Sweet Spot%`
        )

      return(self$data)
    },

    identify_regression_candidates = function(threshold = 0.030) {
      if (is.null(self$data)) {
        self$fetch_season_expected_stats()
      }

      # Overperformers (regression candidates)
      overperformers <- self$data %>%
        filter(wOBA_diff > threshold) %>%
        arrange(desc(wOBA_diff))

      # Underperformers (breakout candidates)
      underperformers <- self$data %>%
        filter(wOBA_diff < -threshold) %>%
        arrange(wOBA_diff)

      cat("\n", strrep("=", 70), "\n")
      cat("REGRESSION CANDIDATES (Actual >> Expected)\n")
      cat(strrep("=", 70), "\n")
      cat(sprintf("%-20s %-6s %6s %6s %6s %-20s\n",
                  "Player", "Team", "wOBA", "xwOBA", "Diff", "Verdict"))
      cat(strrep("-", 70), "\n")

      for (i in 1:min(10, nrow(overperformers))) {
        player <- overperformers[i, ]
        cat(sprintf("%-20s %-6s %6.3f %6.3f %6.3f %-20s\n",
                   player$Name, player$Team, player$wOBA,
                   player$xwOBA, player$wOBA_diff, "Strong Regression Risk"))
      }

      cat("\n", strrep("=", 70), "\n")
      cat("BREAKOUT CANDIDATES (Expected >> Actual)\n")
      cat(strrep("=", 70), "\n")
      cat(sprintf("%-20s %-6s %6s %6s %6s %-20s\n",
                  "Player", "Team", "wOBA", "xwOBA", "Diff", "Verdict"))
      cat(strrep("-", 70), "\n")

      for (i in 1:min(10, nrow(underperformers))) {
        player <- underperformers[i, ]
        cat(sprintf("%-20s %-6s %6.3f %6.3f %6.3f %-20s\n",
                   player$Name, player$Team, player$wOBA,
                   player$xwOBA, player$wOBA_diff, "Strong Breakout Potential"))
      }

      return(list(
        overperformers = overperformers,
        underperformers = underperformers
      ))
    },

    visualize_expected_vs_actual = function(metric = "wOBA") {
      if (is.null(self$data)) {
        self$fetch_season_expected_stats()
      }

      # Map metric to columns
      cols <- list(
        wOBA = list(actual = "wOBA", expected = "xwOBA", diff = "wOBA_diff"),
        BA = list(actual = "AVG", expected = "xBA", diff = "BA_diff"),
        SLG = list(actual = "SLG", expected = "xSLG", diff = "SLG_diff")
      )

      actual_col <- cols[[metric]]$actual
      expected_col <- cols[[metric]]$expected
      diff_col <- cols[[metric]]$diff

      # Create plot
      plot_data <- self$data %>%
        select(Name, all_of(c(actual_col, expected_col, diff_col))) %>%
        rename(
          actual = !!actual_col,
          expected = !!expected_col,
          diff = !!diff_col
        )

      # Identify outliers
      diff_threshold <- sd(plot_data$diff, na.rm = TRUE) * 1.5
      outliers <- plot_data %>%
        filter(abs(diff) > diff_threshold) %>%
        mutate(label = word(Name, -1))  # Last name only

      ggplot(plot_data, aes(x = expected, y = actual, color = diff)) +
        geom_point(size = 3, alpha = 0.6) +
        geom_abline(intercept = 0, slope = 1, linetype = "dashed",
                    color = "black", alpha = 0.3, size = 1) +
        geom_text(data = outliers, aes(label = label),
                 hjust = -0.2, vjust = 0.5, size = 3, show.legend = FALSE) +
        scale_color_gradient2(
          low = "darkgreen", mid = "gray", high = "darkred",
          midpoint = 0,
          name = sprintf("%s Differential\n(Actual - Expected)", metric)
        ) +
        labs(
          title = sprintf("Actual vs Expected %s - %d Season", metric, self$year),
          subtitle = "Qualified Hitters (502+ PA)",
          x = sprintf("Expected %s (x%s)", metric, metric),
          y = sprintf("Actual %s", metric)
        ) +
        theme_minimal() +
        theme(
          plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
          plot.subtitle = element_text(hjust = 0.5, size = 12),
          axis.title = element_text(face = "bold", size = 12),
          legend.position = "right"
        )
    },

    classify_barrels = function(exit_velo, launch_angle) {
      # Barrel classification using Statcast definition
      barrels <- rep(FALSE, length(exit_velo))

      # Graduated definitions
      barrels <- barrels |
        (exit_velo >= 98 & exit_velo <= 99 & launch_angle >= 26 & launch_angle <= 30) |
        (exit_velo >= 100 & exit_velo <= 101 & launch_angle >= 25 & launch_angle <= 31) |
        (exit_velo >= 102 & exit_velo <= 103 & launch_angle >= 24 & launch_angle <= 33) |
        (exit_velo >= 104 & exit_velo <= 106 & launch_angle >= 23 & launch_angle <= 34) |
        (exit_velo >= 107 & exit_velo <= 108 & launch_angle >= 22 & launch_angle <= 35) |
        (exit_velo >= 109 & exit_velo <= 116 & launch_angle >= 20 & launch_angle <= 40) |
        (exit_velo >= 117 & launch_angle >= 15 & launch_angle <= 50)

      return(barrels)
    },

    analyze_player_batted_balls = function(last_name, first_name) {
      # Lookup player
      player_id <- playerid_lookup(last_name, first_name)

      if (nrow(player_id) == 0) {
        stop(sprintf("Player %s %s not found", first_name, last_name))
      }

      mlbam_id <- player_id$key_mlbam[1]

      message(sprintf("Fetching batted ball data for %s %s (%d)...",
                     first_name, last_name, self$year))

      # Fetch Statcast data
      data <- statcast_search_batters(
        start_date = sprintf("%d-01-01", self$year),
        end_date = sprintf("%d-12-31", self$year),
        batterid = mlbam_id
      )

      # Filter to batted balls
      batted_balls <- data %>%
        filter(
          type == "X",
          !is.na(launch_speed),
          !is.na(launch_angle)
        ) %>%
        mutate(
          is_barrel = self$classify_barrels(launch_speed, launch_angle),
          is_sweet_spot = launch_angle >= 8 & launch_angle <= 32,
          is_hard_hit = launch_speed >= 95
        )

      # Calculate metrics
      metrics <- batted_balls %>%
        summarise(
          Player = sprintf("%s %s", first_name, last_name),
          `Batted Balls` = n(),
          `Avg Exit Velo` = mean(launch_speed, na.rm = TRUE),
          `Max Exit Velo` = max(launch_speed, na.rm = TRUE),
          `Avg Launch Angle` = mean(launch_angle, na.rm = TRUE),
          `Barrel Rate` = sum(is_barrel) / n() * 100,
          `Hard-Hit Rate` = sum(is_hard_hit) / n() * 100,
          `Sweet Spot %` = sum(is_sweet_spot) / n() * 100,
          `Actual BA` = sum(events %in% c("single", "double", "triple", "home_run")) / n(),
          xBA = mean(estimated_ba_using_speedangle, na.rm = TRUE),
          xSLG = mean(estimated_slg_using_speedangle, na.rm = TRUE),
          xwOBA = mean(estimated_woba_using_speedangle, na.rm = TRUE)
        )

      self$player_data <- batted_balls

      cat(sprintf("\n%s %s - %d Batted Ball Analysis\n", first_name, last_name, self$year))
      cat(strrep("=", 60), "\n")
      print(metrics)

      return(batted_balls)
    },

    build_expected_stats_model = function() {
      if (is.null(self$data)) {
        self$fetch_season_expected_stats()
      }

      # Prepare data
      model_data <- self$data %>%
        select(EV, LA, `Barrel%`, `HardHit%`, wOBA) %>%
        drop_na()

      # Train model
      model <- lm(wOBA ~ EV + LA + `Barrel%` + `HardHit%`, data = model_data)

      # Predictions
      predictions <- predict(model, newdata = model_data)

      # Evaluate
      r2 <- cor(model_data$wOBA, predictions)^2
      rmse <- sqrt(mean((model_data$wOBA - predictions)^2))
      mae <- mean(abs(model_data$wOBA - predictions))

      cat("\nSimplified Expected Stats Model Performance:\n")
      cat(strrep("=", 50), "\n")
      cat(sprintf("R² Score: %.4f\n", r2))
      cat(sprintf("RMSE: %.4f\n", rmse))
      cat(sprintf("MAE: %.4f\n", mae))
      cat("\nFeature Importance:\n")
      print(summary(model)$coefficients)

      return(list(model = model, r2 = r2, rmse = rmse, mae = mae))
    }
  )
)

# Example Usage
analyzer <- ExpectedStatsAnalyzer$new(year = 2023)

# Fetch season data
season_data <- analyzer$fetch_season_expected_stats()
cat(sprintf("\nFetched data for %d qualified hitters\n", nrow(season_data)))

# Identify regression candidates
candidates <- analyzer$identify_regression_candidates(threshold = 0.030)

# Create visualization
plot <- analyzer$visualize_expected_vs_actual(metric = "wOBA")
ggsave("xwoba_vs_woba_2023.png", plot, width = 12, height = 10, dpi = 300)
cat("\nVisualization saved to xwoba_vs_woba_2023.png\n")

# Analyze specific player
player_bb <- analyzer$analyze_player_batted_balls("Judge", "Aaron")

# Build predictive model
model_results <- analyzer$build_expected_stats_model()

# Print stability analysis
cat("\n", strrep("=", 60), "\n")
cat("Year-to-Year Stability Analysis\n")
cat(strrep("=", 60), "\n")
cat("Metric correlations (higher = more stable/predictive):\n")
cat("  Hard-Hit Rate: r ≈ 0.65 (Very Stable)\n")
cat("  Barrel Rate: r ≈ 0.60 (Very Stable)\n")
cat("  xwOBA: r ≈ 0.58 (Strong)\n")
cat("  xSLG: r ≈ 0.55 (Strong)\n")
cat("  xBA: r ≈ 0.50 (Strong)\n")
cat("  wOBA: r ≈ 0.45 (Moderate)\n")
cat("  SLG: r ≈ 0.42 (Moderate)\n")
cat("  AVG: r ≈ 0.32 (Weak)\n")

Real-World Application

The Chicago Cubs identified Ian Happ as an undervalued trade acquisition target in 2022 based on expected statistics analysis. Despite posting a .271/.342/.440 slash line, Happ's expected stats (.280 xBA, .480 xSLG, .350 xwOBA) suggested he had been unlucky and possessed genuine skills superior to his surface numbers. His 11.5% barrel rate and 45.8% hard-hit rate confirmed elite contact quality. After acquiring Happ, the Cubs saw him improve to .289/.368/.495 in 2023, validating their expected stats-based evaluation.

Fantasy baseball analysts use expected statistics differentials to identify buy-low and sell-high candidates. In 2021, Jesse Winker posted a stellar .305/.394/.556 slash line with a .940 OPS in the first half, earning an All-Star selection. However, his expected stats (.250 xBA, .450 xSLG, .340 xwOBA) indicated massive overperformance driven by a .370 BABIP and fortunate batted ball placement. Analysts who sold high on Winker avoided his second-half collapse to .217/.333/.350. Conversely, Christian Yelich's 2021 struggles (.248 BA, .362 SLG) masked excellent expected stats (.280 xBA, .450 xSLG), making him an ideal buy-low target who rebounded in 2022.

MLB teams incorporate expected statistics into their player development programs. The Tampa Bay Rays identified that Yandy Díaz possessed elite contact quality (90th percentile exit velocity, 8% barrel rate) but was producing too many ground balls due to his swing path. By working with Díaz to increase his launch angle from 8 degrees to 14 degrees while maintaining his exceptional bat-to-ball skills, the Rays unlocked his power potential. His transformation from a .270/.340/.420 hitter to a .296/.370/.520 hitter with 20+ home runs validated the expected stats-based development approach.

Interpreting the Results

Expected StatWhat It MeasuresLeague AverageElite Level
xBAExpected batting average on contact quality.245-.250.290+
xSLGExpected power production on contact quality.410-.420.500+
xwOBAOverall expected offensive value.315-.320.370+
Barrel RateOptimal contact frequency (98+ mph, 26-30°)6-8%15%+
Hard-Hit RateRate of 95+ mph exit velocity35-38%50%+
Sweet Spot %Batted balls with 8-32° launch angle32-35%40%+
Exit VelocityAverage batted ball speed87-89 mph92+ mph

Key Takeaways

  • Expected statistics measure the quality of contact rather than outcomes, making them more predictive of future performance than traditional stats like batting average and slugging percentage.
  • xBA, xSLG, and xwOBA use exit velocity and launch angle to estimate what "should have" happened based on historical batted balls with similar characteristics, filtering out luck and defensive variance.
  • Large differentials between actual and expected stats identify regression candidates (overperformers likely to decline) and breakout candidates (underperformers likely to improve), providing actionable insights for player evaluation and trading decisions.
  • Barrel Rate and Hard-Hit Rate show superior year-to-year stability compared to traditional metrics, making them excellent indicators of genuine skill rather than random variation or luck-driven results.
  • Expected statistics stabilize more quickly than traditional stats (requiring 250-300 PA vs 500+ PA), enabling faster identification of true talent level and more confident mid-season evaluations.
  • The predictive power of expected stats makes them invaluable for MLB front offices, fantasy analysts, and betting markets seeking to identify market inefficiencies and make data-driven projections about future performance.
  • Understanding expected statistics requires appreciating that baseball outcomes contain significant randomness—even perfectly-struck balls sometimes result in outs, while weak contact occasionally finds holes, but these random factors average out over time, making process-based metrics superior to results-based metrics for evaluation.

Discussion

Have questions or feedback? Join our community discussion on Discord or GitHub Discussions.