Runs, Hits, and RBIs

Beginner 10 min read 0 views Nov 26, 2025

Understanding Runs, Hits, and RBIs: Baseball's Traditional Offensive Statistics

Runs (R), Hits (H), and Runs Batted In (RBIs) form the cornerstone of traditional baseball statistics. These metrics have been tracked since the earliest days of professional baseball and remain prominently featured in box scores, player evaluations, and Hall of Fame discussions. While modern analytics have introduced more sophisticated measures of offensive contribution, understanding these foundational statistics is essential for any serious baseball analyst.

Definitions and Basic Concepts

Runs (R)

A run is scored when a player advances around all four bases—first, second, third, and home plate—and touches home plate safely. This is the fundamental scoring unit in baseball, and the team with more runs at the end of the game wins. A player can score a run in several ways:

  • Advancing around the bases on hits by teammates
  • Hitting a home run
  • Being driven in by a sacrifice fly or groundout
  • Scoring on a wild pitch, passed ball, or error
  • Being walked or hit by a pitch with the bases loaded

The run statistic is credited to the player who crosses home plate, regardless of how they advanced around the bases. This makes runs a measure of both a player's ability to get on base and their teammates' ability to drive them in.

Hits (H)

A hit is credited to a batter when they safely reach base via a batted ball without the benefit of an error or fielder's choice. There are four types of hits:

  • Single (1B): The batter reaches first base safely
  • Double (2B): The batter reaches second base safely
  • Triple (3B): The batter reaches third base safely
  • Home Run (HR): The batter circles all bases and scores

The official scorer determines whether a batter receives a hit or if they reached base due to an error. This judgment call can sometimes be controversial, as the distinction between a hit and an error isn't always clear-cut, particularly on difficult defensive plays.

Runs Batted In (RBI)

An RBI is credited to a batter when their plate appearance results in a run being scored, with certain exceptions. A batter receives an RBI when:

  • Their hit drives in a run
  • Their sacrifice fly scores a runner
  • Their groundout scores a runner (not if it's a double play)
  • They are hit by a pitch or walk with the bases loaded
  • They hit a home run (including one RBI for themselves)

A batter does NOT receive an RBI when:

  • A run scores on a double play
  • A run scores due to an error
  • A run scores on a wild pitch or passed ball during their at-bat

The RBI statistic was not officially adopted by the American League until 1920 and the National League until 1920, though it had been tracked informally before then.

How Each Statistic is Recorded

Baseball's official scorers are responsible for recording these statistics during games. The process involves careful attention to detail and sometimes subjective judgment:

Statistic Recording Method Scorer Judgment Required
Runs Recorded when player touches home plate safely Low - usually objective
Hits Recorded when batter reaches base on batted ball High - hit vs. error distinction
RBIs Recorded when batter's action causes run to score Medium - causation in complex situations

The official scorer sits in the press box and makes real-time decisions about how to classify each play. These decisions are typically made immediately but can be changed upon review. Scorers use detailed rulebooks and guidelines to ensure consistency, though some variation between scorers is inevitable, particularly in hit/error determinations.

Historical Context and Evolution

The history of these statistics reflects baseball's evolution as a sport and the changing understanding of player value.

Early Baseball Era (1870s-1900s)

In baseball's early days, runs and hits were among the only statistics regularly tracked. The game was lower-scoring due to larger ballparks, dead-ball construction, and different playing styles. Runs were extremely valuable, and the ability to generate them was highly prized. Hit totals were the primary measure of offensive prowess, with 200-hit seasons considered the hallmark of excellence.

Dead Ball Era (1900-1919)

During this period, runs were scarce, and manufacturing runs through bunts, stolen bases, and hit-and-run plays dominated strategy. Ty Cobb exemplified this era, using his hitting ability and aggressive baserunning to score runs without relying on power. RBI totals were generally lower as home runs were rare events.

Live Ball Era (1920-present)

The introduction of a livelier ball in 1920 coincided with Babe Ruth's emergence and transformed baseball into a power game. Runs and RBIs increased dramatically. The 1920s saw the official adoption of the RBI statistic, recognizing the importance of run production. This era established the importance of "run producers" in the middle of lineups.

Integration and Expansion (1947-1960s)

The integration of baseball and league expansion created more offensive opportunities. Players like Willie Mays and Hank Aaron combined high batting averages with power, producing impressive totals in all three categories. The 150-RBI season became a benchmark for elite sluggers.

Modern Era and Analytics Revolution (1990s-present)

The analytics movement, popularized by "Moneyball," began questioning the value of traditional statistics. While runs, hits, and RBIs remained prominently displayed, analysts developed context-neutral metrics like wRC+ and wOBA that better isolate individual contributions. However, these traditional stats remain important for historical comparisons and public understanding.

Strengths of These Traditional Metrics

Despite criticism from the analytics community, runs, hits, and RBIs have enduring value:

Intuitive Understanding

These statistics are immediately comprehensible to casual fans. Everyone understands that scoring runs wins games and that hits are positive outcomes. This accessibility makes baseball more engaging for broader audiences and facilitates discussions about player performance across generations.

Historical Continuity

Because these statistics have been tracked since baseball's inception, they allow for meaningful historical comparisons. We can compare Babe Ruth's 1921 season (177 runs, 204 hits, 171 RBIs) with modern performances in a way that's difficult with newer metrics that lack historical data.

Direct Game Impact

Runs directly determine game outcomes. Unlike derivative statistics, the team that scores more runs wins—always. This fundamental truth gives the run statistic inherent meaning that advanced metrics must approximate.

Cumulative Achievement Recognition

Career totals in these categories recognize sustained excellence over time. Pete Rose's 4,256 career hits and Hank Aaron's 2,297 career RBIs represent decades of consistent performance, which single-season metrics cannot capture.

Limitations and Criticisms

Modern analysts have identified significant limitations in these traditional statistics:

Context Dependency

All three statistics are heavily influenced by factors beyond individual player control. A great hitter batting leadoff for a weak offense will score fewer runs than a lesser hitter batting third for a powerful lineup. Similarly, a player's RBI total depends enormously on their teammates' ability to get on base ahead of them.

Lineup Position Effects

Batting order position dramatically affects these statistics:

  • Leadoff hitters: More plate appearances and chances to score, but fewer RBI opportunities
  • Second hitters: Moderate opportunities in both categories
  • Third-fourth-fifth hitters: Maximum RBI opportunities, moderate run-scoring chances
  • Sixth-ninth hitters: Fewer opportunities overall, especially in RBIs

This means comparing a leadoff hitter's 110 runs with a cleanup hitter's 85 runs doesn't necessarily indicate superior performance—it may simply reflect opportunity.

Team Quality Impact

Players on good offensive teams accumulate runs and RBIs more easily than those on weak teams. A .280 hitter on a division winner might produce 100 RBIs while an identical hitter on a last-place team produces 70 RBIs, simply due to the frequency of baserunners.

Failure to Account for Outs

Hit totals don't account for plate appearances or at-bats. A player with 180 hits in 600 at-bats (.300 average) contributes more value than a player with 185 hits in 650 at-bats (.285 average), yet the latter has "more hits." This makes raw hit totals potentially misleading without context.

RBI Timing Issues

An RBI in a 10-0 game has the same statistical value as an RBI in a 1-0 game, yet the latter clearly has more impact on winning. RBIs don't capture the leverage or importance of when runs are scored.

Situational Hitting and Context Analysis

Understanding these statistics requires examining situational factors that influence their accumulation:

Runners in Scoring Position (RISP)

A player's performance with runners in scoring position (second or third base) directly affects their RBI totals. Some players excel in these high-pressure situations, while others struggle. Batting average with RISP can vary significantly from overall batting average due to sample size and situational factors like defensive positioning and pitch selection.

Home vs. Road Splits

Park factors significantly influence these statistics. Playing half your games in Coors Field (Colorado) inflates offensive numbers, while playing in Oracle Park (San Francisco) suppresses them. Analysts must account for these environmental factors when evaluating performance.

Clutch Performance

While "clutch hitting" is controversial among statisticians, timing matters for runs and RBIs. A player who consistently produces in close-and-late situations (tied game or within one run in the seventh inning or later) provides value beyond their raw totals suggest.

RBI as a Team-Dependent Statistic

The RBI statistic has received particular criticism for its team dependency. Consider two hypothetical players with identical skills:

Scenario Player A Player B
Team OBP (teammates) .340 .300
Estimated RBI opportunities 220 180
RBIs (25% success rate) 55 45

Player A accumulates 10 more RBIs despite identical ability, solely due to better teammates. This team dependency makes RBIs problematic for evaluating individual contribution, though they still measure actual run production that occurred.

Research has shown that approximately 85% of RBI variation can be explained by a combination of individual hitting ability and team context, with team context accounting for roughly 40% of the variance. This substantial team component means RBI leaders aren't necessarily the best hitters, though they're typically among the best in favorable situations.

Modern Alternatives and Advanced Metrics

Contemporary baseball analysis has developed several metrics that address the limitations of traditional statistics:

Weighted Runs Created Plus (wRC+)

This metric quantifies a player's total offensive value in terms of runs created, adjusted for park factors and league environment, and scaled so that 100 represents league average. A wRC+ of 130 means a player created 30% more runs than average. Unlike RBIs, wRC+ isolates individual contribution and allows for fair comparisons across contexts.

Run Expectancy Based Statistics (RE24, REW)

These metrics measure how much a player changes run expectancy through their plate appearances. RE24 shows the total runs added above average in a season, accounting for the context of each plate appearance. This captures the "true" run contribution more accurately than simple run totals.

Weighted On-Base Average (wOBA)

wOBA assigns appropriate values to different hitting outcomes (singles, doubles, walks, etc.) to create a rate statistic that correlates strongly with run production. It's essentially a better version of batting average that properly weights hits and includes walks.

Base Runs and Runs Created

These formulas estimate how many runs a player contributed to their team based on their complete offensive profile. They provide context-neutral measures that improve upon raw run and RBI totals.

Why Traditional Statistics Still Matter

Despite the availability of advanced metrics, runs, hits, and RBIs retain importance for several reasons:

Public Communication

Sports media and casual fans continue to use these statistics because they're familiar and intuitive. When a broadcaster says a player has "100 RBIs," everyone immediately understands this indicates strong performance. Advanced metrics haven't achieved this level of common understanding.

Historical Records and Milestones

Career achievement milestones remain significant: 3,000 hits, 2,000 RBIs, 2,000 runs scored. These round numbers carry cultural weight and Hall of Fame significance that newer metrics haven't replicated. Players still chase these traditional milestones, and fans celebrate them.

Complement to Advanced Metrics

Traditional and advanced statistics work best together. A player with excellent wRC+ but low RBIs might be a victim of poor lineup construction or teammates. Conversely, a player with high RBIs but mediocre wRC+ might be benefiting from optimal situations. Examining both provides fuller understanding.

Actual Run Production

While context-dependent, runs and RBIs represent actual events that occurred and contributed to team success. A player who drives in 100 runs did, in fact, cause 100 runs to score, regardless of context. This tangible contribution has real value, even if it doesn't perfectly isolate individual skill.

Predictive Value in Combination

When combined with other statistics like on-base percentage and slugging percentage, traditional counting stats help predict future performance and team success. The combination of volume (counting stats) and efficiency (rate stats) provides robust evaluation.

Analyzing Runs, Hits, and RBIs with Code

Modern data science tools make it easy to fetch, analyze, and visualize these traditional baseball statistics. Below are comprehensive examples in both Python and R.

Python Analysis Using PyBaseball

PyBaseball provides easy access to baseball data from multiple sources. Here's a complete workflow for analyzing R/H/RBI statistics:

# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pybaseball import batting_stats, playerid_lookup
from scipy import stats

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 8)

# Fetch batting statistics for the 2023 season
# Minimum 400 plate appearances to focus on regular players
batting_2023 = batting_stats(2023, qual=400)

# Select relevant columns
columns_of_interest = ['Name', 'Team', 'G', 'PA', 'AB', 'R', 'H', 'RBI',
                       'AVG', 'OBP', 'SLG', 'wRC+', 'WAR']
df = batting_2023[columns_of_interest].copy()

# Calculate additional rate statistics
df['R_per_PA'] = df['R'] / df['PA']  # Runs per plate appearance
df['H_per_AB'] = df['H'] / df['AB']  # Hit rate (batting average)
df['RBI_per_PA'] = df['RBI'] / df['PA']  # RBIs per plate appearance

# Display top performers in each category
print("Top 10 Run Scorers (2023):")
print(df.nlargest(10, 'R')[['Name', 'Team', 'R', 'R_per_PA', 'wRC+']])
print("\n")

print("Top 10 Hit Leaders (2023):")
print(df.nlargest(10, 'H')[['Name', 'Team', 'H', 'AVG', 'PA']])
print("\n")

print("Top 10 RBI Leaders (2023):")
print(df.nlargest(10, 'RBI')[['Name', 'Team', 'RBI', 'RBI_per_PA', 'wRC+']])
print("\n")

# Correlation analysis between traditional and advanced metrics
correlation_matrix = df[['R', 'H', 'RBI', 'wRC+', 'WAR']].corr()
print("Correlation Matrix:")
print(correlation_matrix)
print("\n")

# Create correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Between Traditional and Advanced Metrics (2023)',
          fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('correlation_heatmap.png', dpi=300)
plt.show()

# Distribution analysis
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Runs distribution
axes[0].hist(df['R'], bins=20, color='skyblue', edgecolor='black', alpha=0.7)
axes[0].axvline(df['R'].mean(), color='red', linestyle='--',
                linewidth=2, label=f'Mean: {df["R"].mean():.1f}')
axes[0].axvline(df['R'].median(), color='green', linestyle='--',
                linewidth=2, label=f'Median: {df["R"].median():.1f}')
axes[0].set_xlabel('Runs', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Distribution of Runs Scored (2023)', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Hits distribution
axes[1].hist(df['H'], bins=20, color='lightcoral', edgecolor='black', alpha=0.7)
axes[1].axvline(df['H'].mean(), color='red', linestyle='--',
                linewidth=2, label=f'Mean: {df["H"].mean():.1f}')
axes[1].axvline(df['H'].median(), color='green', linestyle='--',
                linewidth=2, label=f'Median: {df["H"].median():.1f}')
axes[1].set_xlabel('Hits', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('Distribution of Hits (2023)', fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

# RBIs distribution
axes[2].hist(df['RBI'], bins=20, color='lightgreen', edgecolor='black', alpha=0.7)
axes[2].axvline(df['RBI'].mean(), color='red', linestyle='--',
                linewidth=2, label=f'Mean: {df["RBI"].mean():.1f}')
axes[2].axvline(df['RBI'].median(), color='green', linestyle='--',
                linewidth=2, label=f'Median: {df["RBI"].median():.1f}')
axes[2].set_xlabel('RBIs', fontsize=12)
axes[2].set_ylabel('Frequency', fontsize=12)
axes[2].set_title('Distribution of RBIs (2023)', fontsize=13, fontweight='bold')
axes[2].legend()
axes[2].grid(alpha=0.3)

plt.tight_layout()
plt.savefig('distributions.png', dpi=300)
plt.show()

# Scatter plot: RBIs vs wRC+ (showing team dependency)
plt.figure(figsize=(12, 8))
scatter = plt.scatter(df['wRC+'], df['RBI'], alpha=0.6, s=100, c=df['R'],
                      cmap='viridis', edgecolors='black', linewidth=0.5)
plt.colorbar(scatter, label='Runs Scored')
plt.xlabel('wRC+ (Context-Neutral Offensive Value)', fontsize=12)
plt.ylabel('RBIs', fontsize=12)
plt.title('RBIs vs wRC+: Understanding Team Dependency (2023)',
          fontsize=14, fontweight='bold')

# Add trend line
z = np.polyfit(df['wRC+'], df['RBI'], 1)
p = np.poly1d(z)
plt.plot(df['wRC+'].sort_values(), p(df['wRC+'].sort_values()),
         "r--", linewidth=2, label=f'Trend: y={z[0]:.2f}x+{z[1]:.2f}')

# Calculate and display R-squared
slope, intercept, r_value, p_value, std_err = stats.linregress(df['wRC+'], df['RBI'])
plt.text(0.05, 0.95, f'R² = {r_value**2:.3f}', transform=plt.gca().transAxes,
         fontsize=12, verticalalignment='top',
         bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('rbis_vs_wrcplus.png', dpi=300)
plt.show()

# Player comparison: Similar wRC+ but different RBIs
# This demonstrates team context effects
high_wrc = df[df['wRC+'] > 130].copy()
high_wrc = high_wrc.sort_values('RBI', ascending=False)

print("Players with wRC+ > 130 (Elite Hitters):")
print("Showing variation in RBI despite similar offensive quality:")
print(high_wrc[['Name', 'Team', 'wRC+', 'RBI', 'R', 'PA']].head(15))
print("\n")

# Fetch multi-year data for trend analysis
print("Fetching multi-year data for trend analysis...")
years = range(2019, 2024)
multi_year_data = []

for year in years:
    try:
        yearly_data = batting_stats(year, qual=400)
        yearly_data['Season'] = year
        multi_year_data.append(yearly_data)
    except:
        print(f"Could not fetch data for {year}")

# Combine all years
if multi_year_data:
    df_multi = pd.concat(multi_year_data, ignore_index=True)

    # Calculate yearly averages
    yearly_avg = df_multi.groupby('Season')[['R', 'H', 'RBI', 'AVG', 'wRC+']].mean()

    print("Yearly League Averages (Qualified Batters):")
    print(yearly_avg)
    print("\n")

    # Plot trends over time
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))

    axes[0, 0].plot(yearly_avg.index, yearly_avg['R'], marker='o',
                    linewidth=2, markersize=8, color='blue')
    axes[0, 0].set_title('Average Runs per Season', fontsize=12, fontweight='bold')
    axes[0, 0].set_xlabel('Season')
    axes[0, 0].set_ylabel('Runs')
    axes[0, 0].grid(alpha=0.3)

    axes[0, 1].plot(yearly_avg.index, yearly_avg['H'], marker='o',
                    linewidth=2, markersize=8, color='red')
    axes[0, 1].set_title('Average Hits per Season', fontsize=12, fontweight='bold')
    axes[0, 1].set_xlabel('Season')
    axes[0, 1].set_ylabel('Hits')
    axes[0, 1].grid(alpha=0.3)

    axes[1, 0].plot(yearly_avg.index, yearly_avg['RBI'], marker='o',
                    linewidth=2, markersize=8, color='green')
    axes[1, 0].set_title('Average RBIs per Season', fontsize=12, fontweight='bold')
    axes[1, 0].set_xlabel('Season')
    axes[1, 0].set_ylabel('RBIs')
    axes[1, 0].grid(alpha=0.3)

    axes[1, 1].plot(yearly_avg.index, yearly_avg['wRC+'], marker='o',
                    linewidth=2, markersize=8, color='purple')
    axes[1, 1].set_title('Average wRC+ per Season', fontsize=12, fontweight='bold')
    axes[1, 1].set_xlabel('Season')
    axes[1, 1].set_ylabel('wRC+')
    axes[1, 1].grid(alpha=0.3)

    plt.tight_layout()
    plt.savefig('multi_year_trends.png', dpi=300)
    plt.show()

# Statistical summary
print("Statistical Summary of R/H/RBI (2023):")
summary_stats = df[['R', 'H', 'RBI']].describe()
print(summary_stats)
print("\n")

# Identify outliers using IQR method
Q1 = df[['R', 'H', 'RBI']].quantile(0.25)
Q3 = df[['R', 'H', 'RBI']].quantile(0.75)
IQR = Q3 - Q1

outliers = df[((df[['R', 'H', 'RBI']] < (Q1 - 1.5 * IQR)) |
               (df[['R', 'H', 'RBI']] > (Q3 + 1.5 * IQR))).any(axis=1)]

print(f"Outlier performers (exceptional in R/H/RBI): {len(outliers)} players")
print(outliers[['Name', 'Team', 'R', 'H', 'RBI', 'wRC+']].sort_values('wRC+', ascending=False))

R Analysis Using baseballr

The baseballr package provides similar functionality for R users. Here's a comprehensive analysis:

# Load required libraries
library(baseballr)
library(dplyr)
library(ggplot2)
library(tidyr)
library(corrplot)
library(gridExtra)

# Fetch 2023 batting statistics from FanGraphs
# Using fg_batter_leaders for qualified batters
batting_2023 <- fg_batter_leaders(startseason = 2023, endseason = 2023,
                                   qual = 400, ind = 1)

# Select and rename relevant columns
df <- batting_2023 %>%
  select(Name, Team, G, PA, AB, R, H, RBI, AVG, OBP, SLG,
         wRC_plus = `wRC+`, WAR) %>%
  mutate(
    R_per_PA = R / PA,
    H_per_AB = H / AB,
    RBI_per_PA = RBI / PA
  )

# Display top performers
cat("Top 10 Run Scorers (2023):\n")
df %>%
  arrange(desc(R)) %>%
  head(10) %>%
  select(Name, Team, R, R_per_PA, wRC_plus) %>%
  print()

cat("\nTop 10 Hit Leaders (2023):\n")
df %>%
  arrange(desc(H)) %>%
  head(10) %>%
  select(Name, Team, H, AVG, PA) %>%
  print()

cat("\nTop 10 RBI Leaders (2023):\n")
df %>%
  arrange(desc(RBI)) %>%
  head(10) %>%
  select(Name, Team, RBI, RBI_per_PA, wRC_plus) %>%
  print()

# Correlation analysis
cor_vars <- df %>% select(R, H, RBI, wRC_plus, WAR)
correlation_matrix <- cor(cor_vars, use = "complete.obs")

cat("\nCorrelation Matrix:\n")
print(round(correlation_matrix, 3))

# Create correlation plot
png("correlation_plot.png", width = 800, height = 800, res = 120)
corrplot(correlation_matrix, method = "color", type = "upper",
         addCoef.col = "black", tl.col = "black", tl.srt = 45,
         title = "Correlation: Traditional vs Advanced Metrics",
         mar = c(0, 0, 2, 0))
dev.off()

# Distribution visualizations
p1 <- ggplot(df, aes(x = R)) +
  geom_histogram(bins = 20, fill = "skyblue", color = "black", alpha = 0.7) +
  geom_vline(aes(xintercept = mean(R)), color = "red",
             linetype = "dashed", size = 1) +
  geom_vline(aes(xintercept = median(R)), color = "green",
             linetype = "dashed", size = 1) +
  labs(title = "Distribution of Runs Scored (2023)",
       x = "Runs", y = "Frequency") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

p2 <- ggplot(df, aes(x = H)) +
  geom_histogram(bins = 20, fill = "lightcoral", color = "black", alpha = 0.7) +
  geom_vline(aes(xintercept = mean(H)), color = "red",
             linetype = "dashed", size = 1) +
  geom_vline(aes(xintercept = median(H)), color = "green",
             linetype = "dashed", size = 1) +
  labs(title = "Distribution of Hits (2023)",
       x = "Hits", y = "Frequency") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

p3 <- ggplot(df, aes(x = RBI)) +
  geom_histogram(bins = 20, fill = "lightgreen", color = "black", alpha = 0.7) +
  geom_vline(aes(xintercept = mean(RBI)), color = "red",
             linetype = "dashed", size = 1) +
  geom_vline(aes(xintercept = median(RBI)), color = "green",
             linetype = "dashed", size = 1) +
  labs(title = "Distribution of RBIs (2023)",
       x = "RBIs", y = "Frequency") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

# Combine plots
png("distributions.png", width = 1500, height = 500, res = 100)
grid.arrange(p1, p2, p3, ncol = 3)
dev.off()

# Scatter plot: RBIs vs wRC+
scatter_plot <- ggplot(df, aes(x = wRC_plus, y = RBI, color = R)) +
  geom_point(size = 3, alpha = 0.6) +
  geom_smooth(method = "lm", color = "red", linetype = "dashed", se = TRUE) +
  scale_color_gradient(low = "lightblue", high = "darkblue", name = "Runs") +
  labs(title = "RBIs vs wRC+: Understanding Team Dependency (2023)",
       x = "wRC+ (Context-Neutral Offensive Value)",
       y = "RBIs") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 14),
        legend.position = "right")

# Calculate R-squared
lm_model <- lm(RBI ~ wRC_plus, data = df)
r_squared <- summary(lm_model)$r.squared

scatter_plot <- scatter_plot +
  annotate("text", x = min(df$wRC_plus) + 10, y = max(df$RBI) - 5,
           label = paste0("R² = ", round(r_squared, 3)),
           size = 5, fontface = "bold")

ggsave("rbis_vs_wrcplus.png", scatter_plot, width = 12, height = 8, dpi = 300)

# Player comparison: Elite hitters with varying RBIs
high_wrc <- df %>%
  filter(wRC_plus > 130) %>%
  arrange(desc(RBI))

cat("\nPlayers with wRC+ > 130 (Elite Hitters):\n")
cat("Showing variation in RBI despite similar offensive quality:\n")
high_wrc %>%
  head(15) %>%
  select(Name, Team, wRC_plus, RBI, R, PA) %>%
  print()

# Multi-year trend analysis
cat("\nFetching multi-year data for trend analysis...\n")
years <- 2019:2023
multi_year_data <- list()

for (year in years) {
  tryCatch({
    yearly_data <- fg_batter_leaders(startseason = year, endseason = year,
                                      qual = 400, ind = 1)
    yearly_data$Season <- year
    multi_year_data[[as.character(year)]] <- yearly_data
  }, error = function(e) {
    cat(sprintf("Could not fetch data for %d\n", year))
  })
}

# Combine all years
if (length(multi_year_data) > 0) {
  df_multi <- bind_rows(multi_year_data)

  # Calculate yearly averages
  yearly_avg <- df_multi %>%
    group_by(Season) %>%
    summarise(
      Avg_R = mean(R, na.rm = TRUE),
      Avg_H = mean(H, na.rm = TRUE),
      Avg_RBI = mean(RBI, na.rm = TRUE),
      Avg_AVG = mean(AVG, na.rm = TRUE),
      Avg_wRC_plus = mean(`wRC+`, na.rm = TRUE)
    )

  cat("\nYearly League Averages (Qualified Batters):\n")
  print(yearly_avg)

  # Create trend plots
  p1 <- ggplot(yearly_avg, aes(x = Season, y = Avg_R)) +
    geom_line(size = 1.2, color = "blue") +
    geom_point(size = 3, color = "blue") +
    labs(title = "Average Runs per Season", x = "Season", y = "Runs") +
    theme_minimal() +
    theme(plot.title = element_text(face = "bold", hjust = 0.5))

  p2 <- ggplot(yearly_avg, aes(x = Season, y = Avg_H)) +
    geom_line(size = 1.2, color = "red") +
    geom_point(size = 3, color = "red") +
    labs(title = "Average Hits per Season", x = "Season", y = "Hits") +
    theme_minimal() +
    theme(plot.title = element_text(face = "bold", hjust = 0.5))

  p3 <- ggplot(yearly_avg, aes(x = Season, y = Avg_RBI)) +
    geom_line(size = 1.2, color = "green") +
    geom_point(size = 3, color = "green") +
    labs(title = "Average RBIs per Season", x = "Season", y = "RBIs") +
    theme_minimal() +
    theme(plot.title = element_text(face = "bold", hjust = 0.5))

  p4 <- ggplot(yearly_avg, aes(x = Season, y = Avg_wRC_plus)) +
    geom_line(size = 1.2, color = "purple") +
    geom_point(size = 3, color = "purple") +
    labs(title = "Average wRC+ per Season", x = "Season", y = "wRC+") +
    theme_minimal() +
    theme(plot.title = element_text(face = "bold", hjust = 0.5))

  png("multi_year_trends.png", width = 1400, height = 1000, res = 100)
  grid.arrange(p1, p2, p3, p4, ncol = 2)
  dev.off()
}

# Statistical summary
cat("\nStatistical Summary of R/H/RBI (2023):\n")
summary_stats <- df %>%
  select(R, H, RBI) %>%
  summary()
print(summary_stats)

# Identify outliers using IQR method
identify_outliers <- function(x) {
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  lower <- Q1 - 1.5 * IQR
  upper <- Q3 + 1.5 * IQR
  return(x < lower | x > upper)
}

outliers <- df %>%
  filter(identify_outliers(R) | identify_outliers(H) | identify_outliers(RBI))

cat(sprintf("\nOutlier performers (exceptional in R/H/RBI): %d players\n",
            nrow(outliers)))
outliers %>%
  arrange(desc(wRC_plus)) %>%
  select(Name, Team, R, H, RBI, wRC_plus) %>%
  print()

# Additional analysis: R/H/RBI rates by team
team_summary <- df %>%
  group_by(Team) %>%
  summarise(
    Players = n(),
    Total_R = sum(R),
    Total_H = sum(H),
    Total_RBI = sum(RBI),
    Avg_wRC_plus = mean(wRC_plus, na.rm = TRUE)
  ) %>%
  arrange(desc(Avg_wRC_plus))

cat("\nTeam Summary (Qualified Batters Only):\n")
print(team_summary)

Interpreting the Code Results

When you run these analysis scripts, you'll gain several important insights:

Correlation Findings

The correlation matrices typically reveal that R, H, and RBI are moderately to strongly correlated with advanced metrics like wRC+ and WAR. However, you'll notice that RBI often has the weakest correlation with context-neutral metrics, confirming its team dependency. Correlation coefficients between RBI and wRC+ typically range from 0.60 to 0.75, indicating substantial but imperfect relationship.

Distribution Patterns

The distribution plots show that qualified batters' runs, hits, and RBIs typically follow roughly normal distributions with slight right skew. Most qualified players cluster around the mean, with elite performers forming the right tail. For a typical season with 400+ PA requirement:

  • Runs: Mean around 75-85, median 75-80
  • Hits: Mean around 140-150, median 140-145
  • RBIs: Mean around 70-80, median 70-75

Team Context Effects

The scatter plots of RBI vs wRC+ demonstrate visible scatter around the trend line, with some elite hitters (high wRC+) posting relatively modest RBI totals while some merely good hitters post high RBIs. This scatter represents the influence of teammates, lineup position, and luck in opportunity distribution.

Multi-Year Trends

Year-to-year analysis reveals how rule changes, ball characteristics, and playing conditions affect these statistics. Recent years have shown fluctuations due to various factors including the implementation of humidors in all ballparks and crackdowns on foreign substances.

Conclusion: The Enduring Value of Traditional Statistics

Runs, hits, and RBIs represent baseball's statistical foundation. While modern analytics have exposed their limitations—particularly the context-dependency of runs and RBIs—these metrics retain considerable value. They provide intuitive communication with fans, enable historical comparisons spanning baseball's entire history, and measure actual outcomes that occurred on the field.

The key to proper utilization is understanding both their strengths and weaknesses. Runs scored reflects both individual ability and teammates' production. Hit totals measure volume but not efficiency. RBIs reward production but depend heavily on opportunity. When used alongside modern metrics like wRC+, OPS+, and WAR, traditional statistics contribute to comprehensive player evaluation.

For analysts, the message is clear: don't discard traditional statistics, but don't rely on them exclusively. Use them as part of a holistic evaluation framework that accounts for context, opportunity, and underlying skill. For fans and media, continue celebrating 100-RBI seasons and 200-hit campaigns while recognizing that these achievements reflect both individual excellence and favorable circumstances.

The data science tools demonstrated in this tutorial enable rigorous analysis that respects both baseball's statistical heritage and modern analytical standards. By combining traditional counting statistics with advanced metrics, we can appreciate player achievements in their full context while making more informed evaluations of true talent and value.

Discussion

Have questions or feedback? Join our community discussion on Discord or GitHub Discussions.