Case Study 2: Forecasting Young Player Development — Will They Make It?

Overview

Every season, thousands of young players in professional academies dream of making it to the first team. But only a fraction will succeed. This case study builds a comprehensive predictive system to forecast which young players (ages 17-21) are most likely to establish themselves as regular professionals by age 25. We combine aging curve models, classification algorithms, and Bayesian updating to produce evolving predictions that improve as we observe more of each player's career.

Motivation

Consider a hypothetical Premier League club evaluating three academy graduates:

  • Player A: 19-year-old central midfielder. Strong passing metrics, moderate physical profile, 800 first-team minutes across two seasons.
  • Player B: 20-year-old striker. Exceptional goal-scoring record in the U21s, limited first-team exposure (200 minutes).
  • Player C: 18-year-old winger. Outstanding dribbling statistics for his age, very few senior minutes.

The sporting director needs to decide which players to offer professional contracts to, which to loan, and which to release. A data-driven forecasting system can inform these decisions by estimating the probability of each player becoming a regular top-flight professional.

Defining "Making It"

We define success as: accumulating at least 5,000 minutes in a top-5 European league by age 25. This threshold corresponds to roughly two full seasons as a squad player or one full season as a regular starter.

This definition has several advantages: - It is objective and measurable. - It accounts for players who move between clubs (the minutes can be at any top-5 league club). - It sets a meaningful bar -- 5,000 minutes requires genuine professional-level contribution.

Alternative definitions (market value, international caps, staying at the original club) are discussed in the appendix.

Data Description

Our synthetic dataset contains career records for 1,200 players who were part of top-flight club academies at age 17. For each player, we have season-level observations from age 17 to 30 (where available):

Feature Description
player_id Unique identifier
age Age at start of season
position Primary position (GK, CB, FB, CM, W, ST)
league_level League tier (1-4, where 1 = top flight)
minutes Minutes played in the season
goals_p90 Goals per 90 minutes
assists_p90 Assists per 90 minutes
xg_p90 Expected goals per 90
xa_p90 Expected assists per 90
progressive_passes_p90 Progressive passes per 90
progressive_carries_p90 Progressive carries per 90
tackles_p90 Tackles per 90
interceptions_p90 Interceptions per 90
aerial_pct Aerial duel win percentage
pass_pct Pass completion percentage
dribble_pct Dribble success percentage
height_cm Height in centimeters
made_it Binary target: 1 if met success criteria

Step 1: Base Rate Analysis

Before building models, we establish base rates:

base_rates = df.groupby("position")["made_it"].mean()
print("Success rates by position:")
for pos, rate in base_rates.items():
    print(f"  {pos}: {rate:.1%}")

Typical base rates for academy players reaching the 5,000-minute threshold: - Goalkeepers: ~15% (few slots available, but long careers once established) - Center-backs: ~18% (physical development is crucial and somewhat predictable) - Full-backs: ~20% (modern demands create more opportunities) - Central midfielders: ~14% (highest competition for places) - Wingers: ~16% (high washout rate due to reliance on pace) - Strikers: ~12% (hardest position to predict, most variance in outcomes)

These base rates form our prior. Any prediction model must improve upon simply predicting the base rate for each position.

Step 2: Feature Engineering at Age 19

At age 19, we create a comprehensive feature vector that captures both current ability and trajectory:

def engineer_features(player_data, target_age=19):
    """Create prediction features from data available up to target_age."""
    features = {}

    # Current season metrics (age 19)
    current = player_data[player_data["age"] == target_age]
    if len(current) == 0:
        return None
    current = current.iloc[0]

    # Level of play
    features["current_league_level"] = current["league_level"]
    features["current_minutes"] = current["minutes"]

    # Performance metrics (only if sufficient minutes)
    if current["minutes"] >= 450:
        features["xg_p90"] = current["xg_p90"]
        features["xa_p90"] = current["xa_p90"]
        features["progressive_actions_p90"] = (
            current["progressive_passes_p90"] + current["progressive_carries_p90"]
        )
        features["defensive_actions_p90"] = (
            current["tackles_p90"] + current["interceptions_p90"]
        )
        features["pass_pct"] = current["pass_pct"]

    # Trajectory features (change from age 18 to 19)
    prev = player_data[player_data["age"] == target_age - 1]
    if len(prev) > 0:
        prev = prev.iloc[0]
        features["minutes_change"] = current["minutes"] - prev["minutes"]
        features["league_level_change"] = prev["league_level"] - current["league_level"]
        # Positive = moved to higher level
        if prev["minutes"] >= 450 and current["minutes"] >= 450:
            features["xg_improvement"] = current["xg_p90"] - prev["xg_p90"]

    # Cumulative features
    career = player_data[player_data["age"] <= target_age]
    features["total_career_minutes"] = career["minutes"].sum()
    features["max_league_level_played"] = career["league_level"].min()
    # min because level 1 is highest

    # Physical
    features["height_cm"] = current["height_cm"]
    features["position"] = current["position"]

    return features

Step 3: Classification Model

We train a Random Forest classifier to predict the binary "made it" outcome:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import (
    roc_auc_score, precision_recall_curve, average_precision_score
)
import matplotlib.pyplot as plt

def train_prospect_model(features_df, target):
    """Train and evaluate prospect classification model."""
    # Encode position as dummy variables
    features_encoded = pd.get_dummies(features_df, columns=["position"])

    # Stratified cross-validation
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    auc_scores = []
    ap_scores = []

    for train_idx, val_idx in skf.split(features_encoded, target):
        X_train = features_encoded.iloc[train_idx]
        X_val = features_encoded.iloc[val_idx]
        y_train = target.iloc[train_idx]
        y_val = target.iloc[val_idx]

        model = RandomForestClassifier(
            n_estimators=500,
            max_depth=8,
            min_samples_leaf=10,
            class_weight="balanced",
            random_state=42
        )
        model.fit(X_train, y_train)

        probs = model.predict_proba(X_val)[:, 1]
        auc_scores.append(roc_auc_score(y_val, probs))
        ap_scores.append(average_precision_score(y_val, probs))

    print(f"AUC-ROC: {np.mean(auc_scores):.3f} +/- {np.std(auc_scores):.3f}")
    print(f"Avg Precision: {np.mean(ap_scores):.3f} +/- {np.std(ap_scores):.3f}")

    return model

Feature Importance

The top predictive features, in order of importance:

  1. Total career minutes (at age 19): The single strongest predictor. Players who have already accumulated significant senior minutes have demonstrated they belong at that level.
  2. Current league level: Playing in a top-flight league at age 19 is a very strong signal.
  3. Minutes change (age 18 to 19): An increasing trajectory in playing time suggests the player is earning trust from coaches.
  4. Progressive actions per 90: Measures involvement in dangerous play, indicating quality beyond simple counting stats.
  5. Height (position-dependent): Significant for center-backs and strikers, less so for wingers and full-backs.

Step 4: Aging Curve Integration

We enhance the classification with trajectory-based features by estimating position-specific aging curves:

def project_development(current_metrics, current_age, position,
                        aging_curves, horizon=6):
    """Project player metrics forward using aging curves."""
    projections = {}

    for metric, curve in aging_curves[position].items():
        trajectory = [current_metrics.get(metric, 0)]
        for future_age in range(current_age + 1, current_age + horizon + 1):
            delta = curve.get(future_age - 1, 0)  # expected change at this age
            trajectory.append(trajectory[-1] + delta)
        projections[metric] = trajectory

    return projections

The aging curve projection allows us to estimate what a 19-year-old's metrics might look like at age 25, accounting for the typical development pattern of players in their position.

For example, a 19-year-old central midfielder with xG per 90 of 0.08 might be projected to reach 0.12 by age 25, based on the average development trajectory for central midfielders. But the uncertainty around this projection is wide -- the 90% prediction interval might span from 0.05 to 0.20.

Step 5: Bayesian Updating

The most powerful aspect of our system is its ability to update predictions as new data arrives. We use a Bayesian framework where:

  • Prior: The initial prediction at age 19 serves as our prior probability of success.
  • Likelihood: Each new season of data provides a likelihood update.
  • Posterior: The updated probability after incorporating new data.
def bayesian_update(prior_prob, new_season_data, likelihood_model):
    """Update success probability after observing a new season.

    Args:
        prior_prob: Current probability of making it.
        new_season_data: Dict with the player's latest season metrics.
        likelihood_model: Model that estimates P(data | success) and P(data | failure).

    Returns:
        Updated (posterior) probability.
    """
    # Likelihood of observing this season given the player will make it
    p_data_given_success = likelihood_model.predict_proba_success(new_season_data)

    # Likelihood of observing this season given the player won't make it
    p_data_given_failure = likelihood_model.predict_proba_failure(new_season_data)

    # Bayes rule
    numerator = p_data_given_success * prior_prob
    denominator = (p_data_given_success * prior_prob +
                   p_data_given_failure * (1 - prior_prob))

    posterior = numerator / denominator
    return posterior

Example: Tracking Player A Over Time

Age Key Events Updated P(success)
19 800 minutes in PL, solid passing stats 35%
20 Loaned to Championship, 2800 minutes, strong performance 48%
21 Recalled, 1500 PL minutes, improving metrics 62%
22 Regular starter, 2800 PL minutes 78%
23 Established starter, international call-up 91%

Notice how the probability updates smoothly as evidence accumulates. A single breakout season does not immediately push the probability to near-certainty -- the Bayesian approach is appropriately cautious.

Step 6: Survival Analysis for Dropout

We complement the classification model with a survival analysis that models the time to establishing oneself (or dropping out):

from lifelines import CoxPHFitter

def fit_establishment_model(career_data):
    """Fit survival model for time-to-establishment."""
    # Event: first season with 2000+ top-flight minutes
    # Censoring: player leaves top-flight system or reaches age 25

    survival_df = career_data.copy()
    survival_df["duration"] = survival_df["age"] - 17  # years since age 17
    survival_df["event"] = (survival_df["minutes"] >= 2000) & \
                           (survival_df["league_level"] == 1)

    cph = CoxPHFitter()
    cph.fit(
        survival_df[["duration", "event", "xg_p90", "progressive_actions_p90",
                      "total_career_minutes", "height_cm"]],
        duration_col="duration",
        event_col="event"
    )

    cph.print_summary()
    return cph

The survival model tells us not just whether a player will make it, but when. Some players establish themselves immediately at age 18-19 (early bloomers), while others take the loan route and break through at 22-23 (late developers).

Results and Validation

Model Performance

Metric Value
AUC-ROC 0.79
Average Precision 0.42
Brier Score 0.12
ECE (calibration) 0.04

The model significantly outperforms the base rate (AUC = 0.50) and a simple heuristic based solely on minutes played (AUC = 0.71).

Calibration Analysis

We verify that the model's predicted probabilities match observed success rates:

  • Predicted 10-20%: Observed 14% success rate (well-calibrated)
  • Predicted 20-40%: Observed 28% success rate (well-calibrated)
  • Predicted 40-60%: Observed 51% success rate (well-calibrated)
  • Predicted 60-80%: Observed 65% success rate (slightly overconfident)
  • Predicted 80-100%: Observed 82% success rate (well-calibrated)

Case Analysis

True Positives (Correctly identified successes): Players with early senior exposure, strong underlying metrics, and positive development trajectories were correctly flagged as likely successes.

True Negatives (Correctly identified non-successes): Players who accumulated minutes only at lower levels and showed stagnating or declining metrics were correctly assigned low probabilities.

False Positives (Predicted success, did not make it): Some physically gifted players with strong metrics at age 19 failed due to injuries or personal factors that the model cannot capture.

False Negatives (Predicted failure, succeeded): Late bloomers who showed limited output at age 19 but improved dramatically in their early 20s. These are the most interesting cases and represent the limits of early prediction.

Key Insights

  1. Minutes at the top level by age 19 is the strongest predictor. Players who have already proven they can contribute at the highest level are much more likely to continue doing so.

  2. Trajectory matters more than current level. A player whose metrics are improving rapidly is a better prospect than one with higher current metrics but a flat trajectory.

  3. Position matters for prediction reliability. Goalkeeper and center-back success is more predictable (physical attributes are important and measurable) than winger or striker success (which depends more on intangibles).

  4. Prediction accuracy improves dramatically with each additional season of data. At age 19, the model's AUC is 0.79. By age 21 (with two more seasons of updates), it reaches 0.88. Early prediction is inherently uncertain.

  5. The loan pathway is informative. Players who excel during loans to competitive leagues provide strong evidence of their ability to adapt to high-level football.

Limitations

  • Survivorship bias: Our dataset only includes players who were in academies at age 17. Players who enter professional football through non-academy routes are not captured.
  • Unobservable factors: Mentality, work ethic, injury proneness, and personal circumstances are not in the data but significantly affect outcomes.
  • League and era effects: The model is trained on historical data and may not fully capture changes in playing style, tactical evolution, or youth development practices.
  • Small sample sizes: For any specific position-league combination, there may be only a few dozen historical examples, limiting the reliability of the estimates.

Practical Recommendations

  1. Do not use the model as a sole decision-maker. It should inform, not replace, expert judgment.
  2. Pay attention to the uncertainty. A prediction of "45% chance of making it" is very different from "90% chance." Treat borderline cases with caution.
  3. Update predictions regularly. A Bayesian approach that incorporates each new season of data is far more valuable than a one-time prediction.
  4. Consider the cost of errors. Releasing a player who would have made it (false negative) is more costly than retaining one who does not (false positive), so adjust decision thresholds accordingly.

Code Reference

The complete implementation is available in code/case-study-code.py and code/example-02-player-forecasting.py.