Case Study 2: Forecasting Young Player Development — Will They Make It?
Overview
Every season, thousands of young players in professional academies dream of making it to the first team. But only a fraction will succeed. This case study builds a comprehensive predictive system to forecast which young players (ages 17-21) are most likely to establish themselves as regular professionals by age 25. We combine aging curve models, classification algorithms, and Bayesian updating to produce evolving predictions that improve as we observe more of each player's career.
Motivation
Consider a hypothetical Premier League club evaluating three academy graduates:
- Player A: 19-year-old central midfielder. Strong passing metrics, moderate physical profile, 800 first-team minutes across two seasons.
- Player B: 20-year-old striker. Exceptional goal-scoring record in the U21s, limited first-team exposure (200 minutes).
- Player C: 18-year-old winger. Outstanding dribbling statistics for his age, very few senior minutes.
The sporting director needs to decide which players to offer professional contracts to, which to loan, and which to release. A data-driven forecasting system can inform these decisions by estimating the probability of each player becoming a regular top-flight professional.
Defining "Making It"
We define success as: accumulating at least 5,000 minutes in a top-5 European league by age 25. This threshold corresponds to roughly two full seasons as a squad player or one full season as a regular starter.
This definition has several advantages: - It is objective and measurable. - It accounts for players who move between clubs (the minutes can be at any top-5 league club). - It sets a meaningful bar -- 5,000 minutes requires genuine professional-level contribution.
Alternative definitions (market value, international caps, staying at the original club) are discussed in the appendix.
Data Description
Our synthetic dataset contains career records for 1,200 players who were part of top-flight club academies at age 17. For each player, we have season-level observations from age 17 to 30 (where available):
| Feature | Description |
|---|---|
player_id |
Unique identifier |
age |
Age at start of season |
position |
Primary position (GK, CB, FB, CM, W, ST) |
league_level |
League tier (1-4, where 1 = top flight) |
minutes |
Minutes played in the season |
goals_p90 |
Goals per 90 minutes |
assists_p90 |
Assists per 90 minutes |
xg_p90 |
Expected goals per 90 |
xa_p90 |
Expected assists per 90 |
progressive_passes_p90 |
Progressive passes per 90 |
progressive_carries_p90 |
Progressive carries per 90 |
tackles_p90 |
Tackles per 90 |
interceptions_p90 |
Interceptions per 90 |
aerial_pct |
Aerial duel win percentage |
pass_pct |
Pass completion percentage |
dribble_pct |
Dribble success percentage |
height_cm |
Height in centimeters |
made_it |
Binary target: 1 if met success criteria |
Step 1: Base Rate Analysis
Before building models, we establish base rates:
base_rates = df.groupby("position")["made_it"].mean()
print("Success rates by position:")
for pos, rate in base_rates.items():
print(f" {pos}: {rate:.1%}")
Typical base rates for academy players reaching the 5,000-minute threshold: - Goalkeepers: ~15% (few slots available, but long careers once established) - Center-backs: ~18% (physical development is crucial and somewhat predictable) - Full-backs: ~20% (modern demands create more opportunities) - Central midfielders: ~14% (highest competition for places) - Wingers: ~16% (high washout rate due to reliance on pace) - Strikers: ~12% (hardest position to predict, most variance in outcomes)
These base rates form our prior. Any prediction model must improve upon simply predicting the base rate for each position.
Step 2: Feature Engineering at Age 19
At age 19, we create a comprehensive feature vector that captures both current ability and trajectory:
def engineer_features(player_data, target_age=19):
"""Create prediction features from data available up to target_age."""
features = {}
# Current season metrics (age 19)
current = player_data[player_data["age"] == target_age]
if len(current) == 0:
return None
current = current.iloc[0]
# Level of play
features["current_league_level"] = current["league_level"]
features["current_minutes"] = current["minutes"]
# Performance metrics (only if sufficient minutes)
if current["minutes"] >= 450:
features["xg_p90"] = current["xg_p90"]
features["xa_p90"] = current["xa_p90"]
features["progressive_actions_p90"] = (
current["progressive_passes_p90"] + current["progressive_carries_p90"]
)
features["defensive_actions_p90"] = (
current["tackles_p90"] + current["interceptions_p90"]
)
features["pass_pct"] = current["pass_pct"]
# Trajectory features (change from age 18 to 19)
prev = player_data[player_data["age"] == target_age - 1]
if len(prev) > 0:
prev = prev.iloc[0]
features["minutes_change"] = current["minutes"] - prev["minutes"]
features["league_level_change"] = prev["league_level"] - current["league_level"]
# Positive = moved to higher level
if prev["minutes"] >= 450 and current["minutes"] >= 450:
features["xg_improvement"] = current["xg_p90"] - prev["xg_p90"]
# Cumulative features
career = player_data[player_data["age"] <= target_age]
features["total_career_minutes"] = career["minutes"].sum()
features["max_league_level_played"] = career["league_level"].min()
# min because level 1 is highest
# Physical
features["height_cm"] = current["height_cm"]
features["position"] = current["position"]
return features
Step 3: Classification Model
We train a Random Forest classifier to predict the binary "made it" outcome:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import (
roc_auc_score, precision_recall_curve, average_precision_score
)
import matplotlib.pyplot as plt
def train_prospect_model(features_df, target):
"""Train and evaluate prospect classification model."""
# Encode position as dummy variables
features_encoded = pd.get_dummies(features_df, columns=["position"])
# Stratified cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
auc_scores = []
ap_scores = []
for train_idx, val_idx in skf.split(features_encoded, target):
X_train = features_encoded.iloc[train_idx]
X_val = features_encoded.iloc[val_idx]
y_train = target.iloc[train_idx]
y_val = target.iloc[val_idx]
model = RandomForestClassifier(
n_estimators=500,
max_depth=8,
min_samples_leaf=10,
class_weight="balanced",
random_state=42
)
model.fit(X_train, y_train)
probs = model.predict_proba(X_val)[:, 1]
auc_scores.append(roc_auc_score(y_val, probs))
ap_scores.append(average_precision_score(y_val, probs))
print(f"AUC-ROC: {np.mean(auc_scores):.3f} +/- {np.std(auc_scores):.3f}")
print(f"Avg Precision: {np.mean(ap_scores):.3f} +/- {np.std(ap_scores):.3f}")
return model
Feature Importance
The top predictive features, in order of importance:
- Total career minutes (at age 19): The single strongest predictor. Players who have already accumulated significant senior minutes have demonstrated they belong at that level.
- Current league level: Playing in a top-flight league at age 19 is a very strong signal.
- Minutes change (age 18 to 19): An increasing trajectory in playing time suggests the player is earning trust from coaches.
- Progressive actions per 90: Measures involvement in dangerous play, indicating quality beyond simple counting stats.
- Height (position-dependent): Significant for center-backs and strikers, less so for wingers and full-backs.
Step 4: Aging Curve Integration
We enhance the classification with trajectory-based features by estimating position-specific aging curves:
def project_development(current_metrics, current_age, position,
aging_curves, horizon=6):
"""Project player metrics forward using aging curves."""
projections = {}
for metric, curve in aging_curves[position].items():
trajectory = [current_metrics.get(metric, 0)]
for future_age in range(current_age + 1, current_age + horizon + 1):
delta = curve.get(future_age - 1, 0) # expected change at this age
trajectory.append(trajectory[-1] + delta)
projections[metric] = trajectory
return projections
The aging curve projection allows us to estimate what a 19-year-old's metrics might look like at age 25, accounting for the typical development pattern of players in their position.
For example, a 19-year-old central midfielder with xG per 90 of 0.08 might be projected to reach 0.12 by age 25, based on the average development trajectory for central midfielders. But the uncertainty around this projection is wide -- the 90% prediction interval might span from 0.05 to 0.20.
Step 5: Bayesian Updating
The most powerful aspect of our system is its ability to update predictions as new data arrives. We use a Bayesian framework where:
- Prior: The initial prediction at age 19 serves as our prior probability of success.
- Likelihood: Each new season of data provides a likelihood update.
- Posterior: The updated probability after incorporating new data.
def bayesian_update(prior_prob, new_season_data, likelihood_model):
"""Update success probability after observing a new season.
Args:
prior_prob: Current probability of making it.
new_season_data: Dict with the player's latest season metrics.
likelihood_model: Model that estimates P(data | success) and P(data | failure).
Returns:
Updated (posterior) probability.
"""
# Likelihood of observing this season given the player will make it
p_data_given_success = likelihood_model.predict_proba_success(new_season_data)
# Likelihood of observing this season given the player won't make it
p_data_given_failure = likelihood_model.predict_proba_failure(new_season_data)
# Bayes rule
numerator = p_data_given_success * prior_prob
denominator = (p_data_given_success * prior_prob +
p_data_given_failure * (1 - prior_prob))
posterior = numerator / denominator
return posterior
Example: Tracking Player A Over Time
| Age | Key Events | Updated P(success) |
|---|---|---|
| 19 | 800 minutes in PL, solid passing stats | 35% |
| 20 | Loaned to Championship, 2800 minutes, strong performance | 48% |
| 21 | Recalled, 1500 PL minutes, improving metrics | 62% |
| 22 | Regular starter, 2800 PL minutes | 78% |
| 23 | Established starter, international call-up | 91% |
Notice how the probability updates smoothly as evidence accumulates. A single breakout season does not immediately push the probability to near-certainty -- the Bayesian approach is appropriately cautious.
Step 6: Survival Analysis for Dropout
We complement the classification model with a survival analysis that models the time to establishing oneself (or dropping out):
from lifelines import CoxPHFitter
def fit_establishment_model(career_data):
"""Fit survival model for time-to-establishment."""
# Event: first season with 2000+ top-flight minutes
# Censoring: player leaves top-flight system or reaches age 25
survival_df = career_data.copy()
survival_df["duration"] = survival_df["age"] - 17 # years since age 17
survival_df["event"] = (survival_df["minutes"] >= 2000) & \
(survival_df["league_level"] == 1)
cph = CoxPHFitter()
cph.fit(
survival_df[["duration", "event", "xg_p90", "progressive_actions_p90",
"total_career_minutes", "height_cm"]],
duration_col="duration",
event_col="event"
)
cph.print_summary()
return cph
The survival model tells us not just whether a player will make it, but when. Some players establish themselves immediately at age 18-19 (early bloomers), while others take the loan route and break through at 22-23 (late developers).
Results and Validation
Model Performance
| Metric | Value |
|---|---|
| AUC-ROC | 0.79 |
| Average Precision | 0.42 |
| Brier Score | 0.12 |
| ECE (calibration) | 0.04 |
The model significantly outperforms the base rate (AUC = 0.50) and a simple heuristic based solely on minutes played (AUC = 0.71).
Calibration Analysis
We verify that the model's predicted probabilities match observed success rates:
- Predicted 10-20%: Observed 14% success rate (well-calibrated)
- Predicted 20-40%: Observed 28% success rate (well-calibrated)
- Predicted 40-60%: Observed 51% success rate (well-calibrated)
- Predicted 60-80%: Observed 65% success rate (slightly overconfident)
- Predicted 80-100%: Observed 82% success rate (well-calibrated)
Case Analysis
True Positives (Correctly identified successes): Players with early senior exposure, strong underlying metrics, and positive development trajectories were correctly flagged as likely successes.
True Negatives (Correctly identified non-successes): Players who accumulated minutes only at lower levels and showed stagnating or declining metrics were correctly assigned low probabilities.
False Positives (Predicted success, did not make it): Some physically gifted players with strong metrics at age 19 failed due to injuries or personal factors that the model cannot capture.
False Negatives (Predicted failure, succeeded): Late bloomers who showed limited output at age 19 but improved dramatically in their early 20s. These are the most interesting cases and represent the limits of early prediction.
Key Insights
-
Minutes at the top level by age 19 is the strongest predictor. Players who have already proven they can contribute at the highest level are much more likely to continue doing so.
-
Trajectory matters more than current level. A player whose metrics are improving rapidly is a better prospect than one with higher current metrics but a flat trajectory.
-
Position matters for prediction reliability. Goalkeeper and center-back success is more predictable (physical attributes are important and measurable) than winger or striker success (which depends more on intangibles).
-
Prediction accuracy improves dramatically with each additional season of data. At age 19, the model's AUC is 0.79. By age 21 (with two more seasons of updates), it reaches 0.88. Early prediction is inherently uncertain.
-
The loan pathway is informative. Players who excel during loans to competitive leagues provide strong evidence of their ability to adapt to high-level football.
Limitations
- Survivorship bias: Our dataset only includes players who were in academies at age 17. Players who enter professional football through non-academy routes are not captured.
- Unobservable factors: Mentality, work ethic, injury proneness, and personal circumstances are not in the data but significantly affect outcomes.
- League and era effects: The model is trained on historical data and may not fully capture changes in playing style, tactical evolution, or youth development practices.
- Small sample sizes: For any specific position-league combination, there may be only a few dozen historical examples, limiting the reliability of the estimates.
Practical Recommendations
- Do not use the model as a sole decision-maker. It should inform, not replace, expert judgment.
- Pay attention to the uncertainty. A prediction of "45% chance of making it" is very different from "90% chance." Treat borderline cases with caution.
- Update predictions regularly. A Bayesian approach that incorporates each new season of data is far more valuable than a one-time prediction.
- Consider the cost of errors. Releasing a player who would have made it (false negative) is more costly than retaining one who does not (false positive), so adjust decision thresholds accordingly.
Code Reference
The complete implementation is available in code/case-study-code.py and code/example-02-player-forecasting.py.