Throughout this textbook, we have built a comprehensive toolkit for soccer analytics---from foundational statistics and data engineering to advanced machine learning, network analysis, and simulation. In this capstone chapter, we bring everything...
Learning Objectives
- Build end-to-end analytics pipelines from raw data to actionable insights
- Design and execute a professional scouting campaign using data
- Conduct tactical analysis across an entire season
- Develop injury prevention models and monitoring systems
- Automate match preparation reports for coaching staff
- Track and quantify player development over time
- Integrate techniques from multiple chapters into cohesive analytical projects
In This Chapter
- Introduction
- 29.1 Case Study: Building a Complete xG Pipeline
- 29.2 Case Study: Scouting Campaign for a Striker
- 29.3 Case Study: Tactical Analysis of a Season
- 29.4 Case Study: Injury Prevention Program
- 29.5 Case Study: Match Preparation Report
- 29.6 Case Study: Player Development Tracking
- 29.7 Cross-Case-Study Integration
- Summary
- References
Chapter 29: Comprehensive Case Studies
Introduction
Throughout this textbook, we have built a comprehensive toolkit for soccer analytics---from foundational statistics and data engineering to advanced machine learning, network analysis, and simulation. In this capstone chapter, we bring everything together through six detailed case studies that mirror real-world analytics workflows at professional clubs.
Each case study is designed as a self-contained project that integrates techniques from multiple chapters. They progress from data acquisition and cleaning through modeling, visualization, and communication of results to stakeholders. The emphasis throughout is on practical implementation: the kind of work that analytics departments perform daily.
Note for Practitioners: These case studies are modeled on real workflows but use synthetic or publicly available data. The architectural patterns, however, are drawn directly from professional club analytics departments and consultancies.
The six case studies are:
| Case Study | Domain | Primary Techniques | Chapters Referenced |
|---|---|---|---|
| 29.1 | xG Pipeline | Feature engineering, logistic regression, model deployment | 3, 6, 7, 4 |
| 29.2 | Scouting | Clustering, similarity metrics, multi-criteria decision analysis | 15, 21, 19 |
| 29.3 | Tactical Analysis | Network analysis, passing models, formation detection | 10, 11, 16 |
| 29.4 | Injury Prevention | Survival analysis, workload monitoring, Bayesian inference | 26, 18, 20 |
| 29.5 | Match Preparation | Automated reporting, visualization, opponent profiling | 6, 7, 22 |
| 29.6 | Player Development | Longitudinal analysis, growth curves, radar charts | 15, 19, 21 |
Integration Across Chapters: One of the central lessons of professional analytics work is that no technique operates in isolation. The xG pipeline (29.1) relies on statistical foundations from Chapter 3 and expected goals methodology from Chapter 7, but it also feeds into the match preparation system (29.5) and the scouting framework (29.2). The tactical analysis (29.3) uses the network analysis methods from Chapter 10 and the team analysis from Chapter 16, but the insights it produces inform the injury prevention program (29.4) by identifying which players are exposed to the highest physical loads based on tactical role. Throughout these case studies, pay attention to the connections between techniques---this is how real analytics departments operate.
29.1 Case Study: Building a Complete xG Pipeline
29.1.1 Project Overview
Expected Goals (xG) is the foundational metric of modern soccer analytics. In this case study, we build a complete xG pipeline from raw event data to a deployed model that can score shots in near-real-time. The pipeline encompasses data ingestion, feature engineering, model training, evaluation, calibration, and deployment.
Objective: Build a production-grade xG model that achieves a log-loss below 0.34 and maintains calibration across shot types.
Architecture Overview:
Raw Event Data --> Data Cleaning --> Feature Engineering --> Model Training
| |
v v
Data Validation Model Evaluation
|
v
Calibration & Deployment
Template Workflow --- xG Pipeline: The workflow for building an xG model can be adapted to almost any expected metric (xA, xT, xGOT). The pattern is: (1) Define the event of interest, (2) Identify the outcome variable, (3) Engineer spatial and contextual features, (4) Train and evaluate candidate models, (5) Calibrate, (6) Deploy with monitoring. Keep this template in mind as you read through the case study.
29.1.2 Data Ingestion and Cleaning
The first stage of any analytics pipeline is reliable data ingestion. We work with event-level shot data that includes spatial coordinates, body part, play pattern, and contextual features.
import pandas as pd
import numpy as np
from typing import Tuple, Optional
def load_shot_data(filepath: str) -> pd.DataFrame:
"""Load and perform initial validation on shot event data.
Args:
filepath: Path to the CSV file containing shot events.
Returns:
Cleaned DataFrame with validated shot records.
"""
df = pd.read_csv(filepath)
# Validate coordinate ranges
df = df[
(df['x'].between(0, 120)) &
(df['y'].between(0, 80))
].copy()
# Standardize categorical variables
df['body_part'] = df['body_part'].str.lower().str.strip()
df['play_pattern'] = df['play_pattern'].str.lower().str.strip()
# Create binary outcome
df['is_goal'] = (df['outcome'] == 'Goal').astype(int)
return df
Data Quality Callout: In professional settings, data providers occasionally introduce coordinate system changes between seasons. Always validate that pitch coordinates are consistent before combining multi-season datasets. A simple sanity check is verifying that the distribution of shot locations forms the expected pattern concentrated around the penalty area.
Data Validation Pipeline:
Before proceeding to feature engineering, a robust pipeline includes automated data validation checks. These checks catch issues that would otherwise silently corrupt model training.
def validate_shot_data(df: pd.DataFrame) -> dict:
"""Run comprehensive validation checks on shot data.
Args:
df: DataFrame with shot event records.
Returns:
Dictionary with validation results and warnings.
"""
validation_results = {
'total_records': len(df),
'null_counts': df.isnull().sum().to_dict(),
'warnings': [],
'passed': True,
}
# Check goal rate is within expected bounds (6-12%)
goal_rate = df['is_goal'].mean()
if not 0.06 <= goal_rate <= 0.14:
validation_results['warnings'].append(
f"Goal rate {goal_rate:.3f} outside expected range [0.06, 0.14]"
)
validation_results['passed'] = False
# Check for duplicate shot events
if 'shot_id' in df.columns:
n_dupes = df['shot_id'].duplicated().sum()
if n_dupes > 0:
validation_results['warnings'].append(
f"Found {n_dupes} duplicate shot IDs"
)
# Check body part distribution
if 'body_part' in df.columns:
foot_pct = df['body_part'].isin(['left foot', 'right foot']).mean()
if foot_pct < 0.70:
validation_results['warnings'].append(
f"Foot shot percentage {foot_pct:.2f} lower than expected (>0.70)"
)
# Verify coordinate consistency across seasons
if 'season' in df.columns:
for season in df['season'].unique():
season_data = df[df['season'] == season]
x_max = season_data['x'].max()
if abs(x_max - 120) > 5:
validation_results['warnings'].append(
f"Season {season}: max x={x_max:.1f}, expected ~120"
)
return validation_results
Common Mistake --- Coordinate Systems: One of the most frequent errors in xG modeling is mixing coordinate systems from different data providers. StatsBomb uses a 120x80 yard pitch, Opta uses a 100x100 percentage-based system, and Wyscout uses yet another convention. Always normalize to a common coordinate system before combining data. Failure to do this produces models that appear to work in cross-validation but fail catastrophically on new data.
Data Sources for xG Modeling:
| Source | Data Type | Coverage | Access | Cost |
|---|---|---|---|---|
| StatsBomb Open Data | Event-level shots | Select competitions | Free (GitHub) | None |
| Opta / Stats Perform | Event-level with freeze frames | All major leagues | Commercial API | High |
| Wyscout | Event-level with qualifiers | 100+ leagues | Commercial subscription | Medium |
| Understat | Aggregated xG values | Top 5 leagues | Free (web scraping) | None |
| FBref | Aggregated shot statistics | Top leagues | Free (web) | None |
29.1.3 Feature Engineering
Feature engineering is where domain expertise meets data science. For xG, the most predictive features relate to the geometry of the shot relative to the goal.
Distance to Goal Center:
$$ d = \sqrt{(x - 120)^2 + (y - 40)^2} $$
where the goal center is at coordinates $(120, 40)$ on a standardized 120 x 80 pitch.
Angle to Goal:
$$ \theta = \arctan\left(\frac{9.32 \cdot (120 - x)}{(120 - x)^2 + (y - 40)^2 - (3.66)^2}\right) $$
This formula computes the angle subtended by the goal posts from the shot location, where 9.32 meters is the goal width.
def engineer_shot_features(df: pd.DataFrame) -> pd.DataFrame:
"""Create geometric and contextual features for xG modeling.
Args:
df: DataFrame with raw shot coordinates and metadata.
Returns:
DataFrame augmented with engineered features.
"""
# Distance to goal center
df['distance'] = np.sqrt(
(df['x'] - 120) ** 2 + (df['y'] - 40) ** 2
)
# Angle to goal
numerator = 9.32 * (120 - df['x'])
denominator = (
(120 - df['x']) ** 2 + (df['y'] - 40) ** 2 - 3.66 ** 2
)
df['angle'] = np.arctan2(numerator, denominator)
df['angle_degrees'] = np.degrees(df['angle'])
# Log-distance (captures diminishing returns)
df['log_distance'] = np.log1p(df['distance'])
# Central zone indicator
df['is_central'] = (df['y'].between(30, 50)).astype(int)
# Inside box indicator
df['in_box'] = (
(df['x'] >= 102) & (df['y'].between(18, 62))
).astype(int)
# Interaction features
df['angle_x_distance'] = df['angle'] * df['distance']
df['angle_squared'] = df['angle'] ** 2
return df
Advanced Feature Engineering --- Contextual Features:
The geometric features above form the baseline, but professional xG models incorporate much richer contextual information. When freeze-frame data (the positions of all players at the moment of the shot) is available, the following additional features significantly improve model performance:
def engineer_advanced_features(df: pd.DataFrame) -> pd.DataFrame:
"""Create advanced contextual features when freeze-frame data is available.
Args:
df: DataFrame with shot data including freeze-frame information.
Returns:
DataFrame with additional contextual features.
"""
# Number of defenders between shooter and goal
if 'n_defenders_in_cone' in df.columns:
df['defenders_blocking'] = df['n_defenders_in_cone']
# Goalkeeper position relative to optimal
if 'gk_distance_to_goal_center' in df.columns:
df['gk_out_of_position'] = (
df['gk_distance_to_goal_center'] > 3.0
).astype(int)
# Shot following a cross or through ball
if 'assist_type' in df.columns:
df['from_cross'] = (
df['assist_type'] == 'cross'
).astype(int)
df['from_through_ball'] = (
df['assist_type'] == 'through_ball'
).astype(int)
# Fast break indicator (counter-attack)
if 'play_pattern' in df.columns:
df['is_counter'] = (
df['play_pattern'] == 'from counter'
).astype(int)
df['is_set_piece'] = (
df['play_pattern'].isin([
'from corner', 'from free kick',
'from throw in'
])
).astype(int)
# Game state (winning, drawing, losing)
if 'goal_difference' in df.columns:
df['winning'] = (df['goal_difference'] > 0).astype(int)
df['losing'] = (df['goal_difference'] < 0).astype(int)
# Match minute buckets
if 'minute' in df.columns:
df['late_game'] = (df['minute'] >= 75).astype(int)
df['injury_time'] = (df['minute'] >= 90).astype(int)
return df
Practical Tip --- Feature Importance: After training your model, always examine feature importances. If you find that a contextual feature (like game state or minute) has very high importance, it may indicate data leakage or a confound rather than a genuine predictive signal. For instance, shots in injury time may have higher conversion rates because they are more likely to occur on counter-attacks against teams pushing forward.
29.1.4 Model Training and Selection
We compare three model families and select based on log-loss and calibration:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.calibration import calibration_curve
from sklearn.metrics import log_loss, brier_score_loss
FEATURE_COLS = [
'distance', 'angle', 'log_distance', 'is_central',
'in_box', 'angle_x_distance', 'angle_squared'
]
def train_and_evaluate(
df: pd.DataFrame,
feature_cols: list[str],
n_splits: int = 5
) -> dict:
"""Train multiple xG models and compare via cross-validation.
Args:
df: Feature-engineered DataFrame.
feature_cols: List of feature column names.
n_splits: Number of CV folds.
Returns:
Dictionary mapping model names to mean log-loss scores.
"""
X = df[feature_cols].values
y = df['is_goal'].values
models = {
'logistic_regression': LogisticRegression(max_iter=1000),
'gradient_boosting': GradientBoostingClassifier(
n_estimators=200, max_depth=4, learning_rate=0.1
),
'random_forest': RandomForestClassifier(
n_estimators=200, max_depth=6
),
}
cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
results = {}
for name, model in models.items():
scores = cross_val_score(
model, X, y, cv=cv, scoring='neg_log_loss'
)
results[name] = {
'mean_log_loss': -scores.mean(),
'std_log_loss': scores.std(),
}
return results
Hyperparameter Tuning:
In a production setting, hyperparameter tuning is essential. We use Bayesian optimization rather than grid search for efficiency:
from sklearn.model_selection import cross_val_score
import optuna
def tune_gradient_boosting(
X: np.ndarray,
y: np.ndarray,
n_trials: int = 50
) -> dict:
"""Tune gradient boosting hyperparameters using Bayesian optimization.
Args:
X: Feature matrix.
y: Target vector.
n_trials: Number of optimization trials.
Returns:
Dictionary with best hyperparameters and score.
"""
def objective(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 100, 500),
'max_depth': trial.suggest_int('max_depth', 3, 8),
'learning_rate': trial.suggest_float(
'learning_rate', 0.01, 0.3, log=True
),
'min_samples_leaf': trial.suggest_int('min_samples_leaf', 5, 50),
'subsample': trial.suggest_float('subsample', 0.6, 1.0),
}
model = GradientBoostingClassifier(**params, random_state=42)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(
model, X, y, cv=cv, scoring='neg_log_loss'
)
return -scores.mean()
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=n_trials)
return {
'best_params': study.best_params,
'best_log_loss': study.best_value,
}
Common Mistake --- Temporal Leakage: A critical error in xG model evaluation is failing to respect temporal ordering. If your training data spans multiple seasons, you must evaluate using time-based splits (train on seasons 1-3, test on season 4), not random cross-validation. Random CV will overestimate performance because shots from the same match or season may appear in both training and test sets, and systematic factors (pitch conditions, ball design, tactical trends) correlate within seasons.
29.1.5 Model Calibration
A well-calibrated xG model should produce probabilities that match observed goal rates. For instance, shots assigned $xG = 0.20$ should result in goals approximately 20% of the time.
$$ \text{Calibration Error} = \frac{1}{B} \sum_{b=1}^{B} |p_b - \hat{p}_b| $$
where $B$ is the number of bins, $p_b$ is the observed proportion of goals in bin $b$, and $\hat{p}_b$ is the mean predicted probability in that bin.
from sklearn.calibration import CalibratedClassifierCV
def calibrate_and_evaluate(
model,
X_train: np.ndarray,
y_train: np.ndarray,
X_test: np.ndarray,
y_test: np.ndarray,
method: str = 'isotonic'
) -> dict:
"""Calibrate a trained model and evaluate calibration quality.
Args:
model: Trained sklearn classifier.
X_train: Training features.
y_train: Training labels.
X_test: Test features.
y_test: Test labels.
method: Calibration method ('sigmoid' for Platt scaling,
'isotonic' for isotonic regression).
Returns:
Dictionary with calibration metrics before and after calibration.
"""
# Pre-calibration metrics
y_pred_raw = model.predict_proba(X_test)[:, 1]
raw_log_loss = log_loss(y_test, y_pred_raw)
raw_brier = brier_score_loss(y_test, y_pred_raw)
# Calibrate
calibrated = CalibratedClassifierCV(
model, method=method, cv=5
)
calibrated.fit(X_train, y_train)
# Post-calibration metrics
y_pred_cal = calibrated.predict_proba(X_test)[:, 1]
cal_log_loss = log_loss(y_test, y_pred_cal)
cal_brier = brier_score_loss(y_test, y_pred_cal)
# Compute calibration curve
prob_true, prob_pred = calibration_curve(
y_test, y_pred_cal, n_bins=10
)
return {
'raw_log_loss': raw_log_loss,
'raw_brier': raw_brier,
'calibrated_log_loss': cal_log_loss,
'calibrated_brier': cal_brier,
'calibration_curve': {
'observed': prob_true.tolist(),
'predicted': prob_pred.tolist(),
},
}
Key Insight: Gradient boosting models often require post-hoc calibration (e.g., Platt scaling or isotonic regression) because their raw outputs can be poorly calibrated despite achieving low log-loss. Logistic regression, by contrast, is inherently calibrated by construction.
Calibration by Shot Type:
A global calibration check is not sufficient. You should also verify calibration within important subgroups:
def check_subgroup_calibration(
df: pd.DataFrame,
pred_col: str = 'xg_pred',
target_col: str = 'is_goal',
subgroup_col: str = 'body_part',
n_bins: int = 5
) -> pd.DataFrame:
"""Check model calibration across subgroups.
Args:
df: DataFrame with predictions and outcomes.
pred_col: Column with predicted probabilities.
target_col: Column with binary outcomes.
subgroup_col: Column defining subgroups.
n_bins: Number of calibration bins per subgroup.
Returns:
DataFrame with calibration error per subgroup.
"""
results = []
for group_name, group_df in df.groupby(subgroup_col):
if len(group_df) < 100:
continue
prob_true, prob_pred = calibration_curve(
group_df[target_col],
group_df[pred_col],
n_bins=n_bins
)
cal_error = np.mean(np.abs(prob_true - prob_pred))
results.append({
'subgroup': group_name,
'n_shots': len(group_df),
'calibration_error': cal_error,
'mean_prediction': group_df[pred_col].mean(),
'observed_rate': group_df[target_col].mean(),
})
return pd.DataFrame(results).sort_values('calibration_error')
29.1.6 Deployment Architecture
The final xG model is serialized using joblib and wrapped in a scoring function that can be called from the club's data platform:
import joblib
from datetime import datetime
def deploy_model(model, scaler, filepath: str) -> None:
"""Serialize trained model and preprocessing objects.
Args:
model: Trained sklearn model.
scaler: Fitted StandardScaler or similar preprocessor.
filepath: Output path for the serialized pipeline.
"""
pipeline_artifact = {
'model': model,
'scaler': scaler,
'feature_names': FEATURE_COLS,
'version': '1.0.0',
'trained_date': datetime.now().isoformat(),
'training_samples': model.n_features_in_,
}
joblib.dump(pipeline_artifact, filepath)
def score_new_shot(
shot_data: dict,
model_path: str
) -> float:
"""Score a single shot using the deployed xG model.
Args:
shot_data: Dictionary with shot features.
model_path: Path to the serialized model artifact.
Returns:
xG probability for the shot.
"""
artifact = joblib.load(model_path)
features = np.array([
[shot_data[f] for f in artifact['feature_names']]
])
if artifact.get('scaler'):
features = artifact['scaler'].transform(features)
prob = artifact['model'].predict_proba(features)[0, 1]
return float(prob)
Deployment Monitoring:
A deployed model requires continuous monitoring to detect drift and degradation:
def monitor_model_performance(
predictions_log: pd.DataFrame,
window_days: int = 30
) -> dict:
"""Monitor deployed xG model performance over a rolling window.
Args:
predictions_log: DataFrame with columns 'date', 'xg_pred', 'is_goal'.
window_days: Rolling window for performance calculation.
Returns:
Dictionary with current performance metrics and drift alerts.
"""
recent = predictions_log[
predictions_log['date'] >= (
predictions_log['date'].max() - pd.Timedelta(days=window_days)
)
]
current_log_loss = log_loss(recent['is_goal'], recent['xg_pred'])
current_brier = brier_score_loss(recent['is_goal'], recent['xg_pred'])
mean_pred = recent['xg_pred'].mean()
observed_rate = recent['is_goal'].mean()
alerts = []
if current_log_loss > 0.36:
alerts.append('Log-loss exceeds threshold (0.36)')
if abs(mean_pred - observed_rate) > 0.03:
alerts.append(
f'Prediction-outcome gap: {abs(mean_pred - observed_rate):.3f}'
)
return {
'log_loss': current_log_loss,
'brier_score': current_brier,
'mean_prediction': mean_pred,
'observed_goal_rate': observed_rate,
'n_predictions': len(recent),
'alerts': alerts,
}
29.1.7 Results and Discussion
A well-built xG pipeline typically achieves the following benchmarks on large datasets (50,000+ shots):
| Model | Log-Loss | Brier Score | Calibration Error |
|---|---|---|---|
| Logistic Regression | 0.338 | 0.082 | 0.012 |
| Gradient Boosting | 0.321 | 0.078 | 0.018 |
| GB + Isotonic Calibration | 0.323 | 0.076 | 0.008 |
The gradient boosting model with isotonic calibration provides the best combination of discrimination and calibration. However, logistic regression remains a strong baseline and offers superior interpretability for communicating with coaching staff.
Lesson Learned --- xG Pipeline: The most common failure mode in xG model development is over-engineering the feature set without investing enough in data quality. A simple model with clean, well-validated data will outperform a complex model built on messy data every time. Start with the geometric baseline (distance and angle), verify that your model beats the naive base rate, and then incrementally add features while monitoring both performance and calibration. This incremental approach also makes it much easier to explain the model to non-technical stakeholders.
29.1.8 Reproducing This Analysis
To reproduce this case study with publicly available data:
- Download StatsBomb open data from GitHub (
statsbombpylibrary) - Filter for shot events across available competitions
- Standardize coordinates to the 120x80 system
- Run the feature engineering pipeline above
- Split data temporally (earlier competitions for training, later for testing)
- Train and evaluate using the provided code
- Expected dataset size: approximately 20,000-40,000 shots depending on competitions selected
29.2 Case Study: Scouting Campaign for a Striker
29.2.1 Project Overview
A mid-table Premier League club needs to replace a departing striker. The analytics department is tasked with identifying candidates who fit the tactical profile, are within budget, and have a high probability of success in the league. This case study walks through the entire scouting workflow, from defining requirements to producing a shortlist.
Objective: Produce a ranked shortlist of 5 striker candidates from a pool of 500+ forwards across Europe's top leagues.
Template Workflow --- Scouting Campaign: The scouting workflow follows six stages: (1) Define the player profile with coaching staff, (2) Build and normalize the scouting database, (3) Apply quantitative filters and scoring, (4) Perform similarity and cluster analysis, (5) Apply league-level adjustments, (6) Assess squad fit and financial viability. Each stage has clear inputs, outputs, and decision points. Document every decision for transparency and future reference.
29.2.2 Defining the Player Profile
The first step is translating the coaching staff's requirements into quantifiable criteria. Through meetings with the head coach and sporting director, the following profile emerges:
| Attribute | Minimum Threshold | Ideal Target | Weight |
|---|---|---|---|
| Non-penalty xG per 90 | 0.35 | 0.50+ | 0.25 |
| Pressing actions per 90 | 15 | 20+ | 0.15 |
| Aerial duels won % | 45% | 55%+ | 0.10 |
| Progressive carries per 90 | 3.0 | 7.0+ | 0.15 |
| Age | 21-28 | 23-26 | 0.10 |
| Contract years remaining | 1-3 | 1-2 | 0.10 |
| Estimated transfer fee | < 25M | < 15M | 0.15 |
Practical Tip --- Stakeholder Meetings: The player profile definition meeting is arguably the most important step in the entire scouting process. Poorly defined requirements lead to wasted analytical effort and a shortlist that the coaching staff will reject. Prepare for this meeting by bringing data on the departing player's profile, the team's current attacking patterns, and examples of different striker archetypes. Ask open-ended questions ("What does the ideal striker look like in our system?") before presenting quantitative thresholds.
29.2.3 Data Collection and Normalization
We aggregate data from multiple sources: event data providers for on-ball metrics, tracking data for off-ball movement, and financial databases for market values.
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
def build_scouting_database(
events_path: str,
tracking_path: str,
market_path: str,
min_minutes: int = 1500
) -> pd.DataFrame:
"""Merge multiple data sources into a unified scouting database.
Args:
events_path: Path to event-level aggregated stats.
tracking_path: Path to tracking-derived metrics.
market_path: Path to market value and contract data.
min_minutes: Minimum minutes played for inclusion.
Returns:
Merged DataFrame with per-90 metrics and market info.
"""
events = pd.read_csv(events_path)
tracking = pd.read_csv(tracking_path)
market = pd.read_csv(market_path)
# Filter by playing time
events = events[events['minutes_played'] >= min_minutes].copy()
# Merge on player ID
df = events.merge(tracking, on='player_id', how='inner')
df = df.merge(market, on='player_id', how='inner')
return df
Per-90 Normalization:
$$ \text{metric}_{p90} = \frac{\text{metric}_{\text{total}}}{\text{minutes played}} \times 90 $$
Scouting Callout: Per-90 normalization is essential for fair comparison across leagues with different schedules, but be wary of players with marginal playing time. A minimum threshold of 1,500 minutes (roughly 17 full matches) is standard practice to ensure statistical stability.
League-Level Adjustments:
Raw per-90 metrics are not directly comparable across leagues. A striker scoring 0.45 npxG/90 in the Dutch Eredivisie is not equivalent to the same figure in the Premier League. We apply league adjustment factors based on historical transfer performance data:
def apply_league_adjustments(
df: pd.DataFrame,
adjustment_factors: dict,
metrics_to_adjust: list[str]
) -> pd.DataFrame:
"""Adjust per-90 metrics for league strength differences.
Args:
df: Scouting database with per-90 metrics.
adjustment_factors: Dict mapping league names to
multiplicative adjustment factors.
metrics_to_adjust: List of metric columns to adjust.
Returns:
DataFrame with league-adjusted metrics.
"""
df = df.copy()
for metric in metrics_to_adjust:
adjusted_col = f'{metric}_adj'
df[adjusted_col] = df.apply(
lambda row: row[metric] * adjustment_factors.get(
row['league'], 1.0
),
axis=1
)
return df
# Example adjustment factors (illustrative)
LEAGUE_ADJUSTMENTS = {
'Premier League': 1.00,
'La Liga': 0.97,
'Bundesliga': 0.95,
'Serie A': 0.96,
'Ligue 1': 0.92,
'Eredivisie': 0.82,
'Primeira Liga': 0.85,
'Belgian Pro League': 0.80,
'Championship': 0.78,
'Austrian Bundesliga': 0.75,
}
Common Mistake --- League Adjustments: League adjustment factors are inherently uncertain and should be treated as rough guidelines, not precise conversion rates. A player's individual profile matters more than the league average. A technically elite player in a weaker league may translate better than the adjustment factor suggests, while a physically dominant player in the same league may struggle in a more technical environment. Use adjustments to flag candidates for closer inspection, not to make final decisions.
29.2.4 Multi-Criteria Scoring
We implement a weighted scoring system that combines all attributes into a single composite score:
def compute_scouting_score(
df: pd.DataFrame,
criteria: dict[str, dict],
) -> pd.DataFrame:
"""Compute weighted composite scouting scores.
Args:
df: Scouting database with all metrics.
criteria: Dict mapping metric names to dicts with
'weight', 'min_threshold', and 'direction' keys.
Returns:
DataFrame with added 'composite_score' column, sorted descending.
"""
scaler = MinMaxScaler()
score_components = []
for metric, params in criteria.items():
col = df[metric].copy()
# Apply minimum threshold filter
if 'min_threshold' in params:
df = df[col >= params['min_threshold']].copy()
col = df[metric].copy()
# Normalize to [0, 1]
normalized = scaler.fit_transform(col.values.reshape(-1, 1)).flatten()
# Invert if lower is better (e.g., transfer fee)
if params.get('direction') == 'lower':
normalized = 1 - normalized
score_components.append(normalized * params['weight'])
df['composite_score'] = np.sum(score_components, axis=0)
return df.sort_values('composite_score', ascending=False)
29.2.5 Similarity Analysis
Beyond raw scoring, we use player similarity analysis to find candidates who match the playing style of the departing striker or a specified archetype.
Cosine Similarity:
$$ \text{sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{||\mathbf{a}|| \cdot ||\mathbf{b}||} $$
from sklearn.metrics.pairwise import cosine_similarity
def find_similar_players(
df: pd.DataFrame,
target_player_id: str,
feature_cols: list[str],
top_n: int = 10
) -> pd.DataFrame:
"""Find players most similar to a target player profile.
Args:
df: Scouting database.
target_player_id: ID of the reference player.
feature_cols: Metrics to use for similarity computation.
top_n: Number of similar players to return.
Returns:
DataFrame of the top_n most similar players with similarity scores.
"""
scaler = MinMaxScaler()
X = scaler.fit_transform(df[feature_cols].values)
target_idx = df.index[df['player_id'] == target_player_id][0]
target_vec = X[target_idx].reshape(1, -1)
similarities = cosine_similarity(target_vec, X).flatten()
df['similarity'] = similarities
return (
df[df['player_id'] != target_player_id]
.nlargest(top_n, 'similarity')
)
29.2.6 Cluster Analysis for Archetype Discovery
We use $k$-means clustering to discover natural player archetypes within the forward population:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
def discover_archetypes(
df: pd.DataFrame,
feature_cols: list[str],
n_clusters: int = 6
) -> pd.DataFrame:
"""Identify player archetypes using clustering.
Args:
df: Scouting database.
feature_cols: Metrics for clustering.
n_clusters: Number of archetypes to discover.
Returns:
DataFrame with added 'archetype' column.
"""
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(df[feature_cols].values)
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
df['archetype'] = kmeans.fit_predict(X_scaled)
# PCA for visualization
pca = PCA(n_components=2)
coords = pca.fit_transform(X_scaled)
df['pca_1'] = coords[:, 0]
df['pca_2'] = coords[:, 1]
return df
29.2.7 Squad Fit Analysis
A player who scores well on individual metrics may still be a poor fit for the team. Squad fit analysis evaluates how a potential signing would integrate into the existing squad.
def assess_squad_fit(
candidate: pd.Series,
current_squad: pd.DataFrame,
tactical_profile: dict,
position: str = 'striker'
) -> dict:
"""Evaluate how well a candidate fits the existing squad.
Args:
candidate: Series with the candidate's metrics.
current_squad: DataFrame with current squad members.
tactical_profile: Dict describing the team's tactical approach.
position: Position the candidate would fill.
Returns:
Dictionary with fit scores across multiple dimensions.
"""
fit_scores = {}
# Age profile fit: does the candidate fill an age gap?
position_players = current_squad[
current_squad['position_group'] == position
]
avg_age = position_players['age'].mean()
fit_scores['age_balance'] = 1.0 - abs(
candidate['age'] - tactical_profile.get('ideal_age', 25)
) / 10.0
# Wage structure fit
if 'estimated_wages' in candidate.index:
max_wage = current_squad['wages'].quantile(0.9)
fit_scores['wage_fit'] = min(
1.0, max_wage / max(candidate['estimated_wages'], 1)
)
# Tactical complementarity
if tactical_profile.get('needs_aerial_threat', False):
fit_scores['aerial_fit'] = min(
candidate.get('aerial_won_pct', 0) / 60.0, 1.0
)
if tactical_profile.get('needs_pressing', False):
fit_scores['pressing_fit'] = min(
candidate.get('pressing_actions_p90', 0) / 25.0, 1.0
)
# Overall fit score
fit_scores['overall'] = np.mean(list(fit_scores.values()))
return fit_scores
Lesson Learned --- Scouting: The greatest risk in data-driven scouting is not analytical error but organizational misalignment. If the analytics department produces a shortlist that the head coach does not trust or the sporting director cannot negotiate within budget, the entire exercise is wasted. Successful scouting campaigns require continuous communication between analytics, coaching, scouting, and recruitment departments. The data narrows the funnel; human judgment and relationships close the deal.
29.2.8 Financial Modeling
Transfer decisions are fundamentally financial decisions. The analytics department should provide financial context alongside performance analysis:
def estimate_transfer_value(
player: pd.Series,
market_conditions: dict
) -> dict:
"""Estimate a player's transfer value and assess financial viability.
Args:
player: Series with player attributes and market data.
market_conditions: Dict with current market parameters.
Returns:
Dictionary with value estimates and financial projections.
"""
# Base valuation from market comparables
base_value = player.get('market_value_eur', 0)
# Contract adjustment: fewer years remaining = lower fee
contract_factor = min(player.get('contract_years', 3) / 4.0, 1.0)
# Age adjustment: premium for peak years, discount otherwise
age = player.get('age', 25)
if 23 <= age <= 27:
age_factor = 1.1
elif age < 23:
age_factor = 1.05
else:
age_factor = max(0.6, 1.0 - (age - 27) * 0.08)
estimated_fee = base_value * contract_factor * age_factor
# Amortization over expected contract length
contract_length = market_conditions.get('standard_contract_years', 4)
annual_amortization = estimated_fee / contract_length
return {
'estimated_fee': estimated_fee,
'annual_amortization': annual_amortization,
'estimated_wages_annual': player.get('estimated_wages', 0) * 52,
'total_annual_cost': (
annual_amortization + player.get('estimated_wages', 0) * 52
),
'contract_factor': contract_factor,
'age_factor': age_factor,
}
29.2.9 Final Shortlist and Reporting
The final shortlist is produced by combining composite scores, similarity analysis, squad fit assessment, and financial modeling. The output is a structured report for the sporting director that includes radar charts, statistical profiles, and risk assessments for each candidate.
Practical Tip --- The Scouting Report: The final scouting report should be concise (no more than two pages per candidate) and structured consistently. Each candidate profile should include: (1) a radar chart showing their percentile rankings, (2) a one-paragraph statistical summary, (3) a squad fit assessment, (4) a financial summary with estimated total annual cost, (5) key strengths and risk factors, and (6) recommended video clips for the coaching staff to review.
29.3 Case Study: Tactical Analysis of a Season
29.3.1 Project Overview
The analytics department is tasked with producing a comprehensive tactical review of the team's season. This involves analyzing formations, pressing patterns, ball progression methods, and set-piece effectiveness across all 38 league matches.
Objective: Identify the tactical patterns that correlated with wins, and quantify where the team underperformed relative to expected metrics.
29.3.2 Formation Detection
Modern teams rarely play static formations. We use positional data to detect the effective formation in each phase of play.
from sklearn.cluster import KMeans
import numpy as np
def detect_formation(
positions: np.ndarray,
n_outfield: int = 10
) -> str:
"""Detect formation from average player positions.
Args:
positions: Array of shape (n_outfield, 2) with (x, y) positions.
n_outfield: Number of outfield players.
Returns:
String representation of detected formation (e.g., '4-3-3').
"""
# Cluster y-positions to find lines
y_positions = positions[:, 1].reshape(-1, 1)
best_formation = None
best_score = -np.inf
for n_lines in [3, 4, 5]:
kmeans = KMeans(n_clusters=n_lines, random_state=42, n_init=10)
kmeans.fit(y_positions)
score = kmeans.score(y_positions)
if score > best_score:
best_score = score
best_formation = kmeans
# Count players per line, sorted by depth
labels = best_formation.labels_
centers = best_formation.cluster_centers_.flatten()
sorted_lines = np.argsort(centers)
formation_str = '-'.join(
str(np.sum(labels == line)) for line in sorted_lines
)
return formation_str
29.3.3 Passing Network Analysis
Passing networks reveal the structural backbone of a team's build-up play. We construct weighted directed graphs where nodes are players and edge weights represent pass frequency.
Betweenness Centrality:
$$ C_B(v) = \sum_{s \neq v \neq t} \frac{\sigma_{st}(v)}{\sigma_{st}} $$
where $\sigma_{st}$ is the total number of shortest paths from node $s$ to node $t$, and $\sigma_{st}(v)$ is the number of those paths passing through $v$.
import networkx as nx
def build_passing_network(
passes: pd.DataFrame,
min_passes: int = 3
) -> nx.DiGraph:
"""Construct a weighted passing network from event data.
Args:
passes: DataFrame with 'passer', 'receiver', and optional weights.
min_passes: Minimum passes between a pair for edge inclusion.
Returns:
NetworkX directed graph with pass counts as edge weights.
"""
# Aggregate pass counts
pass_counts = (
passes.groupby(['passer', 'receiver'])
.size()
.reset_index(name='passes')
)
pass_counts = pass_counts[pass_counts['passes'] >= min_passes]
G = nx.DiGraph()
for _, row in pass_counts.iterrows():
G.add_edge(
row['passer'],
row['receiver'],
weight=row['passes']
)
return G
def analyze_network_metrics(G: nx.DiGraph) -> pd.DataFrame:
"""Compute centrality and flow metrics for a passing network.
Args:
G: Weighted directed passing network.
Returns:
DataFrame with centrality metrics per player.
"""
metrics = pd.DataFrame({
'player': list(G.nodes()),
'degree_centrality': pd.Series(nx.degree_centrality(G)),
'betweenness_centrality': pd.Series(nx.betweenness_centrality(G, weight='weight')),
'eigenvector_centrality': pd.Series(
nx.eigenvector_centrality(G, weight='weight', max_iter=1000)
),
'in_degree': pd.Series(dict(G.in_degree(weight='weight'))),
'out_degree': pd.Series(dict(G.out_degree(weight='weight'))),
})
return metrics.sort_values('betweenness_centrality', ascending=False)
29.3.4 Pressing Intensity Analysis
Pressing is measured through PPDA (Passes Per Defensive Action) and high turnovers. Lower PPDA indicates more intense pressing.
$$ \text{PPDA} = \frac{\text{Opponent passes allowed}^{\text{own half}}}{\text{Defensive actions}^{\text{opponent half}}} $$
def compute_ppda(
events: pd.DataFrame,
team_id: str
) -> pd.DataFrame:
"""Compute PPDA (Passes Per Defensive Action) per match.
Args:
events: Full event data for the season.
team_id: ID of the team to analyze.
Returns:
DataFrame with PPDA per match.
"""
defensive_actions = ['tackle', 'interception', 'foul']
match_ppda = []
for match_id in events['match_id'].unique():
match_events = events[events['match_id'] == match_id]
# Opponent passes in own half
opp_passes = match_events[
(match_events['team_id'] != team_id) &
(match_events['type'] == 'pass') &
(match_events['x'] < 60)
].shape[0]
# Defensive actions in opponent half
def_actions = match_events[
(match_events['team_id'] == team_id) &
(match_events['type'].isin(defensive_actions)) &
(match_events['x'] > 60)
].shape[0]
ppda = opp_passes / max(def_actions, 1)
match_ppda.append({
'match_id': match_id,
'ppda': ppda,
})
return pd.DataFrame(match_ppda)
29.3.5 Tactical Phase Analysis
We segment the season into distinct tactical phases based on formation changes, personnel shifts, and performance trends. This reveals how the coaching staff adapted throughout the campaign.
def segment_tactical_phases(
match_stats: pd.DataFrame,
window: int = 5
) -> pd.DataFrame:
"""Identify tactical phase transitions using rolling metrics.
Args:
match_stats: Per-match tactical metrics (PPDA, possession, etc.).
window: Rolling window size for smoothing.
Returns:
DataFrame with phase labels and transition points.
"""
for col in ['ppda', 'possession', 'field_tilt']:
match_stats[f'{col}_rolling'] = (
match_stats[col].rolling(window=window, min_periods=1).mean()
)
# Detect change points using simple variance method
match_stats['tactical_variance'] = (
match_stats['ppda_rolling'].rolling(window=3).std() +
match_stats['possession_rolling'].rolling(window=3).std()
)
threshold = match_stats['tactical_variance'].quantile(0.85)
match_stats['phase_transition'] = (
match_stats['tactical_variance'] > threshold
).astype(int)
# Label phases
match_stats['phase'] = match_stats['phase_transition'].cumsum()
return match_stats
29.3.6 Expected Points Analysis
We compare actual points earned against expected points based on xG and xGA (expected goals against) to assess whether the team over- or under-performed.
$$ P(\text{win}) = \sum_{g_h > g_a} P(G_h = g_h) \cdot P(G_a = g_a) $$
where $G_h \sim \text{Poisson}(xG)$ and $G_a \sim \text{Poisson}(xGA)$.
from scipy.stats import poisson
def expected_points(xg: float, xga: float) -> float:
"""Compute expected points from xG and xGA using Poisson model.
Args:
xg: Expected goals scored.
xga: Expected goals conceded.
Returns:
Expected points for the match (0-3 scale).
"""
max_goals = 10
p_win = 0.0
p_draw = 0.0
for g_home in range(max_goals):
for g_away in range(max_goals):
p_home = poisson.pmf(g_home, xg)
p_away = poisson.pmf(g_away, xga)
joint_p = p_home * p_away
if g_home > g_away:
p_win += joint_p
elif g_home == g_away:
p_draw += joint_p
return 3 * p_win + 1 * p_draw
Lesson Learned --- Tactical Analysis: The most valuable output of a season-long tactical analysis is not a single insight but a narrative. Coaches and directors want to understand the story of their season: what worked early on, what changed when key players were injured, how the opposition adapted, and where the inflection points were. Structure your analysis chronologically and connect tactical decisions to results. A timeline visualization showing formation changes, PPDA trends, and results in parallel is often the single most impactful deliverable.
29.4 Case Study: Injury Prevention Program
29.4.1 Project Overview
Injuries are the single greatest source of uncontrollable variance in soccer. A comprehensive injury prevention program combines workload monitoring, risk modeling, and individualized management protocols. This case study designs such a system from the ground up.
Objective: Build an injury risk monitoring system that provides daily risk scores for each player, enabling proactive load management.
29.4.2 Workload Monitoring Framework
The acute:chronic workload ratio (ACWR) is a widely used metric for monitoring injury risk. It compares recent training load to a longer baseline.
$$ \text{ACWR} = \frac{\text{Acute Load (7-day rolling mean)}}{\text{Chronic Load (28-day EWMA)}} $$
An ACWR between 0.8 and 1.3 is generally considered the "safe zone," while values above 1.5 indicate significantly elevated risk.
def compute_acwr(
daily_load: pd.Series,
acute_window: int = 7,
chronic_window: int = 28
) -> pd.Series:
"""Compute the Acute:Chronic Workload Ratio.
Args:
daily_load: Time series of daily training/match load values.
acute_window: Window for acute load calculation (days).
chronic_window: Window for chronic load EWMA (days).
Returns:
Series of ACWR values aligned with the input index.
"""
acute = daily_load.rolling(window=acute_window, min_periods=1).mean()
chronic = daily_load.ewm(span=chronic_window, min_periods=1).mean()
# Avoid division by zero
acwr = acute / chronic.replace(0, np.nan)
return acwr
29.4.3 Multi-Factor Risk Model
We build a logistic regression model that incorporates multiple risk factors beyond workload:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
def build_injury_risk_model(
df: pd.DataFrame,
feature_cols: list[str],
target_col: str = 'injury_within_14_days'
) -> tuple:
"""Train a multi-factor injury risk prediction model.
Args:
df: Historical player-day records with features and outcomes.
feature_cols: List of predictor variable names.
target_col: Binary target indicating injury within the prediction window.
Returns:
Tuple of (trained model, fitted scaler, feature importances).
"""
X = df[feature_cols].values
y = df[target_col].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
model = LogisticRegression(
class_weight='balanced',
max_iter=1000,
C=0.1
)
model.fit(X_scaled, y)
importances = pd.DataFrame({
'feature': feature_cols,
'coefficient': model.coef_[0],
'abs_coefficient': np.abs(model.coef_[0])
}).sort_values('abs_coefficient', ascending=False)
return model, scaler, importances
Key risk factors include:
- Acute:Chronic Workload Ratio (ACWR)
- Total distance covered in training (meters)
- High-speed running distance (> 7.5 m/s)
- Number of accelerations and decelerations
- Days since last rest day
- Previous injury history (binary: injured in prior 6 months)
- Age
- Match congestion (matches in last 14 days)
29.4.4 Survival Analysis for Return-to-Play
For players currently injured, survival analysis models the expected time to return:
$$ S(t) = P(T > t) = \exp\left(-\int_0^t \lambda(u) \, du\right) $$
where $\lambda(t)$ is the hazard function representing the instantaneous risk of returning to play at time $t$.
def kaplan_meier_return_to_play(
injury_data: pd.DataFrame,
injury_type: str
) -> pd.DataFrame:
"""Estimate return-to-play curves by injury type.
Args:
injury_data: Historical injury records with duration and censoring info.
injury_type: Type of injury to filter (e.g., 'hamstring', 'ankle').
Returns:
DataFrame with time points and survival probabilities.
"""
subset = injury_data[injury_data['injury_type'] == injury_type].copy()
subset = subset.sort_values('days_out')
# Simple Kaplan-Meier estimation
n_total = len(subset)
survival_prob = 1.0
results = [{'time': 0, 'survival': 1.0}]
for _, row in subset.iterrows():
n_at_risk = n_total - len(
[r for r in results if r['time'] < row['days_out']]
)
if n_at_risk > 0 and row['returned'] == 1:
survival_prob *= (1 - 1 / max(n_at_risk, 1))
results.append({
'time': row['days_out'],
'survival': survival_prob
})
return pd.DataFrame(results)
29.4.5 Daily Dashboard and Alert System
The injury prevention system produces a daily dashboard with traffic-light risk indicators:
| Risk Level | ACWR Range | Color | Action |
|---|---|---|---|
| Low | 0.8 - 1.2 | Green | Normal training |
| Moderate | 1.2 - 1.5 or < 0.8 | Amber | Monitor closely |
| High | > 1.5 or < 0.5 | Red | Reduce load / rest day |
Medical Staff Callout: The injury risk model should be viewed as a decision support tool, not a replacement for clinical judgment. The daily risk scores provide one input among many---including subjective wellness questionnaires, sleep quality, and clinical assessment---that the medical team uses to make training load decisions.
Common Mistake --- Injury Models: A frequent error is treating injury prediction as a standard classification problem and optimizing for accuracy. Because injuries are rare events (base rate of 2-5% per player-week), a model that always predicts "no injury" achieves very high accuracy. Instead, optimize for sensitivity (recall) at a fixed false positive rate, and use the model's probability outputs rather than binary predictions. The goal is not to predict exactly when injuries will occur but to identify periods of elevated risk that warrant closer monitoring.
29.5 Case Study: Match Preparation Report
29.5.1 Project Overview
Before every match, the analytics department produces a structured opponent analysis report for the coaching staff. This case study automates the production of these reports, covering opponent tendencies, key threats, set-piece patterns, and recommended tactical adjustments.
Objective: Build an automated system that generates a comprehensive match preparation report from event data, requiring minimal manual intervention.
29.5.2 Pre-Match Preparation Workflow
The match preparation process begins well before the automated system runs. The full workflow includes:
- Data Collection (Matchday -5): Gather event data from the opponent's last 5-10 matches. Include data from different competition contexts (home, away, vs. top teams, vs. bottom teams).
- Video Tagging (Matchday -4): The video analyst tags key sequences from the opponent's recent matches. These clips will be linked to the statistical findings in the report.
- Automated Analysis (Matchday -3): Run the automated pipeline to generate the statistical report.
- Manual Review (Matchday -2): The lead analyst reviews the automated output, adds context, and highlights the most important findings.
- Coaching Presentation (Matchday -1): Present the report to the coaching staff, typically in a 20-30 minute meeting supported by video clips and visualizations.
- Player Briefing (Matchday): A simplified version of the key findings is communicated to players, often through short video sessions and pitch diagrams.
29.5.3 Report Architecture
The automated report system follows a modular pipeline:
Event Data --> Opponent Profile Module --> Set Piece Module --> Key Player Module
| | |
v v v
Formation & Corner/FK Threat
Build-up Patterns Assessment
| | |
+----------+------------+-----------+-----------+
| |
v v
Report Assembly Visualization
|
v
PDF/HTML Output
29.5.4 Opponent Build-Up Analysis
def analyze_buildup_patterns(
events: pd.DataFrame,
opponent_id: str,
n_recent_matches: int = 5
) -> dict:
"""Analyze opponent's build-up play tendencies.
Args:
events: Event data from opponent's recent matches.
opponent_id: Team ID of the opponent.
n_recent_matches: Number of recent matches to analyze.
Returns:
Dictionary of build-up pattern statistics.
"""
recent_matches = (
events[events['team_id'] == opponent_id]
['match_id'].unique()[-n_recent_matches:]
)
opp_events = events[
(events['team_id'] == opponent_id) &
(events['match_id'].isin(recent_matches))
]
passes = opp_events[opp_events['type'] == 'pass']
analysis = {
'buildup_side': {
'left': passes[passes['y'] < 27].shape[0] / len(passes),
'center': passes[passes['y'].between(27, 53)].shape[0] / len(passes),
'right': passes[passes['y'] > 53].shape[0] / len(passes),
},
'long_ball_pct': (
passes[passes['pass_length'] > 30].shape[0] / len(passes)
),
'avg_pass_length': passes['pass_length'].mean(),
'progressive_pass_rate': (
passes[passes['end_x'] - passes['x'] > 10].shape[0] / len(passes)
),
'buildup_speed': _classify_buildup_speed(passes),
'possession_pct': _compute_possession(opp_events),
}
return analysis
def _classify_buildup_speed(passes: pd.DataFrame) -> str:
"""Classify build-up speed based on pass tempo."""
avg_sequence_length = passes.groupby('possession_id').size().mean()
if avg_sequence_length > 6:
return 'patient'
elif avg_sequence_length > 3:
return 'balanced'
else:
return 'direct'
def _compute_possession(events: pd.DataFrame) -> float:
"""Compute approximate possession percentage."""
total_passes = events[events['type'] == 'pass'].shape[0]
successful_passes = events[
(events['type'] == 'pass') & (events['outcome'] == 'successful')
].shape[0]
return successful_passes / max(total_passes, 1)
29.5.5 Set-Piece Analysis
Set pieces account for approximately 25-30% of all goals in professional soccer. A thorough set-piece analysis is critical for match preparation.
def analyze_set_pieces(
events: pd.DataFrame,
opponent_id: str,
n_matches: int = 10
) -> dict:
"""Analyze opponent's set-piece tendencies and threats.
Args:
events: Event data from opponent's recent matches.
opponent_id: Team ID of the opponent.
n_matches: Number of recent matches to analyze.
Returns:
Dictionary of set-piece patterns and statistics.
"""
set_piece_types = ['corner', 'free_kick', 'throw_in']
recent = events[
(events['team_id'] == opponent_id) &
(events['play_pattern'].isin(set_piece_types))
]
analysis = {}
for sp_type in set_piece_types:
subset = recent[recent['play_pattern'] == sp_type]
if len(subset) == 0:
continue
analysis[sp_type] = {
'total_count': len(subset),
'shots_generated': subset[subset['type'] == 'shot'].shape[0],
'goals_scored': subset[
(subset['type'] == 'shot') &
(subset['outcome'] == 'Goal')
].shape[0],
'xg_generated': subset[
subset['type'] == 'shot'
]['xg'].sum() if 'xg' in subset.columns else None,
}
# Corner kick delivery analysis
corners = recent[recent['play_pattern'] == 'corner']
if len(corners) > 0:
analysis['corner_delivery'] = {
'inswing_pct': (
corners[corners.get('corner_type', pd.Series()) == 'inswing']
.shape[0] / len(corners)
) if 'corner_type' in corners.columns else None,
'short_corner_pct': (
corners[corners['pass_length'] < 10].shape[0] / len(corners)
),
'near_post_pct': _estimate_near_post_delivery(corners),
}
return analysis
def _estimate_near_post_delivery(corners: pd.DataFrame) -> float:
"""Estimate percentage of corners delivered to the near post."""
if 'end_y' not in corners.columns:
return 0.0
near_post = corners[corners['end_y'] < 35].shape[0]
return near_post / max(len(corners), 1)
29.5.6 Key Player Threat Assessment
def assess_key_threats(
events: pd.DataFrame,
opponent_id: str,
n_matches: int = 10,
top_n: int = 3
) -> list[dict]:
"""Identify and profile the opponent's most dangerous players.
Args:
events: Event data from opponent's recent matches.
opponent_id: Team ID of the opponent.
n_matches: Number of recent matches to analyze.
top_n: Number of key threats to return.
Returns:
List of dictionaries with player threat profiles.
"""
opp_events = events[events['team_id'] == opponent_id]
# Compute threat metrics per player
player_threats = []
for player_id in opp_events['player_id'].unique():
player_events = opp_events[opp_events['player_id'] == player_id]
shots = player_events[player_events['type'] == 'shot']
passes = player_events[player_events['type'] == 'pass']
carries = player_events[player_events['type'] == 'carry']
xg_total = shots['xg'].sum() if 'xg' in shots.columns else 0
xa_total = passes['xa'].sum() if 'xa' in passes.columns else 0
player_threats.append({
'player_id': player_id,
'player_name': player_events['player_name'].iloc[0],
'xg': xg_total,
'xa': xa_total,
'threat_score': xg_total + xa_total,
'shots': len(shots),
'key_passes': passes[
passes.get('key_pass', pd.Series(dtype=bool)) == True
].shape[0] if 'key_pass' in passes.columns else 0,
'progressive_carries': carries[
carries['end_x'] - carries['x'] > 10
].shape[0] if len(carries) > 0 and 'end_x' in carries.columns else 0,
})
threats_df = pd.DataFrame(player_threats)
return threats_df.nlargest(top_n, 'threat_score').to_dict('records')
29.5.7 In-Game Tracking and Post-Match Analysis
The match preparation report is not the end of the analyst's workflow for a given match. The complete cycle includes in-game tracking and post-match review.
In-Game Tracking:
During the match, the analyst monitors live data feeds and flags significant deviations from the pre-match analysis:
- Is the opponent playing the formation we expected?
- Are they pressing higher or lower than their recent average?
- Is their key threat player receiving the ball in the zones we identified?
- Are our set-piece defensive assignments working?
These observations are communicated to the coaching staff via a standardized messaging system (often a tablet app) that the assistant coaches monitor.
Post-Match Analysis:
After the match, the analyst produces a post-match report that evaluates:
- How accurately the pre-match analysis predicted the opponent's approach
- Which tactical adjustments were made and their impact
- Key moments that were influenced by the pre-match preparation
- Lessons for future match preparation
Coaching Staff Callout: The most effective match preparation reports are concise and action-oriented. Coaches do not need to see every statistic---they need clear, prioritized insights that translate directly into training ground work. Limit the report to 3-4 pages with clear visual summaries.
29.5.8 Visualization Production
Effective match preparation requires clear, intuitive visualizations. The standard visualization package includes:
- Opponent formation map with average player positions
- Build-up heatmaps showing where the opponent progresses the ball
- Pressing trigger zones highlighting where the opponent is vulnerable
- Set-piece diagrams with common routines
- Key player action maps showing where threats operate
Practical Tip --- Visualization for Coaches: When creating visualizations for coaching staff, follow three rules: (a) use the real pitch as the canvas---coaches think in terms of pitch zones, not abstract charts; (b) limit each visualization to one main message; (c) annotate directly on the visualization rather than relying on a legend. A pitch map showing "Opponent's LW receives 73% of crosses here" is more useful than a bar chart of cross distribution.
29.5.9 Report Generation
The final report is assembled into a structured format:
def generate_match_report(
events: pd.DataFrame,
opponent_id: str,
our_team_id: str,
output_path: str
) -> str:
"""Generate a complete match preparation report.
Args:
events: Full event dataset.
opponent_id: Opponent team ID.
our_team_id: Our team ID.
output_path: Path for the output report file.
Returns:
Path to the generated report.
"""
buildup = analyze_buildup_patterns(events, opponent_id)
set_pieces = analyze_set_pieces(events, opponent_id)
threats = assess_key_threats(events, opponent_id)
report_sections = [
_format_header(opponent_id),
_format_buildup_section(buildup),
_format_set_piece_section(set_pieces),
_format_threat_section(threats),
_format_recommendations(buildup, set_pieces, threats),
]
report_text = '\n\n'.join(report_sections)
with open(output_path, 'w') as f:
f.write(report_text)
return output_path
29.6 Case Study: Player Development Tracking
29.6.1 Project Overview
Tracking player development is essential for academies and first-team environments alike. This case study builds a longitudinal tracking system that measures player progression across technical, physical, tactical, and mental dimensions.
Objective: Create a player development dashboard that visualizes progression over time, benchmarks against age-group peers, and projects future trajectories.
29.6.2 Development Metrics Framework
Player development is tracked across four pillars:
| Pillar | Metrics | Update Frequency |
|---|---|---|
| Technical | Pass completion, dribble success, first touch under pressure | Weekly |
| Physical | Sprint speed, distance covered, high-intensity efforts | Per session |
| Tactical | Positional accuracy, pressing trigger response, defensive positioning | Monthly |
| Mental | Decision-making under pressure, game management, leadership indices | Quarterly |
29.6.3 Percentile Ranking Against Peers
from scipy import stats as scipy_stats
def compute_percentile_ranks(
player_metrics: pd.DataFrame,
peer_metrics: pd.DataFrame,
metric_cols: list[str]
) -> pd.DataFrame:
"""Compute percentile ranks for a player against their peer group.
Args:
player_metrics: Single-row DataFrame with the player's current metrics.
peer_metrics: DataFrame with metrics for the peer comparison group.
metric_cols: List of metric columns to rank.
Returns:
DataFrame with percentile ranks for each metric.
"""
percentiles = {}
for col in metric_cols:
player_val = player_metrics[col].values[0]
peer_vals = peer_metrics[col].dropna().values
percentile = scipy_stats.percentileofscore(peer_vals, player_val)
percentiles[col] = round(percentile, 1)
return pd.DataFrame([percentiles])
29.6.4 Growth Curve Modeling
We model player development trajectories using polynomial regression to project future performance levels:
$$ y(t) = \beta_0 + \beta_1 t + \beta_2 t^2 + \epsilon $$
where $y(t)$ is the metric value at time $t$ (measured in months since the player's academy entry).
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
def fit_growth_curve(
historical: pd.DataFrame,
metric_col: str,
time_col: str = 'months_since_start',
degree: int = 2,
forecast_months: int = 12
) -> dict:
"""Fit a growth curve and project future development.
Args:
historical: Historical metric observations.
metric_col: Column name of the metric to model.
time_col: Column with time values.
degree: Polynomial degree for the growth curve.
forecast_months: Number of months to forecast.
Returns:
Dictionary with model coefficients, fitted values, and forecasts.
"""
X = historical[time_col].values.reshape(-1, 1)
y = historical[metric_col].values
poly = PolynomialFeatures(degree=degree)
X_poly = poly.fit_transform(X)
model = LinearRegression()
model.fit(X_poly, y)
# Fitted values
y_fitted = model.predict(X_poly)
# Forecast
max_time = X.max()
future_times = np.arange(
max_time + 1, max_time + forecast_months + 1
).reshape(-1, 1)
future_poly = poly.transform(future_times)
y_forecast = model.predict(future_poly)
return {
'coefficients': model.coef_.tolist(),
'intercept': model.intercept_,
'r_squared': model.score(X_poly, y),
'fitted_values': y_fitted.tolist(),
'forecast_times': future_times.flatten().tolist(),
'forecast_values': y_forecast.tolist(),
}
29.6.5 Radar Chart Visualization
Radar charts (also called spider charts) provide an intuitive visual summary of a player's multi-dimensional profile.
import matplotlib.pyplot as plt
import numpy as np
def create_radar_chart(
player_percentiles: dict[str, float],
player_name: str,
output_path: str
) -> None:
"""Create a radar chart showing a player's percentile profile.
Args:
player_percentiles: Dictionary mapping metric names to percentile values.
player_name: Name for the chart title.
output_path: Path to save the output figure.
"""
categories = list(player_percentiles.keys())
values = list(player_percentiles.values())
# Close the polygon
values += values[:1]
N = len(categories)
angles = [n / float(N) * 2 * np.pi for n in range(N)]
angles += angles[:1]
fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))
ax.plot(angles, values, 'o-', linewidth=2)
ax.fill(angles, values, alpha=0.25)
ax.set_xticks(angles[:-1])
ax.set_xticklabels(categories, size=10)
ax.set_ylim(0, 100)
ax.set_title(f'{player_name} - Development Profile', size=14, pad=20)
plt.tight_layout()
plt.savefig(output_path, dpi=150, bbox_inches='tight')
plt.close()
29.6.6 Development Report Generation
The player development system produces periodic reports that combine quantitative metrics with qualitative coaching assessments:
def generate_development_report(
player_id: str,
current_metrics: pd.DataFrame,
historical_metrics: pd.DataFrame,
peer_metrics: pd.DataFrame,
metric_cols: list[str]
) -> dict:
"""Generate a comprehensive player development report.
Args:
player_id: Unique player identifier.
current_metrics: Current period metric values.
historical_metrics: All historical metric observations.
peer_metrics: Peer group metrics for benchmarking.
metric_cols: List of metrics to include.
Returns:
Dictionary with all report components.
"""
# Percentile rankings
percentiles = compute_percentile_ranks(
current_metrics, peer_metrics, metric_cols
)
# Growth curves for each metric
growth_curves = {}
for col in metric_cols:
if col in historical_metrics.columns:
growth_curves[col] = fit_growth_curve(
historical_metrics, col
)
# Identify strengths and areas for development
pct_dict = percentiles.iloc[0].to_dict()
strengths = [k for k, v in pct_dict.items() if v >= 75]
development_areas = [k for k, v in pct_dict.items() if v < 40]
report = {
'player_id': player_id,
'percentile_ranks': pct_dict,
'growth_curves': growth_curves,
'strengths': strengths,
'development_areas': development_areas,
'overall_development_index': np.mean(list(pct_dict.values())),
}
return report
29.6.7 Longitudinal Trend Analysis
Tracking development over time requires careful attention to measurement frequency, seasonal effects, and position-specific benchmarks.
Academy Callout: Player development is non-linear. Periods of apparent stagnation are normal and often precede significant breakthroughs. The analytics system should flag concerning trends (sustained decline over 3+ months) without over-reacting to short-term fluctuations.
Lesson Learned --- Player Development: The most successful player development tracking systems are those that are embedded in the coaching culture, not imposed from outside. Academy coaches need to see the development reports as tools that support their work, not as surveillance mechanisms. The best approach is to involve coaches in defining the metrics, present results in collaborative review meetings (not as top-down evaluations), and always pair quantitative data with opportunities for coaches to add their qualitative assessments. A development report that says "passing completion improved from 40th to 65th percentile over 6 months" is good. One that adds "coach notes: consistently making better decisions about when to play forward vs. recycle; has responded well to individual tactical sessions on Thursday mornings" is excellent.
29.7 Cross-Case-Study Integration
29.7.1 How the Case Studies Connect
The six case studies in this chapter are not isolated projects. In a professional club, they form an interconnected system:
- The xG model (29.1) feeds predicted values into the match preparation report (29.5), where opponent shot quality is assessed, and into the scouting campaign (29.2), where a candidate's finishing ability is evaluated against the model.
- The scouting campaign (29.2) uses insights from the tactical analysis (29.3) to understand what kind of player the team's system requires, and from the injury prevention program (29.4) to assess candidates' injury risk profiles.
- The tactical analysis (29.3) informs the match preparation report (29.5) by providing baseline metrics for how the team typically plays, enabling the analyst to identify how the opponent's approach differs.
- The injury prevention program (29.4) interacts with the player development tracking (29.6) because managing a young player's workload is essential for long-term development.
- The player development tracking (29.6) feeds back into the scouting campaign (29.2) by helping the club understand which internal academy players might be ready to fill a squad need, potentially avoiding a transfer altogether.
29.7.2 Common Patterns Across Case Studies
Several patterns recur across all six case studies:
- Data Quality First: Every case study begins with data cleaning and validation. This is not optional scaffolding---it is the foundation on which everything else rests.
- Feature Engineering Requires Domain Knowledge: The most impactful features in every model are those informed by deep understanding of soccer, not by automated feature selection.
- Communication Is the Deliverable: The output of every case study is not a model or a database but a communication to a human decision-maker (coach, sporting director, medical staff, academy director).
- Iteration and Feedback: Every system improves through feedback loops. The xG model is retrained when performance degrades. The scouting criteria are refined after each transfer window. The match preparation system is updated based on coaching staff feedback.
- Uncertainty Quantification: Every case study acknowledges uncertainty. xG predictions have confidence intervals. Scouting scores have sensitivity to weight changes. Injury risk scores are probabilities, not certainties.
29.7.3 Tools and Data Sources Summary
| Tool / Library | Case Studies Used | Purpose |
|---|---|---|
| pandas / numpy | All | Data manipulation and numerical computation |
| scikit-learn | 29.1, 29.2, 29.4, 29.6 | Model training, clustering, preprocessing |
| NetworkX | 29.3 | Passing network construction and analysis |
| matplotlib | 29.3, 29.5, 29.6 | Visualization and chart production |
| scipy | 29.3, 29.6 | Statistical tests and percentile computation |
| optuna | 29.1 | Hyperparameter optimization |
| joblib | 29.1 | Model serialization for deployment |
| StatsBomb / Wyscout | 29.1, 29.3, 29.5 | Event data source |
| GPS/tracking providers | 29.3, 29.4 | Positional and physical load data |
| Transfermarkt / CIES | 29.2 | Market value and contract data |
Practical Tip --- Reproducing These Analyses: All six case studies can be partially reproduced using freely available data. StatsBomb open data provides event data for the xG pipeline, tactical analysis, and match preparation case studies. For scouting, FBref provides aggregated statistics that can substitute for commercial event data. For injury prevention and player development, you will need to simulate or synthesize data, but the code patterns and analytical frameworks remain the same. The key is to focus on learning the workflow rather than achieving exact numerical results.
Summary
This chapter presented six comprehensive case studies that integrate techniques from across the textbook:
-
xG Pipeline (Section 29.1): Demonstrated the full lifecycle of a predictive model, from data cleaning through deployment, emphasizing calibration, monitoring, and interpretability.
-
Scouting Campaign (Section 29.2): Showed how multi-criteria decision analysis, similarity metrics, clustering, league adjustments, squad fit analysis, and financial modeling combine to support evidence-based recruitment.
-
Tactical Analysis (Section 29.3): Applied network analysis, pressing metrics, and expected points models to evaluate a team's season-long tactical approach, with emphasis on narrative construction.
-
Injury Prevention (Section 29.4): Built a workload monitoring system using the acute:chronic workload ratio and multi-factor risk models, with survival analysis for return-to-play estimation.
-
Match Preparation (Section 29.5): Automated the production of opponent analysis reports, covering build-up patterns, set pieces, key player threats, and the full pre-match to post-match workflow.
-
Player Development (Section 29.6): Created a longitudinal tracking system with percentile benchmarking, growth curve modeling, and radar chart visualization, emphasizing the importance of coaching culture integration.
Each case study followed a common pattern: define the objective, collect and clean data, engineer features, build models, and communicate results to stakeholders. This pattern---the analytics workflow---is the foundation of professional soccer analytics.
The techniques in these case studies are not merely academic exercises. They represent the daily work of analytics departments at clubs across the world. The key to success lies not in any single technique but in the ability to combine them thoughtfully, communicate clearly, and maintain a relentless focus on decisions that improve performance on the pitch.
Final Callout --- The Analytics Workflow: If there is one takeaway from this capstone chapter, it is this: the analytics workflow is more important than any individual technique. A well-structured workflow---with clear data validation, thoughtful feature engineering, appropriate model selection, honest evaluation, and effective communication---will produce valuable insights regardless of the specific tools used. Master the workflow, and you can adapt to any new technique or technology that emerges.
References
- Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1-3.
- Gabbett, T. J. (2016). The training-injury prevention paradox. British Journal of Sports Medicine, 50(5), 273-280.
- Rathke, A. (2017). An examination of expected goals and shot efficiency in soccer. Journal of Human Sport and Exercise, 12(2), 514-529.
- Pena, J. L., & Touchette, H. (2012). A network theory analysis of football strategies. arXiv preprint arXiv:1206.6904.
- Caley, M. (2015). Premier League projections and new expected goals. Cartilage Free Captain Blog.
- Impect. (2019). Packing: A new way of analyzing football. Impect GmbH Technical Report.
- Fernandez-Navarro, J., et al. (2016). Evaluating the effectiveness of styles of play in elite soccer. Journal of Sports Sciences, 34(16), 1545-1552.
- Pappalardo, L., et al. (2019). A public data set of spatio-temporal match events in soccer competitions. Scientific Data, 6(1), 236.
- Decroos, T., et al. (2019). Actions Speak Louder than Goals: Valuing Player Actions in Soccer. KDD 2019.
- Fernandez, J., & Bornn, L. (2018). Wide Open Spaces: A statistical technique for measuring space creation in professional soccer. MIT Sloan Sports Analytics Conference.