26 min read

Throughout this textbook, we have built a comprehensive toolkit for soccer analytics---from foundational statistics and data engineering to advanced machine learning, network analysis, and simulation. In this capstone chapter, we bring everything...

Learning Objectives

  • Build end-to-end analytics pipelines from raw data to actionable insights
  • Design and execute a professional scouting campaign using data
  • Conduct tactical analysis across an entire season
  • Develop injury prevention models and monitoring systems
  • Automate match preparation reports for coaching staff
  • Track and quantify player development over time
  • Integrate techniques from multiple chapters into cohesive analytical projects

Chapter 29: Comprehensive Case Studies

Introduction

Throughout this textbook, we have built a comprehensive toolkit for soccer analytics---from foundational statistics and data engineering to advanced machine learning, network analysis, and simulation. In this capstone chapter, we bring everything together through six detailed case studies that mirror real-world analytics workflows at professional clubs.

Each case study is designed as a self-contained project that integrates techniques from multiple chapters. They progress from data acquisition and cleaning through modeling, visualization, and communication of results to stakeholders. The emphasis throughout is on practical implementation: the kind of work that analytics departments perform daily.

Note for Practitioners: These case studies are modeled on real workflows but use synthetic or publicly available data. The architectural patterns, however, are drawn directly from professional club analytics departments and consultancies.

The six case studies are:

Case Study Domain Primary Techniques Chapters Referenced
29.1 xG Pipeline Feature engineering, logistic regression, model deployment 3, 6, 7, 4
29.2 Scouting Clustering, similarity metrics, multi-criteria decision analysis 15, 21, 19
29.3 Tactical Analysis Network analysis, passing models, formation detection 10, 11, 16
29.4 Injury Prevention Survival analysis, workload monitoring, Bayesian inference 26, 18, 20
29.5 Match Preparation Automated reporting, visualization, opponent profiling 6, 7, 22
29.6 Player Development Longitudinal analysis, growth curves, radar charts 15, 19, 21

Integration Across Chapters: One of the central lessons of professional analytics work is that no technique operates in isolation. The xG pipeline (29.1) relies on statistical foundations from Chapter 3 and expected goals methodology from Chapter 7, but it also feeds into the match preparation system (29.5) and the scouting framework (29.2). The tactical analysis (29.3) uses the network analysis methods from Chapter 10 and the team analysis from Chapter 16, but the insights it produces inform the injury prevention program (29.4) by identifying which players are exposed to the highest physical loads based on tactical role. Throughout these case studies, pay attention to the connections between techniques---this is how real analytics departments operate.


29.1 Case Study: Building a Complete xG Pipeline

29.1.1 Project Overview

Expected Goals (xG) is the foundational metric of modern soccer analytics. In this case study, we build a complete xG pipeline from raw event data to a deployed model that can score shots in near-real-time. The pipeline encompasses data ingestion, feature engineering, model training, evaluation, calibration, and deployment.

Objective: Build a production-grade xG model that achieves a log-loss below 0.34 and maintains calibration across shot types.

Architecture Overview:

Raw Event Data --> Data Cleaning --> Feature Engineering --> Model Training
       |                                                         |
       v                                                         v
  Data Validation                                        Model Evaluation
                                                                 |
                                                                 v
                                                     Calibration & Deployment

Template Workflow --- xG Pipeline: The workflow for building an xG model can be adapted to almost any expected metric (xA, xT, xGOT). The pattern is: (1) Define the event of interest, (2) Identify the outcome variable, (3) Engineer spatial and contextual features, (4) Train and evaluate candidate models, (5) Calibrate, (6) Deploy with monitoring. Keep this template in mind as you read through the case study.

29.1.2 Data Ingestion and Cleaning

The first stage of any analytics pipeline is reliable data ingestion. We work with event-level shot data that includes spatial coordinates, body part, play pattern, and contextual features.

import pandas as pd
import numpy as np
from typing import Tuple, Optional

def load_shot_data(filepath: str) -> pd.DataFrame:
    """Load and perform initial validation on shot event data.

    Args:
        filepath: Path to the CSV file containing shot events.

    Returns:
        Cleaned DataFrame with validated shot records.
    """
    df = pd.read_csv(filepath)

    # Validate coordinate ranges
    df = df[
        (df['x'].between(0, 120)) &
        (df['y'].between(0, 80))
    ].copy()

    # Standardize categorical variables
    df['body_part'] = df['body_part'].str.lower().str.strip()
    df['play_pattern'] = df['play_pattern'].str.lower().str.strip()

    # Create binary outcome
    df['is_goal'] = (df['outcome'] == 'Goal').astype(int)

    return df

Data Quality Callout: In professional settings, data providers occasionally introduce coordinate system changes between seasons. Always validate that pitch coordinates are consistent before combining multi-season datasets. A simple sanity check is verifying that the distribution of shot locations forms the expected pattern concentrated around the penalty area.

Data Validation Pipeline:

Before proceeding to feature engineering, a robust pipeline includes automated data validation checks. These checks catch issues that would otherwise silently corrupt model training.

def validate_shot_data(df: pd.DataFrame) -> dict:
    """Run comprehensive validation checks on shot data.

    Args:
        df: DataFrame with shot event records.

    Returns:
        Dictionary with validation results and warnings.
    """
    validation_results = {
        'total_records': len(df),
        'null_counts': df.isnull().sum().to_dict(),
        'warnings': [],
        'passed': True,
    }

    # Check goal rate is within expected bounds (6-12%)
    goal_rate = df['is_goal'].mean()
    if not 0.06 <= goal_rate <= 0.14:
        validation_results['warnings'].append(
            f"Goal rate {goal_rate:.3f} outside expected range [0.06, 0.14]"
        )
        validation_results['passed'] = False

    # Check for duplicate shot events
    if 'shot_id' in df.columns:
        n_dupes = df['shot_id'].duplicated().sum()
        if n_dupes > 0:
            validation_results['warnings'].append(
                f"Found {n_dupes} duplicate shot IDs"
            )

    # Check body part distribution
    if 'body_part' in df.columns:
        foot_pct = df['body_part'].isin(['left foot', 'right foot']).mean()
        if foot_pct < 0.70:
            validation_results['warnings'].append(
                f"Foot shot percentage {foot_pct:.2f} lower than expected (>0.70)"
            )

    # Verify coordinate consistency across seasons
    if 'season' in df.columns:
        for season in df['season'].unique():
            season_data = df[df['season'] == season]
            x_max = season_data['x'].max()
            if abs(x_max - 120) > 5:
                validation_results['warnings'].append(
                    f"Season {season}: max x={x_max:.1f}, expected ~120"
                )

    return validation_results

Common Mistake --- Coordinate Systems: One of the most frequent errors in xG modeling is mixing coordinate systems from different data providers. StatsBomb uses a 120x80 yard pitch, Opta uses a 100x100 percentage-based system, and Wyscout uses yet another convention. Always normalize to a common coordinate system before combining data. Failure to do this produces models that appear to work in cross-validation but fail catastrophically on new data.

Data Sources for xG Modeling:

Source Data Type Coverage Access Cost
StatsBomb Open Data Event-level shots Select competitions Free (GitHub) None
Opta / Stats Perform Event-level with freeze frames All major leagues Commercial API High
Wyscout Event-level with qualifiers 100+ leagues Commercial subscription Medium
Understat Aggregated xG values Top 5 leagues Free (web scraping) None
FBref Aggregated shot statistics Top leagues Free (web) None

29.1.3 Feature Engineering

Feature engineering is where domain expertise meets data science. For xG, the most predictive features relate to the geometry of the shot relative to the goal.

Distance to Goal Center:

$$ d = \sqrt{(x - 120)^2 + (y - 40)^2} $$

where the goal center is at coordinates $(120, 40)$ on a standardized 120 x 80 pitch.

Angle to Goal:

$$ \theta = \arctan\left(\frac{9.32 \cdot (120 - x)}{(120 - x)^2 + (y - 40)^2 - (3.66)^2}\right) $$

This formula computes the angle subtended by the goal posts from the shot location, where 9.32 meters is the goal width.

def engineer_shot_features(df: pd.DataFrame) -> pd.DataFrame:
    """Create geometric and contextual features for xG modeling.

    Args:
        df: DataFrame with raw shot coordinates and metadata.

    Returns:
        DataFrame augmented with engineered features.
    """
    # Distance to goal center
    df['distance'] = np.sqrt(
        (df['x'] - 120) ** 2 + (df['y'] - 40) ** 2
    )

    # Angle to goal
    numerator = 9.32 * (120 - df['x'])
    denominator = (
        (120 - df['x']) ** 2 + (df['y'] - 40) ** 2 - 3.66 ** 2
    )
    df['angle'] = np.arctan2(numerator, denominator)
    df['angle_degrees'] = np.degrees(df['angle'])

    # Log-distance (captures diminishing returns)
    df['log_distance'] = np.log1p(df['distance'])

    # Central zone indicator
    df['is_central'] = (df['y'].between(30, 50)).astype(int)

    # Inside box indicator
    df['in_box'] = (
        (df['x'] >= 102) & (df['y'].between(18, 62))
    ).astype(int)

    # Interaction features
    df['angle_x_distance'] = df['angle'] * df['distance']
    df['angle_squared'] = df['angle'] ** 2

    return df

Advanced Feature Engineering --- Contextual Features:

The geometric features above form the baseline, but professional xG models incorporate much richer contextual information. When freeze-frame data (the positions of all players at the moment of the shot) is available, the following additional features significantly improve model performance:

def engineer_advanced_features(df: pd.DataFrame) -> pd.DataFrame:
    """Create advanced contextual features when freeze-frame data is available.

    Args:
        df: DataFrame with shot data including freeze-frame information.

    Returns:
        DataFrame with additional contextual features.
    """
    # Number of defenders between shooter and goal
    if 'n_defenders_in_cone' in df.columns:
        df['defenders_blocking'] = df['n_defenders_in_cone']

    # Goalkeeper position relative to optimal
    if 'gk_distance_to_goal_center' in df.columns:
        df['gk_out_of_position'] = (
            df['gk_distance_to_goal_center'] > 3.0
        ).astype(int)

    # Shot following a cross or through ball
    if 'assist_type' in df.columns:
        df['from_cross'] = (
            df['assist_type'] == 'cross'
        ).astype(int)
        df['from_through_ball'] = (
            df['assist_type'] == 'through_ball'
        ).astype(int)

    # Fast break indicator (counter-attack)
    if 'play_pattern' in df.columns:
        df['is_counter'] = (
            df['play_pattern'] == 'from counter'
        ).astype(int)
        df['is_set_piece'] = (
            df['play_pattern'].isin([
                'from corner', 'from free kick',
                'from throw in'
            ])
        ).astype(int)

    # Game state (winning, drawing, losing)
    if 'goal_difference' in df.columns:
        df['winning'] = (df['goal_difference'] > 0).astype(int)
        df['losing'] = (df['goal_difference'] < 0).astype(int)

    # Match minute buckets
    if 'minute' in df.columns:
        df['late_game'] = (df['minute'] >= 75).astype(int)
        df['injury_time'] = (df['minute'] >= 90).astype(int)

    return df

Practical Tip --- Feature Importance: After training your model, always examine feature importances. If you find that a contextual feature (like game state or minute) has very high importance, it may indicate data leakage or a confound rather than a genuine predictive signal. For instance, shots in injury time may have higher conversion rates because they are more likely to occur on counter-attacks against teams pushing forward.

29.1.4 Model Training and Selection

We compare three model families and select based on log-loss and calibration:

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.calibration import calibration_curve
from sklearn.metrics import log_loss, brier_score_loss

FEATURE_COLS = [
    'distance', 'angle', 'log_distance', 'is_central',
    'in_box', 'angle_x_distance', 'angle_squared'
]

def train_and_evaluate(
    df: pd.DataFrame,
    feature_cols: list[str],
    n_splits: int = 5
) -> dict:
    """Train multiple xG models and compare via cross-validation.

    Args:
        df: Feature-engineered DataFrame.
        feature_cols: List of feature column names.
        n_splits: Number of CV folds.

    Returns:
        Dictionary mapping model names to mean log-loss scores.
    """
    X = df[feature_cols].values
    y = df['is_goal'].values

    models = {
        'logistic_regression': LogisticRegression(max_iter=1000),
        'gradient_boosting': GradientBoostingClassifier(
            n_estimators=200, max_depth=4, learning_rate=0.1
        ),
        'random_forest': RandomForestClassifier(
            n_estimators=200, max_depth=6
        ),
    }

    cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    results = {}

    for name, model in models.items():
        scores = cross_val_score(
            model, X, y, cv=cv, scoring='neg_log_loss'
        )
        results[name] = {
            'mean_log_loss': -scores.mean(),
            'std_log_loss': scores.std(),
        }

    return results

Hyperparameter Tuning:

In a production setting, hyperparameter tuning is essential. We use Bayesian optimization rather than grid search for efficiency:

from sklearn.model_selection import cross_val_score
import optuna

def tune_gradient_boosting(
    X: np.ndarray,
    y: np.ndarray,
    n_trials: int = 50
) -> dict:
    """Tune gradient boosting hyperparameters using Bayesian optimization.

    Args:
        X: Feature matrix.
        y: Target vector.
        n_trials: Number of optimization trials.

    Returns:
        Dictionary with best hyperparameters and score.
    """
    def objective(trial):
        params = {
            'n_estimators': trial.suggest_int('n_estimators', 100, 500),
            'max_depth': trial.suggest_int('max_depth', 3, 8),
            'learning_rate': trial.suggest_float(
                'learning_rate', 0.01, 0.3, log=True
            ),
            'min_samples_leaf': trial.suggest_int('min_samples_leaf', 5, 50),
            'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        }
        model = GradientBoostingClassifier(**params, random_state=42)
        cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
        scores = cross_val_score(
            model, X, y, cv=cv, scoring='neg_log_loss'
        )
        return -scores.mean()

    study = optuna.create_study(direction='minimize')
    study.optimize(objective, n_trials=n_trials)

    return {
        'best_params': study.best_params,
        'best_log_loss': study.best_value,
    }

Common Mistake --- Temporal Leakage: A critical error in xG model evaluation is failing to respect temporal ordering. If your training data spans multiple seasons, you must evaluate using time-based splits (train on seasons 1-3, test on season 4), not random cross-validation. Random CV will overestimate performance because shots from the same match or season may appear in both training and test sets, and systematic factors (pitch conditions, ball design, tactical trends) correlate within seasons.

29.1.5 Model Calibration

A well-calibrated xG model should produce probabilities that match observed goal rates. For instance, shots assigned $xG = 0.20$ should result in goals approximately 20% of the time.

$$ \text{Calibration Error} = \frac{1}{B} \sum_{b=1}^{B} |p_b - \hat{p}_b| $$

where $B$ is the number of bins, $p_b$ is the observed proportion of goals in bin $b$, and $\hat{p}_b$ is the mean predicted probability in that bin.

from sklearn.calibration import CalibratedClassifierCV

def calibrate_and_evaluate(
    model,
    X_train: np.ndarray,
    y_train: np.ndarray,
    X_test: np.ndarray,
    y_test: np.ndarray,
    method: str = 'isotonic'
) -> dict:
    """Calibrate a trained model and evaluate calibration quality.

    Args:
        model: Trained sklearn classifier.
        X_train: Training features.
        y_train: Training labels.
        X_test: Test features.
        y_test: Test labels.
        method: Calibration method ('sigmoid' for Platt scaling,
                'isotonic' for isotonic regression).

    Returns:
        Dictionary with calibration metrics before and after calibration.
    """
    # Pre-calibration metrics
    y_pred_raw = model.predict_proba(X_test)[:, 1]
    raw_log_loss = log_loss(y_test, y_pred_raw)
    raw_brier = brier_score_loss(y_test, y_pred_raw)

    # Calibrate
    calibrated = CalibratedClassifierCV(
        model, method=method, cv=5
    )
    calibrated.fit(X_train, y_train)

    # Post-calibration metrics
    y_pred_cal = calibrated.predict_proba(X_test)[:, 1]
    cal_log_loss = log_loss(y_test, y_pred_cal)
    cal_brier = brier_score_loss(y_test, y_pred_cal)

    # Compute calibration curve
    prob_true, prob_pred = calibration_curve(
        y_test, y_pred_cal, n_bins=10
    )

    return {
        'raw_log_loss': raw_log_loss,
        'raw_brier': raw_brier,
        'calibrated_log_loss': cal_log_loss,
        'calibrated_brier': cal_brier,
        'calibration_curve': {
            'observed': prob_true.tolist(),
            'predicted': prob_pred.tolist(),
        },
    }

Key Insight: Gradient boosting models often require post-hoc calibration (e.g., Platt scaling or isotonic regression) because their raw outputs can be poorly calibrated despite achieving low log-loss. Logistic regression, by contrast, is inherently calibrated by construction.

Calibration by Shot Type:

A global calibration check is not sufficient. You should also verify calibration within important subgroups:

def check_subgroup_calibration(
    df: pd.DataFrame,
    pred_col: str = 'xg_pred',
    target_col: str = 'is_goal',
    subgroup_col: str = 'body_part',
    n_bins: int = 5
) -> pd.DataFrame:
    """Check model calibration across subgroups.

    Args:
        df: DataFrame with predictions and outcomes.
        pred_col: Column with predicted probabilities.
        target_col: Column with binary outcomes.
        subgroup_col: Column defining subgroups.
        n_bins: Number of calibration bins per subgroup.

    Returns:
        DataFrame with calibration error per subgroup.
    """
    results = []
    for group_name, group_df in df.groupby(subgroup_col):
        if len(group_df) < 100:
            continue
        prob_true, prob_pred = calibration_curve(
            group_df[target_col],
            group_df[pred_col],
            n_bins=n_bins
        )
        cal_error = np.mean(np.abs(prob_true - prob_pred))
        results.append({
            'subgroup': group_name,
            'n_shots': len(group_df),
            'calibration_error': cal_error,
            'mean_prediction': group_df[pred_col].mean(),
            'observed_rate': group_df[target_col].mean(),
        })
    return pd.DataFrame(results).sort_values('calibration_error')

29.1.6 Deployment Architecture

The final xG model is serialized using joblib and wrapped in a scoring function that can be called from the club's data platform:

import joblib
from datetime import datetime

def deploy_model(model, scaler, filepath: str) -> None:
    """Serialize trained model and preprocessing objects.

    Args:
        model: Trained sklearn model.
        scaler: Fitted StandardScaler or similar preprocessor.
        filepath: Output path for the serialized pipeline.
    """
    pipeline_artifact = {
        'model': model,
        'scaler': scaler,
        'feature_names': FEATURE_COLS,
        'version': '1.0.0',
        'trained_date': datetime.now().isoformat(),
        'training_samples': model.n_features_in_,
    }
    joblib.dump(pipeline_artifact, filepath)


def score_new_shot(
    shot_data: dict,
    model_path: str
) -> float:
    """Score a single shot using the deployed xG model.

    Args:
        shot_data: Dictionary with shot features.
        model_path: Path to the serialized model artifact.

    Returns:
        xG probability for the shot.
    """
    artifact = joblib.load(model_path)
    features = np.array([
        [shot_data[f] for f in artifact['feature_names']]
    ])
    if artifact.get('scaler'):
        features = artifact['scaler'].transform(features)
    prob = artifact['model'].predict_proba(features)[0, 1]
    return float(prob)

Deployment Monitoring:

A deployed model requires continuous monitoring to detect drift and degradation:

def monitor_model_performance(
    predictions_log: pd.DataFrame,
    window_days: int = 30
) -> dict:
    """Monitor deployed xG model performance over a rolling window.

    Args:
        predictions_log: DataFrame with columns 'date', 'xg_pred', 'is_goal'.
        window_days: Rolling window for performance calculation.

    Returns:
        Dictionary with current performance metrics and drift alerts.
    """
    recent = predictions_log[
        predictions_log['date'] >= (
            predictions_log['date'].max() - pd.Timedelta(days=window_days)
        )
    ]

    current_log_loss = log_loss(recent['is_goal'], recent['xg_pred'])
    current_brier = brier_score_loss(recent['is_goal'], recent['xg_pred'])
    mean_pred = recent['xg_pred'].mean()
    observed_rate = recent['is_goal'].mean()

    alerts = []
    if current_log_loss > 0.36:
        alerts.append('Log-loss exceeds threshold (0.36)')
    if abs(mean_pred - observed_rate) > 0.03:
        alerts.append(
            f'Prediction-outcome gap: {abs(mean_pred - observed_rate):.3f}'
        )

    return {
        'log_loss': current_log_loss,
        'brier_score': current_brier,
        'mean_prediction': mean_pred,
        'observed_goal_rate': observed_rate,
        'n_predictions': len(recent),
        'alerts': alerts,
    }

29.1.7 Results and Discussion

A well-built xG pipeline typically achieves the following benchmarks on large datasets (50,000+ shots):

Model Log-Loss Brier Score Calibration Error
Logistic Regression 0.338 0.082 0.012
Gradient Boosting 0.321 0.078 0.018
GB + Isotonic Calibration 0.323 0.076 0.008

The gradient boosting model with isotonic calibration provides the best combination of discrimination and calibration. However, logistic regression remains a strong baseline and offers superior interpretability for communicating with coaching staff.

Lesson Learned --- xG Pipeline: The most common failure mode in xG model development is over-engineering the feature set without investing enough in data quality. A simple model with clean, well-validated data will outperform a complex model built on messy data every time. Start with the geometric baseline (distance and angle), verify that your model beats the naive base rate, and then incrementally add features while monitoring both performance and calibration. This incremental approach also makes it much easier to explain the model to non-technical stakeholders.

29.1.8 Reproducing This Analysis

To reproduce this case study with publicly available data:

  1. Download StatsBomb open data from GitHub (statsbombpy library)
  2. Filter for shot events across available competitions
  3. Standardize coordinates to the 120x80 system
  4. Run the feature engineering pipeline above
  5. Split data temporally (earlier competitions for training, later for testing)
  6. Train and evaluate using the provided code
  7. Expected dataset size: approximately 20,000-40,000 shots depending on competitions selected

29.2 Case Study: Scouting Campaign for a Striker

29.2.1 Project Overview

A mid-table Premier League club needs to replace a departing striker. The analytics department is tasked with identifying candidates who fit the tactical profile, are within budget, and have a high probability of success in the league. This case study walks through the entire scouting workflow, from defining requirements to producing a shortlist.

Objective: Produce a ranked shortlist of 5 striker candidates from a pool of 500+ forwards across Europe's top leagues.

Template Workflow --- Scouting Campaign: The scouting workflow follows six stages: (1) Define the player profile with coaching staff, (2) Build and normalize the scouting database, (3) Apply quantitative filters and scoring, (4) Perform similarity and cluster analysis, (5) Apply league-level adjustments, (6) Assess squad fit and financial viability. Each stage has clear inputs, outputs, and decision points. Document every decision for transparency and future reference.

29.2.2 Defining the Player Profile

The first step is translating the coaching staff's requirements into quantifiable criteria. Through meetings with the head coach and sporting director, the following profile emerges:

Attribute Minimum Threshold Ideal Target Weight
Non-penalty xG per 90 0.35 0.50+ 0.25
Pressing actions per 90 15 20+ 0.15
Aerial duels won % 45% 55%+ 0.10
Progressive carries per 90 3.0 7.0+ 0.15
Age 21-28 23-26 0.10
Contract years remaining 1-3 1-2 0.10
Estimated transfer fee < 25M < 15M 0.15

Practical Tip --- Stakeholder Meetings: The player profile definition meeting is arguably the most important step in the entire scouting process. Poorly defined requirements lead to wasted analytical effort and a shortlist that the coaching staff will reject. Prepare for this meeting by bringing data on the departing player's profile, the team's current attacking patterns, and examples of different striker archetypes. Ask open-ended questions ("What does the ideal striker look like in our system?") before presenting quantitative thresholds.

29.2.3 Data Collection and Normalization

We aggregate data from multiple sources: event data providers for on-ball metrics, tracking data for off-ball movement, and financial databases for market values.

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

def build_scouting_database(
    events_path: str,
    tracking_path: str,
    market_path: str,
    min_minutes: int = 1500
) -> pd.DataFrame:
    """Merge multiple data sources into a unified scouting database.

    Args:
        events_path: Path to event-level aggregated stats.
        tracking_path: Path to tracking-derived metrics.
        market_path: Path to market value and contract data.
        min_minutes: Minimum minutes played for inclusion.

    Returns:
        Merged DataFrame with per-90 metrics and market info.
    """
    events = pd.read_csv(events_path)
    tracking = pd.read_csv(tracking_path)
    market = pd.read_csv(market_path)

    # Filter by playing time
    events = events[events['minutes_played'] >= min_minutes].copy()

    # Merge on player ID
    df = events.merge(tracking, on='player_id', how='inner')
    df = df.merge(market, on='player_id', how='inner')

    return df

Per-90 Normalization:

$$ \text{metric}_{p90} = \frac{\text{metric}_{\text{total}}}{\text{minutes played}} \times 90 $$

Scouting Callout: Per-90 normalization is essential for fair comparison across leagues with different schedules, but be wary of players with marginal playing time. A minimum threshold of 1,500 minutes (roughly 17 full matches) is standard practice to ensure statistical stability.

League-Level Adjustments:

Raw per-90 metrics are not directly comparable across leagues. A striker scoring 0.45 npxG/90 in the Dutch Eredivisie is not equivalent to the same figure in the Premier League. We apply league adjustment factors based on historical transfer performance data:

def apply_league_adjustments(
    df: pd.DataFrame,
    adjustment_factors: dict,
    metrics_to_adjust: list[str]
) -> pd.DataFrame:
    """Adjust per-90 metrics for league strength differences.

    Args:
        df: Scouting database with per-90 metrics.
        adjustment_factors: Dict mapping league names to
            multiplicative adjustment factors.
        metrics_to_adjust: List of metric columns to adjust.

    Returns:
        DataFrame with league-adjusted metrics.
    """
    df = df.copy()
    for metric in metrics_to_adjust:
        adjusted_col = f'{metric}_adj'
        df[adjusted_col] = df.apply(
            lambda row: row[metric] * adjustment_factors.get(
                row['league'], 1.0
            ),
            axis=1
        )
    return df

# Example adjustment factors (illustrative)
LEAGUE_ADJUSTMENTS = {
    'Premier League': 1.00,
    'La Liga': 0.97,
    'Bundesliga': 0.95,
    'Serie A': 0.96,
    'Ligue 1': 0.92,
    'Eredivisie': 0.82,
    'Primeira Liga': 0.85,
    'Belgian Pro League': 0.80,
    'Championship': 0.78,
    'Austrian Bundesliga': 0.75,
}

Common Mistake --- League Adjustments: League adjustment factors are inherently uncertain and should be treated as rough guidelines, not precise conversion rates. A player's individual profile matters more than the league average. A technically elite player in a weaker league may translate better than the adjustment factor suggests, while a physically dominant player in the same league may struggle in a more technical environment. Use adjustments to flag candidates for closer inspection, not to make final decisions.

29.2.4 Multi-Criteria Scoring

We implement a weighted scoring system that combines all attributes into a single composite score:

def compute_scouting_score(
    df: pd.DataFrame,
    criteria: dict[str, dict],
) -> pd.DataFrame:
    """Compute weighted composite scouting scores.

    Args:
        df: Scouting database with all metrics.
        criteria: Dict mapping metric names to dicts with
            'weight', 'min_threshold', and 'direction' keys.

    Returns:
        DataFrame with added 'composite_score' column, sorted descending.
    """
    scaler = MinMaxScaler()
    score_components = []

    for metric, params in criteria.items():
        col = df[metric].copy()

        # Apply minimum threshold filter
        if 'min_threshold' in params:
            df = df[col >= params['min_threshold']].copy()
            col = df[metric].copy()

        # Normalize to [0, 1]
        normalized = scaler.fit_transform(col.values.reshape(-1, 1)).flatten()

        # Invert if lower is better (e.g., transfer fee)
        if params.get('direction') == 'lower':
            normalized = 1 - normalized

        score_components.append(normalized * params['weight'])

    df['composite_score'] = np.sum(score_components, axis=0)
    return df.sort_values('composite_score', ascending=False)

29.2.5 Similarity Analysis

Beyond raw scoring, we use player similarity analysis to find candidates who match the playing style of the departing striker or a specified archetype.

Cosine Similarity:

$$ \text{sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{||\mathbf{a}|| \cdot ||\mathbf{b}||} $$

from sklearn.metrics.pairwise import cosine_similarity

def find_similar_players(
    df: pd.DataFrame,
    target_player_id: str,
    feature_cols: list[str],
    top_n: int = 10
) -> pd.DataFrame:
    """Find players most similar to a target player profile.

    Args:
        df: Scouting database.
        target_player_id: ID of the reference player.
        feature_cols: Metrics to use for similarity computation.
        top_n: Number of similar players to return.

    Returns:
        DataFrame of the top_n most similar players with similarity scores.
    """
    scaler = MinMaxScaler()
    X = scaler.fit_transform(df[feature_cols].values)

    target_idx = df.index[df['player_id'] == target_player_id][0]
    target_vec = X[target_idx].reshape(1, -1)

    similarities = cosine_similarity(target_vec, X).flatten()
    df['similarity'] = similarities

    return (
        df[df['player_id'] != target_player_id]
        .nlargest(top_n, 'similarity')
    )

29.2.6 Cluster Analysis for Archetype Discovery

We use $k$-means clustering to discover natural player archetypes within the forward population:

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

def discover_archetypes(
    df: pd.DataFrame,
    feature_cols: list[str],
    n_clusters: int = 6
) -> pd.DataFrame:
    """Identify player archetypes using clustering.

    Args:
        df: Scouting database.
        feature_cols: Metrics for clustering.
        n_clusters: Number of archetypes to discover.

    Returns:
        DataFrame with added 'archetype' column.
    """
    scaler = MinMaxScaler()
    X_scaled = scaler.fit_transform(df[feature_cols].values)

    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    df['archetype'] = kmeans.fit_predict(X_scaled)

    # PCA for visualization
    pca = PCA(n_components=2)
    coords = pca.fit_transform(X_scaled)
    df['pca_1'] = coords[:, 0]
    df['pca_2'] = coords[:, 1]

    return df

29.2.7 Squad Fit Analysis

A player who scores well on individual metrics may still be a poor fit for the team. Squad fit analysis evaluates how a potential signing would integrate into the existing squad.

def assess_squad_fit(
    candidate: pd.Series,
    current_squad: pd.DataFrame,
    tactical_profile: dict,
    position: str = 'striker'
) -> dict:
    """Evaluate how well a candidate fits the existing squad.

    Args:
        candidate: Series with the candidate's metrics.
        current_squad: DataFrame with current squad members.
        tactical_profile: Dict describing the team's tactical approach.
        position: Position the candidate would fill.

    Returns:
        Dictionary with fit scores across multiple dimensions.
    """
    fit_scores = {}

    # Age profile fit: does the candidate fill an age gap?
    position_players = current_squad[
        current_squad['position_group'] == position
    ]
    avg_age = position_players['age'].mean()
    fit_scores['age_balance'] = 1.0 - abs(
        candidate['age'] - tactical_profile.get('ideal_age', 25)
    ) / 10.0

    # Wage structure fit
    if 'estimated_wages' in candidate.index:
        max_wage = current_squad['wages'].quantile(0.9)
        fit_scores['wage_fit'] = min(
            1.0, max_wage / max(candidate['estimated_wages'], 1)
        )

    # Tactical complementarity
    if tactical_profile.get('needs_aerial_threat', False):
        fit_scores['aerial_fit'] = min(
            candidate.get('aerial_won_pct', 0) / 60.0, 1.0
        )

    if tactical_profile.get('needs_pressing', False):
        fit_scores['pressing_fit'] = min(
            candidate.get('pressing_actions_p90', 0) / 25.0, 1.0
        )

    # Overall fit score
    fit_scores['overall'] = np.mean(list(fit_scores.values()))

    return fit_scores

Lesson Learned --- Scouting: The greatest risk in data-driven scouting is not analytical error but organizational misalignment. If the analytics department produces a shortlist that the head coach does not trust or the sporting director cannot negotiate within budget, the entire exercise is wasted. Successful scouting campaigns require continuous communication between analytics, coaching, scouting, and recruitment departments. The data narrows the funnel; human judgment and relationships close the deal.

29.2.8 Financial Modeling

Transfer decisions are fundamentally financial decisions. The analytics department should provide financial context alongside performance analysis:

def estimate_transfer_value(
    player: pd.Series,
    market_conditions: dict
) -> dict:
    """Estimate a player's transfer value and assess financial viability.

    Args:
        player: Series with player attributes and market data.
        market_conditions: Dict with current market parameters.

    Returns:
        Dictionary with value estimates and financial projections.
    """
    # Base valuation from market comparables
    base_value = player.get('market_value_eur', 0)

    # Contract adjustment: fewer years remaining = lower fee
    contract_factor = min(player.get('contract_years', 3) / 4.0, 1.0)

    # Age adjustment: premium for peak years, discount otherwise
    age = player.get('age', 25)
    if 23 <= age <= 27:
        age_factor = 1.1
    elif age < 23:
        age_factor = 1.05
    else:
        age_factor = max(0.6, 1.0 - (age - 27) * 0.08)

    estimated_fee = base_value * contract_factor * age_factor

    # Amortization over expected contract length
    contract_length = market_conditions.get('standard_contract_years', 4)
    annual_amortization = estimated_fee / contract_length

    return {
        'estimated_fee': estimated_fee,
        'annual_amortization': annual_amortization,
        'estimated_wages_annual': player.get('estimated_wages', 0) * 52,
        'total_annual_cost': (
            annual_amortization + player.get('estimated_wages', 0) * 52
        ),
        'contract_factor': contract_factor,
        'age_factor': age_factor,
    }

29.2.9 Final Shortlist and Reporting

The final shortlist is produced by combining composite scores, similarity analysis, squad fit assessment, and financial modeling. The output is a structured report for the sporting director that includes radar charts, statistical profiles, and risk assessments for each candidate.

Practical Tip --- The Scouting Report: The final scouting report should be concise (no more than two pages per candidate) and structured consistently. Each candidate profile should include: (1) a radar chart showing their percentile rankings, (2) a one-paragraph statistical summary, (3) a squad fit assessment, (4) a financial summary with estimated total annual cost, (5) key strengths and risk factors, and (6) recommended video clips for the coaching staff to review.


29.3 Case Study: Tactical Analysis of a Season

29.3.1 Project Overview

The analytics department is tasked with producing a comprehensive tactical review of the team's season. This involves analyzing formations, pressing patterns, ball progression methods, and set-piece effectiveness across all 38 league matches.

Objective: Identify the tactical patterns that correlated with wins, and quantify where the team underperformed relative to expected metrics.

29.3.2 Formation Detection

Modern teams rarely play static formations. We use positional data to detect the effective formation in each phase of play.

from sklearn.cluster import KMeans
import numpy as np

def detect_formation(
    positions: np.ndarray,
    n_outfield: int = 10
) -> str:
    """Detect formation from average player positions.

    Args:
        positions: Array of shape (n_outfield, 2) with (x, y) positions.
        n_outfield: Number of outfield players.

    Returns:
        String representation of detected formation (e.g., '4-3-3').
    """
    # Cluster y-positions to find lines
    y_positions = positions[:, 1].reshape(-1, 1)

    best_formation = None
    best_score = -np.inf

    for n_lines in [3, 4, 5]:
        kmeans = KMeans(n_clusters=n_lines, random_state=42, n_init=10)
        kmeans.fit(y_positions)
        score = kmeans.score(y_positions)

        if score > best_score:
            best_score = score
            best_formation = kmeans

    # Count players per line, sorted by depth
    labels = best_formation.labels_
    centers = best_formation.cluster_centers_.flatten()
    sorted_lines = np.argsort(centers)

    formation_str = '-'.join(
        str(np.sum(labels == line)) for line in sorted_lines
    )

    return formation_str

29.3.3 Passing Network Analysis

Passing networks reveal the structural backbone of a team's build-up play. We construct weighted directed graphs where nodes are players and edge weights represent pass frequency.

Betweenness Centrality:

$$ C_B(v) = \sum_{s \neq v \neq t} \frac{\sigma_{st}(v)}{\sigma_{st}} $$

where $\sigma_{st}$ is the total number of shortest paths from node $s$ to node $t$, and $\sigma_{st}(v)$ is the number of those paths passing through $v$.

import networkx as nx

def build_passing_network(
    passes: pd.DataFrame,
    min_passes: int = 3
) -> nx.DiGraph:
    """Construct a weighted passing network from event data.

    Args:
        passes: DataFrame with 'passer', 'receiver', and optional weights.
        min_passes: Minimum passes between a pair for edge inclusion.

    Returns:
        NetworkX directed graph with pass counts as edge weights.
    """
    # Aggregate pass counts
    pass_counts = (
        passes.groupby(['passer', 'receiver'])
        .size()
        .reset_index(name='passes')
    )
    pass_counts = pass_counts[pass_counts['passes'] >= min_passes]

    G = nx.DiGraph()

    for _, row in pass_counts.iterrows():
        G.add_edge(
            row['passer'],
            row['receiver'],
            weight=row['passes']
        )

    return G


def analyze_network_metrics(G: nx.DiGraph) -> pd.DataFrame:
    """Compute centrality and flow metrics for a passing network.

    Args:
        G: Weighted directed passing network.

    Returns:
        DataFrame with centrality metrics per player.
    """
    metrics = pd.DataFrame({
        'player': list(G.nodes()),
        'degree_centrality': pd.Series(nx.degree_centrality(G)),
        'betweenness_centrality': pd.Series(nx.betweenness_centrality(G, weight='weight')),
        'eigenvector_centrality': pd.Series(
            nx.eigenvector_centrality(G, weight='weight', max_iter=1000)
        ),
        'in_degree': pd.Series(dict(G.in_degree(weight='weight'))),
        'out_degree': pd.Series(dict(G.out_degree(weight='weight'))),
    })

    return metrics.sort_values('betweenness_centrality', ascending=False)

29.3.4 Pressing Intensity Analysis

Pressing is measured through PPDA (Passes Per Defensive Action) and high turnovers. Lower PPDA indicates more intense pressing.

$$ \text{PPDA} = \frac{\text{Opponent passes allowed}^{\text{own half}}}{\text{Defensive actions}^{\text{opponent half}}} $$

def compute_ppda(
    events: pd.DataFrame,
    team_id: str
) -> pd.DataFrame:
    """Compute PPDA (Passes Per Defensive Action) per match.

    Args:
        events: Full event data for the season.
        team_id: ID of the team to analyze.

    Returns:
        DataFrame with PPDA per match.
    """
    defensive_actions = ['tackle', 'interception', 'foul']

    match_ppda = []
    for match_id in events['match_id'].unique():
        match_events = events[events['match_id'] == match_id]

        # Opponent passes in own half
        opp_passes = match_events[
            (match_events['team_id'] != team_id) &
            (match_events['type'] == 'pass') &
            (match_events['x'] < 60)
        ].shape[0]

        # Defensive actions in opponent half
        def_actions = match_events[
            (match_events['team_id'] == team_id) &
            (match_events['type'].isin(defensive_actions)) &
            (match_events['x'] > 60)
        ].shape[0]

        ppda = opp_passes / max(def_actions, 1)
        match_ppda.append({
            'match_id': match_id,
            'ppda': ppda,
        })

    return pd.DataFrame(match_ppda)

29.3.5 Tactical Phase Analysis

We segment the season into distinct tactical phases based on formation changes, personnel shifts, and performance trends. This reveals how the coaching staff adapted throughout the campaign.

def segment_tactical_phases(
    match_stats: pd.DataFrame,
    window: int = 5
) -> pd.DataFrame:
    """Identify tactical phase transitions using rolling metrics.

    Args:
        match_stats: Per-match tactical metrics (PPDA, possession, etc.).
        window: Rolling window size for smoothing.

    Returns:
        DataFrame with phase labels and transition points.
    """
    for col in ['ppda', 'possession', 'field_tilt']:
        match_stats[f'{col}_rolling'] = (
            match_stats[col].rolling(window=window, min_periods=1).mean()
        )

    # Detect change points using simple variance method
    match_stats['tactical_variance'] = (
        match_stats['ppda_rolling'].rolling(window=3).std() +
        match_stats['possession_rolling'].rolling(window=3).std()
    )

    threshold = match_stats['tactical_variance'].quantile(0.85)
    match_stats['phase_transition'] = (
        match_stats['tactical_variance'] > threshold
    ).astype(int)

    # Label phases
    match_stats['phase'] = match_stats['phase_transition'].cumsum()

    return match_stats

29.3.6 Expected Points Analysis

We compare actual points earned against expected points based on xG and xGA (expected goals against) to assess whether the team over- or under-performed.

$$ P(\text{win}) = \sum_{g_h > g_a} P(G_h = g_h) \cdot P(G_a = g_a) $$

where $G_h \sim \text{Poisson}(xG)$ and $G_a \sim \text{Poisson}(xGA)$.

from scipy.stats import poisson

def expected_points(xg: float, xga: float) -> float:
    """Compute expected points from xG and xGA using Poisson model.

    Args:
        xg: Expected goals scored.
        xga: Expected goals conceded.

    Returns:
        Expected points for the match (0-3 scale).
    """
    max_goals = 10
    p_win = 0.0
    p_draw = 0.0

    for g_home in range(max_goals):
        for g_away in range(max_goals):
            p_home = poisson.pmf(g_home, xg)
            p_away = poisson.pmf(g_away, xga)
            joint_p = p_home * p_away

            if g_home > g_away:
                p_win += joint_p
            elif g_home == g_away:
                p_draw += joint_p

    return 3 * p_win + 1 * p_draw

Lesson Learned --- Tactical Analysis: The most valuable output of a season-long tactical analysis is not a single insight but a narrative. Coaches and directors want to understand the story of their season: what worked early on, what changed when key players were injured, how the opposition adapted, and where the inflection points were. Structure your analysis chronologically and connect tactical decisions to results. A timeline visualization showing formation changes, PPDA trends, and results in parallel is often the single most impactful deliverable.


29.4 Case Study: Injury Prevention Program

29.4.1 Project Overview

Injuries are the single greatest source of uncontrollable variance in soccer. A comprehensive injury prevention program combines workload monitoring, risk modeling, and individualized management protocols. This case study designs such a system from the ground up.

Objective: Build an injury risk monitoring system that provides daily risk scores for each player, enabling proactive load management.

29.4.2 Workload Monitoring Framework

The acute:chronic workload ratio (ACWR) is a widely used metric for monitoring injury risk. It compares recent training load to a longer baseline.

$$ \text{ACWR} = \frac{\text{Acute Load (7-day rolling mean)}}{\text{Chronic Load (28-day EWMA)}} $$

An ACWR between 0.8 and 1.3 is generally considered the "safe zone," while values above 1.5 indicate significantly elevated risk.

def compute_acwr(
    daily_load: pd.Series,
    acute_window: int = 7,
    chronic_window: int = 28
) -> pd.Series:
    """Compute the Acute:Chronic Workload Ratio.

    Args:
        daily_load: Time series of daily training/match load values.
        acute_window: Window for acute load calculation (days).
        chronic_window: Window for chronic load EWMA (days).

    Returns:
        Series of ACWR values aligned with the input index.
    """
    acute = daily_load.rolling(window=acute_window, min_periods=1).mean()
    chronic = daily_load.ewm(span=chronic_window, min_periods=1).mean()

    # Avoid division by zero
    acwr = acute / chronic.replace(0, np.nan)

    return acwr

29.4.3 Multi-Factor Risk Model

We build a logistic regression model that incorporates multiple risk factors beyond workload:

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

def build_injury_risk_model(
    df: pd.DataFrame,
    feature_cols: list[str],
    target_col: str = 'injury_within_14_days'
) -> tuple:
    """Train a multi-factor injury risk prediction model.

    Args:
        df: Historical player-day records with features and outcomes.
        feature_cols: List of predictor variable names.
        target_col: Binary target indicating injury within the prediction window.

    Returns:
        Tuple of (trained model, fitted scaler, feature importances).
    """
    X = df[feature_cols].values
    y = df[target_col].values

    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    model = LogisticRegression(
        class_weight='balanced',
        max_iter=1000,
        C=0.1
    )
    model.fit(X_scaled, y)

    importances = pd.DataFrame({
        'feature': feature_cols,
        'coefficient': model.coef_[0],
        'abs_coefficient': np.abs(model.coef_[0])
    }).sort_values('abs_coefficient', ascending=False)

    return model, scaler, importances

Key risk factors include:

  • Acute:Chronic Workload Ratio (ACWR)
  • Total distance covered in training (meters)
  • High-speed running distance (> 7.5 m/s)
  • Number of accelerations and decelerations
  • Days since last rest day
  • Previous injury history (binary: injured in prior 6 months)
  • Age
  • Match congestion (matches in last 14 days)

29.4.4 Survival Analysis for Return-to-Play

For players currently injured, survival analysis models the expected time to return:

$$ S(t) = P(T > t) = \exp\left(-\int_0^t \lambda(u) \, du\right) $$

where $\lambda(t)$ is the hazard function representing the instantaneous risk of returning to play at time $t$.

def kaplan_meier_return_to_play(
    injury_data: pd.DataFrame,
    injury_type: str
) -> pd.DataFrame:
    """Estimate return-to-play curves by injury type.

    Args:
        injury_data: Historical injury records with duration and censoring info.
        injury_type: Type of injury to filter (e.g., 'hamstring', 'ankle').

    Returns:
        DataFrame with time points and survival probabilities.
    """
    subset = injury_data[injury_data['injury_type'] == injury_type].copy()
    subset = subset.sort_values('days_out')

    # Simple Kaplan-Meier estimation
    n_total = len(subset)
    survival_prob = 1.0
    results = [{'time': 0, 'survival': 1.0}]

    for _, row in subset.iterrows():
        n_at_risk = n_total - len(
            [r for r in results if r['time'] < row['days_out']]
        )
        if n_at_risk > 0 and row['returned'] == 1:
            survival_prob *= (1 - 1 / max(n_at_risk, 1))
        results.append({
            'time': row['days_out'],
            'survival': survival_prob
        })

    return pd.DataFrame(results)

29.4.5 Daily Dashboard and Alert System

The injury prevention system produces a daily dashboard with traffic-light risk indicators:

Risk Level ACWR Range Color Action
Low 0.8 - 1.2 Green Normal training
Moderate 1.2 - 1.5 or < 0.8 Amber Monitor closely
High > 1.5 or < 0.5 Red Reduce load / rest day

Medical Staff Callout: The injury risk model should be viewed as a decision support tool, not a replacement for clinical judgment. The daily risk scores provide one input among many---including subjective wellness questionnaires, sleep quality, and clinical assessment---that the medical team uses to make training load decisions.

Common Mistake --- Injury Models: A frequent error is treating injury prediction as a standard classification problem and optimizing for accuracy. Because injuries are rare events (base rate of 2-5% per player-week), a model that always predicts "no injury" achieves very high accuracy. Instead, optimize for sensitivity (recall) at a fixed false positive rate, and use the model's probability outputs rather than binary predictions. The goal is not to predict exactly when injuries will occur but to identify periods of elevated risk that warrant closer monitoring.


29.5 Case Study: Match Preparation Report

29.5.1 Project Overview

Before every match, the analytics department produces a structured opponent analysis report for the coaching staff. This case study automates the production of these reports, covering opponent tendencies, key threats, set-piece patterns, and recommended tactical adjustments.

Objective: Build an automated system that generates a comprehensive match preparation report from event data, requiring minimal manual intervention.

29.5.2 Pre-Match Preparation Workflow

The match preparation process begins well before the automated system runs. The full workflow includes:

  1. Data Collection (Matchday -5): Gather event data from the opponent's last 5-10 matches. Include data from different competition contexts (home, away, vs. top teams, vs. bottom teams).
  2. Video Tagging (Matchday -4): The video analyst tags key sequences from the opponent's recent matches. These clips will be linked to the statistical findings in the report.
  3. Automated Analysis (Matchday -3): Run the automated pipeline to generate the statistical report.
  4. Manual Review (Matchday -2): The lead analyst reviews the automated output, adds context, and highlights the most important findings.
  5. Coaching Presentation (Matchday -1): Present the report to the coaching staff, typically in a 20-30 minute meeting supported by video clips and visualizations.
  6. Player Briefing (Matchday): A simplified version of the key findings is communicated to players, often through short video sessions and pitch diagrams.

29.5.3 Report Architecture

The automated report system follows a modular pipeline:

Event Data --> Opponent Profile Module --> Set Piece Module --> Key Player Module
                        |                       |                       |
                        v                       v                       v
                  Formation &            Corner/FK              Threat
                  Build-up              Patterns              Assessment
                        |                       |                       |
                        +----------+------------+-----------+-----------+
                                   |                        |
                                   v                        v
                           Report Assembly           Visualization
                                   |
                                   v
                            PDF/HTML Output

29.5.4 Opponent Build-Up Analysis

def analyze_buildup_patterns(
    events: pd.DataFrame,
    opponent_id: str,
    n_recent_matches: int = 5
) -> dict:
    """Analyze opponent's build-up play tendencies.

    Args:
        events: Event data from opponent's recent matches.
        opponent_id: Team ID of the opponent.
        n_recent_matches: Number of recent matches to analyze.

    Returns:
        Dictionary of build-up pattern statistics.
    """
    recent_matches = (
        events[events['team_id'] == opponent_id]
        ['match_id'].unique()[-n_recent_matches:]
    )
    opp_events = events[
        (events['team_id'] == opponent_id) &
        (events['match_id'].isin(recent_matches))
    ]

    passes = opp_events[opp_events['type'] == 'pass']

    analysis = {
        'buildup_side': {
            'left': passes[passes['y'] < 27].shape[0] / len(passes),
            'center': passes[passes['y'].between(27, 53)].shape[0] / len(passes),
            'right': passes[passes['y'] > 53].shape[0] / len(passes),
        },
        'long_ball_pct': (
            passes[passes['pass_length'] > 30].shape[0] / len(passes)
        ),
        'avg_pass_length': passes['pass_length'].mean(),
        'progressive_pass_rate': (
            passes[passes['end_x'] - passes['x'] > 10].shape[0] / len(passes)
        ),
        'buildup_speed': _classify_buildup_speed(passes),
        'possession_pct': _compute_possession(opp_events),
    }

    return analysis


def _classify_buildup_speed(passes: pd.DataFrame) -> str:
    """Classify build-up speed based on pass tempo."""
    avg_sequence_length = passes.groupby('possession_id').size().mean()
    if avg_sequence_length > 6:
        return 'patient'
    elif avg_sequence_length > 3:
        return 'balanced'
    else:
        return 'direct'


def _compute_possession(events: pd.DataFrame) -> float:
    """Compute approximate possession percentage."""
    total_passes = events[events['type'] == 'pass'].shape[0]
    successful_passes = events[
        (events['type'] == 'pass') & (events['outcome'] == 'successful')
    ].shape[0]
    return successful_passes / max(total_passes, 1)

29.5.5 Set-Piece Analysis

Set pieces account for approximately 25-30% of all goals in professional soccer. A thorough set-piece analysis is critical for match preparation.

def analyze_set_pieces(
    events: pd.DataFrame,
    opponent_id: str,
    n_matches: int = 10
) -> dict:
    """Analyze opponent's set-piece tendencies and threats.

    Args:
        events: Event data from opponent's recent matches.
        opponent_id: Team ID of the opponent.
        n_matches: Number of recent matches to analyze.

    Returns:
        Dictionary of set-piece patterns and statistics.
    """
    set_piece_types = ['corner', 'free_kick', 'throw_in']

    recent = events[
        (events['team_id'] == opponent_id) &
        (events['play_pattern'].isin(set_piece_types))
    ]

    analysis = {}
    for sp_type in set_piece_types:
        subset = recent[recent['play_pattern'] == sp_type]
        if len(subset) == 0:
            continue

        analysis[sp_type] = {
            'total_count': len(subset),
            'shots_generated': subset[subset['type'] == 'shot'].shape[0],
            'goals_scored': subset[
                (subset['type'] == 'shot') &
                (subset['outcome'] == 'Goal')
            ].shape[0],
            'xg_generated': subset[
                subset['type'] == 'shot'
            ]['xg'].sum() if 'xg' in subset.columns else None,
        }

    # Corner kick delivery analysis
    corners = recent[recent['play_pattern'] == 'corner']
    if len(corners) > 0:
        analysis['corner_delivery'] = {
            'inswing_pct': (
                corners[corners.get('corner_type', pd.Series()) == 'inswing']
                .shape[0] / len(corners)
            ) if 'corner_type' in corners.columns else None,
            'short_corner_pct': (
                corners[corners['pass_length'] < 10].shape[0] / len(corners)
            ),
            'near_post_pct': _estimate_near_post_delivery(corners),
        }

    return analysis


def _estimate_near_post_delivery(corners: pd.DataFrame) -> float:
    """Estimate percentage of corners delivered to the near post."""
    if 'end_y' not in corners.columns:
        return 0.0
    near_post = corners[corners['end_y'] < 35].shape[0]
    return near_post / max(len(corners), 1)

29.5.6 Key Player Threat Assessment

def assess_key_threats(
    events: pd.DataFrame,
    opponent_id: str,
    n_matches: int = 10,
    top_n: int = 3
) -> list[dict]:
    """Identify and profile the opponent's most dangerous players.

    Args:
        events: Event data from opponent's recent matches.
        opponent_id: Team ID of the opponent.
        n_matches: Number of recent matches to analyze.
        top_n: Number of key threats to return.

    Returns:
        List of dictionaries with player threat profiles.
    """
    opp_events = events[events['team_id'] == opponent_id]

    # Compute threat metrics per player
    player_threats = []
    for player_id in opp_events['player_id'].unique():
        player_events = opp_events[opp_events['player_id'] == player_id]

        shots = player_events[player_events['type'] == 'shot']
        passes = player_events[player_events['type'] == 'pass']
        carries = player_events[player_events['type'] == 'carry']

        xg_total = shots['xg'].sum() if 'xg' in shots.columns else 0
        xa_total = passes['xa'].sum() if 'xa' in passes.columns else 0

        player_threats.append({
            'player_id': player_id,
            'player_name': player_events['player_name'].iloc[0],
            'xg': xg_total,
            'xa': xa_total,
            'threat_score': xg_total + xa_total,
            'shots': len(shots),
            'key_passes': passes[
                passes.get('key_pass', pd.Series(dtype=bool)) == True
            ].shape[0] if 'key_pass' in passes.columns else 0,
            'progressive_carries': carries[
                carries['end_x'] - carries['x'] > 10
            ].shape[0] if len(carries) > 0 and 'end_x' in carries.columns else 0,
        })

    threats_df = pd.DataFrame(player_threats)
    return threats_df.nlargest(top_n, 'threat_score').to_dict('records')

29.5.7 In-Game Tracking and Post-Match Analysis

The match preparation report is not the end of the analyst's workflow for a given match. The complete cycle includes in-game tracking and post-match review.

In-Game Tracking:

During the match, the analyst monitors live data feeds and flags significant deviations from the pre-match analysis:

  • Is the opponent playing the formation we expected?
  • Are they pressing higher or lower than their recent average?
  • Is their key threat player receiving the ball in the zones we identified?
  • Are our set-piece defensive assignments working?

These observations are communicated to the coaching staff via a standardized messaging system (often a tablet app) that the assistant coaches monitor.

Post-Match Analysis:

After the match, the analyst produces a post-match report that evaluates:

  1. How accurately the pre-match analysis predicted the opponent's approach
  2. Which tactical adjustments were made and their impact
  3. Key moments that were influenced by the pre-match preparation
  4. Lessons for future match preparation

Coaching Staff Callout: The most effective match preparation reports are concise and action-oriented. Coaches do not need to see every statistic---they need clear, prioritized insights that translate directly into training ground work. Limit the report to 3-4 pages with clear visual summaries.

29.5.8 Visualization Production

Effective match preparation requires clear, intuitive visualizations. The standard visualization package includes:

  1. Opponent formation map with average player positions
  2. Build-up heatmaps showing where the opponent progresses the ball
  3. Pressing trigger zones highlighting where the opponent is vulnerable
  4. Set-piece diagrams with common routines
  5. Key player action maps showing where threats operate

Practical Tip --- Visualization for Coaches: When creating visualizations for coaching staff, follow three rules: (a) use the real pitch as the canvas---coaches think in terms of pitch zones, not abstract charts; (b) limit each visualization to one main message; (c) annotate directly on the visualization rather than relying on a legend. A pitch map showing "Opponent's LW receives 73% of crosses here" is more useful than a bar chart of cross distribution.

29.5.9 Report Generation

The final report is assembled into a structured format:

def generate_match_report(
    events: pd.DataFrame,
    opponent_id: str,
    our_team_id: str,
    output_path: str
) -> str:
    """Generate a complete match preparation report.

    Args:
        events: Full event dataset.
        opponent_id: Opponent team ID.
        our_team_id: Our team ID.
        output_path: Path for the output report file.

    Returns:
        Path to the generated report.
    """
    buildup = analyze_buildup_patterns(events, opponent_id)
    set_pieces = analyze_set_pieces(events, opponent_id)
    threats = assess_key_threats(events, opponent_id)

    report_sections = [
        _format_header(opponent_id),
        _format_buildup_section(buildup),
        _format_set_piece_section(set_pieces),
        _format_threat_section(threats),
        _format_recommendations(buildup, set_pieces, threats),
    ]

    report_text = '\n\n'.join(report_sections)

    with open(output_path, 'w') as f:
        f.write(report_text)

    return output_path

29.6 Case Study: Player Development Tracking

29.6.1 Project Overview

Tracking player development is essential for academies and first-team environments alike. This case study builds a longitudinal tracking system that measures player progression across technical, physical, tactical, and mental dimensions.

Objective: Create a player development dashboard that visualizes progression over time, benchmarks against age-group peers, and projects future trajectories.

29.6.2 Development Metrics Framework

Player development is tracked across four pillars:

Pillar Metrics Update Frequency
Technical Pass completion, dribble success, first touch under pressure Weekly
Physical Sprint speed, distance covered, high-intensity efforts Per session
Tactical Positional accuracy, pressing trigger response, defensive positioning Monthly
Mental Decision-making under pressure, game management, leadership indices Quarterly

29.6.3 Percentile Ranking Against Peers

from scipy import stats as scipy_stats

def compute_percentile_ranks(
    player_metrics: pd.DataFrame,
    peer_metrics: pd.DataFrame,
    metric_cols: list[str]
) -> pd.DataFrame:
    """Compute percentile ranks for a player against their peer group.

    Args:
        player_metrics: Single-row DataFrame with the player's current metrics.
        peer_metrics: DataFrame with metrics for the peer comparison group.
        metric_cols: List of metric columns to rank.

    Returns:
        DataFrame with percentile ranks for each metric.
    """
    percentiles = {}

    for col in metric_cols:
        player_val = player_metrics[col].values[0]
        peer_vals = peer_metrics[col].dropna().values

        percentile = scipy_stats.percentileofscore(peer_vals, player_val)
        percentiles[col] = round(percentile, 1)

    return pd.DataFrame([percentiles])

29.6.4 Growth Curve Modeling

We model player development trajectories using polynomial regression to project future performance levels:

$$ y(t) = \beta_0 + \beta_1 t + \beta_2 t^2 + \epsilon $$

where $y(t)$ is the metric value at time $t$ (measured in months since the player's academy entry).

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

def fit_growth_curve(
    historical: pd.DataFrame,
    metric_col: str,
    time_col: str = 'months_since_start',
    degree: int = 2,
    forecast_months: int = 12
) -> dict:
    """Fit a growth curve and project future development.

    Args:
        historical: Historical metric observations.
        metric_col: Column name of the metric to model.
        time_col: Column with time values.
        degree: Polynomial degree for the growth curve.
        forecast_months: Number of months to forecast.

    Returns:
        Dictionary with model coefficients, fitted values, and forecasts.
    """
    X = historical[time_col].values.reshape(-1, 1)
    y = historical[metric_col].values

    poly = PolynomialFeatures(degree=degree)
    X_poly = poly.fit_transform(X)

    model = LinearRegression()
    model.fit(X_poly, y)

    # Fitted values
    y_fitted = model.predict(X_poly)

    # Forecast
    max_time = X.max()
    future_times = np.arange(
        max_time + 1, max_time + forecast_months + 1
    ).reshape(-1, 1)
    future_poly = poly.transform(future_times)
    y_forecast = model.predict(future_poly)

    return {
        'coefficients': model.coef_.tolist(),
        'intercept': model.intercept_,
        'r_squared': model.score(X_poly, y),
        'fitted_values': y_fitted.tolist(),
        'forecast_times': future_times.flatten().tolist(),
        'forecast_values': y_forecast.tolist(),
    }

29.6.5 Radar Chart Visualization

Radar charts (also called spider charts) provide an intuitive visual summary of a player's multi-dimensional profile.

import matplotlib.pyplot as plt
import numpy as np

def create_radar_chart(
    player_percentiles: dict[str, float],
    player_name: str,
    output_path: str
) -> None:
    """Create a radar chart showing a player's percentile profile.

    Args:
        player_percentiles: Dictionary mapping metric names to percentile values.
        player_name: Name for the chart title.
        output_path: Path to save the output figure.
    """
    categories = list(player_percentiles.keys())
    values = list(player_percentiles.values())

    # Close the polygon
    values += values[:1]
    N = len(categories)

    angles = [n / float(N) * 2 * np.pi for n in range(N)]
    angles += angles[:1]

    fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))
    ax.plot(angles, values, 'o-', linewidth=2)
    ax.fill(angles, values, alpha=0.25)
    ax.set_xticks(angles[:-1])
    ax.set_xticklabels(categories, size=10)
    ax.set_ylim(0, 100)
    ax.set_title(f'{player_name} - Development Profile', size=14, pad=20)

    plt.tight_layout()
    plt.savefig(output_path, dpi=150, bbox_inches='tight')
    plt.close()

29.6.6 Development Report Generation

The player development system produces periodic reports that combine quantitative metrics with qualitative coaching assessments:

def generate_development_report(
    player_id: str,
    current_metrics: pd.DataFrame,
    historical_metrics: pd.DataFrame,
    peer_metrics: pd.DataFrame,
    metric_cols: list[str]
) -> dict:
    """Generate a comprehensive player development report.

    Args:
        player_id: Unique player identifier.
        current_metrics: Current period metric values.
        historical_metrics: All historical metric observations.
        peer_metrics: Peer group metrics for benchmarking.
        metric_cols: List of metrics to include.

    Returns:
        Dictionary with all report components.
    """
    # Percentile rankings
    percentiles = compute_percentile_ranks(
        current_metrics, peer_metrics, metric_cols
    )

    # Growth curves for each metric
    growth_curves = {}
    for col in metric_cols:
        if col in historical_metrics.columns:
            growth_curves[col] = fit_growth_curve(
                historical_metrics, col
            )

    # Identify strengths and areas for development
    pct_dict = percentiles.iloc[0].to_dict()
    strengths = [k for k, v in pct_dict.items() if v >= 75]
    development_areas = [k for k, v in pct_dict.items() if v < 40]

    report = {
        'player_id': player_id,
        'percentile_ranks': pct_dict,
        'growth_curves': growth_curves,
        'strengths': strengths,
        'development_areas': development_areas,
        'overall_development_index': np.mean(list(pct_dict.values())),
    }

    return report

29.6.7 Longitudinal Trend Analysis

Tracking development over time requires careful attention to measurement frequency, seasonal effects, and position-specific benchmarks.

Academy Callout: Player development is non-linear. Periods of apparent stagnation are normal and often precede significant breakthroughs. The analytics system should flag concerning trends (sustained decline over 3+ months) without over-reacting to short-term fluctuations.

Lesson Learned --- Player Development: The most successful player development tracking systems are those that are embedded in the coaching culture, not imposed from outside. Academy coaches need to see the development reports as tools that support their work, not as surveillance mechanisms. The best approach is to involve coaches in defining the metrics, present results in collaborative review meetings (not as top-down evaluations), and always pair quantitative data with opportunities for coaches to add their qualitative assessments. A development report that says "passing completion improved from 40th to 65th percentile over 6 months" is good. One that adds "coach notes: consistently making better decisions about when to play forward vs. recycle; has responded well to individual tactical sessions on Thursday mornings" is excellent.


29.7 Cross-Case-Study Integration

29.7.1 How the Case Studies Connect

The six case studies in this chapter are not isolated projects. In a professional club, they form an interconnected system:

  • The xG model (29.1) feeds predicted values into the match preparation report (29.5), where opponent shot quality is assessed, and into the scouting campaign (29.2), where a candidate's finishing ability is evaluated against the model.
  • The scouting campaign (29.2) uses insights from the tactical analysis (29.3) to understand what kind of player the team's system requires, and from the injury prevention program (29.4) to assess candidates' injury risk profiles.
  • The tactical analysis (29.3) informs the match preparation report (29.5) by providing baseline metrics for how the team typically plays, enabling the analyst to identify how the opponent's approach differs.
  • The injury prevention program (29.4) interacts with the player development tracking (29.6) because managing a young player's workload is essential for long-term development.
  • The player development tracking (29.6) feeds back into the scouting campaign (29.2) by helping the club understand which internal academy players might be ready to fill a squad need, potentially avoiding a transfer altogether.

29.7.2 Common Patterns Across Case Studies

Several patterns recur across all six case studies:

  1. Data Quality First: Every case study begins with data cleaning and validation. This is not optional scaffolding---it is the foundation on which everything else rests.
  2. Feature Engineering Requires Domain Knowledge: The most impactful features in every model are those informed by deep understanding of soccer, not by automated feature selection.
  3. Communication Is the Deliverable: The output of every case study is not a model or a database but a communication to a human decision-maker (coach, sporting director, medical staff, academy director).
  4. Iteration and Feedback: Every system improves through feedback loops. The xG model is retrained when performance degrades. The scouting criteria are refined after each transfer window. The match preparation system is updated based on coaching staff feedback.
  5. Uncertainty Quantification: Every case study acknowledges uncertainty. xG predictions have confidence intervals. Scouting scores have sensitivity to weight changes. Injury risk scores are probabilities, not certainties.

29.7.3 Tools and Data Sources Summary

Tool / Library Case Studies Used Purpose
pandas / numpy All Data manipulation and numerical computation
scikit-learn 29.1, 29.2, 29.4, 29.6 Model training, clustering, preprocessing
NetworkX 29.3 Passing network construction and analysis
matplotlib 29.3, 29.5, 29.6 Visualization and chart production
scipy 29.3, 29.6 Statistical tests and percentile computation
optuna 29.1 Hyperparameter optimization
joblib 29.1 Model serialization for deployment
StatsBomb / Wyscout 29.1, 29.3, 29.5 Event data source
GPS/tracking providers 29.3, 29.4 Positional and physical load data
Transfermarkt / CIES 29.2 Market value and contract data

Practical Tip --- Reproducing These Analyses: All six case studies can be partially reproduced using freely available data. StatsBomb open data provides event data for the xG pipeline, tactical analysis, and match preparation case studies. For scouting, FBref provides aggregated statistics that can substitute for commercial event data. For injury prevention and player development, you will need to simulate or synthesize data, but the code patterns and analytical frameworks remain the same. The key is to focus on learning the workflow rather than achieving exact numerical results.


Summary

This chapter presented six comprehensive case studies that integrate techniques from across the textbook:

  1. xG Pipeline (Section 29.1): Demonstrated the full lifecycle of a predictive model, from data cleaning through deployment, emphasizing calibration, monitoring, and interpretability.

  2. Scouting Campaign (Section 29.2): Showed how multi-criteria decision analysis, similarity metrics, clustering, league adjustments, squad fit analysis, and financial modeling combine to support evidence-based recruitment.

  3. Tactical Analysis (Section 29.3): Applied network analysis, pressing metrics, and expected points models to evaluate a team's season-long tactical approach, with emphasis on narrative construction.

  4. Injury Prevention (Section 29.4): Built a workload monitoring system using the acute:chronic workload ratio and multi-factor risk models, with survival analysis for return-to-play estimation.

  5. Match Preparation (Section 29.5): Automated the production of opponent analysis reports, covering build-up patterns, set pieces, key player threats, and the full pre-match to post-match workflow.

  6. Player Development (Section 29.6): Created a longitudinal tracking system with percentile benchmarking, growth curve modeling, and radar chart visualization, emphasizing the importance of coaching culture integration.

Each case study followed a common pattern: define the objective, collect and clean data, engineer features, build models, and communicate results to stakeholders. This pattern---the analytics workflow---is the foundation of professional soccer analytics.

The techniques in these case studies are not merely academic exercises. They represent the daily work of analytics departments at clubs across the world. The key to success lies not in any single technique but in the ability to combine them thoughtfully, communicate clearly, and maintain a relentless focus on decisions that improve performance on the pitch.

Final Callout --- The Analytics Workflow: If there is one takeaway from this capstone chapter, it is this: the analytics workflow is more important than any individual technique. A well-structured workflow---with clear data validation, thoughtful feature engineering, appropriate model selection, honest evaluation, and effective communication---will produce valuable insights regardless of the specific tools used. Master the workflow, and you can adapt to any new technique or technology that emerges.


References

  1. Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1-3.
  2. Gabbett, T. J. (2016). The training-injury prevention paradox. British Journal of Sports Medicine, 50(5), 273-280.
  3. Rathke, A. (2017). An examination of expected goals and shot efficiency in soccer. Journal of Human Sport and Exercise, 12(2), 514-529.
  4. Pena, J. L., & Touchette, H. (2012). A network theory analysis of football strategies. arXiv preprint arXiv:1206.6904.
  5. Caley, M. (2015). Premier League projections and new expected goals. Cartilage Free Captain Blog.
  6. Impect. (2019). Packing: A new way of analyzing football. Impect GmbH Technical Report.
  7. Fernandez-Navarro, J., et al. (2016). Evaluating the effectiveness of styles of play in elite soccer. Journal of Sports Sciences, 34(16), 1545-1552.
  8. Pappalardo, L., et al. (2019). A public data set of spatio-temporal match events in soccer competitions. Scientific Data, 6(1), 236.
  9. Decroos, T., et al. (2019). Actions Speak Louder than Goals: Valuing Player Actions in Soccer. KDD 2019.
  10. Fernandez, J., & Bornn, L. (2018). Wide Open Spaces: A statistical technique for measuring space creation in professional soccer. MIT Sloan Sports Analytics Conference.