Case Study 1: Evaluating Premier League Midfielders — A Multi-Dimensional Approach

Background

A mid-table Premier League club is looking to strengthen their central midfield during the January transfer window. The coaching staff wants a box-to-box midfielder who can contribute both defensively and offensively. The analytics department has been asked to produce a shortlist of the top 5 candidates from within the Premier League, supported by data.

This case study walks through the complete analytical process from data preparation to final recommendation.

Objectives

  1. Define a position-specific metric set for central midfielders
  2. Filter and normalize the data using per-90 rates and minimum-minute thresholds
  3. Standardize metrics using z-scores within the central midfielder peer group
  4. Build radar chart profiles for top candidates
  5. Construct a composite performance index with role-appropriate weights
  6. Present findings in a format that bridges analytics and coaching

The Data

We work with a simulated dataset based on realistic statistical distributions drawn from the 2023-24 Premier League season. The dataset contains 65 central midfielders (including central midfielders, defensive midfielders, and box-to-box types) who played at least 900 minutes.

Key fields include: player name, age, team, minutes played, and 15 per-90 statistical categories.

Step 1: Metric Selection

For a box-to-box midfielder, we select eight metrics spanning four dimensions:

Offensive contribution: - Non-penalty expected goals (npxG) per 90 - Expected assists (xA) per 90

Ball progression: - Progressive passes per 90 - Progressive carries per 90

Defensive contribution: - Tackles won per 90 - Interceptions per 90

Possession quality: - Pass completion percentage - Pressure success rate (%)

These eight metrics capture the dual-ended nature of the box-to-box role: a player who both disrupts opposition build-up and drives their own team's attacks.

Step 2: Data Preparation and Normalization

All raw counts are converted to per-90 rates. We apply a minimum threshold of 900 minutes (approximately 10 full matches) to exclude players with unreliable small-sample rates.

# Filter: central midfielders with at least 900 minutes
cm_data = df[
    (df["position"].isin(["CM", "DM", "CM/DM"])) &
    (df["minutes"] >= 900)
].copy()

Step 3: Z-Score Standardization

Each metric is converted to a z-score relative to the central midfielder peer group:

$$ z_{ij} = \frac{x_{ij} - \bar{x}_j}{\sigma_j} $$

This allows us to compare metrics measured on different scales. A z-score of +1.0 means the player is one standard deviation above the group average for that metric.

Step 4: Composite Index Construction

We construct a Box-to-Box Composite Index (B2B-CI) using weights that reflect the balanced nature of the role:

Metric Weight Rationale
npxG per 90 0.10 Scoring threat (secondary for B2B)
xA per 90 0.12 Creative contribution
Progressive passes per 90 0.15 Ball advancement via passing
Progressive carries per 90 0.15 Ball advancement via carrying
Tackles won per 90 0.15 Defensive ball-winning
Interceptions per 90 0.13 Anticipatory defending
Pass completion % 0.10 Technical reliability
Pressure success rate 0.10 Pressing effectiveness

$$ \text{B2B-CI}_i = \sum_{j=1}^{8} w_j \cdot z_{ij} $$

Step 5: Results

The composite index produces a ranking. The top 5 candidates in our simulated data set are:

Rank Player Age Team B2B-CI Strongest Dimension
1 Player Alpha 25 Team A 1.82 Ball progression
2 Player Beta 23 Team B 1.65 Defensive contribution
3 Player Gamma 27 Team C 1.58 Possession quality
4 Player Delta 24 Team D 1.41 Offensive contribution
5 Player Epsilon 26 Team E 1.35 Ball progression

Step 6: Radar Chart Analysis

Radar charts for the top 3 candidates reveal distinct profiles:

  • Player Alpha shows a well-rounded profile with particular spikes in progressive passes (+1.8 SD) and progressive carries (+1.6 SD). Their defensive numbers are solid (+0.8 SD for tackles, +0.7 SD for interceptions). This is the archetypal box-to-box profile.

  • Player Beta is the most defensively oriented of the top 3. Tackles (+1.9 SD) and interceptions (+1.5 SD) are exceptional, but offensive numbers are closer to average. At age 23, there is development upside.

  • Player Gamma is the most technically polished, with elite pass completion (+1.7 SD) and pressure success rate (+1.4 SD), but with more modest progression and defensive numbers.

Step 7: Age Curve Consideration

Incorporating age curves adds a forward-looking dimension:

  • Player Beta (age 23) is likely still improving. Based on typical age curves for box-to-box midfielders (peak at 26-29), they have 3-6 years of improvement ahead.
  • Player Alpha (age 25) is approaching peak years, offering immediately high performance.
  • Player Gamma (age 27) is at peak, offering the highest current level but less future improvement.

The choice between these candidates depends on the club's time horizon: immediate impact (Alpha or Gamma) vs. long-term investment (Beta).

Discussion Questions

  1. Weight sensitivity: How would the ranking change if defensive metrics received double the weight? Use the code provided in code/case-study-code.py to experiment.

  2. Transfer feasibility: The composite index ranks players by statistical profile, but ignores contract situation, transfer fee, wages, and willingness to move. How should these non-statistical factors be integrated into the shortlisting process?

  3. System fit: Player Alpha's progressive carrying numbers are exceptional, but the buying club plays a conservative, possession-recirculation style. How might system fit override raw metric quality?

  4. Validation: What additional evidence (video, tracking data, interview) would you want before making a final recommendation?

Key Takeaways

  • Multi-dimensional evaluation prevents the trap of evaluating midfielders solely on goals and assists
  • Z-score standardization enables fair cross-metric comparison
  • Composite indices are useful for initial screening but must be supplemented with profile-level (radar chart) analysis
  • Age curves add a temporal dimension that is critical for transfer decisions
  • Statistical shortlists are the beginning, not the end, of the recruitment process

Code Reference

The complete code for this case study is in code/case-study-code.py, Section 1. It includes data generation, metric computation, z-score standardization, composite index construction, and radar chart visualization.