Case Study 1: Evaluating Premier League Midfielders — A Multi-Dimensional Approach
Background
A mid-table Premier League club is looking to strengthen their central midfield during the January transfer window. The coaching staff wants a box-to-box midfielder who can contribute both defensively and offensively. The analytics department has been asked to produce a shortlist of the top 5 candidates from within the Premier League, supported by data.
This case study walks through the complete analytical process from data preparation to final recommendation.
Objectives
- Define a position-specific metric set for central midfielders
- Filter and normalize the data using per-90 rates and minimum-minute thresholds
- Standardize metrics using z-scores within the central midfielder peer group
- Build radar chart profiles for top candidates
- Construct a composite performance index with role-appropriate weights
- Present findings in a format that bridges analytics and coaching
The Data
We work with a simulated dataset based on realistic statistical distributions drawn from the 2023-24 Premier League season. The dataset contains 65 central midfielders (including central midfielders, defensive midfielders, and box-to-box types) who played at least 900 minutes.
Key fields include: player name, age, team, minutes played, and 15 per-90 statistical categories.
Step 1: Metric Selection
For a box-to-box midfielder, we select eight metrics spanning four dimensions:
Offensive contribution: - Non-penalty expected goals (npxG) per 90 - Expected assists (xA) per 90
Ball progression: - Progressive passes per 90 - Progressive carries per 90
Defensive contribution: - Tackles won per 90 - Interceptions per 90
Possession quality: - Pass completion percentage - Pressure success rate (%)
These eight metrics capture the dual-ended nature of the box-to-box role: a player who both disrupts opposition build-up and drives their own team's attacks.
Step 2: Data Preparation and Normalization
All raw counts are converted to per-90 rates. We apply a minimum threshold of 900 minutes (approximately 10 full matches) to exclude players with unreliable small-sample rates.
# Filter: central midfielders with at least 900 minutes
cm_data = df[
(df["position"].isin(["CM", "DM", "CM/DM"])) &
(df["minutes"] >= 900)
].copy()
Step 3: Z-Score Standardization
Each metric is converted to a z-score relative to the central midfielder peer group:
$$ z_{ij} = \frac{x_{ij} - \bar{x}_j}{\sigma_j} $$
This allows us to compare metrics measured on different scales. A z-score of +1.0 means the player is one standard deviation above the group average for that metric.
Step 4: Composite Index Construction
We construct a Box-to-Box Composite Index (B2B-CI) using weights that reflect the balanced nature of the role:
| Metric | Weight | Rationale |
|---|---|---|
| npxG per 90 | 0.10 | Scoring threat (secondary for B2B) |
| xA per 90 | 0.12 | Creative contribution |
| Progressive passes per 90 | 0.15 | Ball advancement via passing |
| Progressive carries per 90 | 0.15 | Ball advancement via carrying |
| Tackles won per 90 | 0.15 | Defensive ball-winning |
| Interceptions per 90 | 0.13 | Anticipatory defending |
| Pass completion % | 0.10 | Technical reliability |
| Pressure success rate | 0.10 | Pressing effectiveness |
$$ \text{B2B-CI}_i = \sum_{j=1}^{8} w_j \cdot z_{ij} $$
Step 5: Results
The composite index produces a ranking. The top 5 candidates in our simulated data set are:
| Rank | Player | Age | Team | B2B-CI | Strongest Dimension |
|---|---|---|---|---|---|
| 1 | Player Alpha | 25 | Team A | 1.82 | Ball progression |
| 2 | Player Beta | 23 | Team B | 1.65 | Defensive contribution |
| 3 | Player Gamma | 27 | Team C | 1.58 | Possession quality |
| 4 | Player Delta | 24 | Team D | 1.41 | Offensive contribution |
| 5 | Player Epsilon | 26 | Team E | 1.35 | Ball progression |
Step 6: Radar Chart Analysis
Radar charts for the top 3 candidates reveal distinct profiles:
-
Player Alpha shows a well-rounded profile with particular spikes in progressive passes (+1.8 SD) and progressive carries (+1.6 SD). Their defensive numbers are solid (+0.8 SD for tackles, +0.7 SD for interceptions). This is the archetypal box-to-box profile.
-
Player Beta is the most defensively oriented of the top 3. Tackles (+1.9 SD) and interceptions (+1.5 SD) are exceptional, but offensive numbers are closer to average. At age 23, there is development upside.
-
Player Gamma is the most technically polished, with elite pass completion (+1.7 SD) and pressure success rate (+1.4 SD), but with more modest progression and defensive numbers.
Step 7: Age Curve Consideration
Incorporating age curves adds a forward-looking dimension:
- Player Beta (age 23) is likely still improving. Based on typical age curves for box-to-box midfielders (peak at 26-29), they have 3-6 years of improvement ahead.
- Player Alpha (age 25) is approaching peak years, offering immediately high performance.
- Player Gamma (age 27) is at peak, offering the highest current level but less future improvement.
The choice between these candidates depends on the club's time horizon: immediate impact (Alpha or Gamma) vs. long-term investment (Beta).
Discussion Questions
-
Weight sensitivity: How would the ranking change if defensive metrics received double the weight? Use the code provided in
code/case-study-code.pyto experiment. -
Transfer feasibility: The composite index ranks players by statistical profile, but ignores contract situation, transfer fee, wages, and willingness to move. How should these non-statistical factors be integrated into the shortlisting process?
-
System fit: Player Alpha's progressive carrying numbers are exceptional, but the buying club plays a conservative, possession-recirculation style. How might system fit override raw metric quality?
-
Validation: What additional evidence (video, tracking data, interview) would you want before making a final recommendation?
Key Takeaways
- Multi-dimensional evaluation prevents the trap of evaluating midfielders solely on goals and assists
- Z-score standardization enables fair cross-metric comparison
- Composite indices are useful for initial screening but must be supplemented with profile-level (radar chart) analysis
- Age curves add a temporal dimension that is critical for transfer decisions
- Statistical shortlists are the beginning, not the end, of the recruitment process
Code Reference
The complete code for this case study is in code/case-study-code.py, Section 1. It includes data generation, metric computation, z-score standardization, composite index construction, and radar chart visualization.