Case Study 2: Finding the Next Virgil van Dijk — Player Similarity in Recruitment
Background
Virgil van Dijk is widely regarded as one of the best centre-backs in world football. His combination of aerial dominance, composure on the ball, progressive passing, and defensive solidity makes him an exceptionally rare profile.
Suppose a club wants to identify young centre-backs (under 25) whose statistical profiles most closely resemble van Dijk's. This is a classic player similarity problem, and it illustrates both the power and the limitations of algorithmic recruitment.
This case study builds a complete similarity pipeline using multiple distance metrics, compares the results, and critically examines what the numbers can and cannot tell us.
Objectives
- Define van Dijk's statistical profile using per-90 metrics
- Build a candidate pool of centre-backs from the top five European leagues
- Compute similarity scores using cosine similarity, Euclidean distance, and weighted distance
- Use K-Means clustering to identify van Dijk's archetype and find other members
- Visualize the player landscape using PCA
- Critically evaluate the results and identify limitations
Step 1: Defining the Target Profile
We characterize van Dijk's profile using ten metrics that capture the full scope of elite centre-back play:
| Metric | Category | van Dijk Value (per 90) | Percentile (among CBs) |
|---|---|---|---|
| Aerial duels won | Defensive | 4.2 | 92nd |
| Tackles won | Defensive | 1.1 | 45th |
| Interceptions | Defensive | 1.4 | 62nd |
| Clearances | Defensive | 3.8 | 70th |
| Pass completion % | Possession | 91.2% | 95th |
| Progressive passes | Progression | 8.8 | 93rd |
| Progressive carries | Progression | 1.9 | 82nd |
| Long pass accuracy | Distribution | 72.5% | 90th |
| Errors leading to shots | Negative | 0.05 | 88th (inverted) |
| Pressure success rate | Defensive | 32.1% | 75th |
Van Dijk's profile is distinctive: elite in aerial duels, passing, and progressive play, but not exceptionally high in tackles --- reflecting his positional intelligence and the fact that well-positioned defenders need to tackle less.
Step 2: Building the Candidate Pool
We assemble a dataset of 180 centre-backs from the Premier League, La Liga, Bundesliga, Serie A, and Ligue 1 who: - Are under 25 years old - Have played at least 1,200 minutes in the current season - Are listed as centre-backs by their clubs
All metrics are computed as per-90 rates and then standardized to z-scores within this candidate pool.
Step 3: Similarity Computation
Cosine Similarity Results
Cosine similarity finds players whose shape of strengths and weaknesses matches van Dijk's, regardless of absolute level.
Top 5 by cosine similarity:
| Rank | Player | Age | League | Cosine Sim | Key Similarities |
|---|---|---|---|---|---|
| 1 | CB-Alpha | 22 | Bundesliga | 0.94 | Aerial + progressive passing profile |
| 2 | CB-Beta | 24 | Ligue 1 | 0.91 | Distribution + low error rate |
| 3 | CB-Gamma | 23 | La Liga | 0.89 | Balanced defensive + passing |
| 4 | CB-Delta | 21 | Serie A | 0.87 | Progressive carries + aerial |
| 5 | CB-Epsilon | 24 | Premier League | 0.86 | High pass completion + aerial |
Euclidean Distance Results
Euclidean distance finds players who are similar in both style and level.
Top 5 by Euclidean similarity (1 / (1 + distance)):
| Rank | Player | Age | League | Eucl. Sim | Note |
|---|---|---|---|---|---|
| 1 | CB-Epsilon | 24 | Premier League | 0.41 | Already in top league |
| 2 | CB-Beta | 24 | Ligue 1 | 0.38 | High absolute level |
| 3 | CB-Alpha | 22 | Bundesliga | 0.35 | Similar shape, lower magnitude |
| 4 | CB-Zeta | 23 | Bundesliga | 0.34 | New entrant: strong passer |
| 5 | CB-Gamma | 23 | La Liga | 0.33 | Consistent across metrics |
Notice that the rankings shift: CB-Epsilon, who was 5th by cosine similarity, rises to 1st by Euclidean distance because their absolute metric values are closest to van Dijk's. CB-Alpha, who was the best stylistic match, drops to 3rd because their raw numbers (in a less competitive league context) are further from van Dijk's elite level.
Weighted Distance Results
We apply weights that emphasize the qualities most unique to van Dijk:
| Metric | Weight | Rationale |
|---|---|---|
| Aerial duels won | 0.15 | Core to van Dijk's game |
| Progressive passes | 0.15 | Distinguishing feature |
| Pass completion % | 0.12 | Ball-playing requirement |
| Long pass accuracy | 0.12 | Distribution range |
| Progressive carries | 0.10 | Comfort on the ball |
| Interceptions | 0.10 | Positional awareness |
| Clearances | 0.08 | Standard defensive duty |
| Tackles won | 0.06 | Less central to this profile |
| Pressure success rate | 0.07 | Pressing ability |
| Errors (inverted) | 0.05 | Reliability |
The weighted distance reranks players to emphasize the specific combination that makes van Dijk unique, rather than treating all metrics equally.
Step 4: Clustering Analysis
We run K-Means clustering on the full pool of 180 centre-backs with k = 5 clusters. The resulting archetypes are:
| Cluster | Archetype Label | Key Characteristics | n Players |
|---|---|---|---|
| 0 | Ball-playing CB | High pass %, progressive passes, lower tackles | 32 |
| 1 | Aggressive defender | High tackles, interceptions, moderate passing | 45 |
| 2 | Aerial specialist | High aerial duels, clearances, lower passing | 28 |
| 3 | Complete CB | Above average across all dimensions | 22 |
| 4 | Athletic modern CB | High progressive carries, pressure success | 53 |
Van Dijk falls into Cluster 3 (Complete CB) --- the rarest archetype, characterized by above-average performance across virtually all dimensions. Only 22 of 180 centre-backs share this cluster.
The young centre-backs in Cluster 3 become our most holistic matches, as they share not just individual metric similarity but an overall profile type.
Step 5: PCA Visualization
Projecting all 180 players onto two principal components reveals the landscape:
- PC1 (32% variance): Loads on progressive passing, pass completion, and long pass accuracy. This is a "ball-playing" dimension.
- PC2 (21% variance): Loads on aerial duels, clearances, and tackles. This is a "traditional defending" dimension.
Van Dijk appears in the upper-right quadrant: high on both dimensions. The closest candidates in this 2D space are CB-Epsilon, CB-Beta, and CB-Alpha, confirming the similarity analysis.
Step 6: Critical Evaluation
What the model captures well:
- Statistical profile similarity across a broad range of metrics
- Identification of the rarest archetype (complete CB)
- Systematic comparison across multiple leagues
What the model misses:
- Leadership and communication: Van Dijk's vocal organization of the defensive line is invisible in event data.
- Physical profile: Height (6'4"), speed, and strength are not in the per-90 metrics but are central to his dominance.
- Big-match temperament: Performance in high-pressure moments is not captured by season-level per-90 averages.
- League-level adjustment: A centre-back with elite statistics in Ligue 1 may face a significant step up in the Premier League. Raw metric comparison across leagues assumes comparable difficulty.
- Team system effects: Van Dijk's statistics are partly a product of Liverpool's system (high line, possession-dominant). A similar player in a deep-block team would produce different numbers even if their underlying ability were identical.
Recommendations
The final shortlist should combine statistical similarity (the top 3-5 candidates from the analysis above) with: - Video scouting of at least 5 full matches per candidate - Physical profiling data (height, sprint speed, acceleration) - Personality and character assessment - Financial feasibility analysis (transfer fee, wage demands, contract length) - League-level adjustment modeling (if available from Chapter 16+)
Discussion Questions
-
CB-Alpha (age 22, Bundesliga) has the highest cosine similarity (0.94) but plays in a league that is generally considered less physically demanding than the Premier League. How should this inform the evaluation? What additional data would help?
-
Only 22 of 180 centre-backs fall into the "Complete CB" cluster. What does this rarity tell us about the difficulty of replacing a player like van Dijk?
-
The weighted distance approach requires the analyst to choose weights subjectively. How could you use data-driven methods (e.g., regression on match outcomes) to set these weights instead?
-
If you could add tracking data (speed, distance, positioning) to this analysis, which specific tracking metrics would you include, and how might they change the results?
Key Takeaways
- Multiple similarity metrics (cosine, Euclidean, weighted) answer subtly different questions and may produce different rankings
- Clustering reveals natural archetypes and quantifies how rare a player's profile type is
- PCA visualization provides an intuitive map of the player landscape
- Statistical similarity is necessary but not sufficient for recruitment decisions
- The "holy grail" replacement rarely exists --- most signings involve trade-offs between different aspects of the departing player's profile
Code Reference
The complete code for this case study is in code/case-study-code.py, Section 2. It includes target profile definition, candidate pool generation, three similarity computations, K-Means clustering, PCA visualization, and comparison reporting.