Case Study 2: Finding the Next Virgil van Dijk — Player Similarity in Recruitment

Background

Virgil van Dijk is widely regarded as one of the best centre-backs in world football. His combination of aerial dominance, composure on the ball, progressive passing, and defensive solidity makes him an exceptionally rare profile.

Suppose a club wants to identify young centre-backs (under 25) whose statistical profiles most closely resemble van Dijk's. This is a classic player similarity problem, and it illustrates both the power and the limitations of algorithmic recruitment.

This case study builds a complete similarity pipeline using multiple distance metrics, compares the results, and critically examines what the numbers can and cannot tell us.

Objectives

  1. Define van Dijk's statistical profile using per-90 metrics
  2. Build a candidate pool of centre-backs from the top five European leagues
  3. Compute similarity scores using cosine similarity, Euclidean distance, and weighted distance
  4. Use K-Means clustering to identify van Dijk's archetype and find other members
  5. Visualize the player landscape using PCA
  6. Critically evaluate the results and identify limitations

Step 1: Defining the Target Profile

We characterize van Dijk's profile using ten metrics that capture the full scope of elite centre-back play:

Metric Category van Dijk Value (per 90) Percentile (among CBs)
Aerial duels won Defensive 4.2 92nd
Tackles won Defensive 1.1 45th
Interceptions Defensive 1.4 62nd
Clearances Defensive 3.8 70th
Pass completion % Possession 91.2% 95th
Progressive passes Progression 8.8 93rd
Progressive carries Progression 1.9 82nd
Long pass accuracy Distribution 72.5% 90th
Errors leading to shots Negative 0.05 88th (inverted)
Pressure success rate Defensive 32.1% 75th

Van Dijk's profile is distinctive: elite in aerial duels, passing, and progressive play, but not exceptionally high in tackles --- reflecting his positional intelligence and the fact that well-positioned defenders need to tackle less.

Step 2: Building the Candidate Pool

We assemble a dataset of 180 centre-backs from the Premier League, La Liga, Bundesliga, Serie A, and Ligue 1 who: - Are under 25 years old - Have played at least 1,200 minutes in the current season - Are listed as centre-backs by their clubs

All metrics are computed as per-90 rates and then standardized to z-scores within this candidate pool.

Step 3: Similarity Computation

Cosine Similarity Results

Cosine similarity finds players whose shape of strengths and weaknesses matches van Dijk's, regardless of absolute level.

Top 5 by cosine similarity:

Rank Player Age League Cosine Sim Key Similarities
1 CB-Alpha 22 Bundesliga 0.94 Aerial + progressive passing profile
2 CB-Beta 24 Ligue 1 0.91 Distribution + low error rate
3 CB-Gamma 23 La Liga 0.89 Balanced defensive + passing
4 CB-Delta 21 Serie A 0.87 Progressive carries + aerial
5 CB-Epsilon 24 Premier League 0.86 High pass completion + aerial

Euclidean Distance Results

Euclidean distance finds players who are similar in both style and level.

Top 5 by Euclidean similarity (1 / (1 + distance)):

Rank Player Age League Eucl. Sim Note
1 CB-Epsilon 24 Premier League 0.41 Already in top league
2 CB-Beta 24 Ligue 1 0.38 High absolute level
3 CB-Alpha 22 Bundesliga 0.35 Similar shape, lower magnitude
4 CB-Zeta 23 Bundesliga 0.34 New entrant: strong passer
5 CB-Gamma 23 La Liga 0.33 Consistent across metrics

Notice that the rankings shift: CB-Epsilon, who was 5th by cosine similarity, rises to 1st by Euclidean distance because their absolute metric values are closest to van Dijk's. CB-Alpha, who was the best stylistic match, drops to 3rd because their raw numbers (in a less competitive league context) are further from van Dijk's elite level.

Weighted Distance Results

We apply weights that emphasize the qualities most unique to van Dijk:

Metric Weight Rationale
Aerial duels won 0.15 Core to van Dijk's game
Progressive passes 0.15 Distinguishing feature
Pass completion % 0.12 Ball-playing requirement
Long pass accuracy 0.12 Distribution range
Progressive carries 0.10 Comfort on the ball
Interceptions 0.10 Positional awareness
Clearances 0.08 Standard defensive duty
Tackles won 0.06 Less central to this profile
Pressure success rate 0.07 Pressing ability
Errors (inverted) 0.05 Reliability

The weighted distance reranks players to emphasize the specific combination that makes van Dijk unique, rather than treating all metrics equally.

Step 4: Clustering Analysis

We run K-Means clustering on the full pool of 180 centre-backs with k = 5 clusters. The resulting archetypes are:

Cluster Archetype Label Key Characteristics n Players
0 Ball-playing CB High pass %, progressive passes, lower tackles 32
1 Aggressive defender High tackles, interceptions, moderate passing 45
2 Aerial specialist High aerial duels, clearances, lower passing 28
3 Complete CB Above average across all dimensions 22
4 Athletic modern CB High progressive carries, pressure success 53

Van Dijk falls into Cluster 3 (Complete CB) --- the rarest archetype, characterized by above-average performance across virtually all dimensions. Only 22 of 180 centre-backs share this cluster.

The young centre-backs in Cluster 3 become our most holistic matches, as they share not just individual metric similarity but an overall profile type.

Step 5: PCA Visualization

Projecting all 180 players onto two principal components reveals the landscape:

  • PC1 (32% variance): Loads on progressive passing, pass completion, and long pass accuracy. This is a "ball-playing" dimension.
  • PC2 (21% variance): Loads on aerial duels, clearances, and tackles. This is a "traditional defending" dimension.

Van Dijk appears in the upper-right quadrant: high on both dimensions. The closest candidates in this 2D space are CB-Epsilon, CB-Beta, and CB-Alpha, confirming the similarity analysis.

Step 6: Critical Evaluation

What the model captures well:

  • Statistical profile similarity across a broad range of metrics
  • Identification of the rarest archetype (complete CB)
  • Systematic comparison across multiple leagues

What the model misses:

  1. Leadership and communication: Van Dijk's vocal organization of the defensive line is invisible in event data.
  2. Physical profile: Height (6'4"), speed, and strength are not in the per-90 metrics but are central to his dominance.
  3. Big-match temperament: Performance in high-pressure moments is not captured by season-level per-90 averages.
  4. League-level adjustment: A centre-back with elite statistics in Ligue 1 may face a significant step up in the Premier League. Raw metric comparison across leagues assumes comparable difficulty.
  5. Team system effects: Van Dijk's statistics are partly a product of Liverpool's system (high line, possession-dominant). A similar player in a deep-block team would produce different numbers even if their underlying ability were identical.

Recommendations

The final shortlist should combine statistical similarity (the top 3-5 candidates from the analysis above) with: - Video scouting of at least 5 full matches per candidate - Physical profiling data (height, sprint speed, acceleration) - Personality and character assessment - Financial feasibility analysis (transfer fee, wage demands, contract length) - League-level adjustment modeling (if available from Chapter 16+)

Discussion Questions

  1. CB-Alpha (age 22, Bundesliga) has the highest cosine similarity (0.94) but plays in a league that is generally considered less physically demanding than the Premier League. How should this inform the evaluation? What additional data would help?

  2. Only 22 of 180 centre-backs fall into the "Complete CB" cluster. What does this rarity tell us about the difficulty of replacing a player like van Dijk?

  3. The weighted distance approach requires the analyst to choose weights subjectively. How could you use data-driven methods (e.g., regression on match outcomes) to set these weights instead?

  4. If you could add tracking data (speed, distance, positioning) to this analysis, which specific tracking metrics would you include, and how might they change the results?

Key Takeaways

  • Multiple similarity metrics (cosine, Euclidean, weighted) answer subtly different questions and may produce different rankings
  • Clustering reveals natural archetypes and quantifies how rare a player's profile type is
  • PCA visualization provides an intuitive map of the player landscape
  • Statistical similarity is necessary but not sufficient for recruitment decisions
  • The "holy grail" replacement rarely exists --- most signings involve trade-offs between different aspects of the departing player's profile

Code Reference

The complete code for this case study is in code/case-study-code.py, Section 2. It includes target profile definition, candidate pool generation, three similarity computations, K-Means clustering, PCA visualization, and comparison reporting.