Case Study 1: Predicting Player Positions Using Unsupervised Learning
Background
In the 2023 summer transfer window, a mid-table club in one of Europe's top five leagues lost their starting right-back to a larger club. The sporting director tasked the analytics department with finding replacement candidates who matched the departing player's statistical profile --- not merely other right-backs, but players from any position who performed a similar on-pitch role.
This case study walks through the analytical pipeline the team used: unsupervised learning to discover data-driven player roles from per-90 performance metrics, followed by a similarity search to identify replacement candidates.
Objectives
- Cluster 500 outfield players into data-driven roles using multiple unsupervised learning algorithms.
- Compare K-means, hierarchical clustering, and Gaussian Mixture Models.
- Validate clusters against traditional positional labels.
- Build a player similarity engine based on cluster membership and feature proximity.
Data Description
We use synthetic data modeled on publicly available per-90 statistics for outfield players in top-five European leagues. Each player has the following features:
| Feature | Description |
|---|---|
goals_p90 |
Non-penalty goals per 90 minutes |
assists_p90 |
Assists per 90 minutes |
xg_p90 |
Non-penalty expected goals per 90 |
xa_p90 |
Expected assists per 90 |
progressive_passes_p90 |
Progressive passes per 90 |
progressive_carries_p90 |
Progressive carries per 90 |
key_passes_p90 |
Key passes per 90 |
tackles_p90 |
Tackles per 90 |
interceptions_p90 |
Interceptions per 90 |
pressures_p90 |
Pressures per 90 |
aerial_wins_p90 |
Aerial duels won per 90 |
touches_att_pen_p90 |
Touches in the attacking penalty area per 90 |
crosses_p90 |
Crosses per 90 |
dribbles_p90 |
Successful dribbles per 90 |
Players must have at least 900 minutes played to be included, filtering out those with insufficient sample sizes.
Step 1: Data Preparation
Loading and Inspecting
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
# Load synthetic data (see code/case-study-code.py for generation)
df = pd.read_csv("player_stats_p90.csv")
print(f"Dataset shape: {df.shape}")
print(df.describe().round(3))
Standardization
Per-90 metrics already control for playing time, but they have different scales. Goals per 90 typically range from 0 to 0.8, while pressures per 90 can range from 5 to 30. We standardize using z-scores:
$$ z_j = \frac{x_j - \mu_j}{\sigma_j} $$
features = [col for col in df.columns if col.endswith("_p90")]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[features])
Correlation Analysis
Before clustering, we check for highly correlated feature pairs that might distort the distance metric:
import seaborn as sns
import matplotlib.pyplot as plt
corr = df[features].corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
plt.figure(figsize=(12, 10))
sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap="coolwarm",
center=0, vmin=-1, vmax=1)
plt.title("Feature Correlation Matrix")
plt.tight_layout()
plt.savefig("correlation_matrix.png", dpi=150)
plt.show()
We find that goals_p90 and xg_p90 are highly correlated ($r = 0.91$), as
are assists_p90 and xa_p90 ($r = 0.85$). We retain all features but note
that PCA will naturally handle this redundancy.
Step 2: Determining the Number of Clusters
Elbow Method
from sklearn.cluster import KMeans
inertias = []
K_range = range(3, 16)
for k in K_range:
km = KMeans(n_clusters=k, n_init=10, random_state=42)
km.fit(X_scaled)
inertias.append(km.inertia_)
plt.figure(figsize=(8, 5))
plt.plot(K_range, inertias, "bo-")
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Inertia (Within-Cluster Sum of Squares)")
plt.title("Elbow Method for Optimal k")
plt.grid(True, alpha=0.3)
plt.savefig("elbow_plot.png", dpi=150)
plt.show()
Silhouette Analysis
from sklearn.metrics import silhouette_score
silhouettes = []
for k in K_range:
km = KMeans(n_clusters=k, n_init=10, random_state=42)
labels = km.fit_predict(X_scaled)
silhouettes.append(silhouette_score(X_scaled, labels))
plt.figure(figsize=(8, 5))
plt.plot(K_range, silhouettes, "rs-")
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Score vs. k")
plt.grid(True, alpha=0.3)
plt.savefig("silhouette_plot.png", dpi=150)
plt.show()
The elbow plot shows a clear bend at $k = 7$, and the silhouette score peaks at $k = 7$ (score = 0.31). We proceed with $k = 7$.
Step 3: K-Means Clustering
optimal_k = 7
km = KMeans(n_clusters=optimal_k, n_init=20, random_state=42)
df["cluster_kmeans"] = km.fit_predict(X_scaled)
# Examine cluster sizes
print(df["cluster_kmeans"].value_counts().sort_index())
Cluster Profiles
We compute the mean standardized values for each cluster to create profiles:
cluster_profiles = pd.DataFrame(
scaler.inverse_transform(
pd.DataFrame(X_scaled, columns=features)
.groupby(df["cluster_kmeans"]).mean().values
),
columns=features,
index=[f"Cluster {i}" for i in range(optimal_k)]
)
print(cluster_profiles.round(3))
Interpreting Clusters
After examining the centroids, we assign interpretive labels:
| Cluster | Dominant Traits | Interpreted Role |
|---|---|---|
| 0 | High tackles, interceptions, low goals | Defensive Midfielder / Centre-Back |
| 1 | High goals, xG, touches in box | Goal-Scoring Forward |
| 2 | High progressive passes, key passes | Creative Playmaker |
| 3 | High crosses, dribbles, progressive carries | Wide Attacker / Winger |
| 4 | High aerial wins, tackles, moderate passing | Aerial Centre-Back |
| 5 | High pressures, moderate all-round stats | Box-to-Box Midfielder |
| 6 | High crosses, tackles, moderate interceptions | Attacking Full-Back |
Step 4: Hierarchical Clustering
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
Z = linkage(X_scaled, method="ward")
plt.figure(figsize=(16, 8))
dendrogram(Z, truncate_mode="lastp", p=30, leaf_rotation=90,
leaf_font_size=9, show_contracted=True)
plt.title("Hierarchical Clustering Dendrogram (Ward Linkage)")
plt.xlabel("Cluster Size")
plt.ylabel("Distance")
plt.tight_layout()
plt.savefig("dendrogram.png", dpi=150)
plt.show()
df["cluster_hierarchical"] = fcluster(Z, t=optimal_k, criterion="maxclust")
The dendrogram reveals a clear two-way split at the top level: defenders vs. attackers/midfielders. At the next level, midfielders separate from attackers, and so on. This hierarchical structure aligns well with soccer intuition.
Step 5: Gaussian Mixture Models
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=optimal_k, covariance_type="full",
n_init=5, random_state=42)
df["cluster_gmm"] = gmm.fit_predict(X_scaled)
# Soft assignments
proba = gmm.predict_proba(X_scaled)
df["max_cluster_prob"] = proba.max(axis=1)
# Players with uncertain assignments
uncertain = df[df["max_cluster_prob"] < 0.6]
print(f"Players with uncertain role assignment: {len(uncertain)}")
print(uncertain[["player_name", "position", "max_cluster_prob"]].head(10))
We find that approximately 15% of players have no dominant cluster, often corresponding to versatile players who operate in hybrid roles (e.g., wing-backs who attack like wingers but defend like full-backs).
BIC for Model Selection
bics = []
for k in range(3, 16):
gmm_k = GaussianMixture(n_components=k, covariance_type="full",
n_init=5, random_state=42)
gmm_k.fit(X_scaled)
bics.append(gmm_k.bic(X_scaled))
plt.figure(figsize=(8, 5))
plt.plot(range(3, 16), bics, "g^-")
plt.xlabel("Number of Components")
plt.ylabel("BIC")
plt.title("BIC for Gaussian Mixture Model")
plt.grid(True, alpha=0.3)
plt.savefig("gmm_bic.png", dpi=150)
plt.show()
The BIC is minimized at $k = 8$, suggesting one additional cluster compared to the K-means elbow method. The extra cluster splits "Creative Playmaker" into deep-lying playmakers and advanced playmakers --- a meaningful distinction.
Step 6: Visualization
PCA Projection
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
fig, axes = plt.subplots(1, 3, figsize=(20, 6))
for ax, col, title in zip(
axes,
["cluster_kmeans", "cluster_hierarchical", "cluster_gmm"],
["K-Means", "Hierarchical", "GMM"]
):
scatter = ax.scatter(X_pca[:, 0], X_pca[:, 1], c=df[col],
cmap="tab10", alpha=0.6, s=20)
ax.set_xlabel(f"PC1 ({pca.explained_variance_ratio_[0]:.1%})")
ax.set_ylabel(f"PC2 ({pca.explained_variance_ratio_[1]:.1%})")
ax.set_title(title)
ax.legend(*scatter.legend_elements(), title="Cluster", loc="best",
fontsize=8)
plt.suptitle("Player Role Clusters --- PCA Projection", fontsize=14)
plt.tight_layout()
plt.savefig("pca_comparison.png", dpi=150)
plt.show()
Radar Charts for Cluster Centroids
def radar_chart(centroids: pd.DataFrame, feature_names: list[str],
title: str) -> None:
"""Create a radar chart for cluster centroids."""
angles = np.linspace(0, 2 * np.pi, len(feature_names),
endpoint=False).tolist()
angles += angles[:1] # Close the polygon
fig, ax = plt.subplots(figsize=(10, 10),
subplot_kw=dict(polar=True))
for idx, row in centroids.iterrows():
values = row.tolist()
values += values[:1]
ax.plot(angles, values, "o-", linewidth=2, label=idx)
ax.fill(angles, values, alpha=0.1)
ax.set_xticks(angles[:-1])
ax.set_xticklabels(feature_names, fontsize=8)
ax.set_title(title, fontsize=14, pad=20)
ax.legend(loc="upper right", bbox_to_anchor=(1.3, 1.0), fontsize=9)
plt.tight_layout()
plt.savefig("radar_chart.png", dpi=150)
plt.show()
Step 7: Player Similarity Search
With clusters established, we build a similarity search for the departing right-back:
from sklearn.metrics.pairwise import cosine_similarity
def find_similar_players(
target_player: str,
df: pd.DataFrame,
features: list[str],
X_scaled: np.ndarray,
top_n: int = 10
) -> pd.DataFrame:
"""Find the top-N most similar players by cosine similarity."""
target_idx = df[df["player_name"] == target_player].index[0]
target_vector = X_scaled[target_idx].reshape(1, -1)
similarities = cosine_similarity(target_vector, X_scaled).flatten()
df_result = df.copy()
df_result["similarity"] = similarities
return (
df_result[df_result["player_name"] != target_player]
.nlargest(top_n, "similarity")
[["player_name", "position", "cluster_kmeans", "similarity"] + features]
)
similar_players = find_similar_players("Target Right-Back", df, features,
X_scaled, top_n=10)
print(similar_players.to_string(index=False))
Results and Discussion
Key Findings
-
Seven clusters emerged as the optimal number of data-driven roles, closely aligning with tactical intuition but offering finer granularity than traditional positions.
-
The departing right-back fell into the "Attacking Full-Back" cluster (Cluster 6), characterized by high crossing rates, moderate tackles, and progressive carries.
-
The similarity search identified three candidates from the same cluster who played at clubs within the buying club's budget range, including one left-back whose statistical profile was nearly identical, suggesting positional versatility.
-
Gaussian Mixture Models revealed that two of the candidates had significant probability mass in the "Wide Attacker" cluster as well, indicating they could also be deployed further forward --- a tactical bonus.
Method Comparison
| Method | Silhouette Score | Interpretability | Soft Assignments |
|---|---|---|---|
| K-Means ($k=7$) | 0.31 | Good | No |
| Hierarchical (Ward, $k=7$) | 0.29 | Excellent (dendrogram) | No |
| GMM ($k=7$) | 0.28 | Good | Yes |
Limitations
- Per-90 metrics do not capture off-ball movement quality, communication, or leadership --- all critical for a right-back role.
- The 900-minute threshold excludes young prospects with limited playing time.
- Clustering is sensitive to the choice of features; a different feature set would produce different roles.
Recommendations
- Use K-means as the primary clustering method for its stability and interpretability.
- Supplement with GMM soft assignments to identify versatile players.
- Always validate cluster assignments with video analysis before making recruitment decisions.
- Update clusters each season to capture evolving tactical trends.
Code Reference
The complete code for this case study is available in
code/case-study-code.py (Section 1).