Case Study 1: Predicting Player Positions Using Unsupervised Learning

Background

In the 2023 summer transfer window, a mid-table club in one of Europe's top five leagues lost their starting right-back to a larger club. The sporting director tasked the analytics department with finding replacement candidates who matched the departing player's statistical profile --- not merely other right-backs, but players from any position who performed a similar on-pitch role.

This case study walks through the analytical pipeline the team used: unsupervised learning to discover data-driven player roles from per-90 performance metrics, followed by a similarity search to identify replacement candidates.

Objectives

  1. Cluster 500 outfield players into data-driven roles using multiple unsupervised learning algorithms.
  2. Compare K-means, hierarchical clustering, and Gaussian Mixture Models.
  3. Validate clusters against traditional positional labels.
  4. Build a player similarity engine based on cluster membership and feature proximity.

Data Description

We use synthetic data modeled on publicly available per-90 statistics for outfield players in top-five European leagues. Each player has the following features:

Feature Description
goals_p90 Non-penalty goals per 90 minutes
assists_p90 Assists per 90 minutes
xg_p90 Non-penalty expected goals per 90
xa_p90 Expected assists per 90
progressive_passes_p90 Progressive passes per 90
progressive_carries_p90 Progressive carries per 90
key_passes_p90 Key passes per 90
tackles_p90 Tackles per 90
interceptions_p90 Interceptions per 90
pressures_p90 Pressures per 90
aerial_wins_p90 Aerial duels won per 90
touches_att_pen_p90 Touches in the attacking penalty area per 90
crosses_p90 Crosses per 90
dribbles_p90 Successful dribbles per 90

Players must have at least 900 minutes played to be included, filtering out those with insufficient sample sizes.

Step 1: Data Preparation

Loading and Inspecting

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Load synthetic data (see code/case-study-code.py for generation)
df = pd.read_csv("player_stats_p90.csv")
print(f"Dataset shape: {df.shape}")
print(df.describe().round(3))

Standardization

Per-90 metrics already control for playing time, but they have different scales. Goals per 90 typically range from 0 to 0.8, while pressures per 90 can range from 5 to 30. We standardize using z-scores:

$$ z_j = \frac{x_j - \mu_j}{\sigma_j} $$

features = [col for col in df.columns if col.endswith("_p90")]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[features])

Correlation Analysis

Before clustering, we check for highly correlated feature pairs that might distort the distance metric:

import seaborn as sns
import matplotlib.pyplot as plt

corr = df[features].corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
plt.figure(figsize=(12, 10))
sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap="coolwarm",
            center=0, vmin=-1, vmax=1)
plt.title("Feature Correlation Matrix")
plt.tight_layout()
plt.savefig("correlation_matrix.png", dpi=150)
plt.show()

We find that goals_p90 and xg_p90 are highly correlated ($r = 0.91$), as are assists_p90 and xa_p90 ($r = 0.85$). We retain all features but note that PCA will naturally handle this redundancy.

Step 2: Determining the Number of Clusters

Elbow Method

from sklearn.cluster import KMeans

inertias = []
K_range = range(3, 16)

for k in K_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    km.fit(X_scaled)
    inertias.append(km.inertia_)

plt.figure(figsize=(8, 5))
plt.plot(K_range, inertias, "bo-")
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Inertia (Within-Cluster Sum of Squares)")
plt.title("Elbow Method for Optimal k")
plt.grid(True, alpha=0.3)
plt.savefig("elbow_plot.png", dpi=150)
plt.show()

Silhouette Analysis

from sklearn.metrics import silhouette_score

silhouettes = []
for k in K_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    labels = km.fit_predict(X_scaled)
    silhouettes.append(silhouette_score(X_scaled, labels))

plt.figure(figsize=(8, 5))
plt.plot(K_range, silhouettes, "rs-")
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Score vs. k")
plt.grid(True, alpha=0.3)
plt.savefig("silhouette_plot.png", dpi=150)
plt.show()

The elbow plot shows a clear bend at $k = 7$, and the silhouette score peaks at $k = 7$ (score = 0.31). We proceed with $k = 7$.

Step 3: K-Means Clustering

optimal_k = 7
km = KMeans(n_clusters=optimal_k, n_init=20, random_state=42)
df["cluster_kmeans"] = km.fit_predict(X_scaled)

# Examine cluster sizes
print(df["cluster_kmeans"].value_counts().sort_index())

Cluster Profiles

We compute the mean standardized values for each cluster to create profiles:

cluster_profiles = pd.DataFrame(
    scaler.inverse_transform(
        pd.DataFrame(X_scaled, columns=features)
        .groupby(df["cluster_kmeans"]).mean().values
    ),
    columns=features,
    index=[f"Cluster {i}" for i in range(optimal_k)]
)
print(cluster_profiles.round(3))

Interpreting Clusters

After examining the centroids, we assign interpretive labels:

Cluster Dominant Traits Interpreted Role
0 High tackles, interceptions, low goals Defensive Midfielder / Centre-Back
1 High goals, xG, touches in box Goal-Scoring Forward
2 High progressive passes, key passes Creative Playmaker
3 High crosses, dribbles, progressive carries Wide Attacker / Winger
4 High aerial wins, tackles, moderate passing Aerial Centre-Back
5 High pressures, moderate all-round stats Box-to-Box Midfielder
6 High crosses, tackles, moderate interceptions Attacking Full-Back

Step 4: Hierarchical Clustering

from scipy.cluster.hierarchy import dendrogram, linkage, fcluster

Z = linkage(X_scaled, method="ward")

plt.figure(figsize=(16, 8))
dendrogram(Z, truncate_mode="lastp", p=30, leaf_rotation=90,
           leaf_font_size=9, show_contracted=True)
plt.title("Hierarchical Clustering Dendrogram (Ward Linkage)")
plt.xlabel("Cluster Size")
plt.ylabel("Distance")
plt.tight_layout()
plt.savefig("dendrogram.png", dpi=150)
plt.show()

df["cluster_hierarchical"] = fcluster(Z, t=optimal_k, criterion="maxclust")

The dendrogram reveals a clear two-way split at the top level: defenders vs. attackers/midfielders. At the next level, midfielders separate from attackers, and so on. This hierarchical structure aligns well with soccer intuition.

Step 5: Gaussian Mixture Models

from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=optimal_k, covariance_type="full",
                       n_init=5, random_state=42)
df["cluster_gmm"] = gmm.fit_predict(X_scaled)

# Soft assignments
proba = gmm.predict_proba(X_scaled)
df["max_cluster_prob"] = proba.max(axis=1)

# Players with uncertain assignments
uncertain = df[df["max_cluster_prob"] < 0.6]
print(f"Players with uncertain role assignment: {len(uncertain)}")
print(uncertain[["player_name", "position", "max_cluster_prob"]].head(10))

We find that approximately 15% of players have no dominant cluster, often corresponding to versatile players who operate in hybrid roles (e.g., wing-backs who attack like wingers but defend like full-backs).

BIC for Model Selection

bics = []
for k in range(3, 16):
    gmm_k = GaussianMixture(n_components=k, covariance_type="full",
                             n_init=5, random_state=42)
    gmm_k.fit(X_scaled)
    bics.append(gmm_k.bic(X_scaled))

plt.figure(figsize=(8, 5))
plt.plot(range(3, 16), bics, "g^-")
plt.xlabel("Number of Components")
plt.ylabel("BIC")
plt.title("BIC for Gaussian Mixture Model")
plt.grid(True, alpha=0.3)
plt.savefig("gmm_bic.png", dpi=150)
plt.show()

The BIC is minimized at $k = 8$, suggesting one additional cluster compared to the K-means elbow method. The extra cluster splits "Creative Playmaker" into deep-lying playmakers and advanced playmakers --- a meaningful distinction.

Step 6: Visualization

PCA Projection

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

fig, axes = plt.subplots(1, 3, figsize=(20, 6))

for ax, col, title in zip(
    axes,
    ["cluster_kmeans", "cluster_hierarchical", "cluster_gmm"],
    ["K-Means", "Hierarchical", "GMM"]
):
    scatter = ax.scatter(X_pca[:, 0], X_pca[:, 1], c=df[col],
                         cmap="tab10", alpha=0.6, s=20)
    ax.set_xlabel(f"PC1 ({pca.explained_variance_ratio_[0]:.1%})")
    ax.set_ylabel(f"PC2 ({pca.explained_variance_ratio_[1]:.1%})")
    ax.set_title(title)
    ax.legend(*scatter.legend_elements(), title="Cluster", loc="best",
              fontsize=8)

plt.suptitle("Player Role Clusters --- PCA Projection", fontsize=14)
plt.tight_layout()
plt.savefig("pca_comparison.png", dpi=150)
plt.show()

Radar Charts for Cluster Centroids

def radar_chart(centroids: pd.DataFrame, feature_names: list[str],
                title: str) -> None:
    """Create a radar chart for cluster centroids."""
    angles = np.linspace(0, 2 * np.pi, len(feature_names),
                         endpoint=False).tolist()
    angles += angles[:1]  # Close the polygon

    fig, ax = plt.subplots(figsize=(10, 10),
                           subplot_kw=dict(polar=True))

    for idx, row in centroids.iterrows():
        values = row.tolist()
        values += values[:1]
        ax.plot(angles, values, "o-", linewidth=2, label=idx)
        ax.fill(angles, values, alpha=0.1)

    ax.set_xticks(angles[:-1])
    ax.set_xticklabels(feature_names, fontsize=8)
    ax.set_title(title, fontsize=14, pad=20)
    ax.legend(loc="upper right", bbox_to_anchor=(1.3, 1.0), fontsize=9)
    plt.tight_layout()
    plt.savefig("radar_chart.png", dpi=150)
    plt.show()

With clusters established, we build a similarity search for the departing right-back:

from sklearn.metrics.pairwise import cosine_similarity

def find_similar_players(
    target_player: str,
    df: pd.DataFrame,
    features: list[str],
    X_scaled: np.ndarray,
    top_n: int = 10
) -> pd.DataFrame:
    """Find the top-N most similar players by cosine similarity."""
    target_idx = df[df["player_name"] == target_player].index[0]
    target_vector = X_scaled[target_idx].reshape(1, -1)

    similarities = cosine_similarity(target_vector, X_scaled).flatten()
    df_result = df.copy()
    df_result["similarity"] = similarities

    return (
        df_result[df_result["player_name"] != target_player]
        .nlargest(top_n, "similarity")
        [["player_name", "position", "cluster_kmeans", "similarity"] + features]
    )

similar_players = find_similar_players("Target Right-Back", df, features,
                                        X_scaled, top_n=10)
print(similar_players.to_string(index=False))

Results and Discussion

Key Findings

  1. Seven clusters emerged as the optimal number of data-driven roles, closely aligning with tactical intuition but offering finer granularity than traditional positions.

  2. The departing right-back fell into the "Attacking Full-Back" cluster (Cluster 6), characterized by high crossing rates, moderate tackles, and progressive carries.

  3. The similarity search identified three candidates from the same cluster who played at clubs within the buying club's budget range, including one left-back whose statistical profile was nearly identical, suggesting positional versatility.

  4. Gaussian Mixture Models revealed that two of the candidates had significant probability mass in the "Wide Attacker" cluster as well, indicating they could also be deployed further forward --- a tactical bonus.

Method Comparison

Method Silhouette Score Interpretability Soft Assignments
K-Means ($k=7$) 0.31 Good No
Hierarchical (Ward, $k=7$) 0.29 Excellent (dendrogram) No
GMM ($k=7$) 0.28 Good Yes

Limitations

  • Per-90 metrics do not capture off-ball movement quality, communication, or leadership --- all critical for a right-back role.
  • The 900-minute threshold excludes young prospects with limited playing time.
  • Clustering is sensitive to the choice of features; a different feature set would produce different roles.

Recommendations

  1. Use K-means as the primary clustering method for its stability and interpretability.
  2. Supplement with GMM soft assignments to identify versatile players.
  3. Always validate cluster assignments with video analysis before making recruitment decisions.
  4. Update clusters each season to capture evolving tactical trends.

Code Reference

The complete code for this case study is available in code/case-study-code.py (Section 1).