Case Study 2: Graph Neural Networks for Formation Recognition

Background

Identifying a team's formation during a match is a deceptively difficult problem. While a television commentator might describe a team as playing "4-3-3," the reality on the pitch is far more fluid. Defensive and attacking shapes differ, formations morph during transitions, and individual players interpret their roles with personal variation. Traditional template-matching approaches --- measuring the Euclidean distance between player positions and canonical formation templates --- perform reasonably well on static snapshots but struggle with the dynamic, noisy reality of professional soccer.

This case study describes how a Bundesliga club's analytics department developed a Graph Neural Network (GNN) to classify formations from optical tracking data, achieving 94.2% accuracy across 14 formation classes and providing insights into how formations evolve within matches.

The Problem

The club's existing formation detection system used a simple approach:

Average player positions over 5-minute windows.
Compute the distance from the averaged positions to a library of 10 canonical formation templates.
Assign the formation label corresponding to the closest template.

This approach had three significant limitations:

Averaging destroyed temporal information. Formations that oscillated between two shapes (e.g., a 4-4-2 that compressed into a 4-2-2-2 during pressing) were classified as neither, often defaulting to a catch-all "4-3-3" label.
Template rigidity. Real formations are not symmetric. A "4-3-3" with a deep-lying midfielder and two advanced central midfielders looks different from one with a flat midfield line. The template library could not capture this variation.
Missing relational information. The system treated each player's position independently. It did not capture the relationships between players --- the gaps between defenders, the vertical compactness of the midfield, the width of the front line --- that define a formation's tactical character.

Data and Graph Construction

Tracking Data

The team used optical tracking data at 25 Hz from two full Bundesliga seasons (approximately 612 matches). Each frame provided $(x, y)$ coordinates and $(v_x, v_y)$ velocity vectors for all 22 outfield players and the ball.

Graph Representation

Each tracking frame was represented as a graph $G = (V, E, \mathbf{X}, \mathbf{E}_{\text{attr}})$:

Nodes ($V$): 22 nodes, one per outfield player (goalkeepers were excluded as their position carries minimal formation information).

Node features ($\mathbf{X} \in \mathbb{R}^{22 \times d_v}$): Each node was assigned a feature vector of dimension $d_v = 8$:

Feature	Description
$x$ (normalized)	Pitch x-coordinate, normalized to [0, 1]
$y$ (normalized)	Pitch y-coordinate, normalized to [0, 1]
$v_x$	Velocity in x-direction (m/s)
$v_y$	Velocity in y-direction (m/s)
Speed	$\sqrt{v_x^2 + v_y^2}$
Team indicator	0 for team A, 1 for team B
Distance to ball	Euclidean distance to ball (meters)
Distance to own goal	Euclidean distance to own goal center

Edges ($E$): Edges were constructed using two rules: - Intra-team edges: Fully connected within each team (each player connected to all 10 teammates), yielding $2 \times \binom{11}{2} = 110$ intra-team edges. - Inter-team edges: Connected players on opposing teams within a 15-meter radius, capturing the local marking structure.

Edge features ($\mathbf{E}_{\text{attr}} \in \mathbb{R}^{|E| \times d_e}$): Each edge carried $d_e = 4$ features:

Feature	Description
Euclidean distance	Distance between connected players (meters)
Relative angle	Angle of the vector connecting the players
Same team indicator	1 if both players on same team, 0 otherwise
Relative speed	Difference in speed between connected players

Temporal Aggregation

Rather than classifying single frames (which are noisy), the team averaged node features over 30-second windows (750 frames at 25 Hz). This smoothed transient movements while preserving the overall shape. Edge structure was recomputed at each window based on the averaged positions.

Labeling

Formation labels were assigned by the club's video analysis team, who reviewed each match and annotated formation changes with timestamps. The labeling taxonomy included 14 formations:

Code	Formation	Frequency
442	4-4-2	16.2%
433	4-3-3	20.7%
4231	4-2-3-1	18.3%
4141	4-1-4-1	7.8%
352	3-5-2	10.4%
343	3-4-3	8.1%
532	5-3-2	9.2%
541	5-4-1	4.9%
4411	4-4-1-1	7.1%
4321	4-3-2-1	3.8%
4222	4-2-2-2	3.2%
3421	3-4-2-1	2.8%
451	4-5-1	2.3%
other	Other/transitional	1.2%

Each labeled window yielded one training example. Across two seasons and two teams per match, the dataset contained approximately 73,000 labeled graph instances.

Model Architecture

GNN Design

The team implemented a Graph Attention Network (GAT) with the following architecture:

Input: Graph with 22 nodes, 8 features each
GAT Layer 1: 8 -> 32, 4 attention heads (output: 128-dim per node)
Batch Normalization + ELU
GAT Layer 2: 128 -> 32, 4 attention heads (output: 128-dim per node)
Batch Normalization + ELU
Team-level Pooling: Separate mean pooling for each team (128-dim each)
Concatenation: 256-dim
Fully Connected: 256 -> 64 (ReLU) + Dropout(0.3)
Output: 64 -> 14 (Softmax, one class per formation)

Team-Level Pooling

A key architectural decision was to pool node representations separately for each team before concatenation. This ensured that the model classified the formation of each team independently, rather than conflating the two teams' structures. The classification head produced a 14-class output for the team of interest.

Attention Mechanism

The GAT attention mechanism learned to weight different player relationships:

$$\alpha_{ij} = \frac{\exp(\text{LeakyReLU}(\mathbf{a}^\top [\mathbf{W}\mathbf{h}_i \| \mathbf{W}\mathbf{h}_j]))}{\sum_{k \in \mathcal{N}(i)} \exp(\text{LeakyReLU}(\mathbf{a}^\top [\mathbf{W}\mathbf{h}_i \| \mathbf{W}\mathbf{h}_k]))}$$

Multi-head attention allowed different heads to capture different types of relationships (e.g., one head might focus on defensive line compactness while another captures the width of the midfield).

Training Details

Optimizer: AdamW, learning rate $5 \times 10^{-4}$, weight decay $1 \times 10^{-3}$
Batch size: 64 graphs
Epochs: 100 with early stopping (patience = 10)
Loss function: Cross-entropy with inverse-frequency class weighting
Data augmentation: Horizontal pitch mirroring (formations are symmetric)
Train/Val/Test split: Season 1 for training (70%) and validation (15%), Season 2 for testing (15%)

Results

Classification Performance

Model	Accuracy	Macro F1	Weighted F1
Template matching (baseline)	78.3%	0.71	0.77
MLP on averaged positions	84.1%	0.79	0.83
GCN (2 layers)	91.6%	0.88	0.91
GAT (chosen model)	94.2%	0.92	0.94
GAT + temporal (3 consecutive windows)	95.1%	0.93	0.95

The GAT significantly outperformed both the template-matching baseline and a simple MLP that used concatenated averaged positions. The GCN (which uses fixed, uniform neighbor weighting) achieved strong results but fell short of the GAT, suggesting that learned attention weights captured meaningful tactical relationships.

Per-Formation Analysis

The confusion matrix revealed systematic patterns:

High accuracy (>95%): 4-4-2, 4-3-3, 3-5-2, 5-4-1. These formations have distinctive spatial signatures.
Moderate accuracy (85-95%): 4-2-3-1, 4-1-4-1, 3-4-3. These formations are often confused with related shapes (e.g., 4-2-3-1 confused with 4-3-3 when the attacking midfielder pushes forward).
Lower accuracy (75-85%): 4-4-1-1, 4-3-2-1. These less common formations had fewer training examples and subtle differences from more common alternatives.

Attention Weight Analysis

The most revealing aspect of the GNN model was the interpretability of its attention weights. The team extracted and analyzed the learned attention patterns:

Defensive line detection: One attention head consistently assigned the highest weights to edges between adjacent defenders, effectively learning to identify the defensive line without explicit labeling. The horizontal spread of high-attention edges between defenders directly encoded whether the team was playing with 3, 4, or 5 at the back.

Midfield structure: Another attention head focused on the vertical distances between midfield players, distinguishing flat midfields (4-4-2) from layered midfields (4-2-3-1, 4-1-4-1).

Marking relationships: Inter-team attention weights were highest between spatially proximate opponents, effectively learning marking assignments. This information helped the model distinguish between formations that look similar in isolation but interact differently with the opponent's shape.

In-Match Formation Tracking

Transition Detection

By running the model on overlapping 30-second windows with a 10-second stride, the team produced a continuous formation classification timeline for each match. Formation transitions appeared as sustained changes in the predicted class:

A Hidden Markov Model (HMM) was layered on top of the raw GNN predictions to smooth spurious transitions. The transition matrix was calibrated from the training data, encoding the empirical probability of switching from one formation to another. This reduced false transition detections from 14.3 per match (raw GNN) to 2.1 per match (GNN + HMM), while maintaining a median detection delay of 47 seconds from the actual change.

Case Example: Match Analysis

In a key match, the model detected that the opponent shifted from a 4-2-3-1 to a 3-4-3 at minute 58 (within 40 seconds of the analyst's manual annotation). The shift corresponded to a tactical substitution, and the automated detection enabled the analytics team to quickly pull pre-prepared tactical adjustments for the coaching staff.

The model also detected a subtle within-possession shape change: the opponent's 4-2-3-1 compressed into a 4-4-2 when defending deep but expanded into a 4-3-3 when pressing high. This oscillation, invisible to the template-matching system, was clearly visible in the GNN's window-by-window output.

Practical Applications

Pre-Match Preparation

The formation classification model was integrated into the opponent analysis workflow:

Formation profiles: For each upcoming opponent, the model generated a match-by-match formation timeline across their season, identifying their primary and secondary formations and the conditions under which they switched.
Formation tendencies by game state: The model revealed that some teams shifted formation when trailing (e.g., switching from 5-3-2 to 4-3-3 after conceding), providing actionable intelligence for game management.
Set-piece formations: By applying the model to set-piece phases specifically, the team identified opponent corner kick and free kick defensive organizations.

Recruitment

The formation classification model supported recruitment by embedding each player's contribution within the tactical context of their team's shape. A central midfielder's statistical profile means something different in a 4-2-3-1 (as the "2") versus a 4-3-3 (as one of three midfielders). By conditioning performance metrics on formation context, the scouting team achieved more accurate player comparisons.

Limitations

Goalkeeper exclusion. Goalkeepers were excluded from the graph, but sweeper-keeper behavior can influence the effective defensive line height and thus the formation classification. Future versions would include the goalkeeper as a node.
Positional averaging. The 30-second averaging window smoothed short-lived tactical adjustments. An event-triggered approach (re-evaluating formation only during in-possession or out-of-possession phases) might provide more tactically meaningful classifications.
Label subjectivity. Formation labels were assigned by human analysts, introducing subjectivity. Different analysts might label the same phase as "4-2-3-1" or "4-3-3" depending on their interpretation of one player's role. A consensus labeling protocol would improve ground truth quality.
Computational cost. While inference was fast enough for post-match use (approximately 15 ms per graph), real-time deployment at 25 Hz would require model optimization or GPU acceleration at the stadium edge.

Discussion Questions

The model uses a 15-meter radius to determine inter-team edges. How would performance change with a smaller or larger radius? What tactical information might be gained or lost?
The team chose team-level pooling rather than global graph pooling. What are the tradeoffs? Could a model that considers both teams jointly discover inter-team tactical patterns?
Formation labels are inherently discrete (e.g., "4-3-3" or "4-4-2"), but real formations exist on a continuum. How might you modify the model to produce a continuous formation embedding rather than a discrete classification?
The HMM smoothing reduced false transitions but introduced detection delay. How would you optimize this tradeoff for different use cases (live match support vs. post-match analysis)?
How would you extend this model to detect formation changes at a sub-team level (e.g., one side of the defense shifting while the other holds position)?

Code Implementation

See code/case-study-code.py for the complete Python implementation of the GNN formation classification model, including graph construction, GAT implementation, and HMM smoothing.