Understanding Soccer Data Types
The Three Pillars of Soccer Data
Soccer analytics relies on three main types of data, each offering unique insights into the game. Understanding these data types, their structures, and limitations is crucial for effective analysis.
Event Data
Discrete actions on the pitch
Most CommonTracking Data
Continuous player & ball positions
PremiumAggregated Stats
Pre-calculated metrics & summaries
Widely AvailableEvent Data Explained
What is Event Data?
Event data captures every significant action during a match with precise information about what happened, when, where, and by whom.
Typical Event Data Includes:
- Location: X,Y coordinates on the pitch
- Timestamp: When the event occurred
- Event Type: Pass, shot, tackle, etc.
- Player & Team: Who performed the action
- Outcome: Success or failure
- Context: Additional attributes (body part, pressure, etc.)
Event Data Structure
Python: Exploring Event Data Schema
from statsbombpy import sb
import pandas as pd
import json
# Load sample event data
events = sb.events(match_id=8658)
print("Event Data Shape:", events.shape)
print("\nAvailable Columns:")
print(events.columns.tolist())
# Examine a single pass event
pass_event = events[events['type'] == 'Pass'].iloc[0]
print("\n" + "="*60)
print("Sample Pass Event Structure:")
print("="*60)
# Display key fields
key_fields = ['id', 'index', 'period', 'timestamp', 'minute',
'second', 'type', 'player', 'team', 'location',
'pass_recipient', 'pass_length', 'pass_angle',
'pass_outcome', 'under_pressure']
for field in key_fields:
if field in pass_event.index:
print(f"{field:20s}: {pass_event[field]}")
# Event types breakdown
print("\n" + "="*60)
print("Event Types in Match:")
print("="*60)
print(events['type'].value_counts())
# Location data format
print("\n" + "="*60)
print("Location Data Format:")
print("="*60)
sample_location = events[events['location'].notna()]['location'].iloc[0]
print(f"Type: {type(sample_location)}")
print(f"Value: {sample_location}")
print(f"Format: [x_coordinate, y_coordinate]")
print(f"Pitch dimensions: 120 x 80 yards (StatsBomb)")
Common Event Types
| Event Type | Description | Key Attributes |
|---|---|---|
| Pass | Ball played from one player to another | recipient, length, angle, height, outcome |
| Shot | Attempt to score | xG, outcome, body part, technique, end_location |
| Carry | Player moving with the ball | end_location, distance, duration |
| Duel | 1v1 contest for the ball | type (aerial/ground), outcome, counterpress |
| Pressure | Defensive pressure applied | duration, counterpress |
| Interception | Ball intercepted during pass | outcome, position |
| Clearance | Defensive clearance | aerial_won, body_part |
| Dribble | Attempt to beat opponent with ball | outcome, nutmeg, overrun |
Working with Event Data
Python: Event Data Analysis
import numpy as np
import matplotlib.pyplot as plt
from mplsoccer import Pitch
# Load events
events = sb.events(match_id=8658)
# 1. Filter by event type and analyze
passes = events[events['type'] == 'Pass'].copy()
# Calculate pass success rate
passes['successful'] = passes['pass_outcome'].isna()
success_rate = passes['successful'].mean() * 100
print(f"Overall pass success rate: {success_rate:.1f}%")
# 2. Spatial analysis - where do events occur?
passes_with_location = passes[passes['location'].notna()].copy()
passes_with_location['x'] = passes_with_location['location'].apply(lambda loc: loc[0])
passes_with_location['y'] = passes_with_location['location'].apply(lambda loc: loc[1])
# 3. Temporal analysis - events over time
events['minute_exact'] = events['minute'] + events['second']/60
# Passes per 5-minute interval
events['time_bin'] = pd.cut(events['minute_exact'], bins=range(0, 96, 5))
passes_over_time = events[events['type'] == 'Pass'].groupby('time_bin').size()
plt.figure(figsize=(12, 6))
passes_over_time.plot(kind='bar')
plt.title('Pass Frequency Over Time', fontsize=14, fontweight='bold')
plt.xlabel('Time Interval (minutes)')
plt.ylabel('Number of Passes')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('passes_over_time.png', dpi=300)
# 4. Sequence analysis - build-up play
def analyze_possession_sequence(events, min_passes=5):
"""Identify possession sequences with minimum pass count"""
sequences = []
current_sequence = []
current_team = None
for idx, event in events.iterrows():
if event['type'] in ['Pass', 'Carry']:
if event['team'] == current_team:
current_sequence.append(event)
else:
if len(current_sequence) >= min_passes:
sequences.append(current_sequence)
current_sequence = [event]
current_team = event['team']
else:
if len(current_sequence) >= min_passes:
sequences.append(current_sequence)
current_sequence = []
current_team = None
return sequences
# Find long possession sequences
long_sequences = analyze_possession_sequence(events, min_passes=8)
print(f"\nFound {len(long_sequences)} possession sequences with 8+ passes")
# 5. Progressive actions - moves towards goal
def is_progressive_pass(pass_event):
"""
Determine if pass is progressive based on StatsBomb definition:
- Moves ball significantly closer to opponent's goal
"""
if pd.isna(pass_event['location']) or pd.isna(pass_event.get('pass_end_location')):
return False
start_x = pass_event['location'][0]
end_x = pass_event['pass_end_location'][0] if 'pass_end_location' in pass_event else start_x
# Progressive if moves ball 10+ yards towards goal
return (end_x - start_x) >= 10
passes['progressive'] = passes.apply(is_progressive_pass, axis=1)
progressive_count = passes['progressive'].sum()
print(f"Progressive passes: {progressive_count} ({progressive_count/len(passes)*100:.1f}%)")
Tracking Data Explained
What is Tracking Data?
Tracking data captures the x,y coordinates of every player and the ball at 10-25 times per second throughout the match, providing unprecedented detail about movement and positioning.
Tracking Data Characteristics:
- Frequency: 10-25 Hz (frames per second)
- Coverage: All 22 players + ball continuously
- Precision: Sub-meter accuracy
- Volume: ~2-3 million data points per match
- Additional: Speed, acceleration, distance metrics
Tracking Data Structure
Python: Simulated Tracking Data Example
import pandas as pd
import numpy as np
# Simulated tracking data structure (real data is similar)
# Each frame contains positions of all players and ball
def create_sample_tracking_frame(frame_id, timestamp):
"""Create a sample tracking data frame"""
# 11 players per team + ball
frame_data = []
# Home team (team_id = 1)
for player_id in range(1, 12):
frame_data.append({
'frame_id': frame_id,
'timestamp': timestamp,
'team_id': 1,
'player_id': player_id,
'jersey_number': player_id,
'x': np.random.uniform(0, 105), # meters
'y': np.random.uniform(0, 68), # meters
'speed': np.random.uniform(0, 8), # m/s
})
# Away team (team_id = 2)
for player_id in range(1, 12):
frame_data.append({
'frame_id': frame_id,
'timestamp': timestamp,
'team_id': 2,
'player_id': player_id + 100,
'jersey_number': player_id,
'x': np.random.uniform(0, 105),
'y': np.random.uniform(0, 68),
'speed': np.random.uniform(0, 8),
})
# Ball
frame_data.append({
'frame_id': frame_id,
'timestamp': timestamp,
'team_id': 0,
'player_id': 0,
'jersey_number': 0,
'x': np.random.uniform(0, 105),
'y': np.random.uniform(0, 68),
'speed': np.random.uniform(0, 20),
})
return pd.DataFrame(frame_data)
# Create sample frames (25 fps = 25 frames per second)
frames = []
for frame_id in range(100): # 4 seconds of data
timestamp = frame_id / 25.0
frames.append(create_sample_tracking_frame(frame_id, timestamp))
tracking_data = pd.concat(frames, ignore_index=True)
print("Tracking Data Sample:")
print(tracking_data.head(25)) # First frame
print(f"\nData shape: {tracking_data.shape}")
print(f"Frames: {tracking_data['frame_id'].nunique()}")
print(f"Duration: {tracking_data['timestamp'].max():.1f} seconds")
print(f"Data points: {len(tracking_data):,}")
# Calculate distances covered
player_distances = []
for player in tracking_data[tracking_data['player_id'] > 0]['player_id'].unique():
player_data = tracking_data[tracking_data['player_id'] == player].sort_values('frame_id')
# Calculate distance between frames
distances = np.sqrt(
(player_data['x'].diff())**2 +
(player_data['y'].diff())**2
)
total_distance = distances.sum()
player_distances.append({
'player_id': player,
'team_id': player_data['team_id'].iloc[0],
'distance_m': total_distance,
'avg_speed': player_data['speed'].mean()
})
distance_df = pd.DataFrame(player_distances)
print("\nPlayer Movement Summary (4 seconds):")
print(distance_df.head(10))
Tracking Data Applications
Space Control
Calculate which team controls each area of the pitch using Voronoi diagrams and dominance models
Physical Metrics
Total distance, high-speed running, sprints, accelerations, and decelerations
Team Shape
Formation analysis, team compactness, width, and defensive line positioning
Pressing Analysis
Pressure intensity, time to pressure, defensive coverage, and counterpressing
Passing Lanes
Available passing options, passing lanes, and space creation through movement
Off-ball Runs
Player movement without the ball, creating space, and tactical positioning
R: Tracking Data Analysis Concepts
library(tidyverse)
# Simulated tracking data
create_tracking_frame <- function(frame_id, timestamp) {
# Create positions for 22 players + ball
tibble(
frame_id = frame_id,
timestamp = timestamp,
team_id = c(rep(1, 11), rep(2, 11), 0),
player_id = c(1:11, 101:111, 0),
x = runif(23, 0, 105),
y = runif(23, 0, 68),
speed = c(runif(22, 0, 8), runif(1, 0, 20))
)
}
# Generate 100 frames (4 seconds at 25 fps)
tracking_data <- map_df(0:99, ~create_tracking_frame(.x, .x/25))
cat(sprintf("Generated %d frames of tracking data\n",
n_distinct(tracking_data$frame_id)))
# Calculate team centroids (team center of gravity)
team_centroids <- tracking_data %>%
filter(player_id > 0) %>%
group_by(frame_id, team_id) %>%
summarise(
centroid_x = mean(x),
centroid_y = mean(y),
.groups = 'drop'
)
# Calculate team compactness (average distance to centroid)
compactness <- tracking_data %>%
filter(player_id > 0) %>%
left_join(team_centroids, by = c("frame_id", "team_id")) %>%
mutate(
distance_to_centroid = sqrt((x - centroid_x)^2 + (y - centroid_y)^2)
) %>%
group_by(frame_id, team_id) %>%
summarise(
compactness = mean(distance_to_centroid),
.groups = 'drop'
)
cat("\nTeam Compactness Over Time:\n")
print(compactness %>% head(10))
# Visualize team shapes
library(ggplot2)
library(gganimate)
# Animate one second of play
animation_data <- tracking_data %>%
filter(frame_id <= 25, player_id > 0)
p <- ggplot(animation_data, aes(x = x, y = y, color = factor(team_id))) +
geom_point(size = 3) +
geom_point(data = tracking_data %>% filter(frame_id <= 25, player_id == 0),
aes(x = x, y = y), color = "black", size = 2) +
coord_fixed(ratio = 1, xlim = c(0, 105), ylim = c(0, 68)) +
scale_color_manual(values = c("1" = "#FF6B6B", "2" = "#4ECDC4"),
labels = c("Home", "Away")) +
labs(title = "Tracking Data: Frame {frame_id}",
x = "X Position (m)", y = "Y Position (m)",
color = "Team") +
theme_minimal() +
theme(legend.position = "bottom")
cat("\nTracking data visualization created\n")
Aggregated Statistics
What are Aggregated Stats?
Pre-calculated metrics summarizing player or team performance over matches or seasons. Most accessible form of soccer data.
Python: Working with Aggregated Stats
# Example: Player season statistics structure
player_season_stats = {
'player_name': 'Mohamed Salah',
'team': 'Liverpool',
'season': '2023-24',
'matches_played': 38,
'minutes': 3200,
# Scoring
'goals': 25,
'xG': 22.5,
'shots': 120,
'shots_on_target': 58,
# Passing
'passes': 1500,
'pass_completion': 82.5,
'key_passes': 65,
'assists': 12,
'xA': 10.8,
# Dribbling
'dribbles_completed': 95,
'dribbles_attempted': 145,
# Defensive
'tackles': 35,
'interceptions': 15,
# Physical
'distance_km': 380,
'sprints': 450
}
# Convert to DataFrame for analysis
import pandas as pd
stats_df = pd.DataFrame([player_season_stats])
# Calculate derived metrics
stats_df['goals_per_90'] = (stats_df['goals'] / stats_df['minutes']) * 90
stats_df['xG_per_90'] = (stats_df['xG'] / stats_df['minutes']) * 90
stats_df['shot_accuracy'] = (stats_df['shots_on_target'] / stats_df['shots']) * 100
stats_df['dribble_success'] = (stats_df['dribbles_completed'] / stats_df['dribbles_attempted']) * 100
print("Player Performance Metrics:")
print(stats_df[['player_name', 'goals_per_90', 'xG_per_90',
'shot_accuracy', 'dribble_success']])
Data Format Comparison
| Aspect | Event Data | Tracking Data | Aggregated Stats |
|---|---|---|---|
| Granularity | Per action (~2000/match) | Per frame (~2M/match) | Per match/season |
| File Size | 1-5 MB per match | 500 MB - 2 GB per match | KB per player |
| Accessibility | Medium (some free sources) | Low (expensive, limited) | High (widely available) |
| Analysis Difficulty | Medium | High (requires expertise) | Low |
| Best For | Action analysis, xG, tactics | Movement, space, shape | Quick comparisons, trends |
| Processing Speed | Fast | Slow (large datasets) | Very fast |
Choosing the Right Data Type
Use Event Data When:
- Analyzing specific actions (passes, shots, tackles)
- Calculating xG and xA models
- Building passing networks
- Studying individual player contributions
- Creating action-based visualizations
Use Tracking Data When:
- Analyzing team shape and formations
- Studying space creation and control
- Measuring physical performance precisely
- Analyzing off-ball movement
- Advanced tactical analysis
Use Aggregated Stats When:
- Comparing players across seasons
- Quick performance assessments
- Creating league tables and rankings
- Historical trend analysis
- Basic scouting reports
Data Quality Considerations
Important Limitations
Event Data:
- Manual collection introduces human error
- Event definitions vary between providers
- Misses off-ball movement and positioning
- Subjective elements (pressure, body position)
Tracking Data:
- Expensive and limited availability
- Requires significant processing power
- Can have tracking errors in crowded areas
- Difficult to standardize across systems
Aggregated Stats:
- Loses context and detail
- Can be misleading without context
- Varies in calculation methods
- Position and role affect interpretation
Key Takeaways
- Each data type has unique strengths and use cases
- Event data offers the best balance of detail and accessibility
- Tracking data provides unparalleled tactical insights but is expensive
- Aggregated stats are perfect for quick comparisons
- Combine multiple data types for comprehensive analysis
- Always consider data quality and limitations
Ready for Advanced Analysis?
Now that you understand soccer data types, you're prepared to:
- Build custom xG models
- Create advanced tactical visualizations
- Develop player recruitment systems
- Implement machine learning for predictions
- Conduct professional-level analysis