Understanding Soccer Data Types

Beginner 10 min read 17 views Nov 27, 2025

The Three Pillars of Soccer Data

Soccer analytics relies on three main types of data, each offering unique insights into the game. Understanding these data types, their structures, and limitations is crucial for effective analysis.

Event Data

Discrete actions on the pitch

Most Common

Tracking Data

Continuous player & ball positions

Premium

Aggregated Stats

Pre-calculated metrics & summaries

Widely Available

Event Data Explained

What is Event Data?

Event data captures every significant action during a match with precise information about what happened, when, where, and by whom.

Typical Event Data Includes:

Location: X,Y coordinates on the pitch
Timestamp: When the event occurred
Event Type: Pass, shot, tackle, etc.
Player & Team: Who performed the action
Outcome: Success or failure
Context: Additional attributes (body part, pressure, etc.)

Event Data Structure

Python: Exploring Event Data Schema

from statsbombpy import sb
import pandas as pd
import json

# Load sample event data
events = sb.events(match_id=8658)

print("Event Data Shape:", events.shape)
print("\nAvailable Columns:")
print(events.columns.tolist())

# Examine a single pass event
pass_event = events[events['type'] == 'Pass'].iloc[0]

print("\n" + "="*60)
print("Sample Pass Event Structure:")
print("="*60)

# Display key fields
key_fields = ['id', 'index', 'period', 'timestamp', 'minute',
             'second', 'type', 'player', 'team', 'location',
             'pass_recipient', 'pass_length', 'pass_angle',
             'pass_outcome', 'under_pressure']

for field in key_fields:
    if field in pass_event.index:
        print(f"{field:20s}: {pass_event[field]}")

# Event types breakdown
print("\n" + "="*60)
print("Event Types in Match:")
print("="*60)
print(events['type'].value_counts())

# Location data format
print("\n" + "="*60)
print("Location Data Format:")
print("="*60)
sample_location = events[events['location'].notna()]['location'].iloc[0]
print(f"Type: {type(sample_location)}")
print(f"Value: {sample_location}")
print(f"Format: [x_coordinate, y_coordinate]")
print(f"Pitch dimensions: 120 x 80 yards (StatsBomb)")

Common Event Types

Event Type	Description	Key Attributes
Pass	Ball played from one player to another	recipient, length, angle, height, outcome
Shot	Attempt to score	xG, outcome, body part, technique, end_location
Carry	Player moving with the ball	end_location, distance, duration
Duel	1v1 contest for the ball	type (aerial/ground), outcome, counterpress
Pressure	Defensive pressure applied	duration, counterpress
Interception	Ball intercepted during pass	outcome, position
Clearance	Defensive clearance	aerial_won, body_part
Dribble	Attempt to beat opponent with ball	outcome, nutmeg, overrun

Working with Event Data

Python: Event Data Analysis

import numpy as np
import matplotlib.pyplot as plt
from mplsoccer import Pitch

# Load events
events = sb.events(match_id=8658)

# 1. Filter by event type and analyze
passes = events[events['type'] == 'Pass'].copy()

# Calculate pass success rate
passes['successful'] = passes['pass_outcome'].isna()
success_rate = passes['successful'].mean() * 100

print(f"Overall pass success rate: {success_rate:.1f}%")

# 2. Spatial analysis - where do events occur?
passes_with_location = passes[passes['location'].notna()].copy()
passes_with_location['x'] = passes_with_location['location'].apply(lambda loc: loc[0])
passes_with_location['y'] = passes_with_location['location'].apply(lambda loc: loc[1])

# 3. Temporal analysis - events over time
events['minute_exact'] = events['minute'] + events['second']/60

# Passes per 5-minute interval
events['time_bin'] = pd.cut(events['minute_exact'], bins=range(0, 96, 5))
passes_over_time = events[events['type'] == 'Pass'].groupby('time_bin').size()

plt.figure(figsize=(12, 6))
passes_over_time.plot(kind='bar')
plt.title('Pass Frequency Over Time', fontsize=14, fontweight='bold')
plt.xlabel('Time Interval (minutes)')
plt.ylabel('Number of Passes')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('passes_over_time.png', dpi=300)

# 4. Sequence analysis - build-up play
def analyze_possession_sequence(events, min_passes=5):
    """Identify possession sequences with minimum pass count"""
    sequences = []
    current_sequence = []
    current_team = None

    for idx, event in events.iterrows():
        if event['type'] in ['Pass', 'Carry']:
            if event['team'] == current_team:
                current_sequence.append(event)
            else:
                if len(current_sequence) >= min_passes:
                    sequences.append(current_sequence)
                current_sequence = [event]
                current_team = event['team']
        else:
            if len(current_sequence) >= min_passes:
                sequences.append(current_sequence)
            current_sequence = []
            current_team = None

    return sequences

# Find long possession sequences
long_sequences = analyze_possession_sequence(events, min_passes=8)
print(f"\nFound {len(long_sequences)} possession sequences with 8+ passes")

# 5. Progressive actions - moves towards goal
def is_progressive_pass(pass_event):
    """
    Determine if pass is progressive based on StatsBomb definition:
    - Moves ball significantly closer to opponent's goal
    """
    if pd.isna(pass_event['location']) or pd.isna(pass_event.get('pass_end_location')):
        return False

    start_x = pass_event['location'][0]
    end_x = pass_event['pass_end_location'][0] if 'pass_end_location' in pass_event else start_x

    # Progressive if moves ball 10+ yards towards goal
    return (end_x - start_x) >= 10

passes['progressive'] = passes.apply(is_progressive_pass, axis=1)
progressive_count = passes['progressive'].sum()

print(f"Progressive passes: {progressive_count} ({progressive_count/len(passes)*100:.1f}%)")

Tracking Data Explained

What is Tracking Data?

Tracking data captures the x,y coordinates of every player and the ball at 10-25 times per second throughout the match, providing unprecedented detail about movement and positioning.

Tracking Data Characteristics:

Frequency: 10-25 Hz (frames per second)
Coverage: All 22 players + ball continuously
Precision: Sub-meter accuracy
Volume: ~2-3 million data points per match
Additional: Speed, acceleration, distance metrics

Tracking Data Structure

Python: Simulated Tracking Data Example

import pandas as pd
import numpy as np

# Simulated tracking data structure (real data is similar)
# Each frame contains positions of all players and ball

def create_sample_tracking_frame(frame_id, timestamp):
    """Create a sample tracking data frame"""

    # 11 players per team + ball
    frame_data = []

    # Home team (team_id = 1)
    for player_id in range(1, 12):
        frame_data.append({
            'frame_id': frame_id,
            'timestamp': timestamp,
            'team_id': 1,
            'player_id': player_id,
            'jersey_number': player_id,
            'x': np.random.uniform(0, 105),  # meters
            'y': np.random.uniform(0, 68),   # meters
            'speed': np.random.uniform(0, 8),  # m/s
        })

    # Away team (team_id = 2)
    for player_id in range(1, 12):
        frame_data.append({
            'frame_id': frame_id,
            'timestamp': timestamp,
            'team_id': 2,
            'player_id': player_id + 100,
            'jersey_number': player_id,
            'x': np.random.uniform(0, 105),
            'y': np.random.uniform(0, 68),
            'speed': np.random.uniform(0, 8),
        })

    # Ball
    frame_data.append({
        'frame_id': frame_id,
        'timestamp': timestamp,
        'team_id': 0,
        'player_id': 0,
        'jersey_number': 0,
        'x': np.random.uniform(0, 105),
        'y': np.random.uniform(0, 68),
        'speed': np.random.uniform(0, 20),
    })

    return pd.DataFrame(frame_data)

# Create sample frames (25 fps = 25 frames per second)
frames = []
for frame_id in range(100):  # 4 seconds of data
    timestamp = frame_id / 25.0
    frames.append(create_sample_tracking_frame(frame_id, timestamp))

tracking_data = pd.concat(frames, ignore_index=True)

print("Tracking Data Sample:")
print(tracking_data.head(25))  # First frame

print(f"\nData shape: {tracking_data.shape}")
print(f"Frames: {tracking_data['frame_id'].nunique()}")
print(f"Duration: {tracking_data['timestamp'].max():.1f} seconds")
print(f"Data points: {len(tracking_data):,}")

# Calculate distances covered
player_distances = []
for player in tracking_data[tracking_data['player_id'] > 0]['player_id'].unique():
    player_data = tracking_data[tracking_data['player_id'] == player].sort_values('frame_id')

    # Calculate distance between frames
    distances = np.sqrt(
        (player_data['x'].diff())**2 +
        (player_data['y'].diff())**2
    )
    total_distance = distances.sum()

    player_distances.append({
        'player_id': player,
        'team_id': player_data['team_id'].iloc[0],
        'distance_m': total_distance,
        'avg_speed': player_data['speed'].mean()
    })

distance_df = pd.DataFrame(player_distances)
print("\nPlayer Movement Summary (4 seconds):")
print(distance_df.head(10))

Tracking Data Applications

Space Control

Calculate which team controls each area of the pitch using Voronoi diagrams and dominance models

Physical Metrics

Total distance, high-speed running, sprints, accelerations, and decelerations

Team Shape

Formation analysis, team compactness, width, and defensive line positioning

Pressing Analysis

Pressure intensity, time to pressure, defensive coverage, and counterpressing

Passing Lanes

Available passing options, passing lanes, and space creation through movement

Off-ball Runs

Player movement without the ball, creating space, and tactical positioning

R: Tracking Data Analysis Concepts

library(tidyverse)

# Simulated tracking data
create_tracking_frame <- function(frame_id, timestamp) {
  # Create positions for 22 players + ball
  tibble(
    frame_id = frame_id,
    timestamp = timestamp,
    team_id = c(rep(1, 11), rep(2, 11), 0),
    player_id = c(1:11, 101:111, 0),
    x = runif(23, 0, 105),
    y = runif(23, 0, 68),
    speed = c(runif(22, 0, 8), runif(1, 0, 20))
  )
}

# Generate 100 frames (4 seconds at 25 fps)
tracking_data <- map_df(0:99, ~create_tracking_frame(.x, .x/25))

cat(sprintf("Generated %d frames of tracking data\n",
           n_distinct(tracking_data$frame_id)))

# Calculate team centroids (team center of gravity)
team_centroids <- tracking_data %>%
  filter(player_id > 0) %>%
  group_by(frame_id, team_id) %>%
  summarise(
    centroid_x = mean(x),
    centroid_y = mean(y),
    .groups = 'drop'
  )

# Calculate team compactness (average distance to centroid)
compactness <- tracking_data %>%
  filter(player_id > 0) %>%
  left_join(team_centroids, by = c("frame_id", "team_id")) %>%
  mutate(
    distance_to_centroid = sqrt((x - centroid_x)^2 + (y - centroid_y)^2)
  ) %>%
  group_by(frame_id, team_id) %>%
  summarise(
    compactness = mean(distance_to_centroid),
    .groups = 'drop'
  )

cat("\nTeam Compactness Over Time:\n")
print(compactness %>% head(10))

# Visualize team shapes
library(ggplot2)
library(gganimate)

# Animate one second of play
animation_data <- tracking_data %>%
  filter(frame_id <= 25, player_id > 0)

p <- ggplot(animation_data, aes(x = x, y = y, color = factor(team_id))) +
  geom_point(size = 3) +
  geom_point(data = tracking_data %>% filter(frame_id <= 25, player_id == 0),
            aes(x = x, y = y), color = "black", size = 2) +
  coord_fixed(ratio = 1, xlim = c(0, 105), ylim = c(0, 68)) +
  scale_color_manual(values = c("1" = "#FF6B6B", "2" = "#4ECDC4"),
                    labels = c("Home", "Away")) +
  labs(title = "Tracking Data: Frame {frame_id}",
       x = "X Position (m)", y = "Y Position (m)",
       color = "Team") +
  theme_minimal() +
  theme(legend.position = "bottom")

cat("\nTracking data visualization created\n")

Aggregated Statistics

What are Aggregated Stats?

Pre-calculated metrics summarizing player or team performance over matches or seasons. Most accessible form of soccer data.

Python: Working with Aggregated Stats

# Example: Player season statistics structure
player_season_stats = {
    'player_name': 'Mohamed Salah',
    'team': 'Liverpool',
    'season': '2023-24',
    'matches_played': 38,
    'minutes': 3200,

    # Scoring
    'goals': 25,
    'xG': 22.5,
    'shots': 120,
    'shots_on_target': 58,

    # Passing
    'passes': 1500,
    'pass_completion': 82.5,
    'key_passes': 65,
    'assists': 12,
    'xA': 10.8,

    # Dribbling
    'dribbles_completed': 95,
    'dribbles_attempted': 145,

    # Defensive
    'tackles': 35,
    'interceptions': 15,

    # Physical
    'distance_km': 380,
    'sprints': 450
}

# Convert to DataFrame for analysis
import pandas as pd

stats_df = pd.DataFrame([player_season_stats])

# Calculate derived metrics
stats_df['goals_per_90'] = (stats_df['goals'] / stats_df['minutes']) * 90
stats_df['xG_per_90'] = (stats_df['xG'] / stats_df['minutes']) * 90
stats_df['shot_accuracy'] = (stats_df['shots_on_target'] / stats_df['shots']) * 100
stats_df['dribble_success'] = (stats_df['dribbles_completed'] / stats_df['dribbles_attempted']) * 100

print("Player Performance Metrics:")
print(stats_df[['player_name', 'goals_per_90', 'xG_per_90',
               'shot_accuracy', 'dribble_success']])

Data Format Comparison

Aspect	Event Data	Tracking Data	Aggregated Stats
Granularity	Per action (~2000/match)	Per frame (~2M/match)	Per match/season
File Size	1-5 MB per match	500 MB - 2 GB per match	KB per player
Accessibility	Medium (some free sources)	Low (expensive, limited)	High (widely available)
Analysis Difficulty	Medium	High (requires expertise)	Low
Best For	Action analysis, xG, tactics	Movement, space, shape	Quick comparisons, trends
Processing Speed	Fast	Slow (large datasets)	Very fast

Choosing the Right Data Type

Use Event Data When:

Analyzing specific actions (passes, shots, tackles)
Calculating xG and xA models
Building passing networks
Studying individual player contributions
Creating action-based visualizations

Use Tracking Data When:

Analyzing team shape and formations
Studying space creation and control
Measuring physical performance precisely
Analyzing off-ball movement
Advanced tactical analysis

Use Aggregated Stats When:

Comparing players across seasons
Quick performance assessments
Creating league tables and rankings
Historical trend analysis
Basic scouting reports

Data Quality Considerations

Important Limitations

Event Data:

Manual collection introduces human error
Event definitions vary between providers
Misses off-ball movement and positioning
Subjective elements (pressure, body position)

Tracking Data:

Expensive and limited availability
Requires significant processing power
Can have tracking errors in crowded areas
Difficult to standardize across systems

Aggregated Stats:

Loses context and detail
Can be misleading without context
Varies in calculation methods
Position and role affect interpretation

Key Takeaways

Each data type has unique strengths and use cases
Event data offers the best balance of detail and accessibility
Tracking data provides unparalleled tactical insights but is expensive
Aggregated stats are perfect for quick comparisons
Combine multiple data types for comprehensive analysis
Always consider data quality and limitations

Ready for Advanced Analysis?

Now that you understand soccer data types, you're prepared to:

Build custom xG models
Create advanced tactical visualizations
Develop player recruitment systems
Implement machine learning for predictions
Conduct professional-level analysis

Your First Soccer Analysis Previous

Discussion

Have questions or feedback? Join our community discussion on Discord or GitHub Discussions.

Table of Contents