Understanding Soccer Data Types

Beginner 10 min read 0 views Nov 27, 2025

The Three Pillars of Soccer Data

Soccer analytics relies on three main types of data, each offering unique insights into the game. Understanding these data types, their structures, and limitations is crucial for effective analysis.

Event Data

Discrete actions on the pitch

Most Common

Tracking Data

Continuous player & ball positions

Premium

Aggregated Stats

Pre-calculated metrics & summaries

Widely Available

Event Data Explained

What is Event Data?

Event data captures every significant action during a match with precise information about what happened, when, where, and by whom.

Typical Event Data Includes:

  • Location: X,Y coordinates on the pitch
  • Timestamp: When the event occurred
  • Event Type: Pass, shot, tackle, etc.
  • Player & Team: Who performed the action
  • Outcome: Success or failure
  • Context: Additional attributes (body part, pressure, etc.)

Event Data Structure

Python: Exploring Event Data Schema

from statsbombpy import sb
import pandas as pd
import json

# Load sample event data
events = sb.events(match_id=8658)

print("Event Data Shape:", events.shape)
print("\nAvailable Columns:")
print(events.columns.tolist())

# Examine a single pass event
pass_event = events[events['type'] == 'Pass'].iloc[0]

print("\n" + "="*60)
print("Sample Pass Event Structure:")
print("="*60)

# Display key fields
key_fields = ['id', 'index', 'period', 'timestamp', 'minute',
             'second', 'type', 'player', 'team', 'location',
             'pass_recipient', 'pass_length', 'pass_angle',
             'pass_outcome', 'under_pressure']

for field in key_fields:
    if field in pass_event.index:
        print(f"{field:20s}: {pass_event[field]}")

# Event types breakdown
print("\n" + "="*60)
print("Event Types in Match:")
print("="*60)
print(events['type'].value_counts())

# Location data format
print("\n" + "="*60)
print("Location Data Format:")
print("="*60)
sample_location = events[events['location'].notna()]['location'].iloc[0]
print(f"Type: {type(sample_location)}")
print(f"Value: {sample_location}")
print(f"Format: [x_coordinate, y_coordinate]")
print(f"Pitch dimensions: 120 x 80 yards (StatsBomb)")

Common Event Types

Event Type Description Key Attributes
Pass Ball played from one player to another recipient, length, angle, height, outcome
Shot Attempt to score xG, outcome, body part, technique, end_location
Carry Player moving with the ball end_location, distance, duration
Duel 1v1 contest for the ball type (aerial/ground), outcome, counterpress
Pressure Defensive pressure applied duration, counterpress
Interception Ball intercepted during pass outcome, position
Clearance Defensive clearance aerial_won, body_part
Dribble Attempt to beat opponent with ball outcome, nutmeg, overrun

Working with Event Data

Python: Event Data Analysis

import numpy as np
import matplotlib.pyplot as plt
from mplsoccer import Pitch

# Load events
events = sb.events(match_id=8658)

# 1. Filter by event type and analyze
passes = events[events['type'] == 'Pass'].copy()

# Calculate pass success rate
passes['successful'] = passes['pass_outcome'].isna()
success_rate = passes['successful'].mean() * 100

print(f"Overall pass success rate: {success_rate:.1f}%")

# 2. Spatial analysis - where do events occur?
passes_with_location = passes[passes['location'].notna()].copy()
passes_with_location['x'] = passes_with_location['location'].apply(lambda loc: loc[0])
passes_with_location['y'] = passes_with_location['location'].apply(lambda loc: loc[1])

# 3. Temporal analysis - events over time
events['minute_exact'] = events['minute'] + events['second']/60

# Passes per 5-minute interval
events['time_bin'] = pd.cut(events['minute_exact'], bins=range(0, 96, 5))
passes_over_time = events[events['type'] == 'Pass'].groupby('time_bin').size()

plt.figure(figsize=(12, 6))
passes_over_time.plot(kind='bar')
plt.title('Pass Frequency Over Time', fontsize=14, fontweight='bold')
plt.xlabel('Time Interval (minutes)')
plt.ylabel('Number of Passes')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('passes_over_time.png', dpi=300)

# 4. Sequence analysis - build-up play
def analyze_possession_sequence(events, min_passes=5):
    """Identify possession sequences with minimum pass count"""
    sequences = []
    current_sequence = []
    current_team = None

    for idx, event in events.iterrows():
        if event['type'] in ['Pass', 'Carry']:
            if event['team'] == current_team:
                current_sequence.append(event)
            else:
                if len(current_sequence) >= min_passes:
                    sequences.append(current_sequence)
                current_sequence = [event]
                current_team = event['team']
        else:
            if len(current_sequence) >= min_passes:
                sequences.append(current_sequence)
            current_sequence = []
            current_team = None

    return sequences

# Find long possession sequences
long_sequences = analyze_possession_sequence(events, min_passes=8)
print(f"\nFound {len(long_sequences)} possession sequences with 8+ passes")

# 5. Progressive actions - moves towards goal
def is_progressive_pass(pass_event):
    """
    Determine if pass is progressive based on StatsBomb definition:
    - Moves ball significantly closer to opponent's goal
    """
    if pd.isna(pass_event['location']) or pd.isna(pass_event.get('pass_end_location')):
        return False

    start_x = pass_event['location'][0]
    end_x = pass_event['pass_end_location'][0] if 'pass_end_location' in pass_event else start_x

    # Progressive if moves ball 10+ yards towards goal
    return (end_x - start_x) >= 10

passes['progressive'] = passes.apply(is_progressive_pass, axis=1)
progressive_count = passes['progressive'].sum()

print(f"Progressive passes: {progressive_count} ({progressive_count/len(passes)*100:.1f}%)")

Tracking Data Explained

What is Tracking Data?

Tracking data captures the x,y coordinates of every player and the ball at 10-25 times per second throughout the match, providing unprecedented detail about movement and positioning.

Tracking Data Characteristics:

  • Frequency: 10-25 Hz (frames per second)
  • Coverage: All 22 players + ball continuously
  • Precision: Sub-meter accuracy
  • Volume: ~2-3 million data points per match
  • Additional: Speed, acceleration, distance metrics

Tracking Data Structure

Python: Simulated Tracking Data Example

import pandas as pd
import numpy as np

# Simulated tracking data structure (real data is similar)
# Each frame contains positions of all players and ball

def create_sample_tracking_frame(frame_id, timestamp):
    """Create a sample tracking data frame"""

    # 11 players per team + ball
    frame_data = []

    # Home team (team_id = 1)
    for player_id in range(1, 12):
        frame_data.append({
            'frame_id': frame_id,
            'timestamp': timestamp,
            'team_id': 1,
            'player_id': player_id,
            'jersey_number': player_id,
            'x': np.random.uniform(0, 105),  # meters
            'y': np.random.uniform(0, 68),   # meters
            'speed': np.random.uniform(0, 8),  # m/s
        })

    # Away team (team_id = 2)
    for player_id in range(1, 12):
        frame_data.append({
            'frame_id': frame_id,
            'timestamp': timestamp,
            'team_id': 2,
            'player_id': player_id + 100,
            'jersey_number': player_id,
            'x': np.random.uniform(0, 105),
            'y': np.random.uniform(0, 68),
            'speed': np.random.uniform(0, 8),
        })

    # Ball
    frame_data.append({
        'frame_id': frame_id,
        'timestamp': timestamp,
        'team_id': 0,
        'player_id': 0,
        'jersey_number': 0,
        'x': np.random.uniform(0, 105),
        'y': np.random.uniform(0, 68),
        'speed': np.random.uniform(0, 20),
    })

    return pd.DataFrame(frame_data)

# Create sample frames (25 fps = 25 frames per second)
frames = []
for frame_id in range(100):  # 4 seconds of data
    timestamp = frame_id / 25.0
    frames.append(create_sample_tracking_frame(frame_id, timestamp))

tracking_data = pd.concat(frames, ignore_index=True)

print("Tracking Data Sample:")
print(tracking_data.head(25))  # First frame

print(f"\nData shape: {tracking_data.shape}")
print(f"Frames: {tracking_data['frame_id'].nunique()}")
print(f"Duration: {tracking_data['timestamp'].max():.1f} seconds")
print(f"Data points: {len(tracking_data):,}")

# Calculate distances covered
player_distances = []
for player in tracking_data[tracking_data['player_id'] > 0]['player_id'].unique():
    player_data = tracking_data[tracking_data['player_id'] == player].sort_values('frame_id')

    # Calculate distance between frames
    distances = np.sqrt(
        (player_data['x'].diff())**2 +
        (player_data['y'].diff())**2
    )
    total_distance = distances.sum()

    player_distances.append({
        'player_id': player,
        'team_id': player_data['team_id'].iloc[0],
        'distance_m': total_distance,
        'avg_speed': player_data['speed'].mean()
    })

distance_df = pd.DataFrame(player_distances)
print("\nPlayer Movement Summary (4 seconds):")
print(distance_df.head(10))

Tracking Data Applications

Space Control

Calculate which team controls each area of the pitch using Voronoi diagrams and dominance models

Physical Metrics

Total distance, high-speed running, sprints, accelerations, and decelerations

Team Shape

Formation analysis, team compactness, width, and defensive line positioning

Pressing Analysis

Pressure intensity, time to pressure, defensive coverage, and counterpressing

Passing Lanes

Available passing options, passing lanes, and space creation through movement

Off-ball Runs

Player movement without the ball, creating space, and tactical positioning

R: Tracking Data Analysis Concepts

library(tidyverse)

# Simulated tracking data
create_tracking_frame <- function(frame_id, timestamp) {
  # Create positions for 22 players + ball
  tibble(
    frame_id = frame_id,
    timestamp = timestamp,
    team_id = c(rep(1, 11), rep(2, 11), 0),
    player_id = c(1:11, 101:111, 0),
    x = runif(23, 0, 105),
    y = runif(23, 0, 68),
    speed = c(runif(22, 0, 8), runif(1, 0, 20))
  )
}

# Generate 100 frames (4 seconds at 25 fps)
tracking_data <- map_df(0:99, ~create_tracking_frame(.x, .x/25))

cat(sprintf("Generated %d frames of tracking data\n",
           n_distinct(tracking_data$frame_id)))

# Calculate team centroids (team center of gravity)
team_centroids <- tracking_data %>%
  filter(player_id > 0) %>%
  group_by(frame_id, team_id) %>%
  summarise(
    centroid_x = mean(x),
    centroid_y = mean(y),
    .groups = 'drop'
  )

# Calculate team compactness (average distance to centroid)
compactness <- tracking_data %>%
  filter(player_id > 0) %>%
  left_join(team_centroids, by = c("frame_id", "team_id")) %>%
  mutate(
    distance_to_centroid = sqrt((x - centroid_x)^2 + (y - centroid_y)^2)
  ) %>%
  group_by(frame_id, team_id) %>%
  summarise(
    compactness = mean(distance_to_centroid),
    .groups = 'drop'
  )

cat("\nTeam Compactness Over Time:\n")
print(compactness %>% head(10))

# Visualize team shapes
library(ggplot2)
library(gganimate)

# Animate one second of play
animation_data <- tracking_data %>%
  filter(frame_id <= 25, player_id > 0)

p <- ggplot(animation_data, aes(x = x, y = y, color = factor(team_id))) +
  geom_point(size = 3) +
  geom_point(data = tracking_data %>% filter(frame_id <= 25, player_id == 0),
            aes(x = x, y = y), color = "black", size = 2) +
  coord_fixed(ratio = 1, xlim = c(0, 105), ylim = c(0, 68)) +
  scale_color_manual(values = c("1" = "#FF6B6B", "2" = "#4ECDC4"),
                    labels = c("Home", "Away")) +
  labs(title = "Tracking Data: Frame {frame_id}",
       x = "X Position (m)", y = "Y Position (m)",
       color = "Team") +
  theme_minimal() +
  theme(legend.position = "bottom")

cat("\nTracking data visualization created\n")

Aggregated Statistics

What are Aggregated Stats?

Pre-calculated metrics summarizing player or team performance over matches or seasons. Most accessible form of soccer data.

Python: Working with Aggregated Stats

# Example: Player season statistics structure
player_season_stats = {
    'player_name': 'Mohamed Salah',
    'team': 'Liverpool',
    'season': '2023-24',
    'matches_played': 38,
    'minutes': 3200,

    # Scoring
    'goals': 25,
    'xG': 22.5,
    'shots': 120,
    'shots_on_target': 58,

    # Passing
    'passes': 1500,
    'pass_completion': 82.5,
    'key_passes': 65,
    'assists': 12,
    'xA': 10.8,

    # Dribbling
    'dribbles_completed': 95,
    'dribbles_attempted': 145,

    # Defensive
    'tackles': 35,
    'interceptions': 15,

    # Physical
    'distance_km': 380,
    'sprints': 450
}

# Convert to DataFrame for analysis
import pandas as pd

stats_df = pd.DataFrame([player_season_stats])

# Calculate derived metrics
stats_df['goals_per_90'] = (stats_df['goals'] / stats_df['minutes']) * 90
stats_df['xG_per_90'] = (stats_df['xG'] / stats_df['minutes']) * 90
stats_df['shot_accuracy'] = (stats_df['shots_on_target'] / stats_df['shots']) * 100
stats_df['dribble_success'] = (stats_df['dribbles_completed'] / stats_df['dribbles_attempted']) * 100

print("Player Performance Metrics:")
print(stats_df[['player_name', 'goals_per_90', 'xG_per_90',
               'shot_accuracy', 'dribble_success']])

Data Format Comparison

Aspect Event Data Tracking Data Aggregated Stats
Granularity Per action (~2000/match) Per frame (~2M/match) Per match/season
File Size 1-5 MB per match 500 MB - 2 GB per match KB per player
Accessibility Medium (some free sources) Low (expensive, limited) High (widely available)
Analysis Difficulty Medium High (requires expertise) Low
Best For Action analysis, xG, tactics Movement, space, shape Quick comparisons, trends
Processing Speed Fast Slow (large datasets) Very fast

Choosing the Right Data Type

Use Event Data When:

  • Analyzing specific actions (passes, shots, tackles)
  • Calculating xG and xA models
  • Building passing networks
  • Studying individual player contributions
  • Creating action-based visualizations

Use Tracking Data When:

  • Analyzing team shape and formations
  • Studying space creation and control
  • Measuring physical performance precisely
  • Analyzing off-ball movement
  • Advanced tactical analysis

Use Aggregated Stats When:

  • Comparing players across seasons
  • Quick performance assessments
  • Creating league tables and rankings
  • Historical trend analysis
  • Basic scouting reports

Data Quality Considerations

Important Limitations

Event Data:
  • Manual collection introduces human error
  • Event definitions vary between providers
  • Misses off-ball movement and positioning
  • Subjective elements (pressure, body position)
Tracking Data:
  • Expensive and limited availability
  • Requires significant processing power
  • Can have tracking errors in crowded areas
  • Difficult to standardize across systems
Aggregated Stats:
  • Loses context and detail
  • Can be misleading without context
  • Varies in calculation methods
  • Position and role affect interpretation

Key Takeaways

  • Each data type has unique strengths and use cases
  • Event data offers the best balance of detail and accessibility
  • Tracking data provides unparalleled tactical insights but is expensive
  • Aggregated stats are perfect for quick comparisons
  • Combine multiple data types for comprehensive analysis
  • Always consider data quality and limitations

Ready for Advanced Analysis?

Now that you understand soccer data types, you're prepared to:

  • Build custom xG models
  • Create advanced tactical visualizations
  • Develop player recruitment systems
  • Implement machine learning for predictions
  • Conduct professional-level analysis

Discussion

Have questions or feedback? Join our community discussion on Discord or GitHub Discussions.