44 min read

Tracking data represents the most granular level of spatial information available in modern soccer analytics. Unlike event data, which records discrete on-the-ball actions, tracking data captures the continuous positional coordinates of every player...

Learning Objectives

  • Understand the major tracking data collection technologies and their trade-offs
  • Compute velocity, acceleration, and speed profiles from raw positional data
  • Implement sprint detection algorithms and classify high-intensity running events
  • Calculate metabolic power and energy expenditure from tracking data
  • Quantify team shape using centroid, spread, convex hull, and stretch index metrics
  • Analyze collective movement patterns using synchronization and phase coupling methods
  • Build fatigue models and monitor workload using the acute-to-chronic workload ratio
  • Synchronize tracking data with event data for contextual analysis

Chapter 18: Tracking Data Analytics

Introduction

Tracking data represents the most granular level of spatial information available in modern soccer analytics. Unlike event data, which records discrete on-the-ball actions, tracking data captures the continuous positional coordinates of every player and the ball at high frequency --- typically 10 to 25 frames per second. This rich spatiotemporal dataset unlocks analytical dimensions that are simply inaccessible through event data alone: physical performance profiling, collective tactical movement, pressing intensity quantification, and fatigue modeling, among many others.

The proliferation of optical tracking systems (such as TRACAB, Second Spectrum, and Hawk-Eye) and wearable GPS/inertial measurement unit (IMU) devices has made tracking data increasingly available across elite leagues. The Bundesliga's partnership with AWS and the Premier League's adoption of Second Spectrum have brought tracking-derived metrics into mainstream broadcasts, while clubs' internal analytics departments leverage the raw positional feeds for competitive advantage.

This chapter provides a rigorous treatment of tracking data analytics in professional soccer. We begin with the fundamentals of data structure and coordinate systems, progress through physical performance metrics and collective movement analysis, and conclude with methods for integrating tracking data with traditional event data to build holistic analytical frameworks.

Prerequisites: Readers should be comfortable with Python (NumPy, pandas, matplotlib), basic linear algebra (vectors, distances), and introductory calculus (derivatives, integrals). Familiarity with event data concepts from earlier chapters is recommended.


18.1 Understanding Tracking Data

18.1.1 Data Sources and Collection Technologies

Tracking data in professional soccer is collected through three primary technologies:

  1. Optical tracking systems use multiple high-resolution cameras mounted around the stadium perimeter. Computer vision algorithms detect and track each player and the ball across frames. Systems such as TRACAB (ChyronHego), Second Spectrum, and Hawk-Eye Innovations are deployed in major European leagues. These systems typically operate at 25 Hz (25 frames per second), providing sub-meter positional accuracy.

  2. GPS and GNSS devices are wearable units carried by players, usually in a vest worn beneath the jersey. Global Navigation Satellite System (GNSS) receivers combined with local positioning systems (LPS) and inertial measurement units (IMUs) provide positional data at 10--20 Hz. Systems from Catapult, STATSports, and Playermaker are widely used in training and, where regulations permit, in competitive matches.

  3. Radio-frequency and ultra-wideband (UWB) systems use sensors embedded in the field or stadium infrastructure to triangulate player positions via radio signals. These systems, such as Kinexon and Zebra Technologies, offer high accuracy (typically $\pm 10$ cm) and are used in some leagues as alternatives or complements to optical tracking.

Each technology has trade-offs in accuracy, latency, and deployment context. Optical systems are the standard for match-day tracking in top leagues; GPS/IMU devices dominate training ground analytics; UWB systems occupy a growing niche in stadiums with suitable infrastructure.

18.1.2 Tracking Data Providers in Detail

The tracking data market is dominated by a handful of specialized providers, each with distinct technical approaches, league partnerships, and data delivery formats.

Second Spectrum was acquired by Genius Sports in 2021 and serves as the official optical tracking provider for the English Premier League, Major League Soccer, and several other competitions. Second Spectrum's system uses a network of cameras installed in stadium rafters, typically 10--16 cameras per venue, feeding into a centralized computer vision pipeline. Their proprietary algorithms combine pose estimation with trajectory prediction to maintain tracking continuity even during occlusions and crowd scenes. Second Spectrum delivers both raw positional data and a suite of derived metrics, including their "Spatial Control" model, which estimates the probability that a team controls any given point on the pitch. Their data is delivered at 25 Hz with reported positional accuracy of approximately $\pm 20$ cm.

SkillCorner occupies a unique position in the market by deriving tracking data from broadcast video feeds rather than dedicated camera installations. Using deep learning-based computer vision, SkillCorner processes standard television broadcasts to extract player and ball positions. This approach dramatically expands coverage --- SkillCorner can generate tracking data for any match that is televised, regardless of whether the stadium has dedicated tracking infrastructure. The trade-off is reduced accuracy (typically $\pm 50$--$100$ cm) and lower effective frame rates (10--15 Hz) compared to dedicated systems, as well as limitations imposed by camera angles and broadcast production choices. Despite these constraints, SkillCorner's data has proven valuable for scouting across leagues that lack dedicated tracking systems, and their "Off-Ball Runs" and "Physical Performance" products have been adopted by numerous clubs.

Hawk-Eye Innovations (owned by Sony) is best known for its ball-tracking technology used in goal-line decisions and video assistant referee (VAR) systems. Hawk-Eye also provides player tracking in Serie A, the Bundesliga, and several other leagues. Their system uses a dense array of high-speed cameras (typically 12--14 per stadium) operating at up to 100 Hz for the ball and 25 Hz for players. Hawk-Eye's particular strength is three-dimensional ball tracking, including accurate height estimation, which is valuable for analyzing aerial duels, crosses, and shot trajectories.

TRACAB (ChyronHego) is one of the longest-established optical tracking systems, deployed in the Bundesliga since 2011 and used in LaLiga, the Eredivisie, and other competitions. TRACAB uses stereo camera pairs to triangulate player positions, delivering data at 25 Hz. Their system is known for robust performance in outdoor stadiums with variable lighting conditions, and they provide both real-time and post-match data feeds.

Callout: Choosing a Tracking Data Provider

When selecting a tracking data source, consider these factors: (1) Coverage --- does the provider cover the leagues and matches you need? (2) Accuracy --- what positional accuracy is required for your analysis? Sprint detection requires higher accuracy than formation analysis. (3) Latency --- do you need real-time data during matches, or is post-match delivery sufficient? (4) Derived metrics --- does the provider include pre-computed metrics (pitch control, pressing metrics), or will you compute everything from raw positions? (5) Cost --- broadcast-derived tracking (SkillCorner) is generally less expensive than dedicated optical systems.

18.1.3 Data Structure and Coordinate Systems

Tracking data is fundamentally a time series of $(x, y)$ coordinates for each tracked entity. The standard data structure consists of:

  • Frame ID or timestamp: An integer frame counter or a floating-point timestamp in seconds.
  • Player ID: A unique identifier for each player on the pitch.
  • Team ID: Identifies which team the player belongs to (home or away).
  • $x$ coordinate: The horizontal position on the pitch, in meters.
  • $y$ coordinate: The vertical position on the pitch, in meters.
  • Ball coordinates: The $(x, y)$ and sometimes $z$ (height) position of the ball.

The pitch coordinate system is typically centered at the midpoint of the field, with the $x$-axis running along the length (touchline to touchline) and the $y$-axis running along the width (goal line to goal line). Standard pitch dimensions are $105 \times 68$ meters, so coordinates range from $(-52.5, -34)$ to $(52.5, 34)$.

Callout: Coordinate System Conventions

Different data providers use different coordinate origins and axis orientations. Some place the origin at one corner of the pitch; others center it at midfield. Always verify the coordinate system before performing any spatial calculations. A common source of error is applying one provider's algorithms to another provider's data without adjusting for coordinate conventions.

A single frame of tracking data might look like:

frame_id timestamp player_id team_id x y
1000 40.00 P001 home -17.32 10.74
1000 40.00 P002 home -24.10 -3.21
1000 40.00 P012 away 12.55 14.33
1000 40.00 ball --- -14.80 9.95

At 25 Hz over a 90-minute match (plus stoppage time), the dataset typically contains $25 \times 5700 \approx 142{,}500$ frames, each with rows for 22 outfield players, 2 goalkeepers, and the ball --- yielding approximately 3.5 million rows of positional data per match.

18.1.4 Data Format and Sampling Rates: 25 Hz vs 10 Hz

The sampling rate of tracking data has significant implications for analysis. The two most common rates are 25 Hz (optical systems) and 10 Hz (GPS/GNSS systems), though some systems operate at intermediate frequencies.

25 Hz tracking produces a frame every 0.04 seconds. At this resolution, rapid movements --- decelerations after a sprint, changes of direction during a dribble, the moment of ball contact during a pass --- are captured with sufficient detail for biomechanical analysis. The higher temporal resolution means that velocity and acceleration computed via finite differences are more accurate and less affected by aliasing.

10 Hz tracking produces a frame every 0.10 seconds. While adequate for most aggregate physical metrics (total distance, time in speed zones), 10 Hz data may miss the peak of very short acceleration bursts (lasting less than 0.5 seconds) and can underestimate peak speeds during brief sprints. The lower frame rate also means that interpolation errors have a larger impact on derived quantities.

The choice of sampling rate affects data volume linearly: a 25 Hz system generates 2.5 times more data than a 10 Hz system per match. For season-long analyses across multiple teams, this difference in storage and processing requirements is substantial.

Callout: Downsampling and Upsampling

When combining data from different sources (e.g., 25 Hz optical tracking with 10 Hz GPS data from training), you may need to resample to a common frequency. Downsampling from 25 Hz to 10 Hz is straightforward: select every 2.5th frame (with interpolation for non-integer indices). Upsampling from 10 Hz to 25 Hz requires interpolation, which introduces artificial smoothness and can mask genuine high-frequency movement. Always prefer working at the native frequency of the data when possible, and document any resampling in your analysis pipeline.

18.1.5 Data Quality and Preprocessing

Raw tracking data requires substantial preprocessing before analysis:

Missing data and occlusions. Optical systems occasionally lose track of players due to occlusions (players blocking each other from camera view), jersey color confusion, or rapid direction changes. Missing positions must be interpolated. Linear interpolation is acceptable for gaps of 1--3 frames; spline interpolation is preferable for longer gaps. For gaps exceeding 1--2 seconds, interpolated data should be flagged as estimated and excluded from acceleration calculations.

Smoothing. Raw positional data contains measurement noise. Applying a Savitzky-Golay filter or a low-pass Butterworth filter reduces high-frequency noise while preserving the underlying signal. The choice of filter parameters (window size, polynomial order, or cutoff frequency) must balance noise reduction against signal distortion. A Butterworth filter with a cutoff frequency of 6--10 Hz is commonly used in the sports science literature.

Coordinate normalization. To enable cross-match and cross-league comparisons, coordinates should be normalized to a standard pitch size. If the actual pitch dimensions differ from $105 \times 68$ meters, positions should be scaled proportionally.

Period segmentation. Tracking data must be segmented by match period (first half, second half, extra time). During halftime, teams switch ends, so the direction of attack reverses. Coordinates should be flipped for the second half to maintain a consistent attacking direction throughout the analysis.

Dead ball removal. During stoppages (throw-ins, goal kicks, free kicks, substitutions), player movement patterns differ fundamentally from open play. Depending on the analysis, you may wish to exclude dead-ball periods or analyze them separately. Identifying dead-ball segments typically requires synchronization with event data or analysis of ball velocity (the ball is stationary or moving very slowly during stoppages).

import numpy as np
import pandas as pd
from scipy.signal import savgol_filter

def smooth_positions(
    positions: np.ndarray,
    window_length: int = 7,
    polyorder: int = 2
) -> np.ndarray:
    """Apply Savitzky-Golay smoothing to positional data.

    Args:
        positions: Array of shape (n_frames, 2) with x, y coordinates.
        window_length: Length of the filter window (must be odd).
        polyorder: Order of the polynomial used for fitting.

    Returns:
        Smoothed positions with the same shape as input.
    """
    smoothed = np.column_stack([
        savgol_filter(positions[:, 0], window_length, polyorder),
        savgol_filter(positions[:, 1], window_length, polyorder),
    ])
    return smoothed

Callout: Validation of Preprocessing Steps

Every preprocessing step (interpolation, smoothing, normalization) introduces assumptions and potential distortions. Validate your pipeline by: (1) computing metrics on both raw and preprocessed data and checking that differences are within expected ranges; (2) visually inspecting trajectories before and after smoothing for a sample of players; (3) comparing your total distance values against published benchmarks for the relevant league and position. If your preprocessing yields total distances that differ from published norms by more than 5%, investigate your smoothing parameters.

18.1.6 Frame Rate and Temporal Resolution

The frame rate of the tracking system determines the temporal resolution of derived metrics. At 25 Hz, each frame represents $\Delta t = 0.04$ seconds. Velocities and accelerations computed from finite differences are sensitive to this time step:

$$ v_x(t) \approx \frac{x(t + \Delta t) - x(t)}{\Delta t}, \quad v_y(t) \approx \frac{y(t + \Delta t) - y(t)}{\Delta t} $$

The instantaneous speed is:

$$ s(t) = \sqrt{v_x(t)^2 + v_y(t)^2} $$

Higher frame rates provide better temporal resolution but also amplify measurement noise in derivative quantities. This is why smoothing is applied before computing velocities and accelerations.


18.2 Physical Performance Metrics

18.2.1 Overview of Physical Demands

Professional soccer places heterogeneous physical demands on players. A typical outfield player covers 10--13 km in a 90-minute match, with approximately 80% at low-to-moderate intensity (walking, jogging) and 20% at high intensity (running, sprinting). Physical performance metrics derived from tracking data quantify these demands with precision, enabling performance staff to monitor workload, assess fitness, and inform tactical decisions.

The standard physical performance metrics include:

  • Total distance covered
  • Distance covered in speed zones (walking, jogging, running, high-speed running, sprinting)
  • Number and distance of sprints
  • Peak speed
  • Number of accelerations and decelerations above thresholds
  • Metabolic power and energy expenditure

18.2.2 Speed Zone Definitions

Speed zones partition the continuous speed distribution into discrete categories. While exact thresholds vary by provider, position, and league, a widely used classification is:

Zone Category Speed Range (km/h) Speed Range (m/s)
1 Walking 0 -- 9.0 0 -- 1.94
2 Jogging 9.0 -- 16.4 1.94 -- 4.00
3 Running 16.4 -- 21.8 4.00 -- 7.50
4 High-speed running 21.8 -- 27.2 7.50 -- 9.00
5 Sprinting > 27.2 > 9.00

Callout: Individualized Speed Thresholds

Research by Abt and Lovell (2009) and others has demonstrated that fixed speed thresholds may misclassify efforts for players with different physical profiles. A sprint for a central defender may occur at a lower absolute speed than for a winger. Some modern systems use individualized thresholds based on each player's maximum speed, typically defining high-speed running as $>75\%$ of $v_{\max}$ and sprinting as $>90\%$ of $v_{\max}$. This individualization is particularly important when comparing physical output across positions or when using tracking data to inform training load prescriptions. A central defender running at 28 km/h is operating at a much higher fraction of their capacity than a winger at the same speed, and the physiological cost and injury risk associated with that effort are correspondingly greater.

18.2.3 Sprint Detection and High-Intensity Running Analysis

A sprint is typically defined as a sustained period of movement above the sprinting speed threshold ($> 27.2$ km/h or $> 9.0$ m/s) for a minimum duration (usually $\geq 1$ second or $\geq 0.5$ seconds, depending on the provider). Sprint detection involves:

  1. Identifying all frames where instantaneous speed exceeds the threshold.
  2. Grouping consecutive above-threshold frames into sprint events.
  3. Filtering out events shorter than the minimum duration.
  4. Recording sprint attributes: start time, end time, duration, peak speed, distance covered.

Beyond simple sprint counting, advanced sprint analysis examines the context and quality of each sprint:

  • Repeated sprint ability (RSA): The capacity to maintain sprint quality across multiple sprints with short recovery intervals. RSA is quantified by examining sequences of sprints with less than 20--30 seconds between them and comparing the peak speed and distance of successive sprints within the sequence. A player whose fifth sprint in a sequence maintains 95% of the peak speed of their first sprint demonstrates superior RSA compared to one who declines to 85%.

  • Sprint initiation speed: How quickly a player reaches sprinting speed from a standing start or from jogging. This is measured by the average acceleration in the 1--2 seconds preceding the sprint threshold crossing. Players with faster sprint initiation are more effective in pressing triggers and counter-attacking situations.

  • Sprint direction classification: Sprints can be classified by direction relative to the goal: forward sprints (toward the opponent's goal), backward sprints (recovery runs), and lateral sprints (covering width). The distribution of sprint directions reveals tactical roles --- wingers make more forward sprints, full-backs more lateral and backward sprints.

  • High-intensity running (HIR) analysis extends beyond sprints to encompass all movement above the high-speed threshold (typically $>21.8$ km/h). HIR distance is a more robust measure of high-intensity effort than sprint distance alone, as it captures the substantial physiological demands of sub-maximal high-speed running. Research consistently shows that HIR distance is a better predictor of match performance quality than sprint distance, as many game-decisive actions occur at high-speed running intensities rather than at full sprint.

def detect_sprints(
    speed: np.ndarray,
    timestamps: np.ndarray,
    speed_threshold: float = 9.0,
    min_duration: float = 1.0,
) -> list[dict]:
    """Detect sprint events from a speed time series.

    Args:
        speed: Array of instantaneous speeds in m/s.
        timestamps: Array of corresponding timestamps in seconds.
        speed_threshold: Minimum speed to qualify as sprinting (m/s).
        min_duration: Minimum sprint duration in seconds.

    Returns:
        List of dicts with sprint attributes (start, end, duration,
        peak_speed, distance).
    """
    above_threshold = speed >= speed_threshold
    sprints = []
    in_sprint = False
    start_idx = 0

    for i in range(len(above_threshold)):
        if above_threshold[i] and not in_sprint:
            in_sprint = True
            start_idx = i
        elif not above_threshold[i] and in_sprint:
            in_sprint = False
            duration = timestamps[i - 1] - timestamps[start_idx]
            if duration >= min_duration:
                dt = np.diff(timestamps[start_idx:i])
                avg_speeds = (speed[start_idx:i - 1] + speed[start_idx + 1:i]) / 2
                distance = np.sum(avg_speeds * dt)
                sprints.append({
                    "start_time": timestamps[start_idx],
                    "end_time": timestamps[i - 1],
                    "duration": duration,
                    "peak_speed": float(np.max(speed[start_idx:i])),
                    "distance": float(distance),
                })

    # Handle sprint extending to end of data
    if in_sprint:
        duration = timestamps[-1] - timestamps[start_idx]
        if duration >= min_duration:
            dt = np.diff(timestamps[start_idx:])
            avg_speeds = (speed[start_idx:-1] + speed[start_idx + 1:]) / 2
            distance = np.sum(avg_speeds * dt)
            sprints.append({
                "start_time": timestamps[start_idx],
                "end_time": timestamps[-1],
                "duration": duration,
                "peak_speed": float(np.max(speed[start_idx:])),
                "distance": float(distance),
            })

    return sprints

Callout: The "Sprint Desert" Problem

A common analytical trap is to interpret a lack of sprints as low effort or poor fitness. In reality, tactical context heavily influences sprint frequency. A central midfielder in a possession-dominant team (e.g., Manchester City under Pep Guardiola) may make very few sprints because the team's playing style reduces the need for explosive efforts. Conversely, a winger in a counter-attacking team may accumulate many sprints not because they are fitter, but because the tactical system demands it. Always contextualize sprint data within the team's tactical framework and the player's role within it.

18.2.4 Metabolic Power

Metabolic power, introduced by di Prampero et al. (2005) and refined by Osgnach et al. (2010), provides an energy-based measure of physical effort that accounts for both speed and acceleration. The metabolic power $P_{\text{met}}$ at any instant is:

$$ P_{\text{met}} = EC \cdot s $$

where $s$ is the instantaneous speed and $EC$ is the energy cost of locomotion per unit distance. The energy cost depends on the equivalent slope $ES$, which maps the horizontal acceleration to an equivalent uphill gradient:

$$ ES = \frac{a_h}{g} $$

$$ \alpha_E = \arctan(ES) $$

$$ EC = \frac{155.4 \cdot ES^5 - 30.4 \cdot ES^4 - 43.3 \cdot ES^3 + 46.3 \cdot ES^2 + 21.5 \cdot ES + 3.6}{(1 + ES^2)^{0.5}} \cdot \frac{1}{\eta} $$

where $a_h$ is the horizontal acceleration, $g = 11.81 \; \text{m/s}^2$ is gravitational acceleration, and $\eta$ is the efficiency factor (approximately 0.25 for running). This metric enables comparison of efforts that combine acceleration and constant-speed running into a unified energy expenditure framework.

Callout: Limitations of Metabolic Power

The metabolic power model has been critiqued for several reasons: (1) it was originally derived from linear running on flat surfaces and may not accurately represent the energy cost of multidirectional movement; (2) it does not account for the metabolic cost of deceleration (eccentric muscle actions); (3) its validity at very high accelerations is uncertain. Despite these limitations, metabolic power remains widely used as a complementary workload metric.


18.3 Speed and Acceleration Analysis

18.3.1 Computing Velocity Vectors

Given smoothed positional data, the velocity vector at time $t$ is computed via central finite differences:

$$ v_x(t) = \frac{x(t + \Delta t) - x(t - \Delta t)}{2 \Delta t}, \quad v_y(t) = \frac{y(t + \Delta t) - y(t - \Delta t)}{2 \Delta t} $$

The central difference scheme is preferred over the forward difference because it reduces the truncation error from $O(\Delta t)$ to $O(\Delta t^2)$.

The speed (scalar) and direction (angle) are:

$$ s(t) = \| \mathbf{v}(t) \| = \sqrt{v_x^2 + v_y^2} $$

$$ \theta(t) = \arctan2(v_y, v_x) $$

18.3.2 Computing Acceleration

Acceleration is the time derivative of velocity. Using central differences on the velocity:

$$ a_x(t) = \frac{v_x(t + \Delta t) - v_x(t - \Delta t)}{2 \Delta t}, \quad a_y(t) = \frac{v_y(t + \Delta t) - v_y(t - \Delta t)}{2 \Delta t} $$

The scalar acceleration magnitude is:

$$ a(t) = \sqrt{a_x^2 + a_y^2} $$

However, it is often more informative to decompose acceleration into tangential and normal (centripetal) components:

  • Tangential acceleration $a_{\parallel}$ reflects changes in speed (speeding up or slowing down).
  • Normal acceleration $a_{\perp}$ reflects changes in direction (turning).

$$ a_{\parallel}(t) = \frac{\mathbf{v}(t) \cdot \mathbf{a}(t)}{\| \mathbf{v}(t) \|} $$

$$ a_{\perp}(t) = \frac{\| \mathbf{v}(t) \times \mathbf{a}(t) \|}{\| \mathbf{v}(t) \|} $$

In two dimensions, the cross product yields a scalar:

$$ \mathbf{v} \times \mathbf{a} = v_x a_y - v_y a_x $$

18.3.3 Acceleration and Deceleration Events

High-intensity accelerations and decelerations are physiologically demanding and tactically significant. An acceleration event is defined as a sustained period where $a_{\parallel}$ exceeds a threshold (typically $> 2 \; \text{m/s}^2$ for moderate intensity, $> 3 \; \text{m/s}^2$ for high intensity). Deceleration events are defined symmetrically with $a_{\parallel} < -2 \; \text{m/s}^2$ or $a_{\parallel} < -3 \; \text{m/s}^2$.

These events are particularly relevant for:

  • Injury risk assessment: High deceleration loads are associated with hamstring injury risk. Eccentric muscle loading during rapid decelerations places substantial strain on the hamstring muscle group, and the cumulative deceleration load across a match or training week is a recognized risk factor for muscle injuries.
  • Pressing analysis: Pressing actions require rapid accelerations toward the ball carrier. The frequency and intensity of pressing accelerations quantify a player's defensive work rate beyond what distance-based metrics can capture.
  • Transition analysis: Counter-attacks involve explosive accelerations from defensive positions. The ability to produce high accelerations from low starting speeds is a key physical quality for players in teams that rely on transitions.

18.3.4 Speed Profiles and Distributions

A player's speed profile summarizes the distribution of time spent at different speeds during a match. This is typically visualized as:

  1. Histograms of time in each speed zone.
  2. Cumulative distribution functions (CDFs) of speed.
  3. Time-series plots of speed with zone boundaries overlaid.
  4. Speed heatmaps combining spatial position with speed information.
import matplotlib.pyplot as plt

def plot_speed_profile(
    speed: np.ndarray,
    timestamps: np.ndarray,
    zones: list[tuple[float, float]] | None = None,
    zone_labels: list[str] | None = None,
    title: str = "Speed Profile",
) -> plt.Figure:
    """Plot a player's speed profile over time with speed zones.

    Args:
        speed: Array of instantaneous speeds in m/s.
        timestamps: Array of corresponding timestamps in seconds.
        zones: List of (min_speed, max_speed) tuples for each zone.
        zone_labels: Labels for each speed zone.
        title: Plot title.

    Returns:
        Matplotlib Figure object.
    """
    if zones is None:
        zones = [
            (0, 1.94), (1.94, 4.0), (4.0, 7.5),
            (7.5, 9.0), (9.0, float("inf")),
        ]
    if zone_labels is None:
        zone_labels = [
            "Walking", "Jogging", "Running",
            "High-speed", "Sprint",
        ]

    colors = ["#2ecc71", "#f1c40f", "#e67e22", "#e74c3c", "#8e44ad"]

    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 8))

    # Time series
    minutes = timestamps / 60
    ax1.plot(minutes, speed * 3.6, color="black", linewidth=0.5, alpha=0.7)
    for (zmin, zmax), color, label in zip(zones, colors, zone_labels):
        ax1.axhspan(zmin * 3.6, min(zmax, 12) * 3.6,
                     alpha=0.15, color=color, label=label)
    ax1.set_xlabel("Time (minutes)")
    ax1.set_ylabel("Speed (km/h)")
    ax1.set_title(title)
    ax1.legend(loc="upper right", fontsize=8)

    # Histogram of time in zones
    dt = np.median(np.diff(timestamps))
    zone_times = []
    for zmin, zmax in zones:
        mask = (speed >= zmin) & (speed < zmax)
        zone_times.append(np.sum(mask) * dt / 60)

    bars = ax2.bar(zone_labels, zone_times, color=colors, edgecolor="black")
    ax2.set_ylabel("Time (minutes)")
    ax2.set_title("Time in Speed Zones")
    for bar, t in zip(bars, zone_times):
        ax2.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.3,
                 f"{t:.1f}", ha="center", fontsize=9)

    fig.tight_layout()
    return fig

18.4 Distance and Work Rate Metrics

18.4.1 Total Distance

Total distance covered is the most basic physical performance metric. It is computed by summing the Euclidean distances between consecutive positions:

$$ D_{\text{total}} = \sum_{i=1}^{N-1} \sqrt{(x_{i+1} - x_i)^2 + (y_{i+1} - y_i)^2} $$

Equivalently, using speed:

$$ D_{\text{total}} = \sum_{i=1}^{N-1} s_i \cdot \Delta t $$

where $s_i$ is the speed at frame $i$ and $\Delta t$ is the time step between frames.

Typical values for outfield players in a 90-minute match:

Position Total Distance (km)
Central midfielders 13.5 -- 15.0
Wide midfielders 12.5 -- 14.0
Full-backs 12.0 -- 13.5
Center-backs 11.0 -- 12.5
Strikers 11.5 -- 13.0
Goalkeepers 7.0 -- 8.5

18.4.2 High-Speed Running Distance (HSRD)

High-speed running distance is the total distance covered at speeds above a high-speed threshold (typically $>21.8$ km/h or $>7.5$ m/s). HSRD is a key performance indicator that differentiates between positions and correlates with tactical roles:

$$ D_{\text{HSR}} = \sum_{i : s_i > s_{\text{thresh}}} s_i \cdot \Delta t $$

Research consistently shows that HSRD decreases in the second half compared to the first half, reflecting the onset of fatigue. The magnitude of this decrease varies by position, fitness level, and match context.

18.4.3 Sprint Distance

Sprint distance quantifies the total distance covered during detected sprint events (typically $>27.2$ km/h or $>9.0$ m/s). While related to HSRD, sprint distance focuses on the most intense efforts and is particularly sensitive to:

  • Playing position (wingers and full-backs typically accumulate the highest sprint distances)
  • Match state (losing teams tend to sprint more, especially late in matches)
  • Tactical system (counter-attacking teams accumulate more sprint distance than possession-based teams)

18.4.4 Distance Per Minute (Work Rate)

Work rate normalizes distance by playing time, enabling fair comparison between players with different playing minutes (starters vs. substitutes, players who are substituted off):

$$ W = \frac{D_{\text{total}}}{t_{\text{played}}} $$

where $t_{\text{played}}$ is in minutes. The result is typically expressed in meters per minute (m/min). Average work rates for outfield players range from 100 to 140 m/min.

Work rate can be computed in rolling windows (e.g., 5-minute intervals) to examine temporal patterns within a match. A sustained decline in work rate is a common indicator of fatigue.

18.4.5 Distance in and out of Possession

Partitioning distance by possession state reveals tactical information:

  • In-possession distance: Distance covered while the player's team has the ball.
  • Out-of-possession distance: Distance covered while the opposing team has the ball.
  • Transition distance: Distance covered in the moments immediately following a turnover.

Teams that press aggressively out of possession exhibit higher out-of-possession work rates. Teams that build up slowly in possession may show lower in-possession speeds but cover more lateral distance.

def compute_distance_by_possession(
    positions: np.ndarray,
    timestamps: np.ndarray,
    possession_team: np.ndarray,
    player_team: str,
) -> dict[str, float]:
    """Compute distance covered in and out of possession.

    Args:
        positions: Array of shape (n_frames, 2) with x, y coordinates.
        timestamps: Array of timestamps in seconds.
        possession_team: Array indicating which team has possession at
            each frame ('home', 'away', or 'dead_ball').
        player_team: The team identifier for this player ('home' or 'away').

    Returns:
        Dict with 'in_possession', 'out_of_possession', and
        'dead_ball' distances in meters.
    """
    displacements = np.sqrt(np.sum(np.diff(positions, axis=0) ** 2, axis=1))
    in_poss = possession_team[:-1] == player_team
    out_poss = (possession_team[:-1] != player_team) & (
        possession_team[:-1] != "dead_ball"
    )
    dead = possession_team[:-1] == "dead_ball"

    return {
        "in_possession": float(np.sum(displacements[in_poss])),
        "out_of_possession": float(np.sum(displacements[out_poss])),
        "dead_ball": float(np.sum(displacements[dead])),
    }

18.4.6 Physical Benchmarking Across Positions and Leagues

Physical benchmarking establishes reference ranges for physical performance metrics by position, league, and competition level. This enables clubs to evaluate whether a player's physical output is above, below, or within the expected range for their role, and to make cross-league comparisons when scouting.

Position-specific benchmarks are essential because physical demands vary enormously by role. A central midfielder's value lies in sustained moderate-intensity running and frequent changes of direction, while a winger's value lies in sprint frequency and top-end speed. Comparing a central midfielder's sprint distance to a winger's is meaningless without positional context.

League-level benchmarks reveal systematic differences in physical demands across competitions. The Bundesliga and Premier League consistently show the highest average physical outputs among Europe's top five leagues, with higher total distances, more sprints, and greater high-intensity running distances. Serie A and LaLiga tend to show lower physical outputs, partly due to tactical preferences (more possession-based play) and partly due to climate (higher temperatures in Spain and Italy). These league-level differences are critical when evaluating whether a player's physical profile will translate to a new competitive environment.

League Avg Total Distance (km) Avg HSRD (m) Avg Sprints per Match
Bundesliga 13.2 1,180 42
Premier League 13.0 1,150 40
Ligue 1 12.7 1,050 37
LaLiga 12.5 980 34
Serie A 12.4 960 33

Callout: Benchmarking Caveats

Physical benchmarks should always be interpreted with caution. First, they are influenced by match context --- a team that leads 3-0 by halftime will naturally reduce its physical output in the second half. Second, tracking data providers use different smoothing algorithms and speed zone thresholds, making direct comparisons between providers unreliable. Third, effective playing time varies by league (the ball is in play for approximately 58 minutes in the Premier League versus 52 minutes in Serie A), which directly affects physical output. Always normalize for effective playing time when making cross-league comparisons, and ensure you are comparing data from the same provider.


18.5 Team Shape Metrics and Collective Movement

18.5.1 Beyond Individual Metrics

While individual physical metrics are valuable, soccer is fundamentally a team sport. The collective organization of a team --- how players move in coordination, maintain formation shape, and respond as a unit to game events --- is at least as important as individual output. Tracking data enables the quantification of these collective behaviors.

18.5.2 Team Centroid and Spread

The team centroid is the average position of all outfield players on a team at a given moment:

$$ \bar{x}(t) = \frac{1}{n} \sum_{i=1}^{n} x_i(t), \quad \bar{y}(t) = \frac{1}{n} \sum_{i=1}^{n} y_i(t) $$

where $n$ is the number of outfield players (typically 10, excluding the goalkeeper).

The team spread quantifies how dispersed the players are around the centroid. It can be measured as:

  • Length: The range of the team along the $x$-axis (direction of play): $L = x_{\max} - x_{\min}$.
  • Width: The range along the $y$-axis: $W = y_{\max} - y_{\min}$.
  • Surface area: The area of the convex hull enclosing all outfield players.
  • Standard distance: $\sigma_d = \sqrt{\frac{1}{n}\sum_{i=1}^{n} [(x_i - \bar{x})^2 + (y_i - \bar{y})^2]}$

Team length (the distance from the most advanced to the most retreated outfield player along the axis of play) is a key indicator of defensive compactness. Well-organized defenses maintain a short team length (typically 30--40 meters), minimizing the space between defensive and attacking lines. When the team length exceeds 45--50 meters, gaps between lines become exploitable by opponents.

Team width measures lateral spread. In possession, effective teams stretch the opposition by maintaining wide positions (width of 55--65 meters). Out of possession, width contracts as players move centrally to protect the goal (width of 35--45 meters).

18.5.3 Convex Hull and Effective Playing Area

The convex hull of a team's positions defines the smallest convex polygon containing all outfield players. Its area, known as the effective playing area or team surface area, is a measure of how much of the pitch the team occupies:

$$ A_{\text{hull}} = \text{Area}(\text{ConvexHull}(\{(x_i, y_i)\}_{i=1}^{n})) $$

Typical effective playing areas range from 800 to 2,500 m$^2$, depending on game state. Teams tend to expand (increase area) when in possession and compress (decrease area) when defending. The ratio of in-possession area to out-of-possession area is itself informative: teams with high ratios demonstrate strong tactical discipline in alternating between expansive attacking shapes and compact defensive blocks.

from scipy.spatial import ConvexHull

def compute_team_surface_area(
    positions: np.ndarray,
) -> float:
    """Compute the convex hull area of team positions.

    Args:
        positions: Array of shape (n_players, 2) with x, y coordinates
            for outfield players at a single frame.

    Returns:
        Area of the convex hull in square meters.

    Raises:
        ValueError: If fewer than 3 player positions are provided.
    """
    if len(positions) < 3:
        raise ValueError("Need at least 3 points for a convex hull.")
    hull = ConvexHull(positions)
    return float(hull.volume)  # In 2D, 'volume' gives area

18.5.4 Stretch Index

The stretch index (Bourbousson et al., 2010) measures the average distance of all outfield players from the team centroid:

$$ SI(t) = \frac{1}{n} \sum_{i=1}^{n} \sqrt{(x_i(t) - \bar{x}(t))^2 + (y_i(t) - \bar{y}(t))^2} $$

The stretch index captures team compactness: a low stretch index indicates a compact formation, while a high value indicates a spread-out team.

18.5.5 Dyadic Distances and Inter-Player Coordination

Dyadic distances are the pairwise Euclidean distances between all player pairs. For $n$ players, there are $\binom{n}{2} = \frac{n(n-1)}{2}$ unique pairs. The distribution and temporal evolution of these distances reveal coordination patterns:

  • Within-team dyadic distances capture formation rigidity (low variance = rigid formation) vs. fluidity (high variance = position interchange).
  • Between-team dyadic distances capture marking relationships and pressing structure.

18.5.6 Collective Movement Patterns and Team Coordination

Beyond static shape metrics, tracking data enables analysis of how players move collectively over time. Several frameworks capture these dynamics:

Phase synchronization analysis examines whether players' oscillatory movements (forward-backward, side-to-side) are temporally aligned. Using the Hilbert transform to extract the instantaneous phase of each player's movement along the $x$-axis, we can compute the relative phase between any two players. When two players are in-phase (relative phase near $0$), they move forward and backward together. When anti-phase (relative phase near $\pi$), they move in opposite directions --- a pattern observed between attackers making runs and midfielders dropping back to receive.

Principal component analysis of team movement treats the 20 coordinates (10 players $\times$ 2 dimensions) of a team's outfield positions as a high-dimensional time series. PCA identifies the dominant modes of collective movement. The first principal component typically captures the team's forward-backward oscillation as a unit. Higher-order components capture width changes, rotations, and asymmetric movements. The explained variance ratios indicate how much of the team's movement is coordinated versus independent: highly organized teams show more variance explained by the first few components.

Relative phase coupling between teams examines how the two teams' centroids co-evolve. Research by Frencken et al. (2011) showed that the centroids of attacking and defending teams oscillate in a near-anti-phase relationship: as one team pushes forward, the other retreats, and vice versa. Disruptions to this coupling --- moments when one team's centroid advances rapidly while the other fails to retreat --- correspond to dangerous attacking situations.

Callout: Coordination Is Not Always Visible

Some of the most important coordination patterns are invisible to the naked eye. Two center-backs who maintain a consistent 12--15 meter lateral distance while shifting laterally in tandem are exhibiting precise coordination, but it manifests as stillness and positioning rather than dramatic movement. Tracking data reveals these subtle coordination patterns that are missed by event data and even by expert video analysis. Coaches and analysts who work with tracking data frequently report that the most valuable insights come not from individual movement analysis but from understanding how groups of players move in relation to each other.

18.5.7 Synchronization Metrics

Team synchronization measures the degree to which players' movements are temporally coordinated. Several metrics have been proposed:

Correlation-based synchronization. For each player pair $(i, j)$, compute the Pearson correlation of their velocity time series along the $x$-axis ($r_{ij}^x$) and $y$-axis ($r_{ij}^y$). Team synchronization is the average correlation across all pairs:

$$ S_x = \frac{2}{n(n-1)} \sum_{i < j} r_{ij}^x $$

Values close to 1 indicate highly synchronized movement (all players moving in the same direction simultaneously); values near 0 indicate independent movement.

Cluster phase. Adapted from coupled oscillator theory, the cluster phase $\rho$ quantifies the coherence of player movements:

$$ \rho(t) = \frac{1}{n} \left| \sum_{i=1}^{n} e^{i\phi_i(t)} \right| $$

where $\phi_i(t)$ is the phase of player $i$'s movement (derived from the Hilbert transform of the position time series). $\rho = 1$ indicates perfect synchronization; $\rho = 0$ indicates no coherence.

Callout: Practical Interpretation of Synchronization

High synchronization is not always desirable. A team defending a set piece should be highly synchronized (e.g., moving as a unit to play an offside trap). But in attack, some desynchronization is necessary --- if all players move in the same direction, the attack becomes predictable. Effective attacking play requires a combination of coordinated movement (e.g., overlapping runs) and strategic desynchronization (e.g., decoy runs in opposite directions).

18.5.8 Voronoi Tessellation and Dominant Regions

Voronoi tessellation partitions the pitch into regions, one per player, where each point in a player's region is closer to that player than to any other. This defines each player's dominant region --- the area of the pitch they control.

In its simplest form, the Voronoi diagram uses Euclidean distance. More sophisticated variants weight the distance by player speed and direction (Taki and Hasegawa, 2000; Fernandez and Bornn, 2018), producing motion-weighted dominant regions that account for each player's ability to reach a point first given their current velocity.

The area of each player's Voronoi cell provides a measure of their spatial influence. Summing these areas by team gives the team's total pitch control.


18.6 Off-Ball Player Evaluation

18.6.1 The Challenge of Off-Ball Analysis

In a typical match, a player is in possession of the ball for approximately 1--3 minutes out of 90. The remaining 87--89 minutes are spent off the ball, yet these off-ball movements are invisible to event data. Tracking data bridges this gap by providing a complete record of every player's movement throughout the match, enabling evaluation of off-ball contributions.

18.6.2 Off-Ball Run Detection and Classification

Off-ball runs can be detected algorithmically by identifying periods of directional, high-speed movement by players who are not in possession. A run detection algorithm typically:

  1. Identifies frames where a player's speed exceeds a threshold (e.g., $> 5$ m/s) and the player does not have possession.
  2. Filters for sustained directional movement (the player moves consistently toward a target area rather than oscillating).
  3. Classifies the run by type: runs behind the defensive line, runs into the channel, diagonal runs across the defensive line, dropping movements toward the ball, runs to create space for teammates.

Space creation is a particularly valuable off-ball contribution. A forward who makes a run behind the defensive line, drawing a defender out of position, creates space for a teammate even if the forward never receives the ball. Quantifying this contribution requires measuring the change in defensive structure caused by the run.

18.6.3 Pressing Contributions

Off-ball pressing is one of the most important defensive contributions, and tracking data provides the tools to measure it precisely. For each frame, we can identify which players are actively pressing by checking whether they are: (1) within a specified distance of the ball carrier, and (2) moving toward the ball carrier (positive closing speed). The cumulative pressing distance, pressing frequency, and pressing speed for each player quantify their defensive off-ball contribution.

Callout: Off-Ball Value Is the New Frontier

The ability to evaluate off-ball contributions has fundamentally changed player valuation. Before tracking data, analysts could only assess what players did when they touched the ball. Now, we can quantify the forward who creates space through intelligent runs, the midfielder who presses relentlessly, and the defender who maintains perfect positional discipline. This is where tracking data provides its greatest marginal value over event data, and where the next generation of player evaluation models will be built.


18.7 Fatigue and Load Monitoring

18.7.1 Defining Fatigue in Match Context

Fatigue in soccer manifests as a decline in physical performance output over the course of a match. The most common indicators include:

  • Reduction in total distance covered per 5-minute interval
  • Decrease in high-speed running distance and sprint frequency
  • Decline in peak speed achieved
  • Increase in recovery time between high-intensity efforts
  • Reduction in acceleration and deceleration magnitudes

Fatigue is not uniform --- it is influenced by playing position, fitness level, match intensity, environmental conditions (temperature, altitude), and tactical context.

18.7.2 Temporal Decline Analysis

The simplest approach to fatigue analysis divides the match into equal time intervals (typically 5 or 15 minutes) and computes performance metrics within each interval. A common analysis compares the last 15 minutes to the first 15 minutes:

$$ \Delta_{\text{fatigue}} = \frac{M_{\text{last 15}} - M_{\text{first 15}}}{M_{\text{first 15}}} \times 100\% $$

where $M$ is the metric of interest (e.g., HSRD, sprint count, work rate).

Research has consistently shown a 5--15% decline in high-speed running distance from the first half to the second half, with the most pronounced decreases occurring in the final 15 minutes.

Callout: Transient Fatigue

Beyond the progressive decline across a match, players also experience transient fatigue --- temporary performance decrements following intense passages of play. After a period of sustained high-intensity effort (e.g., a prolonged pressing sequence), players may show reduced output for the subsequent 2--5 minutes. This transient effect is superimposed on the broader match-long fatigue trend and can be detected by analyzing rolling performance windows.

18.7.3 Fatigue Modeling with Machine Learning

Advanced fatigue modeling goes beyond simple temporal decline analysis by using machine learning to predict when a player's physical output will drop below a critical threshold. A fatigue prediction model takes as inputs the player's cumulative match data up to the current moment and outputs a probability of significant performance decline in the upcoming 5--10 minutes.

Input features for a real-time fatigue model include:

  • Cumulative distance covered so far in the match
  • Cumulative high-intensity distance and sprint count
  • Recent 5-minute work rate relative to match average
  • Number and intensity of recent acceleration/deceleration events
  • Time since last substitution opportunity
  • Environmental conditions (temperature, humidity)
  • Player's baseline fitness profile (from pre-season testing)
  • Historical fatigue patterns for this player in previous matches

A gradient-boosted classifier trained on labeled fatigue events (defined as a sustained drop in rolling work rate below 80% of the first-half average) can provide real-time probability estimates that inform substitution timing.

18.7.4 Workload Distribution Across the Squad

At the squad level, tracking data enables monitoring of workload distribution across the team over multiple matches. The goal is to ensure that no player accumulates excessive load (increasing injury risk) while maintaining competitive readiness across the squad.

Key squad-level metrics include:

  • Weekly total distance per player, compared to their individual baseline
  • High-intensity distance ratio: The fraction of total distance covered at high intensity, tracked over time
  • Rest and recovery patterns: The interval between matches and training sessions for each player
  • Load concentration index: A measure of how evenly workload is distributed across the squad (a Gini coefficient of weekly load across squad members)

Callout: The Rotation Dilemma

Tracking data creates a tension between performance optimization and squad management. Data may show that a star player is accumulating dangerous levels of fatigue, suggesting they should be rested. But resting that player in a crucial match carries competitive risk. The best sports science departments frame workload data as one input into a multifactorial decision-making process, not as a deterministic prescription. The data informs the conversation between coaching staff, sports scientists, and medical personnel, but the final decision involves competitive context, player psychology, and tactical considerations that models cannot fully capture.

18.7.5 Cumulative Load Metrics

Cumulative load tracking monitors the total physical stress accumulated over a match, a week, or a season. Key metrics include:

PlayerLoad (Catapult). A proprietary metric derived from the triaxial accelerometer in wearable devices:

$$ \text{PlayerLoad} = \sum_{i=1}^{N-1} \sqrt{(a_{x,i+1} - a_{x,i})^2 + (a_{y,i+1} - a_{y,i})^2 + (a_{z,i+1} - a_{z,i})^2} $$

Total High-Intensity Distance (THID). The sum of all distance covered above the high-speed running threshold.

Acceleration Load. The integral of absolute acceleration over time:

$$ L_{\text{acc}} = \int_0^T |a_{\parallel}(t)| \, dt $$

18.7.6 Acute-to-Chronic Workload Ratio (ACWR)

The ACWR compares recent workload (acute, typically 1 week) to longer-term average workload (chronic, typically 4 weeks):

$$ \text{ACWR} = \frac{W_{\text{acute}}}{W_{\text{chronic}}} $$

where $W$ is a chosen workload metric (total distance, HSRD, PlayerLoad, etc.). An ACWR between 0.8 and 1.3 is generally considered the "sweet spot" for balancing performance and injury risk (Gabbett, 2016). Values above 1.5 indicate a spike in workload that may increase injury risk.

The exponentially weighted moving average (EWMA) version of ACWR assigns higher weight to recent sessions:

$$ W_{\text{EWMA}}^{(n)} = W_n \cdot \lambda + (1 - \lambda) \cdot W_{\text{EWMA}}^{(n-1)} $$

where $\lambda = 2 / (N + 1)$ and $N$ is the time window in days.

def compute_acwr(
    daily_loads: np.ndarray,
    acute_window: int = 7,
    chronic_window: int = 28,
    method: str = "rolling",
) -> np.ndarray:
    """Compute the Acute-to-Chronic Workload Ratio.

    Args:
        daily_loads: Array of daily workload values.
        acute_window: Number of days for the acute window.
        chronic_window: Number of days for the chronic window.
        method: 'rolling' for simple rolling average, 'ewma' for
            exponentially weighted moving average.

    Returns:
        Array of ACWR values (NaN where insufficient data).
    """
    n = len(daily_loads)
    acwr = np.full(n, np.nan)

    if method == "rolling":
        for i in range(chronic_window - 1, n):
            acute = np.mean(daily_loads[i - acute_window + 1 : i + 1])
            chronic = np.mean(daily_loads[i - chronic_window + 1 : i + 1])
            if chronic > 0:
                acwr[i] = acute / chronic
    elif method == "ewma":
        lambda_acute = 2 / (acute_window + 1)
        lambda_chronic = 2 / (chronic_window + 1)
        ewma_acute = daily_loads[0]
        ewma_chronic = daily_loads[0]
        for i in range(1, n):
            ewma_acute = daily_loads[i] * lambda_acute + (
                1 - lambda_acute
            ) * ewma_acute
            ewma_chronic = daily_loads[i] * lambda_chronic + (
                1 - lambda_chronic
            ) * ewma_chronic
            if ewma_chronic > 0:
                acwr[i] = ewma_acute / ewma_chronic

    return acwr

18.7.7 Substitution Timing and Fatigue

Tracking data enables data-driven substitution decisions. By monitoring real-time performance metrics relative to a player's baseline, coaches can identify when a player's output has declined to a point where a substitution would be beneficial. Metrics to monitor include:

  • Rolling 5-minute work rate vs. first-half average
  • Sprint count in the last 10 minutes vs. match average
  • Ratio of high-speed running distance to total distance
  • Acceleration count above $3 \; \text{m/s}^2$ per 5-minute interval

A player whose rolling work rate drops below 80% of their first-half average for a sustained period (e.g., 10 minutes) may be a candidate for substitution.


18.8 Integrating Tracking with Event Data

18.8.1 The Complementary Nature of the Two Data Types

Event data and tracking data are complementary:

Dimension Event Data Tracking Data
Granularity Discrete actions Continuous positions
Temporal res. ~1,500 events/match ~142,500 frames/match
Spatial res. Location of the action All 22 players + ball
Off-ball info None Full off-ball positioning
Physical metrics None Speed, acceleration, distance
Annotation Manually labeled or semi-automated Automatically captured

Integrating the two creates a dataset that combines the semantic richness of event data (what happened) with the spatial and physical detail of tracking data (how and where everyone was positioned when it happened).

18.8.2 Synchronization of Data Sources

Integrating tracking and event data requires temporal synchronization. Challenges include:

  • Different time references: Event data may use match clock (e.g., 23:15 of the first half), while tracking data uses absolute timestamps or frame IDs.
  • Different frame rates: Events occur at irregular intervals; tracking data is sampled at fixed frequency.
  • Annotation delays: Event timestamps may have 1--2 second uncertainty due to human annotation latency.
  • Provider mismatches: When event data comes from one provider (e.g., Opta) and tracking data from another (e.g., Second Spectrum), player ID mappings, coordinate systems, and timestamp conventions may differ, requiring careful alignment.

The synchronization process involves:

  1. Aligning kick-off and half-time markers in both data sources.
  2. Matching identifiable events (goals, corners, free kicks) to their corresponding tracking frames using ball position and velocity.
  3. Applying temporal offsets to correct for systematic annotation delays.
  4. Validating by checking that ball positions in tracking data match event locations.
def synchronize_events_to_tracking(
    events: pd.DataFrame,
    tracking: pd.DataFrame,
    event_time_col: str = "match_clock",
    tracking_time_col: str = "timestamp",
    tolerance: float = 0.5,
) -> pd.DataFrame:
    """Synchronize event data to tracking data frames.

    For each event, finds the closest tracking frame within the
    specified time tolerance.

    Args:
        events: DataFrame with event data including timestamps.
        tracking: DataFrame with tracking data including timestamps.
        event_time_col: Column name for event timestamps (seconds).
        tracking_time_col: Column name for tracking timestamps (seconds).
        tolerance: Maximum allowed time difference in seconds.

    Returns:
        Events DataFrame augmented with a 'tracking_frame' column
        indicating the matched frame ID.
    """
    unique_frames = tracking.drop_duplicates(
        subset=["frame_id"]
    )[["frame_id", tracking_time_col]].copy()
    frame_times = unique_frames[tracking_time_col].values
    frame_ids = unique_frames["frame_id"].values

    matched_frames = []
    for _, event in events.iterrows():
        event_time = event[event_time_col]
        idx = np.argmin(np.abs(frame_times - event_time))
        if np.abs(frame_times[idx] - event_time) <= tolerance:
            matched_frames.append(frame_ids[idx])
        else:
            matched_frames.append(np.nan)

    result = events.copy()
    result["tracking_frame"] = matched_frames
    return result

18.8.3 Contextualizing Events with Spatial Data

Once synchronized, tracking data provides rich context for each event:

Passing context. For each pass, we can determine: - The positions of all teammates and opponents at the moment of the pass. - The available passing lanes (via Voronoi analysis). - The receiver's movement (speed and direction) at the time of reception. - The pressing intensity on the passer (number and proximity of opponents).

Shot context. For each shot, we can assess: - The defensive positioning and the goalkeeper's position. - The shooter's speed and angle of approach. - The degree of pressure from defenders. - Whether the shot was taken after a sprint or while relatively stationary.

Defensive context. For tackles, interceptions, and duels: - The speed of approach of the defender. - The acceleration profile in the moments before the action. - The compactness of the defensive block at the moment of the action.

18.8.4 Pitch Control Models

Pitch control models combine tracking and event data to estimate the probability that each team would win possession of the ball if it were played to any point on the pitch. The seminal model by Fernandez and Bornn (2018) considers each player's:

  • Current position $(x_i, y_i)$
  • Current velocity $\mathbf{v}_i$
  • Reaction time $\tau$ (typically 0.7 seconds)
  • Maximum speed $v_{\max}$

For a target point $\mathbf{p}$, the time for player $i$ to reach $\mathbf{p}$ is estimated, and the pitch control value is:

$$ \text{PC}(\mathbf{p}) = \frac{\sum_{i \in \text{team}_A} I_i(\mathbf{p})}{\sum_{i \in \text{team}_A} I_i(\mathbf{p}) + \sum_{j \in \text{team}_B} I_j(\mathbf{p})} $$

where $I_i(\mathbf{p})$ is an influence function that decreases as the time-to-reach increases.

18.8.5 Expected Threat (xT) Enhancement with Tracking Data

The Expected Threat (xT) framework from Chapter 13 can be significantly enhanced with tracking data. While the standard xT model uses only ball position, tracking-enhanced xT can incorporate:

  • The number of defenders between the ball and the goal.
  • The goalkeeper's position relative to the goal center.
  • The number of attacking players in advanced positions.
  • The defensive compactness (stretch index of the defending team).

These features can be used to build context-aware xT models that assign different threat values to the same pitch zone depending on the defensive configuration.

18.8.6 Pressing Intensity Metrics

Pressing is a defensive strategy where the team without possession actively pressures the ball carrier and nearby opponents. Tracking data enables precise quantification of pressing:

PPDA (Passes Per Defensive Action): The number of passes allowed by the defending team before a defensive action (tackle, interception, foul) occurs. This is an event-based metric but can be enhanced with tracking data by incorporating the defensive team's spatial positioning.

Pressure events: The number of instances where a defending player is within a specified distance ($d < 5$ m) of the ball carrier while moving toward them (closing speed $> 0$).

Counterpressure intensity: The collective speed of the defending team's players within 15 m of the ball in the 5 seconds following a turnover:

$$ \text{CP} = \frac{1}{|S|} \sum_{i \in S} s_i \cdot \cos(\alpha_i) $$

where $S$ is the set of defending players within 15 m of the ball, $s_i$ is their speed, and $\alpha_i$ is the angle between their velocity vector and the direction toward the ball. A high CP value indicates intense counterpressing.


18.9 Data Quality, Gaps, and Ethical Considerations

18.9.1 Assessing and Reporting Data Quality

Data quality in tracking data is not a binary property --- it exists on a spectrum, and responsible analysis requires quantifying and reporting quality metrics alongside analytical results. Key quality indicators include:

  • Completeness: The percentage of frames in which all 22 players and the ball are tracked. Values below 95% completeness warrant investigation.
  • Positional accuracy: Typically assessed through calibration with known reference positions or by comparing GPS and optical measurements. Sub-meter accuracy is expected for optical systems; $\pm 1$--$2$ meter accuracy is common for broadcast-derived systems.
  • Temporal consistency: Frame drops or irregular inter-frame intervals can distort velocity and acceleration calculations. Check that the actual sampling rate matches the nominal rate.
  • Identity stability: Player identity swaps (where the tracking system confuses two nearby players) are a common error, especially during crowded situations like corners and free kicks. These can be detected by monitoring sudden jumps in player position and validating against known substitution times.

Callout: The "Garbage In, Garbage Out" Principle

No amount of analytical sophistication can compensate for poor data quality. Before investing time in complex models, always conduct a thorough quality audit of your tracking data. Check for missing frames, identity swaps, physically impossible speeds (>45 km/h for any sustained period is likely an error), and coordinate system anomalies. A simple sanity check --- computing total distance for each player and verifying it falls within expected ranges (10--14 km for outfield players, 5--8 km for goalkeepers) --- catches many data quality issues.

18.9.2 Handling Gaps and Interpolation Methods

When tracking data contains gaps (missing frames for one or more players), the choice of interpolation method affects downstream analysis:

  • Linear interpolation assumes constant velocity during the gap. Acceptable for gaps of 1--3 frames (up to 0.12 seconds at 25 Hz) but introduces unrealistic sharp velocity changes at gap boundaries for longer gaps.
  • Cubic spline interpolation produces smooth trajectories through missing segments, maintaining continuity in position, velocity, and acceleration. Preferred for gaps of 4--25 frames (0.16--1.0 seconds).
  • Physics-informed interpolation uses kinematic equations to constrain the interpolated trajectory, assuming maximum plausible acceleration and speed limits. This prevents physically impossible trajectories during long gaps.
  • No interpolation (flagging): For gaps exceeding 1--2 seconds, the most conservative approach is to flag the affected segments and exclude them from velocity- and acceleration-dependent analyses.

18.9.3 Privacy and Ethical Considerations with Player Tracking

The granularity of tracking data raises significant privacy and ethical concerns that analysts must navigate responsibly.

Player consent and data ownership. Professional players' tracking data is typically collected under the terms of their employment contracts and collective bargaining agreements. However, the extent of players' informed consent varies. Players may not fully understand how their positional data will be used, particularly if it is shared with third parties (broadcasters, analytics companies, betting operators) or used in ways that could affect their contract negotiations, transfer valuations, or playing time decisions.

Data protection regulations. In Europe, tracking data constitutes personal data under the General Data Protection Regulation (GDPR), as it can identify specific individuals. Clubs and data providers must comply with GDPR requirements regarding data processing, storage, and sharing. In practice, this means implementing appropriate data security measures, limiting data retention periods, and providing players with access to their own data upon request.

Performance surveillance concerns. Continuous monitoring of player movement during both matches and training sessions can create a surveillance dynamic that affects the player-employer relationship. If tracking data is used punitively (e.g., penalizing players for low physical output without considering tactical context), it can erode trust between players and coaching staff. Best practice dictates that tracking data should be used collaboratively, with players having access to their own metrics and understanding how the data informs training and selection decisions.

Third-party data sharing. When tracking data is shared with broadcasters, analytics companies, or betting operators, players lose control over how their biometric information is used. The ethical implications of using player tracking data to inform betting markets are particularly contentious, as players do not share in the commercial value generated from their data by the betting industry.

Callout: Responsible Use of Tracking Data

As practitioners, we have an ethical obligation to use tracking data responsibly. This means: (1) being transparent with players about how their data is collected, stored, and used; (2) ensuring that data-driven insights are communicated collaboratively rather than used as surveillance tools; (3) advocating for appropriate data protection policies within our organizations; (4) recognizing that physical output metrics are contextual and should never be used in isolation to judge player effort or commitment. The long-term viability of tracking data analytics in soccer depends on maintaining trust between all stakeholders --- clubs, players, analysts, and data providers.


Summary

This chapter has presented a comprehensive framework for analyzing tracking data in professional soccer. We began with the fundamentals of data collection, coordinate systems, and preprocessing, examining the major tracking data providers (Second Spectrum, SkillCorner, Hawk-Eye, TRACAB) and the implications of different sampling rates. We then progressed through individual physical performance metrics (distance, speed, acceleration, metabolic power) and sprint detection methodologies, including repeated sprint analysis and high-intensity running classification.

The chapter covered collective movement analysis in depth, including team shape metrics (length, width, surface area, centroid), synchronization measures (correlation-based, cluster phase), and Voronoi tessellation for pitch control. We addressed the important topic of off-ball player evaluation, which represents the greatest marginal value of tracking data over event data.

Fatigue monitoring was treated comprehensively, from temporal decline analysis through machine learning-based fatigue prediction and squad-level workload distribution. The integration of tracking data with event data was examined through synchronization methods, pitch control models, enhanced Expected Threat frameworks, and pressing intensity metrics.

Finally, we addressed data quality considerations (gaps, smoothing, interpolation) and the critical topic of privacy and ethics in player tracking, emphasizing the responsibility of analysts to use this powerful data source in ways that respect players' rights and maintain trust within the sport.

Tracking data transforms soccer analytics from a description of what happened to an understanding of why and how it happened, enabling insights into the physical, tactical, and strategic dimensions of the game that are invisible in event data alone. As tracking technology continues to improve in accuracy, accessibility, and temporal resolution, the analytical methods presented in this chapter will remain foundational to advanced soccer analytics practice.


Key Formulas Reference

Metric Formula
Instantaneous speed $s(t) = \sqrt{v_x(t)^2 + v_y(t)^2}$
Total distance $D = \sum_{i} s_i \cdot \Delta t$
Team centroid $(\bar{x}, \bar{y}) = \frac{1}{n}\sum_{i}(x_i, y_i)$
Stretch index $SI = \frac{1}{n}\sum_{i}\sqrt{(x_i - \bar{x})^2 + (y_i - \bar{y})^2}$
Metabolic power $P_{\text{met}} = EC \cdot s$
ACWR $\text{ACWR} = W_{\text{acute}} / W_{\text{chronic}}$
Pitch control $\text{PC}(\mathbf{p}) = \frac{\sum_{A} I_i(\mathbf{p})}{\sum_{A} I_i(\mathbf{p}) + \sum_{B} I_j(\mathbf{p})}$