60 min read

> "Football is a simple game. Twenty-two men chase a ball for 90 minutes and at the end, the data scientists explain why it happened."

Learning Objectives

  • Explain what soccer analytics is and articulate its value proposition for clubs
  • Trace the historical development of soccer analytics from early statistics to modern data science
  • Identify key stakeholders in soccer analytics and understand their needs
  • Describe the typical analytics workflow from question to insight to action
  • Recognize career paths in soccer analytics and skills required for each
  • Apply ethical reasoning to common dilemmas in sports analytics

Chapter 1: Introduction to Soccer Analytics

"Football is a simple game. Twenty-two men chase a ball for 90 minutes and at the end, the data scientists explain why it happened." — Adapted from Gary Lineker

Chapter Overview

In a packed stadium in Liverpool on a crisp October evening in 2019, something remarkable happened that most fans didn't notice. As Sadio Mané collected the ball 25 yards from goal, a computer system had already calculated the probability of a goal from that exact position—0.04, or 4%. As Mané took one touch to set himself, then another to evade a defender, the system updated: 0.07. When he struck the ball, curling it past the goalkeeper's despairing dive and into the far corner, the expected goals (xG) value of that shot was 0.08.

The crowd erupted. The managers paced their technical areas. And somewhere in the stadium's bowels, an analyst quietly logged another data point in the endless pursuit of understanding the beautiful game.

This moment encapsulates modern soccer analytics: invisible to most observers, running continuously in the background, attempting to quantify the unquantifiable while the ancient drama of competition plays out on the pitch. This textbook will teach you how it all works.

But how did we get here? Soccer, a sport steeped in tradition and often resistant to change, has been transformed by data in ways that would have been unimaginable just two decades ago. The journey from Charles Reep's pencil-and-paper tallies in 1950 to Liverpool's multi-million-pound data science operation is one of the great untold stories in sport. It is a story of pioneers ignored, ideas ahead of their time, and a revolution that ultimately proved unstoppable.

In this chapter, you will learn to: - Define soccer analytics and explain why clubs invest millions in it - Trace the evolution from simple statistics to sophisticated machine learning - Understand who uses analytics and what they need from it - Follow the journey from raw data to actionable insights - Explore career opportunities in this rapidly growing field


1.1 What Is Soccer Analytics?

1.1.1 Defining the Field

Soccer analytics is the systematic application of data analysis and statistical methods to improve decision-making in association football. It encompasses everything from simple counting statistics (goals, assists, clean sheets) to complex machine learning models that estimate the probability of scoring from any position on the pitch.

More formally, we can define soccer analytics as:

Soccer Analytics: The collection, processing, analysis, and communication of soccer-related data to generate insights that inform decisions by players, coaches, scouts, executives, journalists, fans, and other stakeholders.

This definition highlights several important aspects:

  1. Data-centric: Analytics starts with data—without reliable data, there is no analysis
  2. Process-oriented: It involves multiple stages from collection to communication
  3. Insight-focused: The goal is understanding, not just numbers
  4. Decision-supporting: Analytics exists to improve decisions, not as an end in itself
  5. Multi-stakeholder: Different people use analytics for different purposes

The field sits at the intersection of several disciplines: statistics and mathematics provide the quantitative foundations, computer science supplies the tools for processing and modeling, and deep domain expertise in soccer ensures that analytical outputs are meaningful and actionable. A brilliant statistician who does not understand soccer will produce technically sound but contextually meaningless work. A football expert who cannot think quantitatively will miss patterns hidden in the data. The best practitioners combine both capacities, and the best teams bring together specialists from each discipline.

1.1.2 What Analytics Is NOT

To understand soccer analytics fully, it helps to clarify what it is not:

Analytics is not a replacement for expertise. The best analytics complements human judgment rather than replacing it. A statistical model can estimate a player's value, but only a scout who has watched them play can assess their character, adaptability, and fit with a team's culture. When Borussia Dortmund signed Erling Haaland from Red Bull Salzburg in January 2020, the data supported the decision emphatically: his goal-scoring metrics were extraordinary for a teenager. But the club also relied on extensive personal scouting, interviews with coaches who had worked with him, and assessments of his psychological makeup. Data and judgment worked hand in hand.

Analytics is not infallible prediction. Soccer is inherently unpredictable. A team with 3.0 xG can lose to a team with 0.5 xG—and occasionally will. Leicester City's 2015-16 Premier League title, achieved at 5000-to-1 odds, is the most famous example of how soccer can defy probabilistic expectation. Analytics helps us understand probabilities and tendencies, not certainties. Any analyst who presents their findings as guaranteed outcomes is either dishonest or naive.

Analytics is not just about offense. While goals and assists dominate public discussion, some of the most valuable analytical work involves defense, pressing, and the spaces between discrete events. Understanding how a team's defensive shape restricts opponents, how pressing triggers force turnovers in dangerous areas, or how a center-back's positioning prevents shots from ever being taken—these are analytical questions with enormous practical value that receive far less public attention than xG and assists.

Analytics is not only for elite clubs. While top teams have the largest departments, effective analytics can be done at any level with limited resources. Indeed, smaller clubs may benefit more from analytical advantages precisely because they cannot compete on budget alone. FC Midtjylland in Denmark, owned by Matthew Benham, demonstrated this principle by winning their first-ever Danish Superliga title in 2015 through analytically driven recruitment and tactics, competing against clubs with larger budgets and deeper traditions.

Best Practice: When presenting analytical work, always acknowledge what the data cannot tell you. Stakeholders respect honesty about limitations far more than false confidence. Frame your findings as "the data suggests" rather than "the data proves."

1.1.3 The Value Proposition

Why do clubs invest millions of dollars in analytics departments, data providers, and technology infrastructure? The value proposition rests on several pillars:

Competitive Advantage

Soccer is a zero-sum game. If analytics can provide even a small edge in player recruitment, tactical preparation, or in-game decision-making, that edge accumulates over a season into points, wins, and potentially trophies.

Consider the mathematics: In a 38-game Premier League season, if analytics-informed decisions contribute to one additional goal scored or prevented every five games, that's approximately 7-8 additional goals over the season. Historical data suggests this could translate to 6-8 additional points—often the difference between Champions League qualification and mid-table mediocrity. When Champions League revenue can exceed 100 million euros per season, the return on investment from an analytics department costing a fraction of that is potentially enormous.

The competitive advantage is most visible in recruitment. When Brighton & Hove Albion consistently identify undervalued talent from lower leagues and foreign markets—players like Moises Caicedo (signed from Independiente del Valle for approximately 4.5 million pounds in 2021 and sold to Chelsea for over 100 million pounds in 2023)—they are demonstrating the financial power of analytical edge.

Market Inefficiencies

The soccer transfer market is notoriously inefficient. Players are regularly over- or under-valued based on reputation, nationality, recent form, or simply being in the right place at the right time. Data analysis can identify undervalued players who fit specific needs—the "Moneyball" approach that transformed baseball and is increasingly applied to soccer.

Liverpool's recruitment of Mohamed Salah, Sadio Mané, and Roberto Firmino—none of whom were superstars when signed—was heavily influenced by analytical profiling. Combined, these three cost approximately 90 million pounds and generated performance worth many times that amount. The club's data science team, led by director of research Ian Graham (a Cambridge-trained physicist), developed models that could project how players would perform within Liverpool's specific tactical system—not just how good they were in general, but how good they would be for Liverpool.

Market inefficiencies exist because the transfer market is driven by imperfect information, cognitive biases, and misaligned incentives. Agents talk up their clients. Scouts develop attachments to players they have watched extensively. Managers fixate on players they have seen perform well against their own teams. Nationality bias means that players from fashionable leagues command premiums while equally talented players in less visible competitions go unnoticed. Analytics, by providing an objective baseline, can cut through these biases—though it introduces its own blind spots if not applied thoughtfully.

Risk Reduction

Major decisions in soccer carry enormous financial risk. A 50 million pound signing who fails is a catastrophe for any club's finances. Analytics cannot eliminate risk, but it can quantify it, identify red flags, and ensure decisions are made with full information rather than gut instinct alone.

Consider the case of a club evaluating two potential signings. Player A is a well-known striker from a top league, commanding a 40 million pound fee. Player B is a less known striker from a mid-tier league, available for 12 million pounds. Traditional scouting might favor Player A based on reputation and visibility. Analytics can ask: what is the probability each player will score 15+ goals per season in our league? What are the confidence intervals around their expected performance? What does the distribution of outcomes look like? If Player B has a 60% chance of reaching the threshold compared to Player A's 70%, but costs a third of the price, the risk-adjusted value clearly favors Player B.

Operational Efficiency

Beyond on-pitch decisions, analytics improves operational efficiency: optimizing training loads to reduce injuries, pricing tickets to maximize revenue, identifying the right time to sell a player, and countless other business decisions. Arsenal's use of injury prediction models, for example, has been reported to help their medical staff manage player workloads across congested fixture schedules—a problem that grows more severe as competitions expand and rest periods shrink.

Real-World Application: Brentford FC, before their promotion to the Premier League, operated with one of the smallest budgets in the Championship. Their sophisticated analytical approach, developed by owner Matthew Benham's company Smartodds, allowed them to consistently identify undervalued players and compete with much wealthier clubs. Players like Ollie Watkins (bought for 1.8 million pounds, sold to Aston Villa for 28 million pounds), Said Benrahma (bought for 1.5 million pounds, sold to West Ham for 25 million pounds), and Neal Maupay (bought for 1.6 million pounds, sold to Brighton for 20 million pounds) exemplify this approach. The club's eventual promotion in 2021 and successful Premier League tenure vindicated years of data-driven decision-making.


1.2 The Evolution of Soccer Analytics

The history of soccer analytics is not a smooth upward curve. It is marked by long periods of stagnation, the work of isolated pioneers who were often dismissed or ignored, sudden leaps enabled by new technology, and the gradual—sometimes grudging—acceptance of data-driven thinking by a sport deeply attached to tradition and intuition. Understanding this history provides context for where the field stands today and where it might go next.

1.2.1 Prehistoric Analytics (Pre-1950)

Soccer analytics, in a primitive form, has existed since the earliest days of organized football. Managers have always kept mental (and sometimes written) records of which players performed well, which tactics worked against specific opponents, and which positions on the pitch were most dangerous.

Herbert Chapman, the legendary Arsenal manager of the 1920s and 1930s, was an early pioneer of systematic thinking about the game. He introduced the "WM" formation, a tactical innovation based on his analysis of how the 1925 offside rule change created new spaces on the pitch. While not "analytics" in the modern sense, this represented data-informed tactical evolution. Chapman was also an innovator in other respects: he advocated for floodlit matches, numbered shirts, and a clock visible to spectators—all reflecting a mindset that valued information and transparency.

In the 1930s and 1940s, match reports in newspapers began including basic statistics: shots on goal, corners, and possession estimates. These were crude by modern standards, often based on subjective observation rather than systematic recording, but they represented the earliest public appetite for quantitative information about soccer matches.

In Eastern Europe, meanwhile, a different analytical tradition was taking shape. Soviet coaches, influenced by the country's emphasis on scientific approaches to sport, began keeping detailed records of match actions. This tradition would eventually produce one of soccer's most important analytical pioneers, though his contributions would not be widely recognized in the West for decades.

1.2.2 Charles Reep and the Birth of Match Analysis (1950-1970)

Charles Reep, an accountant and RAF Wing Commander, began systematically recording soccer matches in 1950. Sitting in the stands at Swindon Town's County Ground on March 18, 1950, Reep used a shorthand notation system he had developed to record every significant action in the match. He would continue this practice for decades, eventually compiling data from thousands of games.

Reep's methodology was remarkably detailed for its time. He recorded the sequence of passes leading to shots and goals, the areas of the pitch where possession was won and lost, and the outcomes of different styles of play. His notation system, while labor-intensive, allowed him to build a database of match actions that was unprecedented in soccer.

His analysis led him to several conclusions that would prove influential—and controversial:

  1. The "three-pass" finding: Reep concluded that the majority of goals resulted from possessions of three passes or fewer. He interpreted this as evidence that direct play (long balls toward goal) was more effective than elaborate passing sequences.

  2. Pitch zones: He divided the pitch into zones and tracked where possession changes and shots originated, identifying what he called the "scoring zone."

  3. Random chance: Reep argued that much of what happened in soccer was essentially random, with goals following a pattern similar to a Poisson distribution.

Reep's influence was significant but complicated. His work directly influenced several managers, most notably Charles Hughes, who became the FA's Director of Coaching and Long-Term Player Development. Hughes adopted Reep's findings as the basis for English coaching philosophy, promoting direct play and discouraging possession-based football. Critics argue this set English soccer back by decades, contributing to the long-ball culture that dominated the English game through much of the 1980s and 1990s.

Common Pitfall: Reep's story illustrates a danger that persists in modern analytics: drawing strong causal conclusions from observational data. Reep observed that most goals came from short possessions, but this was partly because most possessions are short. The relevant question is not "what proportion of goals come from short possessions?" but "what is the probability of scoring from a short possession versus a long one?" Reep's failure to account for base rates led to conclusions that were technically supported by his data but practically misleading.

Reep's work was methodologically flawed by modern standards, suffering from small samples, selection bias, and correlation-causation confusion. But he deserves recognition as perhaps the first person to apply systematic data collection and analysis to soccer. His legacy is a reminder that the impulse to understand soccer through data has deep roots—and that the quality of analytical conclusions depends entirely on the quality of analytical methods.

1.2.3 Valeriy Lobanovskyi and the Scientific Approach (1960-2002)

While Reep worked in relative isolation in England, a far more sophisticated analytical tradition was developing in the Soviet Union. Valeriy Lobanovskyi, who managed Dynamo Kyiv from 1973 to 2002 (with interruptions), was perhaps the most analytically minded manager in soccer history before the modern era.

Lobanovskyi was trained as an engineer, and he brought an engineer's mindset to football management. Working with his long-time collaborator Anatoliy Zelentsov, a scientist, Lobanovskyi developed a comprehensive system for analyzing soccer matches using mathematical and statistical methods. Their approach included:

  1. Action coding: Every player action in a match was recorded and classified according to a detailed taxonomy. Each action was evaluated as positive, negative, or neutral based on its contribution to attacking or defensive objectives.

  2. Performance ratings: Players received numerical ratings based on the ratio of positive to negative actions, weighted by the importance of each action type. These ratings, calculated after every match and training session, determined team selection.

  3. Tactical modeling: Lobanovskyi and Zelentsov modeled soccer as a dynamic system, analyzing the interactions between players and the spatial relationships that created opportunities. They sought to optimize what they called "universal action"—coordinated team movement that created numerical superiority in key areas of the pitch.

  4. Physical conditioning: The scientific approach extended to physical preparation, with training loads calibrated based on physiological data and match demands.

The results were remarkable. Dynamo Kyiv won 13 Soviet league titles under Lobanovskyi, plus two European Cup Winners' Cups (1975 and 1986) and reached the Champions League semi-finals in 1999. Lobanovskyi also managed the Soviet Union national team, taking them to the 1988 European Championship final.

Lobanovskyi's approach was ahead of its time by decades. His insistence that soccer could be understood and optimized through data and mathematical modeling anticipated modern analytics by a generation. Yet his methods remained largely unknown in Western football until after his death in 2002. Jonathan Wilson's book Inverting the Pyramid (2008) brought Lobanovskyi's story to a wider English-speaking audience, and his influence is now widely recognized as foundational.

Intuition: Lobanovskyi's great insight was that soccer is not a collection of individual performances but a complex system of interactions. Modern analytics has largely vindicated this view: the most sophisticated current approaches focus on team-level patterns, spatial relationships, and coordinated movement rather than simply aggregating individual statistics.

1.2.4 The Statistical Era (1950-2000)

For most of the latter 20th century, soccer statistics remained primitive compared to American sports. While baseball had developed sophisticated metrics like OPS (on-base plus slugging) and later OPS+ and WAR (Wins Above Replacement), soccer made do with basic counts: goals, assists, clean sheets, and little else.

Several factors explain this lag:

  1. Continuous play: Soccer's lack of natural breaks makes data collection more difficult than in baseball or American football, where discrete at-bats or plays create natural units of analysis
  2. Contextual complexity: A pass in soccer depends heavily on context (game state, defensive pressure, positional relationships) in ways that complicate simple counting
  3. Cultural resistance: Soccer's tradition-bound culture was skeptical of "outsiders" applying mathematical analysis to the beautiful game. This was especially pronounced in England, where the idea that football men—not boffins with spreadsheets—should run the game was deeply entrenched
  4. Limited data: Without systematic collection efforts, there simply wasn't enough data to analyze
  5. Low-scoring nature: With an average of roughly 2.5 goals per match, individual match outcomes carry enormous random variation, making it harder to draw reliable conclusions from small samples

Despite these challenges, the seeds of modern analytics were being planted during this period. In the early 1990s, a company called Prozone (later Prozone Sports) developed one of the first computer-based match analysis systems. Founded by a group including former footballer and coach Eddie Thomson, Prozone used multiple camera angles to track player movements and generate performance data. The system was initially used by a handful of English clubs, including Leeds United under manager David O'Leary in the late 1990s, and it represented a significant step toward the tracking data systems that would later become ubiquitous.

The company Opta (founded in 1996 by Aidan Cooney) began changing the data landscape by providing detailed event data—every pass, shot, tackle, and duel recorded with timestamps and coordinates. Opta started by hiring teams of analysts to manually code matches, building a dataset that grew steadily in scope and detail. This data, while expensive and initially limited in scope, laid the foundation for modern analytics. By the early 2000s, Opta data was powering statistical content for media companies across Europe, gradually normalizing the idea that soccer could be described in numbers.

1.2.5 The Moneyball Moment (2002-2012)

The publication of Michael Lewis's Moneyball in 2003 transformed public awareness of sports analytics. While the book focused on baseball—specifically, how the Oakland Athletics used statistical analysis to compete with wealthier teams—its core message resonated across sports: that data analysis could identify market inefficiencies and level the playing field between wealthy and poor teams.

Moneyball had a profound impact on soccer thinking, even though the sport presented different analytical challenges. Several figures in the soccer world read the book and began asking whether similar approaches could work in football. Among the most important was Damien Comolli, a French sporting director who had worked at Arsenal under Arsene Wenger and would later hold roles at Tottenham Hotspur, Liverpool, and Fenerbahce. Comolli was one of the first football executives to explicitly embrace Moneyball-style thinking, attempting to build recruitment strategies around statistical profiling rather than traditional scouting alone.

Soccer's "Moneyball" moment is harder to pinpoint than baseball's, but several developments were crucial:

The betting connection (2004-2010)

Some of the most important early work in soccer analytics happened not at clubs but at betting companies. Firms like Smartodds (founded by Matthew Benham), Starlizard (founded by Tony Bloom), and various others employed mathematicians and statisticians to build models that could predict match outcomes more accurately than the betting market. These models required sophisticated understanding of team and player quality, and the analysts who built them developed many of the foundational techniques later adopted by clubs.

Both Benham and Bloom would go on to purchase football clubs—Benham buying Brentford FC and FC Midtjylland, Bloom buying Brighton & Hove Albion—and apply their analytical approaches to club management. The betting-to-ownership pipeline proved to be one of the most important channels through which analytical thinking entered soccer.

2008-2012: The analytics pioneers at clubs

  • Arsenal's acquisition of StatDNA in 2012 brought sophisticated analytics in-house. StatDNA, founded by Jaeson Rosenfeld, had built a comprehensive system for evaluating player performance using data, and Arsenal saw enough value in it to acquire the entire company rather than just licensing the product.
  • Manchester City, backed by the Abu Dhabi United Group from 2008, invested heavily in analytics infrastructure. They hired analysts from the finance industry and academia to apply quantitative methods to recruitment and tactics.
  • Liverpool began building what would become one of the sport's most advanced analytics operations. Ian Graham, a Cambridge-trained physicist who had worked in the betting industry, was hired in 2012 to build a data science team. His models would prove instrumental in Liverpool's recruitment strategy under Jurgen Klopp.
  • Brentford FC, under Benham's ownership from 2012, restructured their entire football operation around data. The club eliminated the traditional head of scouting role and replaced it with a model-driven recruitment process that prioritized statistical evidence over conventional wisdom.

Expected Goals (xG) emergence:

The concept of expected goals—estimating the probability that a shot results in a goal based on its characteristics—had existed in various forms since at least 2012 (and arguably earlier in unpublished club work). Sam Green, an analyst working at Opta, developed an early public xG model around 2012. But the concept gained wider public prominence through independent analysts and bloggers: Sander IJtsma's 11tegen11 blog, Michael Caley's analyses on various platforms, and the work of analysts like Daniel Altman and Colin Trainor.

xG represented a conceptual leap: instead of counting binary outcomes (goal or no goal), analysts could estimate the quality of chances created and conceded. A team could now understand whether they "deserved" to win based on the quality of chances rather than the vagaries of finishing. This was revolutionary because it allowed analysts to separate skill from luck—or at least to begin that separation—in a sport where small samples made raw results deeply unreliable indicators of true quality.

The basic idea is intuitive: a shot from six yards out with no defenders between shooter and goal is more likely to result in a goal than a 30-yard effort with three defenders blocking the view. xG formalizes this intuition, using historical data on thousands of shots to estimate the probability of a goal based on shot location, body part, assist type, game state, and other features.

Real-World Application: One of the most famous early validations of xG came from the 2013-14 season. Several public analysts noted that Sunderland's shot quality metrics were far worse than their results suggested—they were significantly over-performing their xG. Sure enough, the following season Sunderland's results collapsed as regression to the mean took hold. This kind of predictive success gave xG early credibility and demonstrated that analytical models could see things that traditional analysis missed.

1.2.6 The Data Revolution (2012-2020)

The 2010s saw an explosion in both data availability and analytical sophistication. If the previous era was about proving that analytics had value, this era was about building the infrastructure, methods, and organizations to deliver that value at scale.

Tracking data becomes available:

While event data records discrete actions (passes, shots, tackles), tracking data captures the continuous position of every player and the ball, typically at 25 frames per second. This enables entirely new categories of analysis: off-ball movement, pressing patterns, space creation, and more.

Companies like ChyronHego (TRACAB), Second Spectrum, and Metrica Sports developed systems using optical tracking (cameras) or GPS/accelerometer data to capture this information. The cost and complexity initially limited adoption to top clubs, but prices have gradually decreased. By the mid-2010s, the Bundesliga had become the first major league to install tracking systems in every stadium, making comprehensive tracking data available for all matches. Other leagues followed: La Liga partnered with Second Spectrum, and the Premier League installed Hawk-Eye systems across its venues.

The availability of tracking data opened up research areas that had been impossible with event data alone. Analysts could now study questions like: How much space does a team's pressing create? How do off-ball runs create passing lanes? What is the optimal positioning for a defensive line? How does a player's movement off the ball contribute to their team's attacking patterns?

The rise of possession value models:

With tracking data enabling spatial analysis, researchers developed models that assigned value to every location on the pitch based on the probability of scoring from that location within a certain number of actions. Expected Possession Value (EPV), developed by researchers including Javier Fernandez and Luke Bornn, represented a significant advance over xG by valuing not just shots but every action and position in a possession. A pass that moved the ball from a low-value area to a high-value area could now be quantified, even if it didn't directly lead to a shot.

StatsBomb's founding and impact:

Ted Knutson founded StatsBomb in 2017, bringing together a team of experienced analysts with backgrounds in both data provision and public analytics. StatsBomb differentiated itself from established providers like Opta through several innovations: more detailed event classification (for example, distinguishing between different types of defensive actions), freeze frame data that captured all player positions at the moment of key events, and a transparent approach to methodology that engaged the public analytics community.

StatsBomb's open data initiative:

In 2019, StatsBomb released detailed event data from several competitions for free public use, including the 2018 FIFA World Cup, selected seasons of La Liga featuring Lionel Messi, the FA Women's Super League, and other competitions. This democratization of data enabled students, researchers, and aspiring analysts to work with professional-grade data, accelerating learning and innovation throughout the community. The open data became the de facto standard dataset for soccer analytics education, tutorials, and research.

Public analytics community:

A vibrant community of public analysts emerged on social media and blogs, sharing methodologies, debating approaches, and pushing the field forward. Platforms like Twitter became hubs for analytics discussion, with practitioners, journalists, and fans engaging in unprecedented public dialogue about methods and findings. Key figures like Tom Worville, Grace Robertson, John Muller, Thom Lawrence, Eliot McKinley, and many others built public profiles through analytical work that demonstrated both technical skill and football insight. Several were subsequently hired by professional clubs, establishing a pipeline from public analytics to professional careers.

The OptaPro Forum and conferences:

Annual conferences became important venues for sharing cutting-edge research. The OptaPro Analytics Forum (later the Stats Perform Analytics Forum), launched in 2012, provided a platform for both industry professionals and academics to present research. The MIT Sloan Sports Analytics Conference, though not soccer-specific, also featured increasing amounts of football research. These events helped professionalize the field and establish shared standards.

1.2.7 The Modern Era (2020-Present)

Today, soccer analytics has matured into a professional discipline with established practices, dedicated departments, and ongoing methodological debates. Key characteristics of the modern era include:

Ubiquity: Every major club has some analytics capability, though sophistication varies enormously. A 2023 survey by the International Centre for Sports Studies (CIES) found that over 90% of clubs in Europe's top five leagues employed at least one dedicated analyst. At the high end, clubs like Liverpool, Manchester City, Barcelona, and Bayern Munich have departments numbering dozens of staff with specializations in data engineering, machine learning, performance analysis, and recruitment analytics.

Integration: Analytics is increasingly integrated into coaching, scouting, and medical departments rather than siloed separately. The era of the analytics department operating in isolation—producing reports that nobody read—is giving way to embedded analysts who work directly alongside coaches and scouts. Brighton & Hove Albion are often cited as exemplary in this regard, with their analytics team deeply integrated into the football operation under successive technical directors.

Specialization: Roles have differentiated into data engineers, data scientists, analysts, and translators who communicate insights to non-technical stakeholders. The days when a single "stats person" handled everything from data management to visualization to communication are largely over at top clubs, replaced by teams where each member contributes specific expertise.

Machine learning: Sophisticated ML techniques including neural networks, gradient boosting, and reinforcement learning are now common in club analytics departments. These methods enable more accurate prediction, more nuanced player evaluation, and the discovery of patterns that would be invisible to simpler analytical methods. Deep learning models, in particular, have shown promise in processing tracking data to identify tactical patterns.

Real-time analytics: Tools for in-game analysis and decision support are increasingly sophisticated. Some clubs now have analysts communicating insights to coaching staff during matches, supported by systems that process live data feeds and highlight tactical opportunities or concerns in real time. The use of tablets on the touchline, now common across major leagues, facilitates this communication.

Broadcast integration: xG and other metrics now appear regularly in television broadcasts and journalism. Sky Sports, ESPN, BT Sport, and other major broadcasters display xG values during and after matches. This mainstreaming of analytical concepts has increased public analytical literacy and created demand for more sophisticated content.

The frontier of computer vision: Machine learning applied to video is opening new analytical frontiers. Companies like SkillCorner use computer vision to derive tracking data from broadcast video, eliminating the need for stadium-installed camera systems. Other research explores automated event detection, pose estimation (analyzing body positioning during actions), and even automated tactical classification from video alone.

Intuition: Think of soccer analytics evolution like the evolution of photography. Early analysts were like pioneer photographers with crude equipment—capturing something meaningful but with severe limitations. Modern analytics is like digital photography: faster, cheaper, more powerful, and accessible to anyone with the right tools and training. And just as photography continues to evolve with computational photography and AI enhancement, soccer analytics is still in the early stages of realizing its potential.


1.3 Key Stakeholders in Soccer Analytics

Understanding who uses analytical insights—and what they need—is essential for effective work in the field. One of the most common mistakes made by aspiring analysts is producing technically brilliant work that answers a question nobody asked, or presenting insights in a format that the intended audience cannot use. Analytics is ultimately a service discipline: it exists to help other people make better decisions.

1.3.1 The Analytics Ecosystem

Soccer analytics serves many masters. The ecosystem encompasses everyone from the coaching staff preparing for Saturday's match to the board making multi-year strategic decisions, from the scout identifying targets in the Brazilian Serie B to the journalist explaining to readers why a team's good results might not be sustainable.

                          ┌─────────────────┐
                          │   Club Owner/   │
                          │   Board         │
                          └────────┬────────┘
                                   │
                    ┌──────────────┼──────────────┐
                    │              │              │
            ┌───────┴───────┐ ┌───┴───┐ ┌───────┴───────┐
            │  Sporting     │ │  Head │ │   Commercial   │
            │  Director     │ │ Coach │ │   Director     │
            └───────┬───────┘ └───┬───┘ └───────────────┘
                    │             │
         ┌──────────┼──────────┐  │
         │          │          │  │
    ┌────┴────┐┌────┴────┐┌────┴──┴──┐
    │ Scouts  ││Analysts ││ Coaching │
    │         ││         ││ Staff    │
    └─────────┘└─────────┘└──────────┘

Each node in this diagram represents a different set of analytical needs, communication preferences, and time horizons. The analyst's job is to understand all of them and adapt accordingly.

1.3.2 Technical Staff

Head Coach and Assistant Coaches

Coaches use analytics for: - Opposition analysis: Understanding opponents' patterns, weaknesses, and tendencies. Before a match against a team that presses intensely, the analytics department might prepare data showing where the press is triggered, where gaps open up, and which players are the weakest pressing links. - Self-analysis: Identifying areas for improvement in their own team's play. After a run of poor defensive performances, data can reveal whether the problem is individual errors, structural issues, or simply bad luck. - Player development: Tracking individual improvement and identifying training priorities. For a young winger, this might involve tracking progressive carries, successful dribbles, and crossing accuracy over time. - Tactical decisions: Informing formation choices, pressing triggers, and game plans. Data showing that an opponent concedes a disproportionate number of chances when pressed high can inform the decision to adopt an aggressive pressing strategy.

Coaches typically need insights presented simply and visually. They have limited time and must translate analysis into practical instructions for players. The most valuable analysis is specific, actionable, and directly relevant to upcoming matches or training sessions. A 20-page statistical report is far less useful than a single slide showing the three key patterns the team should exploit.

The relationship between coaches and analysts varies enormously. Some coaches—Thomas Tuchel, Roberto De Zerbi, Ange Postecoglou—are known for actively engaging with data. Others prefer to receive information verbally or through video, with the analyst pre-filtering the data into a small number of key messages. The effective analyst adapts to the coach's preferences rather than insisting on a particular format.

Real-World Application: When Jurgen Klopp arrived at Liverpool in 2015, the club's analytics team had to develop new ways to present information that suited Klopp's preferences. Rather than sending spreadsheets or detailed statistical reports, they created concise visual summaries—what some in the industry call "one-pagers"—that highlighted three or four key points relevant to each match. Klopp engaged deeply with these summaries, and the collaboration between Klopp's tactical intuition and the analytics team's data-driven insights became one of the foundations of Liverpool's success.

Players

Players increasingly engage with analytics for: - Self-improvement: Understanding their own performance trends. A midfielder might track their pass completion rate in different zones, or a striker might study their shot placement patterns to identify tendencies they can improve. - Opposition preparation: Learning opponent tendencies, especially for set pieces. Penalty takers, for example, increasingly study data on goalkeeper diving tendencies. - Contract negotiations: Quantifying their value to support salary discussions. Player agents now routinely use analytical reports to support contract negotiations, demonstrating that their client outperforms peers on key metrics.

Players vary enormously in their analytical interest and literacy. Some actively engage with data; others prefer to receive information verbally from coaches. Effective analytics departments adapt their communication to individual preferences. Manchester City's Kevin De Bruyne reportedly used data analysis provided by his agent to negotiate his contract extension without a traditional agent—a sign of increasing player engagement with analytics.

1.3.3 Football Operations

Sporting Director / Director of Football

The sporting director oversees long-term football strategy, including: - Recruitment strategy: Identifying positions to strengthen and player profiles to target. A sporting director might use data to determine that the team's biggest need is a ball-progressing center-back who can play out from the back, then task the analytics team with identifying candidates across multiple leagues. - Squad planning: Balancing age profiles, contract expirations, and development paths. Data can model how the squad's quality will evolve over two to three seasons under different recruitment scenarios. - Performance oversight: Monitoring whether coaches are achieving expected results. If a team is underperforming relative to underlying metrics like xG, the sporting director has data to support conversations about whether tactical adjustments are needed.

Sporting directors need strategic-level analysis that informs multi-year planning. They often have the most sophisticated analytical literacy among non-technical staff. Figures like Michael Edwards (formerly of Liverpool), Paul Mitchell (formerly of Tottenham, AS Monaco, and Newcastle United), and Monchi (of Sevilla fame) are known for their effective integration of analytics into football strategy.

Scouts

Scouts use analytics for: - Shortlisting: Narrowing thousands of potential targets to manageable numbers. For a specific position, there might be 500 players in the age and quality range. Data can reduce this to 30-50 candidates worthy of video review. - Prioritization: Identifying which players to watch in person. A scout cannot watch everyone; data helps direct limited scouting resources to the most promising candidates. - Due diligence: Validating or challenging impressions from video and live scouting. If a scout is enthusiastic about a player whose data profile raises concerns, that tension is worth exploring further. - Comparison: Benchmarking targets against current players and alternatives. How does a target compare to the player they would replace? To other available options?

The scout-analyst relationship is particularly important and sometimes fraught. Traditional scouts may view analytics with suspicion, feeling that their expertise is being devalued. Analysts may undervalue qualitative assessments that cannot be captured in data. The best outcomes emerge from genuine collaboration, where data and scouting insight are treated as complementary rather than competing sources of information.

Best Practice: In recruitment workflows, treat data and scouting as sequential filters rather than competing opinions. Data can efficiently narrow a large pool of candidates to a manageable shortlist. Scouting—video first, then live—evaluates the shortlist for qualities that data cannot capture: body language, decision-making under pressure, communication with teammates, and character. The final decision should integrate both perspectives.

Medical and Sports Science Staff

Medical staff use analytics for: - Injury prediction: Identifying players at elevated injury risk based on training load, match load, previous injury history, and physical metrics. Models can flag when a player's accumulated workload exceeds thresholds associated with increased injury probability. - Load management: Optimizing training intensity and recovery. With tracking data from training sessions and matches, sports scientists can monitor each player's physical output and prescribe individualized recovery protocols. - Return-to-play: Tracking recovery progression after injury. Data-driven return-to-play protocols compare an injured player's physical metrics to their pre-injury baseline, reducing the risk of premature return. - Long-term health: Monitoring cumulative wear over a career.

This area is growing rapidly as tracking data enables detailed physical monitoring. The relationship between physical data and injury has been studied extensively, though reliable prediction remains challenging due to the complex and multifactorial nature of injuries.

1.3.4 Business Operations

Commercial Teams

Analytics also supports business operations: - Ticketing: Optimizing pricing and availability using dynamic pricing models that adjust based on demand, opponent attractiveness, and historical attendance patterns - Sponsorship: Quantifying exposure and value. Data can measure how often a sponsor's logo appears in broadcast footage, the social media reach of sponsored content, and the value of brand association with specific players or achievements. - Fan engagement: Understanding supporter behavior and preferences through analysis of social media engagement, app usage, and merchandise purchasing patterns - Merchandising: Predicting demand and optimizing inventory based on player popularity, recent results, and seasonal trends

Media and Communications

Teams use analytics in media operations for: - Social content: Creating engaging statistical content for fans. A well-designed xG graphic or a striking data visualization can generate significant social media engagement. - Press briefings: Providing coaches with talking points based on data - Brand building: Showcasing analytical sophistication to potential commercial partners and recruits

1.3.5 External Stakeholders

Media and Journalists

Sports journalists increasingly use analytics to: - Add depth: Moving beyond basic stats to sophisticated analysis. Publications like The Athletic, ESPN, and The Guardian employ dedicated data journalists who produce analytical content. - Identify stories: Finding interesting patterns that merit coverage. A journalist might notice that a player's defensive metrics have declined sharply, suggesting a story about declining form or tactical changes. - Verify claims: Checking whether perceived trends are statistically real. When a manager claims their team has been "the best in the league" over a certain period, data can verify or challenge this. - Engage audiences: Creating shareable graphics and insights

Betting and Gaming Industry

The betting industry was an early investor in soccer analytics: - Odds-making: Setting prices that reflect true probabilities. Bookmakers employ sophisticated models that estimate the probability of different match outcomes. - Risk management: Identifying and limiting sharp bettors who consistently beat the market - In-play markets: Real-time probability estimation that adjusts odds during matches based on events, xG accumulation, and other factors

Many analysts began their careers in betting before moving to clubs. The betting industry served as an informal training ground for a generation of soccer analysts, and the analytical methods developed in that context—particularly around match prediction and player valuation—directly influenced club analytics.

Fans

A growing community of analytical fans consumes and creates soccer analysis: - Understanding the game: Moving beyond basic statistics to genuinely understand performance - Fantasy sports: Optimizing team selection using predictive models and underlying performance data - Discussion: Debating player and team quality with data-informed arguments - Content creation: Building audiences around analytical content on platforms like Twitter, YouTube, and Substack

Intuition: Different stakeholders need different things from analytics. A coach needs specific, actionable insights before the next match. A sporting director needs strategic analysis supporting multi-year planning. A journalist needs digestible content their audience will understand. A fan wants to better understand what they are watching. Effective analysts tailor their output to their audience—the same underlying analysis might be presented as a technical model specification, a one-page visual summary, a 1,000-word article, or a single number, depending on who needs to use it.


1.4 The Analytics Workflow

1.4.1 From Question to Action

Soccer analytics follows a systematic workflow from initial question to final action. Understanding this workflow helps structure your learning and your eventual work in the field. The workflow is iterative, not linear—insights from later stages frequently require returning to earlier stages for refinement.

┌─────────────────────────────────────────────────────────────────────────────┐
│                        THE ANALYTICS WORKFLOW                               │
│                                                                             │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐ │
│  │ Question │──▶│   Data   │──▶│ Analysis │──▶│ Insight  │──▶│  Action  │ │
│  │          │   │          │   │          │   │          │   │          │ │
│  └──────────┘   └──────────┘   └──────────┘   └──────────┘   └──────────┘ │
│       │              │              │              │              │        │
│       │              │              │              │              │        │
│       ▼              ▼              ▼              ▼              ▼        │
│  "What do     "Where does    "How do I     "What does     "So what    │
│   we want      it come        process       it mean?"      should we   │
│   to know?"    from?"         it?"                         do?"        │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Each phase of this workflow presents distinct challenges. Many aspiring analysts focus almost exclusively on the analysis phase—the modeling, the coding, the statistical techniques—while neglecting the equally important phases of question definition and communication. In practice, the difference between a good analyst and a great one often lies not in technical sophistication but in the ability to ask the right questions and communicate answers effectively.

1.4.2 Phase 1: Defining the Question

Every analysis begins with a question. Good analytical questions are:

  • Specific: "Which left-back in Ligue 1 best fits our pressing system?" is better than "Who should we sign?"
  • Answerable: The data needed to answer the question must exist or be obtainable
  • Relevant: The answer should inform a real decision
  • Timely: The answer is needed soon enough to matter

Common categories of analytical questions include:

Descriptive questions: What happened? - "How many goals did we concede from set pieces last season?" - "What was our xG per game at home vs. away?" - "How has our pressing intensity changed since the formation switch?"

Diagnostic questions: Why did it happen? - "Why did our conversion rate drop in the second half of the season?" - "What changed in our pressing when we switched to a 4-3-3?" - "Why are we conceding more goals from the right side?"

Predictive questions: What will happen? - "How many goals will this striker score in our league?" - "What is the probability we finish in the top four?" - "If we sign this midfielder, how will our ball progression metrics change?"

Prescriptive questions: What should we do? - "Should we sign Player A or Player B?" - "What tactical adjustments should we make against this opponent?" - "When should we sell this player to maximize transfer fee?"

The quality of the initial question determines the value of everything that follows. A well-defined question focuses analytical effort, guides data selection, and makes communication straightforward. A vague or poorly defined question leads to unfocused analysis that may never reach a useful conclusion.

Common Pitfall: A frequent failure mode is the "interesting but irrelevant" analysis—technically sound work that answers a question nobody needs answered. Before beginning any analysis, confirm that someone will act differently based on the results. If the answer is "probably not," reconsider whether the analysis is worth doing.

1.4.3 Phase 2: Data Acquisition

Once the question is defined, we need data to answer it. This involves:

Identifying data needs: - What variables are required? - What time period is relevant? - What competitions or teams should be included? - What level of granularity is necessary—match-level aggregates, event-level detail, or tracking data?

Sourcing data: - Is the data available internally? - Do we need to purchase from a provider? - Can we collect it ourselves? - Are free alternatives adequate for this purpose?

Validating data quality: - Is the data complete? - Are there obvious errors? - Is it consistent across sources? - Are there known limitations we need to account for?

Data acquisition often reveals that the original question needs refinement. Perhaps the ideal data doesn't exist, or exists only for certain competitions. Iteration between question definition and data exploration is normal and healthy. A common situation: the analyst wants to compare pressing metrics across leagues, but discovers that tracking data is only available for two of the five leagues of interest. The question must be revised, or alternative data sources identified.

1.4.4 Phase 3: Analysis

With question defined and data in hand, the actual analysis begins. This phase includes:

Exploratory analysis: - What patterns exist in the data? - What surprises emerge? - What hypotheses are suggested? - Are there data quality issues that need to be addressed before proceeding?

Exploratory analysis is often undervalued but critically important. Spending time visualizing data, calculating summary statistics, and looking for anomalies before building models can prevent wasted effort and catch problems early. The analyst who jumps straight to modeling without exploring the data first is like the cook who starts cooking without checking whether the ingredients are fresh.

Statistical/quantitative analysis: - Calculating metrics and aggregations - Building models (regression, machine learning) - Testing hypotheses - Estimating uncertainty

Validation: - Do results make sense? - Are they robust to different methodological choices? - Do they replicate on holdout data? - Would a domain expert find the conclusions reasonable?

The analysis phase is where the techniques in this textbook—xG modeling, passing networks, spatial analysis, machine learning—are applied. But technique is only part of the challenge; judgment about which techniques to apply and how to interpret results is equally important. A simple analysis that answers the right question is far more valuable than a complex analysis that answers the wrong one.

1.4.5 Phase 4: Insight Generation

Raw analytical output is not insight. Numbers must be interpreted, contextualized, and synthesized into understanding. This phase involves:

Interpretation: - What do the results mean? - What story do they tell? - What are the key takeaways?

Contextualization: - How does this compare to benchmarks? Is a player's 0.35 xG per 90 minutes good or bad? It depends on the player's position, league, role, and the context in which they play. - What factors might explain the patterns? If a team's pressing intensity has declined, is it fatigue, tactical adjustment, or personnel change? - What are the limitations and uncertainties? Could the results change with more data or a different methodology?

Synthesis: - How does this connect to other analyses? - What is the bigger picture? - What questions remain?

Insight generation is where analytical skill meets domain expertise. Understanding soccer deeply is essential for interpreting results meaningfully. A model might identify a player as statistically exceptional, but without football knowledge, the analyst cannot assess whether that player would fit a particular tactical system, whether their league context inflates their numbers, or whether non-statistical factors (injury history, character, adaptability) should moderate enthusiasm.

Best Practice: Before presenting any analysis, force yourself to articulate the "so what?" in a single sentence. If you cannot explain why anyone should care about your findings in one sentence, the insight is either not clear enough or not important enough.

1.4.6 Phase 5: Communication and Action

Analysis without communication is worthless. Insights must be translated into decisions and actions:

Communication: - Presenting findings clearly and compellingly - Adapting message to audience - Visualizing data effectively - Choosing the right medium: a slide deck for a presentation, a dashboard for ongoing monitoring, a written report for detailed documentation

Recommendation: - Converting insights into specific recommendations - Acknowledging uncertainty appropriately - Providing options when appropriate - Making the decision easy for the stakeholder—not "here is some data" but "I recommend we do X because the data shows Y"

Action: - Supporting implementation of decisions - Monitoring outcomes - Learning from results

Common Pitfall: Many analysts focus obsessively on the analysis phase while neglecting communication. But an analysis that isn't understood isn't useful. Invest as much effort in presenting your work as in producing it. The best visualization in the world is useless if it takes ten minutes to explain. The most sophisticated model is worthless if the coach cannot understand what it is telling them to do.

1.4.7 The Feedback Loop

The workflow is not linear. Results from each phase feed back into earlier phases:

  • Analysis may reveal the question needs refinement
  • Insights may suggest new questions
  • Actions produce outcomes that generate new data
  • Unexpected findings may redirect the entire project

The best analysts embrace this iterative nature rather than expecting a clean start-to-finish process. In practice, a recruitment analysis might begin with a broad question ("who should we sign as a right-back?"), narrow through data exploration ("we need someone who excels at progressive carrying and high pressing"), shift further based on availability ("of the five statistical leaders, only three are realistically available"), and ultimately produce a recommendation that integrates statistical, scouting, financial, and practical considerations.

This iterative, messy process is the reality of analytical work. Textbooks (including this one) present methods in clean, sequential order for pedagogical purposes, but real-world application is always more complex.


1.5 Career Paths in Soccer Analytics

1.5.1 The Modern Analytics Department

Major clubs typically structure their analytics capabilities around several roles, though exact titles, responsibilities, and reporting lines vary significantly between organizations:

Data Engineers: - Build and maintain data infrastructure (databases, pipelines, APIs) - Ensure data quality and accessibility - Integrate multiple data sources into coherent systems - Manage cloud infrastructure and computing resources - Skills: SQL, Python, cloud platforms (AWS, GCP, Azure), databases, ETL tools

Data engineering is often the least glamorous but most essential function in an analytics department. Without reliable data infrastructure, no analysis is possible. Data engineers at clubs like Manchester City and Liverpool manage complex systems that ingest data from multiple providers, process it into usable formats, and serve it to analysts and applications.

Data Scientists: - Build statistical models and machine learning systems - Develop new metrics and frameworks - Conduct research into new methodologies - Evaluate and improve existing models - Skills: Statistics, ML, Python/R, research methods, mathematical modeling

Data scientists in soccer work on problems ranging from xG models and player valuation to injury prediction and match simulation. The work requires not only technical statistical skill but the ability to formulate football problems as mathematical problems—a translation that demands deep understanding of both domains.

Performance Analysts: - Prepare opposition analysis and reports - Support coaches with tactical insights - Create video compilations linked to data - Deliver pre-match and post-match presentations - Skills: Video analysis software (Hudl, Wyscout), communication, tactical understanding, presentation skills

Performance analysts are typically the analysts most closely embedded with the coaching staff. They work long, irregular hours tied to the match schedule, and their effectiveness depends heavily on their ability to communicate with coaches and players. Many performance analysts have backgrounds in coaching or playing, giving them credibility and contextual understanding that purely technical analysts may lack.

Research Analysts: - Focus on specific long-term research projects - Develop club's analytical methodology - Often specialized (e.g., set piece analyst, tracking data analyst) - Produce internal papers and methodological guides - Skills: Deep expertise in specific area, research methodology, academic writing

Research analysts have the luxury of working on longer time horizons than performance analysts, developing new methods and metrics that may take months to produce but provide lasting value. Some clubs, particularly in the Premier League, employ researchers with PhDs in physics, mathematics, or computer science.

Analytics Translators: - Bridge technical and non-technical stakeholders - Communicate insights to coaches and executives - Ensure analytics is actionable and understood - Build relationships across departments - Skills: Communication, domain expertise, relationship building, data visualization

The analytics translator role is increasingly recognized as critical. The most technically sophisticated analysis is worthless if it cannot be communicated to the people who make decisions. Translators combine enough technical understanding to grasp what models are doing with enough football knowledge to explain findings in terms that coaches and directors find meaningful.

1.5.2 Common Entry Points

There is no single path into soccer analytics. The diversity of entry points reflects the interdisciplinary nature of the field. Common routes include:

Academic route: - Degree in statistics, computer science, sports science, mathematics, physics, or related field - Research projects or thesis focused on soccer analytics - Publications or public portfolio demonstrating capability - Masters or PhD programs with sports analytics specializations (e.g., the University of Liverpool's Football Industries MBA, Birkbeck University's Sport Management and the Business of Football, or various data science programs)

Industry route: - Experience in betting industry or related quantitative field (finance, consulting, technology) - Transferable skills in data science or engineering - Domain expertise from playing or coaching at any level - Many successful analysts entered from non-sports backgrounds, bringing analytical methods from other domains

Public analytics route: - Building a public profile through blog posts, Twitter threads, or other platforms - Contributing to open-source projects like mplsoccer (Python soccer visualization library) or socceraction - Participating in competitions (e.g., the Friends of Tracking challenge, Kaggle competitions) - Attending and presenting at conferences

The public analytics route has produced a remarkable number of professional careers. Analysts including Tom Worville (hired by multiple clubs and media organizations), Ashwin Raman (hired by Arsenal), and many others built public profiles through analytical work that attracted the attention of clubs and media companies. The message is clear: demonstrating your skills publicly is one of the most effective ways to break into the field.

Internal transition: - Moving from another role within a club (e.g., scout, video analyst, sports scientist) - Adding analytical skills to existing football expertise - This route offers the advantage of existing relationships and football credibility

1.5.3 Building Your Portfolio

Regardless of entry point, a strong portfolio is essential. Hiring managers in soccer analytics consistently report that demonstrated ability matters more than credentials. A candidate with no formal qualifications but an impressive portfolio of public analytical work will often be preferred over a candidate with perfect academic credentials but nothing to show for it.

Components of a strong portfolio might include:

Technical projects: - xG model built from scratch, with documentation explaining methodology, validation, and limitations - Passing network analysis revealing tactical patterns in specific teams or matches - Player similarity tool that identifies comparable players across leagues - Match prediction model with calibration analysis showing how well predicted probabilities match observed outcomes

Written analysis: - Blog posts explaining methods in accessible language - Deep dives on specific questions (e.g., "How does Manchester City's pressing structure change when trailing?") - Analysis of current events demonstrating ability to produce timely, relevant work

Visualization: - Clear, effective data visualizations that communicate findings without requiring extensive explanation - Interactive dashboards using tools like Streamlit, Tableau, or Observable - Novel visual formats that present familiar data in new, illuminating ways

Code: - Clean, documented code on GitHub demonstrating engineering practices (version control, documentation, testing) - Contributions to open-source packages used by the community - Tutorial notebooks that teach methods to others

Intuition: Think of your career development like building an xG model: you need good features (skills), training data (experience), and validation (portfolio). Just as models improve iteratively, your career will develop through continuous learning and feedback. And just as the best xG models are transparent about their methodology, the best analysts are transparent about their skills, experience, and learning process.

1.5.4 Beyond Clubs

Soccer analytics careers extend well beyond club data departments:

Data providers: Opta/Stats Perform, StatsBomb, Wyscout, Second Spectrum, SkillCorner, and others employ analysts to develop products, build models, support clients, and conduct research. These roles offer exposure to data from many clubs and competitions, providing a breadth of experience that club roles may not.

Media: The Athletic, ESPN, BBC, The Guardian, Sky Sports, and others employ analytical journalists and data visualization specialists. These roles combine analytical skill with writing and communication ability. Journalists like Michael Cox (Zonal Marking, The Athletic), James Yorke (StatsBomb, The Athletic), and others have built influential careers at the intersection of analytics and media.

Agencies: Player agencies increasingly use analytics in negotiations, player development advice, and career planning. Agencies like CAA Stellar and Wasserman employ analysts to support their agents with data-driven arguments.

Federations: National associations use analytics for international team management, league development, referee performance monitoring, and talent identification. FIFA, UEFA, and national federations all employ analysts.

Technology companies: Companies building analytics tools and platforms (e.g., Metrica Sports, Sportec Solutions, Twenty First Group) hire developers, data scientists, and product managers.

Academia: Research positions at universities with sports analytics programs. Academic researchers contribute foundational methods that practitioners then apply. Universities including KU Leuven, the University of Bath, and various American universities have active soccer analytics research programs.

Consulting: Independent analysts and consulting firms supporting multiple clients, often specializing in recruitment analytics, league analytics, or specific technical areas.


1.6 Ethical Considerations in Soccer Analytics

1.6.1 The Ethical Landscape

As soccer analytics grows in influence, ethical questions become increasingly important. Analytics is not neutral: it affects people's careers, livelihoods, and well-being. A negative analytical evaluation can contribute to a player being dropped, not signed, or undervalued in contract negotiations. A flawed injury prediction model might lead to a player being rested unnecessarily or, worse, not rested when they should be. Analysts must navigate complex terrain involving privacy, fairness, transparency, and responsible use of powerful tools.

The ethical landscape in soccer analytics is still developing. There are few established codes of conduct specific to sports analytics, and practitioners must often rely on general ethical principles and personal judgment. This section raises key questions and provides frameworks for thinking about them, rather than offering definitive answers.

Player monitoring: Tracking data and biometric monitoring generate intimate information about players' bodies and movements. Questions arise: - Do players meaningfully consent to this monitoring? In many cases, monitoring is a condition of employment, raising questions about whether consent is truly voluntary. - How is this data protected from misuse? Player data could be leaked to media, used inappropriately in contract negotiations, or accessed by unauthorized parties. - Can players access data collected about them? GDPR and similar regulations give individuals the right to access their personal data, but the application of these regulations to professional sports data is still being tested. - What happens to data when players leave a club? Does the club retain the data indefinitely? Can the player request deletion?

Youth players: Tracking and analyzing young players raises additional concerns: - Are youth players and their families fully informed about what data is collected and how it is used? - Could negative evaluations harm development or mental health? A young player labeled as "below threshold" by a model might internalize that assessment in harmful ways. - How do we protect minors' data, particularly given the long time horizons involved (data collected at age 14 might still exist when the player is 30)?

Real-World Application: In 2019, the European Club Association published guidelines on the ethical use of player data, recommending that clubs establish clear policies on data collection, storage, access, and deletion. However, implementation remains uneven, and many clubs lack formal data governance frameworks for player data.

1.6.3 Fairness and Bias

Algorithmic bias: Models trained on historical data may perpetuate historical biases: - Players from underrepresented backgrounds or less visible leagues may be systematically undervalued. If a model is trained primarily on data from the top five European leagues, it may not accurately evaluate players from the Brazilian Serie B, the Egyptian Premier League, or the Japanese J-League. - Physical metrics may disadvantage certain body types. Models that reward height, speed, or power may undervalue players whose strengths lie in technical skill, intelligence, or positioning. - League adjustments may undervalue players from lower-profile competitions. Converting performance from the Austrian Bundesliga to the Premier League involves substantial uncertainty, and simplistic conversion factors may systematically over- or under-value certain player profiles.

Transparency: Players and agents increasingly want to understand how analytical evaluations work: - Should clubs explain their analytical methods to players who are evaluated by them? If a player is not signed because of an analytical assessment, do they have a right to understand what that assessment was based on? - Do players have a right to contest data-driven assessments? - How transparent should the transfer market be about the role of analytics in valuations?

1.6.4 Competitive Integrity

Data security: Clubs invest heavily in proprietary analytics. Protecting this investment raises questions: - What obligations do employees have regarding proprietary methods when they leave for competitors? Non-disclosure agreements are common, but the line between "proprietary model" and "general analytical skill" is blurry. - How should clubs handle analysts who leave for competitors? Can they prevent analysts from using methods they developed during their employment? - Is there such a thing as "trade secrets" in sports analytics? The legal framework for intellectual property in sports analytics is still evolving.

Match manipulation: Detailed analytical capabilities create new manipulation risks: - Could analytics be used to fix matches more effectively by identifying which specific events to manipulate for maximum impact on results or betting markets? - How should governing bodies monitor for suspicious patterns? Analytical methods can detect unusual betting patterns and match outcomes, but staying ahead of sophisticated manipulation requires constant vigilance. - What responsibilities do analysts have if they detect potential manipulation in their data?

1.6.5 Responsible Communication

Public analytics: Analysts who work publicly face communication responsibilities: - Acknowledging uncertainty appropriately. Presenting an xG model's output as definitive truth rather than a probabilistic estimate is irresponsible. - Avoiding false precision in predictions. Saying "this team has a 37.2% chance of winning the league" implies a precision that no model can justify. - Being clear about limitations. Every model has assumptions and blind spots. - Not making definitive judgments about individuals based on limited data. Publicly declaring that a specific player is "not good enough" based solely on statistical evidence ignores the many factors data cannot capture.

Media and fans: How analytics is communicated to general audiences matters: - Avoiding misleading visualizations that exaggerate differences or obscure uncertainty - Providing context for metrics so that audiences can interpret them correctly - Not weaponizing data to attack players or teams in ways that could cause real harm

1.6.6 A Framework for Ethical Decision-Making

When facing ethical dilemmas, consider:

  1. Who is affected? Identify all stakeholders impacted by the decision
  2. What are the potential harms? Consider worst-case scenarios
  3. What are the potential benefits? Consider best-case scenarios
  4. Is consent informed and meaningful? Especially for data collection
  5. Is there transparency? Can you explain and defend your actions?
  6. What would a reasonable person think? The "newspaper test"—would you be comfortable if your actions were reported in the press?
  7. What are the alternatives? Are there less harmful approaches that still achieve the objective?
  8. What precedent does this set? If everyone in the field behaved this way, would the consequences be acceptable?

Intuition: Ethics in soccer analytics is a developing area without established consensus. This section raises questions rather than providing definitive answers. Thoughtful practitioners will develop their own frameworks through experience and reflection. The key principle is to remember that behind every data point is a person—a player, a coach, a scout—whose career and well-being may be affected by analytical work.


1.7 Types of Soccer Data: A First Look

Before diving into data sources in Chapter 2, it is worth establishing a conceptual overview of the types of data that power soccer analytics. Understanding these categories at a high level will help frame the more detailed discussion to come.

1.7.1 Event Data

Event data records discrete actions during a match: every pass, shot, tackle, duel, foul, and substitution, along with metadata describing each action (who did it, where, when, and what happened). Event data is the most widely available and most commonly used type of soccer data. It powers the majority of public analytics, from xG models to passing networks to player ratings.

A typical Premier League match generates approximately 2,000-3,000 tagged events. Each event includes coordinates (where on the pitch it occurred), timestamps (when it occurred), and qualifiers (additional descriptive tags such as whether a pass was headed, played with the left foot, or under pressure).

1.7.2 Tracking Data

Tracking data captures the continuous position of every player and the ball, typically at 25 frames per second. This produces approximately 4-6 million data points per match—orders of magnitude more than event data. Tracking data enables analysis of off-ball movement, pressing patterns, space creation, and other aspects of play that are invisible in event data.

1.7.3 Broadcast Tracking Data

A relatively recent innovation, broadcast tracking uses computer vision applied to broadcast television footage to derive approximate tracking data without requiring stadium-installed cameras. Companies like SkillCorner have pioneered this approach, which dramatically expands the coverage of tracking data to any match that is televised. The accuracy is lower than dedicated optical tracking systems, but the coverage is vastly broader, enabling cross-league comparisons and analysis of matches in competitions that lack installed tracking infrastructure.

1.7.4 Physical and Biometric Data

GPS vests, accelerometers, and heart rate monitors worn by players during training and matches generate detailed physical performance data. This data is primarily used by sports science and medical staff for load management, injury prevention, and fitness monitoring, but it also has tactical applications—for example, understanding how a team's pressing intensity changes over the course of a match or a season.


1.8 Looking Ahead: Your Analytics Journey

1.8.1 What You'll Learn

This textbook will take you from foundational concepts to advanced techniques. Here's a preview of the journey ahead:

Part I: Foundations (Chapters 1-6) You'll learn where data comes from, how to manipulate it with Python, and how to think statistically about soccer problems. These chapters establish the fundamental knowledge and skills upon which everything else is built. Even experienced data scientists may find value in the soccer-specific applications of familiar statistical concepts.

Part II: Core Analytics (Chapters 7-14) You'll build xG models, analyze passing networks, evaluate defensive performance, and develop comprehensive player and team metrics. These chapters cover the bread-and-butter techniques that form the core of professional soccer analytics work.

Part III: Advanced Analytics (Chapters 15-21) You'll work with tracking data, apply machine learning, support scouting processes, and analyze tactics. These chapters push into more sophisticated territory, requiring stronger technical skills and deeper domain knowledge.

Part IV: Advanced Topics (Chapters 22-26) You'll explore deep learning, economic analysis, injury prevention, and real-time analytics. These chapters cover emerging areas where the field is actively evolving.

Part V: Capstone (Chapters 27-28) You'll integrate everything into comprehensive projects and look toward the field's future.

1.8.2 How to Succeed

Students who succeed in mastering soccer analytics share several characteristics:

Curiosity: They want to understand why things work, not just how to do them. They ask "why does this model work?" rather than just copying code that produces results.

Persistence: They push through confusion and frustration rather than giving up. Learning analytics involves encountering concepts that are initially confusing—probability distributions, gradient descent, Voronoi tessellations—and working through that confusion until understanding emerges.

Practice: They write code, complete exercises, and build projects rather than just reading. Reading about xG models is not the same as building one. The exercises and projects in this textbook are not optional extras; they are where the deepest learning happens.

Skepticism: They question methods and results, including their own. They ask "what could be wrong with this analysis?" before asking "what does this analysis show?"

Collaboration: They engage with the community, ask questions, and share knowledge. Soccer analytics has one of the most generous and collaborative communities in data science. Taking advantage of that community—through social media, conferences, open-source contributions, and informal networks—accelerates learning enormously.

1.8.3 The State of the Art

Soccer analytics is a young field with enormous room for improvement. Current methods have significant limitations:

  • Event data misses most of what happens on a soccer pitch—all the off-ball movement, spatial positioning, and continuous play between discrete actions
  • Tracking data is expensive and not universally available, creating an information asymmetry between wealthy and less wealthy clubs
  • Most public metrics are crude approximations of complex phenomena. Even xG, the field's most established advanced metric, captures only a fraction of what determines shooting outcomes.
  • Uncertainty is rarely communicated appropriately. Metrics are presented as precise numbers when they are actually estimates with substantial error bars.
  • Integration with coaching practice remains challenging. The gap between what data scientists produce and what coaches can use is narrowing but remains significant.
  • Small sample sizes plague many analyses. A player's performance over a 38-game season involves a relatively small number of observations for most event types, making it difficult to distinguish signal from noise.

These limitations represent opportunities. The analysts who develop better methods, better communication, and better integration will shape the field's future. Some of the most impactful work in the coming years will likely come not from more sophisticated models but from better ways to communicate existing insights to coaches and decision-makers.


1.8 Chapter Summary

Key Concepts

  1. Soccer analytics is the systematic application of data analysis to improve decision-making in football, serving multiple stakeholders from coaches to executives to fans.

  2. The field has evolved from Charles Reep's pencil-and-paper tallies in the 1950s and Valeriy Lobanovskyi's scientific approach in Kyiv through the Moneyball-inspired era to today's sophisticated data science operations, driven by advances in data availability, computing power, and analytical methods.

  3. Key stakeholders include technical staff (coaches, players), football operations (sporting directors, scouts), business operations, and external parties (media, fans, betting industry). Each stakeholder has different analytical needs and communication preferences.

  4. The analytics workflow proceeds from question to data to analysis to insight to action, with feedback loops connecting all phases. The quality of the question determines the value of everything that follows.

  5. Career paths include data engineering, data science, performance analysis, research analysis, and analytics translation, with entry points from academia, industry, public analytics, or internal transition.

  6. Ethical considerations around privacy, fairness, transparency, and responsible communication are increasingly important as analytics gains influence over people's careers and livelihoods.

Key Formulas

This introductory chapter is conceptual rather than mathematical. Key formulas will begin in Chapter 3.

Key Code Patterns

This chapter introduces concepts rather than code. Python implementation begins in Chapter 4.

Decision Framework

When starting an analytics project:

├── Is the question well-defined?
│   ├── No → Refine the question with stakeholders
│   └── Yes → Continue
├── Is relevant data available?
│   ├── No → Can it be collected or purchased?
│   │   ├── No → Modify question to match available data
│   │   └── Yes → Acquire data
│   └── Yes → Continue
├── Is the analysis feasible given time and resources?
│   ├── No → Scope down or request more resources
│   └── Yes → Proceed with analysis
└── Can results be communicated effectively?
    ├── No → Plan communication strategy before starting
    └── Yes → Execute and iterate

What's Next

In Chapter 2: Data Sources and Collection in Soccer, we will explore where soccer data comes from, how it's collected, and how to access it. You'll learn about event data, tracking data, and public data sources, setting the foundation for all the analysis to come.

Before moving on, complete the exercises and quiz to solidify your understanding of the concepts introduced in this chapter.


Chapter 1 Exercises → exercises.md

Chapter 1 Quiz → quiz.md

Case Study: The Liverpool Analytics Revolution → case-study-01.md

Case Study: Brentford's Moneyball Approach → case-study-02.md


Chapter 1 Complete