From Raw Data to Boardroom Presentation" chapter: 29 difficulty: advanced estimated_time: "90 minutes" data_required: false
Case Study 1: End-to-End Analytics --- From Raw Data to Boardroom Presentation
"The goal of analytics is not to produce models. It is to produce better decisions." --- Anonymous Technical Director
Executive Summary
This case study traces the complete journey of a soccer analytics project at a fictional but representative Premier League club, Riverside United. Over the course of a single transfer window, the analytics department was tasked with answering a deceptively simple question from the board: "Are we spending our money wisely?" The project integrated data engineering, statistical modeling, visualization, and stakeholder communication --- the full lifecycle that every professional analytics workflow must traverse.
Skills Applied: - Data pipeline construction and validation - Feature engineering for player valuation - Statistical modeling and uncertainty quantification - Executive-level communication and visualization - Decision framing under uncertainty
Background
The Organization
Riverside United finished 11th in the Premier League in the 2023--24 season. Over the previous three transfer windows, the club had invested approximately 180 million in player recruitment. Despite this expenditure, the team's league position had improved only marginally (from 14th to 11th). The board, facing pressure from ownership, demanded a rigorous assessment of transfer value.
The Brief
The sporting director convened a meeting with the head of analytics and the head of recruitment. The mandate was clear:
- Retrospective audit: For each signing in the last three windows, quantify whether the player delivered value relative to their fee.
- Comparative benchmarking: Compare Riverside's recruitment efficiency against peer clubs (those finishing 8th--14th in the same period).
- Forward-looking model: Build a framework for evaluating prospective signings that accounts for on-pitch contribution, injury risk, and resale value.
- Board presentation: Deliver findings in a 20-minute presentation that the board (non-technical audience) can understand and act upon.
Timeline: six weeks from brief to boardroom.
Phase 1: Data Engineering (Weeks 1--2)
Data Source Inventory
The analytics team catalogued available data:
| Source | Data Type | Granularity | Coverage |
|---|---|---|---|
| Event data provider | Passes, shots, carries, pressures | Per event | All PL matches, 3 seasons |
| Tracking data provider | Positional coordinates, 25 fps | Per frame | Home matches only |
| Medical department | Injury records, rehabilitation timelines | Per incident | All squad players |
| Finance department | Transfer fees, wages, amortization | Per player | All signings |
| Scouting platform | Pre-signing reports, video tags | Per player | Signed players + shortlisted |
| Market value API | Estimated market values | Monthly | All PL players |
Pipeline Architecture
Raw Sources --> Ingestion Layer --> Validation Layer --> Feature Store --> Modeling Layer
| | | | |
6 sources Schema checks Range/null checks Per-90 metrics Valuation model
Type coercion Cross-source joins Rolling averages Risk scoring
Deduplication Temporal alignment Composite indices Output tables
Data Quality Challenges
Three significant data quality issues emerged during pipeline construction:
-
Coordinate system mismatch: The event data provider changed their coordinate system between the 2021--22 and 2022--23 seasons (from a 100x100 to a 120x80 grid). All historical coordinates had to be rescaled.
-
Injury record gaps: The medical department's records predated the analytics department by several years and used inconsistent injury classification codes. A mapping table was required.
-
Wage data confidentiality: Finance released anonymized wage bands rather than exact figures. The analytics team worked with bucketed categories (e.g., "50k--75k per week") rather than precise numbers.
Practitioner Insight: Data engineering consistently consumes 50--70% of the total effort in professional analytics projects. The temptation to rush past this phase and get to "the interesting part" (modeling) must be resisted. Models built on unreliable data produce unreliable results, eroding stakeholder trust.
Phase 2: Feature Engineering and Modeling (Weeks 2--4)
Player Contribution Model
The team built a composite Player Contribution Index (PCI) that measured each player's on-pitch value across four dimensions:
$$ \text{PCI} = w_1 \cdot \text{Goal Contribution} + w_2 \cdot \text{Progression} + w_3 \cdot \text{Defensive Impact} + w_4 \cdot \text{Availability} $$
where $w_1 + w_2 + w_3 + w_4 = 1$ and the weights were set through a combination of regression analysis and expert calibration with the coaching staff.
Goal Contribution combined xG, xA, and goal-creating actions per 90 minutes. Progression measured progressive passes, progressive carries, and entries into the final third per 90. Defensive Impact aggregated pressures, tackles, interceptions, and aerial duels. Availability was the fraction of possible minutes actually played, penalizing injury-prone players.
Transfer Value Model
The transfer value model compared actual fee paid against an expected fee derived from the player's age, position, contract length, selling club league tier, and pre-signing performance metrics.
$$ \text{Value Ratio} = \frac{\text{PCI per season}}{\text{Amortized annual cost}} $$
where amortized annual cost = (transfer fee / contract length) + annual wages.
A value ratio above 1.0 indicated the player was delivering more value than their cost; below 1.0 suggested overpayment relative to contribution.
Peer Benchmarking
The team computed Value Ratios for all signings by clubs finishing 8th--14th over the same three-season window, establishing a league-wide distribution. Riverside's signings were ranked within this distribution.
import pandas as pd
import numpy as np
from typing import Dict, List
def compute_value_ratio(
player_stats: pd.DataFrame,
financial_data: pd.DataFrame,
weights: Dict[str, float]
) -> pd.DataFrame:
"""Compute Player Contribution Index and Value Ratio.
Args:
player_stats: Per-90 performance metrics for each player-season.
financial_data: Transfer fees, wages, and contract lengths.
weights: Dictionary mapping PCI components to weights.
Returns:
DataFrame with PCI, amortized cost, and value ratio per player.
"""
df = player_stats.merge(financial_data, on="player_id")
# Normalize each component to 0-1 within position group
for component in weights:
grouped = df.groupby("position")[component]
df[f"{component}_norm"] = grouped.transform(
lambda x: (x - x.min()) / (x.max() - x.min() + 1e-9)
)
# Compute PCI
df["pci"] = sum(
weights[comp] * df[f"{comp}_norm"] for comp in weights
)
# Compute amortized annual cost
df["annual_cost"] = (
df["transfer_fee"] / df["contract_years"] + df["annual_wages"]
)
# Value ratio
df["value_ratio"] = df["pci"] / (
df["annual_cost"] / df["annual_cost"].median()
)
return df
Phase 3: Results and Findings (Week 4--5)
Retrospective Audit
Of Riverside's 14 signings across three windows:
| Category | Count | Avg. Value Ratio | Assessment |
|---|---|---|---|
| Strong value (VR > 1.5) | 3 | 2.1 | Excellent recruitment |
| Fair value (VR 0.8--1.5) | 5 | 1.1 | Reasonable |
| Poor value (VR 0.4--0.8) | 4 | 0.6 | Below expectations |
| Write-off (VR < 0.4) | 2 | 0.2 | Significant overpayment |
The two "write-off" signings accounted for 45 million of the 180 million spent, with both players suffering prolonged injuries that were predictable from pre-signing medical data.
Peer Comparison
Riverside ranked 5th out of 7 peer clubs in recruitment efficiency (median value ratio). The top-performing peer club, which finished 8th, had a median value ratio of 1.4 compared to Riverside's 0.95.
Key Insight: The Injury Tax
The single largest driver of poor value was player unavailability. When the Availability component was removed from PCI, Riverside's ranking improved to 3rd among peers. The implication was clear: Riverside was identifying talented players but failing to adequately account for injury risk in their decision-making.
Phase 4: Communication and Presentation (Weeks 5--6)
Designing the Board Presentation
The analytics team followed five principles for the boardroom presentation:
-
Lead with the answer: The first slide stated the headline finding: "Our player identification is above average; our injury risk assessment is costing us 15 million per season in lost value."
-
Use analogies: The Value Ratio was presented as "return on investment, like any other business asset." Board members immediately understood the framework.
-
Minimize jargon: Terms like "xG" and "per-90 normalization" were replaced with "goal quality" and "rate per match."
-
Show uncertainty: Each Value Ratio was presented with a confidence interval, and the team explicitly stated the assumptions and limitations.
-
End with recommendations: Three actionable recommendations, each with cost and expected impact.
The Three Recommendations
-
Integrate medical risk scoring into the transfer decision framework. Cost: Two additional sports medicine consultants and a risk model. Expected impact: Avoiding 1--2 "write-off" signings per window cycle.
-
Establish a minimum Value Ratio threshold of 0.8 for all signings. This would have screened out 4 of the 6 underperforming signings.
-
Invest in the analytics infrastructure to enable real-time player monitoring. The current system could not flag emerging injury risks during the season.
Outcome
The board approved all three recommendations. Over the following two transfer windows, Riverside's median Value Ratio improved to 1.3, and the club finished 8th --- their highest placement in seven years.
More importantly, the project established the analytics department as a trusted partner in strategic decision-making, rather than a back-office function that produced reports nobody read.
Discussion Questions
-
The Value Ratio framework treats all forms of on-pitch contribution as commensurable (reducible to a single number). What are the limitations of this approach, and how might you address them?
-
The board presentation deliberately simplified the analytical methodology. At what point does simplification become misrepresentation? How would you handle a board member who asked for "just the bottom line" on a genuinely uncertain finding?
-
The injury risk finding was arguably the most actionable insight. Why might previous decision-makers have underweighted injury risk, and what cognitive biases does this reflect?
-
How would this framework need to be adapted for a club with a different financial profile (e.g., a club that relies heavily on player development and resale rather than purchasing established talent)?
-
The project took six weeks. In a fast-moving transfer window, how would you balance the need for rigorous analysis with time pressure? What would you cut, and what would you refuse to compromise on?
Connection to Chapter Themes
This case study integrates techniques from across the textbook:
- Data engineering (Chapters 2, 4): Pipeline construction, data validation, multi-source integration.
- Feature engineering (Chapters 5, 8, 12): Per-90 normalization, composite indices, domain-informed feature design.
- Statistical modeling (Chapters 3, 14): Regression, uncertainty quantification, benchmarking.
- Visualization and communication (Chapters 6, 7, 19): Executive-level presentation design, audience-appropriate simplification.
- Scouting analytics (Chapter 17): Player valuation, peer comparison, multi-criteria decision analysis.