Chapter 21: Exercises
Section 21.1 — The Modern Scouting Process
Exercise 21.1. Describe the recruitment funnel and explain why data screening is typically the first quantitative stage. What are the advantages of filtering thousands of players down to a few hundred before any human evaluation takes place?
Exercise 21.2. A club's scouting department has three full-time scouts, each capable of watching two live matches per week during the season (approximately 40 weeks). If the club is tracking 200 potential targets across 15 leagues, calculate the maximum percentage of targets each scout can observe live at least once, assuming each observation requires attending one match. Discuss the implications for data-driven pre-screening.
Exercise 21.3. Compare and contrast event data and tracking data as sources for recruitment analysis. For each of the following player attributes, state which data source is more informative and explain why: (a) Pressing intensity (b) Passing accuracy (c) Off-ball movement quality (d) Dribbling ability (e) Defensive positioning
Exercise 21.4. A player has played 540 minutes across 8 appearances this season, recording 3 goals and 2 assists. Calculate his goals per 90 and assists per 90. Then discuss whether these per-90 figures are reliable enough to use in a shortlisting exercise. What minimum minutes threshold would you recommend and why?
Exercise 21.5. You are tasked with building a player database for recruitment purposes. List the key tables you would include in the database schema, the primary fields in each table, and the relationships between them.
Section 21.2 — Identifying Player Profiles and Needs
Exercise 21.6. Define a statistical player profile template for a "ball-playing center back" in a team that plays out from the back. Include at least 8 metrics with minimum thresholds and weights that sum to 1.0.
Exercise 21.7. A club plays a 4-2-3-1 formation and is losing its starting number 10 (attacking midfielder) to a transfer. The departing player's key statistics per 90 are: npxG = 0.25, xA = 0.22, progressive passes = 10.5, key passes = 2.8, successful dribbles = 3.1, pressures = 20.0. Using cosine similarity, rank the following three candidates:
| Metric | Candidate A | Candidate B | Candidate C |
|---|---|---|---|
| npxG/90 | 0.22 | 0.30 | 0.18 |
| xA/90 | 0.25 | 0.15 | 0.28 |
| Prog. passes/90 | 9.8 | 8.2 | 11.1 |
| Key passes/90 | 3.0 | 2.1 | 3.5 |
| Dribbles/90 | 2.8 | 4.0 | 1.9 |
| Pressures/90 | 18.5 | 22.0 | 16.0 |
Show your calculations step by step.
Exercise 21.8. Conduct a hypothetical squad depth analysis for a team with the following center-back situation: - Player A: Age 33, contract expires in 6 months, 2,700 minutes played, declining aerial duel win rate - Player B: Age 27, contract until 2027, 2,500 minutes played, consistent performance - Player C: Age 20, on loan, 1,800 minutes in second division - Player D: Age 29, contract until 2026, 900 minutes (injury-prone)
Calculate a priority score using the formula from Section 21.2.3 with weights $w_1 = 0.3$, $w_2 = 0.25$, $w_3 = 0.25$, $w_4 = 0.2$. Assign appropriate values for each component and justify your choices.
Exercise 21.9. Explain the "replacement fallacy" in recruitment. Provide a concrete example of a situation where searching for a like-for-like replacement would be suboptimal, and describe what alternative approach would be better.
Exercise 21.10. Write a Python function that computes the Mahalanobis distance between two player metric vectors, given a covariance matrix. Test it with sample data for two midfielders.
Section 21.3 — Data-Driven Shortlisting
Exercise 21.11. Given the following per-90 data for 10 wingers, compute the percentile rank for each player on each metric. Then create a composite score using the weights provided.
| Player | npxG | xA | Dribbles | Prog. Carries | Pressures |
|---|---|---|---|---|---|
| W1 | 0.32 | 0.18 | 4.2 | 7.1 | 24.0 |
| W2 | 0.25 | 0.22 | 3.5 | 4.8 | 20.5 |
| W3 | 0.18 | 0.30 | 2.8 | 8.2 | 17.0 |
| W4 | 0.28 | 0.15 | 7.0 | 3.9 | 22.0 |
| W5 | 0.15 | 0.12 | 2.1 | 3.2 | 26.0 |
| W6 | 0.35 | 0.20 | 3.8 | 4.5 | 18.0 |
| W7 | 0.22 | 0.28 | 3.0 | 7.5 | 21.0 |
| W8 | 0.30 | 0.16 | 4.5 | 4.0 | 23.0 |
| W9 | 0.20 | 0.25 | 3.2 | 7.8 | 19.0 |
| W10 | 0.27 | 0.19 | 3.6 | 4.3 | 22.5 |
Weights: npxG = 0.25, xA = 0.20, Dribbles = 0.20, Progressive Carries = 0.15, Pressures = 0.20.
Exercise 21.12. Implement a Python function that creates a radar chart for any given player, displaying their percentile rankings across at least 8 metrics. The function should accept the player's name, metric names, and percentile values as arguments.
Exercise 21.13. Explain the difference between z-score-based composite scoring and percentile-based composite scoring. In what situations would one approach be preferred over the other?
Exercise 21.14. A player has only 450 minutes of playing time this season. His observed npxG per 90 is 0.40, while the league average for his position is 0.22. Using Bayesian shrinkage with a prior strength parameter $n_0 = 900$ minutes, calculate his adjusted npxG per 90. Interpret the result.
Exercise 21.15. Design a non-linear scoring function for progressive passes per 90 where: - Below 3.0 per 90 receives zero credit - Between 3.0 and 9.0 receives linear credit (0 to 1) - Above 9.0 receives diminishing returns with $\alpha = 0.3$
Plot this function for values from 0 to 12 per 90.
Exercise 21.16. Write a Python script that implements a complete shortlisting pipeline: load player data, filter by position and age, compute z-scores, calculate a weighted composite score, and output the top 20 candidates with their scores and key metrics.
Section 21.4 — Performance Projection Models
Exercise 21.17. Using the quadratic age curve model $y = \beta_0 + \beta_1 \cdot \text{age} + \beta_2 \cdot \text{age}^2$, with $\beta_0 = -2.5$, $\beta_1 = 0.28$, and $\beta_2 = -0.005$, calculate: (a) The peak age for this metric (b) The expected performance at ages 22, 25, 28, and 31 (c) The percentage decline from peak to age 32
Exercise 21.18. A 23-year-old midfielder recorded npxG per 90 of 0.18 last season, 0.22 this season, and 0.15 two seasons ago. Using the MARCEL projection formula with reliability coefficient $r = 0.55$ and league mean $\mu = 0.20$, and assuming an age-related improvement of +0.01 per year at this age, project his npxG per 90 for next season.
Exercise 21.19. Explain why survivorship bias is a concern when estimating age curves, and describe the delta method approach to mitigating this bias.
Exercise 21.20. Build a simple gradient boosting model in Python that projects next-season npxG per 90 using current-season npxG, previous-season npxG, age, minutes played, and league level as features. Use synthetic data to demonstrate the model training and evaluation process.
Exercise 21.21. For a 21-year-old forward projected to score 0.28 npxG/90 next season with a prediction standard error of 0.08, calculate the 80% prediction interval. Then discuss how a club with a "buy and develop" strategy might interpret this interval differently from a club seeking immediate impact.
Exercise 21.22. Compare the advantages and disadvantages of parametric (quadratic) age curves versus non-parametric (delta method) age curves. Under what circumstances would each approach be preferred?
Section 21.5 — League and Style Adjustments
Exercise 21.23. A striker in the Eredivisie scores 0.45 npxG per 90. Using the league difficulty table from Section 21.5.5 (Eredivisie scoring factor = 0.78, Premier League = 1.00), calculate his league-adjusted npxG per 90 for the Premier League. Discuss the limitations of this simple ratio approach.
Exercise 21.24. You have data on 30 strikers who transferred from Ligue 1 to the Premier League over the past 5 years. Their average npxG per 90 in Ligue 1 was 0.35, and their average npxG per 90 in the Premier League (after one season of adaptation) was 0.28. Calculate the transfer-based calibration factor. Then discuss what confounding factors might bias this estimate.
Exercise 21.25. A midfielder plays for a team that averages 62% possession. His progressive passes per 90 is 11.2. If the average midfielder on a team with 62% possession makes 10.0 progressive passes per 90, and the average midfielder on a team with 50% possession makes 8.5 progressive passes per 90, estimate his possession-adjusted progressive passes per 90 for a team with 50% possession. State your assumptions.
Exercise 21.26. Write a Python function that takes a player's raw statistics, their current league, and a target league, and returns league-adjusted statistics using a provided difficulty matrix. Include at least 6 statistical categories.
Exercise 21.27. Describe the hierarchical model approach to league adjustment. What are the advantages of this approach over simpler methods? What data requirements does it impose?
Section 21.6 — Red Flags and Risk Assessment
Exercise 21.28. A striker scored 18 goals from an xG of 14.5 last season (44 shots on target from 95 total shots). Calculate his goals minus xG, and determine whether this level of overperformance is statistically unusual. Assume the standard deviation of goals minus xG for a player with 95 shots is approximately 3.2 goals.
Exercise 21.29. Build a composite risk score for the following transfer target: - 26 years old - 45 days lost to injury in the last 2 seasons - Goals minus xG of +7.2 over the last 2 seasons - 2,800 minutes played this season - 0.18 cards per 90 - Moving from Ligue 1 to the Premier League (adaptation factor: 0.6)
Use the risk scoring function from Section 21.6.3. Interpret each component and the overall score.
Exercise 21.30. Create a "red flag checklist" evaluation for a 22-year-old winger from a mid-table Bundesliga team who has the following profile: - Breakout season with 8 goals and 6 assists (previous season: 2 goals, 3 assists) - npxG per 90 of 0.28 with G-xG of +3.5 - Two hamstring injuries in the past 18 months - 1,900 minutes played - Low pressing numbers (12th percentile) - Strong agent pushing for a move
Identify all red flags, categorize them, and recommend what additional investigation is needed for each.
Exercise 21.31. Write a Python function that, given a player's injury history (as a list of injuries with type, duration, and date), calculates an injury risk score based on frequency, severity, recurrence patterns, and recency.
Exercise 21.32. A club is considering two transfer targets for the same position: - Player X: Age 24, projected 0.30 npxG/90 (80% CI: 0.22-0.38), risk score 3.2/10, transfer fee 25M EUR - Player Y: Age 28, projected 0.35 npxG/90 (80% CI: 0.30-0.40), risk score 7.5/10, transfer fee 18M EUR
Using expected value analysis, which player represents better value? Consider a 4-year contract horizon and apply age curve adjustments. State all assumptions.
Section 21.7 — Integrating Data with Traditional Scouting
Exercise 21.33. Design a structured scouting report template that integrates both quantitative and qualitative assessments. Your template should include sections for statistical profile, scout assessment (with defined rating scales), key strengths, areas of concern, fit assessment, risk evaluation, and a final recommendation. Justify each section's inclusion.
Exercise 21.34. A data analyst's shortlist ranks a 24-year-old central midfielder as the #1 candidate based on his statistical profile (90th+ percentile in progressive passing, chance creation, and ball recoveries). However, the club's chief scout, who watched the player three times, rates him as "average" and flags concerns about his decision-making speed and tendency to take unnecessary risks in possession. Describe how you would facilitate a productive discussion between the analyst and the scout, and what additional information you would seek to resolve the disagreement.
Exercise 21.35. Write a Python script that generates an automated scouting report for a given player, including: - A radar chart of percentile rankings - League-adjusted key metrics - Risk score breakdown - A text summary highlighting strengths and concerns The report should output as a formatted text file or displayed in the console.
Exercise 21.36. A club has made 20 signings over the past 4 transfer windows. Of those, 12 were primarily data-driven recommendations and 8 were primarily scout-driven recommendations. The data-driven signings had a 58% success rate (7/12 deemed successful after 2 seasons), while the scout-driven signings had a 50% success rate (4/8). Is the difference statistically significant? Perform an appropriate hypothesis test and interpret the result. Discuss what other factors might explain the difference.
Exercise 21.37. Explain the concept of "confirmation bias" in scouting and how it can affect both data-driven and traditional scouting approaches. Propose three concrete strategies to mitigate confirmation bias in a recruitment department.
Exercise 21.38. Design a post-transfer review framework that a club could use to evaluate its recruitment decisions after 1 and 2 seasons. What metrics would you track? How would you account for external factors (injuries, managerial changes, tactical shifts) that might affect a player's performance independent of recruitment quality?