Appendix G: Answers to Selected Exercises

This appendix provides brief solutions to odd-numbered exercises from each chapter. These are intended as reference answers after you have attempted the problems yourself. Full worked solutions with code are available in the exercise-solutions.py file within each chapter's code/ directory.

Chapter 1: Introduction to Soccer Analytics

Exercise 1. The three primary stakeholder groups for a match analysis report are: (a) coaching staff, who need tactical insights in visual format; (b) the sporting director, who needs strategic summary with performance trends; (c) players, who need brief, personalized feedback focused on their role.

Exercise 3. The key difference between descriptive and predictive analytics: descriptive analytics summarizes what happened (e.g., "Team A completed 85% of their passes"), while predictive analytics estimates what will happen (e.g., "This shot has a 0.15 probability of being a goal"). Prescriptive analytics recommends what to do (e.g., "Press higher in the first 15 minutes based on opponent tendencies").

Exercise 5. Three reasons analytics cannot replace scouting judgment: (a) data cannot capture intangibles like leadership and dressing room influence; (b) tactical context determines whether a metric reflects skill or system; (c) small sample sizes in individual player data create high uncertainty.

Chapter 2: Data Sources and Collection

Exercise 1. Event data records every on-ball action with coordinates and outcomes; tracking data records all player and ball positions at high frequency (25 Hz). Event data is available for most professional leagues; tracking data is typically proprietary and limited to the data-collecting club's own matches.

Exercise 3. Three data quality checks: (a) coordinate range validation (x in [0, 120], y in [0, 80]); (b) temporal ordering (events within a possession should have increasing timestamps); (c) cross-reference check (goals recorded in event data should match the official match score).

Exercise 5. The StatsBomb 360 freeze-frame data bridges the gap between event and tracking data by providing the positions of all visible players at the moment of key events. This enables context-aware analysis (e.g., number of defenders between the ball and the goal) without full tracking data.

Chapter 3: Statistical Foundations

Exercise 1. For a dataset of 38 match possession values: mean = 52.3%, median = 51.8%, standard deviation = 8.1%. The slight right skew (mean > median) is typical, as occasional high-possession matches pull the mean upward.

Exercise 3. Using a Poisson model with lambda = 1.4 goals per match: P(0 goals) = 0.247, P(1 goal) = 0.345, P(2 goals) = 0.242, P(3+ goals) = 0.166. The expected number of matches with 2+ goals in a 38-match season is approximately 17.5.

Exercise 5. A 95% confidence interval for a pass completion rate of 82% from n=500 passes: SE = sqrt(0.82 * 0.18 / 500) = 0.0172. CI = [0.786, 0.854]. The interval is narrow because the sample size is large.

Chapter 4: Python Programming

Exercise 1. The pandas groupby-aggregate pattern: df.groupby('player_id').agg({'goals': 'sum', 'minutes': 'sum'}) followed by per-90 normalization: result['goals_p90'] = result['goals'] / result['minutes'] * 90.

Exercise 3. Key differences between NumPy and pandas for soccer data: NumPy is optimized for homogeneous numerical arrays (tracking data matrices); pandas handles heterogeneous tabular data with labels (event data with mixed types). Use NumPy for vectorized mathematical operations; pandas for data manipulation and merging.

Exercise 5. A matplotlib pitch visualization requires: setting equal aspect ratio, drawing pitch lines with plt.plot(), using plt.scatter() or plt.hexbin() for event locations, and inverting the y-axis if the coordinate system origin is at the top-left.

Chapter 5: Introduction to Soccer Metrics

Exercise 1. Per-90 normalization: goals_p90 = (total_goals / minutes_played) * 90. For a player with 12 goals in 2,400 minutes: goals_p90 = 0.45. A minimum minutes threshold (e.g., 900) prevents unstable rates from small samples.

Exercise 3. Shot conversion rate vs. xG: conversion rate = goals / shots = simple ratio. xG accounts for shot quality (location, angle, body part, context). A player with high conversion but low xG is either an elite finisher or benefiting from variance; sustained overperformance of xG by more than 3-5% suggests genuine finishing skill.

Exercise 5. Progressive actions definition: a progressive pass advances the ball at least 10m toward the opponent's goal line. A progressive carry does the same while dribbling. These metrics capture ball progression better than simple pass completion rates.

Chapter 6: The Soccer Pitch as a Coordinate System

Exercise 1. Distance to goal center from (95, 30): d = sqrt((95-120)^2 + (30-40)^2) = sqrt(625 + 100) = 28.93 yards. Angle to goal: theta = arctan(9.32*(120-95) / ((120-95)^2 + (30-40)^2 - 3.66^2)) = arctan(183/715.61) = 16.34 degrees.

Exercise 3. To convert from a 100x100 coordinate system to 120x80: x_new = x_old * 1.2, y_new = y_old * 0.8. Always verify the conversion by checking known reference points (center spot should map to (60, 40), penalty spot to (108, 40)).

Exercise 5. A kernel density estimation (KDE) heatmap of shot locations would show high density inside the penalty area, concentrated centrally, with secondary peaks at the penalty spot and near the six-yard box. Headers from crosses create a lateral spread pattern.

Chapter 7: Expected Goals (xG)

Exercise 1. The top 5 features by coefficient magnitude in a logistic regression xG model are typically: (1) angle to goal, (2) distance to goal, (3) is_header indicator, (4) in_box indicator, (5) angle-distance interaction. The interaction term captures the non-linear relationship between geometry and shot quality.

Exercise 3. A team's xG overperformance of +8 points across a season is in approximately the 95th percentile of historical distributions. Regression toward the mean would predict approximately 4-5 fewer points the following season, though some overperformance may reflect genuine finishing or goalkeeping quality.

Exercise 5. Three principles for xG visualizations for coaching staff: (a) minimize text and maximize visual elements (pitch maps, arrows); (b) use a consistent color scheme (team colors vs. opponent); (c) focus on 2-3 key messages per visualization rather than showing all data. A shot map should encode: position (x, y coordinates), outcome (color: goal = one color, saved = another, off-target = a third), and quality (marker size proportional to xG value).

Chapter 8: Expected Assists (xA)

Exercise 1. Expected assists (xA) assign a probability to each pass that it will lead to a goal, based on the quality of the resulting shot opportunity. A through ball into the box that creates a 0.35 xG chance would carry an xA of approximately 0.35. The key difference from traditional assists is that xA credits the pass based on chance quality rather than binary outcome.

Exercise 3. Key pass identification: a key pass is any pass that directly leads to a shot attempt. To compute xA, fit a model predicting goal probability from the shot characteristics that result from each pass. The passer receives xA equal to the xG of the resulting shot. A player with high xA but low actual assists is creating quality chances that are not being converted.

Chapter 9: Expected Threat (xT)

Exercise 1. Expected Threat (xT) divides the pitch into a grid (typically 12x8 or 16x12) and assigns each cell a goal probability based on historical data. An action's value = xT(end position) - xT(start position). A pass from (50, 40) with xT=0.01 to (90, 35) with xT=0.05 has a value of +0.04.

Exercise 3. VAEP vs xT: VAEP considers action type and context (not just position), models both scoring and conceding probability, and accounts for multi-action sequences. xT is simpler, more interpretable, and does not require modeling the defensive side.

Chapter 10: Passing Networks

Exercise 1. To build a passing network: create a directed graph where each player is a node, each pass creates a weighted edge (weight = pass count). Filter edges below a threshold (e.g., 3 passes). The player with highest betweenness centrality is most critical to ball circulation.

Exercise 3. Network density = actual edges / possible edges. A team with 10 outfield players has at most 90 directed edges. If 45 edges exceed the threshold, density = 0.50. Higher density indicates more distributed passing; lower density indicates reliance on specific passing combinations.

Chapter 11: Possession and Control

Exercise 1. Possession percentage can be computed as the share of total passes, the share of total time on the ball, or the share of completed sequences. These methods can yield different values for the same match. Pass-based possession is simplest; time-based possession from tracking data is most accurate but requires higher-quality data.

Exercise 3. Field tilt measures the proportion of a team's touches or passes occurring in the opponent's third. A team with 55% possession but only 35% field tilt is dominating the ball in non-threatening areas, suggesting possession without penetration.

Chapter 12: Defensive Metrics

Exercise 1. PPDA computation: count opponent passes in their own half (e.g., 180) and our defensive actions in the opponent's half (e.g., 45). PPDA = 180/45 = 4.0. A PPDA below 8 indicates aggressive pressing; above 12 indicates a passive approach.

Exercise 3. Tackles, interceptions, and clearances each measure different defensive contributions. Tackles indicate direct ball-winning in duels, interceptions show anticipation and reading of the game, and clearances reflect last-ditch defending. A defender with high interceptions but low tackles is typically positioned well and reads play proactively.

Chapter 13: Goalkeeper Analysis

Exercise 1. Post-shot xG minus goals allowed (PSxG-GA) measures goalkeeper shot-stopping quality. A positive value means the goalkeeper is saving more than expected. For a goalkeeper facing 120 shots on target with total PSxG of 42.0 who concedes 38 goals: PSxG-GA = 42.0 - 38 = +4.0, indicating above-average shot-stopping.

Exercise 3. Goalkeeper distribution accuracy should be evaluated by pass type: short distribution (to defenders within 20 yards), medium distribution (to midfielders), and long distribution (goal kicks, long launches). Each type has different baseline success rates, and a goalkeeper excelling at long distribution may still be poor at short passing under pressure.

Chapter 14: Set Piece Analytics

Exercise 1. Corner kick analysis should consider delivery type (inswinging, outswinging, short), target zone (near post, far post, penalty spot), and first contact outcome (shot, headed clearance, flick-on). Approximately 3-4% of corners result directly in goals, but this varies significantly by delivery quality.

Exercise 3. Man-marking vs. zonal marking at set pieces: zonal marking assigns defenders to areas, reducing vulnerability to decoy runs but requiring strong aerial ability in each zone. Man-marking tracks specific opponents but can be disrupted by crossing runs and screens. Analytics can evaluate which system yields fewer conceded goals against specific opponent set piece patterns.

Chapter 15: Player Performance Metrics

Exercise 1. Multi-criteria scoring: normalize each metric to [0, 1] using min-max scaling within the position group. Apply weights (e.g., npxG/90: 0.25, pressing: 0.15, aerial: 0.10). Composite score = weighted sum. A threshold of 0.6 (out of 1.0) provides a reasonable candidate pool.

Exercise 3. Percentile rank computation: percentileofscore(peer_values, player_value). A 75th percentile in progressive passes means the player is better than 75% of positional peers. Percentiles above 90 or below 10 should be highlighted as exceptional strengths or concerns.

Chapter 16: Team Performance Analysis

Exercise 1. A team's expected points (xPts) can diverge from actual points due to finishing quality, goalkeeping performance, and variance in close match outcomes. A team with 60 points but only 52 xPts is likely overperforming; regression toward the mean would predict fewer points in subsequent seasons unless the overperformance is driven by a repeatable skill.

Exercise 3. To compare team styles quantitatively: compute a feature vector for each team (possession %, PPDA, field tilt, directness index, average build-up speed) and use clustering or PCA to group similar teams. This reveals that some teams with similar points totals achieve them through very different tactical approaches.

Chapter 17: Spatial Analysis and Pitch Control

Exercise 1. A Voronoi tessellation assigns each point on the pitch to the nearest player. The dominant region of a player is their Voronoi cell area. Pitch control extends this by incorporating player velocity and direction, modeling the time to reach each point rather than just distance.

Exercise 3. Off-ball run value = change in pitch control caused by a player's movement. A run that creates space for a teammate (increasing their pitch control area) has positive value even if the running player never receives the ball.

Chapter 18: Tracking Data Analytics

Exercise 1. Speed = distance / time between consecutive frames. At 25 fps, if a player moves from (50.0, 30.0) to (50.2, 30.1): distance = sqrt(0.04 + 0.01) = 0.224m, time = 0.04s, speed = 7.59 m/s.

Exercise 3. Team compactness can be measured as (a) the area of the convex hull of player positions, (b) the standard deviation of x-coordinates (length compactness) and y-coordinates (width compactness), or (c) the average pairwise distance between teammates.

Chapter 19: Machine Learning for Soccer

Exercise 1. Gradient boosting vs. logistic regression for xG: gradient boosting typically achieves 0.01-0.02 lower log-loss but may require post-hoc calibration. Logistic regression is inherently calibrated and more interpretable. Recommendation: use gradient boosting with isotonic calibration for production; logistic regression for communication with non-technical stakeholders.

Exercise 3. Hyperparameter tuning for gradient boosting xG: key parameters are n_estimators (100-500), max_depth (3-6), learning_rate (0.01-0.1), and min_samples_leaf (10-50). Use randomized search with 5-fold stratified CV, optimizing for log-loss.

Exercise 5. Prior: Beta(3, 7) representing a 30% base conversion rate. After observing 5 goals from 12 shots: posterior = Beta(3+5, 7+7) = Beta(8, 14). Posterior mean = 8/22 = 0.364. The Bayesian estimate shrinks the raw rate (5/12 = 0.417) toward the prior, demonstrating how Bayesian methods handle small samples in soccer analytics.

Chapter 20: Predictive Modeling

Exercise 1. Monte Carlo season simulation: for each of 10,000 simulations, simulate 38 matches using Poisson(xG) for goals scored and Poisson(xGA) for goals conceded. Compute points per simulation. The 5th-95th percentile range is typically +/- 12-15 points from the mean.

Exercise 3. Tournament bracket simulation: simulate each knockout match using Poisson model with extra time and penalties. Run 100,000 brackets. Teams with favorable draws and high xG differential will have the highest win probabilities, but variance is high due to the single-elimination format.

Exercise 5. Rolling 5-match average of xG: captures short-term form changes while smoothing individual match variance. A 10-match window is better for identifying sustained tactical shifts; a 3-match window is too noisy for reliable conclusions.

Chapter 21: Player Recruitment and Scouting

Exercise 1. Player similarity with PCA: reduce the feature space from 15+ metrics to 2-3 principal components, then compute Euclidean distance in the reduced space. The first 3 PCs typically explain 65-80% of variance, capturing the main dimensions of playing style.

Exercise 3. Cosine similarity finds the most stylistically similar players regardless of absolute performance level. Euclidean distance is sensitive to overall performance level. For replacement scouting (finding a similar style), cosine similarity is preferred. For upgrade scouting (finding strictly better players), Euclidean distance or percentile-based ranking is more appropriate.

Exercise 5. A quadratic age curve: y(age) = beta_0 + beta_1 * age + beta_2 * age^2. For outfield xG contribution, typical peak age is -beta_1 / (2*beta_2), usually around 27-28. Goalkeepers peak later (29-31). Defenders' age curves are flatter than forwards'. Understanding age curves is critical for evaluating whether a transfer target's best years are ahead or behind them.

Chapter 22: Match Strategy and Tactics

Exercise 1. Formation detection from average positions: cluster the y-coordinates of 10 outfield players into 3-5 groups. The number of players in each group, sorted by depth, gives the formation string (e.g., "4-3-3"). K-means with silhouette score optimization selects the best number of lines.

Exercise 3. Pressing trap analysis: identify the pitch zones where the opponent most frequently loses possession under pressure. Design pressing triggers (e.g., backward pass to the center-back, switch of play to the weak-footed full-back) that direct the ball into these zones. Effectiveness is measured by the turnover rate and average field position of recoveries.

Chapter 23: Video Analysis and Computer Vision

Exercise 1. Object detection for player tracking requires identifying and localizing each player in video frames. A YOLO-based detector can process frames in real time. Key challenges include occlusion (players overlapping), distinguishing players from referees using jersey color classification, and maintaining identity across frames via tracking algorithms such as DeepSORT.

Exercise 3. Broadcast-derived tracking data differs from stadium-based systems in several ways: lower spatial resolution, partial pitch coverage (only the visible portion of the broadcast frame), and variable frame rates. Despite these limitations, broadcast tracking enables analysis for leagues where stadium camera systems are not installed.

Chapter 24: Deep Learning in Soccer Analytics

Exercise 1. A convolutional neural network (CNN) for action recognition in soccer video would use a sequence of frames as input, apply spatial convolutions to extract visual features, and use temporal pooling or recurrent layers to capture motion. Transfer learning from a pre-trained network (e.g., ResNet) reduces the amount of labeled soccer video required.

Exercise 3. Graph neural networks (GNNs) for passing networks: represent each player as a node with feature attributes (position, speed, recent actions), and passes as edges. The GNN learns representations that capture not only individual player features but also the relational structure of team play, enabling predictions about pass success probability and tactical patterns.

Chapter 25: Economic Analysis and Player Valuation

Exercise 1. Network community detection in passing data can identify sub-groups of players who pass among themselves more than to others, revealing tactical units (e.g., the left-side build-up triangle). This information aids valuation by quantifying how deeply integrated a player is within the team's tactical system, and thus the disruption cost of their departure.

Exercise 3. A player valuation model should incorporate: current performance metrics (xG, xA, defensive actions per 90), age and contract length, market context (selling club's league, buying club's league), injury history, and positional scarcity. The residual between the model's predicted value and the actual transfer fee identifies potentially over- or under-valued transfers.

Exercise 5. Value ratio = (on-pitch contribution in monetary terms) / (annual cost including amortized transfer fee and wages). A value ratio above 1.0 indicates the player provides more value than they cost. Clubs with consistently high average value ratios across their squad demonstrate efficient recruitment.

Chapter 26: Injury Prevention and Load Management

Exercise 1. ACWR with 7-day rolling / 28-day EWMA: if acute load = 650 AU and chronic load = 520 AU, then ACWR = 650/520 = 1.25. This falls within the acceptable range (0.8-1.3) but is approaching the upper boundary.

Exercise 3. Risk factors beyond ACWR: previous injury history (strongest single predictor), age, match congestion (games in last 14 days), sleep quality, and asymmetry in running biomechanics. A multi-factor logistic regression model outperforms ACWR alone.

Exercise 5. Kaplan-Meier survival curve for hamstring injuries: at day 10, approximately 70% of players are still injured; at day 21, approximately 30%; at day 28, approximately 15%. Median return-to-play time is approximately 16 days. The curve enables data-driven return timelines rather than fixed protocols.

Chapter 27: Real-Time Analytics and Decision Support

Exercise 1. A basic reinforcement learning reward function for soccer: +1 for a goal, -1 for conceding, +0.01 * delta_xT for each action (incentivizing ball progression). This reward structure balances the sparse terminal reward (goals) with a denser intermediate signal (spatial progress toward the opponent's goal).

Exercise 3. Real-time substitution decision support: monitor each player's cumulative match load (distance, sprints, high-intensity running), current tactical contribution (xT generated, defensive actions), and the team's game state. Flag a substitution recommendation when a player's physical output drops below 80% of their first-half baseline while the tactical contribution can be replicated by an available substitute.

Chapter 28: Building an Analytics Department

Exercise 1. Three rules for boardroom presentations: (a) lead with the recommendation, not the methodology; (b) use at most 3 key metrics per slide; (c) present uncertainty honestly but without overwhelming technical detail.

Exercise 3. An effective scouting report includes: (a) a one-page summary with radar chart and key metrics; (b) video clips illustrating strengths and weaknesses; (c) statistical comparison to the departing player and position benchmarks; (d) risk assessment including injury history and adaptation probability.

Exercise 5. Key roles in an analytics department: Head of Analytics (strategy and stakeholder management), Data Engineer (data pipelines and infrastructure), Data Scientist (model building and research), Analyst (match and opposition analysis for coaching staff), and Video Analyst (clip preparation and tactical breakdown). The minimum viable team for a mid-table club is 3-4 people combining these roles.

Chapter 29: Comprehensive Case Studies

Exercise 1. A complete xG pipeline includes: data loading, validation, feature engineering (distance, angle, in_box, is_header, interactions), logistic regression training with 5-fold stratified CV, calibration evaluation, and model serialization. See code/exercise-solutions.py for the full implementation.

Exercise 3. Isotonic calibration produces the lowest ECE but may overfit with small test sets because it is non-parametric. Platt scaling is more stable with fewer calibration samples. With n > 2,000 test shots, isotonic is preferred; below that, Platt scaling is safer.

Chapter 30: The Future of Soccer Analytics

Exercise 1. Pose estimation features: hip orientation rate of change (degrees/second), body lean variability (standard deviation over a sequence), and stride length coefficient of variation. High variability in these metrics may indicate fatigue or injury risk.

Exercise 5. An ethics checklist for a new analytics project should include: consent mechanism identified, data retention within limits, access controls defined, bias review completed (for high-risk projects), purpose documented, data minimization applied, and player access to their own data enabled.