Exercises: Statistical Foundations for Soccer Analysis

These exercises build practical statistical skills with soccer-specific applications. Work through them systematically, showing calculations where requested.

Scoring Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)


Part A: Descriptive Statistics ⭐

A.1. A striker scored the following goals across 8 seasons: 12, 18, 24, 15, 21, 19, 22, 17

Calculate: a) Mean b) Median c) Standard deviation d) Range

A.2. Two goalkeepers have the following save percentages over 10 matches:

GK A: 72%, 75%, 73%, 74%, 72%, 76%, 73%, 74%, 75%, 73% GK B: 65%, 82%, 70%, 78%, 68%, 85%, 72%, 76%, 60%, 81%

a) Calculate the mean save percentage for each goalkeeper b) Calculate the standard deviation for each c) Which goalkeeper is more consistent? Which would you prefer and why?

A.3. The following are goals scored by 20 Premier League teams in a season: 31, 34, 37, 38, 40, 42, 44, 45, 48, 51, 52, 55, 58, 59, 68, 72, 75, 83, 88, 94

a) Calculate the median b) Calculate Q1 (25th percentile) and Q3 (75th percentile) c) Calculate the interquartile range (IQR) d) Are there any potential outliers? (Use 1.5 × IQR rule)

A.4. Explain why the median might be preferable to the mean when analyzing transfer fees. Give a specific example.

A.5. A midfielder has a passing accuracy of 85% with a standard deviation of 3%. Assuming normal distribution, what percentage of his matches would you expect to have passing accuracy: a) Above 88%? b) Below 79%? c) Between 82% and 88%?


Part B: Probability ⭐⭐

B.1. A penalty kick has approximately 76% probability of being scored.

a) What is the probability of missing a penalty? b) If penalties are independent, what is the probability of scoring all 5 penalties in a shootout? c) What is the probability of missing at least one penalty in a 5-kick shootout?

B.2. In a league, the following outcome probabilities apply to the average match: - Home win: 45% - Draw: 27% - Away win: 28%

a) What is the probability a randomly selected match is NOT a draw? b) If you observe 10 independent matches, what is the expected number of draws? c) What is the probability of exactly 3 draws in 10 matches? (Use binomial)

B.3. A team's expected goals (xG) for a match is 2.1, modeled as Poisson.

Calculate the probability that: a) They score exactly 2 goals b) They score 0 goals c) They score 3 or more goals d) They score at least 1 goal

B.4. Historical data shows: - P(Team wins | Team scores first) = 0.85 - P(Team scores first) = 0.52 - P(Team wins) = 0.48

a) Calculate P(Team wins AND scores first) b) Calculate P(Team scores first | Team wins) c) Is scoring first independent of winning? Explain mathematically.

B.5. A scout believes 20% of academy players become professionals. A new training program claims to improve this rate.

a) If the program has no effect, what is the probability that at least 4 out of 15 academy players become professionals? (Binomial) b) If actually 5 out of 15 became professionals, is this strong evidence the program works? Explain.


Part C: Statistical Inference ⭐⭐

C.1. A sample of 25 strikers has mean xG per 90 of 0.42 with standard deviation 0.12.

a) Calculate the standard error of the mean b) Construct a 95% confidence interval for the population mean xG per 90 c) Interpret this confidence interval in context

C.2. A coach claims the team's average passing accuracy is 82%. You observe 30 matches with mean passing accuracy 79% and standard deviation 5%.

a) State the null and alternative hypotheses b) Calculate the test statistic c) Using α = 0.05, do you reject the null hypothesis? d) What is your conclusion in context?

C.3. Two formations are tested: - Formation A: 15 matches, mean goals 1.8, SD 0.9 - Formation B: 12 matches, mean goals 2.3, SD 1.1

a) Calculate the 95% CI for each formation's mean goals b) Do the confidence intervals overlap? c) Can you conclude one formation is better? Why or why not?

C.4. A player's conversion rate over 80 shots is 16% (league average is 12%).

a) State appropriate hypotheses to test if the player is better than average b) Calculate the test statistic (use normal approximation) c) Find the p-value d) At α = 0.05, what do you conclude? e) Would your conclusion change with only 40 shots at the same rate?

C.5. Explain the difference between statistical significance and practical significance using a soccer example. When might a result be statistically significant but not practically important?


Part D: Sample Size and Stabilization ⭐⭐

D.1. A goalkeeper has faced 50 shots and saved 38 (76% save rate).

a) Calculate the 95% CI for the true save rate b) If the league average is 72%, can you conclude this goalkeeper is better than average? c) How many shots would be needed to have a CI width of ±3%?

D.2. Using the formula for sample size with specified margin of error E:

$$n = \left(\frac{z \cdot \sigma}{E}\right)^2$$

Calculate how many shots a player needs to take to estimate their conversion rate with: a) 95% confidence and ±5% margin (assume σ = 0.35) b) 95% confidence and ±2% margin c) Why is this problematic for single-season analysis?

D.3. Two players' conversion rates: - Player A: 18% on 120 shots - Player B: 22% on 30 shots

a) Calculate 95% CIs for each player's true conversion rate b) Which player would you rate higher? Justify your choice. c) Explain how regression to the mean affects your evaluation of Player B.

D.4. A "hot streak" analysis shows a player scored in 8 consecutive matches. Their baseline probability of scoring in any match is 0.35.

a) What is the probability of scoring in 8 consecutive matches? b) If there are 500 players in the league, how many would you expect to have such a streak purely by chance? c) What does this tell us about interpreting "hot streaks"?


Part E: Correlation and Regression ⭐⭐⭐

E.1. Calculate the correlation coefficient between team possession (%) and points for the following data:

Team Possession Points
A 58 75
B 52 68
C 62 82
D 45 52
E 55 70
F 48 58
G 60 78
H 50 62

a) Calculate the correlation coefficient b) Interpret the strength and direction c) Does this prove possession causes success? Explain.

E.2. Fit a simple linear regression predicting goals from xG using this data:

Team xG Goals
A 55 58
B 48 45
C 72 75
D 61 65
E 53 50
F 68 70

a) Calculate the slope and intercept b) Interpret the slope in context c) Predict goals for a team with xG = 60 d) Calculate R²

E.3. You run a regression of league points on the following predictors: - xG (coefficient: 0.85, p-value: 0.001) - Possession % (coefficient: 0.42, p-value: 0.23) - Passes completed (coefficient: 0.01, p-value: 0.85)

a) Which predictors are statistically significant at α = 0.05? b) Interpret the xG coefficient c) Why might possession appear non-significant despite its apparent relationship with success?

E.4. A regression shows that adding a new signing is associated with 5 additional points per season (p < 0.01). List three reasons why we should be cautious about interpreting this as a causal effect.


Part F: Applied Problems ⭐⭐⭐

F.1. xG Analysis

A team finished the season with: - 52 goals scored - 65.3 xG - 38 goals conceded - 45.8 xGA

a) Calculate the goal difference vs expected goal difference b) Would you expect this team to improve or decline next season based on these numbers? c) If xG is a better predictor of future performance than actual goals, how should this affect transfer strategy?

F.2. Player Comparison

Comparing two strikers over a full season: - Striker A: 22 goals from 18.5 xG, 150 shots - Striker B: 16 goals from 17.8 xG, 110 shots

a) Calculate each player's conversion rate and xG per shot b) Who is the "better" finisher based on goals vs xG? c) Whose performance is more sustainable? Explain using regression to the mean. d) If you could only sign one, who would you choose and why?

F.3. League Table Analysis

End-of-season analysis shows: - Team finishing 4th had the 2nd highest xG - Team finishing 1st had the 4th highest xG but best defensive xGA - Team finishing 20th (relegated) had 8th highest xG but 2nd worst xGA

a) What does this suggest about the relative importance of attack vs defense? b) Design a regression analysis to quantify the relationship between xG, xGA, and points c) What confounding factors might affect your analysis?


Part G: Computational Exercises ⭐⭐⭐

G.1. Write Python code to:

# Given a list of match results (1=win, 0=draw, -1=loss)
# Calculate:
# a) Win rate with 95% CI
# b) Test whether the win rate differs from 0.35 (league average)

results = [1, 0, 1, 1, -1, 0, 1, 1, 0, 1, -1, 1, 1, 0, 1, 1, -1, 1, 0, 1]
# Your code here

G.2. Write a function that simulates the probability of a specific scoreline using Poisson distributions:

def match_scoreline_probability(home_xg: float, away_xg: float,
                                home_goals: int, away_goals: int) -> float:
    """
    Calculate probability of exact scoreline using Poisson model.

    Parameters
    ----------
    home_xg : float
        Home team expected goals
    away_xg : float
        Away team expected goals
    home_goals : int
        Home team goals to calculate probability for
    away_goals : int
        Away team goals to calculate probability for

    Returns
    -------
    float
        Probability of this exact scoreline
    """
    # Your code here
    pass

# Test: probability of 2-1 when home xG=1.8, away xG=1.2

G.3. Write code to calculate stabilization point for a metric:

def calculate_stabilization(df: pd.DataFrame, metric: str,
                           future_metric: str) -> int:
    """
    Find sample size where metric correlation with future > 0.5

    Parameters
    ----------
    df : pd.DataFrame
        Player data with cumulative samples
    metric : str
        Current metric column name
    future_metric : str
        Future metric column name

    Returns
    -------
    int
        Approximate stabilization point
    """
    # Your code here
    pass

Part H: Critical Analysis ⭐⭐⭐⭐

H.1. A viral tweet claims: "Teams with higher pressing intensity win 68% more often than teams with lower pressing intensity."

Critique this claim. What statistical questions should you ask before accepting it?

H.2. A betting model claims 58% accuracy on match predictions. Over 1000 bets, it achieved 56.5% accuracy.

a) Test whether the observed accuracy is significantly different from claimed 58% b) Test whether it's better than 50% (random guessing) c) Is this model useful for betting? Consider practical significance.

H.3. Simpson's Paradox Exercise

Player performance against different opposition levels:

Player Strong Opposition Weak Opposition
Shots Goals Shots Goals
A 40 4 60 12
B 80 7 20 3

a) Calculate conversion rates for each player against each opposition type b) Calculate overall conversion rates c) Explain the paradox d) Which player is actually the better finisher?

H.4. Research Question Design

You want to test whether teams perform better in afternoon kickoffs than evening kickoffs.

a) State formal hypotheses b) What data would you need? c) What potential confounders should you control for? d) How large a sample would you need? (Estimate) e) What would constitute strong evidence for your hypothesis?


Solutions

Selected solutions are available in: - code/exercise-solutions.py (programming problems) - appendices/g-answers-to-selected-exercises.md (odd-numbered problems)


Reflection Questions

  1. Which statistical concept from this chapter do you find most relevant to soccer analysis?
  2. What are the biggest challenges in applying classical statistics to soccer data?
  3. How has this chapter changed how you interpret soccer statistics in media?