Quiz: Statistical Foundations for Soccer Analysis
Test your statistical knowledge before moving to the next chapter. Target: 70% or higher to proceed. Time: ~45 minutes
Section 1: Multiple Choice (1 point each)
1. A striker's goals over 6 seasons are: 10, 15, 12, 18, 14, 25. What is the median?
- A) 14
- B) 14.5
- C) 15
- D) 15.7
Answer
**B)** 14.5 *Explanation:* Sorted: 10, 12, 14, 15, 18, 25. With n=6 (even), median is average of 3rd and 4th values: (14+15)/2 = 14.5. Reference Section 3.1.2.2. Two players have the same mean passing accuracy (85%), but Player A has SD=2% and Player B has SD=8%. Which statement is correct?
- A) Player A is more accurate than Player B
- B) Player B is more consistent than Player A
- C) Player A is more consistent than Player B
- D) The players perform identically
Answer
**C)** Player A is more consistent than Player B *Explanation:* Lower standard deviation indicates less variability, meaning more consistent performance. Both have the same average accuracy, but Player A's performance varies less. Reference Section 3.1.3.3. A penalty has 76% probability of being scored. What is the probability of missing?
- A) 76%
- B) 34%
- C) 24%
- D) Cannot be determined
Answer
**C)** 24% *Explanation:* P(miss) = 1 - P(score) = 1 - 0.76 = 0.24 = 24%. This is the complement rule. Reference Section 3.2.2.4. Which probability distribution is most appropriate for modeling the number of goals a team scores in a match?
- A) Normal distribution
- B) Binomial distribution
- C) Poisson distribution
- D) Uniform distribution
Answer
**C)** Poisson distribution *Explanation:* Goals are discrete events occurring at a roughly constant rate during a match, making Poisson an appropriate model. The normal is for continuous data, binomial for fixed trials, uniform for equal probability outcomes. Reference Section 3.2.5.5. A team's xG for a match is 2.0. Using the Poisson distribution, what is the approximate probability they score exactly 0 goals?
- A) 5%
- B) 14%
- C) 27%
- D) 37%
Answer
**B)** 14% *Explanation:* P(X=0) = e^(-λ) × λ^0 / 0! = e^(-2) × 1 / 1 = 0.135 ≈ 14%. Reference Section 3.2.5.6. What does a 95% confidence interval mean?
- A) There is a 95% probability the true value is in this interval
- B) 95% of the data falls within this interval
- C) If we repeated the sampling process many times, 95% of intervals would contain the true value
- D) The result is 95% accurate
Answer
**C)** If we repeated the sampling process many times, 95% of intervals would contain the true value *Explanation:* CIs describe the reliability of the estimation procedure, not the probability for a specific interval. Option A is a common misinterpretation. Reference Section 3.3.3.7. A p-value of 0.03 means:
- A) There is a 3% chance the null hypothesis is true
- B) There is a 97% chance the alternative hypothesis is true
- C) If the null hypothesis were true, there's a 3% probability of observing data this extreme or more
- D) The effect size is 0.03
Answer
**C)** If the null hypothesis were true, there's a 3% probability of observing data this extreme or more *Explanation:* The p-value is the probability of the observed data (or more extreme) under the null hypothesis, not the probability the hypothesis is true. Reference Section 3.3.4.8. Which statement about correlation is TRUE?
- A) Correlation implies causation
- B) Correlation measures only linear relationships
- C) Correlation values range from 0 to 1
- D) High correlation means the relationship is causal
Answer
**B)** Correlation measures only linear relationships *Explanation:* Pearson correlation specifically measures linear relationships. It ranges from -1 to 1, and correlation never implies causation. Reference Section 3.5.9. In regression, the R² value represents:
- A) The correlation coefficient
- B) The slope of the regression line
- C) The proportion of variance explained by the model
- D) The standard error of the estimate
Answer
**C)** The proportion of variance explained by the model *Explanation:* R² (coefficient of determination) indicates what percentage of the variability in the dependent variable is explained by the independent variable(s). Reference Section 3.6.2.10. Regression to the mean suggests that:
- A) All players eventually become average
- B) Extreme performances tend to be followed by less extreme ones
- C) The mean always stays constant
- D) Better players always regress to league average
Answer
**B)** Extreme performances tend to be followed by less extreme ones *Explanation:* Regression to the mean is a statistical phenomenon where extreme observations are partly due to luck and subsequent observations tend to be closer to the mean. It doesn't mean everyone becomes average. Reference Section 3.4.3.Section 2: True/False (1 point each)
11. The Central Limit Theorem states that sample means are approximately normally distributed for large samples, regardless of the population distribution.
Answer
**True** *Explanation:* This is exactly what the CLT states. For sufficiently large samples, the sampling distribution of the mean approaches normal. Reference Section 3.3.2.12. A statistically significant result is always practically important.
Answer
**False** *Explanation:* Statistical significance only indicates a result is unlikely under the null hypothesis. With large samples, tiny differences can be "significant" but meaningless in practice. Always consider effect size. Reference Section 3.3.5.13. A player with 20% conversion rate on 30 shots has a more reliable estimate than a player with 15% on 200 shots.
Answer
**False** *Explanation:* Larger samples provide more reliable estimates. The 200-shot sample has a much narrower confidence interval and is more trustworthy. Reference Section 3.4.14. If two events are independent, P(A and B) = P(A) × P(B).
Answer
**True** *Explanation:* This is the definition/test of independence. For independent events, the multiplication rule applies directly. Reference Section 3.2.2.15. Bayes' theorem allows us to update our beliefs based on new evidence.
Answer
**True** *Explanation:* Bayes' theorem provides a framework for updating prior beliefs with observed data to obtain posterior beliefs. Reference Section 3.2.4.16. Shooting percentage typically stabilizes faster than save percentage because shots require smaller samples.
Answer
**False** *Explanation:* Both require large samples. Shooting percentage actually stabilizes at around 700+ shots, while save percentage requires 1000+ shots faced. Reference Section 3.4.2.Section 3: Fill in the Blank (1 point each)
17. The standard __ measures how much sample means typically vary from the true population mean.
Answer
**error** (standard error) *Explanation:* SE = s/√n measures the typical deviation of sample means from the population mean.18. When testing hypotheses, we reject the null hypothesis if the p-value is less than the significance level (usually ______).
Answer
**0.05** (or 5%, or α) *Explanation:* The conventional significance level is α = 0.05, though other values (0.01, 0.10) are sometimes used.19. The __ distribution models the number of events occurring in a fixed interval when events happen independently at a constant rate.
Answer
**Poisson** *Explanation:* The Poisson distribution is characterized by parameter λ (the rate) and is commonly used for goal-scoring models.20. __ to the mean is a statistical phenomenon where extreme observations are followed by observations closer to the average.
Answer
**Regression** *Explanation:* Regression to the mean describes the tendency of extreme values to be followed by less extreme ones.Section 4: Short Answer (2 points each)
21. A player scores 25 goals from 18 xG. What does this suggest about their future performance, and why?
Sample Answer
The player has significantly outperformed their xG, scoring 7 more goals than expected. This suggests their future goal-scoring rate will likely decline toward their xG level due to regression to the mean. The overperformance could be due to exceptional finishing skill, but is more likely partly due to luck (e.g., goalkeeper errors, fortunate deflections). We should expect future performance closer to their xG unless there's strong evidence of sustained elite finishing ability. *Key points for full credit:* - Recognition of overperformance - Mention of regression to the mean - Acknowledgment of skill vs luck distinction22. Explain why possession percentage might have a weak correlation with winning despite appearing important.
Sample Answer
Several factors explain this: 1. **Game state effects:** Winning teams often cede possession to protect their lead, creating reverse causation 2. **Style differences:** Effective counter-attacking teams win with low possession 3. **Quality matters more than quantity:** What you do with possession matters more than how much you have 4. **Confounding:** Better teams might have both more possession AND win more, but possession isn't the cause *Key points for full credit:* - At least two valid explanations - Recognition that correlation ≠ causation23. What is the difference between a confidence interval and a prediction interval?
Sample Answer
A confidence interval estimates where a population parameter (like the mean) lies, while a prediction interval estimates where a future individual observation will fall. Prediction intervals are wider because they account for both uncertainty about the mean AND the variability of individual observations around that mean. For example, a 95% CI for a team's mean goals might be 1.5-2.1, but a prediction interval for their next match's goals would be much wider (perhaps 0-4). *Key points for full credit:* - Clear distinction between parameter estimation vs individual prediction - Recognition that prediction intervals are wider24. A study finds r = 0.92 between xG and actual goals. Does this mean xG is an excellent predictor of goals? What additional information would you want?
Sample Answer
While r = 0.92 indicates a very strong correlation, additional information is needed: 1. **Sample size:** How many teams/seasons were included? Small samples can show spurious strong correlations. 2. **Time frame:** Is this concurrent (same season) or predictive (xG predicting future goals)? 3. **Unit of analysis:** Team-seasons or match-level? Team-season aggregates will show higher correlations. 4. **Confidence interval:** What's the uncertainty around this estimate? *Key points for full credit:* - Question about sample size or statistical significance - Recognition of concurrent vs predictive distinctionSection 5: Calculation Problems (3 points each)
25. A team's shot data: 15 shots with xG of 0.10, 0.15, 0.08, 0.22, 0.05, 0.12, 0.18, 0.09, 0.11, 0.14, 0.25, 0.07, 0.13, 0.16, 0.20
a) Calculate mean xG per shot (1 point) b) Calculate standard deviation of xG per shot (1 point) c) What is the total xG for this set of shots? (1 point)
Answer
**a) Mean xG per shot:** Sum = 0.10 + 0.15 + 0.08 + 0.22 + 0.05 + 0.12 + 0.18 + 0.09 + 0.11 + 0.14 + 0.25 + 0.07 + 0.13 + 0.16 + 0.20 = 2.05 Mean = 2.05 / 15 = **0.137** (or approximately 0.14) **b) Standard deviation:** First calculate variance: - Deviations from mean: (-0.037, 0.013, -0.057, 0.083, -0.087, -0.017, 0.043, -0.047, -0.027, 0.003, 0.113, -0.067, -0.007, 0.023, 0.063) - Squared deviations sum ≈ 0.0386 - Variance = 0.0386/14 ≈ 0.00276 - SD = √0.00276 ≈ **0.053** **c) Total xG:** Sum = **2.05**26. A goalkeeper faces 80 shots and saves 60. League average save rate is 70%.
a) Calculate the goalkeeper's save rate (1 point) b) Construct a 95% confidence interval for the true save rate (1 point) c) Is there evidence this goalkeeper is better than league average? (1 point)
Answer
**a) Save rate:** 60/80 = **75%** (or 0.75) **b) 95% Confidence Interval:** Standard error for proportion: SE = √(p(1-p)/n) = √(0.75 × 0.25/80) = √0.00234 = 0.0484 95% CI: 0.75 ± 1.96 × 0.0484 = 0.75 ± 0.095 = **(0.655, 0.845)** or approximately **(65.5%, 84.5%)** **c) Evidence for better than average:** The league average (70%) falls within the confidence interval (65.5%, 84.5%), so we cannot conclude this goalkeeper is statistically significantly better than league average. The sample is too small to rule out that the observed 75% is due to chance.Section 6: Applied Problem (5 points)
27. You are analyzing whether home advantage has declined in recent seasons.
Data from two periods: - Period 1 (2010-2015): 3000 matches, 1380 home wins (46%) - Period 2 (2018-2023): 3000 matches, 1290 home wins (43%)
a) State the null and alternative hypotheses (1 point) b) Is this a one-tailed or two-tailed test? Why? (1 point) c) Calculate the test statistic (2 points) d) What do you conclude at α = 0.05? (1 point)
Answer
**a) Hypotheses:** - H₀: p₁ = p₂ (home win rates are equal in both periods) - H₁: p₁ ≠ p₂ (home win rates differ between periods) - OR specifically: H₁: p₁ > p₂ (home advantage has declined) **b) One vs two-tailed:** If testing whether home advantage "has declined" specifically, use **one-tailed test**. If testing whether home advantage "has changed" (could be either direction), use two-tailed. **c) Test statistic:** Pooled proportion: p̂ = (1380 + 1290)/(3000 + 3000) = 2670/6000 = 0.445 Standard error: SE = √[p̂(1-p̂)(1/n₁ + 1/n₂)] = √[0.445 × 0.555 × (2/3000)] = √[0.000165] = 0.0128 z = (0.46 - 0.43)/0.0128 = 0.03/0.0128 = **2.34** **d) Conclusion:** For two-tailed test: p-value ≈ 0.019 < 0.05 For one-tailed test: p-value ≈ 0.0096 < 0.05 **Reject the null hypothesis.** There is statistically significant evidence that home win rates differ between the two periods, suggesting home advantage has declined from 46% to 43%.Scoring
| Section | Points | Your Score |
|---|---|---|
| Multiple Choice (1-10) | 10 | ___ |
| True/False (11-16) | 6 | ___ |
| Fill in Blank (17-20) | 4 | ___ |
| Short Answer (21-24) | 8 | ___ |
| Calculations (25-26) | 6 | ___ |
| Applied Problem (27) | 5 | ___ |
| Total | 39 | ___ |
Passing Score: 27/39 (70%)
Review Recommendations
- Score < 50%: Re-read entire chapter, focusing on Sections 3.1-3.4
- Score 50-70%: Review Sections 3.3 (Inference) and 3.4 (Sample Size), redo exercises Part B-C
- Score 70-85%: Good understanding! Review correlation/regression (Section 3.5-3.6)
- Score > 85%: Excellent! Ready for Chapter 4