Exercises: Statistical Foundations for Football Analytics
Difficulty Levels
- Level 1 (Foundational): Basic probability and statistics concepts
- Level 2 (Applied): Applying concepts to football data
- Level 3 (Intermediate): Multi-step analysis and interpretation
- Level 4 (Advanced): Building models and complex inference
- Level 5 (Expert): Research-level statistical analysis
Section 1: Probability (Level 1-2)
Exercise 1.1: Basic Probability
Level 1 | Probability
A kicker has the following field goal success rates by distance: - 0-29 yards: 97% - 30-39 yards: 88% - 40-49 yards: 78% - 50+ yards: 63%
Calculate: 1. Expected points from a 35-yard attempt (make = 3 points) 2. Expected points from a 52-yard attempt 3. At what success rate does attempting a FG have lower expected value than punting (assume punt = 0 points)?
# Your code here
Exercise 1.2: Conditional Probability
Level 2 | Conditional Probability
Using play-by-play data, calculate: 1. P(First Down | Third Down) 2. P(First Down | Third Down AND Yards to Go ≤ 3) 3. P(First Down | Third Down AND Yards to Go > 7) 4. P(Pass | Third Down AND Yards to Go > 7)
# Your code here
Exercise 1.3: Joint Probability
Level 2 | Joint Probability
Calculate: 1. P(Pass AND Touchdown) 2. P(Run AND Touchdown) 3. P(Third Down AND Pass AND First Down) 4. Are "Pass" and "Touchdown" independent events? How can you test this?
# Your code here
Exercise 1.4: Expected Value
Level 2 | Expected Value
For a fourth-and-1 situation at the opponent's 35-yard line, calculate expected points for: 1. Go for it (assume 70% conversion rate) 2. Punt (assume touchback) 3. Field goal attempt
Which option maximizes expected points?
# Your code here
Exercise 1.5: Bayes' Theorem Application
Level 3 | Bayesian Inference
A team wins 60% of games when they score first. They score first 52% of the time. 1. What's P(Win)? 2. If a team wins, what's the probability they scored first? 3. Use Bayes' theorem to calculate P(Score First | Win)
# Your code here
Section 2: EPA Analysis (Level 2-3)
Exercise 2.1: EPA Interpretation
Level 2 | EPA Concepts
Using the play-by-play data: 1. What is the league-wide average EPA on passes? 2. What is the league-wide average EPA on runs? 3. What is the standard deviation of EPA for each play type? 4. Why is the pass EPA higher than run EPA?
# Your code here
Exercise 2.2: QB EPA Rankings
Level 2 | Player Evaluation
Calculate EPA per dropback for all QBs with 200+ attempts: 1. Rank QBs by EPA per dropback 2. Calculate the 95% confidence interval for the top 5 QBs 3. Identify which QBs' confidence intervals overlap with league average
# Your code here
Exercise 2.3: Situational EPA
Level 3 | Context Analysis
Calculate average EPA for passes in different situations: 1. First down vs third down 2. Leading by 7+ vs trailing by 7+ 3. First half vs fourth quarter 4. Red zone vs rest of field
# Your code here
Exercise 2.4: Team EPA Decomposition
Level 3 | Aggregation
For each team, calculate: 1. Total offensive EPA 2. Pass EPA contribution (total pass EPA) 3. Rush EPA contribution (total rush EPA) 4. What percentage of each team's EPA comes from passing?
# Your code here
Exercise 2.5: EPA Model Building
Level 4 | Model Construction
Build a simple Expected Points model: 1. Calculate historical average points scored from each yardline (1-99) 2. Create a lookup table of EP by yardline 3. Use this to calculate "homemade" EPA for plays 4. Compare your EPA values to the provided EPA column
# Your code here
Section 3: Win Probability (Level 2-4)
Exercise 3.1: WP Interpretation
Level 2 | WP Concepts
Using play-by-play data: 1. Find the play with the highest WPA in a given season 2. Find the play with the lowest WPA 3. What types of plays generate the most extreme WPA values?
# Your code here
Exercise 3.2: Clutch Performance
Level 3 | Player Evaluation
Calculate total WPA for QBs: 1. Rank QBs by total WPA 2. Compare rankings by total WPA vs EPA per play 3. Which QBs are "clutch" (high WPA rank, lower EPA rank)?
# Your code here
Exercise 3.3: Game Flow Analysis
Level 3 | Time Series
For a specific game: 1. Plot win probability over time for both teams 2. Identify the "turning point" (largest single-play WPA swing) 3. Calculate each team's cumulative WPA through the game
# Your code here
Exercise 3.4: Comeback Probability
Level 4 | Model Analysis
Analyze comeback likelihood: 1. What is the historical win rate when trailing by 14+ in the 4th quarter? 2. What is the win rate when trailing by 7 with 2 minutes left? 3. Build a simple model predicting win probability from score differential and time remaining
# Your code here
Exercise 3.5: WPA Variance Analysis
Level 4 | Statistical Analysis
Compare WPA distributions: 1. Calculate the standard deviation of WPA for passes vs runs 2. Test whether the difference is statistically significant 3. What does this tell us about which play types are "higher leverage"?
# Your code here
Section 4: Hypothesis Testing (Level 3-4)
Exercise 4.1: Home Field Advantage
Level 3 | Two-Sample Test
Test whether home field advantage exists: 1. Calculate mean EPA for home vs away offenses 2. Perform a two-sample t-test 3. Calculate the effect size (Cohen's d) 4. Is the effect practically significant?
# Your code here
Exercise 4.2: QB Performance by Weather
Level 3 | Comparative Analysis
Test whether a specific QB performs differently in cold weather: 1. Define "cold" as temperature < 40°F 2. Compare EPA in cold vs normal conditions 3. Perform appropriate hypothesis test 4. Account for sample size in your conclusions
# Your code here
Exercise 4.3: Receiver Comparison
Level 3 | Pairwise Comparison
Compare two receivers' catch rates: 1. Calculate catch rate and targets for each 2. Test whether the difference is significant (z-test for proportions) 3. Calculate the confidence interval for the difference
# Your code here
Exercise 4.4: Multiple Comparisons
Level 4 | Multiple Testing
Compare all 32 teams' offensive EPA to league average: 1. Perform 32 one-sample t-tests 2. Apply Bonferroni correction 3. Apply Benjamini-Hochberg FDR correction 4. How many teams are significantly different with each method?
# Your code here
Exercise 4.5: A/B Testing Simulation
Level 4 | Power Analysis
Design an experiment to detect play-calling differences: 1. Calculate the effect size you want to detect 2. Determine required sample size for 80% power 3. How many games would this require? 4. Simulate data to validate your power calculation
# Your code here
Section 5: Regression (Level 3-5)
Exercise 5.1: Simple Linear Regression
Level 3 | Linear Regression
Build a regression model predicting EPA from air yards: 1. Fit a simple linear regression 2. Interpret the slope coefficient 3. Calculate R-squared 4. Plot the regression line with the data
# Your code here
Exercise 5.2: Multiple Regression
Level 3 | Multiple Features
Predict EPA using multiple features: 1. Include: down, ydstogo, yardline_100, score_differential, pass 2. Fit the model and interpret coefficients 3. Which features are most important? 4. Check for multicollinearity
# Your code here
Exercise 5.3: Logistic Regression
Level 4 | Classification
Build a model to predict play success: 1. Fit logistic regression with game state features 2. Interpret coefficients as odds ratios 3. Calculate accuracy, precision, and recall 4. Plot the ROC curve
# Your code here
Exercise 5.4: Regularized Regression
Level 4 | Regularization
Compare Ridge and Lasso regression: 1. Fit both models with alpha = 1.0 2. Compare coefficient sizes 3. Which features are "zeroed out" by Lasso? 4. Cross-validate to find optimal alpha
# Your code here
Exercise 5.5: Regression Diagnostics
Level 5 | Model Validation
Perform comprehensive regression diagnostics: 1. Plot residuals vs fitted values 2. Check for heteroscedasticity 3. Test for normality of residuals 4. Identify influential observations (Cook's distance)
# Your code here
Section 6: Advanced Topics (Level 4-5)
Exercise 6.1: Bootstrap Confidence Intervals
Level 4 | Resampling
Calculate bootstrap confidence intervals: 1. For a QB's EPA per play, bootstrap 1000 samples 2. Calculate 95% CI using percentile method 3. Compare to the t-interval 4. When would bootstrap be preferred?
# Your code here
Exercise 6.2: Permutation Testing
Level 4 | Non-Parametric Test
Use permutation testing to compare two QBs: 1. Create null distribution by permuting labels 2. Calculate p-value from permutation distribution 3. Compare to t-test result 4. When is permutation testing more appropriate?
# Your code here
Exercise 6.3: Sample Size Calculation
Level 4 | Power Analysis
For a QB comparison study: 1. Assume you want to detect 0.1 EPA/play difference 2. Calculate required sample size for 80% power 3. How many games does this represent? 4. What if you only have 100 plays per QB?
# Your code here
Exercise 6.4: Causal Inference Challenge
Level 5 | Causal Analysis
Address the causal question: "Does passing more lead to winning?" 1. Calculate correlation between pass rate and wins 2. Discuss why correlation ≠ causation here 3. What confounders might exist? 4. Propose a quasi-experimental approach
# Your code here
Exercise 6.5: Bayesian Analysis
Level 5 | Bayesian Statistics
Perform Bayesian estimation of a kicker's true FG percentage: 1. Use Beta prior (based on historical data) 2. Update with current season makes/attempts 3. Calculate posterior mean and 95% credible interval 4. Compare to frequentist confidence interval
# Your code here
Submission Guidelines
For each exercise: 1. Include all code with comments 2. Show all statistical output 3. Provide written interpretation 4. Discuss practical significance, not just statistical significance
Grading Rubric
| Level | Points | Focus |
|---|---|---|
| 1 | 2 each | Correct calculation |
| 2 | 3 each | Calculation + interpretation |
| 3 | 4 each | Analysis quality + context |
| 4 | 5 each | Model building + validation |
| 5 | 6 each | Research quality + communication |