Exercises: Descriptive Statistics in Football
These exercises build your skills in calculating and interpreting statistical measures using football data.
Level 1: Central Tendency
Exercise 1.1: Basic Mean and Median
A quarterback has the following passer ratings for an 8-game stretch:
| Game | Rating |
|---|---|
| 1 | 95.2 |
| 2 | 82.4 |
| 3 | 110.5 |
| 4 | 78.3 |
| 5 | 145.2 |
| 6 | 88.9 |
| 7 | 92.1 |
| 8 | 85.6 |
a) Calculate the mean passer rating b) Calculate the median passer rating c) Which measure better represents "typical" performance? Why? d) If Game 5 (145.2) was removed, how would each measure change?
Exercise 1.2: Comparing Teams
Two running backs have the following rushing yards per game:
RB1: 85, 92, 78, 88, 95, 82, 90, 87 RB2: 45, 150, 62, 130, 55, 125, 70, 143
a) Calculate mean rushing yards for each player b) Calculate median rushing yards for each player c) Which player is more "typical" each game? Use statistics to support your answer. d) Which player would you rather have? Consider both statistics.
Exercise 1.3: Mode in Football
Analyze the following play-calling tendencies:
plays = ["Pass", "Rush", "Pass", "Pass", "Rush", "Pass", "Rush",
"Pass", "Pass", "Rush", "Pass", "Punt", "Pass", "Rush",
"Pass", "Pass", "Rush", "Pass", "Pass", "Pass"]
a) What is the mode of play types? b) What percentage of plays are passes? c) What is the pass/rush ratio? d) Would knowing only the mode help predict the next play? Why or why not?
Exercise 1.4: Weighted Average
A quarterback plays 3 games with the following stats:
| Game | Attempts | Completion % |
|---|---|---|
| 1 | 25 | 68.0% |
| 2 | 40 | 62.5% |
| 3 | 35 | 71.4% |
a) Calculate the simple average of completion percentages b) Calculate the weighted average (by attempts) c) Why is the weighted average more accurate? d) Calculate total completions and total attempts to verify
Exercise 1.5: Trimmed Mean
Consider these rushing yards per carry values (including outliers):
2, 3, 4, 3, 5, 75, 4, 3, 2, 4, 3, -2, 3, 4, 5, 3, 2, 4, 3, 50
a) Calculate the regular mean b) Calculate a 10% trimmed mean (remove top and bottom 10%) c) Calculate a 20% trimmed mean d) Which measure best represents typical rushing performance?
Level 2: Variability
Exercise 2.1: Standard Deviation
Calculate the standard deviation for these weekly point totals:
Team A: 28, 31, 24, 35, 27, 30, 33, 26 Team B: 42, 14, 38, 21, 45, 17, 40, 23
a) Calculate mean and standard deviation for each team b) Which team is more consistent? c) Calculate the coefficient of variation for each d) If you need a team to score at least 25 points to win, which team is safer?
Exercise 2.2: Percentiles and Quartiles
A receiver has the following receiving yards for 12 games:
45, 72, 38, 95, 120, 55, 82, 110, 65, 48, 88, 75
a) Calculate the five-number summary (min, Q1, median, Q3, max) b) Calculate the interquartile range (IQR) c) What is the 90th percentile performance? d) How many games were above the 75th percentile?
Exercise 2.3: Consistency Analysis
Two kickers have the following field goal percentages by distance:
Kicker A (FG%) - 0-29 yards: 100%, 100%, 100%, 100% - 30-39 yards: 85%, 90%, 80%, 88% - 40-49 yards: 70%, 75%, 65%, 72% - 50+ yards: 50%, 40%, 60%, 45%
Kicker B (FG%) - 0-29 yards: 100%, 100%, 100%, 100% - 30-39 yards: 100%, 100%, 100%, 100% - 40-49 yards: 60%, 55%, 65%, 58% - 50+ yards: 30%, 25%, 35%, 28%
a) Calculate mean FG% for each distance range for both kickers b) Calculate standard deviation for each range c) Which kicker is more reliable at each distance? d) If you need a 45-yard field goal to win, who do you choose?
Exercise 2.4: Range vs IQR
Consider these attendance figures (in thousands):
72, 85, 78, 92, 88, 15, 95, 82, 79, 90, 86, 110
a) Calculate the range b) Calculate the IQR c) Why might IQR be preferred here? d) Identify any outliers using the IQR method
Exercise 2.5: Variance Decomposition
A team's weekly scores are: 28, 35, 21, 42, 31
a) Calculate the mean b) For each score, calculate (score - mean) c) Calculate (score - mean)² d) Sum the squared deviations and divide by n to get variance e) Take the square root to get standard deviation f) Verify using NumPy's functions
Level 3: Distributions
Exercise 3.1: Skewness Analysis
Analyze the distribution of play gains:
import numpy as np
np.random.seed(42)
# Rushing plays (mostly short gains, occasional big plays)
rushing = np.concatenate([
np.random.normal(3.5, 2, 80), # Normal gains
np.random.uniform(15, 60, 10), # Explosive plays
np.random.uniform(-5, 0, 10) # Negative plays
])
a) Calculate mean, median, and standard deviation b) Calculate skewness c) Is the distribution symmetric, right-skewed, or left-skewed? d) What does this tell us about rushing plays?
Exercise 3.2: Outlier Detection
Team scoring by game: 28, 35, 21, 42, 17, 82, 31, 24, 38, 33
a) Calculate mean and standard deviation b) Identify outliers using z-scores (threshold = 2.0) c) Identify outliers using IQR method (1.5 × IQR) d) Do both methods identify the same outliers? e) Should the outlier be removed for analysis? Discuss.
Exercise 3.3: Distribution Comparison
Compare the scoring distributions of two conferences:
Conference A: 24, 31, 28, 35, 27, 42, 30, 33, 29, 38 Conference B: 17, 45, 21, 52, 24, 48, 28, 42, 19, 55
a) Calculate mean, median, and standard deviation for each b) Calculate skewness for each c) Which conference has more consistent scoring? d) Which conference produces more extreme scores?
Exercise 3.4: Bimodal Detection
A running back has these yards per carry:
-2, 1, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 15, 18, 22, 25, 28, 32
a) Calculate mean and median b) Create a frequency distribution (bins: -5 to 0, 0-5, 5-10, 10-20, 20-35) c) Is this distribution unimodal or bimodal? d) What does this tell us about the running back's style?
Exercise 3.5: Kurtosis Interpretation
Calculate kurtosis for two quarterbacks' completion percentages:
QB1: 62, 65, 63, 64, 66, 63, 65, 64, 62, 65 (consistent) QB2: 45, 75, 52, 78, 48, 72, 55, 80, 50, 70 (boom-or-bust)
a) Calculate mean for each (they should be similar) b) Calculate standard deviation c) Calculate kurtosis d) Which QB has higher kurtosis and what does it mean?
Level 4: Correlation
Exercise 4.1: Basic Correlation
Calculate the correlation between rushing yards and points scored:
| Game | Rush Yards | Points |
|---|---|---|
| 1 | 150 | 28 |
| 2 | 95 | 17 |
| 3 | 180 | 35 |
| 4 | 120 | 24 |
| 5 | 165 | 31 |
| 6 | 140 | 27 |
| 7 | 200 | 42 |
| 8 | 110 | 21 |
a) Calculate Pearson correlation coefficient b) Interpret the strength and direction c) Does rushing success cause scoring? Discuss correlation vs causation.
Exercise 4.2: Correlation Matrix
Create and analyze a correlation matrix for these team statistics:
team_stats = {
"games": list(range(1, 11)),
"points": [35, 24, 42, 17, 38, 31, 45, 21, 40, 28],
"rush_yards": [150, 120, 180, 95, 165, 140, 200, 110, 175, 130],
"pass_yards": [280, 250, 310, 220, 275, 260, 290, 240, 300, 265],
"turnovers": [0, 2, 0, 3, 1, 1, 0, 2, 0, 1],
"penalties": [5, 8, 4, 9, 6, 7, 3, 8, 5, 6]
}
a) Calculate the correlation matrix b) Which statistic correlates most strongly with points? c) Which statistic correlates most negatively with points? d) Are rushing yards and passing yards correlated? Interpret.
Exercise 4.3: Spurious Correlation
Consider these (fictional) statistics:
| Year | Team Wins | Ice Cream Sales (millions) |
|---|---|---|
| 2018 | 10 | 2.5 |
| 2019 | 8 | 2.0 |
| 2020 | 6 | 1.5 |
| 2021 | 11 | 2.8 |
| 2022 | 9 | 2.3 |
a) Calculate the correlation b) Does this mean ice cream sales cause wins? c) What might explain this correlation? d) How can you avoid spurious correlations in analysis?
Exercise 4.4: Partial Correlation
Points and total yards are correlated. But what about when controlling for turnovers?
data = {
"points": [35, 24, 42, 17, 38, 31, 45, 21, 40, 28],
"total_yards": [430, 370, 490, 315, 440, 400, 490, 350, 475, 395],
"turnovers": [0, 2, 0, 3, 1, 1, 0, 2, 0, 1]
}
a) Calculate correlation between points and total yards b) Calculate correlation between turnovers and points c) Discuss: If turnovers affect both yards and points, how might this impact our interpretation?
Exercise 4.5: Non-Linear Relationships
Consider time of possession (TOP) and winning:
import numpy as np
top_minutes = np.array([22, 25, 28, 30, 32, 35, 38, 40])
point_margin = np.array([-14, -3, 5, 10, 8, 2, -5, -12]) # Positive = win
a) Calculate the correlation b) Why might this correlation be weak? c) Plot mentally: What shape might this relationship have? d) What's the optimal time of possession based on this data?
Level 5: Advanced Applications
Exercise 5.1: Team Profile Builder
Create a complete statistical profile for this team's season:
season_data = {
"week": list(range(1, 13)),
"opponent": ["Duke", "Texas", "Arkansas", "Ole Miss", "Vandy", "Tennessee",
"LSU", "Missouri", "Kentucky", "Auburn", "Georgia", "Alabama"],
"points_for": [52, 28, 45, 31, 55, 24, 32, 42, 38, 27, 17, 35],
"points_against": [10, 24, 17, 28, 7, 21, 28, 14, 10, 24, 41, 28],
"rush_yards": [245, 120, 198, 145, 267, 98, 156, 178, 189, 134, 87, 165],
"pass_yards": [285, 310, 245, 298, 195, 340, 278, 265, 245, 289, 198, 275],
"turnovers": [0, 2, 1, 0, 0, 2, 1, 0, 0, 1, 3, 1]
}
Build a profile including: a) Win-loss record b) Scoring: mean, median, std for offense and defense c) Yardage: rushing vs passing balance and consistency d) Turnover analysis e) Home vs road performance (assume first 6 are home)
Exercise 5.2: Player Comparison Tool
Compare two quarterbacks using z-scores:
QB1: - Passing yards/game: 285 (league avg: 250, std: 45) - TD/game: 2.1 (league avg: 1.8, std: 0.6) - INT/game: 0.8 (league avg: 1.0, std: 0.4) - Completion %: 68% (league avg: 63%, std: 5%)
QB2: - Passing yards/game: 310 (same league stats) - TD/game: 2.5 - INT/game: 1.4 - Completion %: 61%
a) Calculate z-scores for each stat for both QBs b) Create a composite score (equal weights, invert INT) c) Which QB is statistically better overall? d) Which QB is better for a team that needs to avoid turnovers?
Exercise 5.3: Efficiency Metrics
Create an efficiency index combining multiple statistics:
For 8 teams, calculate: - Points per play (PPP) - Yards per play (YPP) - Success rate (successful plays / total plays) - Turnover rate (turnovers / plays)
teams = {
"Team A": {"points": 420, "yards": 5400, "plays": 850, "successful": 425, "turnovers": 12},
"Team B": {"points": 380, "yards": 4900, "plays": 820, "successful": 380, "turnovers": 18},
"Team C": {"points": 350, "yards": 5100, "plays": 800, "successful": 400, "turnovers": 8},
# Add 5 more teams with varied stats
}
a) Calculate each metric for all teams b) Standardize each metric (z-scores) c) Create composite efficiency score d) Rank teams by efficiency e) Compare to simple points ranking
Exercise 5.4: Trend Analysis
Analyze this team's scoring trend over the season:
weeks = list(range(1, 13))
points = [28, 35, 31, 24, 38, 42, 45, 35, 48, 52, 55, 49]
a) Calculate 3-game rolling average b) Calculate cumulative average after each game c) Is the team improving, declining, or stable? d) Calculate correlation between week number and points
Exercise 5.5: Conference Analysis
Compare two conferences using comprehensive statistics:
Conference A (8 teams, season totals):
PPG: 28.5, 32.1, 25.4, 35.2, 29.8, 27.3, 31.5, 26.2
YPG: 385, 420, 355, 445, 398, 372, 415, 368
Conference B (8 teams, season totals):
PPG: 31.2, 24.5, 38.7, 22.1, 35.8, 28.4, 33.2, 26.8
YPG: 402, 348, 465, 325, 438, 378, 425, 355
a) Calculate mean, std, and CV for PPG in each conference b) Calculate mean, std, and CV for YPG in each conference c) Which conference is more balanced (lower variation)? d) Which conference has higher average production? e) Calculate correlation between PPG and YPG within each conference
Challenge Problems
Challenge 1: Predictive Validity
You want to know which first-half statistics best predict final margin of victory.
Given data for 20 games with first-half stats (points, yards, turnovers) and final margins:
a) Calculate correlations between each first-half stat and final margin b) Which stat is most predictive? c) Create a simple predictive model using the best stat d) How accurate is your prediction?
Challenge 2: Strength of Schedule
Calculate strength of schedule using opponent statistics:
For each of 12 opponents, you have their season: - Win percentage - Points per game - Points allowed per game
a) Calculate average opponent win percentage b) Calculate weighted average (by how well you did against each) c) Calculate opponent's opponent strength d) Create a composite SOS metric
Challenge 3: Era Adjustment
Compare a 1970s team to a modern team using standardization:
1975 Team: 28.5 PPG (era average: 21.0, std: 5.2) 2023 Team: 35.2 PPG (era average: 30.5, std: 7.8)
a) Calculate z-scores for each team within their era b) Which team was more dominant relative to their competition? c) What are the limitations of this comparison?
Solutions
Solutions are provided in code/exercise-solutions.py. Complete the exercises before checking solutions.
Grading Rubric
| Level | Points | Description |
|---|---|---|
| Level 1 | 20 | Central tendency |
| Level 2 | 25 | Variability |
| Level 3 | 20 | Distributions |
| Level 4 | 20 | Correlation |
| Level 5 | 15 | Advanced applications |
Total: 100 points