Exercises: Exploratory Data Analysis for Football
Difficulty Levels
- Level 1 (Foundational): Basic EDA concepts and simple visualizations
- Level 2 (Applied): Standard EDA workflows with football data
- Level 3 (Intermediate): Multi-step analysis with multiple variables
- Level 4 (Advanced): Complex analysis requiring creative approaches
- Level 5 (Expert): Open-ended research questions
Section 1: Data Inspection (Level 1-2)
Exercise 1.1: Column Exploration
Level 1 | Data Inspection
Load the 2023 play-by-play data and answer: 1. How many total columns exist? 2. How many columns contain the word "player"? 3. How many columns contain the word "epa"? 4. What percentage of columns are numeric vs object type?
# Your code here
Exercise 1.2: Missing Data Analysis
Level 1 | Data Quality
For the 2023 season:
1. Which 10 columns have the most missing values?
2. What percentage of epa values are missing?
3. Why might receiver_player_name have more missing values than passer_player_name?
# Your code here
Exercise 1.3: Play Type Distribution
Level 1 | Categorical Analysis
Create a bar chart showing the distribution of play_type values:
1. What are the top 5 most common play types?
2. What percentage of plays are passes vs runs?
3. What percentage are special teams plays?
# Your code here
Exercise 1.4: Game Coverage Check
Level 2 | Data Validation
Validate that the data covers a complete season: 1. How many unique games exist? 2. How many games per team? 3. Are there any teams with fewer than 17 games of data? 4. What's the date range of the data?
# Your code here
Exercise 1.5: Data Quality Report
Level 2 | Documentation
Write a function that generates a data quality report including: 1. Total rows and columns 2. Missing data summary 3. Duplicate row check 4. Date range covered 5. Team coverage summary
def generate_data_quality_report(pbp: pd.DataFrame) -> dict:
"""Generate comprehensive data quality report."""
# Your code here
pass
Section 2: Univariate Analysis (Level 2-3)
Exercise 2.1: EPA Distribution
Level 2 | Distributions
Analyze the EPA distribution for pass plays: 1. Create a histogram with 50 bins 2. Calculate mean, median, and standard deviation 3. What is the skewness of the distribution? 4. What percentage of passes have positive EPA?
# Your code here
Exercise 2.2: Yards Gained Analysis
Level 2 | Distributions
Compare the distribution of yards gained for passes vs runs: 1. Create side-by-side histograms 2. Calculate the 10th, 50th, and 90th percentiles for each 3. Which has higher variance? Why might this be?
# Your code here
Exercise 2.3: Completion Percentage Distribution
Level 2 | Player Metrics
For quarterbacks with at least 200 pass attempts: 1. Calculate each QB's completion percentage 2. Create a histogram of completion percentages 3. What is the league-wide average? 4. How many QBs are above/below average?
# Your code here
Exercise 2.4: Target Distribution
Level 3 | Concentration Analysis
Analyze how targets are distributed among receivers: 1. For each team, calculate the target share of each receiver 2. What's the average target share of the top receiver on each team? 3. Which teams have the most concentrated passing attack? 4. Create a visualization showing target concentration by team
# Your code here
Exercise 2.5: Outlier Detection
Level 3 | Statistical Analysis
Identify outlier performances: 1. Find all individual plays with EPA > 5 or EPA < -5 2. What play types generate extreme EPA values? 3. Create a function to detect and flag outliers using IQR method
def detect_outliers(series: pd.Series, method: str = 'iqr') -> pd.Series:
"""Detect outliers in a series."""
# Your code here
pass
Section 3: Bivariate Analysis (Level 2-3)
Exercise 3.1: Down and Distance
Level 2 | Relationships
Analyze the relationship between down and EPA: 1. Calculate average EPA by down 2. Create a bar chart comparing EPA across downs 3. Why does EPA tend to decrease on later downs?
# Your code here
Exercise 3.2: Air Yards and Success
Level 2 | Relationships
Explore the relationship between air yards and success rate: 1. Create a scatter plot of air yards vs EPA 2. Bin air yards into 5-yard buckets and calculate success rate for each 3. At what depth does success rate peak?
# Your code here
Exercise 3.3: Score Differential Impact
Level 3 | Game Context
Analyze how score differential affects play-calling and outcomes: 1. Calculate pass rate by score differential (binned) 2. Calculate EPA per play by score differential 3. Create a dual-axis plot showing both relationships 4. At what score differential do teams become "too pass-heavy"?
# Your code here
Exercise 3.4: Time Remaining Effects
Level 3 | Temporal Analysis
Study how play-calling changes with time: 1. Calculate pass rate by quarter 2. Calculate pass rate by half_seconds_remaining (binned into 1-minute intervals) 3. How does two-minute offense differ from normal offense?
# Your code here
Exercise 3.5: Field Position Analysis
Level 3 | Spatial Analysis
Analyze performance by field position: 1. Calculate EPA per play by yardline_100 (binned into 10-yard sections) 2. Where on the field is offense most efficient? 3. Create a heatmap showing EPA by field position and down
# Your code here
Section 4: Team Analysis (Level 3-4)
Exercise 4.1: Team Rankings
Level 3 | Aggregation
Create comprehensive team offensive rankings: 1. Calculate for each team: EPA per play, success rate, explosive play rate 2. Rank teams on each metric 3. Create a composite ranking using all three 4. Which teams rank differently on EPA vs success rate? Why?
# Your code here
Exercise 4.2: Offensive Balance
Level 3 | Strategic Analysis
Analyze offensive balance (pass vs run): 1. Calculate pass rate for each team 2. Calculate EPA per play for passes and runs separately 3. Is there a correlation between pass rate and overall offensive efficiency? 4. Which teams most efficiently allocate between pass and run?
# Your code here
Exercise 4.3: Team Profiles
Level 4 | Multivariate
Create visual team profiles: 1. Select 5 key offensive metrics 2. Standardize each metric (z-scores) 3. Create radar charts for the top 5 and bottom 5 teams 4. What patterns distinguish elite offenses from poor ones?
def create_team_radar(team_stats: pd.DataFrame, team: str, metrics: list):
"""Create radar chart for a team."""
# Your code here
pass
Exercise 4.4: Home Field Advantage
Level 3 | Split Analysis
Quantify home field advantage: 1. Calculate EPA per play at home vs away for each team 2. Which teams have the largest home field advantage? 3. Is home field advantage consistent across seasons? 4. Test whether the difference is statistically significant
# Your code here
Exercise 4.5: Divisional Performance
Level 4 | Comparative Analysis
Compare team performance against division vs non-division opponents: 1. Calculate EPA per play in division games vs out of division 2. Are teams more or less efficient against familiar opponents? 3. Which teams show the biggest splits?
# Your code here
Section 5: Player Analysis (Level 3-4)
Exercise 5.1: QB Efficiency
Level 3 | Player Metrics
Analyze quarterback efficiency: 1. Calculate EPA per dropback for all QBs with 200+ attempts 2. Calculate CPOE (if available) and correlation with EPA 3. Create a scatter plot of EPA vs volume (attempts) 4. Identify the most efficient low-volume QBs
# Your code here
Exercise 5.2: Receiver Comparison
Level 3 | Comparative Analysis
Compare receiving production: 1. Calculate yards per target, EPA per target, and catch rate 2. Create a scatter plot matrix of these three metrics 3. Which receivers excel in different areas? 4. Identify underrated receivers (high EPA, low volume)
# Your code here
Exercise 5.3: Running Back Efficiency
Level 3 | Contextualized Analysis
Analyze running back performance with context: 1. Calculate yards per carry and EPA per carry for RBs with 100+ attempts 2. What percentage of each RB's carries come in "loaded boxes" (8+ defenders)? 3. Is there a relationship between box loading and efficiency?
# Your code here
Exercise 5.4: Player Consistency
Level 4 | Variance Analysis
Measure player consistency: 1. Calculate week-by-week EPA for top 20 QBs 2. Compute the standard deviation of weekly EPA for each 3. Who are the most and least consistent QBs? 4. Is there a correlation between average performance and consistency?
# Your code here
Exercise 5.5: Situational Stars
Level 4 | Conditional Analysis
Identify players who excel in specific situations: 1. Calculate EPA in each situation: early downs, third down, red zone, two-minute 2. Find players who are top-10 in one situation but average overall 3. Create a function to identify "situational specialists"
def find_situational_specialists(
pbp: pd.DataFrame,
situations: dict,
min_plays: int = 30
) -> pd.DataFrame:
"""Find players who excel in specific situations."""
# Your code here
pass
Section 6: Visualization (Level 3-4)
Exercise 6.1: EPA Timeline
Level 3 | Time Series
Create a game flow visualization: 1. Plot cumulative EPA over the course of a game for both teams 2. Add markers for touchdowns and turnovers 3. Create this for a specific game of your choice
def plot_game_flow(pbp: pd.DataFrame, game_id: str):
"""Plot game flow showing cumulative EPA for both teams."""
# Your code here
pass
Exercise 6.2: Team Comparison Dashboard
Level 4 | Multi-panel
Create a 4-panel dashboard comparing two teams: 1. Panel 1: EPA per play comparison (bar chart) 2. Panel 2: Play type distribution (pie charts) 3. Panel 3: Success rate by down (grouped bar) 4. Panel 4: Yards gained distribution (overlapping histograms)
def create_team_comparison_dashboard(pbp: pd.DataFrame, team1: str, team2: str):
"""Create 4-panel comparison dashboard."""
# Your code here
pass
Exercise 6.3: Field Position Heatmap
Level 4 | Spatial Visualization
Create a field-position heatmap: 1. Divide the field into zones (10-yard sections) 2. Calculate EPA per play in each zone for passes and runs 3. Create a heatmap showing efficiency by zone and play type
# Your code here
Exercise 6.4: Correlation Network
Level 4 | Advanced Visualization
Visualize correlations between offensive metrics: 1. Select 8-10 team offensive metrics 2. Calculate pairwise correlations 3. Create a correlation matrix heatmap 4. Identify clusters of related metrics
# Your code here
Exercise 6.5: Interactive Explorer
Level 4 | Interactive
Create an interactive team explorer using Plotly: 1. Scatter plot of teams with selectable x and y axes 2. Dropdown to select different metrics 3. Hover information showing team details
def create_interactive_team_explorer(team_stats: pd.DataFrame):
"""Create interactive team exploration tool."""
# Your code here
pass
Section 7: Advanced EDA (Level 4-5)
Exercise 7.1: Pace Analysis
Level 4 | Derived Metrics
Analyze team pace: 1. Calculate plays per game for each team 2. Calculate time between plays (using game_seconds_remaining) 3. Is there a relationship between pace and efficiency? 4. Do fast teams sacrifice efficiency for volume?
# Your code here
Exercise 7.2: Trend Detection
Level 4 | Temporal Patterns
Detect season-long trends: 1. Calculate weekly EPA for a specific team 2. Fit a linear trend line 3. Which teams improved or declined most during the season? 4. Visualize the top 3 improving and declining teams
# Your code here
Exercise 7.3: Cluster Analysis
Level 5 | Unsupervised Learning
Cluster teams by offensive style: 1. Select 6-8 offensive metrics 2. Standardize the metrics 3. Apply k-means clustering (k=4) 4. Characterize each cluster (what defines each offensive style?)
from sklearn.cluster import KMeans
def cluster_teams_by_style(team_stats: pd.DataFrame, n_clusters: int = 4):
"""Cluster teams by offensive style."""
# Your code here
pass
Exercise 7.4: Anomaly Investigation
Level 5 | Case Study
Investigate performance anomalies: 1. Find games where a team's EPA was 2+ standard deviations from their average 2. For these games, analyze what was different 3. Look at play-calling, opponent, injuries, weather 4. Write a short report on one anomaly
# Your code here
Exercise 7.5: Comprehensive EDA Report
Level 5 | Capstone
Create a complete EDA report for a team of your choice: 1. Data quality assessment 2. Overall offensive profile 3. Situational analysis (down, distance, field position) 4. Player-level breakdown 5. Comparison to league average 6. Key insights and recommendations
The report should include at least 5 visualizations and be structured for presentation to a coaching staff.
def generate_team_eda_report(pbp: pd.DataFrame, team: str) -> str:
"""Generate comprehensive EDA report for a team."""
# Your code here
pass
Submission Guidelines
For each exercise: 1. Include all code used to generate answers 2. Include relevant visualizations (save as PNG) 3. Write brief interpretations of findings 4. Note any data quality issues encountered
Grading Rubric
| Level | Points | Focus |
|---|---|---|
| 1 | 2 each | Correct execution |
| 2 | 3 each | Execution + interpretation |
| 3 | 4 each | Analysis quality + code structure |
| 4 | 5 each | Insight depth + visualization quality |
| 5 | 6 each | Research quality + communication |