Exercises: Exploratory Data Analysis for Football

Difficulty Levels

  • Level 1 (Foundational): Basic EDA concepts and simple visualizations
  • Level 2 (Applied): Standard EDA workflows with football data
  • Level 3 (Intermediate): Multi-step analysis with multiple variables
  • Level 4 (Advanced): Complex analysis requiring creative approaches
  • Level 5 (Expert): Open-ended research questions

Section 1: Data Inspection (Level 1-2)

Exercise 1.1: Column Exploration

Level 1 | Data Inspection

Load the 2023 play-by-play data and answer: 1. How many total columns exist? 2. How many columns contain the word "player"? 3. How many columns contain the word "epa"? 4. What percentage of columns are numeric vs object type?

# Your code here

Exercise 1.2: Missing Data Analysis

Level 1 | Data Quality

For the 2023 season: 1. Which 10 columns have the most missing values? 2. What percentage of epa values are missing? 3. Why might receiver_player_name have more missing values than passer_player_name?

# Your code here

Exercise 1.3: Play Type Distribution

Level 1 | Categorical Analysis

Create a bar chart showing the distribution of play_type values: 1. What are the top 5 most common play types? 2. What percentage of plays are passes vs runs? 3. What percentage are special teams plays?

# Your code here

Exercise 1.4: Game Coverage Check

Level 2 | Data Validation

Validate that the data covers a complete season: 1. How many unique games exist? 2. How many games per team? 3. Are there any teams with fewer than 17 games of data? 4. What's the date range of the data?

# Your code here

Exercise 1.5: Data Quality Report

Level 2 | Documentation

Write a function that generates a data quality report including: 1. Total rows and columns 2. Missing data summary 3. Duplicate row check 4. Date range covered 5. Team coverage summary

def generate_data_quality_report(pbp: pd.DataFrame) -> dict:
    """Generate comprehensive data quality report."""
    # Your code here
    pass

Section 2: Univariate Analysis (Level 2-3)

Exercise 2.1: EPA Distribution

Level 2 | Distributions

Analyze the EPA distribution for pass plays: 1. Create a histogram with 50 bins 2. Calculate mean, median, and standard deviation 3. What is the skewness of the distribution? 4. What percentage of passes have positive EPA?

# Your code here

Exercise 2.2: Yards Gained Analysis

Level 2 | Distributions

Compare the distribution of yards gained for passes vs runs: 1. Create side-by-side histograms 2. Calculate the 10th, 50th, and 90th percentiles for each 3. Which has higher variance? Why might this be?

# Your code here

Exercise 2.3: Completion Percentage Distribution

Level 2 | Player Metrics

For quarterbacks with at least 200 pass attempts: 1. Calculate each QB's completion percentage 2. Create a histogram of completion percentages 3. What is the league-wide average? 4. How many QBs are above/below average?

# Your code here

Exercise 2.4: Target Distribution

Level 3 | Concentration Analysis

Analyze how targets are distributed among receivers: 1. For each team, calculate the target share of each receiver 2. What's the average target share of the top receiver on each team? 3. Which teams have the most concentrated passing attack? 4. Create a visualization showing target concentration by team

# Your code here

Exercise 2.5: Outlier Detection

Level 3 | Statistical Analysis

Identify outlier performances: 1. Find all individual plays with EPA > 5 or EPA < -5 2. What play types generate extreme EPA values? 3. Create a function to detect and flag outliers using IQR method

def detect_outliers(series: pd.Series, method: str = 'iqr') -> pd.Series:
    """Detect outliers in a series."""
    # Your code here
    pass

Section 3: Bivariate Analysis (Level 2-3)

Exercise 3.1: Down and Distance

Level 2 | Relationships

Analyze the relationship between down and EPA: 1. Calculate average EPA by down 2. Create a bar chart comparing EPA across downs 3. Why does EPA tend to decrease on later downs?

# Your code here

Exercise 3.2: Air Yards and Success

Level 2 | Relationships

Explore the relationship between air yards and success rate: 1. Create a scatter plot of air yards vs EPA 2. Bin air yards into 5-yard buckets and calculate success rate for each 3. At what depth does success rate peak?

# Your code here

Exercise 3.3: Score Differential Impact

Level 3 | Game Context

Analyze how score differential affects play-calling and outcomes: 1. Calculate pass rate by score differential (binned) 2. Calculate EPA per play by score differential 3. Create a dual-axis plot showing both relationships 4. At what score differential do teams become "too pass-heavy"?

# Your code here

Exercise 3.4: Time Remaining Effects

Level 3 | Temporal Analysis

Study how play-calling changes with time: 1. Calculate pass rate by quarter 2. Calculate pass rate by half_seconds_remaining (binned into 1-minute intervals) 3. How does two-minute offense differ from normal offense?

# Your code here

Exercise 3.5: Field Position Analysis

Level 3 | Spatial Analysis

Analyze performance by field position: 1. Calculate EPA per play by yardline_100 (binned into 10-yard sections) 2. Where on the field is offense most efficient? 3. Create a heatmap showing EPA by field position and down

# Your code here

Section 4: Team Analysis (Level 3-4)

Exercise 4.1: Team Rankings

Level 3 | Aggregation

Create comprehensive team offensive rankings: 1. Calculate for each team: EPA per play, success rate, explosive play rate 2. Rank teams on each metric 3. Create a composite ranking using all three 4. Which teams rank differently on EPA vs success rate? Why?

# Your code here

Exercise 4.2: Offensive Balance

Level 3 | Strategic Analysis

Analyze offensive balance (pass vs run): 1. Calculate pass rate for each team 2. Calculate EPA per play for passes and runs separately 3. Is there a correlation between pass rate and overall offensive efficiency? 4. Which teams most efficiently allocate between pass and run?

# Your code here

Exercise 4.3: Team Profiles

Level 4 | Multivariate

Create visual team profiles: 1. Select 5 key offensive metrics 2. Standardize each metric (z-scores) 3. Create radar charts for the top 5 and bottom 5 teams 4. What patterns distinguish elite offenses from poor ones?

def create_team_radar(team_stats: pd.DataFrame, team: str, metrics: list):
    """Create radar chart for a team."""
    # Your code here
    pass

Exercise 4.4: Home Field Advantage

Level 3 | Split Analysis

Quantify home field advantage: 1. Calculate EPA per play at home vs away for each team 2. Which teams have the largest home field advantage? 3. Is home field advantage consistent across seasons? 4. Test whether the difference is statistically significant

# Your code here

Exercise 4.5: Divisional Performance

Level 4 | Comparative Analysis

Compare team performance against division vs non-division opponents: 1. Calculate EPA per play in division games vs out of division 2. Are teams more or less efficient against familiar opponents? 3. Which teams show the biggest splits?

# Your code here

Section 5: Player Analysis (Level 3-4)

Exercise 5.1: QB Efficiency

Level 3 | Player Metrics

Analyze quarterback efficiency: 1. Calculate EPA per dropback for all QBs with 200+ attempts 2. Calculate CPOE (if available) and correlation with EPA 3. Create a scatter plot of EPA vs volume (attempts) 4. Identify the most efficient low-volume QBs

# Your code here

Exercise 5.2: Receiver Comparison

Level 3 | Comparative Analysis

Compare receiving production: 1. Calculate yards per target, EPA per target, and catch rate 2. Create a scatter plot matrix of these three metrics 3. Which receivers excel in different areas? 4. Identify underrated receivers (high EPA, low volume)

# Your code here

Exercise 5.3: Running Back Efficiency

Level 3 | Contextualized Analysis

Analyze running back performance with context: 1. Calculate yards per carry and EPA per carry for RBs with 100+ attempts 2. What percentage of each RB's carries come in "loaded boxes" (8+ defenders)? 3. Is there a relationship between box loading and efficiency?

# Your code here

Exercise 5.4: Player Consistency

Level 4 | Variance Analysis

Measure player consistency: 1. Calculate week-by-week EPA for top 20 QBs 2. Compute the standard deviation of weekly EPA for each 3. Who are the most and least consistent QBs? 4. Is there a correlation between average performance and consistency?

# Your code here

Exercise 5.5: Situational Stars

Level 4 | Conditional Analysis

Identify players who excel in specific situations: 1. Calculate EPA in each situation: early downs, third down, red zone, two-minute 2. Find players who are top-10 in one situation but average overall 3. Create a function to identify "situational specialists"

def find_situational_specialists(
    pbp: pd.DataFrame,
    situations: dict,
    min_plays: int = 30
) -> pd.DataFrame:
    """Find players who excel in specific situations."""
    # Your code here
    pass

Section 6: Visualization (Level 3-4)

Exercise 6.1: EPA Timeline

Level 3 | Time Series

Create a game flow visualization: 1. Plot cumulative EPA over the course of a game for both teams 2. Add markers for touchdowns and turnovers 3. Create this for a specific game of your choice

def plot_game_flow(pbp: pd.DataFrame, game_id: str):
    """Plot game flow showing cumulative EPA for both teams."""
    # Your code here
    pass

Exercise 6.2: Team Comparison Dashboard

Level 4 | Multi-panel

Create a 4-panel dashboard comparing two teams: 1. Panel 1: EPA per play comparison (bar chart) 2. Panel 2: Play type distribution (pie charts) 3. Panel 3: Success rate by down (grouped bar) 4. Panel 4: Yards gained distribution (overlapping histograms)

def create_team_comparison_dashboard(pbp: pd.DataFrame, team1: str, team2: str):
    """Create 4-panel comparison dashboard."""
    # Your code here
    pass

Exercise 6.3: Field Position Heatmap

Level 4 | Spatial Visualization

Create a field-position heatmap: 1. Divide the field into zones (10-yard sections) 2. Calculate EPA per play in each zone for passes and runs 3. Create a heatmap showing efficiency by zone and play type

# Your code here

Exercise 6.4: Correlation Network

Level 4 | Advanced Visualization

Visualize correlations between offensive metrics: 1. Select 8-10 team offensive metrics 2. Calculate pairwise correlations 3. Create a correlation matrix heatmap 4. Identify clusters of related metrics

# Your code here

Exercise 6.5: Interactive Explorer

Level 4 | Interactive

Create an interactive team explorer using Plotly: 1. Scatter plot of teams with selectable x and y axes 2. Dropdown to select different metrics 3. Hover information showing team details

def create_interactive_team_explorer(team_stats: pd.DataFrame):
    """Create interactive team exploration tool."""
    # Your code here
    pass

Section 7: Advanced EDA (Level 4-5)

Exercise 7.1: Pace Analysis

Level 4 | Derived Metrics

Analyze team pace: 1. Calculate plays per game for each team 2. Calculate time between plays (using game_seconds_remaining) 3. Is there a relationship between pace and efficiency? 4. Do fast teams sacrifice efficiency for volume?

# Your code here

Exercise 7.2: Trend Detection

Level 4 | Temporal Patterns

Detect season-long trends: 1. Calculate weekly EPA for a specific team 2. Fit a linear trend line 3. Which teams improved or declined most during the season? 4. Visualize the top 3 improving and declining teams

# Your code here

Exercise 7.3: Cluster Analysis

Level 5 | Unsupervised Learning

Cluster teams by offensive style: 1. Select 6-8 offensive metrics 2. Standardize the metrics 3. Apply k-means clustering (k=4) 4. Characterize each cluster (what defines each offensive style?)

from sklearn.cluster import KMeans

def cluster_teams_by_style(team_stats: pd.DataFrame, n_clusters: int = 4):
    """Cluster teams by offensive style."""
    # Your code here
    pass

Exercise 7.4: Anomaly Investigation

Level 5 | Case Study

Investigate performance anomalies: 1. Find games where a team's EPA was 2+ standard deviations from their average 2. For these games, analyze what was different 3. Look at play-calling, opponent, injuries, weather 4. Write a short report on one anomaly

# Your code here

Exercise 7.5: Comprehensive EDA Report

Level 5 | Capstone

Create a complete EDA report for a team of your choice: 1. Data quality assessment 2. Overall offensive profile 3. Situational analysis (down, distance, field position) 4. Player-level breakdown 5. Comparison to league average 6. Key insights and recommendations

The report should include at least 5 visualizations and be structured for presentation to a coaching staff.

def generate_team_eda_report(pbp: pd.DataFrame, team: str) -> str:
    """Generate comprehensive EDA report for a team."""
    # Your code here
    pass

Submission Guidelines

For each exercise: 1. Include all code used to generate answers 2. Include relevant visualizations (save as PNG) 3. Write brief interpretations of findings 4. Note any data quality issues encountered

Grading Rubric

Level Points Focus
1 2 each Correct execution
2 3 each Execution + interpretation
3 4 each Analysis quality + code structure
4 5 each Insight depth + visualization quality
5 6 each Research quality + communication