Exercises: Python for Sports Analytics
These exercises progressively build your pandas and NumPy skills using football data. Complete them in order, as later exercises build on earlier concepts.
Level 1: DataFrame Fundamentals
Exercise 1.1: Creating DataFrames
Create a DataFrame representing SEC West teams with the following data:
| Team | Wins | Losses | Points For | Points Against |
|---|---|---|---|---|
| Alabama | 12 | 2 | 480 | 210 |
| LSU | 10 | 3 | 420 | 245 |
| Auburn | 7 | 5 | 310 | 280 |
| Texas A&M | 8 | 5 | 345 | 295 |
| Ole Miss | 11 | 2 | 495 | 230 |
| Mississippi State | 6 | 7 | 285 | 320 |
| Arkansas | 5 | 7 | 275 | 340 |
Then: a) Print the shape of the DataFrame b) Print the column names c) Print the data types for each column d) Display summary statistics
Exercise 1.2: Basic Selection
Using the DataFrame from Exercise 1.1:
a) Select only the "Team" and "Wins" columns b) Select the first 3 rows c) Select the last 2 rows d) Select the row for Alabama e) Select rows where Wins > 8
Exercise 1.3: Creating New Columns
Add the following columns to your SEC West DataFrame:
a) games - Total games played (wins + losses)
b) win_pct - Win percentage (wins / games)
c) point_diff - Point differential (points for - points against)
d) ppg - Points per game (points for / games)
e) ppg_allowed - Points allowed per game (points against / games)
Sort the DataFrame by win percentage descending.
Exercise 1.4: Boolean Filtering
Using your enhanced DataFrame:
a) Find teams with a winning record (win_pct > 0.5) b) Find teams that scored more than 400 points c) Find teams with positive point differential AND more than 8 wins d) Find teams with ppg > 30 OR ppg_allowed < 25
Exercise 1.5: Working with Series
From your DataFrame:
a) Extract the "point_diff" column as a Series b) Calculate the mean, median, and standard deviation c) Find the maximum and minimum values and which teams they belong to d) Calculate the correlation between "point_diff" and "wins"
Level 2: Data Manipulation
Exercise 2.1: GroupBy Operations
Create this game data:
import pandas as pd
import numpy as np
np.random.seed(42)
games = pd.DataFrame({
"game_id": range(1, 51),
"week": np.repeat(range(1, 11), 5),
"home_team": np.random.choice(
["Alabama", "Georgia", "Ohio State", "Michigan", "Texas"], 50
),
"away_team": np.random.choice(
["Auburn", "Florida", "Penn State", "Wisconsin", "Oklahoma"], 50
),
"home_score": np.random.randint(14, 49, 50),
"away_score": np.random.randint(10, 45, 50),
"attendance": np.random.randint(50000, 105000, 50)
})
Then calculate: a) Average home score by week b) Total attendance by home team c) Number of games per week d) Maximum score (home or away) for each week
Exercise 2.2: Multiple Aggregations
Using the games DataFrame:
a) Group by home_team and calculate: - Total games - Average home score - Average away score - Total attendance
b) Group by week and calculate: - Average total points (home + away) - Max margin of victory - Average attendance
Exercise 2.3: Apply Functions
Write and apply a function that categorizes games:
a) Create a function categorize_game(row) that returns:
- "Blowout" if margin >= 21
- "Comfortable" if margin 14-20
- "Close" if margin 7-13
- "Nail-biter" if margin < 7
b) Apply this function to create a "game_category" column c) Count how many games fall into each category
Exercise 2.4: String Operations
Create play description data:
plays = pd.DataFrame({
"play_id": range(1, 21),
"description": [
"J.Smith pass complete to M.Jones for 15 yards, TOUCHDOWN",
"T.Brown rush up the middle for 4 yards",
"J.Smith pass incomplete intended for R.Wilson",
"PENALTY on Alabama, False Start, 5 yards",
"D.Williams rush left end for 12 yards",
"J.Smith pass complete to T.Jackson for 8 yards",
"PENALTY on Georgia, Holding, 10 yards",
"T.Brown rush for no gain",
"J.Smith sacked for -7 yards",
"K.Johnson punt 45 yards",
"M.Davis rush for 3 yards, FUMBLE, recovered by Defense",
"J.Smith pass complete to M.Jones for 22 yards",
"TIMEOUT Alabama",
"Field Goal attempt from 35 yards, GOOD",
"T.Brown rush for 5 yards, FIRST DOWN",
"J.Smith pass intercepted by S.Thomas",
"D.Williams rush for 8 yards",
"J.Smith pass complete to R.Wilson for 6 yards",
"Extra Point GOOD",
"Kickoff 65 yards, touchback"
]
})
Extract: a) A boolean column for passes (description contains "pass") b) A boolean column for rushes c) A boolean column for penalties d) A boolean column for touchdowns e) Yards from descriptions where available (use regex)
Exercise 2.5: Handling Missing Data
Create this dataset with missing values:
player_stats = pd.DataFrame({
"player": ["Smith", "Jones", "Williams", "Brown", "Davis",
"Wilson", "Thomas", "Jackson", "Miller", "Anderson"],
"rush_yards": [120, None, 85, 0, None, 45, None, 92, 110, 78],
"pass_yards": [None, 250, None, 180, 320, None, 275, None, 145, 200],
"receiving_yards": [35, 95, 120, None, 80, 45, 110, None, 60, None],
"touchdowns": [1, 2, None, 0, 3, 1, None, 2, 1, None]
})
a) Count missing values in each column b) Drop rows with any missing values - how many remain? c) Drop rows with missing touchdowns only - how many remain? d) Fill missing rushing yards with 0 e) Fill missing receiving yards with the column mean f) Fill missing touchdowns with the column median
Level 3: Merging and Joining
Exercise 3.1: Basic Merging
Create these two DataFrames:
teams = pd.DataFrame({
"team_id": [1, 2, 3, 4, 5],
"team_name": ["Alabama", "Georgia", "Ohio State", "Michigan", "Texas"],
"conference": ["SEC", "SEC", "Big Ten", "Big Ten", "SEC"],
"stadium_capacity": [101821, 92746, 102780, 107601, 100119]
})
coaches = pd.DataFrame({
"team_id": [1, 2, 3, 4, 6], # Note: team 5 missing, team 6 not in teams
"coach_name": ["Nick Saban", "Kirby Smart", "Ryan Day",
"Jim Harbaugh", "Dabo Swinney"],
"years_at_school": [16, 7, 4, 9, 14]
})
a) Perform an inner join - which teams appear? b) Perform a left join (teams on left) - which teams appear? c) Perform a right join (teams on left) - which teams appear? d) Perform an outer join - which teams appear?
Exercise 3.2: Merging on Different Columns
Create games and teams data:
games = pd.DataFrame({
"game_id": [1, 2, 3, 4, 5],
"home_team_id": [1, 2, 3, 4, 5],
"away_team_id": [3, 4, 5, 1, 2],
"home_score": [28, 35, 21, 42, 31],
"away_score": [24, 28, 17, 38, 28],
"week": [1, 1, 2, 2, 3]
})
team_info = pd.DataFrame({
"id": [1, 2, 3, 4, 5],
"name": ["Alabama", "Georgia", "Ohio State", "Michigan", "Texas"],
"mascot": ["Crimson Tide", "Bulldogs", "Buckeyes", "Wolverines", "Longhorns"]
})
a) Merge to add home team name and mascot b) Then merge to add away team name and mascot c) Create a final DataFrame with: game_id, week, home_name, home_mascot, home_score, away_name, away_mascot, away_score
Exercise 3.3: Complex Merging Scenario
Create recruiting and roster data:
recruits = pd.DataFrame({
"player_name": ["John Smith", "Mike Johnson", "David Williams",
"James Brown", "Chris Davis"],
"high_school_state": ["TX", "GA", "FL", "OH", "AL"],
"position": ["QB", "RB", "WR", "LB", "CB"],
"stars": [5, 4, 4, 3, 5],
"signed_with": ["Alabama", "Georgia", "Texas", "Ohio State", "Alabama"]
})
roster = pd.DataFrame({
"player_name": ["John Smith", "Mike Johnson", "David Williams",
"Robert Wilson", "Kevin Thomas"],
"team": ["Alabama", "Georgia", "Texas", "Alabama", "Georgia"],
"jersey_number": [12, 28, 7, 4, 33],
"year": ["FR", "FR", "FR", "SO", "JR"]
})
a) Find recruits who are on the roster (inner join on player_name) b) Find all recruits with their roster status (left join) c) Find how many of each team's recruits made the roster
Level 4: NumPy Operations
Exercise 4.1: Array Basics
Create these arrays representing 10 games:
import numpy as np
np.random.seed(42)
home_scores = np.random.randint(14, 49, 10)
away_scores = np.random.randint(10, 45, 10)
home_rush_yards = np.random.randint(80, 200, 10)
home_pass_yards = np.random.randint(150, 350, 10)
Calculate: a) Margin of victory for each game (home - away) b) Total points for each game c) Total home yards for each game d) Which games were home wins (boolean array) e) Home win percentage
Exercise 4.2: Statistical Functions
Using the arrays from Exercise 4.1:
a) Mean and standard deviation of home scores b) Median of total points c) 25th and 75th percentiles of margins d) Correlation between home rush yards and home scores e) Cumulative sum of home scores (running total)
Exercise 4.3: Vectorized Conditionals
Using np.where and np.select:
a) Create array: "Win" if home_score > away_score, else "Loss" b) Create array categorizing margins: - "Blowout" if margin >= 21 - "Comfortable" if margin 14-20 - "Close" if margin 7-13 - "Nail-biter" if margin < 7 c) Create array: "High Scoring" if total > 50, "Low Scoring" if total < 35, "Average" otherwise
Exercise 4.4: Broadcasting
Create team season statistics:
teams = ["Alabama", "Georgia", "Ohio State", "Michigan", "Texas"]
# Rows = teams, Columns = games 1-12
points_scored = np.random.randint(20, 50, (5, 12))
a) Calculate average points per team (axis=1) b) Calculate average points per game across all teams (axis=0) c) Normalize each team's scores by their own average d) Find each team's best game (max) and worst game (min)
Exercise 4.5: Performance Comparison
Create a large dataset and compare loop vs vectorized performance:
n = 100000
home_pts = np.random.randint(14, 49, n)
away_pts = np.random.randint(10, 45, n)
a) Calculate margins using a for loop - time it b) Calculate margins using NumPy subtraction - time it c) Compare the speedup d) Do the same comparison for categorizing games as Win/Loss
Level 5: Advanced Applications
Exercise 5.1: Complete Team Analysis
Write a function analyze_team_season(games_df, team) that takes a games DataFrame and team name, and returns a dictionary containing:
- games_played
- wins, losses
- win_percentage
- home_wins, away_wins
- points_scored (total)
- points_allowed (total)
- ppg (points per game)
- ppg_allowed
- average_margin
- biggest_win
- worst_loss
- longest_win_streak
Test with sample data.
Exercise 5.2: Play Success Rate Calculator
Create a function calculate_success_rate(plays_df) that:
a) Determines play success based on standard criteria: - 1st down: Gain 40%+ of distance - 2nd down: Gain 50%+ of distance - 3rd/4th down: Get the first down
b) Returns success rate by: - Overall - By down - By play type (pass/rush) - By field position (own 20, middle, opponent's 20)
Test with sample play data.
Exercise 5.3: Season Statistics Pipeline
Build a complete data pipeline:
- Create sample season data (100+ games)
- Clean the data (handle missing values, validate ranges)
- Calculate team standings
- Identify conference champions (best record in each conference)
- Generate summary statistics report
Exercise 5.4: Efficiency Analysis
Write a function that analyzes offensive efficiency:
def analyze_offensive_efficiency(plays_df: pd.DataFrame) -> pd.DataFrame:
"""
Calculate efficiency metrics from play-by-play data.
Returns DataFrame with:
- yards_per_play
- success_rate
- explosive_play_rate (10+ yard gains)
- negative_play_rate (losses)
- first_down_rate
"""
Test with sample play-by-play data containing down, distance, yards_gained, play_type.
Exercise 5.5: Memory Optimization Challenge
Create a 1 million row play-by-play dataset:
import numpy as np
import pandas as pd
n = 1_000_000
plays = pd.DataFrame({
"play_id": range(n),
"game_id": np.random.randint(1, 10000, n),
"quarter": np.random.randint(1, 5, n),
"down": np.random.randint(1, 5, n),
"distance": np.random.randint(1, 20, n),
"play_type": np.random.choice(["Pass", "Rush", "Punt", "Kickoff"], n),
"yards_gained": np.random.randint(-10, 50, n),
"home_team": np.random.choice(
["Alabama", "Georgia", "Ohio State", "Michigan", "Texas",
"Oklahoma", "Clemson", "LSU", "Florida", "Auburn"], n
),
"away_team": np.random.choice(
["Auburn", "Florida", "Penn State", "Wisconsin", "Oklahoma",
"Texas A&M", "NC State", "Tennessee", "Kentucky", "Missouri"], n
)
})
a) Measure initial memory usage b) Optimize numeric columns (downcast integers) c) Convert string columns to categorical d) Measure final memory usage e) Calculate the percentage reduction
Challenge Problems
Challenge 1: Win Probability Calculator
Build a simple win probability model:
- Create historical game data with scores at different game times
- Calculate historical win rates based on: - Current score margin - Time remaining
- Apply to new game states to estimate win probability
Challenge 2: Strength of Schedule
Calculate strength of schedule for each team:
- Create a full season of games (all teams play each other)
- Calculate each team's record
- For each team, calculate: - Opponents' average win percentage - Opponents' opponents' average win percentage - Weighted SOS
Challenge 3: Recruiting Pipeline Analysis
Build a recruiting analysis system:
- Create 5 years of recruiting data (class, stars, position, state)
- Create roster data with years in program
- Analyze: - 4-year retention rate by star rating - Average years to contribute by position - Geographic recruiting patterns over time
Solutions
Solutions are provided in code/exercise-solutions.py. Try to complete each exercise before checking the solutions.
Grading Rubric
| Level | Points | Description |
|---|---|---|
| Level 1 | 20 | DataFrame basics |
| Level 2 | 25 | Data manipulation |
| Level 3 | 20 | Merging |
| Level 4 | 20 | NumPy operations |
| Level 5 | 15 | Advanced applications |
Total: 100 points