Exercises: Python for Sports Analytics

These exercises progressively build your pandas and NumPy skills using football data. Complete them in order, as later exercises build on earlier concepts.


Level 1: DataFrame Fundamentals

Exercise 1.1: Creating DataFrames

Create a DataFrame representing SEC West teams with the following data:

Team Wins Losses Points For Points Against
Alabama 12 2 480 210
LSU 10 3 420 245
Auburn 7 5 310 280
Texas A&M 8 5 345 295
Ole Miss 11 2 495 230
Mississippi State 6 7 285 320
Arkansas 5 7 275 340

Then: a) Print the shape of the DataFrame b) Print the column names c) Print the data types for each column d) Display summary statistics


Exercise 1.2: Basic Selection

Using the DataFrame from Exercise 1.1:

a) Select only the "Team" and "Wins" columns b) Select the first 3 rows c) Select the last 2 rows d) Select the row for Alabama e) Select rows where Wins > 8


Exercise 1.3: Creating New Columns

Add the following columns to your SEC West DataFrame:

a) games - Total games played (wins + losses) b) win_pct - Win percentage (wins / games) c) point_diff - Point differential (points for - points against) d) ppg - Points per game (points for / games) e) ppg_allowed - Points allowed per game (points against / games)

Sort the DataFrame by win percentage descending.


Exercise 1.4: Boolean Filtering

Using your enhanced DataFrame:

a) Find teams with a winning record (win_pct > 0.5) b) Find teams that scored more than 400 points c) Find teams with positive point differential AND more than 8 wins d) Find teams with ppg > 30 OR ppg_allowed < 25


Exercise 1.5: Working with Series

From your DataFrame:

a) Extract the "point_diff" column as a Series b) Calculate the mean, median, and standard deviation c) Find the maximum and minimum values and which teams they belong to d) Calculate the correlation between "point_diff" and "wins"


Level 2: Data Manipulation

Exercise 2.1: GroupBy Operations

Create this game data:

import pandas as pd
import numpy as np

np.random.seed(42)
games = pd.DataFrame({
    "game_id": range(1, 51),
    "week": np.repeat(range(1, 11), 5),
    "home_team": np.random.choice(
        ["Alabama", "Georgia", "Ohio State", "Michigan", "Texas"], 50
    ),
    "away_team": np.random.choice(
        ["Auburn", "Florida", "Penn State", "Wisconsin", "Oklahoma"], 50
    ),
    "home_score": np.random.randint(14, 49, 50),
    "away_score": np.random.randint(10, 45, 50),
    "attendance": np.random.randint(50000, 105000, 50)
})

Then calculate: a) Average home score by week b) Total attendance by home team c) Number of games per week d) Maximum score (home or away) for each week


Exercise 2.2: Multiple Aggregations

Using the games DataFrame:

a) Group by home_team and calculate: - Total games - Average home score - Average away score - Total attendance

b) Group by week and calculate: - Average total points (home + away) - Max margin of victory - Average attendance


Exercise 2.3: Apply Functions

Write and apply a function that categorizes games:

a) Create a function categorize_game(row) that returns: - "Blowout" if margin >= 21 - "Comfortable" if margin 14-20 - "Close" if margin 7-13 - "Nail-biter" if margin < 7

b) Apply this function to create a "game_category" column c) Count how many games fall into each category


Exercise 2.4: String Operations

Create play description data:

plays = pd.DataFrame({
    "play_id": range(1, 21),
    "description": [
        "J.Smith pass complete to M.Jones for 15 yards, TOUCHDOWN",
        "T.Brown rush up the middle for 4 yards",
        "J.Smith pass incomplete intended for R.Wilson",
        "PENALTY on Alabama, False Start, 5 yards",
        "D.Williams rush left end for 12 yards",
        "J.Smith pass complete to T.Jackson for 8 yards",
        "PENALTY on Georgia, Holding, 10 yards",
        "T.Brown rush for no gain",
        "J.Smith sacked for -7 yards",
        "K.Johnson punt 45 yards",
        "M.Davis rush for 3 yards, FUMBLE, recovered by Defense",
        "J.Smith pass complete to M.Jones for 22 yards",
        "TIMEOUT Alabama",
        "Field Goal attempt from 35 yards, GOOD",
        "T.Brown rush for 5 yards, FIRST DOWN",
        "J.Smith pass intercepted by S.Thomas",
        "D.Williams rush for 8 yards",
        "J.Smith pass complete to R.Wilson for 6 yards",
        "Extra Point GOOD",
        "Kickoff 65 yards, touchback"
    ]
})

Extract: a) A boolean column for passes (description contains "pass") b) A boolean column for rushes c) A boolean column for penalties d) A boolean column for touchdowns e) Yards from descriptions where available (use regex)


Exercise 2.5: Handling Missing Data

Create this dataset with missing values:

player_stats = pd.DataFrame({
    "player": ["Smith", "Jones", "Williams", "Brown", "Davis",
               "Wilson", "Thomas", "Jackson", "Miller", "Anderson"],
    "rush_yards": [120, None, 85, 0, None, 45, None, 92, 110, 78],
    "pass_yards": [None, 250, None, 180, 320, None, 275, None, 145, 200],
    "receiving_yards": [35, 95, 120, None, 80, 45, 110, None, 60, None],
    "touchdowns": [1, 2, None, 0, 3, 1, None, 2, 1, None]
})

a) Count missing values in each column b) Drop rows with any missing values - how many remain? c) Drop rows with missing touchdowns only - how many remain? d) Fill missing rushing yards with 0 e) Fill missing receiving yards with the column mean f) Fill missing touchdowns with the column median


Level 3: Merging and Joining

Exercise 3.1: Basic Merging

Create these two DataFrames:

teams = pd.DataFrame({
    "team_id": [1, 2, 3, 4, 5],
    "team_name": ["Alabama", "Georgia", "Ohio State", "Michigan", "Texas"],
    "conference": ["SEC", "SEC", "Big Ten", "Big Ten", "SEC"],
    "stadium_capacity": [101821, 92746, 102780, 107601, 100119]
})

coaches = pd.DataFrame({
    "team_id": [1, 2, 3, 4, 6],  # Note: team 5 missing, team 6 not in teams
    "coach_name": ["Nick Saban", "Kirby Smart", "Ryan Day",
                   "Jim Harbaugh", "Dabo Swinney"],
    "years_at_school": [16, 7, 4, 9, 14]
})

a) Perform an inner join - which teams appear? b) Perform a left join (teams on left) - which teams appear? c) Perform a right join (teams on left) - which teams appear? d) Perform an outer join - which teams appear?


Exercise 3.2: Merging on Different Columns

Create games and teams data:

games = pd.DataFrame({
    "game_id": [1, 2, 3, 4, 5],
    "home_team_id": [1, 2, 3, 4, 5],
    "away_team_id": [3, 4, 5, 1, 2],
    "home_score": [28, 35, 21, 42, 31],
    "away_score": [24, 28, 17, 38, 28],
    "week": [1, 1, 2, 2, 3]
})

team_info = pd.DataFrame({
    "id": [1, 2, 3, 4, 5],
    "name": ["Alabama", "Georgia", "Ohio State", "Michigan", "Texas"],
    "mascot": ["Crimson Tide", "Bulldogs", "Buckeyes", "Wolverines", "Longhorns"]
})

a) Merge to add home team name and mascot b) Then merge to add away team name and mascot c) Create a final DataFrame with: game_id, week, home_name, home_mascot, home_score, away_name, away_mascot, away_score


Exercise 3.3: Complex Merging Scenario

Create recruiting and roster data:

recruits = pd.DataFrame({
    "player_name": ["John Smith", "Mike Johnson", "David Williams",
                    "James Brown", "Chris Davis"],
    "high_school_state": ["TX", "GA", "FL", "OH", "AL"],
    "position": ["QB", "RB", "WR", "LB", "CB"],
    "stars": [5, 4, 4, 3, 5],
    "signed_with": ["Alabama", "Georgia", "Texas", "Ohio State", "Alabama"]
})

roster = pd.DataFrame({
    "player_name": ["John Smith", "Mike Johnson", "David Williams",
                    "Robert Wilson", "Kevin Thomas"],
    "team": ["Alabama", "Georgia", "Texas", "Alabama", "Georgia"],
    "jersey_number": [12, 28, 7, 4, 33],
    "year": ["FR", "FR", "FR", "SO", "JR"]
})

a) Find recruits who are on the roster (inner join on player_name) b) Find all recruits with their roster status (left join) c) Find how many of each team's recruits made the roster


Level 4: NumPy Operations

Exercise 4.1: Array Basics

Create these arrays representing 10 games:

import numpy as np

np.random.seed(42)
home_scores = np.random.randint(14, 49, 10)
away_scores = np.random.randint(10, 45, 10)
home_rush_yards = np.random.randint(80, 200, 10)
home_pass_yards = np.random.randint(150, 350, 10)

Calculate: a) Margin of victory for each game (home - away) b) Total points for each game c) Total home yards for each game d) Which games were home wins (boolean array) e) Home win percentage


Exercise 4.2: Statistical Functions

Using the arrays from Exercise 4.1:

a) Mean and standard deviation of home scores b) Median of total points c) 25th and 75th percentiles of margins d) Correlation between home rush yards and home scores e) Cumulative sum of home scores (running total)


Exercise 4.3: Vectorized Conditionals

Using np.where and np.select:

a) Create array: "Win" if home_score > away_score, else "Loss" b) Create array categorizing margins: - "Blowout" if margin >= 21 - "Comfortable" if margin 14-20 - "Close" if margin 7-13 - "Nail-biter" if margin < 7 c) Create array: "High Scoring" if total > 50, "Low Scoring" if total < 35, "Average" otherwise


Exercise 4.4: Broadcasting

Create team season statistics:

teams = ["Alabama", "Georgia", "Ohio State", "Michigan", "Texas"]
# Rows = teams, Columns = games 1-12
points_scored = np.random.randint(20, 50, (5, 12))

a) Calculate average points per team (axis=1) b) Calculate average points per game across all teams (axis=0) c) Normalize each team's scores by their own average d) Find each team's best game (max) and worst game (min)


Exercise 4.5: Performance Comparison

Create a large dataset and compare loop vs vectorized performance:

n = 100000
home_pts = np.random.randint(14, 49, n)
away_pts = np.random.randint(10, 45, n)

a) Calculate margins using a for loop - time it b) Calculate margins using NumPy subtraction - time it c) Compare the speedup d) Do the same comparison for categorizing games as Win/Loss


Level 5: Advanced Applications

Exercise 5.1: Complete Team Analysis

Write a function analyze_team_season(games_df, team) that takes a games DataFrame and team name, and returns a dictionary containing:

  • games_played
  • wins, losses
  • win_percentage
  • home_wins, away_wins
  • points_scored (total)
  • points_allowed (total)
  • ppg (points per game)
  • ppg_allowed
  • average_margin
  • biggest_win
  • worst_loss
  • longest_win_streak

Test with sample data.


Exercise 5.2: Play Success Rate Calculator

Create a function calculate_success_rate(plays_df) that:

a) Determines play success based on standard criteria: - 1st down: Gain 40%+ of distance - 2nd down: Gain 50%+ of distance - 3rd/4th down: Get the first down

b) Returns success rate by: - Overall - By down - By play type (pass/rush) - By field position (own 20, middle, opponent's 20)

Test with sample play data.


Exercise 5.3: Season Statistics Pipeline

Build a complete data pipeline:

  1. Create sample season data (100+ games)
  2. Clean the data (handle missing values, validate ranges)
  3. Calculate team standings
  4. Identify conference champions (best record in each conference)
  5. Generate summary statistics report

Exercise 5.4: Efficiency Analysis

Write a function that analyzes offensive efficiency:

def analyze_offensive_efficiency(plays_df: pd.DataFrame) -> pd.DataFrame:
    """
    Calculate efficiency metrics from play-by-play data.

    Returns DataFrame with:
    - yards_per_play
    - success_rate
    - explosive_play_rate (10+ yard gains)
    - negative_play_rate (losses)
    - first_down_rate
    """

Test with sample play-by-play data containing down, distance, yards_gained, play_type.


Exercise 5.5: Memory Optimization Challenge

Create a 1 million row play-by-play dataset:

import numpy as np
import pandas as pd

n = 1_000_000
plays = pd.DataFrame({
    "play_id": range(n),
    "game_id": np.random.randint(1, 10000, n),
    "quarter": np.random.randint(1, 5, n),
    "down": np.random.randint(1, 5, n),
    "distance": np.random.randint(1, 20, n),
    "play_type": np.random.choice(["Pass", "Rush", "Punt", "Kickoff"], n),
    "yards_gained": np.random.randint(-10, 50, n),
    "home_team": np.random.choice(
        ["Alabama", "Georgia", "Ohio State", "Michigan", "Texas",
         "Oklahoma", "Clemson", "LSU", "Florida", "Auburn"], n
    ),
    "away_team": np.random.choice(
        ["Auburn", "Florida", "Penn State", "Wisconsin", "Oklahoma",
         "Texas A&M", "NC State", "Tennessee", "Kentucky", "Missouri"], n
    )
})

a) Measure initial memory usage b) Optimize numeric columns (downcast integers) c) Convert string columns to categorical d) Measure final memory usage e) Calculate the percentage reduction


Challenge Problems

Challenge 1: Win Probability Calculator

Build a simple win probability model:

  1. Create historical game data with scores at different game times
  2. Calculate historical win rates based on: - Current score margin - Time remaining
  3. Apply to new game states to estimate win probability

Challenge 2: Strength of Schedule

Calculate strength of schedule for each team:

  1. Create a full season of games (all teams play each other)
  2. Calculate each team's record
  3. For each team, calculate: - Opponents' average win percentage - Opponents' opponents' average win percentage - Weighted SOS

Challenge 3: Recruiting Pipeline Analysis

Build a recruiting analysis system:

  1. Create 5 years of recruiting data (class, stars, position, state)
  2. Create roster data with years in program
  3. Analyze: - 4-year retention rate by star rating - Average years to contribute by position - Geographic recruiting patterns over time

Solutions

Solutions are provided in code/exercise-solutions.py. Try to complete each exercise before checking the solutions.


Grading Rubric

Level Points Description
Level 1 20 DataFrame basics
Level 2 25 Data manipulation
Level 3 20 Merging
Level 4 20 NumPy operations
Level 5 15 Advanced applications

Total: 100 points