Chapter 4 Exercises: Exploratory Data Analysis for Basketball

Section 1: Data Loading and Inspection (Exercises 1-6)

Exercise 1: Basic Data Inspection

Load the NBA player statistics dataset and perform the following inspections: a) Display the first 15 rows and last 10 rows of the dataset b) Identify all column names and their data types c) Count the number of unique teams and positions represented d) Calculate the total memory usage of the DataFrame

Dataset: nba_player_stats_2023_24.csv

Exercise 2: Data Type Analysis

For the player statistics dataset: a) Separate columns into numerical and categorical types b) Identify any columns that appear to have incorrect data types (e.g., numeric data stored as strings) c) Convert any percentage columns from strings to proper float values d) Create a summary table showing column name, current type, and recommended type

Exercise 3: Derived Feature Creation

Create the following derived features from the base statistics: a) Points per minute (pts_per_min) b) Assists-to-turnover ratio (ast_to_tov) c) True Shooting Percentage using the formula: TS% = PTS / (2 * (FGA + 0.44 * FTA)) d) Offensive rating estimate: ORtg = (PTS / (FGA + 0.44 * FTA + TOV)) * 100 e) Position group (combine PG/SG as Guards, SF/PF as Forwards, C as Centers)

Exercise 4: Multi-Source Data Loading

You have three separate CSV files: - player_bio.csv (player_id, name, height, weight, age, college) - player_stats.csv (player_id, team, games, points, rebounds, assists) - player_advanced.csv (player_id, per, ws, bpm, vorp)

Write code to: a) Load all three files b) Merge them into a single DataFrame using player_id c) Handle any players that appear in one file but not others d) Verify the merge was successful by checking for duplicate or missing records

Exercise 5: Date and Time Handling

Load a game log dataset (player_game_logs.csv) and: a) Convert the game_date column to datetime format b) Extract the month, day of week, and whether it was a home or away game c) Create a "days_rest" column showing days since the player's previous game d) Identify back-to-back games (games with 0 days rest)

Exercise 6: Hierarchical Data Inspection

Given a dataset with team-level and player-level statistics: a) Calculate the number of players per team b) Verify that individual player statistics sum to approximately team totals c) Identify any teams with missing players (based on total minutes not summing to expected values) d) Create a team-level summary from player-level data

Section 2: Data Cleaning and Preprocessing (Exercises 7-12)

Exercise 7: Duplicate Detection and Handling

Using the player statistics dataset: a) Identify all exact duplicate rows b) Find players who appear multiple times (traded mid-season) c) For traded players, create a "season total" row that combines their statistics d) Document your decisions about how you handled partial season records

Exercise 8: Standardization

Clean the following inconsistencies in the dataset: a) Standardize team abbreviations (e.g., PHX vs PHO, BKN vs BRK) b) Clean player names (remove extra whitespace, standardize capitalization) c) Standardize position labels (handle variations like "Point Guard" vs "PG") d) Create a mapping dictionary for any standardization applied

Exercise 9: Logical Validation

Write validation functions to check for: a) Field goal percentages outside the range [0, 1] b) Players with more assists than field goals made (flag but don't remove) c) Minutes per game exceeding 48 (regular season limit) d) Negative values in counting statistics e) Three-point percentage higher than overall field goal percentage (investigate, don't remove)

Create a report summarizing all validation issues found.

Exercise 10: Outlier Investigation

For points per game in the player dataset: a) Calculate the IQR and identify outliers using the 1.5*IQR rule b) Calculate Z-scores and identify outliers with |Z| > 3 c) Visualize the outliers using a box plot with individual points d) Research the identified outliers - are they errors or exceptional performers? e) Document your decision about how to handle each outlier

Exercise 11: Data Transformation

Apply appropriate transformations to the following statistics: a) Log-transform a heavily right-skewed variable (e.g., total points) b) Square root transform count data with many zeros c) Normalize minutes per game to a 0-1 scale d) Standardize (Z-score) the "points" column e) Compare the distributions before and after transformation visually

Exercise 12: Creating Analysis-Ready Datasets

Create two clean datasets from the raw data: a) A "starters" dataset: Players with at least 20 MPG and 41 games played b) A "rotation players" dataset: Players with at least 10 MPG and 20 games played

For each dataset: - Document all filtering criteria - Report how many players were removed at each step - Verify no missing values remain in key columns

Section 3: Missing Value Analysis (Exercises 13-16)

Exercise 13: Missing Value Patterns

Analyze missing values in the dataset: a) Create a summary table showing missing count and percentage for each column b) Visualize the missing data pattern using a heatmap c) Identify if missing values occur in clusters (e.g., all advanced stats missing together) d) Determine the likely mechanism (MCAR, MAR, MNAR) for each column with missing values

Exercise 14: Conditional Missing Values

Investigate the relationship between missing values and other variables: a) Is three-point percentage more likely to be missing for certain positions? b) Are advanced stats more likely to be missing for players with fewer minutes? c) Create a visualization showing missing percentage by player position d) Test whether the missingness is statistically associated with any observed variables

Exercise 15: Imputation Strategy Development

For a dataset with missing values in height, weight, and shooting percentages: a) Implement mean imputation for height b) Implement position-group mean imputation for weight c) For shooting percentages with zero attempts, decide and implement an appropriate strategy d) Compare player statistics before and after imputation e) Document the rationale for each imputation decision

Exercise 16: Sensitivity Analysis

Conduct a sensitivity analysis for missing data: a) Analyze the data using only complete cases b) Analyze the data using mean imputation c) Analyze the data using group-based imputation d) Compare key summary statistics (mean points, correlation between assists and points) across all three approaches e) Discuss which approach is most appropriate for this basketball dataset

Section 4: Distribution Visualization (Exercises 17-21)

Exercise 17: Histogram Analysis

Create comprehensive histograms for the following statistics: a) Points per game - identify the distribution shape and any multimodality b) Three-point percentage - explain the gap near zero c) Minutes per game - identify distinct player groups d) For each histogram, add mean and median lines and calculate skewness

Exercise 18: Comparative Box Plots

Create box plots comparing: a) Points per game by position (PG, SG, SF, PF, C) b) Three-point attempts by team (identify high-volume and low-volume teams) c) Assists by position group (Guards, Forwards, Centers) d) For each plot, identify and label significant outliers by player name

Exercise 19: Violin Plot Interpretation

Create violin plots for: a) True shooting percentage by position b) Usage rate by position group c) Minutes per game by age group (Under 25, 25-30, Over 30)

For each violin plot: - Describe the shape of each distribution - Identify positions/groups with bimodal distributions - Explain what the distribution shapes tell us about player roles

Exercise 20: Multi-Panel Distribution Analysis

Create a figure with 6 subplots showing: a) Histograms for the "counting stats" (points, rebounds, assists) b) Histograms for the "efficiency stats" (FG%, 3P%, FT%) c) Ensure consistent styling across all subplots d) Add a main title summarizing the overall findings

Exercise 21: Distribution Comparison Over Time

Using historical data from multiple seasons: a) Compare the distribution of three-point attempts between 2013-14 and 2023-24 b) Create overlaid KDE plots showing the shift c) Calculate the change in mean, median, and standard deviation d) Visualize this as an animated GIF if possible, or a small multiples plot

Section 5: Relationship Visualization (Exercises 22-27)

Exercise 22: Scatter Plot Analysis

Create scatter plots examining: a) Points vs Minutes - calculate and display R-squared b) Assists vs Usage Rate - color points by position c) Age vs True Shooting Percentage - add a LOWESS trend line d) For each plot, identify notable outliers and provide context

Exercise 23: Correlation Matrix Analysis

Build a comprehensive correlation analysis: a) Create a correlation matrix for all numeric columns b) Identify the 10 strongest positive correlations c) Identify the 5 strongest negative correlations d) Create a clustered heatmap that groups similar variables together e) Explain any surprising correlations (or lack thereof)

Exercise 24: Pair Plot Exploration

Create pair plots for: a) The four counting stats (points, rebounds, assists, steals) colored by position b) Shooting efficiency metrics (FG%, 3P%, FT%, TS%) colored by position group c) Analyze the diagonal KDE plots to understand how distributions vary by position d) Identify the strongest relationships visible in the scatter plots

Exercise 25: Bubble Chart Creation

Create a bubble chart where: a) X-axis: True Shooting Percentage b) Y-axis: Points Per Game c) Bubble size: Minutes Per Game d) Bubble color: Age

Add annotations for the top 10 scorers by name. Interpret what the visualization reveals about the relationship between efficiency and volume.

Exercise 26: Regression Analysis Visualization

For the relationship between usage rate and points per game: a) Create a scatter plot with linear regression line b) Calculate residuals and create a residual plot c) Fit a polynomial regression and compare to linear d) Create separate regression lines for each position e) Discuss which model best describes the relationship

Exercise 27: Interactive Relationship Exploration

Using Plotly or similar interactive library: a) Create an interactive scatter plot of points vs assists b) Add hover information showing player name, team, and games played c) Enable filtering by position d) Add the ability to zoom and pan e) Export as an HTML file

Section 6: Time Series Analysis (Exercises 28-32)

Exercise 28: Rolling Average Analysis

For a star player's game log: a) Calculate 5-game, 10-game, and 20-game rolling averages for points b) Plot all three on the same graph with the raw game values c) Identify periods of hot and cold streaks d) Calculate the correlation between consecutive 5-game averages (autocorrelation)

Exercise 29: Cumulative Statistics Tracking

Track cumulative statistics for MVP candidates: a) Create cumulative points, rebounds, and assists lines for 5 players b) Project end-of-season totals based on games played and pace c) Identify when specific milestones (1000 points, 500 assists) were reached d) Visualize the "race" to a specific statistical milestone

Exercise 30: Performance Consistency Analysis

Compare performance consistency for two point guards: a) Calculate coefficient of variation for points, assists, and turnovers b) Create side-by-side histograms of game-by-game scoring c) Calculate the probability of a 20+ point game for each player d) Determine which player is more reliable/consistent

Exercise 31: Segment Analysis

Divide the season into quarters and analyze: a) Calculate average statistics for each quarter of the season b) Test for significant differences between quarters c) Identify players who improved or declined most dramatically d) Create visualizations showing the seasonal progression for key players

Exercise 32: Rest and Performance Analysis

Using game log data: a) Calculate days of rest before each game b) Compare performance with 0, 1, 2, and 3+ days of rest c) Create box plots showing points per game by rest days d) Test whether the differences are statistically significant e) Discuss implications for load management

Section 7: Shot Chart Analysis (Exercises 33-38)

Exercise 33: Basic Shot Chart Creation

For a designated player: a) Create a shot chart showing makes and misses on a court diagram b) Color-code by shot type (2-point vs 3-point) c) Add a legend and shooting percentages to the title d) Save the chart at high resolution (300 DPI)

Exercise 34: Hexbin Analysis

Create hexbin shot charts for three players with different playing styles: a) A high-volume three-point shooter b) A paint-dominant center c) A mid-range specialist

Compare the patterns and discuss how the shot distributions reflect playing style.

Exercise 35: Zone-Based Shooting Analysis

Implement court zone classification and analyze: a) Define at least 8 distinct shooting zones b) Calculate FG% for each zone for a selected player c) Create a zone-based shot chart with zones colored by efficiency d) Compare zone distributions for a guard vs a center

Exercise 36: Shot Chart Comparison

Create a comparison visualization for two players: a) Side-by-side hexbin shot charts b) A difference chart showing where each player shoots more c) Zone-by-zone comparison table d) Narrative analysis of the key differences

Exercise 37: Expected Points Analysis

Implement an expected points framework: a) Calculate league-average FG% for each shot zone b) Assign expected point values to each shot c) Calculate expected points per shot for multiple players d) Identify players who exceed expectations vs underperform e) Create a scatter plot of expected vs actual points per shot

Exercise 38: Temporal Shot Pattern Analysis

Analyze how shooting patterns change: a) Compare shot charts for a player's first half vs second half of season b) Compare shot charts from 2018-19 vs 2023-24 for a veteran player c) Analyze shot distribution by quarter of game d) Create an animation or small multiples visualization showing the evolution

Section 8: Comprehensive Analysis Projects (Exercises 39-40)

Exercise 39: Complete Player Profile EDA

Select a player and create a comprehensive EDA report including: a) Basic statistics summary and rankings among peers b) Distribution analysis of game-to-game performance c) Trend analysis showing performance over the season d) Shot chart analysis with zone breakdowns e) Comparison to players at the same position f) Written narrative (500+ words) interpreting all findings

Exercise 40: Team-Level EDA Project

Select a team and conduct full EDA including: a) Roster composition analysis (age, experience, position distribution) b) Scoring distribution among players c) Shooting efficiency by position group d) Team-level trends over the season e) Comparison to league averages f) Identification of strengths and weaknesses g) Visualization dashboard with at least 6 charts h) Executive summary (300+ words) with key insights

Answer Guidelines

For exercises requiring code: - Provide complete, runnable Python code - Include appropriate comments explaining logic - Handle edge cases (missing values, empty results) - Follow PEP 8 style guidelines

For exercises requiring analysis: - Support claims with specific numbers from the data - Reference visualizations created - Consider alternative interpretations - Acknowledge limitations of the analysis

For visualization exercises: - Use appropriate chart types for the data - Include proper labels, titles, and legends - Choose accessible color schemes - Ensure readability at standard sizes