Chapter 4 Exercises: Exploratory Data Analysis for Basketball
Section 1: Data Loading and Inspection (Exercises 1-6)
Exercise 1: Basic Data Inspection
Load the NBA player statistics dataset and perform the following inspections: a) Display the first 15 rows and last 10 rows of the dataset b) Identify all column names and their data types c) Count the number of unique teams and positions represented d) Calculate the total memory usage of the DataFrame
Dataset: nba_player_stats_2023_24.csv
Exercise 2: Data Type Analysis
For the player statistics dataset: a) Separate columns into numerical and categorical types b) Identify any columns that appear to have incorrect data types (e.g., numeric data stored as strings) c) Convert any percentage columns from strings to proper float values d) Create a summary table showing column name, current type, and recommended type
Exercise 3: Derived Feature Creation
Create the following derived features from the base statistics:
a) Points per minute (pts_per_min)
b) Assists-to-turnover ratio (ast_to_tov)
c) True Shooting Percentage using the formula: TS% = PTS / (2 * (FGA + 0.44 * FTA))
d) Offensive rating estimate: ORtg = (PTS / (FGA + 0.44 * FTA + TOV)) * 100
e) Position group (combine PG/SG as Guards, SF/PF as Forwards, C as Centers)
Exercise 4: Multi-Source Data Loading
You have three separate CSV files:
- player_bio.csv (player_id, name, height, weight, age, college)
- player_stats.csv (player_id, team, games, points, rebounds, assists)
- player_advanced.csv (player_id, per, ws, bpm, vorp)
Write code to: a) Load all three files b) Merge them into a single DataFrame using player_id c) Handle any players that appear in one file but not others d) Verify the merge was successful by checking for duplicate or missing records
Exercise 5: Date and Time Handling
Load a game log dataset (player_game_logs.csv) and:
a) Convert the game_date column to datetime format
b) Extract the month, day of week, and whether it was a home or away game
c) Create a "days_rest" column showing days since the player's previous game
d) Identify back-to-back games (games with 0 days rest)
Exercise 6: Hierarchical Data Inspection
Given a dataset with team-level and player-level statistics: a) Calculate the number of players per team b) Verify that individual player statistics sum to approximately team totals c) Identify any teams with missing players (based on total minutes not summing to expected values) d) Create a team-level summary from player-level data
Section 2: Data Cleaning and Preprocessing (Exercises 7-12)
Exercise 7: Duplicate Detection and Handling
Using the player statistics dataset: a) Identify all exact duplicate rows b) Find players who appear multiple times (traded mid-season) c) For traded players, create a "season total" row that combines their statistics d) Document your decisions about how you handled partial season records
Exercise 8: Standardization
Clean the following inconsistencies in the dataset: a) Standardize team abbreviations (e.g., PHX vs PHO, BKN vs BRK) b) Clean player names (remove extra whitespace, standardize capitalization) c) Standardize position labels (handle variations like "Point Guard" vs "PG") d) Create a mapping dictionary for any standardization applied
Exercise 9: Logical Validation
Write validation functions to check for: a) Field goal percentages outside the range [0, 1] b) Players with more assists than field goals made (flag but don't remove) c) Minutes per game exceeding 48 (regular season limit) d) Negative values in counting statistics e) Three-point percentage higher than overall field goal percentage (investigate, don't remove)
Create a report summarizing all validation issues found.
Exercise 10: Outlier Investigation
For points per game in the player dataset: a) Calculate the IQR and identify outliers using the 1.5*IQR rule b) Calculate Z-scores and identify outliers with |Z| > 3 c) Visualize the outliers using a box plot with individual points d) Research the identified outliers - are they errors or exceptional performers? e) Document your decision about how to handle each outlier
Exercise 11: Data Transformation
Apply appropriate transformations to the following statistics: a) Log-transform a heavily right-skewed variable (e.g., total points) b) Square root transform count data with many zeros c) Normalize minutes per game to a 0-1 scale d) Standardize (Z-score) the "points" column e) Compare the distributions before and after transformation visually
Exercise 12: Creating Analysis-Ready Datasets
Create two clean datasets from the raw data: a) A "starters" dataset: Players with at least 20 MPG and 41 games played b) A "rotation players" dataset: Players with at least 10 MPG and 20 games played
For each dataset: - Document all filtering criteria - Report how many players were removed at each step - Verify no missing values remain in key columns
Section 3: Missing Value Analysis (Exercises 13-16)
Exercise 13: Missing Value Patterns
Analyze missing values in the dataset: a) Create a summary table showing missing count and percentage for each column b) Visualize the missing data pattern using a heatmap c) Identify if missing values occur in clusters (e.g., all advanced stats missing together) d) Determine the likely mechanism (MCAR, MAR, MNAR) for each column with missing values
Exercise 14: Conditional Missing Values
Investigate the relationship between missing values and other variables: a) Is three-point percentage more likely to be missing for certain positions? b) Are advanced stats more likely to be missing for players with fewer minutes? c) Create a visualization showing missing percentage by player position d) Test whether the missingness is statistically associated with any observed variables
Exercise 15: Imputation Strategy Development
For a dataset with missing values in height, weight, and shooting percentages: a) Implement mean imputation for height b) Implement position-group mean imputation for weight c) For shooting percentages with zero attempts, decide and implement an appropriate strategy d) Compare player statistics before and after imputation e) Document the rationale for each imputation decision
Exercise 16: Sensitivity Analysis
Conduct a sensitivity analysis for missing data: a) Analyze the data using only complete cases b) Analyze the data using mean imputation c) Analyze the data using group-based imputation d) Compare key summary statistics (mean points, correlation between assists and points) across all three approaches e) Discuss which approach is most appropriate for this basketball dataset
Section 4: Distribution Visualization (Exercises 17-21)
Exercise 17: Histogram Analysis
Create comprehensive histograms for the following statistics: a) Points per game - identify the distribution shape and any multimodality b) Three-point percentage - explain the gap near zero c) Minutes per game - identify distinct player groups d) For each histogram, add mean and median lines and calculate skewness
Exercise 18: Comparative Box Plots
Create box plots comparing: a) Points per game by position (PG, SG, SF, PF, C) b) Three-point attempts by team (identify high-volume and low-volume teams) c) Assists by position group (Guards, Forwards, Centers) d) For each plot, identify and label significant outliers by player name
Exercise 19: Violin Plot Interpretation
Create violin plots for: a) True shooting percentage by position b) Usage rate by position group c) Minutes per game by age group (Under 25, 25-30, Over 30)
For each violin plot: - Describe the shape of each distribution - Identify positions/groups with bimodal distributions - Explain what the distribution shapes tell us about player roles
Exercise 20: Multi-Panel Distribution Analysis
Create a figure with 6 subplots showing: a) Histograms for the "counting stats" (points, rebounds, assists) b) Histograms for the "efficiency stats" (FG%, 3P%, FT%) c) Ensure consistent styling across all subplots d) Add a main title summarizing the overall findings
Exercise 21: Distribution Comparison Over Time
Using historical data from multiple seasons: a) Compare the distribution of three-point attempts between 2013-14 and 2023-24 b) Create overlaid KDE plots showing the shift c) Calculate the change in mean, median, and standard deviation d) Visualize this as an animated GIF if possible, or a small multiples plot
Section 5: Relationship Visualization (Exercises 22-27)
Exercise 22: Scatter Plot Analysis
Create scatter plots examining: a) Points vs Minutes - calculate and display R-squared b) Assists vs Usage Rate - color points by position c) Age vs True Shooting Percentage - add a LOWESS trend line d) For each plot, identify notable outliers and provide context
Exercise 23: Correlation Matrix Analysis
Build a comprehensive correlation analysis: a) Create a correlation matrix for all numeric columns b) Identify the 10 strongest positive correlations c) Identify the 5 strongest negative correlations d) Create a clustered heatmap that groups similar variables together e) Explain any surprising correlations (or lack thereof)
Exercise 24: Pair Plot Exploration
Create pair plots for: a) The four counting stats (points, rebounds, assists, steals) colored by position b) Shooting efficiency metrics (FG%, 3P%, FT%, TS%) colored by position group c) Analyze the diagonal KDE plots to understand how distributions vary by position d) Identify the strongest relationships visible in the scatter plots
Exercise 25: Bubble Chart Creation
Create a bubble chart where: a) X-axis: True Shooting Percentage b) Y-axis: Points Per Game c) Bubble size: Minutes Per Game d) Bubble color: Age
Add annotations for the top 10 scorers by name. Interpret what the visualization reveals about the relationship between efficiency and volume.
Exercise 26: Regression Analysis Visualization
For the relationship between usage rate and points per game: a) Create a scatter plot with linear regression line b) Calculate residuals and create a residual plot c) Fit a polynomial regression and compare to linear d) Create separate regression lines for each position e) Discuss which model best describes the relationship
Exercise 27: Interactive Relationship Exploration
Using Plotly or similar interactive library: a) Create an interactive scatter plot of points vs assists b) Add hover information showing player name, team, and games played c) Enable filtering by position d) Add the ability to zoom and pan e) Export as an HTML file
Section 6: Time Series Analysis (Exercises 28-32)
Exercise 28: Rolling Average Analysis
For a star player's game log: a) Calculate 5-game, 10-game, and 20-game rolling averages for points b) Plot all three on the same graph with the raw game values c) Identify periods of hot and cold streaks d) Calculate the correlation between consecutive 5-game averages (autocorrelation)
Exercise 29: Cumulative Statistics Tracking
Track cumulative statistics for MVP candidates: a) Create cumulative points, rebounds, and assists lines for 5 players b) Project end-of-season totals based on games played and pace c) Identify when specific milestones (1000 points, 500 assists) were reached d) Visualize the "race" to a specific statistical milestone
Exercise 30: Performance Consistency Analysis
Compare performance consistency for two point guards: a) Calculate coefficient of variation for points, assists, and turnovers b) Create side-by-side histograms of game-by-game scoring c) Calculate the probability of a 20+ point game for each player d) Determine which player is more reliable/consistent
Exercise 31: Segment Analysis
Divide the season into quarters and analyze: a) Calculate average statistics for each quarter of the season b) Test for significant differences between quarters c) Identify players who improved or declined most dramatically d) Create visualizations showing the seasonal progression for key players
Exercise 32: Rest and Performance Analysis
Using game log data: a) Calculate days of rest before each game b) Compare performance with 0, 1, 2, and 3+ days of rest c) Create box plots showing points per game by rest days d) Test whether the differences are statistically significant e) Discuss implications for load management
Section 7: Shot Chart Analysis (Exercises 33-38)
Exercise 33: Basic Shot Chart Creation
For a designated player: a) Create a shot chart showing makes and misses on a court diagram b) Color-code by shot type (2-point vs 3-point) c) Add a legend and shooting percentages to the title d) Save the chart at high resolution (300 DPI)
Exercise 34: Hexbin Analysis
Create hexbin shot charts for three players with different playing styles: a) A high-volume three-point shooter b) A paint-dominant center c) A mid-range specialist
Compare the patterns and discuss how the shot distributions reflect playing style.
Exercise 35: Zone-Based Shooting Analysis
Implement court zone classification and analyze: a) Define at least 8 distinct shooting zones b) Calculate FG% for each zone for a selected player c) Create a zone-based shot chart with zones colored by efficiency d) Compare zone distributions for a guard vs a center
Exercise 36: Shot Chart Comparison
Create a comparison visualization for two players: a) Side-by-side hexbin shot charts b) A difference chart showing where each player shoots more c) Zone-by-zone comparison table d) Narrative analysis of the key differences
Exercise 37: Expected Points Analysis
Implement an expected points framework: a) Calculate league-average FG% for each shot zone b) Assign expected point values to each shot c) Calculate expected points per shot for multiple players d) Identify players who exceed expectations vs underperform e) Create a scatter plot of expected vs actual points per shot
Exercise 38: Temporal Shot Pattern Analysis
Analyze how shooting patterns change: a) Compare shot charts for a player's first half vs second half of season b) Compare shot charts from 2018-19 vs 2023-24 for a veteran player c) Analyze shot distribution by quarter of game d) Create an animation or small multiples visualization showing the evolution
Section 8: Comprehensive Analysis Projects (Exercises 39-40)
Exercise 39: Complete Player Profile EDA
Select a player and create a comprehensive EDA report including: a) Basic statistics summary and rankings among peers b) Distribution analysis of game-to-game performance c) Trend analysis showing performance over the season d) Shot chart analysis with zone breakdowns e) Comparison to players at the same position f) Written narrative (500+ words) interpreting all findings
Exercise 40: Team-Level EDA Project
Select a team and conduct full EDA including: a) Roster composition analysis (age, experience, position distribution) b) Scoring distribution among players c) Shooting efficiency by position group d) Team-level trends over the season e) Comparison to league averages f) Identification of strengths and weaknesses g) Visualization dashboard with at least 6 charts h) Executive summary (300+ words) with key insights
Answer Guidelines
For exercises requiring code: - Provide complete, runnable Python code - Include appropriate comments explaining logic - Handle edge cases (missing values, empty results) - Follow PEP 8 style guidelines
For exercises requiring analysis: - Support claims with specific numbers from the data - Reference visualizations created - Consider alternative interpretations - Acknowledge limitations of the analysis
For visualization exercises: - Use appropriate chart types for the data - Include proper labels, titles, and legends - Choose accessible color schemes - Ensure readability at standard sizes