Chapter 4: Key Takeaways
Summary
Chapter 4 established the systematic approach to exploratory data analysis (EDA) for basketball data. Before building predictive models or deriving insights, analysts must thoroughly understand their data through methodical exploration and visualization.
Core Concepts
The EDA Process
A structured EDA workflow for basketball data:
- Load and Inspect - Understand what data you have
- Clean and Validate - Address quality issues
- Explore Distributions - Understand individual variables
- Analyze Relationships - Discover variable interactions
- Visualize Patterns - Create informative graphics
- Document Findings - Record discoveries and decisions
Key Insight: EDA is not a single step but an iterative process. Initial exploration reveals questions that drive deeper investigation.
Data Loading and Inspection
Essential inspection commands:
| Method | Purpose | When to Use |
|---|---|---|
df.shape |
Dimensions | First look at size |
df.info() |
Types, nulls | Understand structure |
df.describe() |
Statistics | Numeric summary |
df.head(n) |
First rows | Sample data |
df.value_counts() |
Frequencies | Categorical data |
df.dtypes |
Data types | Type validation |
Key Insight: Data type issues (numeric appearing as 'object') often indicate data quality problems that need addressing.
Data Cleaning Principles
Common basketball data issues and solutions:
| Issue | Example | Solution |
|---|---|---|
| Duplicate records | Traded player rows | Investigate context, deduplicate if needed |
| Inconsistent naming | PHX vs PHO | Create standardization mapping |
| Invalid values | FG% > 100% | Validate against basketball logic |
| Missing values | No 3PT% for non-shooters | Understand cause, handle appropriately |
Key Insight: Missing 3PT% for a player with zero attempts is not "missing data" - it's undefined. Handle differently than true missing values.
Distribution Visualization
Choosing the right visualization:
| Visualization | Best For | Basketball Example |
|---|---|---|
| Histogram | Single variable distribution | PPG distribution across league |
| Box plot | Comparing groups, outliers | Scoring by position |
| Violin plot | Distribution shape comparison | Usage rate by position |
| KDE plot | Overlaying multiple groups | 3PA by era |
Key Insight: The relationship between mean and median reveals skewness. NBA scoring is right-skewed (mean > median) because a few high scorers pull the average up.
Relationship Analysis
Correlation interpretation guidelines:
| |r| | Interpretation | Basketball Example | |-----|----------------|------------------| | 0.0-0.2 | Negligible | 3P% vs 3PA | | 0.2-0.4 | Weak | Assists vs Turnovers | | 0.4-0.6 | Moderate | Height vs Blocks | | 0.6-0.8 | Strong | Minutes vs Points | | 0.8-1.0 | Very strong | FGA vs Points |
Key Insight: Correlation does not imply causation. High three-point volume correlates with winning, but that doesn't mean shooting more threes causes wins.
Time Series Analysis
Key techniques for performance over time:
| Technique | Purpose | Window Size Guidance |
|---|---|---|
| Rolling average | Smooth variance | 5-10 games for trends |
| Cumulative sum | Track milestones | Full season |
| Monthly aggregation | Identify patterns | By calendar month |
Key Insight: Rolling averages reveal underlying trends obscured by game-to-game variance. A 5-game window balances smoothing with responsiveness.
Shot Chart Analysis
Shot location data structure:
| Field | Unit | Range | Description |
|---|---|---|---|
| LOC_X | Tenths of feet | -250 to 250 | Horizontal position |
| LOC_Y | Tenths of feet | -50 to 890 | Vertical position |
| SHOT_DISTANCE | Feet | 0 to 90+ | Distance from basket |
Court Zones (common classification): - Restricted Area: 0-4 feet - Paint (Non-RA): Inside paint, 4+ feet - Mid-Range: 4-23.75 feet (outside paint) - Three-Point: 23.75+ feet (22 feet in corners)
Key Insight: Hexbin and KDE plots aggregate shots effectively; scatter plots work for individual players but become cluttered for league-wide analysis.
Checklist for Chapter Completion
Before proceeding to Chapter 5, ensure you can:
- [ ] Load and inspect NBA data with pandas
- [ ] Identify and handle missing values appropriately
- [ ] Clean data while respecting basketball logic constraints
- [ ] Create histograms, box plots, and violin plots
- [ ] Build scatter plots with regression lines
- [ ] Create and interpret correlation matrices
- [ ] Calculate and plot rolling averages
- [ ] Create shot charts using various techniques
- [ ] Classify shots into meaningful zones
- [ ] Generate automated EDA summary reports
- [ ] Document data quality decisions
Key Visualization Code Snippets
Distribution Analysis
def plot_distribution_comparison(df, stat, group_col):
"""Compare distribution of stat across groups."""
fig, ax = plt.subplots(figsize=(10, 6))
groups = df[group_col].unique()
for group in groups:
data = df[df[group_col] == group][stat]
sns.kdeplot(data, ax=ax, label=group)
ax.set_xlabel(stat)
ax.set_ylabel('Density')
ax.legend()
return fig
Correlation Heatmap
def create_correlation_heatmap(df, columns):
"""Create masked correlation heatmap."""
corr = df[columns].corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f',
cmap='RdBu_r', center=0, ax=ax)
return fig
Shot Chart
def create_hexbin_shot_chart(shots_df):
"""Create hexbin shot chart."""
fig, ax = plt.subplots(figsize=(12, 11))
hexbin = ax.hexbin(
shots_df['LOC_X'] / 10,
shots_df['LOC_Y'] / 10,
C=shots_df['SHOT_MADE_FLAG'],
gridsize=25,
cmap='RdYlGn',
reduce_C_function=np.mean,
mincnt=5
)
plt.colorbar(hexbin, label='FG%')
return fig
Common Pitfalls to Avoid
- Skipping inspection - Always look at your data before analyzing
- Ignoring missing values - Understand why data is missing
- Inappropriate imputation - Don't fill undefined values (0/0 percentages)
- Over-interpreting correlations - Correlation is not causation
- Using wrong visualization - Match chart type to data type and question
- Forgetting context - Era, team, and role affect statistics
- Small sample conclusions - 10 games is not enough for reliable percentages
- Not validating against logic - FG% must be 0-1, minutes 0-48
Connections to Other Chapters
- Chapter 2 (Data Sources): Data collection determines what's available for EDA
- Chapter 3 (Environment): pandas, matplotlib, seaborn are the EDA workhorses
- Chapter 5 (Descriptive Statistics): EDA reveals which statistics are meaningful
- Chapter 6 (Box Score): EDA techniques apply directly to game-level analysis
- Chapter 7 (Advanced Metrics): Understanding distributions informs metric design
Summary Statement
Exploratory Data Analysis transforms raw numbers into understanding. Before any sophisticated modeling, analysts must know their data intimately - its structure, quality, distributions, and relationships. The techniques in this chapter form the foundation for all basketball analytics work that follows.