Chapter 4: Key Takeaways

Summary

Chapter 4 established the systematic approach to exploratory data analysis (EDA) for basketball data. Before building predictive models or deriving insights, analysts must thoroughly understand their data through methodical exploration and visualization.

Core Concepts

The EDA Process

A structured EDA workflow for basketball data:

Load and Inspect - Understand what data you have
Clean and Validate - Address quality issues
Explore Distributions - Understand individual variables
Analyze Relationships - Discover variable interactions
Visualize Patterns - Create informative graphics
Document Findings - Record discoveries and decisions

Key Insight: EDA is not a single step but an iterative process. Initial exploration reveals questions that drive deeper investigation.

Data Loading and Inspection

Essential inspection commands:

Method	Purpose	When to Use
`df.shape`	Dimensions	First look at size
`df.info()`	Types, nulls	Understand structure
`df.describe()`	Statistics	Numeric summary
`df.head(n)`	First rows	Sample data
`df.value_counts()`	Frequencies	Categorical data
`df.dtypes`	Data types	Type validation

Key Insight: Data type issues (numeric appearing as 'object') often indicate data quality problems that need addressing.

Data Cleaning Principles

Common basketball data issues and solutions:

Issue	Example	Solution
Duplicate records	Traded player rows	Investigate context, deduplicate if needed
Inconsistent naming	PHX vs PHO	Create standardization mapping
Invalid values	FG% > 100%	Validate against basketball logic
Missing values	No 3PT% for non-shooters	Understand cause, handle appropriately

Key Insight: Missing 3PT% for a player with zero attempts is not "missing data" - it's undefined. Handle differently than true missing values.

Distribution Visualization

Choosing the right visualization:

Visualization	Best For	Basketball Example
Histogram	Single variable distribution	PPG distribution across league
Box plot	Comparing groups, outliers	Scoring by position
Violin plot	Distribution shape comparison	Usage rate by position
KDE plot	Overlaying multiple groups	3PA by era

Key Insight: The relationship between mean and median reveals skewness. NBA scoring is right-skewed (mean > median) because a few high scorers pull the average up.

Relationship Analysis

Correlation interpretation guidelines:

| |r| | Interpretation | Basketball Example | |-----|----------------|------------------| | 0.0-0.2 | Negligible | 3P% vs 3PA | | 0.2-0.4 | Weak | Assists vs Turnovers | | 0.4-0.6 | Moderate | Height vs Blocks | | 0.6-0.8 | Strong | Minutes vs Points | | 0.8-1.0 | Very strong | FGA vs Points |

Key Insight: Correlation does not imply causation. High three-point volume correlates with winning, but that doesn't mean shooting more threes causes wins.

Time Series Analysis

Key techniques for performance over time:

Technique	Purpose	Window Size Guidance
Rolling average	Smooth variance	5-10 games for trends
Cumulative sum	Track milestones	Full season
Monthly aggregation	Identify patterns	By calendar month

Key Insight: Rolling averages reveal underlying trends obscured by game-to-game variance. A 5-game window balances smoothing with responsiveness.

Shot Chart Analysis

Shot location data structure:

Field	Unit	Range	Description
LOC_X	Tenths of feet	-250 to 250	Horizontal position
LOC_Y	Tenths of feet	-50 to 890	Vertical position
SHOT_DISTANCE	Feet	0 to 90+	Distance from basket

Court Zones (common classification): - Restricted Area: 0-4 feet - Paint (Non-RA): Inside paint, 4+ feet - Mid-Range: 4-23.75 feet (outside paint) - Three-Point: 23.75+ feet (22 feet in corners)

Key Insight: Hexbin and KDE plots aggregate shots effectively; scatter plots work for individual players but become cluttered for league-wide analysis.

Checklist for Chapter Completion

Before proceeding to Chapter 5, ensure you can:

[ ] Load and inspect NBA data with pandas
[ ] Identify and handle missing values appropriately
[ ] Clean data while respecting basketball logic constraints
[ ] Create histograms, box plots, and violin plots
[ ] Build scatter plots with regression lines
[ ] Create and interpret correlation matrices
[ ] Calculate and plot rolling averages
[ ] Create shot charts using various techniques
[ ] Classify shots into meaningful zones
[ ] Generate automated EDA summary reports
[ ] Document data quality decisions

Key Visualization Code Snippets

Distribution Analysis

def plot_distribution_comparison(df, stat, group_col):
    """Compare distribution of stat across groups."""
    fig, ax = plt.subplots(figsize=(10, 6))

    groups = df[group_col].unique()
    for group in groups:
        data = df[df[group_col] == group][stat]
        sns.kdeplot(data, ax=ax, label=group)

    ax.set_xlabel(stat)
    ax.set_ylabel('Density')
    ax.legend()
    return fig

Correlation Heatmap

def create_correlation_heatmap(df, columns):
    """Create masked correlation heatmap."""
    corr = df[columns].corr()
    mask = np.triu(np.ones_like(corr, dtype=bool))

    fig, ax = plt.subplots(figsize=(10, 8))
    sns.heatmap(corr, mask=mask, annot=True, fmt='.2f',
                cmap='RdBu_r', center=0, ax=ax)
    return fig

Shot Chart

def create_hexbin_shot_chart(shots_df):
    """Create hexbin shot chart."""
    fig, ax = plt.subplots(figsize=(12, 11))

    hexbin = ax.hexbin(
        shots_df['LOC_X'] / 10,
        shots_df['LOC_Y'] / 10,
        C=shots_df['SHOT_MADE_FLAG'],
        gridsize=25,
        cmap='RdYlGn',
        reduce_C_function=np.mean,
        mincnt=5
    )
    plt.colorbar(hexbin, label='FG%')
    return fig

Common Pitfalls to Avoid

Skipping inspection - Always look at your data before analyzing
Ignoring missing values - Understand why data is missing
Inappropriate imputation - Don't fill undefined values (0/0 percentages)
Over-interpreting correlations - Correlation is not causation
Using wrong visualization - Match chart type to data type and question
Forgetting context - Era, team, and role affect statistics
Small sample conclusions - 10 games is not enough for reliable percentages
Not validating against logic - FG% must be 0-1, minutes 0-48

Connections to Other Chapters

Chapter 2 (Data Sources): Data collection determines what's available for EDA
Chapter 3 (Environment): pandas, matplotlib, seaborn are the EDA workhorses
Chapter 5 (Descriptive Statistics): EDA reveals which statistics are meaningful
Chapter 6 (Box Score): EDA techniques apply directly to game-level analysis
Chapter 7 (Advanced Metrics): Understanding distributions informs metric design

Summary Statement

Exploratory Data Analysis transforms raw numbers into understanding. Before any sophisticated modeling, analysts must know their data intimately - its structure, quality, distributions, and relationships. The techniques in this chapter form the foundation for all basketball analytics work that follows.