Chapter 4: Key Takeaways

Summary

Chapter 4 established the systematic approach to exploratory data analysis (EDA) for basketball data. Before building predictive models or deriving insights, analysts must thoroughly understand their data through methodical exploration and visualization.


Core Concepts

The EDA Process

A structured EDA workflow for basketball data:

  1. Load and Inspect - Understand what data you have
  2. Clean and Validate - Address quality issues
  3. Explore Distributions - Understand individual variables
  4. Analyze Relationships - Discover variable interactions
  5. Visualize Patterns - Create informative graphics
  6. Document Findings - Record discoveries and decisions

Key Insight: EDA is not a single step but an iterative process. Initial exploration reveals questions that drive deeper investigation.


Data Loading and Inspection

Essential inspection commands:

Method Purpose When to Use
df.shape Dimensions First look at size
df.info() Types, nulls Understand structure
df.describe() Statistics Numeric summary
df.head(n) First rows Sample data
df.value_counts() Frequencies Categorical data
df.dtypes Data types Type validation

Key Insight: Data type issues (numeric appearing as 'object') often indicate data quality problems that need addressing.


Data Cleaning Principles

Common basketball data issues and solutions:

Issue Example Solution
Duplicate records Traded player rows Investigate context, deduplicate if needed
Inconsistent naming PHX vs PHO Create standardization mapping
Invalid values FG% > 100% Validate against basketball logic
Missing values No 3PT% for non-shooters Understand cause, handle appropriately

Key Insight: Missing 3PT% for a player with zero attempts is not "missing data" - it's undefined. Handle differently than true missing values.


Distribution Visualization

Choosing the right visualization:

Visualization Best For Basketball Example
Histogram Single variable distribution PPG distribution across league
Box plot Comparing groups, outliers Scoring by position
Violin plot Distribution shape comparison Usage rate by position
KDE plot Overlaying multiple groups 3PA by era

Key Insight: The relationship between mean and median reveals skewness. NBA scoring is right-skewed (mean > median) because a few high scorers pull the average up.


Relationship Analysis

Correlation interpretation guidelines:

| |r| | Interpretation | Basketball Example | |-----|----------------|------------------| | 0.0-0.2 | Negligible | 3P% vs 3PA | | 0.2-0.4 | Weak | Assists vs Turnovers | | 0.4-0.6 | Moderate | Height vs Blocks | | 0.6-0.8 | Strong | Minutes vs Points | | 0.8-1.0 | Very strong | FGA vs Points |

Key Insight: Correlation does not imply causation. High three-point volume correlates with winning, but that doesn't mean shooting more threes causes wins.


Time Series Analysis

Key techniques for performance over time:

Technique Purpose Window Size Guidance
Rolling average Smooth variance 5-10 games for trends
Cumulative sum Track milestones Full season
Monthly aggregation Identify patterns By calendar month

Key Insight: Rolling averages reveal underlying trends obscured by game-to-game variance. A 5-game window balances smoothing with responsiveness.


Shot Chart Analysis

Shot location data structure:

Field Unit Range Description
LOC_X Tenths of feet -250 to 250 Horizontal position
LOC_Y Tenths of feet -50 to 890 Vertical position
SHOT_DISTANCE Feet 0 to 90+ Distance from basket

Court Zones (common classification): - Restricted Area: 0-4 feet - Paint (Non-RA): Inside paint, 4+ feet - Mid-Range: 4-23.75 feet (outside paint) - Three-Point: 23.75+ feet (22 feet in corners)

Key Insight: Hexbin and KDE plots aggregate shots effectively; scatter plots work for individual players but become cluttered for league-wide analysis.


Checklist for Chapter Completion

Before proceeding to Chapter 5, ensure you can:

  • [ ] Load and inspect NBA data with pandas
  • [ ] Identify and handle missing values appropriately
  • [ ] Clean data while respecting basketball logic constraints
  • [ ] Create histograms, box plots, and violin plots
  • [ ] Build scatter plots with regression lines
  • [ ] Create and interpret correlation matrices
  • [ ] Calculate and plot rolling averages
  • [ ] Create shot charts using various techniques
  • [ ] Classify shots into meaningful zones
  • [ ] Generate automated EDA summary reports
  • [ ] Document data quality decisions

Key Visualization Code Snippets

Distribution Analysis

def plot_distribution_comparison(df, stat, group_col):
    """Compare distribution of stat across groups."""
    fig, ax = plt.subplots(figsize=(10, 6))

    groups = df[group_col].unique()
    for group in groups:
        data = df[df[group_col] == group][stat]
        sns.kdeplot(data, ax=ax, label=group)

    ax.set_xlabel(stat)
    ax.set_ylabel('Density')
    ax.legend()
    return fig

Correlation Heatmap

def create_correlation_heatmap(df, columns):
    """Create masked correlation heatmap."""
    corr = df[columns].corr()
    mask = np.triu(np.ones_like(corr, dtype=bool))

    fig, ax = plt.subplots(figsize=(10, 8))
    sns.heatmap(corr, mask=mask, annot=True, fmt='.2f',
                cmap='RdBu_r', center=0, ax=ax)
    return fig

Shot Chart

def create_hexbin_shot_chart(shots_df):
    """Create hexbin shot chart."""
    fig, ax = plt.subplots(figsize=(12, 11))

    hexbin = ax.hexbin(
        shots_df['LOC_X'] / 10,
        shots_df['LOC_Y'] / 10,
        C=shots_df['SHOT_MADE_FLAG'],
        gridsize=25,
        cmap='RdYlGn',
        reduce_C_function=np.mean,
        mincnt=5
    )
    plt.colorbar(hexbin, label='FG%')
    return fig

Common Pitfalls to Avoid

  1. Skipping inspection - Always look at your data before analyzing
  2. Ignoring missing values - Understand why data is missing
  3. Inappropriate imputation - Don't fill undefined values (0/0 percentages)
  4. Over-interpreting correlations - Correlation is not causation
  5. Using wrong visualization - Match chart type to data type and question
  6. Forgetting context - Era, team, and role affect statistics
  7. Small sample conclusions - 10 games is not enough for reliable percentages
  8. Not validating against logic - FG% must be 0-1, minutes 0-48

Connections to Other Chapters

  • Chapter 2 (Data Sources): Data collection determines what's available for EDA
  • Chapter 3 (Environment): pandas, matplotlib, seaborn are the EDA workhorses
  • Chapter 5 (Descriptive Statistics): EDA reveals which statistics are meaningful
  • Chapter 6 (Box Score): EDA techniques apply directly to game-level analysis
  • Chapter 7 (Advanced Metrics): Understanding distributions informs metric design

Summary Statement

Exploratory Data Analysis transforms raw numbers into understanding. Before any sophisticated modeling, analysts must know their data intimately - its structure, quality, distributions, and relationships. The techniques in this chapter form the foundation for all basketball analytics work that follows.