Chapter 26: Machine Learning in Basketball - Key Takeaways

Executive Summary

Machine learning extends traditional analytics capabilities, enabling pattern discovery, prediction at scale, and insights too subtle for manual analysis. However, ML is not a replacement for domain knowledge - the most successful applications combine algorithmic sophistication with basketball expertise. This chapter provided frameworks for applying supervised and unsupervised learning to basketball problems while maintaining interpretability and avoiding common pitfalls.

Core Concepts

1. ML Paradigms in Basketball

Paradigm	Definition	Basketball Examples
Supervised	Learn from labeled data	Predict All-Star, draft success, game outcome
Unsupervised	Find structure in unlabeled data	Player clustering, anomaly detection
Reinforcement	Learn through trial and error	Lineup optimization, in-game decisions

2. The Basketball ML Pipeline

Problem Definition: Define clear, measurable objectives
Data Collection: Gather relevant, high-quality data
Feature Engineering: Create meaningful features from raw data
Model Selection: Choose appropriate algorithm for the task
Training: Fit model to training data
Validation: Evaluate on held-out data
Deployment: Put model into production use
Monitoring: Track performance over time

3. Feature Engineering Best Practices

Rate Statistics: - Per-36 or per-100 possessions normalizes for playing time - Adjust for team pace

Efficiency Metrics: - True Shooting % over FG% - Usage-adjusted statistics

Contextual Features: - Conference/competition level - Teammate quality - Role in offense/defense

Interaction Features: - Physical profile combinations - Style matchup indicators

4. Model Selection Guide

Task	Recommended Algorithms	Notes
Binary classification	Logistic Regression, Random Forest, XGBoost	Start simple
Multi-class	Random Forest, Neural Network	Consider class imbalance
Regression	Ridge, Gradient Boosting	Handle outliers
Clustering	K-Means, GMM	Determine K systematically
Dimensionality reduction	PCA, t-SNE	PCA for preprocessing, t-SNE for visualization

5. Evaluation Metrics

Classification: - Accuracy (only with balanced classes) - Precision, Recall, F1 Score - AUC-ROC (probability calibration) - Log Loss (probabilistic predictions)

Regression: - MAE (interpretable error) - RMSE (penalizes large errors) - R-squared (variance explained)

Clustering: - Silhouette Score - Within-cluster variance - Domain expert validation

Practical Application Checklist

Before Building a Model

[ ] Define clear problem statement
[ ] Identify target variable(s)
[ ] Determine success criteria
[ ] Assess data availability and quality
[ ] Consider if ML is even necessary

Data Preparation

[ ] Handle missing values appropriately
[ ] Encode categorical variables
[ ] Scale numerical features (when required)
[ ] Engineer domain-relevant features
[ ] Remove or transform outliers
[ ] Check for data leakage

Model Development

[ ] Establish baseline performance
[ ] Start with simple models
[ ] Use proper cross-validation (time-based for temporal data)
[ ] Tune hyperparameters systematically
[ ] Evaluate on held-out test set
[ ] Check calibration for probability predictions

Interpretation and Deployment

[ ] Analyze feature importance
[ ] Generate local explanations (SHAP)
[ ] Validate with domain experts
[ ] Document model limitations
[ ] Plan for model monitoring and updates

Common Mistakes to Avoid

Mistake 1: Data Leakage

Problem: Using information not available at prediction time Solution: Strict temporal separation, careful feature engineering

Mistake 2: Ignoring Class Imbalance

Problem: Model predicts majority class always Solution: Use appropriate metrics, resampling, or cost-sensitive learning

Mistake 3: Overfitting

Problem: Model memorizes training data Solution: Regularization, cross-validation, simpler models

Mistake 4: Ignoring Domain Knowledge

Problem: Meaningless features, uninterpretable results Solution: Collaborate with basketball experts, validate findings

Mistake 5: Wrong Metric

Problem: Optimizing for accuracy when recall matters Solution: Choose metrics aligned with business objectives

Mistake 6: Black Box Acceptance

Problem: Can't explain why model makes predictions Solution: Use interpretable models or interpretation tools (SHAP)

Algorithm Quick Reference

Supervised Learning

Linear/Logistic Regression: - Interpretable coefficients - Fast training - Assumes linear relationships - Good baseline

Random Forest: - Handles non-linearity - Feature importance built-in - Less prone to overfitting than single trees - May struggle with very high-dimensional data

Gradient Boosting (XGBoost, LightGBM): - Often best performance - Handles missing values - Requires careful tuning - Risk of overfitting

Neural Networks: - Captures complex patterns - Requires large data - "Black box" without interpretation tools - Computationally intensive

Unsupervised Learning

K-Means: - Simple and fast - Requires specifying K - Assumes spherical clusters - Sensitive to initialization

Hierarchical Clustering: - No need to specify K upfront - Dendrogram visualization - Computationally expensive for large data

PCA: - Linear dimensionality reduction - Preserves global structure - Components may be hard to interpret

t-SNE: - Non-linear visualization - Preserves local structure - Not for preprocessing (use PCA)

Summary: When to Use ML

Use ML When:

Large amounts of data available
Complex, non-linear patterns expected
Manual analysis is infeasible at scale
Prediction accuracy is the priority
Patterns may be too subtle for humans

Don't Use ML When:

Simple statistical methods suffice
Data is limited or low quality
Interpretability is paramount
Domain expertise already solves the problem
Quick, one-time analysis needed

Key Formulas

Classification Metrics

Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2 * (Precision * Recall) / (Precision + Recall)

Clustering Metrics

Silhouette = (b - a) / max(a, b)
where a = avg intra-cluster distance
      b = avg nearest-cluster distance

Feature Scaling

Standard: z = (x - mean) / std
MinMax: x_norm = (x - min) / (max - min)

Further Study Recommendations

Foundations: Complete an ML course (Andrew Ng's Coursera, fast.ai)
Implementation: Master scikit-learn, then XGBoost/LightGBM
Deep Learning: TensorFlow/PyTorch for advanced applications
Interpretation: Learn SHAP and LIME for model explanation
Production: Study MLOps for deployment best practices
Domain: Continuously build basketball knowledge