Chapter 28 Further Reading: Feature Engineering for Sports Betting
The following annotated bibliography provides resources for deeper exploration of the feature engineering concepts introduced in Chapter 28. Entries are organized by category and chosen for their relevance to building predictive features for sports betting models.
Books: Feature Engineering and Machine Learning
1. Zheng, Alice and Casari, Amanda. Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. O'Reilly Media, 2018. The definitive practitioner's guide to feature engineering. Covers numeric transformations, categorical encoding, text features, and dimensionality reduction with worked examples in Python. The chapters on feature selection and model-based feature importance are directly applicable to sports betting models. Essential reading for anyone building features from raw data.
2. Kuhn, Max and Johnson, Kjell. Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press, 2019. A more statistically rigorous treatment than Zheng and Casari, with emphasis on the interplay between feature engineering and model selection. The chapters on handling missing data, encoding strategies, and feature selection via wrapper methods are particularly relevant to the sparse, noisy datasets common in sports analytics.
3. Pyle, Dorian. Data Preparation for Data Mining. Morgan Kaufmann, 1999. Though dated in its software examples, Pyle's treatment of data cleaning, transformation, and feature construction remains one of the most thorough available. The chapters on outlier detection and data quality are especially relevant to sports data, where recording errors and edge cases (forfeits, shortened seasons, rule changes) create data quality challenges.
4. Muller, Andreas C. and Guido, Sarah. Introduction to Machine Learning with Python. O'Reilly Media, 2016. A practical introduction to scikit-learn that covers feature preprocessing, scaling, encoding, and selection pipelines. The code examples translate directly to sports betting applications and serve as templates for the pipeline architecture described in Chapter 28.
Books: Sports Analytics
5. Alamar, Benjamin C. Sports Analytics: A Guide for Coaches, Managers, and Other Decision Makers. Columbia University Press, 2013. A broad introduction to sports analytics that covers the feature-engineering thought process from a domain-expert perspective. Alamar emphasizes the importance of choosing metrics that capture what actually drives winning, which is the foundation of good feature engineering. The chapters on football and basketball analytics provide sport-specific feature ideas.
6. Albert, Jim, Glickman, Mark E., Swartz, Tim B., and Koning, Ruud H., eds. Handbook of Statistical Methods and Analyses in Sports. CRC Press, 2017. A comprehensive reference covering statistical methods across multiple sports. Each chapter introduces sport-specific features and metrics that serve as candidates for prediction models. The NFL, NBA, and MLB chapters are particularly rich sources of feature ideas grounded in domain expertise.
7. Thabtah, Fadi, Zhang, Li, and Abdelhamid, Neda. "NBA Game Result Prediction Using Feature Analysis and Machine Learning." Annals of Data Science, 6, 2019, pp. 103-116. A systematic study of which basketball features (offensive rating, defensive rating, four factors) are most predictive of game outcomes. The paper's feature importance analysis directly informs the feature selection process described in Chapter 28.
Academic Papers
8. Boulier, Bryan L. and Stekler, Herman O. "Predicting the Outcomes of National Football League Games." International Journal of Forecasting, 19(2), 2003, pp. 257-270. One of the earliest rigorous studies of NFL prediction models, comparing different feature sets (team ratings, recent performance, home-field advantage). The paper demonstrates that simple, well-chosen features can match more complex approaches, a finding that reinforces the chapter's emphasis on feature quality over quantity.
9. Manner, Hans. "Modeling and Forecasting the Outcomes of NBA Basketball Games." Journal of Quantitative Analysis in Sports, 12(1), 2016, pp. 31-41. Compares multiple feature sets for NBA game prediction, including Elo ratings, four-factors metrics, and schedule-context variables. The paper's finding that rest-day features significantly improve prediction quality supports the temporal feature engineering approach in this chapter.
10. Pelechrinis, Konstantinos and Papalexakis, Evangelos. "The Anatomy of American Football: Evidence from 7 Years of NFL Game Data." PLoS ONE, 11(12), 2016. A data-driven analysis of which play-level statistics best predict game outcomes in the NFL. The paper's EPA-based feature analysis validates the use of EPA/play as a foundational feature and identifies which game situations (early downs, competitive game states) provide the most predictive signal.
11. Haghighat, Maral, Rastegari, Hamid, and Nourafza, Nasim. "A Review of Data Mining Techniques for Result Prediction in Sports." Advances in Computer Science: An International Journal, 2(5), 2013, pp. 7-12. A survey of feature engineering and modeling techniques across multiple sports. While somewhat dated, the taxonomy of feature types (team-level, player-level, contextual, historical) provides a useful organizational framework for feature brainstorming.
Technical Resources and Tutorials
12. nflfastR Documentation and Tutorials (nflfastr.com) The definitive resource for NFL play-by-play data, including detailed documentation of every column in the dataset. The "Getting Started" and "Advanced" tutorials demonstrate how to compute EPA-based features, filter garbage-time plays, and create team-level aggregations. Essential for anyone implementing the NFL feature engineering pipeline from this chapter.
13. nba_api Python Package Documentation (github.com/swar/nba_api) The most comprehensive Python interface for NBA data, providing access to player statistics, game logs, shot charts, and more. The package documentation shows how to query the specific endpoints needed for constructing the NBA features discussed in this chapter.
14. Scikit-learn Feature Selection Documentation (scikit-learn.org) The official documentation for scikit-learn's feature selection module, covering variance thresholding, univariate feature selection, recursive feature elimination (RFE), and model-based selection. Includes code examples that map directly to the selection pipelines in Chapter 28.
15. Feature Engine Library (feature-engine.trainindata.com) An open-source Python library that provides a comprehensive suite of feature engineering transformations as scikit-learn-compatible transformers. Includes missing data imputation, categorical encoding, outlier handling, and variable transformation. The library's pipeline-friendly design makes it ideal for building reproducible feature engineering workflows.
Data Sources
16. Kaggle NFL and NBA Datasets (kaggle.com) Community-contributed datasets containing game-level statistics, play-by-play data, and betting odds for NFL and NBA. These datasets are useful for practicing feature engineering without needing to set up API connections. Search for "NFL play-by-play," "NBA game stats," and "sports betting odds" for relevant datasets.
17. Sports Reference / Basketball Reference / Pro Football Reference The gold-standard reference sites for historical sports statistics. While primarily designed for human consumption, these sites provide comprehensive team and player statistics that serve as the raw material for feature engineering. The "Advanced Stats" pages are particularly useful for identifying candidate features.
18. The Athletic / ESPN Analytics Articles Both publications regularly feature articles on advanced sports analytics that introduce novel features and metrics. Following their analytics coverage provides a continuous stream of feature ideas grounded in expert domain knowledge. Pay particular attention to articles that quantify the impact of rest, travel, injuries, and schedule factors.
How to Use This Reading List
For readers working through this textbook sequentially, the following prioritization is suggested:
- Start with: Zheng and Casari (entry 1) for a comprehensive feature engineering foundation, and nflfastR documentation (entry 12) for hands-on practice with sports data.
- Go deeper on feature selection: Kuhn and Johnson (entry 2) and scikit-learn documentation (entry 14).
- Go deeper on sports-specific features: Alamar (entry 5) and Albert et al. (entry 6) for domain expertise, Pelechrinis (entry 10) for NFL-specific EPA features.
- For implementation: Feature Engine library (entry 15) and nba_api (entry 13) for building automated pipelines.
Many of these resources will be referenced again in later chapters as feature engineering concepts are applied to deep learning, model evaluation, and production pipelines.