Chapter 28 Key Takeaways: Feature Engineering for Sports Betting
Key Concepts
-
Feature Engineering Dominance: The quality of features matters more than the choice of algorithm. A logistic regression with well-engineered features routinely outperforms complex models with poorly designed inputs. Feature engineering is where domain expertise meets data science.
-
Temporal Leakage: The most dangerous error in sports model development. Temporal leakage occurs when features incorporate information that would not have been available at prediction time. Every feature must be computed using only data from before the prediction date. This applies to season averages (which must be rolling, not full-season), standardization parameters (which must be computed on training data only), and any external data source.
-
Garbage-Time Filtering: Plays occurring when the game outcome is effectively decided distort team statistics. Use win-probability-based weighting rather than binary cutoffs to downweight garbage-time plays while preserving sample size.
-
Opponent Adjustment: Raw efficiency statistics are confounded by opponent quality. Iterative opponent adjustment removes this confound by rating each team's performance relative to the quality of opponents faced. Adjusted metrics consistently outrank raw metrics in feature importance analyses.
-
Rolling and Exponential Features: Point-in-time features that reflect a team's current form are more predictive than season-to-date averages. Exponentially weighted moving averages (EWMA) place more weight on recent games while still incorporating historical data, balancing responsiveness with stability.
-
Feature Selection: Including too many features causes overfitting, especially with the small sample sizes typical of sports data. Systematic selection methods (RFECV, Lasso, mutual information) identify the optimal feature subset. A model with 10 well-chosen features typically outperforms a model with 30 weakly predictive features.
-
Interaction Features: Many predictive relationships in sports are non-additive. Pace times efficiency, rest advantage times travel distance, and matchup-specific features capture effects that individual features miss. However, interactions multiply the feature space and increase overfitting risk, so they should be added judiciously.
Key Formulas
| Formula | Expression | Example |
|---|---|---|
| EWMA | S_t = alpha * x_t + (1 - alpha) * S_{t-1} | alpha=0.3, x=[0.05, 0.12]: S_2 = 0.3(0.12) + 0.7(0.05) = 0.071 |
| Variance Inflation Factor | VIF_j = 1 / (1 - R_j^2) | R^2 = 0.90 gives VIF = 10 (multicollinearity threshold) |
| Smoothed Target Encoding | (ny_cat + my_global) / (n + m) | n=5, y_cat=0.6, m=10, y_global=0.5: (3+5)/15 = 0.533 |
| Information Gain | IG(T, F) = H(T) - H(T|F) | Entropy reduction from splitting on feature F |
| Opponent-Adjusted Rating | adj_i = raw_i - mean(opp_ratings) | 0.10 EPA vs avg 0.03 opp = 0.07 adjusted |
Feature Engineering Decision Framework
When constructing features for a sports betting model, apply the following process:
Step 1 -- Define the prediction task and timeline. What are you predicting (spread, total, moneyline)? When must the prediction be available (hours before, minutes before, live)? This determines which data sources and update frequencies are feasible.
Step 2 -- Enumerate candidate features by category. Cover all five major categories: efficiency metrics, situational/contextual factors, opponent-adjusted ratings, momentum/temporal features, and external data (weather, injuries, market signals). Aim for 20-40 candidates before selection.
Step 3 -- Implement point-in-time computation. Every feature must be computable using only data available at the prediction moment. Build a feature store or pipeline that enforces this constraint automatically, rather than relying on manual inspection.
Step 4 -- Check for leakage. Run a diagnostic that verifies no feature for any game uses data from that game or future games. This is non-negotiable.
Step 5 -- Apply feature selection. Use a combination of correlation analysis (to remove redundant features), univariate scoring (mutual information or chi-squared), and model-based selection (Lasso or RFECV) to reduce the candidate set to the optimal size.
Step 6 -- Validate the feature set. Evaluate the model with the selected features using proper walk-forward validation. Compare to a baseline model with minimal features. The improvement should be statistically significant by the Diebold-Mariano test or a paired t-test on Brier score differences.
The core principle: Features are not data; they are hypotheses about what information is predictive. Every feature represents a belief about the data-generating process. The best features encode genuine causal mechanisms (fatigue reduces performance, strong opponents inflate difficulty) rather than spurious correlations.
Ready for Chapter 29? Self-Assessment Checklist
Before moving on to Chapter 29 ("Deep Learning for Sports Prediction"), confirm that you can do the following:
- [ ] Explain the difference between raw features, derived features, and interaction features with examples
- [ ] Implement a point-in-time feature computation pipeline that prevents temporal leakage
- [ ] Construct exponentially weighted moving averages and explain the effect of the alpha parameter
- [ ] Encode categorical variables using one-hot encoding, target encoding, and ordinal encoding, and explain when each is appropriate
- [ ] Apply at least two feature selection methods (e.g., Lasso and RFECV) and compare their results
- [ ] Implement opponent-adjusted efficiency ratings using iterative adjustment
- [ ] Construct garbage-time filters using win-probability weighting
- [ ] Diagnose multicollinearity using VIF and resolve it by removing redundant features
- [ ] Design interaction features based on domain knowledge and test their marginal contribution
- [ ] Build a complete feature engineering pipeline from raw data to modeling-ready DataFrame
If you can check every box with confidence, you are well prepared for Chapter 29. If any items feel uncertain, revisit the relevant sections of Chapter 28 or work through the corresponding exercises before proceeding.