Key Takeaways: Chapter 9

DataField.Dev

Key Takeaways: Chapter 9

Feature Selection: Reducing Dimensionality Without Losing Signal

More features is not more better. Every feature you add is a candidate for overfitting, a source of noise, a dependency to maintain, and a metric to monitor. The StreamFlow team reduced 127 features to 14 and saw AUC improve by 1.5 points, training time drop by 93%, and cross-validation variance decrease by 32%. Feature engineering creates candidates. Feature selection decides which ones earn a seat at the table.
The three families of feature selection are tools in a sequence, not competitors. Filter methods (variance threshold, correlation, mutual information, chi-squared) are fast and scalable --- use them first for initial screening. Embedded methods (L1 regularization, tree-based importance) are efficient and model-aware --- use them for refined selection. Wrapper methods (RFE, forward/backward selection) are thorough but expensive --- use them for final optimization when every fraction of a percent matters.
Multicollinearity is not just a statistical nuisance. It destabilizes your model. Correlated features inflate coefficient variance, make feature importances unreliable, and waste computational resources. Detect it with correlation matrices (pairwise) and VIF (multivariate). Resolve it by dropping redundant features, combining them into composites, or using PCA on the collinear subset. VIF thresholds: below 2.5 is fine, 5-10 is a problem, above 10 is a big problem.
Feature selection must happen INSIDE cross-validation. This is the single most important rule in the chapter. If you select features on the full dataset and then cross-validate, you leak information from the test folds into the selection step. The result is an optimistically biased performance estimate that will not generalize. The fix is simple: put the feature selector inside a Pipeline so it is re-fit on each fold's training data.
L1 regularization is the most practical embedded selection method. It simultaneously selects features and trains a model by driving weak coefficients to exactly zero. Tune the regularization strength (C parameter) using cross-validation: stronger regularization (lower C) selects fewer features; weaker regularization (higher C) keeps more. At no regularization strength do noise features survive.
Permutation importance is more reliable than impurity-based importance. Tree-based impurity importance has known biases --- it favors high-cardinality and continuous features. Permutation importance measures the actual drop in model performance when a feature is shuffled, computed on held-out data. Always use the test set for permutation importance, never the training set. A feature with negative permutation importance is actively harmful and should be removed.
Validate feature selection across method families. If L1 regularization and permutation importance agree on the top features, you can be confident the selection is robust. If they disagree, the selection is unstable and requires further investigation. Agreement across methods is the strongest evidence that your feature set is correct.
Feature selection is a deployment enabler, not just a modeling improvement. In the ShopSmart case, feature selection was the difference between a model that could not meet its latency SLA and one that powered real-time interventions. In the StreamFlow case, it was the difference between monitoring 127 features and monitoring 14. Production constraints (latency, cost, monitoring burden) are first-class inputs to the feature selection decision.
Missing indicators survive feature selection when they carry signal. Two of StreamFlow's 14 selected features were missing indicators from Chapter 8. Feature selection validated the insight that missingness patterns are informative --- the model learned that users whose usage data was missing were more likely to churn, and the missing indicator survived L1 penalization because it genuinely predicts the target.
The curse of dimensionality is not theoretical. It is measurable. Adding 200 noise features to a dataset with 10 real features dropped AUC from 0.954 to 0.862 --- a 10-point decline caused entirely by noise. Even gradient boosting, which is relatively robust to irrelevant features, cannot fully compensate for a high noise-to-signal ratio. Feature selection is not optional when the feature count is high relative to the signal.

If You Remember One Thing

Feature selection must happen inside cross-validation, inside a Pipeline. Everything else in this chapter --- the three method families, VIF, permutation importance, the decision framework --- is useful but secondary. The leakage demonstration showed that even pure noise features produce an AUC of 0.573 when selected outside cross-validation, enough to fool a careless practitioner into believing signal exists where it does not. Put the selector in the pipeline. Test the pipeline with cross-validation. Trust the numbers.

These takeaways summarize Chapter 9: Feature Selection. Return to the chapter for full context.