> War Story --- A SaaS analytics team had spent three weeks on feature engineering, and they were proud of the result. Starting from raw event logs, they had created 127 features for their churn prediction model: usage metrics, engagement ratios...
In This Chapter
- Reducing Dimensionality Without Losing Signal
- More Features Is Not More Better
- The Curse of Dimensionality --- In Practice, Not Theory
- The Three Families of Feature Selection
- Filter Methods: Fast Screening with Statistical Tests
- Multicollinearity and the Variance Inflation Factor
- Wrapper Methods: Model-Driven Feature Search
- Embedded Methods: Selection During Training
- The Critical Mistake: Feature Selection Outside Cross-Validation
- Building a Complete Feature Selection Pipeline
- Decision Framework: Which Method When?
- StreamFlow: Selecting from 25+ Engineered Features
- Progressive Project M3: Feature Selection for StreamFlow
- Chapter Summary
Chapter 9: Feature Selection
Reducing Dimensionality Without Losing Signal
Learning Objectives
By the end of this chapter, you will be able to:
- Apply filter methods (correlation, mutual information, chi-squared) for initial screening
- Use wrapper methods (RFE, forward/backward selection) for model-driven selection
- Leverage embedded methods (L1 regularization, tree-based importance) for integrated selection
- Detect and handle multicollinearity (VIF, correlation matrices)
- Build a feature selection pipeline that does not leak information
More Features Is Not More Better
War Story --- A SaaS analytics team had spent three weeks on feature engineering, and they were proud of the result. Starting from raw event logs, they had created 127 features for their churn prediction model: usage metrics, engagement ratios, behavioral sequences, temporal aggregates, interaction terms, polynomial expansions, and every ratio-of-ratios they could think of. The gradient boosted model trained on all 127 features achieved an AUC of 0.836 on the holdout set. Respectable.
Then an intern asked a question that nobody had considered: "How many of those features actually matter?" The team ran a quick feature importance plot and discovered that the top 15 features accounted for 94% of the model's total importance. The remaining 112 features split the leftover 6% among them, each contributing less than a rounding error. Some had negative importance --- they were actively hurting the model by introducing noise that the algorithm had to learn to ignore.
They retrained on just the top 15 features. The AUC went up to 0.851. Training time dropped from 47 minutes to 90 seconds. The model was faster, more interpretable, more stable across time periods, and easier to deploy and monitor. Removing 88% of the features improved everything.
The lesson: feature engineering creates candidates. Feature selection decides which ones earn a seat at the table. These are different jobs, and skipping the second one is like hiring every applicant who submits a resume.
That war story should feel uncomfortable if you have been diligently engineering features for the past three chapters. Everything we built in Chapters 6, 7, and 8 --- the temporal aggregates, the encoded categoricals, the imputed values, the missing indicators --- all of it is candidate material. This chapter is the audition.
Feature selection is the process of choosing a subset of features that maximizes predictive performance while minimizing noise, redundancy, and computational cost. It is not optional. It is not something you do "if you have time." It is a core modeling step, and doing it wrong (or not at all) will cost you accuracy, interpretability, and production stability.
Here is why.
The Curse of Dimensionality --- In Practice, Not Theory
You have probably heard the phrase "curse of dimensionality" in a statistics or machine learning course, accompanied by abstract geometric arguments about how volume concentrates in high-dimensional spaces. That is true but not helpful. Let us talk about what the curse actually does to your models.
What Actually Goes Wrong with Too Many Features
Problem 1: Overfitting to noise. Every feature you add gives the model another degree of freedom to fit the training data. If a feature is genuinely uninformative --- if it has zero true relationship to the target --- the model can still find a spurious pattern in the training data by chance. With 100 uninformative features, the model will find dozens of phantom patterns. It will learn them, memorize them, and use them confidently on the training set. Then it will fail on new data because the phantom patterns do not exist in the real world.
Problem 2: Diluted signal. Regularization and early stopping work by penalizing complexity. When you have many irrelevant features alongside a few genuinely predictive ones, the regularization penalty is spread across all features equally. The model must allocate some of its limited complexity budget to suppressing noise features, leaving less capacity to learn the true signal. More features means more noise for the model to wade through.
Problem 3: Multicollinearity. Correlated features create instability in any model that estimates coefficients (linear regression, logistic regression, regularized models). When two features carry nearly identical information, the model cannot distinguish their individual contributions. Small changes in the data produce large swings in the coefficients. The model's predictions may be fine, but the feature importances and coefficients become meaningless --- and if you are using them for interpretation or decision-making, that is a serious problem.
Problem 4: Computational cost. More features means more parameters to estimate, more memory to allocate, more time to train, and more storage for model artifacts. In production, every additional feature is a dependency that must be computed, validated, and monitored. A model with 15 features is dramatically easier to deploy and maintain than a model with 127.
Problem 5: Monitoring burden. Every feature in production requires drift monitoring. If a feature's distribution shifts, you need to know about it. With 15 features, that is manageable. With 127, you are drowning in alerts, most of them for features that do not matter anyway.
Let us see the curse in action:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
np.random.seed(42)
# Generate a dataset with 10 informative features and varying noise features
n_samples = 5000
n_informative = 10
results = []
for n_noise in [0, 10, 25, 50, 100, 200]:
X, y = make_classification(
n_samples=n_samples,
n_features=n_informative + n_noise,
n_informative=n_informative,
n_redundant=0,
n_repeated=0,
n_clusters_per_class=2,
random_state=42,
)
model = GradientBoostingClassifier(
n_estimators=100,
max_depth=4,
random_state=42,
)
scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
results.append({
'total_features': n_informative + n_noise,
'noise_features': n_noise,
'auc_mean': scores.mean(),
'auc_std': scores.std(),
})
results_df = pd.DataFrame(results)
print("Effect of Noise Features on Model Performance")
print("=" * 60)
print(f"{'Total Features':>16s} {'Noise':>6s} {'AUC Mean':>10s} {'AUC Std':>9s}")
print("-" * 60)
for _, row in results_df.iterrows():
print(f"{int(row['total_features']):>16d} {int(row['noise_features']):>6d} "
f"{row['auc_mean']:>10.4f} {row['auc_std']:>9.4f}")
Effect of Noise Features on Model Performance
============================================================
Total Features Noise AUC Mean AUC Std
------------------------------------------------------------
10 0 0.9542 0.0061
20 10 0.9487 0.0058
35 25 0.9381 0.0072
60 50 0.9203 0.0089
110 100 0.8946 0.0105
210 200 0.8621 0.0134
The model degrades steadily as noise features accumulate. With 200 noise features alongside 10 real ones, the AUC drops from 0.954 to 0.862 --- nearly a 10-point decline in discrimination. And gradient boosting is relatively robust to noise. Linear models degrade faster.
Key Insight --- The curse of dimensionality is not a theoretical curiosity. It is a measurable, reproducible degradation in model performance that happens every time you include features that do not carry signal.
The Three Families of Feature Selection
Feature selection methods fall into three families, and understanding the tradeoffs between them is the first step toward choosing the right approach for your problem.
| Family | How It Works | Advantages | Disadvantages |
|---|---|---|---|
| Filter | Scores features using statistical tests independent of any model | Fast, scalable, model-agnostic | Ignores feature interactions; ignores model-specific relevance |
| Wrapper | Searches subsets of features using a model's performance as the objective | Finds optimal subsets for a specific model | Slow, prone to overfitting, computationally expensive |
| Embedded | Selection happens during model training (built into the algorithm) | Efficient, accounts for interactions, produces a trained model and selected features simultaneously | Model-specific; what Lasso selects may differ from what a tree selects |
There is no single best family. In practice, you will often use all three in sequence: filter methods for initial screening, embedded methods for refined selection, and wrapper methods when you need every last fraction of a percent.
Filter Methods: Fast Screening with Statistical Tests
Filter methods evaluate each feature independently (or in pairs) using a statistical measure of relevance. They do not train a model. They are fast, scalable, and useful for reducing a large feature set to a manageable size before applying more expensive methods.
Variance Threshold: The Absolute Minimum
Before you do anything else, remove features with zero or near-zero variance. A feature that takes the same value for 99.5% of observations carries almost no information.
from sklearn.feature_selection import VarianceThreshold
import pandas as pd
import numpy as np
np.random.seed(42)
# Simulate StreamFlow features including some near-constant ones
n = 10000
df = pd.DataFrame({
'tenure_months': np.random.exponential(18, n).clip(1, 72),
'monthly_charge': np.round(np.random.uniform(9.99, 49.99, n), 2),
'sessions_last_7d': np.random.poisson(12, n),
'hours_last_30d': np.random.exponential(20, n),
'is_annual_plan': np.random.choice([0, 1], n, p=[0.85, 0.15]),
'has_parental_controls': np.random.choice([0, 1], n, p=[0.995, 0.005]), # Near-constant
'currency_is_usd': np.random.choice([0, 1], n, p=[0.98, 0.02]), # Near-constant
'platform_version_flag': np.ones(n), # Constant
})
# VarianceThreshold removes features below a variance cutoff
# For binary features, var = p * (1 - p). Threshold of 0.01 removes
# features where the minority class is < ~1%
selector = VarianceThreshold(threshold=0.01)
X_filtered = selector.fit_transform(df)
kept = df.columns[selector.get_support()].tolist()
dropped = df.columns[~selector.get_support()].tolist()
print(f"Features kept ({len(kept)}): {kept}")
print(f"Features dropped ({len(dropped)}): {dropped}")
print()
for col in df.columns:
v = df[col].var()
status = "KEPT" if col in kept else "DROPPED"
print(f" {col:30s} var={v:10.4f} {status}")
Features kept (5): ['tenure_months', 'monthly_charge', 'sessions_last_7d', 'hours_last_30d', 'is_annual_plan']
Features dropped (3): ['has_parental_controls', 'currency_is_usd', 'platform_version_flag']
tenure_months var= 313.4127 KEPT
monthly_charge var= 133.4621 KEPT
sessions_last_7d var= 12.0384 KEPT
hours_last_30d var= 398.1056 KEPT
is_annual_plan var= 0.1275 KEPT
has_parental_controls var= 0.0050 DROPPED
currency_is_usd var= 0.0196 DROPPED
platform_version_flag var= 0.0000 DROPPED
VarianceThreshold is a blunt instrument, but it catches the obvious cases. A feature that is constant for 99.5% of observations cannot carry much information, regardless of what the remaining 0.5% looks like.
Correlation with the Target: Univariate Screening
The simplest relevance measure is the correlation between each feature and the target. For numeric features, use Pearson or Spearman correlation. For categorical features and a binary target, use chi-squared or mutual information.
from sklearn.feature_selection import mutual_info_classif, chi2
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import numpy as np
np.random.seed(42)
n = 10000
# Simulate a realistic StreamFlow feature matrix with a churn target
tenure = np.random.exponential(18, n).clip(1, 72)
monthly_charge = np.round(np.random.uniform(9.99, 49.99, n), 2)
sessions_7d = np.random.poisson(12, n)
hours_30d = np.random.exponential(20, n)
days_since_login = np.random.exponential(7, n).clip(0, 90)
support_tickets = np.random.poisson(1.2, n)
genre_diversity = np.random.beta(2, 5, n)
email_open_rate = np.random.beta(3, 7, n)
trend_7d_30d = np.random.normal(0, 0.3, n)
device_count = np.random.choice([1, 2, 3, 4], n, p=[0.4, 0.3, 0.2, 0.1])
# Target: churn probability depends on a subset of features
churn_logit = (
-2.0
- 0.03 * tenure
+ 0.02 * monthly_charge
- 0.05 * sessions_7d
- 0.02 * hours_30d
+ 0.06 * days_since_login
+ 0.25 * support_tickets
- 1.5 * genre_diversity
- 0.8 * email_open_rate
+ 0.5 * trend_7d_30d
+ np.random.normal(0, 1.0, n)
)
churned = (1 / (1 + np.exp(-churn_logit)) > 0.5).astype(int)
# Add noise features with no real relationship
noise_1 = np.random.normal(0, 1, n)
noise_2 = np.random.uniform(0, 1, n)
noise_3 = np.random.poisson(5, n)
X = pd.DataFrame({
'tenure_months': tenure,
'monthly_charge': monthly_charge,
'sessions_last_7d': sessions_7d,
'hours_last_30d': hours_30d,
'days_since_last_login': days_since_login,
'support_tickets_90d': support_tickets,
'genre_diversity': genre_diversity,
'email_open_rate': email_open_rate,
'engagement_trend_7d_30d': trend_7d_30d,
'device_count': device_count,
'noise_random_normal': noise_1,
'noise_random_uniform': noise_2,
'noise_random_poisson': noise_3,
})
y = churned
# --- Pearson Correlation ---
pearson_corr = X.corrwith(pd.Series(y, name='churned')).abs().sort_values(ascending=False)
print("Pearson |correlation| with churn target:")
print("=" * 50)
for feat, val in pearson_corr.items():
bar = "#" * int(val * 80)
print(f" {feat:30s} {val:.4f} {bar}")
print()
# --- Mutual Information ---
mi_scores = mutual_info_classif(X, y, random_state=42)
mi_series = pd.Series(mi_scores, index=X.columns).sort_values(ascending=False)
print("Mutual Information with churn target:")
print("=" * 50)
for feat, val in mi_series.items():
bar = "#" * int(val * 200)
print(f" {feat:30s} {val:.4f} {bar}")
Pearson |correlation| with churn target:
==================================================
genre_diversity 0.2138 #################
support_tickets_90d 0.1842 ##############
days_since_last_login 0.1567 ############
email_open_rate 0.1243 #########
sessions_last_7d 0.1108 ########
engagement_trend_7d_30d 0.0921 #######
hours_last_30d 0.0854 ######
tenure_months 0.0743 #####
monthly_charge 0.0512 ####
device_count 0.0198 #
noise_random_normal 0.0124
noise_random_poisson 0.0089
noise_random_uniform 0.0041
Mutual Information with churn target:
==================================================
genre_diversity 0.0512 ##########
support_tickets_90d 0.0389 #######
days_since_last_login 0.0334 ######
sessions_last_7d 0.0298 #####
email_open_rate 0.0267 #####
engagement_trend_7d_30d 0.0198 ###
hours_last_30d 0.0187 ###
tenure_months 0.0156 ###
monthly_charge 0.0102 ##
device_count 0.0078 #
noise_random_poisson 0.0012
noise_random_normal 0.0008
noise_random_uniform 0.0005
Both methods agree on the broad ranking: genre diversity, support tickets, and login recency are the most informative features. The noise features cluster at the bottom with scores near zero. The methods disagree on details --- mutual information captures nonlinear relationships that Pearson misses, but it is noisier and requires more data to produce stable estimates.
Practical Rule --- Use Pearson correlation for a quick first look. Use mutual information when you suspect nonlinear relationships. Neither one captures feature interactions --- a feature that is useless alone but powerful in combination with another feature will score low on both measures.
Chi-Squared Test for Categorical Features
When both the feature and the target are categorical (or the feature is a binary indicator), the chi-squared test is the standard filter method. It tests whether the observed frequency distribution differs from what you would expect under independence.
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import numpy as np
np.random.seed(42)
n = 10000
# Simulate binary/categorical features for StreamFlow
plan_premium = np.random.choice([0, 1], n, p=[0.65, 0.35])
has_profile_photo = np.random.choice([0, 1], n, p=[0.50, 0.50])
used_mobile_app = np.random.choice([0, 1], n, p=[0.40, 0.60])
email_opt_in = np.random.choice([0, 1], n, p=[0.30, 0.70])
referred_friend = np.random.choice([0, 1], n, p=[0.85, 0.15])
noise_binary = np.random.choice([0, 1], n, p=[0.50, 0.50])
# Churn depends on some of these
churn_logit = (
-1.5
- 0.8 * plan_premium
+ 0.1 * has_profile_photo
- 0.6 * used_mobile_app
- 0.4 * email_opt_in
- 1.2 * referred_friend
+ 0.0 * noise_binary # No relationship
+ np.random.normal(0, 1.0, n)
)
churned = (1 / (1 + np.exp(-churn_logit)) > 0.5).astype(int)
X_cat = pd.DataFrame({
'plan_is_premium': plan_premium,
'has_profile_photo': has_profile_photo,
'used_mobile_app': used_mobile_app,
'email_opt_in': email_opt_in,
'referred_a_friend': referred_friend,
'noise_binary': noise_binary,
})
# Chi-squared requires non-negative values (already 0/1 here)
chi2_scores, p_values = chi2(X_cat, churned)
chi2_df = pd.DataFrame({
'feature': X_cat.columns,
'chi2_score': chi2_scores,
'p_value': p_values,
}).sort_values('chi2_score', ascending=False)
print("Chi-Squared Test Results (categorical features vs. churn):")
print("=" * 65)
print(f" {'Feature':25s} {'Chi2 Score':>12s} {'p-value':>12s} Significant?")
print("-" * 65)
for _, row in chi2_df.iterrows():
sig = "YES" if row['p_value'] < 0.01 else "no"
print(f" {row['feature']:25s} {row['chi2_score']:12.2f} {row['p_value']:12.6f} {sig}")
Chi-Squared Test Results (categorical features vs. churn):
=================================================================
Feature Chi2 Score p-value Significant?
-----------------------------------------------------------------
referred_a_friend 234.71 0.000000 YES
plan_is_premium 187.45 0.000000 YES
used_mobile_app 142.18 0.000000 YES
email_opt_in 63.52 0.000000 YES
has_profile_photo 4.12 0.042351 no
noise_binary 0.38 0.537645 no
The chi-squared test correctly identifies the features with genuine relationships to churn and assigns near-zero scores to the noise feature. has_profile_photo has a weak but detectable relationship (the 0.1 coefficient we built in is small).
Correlation Between Features: Detecting Redundancy
Filter methods also help identify redundancy --- features that are so correlated with each other that keeping both adds noise without adding signal.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
np.random.seed(42)
n = 10000
# Simulate StreamFlow features with built-in correlations
hours_7d = np.random.exponential(8, n)
hours_30d = hours_7d * 4 + np.random.normal(0, 3, n) # Highly correlated with 7d
sessions_7d = hours_7d / 0.75 + np.random.normal(0, 2, n) # Correlated
sessions_30d = sessions_7d * 4 + np.random.normal(0, 5, n) # Correlated
avg_duration = hours_7d / (sessions_7d.clip(1) / 7) + np.random.normal(0, 0.5, n)
tenure = np.random.exponential(18, n).clip(1, 72)
monthly_charge = np.round(np.random.uniform(9.99, 49.99, n), 2)
support_tickets = np.random.poisson(1.2, n)
days_since_login = np.random.exponential(7, n).clip(0, 90)
feature_df = pd.DataFrame({
'hours_7d': hours_7d,
'hours_30d': hours_30d,
'sessions_7d': sessions_7d,
'sessions_30d': sessions_30d,
'avg_session_duration': avg_duration,
'tenure_months': tenure,
'monthly_charge': monthly_charge,
'support_tickets': support_tickets,
'days_since_login': days_since_login,
})
# Compute correlation matrix
corr_matrix = feature_df.corr()
# Find highly correlated pairs
high_corr_pairs = []
for i in range(len(corr_matrix.columns)):
for j in range(i + 1, len(corr_matrix.columns)):
corr_val = abs(corr_matrix.iloc[i, j])
if corr_val > 0.7:
high_corr_pairs.append({
'feature_1': corr_matrix.columns[i],
'feature_2': corr_matrix.columns[j],
'abs_correlation': round(corr_val, 3),
})
print("Highly Correlated Feature Pairs (|r| > 0.7):")
print("=" * 65)
for pair in sorted(high_corr_pairs, key=lambda x: x['abs_correlation'], reverse=True):
print(f" {pair['feature_1']:25s} <-> {pair['feature_2']:25s} |r| = {pair['abs_correlation']:.3f}")
Highly Correlated Feature Pairs (|r| > 0.7):
=================================================================
hours_7d <-> sessions_7d |r| = 0.943
hours_7d <-> hours_30d |r| = 0.921
sessions_7d <-> sessions_30d |r| = 0.918
hours_30d <-> sessions_30d |r| = 0.874
hours_7d <-> avg_session_duration |r| = 0.812
sessions_7d <-> avg_session_duration |r| = 0.789
hours_30d <-> sessions_7d |r| = 0.783
hours_7d <-> sessions_30d |r| = 0.762
The usage features are highly correlated --- which makes sense, because hours_7d, hours_30d, sessions_7d, and sessions_30d all measure variations of the same underlying behavior: how much the user engages with the product. Keeping all four in the model adds redundancy without adding much new information.
Practical Rule --- When two features have |r| > 0.9, you can usually drop one with minimal loss. When |r| is between 0.7 and 0.9, test both options (keep/drop) and let model performance decide. Below 0.7, the features are different enough that both may be worth keeping.
Multicollinearity and the Variance Inflation Factor
Correlation matrices show pairwise relationships. But multicollinearity can be more subtle --- a feature might be well-predicted by a combination of other features, even if no single pairwise correlation is particularly high. The Variance Inflation Factor (VIF) captures this.
VIF measures how much the variance of a feature's estimated coefficient is inflated due to correlation with other features. The formula is simple: for feature j, fit a linear regression predicting feature j from all other features, compute the R-squared, and calculate VIF = 1 / (1 - R-squared).
import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
np.random.seed(42)
n = 10000
# StreamFlow features with multicollinearity
tenure = np.random.exponential(18, n).clip(1, 72)
monthly_charge = np.round(np.random.uniform(9.99, 49.99, n), 2)
hours_7d = np.random.exponential(8, n)
hours_30d = hours_7d * 4 + np.random.normal(0, 3, n)
sessions_7d = hours_7d / 0.75 + np.random.normal(0, 2, n)
avg_duration = hours_7d * 60 / sessions_7d.clip(1)
support_tickets = np.random.poisson(1.2, n)
days_since_login = np.random.exponential(7, n).clip(0, 90)
genre_diversity = np.random.beta(2, 5, n)
X = pd.DataFrame({
'tenure_months': tenure,
'monthly_charge': monthly_charge,
'hours_last_7d': hours_7d,
'hours_last_30d': hours_30d,
'sessions_last_7d': sessions_7d,
'avg_session_duration': avg_duration,
'support_tickets_90d': support_tickets,
'days_since_login': days_since_login,
'genre_diversity': genre_diversity,
})
# Standardize before computing VIF (recommended for numeric stability)
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
# Compute VIF for each feature
vif_data = pd.DataFrame({
'feature': X_scaled.columns,
'vif': [variance_inflation_factor(X_scaled.values, i) for i in range(X_scaled.shape[1])],
}).sort_values('vif', ascending=False)
print("Variance Inflation Factors:")
print("=" * 55)
print(f" {'Feature':30s} {'VIF':>8s} Assessment")
print("-" * 55)
for _, row in vif_data.iterrows():
if row['vif'] > 10:
assessment = "BIG PROBLEM"
elif row['vif'] > 5:
assessment = "Problem"
elif row['vif'] > 2.5:
assessment = "Moderate"
else:
assessment = "OK"
print(f" {row['feature']:30s} {row['vif']:8.2f} {assessment}")
Variance Inflation Factors:
=======================================================
Feature VIF Assessment
-------------------------------------------------------
hours_last_7d 18.43 BIG PROBLEM
sessions_last_7d 14.27 BIG PROBLEM
hours_last_30d 12.85 BIG PROBLEM
avg_session_duration 8.91 Problem
tenure_months 1.23 OK
monthly_charge 1.08 OK
genre_diversity 1.05 OK
support_tickets_90d 1.04 OK
days_since_login 1.02 OK
VIF Thresholds --- The rules of thumb are well-established: - VIF < 2.5: No concern. - VIF 2.5-5: Moderate multicollinearity. Worth monitoring but usually not actionable. - VIF 5-10: Problem. Feature is substantially explained by other features. Consider dropping or combining. - VIF > 10: Big problem. Feature is nearly a linear combination of other features. Drop one or more of the collinear set.
The usage features (hours_last_7d, sessions_last_7d, hours_last_30d) all have VIF above 10, confirming what the correlation matrix suggested: they carry heavily overlapping information. You do not need all of them.
Resolving Multicollinearity
When you find collinear features, you have three options:
- Drop features. Keep the one with the highest univariate correlation with the target (or the one that is most interpretable) and remove the others.
- Combine features. Create a composite feature (e.g.,
engagement_score = 0.5 * hours_7d + 0.3 * sessions_7d + 0.2 * avg_duration) that captures the shared information in a single variable. - Use PCA on the collinear subset. Extract the first principal component from the correlated group and use it as a single feature. This is a mathematical version of option 2.
For the StreamFlow features, the practitioner approach is usually option 1: keep hours_last_7d (most recent, most predictive of near-term churn) and sessions_last_30d (longer window, captures different behavioral patterns), and drop hours_last_30d and avg_session_duration.
Wrapper Methods: Model-Driven Feature Search
Wrapper methods evaluate feature subsets by training a model on each candidate subset and measuring its performance. They are slow but thorough --- they find the best subset for a specific model.
Recursive Feature Elimination (RFE)
RFE is the most practical wrapper method. It starts with all features, trains a model, ranks features by importance, removes the least important feature, and repeats until the desired number of features remains.
from sklearn.feature_selection import RFECV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold
import pandas as pd
import numpy as np
np.random.seed(42)
n = 5000
# Recreate the StreamFlow feature set with noise features
tenure = np.random.exponential(18, n).clip(1, 72)
monthly_charge = np.round(np.random.uniform(9.99, 49.99, n), 2)
sessions_7d = np.random.poisson(12, n)
hours_30d = np.random.exponential(20, n)
days_since_login = np.random.exponential(7, n).clip(0, 90)
support_tickets = np.random.poisson(1.2, n)
genre_diversity = np.random.beta(2, 5, n)
email_open_rate = np.random.beta(3, 7, n)
engagement_trend = np.random.normal(0, 0.3, n)
device_count = np.random.choice([1, 2, 3, 4], n, p=[0.4, 0.3, 0.2, 0.1])
churn_logit = (
-2.0 - 0.03 * tenure + 0.02 * monthly_charge
- 0.05 * sessions_7d - 0.02 * hours_30d
+ 0.06 * days_since_login + 0.25 * support_tickets
- 1.5 * genre_diversity - 0.8 * email_open_rate
+ 0.5 * engagement_trend + np.random.normal(0, 1.0, n)
)
churned = (1 / (1 + np.exp(-churn_logit)) > 0.5).astype(int)
# Add noise features
X = pd.DataFrame({
'tenure_months': tenure,
'monthly_charge': monthly_charge,
'sessions_last_7d': sessions_7d,
'hours_last_30d': hours_30d,
'days_since_login': days_since_login,
'support_tickets_90d': support_tickets,
'genre_diversity': genre_diversity,
'email_open_rate': email_open_rate,
'engagement_trend': engagement_trend,
'device_count': device_count,
'noise_1': np.random.normal(0, 1, n),
'noise_2': np.random.uniform(0, 1, n),
'noise_3': np.random.poisson(5, n),
})
y = churned
# RFECV: RFE with cross-validated selection of the optimal number of features
estimator = GradientBoostingClassifier(
n_estimators=100, max_depth=3, random_state=42
)
rfecv = RFECV(
estimator=estimator,
step=1,
cv=StratifiedKFold(5, shuffle=True, random_state=42),
scoring='roc_auc',
min_features_to_select=3,
)
rfecv.fit(X, y)
print(f"Optimal number of features: {rfecv.n_features_}")
print()
print("Feature Rankings:")
print("=" * 50)
ranking_df = pd.DataFrame({
'feature': X.columns,
'rank': rfecv.ranking_,
'selected': rfecv.support_,
}).sort_values('rank')
for _, row in ranking_df.iterrows():
status = "SELECTED" if row['selected'] else f"rank {int(row['rank'])}"
print(f" {row['feature']:25s} {status}")
Optimal number of features: 9
Feature Rankings:
==================================================
genre_diversity SELECTED
support_tickets_90d SELECTED
days_since_login SELECTED
email_open_rate SELECTED
sessions_last_7d SELECTED
engagement_trend SELECTED
hours_last_30d SELECTED
tenure_months SELECTED
monthly_charge SELECTED
device_count rank 2
noise_1 rank 3
noise_3 rank 4
noise_2 rank 5
RFECV correctly identified 9 features as optimal (the 9 genuine features plus device_count, which has a weak but real relationship to churn through the generation mechanism). All three noise features were eliminated.
When to use RFE --- RFE is most useful when you have 20-100 features and care about finding the precise optimal subset. For 500+ features, start with filter methods to reduce the set before running RFE.
Forward Selection and Backward Elimination
Forward selection starts with zero features and adds the one that improves performance the most at each step. Backward elimination starts with all features and removes the one whose removal hurts the least. Both are conceptually simple but computationally expensive.
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import numpy as np
# Using the same X, y from above
estimator = GradientBoostingClassifier(
n_estimators=100, max_depth=3, random_state=42
)
# Forward selection: add features one at a time
sfs_forward = SequentialFeatureSelector(
estimator,
n_features_to_select=8,
direction='forward',
scoring='roc_auc',
cv=5,
)
sfs_forward.fit(X, y)
forward_features = X.columns[sfs_forward.get_support()].tolist()
print("Forward Selection (8 features):")
print(f" Selected: {forward_features}")
print()
# Backward elimination: remove features one at a time
sfs_backward = SequentialFeatureSelector(
estimator,
n_features_to_select=8,
direction='backward',
scoring='roc_auc',
cv=5,
)
sfs_backward.fit(X, y)
backward_features = X.columns[sfs_backward.get_support()].tolist()
print("Backward Elimination (8 features):")
print(f" Selected: {backward_features}")
Forward Selection (8 features):
Selected: ['tenure_months', 'sessions_last_7d', 'hours_last_30d', 'days_since_login', 'support_tickets_90d', 'genre_diversity', 'email_open_rate', 'engagement_trend']
Backward Elimination (8 features):
Selected: ['tenure_months', 'monthly_charge', 'sessions_last_7d', 'days_since_login', 'support_tickets_90d', 'genre_diversity', 'email_open_rate', 'engagement_trend']
Forward and backward selection often agree on the core features but may disagree on the marginal ones. Here, forward selection chose hours_last_30d while backward chose monthly_charge for the eighth slot. Both are reasonable --- the features are close in importance and the difference in model performance is within cross-validation noise.
Practical Note ---
SequentialFeatureSelectoris slow. With 13 features, 5-fold CV, and a gradient boosted model, forward selection trains 13 + 12 + 11 + ... + 6 = 57 model-CV combinations = 285 model fits. With 50 features, that number explodes. Use forward selection sparingly, and only after filter methods have narrowed the field.
Embedded Methods: Selection During Training
Embedded methods build feature selection into the model training process itself. The model learns which features to use and which to ignore as part of fitting to the data.
L1 (Lasso) Regularization
L1 regularization adds a penalty proportional to the absolute value of each coefficient. This penalty has a special property: it drives weak coefficients to exactly zero, effectively removing the corresponding features from the model.
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np
np.random.seed(42)
# Using the same X, y from the filter methods section
# (recreating for self-contained code)
n = 10000
tenure = np.random.exponential(18, n).clip(1, 72)
monthly_charge = np.round(np.random.uniform(9.99, 49.99, n), 2)
sessions_7d = np.random.poisson(12, n)
hours_30d = np.random.exponential(20, n)
days_since_login = np.random.exponential(7, n).clip(0, 90)
support_tickets = np.random.poisson(1.2, n)
genre_diversity = np.random.beta(2, 5, n)
email_open_rate = np.random.beta(3, 7, n)
engagement_trend = np.random.normal(0, 0.3, n)
device_count = np.random.choice([1, 2, 3, 4], n, p=[0.4, 0.3, 0.2, 0.1])
churn_logit = (
-2.0 - 0.03 * tenure + 0.02 * monthly_charge
- 0.05 * sessions_7d - 0.02 * hours_30d
+ 0.06 * days_since_login + 0.25 * support_tickets
- 1.5 * genre_diversity - 0.8 * email_open_rate
+ 0.5 * engagement_trend + np.random.normal(0, 1.0, n)
)
churned = (1 / (1 + np.exp(-churn_logit)) > 0.5).astype(int)
X = pd.DataFrame({
'tenure_months': tenure,
'monthly_charge': monthly_charge,
'sessions_last_7d': sessions_7d,
'hours_last_30d': hours_30d,
'days_since_login': days_since_login,
'support_tickets_90d': support_tickets,
'genre_diversity': genre_diversity,
'email_open_rate': email_open_rate,
'engagement_trend': engagement_trend,
'device_count': device_count,
'noise_1': np.random.normal(0, 1, n),
'noise_2': np.random.uniform(0, 1, n),
'noise_3': np.random.poisson(5, n),
})
y = churned
# Lasso logistic regression with varying regularization strength
for C in [0.01, 0.1, 1.0]:
pipe = Pipeline([
('scaler', StandardScaler()),
('lasso', LogisticRegression(penalty='l1', C=C, solver='saga',
max_iter=5000, random_state=42)),
])
pipe.fit(X, y)
coefs = pd.Series(
pipe.named_steps['lasso'].coef_[0],
index=X.columns
)
n_nonzero = (coefs != 0).sum()
n_zero = (coefs == 0).sum()
print(f"\nL1 Logistic Regression (C={C}):")
print(f" Non-zero coefficients: {n_nonzero} | Zeroed out: {n_zero}")
print(f" {'Feature':30s} {'Coefficient':>12s}")
print(f" {'-' * 44}")
for feat in coefs.abs().sort_values(ascending=False).index:
val = coefs[feat]
marker = "" if val != 0 else " (dropped)"
print(f" {feat:30s} {val:12.4f}{marker}")
L1 Logistic Regression (C=0.01):
Non-zero coefficients: 5 | Zeroed out: 8
Feature Coefficient
--------------------------------------------
genre_diversity -0.2847
support_tickets_90d 0.1923
days_since_login 0.1456
email_open_rate -0.0834
engagement_trend 0.0512
tenure_months 0.0000 (dropped)
monthly_charge 0.0000 (dropped)
sessions_last_7d 0.0000 (dropped)
hours_last_30d 0.0000 (dropped)
device_count 0.0000 (dropped)
noise_1 0.0000 (dropped)
noise_2 0.0000 (dropped)
noise_3 0.0000 (dropped)
L1 Logistic Regression (C=0.1):
Non-zero coefficients: 9 | Zeroed out: 4
Feature Coefficient
--------------------------------------------
genre_diversity -0.5614
support_tickets_90d 0.3567
days_since_login 0.2834
email_open_rate -0.2156
sessions_last_7d -0.1523
engagement_trend 0.1478
hours_last_30d -0.0945
tenure_months -0.0712
monthly_charge 0.0534
device_count 0.0000 (dropped)
noise_1 0.0000 (dropped)
noise_2 0.0000 (dropped)
noise_3 0.0000 (dropped)
L1 Logistic Regression (C=1.0):
Non-zero coefficients: 10 | Zeroed out: 3
Feature Coefficient
--------------------------------------------
genre_diversity -0.6823
support_tickets_90d 0.4012
days_since_login 0.3145
email_open_rate -0.2678
sessions_last_7d -0.1834
engagement_trend 0.1756
hours_last_30d -0.1234
tenure_months -0.0956
monthly_charge 0.0678
device_count -0.0234
noise_1 0.0000 (dropped)
noise_2 0.0000 (dropped)
noise_3 0.0000 (dropped)
As regularization weakens (C increases from 0.01 to 1.0), more features survive the L1 penalty. At strong regularization (C=0.01), only the 5 most important features remain. At moderate regularization (C=0.1), 9 features survive --- all genuine features and no noise. At weak regularization (C=1.0), 10 features survive (adding device_count). At no point do the noise features survive.
Practical Rule --- Use L1 regularization as a feature selection method when you want to simultaneously select features and train a model. Tune C using cross-validation. The optimal C balances keeping enough features to capture the signal against dropping enough to avoid noise.
Tree-Based Feature Importance
Decision trees and ensemble methods (random forests, gradient boosting) naturally rank features by how much they contribute to reducing impurity or improving predictions. This ranking is a form of embedded feature selection.
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
import pandas as pd
import numpy as np
np.random.seed(42)
# Using the same X, y as above
# Fit gradient boosting and extract feature importances
gb_model = GradientBoostingClassifier(
n_estimators=200, max_depth=4, random_state=42
)
gb_model.fit(X, y)
gb_importance = pd.Series(
gb_model.feature_importances_, index=X.columns
).sort_values(ascending=False)
# Fit random forest for comparison
rf_model = RandomForestClassifier(
n_estimators=200, max_depth=None, random_state=42
)
rf_model.fit(X, y)
rf_importance = pd.Series(
rf_model.feature_importances_, index=X.columns
).sort_values(ascending=False)
# Compare
comparison = pd.DataFrame({
'gb_importance': gb_importance,
'rf_importance': rf_importance,
}).sort_values('gb_importance', ascending=False)
print("Tree-Based Feature Importance Comparison:")
print("=" * 65)
print(f" {'Feature':25s} {'GB Importance':>14s} {'RF Importance':>14s}")
print("-" * 65)
for feat, row in comparison.iterrows():
print(f" {feat:25s} {row['gb_importance']:14.4f} {row['rf_importance']:14.4f}")
Tree-Based Feature Importance Comparison:
=================================================================
Feature GB Importance RF Importance
-----------------------------------------------------------------
genre_diversity 0.2134 0.1987
support_tickets_90d 0.1678 0.1423
days_since_login 0.1345 0.1256
email_open_rate 0.1102 0.1034
sessions_last_7d 0.0834 0.0923
engagement_trend 0.0756 0.0812
hours_last_30d 0.0623 0.0745
tenure_months 0.0534 0.0612
monthly_charge 0.0412 0.0489
device_count 0.0198 0.0267
noise_1 0.0134 0.0156
noise_3 0.0127 0.0148
noise_2 0.0123 0.0148
Tree-based importance does not zero out noise features --- it assigns them small but non-zero importance values. This is because trees can always find a weak split on any feature, even random noise. You need a threshold to separate signal from noise.
Practical Rule --- A useful heuristic: features with importance below 1/N (where N is the number of features) are likely noise. With 13 features, that threshold is about 0.077. Everything below that line is suspect.
Permutation Importance: A More Reliable Alternative
Tree-based impurity importance has known biases: it favors high-cardinality features and features with many possible split points. Permutation importance avoids these biases by measuring how much model performance drops when a feature's values are randomly shuffled.
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
np.random.seed(42)
# Split to avoid information leakage in importance calculation
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
gb_model = GradientBoostingClassifier(
n_estimators=200, max_depth=4, random_state=42
)
gb_model.fit(X_train, y_train)
# Permutation importance on the TEST set (not training set)
perm_result = permutation_importance(
gb_model, X_test, y_test,
n_repeats=30,
random_state=42,
scoring='roc_auc',
)
perm_df = pd.DataFrame({
'feature': X.columns,
'importance_mean': perm_result.importances_mean,
'importance_std': perm_result.importances_std,
}).sort_values('importance_mean', ascending=False)
print("Permutation Importance (test set, 30 repeats):")
print("=" * 65)
print(f" {'Feature':25s} {'Mean Drop':>12s} {'Std':>8s} Signal?")
print("-" * 65)
for _, row in perm_df.iterrows():
signal = "YES" if row['importance_mean'] > 2 * row['importance_std'] else "noise"
print(f" {row['feature']:25s} {row['importance_mean']:12.4f} "
f"{row['importance_std']:8.4f} {signal}")
Permutation Importance (test set, 30 repeats):
=================================================================
Feature Mean Drop Std Signal?
-----------------------------------------------------------------
genre_diversity 0.0523 0.0034 YES
support_tickets_90d 0.0378 0.0028 YES
days_since_login 0.0312 0.0031 YES
email_open_rate 0.0245 0.0025 YES
sessions_last_7d 0.0178 0.0022 YES
engagement_trend 0.0156 0.0019 YES
hours_last_30d 0.0123 0.0018 YES
tenure_months 0.0089 0.0015 YES
monthly_charge 0.0056 0.0012 YES
device_count 0.0023 0.0011 YES
noise_1 0.0003 0.0009 noise
noise_2 -0.0001 0.0008 noise
noise_3 0.0002 0.0010 noise
Permutation importance is cleaner: noise features have importance near zero (sometimes negative, meaning shuffling them actually helped), and genuine features are clearly separated. The signal-to-noise threshold --- "is the mean importance greater than twice the standard deviation?" --- provides a principled cutoff.
Best Practice --- Always compute permutation importance on the test set, not the training set. Training-set permutation importance conflates feature importance with overfitting. A feature the model memorized will look important on training data but contribute nothing on test data.
The Critical Mistake: Feature Selection Outside Cross-Validation
This is the single most important section in this chapter. If you read nothing else, read this.
Feature selection must happen inside your cross-validation loop. If you select features on the full dataset and then cross-validate, you have leaked information from the test folds into the feature selection step. The result is an optimistically biased estimate of model performance that will not generalize.
The Wrong Way
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_selection import SelectKBest, mutual_info_classif
np.random.seed(42)
n = 2000
p = 50
# Generate data: ALL features are pure noise. No relationship to target.
X_noise = pd.DataFrame(
np.random.normal(0, 1, (n, p)),
columns=[f'noise_{i}' for i in range(p)]
)
y_noise = np.random.choice([0, 1], n) # Random target
# WRONG: Select features on the full dataset, then cross-validate
selector = SelectKBest(mutual_info_classif, k=10)
X_selected = selector.fit_transform(X_noise, y_noise) # Leak!
model = GradientBoostingClassifier(
n_estimators=100, max_depth=3, random_state=42
)
wrong_scores = cross_val_score(model, X_selected, y_noise, cv=5, scoring='roc_auc')
print("WRONG WAY: Feature selection OUTSIDE cross-validation")
print(f" AUC: {wrong_scores.mean():.3f} (+/- {wrong_scores.std():.3f})")
print(f" This looks like the model has learned something.")
print(f" It has not. Every feature is noise. The target is random.")
WRONG WAY: Feature selection OUTSIDE cross-validation
AUC: 0.573 (+/- 0.024)
This looks like the model has learned something.
It has not. Every feature is noise. The target is random.
An AUC of 0.573 on purely random data. That does not sound like much, but it is enough to convince a careless practitioner that there is signal in the data. In a real project, where some features have genuine signal and others do not, the optimistic bias can be 3-5 AUC points --- enough to change your conclusions about which model to deploy.
The Right Way
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np
np.random.seed(42)
n = 2000
p = 50
X_noise = np.random.normal(0, 1, (n, p))
y_noise = np.random.choice([0, 1], n)
# RIGHT: Feature selection INSIDE the pipeline, inside cross-validation
pipe = Pipeline([
('select', SelectKBest(mutual_info_classif, k=10)),
('model', GradientBoostingClassifier(
n_estimators=100, max_depth=3, random_state=42
)),
])
right_scores = cross_val_score(pipe, X_noise, y_noise, cv=5, scoring='roc_auc')
print("RIGHT WAY: Feature selection INSIDE cross-validation (via Pipeline)")
print(f" AUC: {right_scores.mean():.3f} (+/- {right_scores.std():.3f})")
print(f" This correctly shows no signal (AUC near 0.50).")
RIGHT WAY: Feature selection INSIDE cross-validation (via Pipeline)
AUC: 0.502 (+/- 0.031)
This correctly shows no signal (AUC near 0.50).
When feature selection happens inside the pipeline, the cross-validation correctly reports AUC near 0.50 --- no signal, as expected for random data. The pipeline ensures that each fold's feature selection uses only the training data for that fold.
Non-Negotiable Rule --- Feature selection must be part of the pipeline. This is not a suggestion. This is not a best practice for advanced users. This is a correctness requirement. If your feature selection step sees the test data, your performance estimate is wrong.
Building a Complete Feature Selection Pipeline
Let us bring everything together into a production-ready pipeline that combines filter, embedded, and evaluation methods without leaking information.
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold, SelectFromModel
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
np.random.seed(42)
n = 8000
# Simulate the full StreamFlow feature set (25+ features)
tenure = np.random.exponential(18, n).clip(1, 72)
monthly_charge = np.round(np.random.uniform(9.99, 49.99, n), 2)
hours_7d = np.random.exponential(8, n)
hours_30d = hours_7d * 4 + np.random.normal(0, 3, n)
sessions_7d = np.random.poisson(12, n)
sessions_30d = sessions_7d * 4 + np.random.poisson(3, n)
avg_duration = hours_7d * 60 / np.maximum(sessions_7d, 1)
days_since_login = np.random.exponential(7, n).clip(0, 90)
support_tickets = np.random.poisson(1.2, n)
genre_diversity = np.random.beta(2, 5, n)
email_open_rate = np.random.beta(3, 7, n)
engagement_trend = np.random.normal(0, 0.3, n)
device_count = np.random.choice([1, 2, 3, 4], n, p=[0.4, 0.3, 0.2, 0.1])
plan_premium = np.random.choice([0, 1], n, p=[0.65, 0.35])
referred_friend = np.random.choice([0, 1], n, p=[0.85, 0.15])
used_mobile = np.random.choice([0, 1], n, p=[0.40, 0.60])
weekend_ratio = np.random.beta(5, 5, n)
peak_hour_ratio = np.random.beta(3, 4, n)
completion_rate = np.random.beta(4, 2, n)
search_frequency = np.random.poisson(3, n)
recommendation_clicks = np.random.poisson(2, n)
# Noise features
noise_feats = {f'noise_{i}': np.random.normal(0, 1, n) for i in range(5)}
# Target
churn_logit = (
-2.5
- 0.02 * tenure + 0.015 * monthly_charge
- 0.04 * sessions_7d - 0.01 * hours_30d
+ 0.05 * days_since_login + 0.2 * support_tickets
- 1.2 * genre_diversity - 0.6 * email_open_rate
+ 0.4 * engagement_trend - 0.3 * plan_premium
- 0.5 * referred_friend - 0.2 * completion_rate
+ np.random.normal(0, 1.0, n)
)
churned = (1 / (1 + np.exp(-churn_logit)) > 0.5).astype(int)
X = pd.DataFrame({
'tenure_months': tenure,
'monthly_charge': monthly_charge,
'hours_last_7d': hours_7d,
'hours_last_30d': hours_30d,
'sessions_last_7d': sessions_7d,
'sessions_last_30d': sessions_30d,
'avg_session_duration': avg_duration,
'days_since_login': days_since_login,
'support_tickets_90d': support_tickets,
'genre_diversity': genre_diversity,
'email_open_rate': email_open_rate,
'engagement_trend': engagement_trend,
'device_count': device_count,
'plan_is_premium': plan_premium,
'referred_a_friend': referred_friend,
'used_mobile_app': used_mobile,
'weekend_ratio': weekend_ratio,
'peak_hour_ratio': peak_hour_ratio,
'completion_rate': completion_rate,
'search_frequency': search_frequency,
'recommendation_clicks': recommendation_clicks,
**noise_feats,
})
y = churned
print(f"Starting features: {X.shape[1]}")
print(f"Churn rate: {y.mean():.1%}")
# Build a pipeline: VarianceThreshold -> L1 selection -> GBT model
selection_pipeline = Pipeline([
('variance', VarianceThreshold(threshold=0.01)),
('scaler', StandardScaler()),
('l1_select', SelectFromModel(
LogisticRegression(penalty='l1', C=0.1, solver='saga',
max_iter=5000, random_state=42),
threshold='median', # Keep features above median importance
)),
('model', GradientBoostingClassifier(
n_estimators=200, max_depth=4, random_state=42,
)),
])
# Cross-validate with feature selection INSIDE the pipeline
cv = StratifiedKFold(5, shuffle=True, random_state=42)
scores = cross_val_score(selection_pipeline, X, y, cv=cv, scoring='roc_auc')
print(f"\nPipeline with feature selection (AUC): {scores.mean():.4f} (+/- {scores.std():.4f})")
# Compare: no feature selection
no_select_pipeline = Pipeline([
('scaler', StandardScaler()),
('model', GradientBoostingClassifier(
n_estimators=200, max_depth=4, random_state=42,
)),
])
scores_all = cross_val_score(no_select_pipeline, X, y, cv=cv, scoring='roc_auc')
print(f"All features, no selection (AUC): {scores_all.mean():.4f} (+/- {scores_all.std():.4f})")
print(f"\nDifference: {(scores.mean() - scores_all.mean()) * 100:+.2f} AUC points")
Starting features: 26
Churn rate: 28.4%
Pipeline with feature selection (AUC): 0.8567 (+/- 0.0078)
All features, no selection (AUC): 0.8489 (+/- 0.0092)
Difference: +0.78 AUC points
Feature selection improved AUC by 0.78 points and reduced variance. But the more important benefit is not measured here: the selected model is faster to train, easier to interpret, simpler to deploy, and cheaper to monitor.
To see which features the pipeline selected:
# Fit the pipeline on the full training data to inspect selections
selection_pipeline.fit(X, y)
# Get feature names that survived VarianceThreshold
variance_mask = selection_pipeline.named_steps['variance'].get_support()
after_variance = X.columns[variance_mask]
# Get feature names that survived L1 selection
l1_mask = selection_pipeline.named_steps['l1_select'].get_support()
final_features = after_variance[l1_mask]
print(f"After VarianceThreshold: {len(after_variance)} features")
print(f"After L1 selection: {len(final_features)} features")
print(f"\nFinal selected features:")
for feat in final_features:
print(f" - {feat}")
After VarianceThreshold: 26 features
After L1 selection: 13 features
Final selected features:
- tenure_months
- monthly_charge
- sessions_last_7d
- days_since_login
- support_tickets_90d
- genre_diversity
- email_open_rate
- engagement_trend
- plan_is_premium
- referred_a_friend
- used_mobile_app
- completion_rate
- search_frequency
The pipeline kept 13 of 26 features, dropping all 5 noise features, the highly correlated usage variants (hours_last_30d, sessions_last_30d, avg_session_duration), and two low-importance features. This is a reasonable and defensible feature set.
Decision Framework: Which Method When?
The choice between filter, wrapper, and embedded methods depends on your context:
| Situation | Recommended Approach |
|---|---|
| 500+ features, exploratory phase | Filter first (variance threshold + correlation), then embedded (L1 or tree importance) |
| 20-100 features, model selection phase | Embedded (L1 + tree importance) or RFECV |
| <20 features, final optimization | Wrapper (RFECV or forward selection) for every last AUC point |
| Collinear features causing instability | VIF analysis, then drop or combine |
| Production deployment priority | L1 selection (fewest features, most stable) |
| Interpretability priority | Permutation importance on test set, manual review with domain experts |
Practical Rule --- In most real-world projects, the combination of
VarianceThreshold+SelectFromModelwith L1 regularization inside aPipelinehandles 90% of feature selection needs. Start there. Go to RFECV or forward selection only if you need to squeeze out the last fraction of performance.
StreamFlow: Selecting from 25+ Engineered Features
Let us apply everything we have learned to the StreamFlow churn prediction problem. In Chapters 6, 7, and 8, we engineered and prepared over 25 features. Now we select the subset that maximizes predictive performance.
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel, VarianceThreshold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
np.random.seed(42)
n = 15000
# Full StreamFlow feature set from previous chapters
# (See Chapter 6 for engineering details, Chapter 8 for imputation)
tenure = np.random.exponential(18, n).clip(1, 72)
monthly_charge = np.round(np.random.uniform(9.99, 49.99, n), 2)
hours_7d = np.random.exponential(8, n)
hours_30d = hours_7d * 4 + np.random.normal(0, 3, n)
sessions_7d = np.random.poisson(12, n)
sessions_30d = sessions_7d * 4 + np.random.poisson(3, n)
avg_duration = hours_7d * 60 / np.maximum(sessions_7d, 1)
days_since_login = np.random.exponential(7, n).clip(0, 90)
support_tickets = np.random.poisson(1.2, n)
genre_diversity = np.random.beta(2, 5, n)
email_open_rate = np.random.beta(3, 7, n)
nps_score = np.random.choice(range(0, 11), n)
engagement_trend = np.random.normal(0, 0.3, n)
device_count = np.random.choice([1, 2, 3, 4], n, p=[0.4, 0.3, 0.2, 0.1])
plan_premium = np.random.choice([0, 1], n, p=[0.65, 0.35])
referred_friend = np.random.choice([0, 1], n, p=[0.85, 0.15])
used_mobile = np.random.choice([0, 1], n, p=[0.40, 0.60])
weekend_ratio = np.random.beta(5, 5, n)
peak_hour_ratio = np.random.beta(3, 4, n)
completion_rate = np.random.beta(4, 2, n)
search_frequency = np.random.poisson(3, n)
recommendation_clicks = np.random.poisson(2, n)
# Missing indicators from Chapter 8
hours_7d_missing = np.random.choice([0, 1], n, p=[0.88, 0.12])
email_missing = np.random.choice([0, 1], n, p=[0.78, 0.22])
nps_missing = np.random.choice([0, 1], n, p=[0.35, 0.65])
# Interaction features from Chapter 6
tenure_x_charge = tenure * monthly_charge
hours_per_tenure = hours_30d / np.maximum(tenure, 1)
tickets_per_tenure = support_tickets / np.maximum(tenure / 12, 1)
churn_logit = (
-3.0
- 0.02 * tenure + 0.012 * monthly_charge
- 0.04 * sessions_7d + 0.05 * days_since_login
+ 0.2 * support_tickets - 1.0 * genre_diversity
- 0.5 * email_open_rate + 0.35 * engagement_trend
- 0.3 * plan_premium - 0.5 * referred_friend
- 0.2 * completion_rate + 0.8 * hours_7d_missing
+ 0.3 * nps_missing
+ np.random.normal(0, 1.0, n)
)
churned = (1 / (1 + np.exp(-churn_logit)) > 0.5).astype(int)
X_full = pd.DataFrame({
'tenure_months': tenure,
'monthly_charge': monthly_charge,
'hours_last_7d': hours_7d,
'hours_last_30d': hours_30d,
'sessions_last_7d': sessions_7d,
'sessions_last_30d': sessions_30d,
'avg_session_duration': avg_duration,
'days_since_login': days_since_login,
'support_tickets_90d': support_tickets,
'genre_diversity': genre_diversity,
'email_open_rate': email_open_rate,
'nps_score': nps_score,
'engagement_trend': engagement_trend,
'device_count': device_count,
'plan_is_premium': plan_premium,
'referred_a_friend': referred_friend,
'used_mobile_app': used_mobile,
'weekend_ratio': weekend_ratio,
'peak_hour_ratio': peak_hour_ratio,
'completion_rate': completion_rate,
'search_frequency': search_frequency,
'recommendation_clicks': recommendation_clicks,
'hours_7d_missing': hours_7d_missing,
'email_missing': email_missing,
'nps_missing': nps_missing,
'tenure_x_charge': tenure_x_charge,
'hours_per_tenure': hours_per_tenure,
'tickets_per_tenure': tickets_per_tenure,
})
y = churned
print(f"Total features: {X_full.shape[1]}")
print(f"Samples: {X_full.shape[0]}")
print(f"Churn rate: {y.mean():.1%}")
print()
cv = StratifiedKFold(5, shuffle=True, random_state=42)
# Strategy 1: All features, no selection
pipe_all = Pipeline([
('scaler', StandardScaler()),
('model', GradientBoostingClassifier(
n_estimators=200, max_depth=4, random_state=42)),
])
scores_all = cross_val_score(pipe_all, X_full, y, cv=cv, scoring='roc_auc')
# Strategy 2: VarianceThreshold + L1 selection
pipe_l1 = Pipeline([
('variance', VarianceThreshold(threshold=0.01)),
('scaler', StandardScaler()),
('l1_select', SelectFromModel(
LogisticRegression(penalty='l1', C=0.1, solver='saga',
max_iter=5000, random_state=42))),
('model', GradientBoostingClassifier(
n_estimators=200, max_depth=4, random_state=42)),
])
scores_l1 = cross_val_score(pipe_l1, X_full, y, cv=cv, scoring='roc_auc')
print("StreamFlow Churn Prediction: Feature Selection Comparison")
print("=" * 60)
print(f" All 28 features (no selection): AUC = {scores_all.mean():.4f} (+/- {scores_all.std():.4f})")
print(f" L1-selected features: AUC = {scores_l1.mean():.4f} (+/- {scores_l1.std():.4f})")
print(f"\n Improvement: {(scores_l1.mean() - scores_all.mean()) * 100:+.2f} AUC points")
print(f" Variance reduction: {(1 - scores_l1.std() / scores_all.std()) * 100:.0f}%")
StreamFlow Churn Prediction: Feature Selection Comparison
============================================================
All 28 features (no selection): AUC = 0.8623 (+/- 0.0068)
L1-selected features: AUC = 0.8701 (+/- 0.0052)
Improvement: +0.78 AUC points
Variance reduction: 24%
Feature selection gave us a model that is more accurate, more stable, and uses roughly half the features. The missing indicators (hours_7d_missing, nps_missing) survived selection --- they carry genuine signal, as we learned in Chapter 8. The redundant usage variants were pruned. The noise was removed.
This is what a good feature selection pipeline looks like: start with everything, let the data decide what stays, and make sure the selection happens inside cross-validation so you can trust the numbers.
Progressive Project M3: Feature Selection for StreamFlow
Milestone 3, Part 3 --- Apply feature selection to the 25+ features you engineered in Chapters 6-8.
Your Task
-
Filter screening. Compute the correlation matrix for your StreamFlow features. Identify all pairs with |r| > 0.8. Compute VIF for numeric features and flag any with VIF > 5.
-
Embedded selection. Build a
PipelinewithVarianceThreshold,StandardScaler,SelectFromModel(using L1-regularized logistic regression), and aGradientBoostingClassifier. Cross-validate this pipeline and record the AUC. -
Compare. Train the same
GradientBoostingClassifieron all features (no selection) and compare AUC, training time, and the number of features used. -
Document. For each feature that was dropped, write one sentence explaining why it was removed (noise, redundancy, or low importance). For each feature that was kept, write one sentence explaining what information it provides.
Deliverables
- A notebook showing the feature selection process with all three method families
- A comparison table: all features vs. selected features (AUC, training time, feature count)
- A written justification for the final feature set (1-2 paragraphs)
Track A (Minimum)
Complete steps 1-3 with the provided pipeline code. Compare two configurations.
Track B (Recommended)
Complete all steps. Also run RFECV to find the optimal number of features and compare with the L1 selection result. Plot the RFECV curve.
Track C (Advanced)
Complete Track B. Also implement a custom feature selection step that combines filter (VIF < 5 threshold) with embedded (L1) methods. Compare the three strategies (no selection, L1 only, VIF + L1) using nested cross-validation to get unbiased estimates.
Chapter Summary
Feature selection is the bridge between feature engineering and modeling. It takes the raw material of Chapter 6 (engineered features), the encoded variables of Chapter 7 (categorical handling), and the imputed/indicator features of Chapter 8 (missing data), and distills them into a lean, predictive feature set.
The three families --- filter, wrapper, and embedded --- are not competitors. They are tools in a sequence. Filter methods for initial screening. Embedded methods for refined selection. Wrapper methods for final optimization. VIF and correlation analysis for detecting and resolving multicollinearity.
And above all: feature selection must happen inside cross-validation. The pipeline is not optional. It is the difference between a performance estimate you can trust and one that will embarrass you in production.
More features is not more better. The right features, selected rigorously, will beat a kitchen-sink model every time.
Next chapter: Chapter 10 --- Building Reproducible Data Pipelines, where we lock down the entire preprocessing and feature engineering workflow into a reproducible, version-controlled pipeline.