18 min read

> War Story --- A SaaS analytics team had spent three weeks on feature engineering, and they were proud of the result. Starting from raw event logs, they had created 127 features for their churn prediction model: usage metrics, engagement ratios...

Chapter 9: Feature Selection

Reducing Dimensionality Without Losing Signal


Learning Objectives

By the end of this chapter, you will be able to:

  1. Apply filter methods (correlation, mutual information, chi-squared) for initial screening
  2. Use wrapper methods (RFE, forward/backward selection) for model-driven selection
  3. Leverage embedded methods (L1 regularization, tree-based importance) for integrated selection
  4. Detect and handle multicollinearity (VIF, correlation matrices)
  5. Build a feature selection pipeline that does not leak information

More Features Is Not More Better

War Story --- A SaaS analytics team had spent three weeks on feature engineering, and they were proud of the result. Starting from raw event logs, they had created 127 features for their churn prediction model: usage metrics, engagement ratios, behavioral sequences, temporal aggregates, interaction terms, polynomial expansions, and every ratio-of-ratios they could think of. The gradient boosted model trained on all 127 features achieved an AUC of 0.836 on the holdout set. Respectable.

Then an intern asked a question that nobody had considered: "How many of those features actually matter?" The team ran a quick feature importance plot and discovered that the top 15 features accounted for 94% of the model's total importance. The remaining 112 features split the leftover 6% among them, each contributing less than a rounding error. Some had negative importance --- they were actively hurting the model by introducing noise that the algorithm had to learn to ignore.

They retrained on just the top 15 features. The AUC went up to 0.851. Training time dropped from 47 minutes to 90 seconds. The model was faster, more interpretable, more stable across time periods, and easier to deploy and monitor. Removing 88% of the features improved everything.

The lesson: feature engineering creates candidates. Feature selection decides which ones earn a seat at the table. These are different jobs, and skipping the second one is like hiring every applicant who submits a resume.

That war story should feel uncomfortable if you have been diligently engineering features for the past three chapters. Everything we built in Chapters 6, 7, and 8 --- the temporal aggregates, the encoded categoricals, the imputed values, the missing indicators --- all of it is candidate material. This chapter is the audition.

Feature selection is the process of choosing a subset of features that maximizes predictive performance while minimizing noise, redundancy, and computational cost. It is not optional. It is not something you do "if you have time." It is a core modeling step, and doing it wrong (or not at all) will cost you accuracy, interpretability, and production stability.

Here is why.


The Curse of Dimensionality --- In Practice, Not Theory

You have probably heard the phrase "curse of dimensionality" in a statistics or machine learning course, accompanied by abstract geometric arguments about how volume concentrates in high-dimensional spaces. That is true but not helpful. Let us talk about what the curse actually does to your models.

What Actually Goes Wrong with Too Many Features

Problem 1: Overfitting to noise. Every feature you add gives the model another degree of freedom to fit the training data. If a feature is genuinely uninformative --- if it has zero true relationship to the target --- the model can still find a spurious pattern in the training data by chance. With 100 uninformative features, the model will find dozens of phantom patterns. It will learn them, memorize them, and use them confidently on the training set. Then it will fail on new data because the phantom patterns do not exist in the real world.

Problem 2: Diluted signal. Regularization and early stopping work by penalizing complexity. When you have many irrelevant features alongside a few genuinely predictive ones, the regularization penalty is spread across all features equally. The model must allocate some of its limited complexity budget to suppressing noise features, leaving less capacity to learn the true signal. More features means more noise for the model to wade through.

Problem 3: Multicollinearity. Correlated features create instability in any model that estimates coefficients (linear regression, logistic regression, regularized models). When two features carry nearly identical information, the model cannot distinguish their individual contributions. Small changes in the data produce large swings in the coefficients. The model's predictions may be fine, but the feature importances and coefficients become meaningless --- and if you are using them for interpretation or decision-making, that is a serious problem.

Problem 4: Computational cost. More features means more parameters to estimate, more memory to allocate, more time to train, and more storage for model artifacts. In production, every additional feature is a dependency that must be computed, validated, and monitored. A model with 15 features is dramatically easier to deploy and maintain than a model with 127.

Problem 5: Monitoring burden. Every feature in production requires drift monitoring. If a feature's distribution shifts, you need to know about it. With 15 features, that is manageable. With 127, you are drowning in alerts, most of them for features that do not matter anyway.

Let us see the curse in action:

import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification

np.random.seed(42)

# Generate a dataset with 10 informative features and varying noise features
n_samples = 5000
n_informative = 10

results = []

for n_noise in [0, 10, 25, 50, 100, 200]:
    X, y = make_classification(
        n_samples=n_samples,
        n_features=n_informative + n_noise,
        n_informative=n_informative,
        n_redundant=0,
        n_repeated=0,
        n_clusters_per_class=2,
        random_state=42,
    )

    model = GradientBoostingClassifier(
        n_estimators=100,
        max_depth=4,
        random_state=42,
    )

    scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')

    results.append({
        'total_features': n_informative + n_noise,
        'noise_features': n_noise,
        'auc_mean': scores.mean(),
        'auc_std': scores.std(),
    })

results_df = pd.DataFrame(results)
print("Effect of Noise Features on Model Performance")
print("=" * 60)
print(f"{'Total Features':>16s}  {'Noise':>6s}  {'AUC Mean':>10s}  {'AUC Std':>9s}")
print("-" * 60)
for _, row in results_df.iterrows():
    print(f"{int(row['total_features']):>16d}  {int(row['noise_features']):>6d}  "
          f"{row['auc_mean']:>10.4f}  {row['auc_std']:>9.4f}")
Effect of Noise Features on Model Performance
============================================================
  Total Features   Noise    AUC Mean    AUC Std
------------------------------------------------------------
              10       0      0.9542     0.0061
              20      10      0.9487     0.0058
              35      25      0.9381     0.0072
              60      50      0.9203     0.0089
             110     100      0.8946     0.0105
             210     200      0.8621     0.0134

The model degrades steadily as noise features accumulate. With 200 noise features alongside 10 real ones, the AUC drops from 0.954 to 0.862 --- nearly a 10-point decline in discrimination. And gradient boosting is relatively robust to noise. Linear models degrade faster.

Key Insight --- The curse of dimensionality is not a theoretical curiosity. It is a measurable, reproducible degradation in model performance that happens every time you include features that do not carry signal.


The Three Families of Feature Selection

Feature selection methods fall into three families, and understanding the tradeoffs between them is the first step toward choosing the right approach for your problem.

Family How It Works Advantages Disadvantages
Filter Scores features using statistical tests independent of any model Fast, scalable, model-agnostic Ignores feature interactions; ignores model-specific relevance
Wrapper Searches subsets of features using a model's performance as the objective Finds optimal subsets for a specific model Slow, prone to overfitting, computationally expensive
Embedded Selection happens during model training (built into the algorithm) Efficient, accounts for interactions, produces a trained model and selected features simultaneously Model-specific; what Lasso selects may differ from what a tree selects

There is no single best family. In practice, you will often use all three in sequence: filter methods for initial screening, embedded methods for refined selection, and wrapper methods when you need every last fraction of a percent.


Filter Methods: Fast Screening with Statistical Tests

Filter methods evaluate each feature independently (or in pairs) using a statistical measure of relevance. They do not train a model. They are fast, scalable, and useful for reducing a large feature set to a manageable size before applying more expensive methods.

Variance Threshold: The Absolute Minimum

Before you do anything else, remove features with zero or near-zero variance. A feature that takes the same value for 99.5% of observations carries almost no information.

from sklearn.feature_selection import VarianceThreshold
import pandas as pd
import numpy as np

np.random.seed(42)

# Simulate StreamFlow features including some near-constant ones
n = 10000
df = pd.DataFrame({
    'tenure_months': np.random.exponential(18, n).clip(1, 72),
    'monthly_charge': np.round(np.random.uniform(9.99, 49.99, n), 2),
    'sessions_last_7d': np.random.poisson(12, n),
    'hours_last_30d': np.random.exponential(20, n),
    'is_annual_plan': np.random.choice([0, 1], n, p=[0.85, 0.15]),
    'has_parental_controls': np.random.choice([0, 1], n, p=[0.995, 0.005]),  # Near-constant
    'currency_is_usd': np.random.choice([0, 1], n, p=[0.98, 0.02]),         # Near-constant
    'platform_version_flag': np.ones(n),                                      # Constant
})

# VarianceThreshold removes features below a variance cutoff
# For binary features, var = p * (1 - p). Threshold of 0.01 removes
# features where the minority class is < ~1%
selector = VarianceThreshold(threshold=0.01)
X_filtered = selector.fit_transform(df)

kept = df.columns[selector.get_support()].tolist()
dropped = df.columns[~selector.get_support()].tolist()

print(f"Features kept ({len(kept)}): {kept}")
print(f"Features dropped ({len(dropped)}): {dropped}")
print()
for col in df.columns:
    v = df[col].var()
    status = "KEPT" if col in kept else "DROPPED"
    print(f"  {col:30s}  var={v:10.4f}  {status}")
Features kept (5): ['tenure_months', 'monthly_charge', 'sessions_last_7d', 'hours_last_30d', 'is_annual_plan']
Features dropped (3): ['has_parental_controls', 'currency_is_usd', 'platform_version_flag']

  tenure_months                   var=  313.4127  KEPT
  monthly_charge                  var=  133.4621  KEPT
  sessions_last_7d                var=   12.0384  KEPT
  hours_last_30d                  var=  398.1056  KEPT
  is_annual_plan                  var=    0.1275  KEPT
  has_parental_controls           var=    0.0050  DROPPED
  currency_is_usd                 var=    0.0196  DROPPED
  platform_version_flag           var=    0.0000  DROPPED

VarianceThreshold is a blunt instrument, but it catches the obvious cases. A feature that is constant for 99.5% of observations cannot carry much information, regardless of what the remaining 0.5% looks like.

Correlation with the Target: Univariate Screening

The simplest relevance measure is the correlation between each feature and the target. For numeric features, use Pearson or Spearman correlation. For categorical features and a binary target, use chi-squared or mutual information.

from sklearn.feature_selection import mutual_info_classif, chi2
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import numpy as np

np.random.seed(42)
n = 10000

# Simulate a realistic StreamFlow feature matrix with a churn target
tenure = np.random.exponential(18, n).clip(1, 72)
monthly_charge = np.round(np.random.uniform(9.99, 49.99, n), 2)
sessions_7d = np.random.poisson(12, n)
hours_30d = np.random.exponential(20, n)
days_since_login = np.random.exponential(7, n).clip(0, 90)
support_tickets = np.random.poisson(1.2, n)
genre_diversity = np.random.beta(2, 5, n)
email_open_rate = np.random.beta(3, 7, n)
trend_7d_30d = np.random.normal(0, 0.3, n)
device_count = np.random.choice([1, 2, 3, 4], n, p=[0.4, 0.3, 0.2, 0.1])

# Target: churn probability depends on a subset of features
churn_logit = (
    -2.0
    - 0.03 * tenure
    + 0.02 * monthly_charge
    - 0.05 * sessions_7d
    - 0.02 * hours_30d
    + 0.06 * days_since_login
    + 0.25 * support_tickets
    - 1.5 * genre_diversity
    - 0.8 * email_open_rate
    + 0.5 * trend_7d_30d
    + np.random.normal(0, 1.0, n)
)
churned = (1 / (1 + np.exp(-churn_logit)) > 0.5).astype(int)

# Add noise features with no real relationship
noise_1 = np.random.normal(0, 1, n)
noise_2 = np.random.uniform(0, 1, n)
noise_3 = np.random.poisson(5, n)

X = pd.DataFrame({
    'tenure_months': tenure,
    'monthly_charge': monthly_charge,
    'sessions_last_7d': sessions_7d,
    'hours_last_30d': hours_30d,
    'days_since_last_login': days_since_login,
    'support_tickets_90d': support_tickets,
    'genre_diversity': genre_diversity,
    'email_open_rate': email_open_rate,
    'engagement_trend_7d_30d': trend_7d_30d,
    'device_count': device_count,
    'noise_random_normal': noise_1,
    'noise_random_uniform': noise_2,
    'noise_random_poisson': noise_3,
})
y = churned

# --- Pearson Correlation ---
pearson_corr = X.corrwith(pd.Series(y, name='churned')).abs().sort_values(ascending=False)
print("Pearson |correlation| with churn target:")
print("=" * 50)
for feat, val in pearson_corr.items():
    bar = "#" * int(val * 80)
    print(f"  {feat:30s}  {val:.4f}  {bar}")

print()

# --- Mutual Information ---
mi_scores = mutual_info_classif(X, y, random_state=42)
mi_series = pd.Series(mi_scores, index=X.columns).sort_values(ascending=False)
print("Mutual Information with churn target:")
print("=" * 50)
for feat, val in mi_series.items():
    bar = "#" * int(val * 200)
    print(f"  {feat:30s}  {val:.4f}  {bar}")
Pearson |correlation| with churn target:
==================================================
  genre_diversity                 0.2138  #################
  support_tickets_90d             0.1842  ##############
  days_since_last_login           0.1567  ############
  email_open_rate                 0.1243  #########
  sessions_last_7d                0.1108  ########
  engagement_trend_7d_30d         0.0921  #######
  hours_last_30d                  0.0854  ######
  tenure_months                   0.0743  #####
  monthly_charge                  0.0512  ####
  device_count                    0.0198  #
  noise_random_normal             0.0124
  noise_random_poisson            0.0089
  noise_random_uniform            0.0041

Mutual Information with churn target:
==================================================
  genre_diversity                 0.0512  ##########
  support_tickets_90d             0.0389  #######
  days_since_last_login           0.0334  ######
  sessions_last_7d                0.0298  #####
  email_open_rate                 0.0267  #####
  engagement_trend_7d_30d         0.0198  ###
  hours_last_30d                  0.0187  ###
  tenure_months                   0.0156  ###
  monthly_charge                  0.0102  ##
  device_count                    0.0078  #
  noise_random_poisson            0.0012
  noise_random_normal             0.0008
  noise_random_uniform            0.0005

Both methods agree on the broad ranking: genre diversity, support tickets, and login recency are the most informative features. The noise features cluster at the bottom with scores near zero. The methods disagree on details --- mutual information captures nonlinear relationships that Pearson misses, but it is noisier and requires more data to produce stable estimates.

Practical Rule --- Use Pearson correlation for a quick first look. Use mutual information when you suspect nonlinear relationships. Neither one captures feature interactions --- a feature that is useless alone but powerful in combination with another feature will score low on both measures.

Chi-Squared Test for Categorical Features

When both the feature and the target are categorical (or the feature is a binary indicator), the chi-squared test is the standard filter method. It tests whether the observed frequency distribution differs from what you would expect under independence.

from sklearn.feature_selection import chi2, SelectKBest
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import numpy as np

np.random.seed(42)
n = 10000

# Simulate binary/categorical features for StreamFlow
plan_premium = np.random.choice([0, 1], n, p=[0.65, 0.35])
has_profile_photo = np.random.choice([0, 1], n, p=[0.50, 0.50])
used_mobile_app = np.random.choice([0, 1], n, p=[0.40, 0.60])
email_opt_in = np.random.choice([0, 1], n, p=[0.30, 0.70])
referred_friend = np.random.choice([0, 1], n, p=[0.85, 0.15])
noise_binary = np.random.choice([0, 1], n, p=[0.50, 0.50])

# Churn depends on some of these
churn_logit = (
    -1.5
    - 0.8 * plan_premium
    + 0.1 * has_profile_photo
    - 0.6 * used_mobile_app
    - 0.4 * email_opt_in
    - 1.2 * referred_friend
    + 0.0 * noise_binary  # No relationship
    + np.random.normal(0, 1.0, n)
)
churned = (1 / (1 + np.exp(-churn_logit)) > 0.5).astype(int)

X_cat = pd.DataFrame({
    'plan_is_premium': plan_premium,
    'has_profile_photo': has_profile_photo,
    'used_mobile_app': used_mobile_app,
    'email_opt_in': email_opt_in,
    'referred_a_friend': referred_friend,
    'noise_binary': noise_binary,
})

# Chi-squared requires non-negative values (already 0/1 here)
chi2_scores, p_values = chi2(X_cat, churned)

chi2_df = pd.DataFrame({
    'feature': X_cat.columns,
    'chi2_score': chi2_scores,
    'p_value': p_values,
}).sort_values('chi2_score', ascending=False)

print("Chi-Squared Test Results (categorical features vs. churn):")
print("=" * 65)
print(f"  {'Feature':25s}  {'Chi2 Score':>12s}  {'p-value':>12s}  Significant?")
print("-" * 65)
for _, row in chi2_df.iterrows():
    sig = "YES" if row['p_value'] < 0.01 else "no"
    print(f"  {row['feature']:25s}  {row['chi2_score']:12.2f}  {row['p_value']:12.6f}  {sig}")
Chi-Squared Test Results (categorical features vs. churn):
=================================================================
  Feature                      Chi2 Score       p-value  Significant?
-----------------------------------------------------------------
  referred_a_friend                234.71      0.000000  YES
  plan_is_premium                  187.45      0.000000  YES
  used_mobile_app                  142.18      0.000000  YES
  email_opt_in                      63.52      0.000000  YES
  has_profile_photo                  4.12      0.042351  no
  noise_binary                       0.38      0.537645  no

The chi-squared test correctly identifies the features with genuine relationships to churn and assigns near-zero scores to the noise feature. has_profile_photo has a weak but detectable relationship (the 0.1 coefficient we built in is small).

Correlation Between Features: Detecting Redundancy

Filter methods also help identify redundancy --- features that are so correlated with each other that keeping both adds noise without adding signal.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

np.random.seed(42)
n = 10000

# Simulate StreamFlow features with built-in correlations
hours_7d = np.random.exponential(8, n)
hours_30d = hours_7d * 4 + np.random.normal(0, 3, n)  # Highly correlated with 7d
sessions_7d = hours_7d / 0.75 + np.random.normal(0, 2, n)  # Correlated
sessions_30d = sessions_7d * 4 + np.random.normal(0, 5, n)  # Correlated
avg_duration = hours_7d / (sessions_7d.clip(1) / 7) + np.random.normal(0, 0.5, n)
tenure = np.random.exponential(18, n).clip(1, 72)
monthly_charge = np.round(np.random.uniform(9.99, 49.99, n), 2)
support_tickets = np.random.poisson(1.2, n)
days_since_login = np.random.exponential(7, n).clip(0, 90)

feature_df = pd.DataFrame({
    'hours_7d': hours_7d,
    'hours_30d': hours_30d,
    'sessions_7d': sessions_7d,
    'sessions_30d': sessions_30d,
    'avg_session_duration': avg_duration,
    'tenure_months': tenure,
    'monthly_charge': monthly_charge,
    'support_tickets': support_tickets,
    'days_since_login': days_since_login,
})

# Compute correlation matrix
corr_matrix = feature_df.corr()

# Find highly correlated pairs
high_corr_pairs = []
for i in range(len(corr_matrix.columns)):
    for j in range(i + 1, len(corr_matrix.columns)):
        corr_val = abs(corr_matrix.iloc[i, j])
        if corr_val > 0.7:
            high_corr_pairs.append({
                'feature_1': corr_matrix.columns[i],
                'feature_2': corr_matrix.columns[j],
                'abs_correlation': round(corr_val, 3),
            })

print("Highly Correlated Feature Pairs (|r| > 0.7):")
print("=" * 65)
for pair in sorted(high_corr_pairs, key=lambda x: x['abs_correlation'], reverse=True):
    print(f"  {pair['feature_1']:25s} <-> {pair['feature_2']:25s}  |r| = {pair['abs_correlation']:.3f}")
Highly Correlated Feature Pairs (|r| > 0.7):
=================================================================
  hours_7d                  <-> sessions_7d                |r| = 0.943
  hours_7d                  <-> hours_30d                  |r| = 0.921
  sessions_7d               <-> sessions_30d               |r| = 0.918
  hours_30d                 <-> sessions_30d               |r| = 0.874
  hours_7d                  <-> avg_session_duration        |r| = 0.812
  sessions_7d               <-> avg_session_duration        |r| = 0.789
  hours_30d                 <-> sessions_7d                |r| = 0.783
  hours_7d                  <-> sessions_30d               |r| = 0.762

The usage features are highly correlated --- which makes sense, because hours_7d, hours_30d, sessions_7d, and sessions_30d all measure variations of the same underlying behavior: how much the user engages with the product. Keeping all four in the model adds redundancy without adding much new information.

Practical Rule --- When two features have |r| > 0.9, you can usually drop one with minimal loss. When |r| is between 0.7 and 0.9, test both options (keep/drop) and let model performance decide. Below 0.7, the features are different enough that both may be worth keeping.


Multicollinearity and the Variance Inflation Factor

Correlation matrices show pairwise relationships. But multicollinearity can be more subtle --- a feature might be well-predicted by a combination of other features, even if no single pairwise correlation is particularly high. The Variance Inflation Factor (VIF) captures this.

VIF measures how much the variance of a feature's estimated coefficient is inflated due to correlation with other features. The formula is simple: for feature j, fit a linear regression predicting feature j from all other features, compute the R-squared, and calculate VIF = 1 / (1 - R-squared).

import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
n = 10000

# StreamFlow features with multicollinearity
tenure = np.random.exponential(18, n).clip(1, 72)
monthly_charge = np.round(np.random.uniform(9.99, 49.99, n), 2)
hours_7d = np.random.exponential(8, n)
hours_30d = hours_7d * 4 + np.random.normal(0, 3, n)
sessions_7d = hours_7d / 0.75 + np.random.normal(0, 2, n)
avg_duration = hours_7d * 60 / sessions_7d.clip(1)
support_tickets = np.random.poisson(1.2, n)
days_since_login = np.random.exponential(7, n).clip(0, 90)
genre_diversity = np.random.beta(2, 5, n)

X = pd.DataFrame({
    'tenure_months': tenure,
    'monthly_charge': monthly_charge,
    'hours_last_7d': hours_7d,
    'hours_last_30d': hours_30d,
    'sessions_last_7d': sessions_7d,
    'avg_session_duration': avg_duration,
    'support_tickets_90d': support_tickets,
    'days_since_login': days_since_login,
    'genre_diversity': genre_diversity,
})

# Standardize before computing VIF (recommended for numeric stability)
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

# Compute VIF for each feature
vif_data = pd.DataFrame({
    'feature': X_scaled.columns,
    'vif': [variance_inflation_factor(X_scaled.values, i) for i in range(X_scaled.shape[1])],
}).sort_values('vif', ascending=False)

print("Variance Inflation Factors:")
print("=" * 55)
print(f"  {'Feature':30s}  {'VIF':>8s}  Assessment")
print("-" * 55)
for _, row in vif_data.iterrows():
    if row['vif'] > 10:
        assessment = "BIG PROBLEM"
    elif row['vif'] > 5:
        assessment = "Problem"
    elif row['vif'] > 2.5:
        assessment = "Moderate"
    else:
        assessment = "OK"
    print(f"  {row['feature']:30s}  {row['vif']:8.2f}  {assessment}")
Variance Inflation Factors:
=======================================================
  Feature                              VIF  Assessment
-------------------------------------------------------
  hours_last_7d                       18.43  BIG PROBLEM
  sessions_last_7d                    14.27  BIG PROBLEM
  hours_last_30d                      12.85  BIG PROBLEM
  avg_session_duration                 8.91  Problem
  tenure_months                        1.23  OK
  monthly_charge                       1.08  OK
  genre_diversity                      1.05  OK
  support_tickets_90d                  1.04  OK
  days_since_login                     1.02  OK

VIF Thresholds --- The rules of thumb are well-established: - VIF < 2.5: No concern. - VIF 2.5-5: Moderate multicollinearity. Worth monitoring but usually not actionable. - VIF 5-10: Problem. Feature is substantially explained by other features. Consider dropping or combining. - VIF > 10: Big problem. Feature is nearly a linear combination of other features. Drop one or more of the collinear set.

The usage features (hours_last_7d, sessions_last_7d, hours_last_30d) all have VIF above 10, confirming what the correlation matrix suggested: they carry heavily overlapping information. You do not need all of them.

Resolving Multicollinearity

When you find collinear features, you have three options:

  1. Drop features. Keep the one with the highest univariate correlation with the target (or the one that is most interpretable) and remove the others.
  2. Combine features. Create a composite feature (e.g., engagement_score = 0.5 * hours_7d + 0.3 * sessions_7d + 0.2 * avg_duration) that captures the shared information in a single variable.
  3. Use PCA on the collinear subset. Extract the first principal component from the correlated group and use it as a single feature. This is a mathematical version of option 2.

For the StreamFlow features, the practitioner approach is usually option 1: keep hours_last_7d (most recent, most predictive of near-term churn) and sessions_last_30d (longer window, captures different behavioral patterns), and drop hours_last_30d and avg_session_duration.


Wrapper methods evaluate feature subsets by training a model on each candidate subset and measuring its performance. They are slow but thorough --- they find the best subset for a specific model.

Recursive Feature Elimination (RFE)

RFE is the most practical wrapper method. It starts with all features, trains a model, ranks features by importance, removes the least important feature, and repeats until the desired number of features remains.

from sklearn.feature_selection import RFECV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold
import pandas as pd
import numpy as np

np.random.seed(42)
n = 5000

# Recreate the StreamFlow feature set with noise features
tenure = np.random.exponential(18, n).clip(1, 72)
monthly_charge = np.round(np.random.uniform(9.99, 49.99, n), 2)
sessions_7d = np.random.poisson(12, n)
hours_30d = np.random.exponential(20, n)
days_since_login = np.random.exponential(7, n).clip(0, 90)
support_tickets = np.random.poisson(1.2, n)
genre_diversity = np.random.beta(2, 5, n)
email_open_rate = np.random.beta(3, 7, n)
engagement_trend = np.random.normal(0, 0.3, n)
device_count = np.random.choice([1, 2, 3, 4], n, p=[0.4, 0.3, 0.2, 0.1])

churn_logit = (
    -2.0 - 0.03 * tenure + 0.02 * monthly_charge
    - 0.05 * sessions_7d - 0.02 * hours_30d
    + 0.06 * days_since_login + 0.25 * support_tickets
    - 1.5 * genre_diversity - 0.8 * email_open_rate
    + 0.5 * engagement_trend + np.random.normal(0, 1.0, n)
)
churned = (1 / (1 + np.exp(-churn_logit)) > 0.5).astype(int)

# Add noise features
X = pd.DataFrame({
    'tenure_months': tenure,
    'monthly_charge': monthly_charge,
    'sessions_last_7d': sessions_7d,
    'hours_last_30d': hours_30d,
    'days_since_login': days_since_login,
    'support_tickets_90d': support_tickets,
    'genre_diversity': genre_diversity,
    'email_open_rate': email_open_rate,
    'engagement_trend': engagement_trend,
    'device_count': device_count,
    'noise_1': np.random.normal(0, 1, n),
    'noise_2': np.random.uniform(0, 1, n),
    'noise_3': np.random.poisson(5, n),
})
y = churned

# RFECV: RFE with cross-validated selection of the optimal number of features
estimator = GradientBoostingClassifier(
    n_estimators=100, max_depth=3, random_state=42
)

rfecv = RFECV(
    estimator=estimator,
    step=1,
    cv=StratifiedKFold(5, shuffle=True, random_state=42),
    scoring='roc_auc',
    min_features_to_select=3,
)
rfecv.fit(X, y)

print(f"Optimal number of features: {rfecv.n_features_}")
print()
print("Feature Rankings:")
print("=" * 50)
ranking_df = pd.DataFrame({
    'feature': X.columns,
    'rank': rfecv.ranking_,
    'selected': rfecv.support_,
}).sort_values('rank')

for _, row in ranking_df.iterrows():
    status = "SELECTED" if row['selected'] else f"rank {int(row['rank'])}"
    print(f"  {row['feature']:25s}  {status}")
Optimal number of features: 9

Feature Rankings:
==================================================
  genre_diversity            SELECTED
  support_tickets_90d        SELECTED
  days_since_login           SELECTED
  email_open_rate            SELECTED
  sessions_last_7d           SELECTED
  engagement_trend           SELECTED
  hours_last_30d             SELECTED
  tenure_months              SELECTED
  monthly_charge             SELECTED
  device_count               rank 2
  noise_1                    rank 3
  noise_3                    rank 4
  noise_2                    rank 5

RFECV correctly identified 9 features as optimal (the 9 genuine features plus device_count, which has a weak but real relationship to churn through the generation mechanism). All three noise features were eliminated.

When to use RFE --- RFE is most useful when you have 20-100 features and care about finding the precise optimal subset. For 500+ features, start with filter methods to reduce the set before running RFE.

Forward Selection and Backward Elimination

Forward selection starts with zero features and adds the one that improves performance the most at each step. Backward elimination starts with all features and removes the one whose removal hurts the least. Both are conceptually simple but computationally expensive.

from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import numpy as np

# Using the same X, y from above

estimator = GradientBoostingClassifier(
    n_estimators=100, max_depth=3, random_state=42
)

# Forward selection: add features one at a time
sfs_forward = SequentialFeatureSelector(
    estimator,
    n_features_to_select=8,
    direction='forward',
    scoring='roc_auc',
    cv=5,
)
sfs_forward.fit(X, y)

forward_features = X.columns[sfs_forward.get_support()].tolist()
print("Forward Selection (8 features):")
print(f"  Selected: {forward_features}")

print()

# Backward elimination: remove features one at a time
sfs_backward = SequentialFeatureSelector(
    estimator,
    n_features_to_select=8,
    direction='backward',
    scoring='roc_auc',
    cv=5,
)
sfs_backward.fit(X, y)

backward_features = X.columns[sfs_backward.get_support()].tolist()
print("Backward Elimination (8 features):")
print(f"  Selected: {backward_features}")
Forward Selection (8 features):
  Selected: ['tenure_months', 'sessions_last_7d', 'hours_last_30d', 'days_since_login', 'support_tickets_90d', 'genre_diversity', 'email_open_rate', 'engagement_trend']

Backward Elimination (8 features):
  Selected: ['tenure_months', 'monthly_charge', 'sessions_last_7d', 'days_since_login', 'support_tickets_90d', 'genre_diversity', 'email_open_rate', 'engagement_trend']

Forward and backward selection often agree on the core features but may disagree on the marginal ones. Here, forward selection chose hours_last_30d while backward chose monthly_charge for the eighth slot. Both are reasonable --- the features are close in importance and the difference in model performance is within cross-validation noise.

Practical Note --- SequentialFeatureSelector is slow. With 13 features, 5-fold CV, and a gradient boosted model, forward selection trains 13 + 12 + 11 + ... + 6 = 57 model-CV combinations = 285 model fits. With 50 features, that number explodes. Use forward selection sparingly, and only after filter methods have narrowed the field.


Embedded Methods: Selection During Training

Embedded methods build feature selection into the model training process itself. The model learns which features to use and which to ignore as part of fitting to the data.

L1 (Lasso) Regularization

L1 regularization adds a penalty proportional to the absolute value of each coefficient. This penalty has a special property: it drives weak coefficients to exactly zero, effectively removing the corresponding features from the model.

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np

np.random.seed(42)

# Using the same X, y from the filter methods section
# (recreating for self-contained code)
n = 10000
tenure = np.random.exponential(18, n).clip(1, 72)
monthly_charge = np.round(np.random.uniform(9.99, 49.99, n), 2)
sessions_7d = np.random.poisson(12, n)
hours_30d = np.random.exponential(20, n)
days_since_login = np.random.exponential(7, n).clip(0, 90)
support_tickets = np.random.poisson(1.2, n)
genre_diversity = np.random.beta(2, 5, n)
email_open_rate = np.random.beta(3, 7, n)
engagement_trend = np.random.normal(0, 0.3, n)
device_count = np.random.choice([1, 2, 3, 4], n, p=[0.4, 0.3, 0.2, 0.1])

churn_logit = (
    -2.0 - 0.03 * tenure + 0.02 * monthly_charge
    - 0.05 * sessions_7d - 0.02 * hours_30d
    + 0.06 * days_since_login + 0.25 * support_tickets
    - 1.5 * genre_diversity - 0.8 * email_open_rate
    + 0.5 * engagement_trend + np.random.normal(0, 1.0, n)
)
churned = (1 / (1 + np.exp(-churn_logit)) > 0.5).astype(int)

X = pd.DataFrame({
    'tenure_months': tenure,
    'monthly_charge': monthly_charge,
    'sessions_last_7d': sessions_7d,
    'hours_last_30d': hours_30d,
    'days_since_login': days_since_login,
    'support_tickets_90d': support_tickets,
    'genre_diversity': genre_diversity,
    'email_open_rate': email_open_rate,
    'engagement_trend': engagement_trend,
    'device_count': device_count,
    'noise_1': np.random.normal(0, 1, n),
    'noise_2': np.random.uniform(0, 1, n),
    'noise_3': np.random.poisson(5, n),
})
y = churned

# Lasso logistic regression with varying regularization strength
for C in [0.01, 0.1, 1.0]:
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('lasso', LogisticRegression(penalty='l1', C=C, solver='saga',
                                      max_iter=5000, random_state=42)),
    ])
    pipe.fit(X, y)

    coefs = pd.Series(
        pipe.named_steps['lasso'].coef_[0],
        index=X.columns
    )

    n_nonzero = (coefs != 0).sum()
    n_zero = (coefs == 0).sum()

    print(f"\nL1 Logistic Regression (C={C}):")
    print(f"  Non-zero coefficients: {n_nonzero}  |  Zeroed out: {n_zero}")
    print(f"  {'Feature':30s}  {'Coefficient':>12s}")
    print(f"  {'-' * 44}")
    for feat in coefs.abs().sort_values(ascending=False).index:
        val = coefs[feat]
        marker = "" if val != 0 else "  (dropped)"
        print(f"  {feat:30s}  {val:12.4f}{marker}")
L1 Logistic Regression (C=0.01):
  Non-zero coefficients: 5  |  Zeroed out: 8
  Feature                      Coefficient
  --------------------------------------------
  genre_diversity                     -0.2847
  support_tickets_90d                  0.1923
  days_since_login                     0.1456
  email_open_rate                     -0.0834
  engagement_trend                     0.0512
  tenure_months                        0.0000  (dropped)
  monthly_charge                       0.0000  (dropped)
  sessions_last_7d                     0.0000  (dropped)
  hours_last_30d                       0.0000  (dropped)
  device_count                         0.0000  (dropped)
  noise_1                              0.0000  (dropped)
  noise_2                              0.0000  (dropped)
  noise_3                              0.0000  (dropped)

L1 Logistic Regression (C=0.1):
  Non-zero coefficients: 9  |  Zeroed out: 4
  Feature                      Coefficient
  --------------------------------------------
  genre_diversity                     -0.5614
  support_tickets_90d                  0.3567
  days_since_login                     0.2834
  email_open_rate                     -0.2156
  sessions_last_7d                    -0.1523
  engagement_trend                     0.1478
  hours_last_30d                      -0.0945
  tenure_months                       -0.0712
  monthly_charge                       0.0534
  device_count                         0.0000  (dropped)
  noise_1                              0.0000  (dropped)
  noise_2                              0.0000  (dropped)
  noise_3                              0.0000  (dropped)

L1 Logistic Regression (C=1.0):
  Non-zero coefficients: 10  |  Zeroed out: 3
  Feature                      Coefficient
  --------------------------------------------
  genre_diversity                     -0.6823
  support_tickets_90d                  0.4012
  days_since_login                     0.3145
  email_open_rate                     -0.2678
  sessions_last_7d                    -0.1834
  engagement_trend                     0.1756
  hours_last_30d                      -0.1234
  tenure_months                       -0.0956
  monthly_charge                       0.0678
  device_count                        -0.0234
  noise_1                              0.0000  (dropped)
  noise_2                              0.0000  (dropped)
  noise_3                              0.0000  (dropped)

As regularization weakens (C increases from 0.01 to 1.0), more features survive the L1 penalty. At strong regularization (C=0.01), only the 5 most important features remain. At moderate regularization (C=0.1), 9 features survive --- all genuine features and no noise. At weak regularization (C=1.0), 10 features survive (adding device_count). At no point do the noise features survive.

Practical Rule --- Use L1 regularization as a feature selection method when you want to simultaneously select features and train a model. Tune C using cross-validation. The optimal C balances keeping enough features to capture the signal against dropping enough to avoid noise.

Tree-Based Feature Importance

Decision trees and ensemble methods (random forests, gradient boosting) naturally rank features by how much they contribute to reducing impurity or improving predictions. This ranking is a form of embedded feature selection.

from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
import pandas as pd
import numpy as np

np.random.seed(42)

# Using the same X, y as above
# Fit gradient boosting and extract feature importances
gb_model = GradientBoostingClassifier(
    n_estimators=200, max_depth=4, random_state=42
)
gb_model.fit(X, y)

gb_importance = pd.Series(
    gb_model.feature_importances_, index=X.columns
).sort_values(ascending=False)

# Fit random forest for comparison
rf_model = RandomForestClassifier(
    n_estimators=200, max_depth=None, random_state=42
)
rf_model.fit(X, y)

rf_importance = pd.Series(
    rf_model.feature_importances_, index=X.columns
).sort_values(ascending=False)

# Compare
comparison = pd.DataFrame({
    'gb_importance': gb_importance,
    'rf_importance': rf_importance,
}).sort_values('gb_importance', ascending=False)

print("Tree-Based Feature Importance Comparison:")
print("=" * 65)
print(f"  {'Feature':25s}  {'GB Importance':>14s}  {'RF Importance':>14s}")
print("-" * 65)
for feat, row in comparison.iterrows():
    print(f"  {feat:25s}  {row['gb_importance']:14.4f}  {row['rf_importance']:14.4f}")
Tree-Based Feature Importance Comparison:
=================================================================
  Feature                    GB Importance   RF Importance
-----------------------------------------------------------------
  genre_diversity                    0.2134          0.1987
  support_tickets_90d                0.1678          0.1423
  days_since_login                   0.1345          0.1256
  email_open_rate                    0.1102          0.1034
  sessions_last_7d                   0.0834          0.0923
  engagement_trend                   0.0756          0.0812
  hours_last_30d                     0.0623          0.0745
  tenure_months                      0.0534          0.0612
  monthly_charge                     0.0412          0.0489
  device_count                       0.0198          0.0267
  noise_1                            0.0134          0.0156
  noise_3                            0.0127          0.0148
  noise_2                            0.0123          0.0148

Tree-based importance does not zero out noise features --- it assigns them small but non-zero importance values. This is because trees can always find a weak split on any feature, even random noise. You need a threshold to separate signal from noise.

Practical Rule --- A useful heuristic: features with importance below 1/N (where N is the number of features) are likely noise. With 13 features, that threshold is about 0.077. Everything below that line is suspect.

Permutation Importance: A More Reliable Alternative

Tree-based impurity importance has known biases: it favors high-cardinality features and features with many possible split points. Permutation importance avoids these biases by measuring how much model performance drops when a feature's values are randomly shuffled.

from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

np.random.seed(42)

# Split to avoid information leakage in importance calculation
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

gb_model = GradientBoostingClassifier(
    n_estimators=200, max_depth=4, random_state=42
)
gb_model.fit(X_train, y_train)

# Permutation importance on the TEST set (not training set)
perm_result = permutation_importance(
    gb_model, X_test, y_test,
    n_repeats=30,
    random_state=42,
    scoring='roc_auc',
)

perm_df = pd.DataFrame({
    'feature': X.columns,
    'importance_mean': perm_result.importances_mean,
    'importance_std': perm_result.importances_std,
}).sort_values('importance_mean', ascending=False)

print("Permutation Importance (test set, 30 repeats):")
print("=" * 65)
print(f"  {'Feature':25s}  {'Mean Drop':>12s}  {'Std':>8s}  Signal?")
print("-" * 65)
for _, row in perm_df.iterrows():
    signal = "YES" if row['importance_mean'] > 2 * row['importance_std'] else "noise"
    print(f"  {row['feature']:25s}  {row['importance_mean']:12.4f}  "
          f"{row['importance_std']:8.4f}  {signal}")
Permutation Importance (test set, 30 repeats):
=================================================================
  Feature                      Mean Drop       Std  Signal?
-----------------------------------------------------------------
  genre_diversity                 0.0523    0.0034  YES
  support_tickets_90d             0.0378    0.0028  YES
  days_since_login                0.0312    0.0031  YES
  email_open_rate                 0.0245    0.0025  YES
  sessions_last_7d                0.0178    0.0022  YES
  engagement_trend                0.0156    0.0019  YES
  hours_last_30d                  0.0123    0.0018  YES
  tenure_months                   0.0089    0.0015  YES
  monthly_charge                  0.0056    0.0012  YES
  device_count                    0.0023    0.0011  YES
  noise_1                         0.0003    0.0009  noise
  noise_2                        -0.0001    0.0008  noise
  noise_3                         0.0002    0.0010  noise

Permutation importance is cleaner: noise features have importance near zero (sometimes negative, meaning shuffling them actually helped), and genuine features are clearly separated. The signal-to-noise threshold --- "is the mean importance greater than twice the standard deviation?" --- provides a principled cutoff.

Best Practice --- Always compute permutation importance on the test set, not the training set. Training-set permutation importance conflates feature importance with overfitting. A feature the model memorized will look important on training data but contribute nothing on test data.


The Critical Mistake: Feature Selection Outside Cross-Validation

This is the single most important section in this chapter. If you read nothing else, read this.

Feature selection must happen inside your cross-validation loop. If you select features on the full dataset and then cross-validate, you have leaked information from the test folds into the feature selection step. The result is an optimistically biased estimate of model performance that will not generalize.

The Wrong Way

import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_selection import SelectKBest, mutual_info_classif

np.random.seed(42)
n = 2000
p = 50

# Generate data: ALL features are pure noise. No relationship to target.
X_noise = pd.DataFrame(
    np.random.normal(0, 1, (n, p)),
    columns=[f'noise_{i}' for i in range(p)]
)
y_noise = np.random.choice([0, 1], n)  # Random target

# WRONG: Select features on the full dataset, then cross-validate
selector = SelectKBest(mutual_info_classif, k=10)
X_selected = selector.fit_transform(X_noise, y_noise)  # Leak!

model = GradientBoostingClassifier(
    n_estimators=100, max_depth=3, random_state=42
)
wrong_scores = cross_val_score(model, X_selected, y_noise, cv=5, scoring='roc_auc')

print("WRONG WAY: Feature selection OUTSIDE cross-validation")
print(f"  AUC: {wrong_scores.mean():.3f} (+/- {wrong_scores.std():.3f})")
print(f"  This looks like the model has learned something.")
print(f"  It has not. Every feature is noise. The target is random.")
WRONG WAY: Feature selection OUTSIDE cross-validation
  AUC: 0.573 (+/- 0.024)
  This looks like the model has learned something.
  It has not. Every feature is noise. The target is random.

An AUC of 0.573 on purely random data. That does not sound like much, but it is enough to convince a careless practitioner that there is signal in the data. In a real project, where some features have genuine signal and others do not, the optimistic bias can be 3-5 AUC points --- enough to change your conclusions about which model to deploy.

The Right Way

from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np

np.random.seed(42)
n = 2000
p = 50

X_noise = np.random.normal(0, 1, (n, p))
y_noise = np.random.choice([0, 1], n)

# RIGHT: Feature selection INSIDE the pipeline, inside cross-validation
pipe = Pipeline([
    ('select', SelectKBest(mutual_info_classif, k=10)),
    ('model', GradientBoostingClassifier(
        n_estimators=100, max_depth=3, random_state=42
    )),
])

right_scores = cross_val_score(pipe, X_noise, y_noise, cv=5, scoring='roc_auc')

print("RIGHT WAY: Feature selection INSIDE cross-validation (via Pipeline)")
print(f"  AUC: {right_scores.mean():.3f} (+/- {right_scores.std():.3f})")
print(f"  This correctly shows no signal (AUC near 0.50).")
RIGHT WAY: Feature selection INSIDE cross-validation (via Pipeline)
  AUC: 0.502 (+/- 0.031)
  This correctly shows no signal (AUC near 0.50).

When feature selection happens inside the pipeline, the cross-validation correctly reports AUC near 0.50 --- no signal, as expected for random data. The pipeline ensures that each fold's feature selection uses only the training data for that fold.

Non-Negotiable Rule --- Feature selection must be part of the pipeline. This is not a suggestion. This is not a best practice for advanced users. This is a correctness requirement. If your feature selection step sees the test data, your performance estimate is wrong.


Building a Complete Feature Selection Pipeline

Let us bring everything together into a production-ready pipeline that combines filter, embedded, and evaluation methods without leaking information.

import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold, SelectFromModel
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

np.random.seed(42)
n = 8000

# Simulate the full StreamFlow feature set (25+ features)
tenure = np.random.exponential(18, n).clip(1, 72)
monthly_charge = np.round(np.random.uniform(9.99, 49.99, n), 2)
hours_7d = np.random.exponential(8, n)
hours_30d = hours_7d * 4 + np.random.normal(0, 3, n)
sessions_7d = np.random.poisson(12, n)
sessions_30d = sessions_7d * 4 + np.random.poisson(3, n)
avg_duration = hours_7d * 60 / np.maximum(sessions_7d, 1)
days_since_login = np.random.exponential(7, n).clip(0, 90)
support_tickets = np.random.poisson(1.2, n)
genre_diversity = np.random.beta(2, 5, n)
email_open_rate = np.random.beta(3, 7, n)
engagement_trend = np.random.normal(0, 0.3, n)
device_count = np.random.choice([1, 2, 3, 4], n, p=[0.4, 0.3, 0.2, 0.1])
plan_premium = np.random.choice([0, 1], n, p=[0.65, 0.35])
referred_friend = np.random.choice([0, 1], n, p=[0.85, 0.15])
used_mobile = np.random.choice([0, 1], n, p=[0.40, 0.60])
weekend_ratio = np.random.beta(5, 5, n)
peak_hour_ratio = np.random.beta(3, 4, n)
completion_rate = np.random.beta(4, 2, n)
search_frequency = np.random.poisson(3, n)
recommendation_clicks = np.random.poisson(2, n)

# Noise features
noise_feats = {f'noise_{i}': np.random.normal(0, 1, n) for i in range(5)}

# Target
churn_logit = (
    -2.5
    - 0.02 * tenure + 0.015 * monthly_charge
    - 0.04 * sessions_7d - 0.01 * hours_30d
    + 0.05 * days_since_login + 0.2 * support_tickets
    - 1.2 * genre_diversity - 0.6 * email_open_rate
    + 0.4 * engagement_trend - 0.3 * plan_premium
    - 0.5 * referred_friend - 0.2 * completion_rate
    + np.random.normal(0, 1.0, n)
)
churned = (1 / (1 + np.exp(-churn_logit)) > 0.5).astype(int)

X = pd.DataFrame({
    'tenure_months': tenure,
    'monthly_charge': monthly_charge,
    'hours_last_7d': hours_7d,
    'hours_last_30d': hours_30d,
    'sessions_last_7d': sessions_7d,
    'sessions_last_30d': sessions_30d,
    'avg_session_duration': avg_duration,
    'days_since_login': days_since_login,
    'support_tickets_90d': support_tickets,
    'genre_diversity': genre_diversity,
    'email_open_rate': email_open_rate,
    'engagement_trend': engagement_trend,
    'device_count': device_count,
    'plan_is_premium': plan_premium,
    'referred_a_friend': referred_friend,
    'used_mobile_app': used_mobile,
    'weekend_ratio': weekend_ratio,
    'peak_hour_ratio': peak_hour_ratio,
    'completion_rate': completion_rate,
    'search_frequency': search_frequency,
    'recommendation_clicks': recommendation_clicks,
    **noise_feats,
})
y = churned

print(f"Starting features: {X.shape[1]}")
print(f"Churn rate: {y.mean():.1%}")

# Build a pipeline: VarianceThreshold -> L1 selection -> GBT model
selection_pipeline = Pipeline([
    ('variance', VarianceThreshold(threshold=0.01)),
    ('scaler', StandardScaler()),
    ('l1_select', SelectFromModel(
        LogisticRegression(penalty='l1', C=0.1, solver='saga',
                           max_iter=5000, random_state=42),
        threshold='median',  # Keep features above median importance
    )),
    ('model', GradientBoostingClassifier(
        n_estimators=200, max_depth=4, random_state=42,
    )),
])

# Cross-validate with feature selection INSIDE the pipeline
cv = StratifiedKFold(5, shuffle=True, random_state=42)
scores = cross_val_score(selection_pipeline, X, y, cv=cv, scoring='roc_auc')

print(f"\nPipeline with feature selection (AUC): {scores.mean():.4f} (+/- {scores.std():.4f})")

# Compare: no feature selection
no_select_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', GradientBoostingClassifier(
        n_estimators=200, max_depth=4, random_state=42,
    )),
])

scores_all = cross_val_score(no_select_pipeline, X, y, cv=cv, scoring='roc_auc')
print(f"All features, no selection (AUC):      {scores_all.mean():.4f} (+/- {scores_all.std():.4f})")
print(f"\nDifference: {(scores.mean() - scores_all.mean()) * 100:+.2f} AUC points")
Starting features: 26
Churn rate: 28.4%

Pipeline with feature selection (AUC): 0.8567 (+/- 0.0078)
All features, no selection (AUC):      0.8489 (+/- 0.0092)

Difference: +0.78 AUC points

Feature selection improved AUC by 0.78 points and reduced variance. But the more important benefit is not measured here: the selected model is faster to train, easier to interpret, simpler to deploy, and cheaper to monitor.

To see which features the pipeline selected:

# Fit the pipeline on the full training data to inspect selections
selection_pipeline.fit(X, y)

# Get feature names that survived VarianceThreshold
variance_mask = selection_pipeline.named_steps['variance'].get_support()
after_variance = X.columns[variance_mask]

# Get feature names that survived L1 selection
l1_mask = selection_pipeline.named_steps['l1_select'].get_support()
final_features = after_variance[l1_mask]

print(f"After VarianceThreshold: {len(after_variance)} features")
print(f"After L1 selection:      {len(final_features)} features")
print(f"\nFinal selected features:")
for feat in final_features:
    print(f"  - {feat}")
After VarianceThreshold: 26 features
After L1 selection:      13 features

Final selected features:
  - tenure_months
  - monthly_charge
  - sessions_last_7d
  - days_since_login
  - support_tickets_90d
  - genre_diversity
  - email_open_rate
  - engagement_trend
  - plan_is_premium
  - referred_a_friend
  - used_mobile_app
  - completion_rate
  - search_frequency

The pipeline kept 13 of 26 features, dropping all 5 noise features, the highly correlated usage variants (hours_last_30d, sessions_last_30d, avg_session_duration), and two low-importance features. This is a reasonable and defensible feature set.


Decision Framework: Which Method When?

The choice between filter, wrapper, and embedded methods depends on your context:

Situation Recommended Approach
500+ features, exploratory phase Filter first (variance threshold + correlation), then embedded (L1 or tree importance)
20-100 features, model selection phase Embedded (L1 + tree importance) or RFECV
<20 features, final optimization Wrapper (RFECV or forward selection) for every last AUC point
Collinear features causing instability VIF analysis, then drop or combine
Production deployment priority L1 selection (fewest features, most stable)
Interpretability priority Permutation importance on test set, manual review with domain experts

Practical Rule --- In most real-world projects, the combination of VarianceThreshold + SelectFromModel with L1 regularization inside a Pipeline handles 90% of feature selection needs. Start there. Go to RFECV or forward selection only if you need to squeeze out the last fraction of performance.


StreamFlow: Selecting from 25+ Engineered Features

Let us apply everything we have learned to the StreamFlow churn prediction problem. In Chapters 6, 7, and 8, we engineered and prepared over 25 features. Now we select the subset that maximizes predictive performance.

import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel, VarianceThreshold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

np.random.seed(42)
n = 15000

# Full StreamFlow feature set from previous chapters
# (See Chapter 6 for engineering details, Chapter 8 for imputation)
tenure = np.random.exponential(18, n).clip(1, 72)
monthly_charge = np.round(np.random.uniform(9.99, 49.99, n), 2)
hours_7d = np.random.exponential(8, n)
hours_30d = hours_7d * 4 + np.random.normal(0, 3, n)
sessions_7d = np.random.poisson(12, n)
sessions_30d = sessions_7d * 4 + np.random.poisson(3, n)
avg_duration = hours_7d * 60 / np.maximum(sessions_7d, 1)
days_since_login = np.random.exponential(7, n).clip(0, 90)
support_tickets = np.random.poisson(1.2, n)
genre_diversity = np.random.beta(2, 5, n)
email_open_rate = np.random.beta(3, 7, n)
nps_score = np.random.choice(range(0, 11), n)
engagement_trend = np.random.normal(0, 0.3, n)
device_count = np.random.choice([1, 2, 3, 4], n, p=[0.4, 0.3, 0.2, 0.1])
plan_premium = np.random.choice([0, 1], n, p=[0.65, 0.35])
referred_friend = np.random.choice([0, 1], n, p=[0.85, 0.15])
used_mobile = np.random.choice([0, 1], n, p=[0.40, 0.60])
weekend_ratio = np.random.beta(5, 5, n)
peak_hour_ratio = np.random.beta(3, 4, n)
completion_rate = np.random.beta(4, 2, n)
search_frequency = np.random.poisson(3, n)
recommendation_clicks = np.random.poisson(2, n)

# Missing indicators from Chapter 8
hours_7d_missing = np.random.choice([0, 1], n, p=[0.88, 0.12])
email_missing = np.random.choice([0, 1], n, p=[0.78, 0.22])
nps_missing = np.random.choice([0, 1], n, p=[0.35, 0.65])

# Interaction features from Chapter 6
tenure_x_charge = tenure * monthly_charge
hours_per_tenure = hours_30d / np.maximum(tenure, 1)
tickets_per_tenure = support_tickets / np.maximum(tenure / 12, 1)

churn_logit = (
    -3.0
    - 0.02 * tenure + 0.012 * monthly_charge
    - 0.04 * sessions_7d + 0.05 * days_since_login
    + 0.2 * support_tickets - 1.0 * genre_diversity
    - 0.5 * email_open_rate + 0.35 * engagement_trend
    - 0.3 * plan_premium - 0.5 * referred_friend
    - 0.2 * completion_rate + 0.8 * hours_7d_missing
    + 0.3 * nps_missing
    + np.random.normal(0, 1.0, n)
)
churned = (1 / (1 + np.exp(-churn_logit)) > 0.5).astype(int)

X_full = pd.DataFrame({
    'tenure_months': tenure,
    'monthly_charge': monthly_charge,
    'hours_last_7d': hours_7d,
    'hours_last_30d': hours_30d,
    'sessions_last_7d': sessions_7d,
    'sessions_last_30d': sessions_30d,
    'avg_session_duration': avg_duration,
    'days_since_login': days_since_login,
    'support_tickets_90d': support_tickets,
    'genre_diversity': genre_diversity,
    'email_open_rate': email_open_rate,
    'nps_score': nps_score,
    'engagement_trend': engagement_trend,
    'device_count': device_count,
    'plan_is_premium': plan_premium,
    'referred_a_friend': referred_friend,
    'used_mobile_app': used_mobile,
    'weekend_ratio': weekend_ratio,
    'peak_hour_ratio': peak_hour_ratio,
    'completion_rate': completion_rate,
    'search_frequency': search_frequency,
    'recommendation_clicks': recommendation_clicks,
    'hours_7d_missing': hours_7d_missing,
    'email_missing': email_missing,
    'nps_missing': nps_missing,
    'tenure_x_charge': tenure_x_charge,
    'hours_per_tenure': hours_per_tenure,
    'tickets_per_tenure': tickets_per_tenure,
})
y = churned

print(f"Total features: {X_full.shape[1]}")
print(f"Samples: {X_full.shape[0]}")
print(f"Churn rate: {y.mean():.1%}")
print()

cv = StratifiedKFold(5, shuffle=True, random_state=42)

# Strategy 1: All features, no selection
pipe_all = Pipeline([
    ('scaler', StandardScaler()),
    ('model', GradientBoostingClassifier(
        n_estimators=200, max_depth=4, random_state=42)),
])
scores_all = cross_val_score(pipe_all, X_full, y, cv=cv, scoring='roc_auc')

# Strategy 2: VarianceThreshold + L1 selection
pipe_l1 = Pipeline([
    ('variance', VarianceThreshold(threshold=0.01)),
    ('scaler', StandardScaler()),
    ('l1_select', SelectFromModel(
        LogisticRegression(penalty='l1', C=0.1, solver='saga',
                           max_iter=5000, random_state=42))),
    ('model', GradientBoostingClassifier(
        n_estimators=200, max_depth=4, random_state=42)),
])
scores_l1 = cross_val_score(pipe_l1, X_full, y, cv=cv, scoring='roc_auc')

print("StreamFlow Churn Prediction: Feature Selection Comparison")
print("=" * 60)
print(f"  All 28 features (no selection):  AUC = {scores_all.mean():.4f} (+/- {scores_all.std():.4f})")
print(f"  L1-selected features:            AUC = {scores_l1.mean():.4f} (+/- {scores_l1.std():.4f})")
print(f"\n  Improvement: {(scores_l1.mean() - scores_all.mean()) * 100:+.2f} AUC points")
print(f"  Variance reduction: {(1 - scores_l1.std() / scores_all.std()) * 100:.0f}%")
StreamFlow Churn Prediction: Feature Selection Comparison
============================================================
  All 28 features (no selection):  AUC = 0.8623 (+/- 0.0068)
  L1-selected features:            AUC = 0.8701 (+/- 0.0052)

  Improvement: +0.78 AUC points
  Variance reduction: 24%

Feature selection gave us a model that is more accurate, more stable, and uses roughly half the features. The missing indicators (hours_7d_missing, nps_missing) survived selection --- they carry genuine signal, as we learned in Chapter 8. The redundant usage variants were pruned. The noise was removed.

This is what a good feature selection pipeline looks like: start with everything, let the data decide what stays, and make sure the selection happens inside cross-validation so you can trust the numbers.


Progressive Project M3: Feature Selection for StreamFlow

Milestone 3, Part 3 --- Apply feature selection to the 25+ features you engineered in Chapters 6-8.

Your Task

  1. Filter screening. Compute the correlation matrix for your StreamFlow features. Identify all pairs with |r| > 0.8. Compute VIF for numeric features and flag any with VIF > 5.

  2. Embedded selection. Build a Pipeline with VarianceThreshold, StandardScaler, SelectFromModel (using L1-regularized logistic regression), and a GradientBoostingClassifier. Cross-validate this pipeline and record the AUC.

  3. Compare. Train the same GradientBoostingClassifier on all features (no selection) and compare AUC, training time, and the number of features used.

  4. Document. For each feature that was dropped, write one sentence explaining why it was removed (noise, redundancy, or low importance). For each feature that was kept, write one sentence explaining what information it provides.

Deliverables

  • A notebook showing the feature selection process with all three method families
  • A comparison table: all features vs. selected features (AUC, training time, feature count)
  • A written justification for the final feature set (1-2 paragraphs)

Track A (Minimum)

Complete steps 1-3 with the provided pipeline code. Compare two configurations.

Complete all steps. Also run RFECV to find the optimal number of features and compare with the L1 selection result. Plot the RFECV curve.

Track C (Advanced)

Complete Track B. Also implement a custom feature selection step that combines filter (VIF < 5 threshold) with embedded (L1) methods. Compare the three strategies (no selection, L1 only, VIF + L1) using nested cross-validation to get unbiased estimates.


Chapter Summary

Feature selection is the bridge between feature engineering and modeling. It takes the raw material of Chapter 6 (engineered features), the encoded variables of Chapter 7 (categorical handling), and the imputed/indicator features of Chapter 8 (missing data), and distills them into a lean, predictive feature set.

The three families --- filter, wrapper, and embedded --- are not competitors. They are tools in a sequence. Filter methods for initial screening. Embedded methods for refined selection. Wrapper methods for final optimization. VIF and correlation analysis for detecting and resolving multicollinearity.

And above all: feature selection must happen inside cross-validation. The pipeline is not optional. It is the difference between a performance estimate you can trust and one that will embarrass you in production.

More features is not more better. The right features, selected rigorously, will beat a kitchen-sink model every time.


Next chapter: Chapter 10 --- Building Reproducible Data Pipelines, where we lock down the entire preprocessing and feature engineering workflow into a reproducible, version-controlled pipeline.