Exercises: Chapter 9

DataField.Dev

Exercises: Chapter 9

Feature Selection: Reducing Dimensionality Without Losing Signal

Exercise 1: Classify the Method (Conceptual)

For each of the following feature selection techniques, classify it as a filter, wrapper, or embedded method. Explain your reasoning in one to two sentences.

a) Computing mutual information between each feature and the target, then keeping the top 15 features.

b) Training a Lasso logistic regression and using features with non-zero coefficients.

c) Starting with all features, removing the least important one (based on gradient boosting), and repeating until cross-validated AUC stops improving.

d) Removing all features with variance below 0.01.

e) Using SequentialFeatureSelector with direction='forward' and a random forest estimator.

f) Training a random forest and selecting features whose permutation importance exceeds a threshold.

g) Computing VIF for all features and dropping any with VIF > 10.

h) Using SelectFromModel with a fitted GradientBoostingClassifier as the estimator.

Exercise 2: The Correlation Trap (Applied)

Generate the following dataset and analyze feature redundancy:

import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier

np.random.seed(42)
n = 8000

# Base engagement signal
base_engagement = np.random.exponential(15, n)

# Five features derived from the same underlying engagement signal
hours_daily = base_engagement + np.random.normal(0, 2, n)
hours_weekly = hours_daily * 7 + np.random.normal(0, 5, n)
hours_monthly = hours_weekly * 4.3 + np.random.normal(0, 10, n)
sessions_daily = base_engagement / 1.5 + np.random.normal(0, 1, n)
avg_session_len = hours_daily / np.maximum(sessions_daily, 0.1)

# Independent features
tenure = np.random.exponential(18, n).clip(1, 72)
support_tickets = np.random.poisson(1.5, n)
plan_premium = np.random.choice([0, 1], n, p=[0.6, 0.4])

# Target depends on base engagement, tenure, tickets, plan
churn_logit = (
    -1.5
    - 0.04 * base_engagement
    - 0.02 * tenure
    + 0.3 * support_tickets
    - 0.5 * plan_premium
    + np.random.normal(0, 1.0, n)
)
churned = (1 / (1 + np.exp(-churn_logit)) > 0.5).astype(int)

X = pd.DataFrame({
    'hours_daily': hours_daily,
    'hours_weekly': hours_weekly,
    'hours_monthly': hours_monthly,
    'sessions_daily': sessions_daily,
    'avg_session_length': avg_session_len,
    'tenure_months': tenure,
    'support_tickets': support_tickets,
    'plan_is_premium': plan_premium,
})
y = churned

Complete the following tasks:

a) Compute the full correlation matrix and identify all pairs with |r| > 0.8. How many highly correlated pairs exist?

b) Compute the VIF for each feature. Which features have VIF > 10? Which have VIF > 5?

c) Train a GradientBoostingClassifier on all 8 features. Report 5-fold cross-validated AUC.

d) Remove features one at a time from the correlated engagement group (keeping only hours_daily). Retrain and report AUC after each removal.

e) Compare: all 8 features vs. only 4 features (hours_daily, tenure_months, support_tickets, plan_is_premium). Is there a meaningful difference in AUC? What about in training time?

f) Write 2-3 sentences explaining why the 4-feature model is preferable even if its AUC is slightly lower.

Exercise 3: VIF in Practice (Applied)

Generate a dataset where multicollinearity is hidden from pairwise correlations but detectable by VIF:

import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
n = 5000

# Three independent base signals
x1 = np.random.normal(50, 10, n)
x2 = np.random.normal(30, 8, n)
x3 = np.random.normal(20, 5, n)

# x4 is a near-perfect linear combination of x1, x2, and x3
x4 = 0.5 * x1 + 0.3 * x2 + 0.2 * x3 + np.random.normal(0, 1.0, n)

# Two independent features
x5 = np.random.exponential(10, n)
x6 = np.random.poisson(5, n)

X = pd.DataFrame({
    'feature_1': x1,
    'feature_2': x2,
    'feature_3': x3,
    'feature_4': x4,
    'feature_5': x5,
    'feature_6': x6,
})

Complete the following tasks:

a) Compute the pairwise correlation matrix. What is the highest pairwise |r| involving feature_4? Is it above 0.9?

b) Compute VIF for all features. Which feature has the highest VIF? What is its value?

c) Explain in 2-3 sentences why VIF detected the multicollinearity that pairwise correlation missed.

d) Remove feature_4 and recompute VIF. Are all remaining VIFs acceptable (below 5)?

Exercise 4: Feature Selection Leakage Experiment (Applied)

This exercise demonstrates the danger of performing feature selection outside cross-validation.

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.pipeline import Pipeline

np.random.seed(42)
n = 1000
p = 100

# ALL features are pure noise. Target is random.
X = np.random.normal(0, 1, (n, p))
y = np.random.choice([0, 1], n)

Complete the following tasks:

a) The wrong way. Use SelectKBest(mutual_info_classif, k=10) to select 10 features on the full dataset. Then run 5-fold cross-validated AUC on the selected features using a GradientBoostingClassifier. Record the mean AUC.

b) The right way. Build a Pipeline that includes SelectKBest and GradientBoostingClassifier. Run 5-fold cross-validated AUC on the pipeline. Record the mean AUC.

c) Repeat part (a) with k=5 and k=20. Does the amount of leakage change with the number of selected features?

d) Repeat both (a) and (b) with n=5000 and p=100. Does the leakage get better or worse with more data?

e) Write a 3-4 sentence explanation of why the wrong way produces inflated AUC even though the features are pure noise.

Exercise 5: The Full Feature Selection Pipeline (Applied)

Build a complete feature selection pipeline for a simulated e-commerce conversion dataset:

import pandas as pd
import numpy as np

np.random.seed(42)
n = 12000

# ShopSmart e-commerce features
page_views = np.random.poisson(8, n)
time_on_site = np.random.exponential(5, n)  # minutes
items_viewed = np.random.poisson(4, n)
items_added_to_cart = np.random.poisson(1.5, n)
cart_value = np.random.exponential(45, n) * (items_added_to_cart > 0)
previous_purchases = np.random.poisson(2, n)
days_since_last_visit = np.random.exponential(14, n).clip(0, 180)
email_clicks_30d = np.random.poisson(3, n)
discount_used = np.random.choice([0, 1], n, p=[0.7, 0.3])
mobile_user = np.random.choice([0, 1], n, p=[0.45, 0.55])
returning_customer = (previous_purchases > 0).astype(int)
session_depth = page_views * time_on_site / 10  # Engineered: correlated with page_views and time
browse_to_cart_ratio = items_added_to_cart / np.maximum(items_viewed, 1)
avg_item_price = cart_value / np.maximum(items_added_to_cart, 1)

# Noise features
noise_feats = {f'noise_{i}': np.random.normal(0, 1, n) for i in range(6)}

# Target: conversion
conv_logit = (
    -2.5
    + 0.15 * items_added_to_cart
    + 0.005 * cart_value
    + 0.1 * previous_purchases
    - 0.02 * days_since_last_visit
    + 0.08 * email_clicks_30d
    + 0.4 * discount_used
    + 0.3 * browse_to_cart_ratio
    + np.random.normal(0, 1.0, n)
)
converted = (1 / (1 + np.exp(-conv_logit)) > 0.5).astype(int)

X = pd.DataFrame({
    'page_views': page_views,
    'time_on_site': time_on_site,
    'items_viewed': items_viewed,
    'items_added_to_cart': items_added_to_cart,
    'cart_value': cart_value,
    'previous_purchases': previous_purchases,
    'days_since_last_visit': days_since_last_visit,
    'email_clicks_30d': email_clicks_30d,
    'discount_used': discount_used,
    'mobile_user': mobile_user,
    'returning_customer': returning_customer,
    'session_depth': session_depth,
    'browse_to_cart_ratio': browse_to_cart_ratio,
    'avg_item_price': avg_item_price,
    **noise_feats,
})
y = converted

Complete the following tasks:

a) Filter step. Compute Pearson correlation of each feature with the target. Compute mutual information scores. Identify the top 10 features by each method. Do they agree?

b) Redundancy check. Compute the correlation matrix and identify pairs with |r| > 0.7. Compute VIF for all features. Which features would you flag for removal?

c) Embedded step. Build a pipeline with VarianceThreshold, StandardScaler, SelectFromModel (L1 logistic regression with C=0.1), and GradientBoostingClassifier. Cross-validate and report AUC.

d) Comparison. Cross-validate the same GradientBoostingClassifier on all 20 features (no selection). Report AUC. How much did feature selection help?

e) Feature report. For each dropped feature, state whether it was removed due to noise, redundancy, or low importance. For each kept feature, explain in one sentence what information it provides.

Exercise 6: Method Comparison (Analysis)

Using the StreamFlow dataset from Exercise 2 (or the full version from the chapter), compare four feature selection strategies:

a) Strategy A: No selection (all features).

b) Strategy B: Filter only. Keep the top k features by mutual information, where k is chosen by testing k = 3, 5, 7, 9, 11.

c) Strategy C: Embedded only. SelectFromModel with L1 logistic regression, testing C = 0.001, 0.01, 0.1, 1.0, 10.0.

d) Strategy D: RFECV with GradientBoostingClassifier.

For each strategy, report: - Number of features selected - 5-fold cross-validated AUC - Approximate training time

Summarize your findings in a table. Which strategy would you recommend for production deployment, and why?

Exercise 7: Domain Knowledge vs. Statistical Selection (Critical Thinking)

A colleague presents the following feature selection result for the StreamFlow churn model:

Features selected by L1 (C=0.01):
  - support_tickets_90d
  - genre_diversity
  - days_since_login
  - engagement_trend

Features dropped by L1 (C=0.01):
  - tenure_months
  - monthly_charge
  - plan_is_premium
  - sessions_last_7d
  - email_open_rate
  - hours_last_30d
  - ... (8 more)

Your colleague says: "L1 selected 4 features. That is our final model."

a) Is it reasonable that tenure_months was dropped? What might this indicate about the data or the regularization strength?

b) The colleague used C=0.01, which is very strong regularization. What happens if you increase C to 0.1 or 1.0? How does this change the number of selected features?

c) A domain expert from the customer success team insists that tenure_months and plan_is_premium must be in the model, regardless of what L1 says. Is the domain expert right? Write 3-4 sentences defending or questioning the domain expert's position.

d) Propose a compromise: how would you reconcile statistical feature selection with domain knowledge requirements?

Exercise 8: Stability of Feature Selection (Advanced)

Feature selection results can be unstable --- small changes in the data produce different selected features. This exercise explores that instability.

import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

np.random.seed(42)
n = 5000
p = 20

# Generate features with varying signal strength
X = np.random.normal(0, 1, (n, p))
true_coefs = np.array([1.0, 0.8, 0.5, 0.3, 0.1] + [0.0] * 15)
y = (X @ true_coefs + np.random.normal(0, 1, n) > 0).astype(int)

feature_names = [f'feature_{i}' for i in range(p)]

Complete the following tasks:

a) Run L1 feature selection (SelectFromModel with LogisticRegression(penalty='l1', C=0.1)) on 50 different bootstrap samples of the data. For each bootstrap sample, record which features were selected.

b) Compute the selection frequency for each feature: what fraction of the 50 runs included that feature?

c) Plot or print a bar chart of selection frequencies. Which features are consistently selected (>90% of runs)? Which are on the boundary (40-60%)?

d) Based on the stability analysis, which features would you include in your final model? How does the stability criterion change your feature set compared to a single L1 run?

e) Write 2-3 sentences explaining why stability analysis is important for feature selection in production systems.

Solutions to selected exercises are provided in the appendix.