Quiz: Chapter 9
Feature Selection: Reducing Dimensionality Without Losing Signal
Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.
Question 1 (Multiple Choice)
A data scientist computes mutual information between each of 50 features and a binary target, selects the top 10 features, and then runs 5-fold cross-validation on the selected features. The cross-validated AUC is 0.78. What is the primary problem with this approach?
- A) Mutual information cannot be used with binary targets
- B) Feature selection was performed outside cross-validation, causing information leakage
- C) 10 features is too few for a binary classification problem
- D) 5-fold cross-validation is not enough; 10-fold should be used
Answer: B) Feature selection was performed outside cross-validation, causing information leakage. The mutual information scores were computed using the full dataset, including the test folds. This means the feature selection step had access to test-fold information, producing an optimistically biased AUC estimate. The correct approach is to place feature selection inside a Pipeline so it is re-fit on training data only within each fold.
Question 2 (Multiple Choice)
A feature has a Variance Inflation Factor (VIF) of 14.3. What does this mean?
- A) The feature explains 14.3% of the variance in the target variable
- B) The feature's variance is 14.3 times the variance of the target
- C) 93% of the feature's variance can be explained by the other features in the model
- D) The feature's coefficient has 14.3 times more variance than it would without multicollinearity
Answer: C) 93% of the feature's variance can be explained by the other features in the model. VIF = 1 / (1 - R-squared), where R-squared is from regressing the feature on all other features. A VIF of 14.3 means R-squared = 1 - 1/14.3 = 0.930, so 93% of the feature's variation is redundant with other features. Option D describes the consequence (inflated coefficient variance) but C describes what VIF directly measures.
Question 3 (Short Answer)
Explain the difference between filter methods and wrapper methods for feature selection. Give one advantage and one disadvantage of each.
Answer: Filter methods score features using statistical tests (correlation, mutual information, chi-squared) independent of any particular model. They are fast and scalable (advantage) but ignore feature interactions and model-specific relevance (disadvantage). Wrapper methods evaluate subsets of features by training a model and measuring its performance. They find the best subset for a specific model (advantage) but are computationally expensive and prone to overfitting the selection to the training data (disadvantage).
Question 4 (Multiple Choice)
You have a dataset with 25 features. Five pairs of features have pairwise correlations above 0.9. You compute VIF for all 25 features and find that 8 features have VIF above 10. What should you do?
- A) Drop all 8 features with high VIF
- B) Drop one feature from each highly correlated pair, then recompute VIF
- C) Ignore the multicollinearity because tree-based models are robust to correlated features
- D) Apply PCA to all 25 features to eliminate all multicollinearity
Answer: B) Drop one feature from each highly correlated pair, then recompute VIF. Removing one feature from a correlated pair often resolves the VIF issue for both. After each removal, VIF should be recomputed because the multicollinearity structure changes. Option A is too aggressive --- removing all high-VIF features may discard useful signal. Option C is only partially correct --- while trees handle correlation better than linear models, correlated features still waste computational resources and complicate interpretation. Option D is overkill and sacrifices interpretability.
Question 5 (Multiple Choice)
L1 (Lasso) regularization performs feature selection because:
- A) It adds a penalty proportional to the square of each coefficient, shrinking all coefficients toward zero
- B) It adds a penalty proportional to the absolute value of each coefficient, driving weak coefficients to exactly zero
- C) It removes features with low variance before fitting the model
- D) It uses permutation importance to rank features after fitting
Answer: B) It adds a penalty proportional to the absolute value of each coefficient, driving weak coefficients to exactly zero. The L1 penalty's geometry (diamond-shaped constraint region) means that the optimal solution often lies at a corner where one or more coefficients are exactly zero. This makes L1 a simultaneous feature selection and model fitting method. Option A describes L2 (Ridge) regularization, which shrinks coefficients but does not set them to exactly zero.
Question 6 (Short Answer)
A permutation importance analysis on the test set shows that noise_feature_7 has an importance of -0.002 (negative). What does a negative permutation importance mean? Should you keep this feature?
Answer: A negative permutation importance means that shuffling the feature's values actually improved model performance. This indicates that the feature is not just uninformative --- it is actively harmful, introducing noise that the model is fitting to. The model performs better when the feature's relationship to the target is broken. You should drop this feature. Negative permutation importance is a strong signal that a feature is pure noise or worse.
Question 7 (Multiple Choice)
You run Recursive Feature Elimination with Cross-Validation (RFECV) on a dataset with 30 features. The optimal number of features is reported as 12. The AUC curve shows a plateau from 10 to 15 features, with AUC values within 0.002 of each other across that range. Which feature count should you choose for production?
- A) 12, because RFECV selected it as optimal
- B) 10, because fewer features is better when performance is effectively equal
- C) 15, to be safe and include all features in the plateau
- D) 30, because removing features always risks losing information
Answer: B) 10, because fewer features is better when performance is effectively equal. When model performance is essentially flat across a range of feature counts, the parsimony principle applies: choose the simplest model. Fewer features means lower maintenance cost, faster inference, fewer monitoring requirements, and better interpretability. The 0.002 AUC difference between 10 and 12 features is within cross-validation noise and should not drive the decision.
Question 8 (Multiple Choice)
Which of the following is the correct way to combine feature selection with cross-validation in scikit-learn?
- A)
SelectKBest(k=10).fit_transform(X, y)followed bycross_val_score(model, X_selected, y) - B)
cross_val_score(Pipeline([('select', SelectKBest(k=10)), ('model', model)]), X, y) - C)
model.fit(SelectKBest(k=10).fit_transform(X_train, y_train), y_train)followed bymodel.score(X_test, y_test) - D) Both B and C are correct
Answer: B) cross_val_score(Pipeline([('select', SelectKBest(k=10)), ('model', model)]), X, y). The pipeline ensures that feature selection is re-fit on each fold's training data. Option A is the classic leakage pattern --- feature selection on the full dataset before cross-validation. Option C looks close, but the feature selection fitted on X_train must also be used to transform X_test (via selector.transform(X_test)), and the code as written applies SelectKBest only to the training data without transforming the test data with the same fitted selector.
Question 9 (Short Answer)
A model uses 40 features in production. A new data scientist joins the team and runs feature selection, reducing the feature set to 18. The offline AUC improves by 0.3 points. Should the team deploy the 18-feature model immediately? What risks should they consider?
Answer: The team should not deploy immediately. First, they should verify that the feature selection was performed inside cross-validation (not leaked). Second, they should check that the selected features are stable across different time periods and data samples --- if the selection changes significantly with different training windows, the improvement may not persist in production. Third, removing 22 features means changing the model's decision surface, which could affect fairness across subgroups or change which customers are flagged for intervention. A/B testing the new model against the existing one in production is the safest path.
Question 10 (Multiple Choice)
You compute VIF for all features and find that hours_last_7d (VIF=16.2), hours_last_30d (VIF=13.8), and sessions_last_7d (VIF=11.4) all have VIF above 10. All three are correlated with each other but measure slightly different aspects of user engagement. Which approach is most appropriate?
- A) Drop all three features since they all have high VIF
- B) Keep the one with the highest univariate correlation with the target and drop the other two
- C) Create a composite engagement feature (e.g., weighted average or first principal component) to replace all three
- D) Both B and C are reasonable approaches; test both and compare model performance
Answer: D) Both B and C are reasonable approaches; test both and compare model performance. Dropping one and keeping the best (Option B) is simpler and preserves interpretability. Creating a composite (Option C) retains more information but loses the ability to explain the model in terms of individual usage metrics. The right choice depends on whether interpretability or raw performance matters more for the use case. Option A is too aggressive and discards useful engagement signal. Testing both options with cross-validation is the rigorous way to decide.
Question 11 (Short Answer)
Explain why tree-based feature importance (impurity-based) can be misleading. When should you use permutation importance instead?
Answer: Impurity-based importance has known biases: it tends to favor high-cardinality features and continuous features with many possible split points, because these features offer more opportunities to reduce impurity regardless of their true predictive value. It is also computed on training data, so it conflates genuine importance with overfitting. Permutation importance avoids these biases by measuring the actual drop in model performance (on held-out data) when a feature is randomly shuffled. You should use permutation importance whenever you need reliable importance rankings for feature selection, interpretation, or stakeholder communication.
Question 12 (Multiple Choice)
A SaaS company has 200 features for their churn model. They want to reduce this to approximately 20 features. Which sequence of methods is most efficient?
- A) Run RFECV from 200 down to 20 features
- B) Apply VarianceThreshold, then correlation filtering (drop one of each |r| > 0.9 pair), then L1 SelectFromModel
- C) Use forward selection to build up from 0 to 20 features
- D) Train a random forest on all 200 features and keep the top 20 by impurity importance
Answer: B) Apply VarianceThreshold, then correlation filtering (drop one of each |r| > 0.9 pair), then L1 SelectFromModel. This sequence uses fast methods first (variance threshold removes obvious junk, correlation filtering removes obvious redundancy) and expensive methods last (L1 selection on the reduced set). Option A is computationally prohibitive --- RFECV from 200 features requires hundreds of model-CV iterations. Option C is even more expensive (forward selection with 200 candidates). Option D is fast but unreliable due to the biases of impurity-based importance.
Question 13 (Short Answer)
A dataset has 50 features and 500 observations. A colleague argues that feature selection is unnecessary because gradient boosting can "just ignore" irrelevant features. Is the colleague correct? Why or why not?
Answer: The colleague is wrong. With a 10:1 observation-to-feature ratio, the model has limited data to distinguish signal from noise. While gradient boosting handles irrelevant features better than linear models, it still suffers from the curse of dimensionality: with 50 features and only 500 observations, the model has many opportunities to find spurious patterns that will not generalize. Each irrelevant feature consumes some of the model's complexity budget and adds noise to the splits. Feature selection is especially important when the ratio of observations to features is low, precisely because the model cannot reliably identify and ignore noise on its own.
Answers are included above for self-study. For additional practice, see the exercises file for this chapter.