Quiz: Chapter 6

DataField.Dev

Quiz: Chapter 6

Feature Engineering

Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.

Question 1 (Multiple Choice)

A team builds a churn prediction model using 3 raw features and achieves an AUC of 0.70 with gradient boosting. They then engineer 25 features from the same raw data and train a logistic regression, achieving an AUC of 0.76. What does this demonstrate?

A) Gradient boosting is an inferior algorithm for churn prediction
B) Logistic regression is always better when you have more features
C) Feature engineering often provides more predictive lift than algorithm choice
D) More features always lead to better model performance

Answer: C) Feature engineering often provides more predictive lift than algorithm choice. The logistic regression (a simpler model) with engineered features outperformed gradient boosting (a more powerful model) with raw features. This demonstrates that the information content of the features is typically more important than the complexity of the algorithm. Option D is incorrect because more features do not always help --- irrelevant features can hurt performance.

Question 2 (Short Answer)

Explain the "feature engineering recipe" described in this chapter. What are the three steps, and why does the recipe start with domain understanding rather than code?

Answer: The recipe is: (1) understand the domain by talking to subject matter experts, (2) ask "what would a human expert look at?" to identify candidate signals, and (3) translate that intuition into computable features. It starts with domain understanding because the best features encode domain knowledge that algorithms cannot discover on their own. A customer success manager who knows that login recency predicts churn provides more value in a brainstorming session than a data scientist running blind feature generation. Code implements the insight; it does not create it.

Question 3 (Multiple Choice)

Which of the following is a correct definition of a "recency feature"?

A) The total number of times a user performed an action over their entire lifetime
B) The time elapsed since the user last performed a specific action
C) The average frequency of actions per month over the last year
D) The percentage change in activity between two time periods

Answer: B) The time elapsed since the user last performed a specific action. Recency measures "how long ago" --- examples include days since last login, days since last support ticket, or days since last purchase. Option A describes a lifetime frequency count, Option C describes a frequency rate, and Option D describes a trend feature.

Question 4 (Multiple Choice)

You compute support_tickets_last_90d / hours_last_90d as a ratio feature. A subscriber has 3 support tickets and 0 hours watched. What should you do?

A) Let the division produce infinity and rely on the model to handle it
B) Drop the subscriber from the dataset
C) Handle the zero denominator explicitly, for example by returning the ticket count when hours is zero
D) Replace the ratio with NaN and let imputation handle it

Answer: C) Handle the zero denominator explicitly, for example by returning the ticket count when hours is zero. Division by zero is a common edge case in ratio features and must be handled explicitly. Option A would produce infinity or NaN values that crash most models. Option B discards data unnecessarily. Option D adds unnecessary imputation complexity. The best approach uses np.where() or equivalent conditional logic to return a sensible default when the denominator is zero.

Question 5 (Short Answer)

Explain why np.log1p() is preferred over np.log() for log-transforming features that contain zero values. What would happen if you used np.log() on a feature with zeros?

Answer: np.log(0) returns negative infinity (-inf), which breaks most machine learning algorithms. np.log1p(x) computes log(1 + x), which maps 0 to log(1) = 0 --- a finite, well-behaved value. The transformation preserves the rank ordering of all values (including zero) while compressing the right tail of the distribution. For values much larger than 1, log1p(x) is approximately equal to log(x), so the transformation behaves like a standard log transform where it matters most.

Question 6 (Multiple Choice)

Which of the following is the strongest argument for using cyclical encoding (sine/cosine) for the month-of-year feature?

A) It reduces the number of features from 12 (one-hot) to 2
B) It ensures that December (12) and January (1) are treated as adjacent, not distant
C) It makes the feature normally distributed
D) It is required by all sklearn classifiers

Answer: B) It ensures that December (12) and January (1) are treated as adjacent, not distant. When month is encoded as an integer (1-12), linear models interpret December as 12 times January and see December and January as maximally distant. Cyclical encoding (sine and cosine components) preserves the circular nature of months, so the model correctly learns that December and January are neighbors. Option A is a side benefit but not the primary motivation. Option C is incorrect; cyclical encoding does not guarantee normality. Option D is false.

Question 7 (Multiple Choice)

In target encoding with smoothing, the formula is: (count * category_mean + smoothing * global_mean) / (count + smoothing). A category with 2 observations and a 100% churn rate in a dataset with an 8% global churn rate, using smoothing = 10, would receive an encoded value closest to:

A) 1.00
B) 0.50
C) 0.23
D) 0.08

Answer: C) 0.23. The calculation is: (2 * 1.0 + 10 * 0.08) / (2 + 10) = (2.0 + 0.8) / 12 = 2.8 / 12 = 0.233. With only 2 observations, the smoothing pulls the encoded value strongly toward the global mean of 8%. This prevents the model from overfitting to the small-sample noise of 100% churn. With more observations, the category-specific mean would dominate.

Question 8 (Short Answer)

A team builds a feature called avg_hours_all_subscribers (the mean hours watched across all subscribers in the dataset) and uses it to normalize individual subscriber hours. They compute this mean on the entire dataset before splitting into train and test. Explain why this is a form of data leakage and how to fix it.

Answer: Computing the global mean on the full dataset means the training features incorporate statistical information from the test set. During training, the model indirectly "sees" test set values through the normalization constant. To fix this, compute the mean only on the training set after the split, then use that training-set mean to normalize both the training and test features. The difference may be small in large datasets, but the principle matters: any statistic used in feature engineering must be computed exclusively on training data.

Question 9 (Multiple Choice)

When should you apply a log transformation to a skewed numeric feature?

A) Always, for all numeric features
B) Only when using tree-based models like random forests or gradient boosting
C) Primarily when using linear models, as tree-based models are invariant to monotonic transformations
D) Only when the feature has no zero values

Answer: C) Primarily when using linear models, as tree-based models are invariant to monotonic transformations. Tree-based models split on rank order, so a log transformation (which is monotonic) does not change the split points. Linear models, however, are sensitive to the distribution of features, and log transformations can make skewed distributions more symmetric, improving the linear model's ability to learn the relationship. Option D is partially correct but incomplete --- np.log1p() handles zeros gracefully.

Question 10 (Multiple Choice)

You are building a churn prediction model and one of your features has an individual AUC of 0.97 against the target. What should you do?

A) Celebrate --- you have found an excellent feature
B) Immediately remove it from the feature set
C) Investigate for data leakage --- check whether this feature uses information from after the prediction date or contains the target directly
D) Scale it to reduce its influence on the model

Answer: C) Investigate for data leakage --- check whether this feature uses information from after the prediction date or contains the target directly. A single feature with AUC of 0.97 is almost certainly leaking information. Real features in churn models rarely exceed AUC of 0.85 individually. Common causes include: the feature was computed using post-prediction-date data, the feature is derived from the target variable (e.g., cancellation reason), or the feature encodes a near-deterministic proxy for the target (e.g., "user visited the cancellation page").

Question 11 (Short Answer)

Explain the difference between PolynomialFeatures(degree=2, interaction_only=False) and PolynomialFeatures(degree=2, interaction_only=True). Given three input features (A, B, C), list the features produced by each setting.

Answer: With interaction_only=False, the output includes the original features, all squared terms, and all pairwise interactions: A, B, C, A^2, B^2, C^2, AB, AC, BC (9 features). With interaction_only=True, the output includes only the original features and pairwise interactions, without squared terms: A, B, C, AB, AC, BC (6 features). The interaction_only=True setting is often preferred because interaction terms capture "these two things together matter" while squared terms capture non-linear effects of a single variable, which are often less interpretable.

Question 12 (Multiple Choice)

The "genre diversity score" feature is computed as unique_genres_last_90d / total_sessions_last_90d. A subscriber who watched 1 session in 1 genre gets a score of 1.0. A subscriber who watched 60 sessions across 6 genres also gets 1.0/10 = 0.1. Which subscriber is more diverse in their viewing?

A) The first subscriber (score 1.0 > 0.1)
B) The second subscriber (6 genres > 1 genre)
C) The scores are not directly comparable because the denominators are different
D) Both are equally diverse

Answer: C) The scores are not directly comparable because the denominators are different. The genre diversity score is a ratio feature that is sensitive to the denominator. A subscriber with 1 session has trivially explored 1 genre, giving a deceptively high score. A subscriber with 60 sessions across 6 genres is genuinely diverse. This is a common pitfall with ratio features when the denominator is very small. One fix is to only compute the ratio for subscribers with a minimum number of sessions (e.g., sessions >= 5) and set it to a default for others.

Question 13 (Short Answer)

A team creates 500 features for a churn model. Many are highly correlated with each other (e.g., hours_last_30d and log_hours_last_30d have a correlation of 0.97). Should they remove highly correlated features before training? Does the answer depend on the model type?

Answer: It depends on the model type. For linear models (logistic regression, linear SVM), highly correlated features cause multicollinearity, which inflates coefficient variance and makes interpretation unreliable. Removing one from each correlated pair is usually wise. For tree-based models (random forest, gradient boosting), highly correlated features are not harmful to predictive performance --- the model will simply use one and ignore the other. However, keeping both wastes memory and slows training. In practice, removing obvious duplicates (raw and log-transformed versions) is good hygiene regardless of model type.

Question 14 (Multiple Choice)

Which of the following best describes why trend features (e.g., hours_change_30d) are often more predictive than snapshot features (e.g., hours_last_30d)?

A) Trend features have lower variance than snapshot features
B) Trend features capture the direction of behavioral change, which signals future intent more strongly than current state
C) Trend features are always normally distributed, which helps model training
D) Trend features require less data to compute

Answer: B) Trend features capture the direction of behavioral change, which signals future intent more strongly than current state. A subscriber watching 20 hours/month is ambiguous without context. If they watched 30 hours/month previously, the decline signals disengagement. If they watched 10 hours previously, the increase signals growing engagement. The direction of change encodes momentum, which is a leading indicator of future behavior. Current state is a snapshot; change is a trajectory.

Question 15 (Short Answer)

Describe the "war story" about the five-minute feature that beat three weeks of neural architecture search. What is the lesson for practicing data scientists?

Answer: A SaaS company's data science team spent three weeks experimenting with neural network architectures (LSTMs, attention mechanisms, autoencoders) and achieved an AUC of 0.78. A new hire from customer success suggested adding days_since_last_login, which took five minutes to compute. Adding this single feature to a gradient boosted tree pushed the AUC to 0.84. The lesson is that domain knowledge is the highest-leverage input to any modeling project. Before investing in complex algorithms, invest in understanding the problem domain and translating expert intuition into features. The best features come from people who understand the business, not from people who understand backpropagation.

This quiz supports Chapter 6: Feature Engineering. Return to the chapter to review concepts before checking your answers.