Chapter 9: Quiz

Test your understanding of feature engineering and data pipelines. Each question has one correct answer unless otherwise stated.

Question 1

Which scaler is most appropriate for a dataset with significant outliers?

A) StandardScaler
B) MinMaxScaler
C) RobustScaler
D) MaxAbsScaler

Answer

**C) `RobustScaler`** `RobustScaler` uses the median and interquartile range (IQR) instead of mean and standard deviation. Since the median and IQR are robust statistics---they are not significantly affected by extreme values---this scaler handles outliers gracefully. `StandardScaler` and `MinMaxScaler` both use statistics (mean, min, max) that are sensitive to outliers.

Question 2

What is the primary risk of applying StandardScaler to the entire dataset before splitting into train and test sets?

A) The model will be slower to train
B) Data leakage---test set statistics influence the training process
C) The features will have incorrect data types
D) The model will underfit

Answer

**B) Data leakage---test set statistics influence the training process** When you fit the scaler on the entire dataset, the mean and standard deviation include information from the test set. This means the training data has been subtly influenced by test set values, producing overly optimistic performance estimates that do not generalize to truly unseen data.

Question 3

Which encoding method is most appropriate for a categorical feature with 10,000 unique values?

A) One-hot encoding
B) Ordinal encoding
C) Target encoding with smoothing
D) Binary encoding

Answer

**C) Target encoding with smoothing** One-hot encoding would create 10,000 new columns, which is impractical and leads to severe sparsity. Ordinal encoding imposes a false ordering. Target encoding with smoothing maps each category to a single numerical value (the regularized mean of the target), keeping dimensionality low while capturing the relationship between category and target. Binary encoding (D) is also reasonable, producing about 14 columns, but target encoding typically provides more predictive power.

Question 4

In TF-IDF, what does a high IDF value indicate about a term?

A) The term appears in most documents
B) The term appears in very few documents and is therefore more discriminative
C) The term has high frequency within a single document
D) The term is a stop word

Answer

**B) The term appears in very few documents and is therefore more discriminative** IDF (Inverse Document Frequency) is defined as $\log(N / df_t)$, where $N$ is the total number of documents and $df_t$ is the number of documents containing term $t$. A rare term has a small $df_t$, resulting in a high IDF value. This upweights rare, discriminative terms and downweights common terms that appear everywhere.

Question 5

Why is cyclical encoding (sine/cosine) preferred over integer encoding for the "hour of day" feature?

A) It produces fewer features
B) It preserves the circular relationship where hour 23 is close to hour 0
C) It is computationally faster
D) It removes outliers from the time data

Answer

**B) It preserves the circular relationship where hour 23 is close to hour 0** Integer encoding (0, 1, 2, ..., 23) implies that hour 0 and hour 23 are maximally distant (distance = 23), when in reality they are only 1 hour apart. Sine and cosine encoding maps hours to a circle in 2D space, so adjacent hours (including 23 and 0) are nearby in the encoded representation.

Question 6

Which feature selection method measures the mutual dependence between features and the target without assuming a linear relationship?

A) Pearson correlation
B) ANOVA F-statistic
C) Mutual information
D) Variance threshold

Answer

**C) Mutual information** Mutual information measures any kind of statistical dependence---linear, nonlinear, or otherwise---between a feature and the target. Pearson correlation and ANOVA F-statistic primarily capture linear relationships. Variance threshold does not consider the target at all; it only removes features with low variance.

Question 7

What is the key advantage of using a scikit-learn Pipeline for preprocessing and modeling?

A) It makes the code run faster
B) It automatically selects the best model
C) It prevents data leakage during cross-validation by ensuring transformers are fit only on training folds
D) It reduces the number of features automatically

Answer

**C) It prevents data leakage during cross-validation by ensuring transformers are fit only on training folds** When you pass a `Pipeline` to `cross_val_score`, each fold's transformers are fit exclusively on the training portion and then applied to the validation portion. This ensures that no information from the validation fold leaks into the preprocessing steps.

Question 8

In the context of missing data, what does "MAR" stand for, and what does it mean?

A) Missing Always at Random---values are deleted randomly with equal probability
B) Missing at Random---missingness depends on observed data but not the missing values themselves
C) Missing Across Rows---entire rows of data are removed
D) Missing and Replaced---values are missing because they were replaced during data collection

Answer

**B) Missing at Random---missingness depends on observed data but not the missing values themselves** Under MAR, the probability that a value is missing may depend on other observed variables but not on the missing value itself. For example, younger respondents might be less likely to report income (missingness depends on age, an observed variable), but the actual income value does not influence whether it is missing.

Question 9

Which of the following is an example of target leakage?

A) Using the customer's age to predict purchase amount
B) Using the total refund amount to predict whether a product was returned
C) Using one-hot encoding for a 5-category feature
D) Applying log transformation to a skewed feature

Answer

**B) Using the total refund amount to predict whether a product was returned** The total refund amount is only known after the return event occurs. Including it as a feature means the model has access to information that would not be available at prediction time. This is a classic case of target leakage---the feature is a direct consequence of the target, not a predictor of it.

Question 10

What does the remainder='passthrough' parameter do in ColumnTransformer?

A) Drops all columns not specified in any transformer
B) Raises an error for unspecified columns
C) Includes unspecified columns in the output without any transformation
D) Applies the default transformer to unspecified columns

Answer

**C) Includes unspecified columns in the output without any transformation** By default, `ColumnTransformer` drops columns that are not assigned to any transformer (`remainder='drop'`). Setting `remainder='passthrough'` instead passes those columns through to the output unchanged.

Question 11

When using PolynomialFeatures with degree 2 and 5 input features (no bias), how many output features are produced?

A) 10
B) 15
C) 20
D) 25

Answer

**C) 20** The formula is $\binom{n + d}{d} - 1 = \binom{5 + 2}{2} - 1 = \binom{7}{2} - 1 = 21 - 1 = 20$. This includes the 5 original features, 5 squared terms ($x_i^2$), and 10 interaction terms ($x_i x_j$ for $i < j$).

Question 12

Which imputation method models each feature with missing values as a function of other features?

A) Mean imputation
B) KNN imputation
C) Iterative imputation
D) Mode imputation

Answer

**C) Iterative imputation** `IterativeImputer` models each feature with missing values as a dependent variable in a regression, using all other features as predictors. It iterates over features, refining imputations in multiple rounds. KNN imputation uses nearest neighbors but does not explicitly model features as functions of each other.

Question 13

What is the purpose of smoothing in target encoding?

A) To make the encoded values continuous
B) To blend category-level means with the global mean, reducing overfitting for rare categories
C) To normalize the encoded values to the range [0, 1]
D) To remove missing values before encoding

Answer

**B) To blend category-level means with the global mean, reducing overfitting for rare categories** Without smoothing, a category that appears only once gets an encoded value equal to that single sample's target value, which is unreliable. Smoothing blends the category mean with the global mean, with the blend controlled by sample size---rare categories are pulled more strongly toward the global mean.

Question 14

In the double-underscore notation pipeline.set_params(preprocessor__num__scaler__with_mean=False), what does each level represent?

A) module__class__method__parameter
B) pipeline_step__column_transformer_name__sub_pipeline_step__parameter
C) data_type__feature_name__transform__option
D) outer_model__inner_model__layer__weight

Answer

**B) pipeline_step__column_transformer_name__sub_pipeline_step__parameter** The double-underscore notation navigates the nested structure: `preprocessor` is the name of the `ColumnTransformer` step in the outer pipeline, `num` is the name of a transformer within the `ColumnTransformer`, `scaler` is a step within that sub-pipeline, and `with_mean` is the parameter being set on the scaler.

Question 15

Which of the following is NOT a valid strategy for KBinsDiscretizer?

A) 'uniform' (equal-width bins)
B) 'quantile' (equal-frequency bins)
C) 'kmeans' (k-means clustering-based bins)
D) 'logarithmic' (log-scale bins)

Answer

**D) `'logarithmic'` (log-scale bins)** `KBinsDiscretizer` supports three strategies: `'uniform'` (equal-width), `'quantile'` (equal-frequency), and `'kmeans'` (based on k-means clustering). There is no built-in `'logarithmic'` strategy. To achieve log-scale binning, you would first apply a log transformation and then use uniform binning.

Question 16

Why might impurity-based feature importance from a random forest be misleading?

A) It only works for regression tasks
B) It is biased toward high-cardinality features that create many potential splits
C) It requires the features to be scaled first
D) It cannot handle categorical features

Answer

**B) It is biased toward high-cardinality features that create many potential splits** Features with many unique values (high cardinality) provide more potential split points, increasing their chance of being selected and appearing "important" even when they have no true predictive value. Permutation importance provides an unbiased alternative by measuring actual performance degradation.

Question 17

What does GroupKFold prevent that standard KFold does not?

A) Imbalanced class distributions in folds
B) Related samples (e.g., same customer) appearing in both train and validation folds
C) Missing values in the validation fold
D) Overfitting to the training fold

Answer

**B) Related samples (e.g., same customer) appearing in both train and validation folds** When multiple samples belong to the same group (e.g., multiple transactions from the same customer), standard `KFold` may split them across training and validation sets. This creates leakage because the model effectively "sees" data from validation-set entities during training. `GroupKFold` ensures all samples from a group stay in the same fold.

Question 18

Which of the following transformations requires strictly positive input values?

A) Yeo-Johnson
B) Box-Cox
C) Standard scaling
D) Min-Max scaling

Answer

**B) Box-Cox** The Box-Cox transformation is only defined for strictly positive values ($x > 0$). The Yeo-Johnson transformation is a generalization that handles zero and negative values as well. Standard scaling and min-max scaling work with any real-valued inputs.

Question 19

What is the correct order of operations when building a supervised ML model?

A) Feature engineering -> Train/test split -> Model training -> Evaluation
B) Train/test split -> Feature engineering (fit on train) -> Model training -> Evaluation
C) Model training -> Feature engineering -> Train/test split -> Evaluation
D) Feature engineering -> Model training -> Train/test split -> Evaluation

Answer

**B) Train/test split -> Feature engineering (fit on train) -> Model training -> Evaluation** You must split the data before any preprocessing to prevent data leakage. Feature engineering steps (scaling, encoding, imputation) are fitted exclusively on the training set and then applied (without refitting) to the test set. This is exactly what scikit-learn Pipelines automate.

Question 20

What is "training-serving skew"?

A) When the model performs better on training data than test data
B) When features are computed differently at training time vs. production serving time
C) When the training data is imbalanced
D) When the model weights are skewed toward certain classes

Answer

**B) When features are computed differently at training time vs. production serving time** Training-serving skew occurs when the feature computation logic in production differs from what was used during training. For example, if training features used a specific version of a lookup table that has since been updated, or if datetime features are computed in a different timezone. Feature stores help prevent this by centralizing feature definitions.

Question 21

Which of the following scenarios would handle_unknown='ignore' in OneHotEncoder help with?

A) When the training data has missing values
B) When the test data contains a category not seen during training
C) When the feature has too many categories
D) When the encoding produces dense arrays

Answer

**B) When the test data contains a category not seen during training** With `handle_unknown='ignore'`, if a new category appears at prediction time that was not in the training data, the encoder sets all binary columns to zero instead of raising an error. This is essential for production systems where new categories may appear over time.

Question 22

What is the main advantage of SelectFromModel over SelectKBest?

A) It is faster to compute
B) It uses model-specific importance measures rather than univariate statistics
C) It works only with tree-based models
D) It does not require specifying the number of features

Answer

**B) It uses model-specific importance measures rather than univariate statistics** `SelectFromModel` leverages the feature importances or coefficients learned by a trained model (such as L1-regularized regression or random forest), which can capture multivariate interactions. `SelectKBest` evaluates each feature independently using univariate statistical tests, potentially missing features that are only useful in combination.

Question 23

When creating lag features for time series data, which of the following is a source of data leakage?

A) Using a lag of 1 (previous time step)
B) Using the current time step's value as a feature
C) Computing a rolling mean that includes future values
D) Both B and C

Answer

**D) Both B and C** Using the current time step's value to predict the current target is leakage (the value may not be available at prediction time). Computing a rolling mean that includes future observations also constitutes leakage because future data would not be available when making real-time predictions. Only backward-looking features (lags, trailing averages) are safe.

Question 24

What happens when you apply VarianceThreshold(threshold=0) to a dataset?

A) It removes all features
B) It removes only features with exactly zero variance (constant features)
C) It keeps all features unchanged
D) It normalizes all features to zero variance

Answer

**B) It removes only features with exactly zero variance (constant features)** A threshold of 0 means any feature with variance greater than 0 is kept. Only features where every value is identical (variance = 0) are removed. This is a useful sanity check to eliminate constant columns that provide no information.

Question 25

In a ColumnTransformer, what is the effect of listing the same column in two different transformers?

A) The column is transformed twice and both results are concatenated
B) The second transformer overrides the first
C) An error is raised
D) The column is silently dropped

Answer

**A) The column is transformed twice and both results are concatenated** Scikit-learn's `ColumnTransformer` applies each transformer independently to its specified columns and concatenates all results. If a column appears in two transformers, it is processed by both, and both transformed versions appear in the output. This is rarely desirable and is usually a mistake.