Quiz: Chapter 7

DataField.Dev

Quiz: Chapter 7

Handling Categorical Data

Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.

Question 1 (Multiple Choice)

A dataset contains a color feature with values {red, blue, green}. A data scientist encodes it as red=1, blue=2, green=3 and trains a logistic regression. What is the problem?

A) The encoding creates too many features
B) The model interprets the integers as magnitudes, implying green has 3x the effect of red
C) The encoding cannot handle unseen categories
D) There is no problem; ordinal encoding works for all feature types

Answer: B) The model interprets the integers as magnitudes, implying green has 3x the effect of red. In a linear model, the learned coefficient multiplies the encoded value. Since color is nominal (no inherent order), the integer assignment is arbitrary, and the model learns a meaningless magnitude relationship. A tree-based model would not have this problem because it splits on specific values, not magnitudes.

Question 2 (Short Answer)

Explain the dummy variable trap. When does it matter, and when can you safely ignore it?

Answer: The dummy variable trap occurs when all k one-hot encoded columns are included, creating perfect multicollinearity (they always sum to 1). For linear models (logistic regression, linear regression), this causes the normal equations to be singular or the coefficient estimates to be unstable. The fix is to drop one column (the reference category). For tree-based models, the dummy variable trap does not matter because trees split on individual features and are not affected by multicollinearity. You can safely include all k columns when using random forests, gradient boosting, or other tree-based methods.

Question 3 (Multiple Choice)

A feature has 350 unique categorical values. Which encoding strategy is most likely to cause overfitting with a linear model?

A) Target encoding with smoothing and cross-validation
B) Frequency encoding
C) One-hot encoding
D) Hash encoding with 64 components

Answer: C) One-hot encoding. With 350 unique values, one-hot encoding creates 350 (or 349) binary features. Most of these features will have very few positive observations, leading to unstable coefficient estimates and overfitting. Target encoding compresses to 1 column, frequency encoding to 1 column, and hash encoding to 64 columns --- all much lower dimensionality.

Question 4 (Multiple Choice)

What is the primary purpose of the smoothing parameter m in target encoding?

A) To speed up the encoding computation
B) To handle missing values in the categorical feature
C) To regularize the encoding for categories with few observations by shrinking toward the global mean
D) To prevent hash collisions

Answer: C) To regularize the encoding for categories with few observations by shrinking toward the global mean. The formula (n * category_mean + m * global_mean) / (n + m) blends the category-specific mean with the global mean. For categories with large n, the category mean dominates. For categories with small n, the encoding is pulled toward the global mean, preventing overfitting to small-sample noise.

Question 5 (Short Answer)

A colleague fits a TargetEncoder on the full training set, then uses the encoded features to train and evaluate a model on the same training set. They report an AUC of 0.92. Explain what is wrong and what the likely AUC would be with proper cross-validation.

Answer: The colleague has introduced target leakage. When the target encoder computes category means using the full training set and then applies those means back to the same training set, each row's encoded value was computed using its own target value. This is circular --- the feature partially "knows" the answer. The AUC of 0.92 is inflated and will not generalize to new data. With proper cross-validation (computing encodings from out-of-fold data only), the AUC will likely be substantially lower, potentially 0.80-0.85 or less depending on the true signal strength. The correct approach is to put the target encoder inside a Pipeline and use cross_val_score.

Question 6 (Multiple Choice)

Which encoding strategy has zero risk of target leakage?

A) Target encoding
B) Leave-one-out encoding
C) Frequency encoding
D) All encoding strategies carry some leakage risk

Answer: C) Frequency encoding. Frequency encoding uses only the distribution of the categorical feature itself (counts or proportions) and never looks at the target variable. Target encoding and leave-one-out encoding both compute statistics from the target, creating leakage risk if not implemented with cross-validation.

Question 7 (Multiple Choice)

You are encoding ICD-10 diagnosis codes (14,000+ unique values) for a hospital readmission model. Which approach is best?

A) One-hot encode all 14,000 codes
B) Drop the feature because it has too many categories
C) Use domain knowledge to group codes into ICD-10 chapters (21 groups), then one-hot encode the chapters
D) Group codes into chapters for one-hot encoding AND target-encode the 3-character category for finer granularity

Answer: D) Group codes into chapters for one-hot encoding AND target-encode the 3-character category for finer granularity. This multi-resolution approach captures both the broad diagnostic category (stable, low-dimensional) and the finer-grained diagnosis-to-outcome relationship (informative, requires regularization). Option A is computationally infeasible. Option B discards a potentially informative feature. Option C loses granularity that the 3-character target encoding preserves.

Question 8 (Short Answer)

Explain why tree-based models (random forest, gradient boosting) are more robust to encoding choice than linear models. Use the concept of how each model type processes a feature.

Answer: Tree-based models split on individual threshold values (e.g., "is feature <= 2.5?"), so they treat each encoded integer as a potential split point rather than as a magnitude. This means ordinal encoding of a nominal feature is harmless --- the tree can isolate any individual category through splits. Linear models learn a single coefficient per feature and multiply it by the encoded value, so the numeric assignment directly affects the prediction. An arbitrary integer assignment for a nominal feature creates a false magnitude relationship that a linear model cannot override.

Question 9 (Multiple Choice)

In production, a new category appears that was not in the training data. Which of the following is the best practice?

A) Retrain the model immediately
B) Raise an error and reject the prediction
C) Map the unseen category to a safe default (all zeros for OHE, global mean for target encoding) and log the event for monitoring
D) Randomly assign the unseen category to an existing category

Answer: C) Map the unseen category to a safe default (all zeros for OHE, global mean for target encoding) and log the event for monitoring. Production models must handle unseen inputs gracefully. Mapping to a safe default produces a reasonable prediction (equivalent to "average" behavior), while logging allows the team to monitor whether unseen categories are becoming frequent enough to warrant retraining. Option A is operationally impractical for every new category. Option B causes outages. Option D introduces randomness.

Question 10 (Multiple Choice)

Binary encoding of a feature with 47 categories produces how many columns?

A) 47
B) 46
C) 6
D) 23

Answer: C) 6. Binary encoding converts the integer-encoded category to binary representation. ceil(log2(47)) = ceil(5.55) = 6 binary digits are needed to represent 47 unique values (since 2^6 = 64 >= 47). This is a significant compression from 47 one-hot columns.

Question 11 (Short Answer)

A data scientist argues: "I always use target encoding because it compresses any categorical feature to a single column." Give two scenarios where target encoding is not the best choice.

Answer: First, target encoding is not appropriate when the target variable is not available, such as in unsupervised learning (clustering, anomaly detection). Since target encoding requires the target to compute category means, it cannot be used when there is no target. Second, target encoding is unnecessarily complex for low-cardinality nominal features (e.g., 4 device types). One-hot encoding is simpler, has no leakage risk, preserves full information about each category, and adds only 3-4 columns. The overhead of cross-validated target encoding is not justified when the cardinality is low enough for one-hot encoding.

Question 12 (Multiple Choice)

Which statement about hash encoding is correct?

A) Hash encoding guarantees no information loss
B) Hash encoding produces a variable number of columns depending on cardinality
C) Hash collisions cause unrelated categories to share the same encoded column
D) Hash encoding requires the target variable to compute

Answer: C) Hash collisions cause unrelated categories to share the same encoded column. This is the fundamental tradeoff of hash encoding: fixed dimensionality regardless of cardinality, at the cost of potential collisions. Option A is wrong because collisions cause information loss. Option B is wrong because hash encoding produces a fixed, user-specified number of columns. Option D is wrong because hashing uses only the feature values, not the target.

Question 13 (Short Answer)

Explain the difference between the handle_unknown='ignore' and handle_unknown='error' settings in scikit-learn's OneHotEncoder. When would you use each?

Answer: With handle_unknown='error', the encoder raises a ValueError if it encounters a category not seen during fit(). With handle_unknown='ignore', unseen categories are encoded as all zeros (no known category matches). Use 'error' during development and testing to catch data quality issues early --- if an unexpected category appears, it might indicate a data pipeline bug. Use 'ignore' in production pipelines where you expect new categories to appear naturally (new products, new regions) and want the model to handle them gracefully by treating them as "none of the above."

Question 14 (Multiple Choice)

You are building a pipeline with ColumnTransformer for a logistic regression model. The subscription_plan feature has values {basic, standard, premium} with a clear ordering. Which encoding is correct?

A) OneHotEncoder(drop='first') because it is a categorical feature
B) OrdinalEncoder(categories=[['basic', 'standard', 'premium']]) because the feature has a natural order
C) TargetEncoder() because logistic regression benefits from target-encoded features
D) LabelEncoder() because it maps strings to integers

Answer: B) OrdinalEncoder(categories=[['basic', 'standard', 'premium']]) because the feature has a natural order. The subscription plan has a clear ordinal relationship (basic < standard < premium) that the integer encoding preserves. A linear model will learn a coefficient that captures the increasing effect of higher plan tiers, which is meaningful. One-hot encoding would work but loses the ordering information. LabelEncoder is designed for target variables, not features, and does not guarantee a meaningful ordering.

Question 15 (Short Answer)

A team compares one-hot encoding (47 columns) versus target encoding (1 column) for the primary_genre feature. The gradient boosted tree AUC is nearly identical (0.783 vs. 0.786). Give two practical reasons to prefer target encoding even when AUC is similar.

Answer: First, training time: a model with 47 additional columns takes longer to train and requires more memory, especially with large datasets. At 2.4 million rows, the one-hot matrix for genre alone is 112.8 million values. Target encoding's single column is 2.4 million values --- a 47x reduction. Second, handling new categories: if StreamFlow adds a new genre, the one-hot encoder must be retrained to add a new column, while the target encoder simply maps the new genre to the global mean. The target-encoded pipeline is more robust to production changes.

Solutions reference the concepts from Chapter 7. For additional practice, see the exercises.