Chapter 9: Exercises

These exercises progress from foundational concepts to advanced applications. Each exercise is tagged with a difficulty level: (B) Basic, (I) Intermediate, or (A) Advanced.

Section A: Numerical Feature Transformations

Exercise 9.1 (B) --- Comparing Scalers

Given the following dataset with outliers, apply StandardScaler, MinMaxScaler, and RobustScaler. Compare the transformed distributions by computing the mean, median, standard deviation, minimum, and maximum for each. Which scaler is least affected by the outlier?

import numpy as np

np.random.seed(42)
data = np.concatenate([
    np.random.normal(50, 10, 95),
    np.array([200, 250, 300, 350, 500])  # Outliers
]).reshape(-1, 1)

Exercise 9.2 (B) --- Log Transform and Skewness

Load the Boston-style housing dataset (or create a synthetic one). Identify all features with skewness greater than 1.0. Apply np.log1p to those features and verify that the skewness is reduced. Report the before and after skewness for each transformed feature.

Exercise 9.3 (I) --- Binning Strategies

Create a synthetic feature with a bimodal distribution (two Gaussian peaks). Apply equal-width, equal-frequency, and k-means binning with 5 bins each. Visualize the bin edges for each strategy. Which strategy best captures the bimodal structure?

np.random.seed(42)
bimodal = np.concatenate([
    np.random.normal(25, 5, 500),
    np.random.normal(75, 5, 500)
]).reshape(-1, 1)

Exercise 9.4 (I) --- Polynomial Feature Explosion

Starting with 10 features, compute the number of output features for polynomial degrees 2, 3, and 4 (without bias, including interactions). Then actually generate degree-2 polynomial features using PolynomialFeatures and verify the count matches the formula $\binom{n+d}{d} - 1$.

Exercise 9.5 (A) --- Custom Power Transformer

Implement a custom scikit-learn transformer that applies the Box-Cox transformation with a user-specified lambda. Your transformer must: - Inherit from BaseEstimator and TransformerMixin. - Handle the $\lambda = 0$ (log) case. - Raise a ValueError if any input values are not strictly positive. - Include an inverse_transform method.

Section B: Categorical Encoding

Exercise 9.6 (B) --- One-Hot Encoding by Hand

Without using scikit-learn, implement one-hot encoding for the following series. Handle the drop='first' option to avoid multicollinearity.

colors = pd.Series(['red', 'blue', 'green', 'blue', 'red', 'green', 'yellow'])

Exercise 9.7 (I) --- Target Encoding with Smoothing

Implement target encoding with Laplace smoothing for the following classification dataset. Use a smoothing factor of 10. Verify that categories with few samples are pulled toward the global mean.

df = pd.DataFrame({
    'city': ['NYC'] * 50 + ['LA'] * 50 + ['SF'] * 5 + ['Boston'] * 3,
    'target': np.random.binomial(1, [0.7]*50 + [0.3]*50 + [0.9]*5 + [0.5]*3)
})

Exercise 9.8 (I) --- Encoding Strategy Comparison

Using a real or synthetic dataset with a categorical feature containing 50 unique values, compare model performance (logistic regression) using: 1. One-hot encoding 2. Ordinal encoding 3. Frequency encoding 4. Target encoding (with proper cross-validation to avoid leakage)

Report the cross-validated accuracy for each approach.

Exercise 9.9 (A) --- Leave-One-Out Target Encoding

Implement leave-one-out target encoding, where each sample's encoded value excludes the sample's own target when computing the category mean. Explain why this reduces overfitting compared to naive target encoding.

Section C: Text Features

Exercise 9.10 (B) --- Bag of Words vs. TF-IDF

Given the following corpus, compute both BoW and TF-IDF representations. Identify the top 5 most important words according to each method for the first document. Explain why the rankings differ.

corpus = [
    "Machine learning is a subset of artificial intelligence",
    "Deep learning is a subset of machine learning",
    "Natural language processing uses machine learning techniques",
    "Computer vision is another application of deep learning",
    "Reinforcement learning is different from supervised learning"
]

Exercise 9.11 (I) --- N-gram Analysis

Using the same corpus, compare unigram, bigram, and trigram TF-IDF representations. For each, report the vocabulary size and the top 10 features by mean TF-IDF score across all documents. What information do bigrams and trigrams capture that unigrams miss?

Exercise 9.12 (I) --- Custom Text Preprocessor in a Pipeline

Build a scikit-learn Pipeline that includes: 1. A custom FunctionTransformer that lowercases text, removes punctuation, and removes stop words. 2. A TfidfVectorizer with a maximum of 1,000 features. 3. A logistic regression classifier.

Test it on a binary text classification task (e.g., sentiment analysis with synthetic data).

Section D: Datetime Features

Exercise 9.13 (B) --- Temporal Feature Extraction

Given a DataFrame with a timestamp column, extract the following features: year, month, day of week, hour, is_weekend, quarter, and day_of_year. Then create cyclical encodings for month and hour.

dates = pd.date_range('2023-01-01', periods=1000, freq='h')
df = pd.DataFrame({'timestamp': dates, 'value': np.random.randn(1000)})

Exercise 9.14 (I) --- Time Since Events

Create a feature that computes the number of days since each customer's first purchase and the number of days since their most recent purchase. Use the following data:

purchases = pd.DataFrame({
    'customer_id': [1, 1, 1, 2, 2, 3, 3, 3, 3],
    'purchase_date': pd.to_datetime([
        '2023-01-15', '2023-03-20', '2023-06-10',
        '2023-02-01', '2023-07-15',
        '2023-01-01', '2023-04-01', '2023-08-01', '2023-12-01'
    ]),
    'amount': [100, 200, 150, 300, 250, 50, 75, 100, 200]
})
reference_date = pd.Timestamp('2024-01-01')

Exercise 9.15 (A) --- Holiday and Event Features

Write a function that takes a date column and returns binary features for: - US federal holidays (at minimum: New Year's, Independence Day, Thanksgiving, Christmas). - The day before and after each holiday. - Whether the date falls within a "holiday season" (Nov 15 -- Jan 5).

Section E: Feature Selection

Exercise 9.16 (B) --- Variance Threshold

Generate a dataset with 20 features, where 5 features have near-zero variance (e.g., all values are the same except for 1% noise). Apply VarianceThreshold and verify that only the low-variance features are removed.

Exercise 9.17 (I) --- Filter vs. Wrapper Comparison

Using the Iris dataset (or a similar multi-class dataset): 1. Select the top 2 features using SelectKBest with f_classif. 2. Select the top 2 features using RFE with a decision tree classifier. 3. Compare the selected features. Are they the same? Train a classifier with each subset and compare accuracies.

Exercise 9.18 (I) --- Mutual Information vs. F-statistic

Create a synthetic dataset where: - Feature 1 has a strong linear relationship with the target. - Feature 2 has a strong nonlinear relationship (e.g., quadratic) with the target. - Feature 3 is pure noise.

Apply both f_classif and mutual_info_classif. Which method correctly identifies Feature 2 as important? Explain why.

Exercise 9.19 (A) --- Stability of Feature Selection

Run SelectKBest (k=10) on 50 different bootstrap samples of a dataset with 30 features. For each feature, compute the fraction of times it was selected. Discuss the stability of the feature selection. Which features are consistently selected?

Exercise 9.20 (A) --- Feature Selection Within Cross-Validation

Demonstrate the difference between performing feature selection before cross-validation (wrong) and within cross-validation (correct). Use a dataset with many noise features and show that the "wrong" approach gives inflated accuracy estimates.

Section F: Pipelines and ColumnTransformers

Exercise 9.21 (B) --- Basic Pipeline

Build a Pipeline that chains SimpleImputer(strategy='median'), StandardScaler(), and LogisticRegression(). Cross-validate it on a dataset with missing values. Then modify the pipeline to use MinMaxScaler instead and compare the scores.

Exercise 9.22 (I) --- ColumnTransformer with Mixed Types

Create a ColumnTransformer that: - Applies median imputation + standard scaling to numerical columns. - Applies most-frequent imputation + one-hot encoding to categorical columns. - Passes through a binary column unchanged.

Combine it with a RandomForestClassifier in a Pipeline and cross-validate.

Exercise 9.23 (I) --- Hyperparameter Tuning in Pipelines

Using the pipeline from Exercise 9.22, perform a grid search over: - Imputation strategy: ['mean', 'median'] - Number of trees: [50, 100, 200] - Max depth: [3, 5, 10, None]

Report the best parameters and corresponding cross-validated score.

Exercise 9.24 (A) --- Custom Transformer in a Pipeline

Create a custom transformer OutlierClipper that clips values to the 1st and 99th percentiles. The percentiles should be learned from the training data in fit and applied in transform. Integrate it into a pipeline and verify that it does not leak information.

Exercise 9.25 (A) --- Pipeline Serialization and Deployment

Build a complete pipeline for a classification task. After training: 1. Serialize it using joblib.dump. 2. Load it in a simulated "production" environment. 3. Make predictions on new data. 4. Verify that the predictions are identical.

Section G: Missing Data

Exercise 9.26 (B) --- Imputation Comparison

Create a dataset, introduce 20% missing values at random, and compare model performance using: 1. Mean imputation 2. Median imputation 3. KNN imputation (k=5) 4. Iterative imputation

Report the cross-validated RMSE for each method.

Exercise 9.27 (I) --- Missing Indicator Features

Take a dataset with missing values and compare two approaches: 1. Simple imputation only. 2. Simple imputation + binary missing indicator features.

Does the addition of missing indicators improve model performance? Under what conditions would you expect it to help?

Exercise 9.28 (A) --- MCAR vs. MAR Simulation

Simulate a dataset and introduce missing values under MCAR and MAR mechanisms: - MCAR: Remove 20% of values in column A completely at random. - MAR: Remove values in column A where column B exceeds its median.

For each mechanism, compare the performance of mean imputation vs. iterative imputation. Which imputation method is more robust to MAR missingness?

Section H: Data Leakage

Exercise 9.29 (I) --- Leakage Detection

The following code contains a data leakage bug. Identify the source of leakage and fix it.

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# Bug: preprocessing before cross-validation
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

scores = cross_val_score(
    LogisticRegression(), X_scaled, y, cv=5, scoring='accuracy'
)
print(f"Accuracy: {scores.mean():.4f}")

Exercise 9.30 (I) --- Target Leakage Investigation

You are building a model to predict whether a loan will default. Your dataset includes the following features: - loan_amount, interest_rate, borrower_age, credit_score - late_payment_count, collection_recovery_fee, total_payment

Identify which features are likely to cause target leakage and explain why.

Exercise 9.31 (A) --- Temporal Leakage Simulation

Create a time series dataset where the target at time $t$ depends on features at times $t-1$ and $t-2$. Demonstrate the performance difference between: 1. Random train/test split (leakage). 2. Chronological split (correct). 3. Time-series cross-validation with TimeSeriesSplit.

Section I: Integration and Challenge Problems

Exercise 9.32 (A) --- End-to-End Feature Engineering

Download or create a tabular dataset with at least: - 3 numerical features (at least one skewed) - 2 categorical features (at least one with > 10 categories) - 1 text feature - 1 datetime feature - Missing values in at least 2 columns

Build a complete pipeline that handles all feature types, applies feature selection, and trains a gradient boosting model. Report cross-validated performance.

Exercise 9.33 (A) --- Feature Engineering Tournament

Starting with a raw dataset, iteratively engineer features and track model performance at each step: 1. Baseline: raw numerical features only. 2. Add scaled numerical features. 3. Add encoded categorical features. 4. Add interaction features. 5. Add datetime features. 6. Apply feature selection.

Create a table showing the cross-validated score at each stage.

Exercise 9.34 (A) --- Reproducibility Challenge

Build two separate pipelines that should produce identical results: 1. A manual pipeline where you apply each transformation step by step. 2. A scikit-learn Pipeline/ColumnTransformer version.

Verify that both produce the same predictions on a test set (within floating-point tolerance).

Exercise 9.35 (A) --- Adversarial Feature Engineering

Create a dataset with 100 features, where only 5 are truly predictive and the remaining 95 are noise. Add 10 "leaky" features that are derived from the target. Show that: 1. A naive model achieves near-perfect accuracy with the leaky features. 2. Removing leaky features drops accuracy but gives a realistic estimate. 3. Feature selection can help identify the 5 truly predictive features.