31 min read

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering."

Chapter 9: Feature Engineering and Data Pipelines

"Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." --- Andrew Ng

In the preceding chapters, we explored the theoretical foundations of machine learning---from linear models (Chapter 3) through tree-based methods (Chapter 6) and ensemble techniques (Chapter 7). We examined how models learn from data and how to evaluate their performance (Chapter 8). Yet there is a critical step that sits between raw data and a trained model, a step that often determines whether your project succeeds or fails: feature engineering.

Feature engineering is the art and science of transforming raw data into representations that machine learning algorithms can effectively exploit. A well-engineered feature set can make a simple model outperform a complex one trained on raw inputs. Conversely, no amount of algorithmic sophistication can compensate for poorly constructed features.

This chapter covers the complete feature engineering workflow: transforming numerical and categorical variables, extracting information from text and datetime fields, selecting the most informative features, building reproducible pipelines with scikit-learn, handling missing data responsibly, and---perhaps most importantly---avoiding the insidious trap of data leakage. By the end, you will have a practical toolkit for turning messy, real-world data into clean, predictive features.


9.1 Why Feature Engineering Matters

Consider a dataset containing house prices. The raw data might include lot_length and lot_width as separate columns. A linear model (Chapter 3) would learn separate coefficients for each, but the truly predictive quantity is the area---the product of these two features. Creating lot_area = lot_length * lot_width gives the model direct access to this relationship without requiring it to learn a multiplicative interaction from additive terms.

This simple example illustrates a broader principle: the representation of data matters as much as the algorithm applied to it. Feature engineering bridges the gap between what your data contains and what your model needs.

The Feature Engineering Lifecycle

Feature engineering is not a one-time activity. It follows an iterative cycle:

  1. Understand the domain --- What do the raw features mean? What relationships might exist?
  2. Transform features --- Apply mathematical transformations, encode categories, extract temporal patterns.
  3. Create new features --- Combine existing features, compute aggregates, derive domain-specific indicators.
  4. Select features --- Remove redundant or uninformative features that add noise.
  5. Evaluate --- Measure model performance with cross-validation (Chapter 8).
  6. Iterate --- Refine based on results, add new transformations, remove underperformers.

Throughout this chapter, we use the following imports consistently:

import numpy as np
import pandas as pd
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler, RobustScaler,
    OneHotEncoder, OrdinalEncoder, LabelEncoder,
    KBinsDiscretizer, PolynomialFeatures,
    FunctionTransformer
)
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_selection import (
    SelectKBest, f_classif, mutual_info_classif,
    RFE, SelectFromModel, VarianceThreshold
)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

9.2 Numerical Feature Transformations

Numerical features are the most common type, but raw numerical values are rarely optimal for machine learning models. Different algorithms have different sensitivities to the scale and distribution of input features.

9.2.1 Feature Scaling

Why scale? Many algorithms---including linear regression (Chapter 3), logistic regression (Chapter 4), support vector machines (Chapter 5), and k-nearest neighbors---are sensitive to the magnitude of input features. A feature ranging from 0 to 1,000,000 will dominate a feature ranging from 0 to 1, regardless of their relative importance.

Standardization (Z-score normalization) transforms each feature to have zero mean and unit variance:

$$z = \frac{x - \mu}{\sigma}$$

where $\mu$ is the mean and $\sigma$ is the standard deviation of the feature computed on the training set only.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Fit AND transform on training data
X_test_scaled = scaler.transform(X_test)         # Only transform on test data

Min-Max normalization rescales features to a fixed range, typically $[0, 1]$:

$$x_{\text{norm}} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}$$

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Robust scaling uses the median and interquartile range (IQR), making it resistant to outliers:

$$x_{\text{robust}} = \frac{x - Q_2}{Q_3 - Q_1}$$

where $Q_1$, $Q_2$, and $Q_3$ are the 25th, 50th, and 75th percentiles, respectively.

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

When to use which?

Scaler Best For Sensitive to Outliers?
StandardScaler Normally distributed data, linear models, SVMs Yes
MinMaxScaler Neural networks, bounded algorithms Yes
RobustScaler Data with significant outliers No

Tree-based models (Chapter 6) are generally invariant to feature scaling because they make decisions based on thresholds, not magnitudes. However, it is still good practice to scale features when building pipelines that might include different model types.

9.2.2 Nonlinear Transformations

When features have skewed distributions, nonlinear transformations can improve model performance by making distributions more symmetric.

Log transformation compresses the range of large values:

$$x_{\text{log}} = \log(1 + x)$$

The $+1$ ensures the transformation is defined for $x = 0$. This is particularly useful for features like income, population, or counts that span several orders of magnitude.

import numpy as np
from sklearn.preprocessing import FunctionTransformer

log_transformer = FunctionTransformer(np.log1p, inverse_func=np.expm1)
X_log = log_transformer.fit_transform(X[['income']])

Box-Cox transformation is a parametric family that finds the optimal power transformation:

$$x_{\text{BC}} = \begin{cases} \frac{x^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0 \\ \ln(x) & \text{if } \lambda = 0 \end{cases}$$

The parameter $\lambda$ is estimated from the data. Note that Box-Cox requires strictly positive values.

from sklearn.preprocessing import PowerTransformer

# Box-Cox (requires positive values)
bc_transformer = PowerTransformer(method='box-cox')
X_bc = bc_transformer.fit_transform(X_positive)

# Yeo-Johnson (handles zero and negative values)
yj_transformer = PowerTransformer(method='yeo-johnson')
X_yj = yj_transformer.fit_transform(X)

9.2.3 Binning (Discretization)

Binning converts continuous features into discrete intervals. This can capture nonlinear relationships in linear models and reduce the impact of outliers.

from sklearn.preprocessing import KBinsDiscretizer

# Equal-width binning
binner = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
X_binned = binner.fit_transform(X[['age']])

# Equal-frequency (quantile) binning
binner_q = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')
X_binned_q = binner_q.fit_transform(X[['age']])

# K-means based binning
binner_k = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='kmeans')
X_binned_k = binner_k.fit_transform(X[['age']])

Caution: Binning discards information within each bin. Use it judiciously---primarily when you have domain knowledge that the relationship between the feature and target changes at specific thresholds (e.g., age brackets for insurance pricing).

9.2.4 Quantile Transformation

When neither log nor Box-Cox produces a sufficiently symmetric distribution, the quantile transformer maps any distribution to a target distribution (typically uniform or Gaussian) by using the empirical cumulative distribution function:

$$x_{\text{transformed}} = \Phi^{-1}(F_n(x))$$

where $F_n$ is the empirical CDF and $\Phi^{-1}$ is the inverse CDF of the target distribution. This forces any continuous feature to follow exactly a normal (or uniform) distribution, regardless of its original shape.

from sklearn.preprocessing import QuantileTransformer

# Transform to a Gaussian distribution
qt_normal = QuantileTransformer(
    output_distribution='normal',
    n_quantiles=1000,
    random_state=42
)
X_gaussian = qt_normal.fit_transform(X_train)

# Transform to a uniform distribution
qt_uniform = QuantileTransformer(
    output_distribution='uniform',
    n_quantiles=1000,
    random_state=42
)
X_uniform = qt_uniform.fit_transform(X_train)

When to use quantile transformation: This is particularly useful for features with heavy tails, multimodal distributions, or features that contain outliers that even robust scaling cannot handle. It is the most aggressive normalization technique and should be used with care, as it discards information about the original scale and spacing of values.

Caution: The quantile transformer is fitted on training data. For test data, values outside the training range are clipped to the minimum or maximum observed quantile. This means the transformer cannot extrapolate, which may cause issues if the test distribution differs significantly from training.

9.2.5 Polynomial and Interaction Features

Polynomial features create new features by raising existing ones to powers and computing interactions:

$$\text{For features } x_1, x_2 \text{ with degree 2: } [x_1, x_2, x_1^2, x_1 x_2, x_2^2]$$

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
X_poly = poly.fit_transform(X[['feature_1', 'feature_2']])
print(poly.get_feature_names_out())

Setting interaction_only=True produces only interaction terms ($x_1 x_2$) without powers ($x_1^2$, $x_2^2$), which is useful when you suspect feature interactions but not polynomial relationships.

Warning: The number of polynomial features grows combinatorially. With $n$ features and degree $d$, the number of output features is $\binom{n + d}{d} - 1$. For 100 features with degree 2, that is over 5,000 new features---a recipe for overfitting without proper regularization (Chapter 3).


9.3 Categorical Feature Encoding

Most machine learning algorithms require numerical inputs. Categorical variables---such as color (red, blue, green), city names, or product categories---must be converted to numbers. The choice of encoding method significantly affects model performance.

9.3.1 One-Hot Encoding

One-hot encoding creates a binary column for each category. A feature with $k$ categories becomes $k$ binary features (or $k-1$ with drop='first' to avoid multicollinearity in linear models).

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore')
X_encoded = encoder.fit_transform(X[['color', 'city']])

Advantages: - No ordinal assumption imposed on categories. - Works well with linear models and neural networks.

Disadvantages: - High-cardinality features (e.g., ZIP codes with thousands of values) explode the feature space. - Does not capture relationships between categories.

The handle_unknown='ignore' parameter is essential for production systems. When the test set contains a category not seen during training, the encoder sets all binary columns to zero rather than raising an error.

9.3.2 Ordinal Encoding

When categories have a natural order (e.g., low < medium < high), ordinal encoding assigns sequential integers:

from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder(
    categories=[['low', 'medium', 'high']],
    handle_unknown='use_encoded_value',
    unknown_value=-1
)
X_encoded = encoder.fit_transform(X[['priority']])

When to use: Only when the categorical variable has a meaningful ordering. Applying ordinal encoding to nominal categories (like color) introduces a false ordering that can mislead distance-based algorithms.

9.3.3 Target Encoding (Mean Encoding)

Target encoding replaces each category with the mean of the target variable for that category. For a classification task where the target is binary:

$$\text{encoded}(c) = \frac{\sum_{i: x_i = c} y_i}{|\{i: x_i = c\}|}$$

This is powerful for high-cardinality features but carries a serious risk of data leakage (Section 9.8). We must use regularization and compute target statistics only on training folds.

from sklearn.preprocessing import TargetEncoder

# Available in scikit-learn >= 1.3
target_encoder = TargetEncoder(smooth='auto')
X_encoded = target_encoder.fit_transform(X[['city']], y)

For earlier scikit-learn versions or custom implementations:

def target_encode_with_smoothing(
    train: pd.DataFrame,
    col: str,
    target: str,
    min_samples: int = 20,
    smoothing: float = 10.0
) -> pd.Series:
    """Apply target encoding with smoothing to prevent overfitting.

    Args:
        train: Training DataFrame.
        col: Name of the categorical column.
        target: Name of the target column.
        min_samples: Minimum samples to trust category mean.
        smoothing: Smoothing factor (higher = more regularization).

    Returns:
        Encoded series with smoothed target means.
    """
    global_mean = train[target].mean()
    agg = train.groupby(col)[target].agg(['mean', 'count'])
    smooth_factor = 1 / (1 + np.exp(-(agg['count'] - min_samples) / smoothing))
    encoded = global_mean * (1 - smooth_factor) + agg['mean'] * smooth_factor
    return train[col].map(encoded).fillna(global_mean)

The smoothing function blends the category mean with the global mean. Categories with few observations are pulled toward the global mean, reducing overfitting.

9.3.4 Target Encoding: Deeper Treatment

Target encoding is one of the most powerful techniques for handling high-cardinality categorical features, but its potential for data leakage demands careful implementation. Let us examine the technique and its safeguards in more depth.

The problem with naive target encoding: If you compute the mean target for each category using the entire training set and then use those values as features, the model effectively has access to the target variable through a proxy. This is especially harmful for rare categories: if a category appears only once and the target is 1, the encoded value is 1.0---a perfect predictor that tells the model nothing useful about unseen data.

Smoothing mitigates this by blending the category mean with the global mean. The smoothing formula uses a sigmoid function that transitions from global mean (for rare categories) to category mean (for common categories):

$$\text{encoded}(c) = \lambda(n_c) \cdot \bar{y}_c + (1 - \lambda(n_c)) \cdot \bar{y}_{\text{global}}$$

where $\lambda(n_c) = \frac{1}{1 + \exp(-(n_c - k) / f)}$ is a smooth transition function, $n_c$ is the count of category $c$, $k$ is the minimum sample size threshold, and $f$ controls the smoothness of the transition.

Worked Example: Suppose the global mean target is 0.30, and we have: - Category "A" with 500 samples and mean target 0.45: $\lambda \approx 1.0$, so encoded $\approx 0.45$. - Category "B" with 5 samples and mean target 0.80: $\lambda \approx 0.05$, so encoded $\approx 0.05 \times 0.80 + 0.95 \times 0.30 = 0.325$. - Category "C" with 1 sample and mean target 1.00: $\lambda \approx 0.0$, so encoded $\approx 0.30$ (falls back to global mean).

The smoothing correctly trusts the category-level statistic only when there are enough observations to make it reliable.

Leave-one-out target encoding provides another defense against leakage. For each training sample, the target encoding of its category excludes that sample's own target value:

$$\text{encoded}_i(c) = \frac{\sum_{j \neq i, x_j = c} y_j}{n_c - 1}$$

This prevents each sample from "seeing its own label" through the encoding.

Cross-validated target encoding computes target statistics using only other folds, similar to how cross-validation prevents leakage during evaluation. Scikit-learn's TargetEncoder (available since version 1.3) implements this approach internally during fit_transform.

9.3.5 Frequency and Binary Encoding

Frequency encoding replaces categories with their frequency in the training set:

def frequency_encode(df: pd.DataFrame, col: str) -> pd.Series:
    """Encode categories by their frequency in the dataset.

    Args:
        df: Input DataFrame.
        col: Column to encode.

    Returns:
        Series with frequency-encoded values.
    """
    freq = df[col].value_counts(normalize=True)
    return df[col].map(freq)

Binary encoding represents each integer-encoded category in binary, creating $\lceil\log_2(k)\rceil$ columns for $k$ categories. This is a compromise between one-hot encoding (too many columns) and ordinal encoding (false ordering).

Worked Example: Consider a "city" feature with 8 categories. One-hot encoding creates 8 (or 7 with drop-first) columns. Binary encoding first maps each city to an integer (0--7), then represents each integer in binary:

City Integer Bit 2 Bit 1 Bit 0
NYC 0 0 0 0
LA 1 0 0 1
Chicago 2 0 1 0
Houston 3 0 1 1
Phoenix 4 1 0 0
Philadelphia 5 1 0 1
San Antonio 6 1 1 0
San Diego 7 1 1 1

Only $\lceil\log_2(8)\rceil = 3$ columns are needed, compared to 8 for one-hot. For high-cardinality features like ZIP codes (43,000+ unique values), binary encoding reduces dimensionality from 43,000 to just 16 columns.


9.4 Text Feature Engineering

Text data requires specialized preprocessing before it can be used in machine learning models. This section covers the foundational techniques; deep learning-based text representations (word embeddings, transformers) are covered in Part IV.

9.4.1 Bag of Words

The bag-of-words (BoW) model represents text as a vector of word counts, ignoring grammar and word order:

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "The cat sat on the mat",
    "The dog sat on the log",
    "Cats and dogs are friends"
]

vectorizer = CountVectorizer(
    max_features=1000,      # Limit vocabulary size
    stop_words='english',   # Remove common words
    min_df=2,               # Ignore terms in fewer than 2 documents
    max_df=0.95,            # Ignore terms in more than 95% of documents
    ngram_range=(1, 2)      # Include unigrams and bigrams
)

X_bow = vectorizer.fit_transform(corpus)
print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
print(f"Feature matrix shape: {X_bow.shape}")

9.4.2 TF-IDF

Term Frequency-Inverse Document Frequency (TF-IDF) weighs words by their importance. Words that appear frequently in a single document but rarely across the corpus receive higher weights:

$$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$$

where:

$$\text{TF}(t, d) = \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}}, \quad \text{IDF}(t) = \log\frac{N}{|\{d \in D : t \in d\}|} + 1$$

Here $f_{t,d}$ is the count of term $t$ in document $d$, $N$ is the total number of documents, and the $+1$ prevents division by zero.

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features=5000,
    sublinear_tf=True,      # Apply log normalization to TF
    ngram_range=(1, 3),
    min_df=5,
    max_df=0.9
)

X_tfidf = tfidf.fit_transform(corpus)

Practical tip: Setting sublinear_tf=True applies $1 + \log(\text{TF})$ instead of raw term frequency, which often improves performance by dampening the effect of very frequent terms.

9.4.3 Text Preprocessing Best Practices

Before vectorizing, consider these preprocessing steps:

import re

def preprocess_text(text: str) -> str:
    """Clean and normalize text for feature extraction.

    Args:
        text: Raw input text.

    Returns:
        Cleaned and normalized text.
    """
    text = text.lower()                          # Lowercase
    text = re.sub(r'<[^>]+>', '', text)          # Remove HTML tags
    text = re.sub(r'[^a-zA-Z\s]', '', text)     # Remove non-alpha characters
    text = re.sub(r'\s+', ' ', text).strip()     # Normalize whitespace
    return text

For more advanced preprocessing, consider lemmatization (converting words to their base form), stemming, or domain-specific tokenization. These techniques are explored further in Part IV.

9.4.4 Word Embeddings as Features

While TF-IDF captures word importance, it treats each word as an independent dimension with no notion of semantic similarity. Word embeddings map words to dense vectors in a low-dimensional space (typically 100--300 dimensions) where semantically similar words are close together.

Pre-trained embeddings like Word2Vec, GloVe, or FastText can be used as features in traditional ML models by averaging the word vectors in a document:

import numpy as np

def document_embedding(
    text: str,
    word_vectors: dict[str, np.ndarray],
    embedding_dim: int = 300,
) -> np.ndarray:
    """Compute a document embedding by averaging word vectors.

    Args:
        text: Input text (already tokenized or will be split).
        word_vectors: Dictionary mapping words to embedding vectors.
        embedding_dim: Dimensionality of word vectors.

    Returns:
        Document embedding vector of shape (embedding_dim,).
    """
    words = text.lower().split()
    vectors = [
        word_vectors[w] for w in words
        if w in word_vectors
    ]
    if vectors:
        return np.mean(vectors, axis=0)
    return np.zeros(embedding_dim)

The averaging approach is simple but effective for many tasks. More sophisticated methods weight words by their TF-IDF scores or use sentence embedding models like Sentence-BERT. We will explore learned text representations in depth in Chapter 15 when we study transformers and attention mechanisms.

9.4.5 Image Feature Extraction

For traditional ML models (as opposed to deep learning, covered in Part III), image data requires manual feature extraction. Common approaches include:

Histogram of Oriented Gradients (HOG): Captures edge and gradient structure by computing histograms of gradient directions in local image regions. HOG features are effective for object detection and pedestrian recognition.

Color histograms: Represent the distribution of pixel intensities or colors. Useful for image classification when color is discriminative (e.g., distinguishing ripe from unripe fruit).

Pre-trained CNN features: Modern practice often uses a pre-trained convolutional neural network (such as ResNet or EfficientNet) as a feature extractor. Pass the image through the network, remove the final classification layer, and use the penultimate layer's activations as a fixed-length feature vector. This approach, called transfer learning, provides rich, general-purpose image features that work well even with small datasets:

# Conceptual example (requires torchvision)
# import torch
# from torchvision import models, transforms
#
# model = models.resnet50(pretrained=True)
# model = torch.nn.Sequential(*list(model.children())[:-1])  # Remove classifier
# model.eval()
#
# transform = transforms.Compose([
#     transforms.Resize(256),
#     transforms.CenterCrop(224),
#     transforms.ToTensor(),
#     transforms.Normalize(mean=[0.485, 0.456, 0.406],
#                          std=[0.229, 0.224, 0.225]),
# ])
#
# with torch.no_grad():
#     features = model(transform(image).unsqueeze(0)).squeeze().numpy()

We will study convolutional neural networks and transfer learning in detail in Chapter 13.


9.5 Datetime Feature Engineering

Datetime columns are rich sources of information that must be decomposed into meaningful numerical features. A single timestamp can yield dozens of useful features.

9.5.1 Basic Temporal Decomposition

def extract_datetime_features(df: pd.DataFrame, col: str) -> pd.DataFrame:
    """Extract comprehensive datetime features from a timestamp column.

    Args:
        df: Input DataFrame.
        col: Name of the datetime column.

    Returns:
        DataFrame with extracted datetime features.
    """
    dt = pd.to_datetime(df[col])

    features = pd.DataFrame({
        f'{col}_year': dt.dt.year,
        f'{col}_month': dt.dt.month,
        f'{col}_day': dt.dt.day,
        f'{col}_dayofweek': dt.dt.dayofweek,
        f'{col}_hour': dt.dt.hour,
        f'{col}_minute': dt.dt.minute,
        f'{col}_is_weekend': (dt.dt.dayofweek >= 5).astype(int),
        f'{col}_is_month_start': dt.dt.is_month_start.astype(int),
        f'{col}_is_month_end': dt.dt.is_month_end.astype(int),
        f'{col}_quarter': dt.dt.quarter,
        f'{col}_week_of_year': dt.dt.isocalendar().week.astype(int),
        f'{col}_day_of_year': dt.dt.dayofyear,
    })

    return features

9.5.2 Cyclical Encoding

Month, day of week, and hour are cyclical features: December (12) is close to January (1), but a naive integer encoding treats them as far apart. Cyclical encoding uses sine and cosine transformations to preserve this circular relationship:

$$x_{\sin} = \sin\left(\frac{2\pi \cdot x}{T}\right), \quad x_{\cos} = \cos\left(\frac{2\pi \cdot x}{T}\right)$$

where $T$ is the period (e.g., 12 for months, 7 for days of week, 24 for hours).

def cyclical_encode(series: pd.Series, period: int) -> pd.DataFrame:
    """Apply cyclical encoding using sine and cosine transformations.

    Args:
        series: Input series with cyclical values.
        period: The period of the cycle (e.g., 12 for months).

    Returns:
        DataFrame with sine and cosine encoded columns.
    """
    return pd.DataFrame({
        f'{series.name}_sin': np.sin(2 * np.pi * series / period),
        f'{series.name}_cos': np.cos(2 * np.pi * series / period),
    })

# Example: encode hour of day
hour_encoded = cyclical_encode(df['hour'], period=24)

9.5.3 Time-Based Aggregations

Compute rolling or lag features when working with temporal data:

def create_lag_features(
    df: pd.DataFrame,
    col: str,
    lags: list[int]
) -> pd.DataFrame:
    """Create lag features for time series data.

    Args:
        df: Input DataFrame sorted by time.
        col: Column to create lags for.
        lags: List of lag periods.

    Returns:
        DataFrame with lag features.
    """
    lag_features = pd.DataFrame()
    for lag in lags:
        lag_features[f'{col}_lag_{lag}'] = df[col].shift(lag)
    return lag_features

Important: When creating lag features, be extremely careful about data leakage (Section 9.8). Ensure that lag computations respect the temporal ordering and that no future information leaks into past observations.

9.5.4 Time Series Feature Engineering

Time series data offers unique opportunities for feature engineering beyond simple temporal decomposition. The goal is to capture temporal patterns---trends, seasonality, autocorrelation---as static features that traditional ML models can exploit.

Rolling statistics capture local trends and volatility:

def create_rolling_features(
    df: pd.DataFrame,
    col: str,
    windows: list[int],
) -> pd.DataFrame:
    """Create rolling window features for time series.

    Args:
        df: Input DataFrame sorted by time.
        col: Column to compute rolling statistics for.
        windows: List of window sizes (in number of periods).

    Returns:
        DataFrame with rolling features.
    """
    features = pd.DataFrame()
    for w in windows:
        features[f'{col}_rolling_mean_{w}'] = (
            df[col].rolling(window=w, min_periods=1).mean()
        )
        features[f'{col}_rolling_std_{w}'] = (
            df[col].rolling(window=w, min_periods=1).std()
        )
        features[f'{col}_rolling_min_{w}'] = (
            df[col].rolling(window=w, min_periods=1).min()
        )
        features[f'{col}_rolling_max_{w}'] = (
            df[col].rolling(window=w, min_periods=1).max()
        )
    return features

Expanding window features capture global trends from the beginning of the series:

def expanding_features(df: pd.DataFrame, col: str) -> pd.DataFrame:
    """Create expanding window features.

    Args:
        df: Input DataFrame sorted by time.
        col: Column to compute expanding statistics for.

    Returns:
        DataFrame with expanding window features.
    """
    return pd.DataFrame({
        f'{col}_expanding_mean': df[col].expanding().mean(),
        f'{col}_expanding_std': df[col].expanding().std(),
        f'{col}_expanding_count': df[col].expanding().count(),
    })

Difference features capture changes over time and help stationarize the series:

$$\Delta x_t = x_t - x_{t-1}, \quad \Delta^2 x_t = \Delta x_t - \Delta x_{t-1}$$

Ratio features capture relative change rather than absolute change:

$$r_t = \frac{x_t}{x_{t-1}}, \quad \text{or equivalently} \quad \log r_t = \log x_t - \log x_{t-1}$$

Fourier features capture periodicity at known frequencies. For a signal with period $T$, add sine and cosine terms at the fundamental frequency and its harmonics:

$$\sin\left(\frac{2\pi k t}{T}\right), \quad \cos\left(\frac{2\pi k t}{T}\right), \quad k = 1, 2, \ldots, K$$

These are particularly useful for capturing annual seasonality ($T = 365$), weekly patterns ($T = 7$), or daily patterns ($T = 24$).

Caution on temporal leakage: Every time series feature must be computable using only past data. A rolling mean at time $t$ must use values at $t, t-1, t-2, \ldots$, never $t+1, t+2, \ldots$. Similarly, target-related statistics (like the average of the target over the past week) must exclude the current time step.


9.6 Feature Selection

After engineering features, you may have hundreds or thousands of candidates. Not all are useful---some are redundant, some are noisy, and some may cause overfitting. Feature selection identifies the most informative subset.

Feature selection methods fall into three categories: filter, wrapper, and embedded methods.

9.6.1 Filter Methods

Filter methods evaluate features independently of any specific model, using statistical measures of relevance.

Variance threshold removes features with near-zero variance (features that are nearly constant across samples):

from sklearn.feature_selection import VarianceThreshold

# Remove features with variance below threshold
selector = VarianceThreshold(threshold=0.01)
X_filtered = selector.fit_transform(X)
print(f"Features reduced from {X.shape[1]} to {X_filtered.shape[1]}")

Univariate statistical tests measure the relationship between each feature and the target variable:

  • ANOVA F-statistic (f_classif): Tests whether the mean of a feature differs across classes. Works well for normally distributed numerical features.
  • Mutual information (mutual_info_classif): Measures general statistical dependence, capturing nonlinear relationships. More computationally expensive.
  • Chi-squared (chi2): Tests independence between non-negative features and a categorical target.
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif

# Select top 20 features by ANOVA F-statistic
selector_f = SelectKBest(score_func=f_classif, k=20)
X_selected_f = selector_f.fit_transform(X, y)

# Select top 20 features by mutual information
selector_mi = SelectKBest(score_func=mutual_info_classif, k=20)
X_selected_mi = selector_mi.fit_transform(X, y)

# Inspect scores
scores = pd.DataFrame({
    'feature': feature_names,
    'f_score': selector_f.scores_,
    'mi_score': selector_mi.scores_
}).sort_values('f_score', ascending=False)

Correlation-based filtering removes highly correlated features that provide redundant information:

def remove_correlated_features(
    df: pd.DataFrame,
    threshold: float = 0.95
) -> list[str]:
    """Identify features to drop based on pairwise correlation.

    Args:
        df: DataFrame containing only numerical features.
        threshold: Correlation threshold above which to drop features.

    Returns:
        List of feature names to drop.
    """
    corr_matrix = df.corr().abs()
    upper_triangle = corr_matrix.where(
        np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
    )
    to_drop = [
        col for col in upper_triangle.columns
        if any(upper_triangle[col] > threshold)
    ]
    return to_drop

9.6.2 Wrapper Methods

Wrapper methods evaluate feature subsets by training a model and measuring performance. They are more computationally expensive but often find better feature sets.

Recursive Feature Elimination (RFE) starts with all features and iteratively removes the least important one:

from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

estimator = RandomForestClassifier(n_estimators=100, random_state=42)
rfe = RFE(estimator=estimator, n_features_to_select=20, step=5)
rfe.fit(X_train, y_train)

# Which features were selected?
selected_mask = rfe.support_
selected_features = np.array(feature_names)[selected_mask]
print(f"Selected features: {selected_features}")

# Feature rankings (1 = selected)
rankings = pd.DataFrame({
    'feature': feature_names,
    'ranking': rfe.ranking_
}).sort_values('ranking')

RFECV (RFE with cross-validation) automatically determines the optimal number of features:

from sklearn.feature_selection import RFECV

rfecv = RFECV(
    estimator=estimator,
    step=5,
    cv=5,
    scoring='accuracy',
    min_features_to_select=5,
    n_jobs=-1
)
rfecv.fit(X_train, y_train)
print(f"Optimal number of features: {rfecv.n_features_}")

9.6.3 Embedded Methods

Embedded methods perform feature selection as part of the model training process. They are typically faster than wrapper methods and more accurate than filter methods.

L1 (Lasso) regularization drives feature coefficients to exactly zero, effectively performing feature selection (as discussed in Chapter 3):

from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression

l1_model = LogisticRegression(penalty='l1', solver='liblinear', C=0.1)
selector = SelectFromModel(l1_model, threshold='median')
selector.fit(X_train, y_train)
X_selected = selector.transform(X_train)

# Selected feature names
selected_features = np.array(feature_names)[selector.get_support()]

Tree-based feature importance uses the impurity reduction or permutation importance from random forests or gradient boosting:

from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(n_estimators=200, random_state=42)
gbc.fit(X_train, y_train)

# Impurity-based importance
importances = pd.DataFrame({
    'feature': feature_names,
    'importance': gbc.feature_importances_
}).sort_values('importance', ascending=False)

print(importances.head(20))

9.6.4 Feature Importance: A Cautionary Note

As we discussed in Chapter 7, impurity-based feature importance can be biased toward high-cardinality features. A feature with many unique values (like an ID column) will appear important because it creates many possible splits, even though it has no predictive value.

Permutation importance (Chapter 8) provides an unbiased alternative by measuring how much model performance degrades when a feature's values are randomly shuffled:

from sklearn.inspection import permutation_importance

perm_result = permutation_importance(
    gbc, X_test, y_test,
    n_repeats=10,
    random_state=42,
    n_jobs=-1
)

perm_importances = pd.DataFrame({
    'feature': feature_names,
    'importance_mean': perm_result.importances_mean,
    'importance_std': perm_result.importances_std
}).sort_values('importance_mean', ascending=False)

Best practice: Always compare multiple feature importance methods. If a feature consistently ranks high across methods, you can be more confident in its value.

9.6.5 Feature Selection in Practice: A Complete Example

To illustrate how feature selection methods work together, let us apply several techniques to a synthetic dataset and compare their results.

from sklearn.datasets import make_classification
from sklearn.feature_selection import (
    SelectKBest, f_classif, mutual_info_classif,
    RFE, SelectFromModel, VarianceThreshold
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd

# Create a dataset with 20 features, only 5 are informative
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=5,
    n_redundant=5, n_useless=10, random_state=42
)
feature_names = [f"feat_{i}" for i in range(20)]

# Method 1: ANOVA F-test (top 10)
selector_f = SelectKBest(f_classif, k=10).fit(X, y)
anova_selected = set(np.array(feature_names)[selector_f.get_support()])

# Method 2: Mutual information (top 10)
selector_mi = SelectKBest(mutual_info_classif, k=10).fit(X, y)
mi_selected = set(np.array(feature_names)[selector_mi.get_support()])

# Method 3: L1 regularization
l1_model = LogisticRegression(penalty='l1', solver='liblinear', C=0.5)
selector_l1 = SelectFromModel(l1_model).fit(X, y)
l1_selected = set(np.array(feature_names)[selector_l1.get_support()])

# Method 4: Random forest importance
rf = RandomForestClassifier(n_estimators=200, random_state=42).fit(X, y)
rf_selector = SelectFromModel(rf, threshold='median').fit(X, y)
rf_selected = set(np.array(feature_names)[rf_selector.get_support()])

# Consensus: features selected by at least 3 of 4 methods
all_methods = [anova_selected, mi_selected, l1_selected, rf_selected]
consensus = {f for f in feature_names
             if sum(f in s for s in all_methods) >= 3}
print(f"Consensus features: {sorted(consensus)}")

The consensus approach is robust because no single method is universally best: ANOVA misses non-linear relationships, mutual information is noisier with small samples, L1 can be inconsistent when features are correlated, and tree-based importance is biased toward high-cardinality features. Features that survive multiple independent filters are most likely to be genuinely informative.


9.7 scikit-learn Pipelines and ColumnTransformers

Real-world datasets contain a mix of feature types: numerical, categorical, text, and datetime. Each requires different preprocessing. Scikit-learn's Pipeline and ColumnTransformer provide an elegant framework for organizing these transformations into a single, reproducible object that can be fitted, cross-validated, and serialized.

9.7.1 The Pipeline Object

A Pipeline chains multiple processing steps, where each step (except the last) is a transformer and the last step can be either a transformer or an estimator:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

# Fit and predict in one call
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

# Cross-validate the entire pipeline
scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')

Key benefits:

  1. Prevents data leakage: The fit_transform is called only on training folds during cross-validation. Test fold data never influences the transformation parameters.
  2. Reproducibility: The entire preprocessing + modeling workflow is encapsulated in one object.
  3. Deployment simplicity: Serialize a single pipeline object for production use.

9.7.2 ColumnTransformer: Different Transformations for Different Columns

ColumnTransformer applies different transformations to different subsets of columns:

from sklearn.compose import ColumnTransformer

numerical_features = ['age', 'income', 'credit_score']
categorical_features = ['education', 'employment_type']
text_features = 'description'

preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]), numerical_features),

        ('cat', Pipeline([
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'))
        ]), categorical_features),

        ('text', TfidfVectorizer(max_features=500), text_features),
    ],
    remainder='drop'  # Drop columns not specified
)

The remainder parameter controls what happens to columns not mentioned in any transformer: - 'drop' (default): Unmentioned columns are dropped. - 'passthrough': Unmentioned columns are included as-is.

9.7.3 Building a Complete Pipeline

Combine the preprocessor with a model to create an end-to-end pipeline:

full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(
        n_estimators=200,
        learning_rate=0.1,
        max_depth=5,
        random_state=42
    ))
])

# Cross-validate the complete pipeline
scores = cross_val_score(
    full_pipeline, X_train, y_train,
    cv=5, scoring='roc_auc', n_jobs=-1
)
print(f"Cross-validated AUC: {scores.mean():.4f} (+/- {scores.std():.4f})")

# Fit and predict
full_pipeline.fit(X_train, y_train)
y_pred = full_pipeline.predict(X_test)
y_prob = full_pipeline.predict_proba(X_test)[:, 1]

9.7.4 Custom Transformers

For transformations not available in scikit-learn, create custom transformers by inheriting from BaseEstimator and TransformerMixin:

from sklearn.base import BaseEstimator, TransformerMixin

class DatetimeFeatureExtractor(BaseEstimator, TransformerMixin):
    """Extract datetime components from timestamp columns.

    Attributes:
        features: List of datetime components to extract.
    """

    def __init__(self, features: list[str] | None = None):
        """Initialize with desired datetime features.

        Args:
            features: Components to extract. Defaults to common set.
        """
        self.features = features or [
            'month', 'dayofweek', 'hour', 'is_weekend'
        ]

    def fit(self, X: pd.DataFrame, y=None):
        """No fitting required for datetime extraction.

        Args:
            X: Input DataFrame.
            y: Ignored.

        Returns:
            self
        """
        return self

    def transform(self, X: pd.DataFrame) -> np.ndarray:
        """Extract datetime features from all datetime columns.

        Args:
            X: Input DataFrame with datetime columns.

        Returns:
            Array of extracted datetime features.
        """
        result = pd.DataFrame()
        for col in X.columns:
            dt = pd.to_datetime(X[col])
            if 'month' in self.features:
                result[f'{col}_month'] = dt.dt.month
            if 'dayofweek' in self.features:
                result[f'{col}_dayofweek'] = dt.dt.dayofweek
            if 'hour' in self.features:
                result[f'{col}_hour'] = dt.dt.hour
            if 'is_weekend' in self.features:
                result[f'{col}_is_weekend'] = (
                    dt.dt.dayofweek >= 5
                ).astype(int)
        return result.values

9.7.5 Pipeline Introspection

Scikit-learn provides tools to inspect and access components within a pipeline:

# Access a specific step
scaler = full_pipeline.named_steps['preprocessor']
model = full_pipeline.named_steps['classifier']

# Get feature names after preprocessing
feature_names_out = full_pipeline[:-1].get_feature_names_out()

# Set hyperparameters using double-underscore notation
full_pipeline.set_params(
    preprocessor__num__scaler__with_mean=False,
    classifier__n_estimators=300
)

The double-underscore notation (step__param) is especially powerful for hyperparameter tuning with GridSearchCV (Chapter 8):

from sklearn.model_selection import GridSearchCV

param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [3, 5, 7],
}

grid_search = GridSearchCV(
    full_pipeline, param_grid,
    cv=5, scoring='roc_auc', n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.4f}")

9.8 Handling Missing Data

Missing data is ubiquitous in real-world datasets. How you handle it can significantly impact model performance.

9.8.1 Types of Missingness

Understanding why data is missing guides your imputation strategy:

  • Missing Completely at Random (MCAR): Missingness is unrelated to any observed or unobserved data. Example: A sensor randomly fails.
  • Missing at Random (MAR): Missingness depends on observed data but not the missing values themselves. Example: Income is more likely to be missing for younger respondents.
  • Missing Not at Random (MNAR): Missingness depends on the missing values. Example: High-income individuals are less likely to report their income.

9.8.2 Simple Imputation Strategies

from sklearn.impute import SimpleImputer

# Numerical: mean, median, or constant
num_imputer = SimpleImputer(strategy='median')

# Categorical: most_frequent or constant
cat_imputer = SimpleImputer(strategy='most_frequent')

# Constant fill
const_imputer = SimpleImputer(strategy='constant', fill_value=-999)

When to use which: - Mean: When data is approximately normally distributed and MCAR. - Median: When data is skewed or contains outliers (preferred default). - Most frequent: For categorical features. - Constant: When you want the model to learn that "missing" is a distinct category.

9.8.3 Choosing an Imputation Strategy

The choice of imputation strategy depends on the missingness mechanism and the amount of missing data:

Worked Example: Consider a medical dataset with blood pressure, cholesterol, and BMI features, where 15% of cholesterol values are missing. If cholesterol is more likely to be missing for patients who did not visit a lab (MAR---missingness depends on whether they had a recent lab visit, which is observed), we should use a model-based imputation method that conditions on other features, including the lab visit indicator.

If we suspect the missingness is MNAR (e.g., high-cholesterol patients avoid testing), no imputation method can fully correct the bias. The best approach is to include a missing indicator feature (Section 9.8.4) so the model can learn that missingness itself is predictive, and to report sensitivity analyses showing how results change under different assumptions about the missing values.

Missing % Mechanism Recommended Strategy
< 5% MCAR Simple imputation (median/mean) is usually sufficient
5--30% MAR Iterative or KNN imputation to preserve relationships
> 30% Any Consider dropping the feature, or use missingness indicators
Any MNAR Include missing indicators; no method fully corrects bias

9.8.4 Advanced Imputation

Iterative imputation models each feature with missing values as a function of all other features:

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

iterative_imputer = IterativeImputer(
    max_iter=10,
    random_state=42,
    sample_posterior=False
)
X_imputed = iterative_imputer.fit_transform(X)

K-Nearest Neighbors imputation fills missing values based on the values of the $k$ nearest complete samples:

from sklearn.impute import KNNImputer

knn_imputer = KNNImputer(n_neighbors=5, weights='distance')
X_imputed = knn_imputer.fit_transform(X)

9.8.5 Missing Indicator Features

Sometimes the fact that a value is missing is itself informative. Create binary indicators for missingness:

from sklearn.impute import SimpleImputer, MissingIndicator

# Combine imputation with missing indicators
class ImputerWithIndicator(BaseEstimator, TransformerMixin):
    """Impute missing values and add binary missing indicators.

    Args:
        strategy: Imputation strategy for SimpleImputer.
    """

    def __init__(self, strategy: str = 'median'):
        self.strategy = strategy

    def fit(self, X: np.ndarray, y=None):
        """Fit the imputer and identify columns with missing values.

        Args:
            X: Input array with potential missing values.
            y: Ignored.

        Returns:
            self
        """
        self.imputer_ = SimpleImputer(strategy=self.strategy)
        self.indicator_ = MissingIndicator()
        self.imputer_.fit(X)
        self.indicator_.fit(X)
        return self

    def transform(self, X: np.ndarray) -> np.ndarray:
        """Impute missing values and concatenate missing indicators.

        Args:
            X: Input array.

        Returns:
            Array with imputed values and indicator columns.
        """
        X_imputed = self.imputer_.transform(X)
        X_indicator = self.indicator_.transform(X)
        return np.hstack([X_imputed, X_indicator.astype(int)])

9.9 Data Leakage Prevention

Data leakage is arguably the most dangerous pitfall in machine learning. It occurs when information from outside the training set---typically from the test set or future data---inadvertently influences model training. Leakage produces overly optimistic performance estimates that collapse in production.

9.9.1 Common Sources of Leakage

1. Preprocessing before splitting:

# WRONG: Scaling before train/test split
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Uses statistics from ALL data
X_train, X_test = X_scaled[:800], X_scaled[800:]

# CORRECT: Scale after splitting
X_train_raw, X_test_raw = X[:800], X[800:]
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train_raw)  # Fit on train only
X_test = scaler.transform(X_test_raw)         # Transform test

When you compute the mean and standard deviation on the entire dataset, the test set statistics "leak" into the training process. The scaler effectively learns information about data the model should never have seen.

2. Target leakage:

Features that are derived from or highly correlated with the target variable but would not be available at prediction time. Example: Including claim_amount as a feature when predicting is_fraud---the claim amount is only known after the event you are trying to predict.

3. Temporal leakage in time series:

Using future data to predict the past. When working with time series, always split chronologically and ensure features like rolling averages do not include future values.

4. Leakage through feature engineering:

Target encoding computed on the full dataset, or aggregations that include test set observations, both leak information.

9.9.2 The Pipeline Solution

Scikit-learn pipelines are the primary defense against preprocessing leakage. When you use cross_val_score with a pipeline, the fit_transform is called only on each training fold:

# This automatically prevents leakage during cross-validation
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

# Each fold: fit imputer+scaler on train fold, transform both folds
scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy')

Without the pipeline, you would need to manually manage the fit/transform split for every preprocessing step in every fold---error-prone and tedious.

9.9.3 Leakage Detection Checklist

Before trusting your model's performance, ask:

  1. Is the performance too good to be true? An AUC of 0.99 on a problem that domain experts say is hard warrants suspicion.
  2. Are any features derived from the target? Trace the provenance of every feature.
  3. Was preprocessing fitted on the full dataset? All fit calls should happen on training data only.
  4. For time series: is the split chronological? Random splits violate temporal ordering.
  5. Are there features that would not be available at prediction time? Simulate the production data flow.
  6. Do any features have suspiciously high individual predictive power? Fit a simple model on each feature alone. If one feature achieves near-perfect accuracy, it is likely leaking target information.
  7. Are aggregation features computed correctly? Group-level statistics (e.g., "average spend of this customer's zip code") must be computed using only training data, not the full dataset.

A cautionary tale: In the 2012 Kaggle "Observing Dark Worlds" competition, a contestant discovered that the order of data points in the test file encoded information about the target. This is a form of leakage through metadata---the test set's structure revealed information that the model should not have had. Always scrutinize data ordering, file structure, and metadata for inadvertent information leakage.

Adversarial validation: A powerful technique for detecting distribution differences between training and test sets. Train a classifier to distinguish training from test examples. If it achieves high accuracy, the distributions differ significantly, which may indicate leakage or distribution shift:

def adversarial_validation(
    X_train: np.ndarray,
    X_test: np.ndarray,
    cv: int = 5,
) -> float:
    """Check for distribution differences via adversarial validation.

    Args:
        X_train: Training feature matrix.
        X_test: Test feature matrix.
        cv: Number of cross-validation folds.

    Returns:
        AUC score (0.5 = identical distributions, 1.0 = fully separable).
    """
    X_combined = np.vstack([X_train, X_test])
    y_combined = np.concatenate([
        np.zeros(len(X_train)),
        np.ones(len(X_test)),
    ])

    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import cross_val_score

    clf = RandomForestClassifier(n_estimators=100, random_state=42)
    scores = cross_val_score(
        clf, X_combined, y_combined,
        cv=cv, scoring="roc_auc"
    )
    return scores.mean()

# AUC near 0.5: train/test look similar (good)
# AUC near 1.0: train/test are easily distinguishable (investigate!)

9.9.4 Group-Aware Splitting

When samples are grouped (e.g., multiple records per patient, multiple transactions per customer), random splitting can place related samples in both train and test sets, creating leakage:

from sklearn.model_selection import GroupKFold

# Ensure all records from the same customer stay in the same fold
groups = df['customer_id']
group_kfold = GroupKFold(n_splits=5)

scores = cross_val_score(
    pipeline, X, y,
    cv=group_kfold, groups=groups,
    scoring='roc_auc'
)

9.10 Putting It All Together: A Complete Feature Engineering Workflow

Let us walk through a complete example that integrates the concepts from this chapter. We have a customer churn prediction dataset with numerical, categorical, text, and datetime features.

import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import (
    StandardScaler, OneHotEncoder, OrdinalEncoder
)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import (
    train_test_split, cross_val_score, GroupKFold
)

# 1. Define column groups
numerical_cols = ['tenure_months', 'monthly_charges', 'total_charges',
                  'num_support_tickets', 'avg_session_minutes']
categorical_cols = ['contract_type', 'payment_method', 'internet_service']
ordinal_cols = ['satisfaction_rating']  # low, medium, high
text_cols = 'last_support_message'

# 2. Build preprocessing pipeline
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(drop='first', handle_unknown='ignore')),
])

ordinal_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OrdinalEncoder(
        categories=[['low', 'medium', 'high']]
    )),
])

preprocessor = ColumnTransformer([
    ('num', numerical_pipeline, numerical_cols),
    ('cat', categorical_pipeline, categorical_cols),
    ('ord', ordinal_pipeline, ordinal_cols),
    ('text', TfidfVectorizer(max_features=200), text_cols),
], remainder='drop')

# 3. Full pipeline with feature selection and model
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('feature_selection', SelectFromModel(
        GradientBoostingClassifier(n_estimators=100, random_state=42),
        threshold='median'
    )),
    ('classifier', GradientBoostingClassifier(
        n_estimators=300,
        learning_rate=0.05,
        max_depth=4,
        subsample=0.8,
        random_state=42
    ))
])

# 4. Evaluate with cross-validation (no leakage!)
scores = cross_val_score(
    full_pipeline, X_train, y_train,
    cv=5, scoring='roc_auc', n_jobs=-1
)
print(f"CV AUC: {scores.mean():.4f} +/- {scores.std():.4f}")

# 5. Final fit and evaluation
full_pipeline.fit(X_train, y_train)
test_auc = roc_auc_score(y_test, full_pipeline.predict_proba(X_test)[:, 1])
print(f"Test AUC: {test_auc:.4f}")

This pipeline encapsulates every preprocessing decision, prevents data leakage during cross-validation, and can be serialized for deployment with a single joblib.dump(full_pipeline, 'model.pkl') call.


9.11 Advanced Topics

9.11.1 Feature Engineering for Specific Model Types

Different models benefit from different feature engineering approaches:

Linear models (Chapter 3) benefit from: - Feature scaling (essential for regularized models). - Polynomial and interaction features to capture nonlinear relationships. - One-hot encoding for categorical features. - Log transforms for skewed numerical features.

Tree-based models (Chapters 6--7) benefit from: - Ordinal encoding or target encoding for categorical features (more efficient than one-hot). - Raw numerical features (scaling is unnecessary). - Interaction features (trees can learn these, but explicit features can help).

Neural networks (Part III) benefit from: - Feature scaling (typically to $[0, 1]$ or standardized). - Embedding layers for high-cardinality categorical features. - The same preprocessing as linear models, plus learned representations.

9.11.2 Feature Stores

In production environments, feature stores provide a centralized repository for feature definitions, ensuring consistency between training and serving:

  • Offline features are computed in batch and stored for training.
  • Online features are computed or retrieved in real-time for serving.
  • Tools like Feast, Tecton, and Hopsworks manage this duality.

Feature stores solve the training-serving skew problem: ensuring that the features used at prediction time are computed exactly the same way as during training.

9.11.3 Automated Feature Engineering

Libraries like Featuretools can automatically generate features through deep feature synthesis (DFS), applying a library of transformation primitives to relational data. DFS works by exploring all possible chains of transformations across related tables, generating hundreds or thousands of candidate features.

# Conceptual example (requires featuretools package)
# import featuretools as ft
#
# # Define the entity set with relationships
# entity_set = ft.EntitySet(id="ecommerce")
# entity_set = entity_set.add_dataframe(
#     dataframe_name="customers",
#     dataframe=customers_df,
#     index="customer_id",
# )
# entity_set = entity_set.add_dataframe(
#     dataframe_name="orders",
#     dataframe=orders_df,
#     index="order_id",
#     time_index="order_date",
# )
# entity_set = entity_set.add_relationship(
#     "customers", "customer_id", "orders", "customer_id"
# )
#
# # Generate features with depth 2
# feature_matrix, feature_defs = ft.dfs(
#     entityset=entity_set,
#     target_dataframe_name="customers",
#     max_depth=2,
#     agg_primitives=["mean", "sum", "count", "max", "min", "std"],
#     trans_primitives=["month", "weekday", "is_weekend"],
# )
# print(f"Generated {len(feature_defs)} features")

DFS operates with two types of primitives:

  • Aggregation primitives (e.g., mean, count, mode) aggregate data from child tables to parent tables. For example, computing "average order value per customer" from an orders table.
  • Transformation primitives (e.g., month, log, percentile) transform individual columns within a single table.

At depth 1, DFS applies single primitives. At depth 2, it stacks primitives---for example, "the average of the monthly counts of orders per customer." The number of generated features grows exponentially with depth, so depths greater than 2 are rarely practical.

Limitations of automated feature engineering: The generated features are only as good as the data relationships defined in the entity set. DFS cannot discover features that require domain-specific formulas (e.g., body mass index from height and weight). It also tends to generate many redundant or uninformative features, making subsequent feature selection essential.

While automated feature engineering can discover useful features, it is no substitute for domain knowledge. The best results come from combining automated exploration with expert-guided feature creation. Use DFS to generate a broad set of candidates, then apply the feature selection techniques from Section 9.6 to identify the most valuable ones.

Alternative tools: Besides Featuretools, several other libraries provide automated feature engineering:

  • tsfresh: Specialized for time series features, automatically extracting hundreds of time-domain and frequency-domain features.
  • Boruta: A wrapper-based feature selection method built on random forests that identifies all relevant features (not just the top $k$).
  • OpenFE: Uses a neural network to evaluate candidate feature transformations efficiently.

9.11.4 Feature Interactions

Feature interactions occur when the combined effect of two or more features on the target differs from the sum of their individual effects. Consider a simple example: the impact of exercise on health outcomes may depend on age. A 30-minute daily walk has a different effect for a 25-year-old than for a 75-year-old. The interaction feature age * exercise_minutes captures this dependency.

Explicit interaction features for linear models:

As we discussed in Section 9.2.4, PolynomialFeatures with interaction_only=True creates all pairwise products. For $d$ features, this produces $\binom{d}{2}$ new features. With 20 original features, that is 190 interactions---manageable. With 200 features, it is 19,900---a significant expansion that requires regularization.

Domain-driven interactions: Rather than creating all possible interactions, use domain knowledge to identify the most likely ones. In real estate, bedrooms * bathrooms or square_feet / bedrooms (average room size) are more informative than arbitrary feature products. In finance, income / debt (debt-to-income ratio) is a standard underwriting feature.

Detecting interactions statistically: Fit a model with and without interaction terms and compare performance. Alternatively, examine partial dependence plots (which we will explore in Chapter 20) to identify features whose effect on the target depends on the value of another feature.

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Compare model with and without interactions
model_no_interactions = make_pipeline(
    StandardScaler(),
    LogisticRegression(max_iter=1000),
)

model_with_interactions = make_pipeline(
    PolynomialFeatures(degree=2, interaction_only=True, include_bias=False),
    StandardScaler(),
    LogisticRegression(max_iter=1000, C=0.1),  # Regularize!
)

score_base = cross_val_score(
    model_no_interactions, X, y, cv=5, scoring="roc_auc"
).mean()
score_interact = cross_val_score(
    model_with_interactions, X, y, cv=5, scoring="roc_auc"
).mean()

print(f"Without interactions: AUC = {score_base:.4f}")
print(f"With interactions:    AUC = {score_interact:.4f}")

9.11.5 Production Pipeline Patterns

Moving from a Jupyter notebook to a production system introduces challenges that feature engineering pipelines must address.

Training-serving skew occurs when features are computed differently during training and serving. Common causes include:

  • Using different library versions or function implementations.
  • Computing aggregations over different time windows.
  • Applying transformations in a different order.

The solution is to use the same pipeline object for both training and serving. Serialize the fitted pipeline with joblib or pickle:

import joblib

# Save the fitted pipeline
joblib.dump(full_pipeline, "model_pipeline.pkl")

# Load in production
pipeline = joblib.load("model_pipeline.pkl")
predictions = pipeline.predict(new_data)

Feature validation ensures that incoming production data conforms to expectations:

def validate_features(
    X_new: pd.DataFrame,
    X_train_stats: dict[str, dict[str, float]],
    tolerance: float = 5.0,
) -> list[str]:
    """Check for feature distribution anomalies.

    Args:
        X_new: New data to validate.
        X_train_stats: Dictionary of {feature: {mean, std, min, max}}
            computed during training.
        tolerance: Number of standard deviations to flag as anomalous.

    Returns:
        List of warning messages for anomalous features.
    """
    warnings = []
    for col in X_new.select_dtypes(include=[np.number]).columns:
        if col not in X_train_stats:
            warnings.append(f"Unknown feature: {col}")
            continue
        stats = X_train_stats[col]
        col_mean = X_new[col].mean()
        if abs(col_mean - stats['mean']) > tolerance * stats['std']:
            warnings.append(
                f"Feature '{col}' mean shifted: "
                f"expected {stats['mean']:.2f}, got {col_mean:.2f}"
            )
    return warnings

Versioning features: As feature definitions evolve, maintain a versioning scheme. A feature store (discussed in Section 9.11.2) provides centralized management, but even without one, documenting feature definitions, transformations, and their dependencies prevents confusion when updating models.

Idempotent pipelines: Ensure that running the pipeline twice on the same data produces the same result. Avoid features that depend on global state, current time (unless explicitly parameterized), or external API calls that may return different results.


9.12 Common Pitfalls and Best Practices

Pitfalls to Avoid

  1. Leaking future information into features (Section 9.9).
  2. One-hot encoding high-cardinality features without considering alternatives.
  3. Binning continuous features unnecessarily, destroying information.
  4. Ignoring the distribution of features before choosing transformations.
  5. Feature selection on the entire dataset rather than within cross-validation folds.
  6. Not creating missing indicators when missingness is informative.

Best Practices

  1. Always use pipelines for preprocessing + modeling. As we saw in Section 9.7, scikit-learn Pipelines prevent data leakage, ensure reproducibility, and simplify deployment.
  2. Fit transformers on training data only, then apply to test data. This is the single most important rule for avoiding leakage. Every scaler, imputer, and encoder must be fitted exclusively on training data.
  3. Start simple with basic transformations and iterate. Begin with raw features and a simple model as a baseline, then add transformations one at a time, measuring the impact with cross-validation (Chapter 8) at each step.
  4. Validate feature importance with multiple methods. As discussed in Section 9.6, impurity-based importance can be misleading. Use permutation importance, SHAP values, and statistical tests to corroborate your findings.
  5. Document your feature engineering decisions for reproducibility. Record not just what transformations you applied, but why---this helps future team members (including your future self) understand the rationale behind each feature.
  6. Monitor feature distributions in production for data drift. Features that were informative during training may become stale or irrelevant as the world changes. Set up automated alerts for distributional shifts using techniques like the Kolmogorov-Smirnov test or Population Stability Index.
  7. Use domain knowledge to guide feature creation---the most powerful features often come from understanding the problem, not from algorithms. A domain expert who knows that "time since last purchase" matters for churn prediction will create a more valuable feature than any automated method.
  8. Consider the computational cost of your features. Features that require expensive computations (e.g., real-time API calls, complex aggregations over large tables) may not be feasible in production, even if they are predictive. Design features with both offline training and online serving constraints in mind.
  9. Version your feature definitions. When you change how a feature is computed, models trained on the old definition may behave unexpectedly. Use feature stores or clear documentation to track feature versions.

9.13 Feature Engineering Checklist

Before concluding, here is a practical checklist you can use at the start of any feature engineering effort:

Data Understanding Phase: - [ ] Examine data types, missing value patterns, and distributions for every feature. - [ ] Understand the domain context: what do the features represent physically? - [ ] Identify the target variable and check its distribution (balanced vs. imbalanced for classification, skewed vs. symmetric for regression).

Numerical Features: - [ ] Check for outliers and decide on a strategy (removal, capping, robust scaling). - [ ] Apply appropriate scaling (standard, min-max, or robust) based on the model type. - [ ] Consider log or power transforms for skewed distributions. - [ ] Create domain-relevant interaction and ratio features.

Categorical Features: - [ ] Assess cardinality: use one-hot encoding for low-cardinality and target or frequency encoding for high-cardinality features. - [ ] Check for unseen categories in validation/test data and handle gracefully. - [ ] Consider ordinal encoding only when a natural ordering exists.

Temporal Features: - [ ] Extract all relevant time components (hour, day, month, day of week). - [ ] Apply cyclical encoding for periodic features. - [ ] Create lag, rolling, and expanding window features for time series. - [ ] Verify that no future information leaks into historical features.

Text Features: - [ ] Preprocess text (lowercasing, removing noise, tokenizing). - [ ] Choose between BoW, TF-IDF, and embeddings based on the task and data volume. - [ ] Consider n-grams for capturing multi-word patterns.

Pipeline and Validation: - [ ] Wrap all preprocessing in a scikit-learn Pipeline. - [ ] Verify no data leakage by checking that all fit calls use only training data. - [ ] Use cross-validation to evaluate the impact of each feature engineering decision. - [ ] Apply feature selection to remove noise and redundancy.


Summary

Feature engineering is the bridge between raw data and effective machine learning models. In this chapter, we covered:

  • Numerical transformations: Scaling, normalization, log and power transforms, binning, and polynomial features.
  • Categorical encoding: One-hot, ordinal, target, and frequency encoding, each suited to different scenarios.
  • Text features: Bag of words and TF-IDF representations for converting text into numerical features.
  • Datetime features: Temporal decomposition and cyclical encoding for time-based data.
  • Feature selection: Filter, wrapper, and embedded methods for identifying the most informative features.
  • Pipelines: scikit-learn's Pipeline and ColumnTransformer for building reproducible, leak-free preprocessing workflows.
  • Missing data: Imputation strategies ranging from simple median filling to iterative multivariate approaches.
  • Data leakage: Understanding, detecting, and preventing the most dangerous pitfall in applied machine learning.

In Chapter 10, we will build on these foundations to explore model selection and hyperparameter tuning, using the pipelines developed here as the foundation for systematic optimization.


References

  1. Zheng, A., & Casari, A. (2018). Feature Engineering for Machine Learning. O'Reilly Media.
  2. Kuhn, M., & Johnson, K. (2019). Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press.
  3. Pedregosa, F., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825--2830.
  4. Nargesian, F., et al. (2017). Learning Feature Engineering for Classification. IJCAI, 2529--2535.
  5. Kaufman, S., et al. (2012). Leakage in Data Mining: Formulation, Detection, and Avoidance. ACM TKDD, 6(4), 1--21.