Chapter 9: Key Takeaways
Core Principles
-
Feature engineering often matters more than model selection. A well-engineered feature set with a simple model frequently outperforms a complex model trained on raw features. Invest time in understanding your data before reaching for sophisticated algorithms.
-
The representation of data determines the ceiling of model performance. Machine learning models can only exploit patterns that are present in their input features. If a predictive relationship is not represented---directly or through engineered features---no model can learn it.
-
Domain knowledge is your most powerful feature engineering tool. Automated methods can discover statistical patterns, but features motivated by domain understanding (ratios, thresholds, interaction terms) tend to be the most predictive and interpretable.
Numerical Features
-
Choose your scaler based on the data and algorithm. Use
StandardScalerfor normally distributed data and linear models,MinMaxScalerwhen bounded outputs are needed, andRobustScalerwhen outliers are present. Tree-based models generally do not require scaling. -
Apply nonlinear transformations to skewed distributions. Log, Box-Cox, and Yeo-Johnson transforms make distributions more symmetric, which benefits linear models and distance-based algorithms.
-
Use binning deliberately, not reflexively. Discretization destroys information within bins. Apply it only when domain knowledge suggests meaningful thresholds (e.g., age brackets, income tiers).
-
Polynomial features grow combinatorially. With $n$ features and degree $d$, the output has $\binom{n+d}{d} - 1$ features. Always pair polynomial expansion with regularization or feature selection.
Categorical Features
-
Match the encoding to the feature's nature. Use one-hot for nominal categories, ordinal for ordered categories, and target or frequency encoding for high-cardinality features.
-
One-hot encoding does not scale to high cardinality. For features with hundreds or thousands of categories, use target encoding with smoothing, frequency encoding, or embedding-based approaches.
-
Target encoding requires regularization to avoid leakage. Always apply smoothing and compute target statistics only on training folds, never on the full dataset.
Text and Datetime Features
-
TF-IDF is a strong baseline for text features. It captures term importance through the inverse document frequency weighting, and
sublinear_tf=Truedampens the effect of very frequent terms. -
Datetime features should include cyclical encodings. Sine and cosine transforms for hours, days of week, and months preserve the circular nature of time.
-
Recency features are often more useful than raw timestamps. "Days since last event" is more interpretable and predictive than the event's absolute date.
Feature Selection
-
Use filter methods for quick screening, wrapper methods for thorough selection. Filter methods (variance threshold, univariate tests) are fast but evaluate features independently. Wrapper methods (RFE) account for feature interactions but are computationally expensive.
-
Embedded methods offer the best trade-off. L1 regularization and tree-based importance perform feature selection as part of model training, capturing multivariate relationships efficiently.
-
Always perform feature selection within cross-validation. Selecting features on the entire dataset and then cross-validating introduces optimistic bias---a subtle form of data leakage.
-
Validate feature importance with multiple methods. Impurity-based importance is biased toward high-cardinality features. Always corroborate with permutation importance.
Pipelines
-
Always use scikit-learn Pipelines for preprocessing and modeling. Pipelines prevent data leakage, ensure reproducibility, simplify cross-validation, enable hyperparameter tuning across all components, and produce a single deployable artifact.
-
Use ColumnTransformer for heterogeneous data. Different column types require different preprocessing.
ColumnTransformerapplies the right transformation to the right columns and concatenates the results. -
Write custom transformers for domain-specific logic. Inherit from
BaseEstimatorandTransformerMixin, implementfitandtransform, and your custom logic integrates seamlessly into pipelines.
Missing Data
-
Median imputation is a safe default for numerical features. It is robust to outliers and preserves the central tendency better than mean imputation for skewed data.
-
Add missing indicator features when missingness is informative. A binary column indicating whether a value was missing can itself be a strong predictor.
-
Consider iterative imputation for complex missing patterns. When missingness follows MAR patterns, iterative imputation that models each feature as a function of others typically outperforms simple strategies.
Data Leakage
-
Data leakage is the most dangerous pitfall in applied ML. It produces models that appear excellent in evaluation but fail catastrophically in production. Always be suspicious of results that seem too good.
-
Fit on train, transform on test---no exceptions. Every preprocessing step (scaling, encoding, imputation, feature selection) must be fitted exclusively on training data and then applied to test data without refitting.
-
Use GroupKFold for grouped data. When multiple samples belong to the same entity (customer, patient, device), ensure all samples from the same group stay in the same fold.
-
Check for target leakage by asking: "Would this feature be available at prediction time?" If a feature is a consequence of the target or would only be known after the prediction event, it is leaking.
Workflow
-
Iterate systematically: baseline, then add complexity. Start with raw features and a simple pipeline. Add transformations one at a time, measuring the impact of each change with cross-validation.
-
Document every feature engineering decision. Future-you (and your teammates) will need to understand why each transformation was applied. A
COLUMN_CONFIGdictionary serves as a living specification. -
Monitor features in production. Feature distributions can drift over time. Establish monitoring to detect when production data diverges from training data distribution.