Part II: Feature Engineering and Data Preparation
Here is the inconvenient truth about machine learning: the algorithm matters less than the data you feed it.
A logistic regression with brilliant features will outperform a gradient boosting model with mediocre features nine times out of ten. And the features do not create themselves. Someone has to extract them from raw databases, encode them for models that speak only in numbers, handle the missing values that real data always contains, select the ones that actually carry signal, and wire everything into a pipeline that runs identically in a notebook and in production.
That someone is you.
Part II is where data science becomes a craft. Six chapters. Six skills that separate working data scientists from people who follow tutorials.
Chapter 5: SQL for Data Scientists starts at the source. Before you can engineer features, you need data — and in most organizations, that data lives in relational databases. You will learn the SQL that the job actually requires: window functions, CTEs, complex joins, and query optimization. If you can write a LAG() window function over a PARTITION BY clause, you can compute temporal features in SQL that would take 50 lines of pandas.
Chapter 6: Feature Engineering is the art chapter. The best features come from understanding the problem, not from automated feature generation. You will learn to create temporal features, interaction terms, domain-specific transformations, and the feature engineering patterns that show up in nearly every production ML system.
Chapter 7: Handling Categorical Data tackles the encoding problem. Models need numbers; your data has categories. One-hot encoding is the answer everyone knows. It is not always the right answer — especially when your categorical variable has 14,000 levels (looking at you, ICD-10 diagnosis codes).
Chapter 8: Missing Data Strategies goes beyond df.dropna(). Missing data is not just an inconvenience. In many problems, the pattern of missingness is itself a powerful feature. The sensor that stops reporting is the sensor attached to the machine that is about to fail.
Chapter 9: Feature Selection reduces the noise. More features is not more better. The curse of dimensionality is real, and 200 features on 10,000 rows means your model is memorizing noise. You will learn filter, wrapper, and embedded methods — and critically, how to do feature selection inside cross-validation so you do not leak information.
Chapter 10: Building Reproducible Data Pipelines wires everything together. scikit-learn Pipelines, ColumnTransformers, and custom transformers turn your ad-hoc preprocessing steps into a single object that can be fitted, saved, loaded, and deployed. If your preprocessing is not in a Pipeline, your results are folklore.
Progressive Project Milestones
In Part II, the StreamFlow project takes shape:
- M1 (Chapter 5): Extract customer features from StreamFlow's relational database using SQL
- M2 (Chapter 6): Engineer 15+ predictive features from the raw data
- M3 (Chapters 7–10): Encode categoricals, handle missing data, select features, and assemble the complete preprocessing Pipeline
By the end of Part II, you will have a production-ready feature pipeline — and you will not have trained a single model yet. That is intentional. The pipeline comes first. The model is almost an afterthought.
What You Need
- Python with pandas, numpy, scikit-learn
- A SQL environment (PostgreSQL recommended; SQLite acceptable for exercises)
- The StreamFlow dataset (introduced in Chapter 5)
- Chapters 1–4 completed (especially Chapter 4 for the math behind feature selection)