Further Reading: Chapter 10

Building Reproducible Data Pipelines

Foundational Documentation

1. scikit-learn User Guide --- "Pipelines and Composite Estimators" The authoritative reference for Pipeline, ColumnTransformer, make_pipeline, and make_column_transformer. The section on "chaining estimators" covers the fit/transform contract, nested pipelines, and the memory parameter for caching. The examples are minimal but precise. The ColumnTransformer page includes a worked example applying different preprocessing to numeric and categorical features --- the exact pattern used in this chapter. Available at scikit-learn.org/stable/modules/compose.html.

2. scikit-learn User Guide --- "Developing scikit-learn Estimators" The official guide for writing custom transformers and estimators. Covers the BaseEstimator and TransformerMixin contracts, the get_params/set_params protocol, the check_estimator utility for validating that your custom class conforms to the API, and the trailing-underscore convention for fitted attributes. Essential reading if you write custom transformers for production. Available at scikit-learn.org/stable/developers/develop.html.

3. scikit-learn User Guide --- "Common Pitfalls and Recommended Practices" A practical guide to the most common mistakes in applied scikit-learn, including preprocessing outside of cross-validation, data leakage through feature selection, and the fit_transform vs. transform distinction. The section on "data leakage during pre-processing" directly addresses the failure mode that motivated this chapter. Available at scikit-learn.org/stable/common_pitfalls.html.

Books

4. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow --- Aurelien Geron (3rd edition, O'Reilly, 2022) Chapter 2, "End-to-End Machine Learning Project," walks through the complete pipeline construction process for a housing price dataset: custom transformer creation, ColumnTransformer assembly, and pipeline serialization. The treatment is more tutorial-oriented than this textbook and provides an excellent second perspective on the same patterns. The custom transformer examples use the same BaseEstimator/TransformerMixin approach.

5. Python Machine Learning --- Sebastian Raschka and Vahid Mirjalili (4th edition, Packt, 2022) Chapter 6 covers scikit-learn pipelines with a focus on model selection and hyperparameter tuning. The integration of pipelines with GridSearchCV and the double-underscore parameter notation is covered in more depth than in this chapter. A useful complement to Chapter 18 when you reach hyperparameter tuning.

6. Feature Engineering and Selection: A Practical Approach for Predictive Models --- Max Kuhn and Kjell Johnson (2019) Chapter 3, "A Review of the Predictive Modeling Process," argues forcefully for the pipeline approach from a statistical perspective. The discussion of why preprocessing must be inside the resampling loop (cross-validation) is the theoretical foundation for the practical recommendation in this chapter. The code is in R (using the recipes package, R's equivalent of scikit-learn pipelines), but the concepts are directly transferable.

Papers and Technical Articles

7. "Machine Learning: The High-Interest Credit Card of Technical Debt" --- D. Sculley et al. (2015) NeurIPS 2015 (originally a NIPS workshop paper). The paper that coined the term "ML technical debt." Section 4, "Pipeline Jungles," describes how ad-hoc preprocessing pipelines grow into unmaintainable tangles of scripts, glue code, and implicit dependencies. The paper argues that the cost of maintaining a production ML system far exceeds the cost of building the initial model. Pipelines, as presented in this chapter, are the primary mitigation for pipeline jungle debt. Freely available from Google Research.

8. "Hidden Technical Debt in Machine Learning Systems" --- D. Sculley et al. (2015) The expanded version of paper 7, also from NeurIPS 2015. Introduces the concept that only a small fraction of a production ML system is the model itself --- the vast majority is data collection, feature extraction, preprocessing, serving, and monitoring infrastructure. The Pipeline abstraction addresses the "glue code" and "pipeline jungles" anti-patterns directly. One of the most-cited papers in ML engineering.

9. "Towards ML Engineering: A Brief History of TensorFlow Extended (TFX)" --- Denis Baylor et al. (2017) Google's production ML pipeline platform. While TFX is far more complex than scikit-learn Pipelines (distributed processing, data validation, model analysis), the core principles are identical: preprocessing must be a single, versioned, reproducible artifact. Section 3 on "ML Pipelines" describes the same fit/transform contract at Google scale. Reading this paper gives context for where scikit-learn Pipelines sit in the broader ML engineering ecosystem.

Practical Guides and Tutorials

10. "Column Transformer with Mixed Types" --- scikit-learn Example Gallery A complete worked example showing how to build a ColumnTransformer that applies different preprocessing to numeric, categorical, and text features on a real dataset (Titanic). Includes pipeline visualization with set_config(display='diagram'). The example uses make_column_selector for dtype-based column routing. Available at scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html.

11. "Pipelines: Chaining a PCA and a Logistic Regression" --- scikit-learn Example Gallery A simpler example demonstrating how Pipeline integrates with GridSearchCV for hyperparameter tuning across both preprocessing and model parameters. Shows the double-underscore notation for parameter grids. A bridge between this chapter and Chapter 18. Available at scikit-learn.org/stable/auto_examples/compose/plot_digits_pipe.html.

12. "How to Use joblib to Serialize and Deserialize scikit-learn Models" --- Various Sources Multiple quality tutorials exist for joblib serialization patterns. The scikit-learn documentation covers the basics (scikit-learn.org/stable/model_persistence.html), including the security warning about loading untrusted files. For production patterns (versioning, checksums, cloud storage), search for "scikit-learn model persistence best practices." The MLflow documentation also covers model serialization as part of experiment tracking (Chapter 30).

Tools and Libraries

13. sklearn-pandas Library --- Documentation A library that provides DataFrameMapper, an alternative to ColumnTransformer with tighter pandas integration. Useful when you need to preserve DataFrame structure through the entire pipeline. The library predates scikit-learn's native ColumnTransformer (added in 0.20) and is less commonly used in new projects, but appears in legacy codebases. Documentation at github.com/scikit-learn-contrib/sklearn-pandas.

14. feature-engine Library --- Documentation A scikit-learn-compatible library of feature engineering transformers: outlier capping, rare label encoding, cyclical feature encoding, mathematical transformations, and more. Every transformer follows the BaseEstimator/TransformerMixin contract and drops into a Pipeline. A useful complement to custom transformers when your preprocessing need is common enough to have a library solution. Documentation at feature-engine.trainindata.com.

15. joblib Library --- Documentation The official documentation for joblib, covering dump, load, compression options, and parallel processing. The section on "persistence" explains why joblib is more efficient than pickle for numpy-heavy objects. The section on "parallel processing" is relevant to Chapter 28 (Large Datasets). Documentation at joblib.readthedocs.io.

How to Use This List

If you read one thing, read the scikit-learn "Common Pitfalls" page (item 3). It covers the data leakage and fit/transform mistakes that motivate the entire chapter, with concise examples.

If you want to deepen your understanding of custom transformers, read the scikit-learn developer guide (item 2) and then the feature-engine documentation (item 14) for examples of well-designed custom transformers.

If you want the production engineering perspective, read Sculley et al. (items 7 and 8). They explain why the investment in building proper pipelines pays off at scale --- and what happens to teams that skip it.

If you are preparing for Chapter 18 (Hyperparameter Tuning), read the PCA + Logistic Regression example (item 11) to see how pipelines integrate with GridSearchCV.

This reading list supports Chapter 10: Building Reproducible Data Pipelines. Return to the chapter to review concepts before diving in.