Chapter 30: The Machine Learning Workflow — Pipelines, Validation, and Putting It All Together

Contributors to Introduction to Data Science

15 min read

Here's a confession: everything you've done in Chapters 25 through 29, while correct in isolation, has been slightly wrong when put together.

Prerequisites

{'chapter': 29, 'description': 'Model evaluation metrics and cross-validation'}
{'chapter': 26, 'description': 'Linear regression and scikit-learn basics'}
{'chapter': 27, 'description': 'Logistic regression and classification'}
{'chapter': 28, 'description': 'Decision trees and random forests'}

Learning Objectives

Explain data leakage and describe how it leads to overoptimistic performance estimates
Build scikit-learn Pipelines that chain preprocessing and modeling steps into a single reproducible object
Use ColumnTransformer to apply different transformations to numeric and categorical features
Implement GridSearchCV to systematically search for optimal hyperparameters
Design a complete ML workflow from raw data to evaluated model, avoiding common pitfalls
Save and load trained models using joblib for deployment and reproducibility
Apply the complete workflow to the progressive project, producing a final ML pipeline

In This Chapter

Chapter Overview
30.1 The Data Leakage Problem: Why Everything You've Done Is Slightly Wrong
30.2 Scikit-learn Pipelines: The Solution
30.3 ColumnTransformer: Handling Mixed Feature Types
30.4 Hyperparameter Tuning: Finding the Best Settings
30.5 Feature Engineering: Creating Better Inputs
30.6 The Complete ML Workflow
30.7 Saving and Loading Models
30.8 Reproducibility: Making Sure Others Can Repeat Your Work
30.9 End-to-End Example: The Complete Pipeline
30.10 Progressive Project: The Complete ML Pipeline
30.11 Common Pitfalls and How to Avoid Them
30.12 The Big Picture: What You've Built in Part V
Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 30: The Machine Learning Workflow — Pipelines, Validation, and Putting It All Together

"If you can't reproduce it, you haven't proven it." — A principle borrowed from experimental science

Chapter Overview

Here's a confession: everything you've done in Chapters 25 through 29, while correct in isolation, has been slightly wrong when put together.

Not wrong in a way that produces errors. Wrong in a way that's more insidious: wrong in a way that makes your results look better than they actually are. Wrong in a way that could cause you to deploy a model that fails in the real world, even though it looked great on your laptop.

The problem is called data leakage, and until you learn to prevent it, every model you build is a house built on sand.

Data leakage happens when information from outside the training set sneaks into your model during training. The most common form is simple and subtle: you scale your features using the mean and standard deviation of the entire dataset (including the test set), then evaluate on that test set. The test set's statistics influenced the scaling, which influenced the model, which means your evaluation is slightly optimistic. In small datasets, "slightly" can mean the difference between "this model works" and "this model is mediocre."

This chapter fixes that problem — and several others — by teaching you to build pipelines: self-contained workflows that handle preprocessing, modeling, and evaluation in a single, reproducible object. By the end, you'll be building models the way professionals do: with clean boundaries between training and testing data, systematic hyperparameter tuning, and reproducible results that a colleague can run on their machine and get the same answers.

This is the capstone of Part V. Everything you've learned about models, metrics, and evaluation comes together here.

In this chapter, you will learn to:

Explain data leakage and how it produces overoptimistic results (all paths)
Build scikit-learn Pipelines that chain preprocessing and modeling (all paths)
Use ColumnTransformer for mixed feature types (standard + deep dive paths)
Implement GridSearchCV for hyperparameter tuning (all paths)
Design a complete ML workflow from raw data to evaluated model (all paths)
Save and load models using joblib (standard + deep dive paths)
Apply the complete workflow to the progressive project (all paths)

30.1 The Data Leakage Problem: Why Everything You've Done Is Slightly Wrong

Let me show you the problem with a concrete example.

The Leaky Workflow

Here's the workflow you probably used in previous chapters:

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
X, y = load_data()

# Step 1: Scale ALL the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)     # <-- Uses mean/std of ALL data

# Step 2: Split into train and test
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42
)

# Step 3: Train and evaluate
model = LogisticRegression()
model.fit(X_train, y_train)
print(f"Test accuracy: {model.score(X_test, y_test):.3f}")

Did you catch the problem? Look at Step 1. The StandardScaler is fit on X — the entire dataset. It computes the mean and standard deviation using all samples, including those that will later become the test set. When you then evaluate on the test set, those test samples have been scaled using statistics that include their own values. The model has indirect knowledge of the test data.

This is data leakage: information from the test set has leaked into the training process.

Why It Matters

In practice, when you deploy a model, you won't know the test data's statistics — those are future data points you haven't seen yet. Your scaler should be fit only on training data, then applied (without refitting) to new data.

The leakage effect is often small — a fraction of a percentage point — but it can be larger with small datasets or high-dimensional data. More importantly, it's a conceptual error. If you build the habit of leaking, you'll eventually build a model that looks great in development and fails in production. And the gap between "works on my laptop" and "fails in the real world" can cost real money and real trust.

The Correct Workflow

# Step 1: Split FIRST
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Step 2: Fit scaler on training data ONLY
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit + transform
X_test_scaled = scaler.transform(X_test)          # transform only!

# Step 3: Train and evaluate
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
print(f"Test accuracy: {model.score(X_test_scaled, y_test):.3f}")

Notice the critical difference: fit_transform on training data (learn the parameters and apply them), transform on test data (apply the same parameters without learning new ones).

This works, but it's fragile. You have to remember to do it correctly every time, and with cross-validation, you'd need to repeat this fit/transform pattern inside every fold. That's error-prone and tedious.

Enter: pipelines.

Check Your Understanding

In the "leaky" workflow, what specific information from the test set leaks into the model?

Why is scaler.transform(X_test) correct but scaler.fit_transform(X_test) wrong when evaluating?

If you're using cross-validation (not just a single train/test split), how would data leakage affect your cross-validation scores?

30.2 Scikit-learn Pipelines: The Solution

A Pipeline in scikit-learn chains multiple processing steps — like scaling, encoding, and modeling — into a single object. When you call fit on the pipeline, it fits each step in order, passing the output of one step as the input to the next. When you call predict, it transforms the data through all preprocessing steps (without refitting them) and then makes the prediction.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(max_iter=1000))
])

# This fits the scaler on X_train ONLY, then fits the model
pipe.fit(X_train, y_train)

# This transforms X_test using training statistics, then predicts
pipe.predict(X_test)
pipe.score(X_test, y_test)

Each step in the pipeline is a tuple of (name, transformer_or_estimator). You choose the names — they can be anything descriptive. The pipeline guarantees that:

The scaler is fit only on training data
The scaler's learned parameters (mean, std) are stored inside the pipeline
When you call predict or score, the scaler transforms new data using the training parameters — no leakage

Pipelines with Cross-Validation: The Real Win

The real power of pipelines shows up with cross-validation. Remember the data leakage risk with CV? If you scale before splitting into folds, every fold's training data has been influenced by the test fold's statistics. Pipelines prevent this automatically:

from sklearn.model_selection import cross_val_score

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(max_iter=1000))
])

# The scaler is fit INSIDE each fold — no leakage!
scores = cross_val_score(pipe, X, y, cv=5, scoring='f1')
print(f"F1: {scores.mean():.3f} +/- {scores.std():.3f}")

Inside each cross-validation fold, the pipeline fits the scaler on only the training portion and transforms the test portion using those parameters. This happens automatically — you don't have to manage it manually. It's the right thing, done the easy way.

Adding More Steps

Pipelines can have as many steps as you need:

from sklearn.decomposition import PCA

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=5)),
    ('model', LogisticRegression(max_iter=1000))
])

Every step except the last must be a transformer (an object with fit and transform methods). The last step can be either a transformer or an estimator (an object with fit and predict methods). This design lets you chain any combination of preprocessing steps with any model.

30.3 ColumnTransformer: Handling Mixed Feature Types

Real datasets have mixed feature types: some columns are numeric (GDP, age, temperature), and some are categorical (country, color, department). These need different preprocessing:

Numeric features: Scale them (StandardScaler) or normalize them
Categorical features: Encode them (OneHotEncoder or OrdinalEncoder)

ColumnTransformer lets you apply different transformations to different columns:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Define which columns get which treatment
numeric_features = ['gdp_per_capita', 'health_spending_pct',
                    'physicians_per_1000', 'literacy_rate']
categorical_features = ['region', 'income_group']

preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'),
     categorical_features)
])

Each transformer is a tuple of (name, transformer, columns). The ColumnTransformer applies each transformer to its specified columns and concatenates the results.

Building the Full Pipeline

Now combine the ColumnTransformer with a model:

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestClassifier(n_estimators=200, random_state=42))
])

# Fit and evaluate — everything handled automatically
full_pipeline.fit(X_train, y_train)
full_pipeline.score(X_test, y_test)

# Cross-validate — no leakage, even with mixed features
scores = cross_val_score(full_pipeline, X, y, cv=5, scoring='f1')

This is the professional way to handle mixed feature types. The ColumnTransformer ensures that: - Numeric features are scaled using training statistics only - Categorical features are encoded based on categories seen in training only - The handle_unknown='ignore' parameter tells the encoder to handle categories in new data that weren't in the training set (instead of crashing)

A Complete Preprocessing Example

Let's build a full pipeline for Elena's vaccination data, now including categorical features:

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score

health = pd.read_csv('global_health_indicators.csv')
median_rate = health['vaccination_rate'].median()
health['high_coverage'] = (health['vaccination_rate'] >= median_rate).astype(int)

numeric_features = ['gdp_per_capita', 'health_spending_pct',
                    'physicians_per_1000', 'literacy_rate',
                    'urban_population_pct']
categorical_features = ['region', 'income_group']

X = health[numeric_features + categorical_features].dropna()
y = health.loc[X.index, 'high_coverage']

preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'),
     categorical_features)
])

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestClassifier(n_estimators=200, random_state=42))
])

scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1')
print(f"F1: {scores.mean():.3f} +/- {scores.std():.3f}")

In about 20 lines, you've built a complete, leak-free pipeline that handles mixed feature types, trains a random forest, and evaluates it with cross-validation. This is the kind of code that professionals write.

30.4 Hyperparameter Tuning: Finding the Best Settings

Every model has hyperparameters — settings that you choose before training and that affect the model's behavior. For a random forest: n_estimators, max_depth, max_features, min_samples_leaf. For logistic regression: C (regularization strength). For a decision tree: max_depth, min_samples_split.

How do you find the best combination? You could try a few values by hand, but that's tedious and unsystematic. Grid search automates the process.

GridSearchCV

GridSearchCV tries every combination of hyperparameters you specify, evaluates each using cross-validation, and reports which combination performs best:

from sklearn.model_selection import GridSearchCV

# Define the pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestClassifier(random_state=42))
])

# Define the parameter grid
# Note: pipeline parameter names use double underscores
param_grid = {
    'model__n_estimators': [100, 200, 300],
    'model__max_depth': [5, 8, 12, None],
    'model__min_samples_leaf': [1, 5, 10]
}

# Search
grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1,       # Use all CPU cores
    verbose=1        # Show progress
)

grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best F1 score:   {grid_search.best_score_:.3f}")

A few important details:

The double-underscore naming convention: When parameters are inside a pipeline step, you reference them as stepname__parametername. So model__max_depth means "the max_depth parameter of the step named model."

The number of combinations: The grid above has 3 x 4 x 3 = 36 combinations. With 5-fold CV, that's 180 model fits. With n_jobs=-1, scikit-learn parallelizes this across your CPU cores. Still, large grids can take a while.

Using the best model: After fitting, grid_search.best_estimator_ is the pipeline fitted with the best parameters on the full training set:

# The best model is ready to use
best_model = grid_search.best_estimator_
print(f"Test accuracy: {best_model.score(X_test, y_test):.3f}")

Comparing Multiple Models with Grid Search

You can even search across different model types:

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Approach: run separate grid searches for each model type
model_configs = {
    'Logistic Regression': {
        'pipeline': Pipeline([
            ('preprocessor', preprocessor),
            ('model', LogisticRegression(max_iter=1000))
        ]),
        'params': {
            'model__C': [0.01, 0.1, 1, 10, 100]
        }
    },
    'Decision Tree': {
        'pipeline': Pipeline([
            ('preprocessor', preprocessor),
            ('model', DecisionTreeClassifier(random_state=42))
        ]),
        'params': {
            'model__max_depth': [3, 5, 8, 12],
            'model__min_samples_leaf': [1, 5, 10, 20]
        }
    },
    'Random Forest': {
        'pipeline': Pipeline([
            ('preprocessor', preprocessor),
            ('model', RandomForestClassifier(random_state=42))
        ]),
        'params': {
            'model__n_estimators': [100, 200, 300],
            'model__max_depth': [5, 8, 12],
            'model__min_samples_leaf': [1, 5, 10]
        }
    }
}

results = {}
for name, config in model_configs.items():
    gs = GridSearchCV(config['pipeline'], config['params'],
                      cv=5, scoring='f1', n_jobs=-1)
    gs.fit(X_train, y_train)
    results[name] = {
        'best_score': gs.best_score_,
        'best_params': gs.best_params_,
        'test_score': gs.best_estimator_.score(X_test, y_test)
    }
    print(f"\n{name}:")
    print(f"  Best CV F1: {gs.best_score_:.3f}")
    print(f"  Best params: {gs.best_params_}")
    print(f"  Test accuracy: {gs.best_estimator_.score(X_test, y_test):.3f}")

This gives you a complete, systematic model comparison: different models, different hyperparameters, all evaluated fairly with cross-validation, all leak-free.

RandomizedSearchCV: When the Grid Is Too Big

If your parameter grid has many options, the total number of combinations explodes. RandomizedSearchCV samples random combinations instead of trying all of them:

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

param_distributions = {
    'model__n_estimators': randint(50, 500),
    'model__max_depth': randint(3, 20),
    'model__min_samples_leaf': randint(1, 20),
    'model__max_features': uniform(0.1, 0.9)
}

random_search = RandomizedSearchCV(
    pipeline,
    param_distributions,
    n_iter=50,          # Try 50 random combinations
    cv=5,
    scoring='f1',
    random_state=42,
    n_jobs=-1
)
random_search.fit(X_train, y_train)

RandomizedSearchCV is often more efficient than GridSearchCV because it explores more of the parameter space without being constrained to a grid. In practice, it finds good hyperparameters faster, especially when some parameters matter much more than others.

Check Your Understanding

Why do pipeline parameter names use double underscores (e.g., model__max_depth)?

A grid with 4 parameters, each with 5 values, and 5-fold CV requires how many model fits?

When would you choose RandomizedSearchCV over GridSearchCV?

30.5 Feature Engineering: Creating Better Inputs

We've focused on model selection and hyperparameter tuning, but sometimes the biggest improvement comes from giving the model better inputs. Feature engineering is the art of creating new features from existing ones that help the model learn patterns more easily.

Common Feature Engineering Techniques

Interactions: Multiply features together to capture joint effects.

# GDP per capita alone and health spending alone might not predict well,
# but their product might capture "total health investment"
health['gdp_x_health_spending'] = (
    health['gdp_per_capita'] * health['health_spending_pct']
)

Binning: Convert continuous features into categories.

# Convert GDP into income groups
health['income_tier'] = pd.cut(health['gdp_per_capita'],
                                bins=[0, 1000, 4000, 12000, 100000],
                                labels=['Low', 'Lower-Mid', 'Upper-Mid', 'High'])

Log transforms: Compress skewed distributions.

import numpy as np
health['log_gdp'] = np.log1p(health['gdp_per_capita'])

Ratios: Create meaningful ratios from existing features.

health['physicians_per_spending'] = (
    health['physicians_per_1000'] / health['health_spending_pct']
)

Custom Transformers in Pipelines

You can create custom transformation steps that fit into your pipeline:

from sklearn.base import BaseEstimator, TransformerMixin

class LogTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self  # Nothing to learn

    def transform(self, X):
        return np.log1p(np.abs(X))

# Use it in a pipeline
pipe = Pipeline([
    ('log_transform', LogTransformer()),
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(max_iter=1000))
])

Feature engineering is as much art as science — domain knowledge helps you guess which new features might be useful, and cross-validated evaluation tells you whether they actually are.

30.6 The Complete ML Workflow

Let's formalize the complete machine learning workflow, from raw data to deployed model. This is the framework you should follow for every modeling project:

Step 1: DEFINE THE PROBLEM
   - What question are you answering?
   - What metric matters? (Based on costs of different errors)
   - What's the baseline performance? (Majority class, simple heuristic)

Step 2: PREPARE THE DATA
   - Load and inspect
   - Handle missing values
   - Split: hold out final test set (20-30%)
   - Identify numeric vs. categorical features

Step 3: BUILD THE PIPELINE
   - ColumnTransformer for mixed preprocessing
   - Model as the final step
   - Everything inside the pipeline — no leakage

Step 4: TUNE HYPERPARAMETERS
   - GridSearchCV or RandomizedSearchCV
   - Use cross-validation on training data only
   - Choose scoring metric that matches your problem

Step 5: EVALUATE ON THE TEST SET
   - Touch the test set exactly ONCE
   - Report appropriate metrics (not just accuracy)
   - Generate confusion matrix, classification report, ROC curve

Step 6: INTERPRET AND COMMUNICATE
   - Feature importance (for tree-based models)
   - Translate metrics into business language
   - Document limitations and assumptions

Step 7: SAVE AND DEPLOY
   - Save the pipeline with joblib
   - Document the expected input format
   - Plan for monitoring and retraining

Each step has a clear purpose, and the boundaries between steps prevent the most common sources of error.

The Critical Rule: Test Set Discipline

The test set should be treated like a sealed envelope. You prepare it at the beginning (Step 2) and open it exactly once (Step 5). All model selection, hyperparameter tuning, and feature engineering decisions are made using cross-validation on the training set.

If you touch the test set during development — adjusting your model until the test score improves — you've converted the test set into a validation set. Your final performance estimate will be optimistically biased, and you'll have no reliable way to know how the model will perform on truly new data.

30.7 Saving and Loading Models

Once you've trained a pipeline, you need to save it so you can use it later without retraining. Joblib is scikit-learn's recommended tool for this:

import joblib

# Save the entire pipeline (preprocessor + model)
joblib.dump(best_model, 'vaccination_pipeline.joblib')

# Load it later
loaded_pipeline = joblib.load('vaccination_pipeline.joblib')

# Use it on new data — preprocessing happens automatically
new_predictions = loaded_pipeline.predict(new_data)

The saved pipeline includes everything: the fitted scaler's learned mean and standard deviation, the encoder's learned categories, and the model's learned parameters. You can send this file to a colleague, and they can make predictions without access to your training data.

Pickle vs. Joblib

You might also see pickle used for saving models:

import pickle

with open('model.pkl', 'wb') as f:
    pickle.dump(best_model, f)

with open('model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

Both work, but joblib is preferred for scikit-learn models because it handles large NumPy arrays more efficiently. Use joblib by default.

A Warning About Security

Never load a pickle or joblib file from an untrusted source. These files can contain arbitrary executable code that runs when loaded. Only load models that you or a trusted colleague created.

30.8 Reproducibility: Making Sure Others Can Repeat Your Work

Reproducibility means that someone else (or future you) can run your code and get exactly the same results. This is harder than it sounds because machine learning involves randomness: random splits, random initialization, random sampling.

Set Random Seeds Everywhere

import numpy as np

RANDOM_STATE = 42

# Use the same seed for all random operations
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=RANDOM_STATE
)

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestClassifier(random_state=RANDOM_STATE))
])

grid_search = GridSearchCV(
    pipeline, param_grid, cv=StratifiedKFold(
        n_splits=5, shuffle=True, random_state=RANDOM_STATE
    )
)

Document Your Environment

# Print library versions
import sklearn
import pandas as pd
import numpy as np

print(f"scikit-learn: {sklearn.__version__}")
print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")

Different versions of scikit-learn can produce different results, even with the same random seed. Recording versions helps others reproduce your exact environment.

Save your requirements

# requirements.txt
scikit-learn==1.9.0
pandas==3.0.3
numpy==2.4.6
matplotlib==3.10.9
joblib==1.5.3

30.9 End-to-End Example: The Complete Pipeline

Let's bring everything together in one complete example. This is the pattern you should use for every modeling project going forward.

import pandas as pd
import numpy as np
from sklearn.model_selection import (train_test_split, GridSearchCV,
                                      StratifiedKFold)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
import joblib

RANDOM_STATE = 42

# === STEP 1: Load and prepare data ===
health = pd.read_csv('global_health_indicators.csv')
median_rate = health['vaccination_rate'].median()
health['high_coverage'] = (health['vaccination_rate'] >= median_rate).astype(int)

numeric_features = ['gdp_per_capita', 'health_spending_pct',
                    'physicians_per_1000', 'literacy_rate',
                    'urban_population_pct']
categorical_features = ['region', 'income_group']

X = health[numeric_features + categorical_features].dropna()
y = health.loc[X.index, 'high_coverage']

# === STEP 2: Hold out final test set ===
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=RANDOM_STATE, stratify=y
)
print(f"Train size: {len(X_train)}, Test size: {len(X_test)}")

# === STEP 3: Define preprocessing ===
preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'),
     categorical_features)
])

# === STEP 4: Define pipeline and parameter grid ===
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestClassifier(random_state=RANDOM_STATE))
])

param_grid = {
    'model__n_estimators': [100, 200, 300],
    'model__max_depth': [5, 8, 12],
    'model__min_samples_leaf': [1, 5, 10]
}

# === STEP 5: Grid search with cross-validation ===
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

grid_search = GridSearchCV(
    pipeline, param_grid, cv=cv, scoring='f1',
    n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV F1: {grid_search.best_score_:.3f}")

# === STEP 6: Final evaluation on test set ===
best_pipeline = grid_search.best_estimator_

y_pred = best_pipeline.predict(X_test)
y_proba = best_pipeline.predict_proba(X_test)[:, 1]

print("\nFinal Test Set Evaluation:")
print(classification_report(y_test, y_pred,
                            target_names=['Low Coverage', 'High Coverage']))
print(f"AUC: {roc_auc_score(y_test, y_proba):.3f}")

# === STEP 7: Save the pipeline ===
joblib.dump(best_pipeline, 'vaccination_coverage_pipeline.joblib')
print("Pipeline saved successfully.")

# === Later: Load and use ===
# loaded_pipeline = joblib.load('vaccination_coverage_pipeline.joblib')
# predictions = loaded_pipeline.predict(new_data)

That's the complete workflow: load, split, preprocess, tune, evaluate, save. Every step is clean, reproducible, and leak-free. This is production-quality code.

30.10 Progressive Project: The Complete ML Pipeline

This is the capstone of your progressive project for Part V. You've been building toward this since Chapter 25.

Project Task: Build a Complete ML Pipeline for Vaccination Coverage

# Use the end-to-end example from Section 30.9 as your template.
# Add these enhancements:

# 1. Compare at least 3 model types with grid search
model_configs = {
    'Logistic Regression': {
        'pipeline': Pipeline([
            ('preprocessor', preprocessor),
            ('model', LogisticRegression(max_iter=1000))
        ]),
        'params': {'model__C': [0.01, 0.1, 1, 10]}
    },
    'Decision Tree': {
        'pipeline': Pipeline([
            ('preprocessor', preprocessor),
            ('model', DecisionTreeClassifier(random_state=42))
        ]),
        'params': {
            'model__max_depth': [3, 5, 8],
            'model__min_samples_leaf': [5, 10, 20]
        }
    },
    'Random Forest': {
        'pipeline': Pipeline([
            ('preprocessor', preprocessor),
            ('model', RandomForestClassifier(random_state=42))
        ]),
        'params': {
            'model__n_estimators': [100, 200],
            'model__max_depth': [5, 8, 12],
            'model__min_samples_leaf': [1, 5]
        }
    }
}

# 2. Run grid search for each model and compare
best_models = {}
for name, config in model_configs.items():
    gs = GridSearchCV(config['pipeline'], config['params'],
                      cv=cv, scoring='f1', n_jobs=-1)
    gs.fit(X_train, y_train)
    best_models[name] = gs
    print(f"{name}: best CV F1 = {gs.best_score_:.3f}")
    print(f"  Params: {gs.best_params_}")

# 3. Select the overall best model
overall_best_name = max(best_models,
                         key=lambda k: best_models[k].best_score_)
overall_best = best_models[overall_best_name].best_estimator_

print(f"\nOverall best: {overall_best_name}")
print(f"Final test evaluation:")
print(classification_report(
    y_test, overall_best.predict(X_test),
    target_names=['Low Coverage', 'High Coverage']
))

# 4. Save the final pipeline
joblib.dump(overall_best, 'project_final_pipeline.joblib')

What to write in your project notebook:

Document the complete workflow — from data loading to saved model.
Present a comparison table showing each model's best cross-validated F1 and its tuned hyperparameters.
Show the final model's confusion matrix and classification report on the test set.
Explain why you chose the final model — what metric did you optimize, and why?
Reflect: compare the pipeline approach to the "piece-by-piece" approach from earlier chapters. What problems does the pipeline prevent?

Milestone Check: Your project notebook should now contain a complete, reproducible ML pipeline that handles preprocessing, compares models, tunes hyperparameters, and produces a saved model. This is the foundation for the professional work in Part VI — communicating results, building portfolios, and deploying real projects.

30.11 Common Pitfalls and How to Avoid Them

Pitfall 1: Preprocessing Outside the Pipeline

If you scale features, encode categories, or create features outside the pipeline, you risk data leakage. The rule is simple: everything that learns from data goes inside the pipeline.

Pitfall 2: Using Test Set Results to Make Decisions

If you see the test set results and then go back to try a different model, you've turned the test set into a validation set. Use cross-validation for all development decisions. Touch the test set once.

Pitfall 3: Ignoring the Scoring Metric

GridSearchCV optimizes whatever you put in the scoring parameter. If you forget to set it, it defaults to accuracy — which might not be what you want. Always explicitly set the scoring metric to match your problem.

Pitfall 4: Forgetting handle_unknown in OneHotEncoder

If your test data contains a category that wasn't in the training data, OneHotEncoder will crash by default. Always set handle_unknown='ignore' (or 'infrequent_if_exist') to handle this gracefully.

Pitfall 5: Not Setting Random Seeds

Without random seeds, your results change every time you run the code. Set random_state in: train_test_split, your model, StratifiedKFold, and GridSearchCV. Use the same seed everywhere for reproducibility.

Pitfall 6: Overfitting the Hyperparameters

If you search over a very fine grid with many options, you might find hyperparameters that happen to work well on the cross-validation folds but don't generalize. This is called "overfitting the hyperparameters." Keep your grids reasonable (3-5 values per parameter) and use the test set as a final sanity check.

30.12 The Big Picture: What You've Built in Part V

Let's step back and appreciate what you've accomplished across Chapters 25 through 30:

Chapter	What You Learned	Key Skill
25	What models are and how they learn	Conceptual foundation
26	Linear regression	Your first predictive model
27	Logistic regression	Classification and probability
28	Decision trees and random forests	Non-linear models and ensembles
29	Model evaluation	Proper metrics and cross-validation
30	ML workflow and pipelines	Putting it all together properly

You started Part V barely knowing what a model was. Now you can:

Build three types of models (linear regression, logistic regression, tree-based)
Evaluate them with appropriate metrics
Compare them fairly using cross-validation
Tune their hyperparameters systematically
Package everything in a pipeline that prevents data leakage
Save and share your work reproducibly

That's not a toy skill set. That's the foundation of professional machine learning. The models will get fancier — neural networks, gradient boosting, deep learning — but the workflow you learned in this chapter never changes. Every production model at every company goes through some version of: preprocess, train, tune, evaluate, deploy. You know how to do that now.

In Part VI, you'll learn to communicate your findings, think about ethics, collaborate with others, and build a portfolio. The technical skills from Part V are the engine; Part VI teaches you to drive.

Summary

Data leakage occurs when information from outside the training set influences model training, leading to overoptimistic performance estimates. Scikit-learn pipelines prevent leakage by ensuring that preprocessing steps (scaling, encoding) are fit only on training data and applied without refitting to test data.

ColumnTransformer handles mixed feature types by applying different transformations to numeric and categorical columns within a single pipeline. GridSearchCV and RandomizedSearchCV automate hyperparameter tuning using cross-validation, and the best_estimator_ attribute provides the fully trained pipeline ready for deployment.

The complete ML workflow follows a disciplined sequence: define the problem, prepare data, build the pipeline, tune hyperparameters on the training set, evaluate once on the held-out test set, interpret results, and save the model. Following this workflow consistently is what separates professional data science from amateur experimentation.

Part V Complete. You've built your first models, learned to evaluate them properly, and packaged everything into reproducible pipelines. In Part VI, you'll learn to communicate your findings, consider the ethical implications of your work, and build the kind of portfolio that launches careers. The hardest technical work is behind you. Now let's learn to make it matter.

Prerequisites

Learning Objectives

In This Chapter

Chapter 30: The Machine Learning Workflow — Pipelines, Validation, and Putting It All Together

Chapter Overview

30.1 The Data Leakage Problem: Why Everything You've Done Is Slightly Wrong

The Leaky Workflow

Why It Matters

The Correct Workflow

30.2 Scikit-learn Pipelines: The Solution

Pipelines with Cross-Validation: The Real Win

Adding More Steps

30.3 ColumnTransformer: Handling Mixed Feature Types

Building the Full Pipeline

A Complete Preprocessing Example

30.4 Hyperparameter Tuning: Finding the Best Settings

GridSearchCV

Comparing Multiple Models with Grid Search

RandomizedSearchCV: When the Grid Is Too Big

30.5 Feature Engineering: Creating Better Inputs

Common Feature Engineering Techniques

Custom Transformers in Pipelines

30.6 The Complete ML Workflow

The Critical Rule: Test Set Discipline

30.7 Saving and Loading Models

Pickle vs. Joblib

A Warning About Security

30.8 Reproducibility: Making Sure Others Can Repeat Your Work

Set Random Seeds Everywhere

Document Your Environment

Save your requirements

30.9 End-to-End Example: The Complete Pipeline

30.10 Progressive Project: The Complete ML Pipeline

Project Task: Build a Complete ML Pipeline for Vaccination Coverage

30.11 Common Pitfalls and How to Avoid Them

Pitfall 1: Preprocessing Outside the Pipeline

Pitfall 2: Using Test Set Results to Make Decisions

Pitfall 3: Ignoring the Scoring Metric

Pitfall 4: Forgetting handle_unknown in OneHotEncoder

Pitfall 5: Not Setting Random Seeds

Pitfall 6: Overfitting the Hyperparameters

30.12 The Big Picture: What You've Built in Part V

Summary

Related Reading