Chapter 30 Exercises: The Machine Learning Workflow

Contributors to Introduction to Data Science

Chapter 30 Exercises: The Machine Learning Workflow

How to use these exercises: This is the capstone exercise set for Part V. The conceptual questions test your understanding of workflow design and data leakage. The applied exercises have you building complete pipelines from scratch. The real-world and synthesis sections ask you to design workflows for novel problems. Every code exercise should be done with pipelines — no more preprocessing outside the pipeline.

Difficulty key: ⭐ Foundational | ⭐⭐ Intermediate | ⭐⭐⭐ Advanced | ⭐⭐⭐⭐ Extension

Part A: Conceptual Understanding ⭐

Exercise 30.1 — Spot the leakage

For each of the following workflows, identify whether data leakage occurs and explain how:

You use StandardScaler().fit_transform(X) on the full dataset, then split into train/test.
You split into train/test, then use StandardScaler().fit_transform(X_train) and StandardScaler().transform(X_test).
You compute the median of a feature across the full dataset to fill missing values, then split into train/test.
You split into train/test, then perform feature selection on the training set using SelectKBest, and then apply the same selected features to the test set.

Guidance

1. **Leakage.** The scaler learns the mean/std from all data, including test samples. Test data statistics influence the training. 2. **No leakage.** The scaler is fit only on training data. A separate scaler object for the test set doesn't help — you need to use the *same* fitted scaler, which this code does. 3. **Leakage.** The median includes test data values. Test data influences the imputation applied to training data. 4. **No leakage.** Feature selection is performed only on training data, and the same feature subset is applied to the test set. The test set doesn't influence which features are selected.

Exercise 30.2 — Pipeline benefits

List four specific problems that scikit-learn Pipeline solves compared to manually chaining preprocessing and modeling steps. For each problem, give a concrete example of what could go wrong without a pipeline.

Guidance

1. **Data leakage prevention**: Without a pipeline, you might fit the scaler on all data instead of just training data. 2. **Cross-validation consistency**: Without a pipeline, you'd need to manually refit the scaler inside each CV fold — easy to forget. 3. **Reproducibility**: A pipeline stores all fitted transformers together, making it easy to save and reload the entire workflow. 4. **Deployment simplicity**: A single pipeline object can preprocess new data and make predictions — you don't need to remember which transformations to apply in which order.

Exercise 30.3 — Parameter naming conventions

In a pipeline with this structure:

pipe = Pipeline([
    ('preprocessor', ColumnTransformer([
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)
    ])),
    ('model', RandomForestClassifier())
])

Write the full parameter names for: (a) the number of trees in the random forest, (b) the drop parameter of the OneHotEncoder. Explain the double-underscore convention.

Guidance

(a) `model__n_estimators` — "model" is the step name, "n_estimators" is the parameter name. (b) `preprocessor__cat__drop` — "preprocessor" is the ColumnTransformer step, "cat" is the OneHotEncoder within it, "drop" is the parameter. The double-underscore convention allows GridSearchCV to reach into nested pipeline components. Each `__` traverses one level of nesting. This is how scikit-learn's parameter access works — it splits on `__` to find the right component.

Exercise 30.4 — Grid search math

You have a parameter grid with: - n_estimators: [100, 200, 300, 400] - max_depth: [3, 5, 8, 12, None] - min_samples_leaf: [1, 5, 10]

Using 5-fold cross-validation, how many total model fits will GridSearchCV perform? If each fit takes 2 seconds, how long will the search take? How would RandomizedSearchCV with n_iter=30 change this?

Guidance

GridSearchCV: 4 x 5 x 3 = 60 combinations x 5 folds = 300 model fits. At 2 seconds each = 600 seconds = 10 minutes (without parallelization). RandomizedSearchCV with n_iter=30: 30 combinations x 5 folds = 150 fits = 300 seconds = 5 minutes. Half the time, covering half the combinations. In practice, RandomizedSearchCV often finds similar-quality hyperparameters because it samples more diverse combinations across the full parameter space.

Exercise 30.5 — The test set rule

Explain the "sealed envelope" principle for test sets. Then describe what happens if you violate it — specifically, how the test set becomes a validation set and why your reported performance will be optimistically biased.

Guidance

The sealed envelope principle: set aside a test set at the beginning and don't touch it until you've finalized every modeling decision. All model selection, feature engineering, and hyperparameter tuning should be done using cross-validation on the training set. If you violate it — repeatedly evaluating on the test set and adjusting your approach based on what you see — you're implicitly fitting to the test set. You might choose a model that happens to do well on this particular test set, not on new data in general. Your reported test performance becomes a biased estimate (too optimistic) because the test set influenced your choices. You've effectively "trained on" the test set through your decision-making process, even if the model never directly saw it during fitting.

Exercise 30.6 — Reproducibility checklist

Create a checklist of at least 6 things you should do to make a machine learning analysis reproducible. For each item, explain what would go wrong if you skip it.

Guidance

1. **Set random seeds everywhere** — different results each run without them. 2. **Record library versions** — different scikit-learn versions may produce different results. 3. **Document the data source and date** — data may change over time. 4. **Save the trained pipeline** — so you can reload and verify without retraining. 5. **Include a requirements.txt** — so others can install the same environment. 6. **Version control your code** (git) — so you can track what changed when results change. 7. **Document preprocessing decisions** — so others know why you made specific choices. 8. **Save the train/test split indices** — so others can reproduce the exact split.

Part B: Applied Exercises ⭐⭐

Exercise 30.7 — Your first pipeline

Build a simple pipeline that chains StandardScaler and LogisticRegression. Train it on the Breast Cancer dataset and evaluate using 5-fold cross-validation. Then build the same model WITHOUT a pipeline (scaling manually) and verify that the cross-validation scores are slightly different (because of leakage in the manual approach).

Guidance

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score

data = load_breast_cancer()
X, y = data.data, data.target

# With pipeline (correct)
pipe = Pipeline([('scaler', StandardScaler()),
                 ('model', LogisticRegression(max_iter=5000))])
scores_pipe = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')

# Without pipeline (leaky)
X_scaled = StandardScaler().fit_transform(X)
lr = LogisticRegression(max_iter=5000)
scores_leaky = cross_val_score(lr, X_scaled, y, cv=5, scoring='accuracy')

print(f"Pipeline: {scores_pipe.mean():.4f}")
print(f"Leaky:    {scores_leaky.mean():.4f}")

The leaky version should be slightly higher because the scaler had access to test data statistics.

Exercise 30.8 — ColumnTransformer practice

Create a ColumnTransformer that applies StandardScaler to ['sepal length (cm)', 'sepal width (cm)'] and OneHotEncoder to a manually created categorical column (add a random 'color' column with values 'red', 'blue', 'green' to the Iris dataset). Fit the transformer on training data and examine the output.

Guidance

import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

iris = load_iris(as_frame=True)
df = iris.frame
df['color'] = np.random.choice(['red', 'blue', 'green'], size=len(df))

numeric_cols = ['sepal length (cm)', 'sepal width (cm)']
cat_cols = ['color']

ct = ColumnTransformer([
    ('num', StandardScaler(), numeric_cols),
    ('cat', OneHotEncoder(drop='first'), cat_cols)
])

X_transformed = ct.fit_transform(df[numeric_cols + cat_cols])
print(f"Original shape: {df[numeric_cols + cat_cols].shape}")
print(f"Transformed shape: {X_transformed.shape}")

Exercise 30.9 — GridSearchCV

Using the Breast Cancer dataset, build a pipeline with StandardScaler and RandomForestClassifier. Use GridSearchCV to search over n_estimators [50, 100, 200] and max_depth [3, 5, 8, None]. Report the best parameters and the best cross-validated F1 score.

Guidance

from sklearn.model_selection import GridSearchCV

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier(random_state=42))
])

param_grid = {
    'model__n_estimators': [50, 100, 200],
    'model__max_depth': [3, 5, 8, None]
}

gs = GridSearchCV(pipe, param_grid, cv=5, scoring='f1', n_jobs=-1)
gs.fit(X_train, y_train)
print(f"Best params: {gs.best_params_}")
print(f"Best F1: {gs.best_score_:.3f}")

Exercise 30.10 — Model comparison pipeline

Build three separate pipelines (logistic regression, decision tree, random forest) for the Breast Cancer dataset. Run GridSearchCV for each with appropriate parameter grids. Compare the best cross-validated F1 scores and select the overall best model. Evaluate the winner on the test set.

Guidance

Follow the model_configs pattern from Section 30.4. Define separate pipelines and parameter grids for each model. Run GridSearchCV for each, compare `best_score_`, and evaluate the winner with `classification_report` on the test set. Remember: the test set is only touched once, at the very end.

Exercise 30.11 — RandomizedSearchCV

Repeat Exercise 30.9 using RandomizedSearchCV instead of GridSearchCV, with n_iter=15 and parameter distributions from scipy.stats. Compare the best found parameters and score to the GridSearchCV results. Which approach was faster? Did they find similar hyperparameters?

Guidance

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_dist = {
    'model__n_estimators': randint(50, 300),
    'model__max_depth': [3, 5, 8, 12, None],
    'model__min_samples_leaf': randint(1, 20)
}

rs = RandomizedSearchCV(pipe, param_dist, n_iter=15, cv=5,
                         scoring='f1', random_state=42, n_jobs=-1)
rs.fit(X_train, y_train)

Exercise 30.12 — Save and load

Train a pipeline on the Breast Cancer dataset, save it using joblib, load it back, and verify that predictions from the loaded model match the original. Print the first 10 predictions from each to confirm they're identical.

Guidance

import joblib

pipe.fit(X_train, y_train)
original_preds = pipe.predict(X_test)

joblib.dump(pipe, 'breast_cancer_pipe.joblib')
loaded_pipe = joblib.load('breast_cancer_pipe.joblib')
loaded_preds = loaded_pipe.predict(X_test)

print(f"Predictions match: {(original_preds == loaded_preds).all()}")

Exercise 30.13 — Missing value handling in pipelines

The Breast Cancer dataset has no missing values, but real data does. Create a pipeline that includes SimpleImputer (from sklearn.impute) to handle missing values before scaling. Test it by manually introducing NaN values into 5% of the training data.

Guidance

from sklearn.impute import SimpleImputer

pipe_with_imputer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(max_iter=5000))
])

# Introduce missing values
X_train_missing = X_train.copy()
mask = np.random.random(X_train_missing.shape) < 0.05
X_train_missing[mask] = np.nan

pipe_with_imputer.fit(X_train_missing, y_train)

The imputer learns the median from the training data (with missing values removed) and uses it to fill NaNs. In the pipeline, this happens before scaling, so the scaler only sees complete data.

Exercise 30.14 — Examining grid search results

After running a GridSearchCV, examine grid_search.cv_results_ to create a heatmap of F1 scores across max_depth and n_estimators. Which hyperparameter has a larger effect on performance? Use pd.DataFrame(grid_search.cv_results_) to explore the results.

Guidance

results_df = pd.DataFrame(grid_search.cv_results_)
# Look at columns: 'param_model__max_depth', 'param_model__n_estimators', 'mean_test_score'
# Pivot and create a heatmap with seaborn

This exercise teaches you to go beyond just the best parameters — understanding how performance varies across the grid helps you understand which hyperparameters actually matter.

Part C: Real-World Applications ⭐⭐⭐

Exercise 30.15 — Elena's complete pipeline

Elena has a dataset with numeric features (GDP, health spending, physicians, literacy rate, urban population) and categorical features (region: 'Africa', 'Americas', 'Europe', 'Asia', 'Oceania' and income_group: 'Low', 'Lower-Middle', 'Upper-Middle', 'High'). Design and build a complete pipeline that: 1. Handles both feature types with ColumnTransformer 2. Trains a random forest 3. Tunes hyperparameters with GridSearchCV 4. Evaluates on a held-out test set

Write the full code, including data splitting, pipeline definition, grid search, and evaluation.

Guidance

Follow the end-to-end example from Section 30.9. The key is defining the ColumnTransformer correctly with separate transformers for numeric and categorical features. Use `handle_unknown='ignore'` in OneHotEncoder in case the test set contains a region not seen in training.

Exercise 30.16 — Marcus's sales prediction pipeline

Marcus wants to predict daily sales using features: day_of_week (categorical), temperature (numeric), is_holiday (binary/numeric), month (categorical), and nearby_event (binary/numeric). Design the pipeline structure (you don't need actual data — describe each step in the pipeline and why it's needed).

Guidance

Pipeline structure: 1. `ColumnTransformer`: StandardScaler for `temperature`; OneHotEncoder for `day_of_week` and `month`; `is_holiday` and `nearby_event` are already binary, so passthrough or no transformation needed. 2. `RandomForestRegressor` (regression, not classification, since sales is continuous). 3. `GridSearchCV` with `scoring='neg_mean_absolute_error'` (regression metric). Note: for regression problems, scikit-learn's scoring convention uses negative values for error metrics (so higher = better), hence `neg_mean_absolute_error`.

Exercise 30.17 — Workflow critique

A colleague shows you this code and asks if it's correct:

# Scale features
X_scaled = StandardScaler().fit_transform(X)

# Feature selection
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(k=5)
X_selected = selector.fit_transform(X_scaled, y)

# Split
X_train, X_test, y_train, y_test = train_test_split(X_selected, y)

# Train and evaluate
model = RandomForestClassifier()
model.fit(X_train, y_train)
print(model.score(X_test, y_test))

Identify all the problems with this workflow and rewrite it using a proper pipeline.

Guidance

Problems: (1) Scaling before splitting — data leakage. (2) Feature selection before splitting — leakage (SelectKBest uses target variable y, which includes test labels). (3) No random_state — not reproducible. (4) Only reporting accuracy — may not be the right metric. Correct version:

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('selector', SelectKBest(k=5)),
    ('model', RandomForestClassifier(random_state=42))
])
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
pipe.fit(X_train, y_train)

Exercise 30.18 — Deployment scenario

You've built a pipeline that predicts customer churn and saved it with joblib. A software engineer asks: "What do I need to know to use this model in our web application?" Write a brief deployment guide covering: (1) what input the model expects, (2) what output it produces, (3) what happens if a new category appears that wasn't in the training data, and (4) when the model should be retrained.

Guidance

A strong answer covers: (1) Input: a DataFrame with specific column names and types. (2) Output: predicted class (0/1) and/or probability of churn. (3) `handle_unknown='ignore'` in OneHotEncoder handles unseen categories by assigning zeros to all one-hot columns. (4) Retrain when performance degrades on monitored data, or on a regular schedule (quarterly), or when the data distribution changes significantly.

Exercise 30.19 — Cost-sensitive grid search

For a medical screening problem where false negatives are 10x more costly than false positives, design a GridSearchCV that optimizes for recall rather than F1. Run it on the Breast Cancer dataset and compare the selected model to one optimized for F1. Do they choose different hyperparameters?

Guidance

gs_recall = GridSearchCV(pipe, param_grid, cv=5, scoring='recall', n_jobs=-1)
gs_f1 = GridSearchCV(pipe, param_grid, cv=5, scoring='f1', n_jobs=-1)

The recall-optimized model may choose hyperparameters that produce a more aggressive classifier (more positive predictions), while the F1-optimized model may choose a more balanced one. Compare the confusion matrices to see the difference.

Part D: Synthesis and Critical Thinking ⭐⭐⭐

Exercise 30.20 — End-to-end workflow design

You're starting a new project: predicting whether a restaurant will fail within its first two years. Your features include: location_type (urban/suburban/rural), cuisine_type (Italian/Mexican/American/etc.), owner_experience_years (numeric), initial_investment (numeric), nearby_competitors (numeric), avg_rent_sqft (numeric), population_density (numeric), and yelp_rating_month1 (numeric).

Design the complete workflow: data splitting strategy, preprocessing pipeline, models to try, hyperparameters to tune, evaluation metric(s), and how you'd report results. Justify each decision.

Guidance

Key decisions to justify: (1) Stratified train/test split since restaurant failure may be imbalanced. (2) ColumnTransformer with StandardScaler for numeric features and OneHotEncoder for location_type and cuisine_type. (3) Try logistic regression, decision tree, and random forest. (4) Optimize for recall or F1 depending on the stakeholder — if it's investors who want to avoid funding failures, recall matters (catch the failures); if it's lenders who don't want to wrongly deny loans to viable restaurants, precision matters. (5) Report confusion matrix, classification report, and AUC. (6) Feature importance to tell the story of why restaurants fail.

Exercise 30.21 — Pipeline vs. manual workflow debate

A colleague argues: "Pipelines are overengineered. I've been doing ML for years without them, and my models work fine." Write a thoughtful response (3-4 paragraphs) that acknowledges their experience while explaining the specific risks of working without pipelines. Include at least one example where the difference between a pipeline and manual workflow would lead to materially different results.

Guidance

A strong response: (1) Acknowledges that many successful projects have been built without formal pipelines. (2) Explains that the risks are subtle — data leakage might add 1-2% to your reported performance, which seems small but can change model selection decisions. (3) Emphasizes that pipelines become essential at scale — when you have many features, many preprocessing steps, or need to hand off work to colleagues. (4) Gives a concrete example: cross-validating a model that requires scaling, where the manual approach leaks test data statistics into every fold. The difference might be small for the Iris dataset but larger for high-dimensional data with many correlated features.

Exercise 30.22 — Feature engineering within pipelines

Discuss the challenges of doing feature engineering inside a pipeline. Specifically: (1) Can you do all feature engineering inside a pipeline? (2) What types of feature engineering might need to happen outside the pipeline? (3) How do you decide what goes inside vs. outside? Give examples.

Guidance

(1) Many feature engineering steps can go inside a pipeline using custom transformers (FunctionTransformer or custom classes). (2) Feature engineering that requires domain knowledge and manual inspection — like deciding which features to create, how to bin continuous variables, or whether to create interaction terms — typically happens during exploratory analysis, outside the pipeline. The *implementation* (the actual transformation) goes inside. (3) Rule of thumb: if the transformation uses parameters learned from data (like the mean for imputation), it MUST be inside the pipeline. If it's a fixed transformation (like log(x) or x1*x2), it can safely be outside — but putting it inside is still better for reproducibility.

Exercise 30.23 — Reflecting on Part V

Write a 1-page reflection on your journey through Part V (Chapters 25-30). Address: 1. What was the most surprising thing you learned? 2. What concept was hardest to understand, and how did you eventually get it? 3. How has your understanding of "what makes a good model" changed since Chapter 25? 4. If you had to explain the ML workflow to a friend in 5 sentences, what would you say?

Guidance

This is a personal reflection — there's no single right answer. But strong responses often mention: the surprise that accuracy is misleading ([Chapter 29](../chapter-29-evaluating-models/index.md)), the difficulty of understanding data leakage (Chapter 30), the realization that model building is the easy part and evaluation is the hard part, and that the workflow matters as much as the algorithm. Your 5-sentence summary should cover: (1) define the problem and metric, (2) split data and build a pipeline, (3) compare models with cross-validation, (4) tune hyperparameters, (5) evaluate once on the test set and report honestly.

Part E: Extension Challenges ⭐⭐⭐⭐

Exercise 30.24 — Nested cross-validation

Research nested cross-validation: an outer loop for estimating model performance and an inner loop for hyperparameter tuning. Implement it using scikit-learn on the Breast Cancer dataset. Compare the nested CV score to the non-nested (standard GridSearchCV) score. Why is the nested score usually lower? Write a paragraph explaining when nested CV is necessary.

Guidance

from sklearn.model_selection import cross_val_score, GridSearchCV

inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

gs = GridSearchCV(pipe, param_grid, cv=inner_cv, scoring='f1', n_jobs=-1)
nested_scores = cross_val_score(gs, X, y, cv=outer_cv, scoring='f1')
print(f"Nested CV F1: {nested_scores.mean():.3f} +/- {nested_scores.std():.3f}")

The nested score is usually lower because it doesn't allow the hyperparameter tuning process to "peek" at the evaluation data. Non-nested CV uses the same data for tuning and evaluation, which is optimistically biased. Nested CV is necessary when you want an honest estimate of how well your entire model selection process (not just one model) will perform on new data.

Exercise 30.25 — Building a reusable ML template

Create a Python function or class called ModelComparer that: 1. Takes a list of (name, pipeline, param_grid) tuples 2. Runs GridSearchCV for each 3. Returns a summary DataFrame with best scores and parameters 4. Optionally evaluates the overall winner on a test set

Test it on the Iris dataset with at least 3 different model types.

Guidance

class ModelComparer:
    def __init__(self, models, cv=5, scoring='f1'):
        self.models = models  # List of (name, pipeline, params) tuples
        self.cv = cv
        self.scoring = scoring
        self.results = {}

    def fit(self, X_train, y_train):
        for name, pipe, params in self.models:
            gs = GridSearchCV(pipe, params, cv=self.cv,
                              scoring=self.scoring, n_jobs=-1)
            gs.fit(X_train, y_train)
            self.results[name] = gs
        return self

    def summary(self):
        rows = []
        for name, gs in self.results.items():
            rows.append({
                'Model': name,
                'Best Score': gs.best_score_,
                'Best Params': gs.best_params_
            })
        return pd.DataFrame(rows).sort_values('Best Score', ascending=False)

This exercise bridges the gap between writing scripts and writing reusable tools — a key skill for professional data science.