Exercises: Chapter 10

Building Reproducible Data Pipelines


Exercise 1: Pipeline Fundamentals (Conceptual)

A colleague shows you their preprocessing code:

from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Step 1: Impute
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X)

# Step 2: Scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)

# Step 3: Evaluate
model = LogisticRegression(random_state=42, max_iter=1000)
scores = cross_val_score(model, X_scaled, y, cv=5, scoring='roc_auc')
print(f"CV AUC: {scores.mean():.4f}")

a) Identify the data leakage in this code. Which step is contaminated, and why?

b) Explain why the reported CV AUC is optimistic. Is the bias likely to be large or small?

c) Rewrite this code using a Pipeline so that cross-validation is correct.

d) Your colleague argues that "the imputer just fills in medians --- it is not really learning anything." Construct a scenario where imputing with global medians before cross-validation produces a meaningfully different (and optimistic) result compared to imputing within each fold.


Exercise 2: ColumnTransformer Construction (Applied)

The Metro General Hospital readmission dataset has the following features:

Feature Type Notes
age numeric Years, no missing
length_of_stay numeric Days, no missing
num_procedures numeric Count, 3% missing
num_diagnoses numeric Count, no missing
num_medications numeric Count, 8% missing
admission_type categorical {emergency, urgent, elective}, no missing
discharge_disposition categorical {home, SNF, rehab, other}, no missing
primary_diagnosis_group categorical 28 unique values, no missing
insurance_type categorical {Medicare, Medicaid, private, self-pay}, no missing
lab_results_abnormal numeric Fraction, 12% missing

a) Write a ColumnTransformer that: - Applies SimpleImputer(strategy='median') followed by StandardScaler to the numeric columns - Applies OneHotEncoder(sparse_output=False, handle_unknown='ignore') to the low-cardinality categoricals (fewer than 10 levels) - Applies OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1) to the high-cardinality categorical (primary_diagnosis_group)

b) Wrap the ColumnTransformer in a full Pipeline with a LogisticRegression(random_state=42, max_iter=1000) final step.

c) Why did we choose OrdinalEncoder for primary_diagnosis_group instead of OneHotEncoder? Under what model type would this choice be a mistake?

d) Add remainder='passthrough' to your ColumnTransformer. How many features does the preprocessor output now compared to part (a)?


Exercise 3: Custom Transformer (Applied)

Write a custom transformer class called OutlierCapTransformer that caps numeric features at a specified percentile.

Requirements:

a) The transformer takes a percentile parameter (default 99) in __init__.

b) In fit, it learns the upper and lower cap values for each feature. The upper cap is the percentile-th percentile, and the lower cap is the (100 - percentile)-th percentile (e.g., percentile=99 caps at the 1st and 99th percentiles).

c) In transform, it clips all values to the learned caps.

d) Implement get_feature_names_out.

e) Demonstrate that your transformer works correctly by: - Creating a small DataFrame with known outliers - Fitting the transformer on training data - Transforming test data and verifying that outliers are capped at the training percentiles (not the test percentiles)

f) Add your OutlierCapTransformer to a pipeline between the imputer and the scaler. Explain why this ordering matters.


Exercise 4: Pipeline Introspection (Applied)

Given the following fitted pipeline:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
import numpy as np

num_cols = ['tenure_months', 'monthly_charge', 'total_hours_last_30d']
cat_cols = ['plan_type', 'device_type']

preprocessor = ColumnTransformer([
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ]), num_cols),
    ('cat', Pipeline([
        ('encoder', OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
    ]), cat_cols)
])

pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('model', LogisticRegression(random_state=42, max_iter=1000))
])

pipe.fit(X_train, y_train)

a) Write the code to extract the median values learned by the numeric imputer.

b) Write the code to extract the categories learned by the one-hot encoder.

c) Write the code to extract the logistic regression coefficients and map each coefficient to its feature name.

d) Using double-underscore notation, write the set_params call to change the imputer strategy from 'median' to 'mean' without rebuilding the pipeline.


Exercise 5: Serialization and Deployment (Applied)

a) Write a complete script that: 1. Builds and fits the StreamFlow pipeline from this chapter 2. Saves the pipeline with joblib 3. Saves a metadata dictionary containing sklearn version, training date, column names, training set size, and cross-validation AUC 4. Loads the pipeline from disk 5. Asserts that predictions from the loaded pipeline are identical to predictions from the original

b) Your pipeline file is 2.3 MB. A colleague's pipeline for the same task is 45 MB. List three possible explanations for the size difference.

c) You need to deploy the pipeline to a system that has scikit-learn 1.3.0 installed, but you trained the pipeline with scikit-learn 1.5.2. What could go wrong? What steps would you take before deploying?

d) Write a function predict_from_csv(pipeline_path, csv_path) that loads a saved pipeline and a CSV file, runs predictions, and returns a DataFrame with subscriber IDs and predicted churn probabilities.


Exercise 6: Debugging Pipeline Failures (Conceptual)

For each scenario below, identify the root cause and the fix.

a) You add a new feature referral_source (categorical, 8 levels) to the training data. The pipeline fits successfully. When you call predict() on test data that also has referral_source, you get a ValueError: specifying the columns using strings is only supported for pandas DataFrames.

b) Your pipeline works in a Jupyter notebook but fails when deployed as a Flask API. The error is ModuleNotFoundError: No module named '__main__'. You defined your custom transformer class in the notebook.

c) A pipeline that was trained on data with 5 numeric features and 2 categorical features suddenly produces a ValueError: X has 14 features, but LogisticRegression is expecting 12 features input. Nothing in the pipeline definition has changed. What happened?

d) You call cross_val_score on your pipeline and get an AUC of 0.72. You then call pipeline.fit(X_train, y_train) and roc_auc_score(y_test, pipeline.predict_proba(X_test)[:, 1]) and get 0.83. Is this a problem? Under what circumstances would this discrepancy be expected, and under what circumstances would it indicate a bug?


Exercise 7: Pipeline Design Challenge (Synthesis)

You are building a fraud detection pipeline for an e-commerce platform. The dataset has the following features:

Feature Type Missingness Notes
transaction_amount numeric 0% Highly right-skewed
customer_age_days numeric 0% Account age
items_in_cart numeric 0% Count
avg_item_price numeric 2% Derived
time_since_last_order numeric 15% Missing for first-time customers
shipping_country categorical 0% 42 unique values
payment_method categorical 0% {credit, debit, paypal, crypto, gift_card}
device_fingerprint_match binary 0% 0 or 1
ip_risk_score numeric 8% From third-party API
email_domain_type categorical 0% {corporate, free, disposable}

a) Design a complete pipeline with appropriate transformations for each feature type. Justify every design choice.

b) Write a custom transformer called TransactionRiskTransformer that creates: - amount_per_item: transaction_amount / items_in_cart - is_first_order: 1 if time_since_last_order is NaN, 0 otherwise - high_amount_flag: 1 if transaction_amount > 99th percentile of training data, 0 otherwise

c) Explain why high_amount_flag must be computed in fit/transform (not as a stateless function). What would go wrong if you hardcoded the threshold?

d) Draw the complete pipeline structure as a tree diagram (like the nested pipeline structure in the chapter text). Include every transformer, sub-pipeline, and the final estimator.


Exercise 8: The Reproducibility Audit (Synthesis)

You join a team and inherit a churn prediction model. The "pipeline" is a 400-line Jupyter notebook with the following characteristics:

  • The notebook starts with import warnings; warnings.filterwarnings('ignore')
  • Feature engineering is done in 12 separate cells with no functions
  • The StandardScaler is fitted on the full dataset before the train/test split
  • The OneHotEncoder is fitted on the full dataset before the train/test split
  • The IterativeImputer uses random_state=None
  • The model is a GradientBoostingClassifier with random_state=None
  • The notebook references a CSV file at an absolute path on the original developer's laptop
  • There is no requirements.txt or environment specification
  • The reported holdout AUC is 0.89

a) List every reproducibility issue in this notebook. For each, explain the concrete risk it creates.

b) Estimate the likely "true" AUC after fixing the leakage issues. Justify your estimate.

c) Write a migration plan: the specific steps you would take to convert this notebook into a production-ready pipeline, in order of priority.

d) How would you validate that your new pipeline is correct? You cannot simply compare outputs to the old notebook, because the old notebook has leakage. Describe your validation strategy.


Solutions for selected exercises are available in Appendix B.