Chapter 30 Quiz: The Machine Learning Workflow

Contributors to Introduction to Data Science

Chapter 30 Quiz: The Machine Learning Workflow

Instructions: This quiz tests your understanding of Chapter 30, the capstone of Part V. Answer all questions before checking the solutions. For multiple choice, select the best answer. For short answer questions, aim for 2-4 clear sentences. Total points: 100.

Section 1: Multiple Choice (10 questions, 4 points each)

Question 1. Data leakage in machine learning refers to:

(A) Data being lost during transfer between systems
(B) Information from outside the training set influencing model training, leading to overoptimistic performance estimates
(C) The model's predictions being leaked to competitors
(D) Memory overflow when processing large datasets

Answer

**Correct: (B)** Data leakage occurs when the model gains access to information it wouldn't have in a real deployment scenario. The most common form is preprocessing (like scaling or encoding) that uses statistics from the test set, allowing the model to indirectly "know" about test data during training. This makes performance estimates artificially high — the model appears to generalize better than it actually does.

Question 2. Which of the following causes data leakage?

(A) Using train_test_split with stratify=y
(B) Fitting a StandardScaler on the entire dataset before splitting into train/test
(C) Using cross_val_score with a Pipeline
(D) Setting random_state=42 in the model

Answer

**Correct: (B)** Fitting the scaler on the entire dataset means the scaler learns the mean and standard deviation from all samples, including those that will later be in the test set. When the model is evaluated on the test set, the test data has been scaled using its own statistics — information has leaked from the test set into the training process. - **(A)** Stratification ensures balanced class proportions — no leakage. - **(C)** Pipeline inside cross_val_score prevents leakage by fitting transformers only on training folds. - **(D)** Random state is for reproducibility — no leakage.

Question 3. In a scikit-learn Pipeline, what is the key difference between fit_transform and transform?

(A) fit_transform is faster than calling fit and transform separately
(B) fit_transform learns parameters from the data AND applies the transformation; transform only applies the transformation using previously learned parameters
(C) fit_transform works on training data; transform works on test data
(D) There is no difference; they produce identical results

Answer

**Correct: (B)** `fit_transform(X)` = `fit(X)` + `transform(X)`: it learns the parameters (e.g., mean and std for StandardScaler) from X and then transforms X using those parameters. `transform(X)` only applies the transformation using parameters that were previously learned during `fit`. This distinction is critical: on test data, you must only call `transform` (using training parameters), never `fit_transform` (which would learn new parameters from the test data, causing leakage). - **(A)** While `fit_transform` may be slightly faster in some implementations, this isn't the important distinction. - **(C)** Technically, `fit_transform` is used during training and `transform` during testing, but the answer describes the mechanism, not just the typical usage.

Question 4. What is the purpose of ColumnTransformer in a pipeline?

(A) To remove columns with missing values
(B) To apply different preprocessing transformations to different groups of columns
(C) To merge multiple DataFrames into one
(D) To rename columns for consistency

Answer

**Correct: (B)** `ColumnTransformer` allows you to specify different transformations for different subsets of columns. For example, you can scale numeric features with `StandardScaler` and encode categorical features with `OneHotEncoder` in a single step. Without `ColumnTransformer`, you'd need to manually split, transform, and reassemble your data — which is error-prone and harder to integrate into a pipeline.

Question 5. In a GridSearchCV parameter grid, why are parameter names written with double underscores (e.g., model__max_depth)?

(A) It's a Python naming convention for private variables
(B) The double underscore separates the pipeline step name from the parameter name, allowing GridSearchCV to reach into nested components
(C) It prevents naming conflicts with Python built-in functions
(D) It's required by scikit-learn's type-checking system

Answer

**Correct: (B)** The double-underscore convention is scikit-learn's mechanism for accessing parameters of nested estimators. In a pipeline where a step is named `'model'` and you want to tune `max_depth`, you write `model__max_depth`. The `__` tells scikit-learn to look inside the `'model'` step for a parameter called `max_depth`. For deeply nested structures (like an encoder inside a ColumnTransformer inside a Pipeline), you chain multiple `__`: `preprocessor__cat__drop`.

Question 6. A GridSearchCV with 4 hyperparameters, each with 3 values, and 10-fold cross-validation will perform how many model fits?

(A) 40
(B) 120
(C) 810
(D) 8,100

Answer

**Correct: (C)** Total combinations: 3⁴ = 81. Each combination is evaluated with 10-fold CV = 10 fits per combination. Total: 81 × 10 = 810 model fits. This is why large grid searches can be computationally expensive, and why `RandomizedSearchCV` (which tries a fixed number of random combinations) is often preferred for large parameter spaces. Note: **(B)** 120 would be correct if the 4 parameters each had independent values (4 × 3 × 10 = 120, treating it as additive), but GridSearchCV tries all *combinations*.

Question 7. When using GridSearchCV, best_estimator_ gives you:

(A) The hyperparameters with the highest cross-validation score
(B) A fully trained pipeline using the best hyperparameters, fitted on the entire training set
(C) The cross-validation fold that produced the best score
(D) An unfitted pipeline that you need to train yourself

Answer

**Correct: (B)** After `GridSearchCV.fit()` completes, `best_estimator_` is the pipeline re-fitted on the entire training set using the hyperparameters that achieved the best cross-validation score. It's ready to make predictions — you don't need to fit it again. This is a convenience: GridSearchCV automatically retrains the best model on all available training data after finding the best hyperparameters.

Question 8. Which of the following should NOT go inside a scikit-learn Pipeline?

(A) StandardScaler
(B) Exploratory data analysis (plotting distributions, computing summary statistics)
(C) OneHotEncoder
(D) SimpleImputer

Answer

**Correct: (B)** Exploratory data analysis — creating visualizations, computing statistics for human understanding, making decisions about which features to use — is a human-driven process that happens before the pipeline is built. Pipelines contain automated transformation and modeling steps that need to be reproduced consistently for every data split. Plots and manual inspection can't (and shouldn't) be automated inside a pipeline. **(A)**, **(C)**, and **(D)** are all transformers that learn parameters from data and should be inside the pipeline to prevent leakage.

Question 9. What does handle_unknown='ignore' in OneHotEncoder do?

(A) Removes rows with unknown categories from the dataset
(B) When a category appears in new data that wasn't seen during training, assigns zeros to all one-hot columns instead of raising an error
(C) Ignores all categorical features during encoding
(D) Replaces unknown categories with the most frequent category

Answer

**Correct: (B)** Without `handle_unknown='ignore'`, OneHotEncoder raises an error if it encounters a category in new data that wasn't present in the training data. With this setting, unseen categories get encoded as all zeros — the model treats them as "none of the known categories." This is essential for production models, where new data may contain categories that weren't in the training set.

Question 10. Why is joblib preferred over pickle for saving scikit-learn models?

(A) joblib is faster for all Python objects
(B) pickle cannot save scikit-learn models
(C) joblib handles large NumPy arrays more efficiently, which scikit-learn models contain internally
(D) joblib files are more secure than pickle files

Answer

**Correct: (C)** Scikit-learn models internally store fitted parameters as NumPy arrays (e.g., coefficients, tree structures, scaler statistics). `joblib` is optimized for serializing objects with large NumPy arrays, making it faster and more efficient than `pickle` for this specific use case. Both `joblib` and `pickle` can save scikit-learn models, and both have similar security concerns (never load from untrusted sources).

Section 2: True/False (4 questions, 4 points each)

Question 11. True or False: A Pipeline guarantees that preprocessing transformations are fit only on training data during cross-validation.

Answer

**True.** When you pass a Pipeline to `cross_val_score` or `GridSearchCV`, scikit-learn handles the fit/transform logic correctly within each fold. In each fold, the pipeline's transformers are fit on the training portion and then only transform (without refitting) the test portion. This is the primary advantage of using pipelines — they prevent data leakage automatically during cross-validation.

Question 12. True or False: RandomizedSearchCV always finds better hyperparameters than GridSearchCV.

Answer

**False.** RandomizedSearchCV samples random combinations from the parameter space, so it might miss the absolute best combination that GridSearchCV would find by exhaustive search. However, RandomizedSearchCV is often more *efficient* — it can explore a wider region of the parameter space in fewer iterations, and in practice finds competitive hyperparameters faster, especially when the grid is large.

Question 13. True or False: The test set should be used multiple times during model development to track progress.

Answer

**False.** The test set should be used exactly once — at the very end, after all model selection and hyperparameter tuning decisions have been made. Using the test set repeatedly during development turns it into a validation set, and your final performance estimate becomes optimistically biased. Use cross-validation on the training set for all intermediate decisions.

Question 14. True or False: Setting random_state=42 in train_test_split, the model, and the cross-validation splitter is sufficient for full reproducibility.

Answer

**Partially True, but incomplete.** Setting random seeds ensures that the *random* elements of your code are reproducible. However, full reproducibility also requires: recording library versions (different scikit-learn versions may produce different results even with the same seed), documenting the data source and version, and ensuring the same computational environment. Random seeds handle randomness; reproducibility requires managing the entire environment.

Section 3: Short Answer (3 questions, 6 points each)

Question 15. Explain the difference between a hyperparameter and a learned parameter (model parameter). Give two examples of each for a Random Forest.

Answer

**Hyperparameters** are set by the data scientist *before* training and control the model's structure and behavior. They are not learned from data. Examples for Random Forest: `n_estimators` (number of trees) and `max_depth` (maximum tree depth). **Learned parameters** (model parameters) are determined *during* training by fitting the model to data. They represent what the model has learned. Examples for Random Forest: the split thresholds at each node (e.g., "GDP > $4,523") and the predicted class or value at each leaf node. The key distinction: you choose hyperparameters; the model learns parameters. Hyperparameter tuning (GridSearchCV) is the process of finding the hyperparameters that lead to the best learned parameters.

Question 16. Describe the complete ML workflow in 6-7 steps. For each step, name one common mistake that beginners make.

Answer

1. **Define the problem and metric** — Mistake: jumping into modeling without a clear question or choosing the wrong evaluation metric (e.g., accuracy for an imbalanced problem). 2. **Prepare and split data** — Mistake: not holding out a test set, or using the test set during development. 3. **Build the preprocessing pipeline** — Mistake: preprocessing outside the pipeline, causing data leakage. 4. **Select and compare models** — Mistake: trying only one model, or comparing models on a single train/test split instead of cross-validation. 5. **Tune hyperparameters** — Mistake: hand-tuning one parameter at a time instead of searching systematically, or searching too fine a grid (overfitting hyperparameters). 6. **Evaluate on the held-out test set** — Mistake: reporting only accuracy, or using the test set more than once. 7. **Save, document, and deploy** — Mistake: not saving the fitted pipeline, not recording library versions, or not documenting assumptions and limitations.

Question 17. A colleague says: "I scaled my features, selected the best 10 features using SelectKBest, tuned my model with GridSearchCV, and got 92% accuracy. Then I tested on the held-out set and only got 84%." Why is there such a large gap between the GridSearchCV score and the test set score?

Answer

The most likely explanation is that the scaling and/or feature selection was done *outside* the pipeline — on the full dataset before the GridSearchCV split. This causes data leakage: the scaler's mean/std and SelectKBest's feature rankings were computed using information from the validation folds (which act as "test" data within each cross-validation fold). The leaked information inflated the cross-validation scores. Another possibility: the GridSearchCV found hyperparameters that happened to work well on the specific cross-validation folds but didn't generalize (hyperparameter overfitting). This is especially likely if the grid was very fine-grained. The fix: put scaling and feature selection inside the pipeline so they're refit within each cross-validation fold. Use a separate held-out test set that is never touched during tuning.

Section 4: Applied Scenarios (2 questions, 5 points each)

Question 18. You're building a model for a hospital to predict patient readmission within 30 days. Your features include: age (numeric), num_prior_admissions (numeric), length_of_stay (numeric), diagnosis_category (categorical: 15 categories), insurance_type (categorical: 4 categories), and discharge_disposition (categorical: 6 categories).

Write out the pipeline structure you would use (preprocessing + model). Include specific transformer choices and explain why you chose them. What scoring metric would you use for GridSearchCV and why?

Answer

Pipeline structure:

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), ['age', 'num_prior_admissions', 'length_of_stay']),
    ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'),
     ['diagnosis_category', 'insurance_type', 'discharge_disposition'])
])

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestClassifier(class_weight='balanced'))
])

Choices explained: StandardScaler for numeric features (needed for logistic regression comparisons; harmless for random forests). OneHotEncoder with `drop='first'` to avoid multicollinearity (matters for logistic regression). `handle_unknown='ignore'` because new diagnosis categories might appear. `class_weight='balanced'` because readmission is relatively rare. Scoring metric: `'recall'` because missing a patient who will be readmitted (false negative) is more costly than unnecessarily flagging a patient who won't be (false positive). The hospital can follow up with flagged patients, but a missed readmission could mean a preventable emergency.

Question 19. You run a GridSearchCV and examine the results:

Best parameters: {'model__max_depth': 3, 'model__n_estimators': 300}
Best CV F1: 0.847

All results:
  max_depth=3,  n_estimators=100: F1=0.841 (+/- 0.03)
  max_depth=3,  n_estimators=200: F1=0.845 (+/- 0.03)
  max_depth=3,  n_estimators=300: F1=0.847 (+/- 0.03)
  max_depth=5,  n_estimators=100: F1=0.838 (+/- 0.04)
  max_depth=5,  n_estimators=200: F1=0.840 (+/- 0.04)
  max_depth=5,  n_estimators=300: F1=0.842 (+/- 0.04)
  max_depth=8,  n_estimators=100: F1=0.830 (+/- 0.05)
  max_depth=8,  n_estimators=200: F1=0.833 (+/- 0.05)
  max_depth=8,  n_estimators=300: F1=0.835 (+/- 0.05)

What patterns do you notice? Would you deploy the "best" model (max_depth=3, n_estimators=300), or would you consider a different combination? Justify your answer.

Answer

Key patterns: (1) Lower max_depth consistently outperforms higher max_depth — the model benefits from simplicity (less overfitting). (2) More trees always helps, but the improvement from 100 to 300 trees is tiny (0.006 for depth=3). (3) The standard deviations increase with max_depth, confirming that deeper trees have more variable performance. A reasonable argument for deploying max_depth=3, n_estimators=100 instead of the "best": it achieves F1=0.841 vs. 0.847 — a difference of 0.006 (statistically insignificant given the +/- 0.03 standard deviation) — but with 1/3 the trees, meaning 3x faster training and prediction. In production, speed and simplicity often matter more than the last fraction of a percentage point. The "best" model isn't meaningfully better than a simpler one. This illustrates an important practical principle: when the differences between models are within the standard deviation of cross-validation, choose the simpler/faster model.

Section 5: Code Analysis (1 question, 6 points)

Question 20. This code attempts to build a complete ML pipeline but has three bugs. Find all three and explain how to fix each.

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split

numeric = ['age', 'income']
categorical = ['city']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric),
    ('cat', OneHotEncoder(), categorical)     # Bug 1
])

pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestClassifier())
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)  # Bug 2

param_grid = {'max_depth': [3, 5, 8]}        # Bug 3

gs = GridSearchCV(pipe, param_grid, cv=5, scoring='f1')
gs.fit(X_train, y_train)

Answer

**Bug 1:** `OneHotEncoder()` should include `handle_unknown='ignore'` (and optionally `drop='first'`). Without `handle_unknown='ignore'`, the pipeline will crash if the test set contains a city not seen in training. Fix: `OneHotEncoder(drop='first', handle_unknown='ignore')` **Bug 2:** `train_test_split` is called without `random_state`, making the split unreproducible. It also doesn't use `stratify=y` for classification. Fix: `train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)` **Bug 3:** The parameter name `'max_depth'` is missing the pipeline step prefix. Since the model is inside a pipeline step named `'model'`, the correct name is `'model__max_depth'`. Fix: `param_grid = {'model__max_depth': [3, 5, 8]}` Any one of these bugs would cause the code to either crash or produce unreliable results. Bug 3 would raise a ValueError immediately. Bug 1 would crash only when unseen categories appear. Bug 2 produces technically valid but unreproducible results.