Quiz: Chapter 10

DataField.Dev

Quiz: Chapter 10

Building Reproducible Data Pipelines

Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.

Question 1 (Multiple Choice)

When you call pipeline.predict(X_test), what does each intermediate step in the pipeline call?

A) fit_transform(X_test) for all steps, then predict on the final step
B) transform(X_test) for all intermediate steps, then predict on the final step
C) fit(X_test) for all steps, then predict on the final step
D) fit_transform(X_test) for intermediate steps, then fit and predict on the final step

Answer: B) transform(X_test) for all intermediate steps, then predict on the final step. During predict, the pipeline assumes all steps have already been fitted (via a prior call to pipeline.fit). Each intermediate step applies its learned transformation using transform only --- no fitting occurs. The final step calls predict on the transformed data. This is the key design that prevents test-set leakage: the pipeline never refits during prediction.

Question 2 (Short Answer)

Explain why preprocessing steps must be inside the pipeline when using cross_val_score, rather than applied beforehand.

Answer: When preprocessing is applied before cross_val_score, the preprocessing step sees the entire dataset --- including the validation fold. This is data leakage. For example, a StandardScaler fitted on the full training set computes means and standard deviations using observations that will later appear in the validation fold. The validation fold is no longer truly held out, and the cross-validation score is optimistically biased. When preprocessing is inside the pipeline, cross_val_score refits the entire pipeline on each fold's training portion, ensuring the validation fold is never seen during fitting.

Question 3 (Multiple Choice)

Which ColumnTransformer parameter controls what happens to columns that are not explicitly listed in any transformer?

A) sparse_threshold
B) remainder
C) handle_unknown
D) n_jobs

Answer: B) remainder. The default is 'drop', which silently removes columns not assigned to any transformer. Setting remainder='passthrough' passes them through unchanged. You can also assign a transformer to the remainder columns (e.g., remainder=StandardScaler()). In production, explicit column lists with remainder='drop' are preferred so that unexpected new columns fail visibly rather than silently passing through.

Question 4 (Multiple Choice)

A custom transformer class has the following __init__ method:

def __init__(self, threshold=0.05):
    self._threshold = threshold

What will happen when scikit-learn tries to clone this transformer during cross-validation?

A) It will work correctly
B) It will raise a TypeError
C) It will create a clone with threshold=0.05 regardless of the original's value
D) It will create a clone with _threshold=None

Answer: C) It will create a clone with threshold=0.05 regardless of the original's value. BaseEstimator.get_params() looks for attributes matching the __init__ parameter names. The parameter is named threshold, but the attribute is _threshold. get_params will not find _threshold and will use the default value from the __init__ signature. If you set threshold=0.10 at construction time, the clone will revert to threshold=0.05. The fix is self.threshold = threshold --- the attribute name must exactly match the parameter name.

Question 5 (Short Answer)

What is the difference between joblib.dump and pickle.dump for saving scikit-learn pipelines? When would you choose one over the other?

Answer: joblib is optimized for objects containing large numpy arrays, which are common in fitted scikit-learn transformers and models. It compresses these arrays and is typically 2-10x faster than pickle for large models. pickle is part of the Python standard library and requires no additional dependencies. Use joblib for scikit-learn pipelines (it is the recommended approach in the scikit-learn documentation). Use pickle only when deploying to environments that do not have joblib installed, or when serializing non-scikit-learn objects alongside the pipeline.

Question 6 (Multiple Choice)

You fit a pipeline containing a OneHotEncoder(handle_unknown='ignore') on training data where the plan_type column has values {basic, standard, premium}. At prediction time, the test data contains plan_type='enterprise'. What happens?

A) The pipeline raises a ValueError
B) The enterprise category is encoded as all zeros in the one-hot columns
C) The enterprise category is mapped to the most frequent training category
D) The pipeline silently drops the row

Answer: B) The enterprise category is encoded as all zeros in the one-hot columns. With handle_unknown='ignore', the encoder produces a row of zeros for any category not seen during training. The model then treats this subscriber as having no plan type signal --- which is imperfect but does not crash the pipeline. Without handle_unknown='ignore', the encoder raises a ValueError and the prediction fails entirely.

Question 7 (Multiple Choice)

In a nested pipeline, what is the correct double-underscore path to access the strategy parameter of a SimpleImputer named 'imputer', inside a sub-pipeline named 'num', inside a ColumnTransformer named 'preprocessor', inside the main Pipeline?

A) preprocessor.num.imputer.strategy
B) preprocessor__num__imputer__strategy
C) preprocessor__num__imputer_strategy
D) preprocessor.num.imputer_strategy

Answer: B) preprocessor__num__imputer__strategy. The double-underscore notation separates each level of nesting. preprocessor is the ColumnTransformer step in the main pipeline, num is the named transformer within it, imputer is the step within the num sub-pipeline, and strategy is the parameter. This notation is used with get_params(), set_params(), and parameter grids for GridSearchCV.

Question 8 (Short Answer)

Your custom transformer adds 3 new columns to the input data. You did not implement get_feature_names_out. What breaks?

Answer: Without get_feature_names_out, downstream pipeline steps and introspection tools cannot map feature indices back to meaningful names. Calling pipeline.get_feature_names_out() will raise an AttributeError. More practically, when you extract model coefficients or feature importances, you will have indices like "feature 7" instead of "tickets_per_tenure_month." You also lose the ability to use set_output(transform='pandas') through the pipeline, because the transformer cannot provide column names for its output DataFrame.

Question 9 (Multiple Choice)

Which of the following is a valid reason to use FunctionTransformer instead of a full custom transformer class?

A) The transformation requires fitting on training data
B) The transformation is stateless and does not learn parameters
C) The transformation needs get_feature_names_out to work correctly
D) The transformation will be used in GridSearchCV with a parameter grid

Answer: B) The transformation is stateless and does not learn parameters. FunctionTransformer wraps a plain function that takes X and returns transformed X. It has a no-op fit method, making it suitable for stateless operations like np.log1p, adding a constant, or computing ratios from existing columns. For transformations that need to learn from training data (A), produce named outputs (C), or expose tunable parameters (D), a full custom transformer class is required.

Question 10 (Multiple Choice)

You save a fitted pipeline with joblib.dump(pipeline, 'model.joblib') using scikit-learn 1.5.2. You then try to load it in an environment with scikit-learn 1.3.0. What is the most likely outcome?

A) It loads and works identically
B) It raises a ModuleNotFoundError
C) It raises a version warning or UserWarning and may produce incorrect results
D) It loads but the predictions are exactly 0.0 for every input

Answer: C) It raises a version warning or UserWarning and may produce incorrect results. scikit-learn does not guarantee backward compatibility for serialized models across minor versions. Internal class structures, default parameter values, and numerical implementations can change. The deserialized object may have attributes that the older version does not expect, or may be missing attributes that the older version requires. Always deploy with the same scikit-learn version used for training, and record the version in your pipeline metadata.

Question 11 (Short Answer)

A pipeline has the following structure:

Pipeline
  |-- MissingIndicatorTransformer
  |-- ColumnTransformer
  |     |-- num: SimpleImputer -> StandardScaler
  |     |-- cat: OneHotEncoder
  |-- SelectKBest(k=10)
  |-- LogisticRegression

The MissingIndicatorTransformer adds 3 indicator columns to the DataFrame. After the ColumnTransformer, how many of those 3 indicator columns survive? Why?

Answer: Zero, unless the indicator columns are explicitly included in one of the ColumnTransformer's column lists. By default, ColumnTransformer uses remainder='drop', which discards any column not mentioned in the num or cat column lists. The indicator columns (e.g., total_hours_last_30d_missing) were added by the previous step but are not in either list. To preserve them, either add them to the numeric column list, or set remainder='passthrough' on the ColumnTransformer. This is a common pipeline construction bug.

Question 12 (Multiple Choice)

Which BaseEstimator / TransformerMixin rule is violated by the following custom transformer?

class BadTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns
        self.n_columns_ = len(columns)

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.columns]

A) fit does not return self
B) __init__ performs data-dependent computation
C) __init__ sets a fitted attribute (trailing underscore) before fit is called
D) transform modifies the input data in-place

Answer: C) __init__ sets a fitted attribute (trailing underscore) before fit is called. The attribute n_columns_ has a trailing underscore, which by convention indicates it was learned during fit. Setting it in __init__ breaks the contract: clone() and get_params() expect fitted attributes to be absent before fitting. The fix is either to remove the trailing underscore (if it is a parameter, not a learned value) or to compute n_columns_ inside fit.

Question 13 (Short Answer)

Explain the difference between Pipeline and make_pipeline. When should you prefer one over the other?

Answer: Pipeline requires explicit names for each step (e.g., ('scaler', StandardScaler())), while make_pipeline auto-generates names from the class names (e.g., standardscaler). Use Pipeline in production code and shared projects, because explicit names make debugging, logging, and hyperparameter tuning clearer (e.g., preprocessor__num__imputer__strategy is more readable than auto-generated names). Use make_pipeline in exploratory analysis and prototyping where brevity matters and you do not need to reference steps by name.

Question 14 (Multiple Choice)

A fitted ColumnTransformer with get_feature_names_out() returns names like num__tenure_months and cat__plan_type_basic. What do the prefixes (num__, cat__) represent?

A) The dtype of the original feature
B) The name of the transformer sub-pipeline that produced the feature
C) The order in which the features were processed
D) The column group from make_column_selector

Answer: B) The name of the transformer sub-pipeline that produced the feature. The prefix matches the name argument in the ColumnTransformer definition (e.g., ('num', numeric_pipeline, num_cols) produces the num__ prefix). This makes it possible to trace any output feature back to its source transformer, which is essential when debugging feature importance rankings or coefficient interpretations.

Question 15 (Short Answer)

You need to deploy a pipeline that includes a custom transformer defined in your project. A colleague tries to load the pipeline on their machine and gets AttributeError: Can't get attribute 'MissingIndicatorTransformer' on <module '__main__'>. What is the problem and how do you fix it?

Answer: The custom transformer class was defined in the __main__ module (e.g., in a Jupyter notebook or script executed directly). When joblib serializes the object, it stores the module path as __main__.MissingIndicatorTransformer. On the colleague's machine, __main__ refers to their script, which does not define this class. The fix is to define custom transformers in a proper Python module (e.g., streamflow_transformers.py), import from that module before saving, and ensure the same module is importable on the colleague's machine (e.g., by including it in the project package or installing it as a dependency).

Return to the chapter to review concepts.