Quiz: Chapter 29

Software Engineering for Data Scientists


Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.


Question 1 (Multiple Choice)

In the cookiecutter-data-science project structure, raw data files belong in:

  • A) src/data/ --- because data loading code and data files should be co-located
  • B) data/raw/ --- because raw data is immutable and separate from code
  • C) notebooks/ --- because notebooks are where you first load the data
  • D) The project root --- because everything should be easily accessible

Answer: B) data/raw/ --- because raw data is immutable and separate from code. The cookiecutter convention separates data into raw/ (original, never modified), interim/ (intermediate transformations), and processed/ (final feature matrices). Raw data is treated as immutable: you never edit it in place. All transformations are code in src/data/, which reads from raw/ and writes to interim/ or processed/. This ensures that anyone can regenerate all derived data from the originals by running the pipeline.


Question 2 (Multiple Choice)

Which of the following is the primary benefit of using pytest fixtures?

  • A) They make tests run faster by caching results
  • B) They eliminate the need for assertions in tests
  • C) They provide reusable, consistently constructed test data across multiple tests
  • D) They automatically generate random test cases

Answer: C) They provide reusable, consistently constructed test data across multiple tests. Fixtures are functions decorated with @pytest.fixture that return test objects (DataFrames, models, configurations, etc.). Any test function that includes the fixture name as a parameter receives the fixture's return value automatically. This avoids copy-pasting test data setup across dozens of tests, ensures all tests use the same data construction logic, and makes it easy to update test data in one place when requirements change.


Question 3 (Short Answer)

Explain the difference between a unit test and an integration test for a data science pipeline. Give one example of each for a churn prediction system.

Answer: A unit test tests a single function in isolation with controlled inputs and a known expected output. For example, testing that compute_engagement_rate(active_days=15, total_days=30) returns 0.5. An integration test tests multiple functions working together as a pipeline. For example, testing that the full sequence of load_and_clean(), build_feature_matrix(), train_churn_model(), and compute_metrics() runs without errors and produces a model with an AUC between 0 and 1. Unit tests catch bugs in individual functions; integration tests catch bugs in how functions interact (e.g., schema mismatches between pipeline stages).


Question 4 (Multiple Choice)

A pre-commit hook running black modifies 3 files and the commit is rejected. What should you do?

  • A) Disable the pre-commit hook and commit again
  • B) Run git add on the modified files and commit again
  • C) Revert the changes black made and reformat manually
  • D) Use --no-verify to bypass the hook

Answer: B) Run git add on the modified files and commit again. When black reformats files as part of a pre-commit hook, it modifies the files in your working directory but the commit is rejected because the staged content no longer matches the working directory content. The correct response is to stage the reformatted files with git add and then commit again. The second commit will pass because the staged content now matches what black expects. Options A and D defeat the purpose of the hook, and option C wastes time doing what black already did correctly.


Question 5 (Multiple Choice)

Which code quality tool catches the following bug?

import pandas as pd
import numpy as np
import scipy.stats  # Never used in this file

df = pd.read_csv("data.csv")
  • A) black (formatter)
  • B) ruff (linter)
  • C) mypy (type checker)
  • D) pytest (testing framework)

Answer: B) ruff (linter). Ruff detects unused imports (rule F401: 'scipy.stats' imported but unused) and can auto-fix by removing the import line. Black only reformats code style (whitespace, line length, quote style) and does not analyze code logic. Mypy checks type annotations and would not flag an unused import unless it caused a type error. Pytest runs tests and would not detect this issue unless a test explicitly checked for unused imports.


Question 6 (Short Answer)

What is the DRY principle, and why is it especially critical for machine learning systems?

Answer: DRY stands for "Don't Repeat Yourself" --- every piece of knowledge or logic should have a single, authoritative representation in the codebase. In ML systems, DRY violations are especially dangerous because feature engineering logic is often duplicated between training and serving pipelines. If these copies diverge (e.g., the training pipeline uses log1p while the serving pipeline uses log), the result is training-serving skew: the model receives different features in production than it saw during training, silently degrading prediction quality. Unlike a visible bug that crashes the system, training-serving skew produces wrong predictions that look normal.


Question 7 (Multiple Choice)

A data scientist writes the following type-annotated function:

def compute_metrics(
    y_true: pd.Series,
    y_pred: pd.Series,
    y_prob: pd.Series,
) -> dict[str, float]:

Mypy reports an error because the function sometimes returns {"auc": None} when there is only one class in y_true. What is the correct return type?

  • A) dict[str, float | None]
  • B) dict[str, any]
  • C) dict
  • D) Optional[dict[str, float]]

Answer: A) dict[str, float | None]. The function always returns a dictionary (so Optional[dict] is wrong --- the dict itself is never None), but some values in the dictionary can be None. The correct type annotation for a dictionary where keys are strings and values are either floats or None is dict[str, float | None]. Option B uses any (lowercase), which is not a valid type --- it should be Any from typing. Option C is too vague; it provides no information about key and value types. Option D says the entire dictionary might be None, which is a different semantic.


Question 8 (Multiple Choice)

According to Google's "Hidden Technical Debt in Machine Learning Systems" paper, the actual machine learning code (model training and prediction) in a production ML system typically represents:

  • A) About 80% of the total code
  • B) About 50% of the total code
  • C) About 20% of the total code
  • D) A small fraction --- often 5% or less --- of the total code

Answer: D) A small fraction --- often 5% or less --- of the total code. The paper's central diagram shows the ML code as a tiny black box surrounded by vastly larger infrastructure components: data collection, data verification, feature extraction, configuration management, resource management, monitoring, serving infrastructure, analysis tools, and process management. Technical debt accumulates primarily in this surrounding infrastructure, not in the model code itself. This is why software engineering skills are essential for data scientists working on production systems.


Question 9 (Short Answer)

A colleague stores model hyperparameters as hardcoded values in train.py. Explain why this is a form of technical debt and describe a better approach.

Answer: Hardcoded hyperparameters create configuration debt because: (1) changing hyperparameters requires editing source code, which can introduce bugs; (2) there is no record of which hyperparameter values were used for which experiment; (3) different parts of the system may reference different hardcoded values for the same parameter. The better approach is to externalize configuration into a YAML or JSON file, loaded by a configuration class (e.g., a dataclass with a from_yaml method). This creates a single source of truth, makes configuration changes auditable through version control, and enables experiment tracking systems to log the exact configuration used for each run.


Question 10 (Multiple Choice)

The nbstripout tool, when installed as a git filter, ensures that:

  • A) Notebook code cells are automatically converted to Python scripts
  • B) Notebook output cells (images, tables, print statements) are removed before committing
  • C) Notebooks with syntax errors cannot be committed
  • D) Notebooks are automatically executed and tested before each commit

Answer: B) Notebook output cells (images, tables, print statements) are removed before committing. Notebook files (.ipynb) are JSON documents that contain both code and outputs (including embedded images, which can be megabytes). Committing outputs inflates repository size, makes diffs unreadable (base64-encoded images mixed with code changes), and creates unnecessary merge conflicts. nbstripout strips all output cells and execution counts from the notebook before it reaches git, keeping only the code and markdown cells. The outputs remain in your local working copy --- they are only stripped in the committed version.


Question 11 (Short Answer)

You have a function that returns different results depending on whether it is called during training or during prediction. Explain why this is a problem and how to fix it.

Answer: This is a training-serving skew problem. If the same function behaves differently in training versus prediction contexts (e.g., using mean imputation from training data during training but global constants during prediction), the model sees different feature distributions at inference time than during training. This silently degrades performance without raising errors. The fix is to make the function stateless and deterministic: it should accept all required parameters explicitly (e.g., imputation values computed during training and passed in) rather than computing them differently based on context. The training pipeline computes and saves these parameters; the prediction pipeline loads and applies them.


Question 12 (Multiple Choice)

Which @pytest.mark.parametrize decorator correctly tests a function with 3 input-output pairs?

  • A) @pytest.mark.parametrize("x, y", [(1, 2), (3, 6), (5, 10)])
  • B) @pytest.mark.parametrize("x", [1, 3, 5], "y", [2, 6, 10])
  • C) @pytest.mark.parametrize([(1, 2), (3, 6), (5, 10)])
  • D) @pytest.mark.parametrize("input, output", {1: 2, 3: 6, 5: 10})

Answer: A) @pytest.mark.parametrize("x, y", [(1, 2), (3, 6), (5, 10)]). The first argument is a comma-separated string of parameter names, and the second is a list of tuples, where each tuple provides one set of values. This generates 3 separate test cases: test(x=1, y=2), test(x=3, y=6), and test(x=5, y=10). Options B and C have incorrect syntax. Option D uses a dictionary, which is not the expected format for parametrize.


Question 13 (Short Answer)

Explain the difference between git branch -d experiment/tune-xgb-lr and git branch -D experiment/tune-xgb-lr. When would you use each in a data science project?

Answer: Lowercase -d deletes a branch only if it has been fully merged into the current branch (a safe delete). Uppercase -D forces deletion regardless of merge status. In a data science project, use -d to delete experiment branches whose results were good enough to merge into develop. Use -D to delete experiment branches whose results were poor --- these were never merged, so -d would refuse to delete them. The -D flag is appropriate for experiment branches because the commit history of a failed experiment has no long-term value; the result (e.g., "AUC did not improve") should be recorded in the commit message or experiment tracker before deletion.


Question 14 (Multiple Choice)

Which of the following is NOT a form of technical debt specific to machine learning systems?

  • A) Feature engineering logic duplicated between training and serving pipelines
  • B) A function with poor variable names that is hard to read
  • C) An upstream data schema change that silently breaks the feature pipeline
  • D) Model A's predictions used as input features for Model B, creating invisible coupling

Answer: B) A function with poor variable names that is hard to read. While poor variable names are technical debt, they are not ML-specific --- they occur in any software project. Options A (training-serving skew from DRY violation), C (data-level debt from undocumented schema dependencies), and D (configuration debt from pipeline entanglement) are all forms of debt identified by Google's paper as unique to or amplified in ML systems. ML technical debt includes these additional categories (data debt, configuration debt, feedback loop debt) on top of the standard code-level debt found in all software.


Question 15 (Short Answer)

A teammate says: "I don't need type hints. Python is dynamically typed, and my code works fine without them." Give two concrete benefits of type hints for a data science codebase, with examples.

Answer: First, type hints serve as machine-checkable documentation: def compute_recency(events: pd.DataFrame, reference_date: pd.Timestamp) -> pd.DataFrame tells readers exactly what the function expects and returns without reading the implementation. Second, mypy catches bugs before runtime: if you accidentally pass a string where a Timestamp is expected, mypy flags this at development time rather than failing at 3 AM when the pipeline runs on new data. In data science, where functions often accept and return DataFrames with implicit schemas, type hints (even without full DataFrame column typing) establish contracts that make refactoring safer and onboarding faster.


Quiz corresponds to Chapter 29: Software Engineering for Data Scientists. See index.md for the full chapter.