Quiz: Chapter 30

ML Experiment Tracking


Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.


Question 1 (Multiple Choice)

In MLflow, what is the relationship between an "experiment" and a "run"?

  • A) An experiment is a single execution; a run is a collection of experiments
  • B) An experiment is a named collection of runs; each run represents one training execution
  • C) An experiment and a run are synonyms for the same thing
  • D) A run contains multiple experiments, each with different hyperparameters

Answer: B) An experiment is a named collection of runs; each run represents one training execution. An experiment groups related runs under a common name (e.g., "streamflow-churn-v2"), while each run represents a single training execution with a specific set of hyperparameters, metrics, and artifacts. This hierarchy allows you to compare all attempts at solving the same problem.


Question 2 (Multiple Choice)

Which of the following should you log as a parameter (not a metric) in MLflow?

  • A) Validation AUC
  • B) Training time in seconds
  • C) The learning rate used for training
  • D) The number of trees selected by early stopping

Answer: C) The learning rate used for training. Parameters are inputs to the experiment --- values you set before training begins. Metrics are outputs --- values you measure after training. Validation AUC and training time are metrics. The number of trees selected by early stopping is arguably a metric as well, since it is determined during training. The learning rate is a hyperparameter you choose beforehand and is therefore a parameter.


Question 3 (Short Answer)

A colleague has MLflow autologging enabled and says: "I do not need manual logging because autologging captures everything." Give two specific examples of information that autologging does not capture but that you should log manually.

Answer: Autologging captures model hyperparameters, training metrics, and the model artifact, but it does not capture (1) the data version or data hash --- there is no way for the logging framework to know which version of the training data was used, and (2) custom evaluation metrics on the test set beyond what the framework logs by default, such as precision at a specific recall threshold or business-specific metrics. Other valid examples include the git commit hash, preprocessing pipeline details, and feature engineering choices.


Question 4 (Multiple Choice)

You have identified the best model from a hyperparameter search. What is the correct sequence of steps in the MLflow Model Registry?

  • A) Register the model -> Assign "Production" alias -> Deploy
  • B) Register the model -> Assign "Staging" alias -> Validate -> Reassign to "Production" alias -> Deploy
  • C) Deploy the model -> Register it -> Assign "Production" alias
  • D) Register the model -> Delete all other runs -> Deploy

Answer: B) Register the model -> Assign "Staging" alias -> Validate -> Reassign to "Production" alias -> Deploy. The registry supports a lifecycle workflow: first register the model (creating a version), then assign it to staging for validation testing, then promote it to production after validation passes. This ensures that no untested model reaches production. Deleting other runs (option D) would destroy experiment history, which defeats the purpose of tracking.


Question 5 (Multiple Choice)

What is the primary advantage of logging the git commit hash as a tag in every MLflow run?

  • A) It makes the training code run faster
  • B) It links each run to the exact code version that produced it, enabling reproduction
  • C) It prevents other team members from modifying the code
  • D) It automatically versions the training data

Answer: B) It links each run to the exact code version that produced it, enabling reproduction. When you need to reproduce a result from months ago, having the git commit hash means you can check out the exact code that was used. Without it, you are guessing which version of the code produced the result, since preprocessing logic, feature engineering, and even model configuration often change between commits.


Question 6 (Short Answer)

Explain the difference between MLflow's artifact store and its backend store. Give an example of what is stored in each.

Answer: The backend store holds structured metadata: parameters (e.g., learning_rate=0.05), metrics (e.g., val_auc=0.8862), tags, and run metadata. It is typically a relational database (SQLite, PostgreSQL). The artifact store holds files: trained model binaries, confusion matrix images, feature importance CSVs, and any other files logged with mlflow.log_artifact(). It is typically a file system or object store (S3, GCS, Azure Blob). The separation allows the backend store to be fast for queries and comparisons, while the artifact store handles large binary files.


Question 7 (Multiple Choice)

A data scientist logs the test set AUC as a metric in MLflow. Three months later, a colleague sees this run has the highest AUC in the experiment and promotes it to production. However, the model performs worse than expected. What is the most likely explanation?

  • A) MLflow corrupted the metric during storage
  • B) The test set was used for early stopping, so the logged AUC was optimistically biased
  • C) The model was overfitted to the training data
  • D) W&B would have stored the metric more accurately

Answer: B) The test set was used for early stopping, so the logged AUC was optimistically biased. If the same data used for early stopping is also used for final evaluation, the reported metric is biased upward because the model was indirectly fitted to that data through the stopping decision. The correct approach is a three-way split: training (fit), validation (early stopping), and test (final, unbiased evaluation). This is the most common source of inflated metrics in experiment logs.


Question 8 (Multiple Choice)

Which statement about MLflow vs. Weights & Biases is most accurate?

  • A) MLflow has a better UI than W&B for experiment visualization
  • B) W&B is open source and free for all team sizes
  • C) MLflow is self-hosted and free; W&B has a better UI but is a paid SaaS product for teams
  • D) MLflow and W&B are interchangeable with no meaningful differences

Answer: C) MLflow is self-hosted and free; W&B has a better UI but is a paid SaaS product for teams. MLflow is Apache 2.0 licensed, runs on your own infrastructure, and costs nothing beyond infrastructure. W&B is a commercial SaaS product with a free tier but requires paid plans for team features. W&B's dashboard and visualization capabilities are widely acknowledged as superior to MLflow's UI. Each tool has clear strengths depending on team needs.


Question 9 (Short Answer)

A team has been running experiments for six months without a naming convention. The MLflow experiment list shows entries like "test," "test2," "final_model," "johns_experiment," and "churn_v3_FINAL_USE_THIS." Propose a naming convention and explain why consistency matters for experiment management.

Answer: A recommended convention: {project}-{model-type}-{version} for experiments (e.g., churn-xgboost-v2) and {model}-{search-index}-{key-params} for runs (e.g., xgb-042-lr0.03-d7). Consistency matters because (1) it makes experiments searchable and filterable --- finding all XGBoost experiments for the churn project becomes trivial, and (2) it reduces ambiguity when onboarding new team members or revisiting work months later. Without conventions, the experiment list becomes unusable at scale, and the tracking system devolves into the same disorganized state as the spreadsheet it replaced.


Question 10 (Multiple Choice)

What does mlflow.xgboost.log_model() store as an artifact?

  • A) Only the model weights as a binary file
  • B) The model, its dependencies, and a metadata file describing how to load it
  • C) The model, the training data, and the validation data
  • D) A screenshot of the model's predictions

Answer: B) The model, its dependencies, and a metadata file describing how to load it. log_model() stores the serialized model file, a conda.yaml or requirements.txt specifying dependencies, an MLmodel metadata file describing the model flavor (e.g., xgboost, sklearn), and optionally an input/output schema. This packaging allows the model to be loaded with mlflow.pyfunc.load_model() on any machine with the correct dependencies, regardless of the original training framework.


Question 11 (Short Answer)

You run the same training script twice with random_state=42 and identical data, but MLflow shows slightly different validation AUC values (0.8847 vs. 0.8851). Give two possible explanations.

Answer: Two likely explanations: (1) Non-determinism in the training algorithm --- XGBoost and LightGBM with multithreading (n_jobs=-1) can produce different results across runs because the order of floating-point operations depends on thread scheduling, even with a fixed random seed. Setting n_jobs=1 would eliminate this at the cost of slower training. (2) The data preprocessing changed between runs --- if any step upstream of the model (feature engineering, encoding, imputation) is not deterministic or was modified, the training data is not truly identical despite the same random seed for the split.


Question 12 (Multiple Choice)

In the MLflow Model Registry, what is the purpose of model aliases (e.g., "champion," "challenger")?

  • A) To give human-readable names to model versions so deployment pipelines can reference them without hard-coding version numbers
  • B) To delete old model versions automatically
  • C) To encrypt the model for security
  • D) To make the model compatible with W&B

Answer: A) To give human-readable names to model versions so deployment pipelines can reference them without hard-coding version numbers. A deployment pipeline loads models:/churn-predictor@champion rather than models:/churn-predictor/7. When a new model version is promoted, you reassign the alias without changing the deployment code. This decouples the deployment pipeline from the experiment workflow and enables safe model updates.


Question 13 (Short Answer)

Your organization is in the healthcare industry and is evaluating MLflow vs. W&B. The legal team requires that all patient data and model artifacts remain on-premises. Which tool would you recommend and why?

Answer: MLflow is the clear choice. It is self-hosted --- you run the tracking server, backend database, and artifact store entirely on your own infrastructure. Patient data and model artifacts never leave your network. W&B's default mode sends data to W&B's cloud servers, which would violate the on-premises requirement. W&B does offer a self-hosted enterprise option (W&B Server), but it requires an enterprise license and is significantly more expensive than running MLflow on PostgreSQL and S3-compatible storage.


Question 14 (Multiple Choice)

What is the best practice for handling a hyperparameter search with 200 configurations in MLflow?

  • A) Create 200 separate experiments, one per configuration
  • B) Log all 200 as runs in a single experiment, using nested runs under a parent
  • C) Log only the top 10 runs and discard the rest
  • D) Use a spreadsheet for the search and log only the final model in MLflow

Answer: B) Log all 200 as runs in a single experiment, using nested runs under a parent. A parent run represents the search as a whole (with tags like search_method=bayes and total_configs=200), and each child run contains the parameters and metrics for one configuration. This keeps the experiment list clean while preserving the full search history. Discarding runs (option C) or using a spreadsheet (option D) defeats the purpose of experiment tracking.


Question 15 (Short Answer)

Explain why experiment tracking is described as "infrastructure" rather than a "nice-to-have" in this chapter. Give a concrete example of a failure that experiment tracking prevents.

Answer: Experiment tracking is infrastructure because it is foundational to reproducibility, collaboration, and model governance --- all requirements for production ML, not optional extras. A concrete failure it prevents: a team deploys model version 3 to production, performance degrades, and they need to roll back to version 2. Without experiment tracking, they cannot identify which hyperparameters, data version, or code produced version 2. With experiment tracking, they query the Model Registry, find the exact run, check the parameters and data version tag, and either redeploy the registered model or reproduce it from the logged configuration.


This quiz covers Chapter 30: ML Experiment Tracking. Return to the chapter for full context.