Exercises: Chapter 30
ML Experiment Tracking
Exercise 1: MLflow Fundamentals (Code)
Set up a local MLflow tracking server and run the following experiment. Log everything properly.
import mlflow
import mlflow.sklearn
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, f1_score, log_loss
# Generate data
X, y = make_classification(
n_samples=8000, n_features=20, n_informative=12,
n_redundant=4, flip_y=0.08, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
a) Create an MLflow experiment named "exercise-30-1". Write a training loop that trains a GradientBoostingClassifier with the following three configurations and logs each as a separate run:
| Config | n_estimators | learning_rate | max_depth |
|---|---|---|---|
| A | 100 | 0.1 | 3 |
| B | 300 | 0.05 | 5 |
| C | 500 | 0.03 | 4 |
For each run, log:
- All hyperparameters (use mlflow.log_params())
- Validation AUC, F1, and log loss (use mlflow.log_metrics())
- The trained model (use mlflow.sklearn.log_model())
- A tag with your name and the config label (A, B, or C)
b) After all three runs complete, write code to query the MLflow tracking API to find the run with the highest validation AUC. Print its run ID and all logged parameters.
c) Open the MLflow UI and take a screenshot (or describe) the parallel coordinates plot comparing the three runs. Which hyperparameter appears to have the strongest relationship with AUC?
Exercise 2: Artifact Logging (Code)
Extend one of your runs from Exercise 1 to log the following artifacts:
a) A confusion matrix saved as a PNG image.
b) A CSV file containing the test set predictions (y_true, y_pred, y_proba).
c) A JSON file containing the feature importance scores (or None for features with zero importance). Use this structure:
import json
importance_dict = {f"feature_{i}": float(model.feature_importances_[i])
for i in range(len(model.feature_importances_))}
with open("feature_importance.json", "w") as f:
json.dump(importance_dict, f, indent=2)
mlflow.log_artifact("feature_importance.json")
d) Navigate to the MLflow UI and verify that all three artifacts appear under the run's artifact tab. How would you retrieve the confusion matrix image programmatically using mlflow.artifacts.download_artifacts()?
Exercise 3: Data Version Tracking (Conceptual + Code)
Consider the following scenario: your team trains a churn model on Monday. On Wednesday, the data engineering team reruns the feature pipeline and fixes a bug that was causing payment_failures_6m to undercount failures by 15%. The training data now has different values for this column.
a) If you retrain the model on Wednesday without logging any data version information, what problems might arise three months later when someone tries to compare Monday's and Wednesday's results?
b) Implement a log_data_fingerprint() function that takes a pandas DataFrame and logs the following to MLflow:
- A SHA-256 hash of the DataFrame contents
- The number of rows and columns
- The column names (as a JSON string)
- The mean of each numeric column (as metrics with prefix
data_)
import hashlib
import json
import pandas as pd
import numpy as np
import mlflow
def log_data_fingerprint(df, prefix="train"):
"""Log a reproducibility fingerprint for a DataFrame to MLflow."""
# YOUR CODE HERE
pass
c) Why is logging the data hash not sufficient on its own? What additional information would you need to fully reconstruct the training data?
Exercise 4: Model Registry Workflow (Code)
Using the best model from Exercise 1, implement the following Model Registry workflow:
a) Register the model with the name "exercise-churn-model". Print the version number.
b) Assign the alias "champion" to this version.
c) Load the model using the alias and verify it produces the same predictions as the original model object.
d) Now train a new model with slightly different hyperparameters (learning_rate=0.04, max_depth=5, n_estimators=400). Register it as version 2 of the same model. Compare its validation AUC to version 1.
e) If version 2 is better, reassign the "champion" alias to version 2. Write the code that a deployment pipeline would use to always load the current champion model.
Exercise 5: Autologging vs. Manual Logging (Code)
a) Train an XGBoost model with MLflow autologging enabled:
import mlflow
import xgboost as xgb
mlflow.xgboost.autolog()
model = xgb.XGBClassifier(
learning_rate=0.05, max_depth=6, n_estimators=500,
early_stopping_rounds=30, eval_metric="logloss",
random_state=42
)
model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)
Inspect the run in the MLflow UI. List all parameters and metrics that were automatically logged.
b) Now train the same model with autologging disabled and manual logging only. Log the same parameters and metrics, plus the following additional items that autologging missed:
data_versiontaggit_committag (use a placeholder if not in a git repo)- Test set AUC, F1, precision, and recall
- Feature importance as a CSV artifact
- The training data column names as a JSON artifact
c) Compare the two runs side by side in the MLflow UI. What did autologging capture that you might have forgotten? What did manual logging capture that autologging missed? Write a one-paragraph recommendation for when to use each approach.
Exercise 6: Experiment Organization (Conceptual)
Your team of five data scientists is starting a new project: predicting customer lifetime value (CLV) for an e-commerce company. You will explore linear regression, Random Forest, XGBoost, and a neural network. Each team member will run their own experiments.
a) Design a naming convention for experiments and runs. Define: experiment name format, run name format, and a set of required tags.
b) Should you create one experiment for all model types or one experiment per model type? Justify your choice with at least two arguments.
c) Draft a one-page "Experiment Tracking Standards" document that your team would follow. Include: naming conventions, required logging (parameters, metrics, tags, artifacts), and rules for the Model Registry (when to register, alias naming).
d) Your colleague argues that experiment tracking is "overhead that slows us down during exploration." Write a three-sentence response.
Exercise 7: W&B Comparison (Code, optional)
This exercise requires a free W&B account.
a) Reimplement Exercise 1 using W&B instead of MLflow. Use wandb.init(), wandb.config, wandb.log(), and wandb.finish().
b) Use W&B Sweeps to run a Bayesian hyperparameter search over the same parameter space. Configure the sweep to maximize validation AUC and run 20 trials.
import wandb
sweep_config = {
"method": "bayes",
"metric": {"name": "val_auc", "goal": "maximize"},
"parameters": {
"learning_rate": {"min": 0.01, "max": 0.15},
"max_depth": {"values": [3, 4, 5, 6, 7]},
"n_estimators": {"values": [100, 200, 300, 500]},
},
}
# YOUR CODE HERE: create sweep, define train function, run agent
c) Compare the experience of using W&B Sweeps vs. the manual grid search you did in Exercise 1 with MLflow. Consider: setup time, visualization quality, ease of finding the best configuration, and total code written.
Exercise 8: Reproducibility Audit (Applied)
Go back to a model you trained in a previous chapter (Chapter 14, 18, or 19). Try to reproduce the exact results. Document the following:
a) Can you identify the exact hyperparameters that produced the result? If not, what information is missing?
b) Can you identify the exact training data that was used? Was there a random seed for the train/test split?
c) Can you identify the exact code version? Did you save a git commit hash?
d) Rewrite the training script from that chapter with full MLflow tracking. Run it twice with the same parameters and verify that the results are identical (within floating-point precision). If they differ, explain why.
e) Estimate how much time experiment tracking would have saved you in this course so far. Be specific: how many times did you retrain a model because you lost track of the best hyperparameters?
Exercises support Chapter 30: ML Experiment Tracking. Return to the chapter for reference.