Key Takeaways: Chapter 30
ML Experiment Tracking
-
If you cannot tell someone what hyperparameters produced your best model, you do not have a best model --- you have a lucky guess. Experiment tracking replaces manual spreadsheets with automated, permanent, queryable records of every parameter, metric, and artifact from every training run. This is not optional overhead. It is the minimum requirement for reproducible ML.
-
MLflow is the open-source standard for experiment tracking in production. It has four components: Tracking (parameters, metrics, artifacts), Projects (reproducible packaging), Models (standardized model format), and the Model Registry (versioned model lifecycle management). It is Apache 2.0 licensed, self-hosted, and free --- your data never leaves your infrastructure.
-
The
with mlflow.start_run()pattern is the fundamental building block. Everything inside the context manager is associated with a single run. Log parameters before training (mlflow.log_params()), metrics after evaluation (mlflow.log_metrics()), and artifacts at any point (mlflow.log_artifact()). When the block exits, the run is finalized. -
Log everything that could affect the result --- not just model hyperparameters. The data version, the git commit hash, the preprocessing choices, the random seed, the feature set, and the train/test split configuration should all be recorded. The most common reproducibility failure is not hyperparameters; it is silent changes to the training data.
-
The MLflow Model Registry turns experiment results into production artifacts. Register a model, assign an alias (
champion,challenger), and your deployment pipeline loads by alias:mlflow.pyfunc.load_model("models:/churn-predictor@champion"). Promoting a new model is a single alias reassignment. Rolling back is another. Every registered model traces back to its training run. -
Weights & Biases has a better UI and built-in hyperparameter sweeps, but it is a SaaS product. W&B's dashboard, real-time collaboration, automatic system metrics logging, and Bayesian sweep coordination are genuinely superior to MLflow's equivalents. The cost is a per-user subscription and data stored on W&B's servers (or an enterprise self-hosted license at significantly higher cost).
-
The honest comparison: MLflow is free and self-hosted but requires infrastructure; W&B has better UX but is SaaS. Use MLflow if you are in a regulated industry, need data to stay on-premises, or need a mature Model Registry integrated with deployment. Use W&B if you want zero-setup exploration with excellent visualization. Use both if your workflow benefits from W&B exploration and MLflow production tracking.
-
Autologging is a floor, not a ceiling.
mlflow.autolog()automatically captures model hyperparameters, training metrics, and the model artifact with zero additional code. But it does not capture your data version, git commit, custom evaluation metrics, or domain-specific artifacts. Use autologging to catch what you forget; use manual logging to capture what matters. -
Organize experiments with consistent naming, nested runs, and required tags. Experiment name format:
{project}-{model-type}-{version}. Run name format:{model}-{index}-{key-params}. Required tags:data_version,author,purpose. Use parent-child runs for hyperparameter searches to keep the experiment list manageable. -
The data fingerprint is the most valuable thing you can log. A SHA-256 hash of the training DataFrame, the column names, and the dtypes catches silent data pipeline changes that break reproducibility. Log it on every run. When someone asks "why are the results different?", the first thing you check is whether the data hash matches.
If You Remember One Thing
Experiment tracking is infrastructure, not overhead. Five lines of code --- set the tracking URI, set the experiment, start a run, log params, log metrics --- give you a permanent, queryable record of every experiment your team runs. The Model Registry connects those experiments to production, and model lineage means you can trace any deployed model back to the exact data, code, and hyperparameters that produced it. The alternative is a spreadsheet labeled "Experiment Log" with 347 rows of fiction and a pickle file named model_ACTUALLY_final.pkl. One of these approaches scales. The other has already failed.
These takeaways summarize Chapter 30: ML Experiment Tracking. Return to the chapter for full context.