Further Reading: Chapter 30

ML Experiment Tracking


Official Documentation

1. MLflow Documentation --- mlflow.org The primary reference for everything MLflow. Start with the "Quickstart" guide for initial setup, then move to the "Tracking" section for the logging API, the "Model Registry" section for model lifecycle management, and the "MLflow Models" section for the pyfunc packaging format. The "Search Runs" documentation covers the query syntax for filtering and comparing runs programmatically. Updated with every release; the 2.x series introduced significant improvements to the Model Registry (aliases replacing stages).

2. Weights & Biases Documentation --- docs.wandb.ai Comprehensive documentation covering experiment tracking (wandb.init, wandb.log), sweeps (hyperparameter optimization), artifacts (data and model versioning), and Reports (collaborative dashboards). The "Quickstart" gets you from zero to first run in under five minutes. The "Sweeps" documentation is particularly strong, covering Bayesian, grid, random, and custom search strategies with clear examples.

3. MLflow GitHub Repository --- github.com/mlflow/mlflow The source code, issue tracker, and community discussions. The examples/ directory contains working code for every MLflow feature: sklearn tracking, XGBoost autologging, model registry workflows, Docker-based deployment, and more. The "Releases" page documents breaking changes between versions, which matters when upgrading a production tracking server.


Foundational Concepts

4. "Hidden Technical Debt in Machine Learning Systems" --- Sculley et al. (Google, 2015) The paper that coined the term "ML technical debt" and identified experiment management as a core challenge. The authors argue that the actual machine learning code in a production ML system is a small fraction of the total code; the rest is data pipelines, configuration, monitoring, and experiment infrastructure. Published in NeurIPS 2015. This paper is the intellectual foundation for why experiment tracking matters --- it is not about convenience, it is about managing the complexity of ML systems.

5. "Towards ML Engineering: A Brief History of TensorFlow Extended (TFX)" --- Baylor et al. (Google, 2017) Describes Google's internal ML platform, including metadata tracking and model management. While TFX is specific to the TensorFlow ecosystem, the architectural principles --- separating the training pipeline from the model serving pipeline, tracking metadata at every step, and using a model registry for deployment --- apply to any experiment tracking system. Published in KDD 2017.

6. Designing Machine Learning Systems --- Chip Huyen (2022) Chapter 4 ("Training Data") and Chapter 6 ("Model Development and Offline Evaluation") cover the practices that experiment tracking supports: data versioning, model selection, reproducibility, and the transition from development to production. Huyen writes from a practitioner perspective with clear examples from companies at different scales. O'Reilly. One of the best books on production ML engineering written in the 2020s.


MLflow Deep Dives

7. MLflow in Action --- Khare and Pandya (2024) A book-length treatment of MLflow, covering tracking, projects, models, and the model registry with production-oriented examples. Includes chapters on deploying MLflow on Kubernetes, integrating with CI/CD pipelines, and scaling the tracking server for large teams. Manning Publications. The most comprehensive MLflow-specific book available.

8. "Managing Your Machine Learning Experiments with MLflow" --- Databricks Blog A series of blog posts from the MLflow creators covering best practices for experiment organization, run comparison, and model registry workflows. The posts include production patterns that are not in the official documentation, such as using nested runs for hyperparameter searches and configuring artifact retention policies. Available at databricks.com/blog.

9. MLflow Model Registry Workflow Guide --- Databricks Documentation A step-by-step guide to the Model Registry lifecycle: registering models, managing versions, transitioning stages (legacy) and assigning aliases (modern), and loading registered models for inference. Includes the rationale for the transition from fixed stages to flexible aliases in MLflow 2.9+. Available at docs.databricks.com.


Weights & Biases Resources

10. Effective MLOps with W&B --- Weights & Biases (2023) A free online course (available at wandb.ai/courses) covering experiment tracking, hyperparameter sweeps, model and data versioning, and collaborative reporting. The course uses practical examples with PyTorch and scikit-learn. The sweep module is particularly well-done, covering Bayesian optimization theory and practical sweep configuration.

11. "Experiment Tracking with W&B" --- Lukas Biewald (W&B CEO), YouTube A conference talk explaining the design philosophy behind W&B: why cloud-first, why automatic system metrics, and why the UI prioritizes real-time collaboration. Biewald is candid about the tradeoffs vs. self-hosted solutions. Worth watching for the product perspective, regardless of which tool you adopt.


Experiment Tracking Comparisons

12. "MLflow vs. Weights & Biases vs. Neptune: A Practitioner's Guide" --- Neptune.ai Blog A detailed feature comparison of the three major experiment tracking platforms, written by the Neptune team (so read with appropriate bias awareness). Despite the source, the feature matrices are accurate and the discussion of tradeoffs is balanced. Covers logging APIs, UI capabilities, model registries, pricing, and deployment integration. Updated regularly.

13. "Experiment Tracking Tools for Machine Learning" --- Made With ML (Goku Mohandas) An open-source MLOps course that includes a module on experiment tracking with side-by-side MLflow and W&B examples. The code is available on GitHub, and the explanations are practitioner-focused. The comparison is notably balanced, with clear recommendations for different team sizes and requirements. Available at madewithml.com.


Reproducibility and Data Versioning

14. DVC (Data Version Control) Documentation --- dvc.org DVC handles a problem adjacent to experiment tracking: versioning datasets and ML pipelines. While MLflow tracks experiments (parameters, metrics, models), DVC tracks the data and code that feed into those experiments. Many teams use DVC for data versioning and MLflow for experiment tracking --- the two tools are complementary, not competing. The documentation includes integration guides for MLflow.

15. "Reproducibility in Machine Learning" --- Pineau et al. (2020) A position paper from the NeurIPS reproducibility committee arguing that experiment tracking is necessary but not sufficient for ML reproducibility. The authors identify four pillars: code versioning, data versioning, environment specification, and experiment logging. Published as a NeurIPS 2020 workshop paper. The framework is useful for evaluating whether your tracking setup actually achieves reproducibility or just creates an illusion of it.

16. "The ML Test Score: A Rubric for ML Production Readiness" --- Breck et al. (Google, 2017) A scoring rubric for assessing ML system maturity, including criteria for experiment tracking, model validation, and monitoring. The rubric assigns points for practices like "all hyperparameters are logged," "model performance is tracked over time," and "data dependencies are versioned." Useful as a self-assessment for teams adopting experiment tracking. Published in IEEE Big Data 2017.


Production Patterns

17. "Introducing MLflow Model Registry" --- Databricks Engineering Blog The original announcement post for the Model Registry, explaining the motivation (model versioning chaos in production), the design (stages, versions, annotations), and the integration with MLflow Tracking. The post includes a production workflow diagram that clarifies how experiments flow from tracking to registry to deployment.

18. Practical MLOps --- Noah Gift and Alfredo Deza (2021) Chapters 3-4 cover experiment tracking and model management in the context of CI/CD for ML. The book takes a DevOps-engineer perspective rather than a data-scientist perspective, which provides useful counterpoint: how does the deployment team interact with your experiment tracking system? O'Reilly.

19. "Full Stack Deep Learning" --- UC Berkeley Course (2022) Lecture 7 ("Experiment Tracking and Management") covers MLflow and W&B in the context of a complete ML development workflow. The lecture slides and video are freely available at fullstackdeeplearning.com. The course is more focused on deep learning than tabular ML, but the experiment tracking principles are universal.


Advanced Topics

20. "ML Metadata (MLMD)" --- TensorFlow Documentation Google's open-source library for tracking metadata in ML workflows, used internally at Google and in TFX pipelines. MLMD is lower-level than MLflow --- it provides a metadata store and schema, not a UI or model registry. Interesting as an alternative architectural approach: instead of a standalone tracking tool, embed metadata tracking directly into the pipeline framework. Available at tensorflow.org/tfx/guide/mlmd.


How to Use This List

If you are setting up experiment tracking for the first time, start with the MLflow documentation (item 1) and the Quickstart guide. Spend an afternoon getting a local tracking server running and logging your first experiment.

If you are evaluating tools for a team, read the comparison in item 12, then try both MLflow and W&B on a real project for one week each. The experience of using the tools is more informative than any feature matrix.

If you want to understand why experiment tracking matters, read Sculley et al. (item 4) and Huyen (item 6). The first explains the problem; the second explains the solution in production terms.

If you are ready to go beyond experiment tracking to full MLOps, read Gift and Deza (item 18) for the deployment perspective and the ML Test Score (item 16) for a maturity assessment.


This reading list supports Chapter 30: ML Experiment Tracking. Return to the chapter to review concepts before diving in.