Chapter 30: ML Experiment Tracking

DataField.Dev

17 min read

> War Story --- A senior data scientist at an insurance company was asked to reproduce the model that had been deployed to production eight months earlier. She opened the team's shared Google Sheet --- the one labeled "Experiment Log" --- and found...

In This Chapter

MLflow, Weights & Biases, and Reproducible Research
The Spreadsheet of Shame
What Experiment Tracking Actually Tracks
MLflow: The Open-Source Standard
Building a Proper Experiment: The StreamFlow Pattern
Comparing Experiments in the MLflow UI
The MLflow Model Registry
MLflow Projects: Reproducible Packaging
Weights & Biases: The SaaS Alternative
Autologging: The Low-Effort Starting Point
Structuring Experiments for a Team
Common Pitfalls and How to Avoid Them
Production MLflow: Beyond Local Development
Putting It All Together: The Experiment Tracking Workflow
Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 30: ML Experiment Tracking

MLflow, Weights & Biases, and Reproducible Research

Learning Objectives

By the end of this chapter, you will be able to:

Set up MLflow for experiment tracking, model versioning, and artifact storage
Log parameters, metrics, and artifacts systematically
Compare experiments in the MLflow UI
Register models in the MLflow Model Registry
Understand W&B as an alternative and compare the two platforms

The Spreadsheet of Shame

War Story --- A senior data scientist at an insurance company was asked to reproduce the model that had been deployed to production eight months earlier. She opened the team's shared Google Sheet --- the one labeled "Experiment Log" --- and found 347 rows of hand-entered hyperparameters, validation scores, and notes like "tried lower LR, seemed better" and "same as row 212 but with new features." There was no record of which preprocessing pipeline produced the training data. No record of the random seed. No record of the exact feature set. The model artifact was a pickle file named model_final_v3_FINAL_USE_THIS.pkl sitting in a shared drive alongside model_final_v4.pkl and model_ACTUALLY_final.pkl. She spent three weeks trying to reproduce the results. She never succeeded.

If you cannot tell me what hyperparameters produced your best model, you do not have a best model --- you have a lucky guess.

This is not a rare story. It is the default state of most data science teams. Every team starts with good intentions: "We will track everything in a spreadsheet." Within a month, the spreadsheet is incomplete. Within three months, it is unreliable. Within six months, it is fiction. The problem is not laziness. The problem is that manual tracking is fundamentally incompatible with the iterative, high-throughput nature of ML experimentation. A single hyperparameter search can produce hundreds of runs. No human is going to manually log the learning rate, max depth, subsample ratio, column sample rate, regularization strength, number of estimators, early stopping round, and six evaluation metrics for each of those runs. So they don't. And six months later, nobody can reproduce anything.

Experiment tracking tools solve this problem by automating the logging. You add a few lines of code to your training script, and the tool records every parameter, every metric, every artifact, every environment detail --- automatically, consistently, and permanently.

This chapter covers two tools: MLflow (the open-source standard) and Weights & Biases (the best-in-class SaaS alternative). We will go deep on MLflow because it is free, self-hosted, and the tool you are most likely to encounter in production environments. We will give W&B a thorough and honest comparison because it has a genuinely better user experience for certain workflows.

What Experiment Tracking Actually Tracks

Before we touch any tool, let us define the vocabulary precisely.

Term	Definition
Experiment	A named collection of runs that share a common objective. "StreamFlow churn model v2" is an experiment.
Run	A single execution of a training script. One set of hyperparameters, one training pass, one set of results.
Parameter	An input to the run: `learning_rate=0.05`, `max_depth=6`, `n_estimators=2000`.
Metric	An output measurement: `val_auc=0.8834`, `val_log_loss=0.2917`, `train_time_seconds=14.3`.
Artifact	A file produced by the run: the trained model, a feature importance plot, the preprocessed dataset, a confusion matrix image.
Tag	Metadata attached to a run: `author=caleb`, `data_version=2024-03-15`, `purpose=hyperparameter_search`.
Model Registry	A versioned catalog of production-ready models, with stage labels like "Staging" and "Production."

A good experiment tracking system captures all of this for every run, automatically, so that any run can be reproduced, compared, or promoted to production months later.

MLflow: The Open-Source Standard

MLflow is an open-source platform for managing the ML lifecycle. It was created by Databricks in 2018 and has become the de facto standard for experiment tracking in production environments. It has four components:

MLflow Tracking --- Logs parameters, metrics, and artifacts for each run
MLflow Projects --- Packages ML code in a reusable, reproducible format
MLflow Models --- Provides a standard format for packaging models for deployment
MLflow Model Registry --- Manages model versions and deployment stages

We will focus primarily on Tracking and the Model Registry, because those are the components you will use every day.

Installation and Setup

pip install mlflow scikit-learn xgboost pandas numpy

MLflow stores tracking data in a backend store (parameters, metrics, tags) and an artifact store (models, plots, data). By default, both use the local filesystem --- a directory called mlruns/ in your working directory. For team use, you would configure a database backend (SQLite, PostgreSQL, MySQL) and a remote artifact store (S3, GCS, Azure Blob). We will start local and scale up.

# Start the MLflow tracking server (local, for development)
mlflow server --host 127.0.0.1 --port 5000

This starts the MLflow UI at http://127.0.0.1:5000. Leave this running in a terminal and open the URL in your browser. You will see an empty experiment list. Time to fill it.

Your First Tracked Run

import mlflow
import mlflow.sklearn
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, f1_score, log_loss

# Point to the tracking server
mlflow.set_tracking_uri("http://127.0.0.1:5000")

# Create or get an experiment
mlflow.set_experiment("first-experiment")

# Generate sample data
X, y = make_classification(
    n_samples=5000, n_features=15, n_informative=10,
    n_redundant=3, flip_y=0.05, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# --- The tracked run ---
with mlflow.start_run(run_name="random-forest-baseline"):
    # Define hyperparameters
    params = {
        "n_estimators": 200,
        "max_depth": 10,
        "min_samples_split": 5,
        "random_state": 42,
    }

    # Log parameters
    mlflow.log_params(params)

    # Train
    model = RandomForestClassifier(**params)
    model.fit(X_train, y_train)

    # Evaluate
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    y_pred = model.predict(X_test)

    metrics = {
        "val_auc": roc_auc_score(y_test, y_pred_proba),
        "val_f1": f1_score(y_test, y_pred),
        "val_log_loss": log_loss(y_test, y_pred_proba),
    }

    # Log metrics
    mlflow.log_metrics(metrics)

    # Log the model as an artifact
    mlflow.sklearn.log_model(model, artifact_path="model")

    # Log a tag for context
    mlflow.set_tag("model_type", "RandomForest")
    mlflow.set_tag("data_version", "v1-synthetic")

    print(f"Run ID: {mlflow.active_run().info.run_id}")
    print(f"AUC: {metrics['val_auc']:.4f}")
    print(f"F1:  {metrics['val_f1']:.4f}")

Run ID: a3f7c2e8d1b04a5f9e6c8d2a1b3f7e9c
AUC: 0.9412
F1:  0.8720

Open the MLflow UI, click on the "first-experiment" experiment, and you will see your run with all parameters, metrics, and the logged model artifact. That is the entire point: you never had to open a spreadsheet, type a number, or remember anything. It is all there.

Key Insight --- The with mlflow.start_run() context manager is the fundamental pattern. Everything inside the block is associated with a single run. When the block exits, the run is finalized. This ensures that even if your code crashes, the run is properly closed.

Logging Parameters

Parameters are the inputs to your experiment. Log them before training begins.

# Individual parameters
mlflow.log_param("learning_rate", 0.05)
mlflow.log_param("max_depth", 6)

# Batch logging (preferred for many parameters)
mlflow.log_params({
    "learning_rate": 0.05,
    "max_depth": 6,
    "n_estimators": 2000,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "early_stopping_rounds": 50,
    "random_state": 42,
})

Best Practice --- Log everything that could affect the result. Not just model hyperparameters, but preprocessing choices (scaler_type, imputation_strategy), data splits (test_size, stratify), and environmental details (python_version, sklearn_version). If you cannot reproduce a run, it is almost always because you forgot to log something you thought was constant.

Logging Metrics

Metrics are the outputs of your experiment. Log them after evaluation.

# Individual metrics
mlflow.log_metric("val_auc", 0.8834)
mlflow.log_metric("val_f1", 0.8102)

# Batch logging
mlflow.log_metrics({
    "val_auc": 0.8834,
    "val_f1": 0.8102,
    "val_log_loss": 0.2917,
    "train_auc": 0.9512,
    "train_time_seconds": 14.3,
})

# Step-based metrics (for tracking over training iterations)
for epoch in range(100):
    train_loss = compute_train_loss(epoch)
    val_loss = compute_val_loss(epoch)
    mlflow.log_metric("train_loss", train_loss, step=epoch)
    mlflow.log_metric("val_loss", val_loss, step=epoch)

Step-based metrics are particularly useful for gradient boosting (logging loss at each boosting round) and neural networks (logging loss at each epoch). The MLflow UI plots these as line charts, making it easy to spot overfitting visually.

Logging Artifacts

Artifacts are files --- anything your run produces that you want to keep.

import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Save and log a confusion matrix plot
fig, ax = plt.subplots(figsize=(6, 5))
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["Retained", "Churned"])
disp.plot(ax=ax, cmap="Blues")
ax.set_title("StreamFlow Churn - Confusion Matrix")
fig.savefig("confusion_matrix.png", dpi=150, bbox_inches="tight")
plt.close()

mlflow.log_artifact("confusion_matrix.png")

# Log a directory of artifacts
mlflow.log_artifacts("feature_importance_plots/", artifact_path="plots")

# Log a CSV of predictions
predictions_df = pd.DataFrame({
    "y_true": y_test,
    "y_pred": y_pred,
    "y_proba": y_pred_proba,
})
predictions_df.to_csv("predictions.csv", index=False)
mlflow.log_artifact("predictions.csv")

Practical Tip --- Log the training data schema (column names and dtypes) as an artifact. When you try to reproduce a run six months later, the most common failure is not hyperparameters --- it is that the feature set has changed.

Building a Proper Experiment: The StreamFlow Pattern

Let us put it all together with a realistic example. StreamFlow's data science team is running a hyperparameter search for their churn prediction model. They want to compare multiple XGBoost configurations and keep a permanent record of every attempt.

import mlflow
import mlflow.xgboost
import xgboost as xgb
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    roc_auc_score, f1_score, log_loss, precision_score,
    recall_score, average_precision_score
)
import json
import time

mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("streamflow-churn-v2")

# -------------------------------------------------------
# 1. DATA PREPARATION (simplified for demonstration)
# -------------------------------------------------------
np.random.seed(42)
n = 50000

streamflow = pd.DataFrame({
    'monthly_hours_watched': np.random.exponential(18, n).round(1),
    'sessions_last_30d': np.random.poisson(14, n),
    'avg_session_minutes': np.random.exponential(28, n).round(1),
    'unique_titles_watched': np.random.poisson(8, n),
    'content_completion_rate': np.random.beta(3, 2, n).round(3),
    'binge_sessions_30d': np.random.poisson(2, n),
    'hours_change_pct': np.random.normal(0, 30, n).round(1),
    'sessions_change_pct': np.random.normal(0, 25, n).round(1),
    'months_active': np.random.randint(1, 60, n),
    'plan_price': np.random.choice([9.99, 14.99, 24.99, 29.99], n,
                                    p=[0.35, 0.35, 0.20, 0.10]),
    'devices_used': np.random.randint(1, 6, n),
    'profiles_active': np.random.randint(1, 5, n),
    'payment_failures_6m': np.random.poisson(0.3, n),
    'support_tickets_90d': np.random.poisson(1.2, n),
    'negative_sentiment_tickets': np.random.poisson(0.3, n),
    'genre_diversity': np.random.uniform(0.1, 1.0, n).round(3),
})

# Generate realistic churn target
churn_score = (
    -0.025 * streamflow['months_active']
    - 0.03 * streamflow['monthly_hours_watched']
    + 0.12 * streamflow['support_tickets_90d']
    + 0.25 * streamflow['negative_sentiment_tickets']
    + 0.35 * streamflow['payment_failures_6m']
    - 0.02 * streamflow['sessions_last_30d']
    - 0.3 * streamflow['content_completion_rate']
    - 0.4 * streamflow['genre_diversity']
    + np.random.normal(0, 0.5, n)
)
churn_prob = 1 / (1 + np.exp(-churn_score))
streamflow['churned'] = (np.random.random(n) < churn_prob).astype(int)

X = streamflow.drop(columns=['churned'])
y = streamflow['churned']

# Three-way split: train, validation (early stopping), test (final eval)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.15, stratify=y, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.176, stratify=y_temp, random_state=42
)

print(f"Train: {len(X_train):,}  Val: {len(X_val):,}  Test: {len(X_test):,}")
print(f"Churn rate --- Train: {y_train.mean():.3f}  Val: {y_val.mean():.3f}  Test: {y_test.mean():.3f}")

Train: 35,000  Val: 7,500  Test: 7,500
Churn rate --- Train: 0.323  Val: 0.324  Test: 0.322

Now the hyperparameter search, with every run tracked:

# -------------------------------------------------------
# 2. HYPERPARAMETER SEARCH WITH FULL TRACKING
# -------------------------------------------------------
configs = [
    {"learning_rate": 0.1,  "max_depth": 4, "subsample": 0.8,
     "colsample_bytree": 0.8, "reg_alpha": 0, "reg_lambda": 1},
    {"learning_rate": 0.1,  "max_depth": 6, "subsample": 0.8,
     "colsample_bytree": 0.8, "reg_alpha": 0, "reg_lambda": 1},
    {"learning_rate": 0.05, "max_depth": 6, "subsample": 0.8,
     "colsample_bytree": 0.8, "reg_alpha": 0.1, "reg_lambda": 1},
    {"learning_rate": 0.05, "max_depth": 6, "subsample": 0.85,
     "colsample_bytree": 0.7, "reg_alpha": 0.1, "reg_lambda": 1.5},
    {"learning_rate": 0.03, "max_depth": 7, "subsample": 0.8,
     "colsample_bytree": 0.8, "reg_alpha": 0.05, "reg_lambda": 1.2},
    {"learning_rate": 0.03, "max_depth": 5, "subsample": 0.9,
     "colsample_bytree": 0.75, "reg_alpha": 0, "reg_lambda": 2.0},
]

results = []

for i, config in enumerate(configs):
    run_name = f"xgb-search-{i+1:02d}-lr{config['learning_rate']}-d{config['max_depth']}"

    with mlflow.start_run(run_name=run_name):
        # Log all parameters
        full_params = {
            **config,
            "n_estimators": 3000,
            "early_stopping_rounds": 50,
            "eval_metric": "logloss",
            "random_state": 42,
            "scale_pos_weight": (1 - y_train.mean()) / y_train.mean(),
        }
        mlflow.log_params(full_params)

        # Log data metadata
        mlflow.set_tag("data_version", "streamflow-v2-2024-03")
        mlflow.set_tag("train_rows", str(len(X_train)))
        mlflow.set_tag("feature_count", str(X_train.shape[1]))
        mlflow.set_tag("target_rate", f"{y_train.mean():.3f}")
        mlflow.set_tag("search_index", str(i + 1))

        # Train with early stopping
        start_time = time.time()
        model = xgb.XGBClassifier(
            n_estimators=3000,
            early_stopping_rounds=50,
            eval_metric="logloss",
            random_state=42,
            n_jobs=-1,
            **config,
        )
        model.fit(
            X_train, y_train,
            eval_set=[(X_val, y_val)],
            verbose=False,
        )
        train_time = time.time() - start_time

        # Evaluate on validation set
        y_val_proba = model.predict_proba(X_val)[:, 1]
        y_val_pred = model.predict(X_val)

        val_metrics = {
            "val_auc": roc_auc_score(y_val, y_val_proba),
            "val_f1": f1_score(y_val, y_val_pred),
            "val_precision": precision_score(y_val, y_val_pred),
            "val_recall": recall_score(y_val, y_val_pred),
            "val_avg_precision": average_precision_score(y_val, y_val_proba),
            "val_log_loss": log_loss(y_val, y_val_proba),
            "best_iteration": model.best_iteration,
            "train_time_seconds": round(train_time, 2),
        }
        mlflow.log_metrics(val_metrics)

        # Log the model
        mlflow.xgboost.log_model(model, artifact_path="model")

        # Log feature importance as artifact
        importance = pd.DataFrame({
            "feature": X_train.columns,
            "importance": model.feature_importances_,
        }).sort_values("importance", ascending=False)
        importance.to_csv("feature_importance.csv", index=False)
        mlflow.log_artifact("feature_importance.csv")

        results.append({
            "run_name": run_name,
            "val_auc": val_metrics["val_auc"],
            "val_f1": val_metrics["val_f1"],
            "best_iteration": val_metrics["best_iteration"],
            "train_time": val_metrics["train_time_seconds"],
        })

        print(f"  {run_name}: AUC={val_metrics['val_auc']:.4f}  "
              f"F1={val_metrics['val_f1']:.4f}  "
              f"Trees={model.best_iteration}  "
              f"Time={train_time:.1f}s")

# Summary table
print("\n" + "=" * 70)
results_df = pd.DataFrame(results).sort_values("val_auc", ascending=False)
print(results_df.to_string(index=False))

  xgb-search-01-lr0.1-d4: AUC=0.8791  F1=0.7384  Trees=187  Time=2.1s
  xgb-search-02-lr0.1-d6: AUC=0.8823  F1=0.7426  Trees=142  Time=2.4s
  xgb-search-03-lr0.05-d6: AUC=0.8847  F1=0.7461  Trees=289  Time=3.8s
  xgb-search-04-lr0.05-d6: AUC=0.8851  F1=0.7470  Trees=301  Time=3.6s
  xgb-search-05-lr0.03-d7: AUC=0.8862  F1=0.7489  Trees=517  Time=6.2s
  xgb-search-06-lr0.03-d5: AUC=0.8839  F1=0.7445  Trees=483  Time=5.1s

======================================================================
                    run_name  val_auc  val_f1  best_iteration  train_time
  xgb-search-05-lr0.03-d7   0.8862  0.7489             517         6.2
  xgb-search-04-lr0.05-d6   0.8851  0.7470             301         3.6
  xgb-search-03-lr0.05-d6   0.8847  0.7461             289         3.8
  xgb-search-06-lr0.03-d5   0.8839  0.7445             483         5.1
  xgb-search-02-lr0.1-d6    0.8823  0.7426             142         2.4
  xgb-search-01-lr0.1-d4    0.8791  0.7384             187         2.1

Every run is now permanently recorded in MLflow. Open the UI, select all six runs, and click "Compare." You will see a parallel coordinates plot, a scatter matrix, and a sortable metrics table. No spreadsheet required.

Theme: Reproducibility --- Notice that we logged the random_state, the data_version tag, the exact feature count, and the target rate. Six months from now, if someone asks "What produced the 0.8862 AUC model?", you can pull up run xgb-search-05 and see every input and output. That is the difference between a tracked experiment and a lucky guess.

Comparing Experiments in the MLflow UI

The MLflow UI is where experiment tracking becomes experiment management. Here is what you can do.

The Runs Table

The default view shows all runs in a sortable, filterable table. Click any column header to sort. The most common workflow: sort by val_auc descending, look at the top three runs, and examine their parameter differences.

Parallel Coordinates Plot

Select multiple runs and click "Compare." The parallel coordinates plot shows each parameter as a vertical axis, with lines connecting parameter values across runs. This visualization answers the question: "Which hyperparameters separate my best runs from my worst runs?"

If every good run has max_depth between 5 and 7, and every bad run has max_depth of 3 or 12, that axis will show a clear pattern. This is more informative than looking at individual hyperparameters in isolation.

Metric History

For runs that log step-based metrics (loss at each boosting round, for example), the UI plots metric curves. Overlaying the validation loss curves of multiple runs on the same chart immediately reveals which runs overfitted, which converged slowly, and which found the sweet spot.

Search and Filter

The MLflow UI supports a search syntax for filtering runs:

# Find runs with AUC above 0.88
metrics.val_auc > 0.88

# Find runs with specific learning rate
params.learning_rate = "0.05"

# Combine conditions
metrics.val_auc > 0.88 AND params.max_depth = "6"

# Filter by tag
tags.data_version = "streamflow-v2-2024-03"

Programmatic Access

Everything in the UI is also available via the Python API:

import mlflow

client = mlflow.tracking.MlflowClient()

# Get the experiment
experiment = client.get_experiment_by_name("streamflow-churn-v2")

# Search for the best run
best_runs = client.search_runs(
    experiment_ids=[experiment.experiment_id],
    filter_string="metrics.val_auc > 0.88",
    order_by=["metrics.val_auc DESC"],
    max_results=5,
)

print("Top 5 runs by AUC:")
for run in best_runs:
    print(f"  Run {run.info.run_id[:8]}  "
          f"AUC={run.data.metrics['val_auc']:.4f}  "
          f"LR={run.data.params['learning_rate']}  "
          f"Depth={run.data.params['max_depth']}")

Top 5 runs by AUC:
  Run a8b3c2d1  AUC=0.8862  LR=0.03  Depth=7
  Run f4e7d6c5  AUC=0.8851  LR=0.05  Depth=6
  Run b2c4e6a8  AUC=0.8847  LR=0.05  Depth=6
  Run d1e3f5a7  AUC=0.8839  LR=0.03  Depth=5
  Run c5d7e9f1  AUC=0.8823  LR=0.1   Depth=6

The MLflow Model Registry

Tracking experiments is necessary but not sufficient. Once you have identified a good model, you need a way to manage it through its lifecycle: from development to staging to production to retirement. That is what the Model Registry does.

Registering a Model

import mlflow

# Option 1: Register during logging
with mlflow.start_run(run_name="streamflow-churn-best"):
    # ... train and evaluate ...
    mlflow.xgboost.log_model(
        model,
        artifact_path="model",
        registered_model_name="streamflow-churn-predictor",
    )

# Option 2: Register an existing run's model after the fact
result = mlflow.register_model(
    model_uri="runs:/a8b3c2d1e4f5a6b7c8d9e0f1a2b3c4d5/model",
    name="streamflow-churn-predictor",
)
print(f"Registered model version: {result.version}")

Registered model version: 1

Model Versions and Stages

Every time you register a model with the same name, the version number increments. You can then assign stages:

client = mlflow.tracking.MlflowClient()

# Transition version 1 to Staging
client.transition_model_version_stage(
    name="streamflow-churn-predictor",
    version=1,
    stage="Staging",
)

# After validation, promote to Production
client.transition_model_version_stage(
    name="streamflow-churn-predictor",
    version=1,
    stage="Production",
)

# When a new model is ready, archive the old one
client.transition_model_version_stage(
    name="streamflow-churn-predictor",
    version=1,
    stage="Archived",
)
client.transition_model_version_stage(
    name="streamflow-churn-predictor",
    version=2,
    stage="Production",
)

Note

--- In MLflow 2.9+, the transition_model_version_stage API is deprecated in favor of the new aliases system. Aliases are more flexible: instead of fixed stages, you assign arbitrary aliases like "champion" and "challenger" to model versions. The pattern below shows the modern approach:

# Modern MLflow: use aliases instead of stages
client.set_registered_model_alias(
    name="streamflow-churn-predictor",
    alias="champion",
    version=2,
)

# Load the production model by alias
champion_model = mlflow.pyfunc.load_model(
    model_uri="models:/streamflow-churn-predictor@champion"
)
predictions = champion_model.predict(X_test)

Loading a Registered Model for Inference

import mlflow.pyfunc

# Load by version number
model_v1 = mlflow.pyfunc.load_model(
    model_uri="models:/streamflow-churn-predictor/1"
)

# Load by alias (recommended)
champion = mlflow.pyfunc.load_model(
    model_uri="models:/streamflow-churn-predictor@champion"
)

# Load latest version (not recommended for production)
latest = mlflow.pyfunc.load_model(
    model_uri="models:/streamflow-churn-predictor/latest"
)

# Make predictions
new_data = X_test.iloc[:5]
predictions = champion.predict(new_data)
print(predictions)

The Model Registry gives you model lineage: for any model in production, you can trace back to the exact run that produced it, the exact hyperparameters, the exact training data version, and the exact metrics. That is reproducibility that survives team turnover.

MLflow Projects: Reproducible Packaging

MLflow Projects go beyond tracking individual runs --- they package the entire codebase into a reproducible unit. An MLflow Project is a directory with an MLproject file that specifies the environment and entry points.

streamflow-churn/
  MLproject
  conda.yaml
  train.py
  evaluate.py
  data/
    features.csv

The MLproject file:

name: streamflow-churn

conda_env: conda.yaml

entry_points:
  train:
    parameters:
      learning_rate: {type: float, default: 0.05}
      max_depth: {type: int, default: 6}
      data_path: {type: str, default: "data/features.csv"}
    command: "python train.py --learning-rate {learning_rate} --max-depth {max_depth} --data-path {data_path}"

  evaluate:
    parameters:
      model_uri: {type: str}
      data_path: {type: str, default: "data/features.csv"}
    command: "python evaluate.py --model-uri {model_uri} --data-path {data_path}"

The conda.yaml file pins every dependency:

name: streamflow-churn
channels:
  - defaults
  - conda-forge
dependencies:
  - python=3.11
  - scikit-learn=1.4.0
  - xgboost=2.0.3
  - pandas=2.2.0
  - numpy=1.26.3
  - mlflow=2.10.0
  - pip:
    - shap==0.44.0

Run the project from the command line:

mlflow run streamflow-churn/ -P learning_rate=0.03 -P max_depth=7

Or from Python:

mlflow.projects.run(
    uri="streamflow-churn/",
    parameters={"learning_rate": 0.03, "max_depth": 7},
)

MLflow creates a fresh Conda environment, installs the pinned dependencies, and executes the training script. The result is a run that anyone can reproduce on any machine. No "it works on my laptop" conversations.

Weights & Biases: The SaaS Alternative

Weights & Biases (W&B, pronounced "weights and biases," not "wand-b") is a commercial experiment tracking platform. It does the same thing as MLflow Tracking --- logs parameters, metrics, artifacts, and models --- but with a different philosophy: everything is in the cloud, and the UI is exceptional.

Setup

pip install wandb
wandb login  # Authenticate with your API key from wandb.ai

The W&B Equivalent of Our MLflow Example

import wandb
import xgboost as xgb
from sklearn.metrics import roc_auc_score, f1_score, log_loss

# Initialize a run
wandb.init(
    project="streamflow-churn",
    name="xgb-search-05-lr0.03-d7",
    config={
        "learning_rate": 0.03,
        "max_depth": 7,
        "subsample": 0.8,
        "colsample_bytree": 0.8,
        "n_estimators": 3000,
        "early_stopping_rounds": 50,
        "random_state": 42,
        "data_version": "streamflow-v2-2024-03",
    },
)

# Train (using wandb.config for hyperparameters)
model = xgb.XGBClassifier(
    learning_rate=wandb.config.learning_rate,
    max_depth=wandb.config.max_depth,
    subsample=wandb.config.subsample,
    colsample_bytree=wandb.config.colsample_bytree,
    n_estimators=wandb.config.n_estimators,
    early_stopping_rounds=wandb.config.early_stopping_rounds,
    eval_metric="logloss",
    random_state=wandb.config.random_state,
    n_jobs=-1,
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)

# Evaluate and log metrics
y_val_proba = model.predict_proba(X_val)[:, 1]
y_val_pred = model.predict(X_val)

wandb.log({
    "val_auc": roc_auc_score(y_val, y_val_proba),
    "val_f1": f1_score(y_val, y_val_pred),
    "val_log_loss": log_loss(y_val, y_val_proba),
    "best_iteration": model.best_iteration,
})

# Log an artifact (model file)
artifact = wandb.Artifact("streamflow-churn-model", type="model")
model.save_model("model.json")
artifact.add_file("model.json")
wandb.log_artifact(artifact)

wandb.finish()

The code is structurally similar to MLflow. The difference becomes apparent when you open the W&B dashboard.

What W&B Does Better Than MLflow

1. Real-time collaboration. W&B is cloud-native. Your entire team sees every run the moment it finishes. No server setup, no shared database configuration, no "did you push to the tracking server?" MLflow requires you to set up and maintain a tracking server for team access.

2. Interactive visualizations. The W&B dashboard is genuinely excellent. Drag-and-drop chart builder, cross-run comparison with hover details, automatic parallel coordinates plots, custom scatter plots with configurable axes. MLflow's UI is functional. W&B's UI is a pleasure to use.

3. Sweeps (hyperparameter search). W&B has a built-in sweep agent that coordinates distributed hyperparameter searches. Define a search space in YAML, launch agents on multiple machines, and W&B handles the coordination. MLflow defers hyperparameter search to external tools (Optuna, Hyperopt).

# W&B Sweep configuration example
sweep_config = {
    "method": "bayes",
    "metric": {"name": "val_auc", "goal": "maximize"},
    "parameters": {
        "learning_rate": {"min": 0.01, "max": 0.15},
        "max_depth": {"values": [4, 5, 6, 7, 8]},
        "subsample": {"min": 0.7, "max": 0.95},
        "colsample_bytree": {"min": 0.6, "max": 0.9},
    },
}

sweep_id = wandb.sweep(sweep_config, project="streamflow-churn")
wandb.agent(sweep_id, function=train_function, count=50)

4. System metrics. W&B automatically logs GPU utilization, CPU usage, memory consumption, and network I/O during training. This is invaluable for debugging performance bottlenecks. MLflow does not log system metrics by default (though the mlflow.system_metrics module was added in MLflow 2.8).

5. Reports. W&B Reports let you embed live charts, runs, and comparisons into a Markdown document that updates as new experiments are added. This is useful for stakeholder communication --- share a link to a report instead of exporting screenshots.

What MLflow Does Better Than W&B

1. Self-hosted and free. MLflow is Apache 2.0 licensed. You run it on your own infrastructure, your data never leaves your network, and there is no per-user fee. For regulated industries (healthcare, finance), this is not a preference --- it is a requirement.

2. Model Registry. MLflow's Model Registry is more mature and deeply integrated. Model versioning, stage transitions, aliases, and the ability to load a registered model with a single mlflow.pyfunc.load_model() call make the path from experiment to production smoother. W&B has a model registry (launched 2023), but it is younger and less adopted.

3. MLflow Models format. The mlflow.pyfunc format wraps any model (sklearn, XGBoost, PyTorch, a custom Python function) in a standardized interface with predict(). This makes deployment tool-agnostic: any system that knows how to serve an MLflow Model can serve your model, regardless of the framework that produced it.

4. MLflow Projects. Reproducible packaging with pinned environments. W&B has no equivalent.

5. No vendor lock-in. MLflow is open source. If Databricks disappeared tomorrow, MLflow would continue to exist. W&B is a venture-backed SaaS company. Your experiment data lives on their servers (or on your own servers with W&B Server, their self-hosted offering, which requires an enterprise license).

The Honest Comparison Table

Criterion	MLflow	W&B
Cost	Free (open source)	Free tier (100 GB); Teams $50/user/mo; Enterprise custom
Hosting	Self-hosted (you manage)	Cloud SaaS (they manage); self-hosted enterprise option
UI quality	Functional, improving	Excellent, best-in-class
Setup effort	Moderate (server, DB, artifact store)	Minimal (pip install, login)
Experiment tracking	Excellent	Excellent
Model Registry	Mature, deeply integrated	Newer, growing
Hyperparameter sweeps	External tools (Optuna)	Built-in sweeps
Reproducibility	MLflow Projects (environments)	Limited (logs config, not environment)
Data privacy	Full control (self-hosted)	Data on W&B servers (cloud) or self-hosted enterprise
Ecosystem integration	Databricks, AWS SageMaker, Azure ML	Integrations with major frameworks
Community	Large, enterprise-heavy	Large, research-heavy

When to Use Which

Use MLflow if: - You are in a regulated industry (healthcare, finance, government) - Your organization requires data to stay on-premises - You need a mature Model Registry integrated with your deployment pipeline - Cost matters (MLflow is free; W&B is not for teams)

Use W&B if: - You are a small team that wants to get started in five minutes - The UI and collaboration features justify the cost - You run many hyperparameter sweeps and want built-in coordination - You value real-time team visibility over infrastructure control

Use both if: - You use W&B for experiment exploration and visualization during development - You use MLflow for the Model Registry and production deployment pipeline - This is more common than you might expect

Autologging: The Low-Effort Starting Point

MLflow supports autologging for most popular ML frameworks. With a single line of code, MLflow automatically logs parameters, metrics, and the trained model --- no manual log_param() calls required.

import mlflow

# Enable autologging for scikit-learn
mlflow.sklearn.autolog()

# Enable autologging for XGBoost
mlflow.xgboost.autolog()

# Enable autologging for all supported frameworks
mlflow.autolog()

With autologging enabled, a simple training script becomes fully tracked:

import mlflow
import xgboost as xgb
from sklearn.model_selection import train_test_split

mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("streamflow-churn-v2")
mlflow.xgboost.autolog()

# This will automatically log all XGBoost parameters,
# the best iteration, training metrics, and the model artifact.
model = xgb.XGBClassifier(
    learning_rate=0.05,
    max_depth=6,
    n_estimators=2000,
    early_stopping_rounds=50,
    eval_metric="logloss",
    random_state=42,
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)

Practical Tip --- Autologging is a good starting point but not a complete solution. It logs model hyperparameters and training metrics, but it does not log your data version, your feature engineering choices, your evaluation on the test set, or your custom artifacts. Use autologging as a floor (it catches what you forget) and manual logging as the ceiling (it captures what matters).

Structuring Experiments for a Team

When a team shares an MLflow server, naming and organization become critical. Here are the patterns that work.

Naming Conventions

Experiment name:  {project}-{model-type}-{version}
                  streamflow-churn-xgboost-v2
                  streamflow-churn-lightgbm-v3

Run name:         {model}-{search-index}-{key-params}
                  xgb-042-lr0.03-d7-ss0.8
                  lgbm-017-lr0.05-nl128

Tags:             author=caleb
                  data_version=2024-03-15
                  purpose=hyperparameter_search | final_eval | ablation
                  pipeline_version=v2.1
                  git_commit=a3f7c2e

The Git Commit Tag

This one practice prevents more headaches than any other. Log the git commit hash of the code that produced the run:

import subprocess

def get_git_hash():
    try:
        return subprocess.check_output(
            ["git", "rev-parse", "HEAD"]
        ).decode("utf-8").strip()
    except Exception:
        return "unknown"

with mlflow.start_run():
    mlflow.set_tag("git_commit", get_git_hash())
    # ... rest of training ...

Now every run is linked to the exact code that produced it. When you need to reproduce a run from eight months ago, you check out the commit and run the script. No archaeology required.

Organizing Large Experiments

For a systematic hyperparameter search, use parent-child runs:

with mlflow.start_run(run_name="hp-search-2024-03-25") as parent_run:
    mlflow.set_tag("purpose", "hyperparameter_search")
    mlflow.set_tag("search_method", "grid")
    mlflow.set_tag("total_configs", str(len(configs)))

    best_auc = 0
    best_run_id = None

    for i, config in enumerate(configs):
        with mlflow.start_run(
            run_name=f"config-{i+1:03d}",
            nested=True,
        ) as child_run:
            mlflow.log_params(config)
            # ... train and evaluate ...
            mlflow.log_metrics(metrics)

            if metrics["val_auc"] > best_auc:
                best_auc = metrics["val_auc"]
                best_run_id = child_run.info.run_id

    # Log the best result on the parent run
    mlflow.log_metric("best_val_auc", best_auc)
    mlflow.set_tag("best_child_run_id", best_run_id)

In the MLflow UI, the parent run appears as a summary, and expanding it reveals all child runs. This keeps the experiment list clean while preserving full detail.

Common Pitfalls and How to Avoid Them

Pitfall 1: Forgetting to Log the Data Version

The most common failure mode is not hyperparameters or code --- it is data. Your features changed, a column was renamed, the data was re-extracted with different filters, and now the "best model" from last month produces different results. Always log a data version tag, and ideally log a hash of the training data:

import hashlib

data_hash = hashlib.sha256(
    pd.util.hash_pandas_object(X_train).values.tobytes()
).hexdigest()[:12]

mlflow.set_tag("data_sha256", data_hash)

Pitfall 2: Using the Test Set for Early Stopping

This was covered in Chapter 14, but it bears repeating in the tracking context. If your early stopping uses the test set, your logged "test AUC" is biased. The three-way split (train, validation, test) is non-negotiable.

Pitfall 3: Logging Too Late

Log parameters at the start of the run, not the end. If your training script crashes at minute 45, you want the parameters on record so you know what configuration failed. Metrics can be logged incrementally.

Pitfall 4: Overwriting Runs

Never reuse a run ID. Every execution of your training script should create a new run. This is the default behavior with mlflow.start_run(), but some teams try to "update" existing runs to save space. Do not do this. Disk space is cheap. Lost experiment history is not.

Pitfall 5: Ignoring the Artifact Store Size

Model artifacts accumulate. A single XGBoost model might be 10 MB, but after 500 experiments, that is 5 GB. For deep learning models, multiply by 100. Set up artifact retention policies: archive experiments older than 6 months, delete failed runs after 30 days.

Production MLflow: Beyond Local Development

For team and production use, MLflow needs infrastructure beyond the default mlruns/ directory.

Backend Store Options

Backend	Use Case
Local filesystem	Single-user development
SQLite	Single-user, persistent tracking
PostgreSQL	Team use, production
MySQL	Team use, production

Artifact Store Options

Store	Use Case
Local filesystem	Development
S3 (or S3-compatible)	Production, team access
Azure Blob Storage	Azure-based teams
Google Cloud Storage	GCP-based teams

Minimal Production Setup

# Start MLflow with PostgreSQL backend and S3 artifact store
mlflow server \
    --backend-store-uri postgresql://mlflow:password@db-host:5432/mlflow \
    --default-artifact-root s3://mlflow-artifacts/experiments \
    --host 0.0.0.0 \
    --port 5000

This configuration gives you persistent tracking data in a real database, team-accessible artifact storage in S3, and a UI available to anyone on the network.

Putting It All Together: The Experiment Tracking Workflow

Here is the workflow that every team should follow.

1. Before you write any training code, set up the tracking server and create your experiment. Agree on naming conventions.

2. Every training script should start with mlflow.set_tracking_uri() and mlflow.set_experiment(). Non-negotiable.

3. Every run should log: all hyperparameters, the data version, the git commit, all evaluation metrics, and the trained model artifact.

4. After a hyperparameter search, use the UI to compare runs, identify the best configuration, and verify that the result is not an artifact of noise.

5. When a model is ready for deployment, register it in the Model Registry and assign it an alias. The deployment pipeline loads the model by alias, not by run ID.

6. When a new model is trained, register it as a new version, test it in staging, and promote it to production by reassigning the alias. The old model version remains in the registry for rollback.

This is the experiment-tracking layer of MLOps. It sits between your notebook and your deployment pipeline, and it ensures that every decision --- every hyperparameter choice, every data version, every model artifact --- is recorded, comparable, and reproducible.

Theme: Real World =/= Kaggle --- On Kaggle, you submit a CSV and get a score. In the real world, you submit a model and get the question: "Can you reproduce this? Can you explain why this model is better than the one we deployed last quarter? Can you trace the lineage from training data to production prediction?" Experiment tracking is how you answer "yes" to all three.

Summary

Experiment tracking is not a nice-to-have --- it is infrastructure. MLflow is the open-source standard: free, self-hosted, and integrated with the deployment ecosystem through the Model Registry and the pyfunc model format. Weights & Biases offers a superior UI and built-in sweep coordination, at the cost of SaaS dependency and per-user pricing.

The minimum viable experiment tracking setup is five lines of code: set the tracking URI, set the experiment name, start a run, log parameters, and log metrics. The maximum --- with artifacts, model registration, git commit tags, data versioning, and nested runs --- is the kind of infrastructure that lets a team of twenty data scientists iterate on the same model without losing their minds.

Start with autologging. Graduate to manual logging when autologging misses something important. Adopt the Model Registry when you have more than one model in production. And never, under any circumstances, go back to the spreadsheet.

Next chapter: Chapter 31: Model Deployment --- wrapping your tracked, registered model in a FastAPI endpoint and deploying it with Docker.