> War Story --- A senior data scientist at an insurance company was asked to reproduce the model that had been deployed to production eight months earlier. She opened the team's shared Google Sheet --- the one labeled "Experiment Log" --- and found...
In This Chapter
- MLflow, Weights & Biases, and Reproducible Research
- The Spreadsheet of Shame
- What Experiment Tracking Actually Tracks
- MLflow: The Open-Source Standard
- Building a Proper Experiment: The StreamFlow Pattern
- Comparing Experiments in the MLflow UI
- The MLflow Model Registry
- MLflow Projects: Reproducible Packaging
- Weights & Biases: The SaaS Alternative
- Autologging: The Low-Effort Starting Point
- Structuring Experiments for a Team
- Common Pitfalls and How to Avoid Them
- Production MLflow: Beyond Local Development
- Putting It All Together: The Experiment Tracking Workflow
- Summary
Chapter 30: ML Experiment Tracking
MLflow, Weights & Biases, and Reproducible Research
Learning Objectives
By the end of this chapter, you will be able to:
- Set up MLflow for experiment tracking, model versioning, and artifact storage
- Log parameters, metrics, and artifacts systematically
- Compare experiments in the MLflow UI
- Register models in the MLflow Model Registry
- Understand W&B as an alternative and compare the two platforms
The Spreadsheet of Shame
War Story --- A senior data scientist at an insurance company was asked to reproduce the model that had been deployed to production eight months earlier. She opened the team's shared Google Sheet --- the one labeled "Experiment Log" --- and found 347 rows of hand-entered hyperparameters, validation scores, and notes like "tried lower LR, seemed better" and "same as row 212 but with new features." There was no record of which preprocessing pipeline produced the training data. No record of the random seed. No record of the exact feature set. The model artifact was a pickle file named
model_final_v3_FINAL_USE_THIS.pklsitting in a shared drive alongsidemodel_final_v4.pklandmodel_ACTUALLY_final.pkl. She spent three weeks trying to reproduce the results. She never succeeded.
If you cannot tell me what hyperparameters produced your best model, you do not have a best model --- you have a lucky guess.
This is not a rare story. It is the default state of most data science teams. Every team starts with good intentions: "We will track everything in a spreadsheet." Within a month, the spreadsheet is incomplete. Within three months, it is unreliable. Within six months, it is fiction. The problem is not laziness. The problem is that manual tracking is fundamentally incompatible with the iterative, high-throughput nature of ML experimentation. A single hyperparameter search can produce hundreds of runs. No human is going to manually log the learning rate, max depth, subsample ratio, column sample rate, regularization strength, number of estimators, early stopping round, and six evaluation metrics for each of those runs. So they don't. And six months later, nobody can reproduce anything.
Experiment tracking tools solve this problem by automating the logging. You add a few lines of code to your training script, and the tool records every parameter, every metric, every artifact, every environment detail --- automatically, consistently, and permanently.
This chapter covers two tools: MLflow (the open-source standard) and Weights & Biases (the best-in-class SaaS alternative). We will go deep on MLflow because it is free, self-hosted, and the tool you are most likely to encounter in production environments. We will give W&B a thorough and honest comparison because it has a genuinely better user experience for certain workflows.
What Experiment Tracking Actually Tracks
Before we touch any tool, let us define the vocabulary precisely.
| Term | Definition |
|---|---|
| Experiment | A named collection of runs that share a common objective. "StreamFlow churn model v2" is an experiment. |
| Run | A single execution of a training script. One set of hyperparameters, one training pass, one set of results. |
| Parameter | An input to the run: learning_rate=0.05, max_depth=6, n_estimators=2000. |
| Metric | An output measurement: val_auc=0.8834, val_log_loss=0.2917, train_time_seconds=14.3. |
| Artifact | A file produced by the run: the trained model, a feature importance plot, the preprocessed dataset, a confusion matrix image. |
| Tag | Metadata attached to a run: author=caleb, data_version=2024-03-15, purpose=hyperparameter_search. |
| Model Registry | A versioned catalog of production-ready models, with stage labels like "Staging" and "Production." |
A good experiment tracking system captures all of this for every run, automatically, so that any run can be reproduced, compared, or promoted to production months later.
MLflow: The Open-Source Standard
MLflow is an open-source platform for managing the ML lifecycle. It was created by Databricks in 2018 and has become the de facto standard for experiment tracking in production environments. It has four components:
- MLflow Tracking --- Logs parameters, metrics, and artifacts for each run
- MLflow Projects --- Packages ML code in a reusable, reproducible format
- MLflow Models --- Provides a standard format for packaging models for deployment
- MLflow Model Registry --- Manages model versions and deployment stages
We will focus primarily on Tracking and the Model Registry, because those are the components you will use every day.
Installation and Setup
pip install mlflow scikit-learn xgboost pandas numpy
MLflow stores tracking data in a backend store (parameters, metrics, tags) and an artifact store (models, plots, data). By default, both use the local filesystem --- a directory called mlruns/ in your working directory. For team use, you would configure a database backend (SQLite, PostgreSQL, MySQL) and a remote artifact store (S3, GCS, Azure Blob). We will start local and scale up.
# Start the MLflow tracking server (local, for development)
mlflow server --host 127.0.0.1 --port 5000
This starts the MLflow UI at http://127.0.0.1:5000. Leave this running in a terminal and open the URL in your browser. You will see an empty experiment list. Time to fill it.
Your First Tracked Run
import mlflow
import mlflow.sklearn
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, f1_score, log_loss
# Point to the tracking server
mlflow.set_tracking_uri("http://127.0.0.1:5000")
# Create or get an experiment
mlflow.set_experiment("first-experiment")
# Generate sample data
X, y = make_classification(
n_samples=5000, n_features=15, n_informative=10,
n_redundant=3, flip_y=0.05, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# --- The tracked run ---
with mlflow.start_run(run_name="random-forest-baseline"):
# Define hyperparameters
params = {
"n_estimators": 200,
"max_depth": 10,
"min_samples_split": 5,
"random_state": 42,
}
# Log parameters
mlflow.log_params(params)
# Train
model = RandomForestClassifier(**params)
model.fit(X_train, y_train)
# Evaluate
y_pred_proba = model.predict_proba(X_test)[:, 1]
y_pred = model.predict(X_test)
metrics = {
"val_auc": roc_auc_score(y_test, y_pred_proba),
"val_f1": f1_score(y_test, y_pred),
"val_log_loss": log_loss(y_test, y_pred_proba),
}
# Log metrics
mlflow.log_metrics(metrics)
# Log the model as an artifact
mlflow.sklearn.log_model(model, artifact_path="model")
# Log a tag for context
mlflow.set_tag("model_type", "RandomForest")
mlflow.set_tag("data_version", "v1-synthetic")
print(f"Run ID: {mlflow.active_run().info.run_id}")
print(f"AUC: {metrics['val_auc']:.4f}")
print(f"F1: {metrics['val_f1']:.4f}")
Run ID: a3f7c2e8d1b04a5f9e6c8d2a1b3f7e9c
AUC: 0.9412
F1: 0.8720
Open the MLflow UI, click on the "first-experiment" experiment, and you will see your run with all parameters, metrics, and the logged model artifact. That is the entire point: you never had to open a spreadsheet, type a number, or remember anything. It is all there.
Key Insight --- The
with mlflow.start_run()context manager is the fundamental pattern. Everything inside the block is associated with a single run. When the block exits, the run is finalized. This ensures that even if your code crashes, the run is properly closed.
Logging Parameters
Parameters are the inputs to your experiment. Log them before training begins.
# Individual parameters
mlflow.log_param("learning_rate", 0.05)
mlflow.log_param("max_depth", 6)
# Batch logging (preferred for many parameters)
mlflow.log_params({
"learning_rate": 0.05,
"max_depth": 6,
"n_estimators": 2000,
"subsample": 0.8,
"colsample_bytree": 0.8,
"early_stopping_rounds": 50,
"random_state": 42,
})
Best Practice --- Log everything that could affect the result. Not just model hyperparameters, but preprocessing choices (
scaler_type,imputation_strategy), data splits (test_size,stratify), and environmental details (python_version,sklearn_version). If you cannot reproduce a run, it is almost always because you forgot to log something you thought was constant.
Logging Metrics
Metrics are the outputs of your experiment. Log them after evaluation.
# Individual metrics
mlflow.log_metric("val_auc", 0.8834)
mlflow.log_metric("val_f1", 0.8102)
# Batch logging
mlflow.log_metrics({
"val_auc": 0.8834,
"val_f1": 0.8102,
"val_log_loss": 0.2917,
"train_auc": 0.9512,
"train_time_seconds": 14.3,
})
# Step-based metrics (for tracking over training iterations)
for epoch in range(100):
train_loss = compute_train_loss(epoch)
val_loss = compute_val_loss(epoch)
mlflow.log_metric("train_loss", train_loss, step=epoch)
mlflow.log_metric("val_loss", val_loss, step=epoch)
Step-based metrics are particularly useful for gradient boosting (logging loss at each boosting round) and neural networks (logging loss at each epoch). The MLflow UI plots these as line charts, making it easy to spot overfitting visually.
Logging Artifacts
Artifacts are files --- anything your run produces that you want to keep.
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
# Save and log a confusion matrix plot
fig, ax = plt.subplots(figsize=(6, 5))
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["Retained", "Churned"])
disp.plot(ax=ax, cmap="Blues")
ax.set_title("StreamFlow Churn - Confusion Matrix")
fig.savefig("confusion_matrix.png", dpi=150, bbox_inches="tight")
plt.close()
mlflow.log_artifact("confusion_matrix.png")
# Log a directory of artifacts
mlflow.log_artifacts("feature_importance_plots/", artifact_path="plots")
# Log a CSV of predictions
predictions_df = pd.DataFrame({
"y_true": y_test,
"y_pred": y_pred,
"y_proba": y_pred_proba,
})
predictions_df.to_csv("predictions.csv", index=False)
mlflow.log_artifact("predictions.csv")
Practical Tip --- Log the training data schema (column names and dtypes) as an artifact. When you try to reproduce a run six months later, the most common failure is not hyperparameters --- it is that the feature set has changed.
Building a Proper Experiment: The StreamFlow Pattern
Let us put it all together with a realistic example. StreamFlow's data science team is running a hyperparameter search for their churn prediction model. They want to compare multiple XGBoost configurations and keep a permanent record of every attempt.
import mlflow
import mlflow.xgboost
import xgboost as xgb
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
roc_auc_score, f1_score, log_loss, precision_score,
recall_score, average_precision_score
)
import json
import time
mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("streamflow-churn-v2")
# -------------------------------------------------------
# 1. DATA PREPARATION (simplified for demonstration)
# -------------------------------------------------------
np.random.seed(42)
n = 50000
streamflow = pd.DataFrame({
'monthly_hours_watched': np.random.exponential(18, n).round(1),
'sessions_last_30d': np.random.poisson(14, n),
'avg_session_minutes': np.random.exponential(28, n).round(1),
'unique_titles_watched': np.random.poisson(8, n),
'content_completion_rate': np.random.beta(3, 2, n).round(3),
'binge_sessions_30d': np.random.poisson(2, n),
'hours_change_pct': np.random.normal(0, 30, n).round(1),
'sessions_change_pct': np.random.normal(0, 25, n).round(1),
'months_active': np.random.randint(1, 60, n),
'plan_price': np.random.choice([9.99, 14.99, 24.99, 29.99], n,
p=[0.35, 0.35, 0.20, 0.10]),
'devices_used': np.random.randint(1, 6, n),
'profiles_active': np.random.randint(1, 5, n),
'payment_failures_6m': np.random.poisson(0.3, n),
'support_tickets_90d': np.random.poisson(1.2, n),
'negative_sentiment_tickets': np.random.poisson(0.3, n),
'genre_diversity': np.random.uniform(0.1, 1.0, n).round(3),
})
# Generate realistic churn target
churn_score = (
-0.025 * streamflow['months_active']
- 0.03 * streamflow['monthly_hours_watched']
+ 0.12 * streamflow['support_tickets_90d']
+ 0.25 * streamflow['negative_sentiment_tickets']
+ 0.35 * streamflow['payment_failures_6m']
- 0.02 * streamflow['sessions_last_30d']
- 0.3 * streamflow['content_completion_rate']
- 0.4 * streamflow['genre_diversity']
+ np.random.normal(0, 0.5, n)
)
churn_prob = 1 / (1 + np.exp(-churn_score))
streamflow['churned'] = (np.random.random(n) < churn_prob).astype(int)
X = streamflow.drop(columns=['churned'])
y = streamflow['churned']
# Three-way split: train, validation (early stopping), test (final eval)
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.15, stratify=y, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.176, stratify=y_temp, random_state=42
)
print(f"Train: {len(X_train):,} Val: {len(X_val):,} Test: {len(X_test):,}")
print(f"Churn rate --- Train: {y_train.mean():.3f} Val: {y_val.mean():.3f} Test: {y_test.mean():.3f}")
Train: 35,000 Val: 7,500 Test: 7,500
Churn rate --- Train: 0.323 Val: 0.324 Test: 0.322
Now the hyperparameter search, with every run tracked:
# -------------------------------------------------------
# 2. HYPERPARAMETER SEARCH WITH FULL TRACKING
# -------------------------------------------------------
configs = [
{"learning_rate": 0.1, "max_depth": 4, "subsample": 0.8,
"colsample_bytree": 0.8, "reg_alpha": 0, "reg_lambda": 1},
{"learning_rate": 0.1, "max_depth": 6, "subsample": 0.8,
"colsample_bytree": 0.8, "reg_alpha": 0, "reg_lambda": 1},
{"learning_rate": 0.05, "max_depth": 6, "subsample": 0.8,
"colsample_bytree": 0.8, "reg_alpha": 0.1, "reg_lambda": 1},
{"learning_rate": 0.05, "max_depth": 6, "subsample": 0.85,
"colsample_bytree": 0.7, "reg_alpha": 0.1, "reg_lambda": 1.5},
{"learning_rate": 0.03, "max_depth": 7, "subsample": 0.8,
"colsample_bytree": 0.8, "reg_alpha": 0.05, "reg_lambda": 1.2},
{"learning_rate": 0.03, "max_depth": 5, "subsample": 0.9,
"colsample_bytree": 0.75, "reg_alpha": 0, "reg_lambda": 2.0},
]
results = []
for i, config in enumerate(configs):
run_name = f"xgb-search-{i+1:02d}-lr{config['learning_rate']}-d{config['max_depth']}"
with mlflow.start_run(run_name=run_name):
# Log all parameters
full_params = {
**config,
"n_estimators": 3000,
"early_stopping_rounds": 50,
"eval_metric": "logloss",
"random_state": 42,
"scale_pos_weight": (1 - y_train.mean()) / y_train.mean(),
}
mlflow.log_params(full_params)
# Log data metadata
mlflow.set_tag("data_version", "streamflow-v2-2024-03")
mlflow.set_tag("train_rows", str(len(X_train)))
mlflow.set_tag("feature_count", str(X_train.shape[1]))
mlflow.set_tag("target_rate", f"{y_train.mean():.3f}")
mlflow.set_tag("search_index", str(i + 1))
# Train with early stopping
start_time = time.time()
model = xgb.XGBClassifier(
n_estimators=3000,
early_stopping_rounds=50,
eval_metric="logloss",
random_state=42,
n_jobs=-1,
**config,
)
model.fit(
X_train, y_train,
eval_set=[(X_val, y_val)],
verbose=False,
)
train_time = time.time() - start_time
# Evaluate on validation set
y_val_proba = model.predict_proba(X_val)[:, 1]
y_val_pred = model.predict(X_val)
val_metrics = {
"val_auc": roc_auc_score(y_val, y_val_proba),
"val_f1": f1_score(y_val, y_val_pred),
"val_precision": precision_score(y_val, y_val_pred),
"val_recall": recall_score(y_val, y_val_pred),
"val_avg_precision": average_precision_score(y_val, y_val_proba),
"val_log_loss": log_loss(y_val, y_val_proba),
"best_iteration": model.best_iteration,
"train_time_seconds": round(train_time, 2),
}
mlflow.log_metrics(val_metrics)
# Log the model
mlflow.xgboost.log_model(model, artifact_path="model")
# Log feature importance as artifact
importance = pd.DataFrame({
"feature": X_train.columns,
"importance": model.feature_importances_,
}).sort_values("importance", ascending=False)
importance.to_csv("feature_importance.csv", index=False)
mlflow.log_artifact("feature_importance.csv")
results.append({
"run_name": run_name,
"val_auc": val_metrics["val_auc"],
"val_f1": val_metrics["val_f1"],
"best_iteration": val_metrics["best_iteration"],
"train_time": val_metrics["train_time_seconds"],
})
print(f" {run_name}: AUC={val_metrics['val_auc']:.4f} "
f"F1={val_metrics['val_f1']:.4f} "
f"Trees={model.best_iteration} "
f"Time={train_time:.1f}s")
# Summary table
print("\n" + "=" * 70)
results_df = pd.DataFrame(results).sort_values("val_auc", ascending=False)
print(results_df.to_string(index=False))
xgb-search-01-lr0.1-d4: AUC=0.8791 F1=0.7384 Trees=187 Time=2.1s
xgb-search-02-lr0.1-d6: AUC=0.8823 F1=0.7426 Trees=142 Time=2.4s
xgb-search-03-lr0.05-d6: AUC=0.8847 F1=0.7461 Trees=289 Time=3.8s
xgb-search-04-lr0.05-d6: AUC=0.8851 F1=0.7470 Trees=301 Time=3.6s
xgb-search-05-lr0.03-d7: AUC=0.8862 F1=0.7489 Trees=517 Time=6.2s
xgb-search-06-lr0.03-d5: AUC=0.8839 F1=0.7445 Trees=483 Time=5.1s
======================================================================
run_name val_auc val_f1 best_iteration train_time
xgb-search-05-lr0.03-d7 0.8862 0.7489 517 6.2
xgb-search-04-lr0.05-d6 0.8851 0.7470 301 3.6
xgb-search-03-lr0.05-d6 0.8847 0.7461 289 3.8
xgb-search-06-lr0.03-d5 0.8839 0.7445 483 5.1
xgb-search-02-lr0.1-d6 0.8823 0.7426 142 2.4
xgb-search-01-lr0.1-d4 0.8791 0.7384 187 2.1
Every run is now permanently recorded in MLflow. Open the UI, select all six runs, and click "Compare." You will see a parallel coordinates plot, a scatter matrix, and a sortable metrics table. No spreadsheet required.
Theme: Reproducibility --- Notice that we logged the
random_state, thedata_versiontag, the exact feature count, and the target rate. Six months from now, if someone asks "What produced the 0.8862 AUC model?", you can pull up runxgb-search-05and see every input and output. That is the difference between a tracked experiment and a lucky guess.
Comparing Experiments in the MLflow UI
The MLflow UI is where experiment tracking becomes experiment management. Here is what you can do.
The Runs Table
The default view shows all runs in a sortable, filterable table. Click any column header to sort. The most common workflow: sort by val_auc descending, look at the top three runs, and examine their parameter differences.
Parallel Coordinates Plot
Select multiple runs and click "Compare." The parallel coordinates plot shows each parameter as a vertical axis, with lines connecting parameter values across runs. This visualization answers the question: "Which hyperparameters separate my best runs from my worst runs?"
If every good run has max_depth between 5 and 7, and every bad run has max_depth of 3 or 12, that axis will show a clear pattern. This is more informative than looking at individual hyperparameters in isolation.
Metric History
For runs that log step-based metrics (loss at each boosting round, for example), the UI plots metric curves. Overlaying the validation loss curves of multiple runs on the same chart immediately reveals which runs overfitted, which converged slowly, and which found the sweet spot.
Search and Filter
The MLflow UI supports a search syntax for filtering runs:
# Find runs with AUC above 0.88
metrics.val_auc > 0.88
# Find runs with specific learning rate
params.learning_rate = "0.05"
# Combine conditions
metrics.val_auc > 0.88 AND params.max_depth = "6"
# Filter by tag
tags.data_version = "streamflow-v2-2024-03"
Programmatic Access
Everything in the UI is also available via the Python API:
import mlflow
client = mlflow.tracking.MlflowClient()
# Get the experiment
experiment = client.get_experiment_by_name("streamflow-churn-v2")
# Search for the best run
best_runs = client.search_runs(
experiment_ids=[experiment.experiment_id],
filter_string="metrics.val_auc > 0.88",
order_by=["metrics.val_auc DESC"],
max_results=5,
)
print("Top 5 runs by AUC:")
for run in best_runs:
print(f" Run {run.info.run_id[:8]} "
f"AUC={run.data.metrics['val_auc']:.4f} "
f"LR={run.data.params['learning_rate']} "
f"Depth={run.data.params['max_depth']}")
Top 5 runs by AUC:
Run a8b3c2d1 AUC=0.8862 LR=0.03 Depth=7
Run f4e7d6c5 AUC=0.8851 LR=0.05 Depth=6
Run b2c4e6a8 AUC=0.8847 LR=0.05 Depth=6
Run d1e3f5a7 AUC=0.8839 LR=0.03 Depth=5
Run c5d7e9f1 AUC=0.8823 LR=0.1 Depth=6
The MLflow Model Registry
Tracking experiments is necessary but not sufficient. Once you have identified a good model, you need a way to manage it through its lifecycle: from development to staging to production to retirement. That is what the Model Registry does.
Registering a Model
import mlflow
# Option 1: Register during logging
with mlflow.start_run(run_name="streamflow-churn-best"):
# ... train and evaluate ...
mlflow.xgboost.log_model(
model,
artifact_path="model",
registered_model_name="streamflow-churn-predictor",
)
# Option 2: Register an existing run's model after the fact
result = mlflow.register_model(
model_uri="runs:/a8b3c2d1e4f5a6b7c8d9e0f1a2b3c4d5/model",
name="streamflow-churn-predictor",
)
print(f"Registered model version: {result.version}")
Registered model version: 1
Model Versions and Stages
Every time you register a model with the same name, the version number increments. You can then assign stages:
client = mlflow.tracking.MlflowClient()
# Transition version 1 to Staging
client.transition_model_version_stage(
name="streamflow-churn-predictor",
version=1,
stage="Staging",
)
# After validation, promote to Production
client.transition_model_version_stage(
name="streamflow-churn-predictor",
version=1,
stage="Production",
)
# When a new model is ready, archive the old one
client.transition_model_version_stage(
name="streamflow-churn-predictor",
version=1,
stage="Archived",
)
client.transition_model_version_stage(
name="streamflow-churn-predictor",
version=2,
stage="Production",
)
Note
--- In MLflow 2.9+, the transition_model_version_stage API is deprecated in favor of the new aliases system. Aliases are more flexible: instead of fixed stages, you assign arbitrary aliases like "champion" and "challenger" to model versions. The pattern below shows the modern approach:
# Modern MLflow: use aliases instead of stages
client.set_registered_model_alias(
name="streamflow-churn-predictor",
alias="champion",
version=2,
)
# Load the production model by alias
champion_model = mlflow.pyfunc.load_model(
model_uri="models:/streamflow-churn-predictor@champion"
)
predictions = champion_model.predict(X_test)
Loading a Registered Model for Inference
import mlflow.pyfunc
# Load by version number
model_v1 = mlflow.pyfunc.load_model(
model_uri="models:/streamflow-churn-predictor/1"
)
# Load by alias (recommended)
champion = mlflow.pyfunc.load_model(
model_uri="models:/streamflow-churn-predictor@champion"
)
# Load latest version (not recommended for production)
latest = mlflow.pyfunc.load_model(
model_uri="models:/streamflow-churn-predictor/latest"
)
# Make predictions
new_data = X_test.iloc[:5]
predictions = champion.predict(new_data)
print(predictions)
The Model Registry gives you model lineage: for any model in production, you can trace back to the exact run that produced it, the exact hyperparameters, the exact training data version, and the exact metrics. That is reproducibility that survives team turnover.
MLflow Projects: Reproducible Packaging
MLflow Projects go beyond tracking individual runs --- they package the entire codebase into a reproducible unit. An MLflow Project is a directory with an MLproject file that specifies the environment and entry points.
streamflow-churn/
MLproject
conda.yaml
train.py
evaluate.py
data/
features.csv
The MLproject file:
name: streamflow-churn
conda_env: conda.yaml
entry_points:
train:
parameters:
learning_rate: {type: float, default: 0.05}
max_depth: {type: int, default: 6}
data_path: {type: str, default: "data/features.csv"}
command: "python train.py --learning-rate {learning_rate} --max-depth {max_depth} --data-path {data_path}"
evaluate:
parameters:
model_uri: {type: str}
data_path: {type: str, default: "data/features.csv"}
command: "python evaluate.py --model-uri {model_uri} --data-path {data_path}"
The conda.yaml file pins every dependency:
name: streamflow-churn
channels:
- defaults
- conda-forge
dependencies:
- python=3.11
- scikit-learn=1.4.0
- xgboost=2.0.3
- pandas=2.2.0
- numpy=1.26.3
- mlflow=2.10.0
- pip:
- shap==0.44.0
Run the project from the command line:
mlflow run streamflow-churn/ -P learning_rate=0.03 -P max_depth=7
Or from Python:
mlflow.projects.run(
uri="streamflow-churn/",
parameters={"learning_rate": 0.03, "max_depth": 7},
)
MLflow creates a fresh Conda environment, installs the pinned dependencies, and executes the training script. The result is a run that anyone can reproduce on any machine. No "it works on my laptop" conversations.
Weights & Biases: The SaaS Alternative
Weights & Biases (W&B, pronounced "weights and biases," not "wand-b") is a commercial experiment tracking platform. It does the same thing as MLflow Tracking --- logs parameters, metrics, artifacts, and models --- but with a different philosophy: everything is in the cloud, and the UI is exceptional.
Setup
pip install wandb
wandb login # Authenticate with your API key from wandb.ai
The W&B Equivalent of Our MLflow Example
import wandb
import xgboost as xgb
from sklearn.metrics import roc_auc_score, f1_score, log_loss
# Initialize a run
wandb.init(
project="streamflow-churn",
name="xgb-search-05-lr0.03-d7",
config={
"learning_rate": 0.03,
"max_depth": 7,
"subsample": 0.8,
"colsample_bytree": 0.8,
"n_estimators": 3000,
"early_stopping_rounds": 50,
"random_state": 42,
"data_version": "streamflow-v2-2024-03",
},
)
# Train (using wandb.config for hyperparameters)
model = xgb.XGBClassifier(
learning_rate=wandb.config.learning_rate,
max_depth=wandb.config.max_depth,
subsample=wandb.config.subsample,
colsample_bytree=wandb.config.colsample_bytree,
n_estimators=wandb.config.n_estimators,
early_stopping_rounds=wandb.config.early_stopping_rounds,
eval_metric="logloss",
random_state=wandb.config.random_state,
n_jobs=-1,
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)
# Evaluate and log metrics
y_val_proba = model.predict_proba(X_val)[:, 1]
y_val_pred = model.predict(X_val)
wandb.log({
"val_auc": roc_auc_score(y_val, y_val_proba),
"val_f1": f1_score(y_val, y_val_pred),
"val_log_loss": log_loss(y_val, y_val_proba),
"best_iteration": model.best_iteration,
})
# Log an artifact (model file)
artifact = wandb.Artifact("streamflow-churn-model", type="model")
model.save_model("model.json")
artifact.add_file("model.json")
wandb.log_artifact(artifact)
wandb.finish()
The code is structurally similar to MLflow. The difference becomes apparent when you open the W&B dashboard.
What W&B Does Better Than MLflow
1. Real-time collaboration. W&B is cloud-native. Your entire team sees every run the moment it finishes. No server setup, no shared database configuration, no "did you push to the tracking server?" MLflow requires you to set up and maintain a tracking server for team access.
2. Interactive visualizations. The W&B dashboard is genuinely excellent. Drag-and-drop chart builder, cross-run comparison with hover details, automatic parallel coordinates plots, custom scatter plots with configurable axes. MLflow's UI is functional. W&B's UI is a pleasure to use.
3. Sweeps (hyperparameter search). W&B has a built-in sweep agent that coordinates distributed hyperparameter searches. Define a search space in YAML, launch agents on multiple machines, and W&B handles the coordination. MLflow defers hyperparameter search to external tools (Optuna, Hyperopt).
# W&B Sweep configuration example
sweep_config = {
"method": "bayes",
"metric": {"name": "val_auc", "goal": "maximize"},
"parameters": {
"learning_rate": {"min": 0.01, "max": 0.15},
"max_depth": {"values": [4, 5, 6, 7, 8]},
"subsample": {"min": 0.7, "max": 0.95},
"colsample_bytree": {"min": 0.6, "max": 0.9},
},
}
sweep_id = wandb.sweep(sweep_config, project="streamflow-churn")
wandb.agent(sweep_id, function=train_function, count=50)
4. System metrics. W&B automatically logs GPU utilization, CPU usage, memory consumption, and network I/O during training. This is invaluable for debugging performance bottlenecks. MLflow does not log system metrics by default (though the mlflow.system_metrics module was added in MLflow 2.8).
5. Reports. W&B Reports let you embed live charts, runs, and comparisons into a Markdown document that updates as new experiments are added. This is useful for stakeholder communication --- share a link to a report instead of exporting screenshots.
What MLflow Does Better Than W&B
1. Self-hosted and free. MLflow is Apache 2.0 licensed. You run it on your own infrastructure, your data never leaves your network, and there is no per-user fee. For regulated industries (healthcare, finance), this is not a preference --- it is a requirement.
2. Model Registry. MLflow's Model Registry is more mature and deeply integrated. Model versioning, stage transitions, aliases, and the ability to load a registered model with a single mlflow.pyfunc.load_model() call make the path from experiment to production smoother. W&B has a model registry (launched 2023), but it is younger and less adopted.
3. MLflow Models format. The mlflow.pyfunc format wraps any model (sklearn, XGBoost, PyTorch, a custom Python function) in a standardized interface with predict(). This makes deployment tool-agnostic: any system that knows how to serve an MLflow Model can serve your model, regardless of the framework that produced it.
4. MLflow Projects. Reproducible packaging with pinned environments. W&B has no equivalent.
5. No vendor lock-in. MLflow is open source. If Databricks disappeared tomorrow, MLflow would continue to exist. W&B is a venture-backed SaaS company. Your experiment data lives on their servers (or on your own servers with W&B Server, their self-hosted offering, which requires an enterprise license).
The Honest Comparison Table
| Criterion | MLflow | W&B |
|---|---|---|
| Cost | Free (open source) | Free tier (100 GB); Teams $50/user/mo; Enterprise custom |
| Hosting | Self-hosted (you manage) | Cloud SaaS (they manage); self-hosted enterprise option |
| UI quality | Functional, improving | Excellent, best-in-class |
| Setup effort | Moderate (server, DB, artifact store) | Minimal (pip install, login) |
| Experiment tracking | Excellent | Excellent |
| Model Registry | Mature, deeply integrated | Newer, growing |
| Hyperparameter sweeps | External tools (Optuna) | Built-in sweeps |
| Reproducibility | MLflow Projects (environments) | Limited (logs config, not environment) |
| Data privacy | Full control (self-hosted) | Data on W&B servers (cloud) or self-hosted enterprise |
| Ecosystem integration | Databricks, AWS SageMaker, Azure ML | Integrations with major frameworks |
| Community | Large, enterprise-heavy | Large, research-heavy |
When to Use Which
Use MLflow if: - You are in a regulated industry (healthcare, finance, government) - Your organization requires data to stay on-premises - You need a mature Model Registry integrated with your deployment pipeline - Cost matters (MLflow is free; W&B is not for teams)
Use W&B if: - You are a small team that wants to get started in five minutes - The UI and collaboration features justify the cost - You run many hyperparameter sweeps and want built-in coordination - You value real-time team visibility over infrastructure control
Use both if: - You use W&B for experiment exploration and visualization during development - You use MLflow for the Model Registry and production deployment pipeline - This is more common than you might expect
Autologging: The Low-Effort Starting Point
MLflow supports autologging for most popular ML frameworks. With a single line of code, MLflow automatically logs parameters, metrics, and the trained model --- no manual log_param() calls required.
import mlflow
# Enable autologging for scikit-learn
mlflow.sklearn.autolog()
# Enable autologging for XGBoost
mlflow.xgboost.autolog()
# Enable autologging for all supported frameworks
mlflow.autolog()
With autologging enabled, a simple training script becomes fully tracked:
import mlflow
import xgboost as xgb
from sklearn.model_selection import train_test_split
mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("streamflow-churn-v2")
mlflow.xgboost.autolog()
# This will automatically log all XGBoost parameters,
# the best iteration, training metrics, and the model artifact.
model = xgb.XGBClassifier(
learning_rate=0.05,
max_depth=6,
n_estimators=2000,
early_stopping_rounds=50,
eval_metric="logloss",
random_state=42,
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)
Practical Tip --- Autologging is a good starting point but not a complete solution. It logs model hyperparameters and training metrics, but it does not log your data version, your feature engineering choices, your evaluation on the test set, or your custom artifacts. Use autologging as a floor (it catches what you forget) and manual logging as the ceiling (it captures what matters).
Structuring Experiments for a Team
When a team shares an MLflow server, naming and organization become critical. Here are the patterns that work.
Naming Conventions
Experiment name: {project}-{model-type}-{version}
streamflow-churn-xgboost-v2
streamflow-churn-lightgbm-v3
Run name: {model}-{search-index}-{key-params}
xgb-042-lr0.03-d7-ss0.8
lgbm-017-lr0.05-nl128
Tags: author=caleb
data_version=2024-03-15
purpose=hyperparameter_search | final_eval | ablation
pipeline_version=v2.1
git_commit=a3f7c2e
The Git Commit Tag
This one practice prevents more headaches than any other. Log the git commit hash of the code that produced the run:
import subprocess
def get_git_hash():
try:
return subprocess.check_output(
["git", "rev-parse", "HEAD"]
).decode("utf-8").strip()
except Exception:
return "unknown"
with mlflow.start_run():
mlflow.set_tag("git_commit", get_git_hash())
# ... rest of training ...
Now every run is linked to the exact code that produced it. When you need to reproduce a run from eight months ago, you check out the commit and run the script. No archaeology required.
Organizing Large Experiments
For a systematic hyperparameter search, use parent-child runs:
with mlflow.start_run(run_name="hp-search-2024-03-25") as parent_run:
mlflow.set_tag("purpose", "hyperparameter_search")
mlflow.set_tag("search_method", "grid")
mlflow.set_tag("total_configs", str(len(configs)))
best_auc = 0
best_run_id = None
for i, config in enumerate(configs):
with mlflow.start_run(
run_name=f"config-{i+1:03d}",
nested=True,
) as child_run:
mlflow.log_params(config)
# ... train and evaluate ...
mlflow.log_metrics(metrics)
if metrics["val_auc"] > best_auc:
best_auc = metrics["val_auc"]
best_run_id = child_run.info.run_id
# Log the best result on the parent run
mlflow.log_metric("best_val_auc", best_auc)
mlflow.set_tag("best_child_run_id", best_run_id)
In the MLflow UI, the parent run appears as a summary, and expanding it reveals all child runs. This keeps the experiment list clean while preserving full detail.
Common Pitfalls and How to Avoid Them
Pitfall 1: Forgetting to Log the Data Version
The most common failure mode is not hyperparameters or code --- it is data. Your features changed, a column was renamed, the data was re-extracted with different filters, and now the "best model" from last month produces different results. Always log a data version tag, and ideally log a hash of the training data:
import hashlib
data_hash = hashlib.sha256(
pd.util.hash_pandas_object(X_train).values.tobytes()
).hexdigest()[:12]
mlflow.set_tag("data_sha256", data_hash)
Pitfall 2: Using the Test Set for Early Stopping
This was covered in Chapter 14, but it bears repeating in the tracking context. If your early stopping uses the test set, your logged "test AUC" is biased. The three-way split (train, validation, test) is non-negotiable.
Pitfall 3: Logging Too Late
Log parameters at the start of the run, not the end. If your training script crashes at minute 45, you want the parameters on record so you know what configuration failed. Metrics can be logged incrementally.
Pitfall 4: Overwriting Runs
Never reuse a run ID. Every execution of your training script should create a new run. This is the default behavior with mlflow.start_run(), but some teams try to "update" existing runs to save space. Do not do this. Disk space is cheap. Lost experiment history is not.
Pitfall 5: Ignoring the Artifact Store Size
Model artifacts accumulate. A single XGBoost model might be 10 MB, but after 500 experiments, that is 5 GB. For deep learning models, multiply by 100. Set up artifact retention policies: archive experiments older than 6 months, delete failed runs after 30 days.
Production MLflow: Beyond Local Development
For team and production use, MLflow needs infrastructure beyond the default mlruns/ directory.
Backend Store Options
| Backend | Use Case |
|---|---|
| Local filesystem | Single-user development |
| SQLite | Single-user, persistent tracking |
| PostgreSQL | Team use, production |
| MySQL | Team use, production |
Artifact Store Options
| Store | Use Case |
|---|---|
| Local filesystem | Development |
| S3 (or S3-compatible) | Production, team access |
| Azure Blob Storage | Azure-based teams |
| Google Cloud Storage | GCP-based teams |
Minimal Production Setup
# Start MLflow with PostgreSQL backend and S3 artifact store
mlflow server \
--backend-store-uri postgresql://mlflow:password@db-host:5432/mlflow \
--default-artifact-root s3://mlflow-artifacts/experiments \
--host 0.0.0.0 \
--port 5000
This configuration gives you persistent tracking data in a real database, team-accessible artifact storage in S3, and a UI available to anyone on the network.
Putting It All Together: The Experiment Tracking Workflow
Here is the workflow that every team should follow.
1. Before you write any training code, set up the tracking server and create your experiment. Agree on naming conventions.
2. Every training script should start with mlflow.set_tracking_uri() and mlflow.set_experiment(). Non-negotiable.
3. Every run should log: all hyperparameters, the data version, the git commit, all evaluation metrics, and the trained model artifact.
4. After a hyperparameter search, use the UI to compare runs, identify the best configuration, and verify that the result is not an artifact of noise.
5. When a model is ready for deployment, register it in the Model Registry and assign it an alias. The deployment pipeline loads the model by alias, not by run ID.
6. When a new model is trained, register it as a new version, test it in staging, and promote it to production by reassigning the alias. The old model version remains in the registry for rollback.
This is the experiment-tracking layer of MLOps. It sits between your notebook and your deployment pipeline, and it ensures that every decision --- every hyperparameter choice, every data version, every model artifact --- is recorded, comparable, and reproducible.
Theme: Real World =/= Kaggle --- On Kaggle, you submit a CSV and get a score. In the real world, you submit a model and get the question: "Can you reproduce this? Can you explain why this model is better than the one we deployed last quarter? Can you trace the lineage from training data to production prediction?" Experiment tracking is how you answer "yes" to all three.
Summary
Experiment tracking is not a nice-to-have --- it is infrastructure. MLflow is the open-source standard: free, self-hosted, and integrated with the deployment ecosystem through the Model Registry and the pyfunc model format. Weights & Biases offers a superior UI and built-in sweep coordination, at the cost of SaaS dependency and per-user pricing.
The minimum viable experiment tracking setup is five lines of code: set the tracking URI, set the experiment name, start a run, log parameters, and log metrics. The maximum --- with artifacts, model registration, git commit tags, data versioning, and nested runs --- is the kind of infrastructure that lets a team of twenty data scientists iterate on the same model without losing their minds.
Start with autologging. Graduate to manual logging when autologging misses something important. Adopt the Model Registry when you have more than one model in production. And never, under any circumstances, go back to the spreadsheet.
Next chapter: Chapter 31: Model Deployment --- wrapping your tracked, registered model in a FastAPI endpoint and deploying it with Docker.