Case Study 2: MLflow vs. W&B --- A Team Adoption Decision
Background
NovaBio is a mid-size biotechnology company with a data science team of twelve. They build predictive models for drug compound screening, clinical trial patient matching, and manufacturing quality control. The team has been using a mix of Jupyter notebooks, Google Sheets, and a shared NAS drive for experiment tracking. The CTO has approved budget for "proper ML tooling" and tasked the team lead, Diane, with evaluating and recommending an experiment tracking platform.
The candidates are MLflow (self-hosted, open source) and Weights & Biases (cloud SaaS). The evaluation period is four weeks. Three team members will use each tool on the same project --- a compound activity classification model --- and report back.
This case study follows the evaluation from both sides, documents the friction points, and arrives at a recommendation that is honest about the tradeoffs.
The Evaluation Criteria
Diane defines eight criteria before the evaluation starts. This matters. Without predefined criteria, the evaluation degrades into "which tool do I personally like more," which is not useful for an organizational decision.
| Criterion | Weight | Description |
|---|---|---|
| Setup time | 10% | From zero to first tracked run |
| Logging API quality | 15% | Ease of logging params, metrics, artifacts |
| UI and visualization | 20% | Experiment comparison, chart quality, search |
| Model Registry | 15% | Model versioning, staging, deployment support |
| Team collaboration | 15% | Multi-user access, sharing, permissions |
| Data privacy | 10% | Where data is stored, compliance |
| Cost (3-year TCO) | 10% | Total cost of ownership for 12 users |
| Ecosystem integration | 5% | Integration with existing tools (AWS, Docker, CI/CD) |
Week 1: Setup and First Impressions
The MLflow Team
The three MLflow evaluators --- Raj, Tomoko, and Omar --- start by provisioning infrastructure.
# MLflow setup on AWS (simplified)
# 1. RDS PostgreSQL instance for backend store
# 2. S3 bucket for artifact store
# 3. EC2 instance for the tracking server
# Install on the tracking server
pip install mlflow psycopg2-binary boto3
# Start the server
mlflow server \
--backend-store-uri postgresql://mlflow:password@novabio-mlflow-db.us-east-1.rds.amazonaws.com:5432/mlflow \
--default-artifact-root s3://novabio-ml-artifacts/experiments \
--host 0.0.0.0 \
--port 5000
Time to first tracked run: 3 hours. Most of that was infrastructure: creating the RDS instance, configuring security groups, setting up IAM roles for S3 access, and debugging a PostgreSQL connection timeout.
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
import numpy as np
mlflow.set_tracking_uri("http://novabio-mlflow.internal:5000")
mlflow.set_experiment("compound-activity-eval")
X, y = make_classification(
n_samples=10000, n_features=50, n_informative=20,
random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
with mlflow.start_run(run_name="rf-baseline"):
params = {"n_estimators": 200, "max_depth": 10, "random_state": 42}
mlflow.log_params(params)
model = RandomForestClassifier(**params)
model.fit(X_train, y_train)
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
mlflow.log_metric("test_auc", auc)
mlflow.sklearn.log_model(model, "model")
mlflow.set_tag("evaluator", "raj")
print(f"AUC: {auc:.4f}")
AUC: 0.9647
Raj's notes: "Setup was painful. The tracking server itself is easy. The infrastructure around it --- database, object storage, networking, authentication --- is not trivial. But once it is running, the logging API is clean and intuitive."
The W&B Team
The three W&B evaluators --- Lin, Marcus, and Sofia --- start with a pip install.
pip install wandb
wandb login
# Paste API key from wandb.ai/settings
Time to first tracked run: 8 minutes. No infrastructure to provision. No database. No artifact store configuration.
import wandb
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
import numpy as np
X, y = make_classification(
n_samples=10000, n_features=50, n_informative=20,
random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
wandb.init(
project="compound-activity-eval",
name="rf-baseline",
config={"n_estimators": 200, "max_depth": 10, "random_state": 42},
tags=["evaluator:lin", "model:rf"],
)
model = RandomForestClassifier(**wandb.config)
model.fit(X_train, y_train)
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
wandb.log({"test_auc": auc})
wandb.finish()
print(f"AUC: {auc:.4f}")
AUC: 0.9647
Lin's notes: "Eight minutes from pip install to a run visible in the dashboard. The dashboard auto-generates system metrics, shows run duration, and creates comparison charts without configuration. First impression: this is designed for researchers who want to experiment, not engineers who want to build pipelines."
Observation --- The setup time difference (3 hours vs. 8 minutes) is real but misleading. MLflow's setup cost is paid once; W&B's operational cost is paid monthly. The correct comparison is total cost of ownership, not first-day friction.
Week 2: Running the Actual Experiment
Both teams run a 50-configuration hyperparameter search on the compound activity dataset.
MLflow: Manual Grid Search
import mlflow
import xgboost as xgb
from sklearn.metrics import roc_auc_score, f1_score, log_loss
import itertools
import time
mlflow.set_tracking_uri("http://novabio-mlflow.internal:5000")
mlflow.set_experiment("compound-activity-xgb-search")
search_space = {
"learning_rate": [0.03, 0.05, 0.1],
"max_depth": [4, 5, 6, 7],
"subsample": [0.7, 0.8, 0.9],
"colsample_bytree": [0.7, 0.8],
}
# Generate all combinations (72), sample 50
all_configs = [dict(zip(search_space.keys(), v))
for v in itertools.product(*search_space.values())]
np.random.seed(42)
configs = [all_configs[i] for i in np.random.choice(len(all_configs), 50, replace=False)]
with mlflow.start_run(run_name="xgb-search-50-configs") as parent:
mlflow.set_tag("search_method", "random_subset_of_grid")
mlflow.set_tag("total_configs", "50")
best_auc = 0
for i, config in enumerate(configs):
with mlflow.start_run(
run_name=f"config-{i+1:03d}",
nested=True,
):
mlflow.log_params(config)
mlflow.log_param("n_estimators", 2000)
mlflow.log_param("early_stopping_rounds", 40)
model = xgb.XGBClassifier(
n_estimators=2000,
early_stopping_rounds=40,
eval_metric="logloss",
random_state=42,
n_jobs=-1,
**config,
)
model.fit(X_train, y_train,
eval_set=[(X_val, y_val)], verbose=False)
y_proba = model.predict_proba(X_val)[:, 1]
auc = roc_auc_score(y_val, y_proba)
mlflow.log_metric("val_auc", auc)
mlflow.log_metric("val_log_loss", log_loss(y_val, y_proba))
mlflow.log_metric("best_iteration", model.best_iteration)
if auc > best_auc:
best_auc = auc
if (i + 1) % 10 == 0:
print(f" Completed {i+1}/50 configs. Best AUC so far: {best_auc:.4f}")
mlflow.log_metric("best_val_auc", best_auc)
print(f"\nSearch complete. Best validation AUC: {best_auc:.4f}")
Completed 10/50 configs. Best AUC so far: 0.9712
Completed 20/50 configs. Best AUC so far: 0.9728
Completed 30/50 configs. Best AUC so far: 0.9735
Completed 40/50 configs. Best AUC so far: 0.9741
Completed 50/50 configs. Best AUC so far: 0.9741
Search complete. Best validation AUC: 0.9741
W&B: Bayesian Sweep
import wandb
import xgboost as xgb
from sklearn.metrics import roc_auc_score, log_loss
sweep_config = {
"method": "bayes",
"metric": {"name": "val_auc", "goal": "maximize"},
"parameters": {
"learning_rate": {"min": 0.01, "max": 0.15},
"max_depth": {"values": [4, 5, 6, 7, 8]},
"subsample": {"min": 0.6, "max": 0.95},
"colsample_bytree": {"min": 0.6, "max": 0.95},
},
}
def train_sweep():
wandb.init()
config = wandb.config
model = xgb.XGBClassifier(
learning_rate=config.learning_rate,
max_depth=config.max_depth,
subsample=config.subsample,
colsample_bytree=config.colsample_bytree,
n_estimators=2000,
early_stopping_rounds=40,
eval_metric="logloss",
random_state=42,
n_jobs=-1,
)
model.fit(X_train, y_train,
eval_set=[(X_val, y_val)], verbose=False)
y_proba = model.predict_proba(X_val)[:, 1]
wandb.log({
"val_auc": roc_auc_score(y_val, y_proba),
"val_log_loss": log_loss(y_val, y_proba),
"best_iteration": model.best_iteration,
})
wandb.finish()
sweep_id = wandb.sweep(sweep_config, project="compound-activity-eval")
wandb.agent(sweep_id, function=train_sweep, count=50)
Sofia's notes: "The sweep dashboard updates in real time. I can see a parallel coordinates plot of all 50 runs while they are still running. The Bayesian optimizer is clearly concentrating on the promising regions of the search space --- the later runs cluster around learning_rate 0.03-0.06 and max_depth 6-7, which are the regions with the best AUC. I did not have to write any comparison code."
Raj's notes on MLflow: "I had to write the comparison code myself. The UI shows all 50 runs, and I can sort and filter, but the parallel coordinates plot requires selecting runs manually. It works, but it is not as fluid as what Sofia described."
Week 3: Model Registry and Collaboration
MLflow Model Registry
import mlflow
client = mlflow.tracking.MlflowClient()
# Register the best model
best_run = client.search_runs(
experiment_ids=[experiment.experiment_id],
filter_string="",
order_by=["metrics.val_auc DESC"],
max_results=1,
)[0]
result = mlflow.register_model(
model_uri=f"runs:/{best_run.info.run_id}/model",
name="compound-activity-classifier",
)
# Assign alias
client.set_registered_model_alias(
name="compound-activity-classifier",
alias="champion",
version=result.version,
)
# Load and verify
champion = mlflow.pyfunc.load_model(
"models:/compound-activity-classifier@champion"
)
sample_predictions = champion.predict(X_test[:5])
print(f"Champion model: v{result.version}")
print(f"Sample predictions: {sample_predictions}")
Champion model: v1
Sample predictions: [1 0 1 1 0]
Tomoko's notes: "The Model Registry is MLflow's killer feature. Version management with aliases, the ability to trace any registered model back to its training run, and the standardized pyfunc loading API --- this is production-grade. W&B's registry is newer and less mature."
W&B Collaboration
Marcus's notes: "The collaboration features are where W&B shines in daily use. I can see Lin's runs and Sofia's runs in the same project dashboard without any configuration. I can add comments to specific runs. I can create a Report that embeds live charts and share it with Diane via a URL. On MLflow, sharing requires everyone to have network access to the tracking server, and there is no commenting or reporting built in."
Week 4: The Scorecard
Diane compiles the evaluation results.
Setup and Operations
| Dimension | MLflow | W&B |
|---|---|---|
| Time to first run | 3 hours | 8 minutes |
| Ongoing maintenance | ~4 hours/month (server, DB, backups) | 0 (managed SaaS) |
| Infrastructure needed | PostgreSQL, S3, EC2 | None |
| Authentication | DIY (reverse proxy, SSO integration) | Built-in (SSO, RBAC) |
Feature Comparison
| Feature | MLflow | W&B | Winner |
|---|---|---|---|
| Parameter/metric logging | Excellent | Excellent | Tie |
| Artifact logging | Excellent | Excellent | Tie |
| Autologging | Supported (sklearn, xgboost, etc.) | Supported (similar) | Tie |
| UI: Run comparison | Good (sortable table, basic charts) | Excellent (interactive, real-time) | W&B |
| UI: Parallel coordinates | Available, manual selection | Automatic, interactive | W&B |
| UI: Custom dashboards | Limited | Extensive (drag-and-drop panels) | W&B |
| Model Registry | Mature (aliases, lineage, pyfunc) | Newer, growing | MLflow |
| Hyperparameter sweeps | External (Optuna, Hyperopt) | Built-in (Bayesian, grid, random) | W&B |
| System metrics (GPU, CPU) | Optional (mlflow.system_metrics) | Automatic | W&B |
| Reports/sharing | Not built-in | Built-in Reports | W&B |
| Reproducibility (env) | MLflow Projects (conda/docker) | Limited | MLflow |
| Deployment integration | MLflow Models, SageMaker, Azure ML | Limited | MLflow |
| API for programmatic access | Comprehensive | Comprehensive | Tie |
Cost Analysis (3-Year, 12 Users)
MLflow (self-hosted on AWS):
RDS PostgreSQL (db.t3.medium): $1,580/year
S3 storage (500 GB, growing): $140/year
EC2 instance (t3.medium): $410/year
DevOps time (4 hr/mo * $80/hr): $3,840/year
Total Year 1: $5,970
Total 3-Year: ~$19,000
W&B Teams (12 users * $50/user/month):
Subscription: $7,200/year
Infrastructure: $0
DevOps time: $0
Total Year 1: $7,200
Total 3-Year: $21,600
Difference: W&B costs ~$2,600 more over 3 years
Important caveat: The MLflow cost estimate assumes a relatively simple setup. If NovaBio needs high availability, automated backups with point-in-time recovery, a load balancer, and monitoring for the MLflow server itself, the infrastructure cost rises significantly. The W&B cost includes all of this by default.
Data Privacy Analysis
MLflow:
- All data stays on NovaBio's AWS account
- Full control over encryption, access, retention
- HIPAA/GxP compliance achievable with standard AWS controls
- No third-party data exposure
W&B (Cloud):
- Experiment metadata and artifacts stored on W&B servers
- W&B offers SOC 2 Type II certification
- BAA available for healthcare (W&B Enterprise)
- Self-hosted option (W&B Server) requires enterprise license
W&B (Self-Hosted Enterprise):
- Data stays on NovaBio's infrastructure
- Enterprise license required (pricing: contact sales, typically $200+/user/month)
- Maintains W&B's UI and features
- Total 3-year cost: ~$86,000+ (12 users)
The Recommendation
Diane's report to the CTO:
Primary recommendation: MLflow for production, with W&B for exploration (optional).
The reasoning:
1. Data privacy is non-negotiable. NovaBio handles proprietary compound data. Sending experiment metadata --- which includes feature names, model parameters, and performance metrics on proprietary datasets --- to a third-party SaaS platform requires legal review and a data processing agreement. MLflow eliminates this concern entirely.
2. The Model Registry is critical for our deployment pipeline. NovaBio is building a CI/CD pipeline for model deployment. MLflow's Model Registry, with aliases and the pyfunc loading API, integrates directly with our Docker-based deployment. W&B's registry is usable but less mature.
3. The cost difference is small, but the risk profile is different. MLflow's cost is infrastructure (predictable, controlled). W&B's cost is a subscription (subject to pricing changes). For a 12-person team, the difference is negligible. But if the team grows to 30, W&B costs scale linearly while MLflow's infrastructure costs grow sublinearly.
4. W&B's UI is genuinely better for exploration. For hyperparameter searches, model comparison, and creating reports for stakeholders, W&B is more pleasant to use. The recommendation: use W&B's free tier for individual exploration during development, and use MLflow as the system of record for experiments that matter.
5. The setup cost is real but one-time. Three hours of initial setup plus four hours per month of maintenance is a modest investment for infrastructure that serves the entire team indefinitely.
The Dissenting View
Lin disagrees with the recommendation and writes a rebuttal. Her argument: "The four hours per month of MLflow maintenance is optimistic. When the PostgreSQL instance needs a version upgrade, when the S3 bucket hits a lifecycle policy conflict, when the EC2 instance needs a security patch, when the MLflow version upgrade breaks the database schema --- these events are unpredictable and each takes 2-8 hours to resolve. Over three years, the true DevOps burden of self-hosted MLflow is 2-3x the estimate. W&B eliminates this entirely."
Diane includes Lin's rebuttal in the report. She is right about the maintenance risk. The recommendation stands because data privacy requirements outweigh operational convenience, but the rebuttal ensures the CTO understands the tradeoff with eyes open.
Implementation Plan
Diane's final deliverable is a four-week migration plan:
Week 1: Provision MLflow infrastructure (PostgreSQL, S3, EC2). Configure authentication via the company's existing reverse proxy.
Week 2: Migrate two existing projects to MLflow. Create naming conventions and tagging standards. Write a log_experiment() utility function that wraps common logging patterns.
Week 3: Set up the Model Registry. Register the three models currently in production. Build the CI/CD integration that loads registered models by alias.
Week 4: Team training (half-day workshop). Document the standards. Run a parallel experiment where every team member tracks their current work in MLflow for one week.
# The utility function Diane's team builds for standardized logging
def log_experiment(
model,
X_train, y_train,
X_val, y_val,
params,
data_version,
run_name=None,
tags=None,
register_as=None,
):
"""
Standardized experiment logging for NovaBio.
Logs parameters, validation metrics, data metadata,
the model artifact, and optionally registers the model.
"""
import mlflow
import hashlib
import json
from sklearn.metrics import roc_auc_score, f1_score, log_loss
with mlflow.start_run(run_name=run_name):
# Parameters
mlflow.log_params(params)
# Data metadata
data_hash = hashlib.sha256(
pd.util.hash_pandas_object(X_train).values.tobytes()
).hexdigest()[:16]
mlflow.set_tag("data_version", data_version)
mlflow.set_tag("data_hash", data_hash)
mlflow.set_tag("train_rows", str(len(X_train)))
mlflow.set_tag("val_rows", str(len(X_val)))
mlflow.set_tag("feature_count", str(X_train.shape[1]))
# Custom tags
if tags:
for key, value in tags.items():
mlflow.set_tag(key, value)
# Evaluate
y_proba = model.predict_proba(X_val)[:, 1]
y_pred = model.predict(X_val)
metrics = {
"val_auc": roc_auc_score(y_val, y_proba),
"val_f1": f1_score(y_val, y_pred),
"val_log_loss": log_loss(y_val, y_proba),
}
mlflow.log_metrics(metrics)
# Model artifact
model_flavor = type(model).__module__.split(".")[0]
if hasattr(mlflow, model_flavor):
getattr(mlflow, model_flavor).log_model(
model,
artifact_path="model",
registered_model_name=register_as,
)
else:
mlflow.sklearn.log_model(
model,
artifact_path="model",
registered_model_name=register_as,
)
# Feature importance (if available)
if hasattr(model, "feature_importances_"):
importance = pd.DataFrame({
"feature": X_train.columns,
"importance": model.feature_importances_,
}).sort_values("importance", ascending=False)
importance.to_csv("feature_importance.csv", index=False)
mlflow.log_artifact("feature_importance.csv")
return metrics
Outcome
Six months after the migration:
- The team has 847 tracked runs across 14 experiments
- Five models are registered in the Model Registry, three in production
- The average time to answer "what hyperparameters produced this model?" dropped from 30 minutes (spreadsheet) to 10 seconds (MLflow query)
- The compound activity classifier improved by 1.2% AUC because the team could systematically compare configurations instead of guessing
- Lin still uses W&B for her personal exploration. Nobody minds. The results end up in MLflow when they matter
The spreadsheet was archived. Nobody has opened it since.
This case study compares MLflow and W&B for team adoption. Return to the chapter for the full experiment tracking framework.