Case Study 2: MLflow vs. W&B --- A Team Adoption Decision


Background

NovaBio is a mid-size biotechnology company with a data science team of twelve. They build predictive models for drug compound screening, clinical trial patient matching, and manufacturing quality control. The team has been using a mix of Jupyter notebooks, Google Sheets, and a shared NAS drive for experiment tracking. The CTO has approved budget for "proper ML tooling" and tasked the team lead, Diane, with evaluating and recommending an experiment tracking platform.

The candidates are MLflow (self-hosted, open source) and Weights & Biases (cloud SaaS). The evaluation period is four weeks. Three team members will use each tool on the same project --- a compound activity classification model --- and report back.

This case study follows the evaluation from both sides, documents the friction points, and arrives at a recommendation that is honest about the tradeoffs.


The Evaluation Criteria

Diane defines eight criteria before the evaluation starts. This matters. Without predefined criteria, the evaluation degrades into "which tool do I personally like more," which is not useful for an organizational decision.

Criterion Weight Description
Setup time 10% From zero to first tracked run
Logging API quality 15% Ease of logging params, metrics, artifacts
UI and visualization 20% Experiment comparison, chart quality, search
Model Registry 15% Model versioning, staging, deployment support
Team collaboration 15% Multi-user access, sharing, permissions
Data privacy 10% Where data is stored, compliance
Cost (3-year TCO) 10% Total cost of ownership for 12 users
Ecosystem integration 5% Integration with existing tools (AWS, Docker, CI/CD)

Week 1: Setup and First Impressions

The MLflow Team

The three MLflow evaluators --- Raj, Tomoko, and Omar --- start by provisioning infrastructure.

# MLflow setup on AWS (simplified)
# 1. RDS PostgreSQL instance for backend store
# 2. S3 bucket for artifact store
# 3. EC2 instance for the tracking server

# Install on the tracking server
pip install mlflow psycopg2-binary boto3

# Start the server
mlflow server \
    --backend-store-uri postgresql://mlflow:password@novabio-mlflow-db.us-east-1.rds.amazonaws.com:5432/mlflow \
    --default-artifact-root s3://novabio-ml-artifacts/experiments \
    --host 0.0.0.0 \
    --port 5000

Time to first tracked run: 3 hours. Most of that was infrastructure: creating the RDS instance, configuring security groups, setting up IAM roles for S3 access, and debugging a PostgreSQL connection timeout.

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
import numpy as np

mlflow.set_tracking_uri("http://novabio-mlflow.internal:5000")
mlflow.set_experiment("compound-activity-eval")

X, y = make_classification(
    n_samples=10000, n_features=50, n_informative=20,
    random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

with mlflow.start_run(run_name="rf-baseline"):
    params = {"n_estimators": 200, "max_depth": 10, "random_state": 42}
    mlflow.log_params(params)

    model = RandomForestClassifier(**params)
    model.fit(X_train, y_train)

    auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
    mlflow.log_metric("test_auc", auc)
    mlflow.sklearn.log_model(model, "model")
    mlflow.set_tag("evaluator", "raj")

    print(f"AUC: {auc:.4f}")
AUC: 0.9647

Raj's notes: "Setup was painful. The tracking server itself is easy. The infrastructure around it --- database, object storage, networking, authentication --- is not trivial. But once it is running, the logging API is clean and intuitive."

The W&B Team

The three W&B evaluators --- Lin, Marcus, and Sofia --- start with a pip install.

pip install wandb
wandb login
# Paste API key from wandb.ai/settings

Time to first tracked run: 8 minutes. No infrastructure to provision. No database. No artifact store configuration.

import wandb
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
import numpy as np

X, y = make_classification(
    n_samples=10000, n_features=50, n_informative=20,
    random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

wandb.init(
    project="compound-activity-eval",
    name="rf-baseline",
    config={"n_estimators": 200, "max_depth": 10, "random_state": 42},
    tags=["evaluator:lin", "model:rf"],
)

model = RandomForestClassifier(**wandb.config)
model.fit(X_train, y_train)

auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
wandb.log({"test_auc": auc})

wandb.finish()
print(f"AUC: {auc:.4f}")
AUC: 0.9647

Lin's notes: "Eight minutes from pip install to a run visible in the dashboard. The dashboard auto-generates system metrics, shows run duration, and creates comparison charts without configuration. First impression: this is designed for researchers who want to experiment, not engineers who want to build pipelines."

Observation --- The setup time difference (3 hours vs. 8 minutes) is real but misleading. MLflow's setup cost is paid once; W&B's operational cost is paid monthly. The correct comparison is total cost of ownership, not first-day friction.


Week 2: Running the Actual Experiment

Both teams run a 50-configuration hyperparameter search on the compound activity dataset.

import mlflow
import xgboost as xgb
from sklearn.metrics import roc_auc_score, f1_score, log_loss
import itertools
import time

mlflow.set_tracking_uri("http://novabio-mlflow.internal:5000")
mlflow.set_experiment("compound-activity-xgb-search")

search_space = {
    "learning_rate": [0.03, 0.05, 0.1],
    "max_depth": [4, 5, 6, 7],
    "subsample": [0.7, 0.8, 0.9],
    "colsample_bytree": [0.7, 0.8],
}

# Generate all combinations (72), sample 50
all_configs = [dict(zip(search_space.keys(), v))
               for v in itertools.product(*search_space.values())]
np.random.seed(42)
configs = [all_configs[i] for i in np.random.choice(len(all_configs), 50, replace=False)]

with mlflow.start_run(run_name="xgb-search-50-configs") as parent:
    mlflow.set_tag("search_method", "random_subset_of_grid")
    mlflow.set_tag("total_configs", "50")

    best_auc = 0
    for i, config in enumerate(configs):
        with mlflow.start_run(
            run_name=f"config-{i+1:03d}",
            nested=True,
        ):
            mlflow.log_params(config)
            mlflow.log_param("n_estimators", 2000)
            mlflow.log_param("early_stopping_rounds", 40)

            model = xgb.XGBClassifier(
                n_estimators=2000,
                early_stopping_rounds=40,
                eval_metric="logloss",
                random_state=42,
                n_jobs=-1,
                **config,
            )
            model.fit(X_train, y_train,
                      eval_set=[(X_val, y_val)], verbose=False)

            y_proba = model.predict_proba(X_val)[:, 1]
            auc = roc_auc_score(y_val, y_proba)
            mlflow.log_metric("val_auc", auc)
            mlflow.log_metric("val_log_loss", log_loss(y_val, y_proba))
            mlflow.log_metric("best_iteration", model.best_iteration)

            if auc > best_auc:
                best_auc = auc

            if (i + 1) % 10 == 0:
                print(f"  Completed {i+1}/50 configs. Best AUC so far: {best_auc:.4f}")

    mlflow.log_metric("best_val_auc", best_auc)

print(f"\nSearch complete. Best validation AUC: {best_auc:.4f}")
  Completed 10/50 configs. Best AUC so far: 0.9712
  Completed 20/50 configs. Best AUC so far: 0.9728
  Completed 30/50 configs. Best AUC so far: 0.9735
  Completed 40/50 configs. Best AUC so far: 0.9741
  Completed 50/50 configs. Best AUC so far: 0.9741

Search complete. Best validation AUC: 0.9741

W&B: Bayesian Sweep

import wandb
import xgboost as xgb
from sklearn.metrics import roc_auc_score, log_loss

sweep_config = {
    "method": "bayes",
    "metric": {"name": "val_auc", "goal": "maximize"},
    "parameters": {
        "learning_rate": {"min": 0.01, "max": 0.15},
        "max_depth": {"values": [4, 5, 6, 7, 8]},
        "subsample": {"min": 0.6, "max": 0.95},
        "colsample_bytree": {"min": 0.6, "max": 0.95},
    },
}

def train_sweep():
    wandb.init()
    config = wandb.config

    model = xgb.XGBClassifier(
        learning_rate=config.learning_rate,
        max_depth=config.max_depth,
        subsample=config.subsample,
        colsample_bytree=config.colsample_bytree,
        n_estimators=2000,
        early_stopping_rounds=40,
        eval_metric="logloss",
        random_state=42,
        n_jobs=-1,
    )
    model.fit(X_train, y_train,
              eval_set=[(X_val, y_val)], verbose=False)

    y_proba = model.predict_proba(X_val)[:, 1]
    wandb.log({
        "val_auc": roc_auc_score(y_val, y_proba),
        "val_log_loss": log_loss(y_val, y_proba),
        "best_iteration": model.best_iteration,
    })
    wandb.finish()

sweep_id = wandb.sweep(sweep_config, project="compound-activity-eval")
wandb.agent(sweep_id, function=train_sweep, count=50)

Sofia's notes: "The sweep dashboard updates in real time. I can see a parallel coordinates plot of all 50 runs while they are still running. The Bayesian optimizer is clearly concentrating on the promising regions of the search space --- the later runs cluster around learning_rate 0.03-0.06 and max_depth 6-7, which are the regions with the best AUC. I did not have to write any comparison code."

Raj's notes on MLflow: "I had to write the comparison code myself. The UI shows all 50 runs, and I can sort and filter, but the parallel coordinates plot requires selecting runs manually. It works, but it is not as fluid as what Sofia described."


Week 3: Model Registry and Collaboration

MLflow Model Registry

import mlflow

client = mlflow.tracking.MlflowClient()

# Register the best model
best_run = client.search_runs(
    experiment_ids=[experiment.experiment_id],
    filter_string="",
    order_by=["metrics.val_auc DESC"],
    max_results=1,
)[0]

result = mlflow.register_model(
    model_uri=f"runs:/{best_run.info.run_id}/model",
    name="compound-activity-classifier",
)

# Assign alias
client.set_registered_model_alias(
    name="compound-activity-classifier",
    alias="champion",
    version=result.version,
)

# Load and verify
champion = mlflow.pyfunc.load_model(
    "models:/compound-activity-classifier@champion"
)
sample_predictions = champion.predict(X_test[:5])
print(f"Champion model: v{result.version}")
print(f"Sample predictions: {sample_predictions}")
Champion model: v1
Sample predictions: [1 0 1 1 0]

Tomoko's notes: "The Model Registry is MLflow's killer feature. Version management with aliases, the ability to trace any registered model back to its training run, and the standardized pyfunc loading API --- this is production-grade. W&B's registry is newer and less mature."

W&B Collaboration

Marcus's notes: "The collaboration features are where W&B shines in daily use. I can see Lin's runs and Sofia's runs in the same project dashboard without any configuration. I can add comments to specific runs. I can create a Report that embeds live charts and share it with Diane via a URL. On MLflow, sharing requires everyone to have network access to the tracking server, and there is no commenting or reporting built in."


Week 4: The Scorecard

Diane compiles the evaluation results.

Setup and Operations

Dimension MLflow W&B
Time to first run 3 hours 8 minutes
Ongoing maintenance ~4 hours/month (server, DB, backups) 0 (managed SaaS)
Infrastructure needed PostgreSQL, S3, EC2 None
Authentication DIY (reverse proxy, SSO integration) Built-in (SSO, RBAC)

Feature Comparison

Feature MLflow W&B Winner
Parameter/metric logging Excellent Excellent Tie
Artifact logging Excellent Excellent Tie
Autologging Supported (sklearn, xgboost, etc.) Supported (similar) Tie
UI: Run comparison Good (sortable table, basic charts) Excellent (interactive, real-time) W&B
UI: Parallel coordinates Available, manual selection Automatic, interactive W&B
UI: Custom dashboards Limited Extensive (drag-and-drop panels) W&B
Model Registry Mature (aliases, lineage, pyfunc) Newer, growing MLflow
Hyperparameter sweeps External (Optuna, Hyperopt) Built-in (Bayesian, grid, random) W&B
System metrics (GPU, CPU) Optional (mlflow.system_metrics) Automatic W&B
Reports/sharing Not built-in Built-in Reports W&B
Reproducibility (env) MLflow Projects (conda/docker) Limited MLflow
Deployment integration MLflow Models, SageMaker, Azure ML Limited MLflow
API for programmatic access Comprehensive Comprehensive Tie

Cost Analysis (3-Year, 12 Users)

MLflow (self-hosted on AWS):
  RDS PostgreSQL (db.t3.medium):      $1,580/year
  S3 storage (500 GB, growing):       $140/year
  EC2 instance (t3.medium):           $410/year
  DevOps time (4 hr/mo * $80/hr):     $3,840/year
  Total Year 1:                       $5,970
  Total 3-Year:                       ~$19,000

W&B Teams (12 users * $50/user/month):
  Subscription:                       $7,200/year
  Infrastructure:                     $0
  DevOps time:                        $0
  Total Year 1:                       $7,200
  Total 3-Year:                       $21,600

Difference: W&B costs ~$2,600 more over 3 years

Important caveat: The MLflow cost estimate assumes a relatively simple setup. If NovaBio needs high availability, automated backups with point-in-time recovery, a load balancer, and monitoring for the MLflow server itself, the infrastructure cost rises significantly. The W&B cost includes all of this by default.

Data Privacy Analysis

MLflow:
  - All data stays on NovaBio's AWS account
  - Full control over encryption, access, retention
  - HIPAA/GxP compliance achievable with standard AWS controls
  - No third-party data exposure

W&B (Cloud):
  - Experiment metadata and artifacts stored on W&B servers
  - W&B offers SOC 2 Type II certification
  - BAA available for healthcare (W&B Enterprise)
  - Self-hosted option (W&B Server) requires enterprise license

W&B (Self-Hosted Enterprise):
  - Data stays on NovaBio's infrastructure
  - Enterprise license required (pricing: contact sales, typically $200+/user/month)
  - Maintains W&B's UI and features
  - Total 3-year cost: ~$86,000+ (12 users)

The Recommendation

Diane's report to the CTO:

Primary recommendation: MLflow for production, with W&B for exploration (optional).

The reasoning:

1. Data privacy is non-negotiable. NovaBio handles proprietary compound data. Sending experiment metadata --- which includes feature names, model parameters, and performance metrics on proprietary datasets --- to a third-party SaaS platform requires legal review and a data processing agreement. MLflow eliminates this concern entirely.

2. The Model Registry is critical for our deployment pipeline. NovaBio is building a CI/CD pipeline for model deployment. MLflow's Model Registry, with aliases and the pyfunc loading API, integrates directly with our Docker-based deployment. W&B's registry is usable but less mature.

3. The cost difference is small, but the risk profile is different. MLflow's cost is infrastructure (predictable, controlled). W&B's cost is a subscription (subject to pricing changes). For a 12-person team, the difference is negligible. But if the team grows to 30, W&B costs scale linearly while MLflow's infrastructure costs grow sublinearly.

4. W&B's UI is genuinely better for exploration. For hyperparameter searches, model comparison, and creating reports for stakeholders, W&B is more pleasant to use. The recommendation: use W&B's free tier for individual exploration during development, and use MLflow as the system of record for experiments that matter.

5. The setup cost is real but one-time. Three hours of initial setup plus four hours per month of maintenance is a modest investment for infrastructure that serves the entire team indefinitely.

The Dissenting View

Lin disagrees with the recommendation and writes a rebuttal. Her argument: "The four hours per month of MLflow maintenance is optimistic. When the PostgreSQL instance needs a version upgrade, when the S3 bucket hits a lifecycle policy conflict, when the EC2 instance needs a security patch, when the MLflow version upgrade breaks the database schema --- these events are unpredictable and each takes 2-8 hours to resolve. Over three years, the true DevOps burden of self-hosted MLflow is 2-3x the estimate. W&B eliminates this entirely."

Diane includes Lin's rebuttal in the report. She is right about the maintenance risk. The recommendation stands because data privacy requirements outweigh operational convenience, but the rebuttal ensures the CTO understands the tradeoff with eyes open.


Implementation Plan

Diane's final deliverable is a four-week migration plan:

Week 1: Provision MLflow infrastructure (PostgreSQL, S3, EC2). Configure authentication via the company's existing reverse proxy.

Week 2: Migrate two existing projects to MLflow. Create naming conventions and tagging standards. Write a log_experiment() utility function that wraps common logging patterns.

Week 3: Set up the Model Registry. Register the three models currently in production. Build the CI/CD integration that loads registered models by alias.

Week 4: Team training (half-day workshop). Document the standards. Run a parallel experiment where every team member tracks their current work in MLflow for one week.

# The utility function Diane's team builds for standardized logging
def log_experiment(
    model,
    X_train, y_train,
    X_val, y_val,
    params,
    data_version,
    run_name=None,
    tags=None,
    register_as=None,
):
    """
    Standardized experiment logging for NovaBio.

    Logs parameters, validation metrics, data metadata,
    the model artifact, and optionally registers the model.
    """
    import mlflow
    import hashlib
    import json
    from sklearn.metrics import roc_auc_score, f1_score, log_loss

    with mlflow.start_run(run_name=run_name):
        # Parameters
        mlflow.log_params(params)

        # Data metadata
        data_hash = hashlib.sha256(
            pd.util.hash_pandas_object(X_train).values.tobytes()
        ).hexdigest()[:16]
        mlflow.set_tag("data_version", data_version)
        mlflow.set_tag("data_hash", data_hash)
        mlflow.set_tag("train_rows", str(len(X_train)))
        mlflow.set_tag("val_rows", str(len(X_val)))
        mlflow.set_tag("feature_count", str(X_train.shape[1]))

        # Custom tags
        if tags:
            for key, value in tags.items():
                mlflow.set_tag(key, value)

        # Evaluate
        y_proba = model.predict_proba(X_val)[:, 1]
        y_pred = model.predict(X_val)
        metrics = {
            "val_auc": roc_auc_score(y_val, y_proba),
            "val_f1": f1_score(y_val, y_pred),
            "val_log_loss": log_loss(y_val, y_proba),
        }
        mlflow.log_metrics(metrics)

        # Model artifact
        model_flavor = type(model).__module__.split(".")[0]
        if hasattr(mlflow, model_flavor):
            getattr(mlflow, model_flavor).log_model(
                model,
                artifact_path="model",
                registered_model_name=register_as,
            )
        else:
            mlflow.sklearn.log_model(
                model,
                artifact_path="model",
                registered_model_name=register_as,
            )

        # Feature importance (if available)
        if hasattr(model, "feature_importances_"):
            importance = pd.DataFrame({
                "feature": X_train.columns,
                "importance": model.feature_importances_,
            }).sort_values("importance", ascending=False)
            importance.to_csv("feature_importance.csv", index=False)
            mlflow.log_artifact("feature_importance.csv")

        return metrics

Outcome

Six months after the migration:

  • The team has 847 tracked runs across 14 experiments
  • Five models are registered in the Model Registry, three in production
  • The average time to answer "what hyperparameters produced this model?" dropped from 30 minutes (spreadsheet) to 10 seconds (MLflow query)
  • The compound activity classifier improved by 1.2% AUC because the team could systematically compare configurations instead of guessing
  • Lin still uses W&B for her personal exploration. Nobody minds. The results end up in MLflow when they matter

The spreadsheet was archived. Nobody has opened it since.


This case study compares MLflow and W&B for team adoption. Return to the chapter for the full experiment tracking framework.