Case Study 1: Uber's Michelangelo — Building an ML Platform at Scale

DataField.Dev

Case Study 1: Uber's Michelangelo — Building an ML Platform at Scale

The Challenge: ML Without a Platform

In 2015, Uber was one of the fastest-growing technology companies in the world, expanding across hundreds of cities, processing millions of rides per day, and confronting business problems that were fundamentally prediction problems: How long will this ride take? What should the fare be? Where will demand spike in the next 30 minutes? Is this transaction fraudulent? Which drivers are most likely to churn?

The answers to these questions were — and are — worth billions. But Uber's ability to turn data science experiments into production systems was not keeping pace with its ambition.

At the time, Uber's ML efforts were fragmented. Individual engineering teams built bespoke solutions for each use case. The Marketplace team had its own model training infrastructure. The Safety team had a separate pipeline for fraud detection. The Maps team used different tools entirely for ETA prediction. Each team had independently solved — or more often, partially solved — the same set of problems: How do you manage training data? How do you train and evaluate models? How do you deploy models to production? How do you monitor them once deployed?

The result was what Uber's engineering leaders described as "ML anarchy." Duplicated effort. Inconsistent practices. Models that took months to deploy. Engineers reinventing infrastructure that already existed elsewhere in the company. And, critically, no shared language or process for talking about ML systems — each team used different tools, different terminology, and different standards.

Jeremy Hermann and Mike Del Balso, engineers in Uber's ML infrastructure group, characterized the situation bluntly: Uber was building world-class models on top of chaotic infrastructure. The models were excellent. Everything else was unsustainable.

The Decision: A Unified ML Platform

In 2016, Uber launched Michelangelo — an internal ML platform designed to standardize how the entire company built, trained, deployed, monitored, and managed machine learning models. The name, a nod to the Renaissance artist, reflected an ambitious goal: to provide tools that would let data scientists and engineers create masterpieces without having to sculpt the marble first.

The design principles were clear from the outset:

1. End-to-end coverage. Michelangelo would handle the full ML lifecycle — from data management and feature engineering through model training, evaluation, deployment, and monitoring. Data scientists should be able to go from idea to production using a single platform.

2. Standardization without rigidity. The platform would enforce consistent workflows and interfaces while remaining flexible enough to support different algorithms, frameworks, and use cases. A fraud detection model and a pricing model had different requirements, but they should follow the same deployment process.

3. Scalability as a design constraint. Every component had to operate at Uber's scale — training on datasets with billions of rows, serving predictions with single-digit millisecond latency, and supporting thousands of models across hundreds of teams.

4. Self-service for data scientists. Data scientists should be able to train, evaluate, and deploy models without writing deployment code or opening engineering tickets. The platform, not the engineer, should handle the operational complexity.

The Architecture: How Michelangelo Works

Data Management and Feature Engineering

Michelangelo's data management layer addressed one of the most painful aspects of Uber's pre-platform era: feature engineering. Before Michelangelo, data scientists at Uber spent significant effort writing and maintaining feature pipelines — and the same features were often re-engineered by different teams.

Michelangelo introduced a centralized Feature Store (one of the earliest large-scale implementations of the concept) with two components:

An offline feature store backed by Hive/HDFS for batch training. Historical feature values were stored in a time-partitioned format, enabling accurate reconstruction of feature values at any point in time — critical for training models without data leakage.
An online feature store backed by Cassandra for real-time serving. The most recent feature values were available with low-latency reads, enabling real-time models (like ETA prediction) to access features within milliseconds.

Data scientists could define features once and have them available in both offline and online stores. Features were versioned, documented, and searchable — a data scientist working on driver churn could discover that the Marketplace team had already computed "driver completion rate over 28 days" and reuse it rather than recomputing it.

Business Insight. Uber estimated that the Feature Store alone saved hundreds of hours of engineering time per quarter by eliminating duplicated feature engineering work. But the more significant benefit was consistency: the same feature definition was guaranteed to produce the same values in training and serving, eliminating the training-serving skew that had caused subtle production bugs across multiple teams.

Model Training

Michelangelo supported multiple training frameworks — initially XGBoost and TensorFlow, later extending to PyTorch and other frameworks — through a unified training API. Data scientists defined their model configuration (algorithm, hyperparameters, features, training data) in a standard format, and the platform handled the rest: provisioning compute resources, executing training jobs, logging metrics, and storing model artifacts.

Key training capabilities included:

Distributed training for large datasets that didn't fit on a single machine
Hyperparameter search using grid search and Bayesian optimization
Experiment tracking that automatically logged every training run with its configuration, metrics, and artifacts
Partitioned models that allowed training separate models for different segments (e.g., separate ETA models for different cities) while treating them as a single logical model

Model Evaluation

Every trained model went through an automated evaluation pipeline that computed standard metrics (accuracy, precision, recall, AUC, RMSE) and generated visual reports that data scientists could use to compare model versions. The evaluation pipeline also supported:

Segmented evaluation — performance broken down by key dimensions (city, time of day, customer segment), ensuring that a model that performed well on average wasn't hiding poor performance on specific subgroups
Feature importance analysis — identifying which features contributed most to predictions
Baseline comparison — automatic comparison against the current production model (champion-challenger)

Model Deployment

Deployment was the capability that most differentiated Michelangelo from Uber's pre-platform approach. Before Michelangelo, deploying a model required weeks of engineering work — writing a serving application, containerizing it, provisioning infrastructure, configuring load balancers, and setting up monitoring.

With Michelangelo, deployment was a button click — or more precisely, a configuration change. The platform supported three serving modes:

Offline batch prediction: Run the model against a large dataset (e.g., score all drivers for churn risk) on a scheduled basis
Online prediction: Serve the model as a low-latency RPC endpoint for real-time predictions (e.g., ETA prediction for each ride request)
Near-real-time prediction: Process events from a Kafka stream and write predictions to a data store

The platform handled model packaging, container creation, deployment to Uber's Kubernetes infrastructure, load balancing, autoscaling, and health checking — all transparently.

Model Monitoring

Michelangelo included a monitoring system that tracked both operational metrics (latency, throughput, error rates) and model quality metrics (prediction distributions, feature distributions, and accuracy when ground truth became available). Alerts could be configured for any metric, and dashboards provided a unified view of all deployed models.

The Impact: From Anarchy to Platform

By 2019, Michelangelo supported thousands of models across virtually every engineering team at Uber. The impact was measurable across several dimensions:

Deployment velocity. The time from model training to production deployment dropped from weeks or months to hours. Data scientists could deploy models without opening a single engineering ticket.

Model proliferation. The number of ML models in production grew by an order of magnitude. Use cases that were previously not economically viable (because deployment cost exceeded expected value) became feasible when deployment was nearly free.

Feature reuse. The Feature Store contained thousands of features, with high reuse rates across teams. Features engineered by one team were discoverable and consumable by any other team.

Operational reliability. Standardized deployment, monitoring, and rollback procedures dramatically reduced the frequency and severity of production incidents. When issues occurred, they could be diagnosed and resolved faster because the infrastructure was consistent.

Organizational alignment. A shared platform created a shared vocabulary. Teams across Uber could discuss ML systems using common concepts — "features," "model versions," "serving modes," "evaluation reports" — reducing miscommunication and enabling cross-team collaboration.

The Challenges: What Was Hard

Michelangelo's success was not without significant challenges.

1. Adoption Resistance

Not every team was eager to adopt Michelangelo. Teams that had already invested in custom infrastructure were reluctant to migrate — they had working systems, and migration carried risk. Uber's ML platform team had to demonstrate concrete value (faster deployment, better monitoring, reduced operational burden) to overcome institutional inertia.

The strategy that worked: identify high-visibility use cases, migrate them to Michelangelo, demonstrate the improvements, and use those successes to motivate adoption by other teams. Top-down mandates were less effective than bottom-up demonstrations.

2. Flexibility vs. Standardization

Some teams had requirements that didn't fit neatly into Michelangelo's standardized workflows. Research teams exploring novel architectures needed more flexibility than the platform initially offered. The platform team learned to distinguish between "standardize this" (deployment, monitoring, feature management) and "leave this flexible" (model architecture, training procedures, evaluation metrics).

3. The Feature Store Governance Problem

As the Feature Store grew, governance became a challenge. Who owned a feature? What happened when a feature definition needed to change? How did you deprecate a feature that multiple models depended on? These were organizational problems as much as technical ones, and they required clear ownership policies, communication protocols, and impact analysis tools.

4. Cost at Scale

Running Michelangelo at Uber's scale was expensive — the compute infrastructure for training thousands of models and serving millions of predictions per day represented a significant cost center. Cost attribution (which team's models are consuming how much compute?) and cost optimization became ongoing concerns.

5. Keeping Pace with the Field

The ML ecosystem evolves rapidly. New frameworks, new algorithms, new serving technologies emerge constantly. Michelangelo had to continuously evolve to support new capabilities without breaking existing workflows — a classic platform engineering challenge.

Organizational Lessons

Uber's Michelangelo experience offers several lessons that apply to any organization building MLOps capability, regardless of scale:

Lesson 1: The platform must earn adoption, not mandate it. Top-down mandates to "use the platform" generate compliance, not commitment. The platform must demonstrably reduce friction for data scientists — if using the platform is harder than the ad hoc approach, the platform will be circumvented.

Lesson 2: Start with the deployment bottleneck. Uber prioritized deployment automation because that was where the most time was being wasted. Feature stores, experiment tracking, and monitoring were added later. Organizations should invest in whatever is currently the bottleneck — and the bottleneck is almost always deployment.

Lesson 3: Feature stores are transformative but require governance. Centralized feature management eliminates duplication and inconsistency, but it introduces governance requirements — ownership, versioning, deprecation policies — that must be addressed from the beginning.

Lesson 4: Standardize the operational layer, not the scientific layer. Data scientists need flexibility in how they build models. They do not need flexibility in how models are deployed, monitored, and maintained. The operational layer is where standardization delivers the most value and encounters the least resistance.

Lesson 5: Platform teams are infrastructure teams, not product teams. The Michelangelo team's customers were internal data scientists and engineers. The platform's success was measured not by its own features, but by the velocity and reliability of the ML models built on top of it.

Connection to Chapter Concepts

Michelangelo illustrates nearly every concept introduced in Chapter 12:

The deployment gap (Section 12.1): Uber's pre-platform state — where months were spent deploying individual models — is a large-scale example of the gap described.
The three pillars of MLOps (Section 12.2): Michelangelo explicitly manages data (Feature Store), models (Model Registry), and code (training and serving pipelines).
Model serving patterns (Section 12.3): The platform supports batch, online, and streaming prediction — matching serving patterns to business needs.
Feature stores (Section 12.5): Uber's Feature Store is one of the foundational implementations of the concept, with online and offline components.
Monitoring (Section 12.7): The monitoring subsystem tracks both operational and model quality metrics.
MLOps maturity (Section 12.11): Uber's journey from "ML anarchy" to Michelangelo represents a progression from Level 0 to Level 2 at massive scale.

Discussion Questions

Uber built Michelangelo as an internal platform because no adequate external solution existed at the time (2016). Today, cloud providers offer managed ML platforms (SageMaker, Vertex AI, Azure ML) with similar capabilities. If Uber were starting today, should they build or buy? What factors would inform this decision?
The Feature Store was one of Michelangelo's most impactful components, but it also required the most organizational governance. At what scale does a Feature Store justify its governance overhead? What would you recommend for a company with 5 models versus 50 models?
Uber's adoption strategy prioritized bottom-up demonstrations over top-down mandates. Under what circumstances might a top-down mandate be more effective? What risks does each approach carry?
Michelangelo standardized deployment and monitoring but left model architecture flexible. Where is the right boundary between standardization and flexibility for your organization?
The Michelangelo team measured success by the velocity and reliability of models deployed on the platform — not by the platform's own features. How does this "internal customer" mindset compare to the product management approaches discussed in the textbook's later chapters (Chapter 33)?

Sources and Further Reading

Hermann, J., & Del Balso, M. (2017). "Meet Michelangelo: Uber's Machine Learning Platform." Uber Engineering Blog.
Hermann, J., Del Balso, M., et al. (2018). "Scaling Machine Learning at Uber with Michelangelo." Uber Engineering Blog.
Li, E. (2019). "Michelangelo PyML: Introducing Uber's Platform for Rapid Python ML Model Development." Uber Engineering Blog.
Del Balso, M. (2018). "Michelangelo: Uber's Machine Learning Platform." Presentation at ML Platform meetup.
Uber Engineering. (2019). "Evolving Michelangelo Model Representation for Flexibility at Scale." Uber Engineering Blog.