Case Study 2: Booking.com — 150 Teams, One ML Platform

DataField.Dev

Case Study 2: Booking.com — 150 Teams, One ML Platform

The Context: A Company Built on Experimentation

Booking.com is one of the world's largest online travel platforms, offering accommodations, flights, and experiences in over 220 countries and territories. It lists more than 28 million reported accommodation options and processes hundreds of millions of searches per day. The company's culture is famously — almost obsessively — data-driven.

What makes Booking.com unusual among large technology companies is not its scale, but its experimentation culture. At any given moment, the platform runs over 1,000 concurrent A/B tests. Every product change, every algorithmic update, every UX modification is tested against a control group before being rolled out. This culture of rigorous experimentation extends to machine learning: no ML model is deployed based on offline metrics alone. Every model must prove its value through a live A/B test against the incumbent system.

This case study examines how Booking.com scaled ML across 150 product teams, built a platform that democratized model deployment, and learned — through a series of hard-won lessons — that the relationship between offline model performance and online business impact is far more complex than most organizations assume.

The Early Days: ML as Artisanal Craft

In the early 2010s, machine learning at Booking.com was the domain of a small, specialized team. A handful of data scientists and ML engineers built and maintained the models that powered search ranking, pricing suggestions, and recommendation features. The models were sophisticated — Booking.com had access to enormous datasets and the technical talent to exploit them — but the process of building and deploying models was artisanal.

Each model had its own infrastructure. Each deployment required custom engineering. Each monitoring setup was bespoke. The specialized ML team was a bottleneck — product teams across the company wanted ML features, but they had to wait for the central team to build them.

Theo Parveen, a senior ML engineer at Booking.com (in a talk at the 2019 KDD conference), described the situation: "We had maybe ten models in production, and each one was a special snowflake. We knew the limitations. But the demand for ML was growing much faster than our ability to deliver it."

The company faced a strategic choice: continue with a centralized ML team that built models for the rest of the company, or build a platform that enabled the 150+ product teams to build and deploy their own models. They chose the platform.

The Platform: Democratizing Model Deployment

Booking.com's ML platform — developed iteratively from approximately 2017 onward — was designed around a core principle: make it easy for product teams to deploy models, but make it hard for them to deploy bad models.

The Self-Service Training Pipeline

The platform provided standardized training pipelines that product teams could use with minimal configuration. A product team could:

Define features by selecting from a catalog of pre-computed features in the feature store or by defining new features using a standard templating system
Specify the training data using a time-range-based query system that automatically handled train/validation/test splits with proper temporal ordering (preventing data leakage)
Select an algorithm from a library of supported model types (gradient-boosted trees, logistic regression, deep learning models, and custom implementations)
Launch training with a single command, which provisioned compute resources, executed the training job, logged all metrics and artifacts, and registered the trained model

The key insight was abstraction without elimination. Product teams didn't need to manage infrastructure, but they did need to understand what they were building. The platform enforced guardrails — temporal data splits, minimum dataset sizes, standard evaluation metrics — that prevented common mistakes without requiring expertise in MLOps.

The Deployment Gateway

Perhaps the most distinctive feature of Booking.com's platform was its deployment gateway — a formalized process that every model had to pass through before reaching production. The gateway was not a rubber stamp; it was a rigorous validation step that combined automated checks with human review.

Automated checks included: - Model performance above minimum thresholds on standard metrics - No performance degradation on protected subgroups (fairness checks) - Inference latency within acceptable bounds - Model size within deployment constraints - Feature availability confirmed in the online feature store - No training-serving skew detected (features computed consistently)

Human review included: - A review by a member of the central ML team for models in high-impact contexts (search ranking, pricing) - A review of the A/B test design — specifically the hypothesis, the success metrics, the sample size calculation, and the expected minimum detectable effect - For models affecting user experience, a review by a UX researcher to ensure the model's behavior aligned with user expectations

Business Insight. Booking.com's deployment gateway embodies a principle applicable to any organization: make it easy to deploy models, but make it hard to deploy models without proper validation. The gateway added time to the deployment process (typically 1-2 days for the automated checks and review), but it prevented a much larger cost: production models that degraded user experience or produced misleading business results.

The A/B Testing Requirement

No model at Booking.com was deployed to 100 percent of users based on offline evaluation alone. Every model deployment was structured as an A/B test:

The new model served a randomly selected treatment group (typically 10-50 percent of traffic, depending on the expected effect size and risk tolerance)
The incumbent system (rule-based or previous model) served the control group
The experiment ran for a statistically determined minimum duration (typically 2-4 weeks)
Success was measured by business metrics — conversion rate, revenue per visitor, customer satisfaction scores — not by model accuracy or AUC

This requirement was cultural, not just procedural. At Booking.com, an ML model that improved AUC by 10 percent but showed no statistically significant improvement in conversion rate was not considered a success. Conversely, a simpler model that improved conversion rate by 0.5 percent was celebrated, even if its offline metrics were unremarkable.

The Key Insight: Offline Metrics Lie (Sometimes)

Booking.com's most important contribution to the broader ML community is arguably the lessons documented in their 2019 KDD paper, "150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com" (Bernardi et al., 2019). The paper described six lessons, but the most provocative was Lesson 2: the correlation between offline model performance and online business impact is weak.

The Booking.com team analyzed the relationship between improvements in offline metrics (AUC, RMSE, NDCG) and improvements in online business metrics (conversion rate, revenue) across dozens of model deployments. Their finding was striking: many models that showed clear offline improvements produced no statistically significant online improvement. Conversely, some models with modest offline improvements produced significant online gains.

Why Offline Metrics Can Mislead

The paper identified several mechanisms that explain the disconnect:

1. Proxy Metric Misalignment Offline metrics measure a proxy for business value — click-through rate, ranking accuracy, prediction error — not business value itself. A recommendation model might improve click-through rate (more users click on recommended items) without improving conversion rate (users don't buy those items). The proxy and the goal are correlated but not identical.

2. Saturation Effects Beyond a certain accuracy threshold, further offline improvements produce diminishing online returns. If a ranking model already surfaces good results in the top 5 positions, improving the ranking of positions 6-20 has limited user impact — most users never scroll that far.

3. Feedback Loops In online systems, the model's predictions influence the data it will be evaluated on. A recommendation model that successfully promotes an item increases that item's popularity, which changes the signal the model sees in future data. Offline evaluation cannot capture these dynamics.

4. User Behavior Is Complex Users don't behave like evaluation metrics assume. A more "accurate" search ranking might surface the objectively best-rated hotel — but the user was actually browsing for fun and wasn't ready to book. A less "accurate" ranking that shows a variety of options might better match the user's actual intent (exploration), even though it scores lower on ranking metrics.

5. Interaction Effects ML models don't operate in isolation. A recommendation model interacts with the pricing algorithm, the search ranking, the email marketing system, and the UX layout. Improving one model might create a conflict with another — for example, better recommendations might cannibalize search traffic, producing a net-zero effect on conversion.

Research Note. Bernardi et al.'s finding has significant implications for model evaluation, which we covered in Chapter 11. Offline metrics (precision, recall, AUC) are necessary for development — they enable rapid iteration and comparison. But they are not sufficient for deployment decisions. The gold standard for measuring a model's business value is a well-designed A/B test. Organizations that deploy models based solely on offline metrics are making decisions with incomplete information.

The Organizational Model: ML at Scale Without Centralization

Booking.com's organizational approach to ML was deliberately decentralized. Rather than routing all ML work through a central AI team, the company empowered product teams to build and deploy their own models using the platform.

The Three-Layer Structure

Layer 1: ML Platform Team (Central) A dedicated platform engineering team that built and maintained the shared ML infrastructure — training pipelines, feature store, model serving, monitoring, and the deployment gateway. This team did not build product models. It built the tools that product teams used to build their own models.

Layer 2: ML Specialists (Distributed) Experienced ML engineers and data scientists embedded within product teams. These specialists had deep domain knowledge in their team's area (search, pricing, recommendations, customer service) and used the platform to build ML solutions for their team's problems.

Layer 3: Product Engineers (Broad) Software engineers on product teams who consumed ML model outputs. They didn't build models, but they integrated model predictions into product features — displaying recommendations, adjusting search rankings, triggering interventions. The platform's standardized API made it possible for any engineer to integrate a model's predictions without understanding the model's internals.

Why Decentralization Worked

Domain expertise matters. The team that understands search behavior best is the search team. The team that understands pricing dynamics best is the pricing team. Centralizing ML in a team removed from the domain creates a translation bottleneck — the central team must understand every domain, which is impossible at scale.

Speed at scale. With 150 teams, a centralized model would create queues. Each team would wait for the central ML team to prioritize and build their model. Decentralization enabled parallel execution — 150 teams could build models simultaneously.

Accountability. When a product team owns its model end-to-end — from problem definition through deployment and monitoring — it has stronger incentives to maintain model quality. Centralized models often suffer from "orphan model" syndrome: the team that built the model moves on, and nobody is left to maintain it.

The platform prevents chaos. Decentralization without standardization produces the "ML anarchy" described in the Uber case study. Booking.com avoided this by providing a platform that enforced consistent practices for deployment, monitoring, and evaluation — even as it allowed flexibility in model design.

Challenges and Failures

Booking.com's ML journey was not without setbacks.

The "Ship It" Pressure

The experimentation culture, while generally healthy, sometimes created pressure to launch models quickly — before adequate validation. In a culture where running A/B tests is the norm, there was a temptation to "just launch and see what happens" rather than investing in thorough offline evaluation. The deployment gateway was introduced partly to counteract this tendency.

Feature Store Growing Pains

As the feature store grew to thousands of features, discoverability became a challenge. Teams struggled to find existing features that matched their needs, sometimes re-creating features that already existed under different names. Improved search, documentation, and governance tools were needed — and took time to build.

The A/B Testing Bottleneck

Ironically, Booking.com's strength — mandatory A/B testing — could also be a bottleneck. Running statistically valid experiments takes time, especially for low-traffic use cases where accumulating sufficient sample size requires weeks. For some use cases, the experimentation overhead made rapid iteration difficult.

Model Interaction Effects

As the number of models in production grew, interaction effects between models became harder to understand and manage. A change to the search ranking model could affect the recommendation model's performance and vice versa. Booking.com invested in "holdout" groups — a small percentage of users who saw no ML-driven features — to measure the combined effect of all models, but isolating individual model contributions remained challenging.

Talent Development

Democratizing ML meant that product engineers needed at least foundational ML literacy — enough to frame problems, select appropriate approaches, and interpret results. Booking.com invested heavily in internal training programs, but building this literacy across 150 teams was a multi-year effort.

Quantified Impact

While Booking.com does not publicly disclose the full financial impact of its ML systems, available data points include:

The platform supported hundreds of ML models in production by 2020, serving predictions for search ranking, recommendations, pricing, customer service routing, and fraud detection
The mandatory A/B testing framework prevented numerous models from reaching production that showed offline promise but no online impact — saving the engineering cost of maintaining models that would have produced no value
The democratization of ML deployment resulted in a 5x increase in the number of teams shipping ML features over a three-year period
The centralized feature store served features to models across the company, with feature reuse rates that significantly reduced the engineering time required for new model development

Lessons for Other Organizations

1. Measure What Matters — And It's Rarely Model Accuracy

Booking.com's most important lesson is that offline model metrics are necessary but not sufficient. Organizations that celebrate AUC improvements without measuring business impact are optimizing the wrong thing. The gold standard is a controlled experiment (A/B test) that measures the model's impact on actual business outcomes.

This lesson connects directly to Chapter 11's discussion of business-aligned evaluation: model metrics tell you how the model performs. Business metrics tell you whether the model matters.

2. Platforms Enable Scale; Culture Enables Adoption

The ML platform was necessary but not sufficient. Booking.com's experimentation culture — the shared belief that decisions should be data-driven and that every change should be tested — was the foundation that made ML adoption natural. Organizations that build platforms without a data-driven culture will have platforms that nobody uses.

3. Decentralization Requires Guardrails

Empowering 150 teams to build and deploy models is powerful, but it requires strong guardrails — standardized deployment processes, mandatory validation checks, and consistent monitoring. Without guardrails, decentralization produces fragmentation.

4. Start With the Process, Not the Tools

Booking.com's success was rooted in its experimentation process — the rigorous A/B testing methodology — not in any specific tool or framework. The tools evolved over time; the process was the constant. Organizations adopting MLOps should define their processes first (how will models be evaluated? who approves deployment? what triggers retraining?) and then select tools that support those processes.

5. The Gap Between "Better Model" and "Better Product" Is Real

A model that is technically superior may not be a product improvement. The gap between model performance and product performance is mediated by user behavior, system interactions, and business context. Bridging this gap requires collaboration between data scientists, product managers, and UX researchers — a theme explored further in Chapter 33 (AI Product Management).

Connection to Chapter Concepts

MLOps maturity (Section 12.11): Booking.com's journey from artisanal model deployment to a standardized platform with hundreds of models mirrors the Level 0 to Level 2 progression.
CI/CD for ML (Section 12.6): The deployment gateway is an implementation of CI/CD gates for ML — automated testing, validation, and human review before production deployment.
The human side of MLOps (Section 12.12): Booking.com's three-layer organizational structure (platform team, embedded specialists, product engineers) is a real-world implementation of the ML Platform Model described in the chapter.
Monitoring (Section 12.7): The mandatory A/B testing requirement is an extreme form of business impact monitoring — every model's impact is continuously measured against a control group.
Model evaluation (Chapter 11): Booking.com's finding that offline metrics weakly correlate with online impact validates Chapter 11's emphasis on business-aligned evaluation.

Discussion Questions

Booking.com found that offline model improvements often did not translate to online business improvements. Does this finding invalidate the careful evaluation methodology described in Chapter 11? How would you reconcile the two perspectives?
The mandatory A/B testing requirement adds time and complexity to model deployment. For a startup with limited traffic (10,000 daily active users), is this requirement practical? What alternative approaches could you use?
Booking.com chose a decentralized organizational model for ML. Under what circumstances would a centralized model be preferable? What are the trade-offs?
The "deployment gateway" combines automated checks with human review. As the number of models grows, the human review step could become a bottleneck. How would you scale the gateway to support 500+ model deployments per year?
Booking.com's culture of experimentation predated its ML platform. For an organization without an experimentation culture, which should come first — the platform or the culture? Can you build one without the other?

Sources and Further Reading

Bernardi, L., Mavridis, T., Estevez, P., et al. (2019). "150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com." Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '19).
Parveen, T. (2019). "Machine Learning at Booking.com." Presentation at KDD 2019.
Booking.com Technology Blog. (2020). "How We Scaled Machine Learning Across 150 Product Teams."
Bernardi, L., & Mavridis, T. (2020). "Building Machine Learning Pipelines at Booking.com." Proceedings of the ACM Conference on Recommender Systems.
Fabijan, A., Dmitriev, P., Olsson, H. H., & Bosch, J. (2018). "Online Controlled Experimentation at Scale: An Empirical Survey on the Current State of A/B Testing." Proceedings of the 44th Euromicro Conference on Software Engineering and Advanced Applications.