Walk past the glass cube on Oosterdokskade in Amsterdam where Booking.com keeps its engineering floors, and you are walking past one of the most relentless experimentation machines ever built into a consumer website. The company’s Director of Experimentation has put the number on the record more than once: well over a thousand controlled experiments running in parallel at any moment. Pricing nudges, button placements, headline lengths, the exact shade of orange on a Reserve button. Across a full year, former staff have described running tens of thousands. The framework that runs all of it is, by most accounts from former engineers, a piece of internal software first stitched together in the mid-2000s and patched continuously since.
It has never been fully rewritten. That is partly a confession and partly a brag.

The factory floor of online travel
Booking Holdings, the Nasdaq-listed parent, processed gross travel bookings of $165.6 billion in 2024. The website that does most of that work is run from Amsterdam, and inside the building, engineers do not talk about features. They talk about experiments. Almost every visible element on the homepage, the calendar widget, the map pin colour, the urgency message that says booked 4 times in the last 24 hours, has been through a controlled trial against at least one alternative, often dozens.
The scale is the point. A site receiving hundreds of millions of monthly visitors can detect a 0.1% lift in conversion within hours. That is statistical power most companies will never have. It is also why Booking has become a kind of pilgrimage site for product managers. Its experimentation culture is studied in business schools and copied, badly, by competitors who lack the traffic to make small effects visible.
A framework that aged into an advantage
The internal A/B testing platform was, by the recollection of engineers who have written about it publicly, originally built by a small team in the mid-2000s. The web at the time looked different. Airbnb did not yet exist, the iPhone was a year or two from launch, and most travel sites still measured success in page views rather than completed bookings.
What that team built was deceptively simple: a way to randomly assign every visitor to a variant, log the outcome, and compute statistical significance without a data scientist in the loop. Product managers could ship a test themselves. No committee, no quarterly review, no permission. The tool democratised experimentation inside the company before that phrase existed in tech HR decks.
Roughly two decades later, the core of that system is still running. Engineers have wrapped it, extended it, plugged in new statistical methods, added guardrails for ethical concerns and accessibility, but the foundation has survived multiple platform migrations. Most experimentation systems are torn down and rebuilt periodically, often losing institutional knowledge in the process. The decision to keep patching rather than replace has compounded into something close to a moat, in the way iterative testing frameworks accumulate organisational learning over time.
The button that earned a fortune
Inside the company, there is a story, repeated often enough at conferences that it has hardened into folklore, about a single button colour test in the early 2010s that produced a revenue lift so large it dwarfed the cumulative revenue of Booking.com’s first three years of existence in the late 1990s. The specific shade, the specific page, and the exact figure have never been publicly confirmed. The company does not disclose individual experiment outcomes for competitive reasons.
The story is plausible because of the maths of scale. Booking’s early years, before the Priceline acquisition in 2005, generated modest revenue from a small Dutch user base. By the time a single button test could touch every shopper on the global site, even a fraction of a percent improvement, applied across tens of billions in annual bookings, would dwarf those early totals. That is what compounding traffic does to small wins.
Whether the precise tale is true or not, the underlying behavioural science is well documented. Consumers shopping for travel are often deep in decision fatigue, having compared a dozen hotels across three sites and four date combinations. By the time they reach a confirm screen, their brains are running on heuristics, not analysis. A button that pops against the background by a few hue degrees becomes the difference between a completed booking and a closed tab.
Why small visual cues move so much money
The psychology behind this is no longer controversial. Consumer decision-making research in online commerce has shown that perceived risk, effort expectancy, and social proof signals, those 14 people are looking at this hotel right now banners, operate on shoppers below conscious awareness. Trust signals, urgency cues and friction reduction all contribute to what behavioural economists call the conversion funnel, and small changes at the bottom of the funnel are worth far more than equivalent changes at the top.
Cognitive load is the quiet villain. As the Psychology Today piece on consumer brain fatigue puts it, a tired brain reaches for shortcuts and becomes more susceptible to marketing cues. That is exactly the state most people are in when they finally commit to a hotel after an evening of comparison shopping. The booking page is engineered for that moment.
The discipline of doing nothing
One detail that gets less attention: a majority of Booking’s experiments do not produce a winning variant. Engineers who have spoken publicly about the program have estimated that the bulk of tested variations either underperform the control or show no statistically meaningful difference.
The company treats this as the cost of knowing. Most product teams, at most companies, would never ship something they expected to fail most of the time. Booking ships it precisely because failure is data, and at their volume, data converts into dollars eventually.
This discipline of running tests cheaply enough that most of them can lose is what writers in Forbes have called the shift from intuition to evidence in modern product development. The trick is not having opinions about what users want. The trick is having infrastructure cheap enough to test everyone’s opinions and find out.
What the framework cannot see
The model has limits, and Booking has been candid about some of them. Optimising for short-term conversion can erode long-term trust. The company has been criticised by consumer watchdogs in the UK and EU for pressure-tactic messaging like fake scarcity warnings, some of which were rolled back after regulatory intervention from the UK Competition and Markets Authority in 2019. A test that lifts bookings by 1% today can cost loyalty over a decade. The framework measures the first effect easily. The second, almost not at all.
Writers on data-driven conversion optimisation have flagged this gap repeatedly: A/B testing rewards what is measurable in the test window, which is rarely the same as what is valuable over a customer’s lifetime. The discipline of waiting six months to know whether a winning variant produced loyal customers or churned ones is something most experimentation cultures cannot stomach.
The Amsterdam engineering culture
According to accounts from former engineers, the Booking engineering organisation has long held that velocity of learning matters more than elegance of code. New hires are reportedly taught the platform in their first week and expected to ship their first test within a month. The tool’s age is treated as a feature, a stable, well-understood substrate that everyone can reason about, rather than a bug to be refactored away.
This is the opposite of the dominant Silicon Valley reflex, which is to rewrite every system every few years. Whether the patience pays off forever is an open question. AI-driven personalisation, where every user sees a different page generated in real time, may eventually make traditional A/B testing obsolete. A site that shows two billion personalised variants per day cannot meaningfully run a randomised trial across all of them.
For now, the framework keeps humming. Engineers in Amsterdam push another batch of experiments to production this afternoon, and somewhere on the site, a button shifts by three hex codes, and a hundred thousand travellers click Reserve without ever knowing they were the control.
The orange will probably win. It usually does.