Methods for Effective Online Testing in Real-Time Bidding

Published in

Outbrain Engineering

8 min readJun 8, 2021

The Context

When a user visits a website monetized through ads, a super-fast auction takes place. This auction is so fast, it usually finishes by the time the page loads. This provides a seamless experience for the user. Most people don’t even realize that real-time bidding (RTB) exists.

A simplified diagram of the auction process

Outbrain and Zemanta both participate in such auctions. Zemanta is one of the bidders in these auctions (DSP, or Demand Side Platform) and acts as Outbrain’s window into the programmatic ad buying ecosystem.

The Scale

At Zemanta, we handle over a million requests per second on average. That is over 100 billion processed requests each day. For context, Google reports about 3.5 billion searches per day. Saying that we work with big data would be an understatement. Zemanta also has millions of advertisements that need to get ranked before bidding. The amount of publishers we can reach is in the millions as well.

On top of the sheer volume of data we process, we are also constrained by time limits. We only have about 100 ms to deliver a bid. That includes all network latency, data processing, and data science magic — all in less than the blink of an eye. This will come into play later, once we go over the problem we were trying to solve and how we approached it.

To Crop an Image

For each ad we deliver, there is an image to include. That may seem like a trivial detail, but it is an enormous challenge that requires continuous testing and optimization. Why? The images must dynamically fill ad spaces of arbitrary dimensions, depending on the web pages and apps where they might appear.

Our customers usually send an image that comes in one size, so it won’t necessarily fit in every ad space we’d like to bid on. If we only matched ad placements that had a perfect aspect ratio for the available image, it would drastically reduce the number of placements an advertisement could be served on. That would lead to worse recommendations and more expensive bids.

Delivering ads that match the context the best, while using effective images in any dimension, was a big challenge. We needed to optimize the images so they could dynamically fill ad spaces of arbitrary dimensions.

Could we simply crop the images? As it turns out, there are many different strategies or approaches when it comes to cropping images. As you’ll see below, they can return very different results depending on what gets cut and what stays. In some cases, comparing the same image cropped two different ways can tell quite a different story.

Example of how different cropping strategies perform on the same image (originally a landscape) that gets displayed in a square widget

And that is just one side of the problem. The other is determining which cropping strategy would work best for each image. We have millions of images, so doing this manually would be next to impossible. Even if we could, it would just be our team’s subjective choices, which might not reflect the preferences of our customers or the users we are hoping to reach.

Figuring Out a Solution

As data scientists, we started by using statistics to figure out which combinations of images and cropping strategies worked best on average. The simplest place to start was an online A/B test, also known as a split test, which allowed us to randomly allocate our requests between different combinations and measure their performance.

The A/B Testing Approach

Our online testing framework allows Zemanta to deploy online tests quickly. With a bit of offline preparation and coding, we were ready to start a performance test. Our success measurement was based on CTR (click-through rate). We started with the hypothesis that all cropping strategies perform the same.

Since CTR is distributed binomially, we used Fisher’s Exact Test to determine whether the tested combinations differ in a statistically significant fashion.

The allocation of traffic for our first split test looked like the chart below. All test groups here have a preset CTR that get plugged into binomial distribution and sampled:

To ensure that the test had enough power, we also calculated the number of examples needed for the test.

Statistical tests that compare very similar distributions need a lot of examples, so that became our first problem. Even though we had over a million bidding opportunities per second, taking into account the number of combinations of images and cropping strategies, there was a dramatic increase in the number of examples needed for the test to provide actionable results. Another thing to keep in mind as well is the fact that campaigns have budgets which means each advertisement can be displayed only a finite number of times.

Multi-Armed Bandits to the Rescue

One way to tackle this problem is reframing the best combination as a multi-armed bandit problem (MAB). A MAB problem starts with a limited amount of resources that you need to divide between a set of competing options to maximize the expected gain.

MABs help solve the classic exploration-exploitation trade-off. An approach that favors exploration is learning-based. By contrast, exploitation is the greedy approach, where more resources are devoted to options that are already expected to produce higher gains. There are many great strategies to tackle MABs with, but we focused on two: epsilon-greedy and Thompson sampling.

Trying the Epsilon-Greedy Approach

The epsilon-greedy strategy is a frequentist approach to tackling MABs. Think of an A/B test with a 70/30 split, where 70% of traffic is pointed at the most effective option (exploitation), while the other 30% is dedicated to learning how the other alternatives are performing (exploration).

During this test, 70% of the traffic is used to deliver the top-performing option, while the remaining 30% runs ongoing tests, split uniformly among the remaining combinations. The results of the exploration side of the split are then used to dynamically update the exploitation side as the results continue coming in.

Essentially, you are running a split test where the best bandit is used for the majority of traffic. This already gave us better business results than an A/B test in the same time period, since A/B tests return the average yield of all tested groups.

See below for an illustration of how epsilon-greedy allocation works. The purple section represents the top-performing bandit, while the other exploratory tests continue being split with a smaller allocation:

Here, we also noticed an issue: The testing was continuous and uniform for all bandits. That meant if one bandit used a cropping strategy that trimmed the image in an unwanted way(like the example above that cut off some of the surfer), we want to omit it as fast as possible instead of showing it to users potentially for the whole life of the ad. There are some ways of upgrading the basic epsilon-greedy approach to mitigate this issue, but we decided to try a more elegant strategy.

Thompson Sampling, the Elegant Solution

Thompson sampling solves this MAB problem with a Bayesian approach.

Thompson sampling works a bit like you would expect: You explore all bandits, and over time you figure out each bandit’s results while increasing your confidence in the prediction. In each cycle, you pick the best bandit so far, based on prior results.

It’s possible to set the bandit priors as a uniform distribution, which doesn’t prioritize any of the options. Instead, we decided to use a custom setting, which allowed for faster convergence to the best options. The resource-allocation process can be seen in the chart below:

Simulation Results

Clicks attained in the simulated testing period based on the parameters we plugged in.

As can be seen from the chart above, both MAB methods attain a lot more clicks in the same time it would take us to run a proper A/B test. It has to be noted that these results are based on the parameters we plugged into the simulation. Thompson sampling only slightly outperforms epsilon-greedy, but we prefer Thompson sampling because it deprecates bad performers.

Auction Constraints

At the beginning of this post, we covered some time constraints that face the bidder. We have to deliver a bid response within 100 ms on average. This includes network latency, data processing, making predictions, running optimizations, and other data science wizardry.

We could solve the data science processes by making things as efficient as possible. However, network latency was a different type of constraint, which we mitigated by sharding our bidders across different data centers around the globe.

To run bidders in such a distributed manner, we needed a central datastore. Reading and writing to it, however, is very expensive if you only have 100 ms and you have to process over a million requests per second.

Where We Ended Up

While Thompson sampling is very computationally expensive, it was the most efficient basis for sampling distributions and searching for the best bandit. We learned that the most resource-effective way to implement our MAB strategy was the use of batches.

In the case of Thompson sampling, we now set initial priors, sample the priors, and create weights for each bandit. The weights can be interpreted as the probability of a specific bandit being chosen in real-time. The weights are then read by the bidder nodes, which run a weighted lottery.

We collect data for n minutes and then repeat the process: Update the priors, sample them, create new weights, pull them, and apply them at bidding time. We then created an A/B test on top of the entire process, to test the performance of the new optimization compared to the way we did cropping previously. This methodology has worked great so far, but there are still many ways upon which we can improve it. We might revisit those in a future post.

Something to Take Home

There are many ways to run online tests, and most of them have upsides and drawbacks. You will need to study and figure out which approaches work for you. Diving into a challenge like this and learning about it is a lot of fun, so make sure to leverage scientific thinking, mathematical and statistical methods in your development — they are your friends.

The content of this blog was presented at The Web Conference and PyCon Israel in an extended format, both links will lead you to the video recording. All the demo code can be found here: GitHub. If you have any comments or questions, leave us a note below.