Augmented Two-Stage Bandit Framework: Practical Approaches for Improved Online Ad Selection
Paper |
Seowon Han, Ryan Lakritz and Hanxiao Wu
In online advertising, maximizing user engagement and advertiser performance hinges on effective ad selection algorithms. Algorithms that tackle Multi-armed bandit problems, such as Thompson Sampling, excel in exploration, but their utilization of contextual information remains limited. Conversely, contextual bandit approaches personalize ad selection by leveraging user and ad-specific features. However, they perform poorly in contexts with limited data and often encounter cold start problems for new ad groups. To address this dilemma, we propose a novel bandit framework that combines context-free and context-aware rewards and is augmented with historical predicted performance, for which we use predicted click-through rate (pCTR) scores. We will refer to this bandit framework as the Augmented Two-Stage Bandit Framework.
Our bandit framework is comprised of two stages. In the first stage, the framework applies context-free Thompson Sampling augmented by historical pCTR scores for initial exploration. The non-contextual bandit algorithm and generalized patterns recognized by our pCTR model allow for effective mitigation of the cold start problem. In the second stage, the framework shifts to a contextual bandit algorithm for refined exploration and exploitation.
We demonstrate the efficacy of our proposed method using extensive simulation and experiments conducted on a real-world ads marketplace at Reddit. Compared to traditional bandit algorithms, our historical pCTR augmented Two-Stage Bandit framework achieves significant improvements in click-through rate. These findings underscore the ability of an Augmented Two-Stage Bandit Framework to enhance online ad selection and improve key performance metrics.