Multi-armed Bandit Problems with Dependent Arms Sandeep Pandey spandey@yahoo-inc.com Deepayan Chakrabarti deepay@yahoo-inc.com Deepak Agarwal dagarwal@yahoo-inc.com Yahoo! Research, Sunnyvale, CA Abstract We provide a framework to exploit dependen- cies among arms in multi-armed bandit prob- lems, when the dependencies are in the form of a generative model on clusters of arms. We ﬁnd an optimal MDP-based policy for the discounted reward case, and also give an approximation of it with formal error guaran- tee. We discuss lower bounds on regret in the undiscounted reward scenario, and propose a general two-level bandit policy for it. We propose three diﬀerent instantiations of our general policy and provide theoretical justiﬁ- cations of how the regret of the instantiated policies depend on the characteristics of the clusters. Finally, we empirically demonstrate the eﬃcacy of our policies on large-scale real- world and synthetic data, and show that they signiﬁcantly outperform classical policies de- signed for bandits with independent arms. 1. INTRODUCTION Multi-armed bandit problems have been an active area of research since the 1950s. The problem can be stated as follows (J.C.Gittins, 1979): there are N arms, each having an unknown success probability of emitting a unit reward. The success probabilities of the arms are assumed to be independent of each other. The ob- jective is to pull arms sequentially so as to maximize the total reward. Many policies have been proposed for this problem under the independent-arm assump- tion (Lai & Robbins, 1985; P.Auer et al., 2002). In this paper we drop this assumption and focus on the bandit problem where the arms are dependent. For ex- ample, consider a simple bandit instance which has 3 arms, with success probabilities θ 1 , θ 2 and θ 3 , where one also has a-priori knowledge that |θ 1 - θ 2 | <.001. Appearing in Proceedings of the 24 th International Confer- ence on Machine Learning, Corvallis, OR, 2007. Copyright 2007 by the author(s)/owner(s). This constraint induces dependence between arms 1 and 2. Is it possible to construct policies that perform better than those for independent bandits by exploit- ing the similarity of the ﬁrst two arms? This question is not merely of theoretical interest. For instance, the lucrative Internet advertising business is based on selecting ads to display on webpages. This ad-selection problem can be cast as a bandit problem where each ad corresponds to an arm, displaying an ad corresponds to an arm pull, and user clicks are the reward. Ads with similar text, “bidding phrase,” and advertiser information are likely to have similar click probabilities, and this creates dependencies between the arms of the bandit. We formalize this problem in the paper. In particular, we propose a new variant of the multi-armed bandit problem where the arms have been grouped into clus- ters. For the toy example discussed previously, one can consider arms 1 and 2 together as a cluster, arm 3 as another cluster, and “reduce” the 3-arm problem to a 2-cluster problem. The latter may be more ef- ﬁcient to solve due to fewer number of clusters. We show that this intuition is indeed justiﬁed, and design policies that exploit such dependencies. Our contributions: We formalize and study multi- armed bandits with dependent arms (henceforth, de- pendent bandits ) for both the discounted and undis- counted reward scenarios. For the discounted reward objective, we ﬁnd the optimal MDP-based solution for dependent bandits. At each timestep, this policy com- putes an (index, arm) pair for each cluster, then picks the cluster with the highest index and pulls the corre- sponding arm. However, as with independent bandits, computing the optimal is often infeasible and approx- imations are necessary. We provide error bounds on a simple approximation to the optimal policy. For the undiscounted reward scenario, we ﬁrst discuss an upper bound on the performance of any bandit pol- icy. We then present a general and computationally