To appear in Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Database (EC Antewrp, Belgium,, September 2008. Online Multiagent Learning against Memory Bounded Adversaries Doran Chakraborty and Peter Stone Department of Computer Sciences University of Texas Austin, Texas, USA {chakrado,pstone}@cs.utexas.edu Abstract. The traditional agenda in Multiagent Learning (MAL) has been to develop learners that guarantee convergence to an equilibrium in self-play or that converge to playing the best response against an op- ponent using one of a fixed set of known targeted strategies. This paper introduces an algorithm called Learn or Exploit for Adversary Induced Markov Decision Process (LoE-AIM) that targets optimality against any learning opponent that can be treated as a memory bounded adversary. LoE-AIM makes no prior assumptions about the opponent and is tailored to optimally exploit any adversary which induces a Markov decision pro- cess in the state space of joint histories. LoE-AIM either explores and gathers new information about the opponent or converges to the best response to the partially learned opponent strategy in repeated play. We further extend LoE-AIM to account for online repeated interactions against the same adversary with plays against other adversaries inter- leaved in between. LoE-AIM-repeated stores learned knowledge about an adversary, identifies the adversary in case of repeated interaction, and reuses the stored knowledge about the behavior of the adversary to enhance learning in the current epoch of play. LoE-AIM and LoE- AIM-repeated are fully implemented, with results demonstrating their superiority over other existing MAL algorithms. 1 Introduction The aim of many adversarial strategic interactions is to learn a model of the op- ponent(s) and to respond accordingly [1, 3, 14]. If the opponents execute static policies, then the learning agent is faced with a stationary environment, thus re- ducing the problem to effectively a single-agent decision problem. However when in the presence of other learning agents, there is an inherent non-stationarity in the environment which makes the learning problem for an individual agent much harder [12]. The most popular solution concept in such multiagent settings has been the Nash equilibrium [13] and most multiagent learning (MAL) algorithms proposed to date aim at convergence to such an equilibrium in self-play [5, 8, 15]. Their popularity notwithstanding, the ability to find Nash equilibria does not solve all multiagent problems. For one thing, there can be multiple Nash