One Arm to Rule Them All: Online Learning with Multi-armed Bandits for Low-resource Conversational Agents ania Mendon¸ ca 1,2[0000-0001-5729-7608] , Lu´ ısa Coheur 1,2[0000-0002-2456-5028] , and Alberto Sardinha 1,2[0000-0002-5782-3142] 1 INESC-ID, Lisboa, Portugal 2 Instituto Superior T´ ecnico, Lisboa, Portugal {vania.mendonca,luisa.coheur,jose.alberto.sardinha}@tecnico.ulisboa.pt Abstract. In a low-resource scenario, the lack of annotated data can be an obstacle not only to train a robust system, but also to evaluate and compare different approaches before deploying the best one for a given setting. We propose to dynamically find the best approach for a given setting by taking advantage of feedback naturally present on the scenario in hand (when it exists). To this end, we present a novel applica- tion of online learning algorithms, where we frame the choice of the best approach as a multi-armed bandits problem. Our proof-of-concept is a retrieval-based conversational agent, in which the answer selection crite- ria available to the agent are the competing approaches (arms). In our experiment, an adversarial multi-armed bandits approach converges to the performance of the best criterion after just three interaction turns, which suggests the appropriateness of our approach in a low-resource conversational agent 3 . Keywords: Online learning · Multi-armed bandits · Conversational agents. 1 Introduction State of the art on several Natural Language Processing tasks is currently dom- inated by deep learning approaches. In the particular case of conversational agents, such deep approaches have been applied to either generate an answer from scratch - generation-based - or to find the best match among a collection This work was supported by: Funda¸c˜ ao para a Ciˆ encia e a Tecnologia (FCT) un- der reference UIDB/50021/2020 (INESC-ID multi-annual funding), as well as un- der the HOTSPOT project with reference PTDC/CCI-COM/7203/2020; Air Force Office of Scientific Research under award number FA9550-19-1-0020; P2020 pro- gram, supervised by Agˆ encia Nacional de Inova¸ ao (ANI), under the project CMU- PT Ref. 045909 (MAIA). Vˆ ania Mendon¸ca was funded by an FCT grant, ref. SFRH/BD/121443/2016. 3 The final authenticated publication is available online at https://doi.org/10.1007/ 978-3-030-86230-5 49