MS MARCO Chameleons: Challenging the MS MARCO Leaderboard with Extremely Obstinate Qeries Negar Arabzadeh University of Waterloo narabzad@uwaterloo.ca Bhaskar Mitra Microsoft bmitra@microsoft.com Ebrahim Bagheri Ryerson University bagheri@ryerson.ca ABSTRACT During the recent years and with the growing infuence of neural architectures, tasks such as ad hoc retrieval have witnessed an im- pressive improvement in performance. In this paper, we go beyond the overall performance of the state of the art rankers and empir- ically study their performance from a fner-grained perspective. We fnd that while neural rankers have been able to consistently improve performance, this has been in part thanks to a specifc set of queries from within the larger query set. We systematically show that there are subsets of queries that are difcult for each and every one of the neural rankers, which we refer to as obstinate queries. We show the obstinate queries are similar to easier queries in terms of their number of available relevant judgement documents and the length of the query itself but they are extremely more difcult to satisfy by existing rankers. Furthermore, we observe that query reformulation methods cannot help these queries. On this basis, we present three datasets derived from the MS MARCO Dev set, called the MS MARCO Chameleon datasets. We believe that the next breakthrough in performance would need to necessarily consider the queries in the MS MARCO Chameleons, as such, propose that a well-rounded evaluation strategy for any new ranker would need to include performance measures on both the overall MS MARCO dataset as well as the proposed MS MARCO Chameleon datasets. CCS CONCEPTS · Information systems Information retrieval; Evaluation of retrieval results; Retrieval efectiveness; Retrieval efciency; Information retrieval query processing; Query reformula- tion. KEYWORDS Query Difculty, Information retrieval, Query Reformulation ACM Reference Format: Negar Arabzadeh, Bhaskar Mitra, and Ebrahim Bagheri . 2021. MS MARCO Chameleons: Challenging the MS MARCO Leaderboard with Extremely Obstinate Queries . In Proceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM ’21), November 1ś5, 2021, Virtual Event, QLD, Australia. ACM, New York, NY, USA, 10 pages. https: //doi.org/10.1145/3459637.3482011 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specifc permission and/or a fee. Request permissions from permissions@acm.org. CIKM ’21, November 1ś5, 2021, Virtual Event, QLD, Australia © 2021 Association for Computing Machinery. ACM ISBN 978-1-4503-8446-9/21/11. . . $15.00 https://doi.org/10.1145/3459637.3482011 1 INTRODUCTION The recent advances in neural information processing has made a noticeable impact on many information retrieval tasks including question answering, [14, 20, 25], ad hoc retrieval [10, 15, 16, 26, 31], and knowledge graph search [11], just to name a few. Particularly, the ad hoc retrieval task has witnessed a number of recent neural (re)rankers that have shown impressive performance improvements over traditional retrieval methods [13]. These developments have been made possible, in part, thanks to the large-scale datasets such as MS MARCO [32] that provide a large number of queries and their associated relevance judgements, which can be used for train- ing neural rankers. When reviewing the leaderboard associated with the MS MARCO passage retrieval dataset, the performance improvements gained over the past two years is impressive. For instance, the best run submitted to the MS MARCO leaderboard in 2018 produced an MRR@10 of 0.271 on the development set, while the best run submitted in 2020 reported 0.426 on the same metric and dataset. This means that the efectiveness of the ranking methods has improved by an order of magnitude over a two-year period. While the MS MARCO leaderboard and the authors of many papers resort to reporting their efectiveness based on the whole collection of queries, the focus of this paper is to dig deeper into the performance of recent state of the art neural rankers at the query level and explore whether the improvements obtained by the neural rankers are consistent across the whole dataset or not. There are many ranking stacks that provide the state-of-the-art performance by multi-stage ranking [8, 34], however, in this work, we only focus on single stage retrievals. Improving frst stage of the ranking stack would consequently lead to performance boost. Based on an empirical study over the runs of fve leading neural- based frst stage retrieval methods, we fnd there are a consistent set of poor-performing queries that cannot be addressed by any of the existing neural rankers. We additionally observe that the performance improvements observed by neural rankers are due to gradual improvements obtained over a certain subset of the dataset and as such, performance improvements reported in the literature are not necessarily due to the consistent performance improvement over all of the queries. In order to substantiate our discussion, let us consider several state-of-the-art methods that have shown strong performance on the 6,980 queries in the MS MARCO Dev set. These methods along with mean and median of their average precision and reciprocal rank are reported in Table 1. The contrast between mean and me- dian is quite meaningful and shows that a signifcant number of the queries report an average precision less than the overall re- ported average. We will show later in the paper that even for the Resource Paper Track CIKM ’21, November 1–5, 2021, Virtual Event, Australia 4426