MS MARCO Chameleons: Challenging the MS MARCO
Leaderboard with Extremely Obstinate Qeries
Negar Arabzadeh
University of Waterloo
narabzad@uwaterloo.ca
Bhaskar Mitra
Microsoft
bmitra@microsoft.com
Ebrahim Bagheri
Ryerson University
bagheri@ryerson.ca
ABSTRACT
During the recent years and with the growing infuence of neural
architectures, tasks such as ad hoc retrieval have witnessed an im-
pressive improvement in performance. In this paper, we go beyond
the overall performance of the state of the art rankers and empir-
ically study their performance from a fner-grained perspective.
We fnd that while neural rankers have been able to consistently
improve performance, this has been in part thanks to a specifc set
of queries from within the larger query set. We systematically show
that there are subsets of queries that are difcult for each and every
one of the neural rankers, which we refer to as obstinate queries.
We show the obstinate queries are similar to easier queries in terms
of their number of available relevant judgement documents and
the length of the query itself but they are extremely more difcult
to satisfy by existing rankers. Furthermore, we observe that query
reformulation methods cannot help these queries. On this basis, we
present three datasets derived from the MS MARCO Dev set, called
the MS MARCO Chameleon datasets. We believe that the next
breakthrough in performance would need to necessarily consider
the queries in the MS MARCO Chameleons, as such, propose that
a well-rounded evaluation strategy for any new ranker would need
to include performance measures on both the overall MS MARCO
dataset as well as the proposed MS MARCO Chameleon datasets.
CCS CONCEPTS
· Information systems → Information retrieval; Evaluation
of retrieval results; Retrieval efectiveness; Retrieval efciency;
Information retrieval query processing; Query reformula-
tion.
KEYWORDS
Query Difculty, Information retrieval, Query Reformulation
ACM Reference Format:
Negar Arabzadeh, Bhaskar Mitra, and Ebrahim Bagheri . 2021. MS MARCO
Chameleons: Challenging the MS MARCO Leaderboard with Extremely
Obstinate Queries . In Proceedings of the 30th ACM International Conference
on Information and Knowledge Management (CIKM ’21), November 1ś5, 2021,
Virtual Event, QLD, Australia. ACM, New York, NY, USA, 10 pages. https:
//doi.org/10.1145/3459637.3482011
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proft or commercial advantage and that copies bear this notice and the full citation
on the frst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specifc permission and/or a
fee. Request permissions from permissions@acm.org.
CIKM ’21, November 1ś5, 2021, Virtual Event, QLD, Australia
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8446-9/21/11. . . $15.00
https://doi.org/10.1145/3459637.3482011
1 INTRODUCTION
The recent advances in neural information processing has made a
noticeable impact on many information retrieval tasks including
question answering, [14, 20, 25], ad hoc retrieval [10, 15, 16, 26, 31],
and knowledge graph search [11], just to name a few. Particularly,
the ad hoc retrieval task has witnessed a number of recent neural
(re)rankers that have shown impressive performance improvements
over traditional retrieval methods [13]. These developments have
been made possible, in part, thanks to the large-scale datasets such
as MS MARCO [32] that provide a large number of queries and
their associated relevance judgements, which can be used for train-
ing neural rankers. When reviewing the leaderboard associated
with the MS MARCO passage retrieval dataset, the performance
improvements gained over the past two years is impressive. For
instance, the best run submitted to the MS MARCO leaderboard
in 2018 produced an MRR@10 of 0.271 on the development set,
while the best run submitted in 2020 reported 0.426 on the same
metric and dataset. This means that the efectiveness of the ranking
methods has improved by an order of magnitude over a two-year
period.
While the MS MARCO leaderboard and the authors of many
papers resort to reporting their efectiveness based on the whole
collection of queries, the focus of this paper is to dig deeper into
the performance of recent state of the art neural rankers at the
query level and explore whether the improvements obtained by
the neural rankers are consistent across the whole dataset or not.
There are many ranking stacks that provide the state-of-the-art
performance by multi-stage ranking [8, 34], however, in this work,
we only focus on single stage retrievals. Improving frst stage of
the ranking stack would consequently lead to performance boost.
Based on an empirical study over the runs of fve leading neural-
based frst stage retrieval methods, we fnd there are a consistent
set of poor-performing queries that cannot be addressed by any
of the existing neural rankers. We additionally observe that the
performance improvements observed by neural rankers are due to
gradual improvements obtained over a certain subset of the dataset
and as such, performance improvements reported in the literature
are not necessarily due to the consistent performance improvement
over all of the queries.
In order to substantiate our discussion, let us consider several
state-of-the-art methods that have shown strong performance on
the 6,980 queries in the MS MARCO Dev set. These methods along
with mean and median of their average precision and reciprocal
rank are reported in Table 1. The contrast between mean and me-
dian is quite meaningful and shows that a signifcant number of
the queries report an average precision less than the overall re-
ported average. We will show later in the paper that even for the
Resource Paper Track CIKM ’21, November 1–5, 2021, Virtual Event, Australia
4426