Evaluation of Methods for Relative Comparison of Retrieval Systems Based on Clickthroughs Jing He Department of Computer Science and Technology Peking University Beijing, China hj@net.pku.edu.cn Chengxiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL czhai@cs.uiuc.edu Xiaoming Li Department of Computer Science and Technology Peking University Beijing, China lxm@pku.edu.cn ABSTRACT The Cranﬁeld evaluation method has some disadvantages, including its high cost in labor and inadequacy for evaluating interactive retrieval techniques. As a very promising alter- native, automatic comparison of retrieval systems based on observed clicking behavior of users has recently been stud- ied. Several methods have been proposed , but there has so far been no systematic way to assess which strategy is better, making it diﬃcult to choose a good method for real applications. In this paper, we propose a general way to evaluate these relative comparison methods with two mea- sures: utility to users(UtU) and eﬀectiveness of diﬀerenti- ation(EoD). We evaluate two state of the art methods by systematically simulating diﬀerent retrieval scenarios. In- spired by the weakness of these methods revealed through our evaluation, we further propose a novel method by con- sidering the positions of clicked documents. Experiment re- sults show that our new method performs better than the existing methods. Categories and Subject Descriptors: H.3.3 [Informa- tion Search and Retrieval]: Information Search and Retrieval; H.3.4 [System and Software]:Performance Evaluation General Terms: Experimentation Measurement Keywords: information retrieval, implicit feedback, evalu- ation 1. INTRODUCTION Evaluation of an information retrieval (IR) system is crit- ical for improving search techniques. So far, the dominant method for IR evaluation has been the Cranﬁeld evaluation method. However, it has some disadvantages such as high cost in labor and inadequacy for iteractive retrieval evalu- ation. As a promising alternative, automatic evaluation of retrieval systems based on the implicit feedback of users has recently been studied [2, 4]. There are two main categories of methods based on “absolute metric” and “relative compar- ison test”, respectively. In the methods of the ﬁrst category, it infers the absolute relevance of the retrieved documents Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. CIKM’09, November 2–6, 2009, Hong Kong, China. Copyright 2009 ACM 978-1-60558-512-3/09/11 ...$10.00. based on the implicit feedback. The methods of the second category [2, 4] attempt to compare two systems by lever- aging clickthroughs. Previous work showed such a relative comparison strategy is more robust against the bias in click- throughs[4]. Although several methods have been proposed for relative comparison of retrieval systems[2, 4], there has so far been no systematic way to assess which method is better, making it diﬃcult to choose a good method for any real application. In this paper, we propose a general way to evaluate these relative comparison methods in two dimensions: (1) utility to users, which refers to the perceived utility of the merged results from a user’s perspective, and (2) eﬀectiveness of diﬀerentiation, which refers to their eﬀectiveness in distin- guishing diﬀerent retrieval systems. The utility to users of a method can be measured by applying standard retrieval measures, such as MAP, to the merged result list. The ef- fectiveness of diﬀerentiation of a method can be measured based on the accuracy of the prediction of which system is better. With these measures, we can systematically evaluate any relative comparison method using a large sample of simu- lated search results of two retrieval systems. Simulation also allows us to systematically vary the composition of samples to simulate diﬀerent scenarios. Such variations are necessary to help understanding the relative strengths and weaknesses of diﬀerent methods in diﬀerent application scenarios. Using the proposed evaluation method, we systematically evaluated two state-of-the-art methods of relative compar- ison: balanced [2] and team-draft [4]. The results show that: (1) The two methods have identical utility to users (2) The balanced method is more eﬀective in distinguish- ing retrieval systems than the team-draft method in most cases; Our evaluation also reveals a common deﬁciency of both existing methods, which inspired us to further propose a novel extension of the balanced method, called preference- based balanced, which is shown to outperform both existing methods. 2. CLICKTHROUGH-BASED COMPARISON OF RETRIEVAL SYSTEMS The basic idea of comparing retrieval systems based on clickthroughs is to interleave the search results returned by diﬀerent systems for the same query and present a merged list of results to the user. The clickthroughs of users would then be recorded and leveraged to infer which retrieval sys- tems has returned better results. In general, it is quite challenging to accurately predict which system is better just based on the limited number 2029