Managing the Quality of Large-Scale Crowdsourcing Jeroen B. P. Vuurens Delft University of Technology Delft, The Netherlands j.b.p.vuurens@tudelft.nl Arjen P. de Vries Centrum Wiskunde & Informatica Amsterdam, The Netherlands arjen@acm.org Carsten Eickhoff Delft University of Technology Delft, The Netherlands c.eickhoff@tudelft.nl ABSTRACT Crowdsourcing can be used to obtain relevance judgments needed for the evaluation of information retrieval systems. However, the quality of crowdsourced relevance judgments may be questionable; a substantial amount of workers appear to spam HITs in order to maximize their hourly wages, and workers may know less than expert annotators about the topic being queried. The task for the TREC 2011 Crowdsourcing track was to obtain high-quality relevance judgments. The quality of obtained annotations is improved by removing random judgments and aggregating multiple annotations per query-document pair. We conclude that crowdsourcing can be used as a feasible alternative to expert annotations, based on the estimated proportions of correctly judged query-document pairs in the crowdsourced relevance judgments and previous TREC qrels. 1. INTRODUCTION Evaluation of IR-systems generally uses known ground truth for every query-document pair. Ground truth is commonly obtained from expert annotators who manually judge relevance for each pair. Obtaining ground truth through experts is an expensive and time-consuming process [1]. Relevance judgments can be crowdsourced on the Internet by using anonymous web users (known as workers) as non-expert annotators [1]. Through the use of crowdsourcing services like Amazon’s Mechanical Turk (AMT) or CrowdFlower, it is relatively inexpensive to obtain judgments from a large number of workers in a short amount of time. Typically, several judgments are obtained per query-document pair. A consensus algorithm is used to aggregate the judgments into a single outcome per pair [2]. The use of crowdsourcing for relevance judgments comes with new challenges. There have been several reports of workers spamming questions [3], [4], [5]. The random votes these workers produce can seriously affect consensus, especially at increased spam rates. Attempts to suppress random votes in a consensus algorithm showed mediocre results [6]. Therefore, an elimination strategy is used to detect spam and take it out of the dataset before determining consensus. Section 2 discusses the importance to remove random judgments, while leaving room for difference in opinion. Section 3 gives the design of the used HIT, spam detection and management tool for obtaining results for Task 1. In Section 4, an adapted approach is given for computing consensus over the data for Task 2. The results for both tasks are described and analyzed in Section 5. In Section 6 we conclude that the results are comparable to those of expert annotators, at lower costs. 2. FRAMEWORK OF REFERENCE 2.1 Quality of relevance judgments There is a distinct difference between random judgments and differences of opinion. Differences of opinion are inherent to subjective information needs. Voorhees compared the variance of relevance judgments created by different assessors and the intersection between assessors [8]. She concluded that different relevance assessments, created under disparate conditions, produce essentially comparative evaluation results. This study shows that differences of opinion do not affect the usefulness of qrels for evaluation. Random judgments on the other hand are useless for the evaluation of IR systems; a perfect IR system is expected to obtain the same score as a random machine, regardless of the measure used. While there is no need to resolve differences in opinion amongst crowdsourcing workers, random judgments increase the variance of evaluation measures, and can render a test set useless if they are too abundant. We expect that the quality of relevance judgments can be increased by decreasing the proportion of random judgments. 2.2 Consensus for relevance judgments The relevance judgments obtained from anonymous crowdsourcing workers are of unknown quality. Only part of the workers may use ethical behavior, as they follow instructions and aim to produce meaningful results [6]. A common approach is to obtain several judgments for each query-document pair, and combine these with a consensus algorithm [1]. The redundant information helps to filter out errors in judgment and cheat attempts. 2.3 Random judgments Search results often contain duplicate documents, which contain the same content but have different URLs. In previous TREC datasets, duplicate documents that were retrieved for the same topic were judged by the same assessor. Scholer et al. found that 18% of the duplicate documents found in previous TREC datasets were judged inconsistently, when judgments are converted to a binary scale [7]. Every inconsistently judged duplicate can be seen as a random element within the set of relevance judgments, and will have the same value as random data when used in evaluation. TREC 2011 Crowdsourcing track, team TUD_DMIR