An entropic score to rank annotators for crowdsourced labeling tasks Vikas C. Raykar Siemens Healthcare, Malvern, PA 19355 Email: vikas.raykar@siemens.com Shipeng Yu Siemens Healthcare, Malvern, PA 19355 Email: shipeng.yu@siemens.com Abstract—With the advent of crowdsourcing services it has become quite cheap and reasonably effective to get a dataset labeled by multiple annotators in a short amount of time. Various methods have been proposed to estimate the consensus labels by correcting for the bias of annotators with different kinds of expertise. Often we have low quality annotators or spammers–annotators who assign labels randomly (e.g., without actually looking at the instance). Spammers can make the cost of acquiring labels very expensive and can potentially degrade the quality of the consensus labels. In this paper we propose a score (based on the reduction in entropy) which can be used to rank the annotators—with the spammers having a score close to zero and the good annotators having a high score close to one. Index Terms—crowdsourcing, ranking annotators, entropic score. I. RANKING ANNOTATORS FOR CROWDSOURCING Annotating an unlabeled dataset is one of the major bot- tlenecks in using supervised learning to build good predictive models for pattern recognition. Getting a dataset labeled by experts can be expensive and time consuming. With the advent of crowdsourcing services (Amazon’s Mechanical Turk (AMT) [1] being a prime example) it has become quite easy and inexpensive to acquire labels from a large number of annotators in a short amount of time (see [2], [3], and [4] for some computer vision and natural language processing case studies). For example in AMT the requesters are able to pose tasks known as HITs (Human Intelligence Tasks). Workers (called providers) can then browse among existing tasks and complete them for a small monetary payment set by the requester. However one drawback of most crowdsourcing services is that we do not have control over the quality of the annotators. The annotators can come from a diverse pool including gen- uine experts, novices, biased annotators, malicious annotators, and spammers. Hence in order to get good quality labels requestors typically get each instance labeled by multiple an- notators and these multiple annotations are then consolidated either using a simple majority voting or more sophisticated methods that model and correct for the annotator biases [5], [6] and/or task complexity [7]. While majority voting assumes all annotators are equally good the more sophisticated methods model the annotator performance and then appropriately give different weights to the annotators to reach the consensus. In this paper we are interested in ranking annotators based on the contribution of the annotator towards the ﬁnal consensus label. A mechanism to rank annotators is a desirable feature for any crowdsourcing market place. For example one can give monetary bonuses to good annotators and deny payments to spammers and low quality annotators. In our context a spammer is an annotator who assigns random labels (maybe because the annotator does not understand the labeling criteria, or does not look at the instances when labeling). Spammers can signiﬁcantly increase the cost of acquiring annotations and at the same time decrease the accuracy of the ﬁnal consensus labels. The main contribution of this paper is to compute a scalar metric which can be used to rank the annotators— with the spammers having a score close to zero and the good annotators having a score close to one. This is achieved via computing the expected reduction in entropy of the ﬁnal label due to the labels from that annotator. A similar score can also be used to estimate the quality of the resulting consensus ground truth. This is especially useful since we would like to have a sense of the quality of the resulting consensus labels. A recent attempt to quantify the quality of the workers based on the confusion matrix was recently made by [8] where they transformed the observed labels into posterior soft labels based on the estimated confusion matrix. II. ANNOTATOR MODEL FOR CATEGORICAL LABELS Suppose there are K ≥ 2 classes. Let y j i ∈{1,...,K} be the label assigned to the i th instance by the j th annotator, and let y i ∈{1,...,K} be the actual (unobserved) label. We model each annotator by the multinomial parameters α j c = (α j c1 ,...,α j cK ), where α j ck := Pr[y j i = k|y i = c], K  k=1 α j ck =1. The term α j ck denotes the probability that annotator j assigns class k to an instance given the true class is c. When K =2, α j 11 and α j 00 are sensitivity and speciﬁcity, respectively. In this paper we do not dwell too much on the estimation of the annotator model parameters. Based on N instances—D = {y 1 i ,...,y M i } N i=1 —labeled by M annotators the maximum likelihood estimator for the annotator parameters and also the consensus ground truth can be estimated iteratively [5], [6] via the Expectation Maximization (EM) algorithm summarized in Algorithm 1. The EM algorithm iteratively establishes a particular gold standard (initialized via majority voting), 2011 Third National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics 978-0-7695-4599-8/11 $26.00 © 2011 IEEE DOI 10.1109/NCVPRIPG.2011.14 29