An entropic score to rank annotators for
crowdsourced labeling tasks
Vikas C. Raykar
Siemens Healthcare, Malvern, PA 19355
Email: vikas.raykar@siemens.com
Shipeng Yu
Siemens Healthcare, Malvern, PA 19355
Email: shipeng.yu@siemens.com
Abstract—With the advent of crowdsourcing services it has
become quite cheap and reasonably effective to get a dataset
labeled by multiple annotators in a short amount of time.
Various methods have been proposed to estimate the consensus
labels by correcting for the bias of annotators with different
kinds of expertise. Often we have low quality annotators or
spammers–annotators who assign labels randomly (e.g., without
actually looking at the instance). Spammers can make the cost
of acquiring labels very expensive and can potentially degrade
the quality of the consensus labels. In this paper we propose a
score (based on the reduction in entropy) which can be used to
rank the annotators—with the spammers having a score close to
zero and the good annotators having a high score close to one.
Index Terms—crowdsourcing, ranking annotators, entropic
score.
I. RANKING ANNOTATORS FOR CROWDSOURCING
Annotating an unlabeled dataset is one of the major bot-
tlenecks in using supervised learning to build good predictive
models for pattern recognition. Getting a dataset labeled by
experts can be expensive and time consuming. With the
advent of crowdsourcing services (Amazon’s Mechanical Turk
(AMT) [1] being a prime example) it has become quite easy
and inexpensive to acquire labels from a large number of
annotators in a short amount of time (see [2], [3], and [4]
for some computer vision and natural language processing
case studies). For example in AMT the requesters are able
to pose tasks known as HITs (Human Intelligence Tasks).
Workers (called providers) can then browse among existing
tasks and complete them for a small monetary payment set by
the requester.
However one drawback of most crowdsourcing services is
that we do not have control over the quality of the annotators.
The annotators can come from a diverse pool including gen-
uine experts, novices, biased annotators, malicious annotators,
and spammers. Hence in order to get good quality labels
requestors typically get each instance labeled by multiple an-
notators and these multiple annotations are then consolidated
either using a simple majority voting or more sophisticated
methods that model and correct for the annotator biases [5],
[6] and/or task complexity [7]. While majority voting assumes
all annotators are equally good the more sophisticated methods
model the annotator performance and then appropriately give
different weights to the annotators to reach the consensus.
In this paper we are interested in ranking annotators based
on the contribution of the annotator towards the final consensus
label. A mechanism to rank annotators is a desirable feature
for any crowdsourcing market place. For example one can
give monetary bonuses to good annotators and deny payments
to spammers and low quality annotators. In our context a
spammer is an annotator who assigns random labels (maybe
because the annotator does not understand the labeling criteria,
or does not look at the instances when labeling). Spammers
can significantly increase the cost of acquiring annotations and
at the same time decrease the accuracy of the final consensus
labels. The main contribution of this paper is to compute a
scalar metric which can be used to rank the annotators—
with the spammers having a score close to zero and the good
annotators having a score close to one. This is achieved via
computing the expected reduction in entropy of the final label
due to the labels from that annotator. A similar score can
also be used to estimate the quality of the resulting consensus
ground truth. This is especially useful since we would like to
have a sense of the quality of the resulting consensus labels.
A recent attempt to quantify the quality of the workers based
on the confusion matrix was recently made by [8] where they
transformed the observed labels into posterior soft labels based
on the estimated confusion matrix.
II. ANNOTATOR MODEL FOR CATEGORICAL LABELS
Suppose there are K ≥ 2 classes. Let y
j
i
∈{1,...,K}
be the label assigned to the i
th
instance by the j
th
annotator,
and let y
i
∈{1,...,K} be the actual (unobserved) label. We
model each annotator by the multinomial parameters α
j
c
=
(α
j
c1
,...,α
j
cK
), where
α
j
ck
:= Pr[y
j
i
= k|y
i
= c],
K
k=1
α
j
ck
=1.
The term α
j
ck
denotes the probability that annotator j assigns
class k to an instance given the true class is c. When K =2,
α
j
11
and α
j
00
are sensitivity and specificity, respectively.
In this paper we do not dwell too much on the estimation of
the annotator model parameters. Based on N instances—D =
{y
1
i
,...,y
M
i
}
N
i=1
—labeled by M annotators the maximum
likelihood estimator for the annotator parameters and also the
consensus ground truth can be estimated iteratively [5], [6]
via the Expectation Maximization (EM) algorithm summarized
in Algorithm 1. The EM algorithm iteratively establishes
a particular gold standard (initialized via majority voting),
2011 Third National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics
978-0-7695-4599-8/11 $26.00 © 2011 IEEE
DOI 10.1109/NCVPRIPG.2011.14
29