Controlling Quality and Handling Fraud in Large Scale Crowdsourcing Speech
Data Collections
Spencer Rothwell, Ahmad Elshenawy, Steele Carter, Daniela Braga, Faraz Romani,
Michael Kennewick, Bob Kennewick
VoiceBox Technologies, Bellevue, WA, USA
spencerr, ahmade, steelec, danielab, farazr, michaelk, bobk {@voicebox.com}
Abstract
This paper presents strategies for measuring and assuring high
quality when performing large-scale crowdsourcing data
collections for acoustic model training. We examine different
types of spam encountered while collecting and validating
speech audio from unmanaged crowds and describe how we
were able to identify these sources of spam and prevent our
data from being tainted. We built a custom Android mobile
application which funnels workers from a crowdsourcing
platform and allows us to gather recordings and control
conditions of the audio collection. We use a 2-step validation
process which ensures that workers are paid only when they
have actually used our application to complete their tasks. The
collected audio is run through a second crowdsourcing job
designed to validate that the speech matches the text with
which the speakers were prompted. For the validation task,
gold-standard test questions are used in combination with
expected answer distribution rules and monitoring of worker
activity levels over time to detect and expel likely spammers.
Inter-annotator agreement is used to ensure high confidence of
validated judgments. This process yielded millions of
recordings with matching transcriptions in American English.
The resulting set is 96% accurate with only minor errors.
Index Terms: unmanaged crowdsourcing, mobile speech data
collection, validation, spam, quality control, Acoustic
Modeling (AM), Automatic Speech Recognition (ASR).
1. Introduction
Many tasks in Natural Language Processing (NLP) require
large amounts of annotated or transcribed language data. It is
expensive to create and transcribe this data by hand. Costs can
be significantly reduced through the use of crowdsourcing to
accomplish simple tasks. These cost reductions often come at
the price of introducing noise and consequently lowering data
quality. The challenge of crowdsourcing is therefore to
efficiently filter out the noise introduced by unreliable workers
in order to maintain high data quality. The most common
approach for maintaining crowdsourced data quality is to
introduce gold-standard work units alongside normal work
units. The worker accuracy can then be reliably measured and
input from inaccurate workers can be rejected. This prevention
mechanism, while adequate for some crowdsourcing jobs, is
inadequate or inapplicable for more complicated tasks such as
audio collection. Additionally, malicious workers can find
innovative ways to circumvent gold-standard work units such
as learning the answers to test questions and then reusing those
answers under different accounts. We refer to this type of
tainted input as spam, which is distinguished from low quality
data, the former being intentionally produced and easily
spotted because of the regular pattern it shows (e.g. copying
and pasting the same string repeatedly as responses for a task);
the latter is less obvious and not necessarily intentionally
produced (e.g. an utterance about weather in response to a
scenario regarding music). Spam becomes especially
problematic when collecting high volumes of data because
workers have more time to devise strategies for bypassing the
existing validation mechanisms. Preventing spam requires a
robust approach, combining a number of spam detection
strategies to ensure that only high quality responses are
collected.
Our work describes quality control and spam detection
strategies used while collecting high volumes of crowdsourced
audio samples from text prompts. We use a custom designed
audio elicitation mobile app which allows us to measure audio
properties which are indicative of spam. We also describe
crowdsourcing strategies for validating that collected audio
matches with given text. Additionally, we describe strategies
for preventing hackers from circumventing our quality control
and spam prevention tools. All of these strategies combined
provide us with large volumes of accurate speech data at very
low cost.
Since 2008, there has been a marked increase in the
number of NLP conference papers that use crowdsourcing to
achieve some goal [1], but there has not been significant
attention paid to the issue of cheating and spamming that
arises in crowdsourcing. Spam is an important issue to address
when one considers using crowdsourcing [13]; without any
quality control in place, a substantial portion of the crowd will
be comprised of spammers [4, 12]. While it is not uncommon
to see issues of quality control addressed in such papers [6, 7,
8, 9], quality control approaches are commonly limited to the
use of gold-standard questions, to which answers are known a
priori and used to evaluate worker contributions [5, 9, 10].
Other quality control approaches that have been adopted
include crowdsourced recruitment, whereby a task is designed
to find quality workers whom are later manually selected for
further work [11], filtering through majority vote [5], and use
of generated completion codes to determine legitimate worker
contributions [2].
One of the most comprehensive assessments of
illegitimate worker contributions in crowdsourcing tasks is
from Eickhoff and De Vries [2], who investigate the different
types of cheaters, the different ways in which they cheat, and
the characteristics of a task they are attracted toward. They
demonstrate that the complexity of a task can strongly
influence the percentage of spam received, with simpler tasks
being much more vulnerable to spammer attention than more
complicated tasks. The authors also show how the amount of
Copyright © 2015 ISCA September 6 - 10, 2015, Dresden, Germany
INTERSPEECH 2015
2784
10.21437/Interspeech.2015-586