Controlling Quality and Handling Fraud in Large Scale Crowdsourcing Speech Data Collections Spencer Rothwell, Ahmad Elshenawy, Steele Carter, Daniela Braga, Faraz Romani, Michael Kennewick, Bob Kennewick VoiceBox Technologies, Bellevue, WA, USA spencerr, ahmade, steelec, danielab, farazr, michaelk, bobk {@voicebox.com} Abstract This paper presents strategies for measuring and assuring high quality when performing large-scale crowdsourcing data collections for acoustic model training. We examine different types of spam encountered while collecting and validating speech audio from unmanaged crowds and describe how we were able to identify these sources of spam and prevent our data from being tainted. We built a custom Android mobile application which funnels workers from a crowdsourcing platform and allows us to gather recordings and control conditions of the audio collection. We use a 2-step validation process which ensures that workers are paid only when they have actually used our application to complete their tasks. The collected audio is run through a second crowdsourcing job designed to validate that the speech matches the text with which the speakers were prompted. For the validation task, gold-standard test questions are used in combination with expected answer distribution rules and monitoring of worker activity levels over time to detect and expel likely spammers. Inter-annotator agreement is used to ensure high confidence of validated judgments. This process yielded millions of recordings with matching transcriptions in American English. The resulting set is 96% accurate with only minor errors. Index Terms: unmanaged crowdsourcing, mobile speech data collection, validation, spam, quality control, Acoustic Modeling (AM), Automatic Speech Recognition (ASR). 1. Introduction Many tasks in Natural Language Processing (NLP) require large amounts of annotated or transcribed language data. It is expensive to create and transcribe this data by hand. Costs can be significantly reduced through the use of crowdsourcing to accomplish simple tasks. These cost reductions often come at the price of introducing noise and consequently lowering data quality. The challenge of crowdsourcing is therefore to efficiently filter out the noise introduced by unreliable workers in order to maintain high data quality. The most common approach for maintaining crowdsourced data quality is to introduce gold-standard work units alongside normal work units. The worker accuracy can then be reliably measured and input from inaccurate workers can be rejected. This prevention mechanism, while adequate for some crowdsourcing jobs, is inadequate or inapplicable for more complicated tasks such as audio collection. Additionally, malicious workers can find innovative ways to circumvent gold-standard work units such as learning the answers to test questions and then reusing those answers under different accounts. We refer to this type of tainted input as spam, which is distinguished from low quality data, the former being intentionally produced and easily spotted because of the regular pattern it shows (e.g. copying and pasting the same string repeatedly as responses for a task); the latter is less obvious and not necessarily intentionally produced (e.g. an utterance about weather in response to a scenario regarding music). Spam becomes especially problematic when collecting high volumes of data because workers have more time to devise strategies for bypassing the existing validation mechanisms. Preventing spam requires a robust approach, combining a number of spam detection strategies to ensure that only high quality responses are collected. Our work describes quality control and spam detection strategies used while collecting high volumes of crowdsourced audio samples from text prompts. We use a custom designed audio elicitation mobile app which allows us to measure audio properties which are indicative of spam. We also describe crowdsourcing strategies for validating that collected audio matches with given text. Additionally, we describe strategies for preventing hackers from circumventing our quality control and spam prevention tools. All of these strategies combined provide us with large volumes of accurate speech data at very low cost. Since 2008, there has been a marked increase in the number of NLP conference papers that use crowdsourcing to achieve some goal [1], but there has not been significant attention paid to the issue of cheating and spamming that arises in crowdsourcing. Spam is an important issue to address when one considers using crowdsourcing [13]; without any quality control in place, a substantial portion of the crowd will be comprised of spammers [4, 12]. While it is not uncommon to see issues of quality control addressed in such papers [6, 7, 8, 9], quality control approaches are commonly limited to the use of gold-standard questions, to which answers are known a priori and used to evaluate worker contributions [5, 9, 10]. Other quality control approaches that have been adopted include crowdsourced recruitment, whereby a task is designed to find quality workers whom are later manually selected for further work [11], filtering through majority vote [5], and use of generated completion codes to determine legitimate worker contributions [2]. One of the most comprehensive assessments of illegitimate worker contributions in crowdsourcing tasks is from Eickhoff and De Vries [2], who investigate the different types of cheaters, the different ways in which they cheat, and the characteristics of a task they are attracted toward. They demonstrate that the complexity of a task can strongly influence the percentage of spam received, with simpler tasks being much more vulnerable to spammer attention than more complicated tasks. The authors also show how the amount of Copyright © 2015 ISCA September 6 - 10, 2015, Dresden, Germany INTERSPEECH 2015 2784 10.21437/Interspeech.2015-586