Data Collection and Annotation for State-of-the-Art NER Using Unmanaged Crowds Spencer Rothwell, Steele Carter, Ahmad Elshenawy, Vladislavs Dovgalecs, Safiyyah Saleem, Daniela Braga, Bob Kennewick VoiceBox Technologies, Bellevue, WA, USA spencerr, steelec, ahmade, vladislavsd, danielab, safiyyahs, bobk {@voicebox.com} Abstract This paper presents strategies for generating entity level annotated text utterances using unmanaged crowds. These utterances are then used to build state-of-the-art Named Entity Recognition (NER) models, a required component to build dialogue systems. First, a wide variety of raw utterances are collected through a variant elicitation task. We ensure that these utterances are relevant by feeding them back to the crowd for a domain validation task. We also flag utterances with potential spelling errors and verify these errors with the crowd before discarding them. These strategies, combined with a periodic CAPTCHA to prevent automated responses, allow us to collect high quality text utterances despite the inability to use the traditional gold test question approach for spam filtering. These utterances are then tagged with appropriate NER labels using unmanaged crowds. The crowd annotation was 23% more accurate and 29% more consistent than in-house annotation. Index Terms: unmanaged crowdsourcing, data collection, spam, NER, Natural Language Understanding (NLU), dialogue systems. 1. Introduction Building NLU models requires a large amount of text utterances. In order to collect extensive quantities of annotations in a cost-effective manner and with fast turnaround, we leveraged crowdsourcing, specifically using unmanaged crowds. Crowds can generate creative input for open ended questions, which then can be used as bootstrapping data for NLU models. It is difficult, however, to prevent spam when collecting open text. We distinguish spam from low quality data, the former being intentionally produced and easily spotted because of the regular pattern it shows (e.g. copy and paste of the same string along all the units of the task); the latter being less obvious and not necessarily intentionally produced (e.g. an utterance about sports in response to a scenario regarding weather). In most crowdsourcing tasks, gold test questions are interspersed throughout the task to measure worker accuracy and ensure data quality. This approach is not applicable when collecting open text responses because there is no single correct response. As an alternative solution, we insert CAPTCHAs to prevent automated responses from bots. We then ensure high quality of responses by feeding them back to the crowd for several stages of validation. In addition to the difficulties with open text collection, unmanaged crowds are difficult to train for complicated or specialized tasks. Workers have limited attention spans and often neglect to read the instructions for their tasks; most crowdsourcing tasks therefore tend to be simple and intuitive. The task of labeling named entities is more difficult than typical crowdsourcing tasks because it requires an intimate understanding of many different entity labels across domains. We overcame the issue of task complexity to generate accurate entity labeled data by dividing the problem into manageable pieces for the crowd. Each worker was only asked to annotate utterances with a small number of entity labels at a time. By limiting the number of labels any one worker had to become familiar with, we were able to keep the complexity of the job low enough to maintain high accuracy. This process yielded more accurate and more consistent annotations than those done by a single, in-house annotator. Sequence labeling tasks, such as Part-of-Speech (POS) tagging or entity labeling for NER, are increasingly common applications of crowdsourcing used in Natural Language Understanding (NLP) tasks. Carmel et al. [1] use crowdsourcing as a mean of evaluating the output of various entity recognizers with some success. Others have utilized crowdsourcing for more than just evaluating the performance of an entity recognizer, relying on crowd judgments to identify as well as categorize entity spans in text data. Many papers have been published on crowdsourced NER on Twitter data ([2], [3], [4]), but the approach has also been applied to email data [5]. There are typically two steps in crowdsourcing data for NER model training. First, identify the spans of the entities in a source text. Second, classify any identified spans as belonging to some entity type. Voyer et al. [6] rely on a hybrid method of using expert annotators to identify entity spans and crowdsourcing to classify the entities. Braunschweig et al. [7] use crowdsourcing for each step, with mixed results. They found that using a pure crowdsourcing approach for entity span identification and labeling did not perform as well as expected. This result may be attributable to the lack of any quality control mechanisms whatsoever due to issues with their crowdsourcing platforms. The negative impact of spam on crowdsourced data quality in the absence of proper quality control has been well established [8], [9]. Finin et al. [2] and Bontcheva et al. [3] combine the boundary detection and entity labeling steps together, presenting workers with an entity-containing phrase and allowing them to selectively identify and classify any possible entities in context, word-by-word. Finin et al. [2] present each word in a phrase with a radial selection menu for each entity type being considered. Bontcheva et al. [3] present the workers with a phrase and ask them to highlight which words belong to a given entity type (e.g. Location). The results of [2] suggest Copyright © 2015 ISCA September 6 - 10, 2015, Dresden, Germany INTERSPEECH 2015 2789 10.21437/Interspeech.2015-587