Data Collection and Annotation for State-of-the-Art NER Using Unmanaged
Crowds
Spencer Rothwell, Steele Carter, Ahmad Elshenawy, Vladislavs Dovgalecs, Safiyyah Saleem,
Daniela Braga, Bob Kennewick
VoiceBox Technologies, Bellevue, WA, USA
spencerr, steelec, ahmade, vladislavsd, danielab, safiyyahs, bobk {@voicebox.com}
Abstract
This paper presents strategies for generating entity level
annotated text utterances using unmanaged crowds. These
utterances are then used to build state-of-the-art Named Entity
Recognition (NER) models, a required component to build
dialogue systems. First, a wide variety of raw utterances are
collected through a variant elicitation task. We ensure that
these utterances are relevant by feeding them back to the
crowd for a domain validation task. We also flag utterances
with potential spelling errors and verify these errors with the
crowd before discarding them. These strategies, combined
with a periodic CAPTCHA to prevent automated responses,
allow us to collect high quality text utterances despite the
inability to use the traditional gold test question approach for
spam filtering. These utterances are then tagged with
appropriate NER labels using unmanaged crowds. The crowd
annotation was 23% more accurate and 29% more consistent
than in-house annotation.
Index Terms: unmanaged crowdsourcing, data collection,
spam, NER, Natural Language Understanding (NLU),
dialogue systems.
1. Introduction
Building NLU models requires a large amount of text
utterances. In order to collect extensive quantities of
annotations in a cost-effective manner and with fast
turnaround, we leveraged crowdsourcing, specifically using
unmanaged crowds. Crowds can generate creative input for
open ended questions, which then can be used as
bootstrapping data for NLU models. It is difficult, however, to
prevent spam when collecting open text. We distinguish spam
from low quality data, the former being intentionally produced
and easily spotted because of the regular pattern it shows (e.g.
copy and paste of the same string along all the units of the
task); the latter being less obvious and not necessarily
intentionally produced (e.g. an utterance about sports in
response to a scenario regarding weather). In most
crowdsourcing tasks, gold test questions are interspersed
throughout the task to measure worker accuracy and ensure
data quality. This approach is not applicable when collecting
open text responses because there is no single correct
response. As an alternative solution, we insert CAPTCHAs to
prevent automated responses from bots. We then ensure high
quality of responses by feeding them back to the crowd for
several stages of validation. In addition to the difficulties with
open text collection, unmanaged crowds are difficult to train
for complicated or specialized tasks. Workers have limited
attention spans and often neglect to read the instructions for
their tasks; most crowdsourcing tasks therefore tend to be
simple and intuitive. The task of labeling named entities is
more difficult than typical crowdsourcing tasks because it
requires an intimate understanding of many different entity
labels across domains. We overcame the issue of task
complexity to generate accurate entity labeled data by dividing
the problem into manageable pieces for the crowd. Each
worker was only asked to annotate utterances with a small
number of entity labels at a time. By limiting the number of
labels any one worker had to become familiar with, we were
able to keep the complexity of the job low enough to maintain
high accuracy. This process yielded more accurate and more
consistent annotations than those done by a single, in-house
annotator.
Sequence labeling tasks, such as Part-of-Speech (POS)
tagging or entity labeling for NER, are increasingly common
applications of crowdsourcing used in Natural Language
Understanding (NLP) tasks. Carmel et al. [1] use
crowdsourcing as a mean of evaluating the output of various
entity recognizers with some success. Others have utilized
crowdsourcing for more than just evaluating the performance
of an entity recognizer, relying on crowd judgments to identify
as well as categorize entity spans in text data. Many papers
have been published on crowdsourced NER on Twitter data
([2], [3], [4]), but the approach has also been applied to email
data [5].
There are typically two steps in crowdsourcing data for
NER model training. First, identify the spans of the entities in
a source text. Second, classify any identified spans as
belonging to some entity type. Voyer et al. [6] rely on a hybrid
method of using expert annotators to identify entity spans and
crowdsourcing to classify the entities. Braunschweig et al. [7]
use crowdsourcing for each step, with mixed results. They
found that using a pure crowdsourcing approach for entity
span identification and labeling did not perform as well as
expected. This result may be attributable to the lack of any
quality control mechanisms whatsoever due to issues with
their crowdsourcing platforms. The negative impact of spam
on crowdsourced data quality in the absence of proper quality
control has been well established [8], [9].
Finin et al. [2] and Bontcheva et al. [3] combine the
boundary detection and entity labeling steps together,
presenting workers with an entity-containing phrase and
allowing them to selectively identify and classify any possible
entities in context, word-by-word. Finin et al. [2] present each
word in a phrase with a radial selection menu for each entity
type being considered. Bontcheva et al. [3] present the workers
with a phrase and ask them to highlight which words belong to
a given entity type (e.g. Location). The results of [2] suggest
Copyright © 2015 ISCA September 6 - 10, 2015, Dresden, Germany
INTERSPEECH 2015
2789
10.21437/Interspeech.2015-587