Label Semantics for Few Shot Named Entity Recognition Jie Ma 1 Miguel Ballesteros 1 Srikanth Doss 1 Rishita Anubhai 1 Sunil Mallya 1* Yaser Al-Onaizan 1* Dan Roth 1,2 1 AWS AI Labs 2 Computer and Information Science, University of Pennsylvania {jieman, ballemig, srikad, ranubhai, drot}@amazon.com mallya16@gmail.com, onaizan2000@yahoo.com Abstract We study the problem of few shot learning for named entity recognition. Speciﬁcally, we leverage the semantic information in the names of the labels as a way of giving the model additional signal and enriched priors. We propose a neural architecture that consists of two BERT encoders, one to encode the doc- ument and its tokens and another one to en- code each of the labels in natural language format. Our model learns to match the repre- sentations of named entities computed by the ﬁrst encoder with label representations com- puted by the second encoder. The label seman- tics signal is shown to support improved state- of-the-art results in multiple few shot NER benchmarks and on-par performance in stan- dard benchmarks. Our model is especially ef- fective in low resource settings. 1 Introduction Named entity recognition (NER) seeks to locate named entity spans in unstructured text and clas- sify them into pre-deﬁned categories such as PER- SON, LOCATION and ORGANIZATION (Tjong Kim Sang and De Meulder, 2003a). As a funda- mental natural language understanding task, NER often serves as an upstream component for more complex tasks such as question answering (Mollá et al., 2006), relation extraction (Chan and Roth, 2011) and coreference resolution (Clark and Man- ning, 2015). However, building an accurate NER system has traditionally required large amounts of high quality annotated in-domain data (Lison et al., 2020; Chen et al., 2020). This usually involves well deﬁned annotation guidelines and training of annotators, which requires rich domain knowledge and can be prohibitively expensive (Huang et al., 2020). * Work done while at AWS AI Labs. Few shot learning (FSL) (Vinyals et al., 2017; Finn et al., 2017; Snell et al., 2017) aims at per- forming a task using only very few annotated ex- amples (i.e. support set). Similarity-based methods, such as prototypical networks, are extensively studied and show great success for FSL (Vinyals et al., 2017; Snell et al., 2017; Yu et al., 2018a; Hou et al., 2020). The core idea is to classify input examples from a new domain based on their similarities with representa- tions of each class in the support set. These meth- ods do not utilize the semantics of label names and usually represent labels by directly averaging the embedding of support set examples, oversimpli- fying the learning of label representations. The main premise of our work is that label names carry meaning that our models can induce from data; the labels are themselves words that appear in text in various contexts and are thus semantically re- lated to other words that appear in text, and this relatedness can be leveraged. For example, the representation of “Lionel Messi” is more similar to that of PERSON than to the representations of LOCATION or DATE when similar priors are used for labels and words or phrases. In this work, we propose a neural architecture that uses two separate BERT-based encoders (De- vlin et al., 2019) to leverage semantics of label names for NER. 1 One encoder (a) is used to en- code the document and its words while the other encoder (b) is used to encode label names (e.g. PERSON, LOCATION etc.). The model is trained to match word representations from encoder (a) with label representations from encoder (b), and assign a label for each word by maximizing the 1 Our model is similar to the two-tower model widely adopted in question answering (Karpukhin et al., 2020), rec- ommender systems (Wang et al., 2021) and entity linking (Logeswaran et al., 2019; Vyas and Ballesteros, 2020). arXiv:2203.08985v1 [cs.CL] 16 Mar 2022