R EMEDYING B I LSTM-CNN D EFICIENCY IN MODELING C ROSS -C ONTEXT FOR NER APREPRINT Peng-Hsuan Li, Tsu-Jui Fu, Wei-Yun Ma Academia Sinica {jacobvsdanniel,tsujuifu}@gmail.com, ma@iis.sinica.edu.tw November 15, 2021 ABSTRACT Recent researches prevalently used BiLSTM-CNN as a core module for NER in a sequence-labeling setup. This paper formally shows the limitation of BiLSTM-CNN encoders in modeling cross-context patterns for each word, i.e., patterns crossing past and future for a speciﬁc time step. Two types of cross-structures are used to remedy the problem: A BiLSTM variant with cross-link between layers; a multi-head self-attention mechanism. These cross-structures bring consistent improvements across a wide range of NER domains for a core system using BiLSTM-CNN without additional gazetteers, POS taggers, language-modeling, or multi-task supervision. The model surpasses comparable previous models on OntoNotes 5.0 and WNUT 2017 by 1.4% and 4.6%, especially improving emerging, complex, confusing, and multi-token entity mentions, showing the importance of remedying the core module of NER. 1 Introduction Named Entity Recognition (NER) is a core task for information extraction. Originally a structured prediction task, NER has since been formulated as a task of sequential token labeling. BiLSTM-CNN uses a CNN to encode each word and then uses bi-directional LSTMs to encode past and future context respectively at each time step. With state-of-the-art empirical results, most regard it as a robust core module for sequence-labeling NER [1, 2, 3, 4, 5]. However, each direction of BiLSTM only sees and encodes half of a sequence at each time step. For each token, the forward LSTM only encodes past context; the backward LSTM only encodes future context. For computing sentence representations for tasks such as sentence classiﬁcation and machine translation, this is not a problem, as only the rightmost hidden state of the forward LSTM and only the leftmost hidden state of the backward LSTM are used, and each of the endpoint hidden states sees and encodes the whole sentence. For computing sentence representations for sequence-labeling tasks such as NER, however, this becomes a limitation, as each token uses its own midpoint hidden states, which do not model the patterns that happen to cross past and future at this speciﬁc time step. This paper explores two types of cross-structures to help cope with the problem: Cross-BiLSTM-CNN and Att- BiLSTM-CNN. Previous studies have tried to stack multiple LSTMs for sequence-labeling NER [2]. As they follow the trend of stacking forward and backward LSTMs independently, the Baseline-BiLSTM-CNN is only able to learn higher-level representations of past or future per se. Instead, Cross-BiLSTM-CNN, which interleaves every layer of the two directions, models cross-context in an additive manner by learning higher-level representations of the whole context of each token. On the other hand, Att-BiLSTM-CNN models cross-context in a multiplicative manner by capturing the interaction between past and future with a dot-product self-attentive mechanism [6, 7]. Section 3 formulates the three Baseline, Cross, and Att-BiLSTM-CNN models. The section gives a concrete proof that patterns forming an XOR cannot be modeled by Baseline-BiLSTM-CNN used in all previous work. Cross- BiLSTM-CNN and Att-BiLSTM-CNN are shown to have additive and multiplicative cross-structures respectively to deal with the problem. Section 4 evaluates the approaches on two challenging NER datasets spanning a wide range of domains with complex, noisy, and emerging entities. The cross-structures bring consistent improvements over the arXiv:1908.11046v1 [cs.CL] 29 Aug 2019