AN ATTENTION-AWARE BIDIRECTIONAL MULTI-RESIDUAL RECURRENT NEURAL NETWORK (ABMRNN): A STUDY ABOUT BETTER SHORT-TERM TEXT CLASSIFICATION Ye Wang 1 , Han Wang 1 , Xinxiang Zhang 2 , Theodora Chaspari 1 , Yoonsuck Choe 1 and Mi Lu 1 1 Texas A&M University, College Station, Texas, 77843, USA 2 Southern Methodist University, Dallas, Texas, 75025, USA {wangye0523, hanwang, chaspari, choe}@tamu.edu, xinxiang@smu.edu, mlu@ece.tamu.edu ABSTRACT Long Short-Term Memory (LSTM) has been proven an effi- cient way to model sequential data, because of its ability to overcome the gradient diminishing problem during training. However, due to the limited memory capacity in LSTM cells, LSTM is weak in capturing long-time dependency in sequen- tial data. To address this challenge, we propose an Attention- aware Bidirectional Multi-residual Recurrent Neural Network (ABMRNN) to overcome the deficiency. Our model consid- ers both past and future information at every time step with omniscient attention based on LSTM. In addition to that, the multi-residual mechanism has been leveraged in our model which aims to model the relationship between current time step with further distant time steps instead of a just previous time step. The results of experiments show that our model achieves state-of-the-art performance in classification tasks. Index TermsLong Short-Term Memory, recurrent neural network, attention model, natural language process- ing, residual network 1. INTRODUCTION Compared with Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) are widely applied to sequential data such as natural language processing [1] and speech processing [2], while CNNs are more employed in image processing fields [3–5]. Among the existing RNN models, LSTM is one of the most widely approaches since it initially solved gradient vanishing and exploding prob- lems during RNN training [6] by introducing forget gate and memory cell. Numerous RNNs variations [6–8] have been proposed in previous literature to achieve the state-of-the-art performance in different tasks, where LSTM is the corner- stone of those structures. With the increase in the depth of the layers, residual networks have proved their advantages in both CNNs [9] and RNNs [10]. Residual networks provide an alternative to LSTMs by connecting current and distant time steps during training. In this paper, we propose an Attention-aware Bidirec- tional Multi-residual Recurrent Neural Network (ARMRNN) and have shown improved performance in existing sequential classification tasks. To summarize our contributions: We propose a algorithm which enables the updating of the weights combining both previous and future time steps. We leverage a multi-residual mechanism from exist- ing residual network into the recurrent networks for se- quence learning, through which we achieve the state- of-the-art performance in classification tasks. We provide comprehensive analysis of the advantages and disadvantages of current cutting-edge models in- cluding RNNs and CNNs for sequence learning, espe- cially in short-term text classification tasks. 2. RELATED WORK Regarding improving the performance of classification tasks, there are some directions towards networks exploration. First, an increasing number of layers is employed for capturing fea- tures. Second, various feature extraction methods such as word2vec [11] and doc2vec [12] have been invented for bet- ter words representations learning. Third, some variations to- wards the interior structure units such as LSTM and GRU [7] are proposed. With the development of neural networks, a novel trend is to combine deeper networks and multiple neu- ral network variations. Since general CNNs or RNNs architectures do not fit well in some tasks such as short-term text classification, the contri- bution of this work lies in the fact that, it integrates advantages of residual networks for the tasks of interest. 3. RESIDUAL LSTM PRELIMINARIES LSTM solves gradient vanishing and exploding problems. However, if the time sequence is too long, the dependency between the former and latter information is neglected in LSTM because current time step only depends on previous time step. To enhance such a distant relationship, residual network based on LSTM has been proposed [10, 13]. Figure 1 shows the general structure of a residual network. The basic 3582 978-1-5386-4658-8/18/$31.00 ©2019 IEEE ICASSP 2019