CACHE NEURAL NETWORK LANGUAGE MODELS BASED ON LONG-DISTANCE DEPENDENCIES FOR A SPOKEN DIALOG SYSTEM F. Zamora-Mart´ ınez ⋆† S. Espa˜ na-Boquera † M.J. Castro-Bleda † R. De-Mori ‡ ⋆ ESET, Universidad CEU-Cadenal Herrera, Valencia, Spain. † DSIC, Universitat Polit` ecnica de Val` encia, Valencia, Spain ‡ LIA University of Avignon, Avignon, France ABSTRACT The integration of a cache memory into a connectionist lan- guage model is proposed in this paper. The model captures long term dependencies of both words and concepts and is particularly useful for Spoken Language Understanding tasks. Experiments conducted on a human-machine telephone dia- log corpus are reported, and an increase in performance is ob- served when features of previous turns are taken into account for predicting the concepts expressed in a user turn. In terms of Concept Error Rate we obtained a statistically signiﬁcant improvement of 3.2 points over our baseline (10% relative improvement) on the French Media corpus. 1. INTRODUCTION The purpose of Language Models (LMs) in Automatic Speech Recognition (ASR) systems is to compute the probability of a word w given its history h, which is deﬁned as the sequence of words uttered before w. If the process of ASR decod- ing is performed on a human/human conversation or a hu- man/computer dialog, the history of a word may be very long and the estimation of the probability P (w|h) may be very dif- ﬁcult due to the immense variety of possible histories. A pop- ular solution is to approximate histories by the n - 1 words preceding w in word n-grams. Even in this case, the estima- tion accuracy is affected by data sparseness. Motivated by the above considerations, a solution is proposed based on contin- uous space LMs and effective approximations of word histo- ries made of summaries composed of a limited number of se- mantic constituents useful for word prediction. History sum- maries are made of discourse features. In [1] intentions and preferentially retained information are considered to model the attentional state of a conversation. A cache model is pro- posed for temporarily storing this information. Inspired by these ideas and by a previous cache model presented in [2] for LM adaptation, a new cache model is proposed in this pa- per. Stored in the cache are semantic components used by the dialog manager for performing progressive composition of concepts into frame structures. A continuous space LM is proposed to estimate word probabilities based on n-gram and concept histories. It is expected to perform better pre- dictions of words expressing concepts to be composed with the already hypothesized ones even if errors in the history hy- potheses may have a negative inﬂuence. The paper introduces this new LM adaptation model and evaluates its performance on a Spoken Language Understand- ing (SLU) task. It is organized as follows. Section 2 summa- rizes previous work on LM adaptation related to the proposed approach. Section 3 introduces a new cache Neural Network LM (cacheNNLM). Section 4 reports details of experimental results of cacheNNLMs. 2. RELATED WORK In order to take into account contexts longer than n-grams for representing contextual dependencies for word expecta- tion, a cache memory model was proposed in [2]. Along this line, trigger models were introduced with triggers stored into a cache to predict triggered words [3]. Expectations based on the cache are combined with probabilities computed by com- bining general static n-grams. The modiﬁcations of general static LM probabilities with features from the message to be analyzed were often referred to as LM adaptation. Important dimensions of LM adaptation are the type of context taken into account, how to obtain the adaptation data and how to use it to update LM probabilities. An additional concern in LM design and adaptation is the sparseness of available data used for model parameter esti- mation. A possible solution to this problem is to perform history clusters after projecting vectors of word probabilities into a reduced space [4]. Neural Network LMs (NNLM) were also proposed to overcome the data sparseness problem. The NNLM [5, 6, 7] was introduced for exploiting the inherently generalization and discriminative power of a continuous vec- tor space representation of word sequences. NNLMs are es- sentially approximators of functions that predict words based on histories. Words are coded in an internal network layer. An LM adaptation solution was proposed for NNLM by in- troducing an additional hidden layer in the network and using adaptation data to modify the weights of this layer [8].