Use of Global and Acoustic Features Associated with Contextual Factors to Adapt Language Models for Spontaneous Speech Recognition Shohei Toyama, Daisuke Saito, Nobuaki Minematsu Graduate School of Engineering, The University of Tokyo toyama@gavo.t.u-tokyo.ac.jp, dsk saito@gavo.t.u-tokyo.ac.jp, mine@gavo.t.u-tokyo.ac.jp Abstract In this study, we propose a new method of adapting language models for speech recognition using para-linguistic and extra- linguistic features in speech. When we talk with others, we often change the way of lexical choice and speaking style ac- cording to various contextual factors. This fact indicates that the performance of automatic speech recognition can be improved by taking the contextual factors into account, which can be es- timated from speech acoustics. In this study, we attempt to ﬁnd global and acoustic features that are associated with those con- textual factors, then integrate those features into Recurrent Neu- ral Network (RNN) language models for speech recognition. In experiments, using Japanese spontaneous speech corpora, we examine how i-vector and openSMILE are associated with con- textual factors. Then, we use those features in the reranking process of RNN-based language models. Results show that per- plexity is reduced by 16% relative and word error rate is reduced by 2.1% relative for highly emotional speech. Index Terms: contextual factors, global features, spontaneous speech, language models, adaptation, reranking 1. Introduction Recently, we can ﬁnd many automatic speech recognition (ASR) systems embedded into various electronic devices, but the input to these systems is often voice commands. As these devices become more prevalent, it should be more necessary for them to accept spontaneous speech. Here, we can point out various differences of speakers’ behaviors found between voice commands and spontaneous speech. In the latter, one of- ten communicates with others by controlling not only linguistic information but also para-linguistic and even non-verbal infor- mation such as speaking styles and gestures [1]. To understand him/her, listeners identify the spoken words while interpret- ing para-linguistic and non-verbal information transmitted via speech and motions, which are related to age, gender, emotion, regional accent, attitude, and so on. It is certainly possible to adapt language models to those factors by treating them as dis- crete labels and using class-based language models [2]. With RNN language models, however, to adapt these models, we can use raw and continuous features related to these labels and also combine different types of features very ﬂexibly [3, 4, 5]. What kind of acoustic features are associated with contex- tual factors? As far as the authors know, language model adap- tation to contextual factors were examined only by using acous- tic features related to small linguistic units such as syllable and word [5, 6], but we can claim that long-span features are highly correlated with some contextual factors when they are acous- tically realized by static bias of speech features. Then in this paper, we focus on global and acoustic features associated with contextual factors and examine how they can be used for RNN language model adaptation. In experiments, we investigate i- Figure 1: Recurrent Neural Network Language Model vector and openSMILE features extracted from individual utter- ances and use them for language model adaptation. The result- ing models are tested using highly emotional speech corpora. 2. Related works The basic RNN language model [7] is schematically shown in Figure 1. Word xi-1 is converted to ﬁxed length feature vec- tor v(xi-1) ∈ R n , and is combined with previous hidden layer hi-1 ∈ R n . The current hidden layer hi is calculated as fol- lows: hi = f (W hh hi-1 + W xh v(xi-1)+ b h ), (1) where W hh and W xh ∈ R n×n are weight matrices, b h ∈ R n is a bias vector, and f (·) is called activation function like hyper- bolic tangent. From hi , the following word is predicted: P (xi ) = softmax(Wxhi + bx), (2) where Wx ∈ R n×v is a weight matrix and bx ∈ R v is a bias vector. P (xi ) ∈ R v is an output vector whose dimension v is equal to the vocabulary size. Each dimension represents the probability that its corresponding item in the vocabulary is ob- served after the given history. To avoid the well-known gradient vanishing problem, hidden layer prediction, denoted as Equa- tion 1, can be replaced with Long Short-Term Memory (LSTM) [8]. As for language model adaptation based on contextual fac- tors, we can say that there are two types of approaches of using additional features for adaptation: linguistic features and acous- tic features. In the former, both local and global features were examined in [9, 10]. Here, the local features are related to mor- pheme [10, 11, 12] or word [9, 13] and the global features are related to sentence [14] or document [3, 10, 15]. Further, socio- situational settings were also examined for adaptation in [9]. Copyright  2017 ISCA INTERSPEECH 2017 August 20–24, 2017, Stockholm, Sweden http://dx.doi.org/10.21437/Interspeech.2017-717 543