Spontaneous Speech Emotion Recognition using Prior Knowledge Rupayan Chakraborty, Meghna Pandharipande, Sunil Kumar Kopparapu TCS Innovation Labs-Mumbai Yantra Park, Thane West-400 601, India Email: {rupayan.chakraborty, meghna.pandharipande, sunilkumar.kopparapu}@tcs.com Abstract—Automatic and spontaneous speech emotion recog- nition is an important part of a human-computer interactive system. However, emotion identiﬁcation in spontaneous speech is difﬁcult because most often the emotion expressed by the speaker are not necessarily as prominent as in acted speech. In this paper, we propose a spontaneous speech emotion recog- nition framework that makes use of the associated knowledge. The framework is motivated by the observation that there is signiﬁcant disagreement amongst human annotators when they annotate spontaneous speech; the disagreement largely reduces when they are provided with additional knowledge related to the conversation. The proposed framework makes use of the contexts (derived from linguistic contents) and the knowledge regarding the time lapse of the spoken utterances in the context of an audio call to reliably recognize the current emotion of the speaker in spontaneous audio conversations. Our experimental results demonstrate that there is a signiﬁcant improvement in the performance of spontaneous speech emotion recognition using the proposed framework. Index Terms:Emotion recognition; knowledge-based frame- work; spontaneous speech; non-acted emotion; call center audio I. I NTRODUCTION Emotion in audio plays an important role in intelligent human computer interactions. Much of the initial emotion recognition research has been successfully validated on acted speech (for example [1], [2], [3], [4]). With the introduction of call centers associated with the growing services industry, the focus has shifted to spontaneous speech 1 [5], [6], [7], [8], [9], [10], [11]. Speech emotion recognition systems that perform with high accuracies on acted speech datasets do not perform as well on realistic natural speech [12]. This can be attributed to the mismatched train-test datasets, however the fact remains that acted speech is an exaggeration of emotions which is not a characteristic in spontaneous speech. There are two problems associated with spontaneous speech, namely (i) building a spontaneous speech database suitable for emotion recognition and (ii) reliable emotion annotation of spontaneous speech by human annotators. In spite of these problems, emotion recognition of spontaneous natural speech has attracted the attention of researchers (for example [13], [8], [12], [14], [15], [16], [17], [18]). In [14], authors proposed a combination of acoustic, lexical, and discourse information 1 We will use the word spontaneous, non-acted and natural speech inter- changeably in this paper for emotion recognition in spoken dialogue system and found improvements in recognition performance. In [19], authors described an approach to improve emotion recognition in spon- taneous children’s speech by combining acoustic and linguistic features. More recently, with a view to identify emotions in near real time, an incremental emotion recognition system has been proposed that updates the recognized emotion with each recognized word in the conversation [20]. They make use of three features (i.e. cepstral, intonation and textual), obtained at the word level to estimate the emotion with better accuracies. In this paper, we propose a framework for emotion recog- nition that can work for both spontaneous and acted speech. The framework is a combination of several modules, each of which extracts information related to the emotion. Combining the output of these modules produces a better estimate of the emotion. Unlike [20], we do not rely only on the use of word recognition to determine the emotion. This makes our system feasible even for resource deﬁcient languages that do not boast of a good Automatic Speech Recognition (ASR) engine. Our framework is motivated by the hypothesis that the emotion in a spontaneous speech utterance at any instance of time not only depends on instantaneously extracted emotion, but is also dependent on (a) time lapse of the utterance in the audio call, and (b) context-based information (events derived from linguistic contents). We validate our proposed framework in different and diversiﬁed scenarios of both acted and spontaneous speech through experiments. Moreover, the usefulness of knowledge incorporation is tested with two spontaneous datasets in two different train-test conditions, where the emotion models are generated from (a) acted speech samples (mismatched scenario) (b) spontaneous speech samples (matched scenario). The framework and its validation in realistic scenarios are the main contributions of this paper. The rest of the paper is organized as follows. Section II presents the challenges in determining emotion in spontaneous speech and motivates the proposed framework. In Section III, we propose the framework for emotion recognition that in- corporates knowledge-based information. Section IV describes the datasets, experiments and results. We conclude in Section V.