Emotion Recognition using Acoustic and Lexical Features Viktor Rozgi´ c 1 , Sankaranarayanan Ananthakrishnan 1 , Shirin Saleem 1 Rohit Kumar 1 , Aravind Namandi Vembu 2 , Rohit Prasad 1 1 Speech Language and Multimedia Technologies, Raytheon BBN Technologies, 10 Moulton Street, Cambridge, MA 02138, USA {vrozgic,sanantha,ssaleem,rkumar,rprasad}@bbn.com 2 Ming Hseih Department of Electrical Engineering, University of Southern California, Los Angeles, CA 90089, USA namandiv@usc.edu Abstract In this paper we present an innovative approach for utterance-level emotion recognition by fusing acoustic features with lexical features extracted from automatic speech recognition (ASR) output. The acoustic features are generated by combining: (1) a novel set of features that are derived from segmental Mel Frequency Cepstral Coefficients (MFCC) scored against emotion-dependent Gaussian mixture models, and (2) statistical functionals of low-level feature descriptors such as intensity, fun- damental frequency, jitter, shimmer, etc. These acous- tic features are fused with two types of lexical features extracted from the ASR output: (1) presence/absence of word stems, and (2) bag-of-words sentiment categories. The combined feature set is used to train support vector machines (SVM) for emotion classification. We demon- strate the efficacy of our approach by performing four- way emotion recognition on the University of South- ern California’s Interactive Emotional Motion Capture (USC-IEMOCAP) corpus. Our experiments show that the fusion of acoustic and lexical features delivers an emotion recognition accuracy of 65.7%, outperforming the previously reported best results on this challenging dataset. Index Terms: emotion recognition, model-based acous- tic features, lexical features 1. Introduction Automatic assessment of the emotional state of an in- dividual has a very important role in several applica- tions ranging from affective computing for virtual train- ing systems [1] to early diagnosis of psychological health disorders [2]. Studies have shown that information rel- evant for interpretation of speaker’s emotional state is contained both in linguistic content and acoustic para- linguistic properties of speech. In some cases, linguis- tic content may not be emotionally rich (i.e. it is very difficult to recognize emotion from the transcript), and we rely on subtle speech characteristics: pitch, loudness, voicing patterns, etc. for assessing a speaker’s emotional Approved for Public Release, Distribution Unlimited. This mate- rial is based upon work supported by the Defense Advance Research Project Agency (DARPA) and Space and Naval Warfare Systems Cen- ter Pacific under Contract No. N66001-11-C-4094. Any opinions, find- ings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advance Research Project Agency (DARPA) and Space and Naval Warfare Systems Center Pacific. state. In other cases, speech is acoustically neutral and the meaning and associated sentiments inferred from the lin- guistic content are the main cues for the emotion recog- nition. Therefore, humans actively make decisions about importance of both ”What is said?” and ”How is it said?”. Accuracy of automatic emotion recognition from speech depends largely on the choice of informative and meaningful features. Acoustic features play a domi- nant role in the emotion recognition literature. These include segmental (energy, mel-frequency cepstral coef- ficients (MFCC), formants) and supra-segmental (pitch and degree of voicing) frame-level descriptors [3]. It is an accepted practice to generate fixed-dimensional vec- tors for utterance-level emotion classification by comput- ing various statistical functionals (mean, standard devia- tion, range etc.) over a variable number of frame-level de- scriptors extracted from an utterance. On the other hand, linguistic features used for emotion recognition include presence indicators of lexemes [4], n-grams [5], and var- ious bag-of-words representations [6]. Since automatic speech recognition (ASR) of emotional speech is a dif- ficult problem, most of the work is based on reference transcripts [4, 6], while a small number of studies rely on ASR [7, 8]. In this paper we present four-way utterance-level emotion (angry, happy, sad, neutral) recognition results on USC-IEMOCAP database [9]. Our main contributions are: (a) introduction of a feature class based on scor- ing of frame-level MFCCs by emotion dependent mod- els (model-based features), and (b) analysis of emotion recognition performance based on feature-level fusion of acoustic and linguistic (lexeme and sentiment) features. We based our recognition experiments on three feature classes. The first class contains a set of basic frame- level features (energy, pitch, formants), voice quality fea- tures (jitter, shimmer), voicing statistics, and MFCCs. For features in the first category we calculated differ- ent statistical functionals of the frame-level features on the utterance-level. While this approach is appealing for slowly varying features, moments of the feature value distributions could oversimplify utterance-level represen- tations of highly non-stationary features (e.g. MFCCs). To overcome this drawback, we propose a model-based feature set obtained by scoring all MFCCs within an ut- terance by emotion-dependent Gaussian mixture mod- els (GMM). We further normalize score vectors and fuse them in three ways by: (a) calculating mean of the nor- malized score vector on the utterance-level, (b) generat- ing histograms by voting for the highest scoring emo- tion model, and (c) estimating parameters of the Dirich- ISCA Archive http://www.isca-speech.org/archive INTERSPEECH 2012 ISCA's 13 th Annual Conference Portland, OR, USA September 9-13, 2012 INTERSPEECH 2012 366 10.21437/Interspeech.2012-118