Enhanced Robustness in Speech Emotion Recognition Combining Acoustic and Semantic Analyses Ronald Müller, Björn Schuller, Gerhard Rigoll Technische Universität München, Institute for Human Machine-Communication Arcisstr. 21, D-80333 München, Germany {mueller, schuller, rigoll}@mmk.ei.tum.de Abstract In this contribution we like to give a very short introduction to a system allowing for enhanced robustness in emotion recognition from speech. The underlying emotion set consists in seven discrete states derived from the MPEG-4 standard: Anger, disgust, fear, joy, neutral, sadness, and surprise. Within here several novel approaches and investigations to acoustic feature sets, semantic analysis of spoken content, and the combination of those two information streams in a soft decision fusion are presented in short. Firstly statistic features on prosody, extracted from the speech signal, are ranked by their quantitative contribution to the estimation of an emotion. After investigation of various classification methods Support Vector Machines performed out within this task. Secondly an approach to emotion recognition by the spoken content is introduced applying Bayesian Network based spotting for emotional key-phrases. Finally the two information sources will be integrated in a soft decision fusion by using a MLP, which eventually leads to a remarkable improvement in overall recognition results. 1. System overview Figure 1. System architecture Figure 1 shows the proposed system architecture allowing for robust emotion recognition as cutout of an automotive infotainment application. Spoken utterances are recorded and subsequently processed by units for feature extraction and Automatic Speech Recognition (ASR), which leads to two information streams: One focusing on acoustic properties, the other addressing the linguistic information contained. At the end of both streams we obtain overall 14 confidences, i.e. one per emotion and stream to allow for an entirely probabilistic post-processing within the stream fusion. Thereby a Multi-Layer-Perceptron (MLP) eventually performs a soft decision fusion providing one confidence for each of the seven discrete emotions as output. Via maximum likelihood the final hard decision may take place at this point if desired. 2. Emotional Speech Corpus The emotional speech corpus has been collected in the framework of the FERMUS III project [1], dealing with emotion recognition in an automotive environment. The corpus consists of 2828 acted emotional samples from 13 speakers used for training and evaluation in the prosodic and semantic analysis. The samples were recorded over a period of one year to avoid anticipation effects of the actors. While these acted emotions tend to form a reasonable basis for a first impression of the obtainable performance, the use of spontaneous emotions seems to offer more realistic results, especially in view of the spoken content. A second set consists of 700 selected utterances in automotive infotainment speech interaction dialogs recorded for the evaluation of the fusion. In this project disgust and sadness were of minor interest. Therefore these have been provoked in additional usability test-setups to ensure equal distribution among the emotions in the data set. 3. Acoustic Analysis Unlike former works [1], which compared static and dynamic feature sets for the prosodic analysis, we focus on derived static features herein. Initially the raw contours of pitch and energy are calculated because they rather rely on broad classes of sounds. Spectral characteristics on the other hand seem to depend too strongly on the phonetic content of an utterance. Therefore only spectral energy below 250Hz and 650Hz is used considering spectral information. The values of signal energy resemble the logarithmic mean energy within a frame. The Average Magnitude Difference Function (AMDF) provides the pitch contour. This method proves robust against noise but susceptible to dominant