F0 and pause features analysis for Anger and Fear detection in real-life spoken dialogs L. Devillers & I. Vasilescu & L. Vidrascu LIMSI-CNRS 0rsay, France, ENST-CNRS, TSI, Paris, France devil@limsi.fr, vasilescu@tsi.enst.fr, vidrascu@limsi.fr Abstract This paper describes recent work focusing on F0 and pause fea- tures detection for two negative emotions, Anger and Fear, oc- curing in real-life human-human spoken dialogs. Most of the current studies do not differentiate whithin the class of negative emotions, when an automatic system should consider appropri- ate strategies according to different negative emotions. In this paper we consider two types of prosodic cues aiming to differ- entiate between two negative emotions Anger and Fear. The work is carried out in the context of the AMITIES project in which spoken dialog systems for call center services are be- ing developed. F0 features are two range parameters, one at the sentence level and the other at the sub-segment level. Pause features are meaningful silent pauses and filler pause “euh”. We correlate all the features with emotion labels and with two vari- ables, gender and speaker (agent vs client). The study shows that pause features are a global more reliable cue to distinguish between Anger and Fear than F0 parameters. However, differ- ences in both F0 and pause patterns needs to be made according to speaker and dialogic context. 1. Introduction In recent years there has been a growing interest in the study of emotions [1][3][11] in order to improve the capacities of current speech technologies (speech synthesis, speech recognition, and dialog systems). In the context of human-machine interaction, the study of emotion has generally been aimed at the automatic extraction of mood features in order to be able to dynamically adapt the dialog strategy of the automatic system. Most of the studies focus on the opposition nega- tive/positive emotions. However distinctions should also be made inside the negative class. According to the type of neg- ative emotion, the system will adopt a different strategy. The question we address here is whether the two main negative emo- tions Anger and Fear present in our corpus show prosodic man- ifestations robust enough to differentiate them. According to Sherer, the literature on emotions define Anger as being vocally expressed by an increase in mean F0 and mean intensity as well as in F0 variability manifested in in- creased F0 range. Further Anger signs seem to be an increase in high frequency energy and downward directed F0 contours. The rate of articulation increases as well. Concerning Fear, the data shows increases in mean F0, F0 range and in high fre- quency energy. Rate of articulation seems to increase as well This work was partially financed by the European Commission un- der the IST-2000-25033 AMITIES project. [14]. It appears that the two emotions have quite similar man- ifestations. For other researchers, manifestations for Fear have more intense F0 patterns than for Anger [16]. As Scherer [15] has pointed out, there is an apparent contradiction between the difficulty in finding acoustic differenciation of emotional states and the comparative ease with which listeners are able to judge emotions from speech. In previous studies on the Amities corpus [2], we have shown that F0 range variations allow to distinguish between negative and positive emotions. Finally, we have found that dur- ing perceptual tests subjects are able to differentiate Anger and Fear [9] with 75% accuracy. In this paper we aim to analyse F0 and pause features en- abling to differentiate between Anger and Fear. We correlate F0 and pauses features with the emotions labels given two vari- ables: gender (male/female) and speaker (agent/client). The present study is carried out within the framework of the IST Amities (Automated Multi-lingual Interaction with In- formation and Services) project, and makes use of a corpus of real agent-client dialogs recorded in French (for independent purposes) at a Stock Exchange Customer Service Center. In the following sections, we describe the corpus and the data pro- cessing (section 2) and the analysis of the F0 and pause features (section 3). Conclusions and further research are discussed in section 4. 2. Corpus The dialogs are real agent-client recordings from a Web-based Stock Exchange Customer Service center. These recordings were made for purposes independent of this study, and have been made available for use in developing an automated call routing service within the context of the AMITIES project. The service center can be reached via an Internet connection or by directly calling an agent. While many of the calls involve prob- lems in using the Web to carry out transactions (general infor- mation, complicated requests, transactions, confirmations, con- nection failures), some of the callers simply seem to prefer in- teracting with a human agent. A corpus of 100 agent-client di- alogs (4 different agents) in French has been orthographically Table 1: Characteristics of the corpus of 100 agent-client di- alogs. # agents 4 # clients 100 # turns/dialog ave: 50 min: 5 max: 227 # words/turn ave: 9 min: 1 max: 128 # words total 44.1k # distinct 3k Speech Prosody 2004 Nara, Japan March 23-26, 2004 ISCA Archive http://www.isca-speech.org/archive