Evaluating an Authentic Audio-Visual Expressive Speech Corpus Rilliard Albert, Aubergé Véronique & Audibert Nicolas Institut de la Communication Parlée, Grenoble, France E-mail: {rilliard, auberge, audibert}@icp.inpg.fr ABSTRACT This paper presents an evaluation of the acted part of an audio-visual corpus of emotional speech. This corpus is intended to collect both spontaneous and acted emotions, and then the perceptive efficiency of stimuli to carry emotional expression has to be rated. The evaluation of acted speech is presented here, and will give us a scale to measure the spontaneous expressions. INTRODUCTION When one aims at describing and analysing emotional speech, he rapidly faces the problem of data: how to find corpora that share either a large panel of emotions, with the smallest linguistic variability, and a high recording quality? Most corpora of emotional speech are based on acted or elicited emotions (see e.g. Douglas-Cowie et al., 2003, for a review). Therefore, we have tried to build a corpus (for French) that contains both acted and spontaneous emotions, with very basic and repetitive linguistic variations, by using a protocol inspired from the wizard of Oz technique. The corpus and the methodology are described in Aubergé et al. (2003). The collected data is aimed at studying the expression of emotions in speech, in relation with face. The study presented here consists both in evaluating the corpus and in perceptively testing some hypotheses held on the prosodic morphology that encode the expressions. It follows a general idea that we proposed for the evaluation of prosody (Rilliard & Aubergé, 2003): the evaluation of prosody must be « modularized » into the different functions encoded by prosody (the direct emotion encoding being one) and must be related to the cognitive representation of prosody. In a previous work on the acoustic analysis of the corpus, we proposed to extend a model of prosody based on the integration of superposed Gestalts contours and gradient tuning (Aubergé, 2002) on these contours of direct expressions of emotions. A Corpus of Emotional Speech One of the major singularities of these data is that emotions are spontaneously produced by the speakers in a first time, and then acted by the same speakers (who are professional actors). The induction of the emotional variations was not expected by the speakers, who were performing a Wizard of Oz task, held by a devoted man- machine scenario (cf. the “Sound Teacher” scenario in Aubergé et al. 2003). For this paper, we only work on the productions of one speaker. One of the greatest advantages of such a method consists in the strict control made over the linguistic variations: the same sentences are repeatedly produced by the speakers with all emotional variations. This is very important in order to analyse the acoustic parameters linked to the expression of emotions: all the non- emotional variations are counterbalanced. Some other constraints were imposed to our corpus: - The collected emotions had to be spontaneously experienced and expressed before to be acted. - Speech was recorded in a soundproof room in order to ensure a very high sound quality. Both the speech signal and a facial video of the speaker were recorded synchronously during the experiment. Physiological data (skin conductance, skin temperature, hearth rate, respiration rate and EMG) were also recorded, in order to validate the speaker’s actual physiological changes during the recording. The sentences used during the experiment were based on: - Five monosyllabic French words referring to different colours. They have been chosen in order to propose a set of vowels dispersed amongst the vocalic triangle: [u o a  i], in the words “rouge, jaune, sable, vert, brique”. - A longer stimulus of three syllables was also recorded: “page suivante” (next page). At the end of the “Sound Teacher” scenario, the speaker was interviewed in order to write down a list of all different emotions they had experienced. Then, he had to repeat the same utterances (the colour words and “page suivante ”), plus ten sentences from three to seven syllables. Each stimulus is produced with the complete set of acted emotions based on: the “big six” emotions (i.e. happiness, sadness, fear, disgust, anger, surprise) plus the emotions he think he has experienced during the first phase (i.e. neutral, anxiety, deception, amusement, worried, resignation, satisfaction, expectancy). Validation of the Corpus In order to validate the emotional expression collected through such a paradigm, a perceptive validation has to be carried out. It has to first validate the acted emotions: the “big six”, and the emotions reported by the listener himself. The results of this test give a first map of what listeners can efficiently perceive, and what kind of emotions cannot be differentiated. Then, the spontaneous data can be evaluated on a pre-tuned set of emotional category. This paper presents the results of the first step of the evaluation: the analysis of acted emotional expressions. Subjects 26 subjects have participated in this experiment, including 4 males and 22 females, from 19 to 45-years old, aged of 25 in average. 175