USING EMOTIONS TO TAG MEDIA Marco Paleari (1) , Benoit Huet (1) and Brian Duffy (2) (1) Eurecom Institute, Sophia Antipolis, France (2) The SmartLab, University of East London, UK ABSTRACT Multimedia information indexing and retrieval is about developing techniques which allow people to effectively find the media they are looking for. Content-based methods become necessary when dealing with big databases due to the limitations inherent in metadata-based systems. Current technology allows researchers to explore the emotional space which is known to carry very interesting semantic information, but emotion recognition systems, however, lack sufficient reliability when dealing with real world data. A possible solution to this problem resides in the multimodal fusion paradigm which aims at improving robustness to real world noise. We state the need for an integrated methodology which extracts reliable affective information through a multimodal fusion system and tags this semantic information to the medium itself. A framework, EMMA, currently under development in our laboratory, will be described. 1. INTRODUCTION It has been demonstrated that events and objects appraised as emotionally relevant are memorized in more permanent ways but also, that the organization of memory in humans is such that similar remembrances (i.e. which elicit similar emotional reactions) are, linked and stored close to each other. This suggests that emotions are an important characteristic of human memory, by helping us to retrieve the memories we are looking for [1,2]. Considerable efforts have been done in the domain of emotion recognition from different media. Indeed, emotions are mainly recognized from three kind of media: audio, images (still images and video), and physiological signals. Even though studies from the indexing and retrieval community [3] acknowledge that emotions are an important characteristic of media and that they might be used in interesting ways as semantic tags, only few efforts have been done in using emotions in content-based indexing. In this paper, we present an architecture that includes emotion recognition through multimodal fusion of affective cues and automatic tagging of videos (with audio) for content-based retrieval and summarization. 2. PREVIOUS WORKS Research in this field basically developed from work in 2003 with Salway and Graham [4] with the extraction of emotional feature from the transcriptions of audio- descriptors of films for visually-impaired people. 679 different words were considered as emotion tokens belonging to one of the 22 different emotions described in the OCC model. Miyamori et al. [5] perform similar algorithms using blogs’ texts. Chan and Jones [6] use films audio and, in particular, pitch and energy of the actors' speech signal. Kuo et al. [7] use films music and algorithms exploiting features such as tempo, melody, mode, and rhythm to classify music. Finally Kim et al. [8] use information about texture and colors of an image to extrapolate the emotion elicited by that picture in humans. These works show the interest of the community in these approaches. Nevertheless the algorithms used are often not very reliable and the system evaluation lacks completeness. 3. EMMA: E MOTION M ULTIM EDIA A NNOTATION One well known technique to increase reliability involves exploiting multimodal information through fusion. Emotions are intrinsically multimodal (i.e. they affect speech, facial expression, physiology and many other modalities). Indeed, our approach to increase emotion estimation reliability passes through a multimodal fusion framework (Figure 2). EMMA extracts emotions from both the speech (auditory) and the facial expression (visual) signals. Dynamic control (Figure 1) is used to adapt the fusion algorithms to the quality of the different systems. Indeed if lighting is bad using color information should be limited and the emotion estimate should privilege the auditory modality. Furthermore EMMA couples affective and semantic labeling of the same data. Future plans consist of increasing the number of modalities starting from the inclusion of emotion recognition through skin conductivity and heart rate (physiology). Pitch, pitch contours, formants, speech energy, mel frequency cepstral coefficients (MFCC), and Rasta-PLP coefficients will be computed for the audio. Feature points positions and movements will be considered together with motion flow as video features. In Jamboree 2007: One Day Workshop By and For KSpace Ph.D. Students, Berlin, Germany, 2007