Affective Content Analysis of Music Video Clips Ashkan Yazdani Multimedia Signal Processing Group (MMSPG) Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne, Switzerland ashkan.yazdani@epfl.ch Krista Kappeler Multimedia Signal Processing Group (MMSPG) Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne, Switzerland krista.kappeler@epfl.ch Touradj Ebrahimi Multimedia Signal Processing Group (MMSPG) Ecole Polytechnique Fédérale de Lausanne (EPFL) 1015 Lausanne, Switzerland touradj.ebrahimi@epfl.ch ABSTRACT Nowadays, the amount of multimedia contents is explosively increasing and it is often a challenging problem to find a con- tent that will be appealing or matches users’ current mood or affective state. In order to achieve this goal, an efficient indexing technique should be developed to annotate multi- media contents such that these annotations can be used in a retrieval process using an appropriate query. One approach to such indexing techniques is to determine the affect ( type and intensity), which can be induced in a user while con- suming multimedia. In this paper, affective content analysis of music video clips is performed to determine the emotion they can induce in people. To this end, a subjective test was developed, where 32 participants watched different mu- sic video clips and assessed their induced emotions. These self assessments were used as ground-truth and the results of classification using audio, visual and audiovisual features ex- tracted from music video clips are presented and compared. Categories and Subject Descriptors H.5.2 [[INFORMATION INTERFACES AND PRE- SENTATION]: User Interfaces—Evaluation/methodology ; I. 5.4 [PATTERN RECOGNITION]: Applications—Sig- nal processing, Waveform analysis General Terms Algorithms, Measurement, Performance, Experimentation, Keywords Affect, Emotion, Multimedia content analysis 1. INTRODUCTION With the increasing popularity of video-on-demand (VOD) and personalized recommendation services, the development of automatic content descriptions for annotation, retrieval and personalization purposes is a key issue. The technology Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MIRUM’11, November 30, 2011, Scottsdale, Arizona, USA. Copyright 2011 ACM 978-1-4503-0987-5/11/11 ...$10.00. required to achieve this goal is referred to as multimedia content analysis (MCA) and aims at bridging the ”semantic gap”, that is, to develop models of the relationship between low level features and the semantics conveyed by multimedia contents. To this end, two different approaches have been adopted for MCA so far: cognitive and emotional. The cog- nitive approach analyzes a piece of multimedia content in terms of the semantics of a scene: location, characters and events. In the past decade, most of the MCA-related re- search efforts were focused on these methods. The emotional approach, on the other hand attempts to characterize a given multimedia content by the emotions it may elicit in viewers. This approach is often referred to as affective MCA or affective content analysis and predicts or infers viewers’ emotional reactions when perceiving mul- timedia contents. The emotional approach has been less investigated when compared to cognitive approach, but its importance has been rapidly increasing with the growing awareness of the role that the emotional load of multimedia and viewers reactions to it, play in VOD concepts and per- sonalized multimedia recommendation. Analyzing a mul- timedia content at affective level reveals information that describes the emotional value of it. This value can be de- fined as amount and type of the affect (feeling or emotion) of the audience while they consume this multimedia content. In order to analyze a given multimedia content at affec- tive level, an appropriate modeling for emotion must be de- veloped and used. How to represent and model emotions is, however, a challenging task. Until today, numerous theorists and researchers have conducted research on this subject and consequently a large amount of literature exists today with sometimes very different solutions. Generally, there are two different families of emotion models: the categorical models and the dimensional models. The rational for the categorical models is to have discrete basic categories of emotions from which every other emotion can be built by combining these basic emotions. The most common basic emotions are ’fear’, ’anger’, ’sadness’, ’joy’, ’disgust’, ’surprise’ found by Ekman [4]. The dimensional models describe the components of emotions and are often represented as a two dimensional or three dimensional space where the emotions are presented as points in the coordinate space of these dimensions. The goal of the underlying dimensional model is not to find a finite set of emotions as in the categorical model but to find a finite set of underlying components of emo- tions. Many theorists have proposed that emotions can be modeled with three underlying dimensions namely, Valence (V) or Pleasantness (P), Arousal (A), Dominance (D) or 7