Multimodal Emotion Recognition with Automatic Peak Frame Selection Sara Zhalehpour #1 , Zahid Akhtar *2 , Cigdem Eroglu Erdem #3 # Dept. of Electrical and Electronics Engineering, Bahcesehir University, Istanbul, Turkey 1 sara.zhalehpour@stu.bahcesehir.edu.tr 3 cigdem.eroglu@bahcesehir.edu.tr * Dept. of Mathematics and Computer Science, University of Udine, Udine, Italy 2 zahid.akhtar@uniud.it Abstract—In this paper we present an effective framework for multimodal emotion recognition based on a novel approach for automatic peak frame selection from audio-visual video sequences. Given a video with an emotional expression, peak frames are the ones at which the emotion is at its apex. The objective of peak frame selection is to make the training process for the automatic emotion recognition system easier by summarizing the expressed emotion over a video sequence. The main steps of the proposed framework consists of extraction of video and audio features based on peak frame selection, unimodal classification and decision level fusion of audio and visual results. We evaluated the performance of our approach on eNTERFACE’05 audio-visual database containing six basic emotional classes. Experimental results demonstrate the effectiveness and superiority of the proposed system over other methods in the literature. Keywords— multimodal emotion recognition; peak frame selection; decision level fusion; affective computing I. INTRODUCTION Automatic recognition of human emotional states is an important problem in human-computer interactions. Recognition of the emotional state of a person has many applications in very diverse areas such as psychology [1], security [2], health care [3], education, marketing and advertising. Humans express their emotions through various channels: speech, facial expressions, head motion, body gestures, etc. Therefore, a joint analysis of these channels in a multimodal system is hypothesized for a better and more robust performance in automatic emotion recognition. There are several approaches for human emotion recognition. The main focus of these approaches is on visual and audio modalities. For the audio modality, the state-of-the- art is usually based on using prosodic features along with spectral, cepstral and voice quality features [4-6]. On the other hand, the visual modality is the most widely used channel, for which state of the art methods are usually based on utilizing 2D facial features. 2D facial features can be broadly grouped as geometric features and appearance based features. Geometric features localize the salient facial points and detect the emotion based on the deformation of these facial points. Whereas, appearance based features represent the change in the texture of the expressive face [2, 3, 7, 8]. Since the human emotion relies heavily on both audio and visual information, utilizing a multimodal approach is more reasonable for an emotion recognition system. Although research on the audio and visual channels has considerably progressed in the recent years, integration of these channels is still an open research problem [9]. Recent studies have shown many advantages of fusing audio and video channels for emotion recognition [10-15]. Mansurizadeh et al. [10] proposes an asynchronous feature level fusion approach which uses both feature and decision level fusions. This approach is based on the fact that cues from facial image series and audio information of an audio-visual sequence are not temporally aligned, hence, features from audio and video modalities that are related to the same emotional event have more chance to be temporally overlapped and therefore should be fused together. Gajsek et al. [11] presents an audio-visual emotion recognition system, which uses prosodic and cepstral coefficients as audio features and Gabor wavelets as video features followed by feature selection using a stepwise method. He uses a multi-class classifier system to combine the outputs of the different classifiers. Datcu et al. [12] presents a multimodal semantic data fusion model. His method uses two types of geometric features for face depending on the presence or absence of speech. It removes the influence of the speech on the face shape by using only the eye and eyebrow related features. Yongjin et al. [13] proposes a method for visual emotion recognition, which selects a single peak frame from each audio-visual sequence. The peak frames are selected as the frames with the highest speech amplitude. The visual features are extracted from these peak frames using Gabor wavelets. The audio features are based on prosody and ceptstral related coefficients. The audio and video modalities are fused at the feature level. 978-1-4799-3020-3/14/$31.00 ©2014 IEEE 116