Multimodal Emotion Recognition with Automatic
Peak Frame Selection
Sara Zhalehpour
#1
, Zahid Akhtar
*2
, Cigdem Eroglu Erdem
#3
#
Dept. of Electrical and Electronics Engineering, Bahcesehir University, Istanbul, Turkey
1
sara.zhalehpour@stu.bahcesehir.edu.tr
3
cigdem.eroglu@bahcesehir.edu.tr
*
Dept. of Mathematics and Computer Science, University of Udine, Udine, Italy
2
zahid.akhtar@uniud.it
Abstract—In this paper we present an effective framework for
multimodal emotion recognition based on a novel approach for
automatic peak frame selection from audio-visual video
sequences. Given a video with an emotional expression, peak
frames are the ones at which the emotion is at its apex. The
objective of peak frame selection is to make the training process
for the automatic emotion recognition system easier by
summarizing the expressed emotion over a video sequence. The
main steps of the proposed framework consists of extraction of
video and audio features based on peak frame selection,
unimodal classification and decision level fusion of audio and
visual results. We evaluated the performance of our approach on
eNTERFACE’05 audio-visual database containing six basic
emotional classes. Experimental results demonstrate the
effectiveness and superiority of the proposed system over other
methods in the literature.
Keywords— multimodal emotion recognition; peak frame
selection; decision level fusion; affective computing
I. INTRODUCTION
Automatic recognition of human emotional states is an
important problem in human-computer interactions.
Recognition of the emotional state of a person has many
applications in very diverse areas such as psychology [1],
security [2], health care [3], education, marketing and
advertising.
Humans express their emotions through various channels:
speech, facial expressions, head motion, body gestures, etc.
Therefore, a joint analysis of these channels in a multimodal
system is hypothesized for a better and more robust
performance in automatic emotion recognition.
There are several approaches for human emotion
recognition. The main focus of these approaches is on visual
and audio modalities. For the audio modality, the state-of-the-
art is usually based on using prosodic features along with
spectral, cepstral and voice quality features [4-6].
On the other hand, the visual modality is the most widely
used channel, for which state of the art methods are usually
based on utilizing 2D facial features. 2D facial features can be
broadly grouped as geometric features and appearance based
features. Geometric features localize the salient facial points
and detect the emotion based on the deformation of these
facial points. Whereas, appearance based features represent
the change in the texture of the expressive face [2, 3, 7, 8].
Since the human emotion relies heavily on both audio and
visual information, utilizing a multimodal approach is more
reasonable for an emotion recognition system. Although
research on the audio and visual channels has considerably
progressed in the recent years, integration of these channels is
still an open research problem [9].
Recent studies have shown many advantages of fusing
audio and video channels for emotion recognition [10-15].
Mansurizadeh et al. [10] proposes an asynchronous feature
level fusion approach which uses both feature and decision
level fusions. This approach is based on the fact that cues from
facial image series and audio information of an audio-visual
sequence are not temporally aligned, hence, features from
audio and video modalities that are related to the same
emotional event have more chance to be temporally
overlapped and therefore should be fused together.
Gajsek et al. [11] presents an audio-visual emotion
recognition system, which uses prosodic and cepstral
coefficients as audio features and Gabor wavelets as video
features followed by feature selection using a stepwise
method. He uses a multi-class classifier system to combine the
outputs of the different classifiers. Datcu et al. [12] presents a
multimodal semantic data fusion model. His method uses two
types of geometric features for face depending on the presence
or absence of speech. It removes the influence of the speech
on the face shape by using only the eye and eyebrow related
features.
Yongjin et al. [13] proposes a method for visual emotion
recognition, which selects a single peak frame from each
audio-visual sequence. The peak frames are selected as the
frames with the highest speech amplitude. The visual features
are extracted from these peak frames using Gabor wavelets.
The audio features are based on prosody and ceptstral related
coefficients. The audio and video modalities are fused at the
feature level.
978-1-4799-3020-3/14/$31.00 ©2014 IEEE 116