IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 7, NOVEMBER 2013 1543
Face Expression Recognition
by Cross Modal Data Association
Ashish Tawari and Mohan Manubhai Trivedi, Fellow, IEEE
Abstract—We present a novel facial expression recognition
framework using audio-visual information analysis. We propose
to model the cross-modality data correlation while allowing
them to be treated as asynchronous streams. We also show that
our framework can improve the recognition performance while
significantly reducing the computational cost by avoiding redun-
dant or insignificant frame processing by incorporating auditory
information. In particular, we design a single good image repre-
sentation of image sequence by weighted sums of registered face
images where the weights are derived using auditory features.
We use a still image based technique for the expression recogni-
tion task. Our framework, however, can be generalized to work
with dynamic features as well. We performed experiments using
eNTERFACE’05 audio-visual emotional database containing six
archetypal emotion classes: Happy, Sad, Surprise, Fear, Anger
and Disgust. We present one-to-one binary classification as well
as multi-class classification performances evaluated using both
subject dependent and independent strategies. Furthermore,
we compare multi-class classification accuracies with those of
previously published literature which use the same database. Our
analyses show promising results.
Index Terms—Facial expression recognition, audio-visual ex-
pression recognition, key frames selection, multi-modal expression
recognition, emotion recognition, affective computing, affect
analysis.
I. INTRODUCTION AND MOTIVATION
A
FFECTIVE state plays a fundamental role in human
interactions, influencing cognition, perception and even
rational decision making. This fact has inspired the research
field of “affective computing” which aims at enabling com-
puters to recognize, interpret and simulate affects [1]. Such
systems can contribute to human computer communication and
to applications such as learning environment, entertainment,
customer service, computer games, security/surveillance, edu-
cational software as well as in safety critical application such as
driver monitoring [2], [3]. To make human-computer interac-
tion (HCI) more natural and friendly, it would be beneficial to
give computers the ability to recognize affects the same way a
human does. Since speech and vision are the primary senses for
human expression and perception, significant research effort
has been focused on developing intelligent systems with audio
and video interfaces [4].
Manuscript received March 05, 2012; revised September 17, 2012; accepted
November 22, 2012. Date of publication June 06, 2013; date of current version
October 11, 2013. The associate editor coordinating the review of this manu-
script and approving it for publication was Prof. K. Selcuk Candan.
The authors are with the Computer Vision and Robotics Research Labora-
tory, University of California, San Diego, La Jolla, CA, 92093 USA (e-mail:
atawari@ucsd.edu; mtrivedi@soe.ucsd.edu).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TMM.2013.2266635
Multimodal systems, specifically with audio and visual
modalities, have shown several interesting interactions be-
tween the two modalities. For example, audio-visual speech
recognition (AVSR), also known as automatic lipreading, or
speechreading [5] aims at improving automatic speech recog-
nition by exploring the visual modality of the speaker’s mouth
region. Not surprisingly, it has outperformed audio alone ASR
system particularly in noisy conditions. Similarly, the well
known perceptual phenomenon, McGurk effect [6], which
demonstrates an interaction between hearing and vision in
speech perception. Furthermore, Munhall et al. [7] suggests
that rhythmic head movements are correlated with the pitch
and amplitude of speaker’s voice and that visual information
can improve speech intelligibility by 100% over that possible
using auditory information only.
In the field of affect recognition, there have been number
of efforts to exploit audio-visual information as well and our
framework can utilize these methods. However, above exam-
ples, where visual modality improves audio alone system, are
motivated us to ask the fundamental question of how does audio
modality influence visual perception, in particular, for the task
of facial expression recognition. It is evident that speech gen-
eration influences facial expression. Also, for expression recog-
nition the coupling between these two modalities is not so tight
unlike the case in audio-visual speech recognition task.
Towards this end, we present a novel facial expression recog-
nition framework using bimodal information. Our framework
explicitly models the cross-modality data correlation while al-
lowing them to be treated as asynchronous streams. To recog-
nize the key emotion of an image sequence, the proposed frame-
work seeks to summarize the emotion using one single image
derived from hundreds of frames contained in the video. We
also show that the framework can improve the recognition per-
formance while significantly reducing the computational cost
by avoiding redundant or insignificant frame processing using
auditory information.
II. RELATED STUDIES
Our long term goal is to study the cross-modal influence of
the audio-visual data streams on each other for the affect recog-
nition task. In this study, however, our focus is on face expres-
sion recognition. Hence we first discuss some of the represen-
tative works for facial expression recognition and then move
our discussion on existing audio-visual affect recognition ap-
proaches to highlight the challenges lies in the integration of
the two modalities. For an overview of audio only, visual only
and audio-visual affect recognition, readers are encouraged to
study a recent survey by Zeng et al. [9].
1520-9210 © 2013 IEEE