IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 7, NOVEMBER 2013 1543 Face Expression Recognition by Cross Modal Data Association Ashish Tawari and Mohan Manubhai Trivedi, Fellow, IEEE Abstract—We present a novel facial expression recognition framework using audio-visual information analysis. We propose to model the cross-modality data correlation while allowing them to be treated as asynchronous streams. We also show that our framework can improve the recognition performance while signiﬁcantly reducing the computational cost by avoiding redun- dant or insigniﬁcant frame processing by incorporating auditory information. In particular, we design a single good image repre- sentation of image sequence by weighted sums of registered face images where the weights are derived using auditory features. We use a still image based technique for the expression recogni- tion task. Our framework, however, can be generalized to work with dynamic features as well. We performed experiments using eNTERFACE’05 audio-visual emotional database containing six archetypal emotion classes: Happy, Sad, Surprise, Fear, Anger and Disgust. We present one-to-one binary classiﬁcation as well as multi-class classiﬁcation performances evaluated using both subject dependent and independent strategies. Furthermore, we compare multi-class classiﬁcation accuracies with those of previously published literature which use the same database. Our analyses show promising results. Index Terms—Facial expression recognition, audio-visual ex- pression recognition, key frames selection, multi-modal expression recognition, emotion recognition, affective computing, affect analysis. I. INTRODUCTION AND MOTIVATION A FFECTIVE state plays a fundamental role in human interactions, inﬂuencing cognition, perception and even rational decision making. This fact has inspired the research ﬁeld of “affective computing” which aims at enabling com- puters to recognize, interpret and simulate affects [1]. Such systems can contribute to human computer communication and to applications such as learning environment, entertainment, customer service, computer games, security/surveillance, edu- cational software as well as in safety critical application such as driver monitoring [2], [3]. To make human-computer interac- tion (HCI) more natural and friendly, it would be beneﬁcial to give computers the ability to recognize affects the same way a human does. Since speech and vision are the primary senses for human expression and perception, signiﬁcant research effort has been focused on developing intelligent systems with audio and video interfaces [4]. Manuscript received March 05, 2012; revised September 17, 2012; accepted November 22, 2012. Date of publication June 06, 2013; date of current version October 11, 2013. The associate editor coordinating the review of this manu- script and approving it for publication was Prof. K. Selcuk Candan. The authors are with the Computer Vision and Robotics Research Labora- tory, University of California, San Diego, La Jolla, CA, 92093 USA (e-mail: atawari@ucsd.edu; mtrivedi@soe.ucsd.edu). Color versions of one or more of the ﬁgures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identiﬁer 10.1109/TMM.2013.2266635 Multimodal systems, speciﬁcally with audio and visual modalities, have shown several interesting interactions be- tween the two modalities. For example, audio-visual speech recognition (AVSR), also known as automatic lipreading, or speechreading [5] aims at improving automatic speech recog- nition by exploring the visual modality of the speaker’s mouth region. Not surprisingly, it has outperformed audio alone ASR system particularly in noisy conditions. Similarly, the well known perceptual phenomenon, McGurk effect [6], which demonstrates an interaction between hearing and vision in speech perception. Furthermore, Munhall et al. [7] suggests that rhythmic head movements are correlated with the pitch and amplitude of speaker’s voice and that visual information can improve speech intelligibility by 100% over that possible using auditory information only. In the ﬁeld of affect recognition, there have been number of efforts to exploit audio-visual information as well and our framework can utilize these methods. However, above exam- ples, where visual modality improves audio alone system, are motivated us to ask the fundamental question of how does audio modality inﬂuence visual perception, in particular, for the task of facial expression recognition. It is evident that speech gen- eration inﬂuences facial expression. Also, for expression recog- nition the coupling between these two modalities is not so tight unlike the case in audio-visual speech recognition task. Towards this end, we present a novel facial expression recog- nition framework using bimodal information. Our framework explicitly models the cross-modality data correlation while al- lowing them to be treated as asynchronous streams. To recog- nize the key emotion of an image sequence, the proposed frame- work seeks to summarize the emotion using one single image derived from hundreds of frames contained in the video. We also show that the framework can improve the recognition per- formance while signiﬁcantly reducing the computational cost by avoiding redundant or insigniﬁcant frame processing using auditory information. II. RELATED STUDIES Our long term goal is to study the cross-modal inﬂuence of the audio-visual data streams on each other for the affect recog- nition task. In this study, however, our focus is on face expres- sion recognition. Hence we ﬁrst discuss some of the represen- tative works for facial expression recognition and then move our discussion on existing audio-visual affect recognition ap- proaches to highlight the challenges lies in the integration of the two modalities. For an overview of audio only, visual only and audio-visual affect recognition, readers are encouraged to study a recent survey by Zeng et al. [9]. 1520-9210 © 2013 IEEE