Expression recognition in videos using a weighted component-based feature descriptor Xiaohua Huang 1,2 , Guoying Zhao 1 , Matti Pietik¨ ainen 1 , Wenming Zheng 2 1. Machine Vision Group, Department of Electrical and Information Engineering, University of Oulu, Finland 2. Research Center for Learning Science, Southeast University, China {huang.xiaohua,gyzhao,mkp}@ee.oulu.fi wenming_zheng@seu.edu.cn http://www.ee.oulu.fi/mvg Abstract. In this paper, we propose a weighted component-based feature de- scriptor for expression recognition in video sequences. Firstly, we extract the tex- ture features and structural shape features in three facial regions: mouth, cheeks and eyes of each face image. Then, we combine these extracted feature sets using conﬁdence level strategy. Noting that for different facial components, the con- tributions to the expression recognition are different, we propose a method for automatically learning different weights to components via the multiple kernel learning. Experimental results on the Extended Cohn-Kanade database show that our approach combining component-based spatiotemporal features descriptor and weight learning strategy achieves better recognition performance than the state of the art methods. Keywords: Spatiotemporal features, LBP-TOP, EdgeMap, Information fusion, Multiple kernel learning, Facial expression recognition. 1 Introduction A goal of automatic facial expression analysis is to determine the emotional state, e.g., happiness, sadness, surprise, neutral, anger, fear, and disgust, of human beings based on facial images, regardless of the identity of the face. To date, there have been some sur- veys describing the state-of-the-art techniques of facial expression recognition, based on static images or video sequences [1,2]. The surveys show that dynamic features from video sequences can provide more accurate and robust information than the static fea- tures from images. Feature representation is very important for automatic facial expression analysis. Methods combining geometric and appearance features have been considered earli- er [2]. For example, Tian et.al [3] proposed to use facial component shapes and transient features like crow-feet wrinkles and nasal-labial furrows. A framework of combining facial appearance (Scale-invariant feature transform) and shape information (Pyramid histogram of orientated gradient) was proposed for facial expression recognition [6]. Both similarity-normalized shape (SPTS) and canonical appearance (CAPP) were de- rived from the active appearance models (AAM) to interpret the face images [11]. It