Spontaneous Facial Expression Recognition: A Part Based Approach Nazil Perveen, Dinesh Singh and C. Krishna Mohan Visual Intelligence and Learning Group (VIGIL), Department of Computer Science and Engineering, Indian Institute of Technology Hyderabad, Kandi, Sangareddy-502285, India. email: {cs14resch11006, cs14resch11003, ckm}@iith.ac.in Abstract—A part-based approach for spontaneous expression recognition using audio-visual feature and deep convolution neural network (DCNN) is proposed. The ability of convolution neural network to handle variations in translation and scale is exploited for extracting visual features. The sub-regions, namely, eye and mouth parts extracted from the video faces are given as an input to the deep CNN (DCNN) inorder to extract convnet features. The audio features, namely, voice-report, voice intensity, and other prosodic features are used to obtain complementary information useful for classiﬁcation. The conﬁdence scores of the classiﬁer trained on different facial parts and audio information are combined using different fusion rules for recognizing expres- sions. The effectiveness of the proposed approach is demonstrated on acted facial expression in wild (AFEW) dataset. Keywords—Isotropic smoothing, Expression recognition and Convolution Neural Network. I. I NTRODUCTION Emotion reﬂects the mental status of the human mind. Mehrabian [1] indicated that the verbal part (i.e. spoken words) of a message contributes only 7% of the effect of any message; the vocal part (i.e. voice information) contributes for 38%, while facial expression contributes for 55% of the effect of any message. Therefore, facial expression plays an important role in recognition of human emotions, like angry, disgust, fear, happy, neutral, sadness, and surprise. The expressions when recognized in an unconstrained environment is termed as spontaneous expression recognition, which becomes very difﬁcult task due to various real world issues like, illumination, posed faces, scaling, occlusion, etc. Handling these issues while maintaining reasonable classiﬁcation accuracy is one of the biggest challenge today. Being an active research area, spontaneous expression recognition has immense applications. It can be used to make smart devices smarter using emotional intelligence [2], perform surveys on products and services, engagement systems, mood recognition, psychology, real time gaming, animated movies, etc. [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13]. Spontaneous expression recognition uses data science technologies like machine learning, artiﬁcial intel- ligence, big data, bio-sensors etc. to recognize the expressions. Expression analyst and data scientists are trying to synchronize stimuli to expressions for detecting micro-expressions, etc., to enhance the recognition rate of primary emotions [14]. In 1978, Paul and Ekman deﬁne the human facial expres- sions which can be classiﬁed into seven basic classes, namely, angry, disgust, fear, happy, neutral, sad, and surprise, are also known as universal expressions [15]. Several exhaustive re- search works were being carried out in literature for automatic recognition of expression in static images with high recog- nition rate. Recent advances in expression recognition from 2013 to 2015 have changed the perception of the recognition system. In 2014, vision and attention theory based sampling for continuous facial expression recognition by Bir Banu et al. [16] propose the way in which human visualize the expressions. In their approach, the dataset is divided into two categories based on the frame rates, namely, low and high frame rate. In former one, person is idle and expressing no emotions and in latter one, person is changing their expressions frequently. The basic contribution of Bir Banu is to make a video based temporal sampling where they describe appearance based methodology for feature extraction and then classify the features using support vector machine classiﬁer. The recognition rate is 75% on the standard dataset AVEC 2011 or 2012, CK & CK+, MMI. An automatic frame work for textured 3-D video based fa- cial expression recognition by Munauwar and Bennamoun [17] hypothesize texture based dynamic approach for recognizing expressions. Initially, small patches are extracted from the sample videos and these patches are then represented in points such that each point is lying on Grassmanian manifold, and using Grassmanian kernalization clusters are formed using graph based spectral clustering mechanism. All cluster centers are embedded with each other to reproduce the kernel Hilbert space such that support vector machines (SVM) for each expressions are learned. The recognition accuracy is 93%-94% on BU4DFE (Binghamton University 3-D facial database). A different approach of 4-D facial expression recognition by learning geometric deformation by Benamor et.al. [18] represented face as combinations of radial curves which lie on Riemannian manifold is proposed in 2014 that measures the deformation induced by each facial expression. The features obtained are of very high dimension and hence linear discrim- inant analysis (LDA) transformation is applied for projecting it in low dimension. Two approaches are implemented for classiﬁcation, one is temporal or dynamic HMM and other is mean deformation patches applied to random forest classiﬁ- cation. The recognition rate is 93% on an average in different datasets, namely, BU4-DFE, Boshphorus, D3-DFACS and HI4- D-ADSIP datasets. Earlier, the topic of spontaneous expression recognition i.e. expression recognition in an unconstrained environment, is not focused in the literature. J. F. Cohn et al., introduce sponta- 2016 15th IEEE International Conference on Machine Learning and Applications 978-1-5090-6167-9/16 $31.00 © 2016 IEEE DOI 10.1109/ICMLA.2016.162 819