Multimodal Fusion of Audio, Scene, and Face Features for First Impression Estimation Furkan G¨ urpınar Program of Computational Science and Engineering Bo˘ gazic ¸i University Bebek, Istanbul, Turkey Email: furkan.gurpinar@boun.edu.tr Heysem Kaya Department of Computer Engineering Namık Kemal University C ¸ orlu, Tekirda˘ g, Turkey Email: hkaya@nku.edu.tr Albert Ali Salah Department of Computer Engineering Bo˘ gazic ¸i University Bebek, Istanbul, Turkey Email: salah@boun.edu.tr Abstract—Affective computing, particularly emotion and per- sonality trait recognition, is of increasing interest in many research disciplines. The interplay of emotion and personality shows itself in the first impression left on other people. Moreover, the ambient information, e.g. the environment and objects surrounding the subject, also affect these impressions. In this work, we employ pre-trained Deep Convolutional Neural Networks to extract facial emotion and ambient information from images for predicting apparent personality. We also investigate Local Gabor Binary Patterns from Three Orthogonal Planes video descriptor and acoustic features extracted via the popularly used openSMILE tool. We subsequently propose classifying features using a Kernel Extreme Learning Machine and fusing their predictions. The proposed system is applied to the ChaLearn Challenge on First Impression Recognition, achieving the winning test set accuracy of 0.913, averaged over the “Big Five” personality traits. I. I NTRODUCTION AND RELATED WORK Automatic prediction of apparent personality is an interesting and challenging topic for researchers from a range of back- grounds. Machines that are able to recognize apparent person- ality traits can be useful in many applications such as computer assisted tutoring systems, forensics and user recommendation systems. The complexity of the personality formation makes it also hard to automatically recognize [1], [2]. A way to handle this issue is working on the impressions (apparent personality) instead of the personality itself [3]. In this work, we tackle the problem of predicting the apparent personality using the data and protocol from the ChaLearn Looking at People 2016 First Impression Challenge [3]. Our aim is to benefit from influences of emotional facial expres- sions, as well as ambient cues on first impressions. The apparent personality is assessed along the “Big Five” personality traits that are Openness, Conscientiousness, Ex- traversion, Agreeableness, and Neuroticism (OCEAN). The formation of a personality impression affects decisions in humans, and it is an interesting question whether computers can automatically estimate how a certain person is perceived by others. There are several recent approaches for recognizing apparent personality traits from different modalities such as audio [4], [5], text [6], [7], [8] and visual information [9], [10]. To increase the robustness of predictions, multimodal systems are also investigated [11], [12], [13], [14], [15]. In our previous work, we have shown that for first impression prediction, deep learning approaches for face processing can be fused profitably with features that describe the scene, i.e. the context of the perceived image [15]. For classification, Sup- port Vector Machines (SVM) approaches are widely used [5], [12], [14], but we have used Extreme Learning Machines (ELM) [16], which achieved good results with rapid classifica- tion of new samples [15]. Similar approaches have been used on related tasks like facial age estimation [17] and emotion recognition [18], with good results. All three winners of the first round of this challenge exten- sively used deep learning in their bimodal systems, while the overall approach and the type of the network was different [19], [20], [21]. In [19] single hidden layer neural network (NN) was used for audio modality regression, while deep convolutional neural networks (DCNN) were used for video representation and regression. The choice of the first runner up [20] was a Recurrent DCNN for both modalities. The learning system of the second runner up [21] was based on Residual Networks. The winner [19] and the second runner up [21] did not employ face alignment in the preprocessing step, but both of these works applied late fusion of modality based scores. For facial feature extraction, the winning system of Zhang et al. [19] used the VGG-Face pre-trained DCNN model [22], which we also employ here, and in our submission to the first round of the challenge [15]. Given the success of deep learning and the speed of ELM, we propose to fuse ELM models trained on audio, deep face, and scene features. Our contribution to the first round of this challenge proposed combining emotion related and ambient fea- tures that are efficiently extracted from pre-trained/fine-tuned DCNN models [15]. Here, we further improve this system by investigating i) other visual descriptors; ii) the audio modality; and iii) weighted score level fusion strategy. Our method is illustrated in Figure 1. The remainder of this paper is organized as follows. In the next section we provide background and details on the methodology. Then in Section III, we present the experimental results. Finally, Section IV concludes the paper with remarks on the proposed approach in context.