Contrasting and Combining Least Squares Based Learners for Emotion Recognition in the Wild Heysem Kaya ∗ Department of Computer Engineering Bo˘ gaziçi University 34342, ˙ Istanbul, Turkey heysem@boun.edu.tr Furkan Gürpınar Department of Computational Science and Engineering Bo˘ gaziçi University 34342, ˙ Istanbul, Turkey gurpinarfurkan@gmail.com Sadaf Afshar Department ofComputational Science and Engineering Bo˘ gaziçi University 34342, ˙ Istanbul, Turkey sa.afshar.sa@gmail.com Albert Ali Salah Department of Computer Engineering Bo˘ gaziçi University 34342, ˙ Istanbul, Turkey salah@boun.edu.tr ABSTRACT This paper presents our contribution to ACM ICMI 2015 Emotion Recognition in the Wild Challenge (EmotiW 2015). We participate in both static facial expression (SFEW) and audio-visual emotion recognition challenges. In both chal- lenges, we use a set of visual descriptors and their early and late fusion schemes. For AFEW, we also exploit a set of popularly used spatio-temporal modeling alternatives and carry out multi-modal fusion. For classiﬁcation, we employ two least squares regression based learners that are shown to be fast and accurate on former EmotiW Challenge corpora. Speciﬁcally, we use Partial Least Squares Regression (PLS) and Kernel Extreme Learning Machines (ELM), which is closely related to Kernel Regularized Least Squares. We use a General Procrustes Analysis (GPA) based alignment for face registration. By employing diﬀerent alignments, de- scriptor types, video modeling strategies and classiﬁers, we diversify learners to improve the ﬁnal fusion performance. Test set accuracies reached in both challenges are relatively 25% above the respective baselines. Categories and Subject Descriptors I.5.4 [Computing Methodologies]: Pattern Recognition- Signal processing; I.4.7 [Image Processing and Com- puter Vision]: Feature Measurement; I.4.8 [Image Pro- cessing and Computer Vision]: Scene Analysis ∗ Corresponding author Permission to make digital or hard copies of all or part of this work for per- sonal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstract- ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. Request permissions from Permissions@acm.org. ICMI ’15, November 09-13 2015, Seattle, WA, USA Copyright is held by the authors. Publication rights licensed to ACM. ACM 978-1-4503-3983-4/15/11 ...$15.00. DOI: http://dx.doi.org/10.1145/2823327.2823334. General Terms Human-Computer Interaction Keywords audio-visual emotion corpus, audio-visual fusion, feature ex- traction, emotion recognition in the wild, SFEW, AFEW 1. INTRODUCTION Audio and video based emotion recognition in the wild is challenging, because of noise, large idiosyncratic variance and sensor-related diﬀerences. Fixed-protocol challenges in this ﬁeld provide a unique opportunity to push forward the state-of-the-art and to compare many approaches under very similar conditions. The Emotion Recognition in the Wild (EmotiW) challenge provides out of laboratory data -Acted Facial Expressions in the Wild (AFEW)-, collected from videos that mimic real life [4, 3, 5]. In 2015, the Emo- tiW campaign introduced a static facial expression challenge based on in-the-wild images collected from videos [6]. In this paper we propose several systems based on combinations of learners for both static and video-based emotion recognition, and report results with the standard challenge protocols. Our contributions to the EmotiW Challenge are mani- fold: i) We employ a General Procrustes Analysis (GPA) based alignment method to have improved face registration, ii) we extract and combine a set of visual descriptors such as Scale Invariant Feature Transform (SIFT) [17], Histogram of Oriented Gradients (HOG) [2], Local Phase Quantiza- tion (LPQ) [11, 13], Local Binary Patterns (LBP) [18] and its Gabor extension (LGBP), as well as hand-crafted ge- ometric features computed from ﬁtted landmarks of GPA alignment, iii) we use the popular Three Orthogonal Planes (TOP), summarizing functionals (FUN) and Fisher Vector encoding [19] (FV) on low level descriptors for video model- ing, iv) we contrast feature and score level fusion strategies using PLS and Kernel ELM classiﬁers. The remainder of this paper is organized as follows. In the next section we provide background on the signal pro-