An Investigation of Annotation Delay Compensation and Output-Associative Fusion for Multimodal Continuous Emotion Prediction Zhaocheng Huang School of Electrical Eng. and Tele. The University of New South Wales and National ICT Australia zhaocheng.huang@student.unsw.edu.au Ting Dang School of Electrical Eng. and Tele. The University of New South Wales and National ICT Australia ting.dang@student.unsw.edu.au Nicholas Cummins School of Electrical Eng. and Tele. The University of New South Wales and National ICT Australia n.p.cummins@unsw.edu.au Brian Stasak School of Electrical Eng. and Tele. The University of New South Wales and National ICT Australia b.stasak@student.unsw.edu.au Phu Le School of Electrical Eng. and Tele. The University of New South Wales Sydney NSW 2052 Australia phule@unsw.edu.au Vidhyasaharan Sethu School of Electrical Eng. and Tele. The University of New South Wales Sydney NSW 2052 Australia v.sethu@unsw.edu.au Julien Epps School of Electrical Eng. and Tele. The University of New South Wales and National ICT Australia j.epps@unsw.edu.au ABSTRACT Continuous emotion dimension prediction has increased in popularity over the last few years, as the shift away from discrete classification based tasks has introduced more realism in emotion modeling. However, many questions remain including how best to combine information from several modalities (e.g. audio, video, etc). As part of the AV+EC 2015 Challenge, we investigate annotation delay compensation and propose a range of multimodal systems based on an output- associative fusion framework. The performance of the proposed systems are significantly higher than the challenge baseline, with the strongest performing system yielding 66.7% and 53.9% relative increases in prediction accuracy over the AV+EC 2015 test set arousal and valence baselines respectively. Results also demonstrate the importance of annotation delay compensation for continuous emotion analysis. Of particular interest was the output–associative based fusion framework, which performed very well in a number of significantly different configurations, highlighting that incorporating both affective dimensional dependencies and temporal information is a promising research direction for predicting emotion dimensions. Categories and Subject Descriptors G.3 [Mathematics of Computing]: Probability and Statistics – Correlation and regression analysis; Robust regression I.5.4 [Computing Methodologies]: Pattern Recognition – Signal processing; Computer vision; Waveform analysis General Terms Algorithms, Performance, Design, Human Factors, Verification. Keywords Emotion Dimension Prediction, Support Vector Regression, Relevance Vector Machine, Output-Associative Fusion, Annotation Delay Compensation, Multimodal Fusion. 1. INTRODUCTION Using behavioral signal processing techniques to model, analyze, detect or predict human emotions is an actively emerging area of research [1]. In recent years, there has been a shift away from extensive investigation into lab-based recognition of prototypical emotion categories (e.g. anger, fear, etc) towards continuous prediction of emotional dimensions (e.g. arousal and valence) in more naturalistic communication. Affective dimensions are considered a more descriptive representation of subtle and complex emotions and emotion- rated states [1, 2]. For continuous emotion prediction, a number of physiological and behavioral modalities have been investigated, such as: audio [3], video [4], body language [5] and EEG [6]. Also, a combination of the modalities can lead to further improvements [7, 8]. The 2015 Audio/Visual Emotion Challenge and Workshop (AV+EC 2015) provides an opportunity for advancing continuous emotion prediction by combining information gained from audio, video and physiological data [9]. AV+EC 2015 requires participants to continuously predict arousal and valence by utilizing multimedia signal processing and machine learning techniques. The primary aim of the analysis provided herein is to outperform the challenge baseline benchmark, as well as to provide novel insights into continuous emotion analysis. The investigations presented within this paper compare the performance of a range of multimodal prediction systems designed to capture relevant audio, video and physiological information. The experimental results demonstrate that significant gains in affect prediction performance can be found by compensating for annotator delays introduced when forming the ground truth labels and via the application of an output- associative regression framework for multimodal fusion. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org AVEC’15, October 26 2015, Brisbane, Australia © 2015 ACM. ISBN 978-1-4503-3743-4/15/10…$15.00 DOI: http://dx.doi.org/10.1145/2808196.2811640 41