An Investigation of Annotation Delay Compensation and
Output-Associative Fusion for Multimodal Continuous
Emotion Prediction
Zhaocheng Huang
School of Electrical Eng. and Tele.
The University of New South Wales
and National ICT Australia
zhaocheng.huang@student.unsw.edu.au
Ting Dang
School of Electrical Eng. and Tele.
The University of New South Wales
and National ICT Australia
ting.dang@student.unsw.edu.au
Nicholas Cummins
School of Electrical Eng. and Tele.
The University of New South Wales
and National ICT Australia
n.p.cummins@unsw.edu.au
Brian Stasak
School of Electrical Eng. and Tele.
The University of New South Wales
and National ICT Australia
b.stasak@student.unsw.edu.au
Phu Le
School of Electrical Eng. and Tele.
The University of New South Wales
Sydney NSW 2052 Australia
phule@unsw.edu.au
Vidhyasaharan Sethu
School of Electrical Eng. and Tele.
The University of New South Wales
Sydney NSW 2052 Australia
v.sethu@unsw.edu.au
Julien Epps
School of Electrical Eng. and Tele.
The University of New South Wales
and National ICT Australia
j.epps@unsw.edu.au
ABSTRACT
Continuous emotion dimension prediction has increased in
popularity over the last few years, as the shift away from
discrete classification based tasks has introduced more realism
in emotion modeling. However, many questions remain
including how best to combine information from several
modalities (e.g. audio, video, etc). As part of the AV+EC 2015
Challenge, we investigate annotation delay compensation and
propose a range of multimodal systems based on an output-
associative fusion framework. The performance of the proposed
systems are significantly higher than the challenge baseline,
with the strongest performing system yielding 66.7% and 53.9%
relative increases in prediction accuracy over the AV+EC 2015
test set arousal and valence baselines respectively. Results also
demonstrate the importance of annotation delay compensation
for continuous emotion analysis. Of particular interest was the
output–associative based fusion framework, which performed
very well in a number of significantly different configurations,
highlighting that incorporating both affective dimensional
dependencies and temporal information is a promising research
direction for predicting emotion dimensions.
Categories and Subject Descriptors
G.3 [Mathematics of Computing]: Probability and Statistics –
Correlation and regression analysis; Robust regression
I.5.4 [Computing Methodologies]: Pattern Recognition –
Signal processing; Computer vision; Waveform analysis
General Terms
Algorithms, Performance, Design, Human Factors, Verification.
Keywords
Emotion Dimension Prediction, Support Vector Regression,
Relevance Vector Machine, Output-Associative Fusion,
Annotation Delay Compensation, Multimodal Fusion.
1. INTRODUCTION
Using behavioral signal processing techniques to model,
analyze, detect or predict human emotions is an actively
emerging area of research [1]. In recent years, there has been a
shift away from extensive investigation into lab-based
recognition of prototypical emotion categories (e.g. anger, fear,
etc) towards continuous prediction of emotional dimensions
(e.g. arousal and valence) in more naturalistic communication.
Affective dimensions are considered a more descriptive
representation of subtle and complex emotions and emotion-
rated states [1, 2]. For continuous emotion prediction, a number
of physiological and behavioral modalities have been
investigated, such as: audio [3], video [4], body language [5]
and EEG [6]. Also, a combination of the modalities can lead to
further improvements [7, 8].
The 2015 Audio/Visual Emotion Challenge and Workshop
(AV+EC 2015) provides an opportunity for advancing
continuous emotion prediction by combining information gained
from audio, video and physiological data [9]. AV+EC 2015
requires participants to continuously predict arousal and valence
by utilizing multimedia signal processing and machine learning
techniques. The primary aim of the analysis provided herein is
to outperform the challenge baseline benchmark, as well as to
provide novel insights into continuous emotion analysis.
The investigations presented within this paper compare the
performance of a range of multimodal prediction systems
designed to capture relevant audio, video and physiological
information. The experimental results demonstrate that
significant gains in affect prediction performance can be found
by compensating for annotator delays introduced when forming
the ground truth labels and via the application of an output-
associative regression framework for multimodal fusion.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee. Request permissions
from Permissions@acm.org
AVEC’15, October 26 2015, Brisbane, Australia
© 2015 ACM. ISBN 978-1-4503-3743-4/15/10…$15.00
DOI: http://dx.doi.org/10.1145/2808196.2811640
41