EVALUATING VARIANTS OF WAV2VEC 2.0 ON AFFECTIVE VOCAL BURST TASKS
Bagus Tris Atmaja and Akira Sasou
National Institute of Advanced Industrial Science and Technology, Tsukuba, Japan
ABSTRACT
The search for emotional biomarkers within the human voice
is a challenging research area. Previous studies focused on
predicting affective state from speech; this study explores var-
ious tasks on affective vocal bursts. Borrowing the success of
self-supervised learning in automatic speech recognition, we
extracted acoustic embedding using variants of wav2vec 2.0
for four affective vocal bursts tasks: High, Two, Culture, and
Type. Using a similar architecture for all tasks, the evaluation
of acoustic embeddings reveals the potential use of wav2vec
2.0 variants over the conventional acoustic features in affec-
tive vocal bursts tasks. We evaluated both conventional acous-
tic features and these acoustic embeddings on the different
number of twenty seeds evaluation and reported the maximum
and average scores with their standard deviations in the val-
idation set. Three high scores from these validations for all
tasks assist the generation of predictions for the test set. We
compared the test scores with previous studies and obtained
remarkable improvements.
Index Terms— Affective computing, affective vocal
bursts, pre-trained model, wav2vec 2.0, speech emotion
recognition
1. INTRODUCTION
Vocal bursts may have richer affective information than
speech. However, speech emotion recognition, rather than
vocal bursts, is currently gaining more attention from re-
searchers due to its potential implementation and the avail-
ability of the datasets. Instead of a speech, affective infor-
mation may also lay on short vocal bursts (i.e., crying when
sad). In contrast to speech emotion recognition which may
have difficulties in distinguishing between emotions, differ-
ent vocal bursts may reflect different affective states more
distinctly. For instance, the emotion of sadness and fear in
speech are similar since both of them are expressed by higher
pitch [1]. A specific pattern of crying may indicate sadness,
whereas laughter produces happiness. Given these benefits,
analyzing emotions from humans’ vocal bursts may improve
our understanding of human emotions.
Humans communicate through verbal and vocal com-
munication, including communicating emotions [2]. Verbal
E-mail: b-atmaja@aist.go.jp; funded by NEDO Japan (project JPNP20006)
communication includes chosen words in the speech that has
semantic meaning. Vocal communication includes prosody:
intonation, intensity, and rhythm. A study in cognitive brain
research suggests that brain activity in emotional prosody
detection is higher than in verbal detection [3]. Further stud-
ies by Tian et al. [4, 5, 6] suggest that adding non-verbal
vocalization information to acoustic features improved the
recognition rate of speech emotion recognition in IEMOCAP
[7] and AVEC2012 [8] datasets.
Vocal bursts – a non-verbal communication – constitute a
potential source of information for emotion [9]. A study by
Cowen et al. [10] has found that vocal bursts are rich in emo-
tional information that can be conceptualized into 24 emo-
tion categories. A previous study by Scherer [11] has pro-
posed a model of vocal communication as Brunswik’s lens
model from expression (encoding) to perception (representa-
tion). There is no exact number of emotion categories emerg-
ing from this study. The authors mentioned eight examples of
emotion categories with ranges of importance for their design
features delimitation (e.g., intensity). However, the research
on affective vocal bursts since then has been limited by the
lack of available datasets.
One way to speed up research on affective vocal bursts
is to hold workshops and competitions in that area. In [12,
13, 14], the organizers provided datasets and baseline meth-
ods to challenge participants to explore the dataset and sur-
pass the baseline scores. This study, in particular, is presented
to report the evaluation of wav2vec 2.0 variants for the ACII
2022 Affective Vocal Bursts Workshop and Competition [14].
There are four tasks in the competition: three regression prob-
lems and a classification problem. The regression problems
are intended for measuring either the intensity of ten emo-
tion categories or valence (positive-negative of emotion) and
arousal (low-high of emotion). The classification problem is
for predicting the type of vocal bursts (e.g., laughter). We
approached all four tasks using a similar method while ob-
serving the effect of varying the acoustic embeddings.
There are two research questions to be solved in this re-
search. First, we evaluated the effectiveness of seven wav2vec
2.0 variants for four affective vocal bursts tasks, including
variants of wav2vec 2.0 pre-trained on an affective speech
dataset. Second, we combined wav2vec 2.0 embeddings with
valence, arousal, and dominance (vad) predictions in a variant
to evaluate their benefits. The architecture of deep learning
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 978-1-7281-6327-7/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICASSP49357.2023.10096552