EVALUATING VARIANTS OF WAV2VEC 2.0 ON AFFECTIVE VOCAL BURST TASKS Bagus Tris Atmaja and Akira Sasou National Institute of Advanced Industrial Science and Technology, Tsukuba, Japan ABSTRACT The search for emotional biomarkers within the human voice is a challenging research area. Previous studies focused on predicting affective state from speech; this study explores var- ious tasks on affective vocal bursts. Borrowing the success of self-supervised learning in automatic speech recognition, we extracted acoustic embedding using variants of wav2vec 2.0 for four affective vocal bursts tasks: High, Two, Culture, and Type. Using a similar architecture for all tasks, the evaluation of acoustic embeddings reveals the potential use of wav2vec 2.0 variants over the conventional acoustic features in affec- tive vocal bursts tasks. We evaluated both conventional acous- tic features and these acoustic embeddings on the different number of twenty seeds evaluation and reported the maximum and average scores with their standard deviations in the val- idation set. Three high scores from these validations for all tasks assist the generation of predictions for the test set. We compared the test scores with previous studies and obtained remarkable improvements. Index TermsAffective computing, affective vocal bursts, pre-trained model, wav2vec 2.0, speech emotion recognition 1. INTRODUCTION Vocal bursts may have richer affective information than speech. However, speech emotion recognition, rather than vocal bursts, is currently gaining more attention from re- searchers due to its potential implementation and the avail- ability of the datasets. Instead of a speech, affective infor- mation may also lay on short vocal bursts (i.e., crying when sad). In contrast to speech emotion recognition which may have difficulties in distinguishing between emotions, differ- ent vocal bursts may reflect different affective states more distinctly. For instance, the emotion of sadness and fear in speech are similar since both of them are expressed by higher pitch [1]. A specific pattern of crying may indicate sadness, whereas laughter produces happiness. Given these benefits, analyzing emotions from humans’ vocal bursts may improve our understanding of human emotions. Humans communicate through verbal and vocal com- munication, including communicating emotions [2]. Verbal E-mail: b-atmaja@aist.go.jp; funded by NEDO Japan (project JPNP20006) communication includes chosen words in the speech that has semantic meaning. Vocal communication includes prosody: intonation, intensity, and rhythm. A study in cognitive brain research suggests that brain activity in emotional prosody detection is higher than in verbal detection [3]. Further stud- ies by Tian et al. [4, 5, 6] suggest that adding non-verbal vocalization information to acoustic features improved the recognition rate of speech emotion recognition in IEMOCAP [7] and AVEC2012 [8] datasets. Vocal bursts – a non-verbal communication – constitute a potential source of information for emotion [9]. A study by Cowen et al. [10] has found that vocal bursts are rich in emo- tional information that can be conceptualized into 24 emo- tion categories. A previous study by Scherer [11] has pro- posed a model of vocal communication as Brunswik’s lens model from expression (encoding) to perception (representa- tion). There is no exact number of emotion categories emerg- ing from this study. The authors mentioned eight examples of emotion categories with ranges of importance for their design features delimitation (e.g., intensity). However, the research on affective vocal bursts since then has been limited by the lack of available datasets. One way to speed up research on affective vocal bursts is to hold workshops and competitions in that area. In [12, 13, 14], the organizers provided datasets and baseline meth- ods to challenge participants to explore the dataset and sur- pass the baseline scores. This study, in particular, is presented to report the evaluation of wav2vec 2.0 variants for the ACII 2022 Affective Vocal Bursts Workshop and Competition [14]. There are four tasks in the competition: three regression prob- lems and a classification problem. The regression problems are intended for measuring either the intensity of ten emo- tion categories or valence (positive-negative of emotion) and arousal (low-high of emotion). The classification problem is for predicting the type of vocal bursts (e.g., laughter). We approached all four tasks using a similar method while ob- serving the effect of varying the acoustic embeddings. There are two research questions to be solved in this re- search. First, we evaluated the effectiveness of seven wav2vec 2.0 variants for four affective vocal bursts tasks, including variants of wav2vec 2.0 pre-trained on an affective speech dataset. Second, we combined wav2vec 2.0 embeddings with valence, arousal, and dominance (vad) predictions in a variant to evaluate their benefits. The architecture of deep learning ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 978-1-7281-6327-7/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICASSP49357.2023.10096552