Speech Emotion Recognition with Multi-task Learning Xingyu Cai, Jiahong Yuan, Renjie Zheng, Liang Huang, Kenneth Church Baidu Research, USA {xingyucai,jiahongyuan,renjiezheng,lianghuang,kennethchurch}@baidu.com Abstract Speech emotion recognition (SER) classifies speech into emo- tion categories such as: Happy, Angry, Sad and Neutral. Re- cently, deep learning has been applied to the SER task. This paper proposes a multi-task learning (MTL) framework to si- multaneously perform speech-to-text recognition and emotion classification, with an end-to-end deep neural model based on wav2vec-2.0. Experiments on the IEMOCAP benchmark show that the proposed method achieves the state-of-the-art perfor- mance on the SER task. In addition, an ablation study estab- lishes the effectiveness of the proposed MTL framework. Index Terms: speech emotion recognition, multi-task learning 1. Introduction Emotions such as Happy, Angry, Sad and Neutral, play an im- portant role in human communication process. Emotion has been described as an “implicit channel” that is transmitted in addition to the explicit messages [1]. Participants in a conver- sation can communicate more effectively if they can recognize each others’ emotion states. Although it may not be very hard for humans to perceive others’ emotions, it remains a challeng- ing task for computers. Considerable efforts have been devoted into emotion recognition (ER) in the human-computer interac- tion field since decades ago. People express emotions in many ways including body lan- guage, facial expressions, choice of words, tone of voice and more. Therefore, a variety of ER systems based on different types of input signals are proposed in the past, e.g. face emo- tion recognition [2]. Emotions are even correlated with hu- man’s electrochemical characteristics such as EEG signals [3], suggesting electrochemical probes can be used to capture emo- tions. In this paper, we focus on the speech emotion recognition (SER) task that takes audio speech as input, and outputs emo- tion classes such as: Happy, Angry, Sad, Neutral. SER systems typically consist of several major cascading components: feature extraction, feature selection and classifi- cation [4]. Many systems make use of spectral features, as well as explicit representations of prosodic features, voice quality features and teager energy operator based features [5]. These approaches require strong domain knowledge and a deep under- standing of speech. In recent years, end-to-end systems have tended to outperform those traditional systems based on care- fully engineered features. In particular, end-to-end deep neural models learn to extract features implicitly, via trainable blocks such as convolutional layers. Thanks to the much larger model capacity (significantly larger number of parameters) and the development of efficient learning algorithms, the deep neural models have become the dominant and preferred systems for the SER task [6]. In this paper, we propose an end-to-end deep neural SER model that is trained using multi-task learning (MTL). The main contributions of this paper are: • We build an end-to-end model that achieves the state-of-the- art SER results on the standard IEMOCAP [7] dataset. • We leverage the pretrained wav2vec-2.0 for speech feature extraction, and fine-tune on SER data through two tasks: SER (emotion classification) and ASR (speech recognition). • Ablation study verifies the effectiveness of the MTL ap- proach, and discusses how the ASR affects the SER. • The speech transcription could be obtained as a byproduct. The rest of the paper is organized as follows: Section 2 reviews recent related work on SER, MTL and the pretrained wav2vec-2.0. Next, in Section 3, we describe the proposed model, as well as the training and inference processes. Em- pirical results and ablation studies are presented in Section 4. Finally, conclusions are drawn in Section 5. 2. Related Work 2.1. Speech Emotion Recognition Speech emotion recognition detects the speakers’ emotion state from their speech signals. It is often treated as a classification task. Typically, each input utterance is assigned a single class label, which is a predefined emotion category, e.g. Happy. There is a considerable literature on SER. Much of this work makes use of steps such as preprocessing, feature extrac- tion and classification [5, 8]. In early work [4], it was common to extract features such as pitch, energy, formants, mel-band en- ergies, and mel-frequency cepstral coefficients (MFCCs) as the base features, as well as utterance level features such as speak- ing rate. The next step is to feed those features as input into ma- chine learning classifiers, e.g. SVMs, LDA, QDA and HMMs. SVMs and HMMs performed relatively well, in terms of classi- fication accuracy. Ensembling methods were often found to be more effective [9, 10, 11]. Thanks to the advances of deep learning, neural based mod- els dominate the recent trends in SER studies. The authors in [12] evaluated on CNN and LSTM architecture, and found a concatenation of 3 convolutional layers plus a bi-LSTM layer, yields the best results. In [13], a much larger backbone con- volutional network, ResNet-101, is adopted to provide stronger feature extraction. More recently, attention mechanism started to play a very important role in NLP domain, and extends to the speech and vision areas. In [14], the authors proposed a model that consists of an attention sliding recurrent neural net- work (ASRNN). The authors from [15] combine both encoded linguistic and acoustic features, and build a multi-head self- attention model to study the influence of both features on the SER task. In [16], the authors leverage deep attention-based language model and use pauses as a critical feature to detect emotions. In [17], two models, CNN plus attention, and bi- LSTM plus attention, are evaluated and compared from several aspects. A comprehensive survey regarding the recent deep neu- ral models for SER is given in [6]. Copyright 2021 ISCA INTERSPEECH 2021 30 August – 3 September, 2021, Brno, Czechia http://dx.doi.org/10.21437/Interspeech.2021-1852 4508