Indian Cross Corpus Speech Emotion Recognition Using Multiple Spectral-Temporal-Voice
Quality Acoustic Features and Deep Convolution Neural Network
Rupali Kawade
1,2*
, Sonal Jagtap
1,3
1
Department of E&TC Engineering, G H Raisoni College of Engineering and Management, Wagholi, Pune 412207, India
2
Department of E&TC Engineering, PCET's Pimpri Chinchwad College of Engineering & Research, Ravet, Pune 412101, India
3
Department of E&TC Engineering, Smt. Kashibai Navale College of Engineering, Vadgaon(Bk), Pune 411041, India
Corresponding Author Email: rupali2118@gmail.com
Copyright: ©2024 The authors. This article is published by IIETA and is licensed under the CC BY 4.0 license
(http://creativecommons.org/licenses/by/4.0/).
https://doi.org/10.18280/ria.380318 ABSTRACT
Received: 6 September 2023
Revised: 2 December 2023
Accepted: 10 January 2024
Available online: 21 June 2024
Speech Emotion Recognition (SER) is very crucial in enriching next generation human
machine interaction (HMI) with emotional intelligence capabilities by extracting the
emotions from words and voice. However, current SER techniques are developed within
the experimental boundaries and faces major challenges such as lack of robustness across
languages, cultures, age gaps and gender of speakers. Very little work is carried out for SER
for Indian corpus which has higher diversity, large number of dialects, vast changes due to
regional and geographical aspects. India is one of the largest customers of HMI systems,
social networking sites and internet users, therefore it is crucial for SER that focuses on
Indian corpuses. This paper presents, cross corpus SER (CCSER) for Indian corpus using
multiple acoustic features (MAF) and deep convolution neural network (DCNN) to improve
the robustness of the SER. The MAF consists of various spectral, temporal and voice quality
features. Further, Fire Hawk based optimization (FHO) technique is utilized for the salient
feature selection. The FHO selects the important features from MAF to minimize the
computational complexity and improve feature distinctiveness based in inter class and inter
class variance of the features. The DCNN algorithm provides the better correlation, higher
feature representation, better description of variation in timbre, intonation and pitch,
superior connectivity in global and local features of the speech signal to characterize the
corpus. The outcomes of suggested DCNN based SER is evaluated on Indo-Aryan language
family (Hindi and Urdu) and Dravidian Language family (Telugu and Kannada). The
proposed scheme results in improved accuracy for the various cross corpus and multilingual
SER and out performs the traditional techniques. It provides an accuracy of 58.83%,
61.75%, 69.75% and 45.51% for Hindi, Urdu, Telugu and Kannada language for multi-
lingual training.
Keywords:
affective computing, acoustic features, cross
corpus SER, deep convolution neural
network, deep learning, human computer
interaction, speech recognition
1. INTRODUCTION
Affective computing seeks to facilitate people's natural
interaction with computers. One of the main goals is to enable
computers to understand people's emotional states so that
customized answers may be provided in response [1, 2].
Recent years have seen an increase in interest in SER, which
is often done on the premise that spoken sounds in training and
testing datasets are generated under the same circumstances.
However, as voice data are often gathered from many devices
or places, this assumption does not hold true in practice. Due
to the disparity between the training and testing datasets, SER
suffers from Class imbalance problem [3, 4].
Emotions reflect the psychological state of the human being.
Various physiological and psychological signals such as
speech, facial expressions, and electrocardiograms (ECG),
electroencephalograms (EEG) are utilized for the
manifestation of emotional reflection. Speech is the natural
and easiest way of interaction that comprises huge emotional
content and context. SER is the most straightforward way of
human-machine interaction (HMI). Generalized SER systems
use the same corpus for training as well as testing purpose,
which may cause poor outcome for the new corpus [5-7]. SER
is very challenging due to many factors such as age, health
status, gender, linguistic variability, cultural variability,
recording environments, and languages with distinct corpus.
The speech attributes show high variance for different corpus
which leads to poor recognition rate for the SER systems
designed for single corpus. Now a days, various cross-corpus
SER systems have been implemented that use one dataset for
training and another for testing [8, 9].
In past decades, most of the SER techniques uses same
corpus for the training as well as training and researchers have
achieved noteworthy success for the SER under controlled
experimental boundaries [10-12]. Earlier SER uses traditional
machine learning (ML) techniques such as Gaussian Mixture
Model (GMM) [13], Hidden Markov Model (HMM) [14],
Support Vector Machine (SVM) [15], K-Nearest Neighbor
(KNN) [16], Random Forest Classifier (RF) [17], Artificial
Neural Network (ANN) [18], etc along with handcrafted
Revue d'Intelligence Artificielle
Vol. 38, No. 3, June, 2024, pp. 913-927
Journal homepage: http://iieta.org/journals/ria
913