Indian Cross Corpus Speech Emotion Recognition Using Multiple Spectral-Temporal-Voice Quality Acoustic Features and Deep Convolution Neural Network Rupali Kawade 1,2* , Sonal Jagtap 1,3 1 Department of E&TC Engineering, G H Raisoni College of Engineering and Management, Wagholi, Pune 412207, India 2 Department of E&TC Engineering, PCET's Pimpri Chinchwad College of Engineering & Research, Ravet, Pune 412101, India 3 Department of E&TC Engineering, Smt. Kashibai Navale College of Engineering, Vadgaon(Bk), Pune 411041, India Corresponding Author Email: rupali2118@gmail.com Copyright: ©2024 The authors. This article is published by IIETA and is licensed under the CC BY 4.0 license (http://creativecommons.org/licenses/by/4.0/). https://doi.org/10.18280/ria.380318 ABSTRACT Received: 6 September 2023 Revised: 2 December 2023 Accepted: 10 January 2024 Available online: 21 June 2024 Speech Emotion Recognition (SER) is very crucial in enriching next generation human machine interaction (HMI) with emotional intelligence capabilities by extracting the emotions from words and voice. However, current SER techniques are developed within the experimental boundaries and faces major challenges such as lack of robustness across languages, cultures, age gaps and gender of speakers. Very little work is carried out for SER for Indian corpus which has higher diversity, large number of dialects, vast changes due to regional and geographical aspects. India is one of the largest customers of HMI systems, social networking sites and internet users, therefore it is crucial for SER that focuses on Indian corpuses. This paper presents, cross corpus SER (CCSER) for Indian corpus using multiple acoustic features (MAF) and deep convolution neural network (DCNN) to improve the robustness of the SER. The MAF consists of various spectral, temporal and voice quality features. Further, Fire Hawk based optimization (FHO) technique is utilized for the salient feature selection. The FHO selects the important features from MAF to minimize the computational complexity and improve feature distinctiveness based in inter class and inter class variance of the features. The DCNN algorithm provides the better correlation, higher feature representation, better description of variation in timbre, intonation and pitch, superior connectivity in global and local features of the speech signal to characterize the corpus. The outcomes of suggested DCNN based SER is evaluated on Indo-Aryan language family (Hindi and Urdu) and Dravidian Language family (Telugu and Kannada). The proposed scheme results in improved accuracy for the various cross corpus and multilingual SER and out performs the traditional techniques. It provides an accuracy of 58.83%, 61.75%, 69.75% and 45.51% for Hindi, Urdu, Telugu and Kannada language for multi- lingual training. Keywords: affective computing, acoustic features, cross corpus SER, deep convolution neural network, deep learning, human computer interaction, speech recognition 1. INTRODUCTION Affective computing seeks to facilitate people's natural interaction with computers. One of the main goals is to enable computers to understand people's emotional states so that customized answers may be provided in response [1, 2]. Recent years have seen an increase in interest in SER, which is often done on the premise that spoken sounds in training and testing datasets are generated under the same circumstances. However, as voice data are often gathered from many devices or places, this assumption does not hold true in practice. Due to the disparity between the training and testing datasets, SER suffers from Class imbalance problem [3, 4]. Emotions reflect the psychological state of the human being. Various physiological and psychological signals such as speech, facial expressions, and electrocardiograms (ECG), electroencephalograms (EEG) are utilized for the manifestation of emotional reflection. Speech is the natural and easiest way of interaction that comprises huge emotional content and context. SER is the most straightforward way of human-machine interaction (HMI). Generalized SER systems use the same corpus for training as well as testing purpose, which may cause poor outcome for the new corpus [5-7]. SER is very challenging due to many factors such as age, health status, gender, linguistic variability, cultural variability, recording environments, and languages with distinct corpus. The speech attributes show high variance for different corpus which leads to poor recognition rate for the SER systems designed for single corpus. Now a days, various cross-corpus SER systems have been implemented that use one dataset for training and another for testing [8, 9]. In past decades, most of the SER techniques uses same corpus for the training as well as training and researchers have achieved noteworthy success for the SER under controlled experimental boundaries [10-12]. Earlier SER uses traditional machine learning (ML) techniques such as Gaussian Mixture Model (GMM) [13], Hidden Markov Model (HMM) [14], Support Vector Machine (SVM) [15], K-Nearest Neighbor (KNN) [16], Random Forest Classifier (RF) [17], Artificial Neural Network (ANN) [18], etc along with handcrafted Revue d'Intelligence Artificielle Vol. 38, No. 3, June, 2024, pp. 913-927 Journal homepage: http://iieta.org/journals/ria 913