Date of publication MAR-31, 2024, date of current version FEB-09, 2024 www.computingonline.net/ computing@computingonline.net Print ISSN 1727-6209 Online ISSN 2312-5381 DOI 10.47839/ijc.23.1.3430 Speech Emotion Recognition using Hybrid Architectures M. NORVAL 1 , Z. WANG 2 1 Department of Electrical Engineering, University of South Africa, Johannesburg (e-mail: 36825050@mylife.unisa.ac.za) 2 Department of Electrical Engineering, University of South Africa, Johannesburg (e-mail: wangz@unisa.ac.za) Corresponding author: M. Norval (e-mail: 36825050@mylife.unisa.ac.za). This research was supported partially by the South African National Research Foundation (Grants nos. 120106, 41951, and 132797) and the South African National Research Foundation Incentive (Grant no. 132159). ABSTRACT The detection of human emotions from speech signals remains a challenging frontier in audio processing and human-computer interaction domains. This study introduces a novel approach to Speech Emotion Recognition (SER) using a Dendritic Layer combined with a Capsule Network (DendCaps). A Convolutional Neural Network (NN) and a Long Short-Time Neural Network (CLSTM) hybrid model are used to create a baseline which is then compared to the DendCap model. Integrating dendritic layers and capsule networks for speech emotion detection can harness the unique advantages of both architectures, potentially leading to more sophisticated and accurate models. Dendritic layers, inspired by the nonlinear processing properties of dendritic trees in biological neurons, can handle the intricate patterns and variabilities inherent in speech signals, while capsule networks, with their dynamic routing mechanisms, are adept at preserving hierarchical spatial relationships within the data, enabling the model to capture more reﬁned emotional subtleties in human speech. The main motivation for using DendCaps is to bridge the gap between the capabilities of biological neural systems and artiﬁcial neural networks. This combination aims to capitalize on the hierarchical nature of speech data, where intricate patterns and dependencies can be better captured. Finally, two ensemble methods namely stacking and boosting are used for evaluating the CLSTM and DendCaps networks and the experimental results show that stacking of the CLSTM and DendCaps networks gives the superior result with a 75% accuracy. KEYWORDS Emotion recognition; Artiﬁcial Intelligence; Dendritic Layer; Capsule Networks; Ensem- ble I. INTRODUCTION E MOTION recognition from audio signals plays a piv- otal role in enhancing human-computer interaction and enabling machines to understand and react to human emotions. The ﬁeld has gained signiﬁcant attention due to the explosion of voice-activated systems, virtual assistants, and AI-driven customer support systems. However, despite substantial advancements, emotion recognition from audio signals remains challenging due to the inherent complex- ity and variability of human emotions, and the diverse acoustic characteristics they present. Recently, deep learning models have demonstrated superior performance in various domains, including image recognition, natural language processing, and speech recognition. Their ability to learn complex patterns and dependencies from raw data suggests that they may offer improved performance for emotion recognition from audio signals [1]. In the context of human- computer interaction and affective computing, the challenge of recognizing emotions from audio-speech signals remains a vital research problem. The task involves designing robust and efﬁcient models capable of accurately discerning var- ious emotional states conveyed through spoken language, accounting for factors such as speaker diversity, language variations, and the dynamic nature of emotions. Researchers have ventured into the utilization of ensemble techniques for emotion detection, with a speciﬁc focus on employing random forest averaging, locally weighted naive bayes (LWNB), logistic classiﬁers, boosting-tree- based models, gradient boosting and majority-voting clas- siﬁer [2] [3] [4] [5] [6]. Furthermore, Capsule Networks VOLUME 23(1), 2024 1