International Journal of Electrical and Computer Engineering (IJECE) Vol. 12, No. 6, December 2022, pp. 6594~6601 ISSN: 2088-8708, DOI: 10.11591/ijece.v12i6.pp6594-6601 6594 Journal homepage: http://ijece.iaescore.com Speech emotion recognition using 2D-convolutional neural network Fauzivy Reggiswarashari, Sari Widya Sihwi Department of Informatics, Faculty of Mathematics and Natural Sciences, Universitas Sebelas Maret, Surakarta, Indonesia Article Info ABSTRACT Article history: Received Aug 31, 2021 Revised May 28, 2022 Accepted Jun 26, 2022 This research proposes a speech emotion recognition model to predict human emotions using the convolutional neural network (CNN) by learning segmented audio of specific emotions. Speech emotion recognition utilizes the extracted features of audio waves to learn speech emotion characteristics; one of them is mel frequency cepstral coefficient (MFCC). Dataset takes a vital role to obtain valuable results in model learning. Hence this research provides the leverage of dataset combination implementation. The model learns a combined dataset with audio segmentation and zero padding using 2D-CNN. Audio segmentation and zero padding equalize the extracted audio features to learn the characteristics. The model results in 83.69% accuracy to predict seven emotions: neutral, happy, sad, angry, fear, disgust, and surprise from the combined dataset with the segmentation of the audio files. Keywords: 2D-CNN Audio segmentation Mel frequency cepstral coefficient Speech emotion recognition This is an open access article under the CC BY-SA license. Corresponding Author: Sari Widya Sihwi Department of Informatics, Faculty of Mathematics and Natural Sciences, Universitas Sebelas Maret Jl. Ir. Sutami No 36A, Surakarta, Central Java, 57126, Indonesia Email: sariwidya@staff.uns.ac.id 1. INTRODUCTION Artificial intelligence (AI) studies show enormous beneficial growth in most aspects of the product industry. One of the studies in AI that gather the researcher's attention is affective computing with all its humanitarian approaches. Affective computing recognizes, processes, and produces human feelings using a computational approach [1]. It reaches many fields in the industry, such as medic, psychotherapy, marketing, and advertising [2]. The implementation of emotion detection expands facial expression recognition, language recognition, empathy giving, and other functionalities closer to how humans function. In addition, emotion recognition widens the technology growth of affective computing itself. Emotion recognition is human feelings recognition of many data sources, from texts, facial expressions, voices, gestures, and behaviors [3][5]. Researchers face two main focuses to solve in speech emotion recognition. They are the extracted feature selection and model selection. Speech emotion recognition is an emotion recognition of human speech using computation models in affective computing and develops forensic, security, and biometric fields [6]. Besides its lexical contents, human speech expresses other characteristics: information of age, gender, language, and emotion [6], [7]. The audio contains many features to extract into computational form by transforming audio to a graph of arrays. The extraction of speech waveform is categorized into four: prosodic, spectral, audio quality, and Teager energy operator features [8]. Extracted audio features are mel frequency cepstral coefficient (MFCC), spectrogram, zero crossing rate (ZCR), Teager energy operator (TEO), harmonic to noise rate (HNR) [9][12]. In emotion recognition, MFCC is a cepstral domain feature used the most for research [12].