Indonesian Journal of Electrical Engineering and Computer Science Vol. 10, No. 2, May 2018, pp. 554~561 ISSN: 2502-4752, DOI: 10.11591/ijeecs.v10.i2.pp554-561 ! 554 Journal homepage: http://iaescore.com/journals/index.php/ijeecs Speech Emotion Recognition Using Deep Feedforward Neural Network Muhammad Fahreza Alghifari 1 , Teddy Surya Gunawan* 2 , Mira Kartiwi 3 1,2 Department of Electrical and Computer Engineering, Kulliyyah of Engineering, Malaysia 3 Department of Information Systems, Kulliyyah of ICT, International Islamic University Malaysia, Malaysia Article Info ABSTRACT Article history: Received Nov 26, 2017 Revised Jan 23, 2018 Accepted Feb 21, 2018 Speech emotion recognition (SER) is currently a research hotspot due to its challenging nature but bountiful future prospects. The objective of this research is to utilize Deep Neural Networks (DNNs) to recognize human speech emotion. First, the chosen speech feature Mel-frequency cepstral coefficient (MFCC) were extracted from raw audio data. Second, the speech features extracted were fed into the DNN to train the network. The trained network was then tested onto a set of labelled emotion speech audio and the recognition rate was evaluated. Based on the accuracy rate the MFCC, number of neurons and layers are adjusted for optimization. Moreover, a custom-made database is introduced and validated using the network optimized. The optimum configuration for SER is 13 MFCC, 12 neurons and 2 layers for 3 emotions and 25 MFCC, 21 neurons and 4 layers for 4 emotions, achieving a total recognition rate of 96.3% for 3 emotions and 97.1% for 4 emotions. Keywords: Deep neural network Mel-frequency cepstral coefficients (MFCC) Speech emotion recognition (SER) Copyright © 2018 Institute of Advanced Engineering and Science. All rights reserved. Corresponding Author: Teddy Surya Gunawan, Department of Electrical and Computer Engineering, Kulliyyah of Engineering, Malaysia. Email: tsgunawan@iium.edu.my 1. INTRODUCTION Speech Emotion Recognition (SER) can be defined as the identification of the emotional state of the speaker from his or her speech signal [1]. SER is one of the topics in speech processing that has been continuously researched for decades, the simplest attempts dates back from the late fifties [2]. In today’s world, SER has shown to be quite a research hotspot, as indicated by the growth of publication papers in each year. The application of SER can be targeted to several sectors. In banking, an auto caller equipped with SER may assist in detecting the emotion of the customer, generating custom responses based on the result. In education, an e-learning portal with SER can detect the emotions of the user such as frustration and stress, determining whether the studying is conducive or not and give appropriate countermeasures. Yet another application is in transportation, where in the near-future that vehicles are capable of auto-driving, the system can take over the steering wheel in the case where an unhealthy amount of emotion is detected from the driver. A typical speech emotion recognition is illustrated in Figure 1. The feature extraction marks the start of a SER system. This includes selecting the features appropriate for emotion recognition. Next these features are processed by a classifier. These classifiers are trained by referring to an emotion database. Next, the system will be put into testing by crosschecking with the same database. The processed data obtained will be the determinant of the decision, typically in terms of accuracy and processing time.