Indonesian Journal of Electrical Engineering and Computer Science
Vol. 10, No. 2, May 2018, pp. 554~561
ISSN: 2502-4752, DOI: 10.11591/ijeecs.v10.i2.pp554-561 ! 554
Journal homepage: http://iaescore.com/journals/index.php/ijeecs
Speech Emotion Recognition Using Deep Feedforward
Neural Network
Muhammad Fahreza Alghifari
1
, Teddy Surya Gunawan*
2
, Mira Kartiwi
3
1,2
Department of Electrical and Computer Engineering, Kulliyyah of Engineering, Malaysia
3
Department of Information Systems, Kulliyyah of ICT, International Islamic University Malaysia, Malaysia
Article Info ABSTRACT
Article history:
Received Nov 26, 2017
Revised Jan 23, 2018
Accepted Feb 21, 2018
Speech emotion recognition (SER) is currently a research hotspot due to its
challenging nature but bountiful future prospects. The objective of this
research is to utilize Deep Neural Networks (DNNs) to recognize human
speech emotion. First, the chosen speech feature Mel-frequency cepstral
coefficient (MFCC) were extracted from raw audio data. Second, the speech
features extracted were fed into the DNN to train the network. The trained
network was then tested onto a set of labelled emotion speech audio and the
recognition rate was evaluated. Based on the accuracy rate the MFCC,
number of neurons and layers are adjusted for optimization. Moreover, a
custom-made database is introduced and validated using the network
optimized. The optimum configuration for SER is 13 MFCC, 12 neurons and
2 layers for 3 emotions and 25 MFCC, 21 neurons and 4 layers for 4
emotions, achieving a total recognition rate of 96.3% for 3 emotions and
97.1% for 4 emotions.
Keywords:
Deep neural network
Mel-frequency cepstral coefficients
(MFCC)
Speech emotion recognition (SER)
Copyright © 2018 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Teddy Surya Gunawan,
Department of Electrical and Computer Engineering,
Kulliyyah of Engineering, Malaysia.
Email: tsgunawan@iium.edu.my
1. INTRODUCTION
Speech Emotion Recognition (SER) can be defined as the identification of the emotional state of the
speaker from his or her speech signal [1]. SER is one of the topics in speech processing that has been
continuously researched for decades, the simplest attempts dates back from the late fifties [2]. In today’s
world, SER has shown to be quite a research hotspot, as indicated by the growth of publication papers
in each year.
The application of SER can be targeted to several sectors. In banking, an auto caller equipped with
SER may assist in detecting the emotion of the customer, generating custom responses based on the result. In
education, an e-learning portal with SER can detect the emotions of the user such as frustration and stress,
determining whether the studying is conducive or not and give appropriate countermeasures. Yet another
application is in transportation, where in the near-future that vehicles are capable of auto-driving, the system
can take over the steering wheel in the case where an unhealthy amount of emotion is detected
from the driver.
A typical speech emotion recognition is illustrated in Figure 1. The feature extraction marks the start
of a SER system. This includes selecting the features appropriate for emotion recognition. Next these features
are processed by a classifier. These classifiers are trained by referring to an emotion database. Next, the
system will be put into testing by crosschecking with the same database. The processed data obtained will be
the determinant of the decision, typically in terms of accuracy and processing time.