International Journal of Recent Technology and Engineering (IJRTE)
ISSN: 2277-3878, Volume-9 Issue-1, May 2020
2364
Published By:
Blue Eyes Intelligence Engineering
& Sciences Publication
Retrieval Number: F9896038620/2020©BEIESP
DOI:10.35940/ijrte.F9896.059120
Abstract: over the recent years much advancement are made in
terms of artificial intelligence, machine learning, human-machine
interaction etc. Voice interaction with the machine or giving
command to it to perform a specific task is increasingly popular.
Many consumer electronics are integrated with SIRI, Alexa,
cortana, Google assist etc. But machines have limitation that they
cannot interact with a person like a human conversational
partner. It cannot recognize Human Emotion and react to them.
Emotion Recognition from speech is a cutting edge research topic
in the Human machines Interaction field. There is a demand to
design a more rugged man-machine communication system, as
machines are indispensable to our lives. Many researchers are
working currently on speech emotion recognition(SER) to
improve the man machines interaction. To achieve this goal, a
computer should be able to recognize emotional states and react to
them in the same way as we humans do. The effectiveness of the
speech emotion recognition(SER) system depends on quality of
extracted features and the type of classifiers used . In this paper we
tried to identify four basic emotions: anger, sadness, neutral,
happiness from speech. Here we used audio file of short Manipuri
speech taken from movies as training and testing dataset . This
paper use CNN to identify four different emotions using MFCC
(Mel Frequency Cepstral Coefficient )as features extraction
technique from speech.
Keywords : CNN, emotion recognition, Human Machine
interface, MFCC,.
I. INTRODUCTION
Speech is the statement of one's sentiments or
contemplations by verbalized sounds. Speech signal contains
the data about speaker, language, message and feelings .
Emotion is the expression of human feelings. It may be
conveyed through face, movement or speech . Emotions are
vital for passing on significant information.. speech contains
different kind of emotions like happiness ,sadness ,fear,
disgust, anger surprise etc. A detailed survey on speech
emotion recognition (SER) is given in [1] which discussed
about the features, classifier schemes and databases. Emotion
detection of Assamese speech using Gaussian Mixture Model
(GMM) classifiers and Mel frequency Cepstral co-efficient
(MFCC) are described in[2]. A method of emotion
classification for speech using short time long frequency
power co-efficient (LFPC) and a discrete hidden Markov
model (HMM) as the classifier is described in [3]. The
proposed system yields 78% of accuracy and discussed
classification of 6 emotions.[5]discuss about the emotion
recognition from speech using MFCC and DWT for security
Revised Manuscript Received on May 21, 2020.
* Correspondence Author
G.R.Michael, Dept. of ECE, Dibrugarh University Dibrugarh , India.
Email: roberteld008@gmail.com
Dr Aditya Bihar Kandali., Electrical Department, Jorhat Engineering
college, Jorhat, India. Email: abkandali@rediffmail.com
system using SVM classifier. In [6] the authors have
effectively utilized for speech-based emotion identification.
They planned and trained the network for the detection of 6
essential emotions from speech. In this paper we tried to
identify for basic emotion using CNN form Manipuri Speech.
II. . MEL FREQUENCY CEPSTRAL COEFICIENT
Mel Frequency Cepstral Coefficient (MFCC) is the most
generally utilized feature extraction method utilized in
automatic speech recognition. MFCC relies upon human
hearing acknowledgments which can't perceive frequencies
over 1Khz.. Fig. 1 shows the complete steps to get the MFCC
coefficient.
Fig.1 MFCC block
A. Pre-emphasis
In the speech spectrum more energy is concentrated at lower
frequencies. Pre-emphasis increase the signal energy. Speech
signal is passed through a filter which increases the signal
energy.
Y[n] = X[n] * 0.95 X [n-1] [1]
B. Framing and windowing
In this process speech signal is segmented into 20ms (Frame
blocking) then, windowing is done by Hamming using the
formula:
[2]
W(n) = Hamming window
N=number of input Samples
n= sample input index in time domain
C. FFT
Fourier transform converts each N samples of each frame
from the time domain into the frequency domain. FFT is used
to find all frequencies present in the particular frame [7].
Emotion Recognition of Manipuri Speech using
Convolution Neural Network.
Gurumayum Robert Michael, Aditya Bihar Kandali