International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-9 Issue-1, May 2020 2364 Published By: Blue Eyes Intelligence Engineering & Sciences Publication Retrieval Number: F9896038620/2020©BEIESP DOI:10.35940/ijrte.F9896.059120 Abstract: over the recent years much advancement are made in terms of artificial intelligence, machine learning, human-machine interaction etc. Voice interaction with the machine or giving command to it to perform a specific task is increasingly popular. Many consumer electronics are integrated with SIRI, Alexa, cortana, Google assist etc. But machines have limitation that they cannot interact with a person like a human conversational partner. It cannot recognize Human Emotion and react to them. Emotion Recognition from speech is a cutting edge research topic in the Human machines Interaction field. There is a demand to design a more rugged man-machine communication system, as machines are indispensable to our lives. Many researchers are working currently on speech emotion recognition(SER) to improve the man machines interaction. To achieve this goal, a computer should be able to recognize emotional states and react to them in the same way as we humans do. The effectiveness of the speech emotion recognition(SER) system depends on quality of extracted features and the type of classifiers used . In this paper we tried to identify four basic emotions: anger, sadness, neutral, happiness from speech. Here we used audio file of short Manipuri speech taken from movies as training and testing dataset . This paper use CNN to identify four different emotions using MFCC (Mel Frequency Cepstral Coefficient )as features extraction technique from speech. Keywords : CNN, emotion recognition, Human Machine interface, MFCC,. I. INTRODUCTION Speech is the statement of one's sentiments or contemplations by verbalized sounds. Speech signal contains the data about speaker, language, message and feelings . Emotion is the expression of human feelings. It may be conveyed through face, movement or speech . Emotions are vital for passing on significant information.. speech contains different kind of emotions like happiness ,sadness ,fear, disgust, anger surprise etc. A detailed survey on speech emotion recognition (SER) is given in [1] which discussed about the features, classifier schemes and databases. Emotion detection of Assamese speech using Gaussian Mixture Model (GMM) classifiers and Mel frequency Cepstral co-efficient (MFCC) are described in[2]. A method of emotion classification for speech using short time long frequency power co-efficient (LFPC) and a discrete hidden Markov model (HMM) as the classifier is described in [3]. The proposed system yields 78% of accuracy and discussed classification of 6 emotions.[5]discuss about the emotion recognition from speech using MFCC and DWT for security Revised Manuscript Received on May 21, 2020. * Correspondence Author G.R.Michael, Dept. of ECE, Dibrugarh University Dibrugarh , India. Email: roberteld008@gmail.com Dr Aditya Bihar Kandali., Electrical Department, Jorhat Engineering college, Jorhat, India. Email: abkandali@rediffmail.com system using SVM classifier. In [6] the authors have effectively utilized for speech-based emotion identification. They planned and trained the network for the detection of 6 essential emotions from speech. In this paper we tried to identify for basic emotion using CNN form Manipuri Speech. II. . MEL FREQUENCY CEPSTRAL COEFICIENT Mel Frequency Cepstral Coefficient (MFCC) is the most generally utilized feature extraction method utilized in automatic speech recognition. MFCC relies upon human hearing acknowledgments which can't perceive frequencies over 1Khz.. Fig. 1 shows the complete steps to get the MFCC coefficient. Fig.1 MFCC block A. Pre-emphasis In the speech spectrum more energy is concentrated at lower frequencies. Pre-emphasis increase the signal energy. Speech signal is passed through a filter which increases the signal energy. Y[n] = X[n] * 0.95 X [n-1] [1] B. Framing and windowing In this process speech signal is segmented into 20ms (Frame blocking) then, windowing is done by Hamming using the formula:              [2] W(n) = Hamming window N=number of input Samples n= sample input index in time domain C. FFT Fourier transform converts each N samples of each frame from the time domain into the frequency domain. FFT is used to find all frequencies present in the particular frame [7]. Emotion Recognition of Manipuri Speech using Convolution Neural Network. Gurumayum Robert Michael, Aditya Bihar Kandali