International Journal of Computer Applications (0975 8887) Volume 145 No.8, July 2016 5 Emotion Recognition and Classification in Speech using Artificial Neural Networks Akash Shaw National Institute of Technology Warangal 506004 Telangana, India Rohan Kumar Vardhan National Institute of Technology Warangal 506004 Telangana, India Siddharth Saxena National Institute of Technology Warangal 506004 Telangana, India ABSTRACT To date, little research has been done in emotion classification and recognition in speech. Therefore, there is a need to discuss why this topic is interesting and present a system for classifying and recognizing emotions through speech using neural networks through this article. The proposed system will be speaker independent since a database of speech samples will be used. Various classifiers will used to differentiate emotions such as neutral, anger, happy, sad, etc. The database will consist of emotional speech samples. Prosodic features like pitch, energy, formant frequencies and spectral features like mel frequency cepstral coefficients will be used in the system. Further the classifiers will be trained by using these features for classifying emotions accurately. Following classification, these features will be used to recognize the emotion of the speech sample. Thus, many components like pre-processing of speech, MFCC features, classifiers, prosodic features come together in the implementation of emotion recognition system using speech. General Terms Pattern Recognition, Speech. Keywords ANN, MFCC, prosodic features, emotion classification and recognition, pre-processing. 1. INTRODUCTION The interaction between humans and computers has received a lot of attention off late. It is one of the most popular areas of research and has great potential. Teaching a computer the understanding of human emotions is an important aspect of this interaction. A lot of successful applications related to speech recognition are available in the market. People can use their voice to give commands to car, cell-phones, computer, television and many electrical devices. Thus, to make a computer understand human emotions and give a better interaction experience becomes a very interesting challenge. The most common way to recognize any speech emotion is extracting important features that are related to various emotional states from the speech signal (i.e. energy is an important feature to distinguish happiness from sadness), feed these features to the input end of a classifier and obtain different emotions at the output end. This process is shown in the figure below. In this paper, the aim is to classify a batch of recorded speech signal into four categories, namely: happy, sad, angry, natural. Before extraction, pre-processing is performed on the speech signals. Samples are taken from the speech and the analog signal is convereted to digital signal. Then each sentence is normalized to ensure that all the sentences are in the same volume range. At last, segmentation separates signal in frames so that speech signal can maintain its characteristics in short duration. Commonly used features are chosen for study and subsequently extracted. Energy is the most basic feature of speech signal. Pitch is frequently used in this topic and autocorrelation is used to detect the pitch in each frame. After autocorrelation, statistical values are calculated for speech signals. Formant is another important feature. Linear Predictive Coding (LPC) method is used to extract the first formant. Similar to pitch, statistical values are calculated for first formant. Mel frequency cepstral coefficient (MFCC) is a representation of short-term power spectrum on a human-like mel scale of frequency. First three coefficients of MFCCs are taken to derive means and variances. All features of the speech samples are put into Artificial Neural Network (ANN), which consists of an input matrix along with a target matrix, which indicate the emotion state for each sentence composed the input of neural network. Artificial Neural Network is used to train and test the data and perform the classification, in the end figures of mean square error and confusion will be given to show how good the performance is. [1][2][11][12] Figure 1: Flow of emotion recognition and classification 2. PRE-PROCESSING FOR EMOTION RECOGNITION Prior to feature extraction, some necessary steps are taken to manipulate speech signal. Pre-processing mainly includes sampling, normalization and segmentation. Figure 2: Pre-processing for emotion recognition Speech signal is analog in form and it needs to be converted into digital form for processing. Analog signal is converted into discrete time signal with the help of sampling. Sampling ensures that the original characteristics of the signal are retained. According to sampling theorem, when the sampling frequency is greater than or equal to twice the maximum