86 Speech Emotion Recognition Using Support Vector Machines Thapanee Seehapoch 1 Sartra Wongthanavasu 2 12 Cellular Automata and Knowledge Engineering (CAKE) Laboratory 2 Machine Learning and Intelligent Systems (MLIS) Laboratory Department of Computer Science, Faculty of Science, Khon Kaen University, Khon Kaen 40002, Thailand 1 s.thapanee@kkumail.com, 2 wongsar@kku.ac.th Abstract—Automatic recognition of emotional states from human speech is a current research topic with a wide range. In this paper an attempt has been made to recognize and classify the speech emotion from three language databases, namely, Berlin, Japan and Thai emotion databases. Speech features consisting of Fundamental Frequency (F0), Energy, Zero Crossing Rate (ZCR), Linear Predictive Coding (LPC) and Mel Frequency Cepstral Coefficient (MFCC) from short-time wavelet signals are comprehensively investigated. In this regard, Support Vector Machines (SVM) is utilized as the classification model. Empirical experimentation shows that the combined features of F0, Energy and MFCC provide the highest accuracy on all databases provided using the linear kernel. It gives 89.80%, 93.57% and 98.00% classification accuracy for Berlin, Japan and Thai emotions databases, respectively. Keywords-component; Speech Emotion Recognitions; Support Vector Machines I. INTRODUCTION A way of speaking is very important in human communication. It is the most natural way to express in different story; expressing the emotion and feeling of the speaking. However, tone of voices are also the way to express the status in emotion. Sometimes, a person speaks the sentence while stay in some emotion which makes the tone of speech change the meaning of the sentence completely. Up to date, Automatic Speech Emotion Recognition (ASER) is a very active research area in Human Computer Interaction (HCI) field and has a wide range of applications. For example, in e-learning system, computer is capable to analyze the emotions of the subject or person and adjust the substance of learning of the subject or student. In automatic remote call center, it is used to timely detect customers’ dissatisfaction. In the robot's technology, by teaching robots to respond to the human and receive emotions of human, could be able to verify the human's stresses. Or even in medication, the patience's emotion is examined to diagnose the mental illness. In recent years, many speech databases were built for speech emotion research, such as Danish Emotional Speech corpus (DES) [1], Berlin Emotional Database (EMO-DB) [2], Spanish Emotional Speech Database (SES) [3], Chinese Emotion Speech Database [4], Japanese Emotional Speech Database [5] etc. Process of the speech emotion recognition system have 2 major factors; first, feature extraction and the other, Emotion Classification. Feature extraction is the important part in finding the substitute of the tone of speech which expresses the emotion. Prosodic features and Spectral features can be used for speech emotion recognition because both of these features contain the emotional information. Many researchers tried to intercept the important features of speech such as Pitch, Energy, Formant frequency [6], Jitter, Shimmer [7], Zero Crossing Rate (ZCR) [8], Linear Predictive Coding (LPC), Linear Prediction-based Cepstral Coefficients (LPCC), Mel-Frequency Cepstral Coefficients (MFCC) [9], Postfiltered Cepstral Coefficient (PFC), Greenwood Function Cepstral Coefficient (GFCC) [10], Perceptual Linear Prediction (PLP) Cepstral Coefficients [11], and RASTA-PLP [12], etc. In work [13], lpc, jitter and energy were used as the features and it reported classification accuracy at 62.35%. In work [14] lpcc, mfcc were used and reported the classification accuracy of 83.9%. In some other works like [15], energy, zero crossing rate and fundamental frequency were used and reached classification accuracy of 97.8%. In emotion classification, many researchers explored several promising classification methods, such as K-nearest neighbor (k-NN) [16], support vector machines (SVM) [17], Neural network (NN) [18], Hidden Markov model (HMM) [19], etc. As was investigated, a number of stated features play a vital role in accuracy performance. In addition, the promising classification models are capable of enhancing the peak accuracy rate. This paper is to investigate and integrate the integrated features to arrive at the highest accuracy performance by using Support Vector Machines. The paper is organized as follows. Following this, section II provides the details in ASER. Section III discussed the experimentation. Results and performance comparison are given in section IV. Section V gives conclusions and discussion. II. SPEECH EMOTION RECOGNITION SYSTEM The structure of the speech emotion recognition system studied in this paper is depicted in figure 1. The speech signal is the first pre-process by pre-emphasis, framing and windowing. In this paper, five short time features are extracted, which are Fundamental Frequency (F0), Energy, Zero Crossing Rate (ZCR), Linear Predictive Coding (LPC) 2013 5th International Conference on Knowledge and Smart Technology (KST) 978-1-4673-4853-9/13/$31.00 ©2013 IEEE