86
Speech Emotion Recognition
Using Support Vector Machines
Thapanee Seehapoch
1
Sartra Wongthanavasu
2
12
Cellular Automata and Knowledge Engineering (CAKE) Laboratory
2
Machine Learning and Intelligent Systems (MLIS) Laboratory
Department of Computer Science, Faculty of Science, Khon Kaen University,
Khon Kaen 40002, Thailand
1
s.thapanee@kkumail.com,
2
wongsar@kku.ac.th
Abstract—Automatic recognition of emotional states from human
speech is a current research topic with a wide range. In this
paper an attempt has been made to recognize and classify the
speech emotion from three language databases, namely, Berlin,
Japan and Thai emotion databases. Speech features consisting of
Fundamental Frequency (F0), Energy, Zero Crossing Rate
(ZCR), Linear Predictive Coding (LPC) and Mel Frequency
Cepstral Coefficient (MFCC) from short-time wavelet signals
are comprehensively investigated. In this regard, Support Vector
Machines (SVM) is utilized as the classification model. Empirical
experimentation shows that the combined features of F0, Energy
and MFCC provide the highest accuracy on all databases
provided using the linear kernel. It gives 89.80%, 93.57% and
98.00% classification accuracy for Berlin, Japan and Thai
emotions databases, respectively.
Keywords-component; Speech Emotion Recognitions; Support
Vector Machines
I. INTRODUCTION
A way of speaking is very important in human
communication. It is the most natural way to express in
different story; expressing the emotion and feeling of the
speaking. However, tone of voices are also the way to express
the status in emotion. Sometimes, a person speaks the sentence
while stay in some emotion which makes the tone of speech
change the meaning of the sentence completely.
Up to date, Automatic Speech Emotion Recognition
(ASER) is a very active research area in Human Computer
Interaction (HCI) field and has a wide range of applications.
For example, in e-learning system, computer is capable to
analyze the emotions of the subject or person and adjust the
substance of learning of the subject or student. In automatic
remote call center, it is used to timely detect customers’
dissatisfaction. In the robot's technology, by teaching robots to
respond to the human and receive emotions of human, could be
able to verify the human's stresses. Or even in medication, the
patience's emotion is examined to diagnose the mental illness.
In recent years, many speech databases were built for
speech emotion research, such as Danish Emotional Speech
corpus (DES) [1], Berlin Emotional Database (EMO-DB) [2],
Spanish Emotional Speech Database (SES) [3], Chinese
Emotion Speech Database [4], Japanese Emotional Speech
Database [5] etc. Process of the speech emotion recognition
system have 2 major factors; first, feature extraction and the
other, Emotion Classification. Feature extraction is the
important part in finding the substitute of the tone of speech
which expresses the emotion. Prosodic features and Spectral
features can be used for speech emotion recognition because
both of these features contain the emotional information. Many
researchers tried to intercept the important features of speech
such as Pitch, Energy, Formant frequency [6], Jitter, Shimmer
[7], Zero Crossing Rate (ZCR) [8], Linear Predictive Coding
(LPC), Linear Prediction-based Cepstral Coefficients (LPCC),
Mel-Frequency Cepstral Coefficients (MFCC) [9], Postfiltered
Cepstral Coefficient (PFC), Greenwood Function Cepstral
Coefficient (GFCC) [10], Perceptual Linear Prediction (PLP)
Cepstral Coefficients [11], and RASTA-PLP [12], etc. In work
[13], lpc, jitter and energy were used as the features and it
reported classification accuracy at 62.35%. In work [14] lpcc,
mfcc were used and reported the classification accuracy of
83.9%. In some other works like [15], energy, zero crossing
rate and fundamental frequency were used and reached
classification accuracy of 97.8%. In emotion classification,
many researchers explored several promising classification
methods, such as K-nearest neighbor (k-NN) [16], support
vector machines (SVM) [17], Neural network (NN) [18],
Hidden Markov model (HMM) [19], etc.
As was investigated, a number of stated features play a vital
role in accuracy performance. In addition, the promising
classification models are capable of enhancing the peak
accuracy rate. This paper is to investigate and integrate the
integrated features to arrive at the highest accuracy
performance by using Support Vector Machines.
The paper is organized as follows. Following this, section II
provides the details in ASER. Section III discussed the
experimentation. Results and performance comparison are
given in section IV. Section V gives conclusions and
discussion.
II. SPEECH EMOTION RECOGNITION SYSTEM
The structure of the speech emotion recognition system
studied in this paper is depicted in figure 1. The speech signal
is the first pre-process by pre-emphasis, framing and
windowing. In this paper, five short time features are
extracted, which are Fundamental Frequency (F0), Energy,
Zero Crossing Rate (ZCR), Linear Predictive Coding (LPC)
2013 5th International Conference on Knowledge and Smart Technology (KST)
978-1-4673-4853-9/13/$31.00 ©2013 IEEE