Int J Speech Technol (2009) 12: 1–13
DOI 10.1007/s10772-009-9046-4
Vocal emotion recognition in five native languages of Assam using
new wavelet features
Aditya Bihar Kandali · Aurobinda Routray ·
Tapan Kumar Basu
Received: 6 August 2009 / Accepted: 13 October 2009 / Published online: 29 October 2009
© Springer Science+Business Media, LLC 2009
Abstract The present work investigates the following spe-
cific research questions concerning voice emotion recogni-
tion: whether vocal emotion expressions of discrete emo-
tion (i) can be distinguished from no-emotion (i.e. neu-
tral), (ii) can be distinguished from another, (iii) of sur-
prise, which is actually a cognitive component that could be
present with any emotion, can also be recognized as distinct
emotion, (iv) can be recognized cross-lingually. This study
will enable us to get more information regarding nature and
function of emotion. Furthermore, this work will help in de-
veloping a generalized voice emotion recognition system,
which will increase the efficiency of human-machine inter-
action systems. In this work an emotional utterance database
is created with 140 acted utterances per speaker consisting
of short sentences of six full-blown basic emotions and neu-
tral of five native languages of Assam. This database is val-
idated by a Listening Test. Four feature sets are extracted
based on WPCC2 (Wavelet-Packet-Cepstral-Coefficients
computed by method 2), MFCC (Mel-Frequency-Cepstral-
Coefficients), tfWPCC2 (Teager-energy-operated-in-Trans-
form-domain WPCC2) and tfMFCC. The Gaussian Mixture
Model (GMM) is used as classifier. The performances of
all these feature sets are compared in respect of accuracy of
A.B. Kandali ( ) · A. Routray
Department of Electrical Engineering, Indian Institute
of Technology Kharagpur, Kharagpur, PIN-721302, West Bengal,
India
e-mail: abkandali@rediffmail.com
A. Routray
e-mail: aroutray@ee.iitkgp.ac.in
T.K. Basu
Aliah University, DN 47, Sector 5, Salt Lake City, Kolkota, India
e-mail: basutk02@yahoo.co.in
classification in two experiments: (i) text-and-speaker inde-
pendent vocal emotion recognition in individual languages,
and (ii) cross-lingual vocal emotion recognition. tfWPCC2
is a new wavelet feature set proposed by the same authors in
one of their recent papers in a National Seminar in India as
cited in References.
Keywords Full-blown basic emotion · Vocal emotion
recognition · GMM classifier · MFCC · WPCC · Teager
energy operator
1 Introduction
Emotions are expressed in speech, face, gait and other body
languages explicitly by human beings along with internal
physiological signals such as muscle voltage, blood volume
pressure, skin conductivity and respiration. The vocal ex-
pressions are harder to regulate than other explicit emotion
signals. So, it is possible to know the actual affective state
of the speaker from her/his voice without any physical con-
tact. But exact identification of emotion from voice is very
difficult due to several factors. The speech consists broadly
of two components coded simultaneously: (i) “What is said”
and (ii) “How it is said”. The first component consists of the
linguistic information pronounced as per the sounds of the
language. The second component consists of non-linguistic
or paralinguistic or supra-segmental component which in-
cludes the prosody of the language i.e. pitch, intensity and
speaking-rate rules to give lexical and grammatical empha-
sis for the spoken messages; and the prosody of emotion to
express the affective state of the speaker. In addition, speak-
ers also possess their own style, i.e. a characteristic articu-
lation rate, intonation habit and loudness characteristic. The
voice also contains information about the speaker’s identity,