Int J Speech Technol (2009) 12: 1–13 DOI 10.1007/s10772-009-9046-4 Vocal emotion recognition in ﬁve native languages of Assam using new wavelet features Aditya Bihar Kandali · Aurobinda Routray · Tapan Kumar Basu Received: 6 August 2009 / Accepted: 13 October 2009 / Published online: 29 October 2009 © Springer Science+Business Media, LLC 2009 Abstract The present work investigates the following spe- ciﬁc research questions concerning voice emotion recogni- tion: whether vocal emotion expressions of discrete emo- tion (i) can be distinguished from no-emotion (i.e. neu- tral), (ii) can be distinguished from another, (iii) of sur- prise, which is actually a cognitive component that could be present with any emotion, can also be recognized as distinct emotion, (iv) can be recognized cross-lingually. This study will enable us to get more information regarding nature and function of emotion. Furthermore, this work will help in de- veloping a generalized voice emotion recognition system, which will increase the efﬁciency of human-machine inter- action systems. In this work an emotional utterance database is created with 140 acted utterances per speaker consisting of short sentences of six full-blown basic emotions and neu- tral of ﬁve native languages of Assam. This database is val- idated by a Listening Test. Four feature sets are extracted based on WPCC2 (Wavelet-Packet-Cepstral-Coefﬁcients computed by method 2), MFCC (Mel-Frequency-Cepstral- Coefﬁcients), tfWPCC2 (Teager-energy-operated-in-Trans- form-domain WPCC2) and tfMFCC. The Gaussian Mixture Model (GMM) is used as classiﬁer. The performances of all these feature sets are compared in respect of accuracy of A.B. Kandali ( ) · A. Routray Department of Electrical Engineering, Indian Institute of Technology Kharagpur, Kharagpur, PIN-721302, West Bengal, India e-mail: abkandali@rediffmail.com A. Routray e-mail: aroutray@ee.iitkgp.ac.in T.K. Basu Aliah University, DN 47, Sector 5, Salt Lake City, Kolkota, India e-mail: basutk02@yahoo.co.in classiﬁcation in two experiments: (i) text-and-speaker inde- pendent vocal emotion recognition in individual languages, and (ii) cross-lingual vocal emotion recognition. tfWPCC2 is a new wavelet feature set proposed by the same authors in one of their recent papers in a National Seminar in India as cited in References. Keywords Full-blown basic emotion · Vocal emotion recognition · GMM classiﬁer · MFCC · WPCC · Teager energy operator 1 Introduction Emotions are expressed in speech, face, gait and other body languages explicitly by human beings along with internal physiological signals such as muscle voltage, blood volume pressure, skin conductivity and respiration. The vocal ex- pressions are harder to regulate than other explicit emotion signals. So, it is possible to know the actual affective state of the speaker from her/his voice without any physical con- tact. But exact identiﬁcation of emotion from voice is very difﬁcult due to several factors. The speech consists broadly of two components coded simultaneously: (i) “What is said” and (ii) “How it is said”. The ﬁrst component consists of the linguistic information pronounced as per the sounds of the language. The second component consists of non-linguistic or paralinguistic or supra-segmental component which in- cludes the prosody of the language i.e. pitch, intensity and speaking-rate rules to give lexical and grammatical empha- sis for the spoken messages; and the prosody of emotion to express the affective state of the speaker. In addition, speak- ers also possess their own style, i.e. a characteristic articu- lation rate, intonation habit and loudness characteristic. The voice also contains information about the speaker’s identity,