AbstractEmotion in speech is an issue that has been attracting the interest of the speech community for many years, both in the context of speech synthesis as well as in automatic speech recognition (ASR). In spite of the remarkable recent progress in Large Vocabulary Recognition (LVR), it is still far behind the ultimate goal of recognising free conversational speech uttered by any speaker in any environment. Current experimental tests prove that using state of the art large vocabulary recognition systems the error rate increases substantially when applied to spontaneous/emotional speech. This paper shows that recognition rate for emotionally coloured speech can be improved by using a language model based on increased representation of emotional utterances. KeywordsStatistical language model, N-grams, emotionally coloured speech I. INTRODUCTION ECOGNISING the verbal content of emotional speech is a difficult problem, and recognition rates reported in the literature are in fact low. Although knowledge in the area has been developing rapidly, it is still limited in fundamental ways. The first issue concerns that not much of the spectrum of emotionally coloured expressions has been studied. The second issue is that most research on speech and emotion has focused on recognising the emotion being expressed and not on the classic Automatic Speech Recognition (ASR) problem of recovering the verbal content of the speech. Read speech and non-read speech in a ‘careful’ style can be recognized with accuracy higher than 95% using the state-of-the-art speech recognition technology. Including information about prosody improves recognition rate for emotions simulated by actors, but its relevance to the freer patterns of spontaneous speech is unproven. Phonetic descriptions of emotional speech show that it has multiple features which would be expected to pose problem Manuscript received January 8, 2006. Theologos Athanaselis is with the Institute for Language and Speech Processing, Artemidos 6 and Epidavrou, Maroussi, Athens, Greece, GR-15125, (phone: +302106875416; fax:+302106854270; e-mail: tathana@ilsp.gr). Stelios Bakamidis, is with the Institute for Language and Speech Processing, Artemidos 6 and Epidavrou, Maroussi, Athens, Greece, GR-15125, (e-mail: bakam@ilsp.gr). Ioannis Dologlou is with the Institute for Language and Speech Processing, Artemidos 6 and Epidavrou, Maroussi, Athens, Greece, GR-15125, (e-mail: ydol@ilsp.gr). for ASR systems. Five areas of difficulty stand out. 1) Source [1], 2) Intensity [2], 3) Speech quality [3], 4) Prosody [4], 5) Timing [5]. A solution to the problem of emotional speech recognition is to modify the training process so that recognition is sensitive to prosodic information. Polzin & Waibel [6] show that this strategy can be effective. This paper deals with a second strategy, which is complementary to Polzin & Waibel’s. It is well known that the emotion affects language as well as speech variables. For that reason the important issue is to identify corpora that reflect emotion-influenced language so that emotion-oriented language models can be learned from them. The language models are derived by adapting an already existing corpus, the British National Corpus (BNC). An emotional lexicon is used to identify emotionally coloured words, and sentences containing these words are recombined with the BNC to form a corpus with a raised proportion of emotional material. This paper confirms that emotion does have major effects on recognition rate. The aim of this paper is to investigate the performance of a speech recognition system which is based on emotional oriented language model, for material that presents emotion variability. For experimental purposes a set of 4 different emotional characters are used. The paper is organized as follows: the architecture of the speech recognition engine in section 2. The 3 rd section describes the basic language model while in the next section follows a detailed presentation of the enhanced language model generation. The experimental scheme and results of using the basic and the enhanced language model are discussed in section 5 and concluding remarks are made in section 6. II. SYSTEM ARCHITECTURE The proposed large vocabulary continuous speech recognition system is based on Hidden Markov Models (HMM) [7]. The unknown speech input is converted into a sequence of acoustic vectors n y y y ,..., , 2 1 = Υ , by means of a parameter extraction module. The goal of the LVR system is to determine the most probable word sequence W ˆ given the observed acoustic signal Υ , based on the Bayes’ rule for decomposition of the required probability ( ) Υ Ρ | W into two components, that is, Automatic Recognition of Emotionally Coloured Speech Theologos Athanaselis, Stelios Bakamidis, and Ioannis Dologlou R World Academy of Science, Engineering and Technology International Journal of Computer and Information Engineering Vol:1, No:12, 2007 1961 International Scholarly and Scientific Research & Innovation 1(12) 2007 scholar.waset.org/1307-6892/1891 International Science Index, Computer and Information Engineering Vol:1, No:12, 2007 waset.org/Publication/1891