F ILTERING- BASED ANALYSIS OF SPECTRAL AND TEMPORAL EFFECTS OF ROOM MODES ON LOW- LEVEL DESCRIPTORS OF EMOTIONALLY COLOURED SPEECH Martin Gottschalk 1 , Juliane Höbel-Müller 2 , Ingo Siegert 3 , Jesko Verhey 1 , Andreas Wendemuth 2 1 Department of Experimental Audiology, 2 Chair of Cognitive Systems, 3 Mobile Dialog Systems, Otto von Guericke University Magdeburg martin.gottschalk@med.ovgu.de Abstract: Emotion recognition in far-ﬁeld speech is challenging due to various acoustic factors. The present contribution especially considers dominant low- frequency room modes which are often found in small rooms and cause variations in the low-frequency acoustical response at various listening locations. The impact of this spatial variation on low-level descriptors, used for feature sets in speech emotion recognition, has not been analysed in detail so far. This shortfall will be addressed in this paper, by utilising the well-known bench- mark dataset EMO-DB providing emotionally coloured speech of high quality. The measured room response of a speaker cabin is compared with artiﬁcial approxima- tions of its frequency response in the low frequency range. Two techniques were applied to obtain the approximations: The ﬁrst technique uses multiple resonant ﬁlters in the low frequency region, whose parameters are determined by a least- squares ﬁt. The second technique used a modiﬁed version of the cabin’s amplitude spectrum, that was set to unity for higher frequencies and transformed to minimum phase and to time domain. To be able to identify the impact of room modes on the low-level descriptors, cor- relation coefﬁcients between the “clean” and modiﬁed EMO-DB utterances are calculated and compared to each other. Furthermore, a speech emotion recognition system is used to identify the impact on the recognition performance. 1 Introduction Voice-based human-machine interaction (HMI) “in the wild” is exposed to varying environment conditions. It has been analysed in terms of superposed noise [1, 2], robust feature sets [3, 4], feature pooling [5] or feature degradation for different room acoustics [6] and their impact on emotion recognition performance [7]. Furthermore, the impact of room acoustic charac- teristics on speciﬁc feature types and the performance of speaker state classiﬁcation has been analysed [8, 9, 10]. It could be shown that emotion recognition in far-ﬁeld speech shows per- formance drops due to several environmental factors, including background noise, echo, rever- beration, delay and other. One factor that has been neglected so far are dominant low-frequency room modes which are often found in small rooms and cause variations in the low-frequency acoustical response at various listening locations. This spatial variation impacts the speech signal. Also the low-level descriptors (LLDs) used in various feature sets for speech emotion recognition are affected. Therefore, speech emotion recognition may be challenging, for instance, in Ambient Assisted Living environments, as user’s far-ﬁeld voice is necessary due to user acceptance considera- tions. 219