A CORPUS-BASED APPROACH FOR ROBUST ASR IN REVERBERANT ENVIRONMENTS Laurent Couvreur Christophe Couvreur Christophe Ris Facult´ e Polytechnique de Mons Lernout & Hauspie Speech Products e-mail: lcouv,ris @tcts.fpms.ac.be christophe.couvreur@lhs.be ABSTRACT In this paper, we discuss the use of artificial room reverbera- tion to increase the performance of automatic speech recognition (ASR) systems in reverberant enclosures. Our approach consists in training acoustic models on artificially reverberated speech ma- terial. In order to obtain the desired reverberated speech training database, we propose to use a reverberating filter whose impulse response is designed to match two high-level acoustic properties of the target reverberant operating environment, namely the early- to-late energy ratio and the reverberation time. Speech recog- nition experiments in simulated reverberant environments show that recognizers trained on speech reverberated with the proposed method outperform systems trained on clean speech, even when channel normalization methods like CMS and logRASTA-PLP are used. The extension of our approach to multi-style training is also considered. 1. INTRODUCTION Recognition of distant-talking speech is a promising technology for man-machine interaction. Unfortunately, in many applications the operating enclosure is reverberant and the distance between the speech source and the microphone is higher than the so-called critical distance [6]. That is, most of the acoustic energy reaches the microphone after one or more reflections and the recorded speech signal is highly reverberated. The speech signal is severely distorted by this room reverberation, leading to degraded perfor- mance of speech recognizers [7]. Several methods have been proposed to cope with room rever- beration in speech recognition applications. In some methods, speech is enhanced prior to the extraction of the usual acoustic features [9, 7]. In other methods, robust acoustic features are com- puted directly from the reverberated speech via channel normal- ization techniques such as cepstral mean subtraction (CMS) [3] or RASTA-like algorithms [4, 5]. Unfortunately, these methods fail to yield satisfying results on highly reverberated speech. The discrepancy between the training conditions (anechoic speech) and the testing conditions (reverberated speech) accounts for the poor performance of speech recognition in reverberant environ- ments. Thus, one can suggest to train the recognizer on rever- berated speech material rather than on anechoic speech material. Ideally, a training database should be collected every time the sys- tem has to be deployed in specific reverberant conditions. This approach is obviously not practical. An alternative may be sim- ulating reverberation in order to obtain adequately reverberated training material from an existing clean speech database. To do so, the anechoic speech database can be convolved with an acous- tic impulse response measured in the target reverberant environ- ment [7]. However, this approach is problematic because the This work was supported in part by a F.R.I.A grant (Fonds pour la formation ` a la Recherche dans l’Industrie et l’Agriculture, Belgium). This work was also partly supported by the European LTR Esprit project RESPITE. acoustic impulse response is highly dependent on the geometric and acoustic characteristics of the room, on the source and mi- crophone locations, on the air temperature and humidity, etc [6]. Moreover, reliable measurement of an acoustic impulse response is not straightforward. Thus, it is difficult to guarantee that the measured acoustic impulse response matches perfectly the acous- tic impulse response of the target reverberant environment. In practice, this method gives disappointing results. In this communication, we propose to use a “randomized” rever- berating filter instead of a measured acoustic impulse response to obtain the reverberated speech training database. The impulse re- sponse of this reverberating filter is designed to match two high- level, perceptually meaningful, acoustic properties of the target reverberant environment, namely the early-to-late energy ratio and the reverberation time. The paper is organized as follows. In the next section, we describe the proposed method for artificially reverberating speech mate- rial. The efficiency of our approach is then assessed by connected digit recognition experiments in reverberant conditions. The ex- perimental set-up is briefly described in section 3 and results are reported in section 4. Conclusions are drawn in Section 5. 2. ARTIFICIAL REVERBERATION We assume that the effect of room reverberation for a speech rec- ognizer is better characterized by high-level acoustic properties rather than by the fine temporal details of a complete acoustic impulse response. More specifically, we assume that the early-to- late energy ratio and the reverberation time are sufficient to specify room reverberation conditions [6]. The early-to-late energy ratio is defined as the steady-state ratio between the di- rect and reverberated sound energies and is expressed in dB. The reverberation time is defined as the time interval expressed in seconds in which the sound energy in the room reaches one millionth of its initial value (-60dB) once a sound source is in- terrupted. We further assume that is frequency independent. Under the diffuse sound field assumption, these parameters can be computed easily using the well-known equations of Sabine [6]. Given the geometric and acoustic properties of a reverberant test enclosure, we have (1) (2) where the parameters , , , , , and denote the wall sur- face, the room volume, the speed of sound, the mean wall absorp- tion coefficient, the directivity factor, and the source–microphone distance, respectively. The mean wall absorption coefficient is computed as with and standing for the ab- sorption coefficient and the surface of wall . If the source and the microphone are omnidirectional, the directivity factor reduces to 1. Note that and can be easily measured in practice: