A Missing Data Approach for Robust Automatic Speech Recognition in the Presence of Reverberation Guy J. Brown 1 , Kalle Palomäki 2 and Jon Barker 1 1 Department of Computer Science, University of Sheffield, United Kingdom 2 Laboratory of Acoustics and Audio Signal Processing, Helsinki University of Technology, Finland g.brown@dcs.shef.ac.uk, kalle.palomaki@hut.fi, barker@dcs.shef.ac.uk Abstract We describe a technique for robust recognition of reverberated speech using the ‘missing data’ paradigm. Modulation filtering is used to identify time-frequency regions of the speech signal which are relatively uncontaminated by reverberation and contain strong speech energy; only these ‘reliable’ acoustic features are made directly available to the recogniser. The proposed system is evaluated on a connected digit recognition task using a range of reverberation conditions. Our approach improves recognition performance when the T 60 reverberation time is longer than 0.7 sec., relative to a baseline system which uses acoustic features derived from perceptual linear prediction and the modulation filtered spectrogram. 1. Introduction Much progress has been made in the field of automatic speech recognition (ASR) in recent years, but significant problems still remain; in particular, the performance of ASR systems is far below that of human listeners when speech is presented in noisy or reverberant conditions (see [7] for a review). Cooke et al. [1] note that human speech perception is robust even when speech is band limited or partially masked by noise. Accordingly, they propose a missing data approach to ASR, in which a hidden Markov model (HMM) classifier is adapted to deal with acoustic features which are known to be missing or unreliable. However, the missing data approach was conceived as a way of handling additive noise in ASR; as a result, little consideration has been given to its ability to handle convolutional interference, such as reverberation. In this paper, we propose a number of modifications to a missing data ASR system which allow it to perform robustly in the presence of reverberation. A typical room impulse response consists of two components. Initially, sparse early reflections occur which are highly correlated with the speech signal. These may spectrally distort the speech, because the absorptive properties of room surfaces tend to vary with frequency. Following this, higher-order reflections produce dense late reverberation, which is poorly correlated with the speech signal and therefore behaves more like additive noise. The speech spectrum is also shaped by the eigen- modes of the room, which emphasize some frequencies in preference to others. Hence, the missing data approach can be applied in reverberant conditions as follows; we use conventional missing data techniques to handle late reverberation (since it resembles additive noise) and employ spectral normalisation to deal with the distortion caused by early reflections and eigenmodes of the room. Conventional approaches to robust ASR in the presence of reverberation either perform dereverberation using multiple microphones or employ robust acoustic features. Such features include mel-frequency cepstral coefficients (MFCC) with cepstral mean subtraction [2], cepstral coefficients obtained by perceptual linear prediction (PLP) [3], and modulation spectrogram (MSG) features [5], [6]. The latter have proven to be particularly effective. A schematic diagram of our proposed system is shown in Fig. 1. In the remainder of this paper, we review the missing data approach to ASR in Section 2 and describe a system for reverberation processing in Section 3. Our approach is evaluated in Section 4 using a number of reverberant conditions, and is compared against a system which uses MSG and PLP features [5], [6]. The paper concludes with a discussion in Section 5. 2. Speech recognition with missing features 2.1. Acoustic features The missing data approach to ASR requires that regions of the time-frequency plane are labelled as reliable or unreliable evidence of the speech source. Accordingly, the recogniser used here employs spectral features derived from an auditory model, rather than conventional features for ASR such as cepstral coefficients. Here, spectral features are derived from a model of cochlear frequency analysis, consisting of an array of 32 bandpass ‘gammatone’ filters. The centre frequencies of the filters were spaced uniformly between 50 Hz and 3850 Hz on the equivalent rectangular bandwidth (ERB) scale (see [1]). The instantaneous Hilbert envelope is Mo4.X2.3 I - 449