Magnitude Spectrum Enhancement for Robust Speech Recognition Wen-hsiang Tu and Jeih-weih Hung Dept of Electrical Engineering, National Chi Nan University Taiwan, Republic of China e-mail: aero3016@ms45.hinet.net , jwhung@ncnu.edu.tw Abstract In this paper, an effective compensation scheme for the spectra of speech signals is proposed in order to improve their noise robustness. In this compensation scheme, named magnitude spectrum enhancement (MSE), a voice activity detection (VAD) process is first processed for the frame sequence of the utterance, and then the magnitude spectra of non-speech frames are set to be small while those of speech frames are amplified. In experiments conducted on the Aurora-2 noisy digits database, MSE achieves a relative error reduction rate of nearly 50% from the baseline processing, which outperforms the well-known spectral-domain speech enhancement techniques, spectral subtraction (SS) and Wiener filtering (WF). In addition, the proposed MSE can be integrated with cepstral-domain robustness methods, like mean and variance normalization (MVN) and histogram normalization (HEQ), to achieve further improved recognition accuracy under noise-corrupted environments. Index Terms: speech recognition, robust speech features, speech enhancement 1. Introduction The environmental mismatch caused by additive noise and/or channel distortion often degrades the performance of a speech recognition system seriously. Various robustness techniques have been proposed to reduce this mismatch, and one category of them aims to obtaining robust speech features. As we know, the mel-frequency cepstral coefficient (MFCC) is one of the most widely used speech feature representations due to its high recognition performance under a clean condition. However, MFCC is not very noise-robust, and thus many robustness techniques are applied in various domains of a noise-corrupted speech signal when deriving MFCC. For example, the well- known spectral subtraction (SS) [1] and Wiener filtering (WF) [2,3] techniques are used in the linear spectral domain, and various feature statistics normalization techniques, like cepstral mean subtraction (CMS) [4], mean and variance normalization (MVN)[5], MVN plus ARMA filtering (MVA) [6] and histogram equalization (HEQ) [7] are often used in the cepstral domain. Besides dealing with the MFCC features, in many recent papers [8-11] it is found that compensating the logarithmic energy (logE) feature properly can improve the recognition accuracy significantly under noisy conditions. In our previously proposed method [10], silence feature normalization (SFN), the high-pass filtered logE is used as the indicator for speech/non-speech frame classification, and then the logE features of non-speech frames are set to be small while those of speech frames are kept nearly unchanged. We have shown that SFN is very effective despite of its simplicity in implementation. Partially motivated by the concept of SFN, in this paper we propose a new approach, named magnitude spectrum enhancement (MSE), to process the noise-corrupted signal in the linear spectral domain, with the hope that the resulting MFCC features can be more noise-robust. Briefly speaking, in MSE the magnitude spectrum of each non-speech frame is set to be small as SFN does, while the magnitude spectrum of each speech frame is amplified by multiplying a weighting factor, which is related to the signal-to-noise ratio (SNR). The main purpose of MSE is to highlight the spectral difference between the speech and non-speech frames, but it is not to re-construct the clean speech spectrum, as SS and WF do. The experiments conducted on the Aurora-2 digit database show that our proposed MSE can provide a significant improvement in recognition accuracy under various noise-corrupted environments. It performs better than SS and WF, and it can be well integrated with cepstral-domain processing techniques, like MVN, MVA and HEQ. The best possible averaged accuracy rate for the Aurora-2 clean-condition training task with the proposed method can be as high as 90.98% The remainder of the paper is organized as follows: Section 2 introduces the proposed MSE method. The experimental setup is described in Section 3, and the experiment results are given and discussed in Section 4. Finally, Section 5 contains brief concluding remarks and future works. 2. The Proposed Magnitude Spectrum Enhancement Method Assume that \ ^ ,0 1 m x n n N ｯ b b  ｡ｰ｢ｱ is the time-domain signal for the m th frame of an utterance. Taking the K-point DFT of \ ^ m x n ｯ｡ｰ｢ｱ , we obtain the spectrum for this frame as follows, 2 1 0 , 0 2 , 1 , nk N j K m m n X k x ne k K m M Q    ｡ｰｯｯ  b b b b ｡ｰ｡ｰ｡ｰ｢ｱ｢ｱ｢ｱ  (1) where M is the number of frames in this utterance. As a result, m X k ｯ｡ｰ｢ｱ represents the magnitude spectrum for the k th frequency bin of the m th frame in an utterance. On the other hand, the logarithmic energy (logE) feature of the m th frame is calculated as follows: 1 2 0 log , 1 . N m m n e x n m M    ｬｭ  ｯｭ  b b  ｭ｡ｰ  ｢ｱｭ   ｮ  (2) Then the proposed magnitude spectrum enhancement (MSE) approach uses the following two steps to create the new magnitude spectrum: Step I: Perform the process of voice activity detection (VAD): The VAD process that discriminates speech/non-speech frames in an utterance is based on two sources, the magnitude spectrum in eq. (1) and the logE in eq. (2). As for the first source, since the high-pass filtered logarithmic magnitude spectrum,   log m X k ｯ｡ｰ｢ｱ , which can be viewed as the logE feature at the k th frequency bin, is shown to be more helpful in discriminating the speech and non-speech portions [10], we first process the sequence   \ ^ log m X k ｯ｡ｰ｢ｱ with a high-pass IIR filter which input- output relationship is <> <>   <> 1 log , 0 2 , 1 m m m Y k X k Y k k K m M M  ｡ｰ   b b b b ｡ｰ｢ｱ , (3) Next, we sum up the high-pass filtered logarithmic spectrum, m Y k ｯ｡ｰ｢ｱ , over the entire frequency band for each frame as follows: 4586 978-1-4244-4296-6/10/$25.00 ©2010 IEEE ICASSP 2010