A Noise-Robust ASR Back-end Technique Based on Weighted Viterbi Recognition Xiaodong Cui, Alexis Bernard and Abeer Alwan Department of Electrical Engineering University of California, Los Angeles, CA Email: xdcui, abernard, alwan @icsl.ucla.edu Abstract The performance of speech recognition systems trained in quiet degrades signiﬁcantly under noisy conditions. To address this problem, a Weighted Viterbi Recogni- tion (WVR) algorithm that is a function of the SNR of each speech frame is proposed. Acoustic models trained on clean data, and the acoustic front-end fea- tures are kept unchanged in this approach. Instead, a conﬁdence/robustness factor is assigned to the output observation probability of each speech frame according to its SNR estimate during the Viterbi decoding stage. Comparative experiments are conducted with Weighted Viterbi Recognition with different front-end features such as MFCC, LPCC and PLP. Results show consistent im- provements with all three feature vectors. For a reason- able size of adaptation data, WVR outperforms environ- ment adaptation using MLLR. 1. Introduction Noise-robust speech recognition is an important chal- lenge for real world applications. The performance of recognition systems trained in quiet degrades signiﬁ- cantly in the presence of background acoustic noise. In general, there are two ways of addressing this problem. The ﬁrst approach is to reduce mismatch in the front end feature extraction stage [1] [2]. The other approach in- volves either updating ‘clean’ acoustic models based on noise estimates [3] or building separate HMMs of the ’clean’ speech and of the noise [4]. In [7], a Weighted Viterbi Recognition (WVR) algo- rithm was introduced to deal with channel impairments, frame erasures and network congestion for Distributed Speech Recognition (DSR). Also, independent work was conducted in [9] using “soft-feature” decoding to deal with DSR channel degradation. In this paper, we use the WVR algorithm to deal with background acoustic noise without changing the acoustic speech models. Dr. Alexis Bernard is now with the DSP R&D Center of TEXAS I NSTRUMENTS in Dallas, TX. Work was initiated during his doctoral studies at UCLA. The weighting factor is a function of the SNR esti- mate of each speech frame. The computational complex- ity of this algorithm is quite low and its structure renders it easy to implement in DSR systems. Compared with en- vironment adaptation using MLLR with a reasonable size of adaptation data, WVR can achieve better results. Three types of feature vectors are examined: MFCC, LPCC and PLP [2]. The remainder of this paper is organized as follows. In Section 2, a system overview is provided. In Sections 3 and 4, the SNR estimation algorithm and WVR formula- tion are described, respectively. Experimental results are shown in Section 5, and Section 6 concludes the paper with a summary and discussion. 2. System Overview A system overview is illustrated in Fig. 1, where acoustic HMMs are trained using clean data and front-end feature extraction using standard features such as MFCC, LPCC and PLP. The SNR is estimated for each speech frame and the estimate is provided to a Viterbi decoding/recognition module where a ﬁnal decision is made based on the clean acoustic models and the conﬁdence/quality of each speech frame. Front End Feature Extraction WVR Clean HMM Framewise SNR Estimation Speech Input Output Figure 1: Weighted Viterbi Recognition (WVR) to deal with noisy speech given ‘clean’ acoustic models. EUROSPEECH 2003 - GENEVA 2169