SEARCH ERROR RISK MINIMIZATION IN VITERBI BEAM SEARCH FOR SPEECH RECOGNITION Takaaki Hori, Shinji Watanabe, and Atsushi Nakamura NTT Communication Science Laboratories, NTT Corporation 2-4, Hikaridai, Seika-cho, Soraku-gun, Kyoto, Japan {hori,watanabe,ats}@cslab.kecl.ntt.co.jp ABSTRACT This paper proposes a method to optimize Viterbi beam search based on search error risk minimization in large vocab- ulary continuous speech recognition (LVCSR). Most speech recognizers employ beam search to speed up the decoding process, in which unpromising partial hypotheses are pruned during decoding. However, the pruning step involves the risk of missing the best complete hypothesis by discarding a par- tial hypothesis that might grow into the best. Missing the best hypothesis is called search error. Our purpose is to reduce search error by optimizing the pruning step. While conven- tional methods use heuristic criteria to prune each hypothesis based on its score, rank, and so on, our proposed method introduces a pruning function that makes a more precise de- cision using the rich features extracted from each hypothesis. The parameters of the function can be estimated efficiently to minimize the search error risk using recognition lattices at the training step. We implemented the new method in a WFST-based decoder and achieved a significant reduction of search errors in a 200K-word LVCSR task. Index Terms— Viterbi beam search, pruning, search er- ror, WFST 1. INTRODUCTION In recent years, large-vocabulary continuous-speech recogni- tion (LVCSR) has been incorporated in various speech appli- cations, such as dictation systems, spoken dialogue systems, and broadcast news captioning systems. The LVCSR decod- ing process finds a sequence of words that best matches an input signal from among a large number of hypotheses. This search problem which finds the best path in a huge graph, es- sentially needs a large amount of computation. To reduce the computation, many techniques have been proposed [1]. In such techniques, the most basic idea is beam search [2][3]. Current speech recognizers employ beam search to speed up the decoding process, in which unpromising partial hy- potheses are pruned during decoding. However, the pruning step involves the risk of missing the best complete hypothe- sis by discarding a partial hypothesis that might grow into the best. Missing the best hypothesis is called search error and is one of the main causes of recognition errors. A straightforward approach to reducing search errors is to use more accurate pruning techniques. For example, different criteria based on score, rank, word-end or not, and look-ahead score are introduced with different pruning thresholds [4]. In general, such thresholds need to be tuned experimentally to balance search error and decoding speed using development data. Our purpose is to improve this pruning step to effectively reduce search errors. While conventional methods use heuris- tic criteria to prune hypotheses, our proposed method intro- duces a pruning function that makes a more precise decision using the rich features extracted from each hypothesis. The parameters of the function can automatically be estimated to minimize the search error risk using recognition lattices at the training step. Our approach is related to a search optimization frame- work proposed by Daume et al. [5]. In this framework, search parameters are optimized to reduce computational cost and errors, where the parameters are used for ranking or prun- ing hypotheses in the queue. In [5], perceptron and large- margin methods were introduced to update the parameters of the ranking function and evaluated with a syntactic chunking task in natural language processing. Xu et al. applied this framework to a planning problem in the AI field [6]. Compu- tational complexity and the convergence property of the learn- ing algorithm were also investigated in [7]. In this paper, we introduce search optimization in a Viterbi beam search for LVCSR. However, it is actually difficult to apply the original learning algorithm of [5] to the LVCSR problem because it updates parameters and re- constructs the hypothesis queue whenever an incorrect hy- pothesis is popped from the queue in the same procedure as the decoding algorithm. Therefore, it is computationally much more expensive than the decoding process, and is also difficult to parallelize the training step. Accordingly, it is un- suitable for LVCSR search optimization in which we aim to estimate many search parameters using a large-scale training corpus. Instead, we propose a batch style algorithm to estimate parameters of the pruning function using recognition lattices. In the training phase, we generate lattices as results of de- coding training samples, where the features extracted during decoding are attached to each lattice arc. Then we iteratively update the parameters using lattices based on the Gradient