Bag of n-gram driven decoding for LVCSR system harnessing Fethi Bougares #1 , Yannick Est` eve #2 , Paul Del´ eglise #3 ,Georges Linar` es *4 # LIUM, University of le Mans, France 1,2,3 firstname.lastname@lium.univ-lemans.fr ∗ LIA, University of Avignon, France 4 firstname.lastname@univ-avignon.fr Abstract—This paper focuses on automatic speech recognition systems combination based on driven decoding paradigms. The driven decoding algorithm (DDA) involves the use of a 1-best hypothesis provided by an auxiliary system as another knowledge source in the search algorithm of a primary system. In previous studies, it was shown that DDA outperforms ROVER when the primary system is guided by a more accurate system. In this paper we propose a new method to manage auxiliary transcriptions which are presented as a bag-of-n-grams (BONG) without temporal matching. These modiﬁcations allow to make easier the combination of several hypotheses given by different auxiliary systems. Using BONG combination with hypotheses provided by two auxiliary systems, each of which obtained more than 23% of WER on the same data, our experiments show that a CMU Sphinx based ASR system can reduce its WER from 19.85% to 18.66% which is better than the results reached with DDA or classical ROVER combination. Index Terms—speech recognition, system combination, bag of n-gram driven decoding. I NTRODUCTION One of the main challenges in automatic speech recognition (ASR) researches are to get accurate system working in real- life situations and with different kind of speaking styles. To achieve this goal, studies have taken many directions to look for better models or more sophisticated algorithms, meanwhile many works propose different combination schemes to beneﬁt from systems complementarity. In previous studies, a variety of combination approaches were proposed. These combination schemes are distinguishable depending on the method used to share information and the application levels. Cross-adaptation techniques [1] and feature concatenation [2] are two examples of combination before the decoding process, while ROVER [3], lattice combination [4] and CNC [5] operate after. The DDA[6] framework is more than a combination method: applied during the decoding process, this method modify search space exploration and brought out new hypothesis not proposed by initial system. In order to keep the search space at a manageable size, the recognition process prunes many hypotheses according to its knowledge base and its internal heuristics. The pruning process is generally local and local information is used to reject some word hypotheses. But these rejected words can be the words uttered by the speaker, and could be retained in a more global pruning process: a better pruning method could give more accurate search. Motivated by these considerations, we have chosen to explore the use of the DDA. This algorithm takes into account the output given by an auxiliary ASR system to evaluate a partial hypothesis during the decoding process of a primary system. DDA helps improving the internal pruning decision made by the primary ASR system using the output of another recognizer. In a previous work [6], it was shown that the DDA approach gives good results in system combination. It signiﬁcantly improves the output of the primary ASR system when the auxiliary system is initially better. In this paper, we introduce the bag-of-n-gram driven de- coding approach as modiﬁed DDA combination. Experimental results show that we can improve a primary ASR system and outperform DDA when using less efﬁcient single auxiliary ASR system. Additionally, an efﬁcient method is proposed to deal with multiple auxiliary ASR system. The ﬁrst section presents the principle of DDA. Experimental framework is then presented in the section two. In the third section we investigate the DDA algorithm when primary system is more accurate than auxiliary. Before concluding along with future work, section four introduces the BONG method, obtained results, and their analysis. I. DRIVEN DECODING ALGORITHM DDA is presented in [6] as a speech recognition system combination method. Initially DDA was proposed in [7] to help ASR systems process audio documents associated to imperfect manual transcripts (for example subtitles). This method is based on linguistic score reevaluation during the decoding process in a primary system using a recognition hypothesis computed by an auxiliary system. During the decoding process, each evaluated hypothesis is aligned to the auxiliary hypothesis using the edit distance. After ﬁnding a synchronized point, a matching score α is estimated depending of the number of words correctly aligned. Then the linguistic score L is computed using the following rule: L(w i /w i-2 ,w i-1 )= P (w i /w i-2 ,w i-1 ) 1-α(wi) where P (w i /w i-2 ,w i-1 ) is the initial probability of the trigram and α(w i ) is the DDA matching score depending on 278 978-1-4673-0367-5/11/$26.00 ©2011 IEEE ASRU 2011