Bag of n-gram driven decoding for LVCSR system
harnessing
Fethi Bougares
#1
, Yannick Est` eve
#2
, Paul Del´ eglise
#3
,Georges Linar` es
*4
#
LIUM, University of le Mans, France
1,2,3
firstname.lastname@lium.univ-lemans.fr
∗
LIA, University of Avignon, France
4
firstname.lastname@univ-avignon.fr
Abstract—This paper focuses on automatic speech recognition
systems combination based on driven decoding paradigms. The
driven decoding algorithm (DDA) involves the use of a 1-best
hypothesis provided by an auxiliary system as another knowledge
source in the search algorithm of a primary system. In previous
studies, it was shown that DDA outperforms ROVER when
the primary system is guided by a more accurate system. In
this paper we propose a new method to manage auxiliary
transcriptions which are presented as a bag-of-n-grams (BONG)
without temporal matching. These modifications allow to make
easier the combination of several hypotheses given by different
auxiliary systems. Using BONG combination with hypotheses
provided by two auxiliary systems, each of which obtained more
than 23% of WER on the same data, our experiments show that
a CMU Sphinx based ASR system can reduce its WER from
19.85% to 18.66% which is better than the results reached with
DDA or classical ROVER combination.
Index Terms—speech recognition, system combination, bag of
n-gram driven decoding.
I NTRODUCTION
One of the main challenges in automatic speech recognition
(ASR) researches are to get accurate system working in real-
life situations and with different kind of speaking styles. To
achieve this goal, studies have taken many directions to look
for better models or more sophisticated algorithms, meanwhile
many works propose different combination schemes to benefit
from systems complementarity. In previous studies, a variety
of combination approaches were proposed. These combination
schemes are distinguishable depending on the method used to
share information and the application levels. Cross-adaptation
techniques [1] and feature concatenation [2] are two examples
of combination before the decoding process, while ROVER
[3], lattice combination [4] and CNC [5] operate after. The
DDA[6] framework is more than a combination method:
applied during the decoding process, this method modify
search space exploration and brought out new hypothesis not
proposed by initial system.
In order to keep the search space at a manageable size,
the recognition process prunes many hypotheses according to
its knowledge base and its internal heuristics. The pruning
process is generally local and local information is used to
reject some word hypotheses. But these rejected words can be
the words uttered by the speaker, and could be retained in a
more global pruning process: a better pruning method could
give more accurate search. Motivated by these considerations,
we have chosen to explore the use of the DDA. This algorithm
takes into account the output given by an auxiliary ASR system
to evaluate a partial hypothesis during the decoding process of
a primary system. DDA helps improving the internal pruning
decision made by the primary ASR system using the output
of another recognizer.
In a previous work [6], it was shown that the DDA approach
gives good results in system combination. It significantly
improves the output of the primary ASR system when the
auxiliary system is initially better.
In this paper, we introduce the bag-of-n-gram driven de-
coding approach as modified DDA combination. Experimental
results show that we can improve a primary ASR system and
outperform DDA when using less efficient single auxiliary
ASR system. Additionally, an efficient method is proposed
to deal with multiple auxiliary ASR system. The first section
presents the principle of DDA. Experimental framework is
then presented in the section two. In the third section we
investigate the DDA algorithm when primary system is more
accurate than auxiliary. Before concluding along with future
work, section four introduces the BONG method, obtained
results, and their analysis.
I. DRIVEN DECODING ALGORITHM
DDA is presented in [6] as a speech recognition system
combination method. Initially DDA was proposed in [7] to
help ASR systems process audio documents associated to
imperfect manual transcripts (for example subtitles). This
method is based on linguistic score reevaluation during the
decoding process in a primary system using a recognition
hypothesis computed by an auxiliary system. During the
decoding process, each evaluated hypothesis is aligned to the
auxiliary hypothesis using the edit distance. After finding a
synchronized point, a matching score α is estimated depending
of the number of words correctly aligned. Then the linguistic
score L is computed using the following rule:
L(w
i
/w
i-2
,w
i-1
)= P (w
i
/w
i-2
,w
i-1
)
1-α(wi)
where P (w
i
/w
i-2
,w
i-1
) is the initial probability of the
trigram and α(w
i
) is the DDA matching score depending on
278 978-1-4673-0367-5/11/$26.00 ©2011 IEEE ASRU 2011