A GENERALIZED DYNAMIC COMPOSITION ALGORITHM OF WEIGHTED FINITE
STATE TRANSDUCERS FOR LARGE VOCABULARY SPEECH RECOGNITION
Octavian Cheng
1,2
, John Dines
1
and Mathew Magimai Doss
3
1
IDIAP Research Institute, Martigny, Switzerland
2
Department of Electrical and Computer Engineering, The University of Auckland, New Zealand
3
International Computer Science Institute, Berkeley, California, USA
Email: {ocheng, dines}@idiap.ch, mathew@icsi.berkeley.edu
ABSTRACT
We propose a generalized dynamic composition algorithm of weighted
ſnite state transducers (WFST), which avoids the creation of non-
coaccessible paths, performs weight look-ahead and does not impose
any constraints to the topology of the WFSTs. Experimental results
on Wall Street Journal (WSJ1) 20k-word trigram task show that at
17% WER (moderately-wide beam width), the decoding time of the
proposed approach is about 48% and 65% of the other two dynamic
composition approaches. In comparison with static composition, at
the same level of 17% WER, we observe a reduction of about 60%
in memory requirement, with an increase of about 60% in decoding
time due to extra overheads for dynamic composition.
Index Terms— Weighted Finite State Transducers, Dynamic
Composition, Large Vocabulary Continuous Speech Recognition
1. INTRODUCTION
Recently, the use of Weighted Finite State Transducers (WFST) for
Large Vocabulary Continuous Speech Recognition (LVCSR) has be-
come an attractive approach [1, 2]. In simple terms, a WFST is a
ſnite state machine which maps sequences of input symbols to se-
quences of output symbols with an associated weight. In the appli-
cation of WFST in LVCSR, the idea is to represent each individual
knowledge source by a WFST and fully integrate them into a uniſed
WFST by the composition algorithm [2]. The fully integrated WFST
provides weighted mappings from HMM state sequences to word se-
quences. Thus the speech recognition problem becomes searching
for the mapped sequence with the lowest associated weight (cost).
The composition of knowledge sources is a one-off process and is
done ofƀine. Therefore it is often referred to static composition.
There are two main advantages with the static approach. First
the decoder design is simple because all the knowledge sources are
integrated into one compact WFST. The knowledge sources are de-
coupled from the Viterbi search and therefore the decoder does not
need to perform any combination of knowledge sources during de-
coding. The second advantage is that the fully integrated transducer
can be further optimized by algorithms, such as, determinization,
minimization and weight-pushing [1, 3].
Despite the above advantages, there are several drawbacks with
the static approach. They include:
This work was supported by the EU 6th FWP IST integrated project AMI
and the Swiss National Science Foundation through the National Center of
Competence in Research (NCCR) on “Interactive Multimodal Information
Management (IM2)”
• The composition and optimization of the fully integrated WFST
has prohibitively high memory requirement when the con-
stituent WFSTs are large and complex;
• The size of the fully integrated WFST can be very large, re-
sulting in large memory requirement during decoding;
• It does not allow on-line modiſcation of knowledge sources
once they have been fully integrated.
One way of addressing these issues is to perform dynamic trans-
ducer composition during decoding. Instead of representing the en-
tire search space by an optimized transducer, it is possible to factor-
ize the search space into two or more transducers. These component
transducers are built statically and optimized separately. The combi-
nation is done dynamically during decoding.
In this paper, we investigate several existing dynamic composi-
tion approaches and propose our improved algorithm, which avoids
the creation of non-coaccessible transitions, performs weight look-
ahead and does not impose any constraints to the topology of compo-
nent WFSTs. The paper is organized as follows. Section 2 brieƀy de-
scribes static WFST composition and how a fully integrated WFST
is generated. Section 3 gives a general overview on current ap-
proaches to dynamic WFST composition. Section 4 describes our
dynamic composition algorithm. Experimental results on different
composition methods are shown in Section 5. Finally, Section 6 con-
cludes the paper.
2. STATIC WFST COMPOSITION
Static WFST composition involves integration of all the knowledge
sources. It can be represented by the following expression [2].
N = πǫ(min(det(
˜
H ◦ det(
˜
C ◦ det(
˜
L ◦ G))))) (1)
In the above expression,
˜
H represents the HMM topology;
˜
C is a
WFST which maps context-dependent phones to context-independent
phones;
˜
L is the lexicon WFST and G is the language model (LM)
WFST. The symbol ◦ is the composition operator. Transducer opti-
mization algorithms, for example determinization and minimization,
are represented by det and min operators respectively. The ˜ . symbol
means that the WFST is augmented with auxiliary symbols which
are necessary for the success of transducer optimization. The πǫ op-
eration replaces the auxiliary symbols by ǫ (null) symbols. The ſnal
transducer N is a fully integrated transducer which maps HMM state
sequences to word sequences.
IV 345 1424407281/07/$20.00 ©2007 IEEE ICASSP 2007