A GENERALIZED DYNAMIC COMPOSITION ALGORITHM OF WEIGHTED FINITE STATE TRANSDUCERS FOR LARGE VOCABULARY SPEECH RECOGNITION Octavian Cheng 1,2 , John Dines 1 and Mathew Magimai Doss 3 1 IDIAP Research Institute, Martigny, Switzerland 2 Department of Electrical and Computer Engineering, The University of Auckland, New Zealand 3 International Computer Science Institute, Berkeley, California, USA Email: {ocheng, dines}@idiap.ch, mathew@icsi.berkeley.edu ABSTRACT We propose a generalized dynamic composition algorithm of weighted ſnite state transducers (WFST), which avoids the creation of non- coaccessible paths, performs weight look-ahead and does not impose any constraints to the topology of the WFSTs. Experimental results on Wall Street Journal (WSJ1) 20k-word trigram task show that at 17% WER (moderately-wide beam width), the decoding time of the proposed approach is about 48% and 65% of the other two dynamic composition approaches. In comparison with static composition, at the same level of 17% WER, we observe a reduction of about 60% in memory requirement, with an increase of about 60% in decoding time due to extra overheads for dynamic composition. Index Terms— Weighted Finite State Transducers, Dynamic Composition, Large Vocabulary Continuous Speech Recognition 1. INTRODUCTION Recently, the use of Weighted Finite State Transducers (WFST) for Large Vocabulary Continuous Speech Recognition (LVCSR) has be- come an attractive approach [1, 2]. In simple terms, a WFST is a ſnite state machine which maps sequences of input symbols to se- quences of output symbols with an associated weight. In the appli- cation of WFST in LVCSR, the idea is to represent each individual knowledge source by a WFST and fully integrate them into a uniſed WFST by the composition algorithm [2]. The fully integrated WFST provides weighted mappings from HMM state sequences to word se- quences. Thus the speech recognition problem becomes searching for the mapped sequence with the lowest associated weight (cost). The composition of knowledge sources is a one-off process and is done ofƀine. Therefore it is often referred to static composition. There are two main advantages with the static approach. First the decoder design is simple because all the knowledge sources are integrated into one compact WFST. The knowledge sources are de- coupled from the Viterbi search and therefore the decoder does not need to perform any combination of knowledge sources during de- coding. The second advantage is that the fully integrated transducer can be further optimized by algorithms, such as, determinization, minimization and weight-pushing [1, 3]. Despite the above advantages, there are several drawbacks with the static approach. They include: This work was supported by the EU 6th FWP IST integrated project AMI and the Swiss National Science Foundation through the National Center of Competence in Research (NCCR) on “Interactive Multimodal Information Management (IM2)” • The composition and optimization of the fully integrated WFST has prohibitively high memory requirement when the con- stituent WFSTs are large and complex; • The size of the fully integrated WFST can be very large, re- sulting in large memory requirement during decoding; • It does not allow on-line modiſcation of knowledge sources once they have been fully integrated. One way of addressing these issues is to perform dynamic trans- ducer composition during decoding. Instead of representing the en- tire search space by an optimized transducer, it is possible to factor- ize the search space into two or more transducers. These component transducers are built statically and optimized separately. The combi- nation is done dynamically during decoding. In this paper, we investigate several existing dynamic composi- tion approaches and propose our improved algorithm, which avoids the creation of non-coaccessible transitions, performs weight look- ahead and does not impose any constraints to the topology of compo- nent WFSTs. The paper is organized as follows. Section 2 brieƀy de- scribes static WFST composition and how a fully integrated WFST is generated. Section 3 gives a general overview on current ap- proaches to dynamic WFST composition. Section 4 describes our dynamic composition algorithm. Experimental results on different composition methods are shown in Section 5. Finally, Section 6 con- cludes the paper. 2. STATIC WFST COMPOSITION Static WFST composition involves integration of all the knowledge sources. It can be represented by the following expression [2]. N = πǫ(min(det( ˜ H ◦ det( ˜ C ◦ det( ˜ L ◦ G))))) (1) In the above expression, ˜ H represents the HMM topology; ˜ C is a WFST which maps context-dependent phones to context-independent phones; ˜ L is the lexicon WFST and G is the language model (LM) WFST. The symbol ◦ is the composition operator. Transducer opti- mization algorithms, for example determinization and minimization, are represented by det and min operators respectively. The ˜ . symbol means that the WFST is augmented with auxiliary symbols which are necessary for the success of transducer optimization. The πǫ op- eration replaces the auxiliary symbols by ǫ (null) symbols. The ſnal transducer N is a fully integrated transducer which maps HMM state sequences to word sequences. IV  345 1424407281/07/$20.00 ©2007 IEEE ICASSP 2007