1 LET-Decoder: A WFST-based Lazy-evaluation Token-group Decoder with Exact Lattice Generation Hang Lv, Student Member, IEEE, Daniel Povey, Mahsa Yarmohammadi, Ke Li, Student Member, IEEE, Yiming Wang, Lei Xie, Senior Member, IEEE, Sanjeev Khudanpur, Member, IEEE Abstract—We propose a novel lazy-evaluation token-group decoding algorithm with on-the-ﬂy composition of weighted ﬁnite-state transducers (WFSTs) for large vocabulary continu- ous speech recognition. In the standard on-the-ﬂy composition decoder, a base WFST and one or more incremental WFSTs are composed during decoding, and then token passing algorithm is employed to generate the lattice on the composed search space, resulting in substantial computation overhead. To improve speed, the proposed algorithm adopts 1) a token-group method, which groups tokens with the same state in the base WFST on each frame and limits the capacity of the group and 2) a lazy-evaluation method, which does not expand a token group and its source token groups until it processes a word label during decoding. Experiments show that the proposed decoder works notably up to 3 times faster than the standard on-the-ﬂy composition decoder. Index Terms—Speech recognition, WFST, on-the-ﬂy composi- tion, on-the-ﬂy lattice rescoring I. I NTRODUCTION A decoder plays an important role in an automatic speech recognition (ASR) system where it integrates acoustic and language information to generate the most likely word se- quence for an input speech signal. It is necessary for many applications that the decoded results are in the form of lattices [1]–[3] or N-best hypotheses lists [4], [5] since they are more informative than just one best hypothesis. Various lattice generation methods have been proposed, such as the word/phone pair assumption [2], [6]–[8], the N-best histories method [9], [10], and the exact lattice generation method [11] that is the most widely used compared with others. Decoders based on weighted ﬁnite-state transducers (WF- STs) [12] can efﬁciently compose various source information, including acoustic, phonetic context decision tree, lexicon, and language model. Thanks to operations such as deter- minization and minimization in WFST making the decoders very compact, the WFST-based decoders generally work more efﬁciently, in a pithy and elegant manner than other classical Hang Lv was with CLSP, Johns Hopkins University, Baltimore, MD, USA. He is with ASLP@NPU, School of Computer Science, Northwestern Polytechnical University, Xi’an, China. (e-mail:hanglv@nwpu-aslp.org) Daniel Povey is with Xiaomi, Beijing, China. (e-mail: dpovey@xiaomi.com) Ke Li and Yiming Wang are with CLSP, Johns Hopkins University, Baltimore, MD, USA. (e-mail: keli26, yiming.wang@jhu.edu) Lei Xie is with ASLP@NPU, School of Computer Science, Northwestern Polytechnical University, Xi’an, China. (e-mail: lxie@nwpu-aslp.org) Mahsa Yarmohammadi and Sanjeev Khudanpur are with CLSP and Hu- man Language Technology Center of Excellence, Johns Hopkins University, Baltimore, MD, USA. (e-mail: mahsa,khudanpur@jhu.edu) approaches [13]. However, when individual models of knowl- edge sources become huge, such as high-order language model (LM) or million-word lexicon, the composed WFST can be memory-inefﬁcient or even infeasible to construct. To overcome this problem, two kinds of solutions have been proposed. The ﬁrst one is multiple-pass decoding [8], [14], [15] which usually breaks up the decoding process into two stages. Fast, efﬁcient (relatively small) models are used to generate restrict-sized lattices ﬁrst. They are then rescored with richer knowledge sources for better performance. While the potential of the two-pass method is limited by the relatively small knowledge sources used in the ﬁrst pass. The two-pass procedure makes the latency issue unavoidable. Such method is thus more suitable for ofﬂine tasks. The second method is on-the-ﬂy (aka. on-demand or lazy) composition [16]– [18], in which WFSTs are separated into two (or more) groups and dynamically composed when needed. As such, it reduces memory usage and is more ﬂexible than ofﬂine composition. However, decoding becomes slower since the search space is not optimised as well as the ofﬂine composition does. Moreover, on-the-ﬂy composition results in extra com- putational overhead within decoding. Researchers proposed several algorithms to optimize on-the-ﬂy composition, such as look-ahead composition [19], [20] and on-the-ﬂy hypothesis rescoring under phone pair assumption [21]–[23]. This paper proposes a novel method to optimize the on-the- ﬂy composition decoding for exact lattice generation [11]. The proposed innovative WFST-based decoder is denoted as Lazy- evaluation Token-group decoder (LET-Decoder). This work makes contributions from three aspects: 1) we propose a token- group method and apply lazy-evaluation method with it for speedup; 2) we employ “bucketqueue” to implement histogram pruning in a more natural way; and 3) we develop an online version of the LET-decoder. Experiments show that our pro- posed decoder can achieve up to 3 times relative speedup. This work is open-sourced under Kaldi [24] 1 . The approach in [22] mostly relates to our work. That approach performs the Viterbi search in the ﬁrst WFST and rescores the hypotheses on word level in the second WFST. However, their implementation includes an approximation during token recombination that may affect word accuracy, and a declination of time alignment can occurs. In contrast, our proposed decoder can perform exact decoding and generate exact lattices. 1 https://github.com/LvHang/kaldi/tree/bucket-d