IMPROVING MULTI-LATTICE ALIGNMENT BASED SPOKEN KEYWORD SPOTTING Hui Lin, Alex Stupakov and Jeff Bilmes Department of Electrical Engineering, University of Washington, Seattle, Washington, USA ABSTRACT In previous work, we showed that using a lattice instead of the 1-best path to represent both the query and the utterance being searched is beneﬁcial for spoken keyword spotting. In this paper, we introduce several techniques that further im- prove our multi-lattice alignment approach, including edit op- eration modeling and supervised training of the conditional probability table, something which cannot be directly trained by traditional maximum likelihood estimation. Experiments on TIMIT show that the proposed methods signiﬁcantly im- prove the performance of spoken keyword spotting. Index Terms— Spoken keyword spotting, lattice align- ment, edit operation modeling, negative training, auxiliary training 1. INTRODUCTION In certain cases, speech-speciﬁed keyword spotting is more appropriate than text-based keyword detection, such as when- ever it is inconvenient, unsafe, or impossible for the user to enter a search query using a standard keyboard. For example, modern police ofﬁcers or soldiers are sometimes equipped with a multi-sensor platform that has been augmented with a close-talking microphone, a camera, and a wrist-mounted dis- play. During many on-the-job scenarios (such as while driv- ing, walking, or whenever the hands are unavailable), spoken queries may be more appropriate to search through recordings of conversations in order to locate audio, photos, and video that have been recorded on the device during an investigation. In [1], we proposed a new approach to spoken keyword spotting that uses a joint alignment between multiple phone lattices. The ﬁrst phone lattice comes from the database itself and can be created ofﬂine. We refer to this as the utterance lattice. A second phone lattice is generated once the user has spoken a query phrase. This query lattice is then modiﬁed by removing its time marks, and then the two lattices are jointly aligned. Every region of time where the query lattice is prop- erly aligned then becomes a candidate spoken keyword detec- tion. In this paper, we propose several methods to improve the performance of our multi-lattice alignment approach. The This work was supported by DARPA’s ASSIST Program (No. NBCH-C- 05-0137) and an ONR MURI grant (No. N000140510388). ﬁrst method is related to edit operation modeling, which fo- cuses on providing robustness against mistakes in the phone lattices. The learning of the consistency conditional proba- bility table (CPT) is also investigated in this paper. As seen in [1] (a prerequisite reference for understanding the current paper), the consistency CPT plays an important role in gluing the query lattice and utterance lattice together to produce ac- curate alignments. Maximum likelihood training alone, how- ever, is inappropriate for the consistency CPT — we address this issue in this paper using negative training data [2]. Ex- periments on TIMIT show that these methods signiﬁcantly improve the performance of our spoken keyword spotting sys- tem relative to [1]. 2. MULTI-LATTICE ALIGNMENT We ﬁrst brieﬂy review the multi-lattice alignment approach to spoken keyword spotting that we proposed in [1]. We refer the reader to [1] for full details of our method. In our approach, the lattice of the query keyword and the lattice of the utterance are both represented by graphical mod- els, as shown in Fig 1. The upper half of the graph corre- sponds to the query, and the lower half to the utterance. The graphical model representations of the two lattices have simi- lar topology — for each lattice, two distinct random variables represent the lattice node and lattice link. For instance, for the query lattice, we have the query node variable N q t for the lattice node, and the query phone variable H q t for the lat- tice links (which represent phones) in the phone lattice. The major difference between the two lattices is that for the graph that represents the query lattice, the time information associ- ated with each node in the original lattice is discarded, while the graphical model for the utterance lattice uses a time in- homogeneous conditional probability table (CPT) to encode the starting/ending time points of links in the lattice. Speciﬁ- cally, the utterance phone transition variable T u t can take the value 1 (meaning there will be a transition for the utterance phone variable H u t+1 ) only when there is an actual transition in the original lattice. The consistency variable C t , which is always ob- served with value 1, couples the query and utterance lat- tices together. In particular, the CPT p(C t =1|H q t , H u t )= f (H q t , H u t ) is simply a function of H q t and H u t . If H q t is iden- tical or similar to H u t , f (H q t , H u t ) should take larger values,