An approach for efficient open vocabulary spoken term detection Atta Norouzian , Richard Rose Electrical and Computer Engineering, McGill University, Montreal, Quebec, Canada Received 17 December 2012; received in revised form 20 August 2013; accepted 7 September 2013 Available online 25 September 2013 Abstract A hybrid two-pass approach for facilitating fast and efficient open vocabulary spoken term detection (STD) is presented in this paper. A large vocabulary continuous speech recognition (LVCSR) system is deployed for producing word lattices from audio recordings. An index construction technique is used for facilitating very fast search of lattices for finding occurrences of both in vocabulary (IV) and out of vocabulary (OOV) query terms. Efficient search for query terms is performed in two passes. In the first pass, a subword approach is used for identifying audio segments that are likely to contain occurrences of the IV and OOV query terms from the index. A more detailed subword based search is performed in the second pass for verifying the occurrence of the query terms in the candidate segments. The performance of this STD system is evaluated in an open vocabulary STD task defined on a lecture domain corpus. It is shown that the indexing method presented here results in an index that is nearly two orders of magnitude smaller than the LVCSR lattices while preserving most of the information relevant for STD. Furthermore, despite using word lattices for constructing the index, 67% of the segments containing occurrences of the OOV query terms are identified from the index in the first pass. Finally, it is shown that the detec- tion performance of the subword based term detection performed in the second pass has the effect of reducing the performance gap between OOV and IV query terms. Ó 2013 Elsevier B.V. All rights reserved. Keywords: Spoken term detection; Automatic speech recognition; Index 1. Introduction There are many applications that require a capability for searching and retrieving spoken utterances from large media repositories. The input to this process is a set of either orthographic or spoken examples of search terms supplied by a user. Commercial systems and research pro- totypes have been developed for searching course lectures, online videos, and archived telephone conversations for segments that are relevant to user queries (Brno University Super Lectures, 2012; Microsoft MAVIS, 2012). These sys- tems can also be used to support higher level tasks such as topic classification, message summarization, and assessing the quality of operator–customer interactions in call center scenarios (Koumpis and Renals, 2005; Mamou et al., 2006). This paper is concerned with applications where users attempt to retrieve relevant audio segments from a large archive of recorded speech messages by entering ortho- graphic examples of search terms through the user interface of an online search engine. There are a number of require- ments associated with this class of applications. First, the search must be extremely fast. It is generally assumed that hypothesized term occurrences are returned with sub-sec- ond response latencies even for audio collections contain- ing hundreds of hours of speech. Second, it is generally not reasonable to restrict search terms to be drawn from a finite pre-specified vocabulary. Query terms are often proper names or, in many cases, they are selected from spe- cialized domains. For example, the task domain evaluated in this work involves course lectures taken from an online media archive on the topic of chemistry. Finally, the term 0167-6393/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.specom.2013.09.002 Corresponding author. Tel.: +1 5149677973. E-mail address: atta.norouzian@mail.mcgill.ca (A. Norouzian). www.elsevier.com/locate/specom Available online at www.sciencedirect.com ScienceDirect Speech Communication 57 (2014) 50–62