FEATHERWEIGHT PHONETIC KEYWORD SEARCH FOR CONVERSATIONAL SPEECH Keith Kintzley †⋆ Aren Jansen ⋆ Hynek Hermansky ⋆ † U.S. Naval Academy, Annapolis, MD, USA ⋆ Human Language Technology Center of Excellence, Center for Language and Speech Processing Johns Hopkins University, Baltimore, MD, USA ABSTRACT The point process model (PPM) for keyword search is a phonetic event-driven approach that provides a whole-word focused alterna- tive to fast lattice matching techniques. Recent efforts in PPMs have been focused on improved model estimation techniques and efﬁcient search algorithms, but past evaluations have been limited to search- ing relatively easy scripted corpora for simple unigram queries, pre- venting comprehensive benchmarking against standard search meth- ods. In this paper, we present techniques for score normalization and the processing of multi-word and out of training query terms as required by the 2006 NIST Spoken Term Detection (STD) evalua- tion, permitting the ﬁrst comprehensive benchmark of PPM search technology against state-of-the-art word and phonetic-based search systems. We demonstrate PPM to be the fastest phonetic system while posting accuracies competitive with the best phonetic alterna- tives. Moreover, index construction time and size are better than any keyword search system entered in the NIST evaluation. Index Terms— point process model, spoken term detection, score normalization, compact speech indexing 1. INTRODUCTION The point process model represents a fundamentally distinct ap- proach to the problem of speech recognition. Given that speech arises from the highly coupled movement of articulators, a core feature of the PPM framework is the notion that words are charac- terized by temporal patterns of speech sounds (i.e., phonetic events). Current theories of human language acquisition also lend credence to this whole-word approach to recognition. Contrary to previous beliefs about phonemic development, a large body of evidence sup- ports the hypothesis that infants ﬁrst recognize whole words and only later construct an inventory of phonemes [1]. Additionally, the fundamental importance of temporal relations in human speech perception is corroborated by the ﬁnding that a basic neurological impairment in temporal processing lies at the root of most language learning impairment in children [2]. Beyond motivations in human speech perception, the PPM framework also possesses fundamental computational advantages. The reduction of speech to a set of dis- tinct phonetic events produces an exceedingly sparse representation. Not only does this permit compact storage, but it also enables very fast search. The original formulation of the point process model for keyword spotting was presented in [3]. Distinct from dense, frame-by-frame representations of speech that characterize hidden Markov model (HMM) approaches, the PPM framework operates on a sparse se- quence of discrete phonetic events and words are modeled as in- homogeneous Poisson processes. This initial work presented key- word search experiments on the TIMIT dataset as well as the BU- Radio news corpus and demonstrated that the PPM system com- pared favorably with HMM keyword-ﬁller approaches. A related work [4] explored an alternative method of determining phonetic events from phone posteriorgram data. It showed that the use of phonetic matched ﬁlters and appropriate threshold selection resulted in 40% fewer phonetic events and a 20% improvement word spotting performance. Capitalizing on this extremely sparse representation of speech, [5] introduced an upper bound on the PPM detection func- tion that enabled keyword search times exceeding 500,000x faster than real-time. Other related works have addressed the issue of estimating PPM word models. In the original presentation [3], inhomogeneous rate parameters were derived from maximum likelihood estimates (MLE) which necessitated the use of numerous keyword training examples. In [6], we demonstrated that a Bayesian approach could be applied to whole-word model estimation, signiﬁcantly reducing the required number of word examples. Subsequent work presented in [7] devel- oped improved techniques for synthesizing prior models of phonetic timing distributions using Monte Carlo and CART approaches. Unique from previous works on this topic, here we address several challenges necessary for extending PPM techniques to the task of spoken term detection in conversational telephone speech. First, we consider approaches to modeling and search for multi-word terms as required in the 2006 NIST STD evaluation. We evaluate techniques for estimating word duration of words not present in training. Next, we address score normalization of PPM detections for subsequent evaluation under the actual term-weighted value (ATWV) metric. Finally, we present the performance of a PPM system on the 2006 NIST STD evaluation data in relation to other competitive systems. 2. PPM FOR SPOKEN TERM DETECTION In PPM keyword search, speech is ﬁrst distilled to a discrete set of points in time called phonetic events which correspond to the oc- currence of phones. Typically, the acoustic signal is processed using MLP-based phone detectors that produce a phone posteriorgram rep- resentation from which phonetic events are extracted. Candidate oc- currences of a keyword are identiﬁed from the PPM detection func- tion deﬁned as the ratio of the likelihood of a set phonetic events under a keyword model relative to its likelihood under a background model. Given a keyword w and a set of observed phonetic events O(t) in the interval (t, t + T ], the detection function dw(t) is given by dw(t) = log  P (O(t)|θw,T ) P (O(t)|θ bg ,T )  , where θw corresponds to the keyword-speciﬁc model parameters, θ bg corresponds to background model parameters, and T is the key- word duration. This detection function is simply a log-likelihood ra-