An Information Filter for Voice Prompt Suppression John McDonough, 1 Wei Chu, 2 Kenichi Kumatani, 3 Bhiksha Raj, 1 Jill Fain Lehman 3 1 Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA 2 Dept. of Electrical Engineering, University of California, Los Angeles, Los Angeles, CA 90024, USA 3 Disney Research, Pittsburgh, Pittsburgh, PA 15213, USA johnmcd@cs.cmu.edu, weichu@ee.ucla.edu, kenichi.kumatani@disneyresearch.com, bhiksha@cs.cmu.edu, jill.lehman@disneyresearch.com Abstract Modern speech enabled applications provide for dialog between a machine and one or more human users. The machine prompts the user with queries that are either prerecorded or synthesized on the ﬂy. The human users respond with their own voices, and their speech is then recognized and understood by a hu- man language understanding module. In order to achieve as natural an interaction as possible, the human user(s) must be allowed to interrupt the machine during a voice prompt. In this work, we compare two techniques for such voice prompt suppression. The ﬁrst is a straightforward adaptation of a con- ventional Kalman ﬁlter, which has certain advantages over the normalized least squares algrithm in terms of robustness and speed of convergence. The second algorithm, which is novel in this work, is also based on a Kalman ﬁlter, but differs from the ﬁrst in that the update or correction step is performed in infor- mation space and hence allows for the use of diagonal loading in order to control the growth of the subband ﬁlter coefﬁcients, and thereby add robustness to the VPS. Index Terms: acoustic echo cancellation, speech recognition 1. Introduction Modern speech enabled applications provide for dialog between a machine and one or more human users. The machine prompts the user with queries that are either prerecorded or synthesized on the ﬂy. The human users respond with their own voices, and their speech is then recognized and understood by a human lan- guage understanding module. In order to achieve as natural an interaction as possible, the human user(s) must be allowed to interrupt the machine during a voice prompt. This implies that the recognition engine must be running even during the voice prompt; hence, the capacity to suppress the voice prompt in the signals captured by one or more far-ﬁeld microphones is es- sential. The task of voice prompt suppression (VPS) is similar to that of acoustic echo cancellation (AEC). Most algorithms for AEC proposed in the literature are based on the normalized least mean squares (NLMS) algorithm developed in the ﬁeld of adaptive ﬁltering; see [1, 2, 3, 4], for example. The ﬁrst al- gorithm investigated here is a straightforward adaptation of a conventional Kalman ﬁlter, which has certain advantages over the NLMS algorithm in terms of robustness and speed of con- vergence; this algorithm is similar to that described in [5]. The second algorithm, which is novel in this work, is also based on a Kalman ﬁlter, but differs from the ﬁrst in that the update or cor- rection step is performed in information space. The advantage of this approach is that the information matrix can be diago- nally loaded in order to control the magnitude of the subband ﬁlter coefﬁcients, which provides for better robustness. As the adaptive ﬁlter tends to diverge when speech from the desired speaker is present, a double-talk detector (DTD) is needed to halt the adaptation of ﬁlter coefﬁcients during seg- ments containing double-talk [6]; i.e., when both the voice prompt and desired speaker are active. Jia et al. combined local decisions of double-talk detectors on subbands to make a global decision of the presence of the near-end speaker [7]. In this pa- per, we also propose a subband double-talk detection algorithm in which the ﬁlter for a subband is only updated when the sub- band speaker-to-voice prompt energy ratio is sufﬁciently high. The proposed subband DTD is shown to be effective in increas- ing the rate of convergence of the subband ﬁlters and hence in improving the sound quality. In Section 2, we review the conventional NLMS and co- variance Kalman ﬁlter techniques for voice prompt suppression (VPS). We also present the VPS algorithm based on the infor- mation Kalman ﬁlter proposed here and discuss its similarities and differences with the covariance form of the ﬁlter. Our initial experimental results with the proposed technique are tabulated and described in Section 3. A set of distant speech recogni- tion (DSR) experiments demonstrates that the information ﬁlter provides performance superior to that obtained with the conven- tional Kalman ﬁlter. In the ﬁnal section of this work, we present our conclusions and a brief description of our plans for future research. 2. The Information Filter In this section, we describe the components of a VPS system. We then brieﬂy present the operational details of the standard NLMS algorithm, as well as those of both the conventional and information formulations of the Kalman ﬁlter; we discuss how all three algorithms can be used for VPS. Finally, we present a novel DTD algorithm. 2.1. Voice Prompt Suppression Let us deﬁne the following components of our voice prompt cancellation application: • V (z) denotes the transform of the known voice prompt; • S(z) denotes the transform of the unknown desired speech; • R(z)  ∑ L−1 n=0 r[n]z −n denotes the transform of FIR ﬁlter simulating the room impulse response; • G(z) is the transform of the actual, unknown room im- pulse response (RIR) for the voice prompt;