An Information Filter for Voice Prompt Suppression John McDonough, 1 Wei Chu, 2 Kenichi Kumatani, 3 Bhiksha Raj, 1 Jill Fain Lehman 3 1 Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA 2 Dept. of Electrical Engineering, University of California, Los Angeles, Los Angeles, CA 90024, USA 3 Disney Research, Pittsburgh, Pittsburgh, PA 15213, USA johnmcd@cs.cmu.edu, weichu@ee.ucla.edu, kenichi.kumatani@disneyresearch.com, bhiksha@cs.cmu.edu, jill.lehman@disneyresearch.com Abstract Modern speech enabled applications provide for dialog between a machine and one or more human users. The machine prompts the user with queries that are either prerecorded or synthesized on the fly. The human users respond with their own voices, and their speech is then recognized and understood by a hu- man language understanding module. In order to achieve as natural an interaction as possible, the human user(s) must be allowed to interrupt the machine during a voice prompt. In this work, we compare two techniques for such voice prompt suppression. The first is a straightforward adaptation of a con- ventional Kalman filter, which has certain advantages over the normalized least squares algrithm in terms of robustness and speed of convergence. The second algorithm, which is novel in this work, is also based on a Kalman filter, but differs from the first in that the update or correction step is performed in infor- mation space and hence allows for the use of diagonal loading in order to control the growth of the subband filter coefficients, and thereby add robustness to the VPS. Index Terms: acoustic echo cancellation, speech recognition 1. Introduction Modern speech enabled applications provide for dialog between a machine and one or more human users. The machine prompts the user with queries that are either prerecorded or synthesized on the fly. The human users respond with their own voices, and their speech is then recognized and understood by a human lan- guage understanding module. In order to achieve as natural an interaction as possible, the human user(s) must be allowed to interrupt the machine during a voice prompt. This implies that the recognition engine must be running even during the voice prompt; hence, the capacity to suppress the voice prompt in the signals captured by one or more far-field microphones is es- sential. The task of voice prompt suppression (VPS) is similar to that of acoustic echo cancellation (AEC). Most algorithms for AEC proposed in the literature are based on the normalized least mean squares (NLMS) algorithm developed in the field of adaptive filtering; see [1, 2, 3, 4], for example. The first al- gorithm investigated here is a straightforward adaptation of a conventional Kalman filter, which has certain advantages over the NLMS algorithm in terms of robustness and speed of con- vergence; this algorithm is similar to that described in [5]. The second algorithm, which is novel in this work, is also based on a Kalman filter, but differs from the first in that the update or cor- rection step is performed in information space. The advantage of this approach is that the information matrix can be diago- nally loaded in order to control the magnitude of the subband filter coefficients, which provides for better robustness. As the adaptive filter tends to diverge when speech from the desired speaker is present, a double-talk detector (DTD) is needed to halt the adaptation of filter coefficients during seg- ments containing double-talk [6]; i.e., when both the voice prompt and desired speaker are active. Jia et al. combined local decisions of double-talk detectors on subbands to make a global decision of the presence of the near-end speaker [7]. In this pa- per, we also propose a subband double-talk detection algorithm in which the filter for a subband is only updated when the sub- band speaker-to-voice prompt energy ratio is sufficiently high. The proposed subband DTD is shown to be effective in increas- ing the rate of convergence of the subband filters and hence in improving the sound quality. In Section 2, we review the conventional NLMS and co- variance Kalman filter techniques for voice prompt suppression (VPS). We also present the VPS algorithm based on the infor- mation Kalman filter proposed here and discuss its similarities and differences with the covariance form of the filter. Our initial experimental results with the proposed technique are tabulated and described in Section 3. A set of distant speech recogni- tion (DSR) experiments demonstrates that the information filter provides performance superior to that obtained with the conven- tional Kalman filter. In the final section of this work, we present our conclusions and a brief description of our plans for future research. 2. The Information Filter In this section, we describe the components of a VPS system. We then briefly present the operational details of the standard NLMS algorithm, as well as those of both the conventional and information formulations of the Kalman filter; we discuss how all three algorithms can be used for VPS. Finally, we present a novel DTD algorithm. 2.1. Voice Prompt Suppression Let us define the following components of our voice prompt cancellation application: V (z) denotes the transform of the known voice prompt; S(z) denotes the transform of the unknown desired speech; R(z) L1 n=0 r[n]z n denotes the transform of FIR filter simulating the room impulse response; G(z) is the transform of the actual, unknown room im- pulse response (RIR) for the voice prompt;