DISTANT SPEECH RECOGNITION: BRIDGING THE GAPS John McDonough 1,3 Saarland University Spoken Language Systems D66123 Saarbr ¨ ucken, Germany Matthias W¨ olfel 2,3 Universit¨ at Karlsruhe (TH) Institut f ¨ ur Theoretische Informatik Am Fasanengarten 5, D76131 Karlsruhe, Germany ABSTRACT While great progress has been made in both ﬁelds, there is cur- rently a relatively large rift between researchers engaged in acoustic array processing and those engaged in automatic speech recognition. This is unfortunate for many reasons, but most of all because it pre- vents the two sides, both of whom are investigating different aspects of the same problem, from truly understanding one another and co- operating. In many cases, the two sides see each other through the eyes of strangers. If ground breaking progress is to be made in the emerging ﬁeld of distant speech recognition (DSR), this abysmal state of affairs must change. In this work, we outline ﬁve pressing problems in the DSR research ﬁeld, and we make initial proposals for their solutions. The problems discussed here are by no means the only ones that must be solved in order to construct truly effective DSR systems. Nonetheless, their solution, in our view, will represent signiﬁcant ﬁrst steps towards this goal, inasmuch as the solution of each of these problems will require a substantial change in the mind- sets and thought patterns of those engaged in this ﬁeld of research. Index Terms— speech feature enhancement, particle ﬁlter, multi-step linear prediction, joint denoising and dereveberation, automatic speech recognition, beamforming, microphone arrays 1. INTRODUCTION While great progress has been made in both ﬁelds, there is currently a relatively large rift between researchers engaged in acoustic ar- ray processing and those engaged in automatic speech recognition (ASR). This is unfortunate for many reasons, but most of all because it prevents the two sides, who are investigating different aspects of the same problem, from truly understanding one another and coop- erating. In many cases, each side is either ignorant or dismissive of research progress made by the other. Groundbreaking progress requires that this state of affairs changes. In this work, which is admittedly somewhat “editorial” in na- ture, we report the perspective of two researchers who have experi- ence from from both sides. While we began our careers conducting research in ASR, we have in the last years also had considerable ex- perience with microphone arrays and multichannel signal processing techniques. We have learned the patterns of thought, both good and 1 The ﬁrst author is grateful for the ﬁnancial support of the German Re- search Foundation (DFG) in connection with the international research train- ing network IRTG 715 “Language Technology and Cognitive Systems”. 2 The second author is grateful for the ﬁnancial support of the DFG un- der Sonderforschungsbereich SFB 588: “Humanoid Robots—Learning and Cooperating Multimodal Robots”. 3 Both authors are grateful to John Wiley & Sons, Ltd. for permission to reproduce the images contained in this work. bad, from both sides. Here we hope to offer a unique perspective from people with one foot in both worlds. In particular, our discussion is organized around ﬁve gaps in the emerging ﬁeld of distant speech recognition (DSR). In each of these gaps, we perceive the possibility of making signiﬁcant progress in the coming years through a change of research paradigm, so to say, by looking at the problem through new eyes. The ﬁrst gap concerns the uniﬁcation of the research community involved in conventional beamforming with that involved in independent component analy- sis (ICA). Each community examines the same problem, but deﬁnes itself by what knowledge it does not consider: Those doing conven- tional beamforming conﬁne themselves to using second-order statis- tics. Those active in the ICA ﬁeld consider no geometric informa- tion. Hence our question, why cannot algorithms be formulated that utilize both higher order statistics as well as geometric information? It may arguably be said that neither source of information is sufﬁ- cient for building effective DSR systems. But perhaps when used together with a bit of innovation, they are sufﬁcient. The second gap pertains to the formulation of a consistent ap- proach towards combating the two most prominent distortions in- troduced by realistic acoustic environments, namely, noise and re- verberation. All known techniques, such as spectral subtraction or multi-stage linear prediction, are designed to suppress only one of thses two distortions. We refer to current work based on the combi- nation of a particle ﬁlter and multistage linear prediction that aims to simultaneously suppress both distortions, and make further sugges- tions for continuing in that direction. The third gap addresses the requirement of more effective in- tegration of beamforming and post-ﬁltering. As is well-known, the minimum mean squared error (MMSE) beamformer consists of a minimum variance distortionless response (MVDR) beamformer followed by some variation on the Wiener postﬁlter [1, §6.2.2]. This optimality, however, is based only on the consideration of second-order statistics. Can more effective post-ﬁltering algorithms be developed based on the use of higher order statistcs? As pointed out by Seltzer et al [2], the information provided by a Hidden Markov model (HMM) can be effectively incorporated into a beamforming algorithm in order to account for the the non- stationarity of speech. How can this information be combined with other knowledge sources? For example, how can it be combined with knowledge about the non-Gaussian nature of speech? We liken this issue to Gap 4, and discuss possible remedies and solutions. The ﬁnal gap, and that most desperately in need of closing, is the one ﬁrst mentioned. How can the acoustic array processing and the automatic speech recognition research communities truly be in- tegrated? What must each learn from the other? What practices must each adopt from the other? What set of skills must a new generation of researchers possess in order to effectively solve the distant speech recognition problem?