DISTANT SPEECH RECOGNITION: BRIDGING THE GAPS John McDonough 1,3 Saarland University Spoken Language Systems D66123 Saarbr ¨ ucken, Germany Matthias W¨ olfel 2,3 Universit¨ at Karlsruhe (TH) Institut f ¨ ur Theoretische Informatik Am Fasanengarten 5, D76131 Karlsruhe, Germany ABSTRACT While great progress has been made in both fields, there is cur- rently a relatively large rift between researchers engaged in acoustic array processing and those engaged in automatic speech recognition. This is unfortunate for many reasons, but most of all because it pre- vents the two sides, both of whom are investigating different aspects of the same problem, from truly understanding one another and co- operating. In many cases, the two sides see each other through the eyes of strangers. If ground breaking progress is to be made in the emerging field of distant speech recognition (DSR), this abysmal state of affairs must change. In this work, we outline five pressing problems in the DSR research field, and we make initial proposals for their solutions. The problems discussed here are by no means the only ones that must be solved in order to construct truly effective DSR systems. Nonetheless, their solution, in our view, will represent significant first steps towards this goal, inasmuch as the solution of each of these problems will require a substantial change in the mind- sets and thought patterns of those engaged in this field of research. Index Termsspeech feature enhancement, particle filter, multi-step linear prediction, joint denoising and dereveberation, automatic speech recognition, beamforming, microphone arrays 1. INTRODUCTION While great progress has been made in both fields, there is currently a relatively large rift between researchers engaged in acoustic ar- ray processing and those engaged in automatic speech recognition (ASR). This is unfortunate for many reasons, but most of all because it prevents the two sides, who are investigating different aspects of the same problem, from truly understanding one another and coop- erating. In many cases, each side is either ignorant or dismissive of research progress made by the other. Groundbreaking progress requires that this state of affairs changes. In this work, which is admittedly somewhat “editorial” in na- ture, we report the perspective of two researchers who have experi- ence from from both sides. While we began our careers conducting research in ASR, we have in the last years also had considerable ex- perience with microphone arrays and multichannel signal processing techniques. We have learned the patterns of thought, both good and 1 The first author is grateful for the financial support of the German Re- search Foundation (DFG) in connection with the international research train- ing network IRTG 715 “Language Technology and Cognitive Systems”. 2 The second author is grateful for the financial support of the DFG un- der Sonderforschungsbereich SFB 588: “Humanoid Robots—Learning and Cooperating Multimodal Robots”. 3 Both authors are grateful to John Wiley & Sons, Ltd. for permission to reproduce the images contained in this work. bad, from both sides. Here we hope to offer a unique perspective from people with one foot in both worlds. In particular, our discussion is organized around five gaps in the emerging field of distant speech recognition (DSR). In each of these gaps, we perceive the possibility of making significant progress in the coming years through a change of research paradigm, so to say, by looking at the problem through new eyes. The first gap concerns the unification of the research community involved in conventional beamforming with that involved in independent component analy- sis (ICA). Each community examines the same problem, but defines itself by what knowledge it does not consider: Those doing conven- tional beamforming confine themselves to using second-order statis- tics. Those active in the ICA field consider no geometric informa- tion. Hence our question, why cannot algorithms be formulated that utilize both higher order statistics as well as geometric information? It may arguably be said that neither source of information is suffi- cient for building effective DSR systems. But perhaps when used together with a bit of innovation, they are sufficient. The second gap pertains to the formulation of a consistent ap- proach towards combating the two most prominent distortions in- troduced by realistic acoustic environments, namely, noise and re- verberation. All known techniques, such as spectral subtraction or multi-stage linear prediction, are designed to suppress only one of thses two distortions. We refer to current work based on the combi- nation of a particle filter and multistage linear prediction that aims to simultaneously suppress both distortions, and make further sugges- tions for continuing in that direction. The third gap addresses the requirement of more effective in- tegration of beamforming and post-filtering. As is well-known, the minimum mean squared error (MMSE) beamformer consists of a minimum variance distortionless response (MVDR) beamformer followed by some variation on the Wiener postfilter [1, §6.2.2]. This optimality, however, is based only on the consideration of second-order statistics. Can more effective post-filtering algorithms be developed based on the use of higher order statistcs? As pointed out by Seltzer et al [2], the information provided by a Hidden Markov model (HMM) can be effectively incorporated into a beamforming algorithm in order to account for the the non- stationarity of speech. How can this information be combined with other knowledge sources? For example, how can it be combined with knowledge about the non-Gaussian nature of speech? We liken this issue to Gap 4, and discuss possible remedies and solutions. The final gap, and that most desperately in need of closing, is the one first mentioned. How can the acoustic array processing and the automatic speech recognition research communities truly be in- tegrated? What must each learn from the other? What practices must each adopt from the other? What set of skills must a new generation of researchers possess in order to effectively solve the distant speech recognition problem?