Speech Recognition in reverberant environments using remote microphones Luca Brayda, Christian Wellekens Institut Eurecom, 2229 Route des Cretes, 06904 Sophia Antipolis, France {brayda,wellekens}@eurecom.fr Marco Matassoni, Maurizio Omologo ITC-irst, via Sommarive 18, 38050 Povo (TN), Italy {matasso,omologo}@itc.it Abstract This paper addresses distant-talking speech recognition by means of remote sensors in a reverberant room. Recog- nition performances is investigated for different ways of ini- tializing, steering, and optimizing the related beamformer. Results show how much critical that front-end processing may be in such a challenging setup, according to the differ- ent positions and orientations of the speaker. 1 Introduction Distant-talking speech recognition is a very challenging topic. To tackle it, microphone arrays [1] are generally em- ployed thanks to the capabilities of beamforming techniques to enhance the speech message, while attenuating undesired contributions of environmental noise and reverberation. Mi- crophone arrays can be steered toward the most convenient look direction, which ensures the best speech recognition performances. This can be accomplished by adopting a suitable filter-and-sum beamforming [2, 3], i.e. a combi- nation of filtered versions of all the microphone signals. In the past, a wide literature addressed beamforming mainly with the target of deriving an enhanced signal with good properties from the perceptual point of view rather than maximizing speech recognition performances. More recent works have addressed the task of improving recognizer ac- curacy, which can represent a quite different objective. To this regard, a technique that deserves to be mentioned is Limabeam [4], which aims to optimize the beamformer pa- rameters, given the most likely HMM state sequence that has been observed in a first processing step. Moreover, an intensive activity of evaluating perfor- mances of microphone array based speech recognizers is being conducted world-wide, in particular in the commu- nities related to the EC AMI [5] and CHIL [6] projects: NIST has recently organized benchmarking campaigns (see http://www.nist/gov/speech) which showed that the error rate provided by a 64-microphone array based recognizer is about twice the error obtained on the corresponding close- talking microphone signal, given a large vocabulary spon- taneous speech recognition task. We observe that, when dealing with a real reverberant environment, the direction that ensures the best automatic speech recognition (ASR) performances can be different from the one determined by speaker localization techniques. In the past, accurate time delay estimation methods and related speaker localization systems were addressed which can be used to select a pos- sible steering direction. However, also given this approach in a real-world situation one may encounter problems due to the head orientation that represents another source of vari- ability very difficult to address: in other words, when the speaker is not aiming toward the array, the speech captured by each microphone of the array will be mostly character- ized by contributions due to reflections. This paper investi- gates on distant-talking speech recognition in a real highly reverberant environment given different speaker positions, in most of the cases not oriented toward the microphone array. Existing techniques are presented and some new pos- sible improvements are proposed. The purpose of the work is: to describe the parameters of a general microphone array processing system (Section 2), focusing to the beamform- ing techniques; to outline the possible performances that can be obtained steering the array in different directions (Section 3); to understand the potential of delay-and-sum beamforming, given delays extracted by a technique typi- cally used for speaker localization purposes (Section 4); to outline the room for improvements estimating “recognition- oriented” filters (Sections 5 and 6 ) or exploiting additional information related to the environment such as the room im- pulse responses (Section 7). Finally, Section 8 describes the experimental setup and results (derived by using a multi- microphonic version of the well known TI connected digit recognition task) and Section 9 draws our conclusions and discussions for future work.