Available online at www.sciencedirect.com Computer Speech and Language 27 (2013) 703–725 Blind source extraction for robust speech recognition in multisource noisy environments Francesco Nesta , Marco Matassoni Fondazione Bruno Kessler CIT-irst via Sommarive 18, 38123 Trento, Italy Received 12 January 2012; received in revised form 30 July 2012; accepted 9 August 2012 Available online 23 August 2012 Abstract This paper proposes and describes a complete system for Blind Source Extraction (BSE). The goal is to extract a target signal source in order to recognize spoken commands uttered in reverberant and noisy environments, and acquired by a microphone array. The architecture of the BSE system is based on multiple stages: (a) TDOA estimation, (b) mixing system identification for the target source, (c) on-line semi-blind source separation and (d) source extraction. All the stages are effectively combined, allowing the estimation of the target signal with limited distortion. While a generalization of the BSE framework is described, here the proposed system is evaluated on the data provided for the CHiME Pascal 2011 competition, i.e. binaural recordings made in a real-world domestic environment. The CHiME mixtures are processed with the BSE and the recovered target signal is fed to a recognizer, which uses noise robust features based on Gammatone Frequency Cepstral Coefficients. Moreover, acoustic model adaptation is applied to further reduce the mismatch between training and testing data and improve the overall performance. A detailed comparison between different models and algorithmic settings is reported, showing that the approach is promising and the resulting system gives a significant reduction of the error rate. © 2012 Elsevier Ltd. All rights reserved. 1. Introduction Although voice interaction is an appealing modality for human/machine interaction and speech processing technolo- gies have been actively investigated, speech acquisition, processing and recognition in non-ideal acoustic environments are still complex tasks due to the presence of noise, reverberation and interfering speakers (Kellermann, 2006; Wölfel and McDonough, 2009). CHiME is an audio corpus designed for investigating robust speech processing and for comparing achievements obtained in both the speech enhancement and the recognition communities (Barker et al., submitted for publication). The recorded data includes background recordings from a head simulator positioned in a domestic setting as well as binaural impulse responses collected in the same environment. By means of these genuine responses, utterances from the Grid corpus (Takah Cooke et al., 2006) have been added to this setting and mixed with the background noise to produce controlled and natural audio data. The resulting task is to separate the speech and recognize the commands being spoken using a recognizer trained on noise-free recordings. This paper has been recommended for acceptance by ‘Jon Barker’. Corresponding author. E-mail addresses: nesta@fbk.eu (F. Nesta), matasso@fbk.eu (M. Matassoni). 0885-2308/$ – see front matter © 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.csl.2012.08.001