Available online at www.sciencedirect.com
Computer Speech and Language 27 (2013) 703–725
Blind source extraction for robust speech recognition in multisource
noisy environments
Francesco Nesta
∗
, Marco Matassoni
Fondazione Bruno Kessler CIT-irst via Sommarive 18, 38123 Trento, Italy
Received 12 January 2012; received in revised form 30 July 2012; accepted 9 August 2012
Available online 23 August 2012
Abstract
This paper proposes and describes a complete system for Blind Source Extraction (BSE). The goal is to extract a target signal
source in order to recognize spoken commands uttered in reverberant and noisy environments, and acquired by a microphone array.
The architecture of the BSE system is based on multiple stages: (a) TDOA estimation, (b) mixing system identification for the target
source, (c) on-line semi-blind source separation and (d) source extraction. All the stages are effectively combined, allowing the
estimation of the target signal with limited distortion.
While a generalization of the BSE framework is described, here the proposed system is evaluated on the data provided for the
CHiME Pascal 2011 competition, i.e. binaural recordings made in a real-world domestic environment. The CHiME mixtures are
processed with the BSE and the recovered target signal is fed to a recognizer, which uses noise robust features based on Gammatone
Frequency Cepstral Coefficients. Moreover, acoustic model adaptation is applied to further reduce the mismatch between training
and testing data and improve the overall performance. A detailed comparison between different models and algorithmic settings is
reported, showing that the approach is promising and the resulting system gives a significant reduction of the error rate.
© 2012 Elsevier Ltd. All rights reserved.
1. Introduction
Although voice interaction is an appealing modality for human/machine interaction and speech processing technolo-
gies have been actively investigated, speech acquisition, processing and recognition in non-ideal acoustic environments
are still complex tasks due to the presence of noise, reverberation and interfering speakers (Kellermann, 2006; Wölfel
and McDonough, 2009). CHiME is an audio corpus designed for investigating robust speech processing and for
comparing achievements obtained in both the speech enhancement and the recognition communities (Barker et al.,
submitted for publication). The recorded data includes background recordings from a head simulator positioned in a
domestic setting as well as binaural impulse responses collected in the same environment. By means of these genuine
responses, utterances from the Grid corpus (Takah Cooke et al., 2006) have been added to this setting and mixed with
the background noise to produce controlled and natural audio data. The resulting task is to separate the speech and
recognize the commands being spoken using a recognizer trained on noise-free recordings.
This paper has been recommended for acceptance by ‘Jon Barker’.
∗
Corresponding author.
E-mail addresses: nesta@fbk.eu (F. Nesta), matasso@fbk.eu (M. Matassoni).
0885-2308/$ – see front matter © 2012 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.csl.2012.08.001