SPEECH RECOGNITION IN MULTISOURCE REVERBERANT
ENVIRONMENTS WITH BINAURAL INPUTS
Nicoleta Roman
1
, Soundararajan Srinivasan
2
and DeLiang Wang
3
1
Department of Mathematics, Statistics and Computer Science
The Ohio State University at Lima, Lima, OH, 45804, USA
2
Biomedical Engineering Center
3
Department of Computer Science and Engineering and Center for Cognitive Science
The Ohio State University, Columbus, OH, 43210, USA
{niki, srinivso, dwang}@cse.ohio-state.edu
ABSTRACT
We present a binaural solution to robust speech recognition in
multi-source reverberant environments. We employ the notion of
an ideal time-frequency binary mask, which selects the target if it is
stronger than the interference in a local time-frequency (T-F) unit.
Our system estimates this ideal binary mask at the output of a target
cancellation module implemented using adaptive filtering. This
mask is used in conjunction with a missing-data algorithm to
decode the target utterance. A systematic evaluation in terms of
automatic speech recognition (ASR) performance shows substantial
improvements over the baseline performance and better results over
related two-microphone approaches.
1. INTRODUCTION
A typical auditory environment contains multiple concurrent
sources that are also reflected by surfaces and may change their
locations constantly. While human listeners are able to segregate
and recognize a target signal under such adverse conditions, ASR
remains a challenging problem [1]. ASR systems are trained on
clean speech and face the problem of mismatch when tested in
noisy and reverberant conditions. In this paper we address the
problem of recognizing target speech from multi-source reverberant
binaural recordings.
Microphone array processing techniques which enhance the
target speech have been employed to improve the robustness of
ASR systems in noisy environments [2]. These techniques are
divided in two broad categories: beamforming and independent
component analysis (ICA) [3]. To separate multiple sound sources,
beamforming takes advantage of their different directions of arrival
while ICA relies on their statistical independence. A fixed
beamformer, such as that of the delay-and-sum, constructs a spatial
beam to enhance signals arriving from the target direction
independent of the interfering sources. A large number of
microphones are however required in order to impose a constant
beam shape across frequencies [3]. Adaptive beamforming
techniques, on the other hand, attempt to null out the interfering
sources in the mixture [4] [5]. While an adaptive beamformer with
two microphones is optimal for canceling a single directional
interference, additional microphones are required as the number of
noise sources increases. Similarly, the drawbacks of ICA
techniques include the requirement that the number of microphones
be greater than or equal to the number of sources and poor
performance in reverberant conditions [5]. Some recent sparse
representations attempt to relax the former assumption but the
performance is limited [6]. While the above techniques enhance
target speech independently of the recognizer, Seltzer et al.
optimize an adaptive filter based on recognition results [7].
Inspired by the robustness of the human auditory system,
research in computational auditory scene analysis (CASA) has been
devoted to build speech separation systems that incorporate known
principles of auditory perception [8]. In particular, binaural CASA
systems which utilize location information have shown very good
recognition results in anechoic conditions. Reverberation, however,
introduces potentially an infinite number of sources due to
reflections from hard surfaces. As a result, the estimation of
location cues in individual T-F units becomes unreliable and the
performance of location-based segregation systems degrades. A
notable exception is the binaural system proposed by Palomäki et
al. [9] which includes an inhibition mechanism that emphasizes the
onset portions of the signal and groups them according to common
location. The system shows improved speech recognition results
across a range of reverberation times with a single interference.
From an information processing perspective, the notion of an
ideal T-F binary mask has been proposed as the computational goal
of CASA [10]. Such a mask can be constructed from a priori
knowledge of target and interference; specifically a value of 1 in
the mask indicates that the target is stronger than the interference
within a particular T-F unit and 0 indicates otherwise. Previously,
we have proposed a binaural system that is capable of estimating
the ideal binary mask under multi-source reverberant conditions
[11] and reported results using a missing-data recognizer [12]
trained on reverberant speech. Note that the missing-data
recognizer treats the units labeled 1 in the mask as reliable data and
the others as unreliable during recognition. To avoid using a
different model for each reverberant condition, it is desirable to
train the ASR on anechoic data. However, we find that the
performance of the missing-data recognizer degrades considerably
when obtained using anechoic training.
In this paper, we propose an alternate approach using a speech
prior based spectrogram reconstruction technique [13]. In this
technique, the target speech values in the unreliable T-F units are
estimated by conditioning on the reliable ones. We observe that the
reliable units in the mask correspond to regions in the spectrogram
dominated by relatively clean target speech. Hence, the prior
I 309 142440469X/06/$20.00 ©2006 IEEE ICASSP 2006