1196 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 6, AUGUST 2009
Determining Mixing Parameters From Multispeaker
Data Using Speech-Specific Information
B. Yegnanarayana, Senior Member, IEEE, R. Kumara Swamy, Member, IEEE, and K. Sri Rama Murty
Abstract—In this paper, we propose an approach for processing
multispeaker speech signals collected simultaneously using a pair
of spatially separated microphones in a real room environment.
Spatial separation of microphones results in a fixed time-delay of
arrival of speech signals from a given speaker at the pair of mi-
crophones. These time-delays are estimated by exploiting the im-
pulse-like characteristic of excitation during speech production.
The differences in the time-delays for different speakers are used
to determine the number of speakers from the mixed multispeaker
speech signals. There is difference in the signal levels due to dif-
ferences in the distances between the speaker and each of the mi-
crophones. The differences in the signal levels dictate the values
of the mixing parameters. Knowledge of speech production, es-
pecially the excitation source characteristics, is used to derive an
approximate weight function for locating the regions specific to a
given speaker. The scatter plots of the weighted and delay-compen-
sated mixed speech signals are used to estimate the mixing param-
eters. The proposed method is applied on the data collected in ac-
tual laboratory environment for an underdetermined case, where
the number of speakers is more than the number of microphones.
Enhancement of speech due to a speaker is also examined using the
information of the time-delays and the mixing parameters, and is
evaluated using objective measures proposed in the literature.
Index Terms—Excitation source, mixing parameters, multi-
speaker data, speaker localization, time-delay estimation.
I. INTRODUCTION
O
NE of the interesting and scientifically challenging prob-
lems is to separate signals generated by different sources
from the mixture of signals picked up by spatially distributed
sensors, when there is limited or no knowledge of the nature
of the signals and the number of sources. While the motiva-
tion for such problems is the so called cocktail party problem
[1], the problem is addressed as a blind source separation (BSS)
problem [2] with possible applications in other areas as well,
such as in telecommunications and medical signal processing.
When the signals from spatially distributed sources are mixed in
a real room environment, the signal at any of the sensors in the
room is a mixture of the convolved source signals. Each source
signal at a sensor is convolved with the impulse response of the
Manuscript received July 02, 2008; revised January 06, 2009. Current version
published July 06, 2009. The associate editor coordinating the review of this
manuscript and approving it for publication was Dr. Hiroshi Sawada.
B. Yegnanarayana is with the International Institute of Information Tech-
nology, Hyderabad 500 032, India (e-mail: yegna@iiit.ac.in).
R. K. Swamy is with the Department of Electronics and Communications En-
gineering, Siddaganga Institute of Technology, Tumkur 572103, India (e-mail:
hyrkswamy@gmail.com).
K. S. R. Murty is with the Department of Computer Science and Engineering,
Indian Institute of Technology-Madras, Chennai 600 036, India (e-mail: ksr-
murty@gmail.com).
Digital Object Identifier 10.1109/TASL.2009.2016230
acoustic path from the source to the receiver (sensor). For a BSS
problem with sensors and sources, the mixed signal at the
th sensor is given by
(1)
Here is the mixed signal at the th sensor, is the
signal generated from the th source, is the additive noise
at the th sensor and represents the impulse response of
the acoustic path between th source and th sensor. There are
such responses for a room with spatially distributed sen-
sors and spatially distributed sources. Given observations
of the , , the problem in BSS is to deter-
mine the number of sources, and the individual sources ,
. This case of source signal separation problem
is called convolutive BSS problem. If the impulse response of
the room is varying with time, then it becomes a case of non-
stationary response convolutive BSS problem. In general, the
impulse response of the room between two points is assumed to
be stationary.
The separated signals are generally obtained using linear fil-
ters . The estimated source signals are given by
(2)
The objective is to derive the deconvolving filters , usually
of finite-impulse response (FIR) type. The problem of source
separation from convolutive mixtures can be treated as inver-
sion of impulse response of the acoustic path imposed on each
sensor. Several methods have been proposed, both in time-do-
main and frequency-domain, to achieve source separation from
convolutive mixtures.
Time-domain approaches consider the impulse response
along the acoustic path as an FIR filter, and estimate the filter
coefficients in the time-domain [3], [4]. Though these methods
are successful, the number of sources must be exactly known, as
they are extracted simultaneously. Moreover, the convergence
of these methods depends critically on the initial values of
the filter coefficients and on the selection of the step size for
updating them.
Frequency-domain approaches transform the problem of con-
volutive mixtures into an instantaneous mixture, and solve the
instantaneous BSS problem simultaneously for individual fre-
quencies [5], [6]. Since the filter parameters in the frequency-
domain are mutually orthogonal, updating one parameter does
not influence the rest. Though the mutual independence of the
1558-7916/$25.00 © 2009 IEEE
Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on July 5, 2009 at 23:26 from IEEE Xplore. Restrictions apply.