1196 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 6, AUGUST 2009 Determining Mixing Parameters From Multispeaker Data Using Speech-Specific Information B. Yegnanarayana, Senior Member, IEEE, R. Kumara Swamy, Member, IEEE, and K. Sri Rama Murty Abstract—In this paper, we propose an approach for processing multispeaker speech signals collected simultaneously using a pair of spatially separated microphones in a real room environment. Spatial separation of microphones results in a fixed time-delay of arrival of speech signals from a given speaker at the pair of mi- crophones. These time-delays are estimated by exploiting the im- pulse-like characteristic of excitation during speech production. The differences in the time-delays for different speakers are used to determine the number of speakers from the mixed multispeaker speech signals. There is difference in the signal levels due to dif- ferences in the distances between the speaker and each of the mi- crophones. The differences in the signal levels dictate the values of the mixing parameters. Knowledge of speech production, es- pecially the excitation source characteristics, is used to derive an approximate weight function for locating the regions specific to a given speaker. The scatter plots of the weighted and delay-compen- sated mixed speech signals are used to estimate the mixing param- eters. The proposed method is applied on the data collected in ac- tual laboratory environment for an underdetermined case, where the number of speakers is more than the number of microphones. Enhancement of speech due to a speaker is also examined using the information of the time-delays and the mixing parameters, and is evaluated using objective measures proposed in the literature. Index Terms—Excitation source, mixing parameters, multi- speaker data, speaker localization, time-delay estimation. I. INTRODUCTION O NE of the interesting and scientifically challenging prob- lems is to separate signals generated by different sources from the mixture of signals picked up by spatially distributed sensors, when there is limited or no knowledge of the nature of the signals and the number of sources. While the motiva- tion for such problems is the so called cocktail party problem [1], the problem is addressed as a blind source separation (BSS) problem [2] with possible applications in other areas as well, such as in telecommunications and medical signal processing. When the signals from spatially distributed sources are mixed in a real room environment, the signal at any of the sensors in the room is a mixture of the convolved source signals. Each source signal at a sensor is convolved with the impulse response of the Manuscript received July 02, 2008; revised January 06, 2009. Current version published July 06, 2009. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Hiroshi Sawada. B. Yegnanarayana is with the International Institute of Information Tech- nology, Hyderabad 500 032, India (e-mail: yegna@iiit.ac.in). R. K. Swamy is with the Department of Electronics and Communications En- gineering, Siddaganga Institute of Technology, Tumkur 572103, India (e-mail: hyrkswamy@gmail.com). K. S. R. Murty is with the Department of Computer Science and Engineering, Indian Institute of Technology-Madras, Chennai 600 036, India (e-mail: ksr- murty@gmail.com). Digital Object Identifier 10.1109/TASL.2009.2016230 acoustic path from the source to the receiver (sensor). For a BSS problem with sensors and sources, the mixed signal at the th sensor is given by (1) Here is the mixed signal at the th sensor, is the signal generated from the th source, is the additive noise at the th sensor and represents the impulse response of the acoustic path between th source and th sensor. There are such responses for a room with spatially distributed sen- sors and spatially distributed sources. Given observations of the , , the problem in BSS is to deter- mine the number of sources, and the individual sources , . This case of source signal separation problem is called convolutive BSS problem. If the impulse response of the room is varying with time, then it becomes a case of non- stationary response convolutive BSS problem. In general, the impulse response of the room between two points is assumed to be stationary. The separated signals are generally obtained using linear fil- ters . The estimated source signals are given by (2) The objective is to derive the deconvolving filters , usually of finite-impulse response (FIR) type. The problem of source separation from convolutive mixtures can be treated as inver- sion of impulse response of the acoustic path imposed on each sensor. Several methods have been proposed, both in time-do- main and frequency-domain, to achieve source separation from convolutive mixtures. Time-domain approaches consider the impulse response along the acoustic path as an FIR filter, and estimate the filter coefficients in the time-domain [3], [4]. Though these methods are successful, the number of sources must be exactly known, as they are extracted simultaneously. Moreover, the convergence of these methods depends critically on the initial values of the filter coefficients and on the selection of the step size for updating them. Frequency-domain approaches transform the problem of con- volutive mixtures into an instantaneous mixture, and solve the instantaneous BSS problem simultaneously for individual fre- quencies [5], [6]. Since the filter parameters in the frequency- domain are mutually orthogonal, updating one parameter does not influence the rest. Though the mutual independence of the 1558-7916/$25.00 © 2009 IEEE Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on July 5, 2009 at 23:26 from IEEE Xplore. Restrictions apply.