ENERGY-BASED SOUND SOURCE LOCALIZATION AND GAIN NORMALIZATION FOR AD HOC MICROPHONE ARRAYS Zicheng Liu, Zhengyou Zhang, Li-Wei He, Phil Chou Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA ABSTRACT We present an energy-based technique to estimate both mi- crophone and speaker/talker locations from an ad hoc network of microphones. An example of such ad hoc microphone network is a set of microphones built in the laptops that some meeting partic- ipants bring in a meeting room. Compared with traditional sound source localization approaches based on time of flight, our tech- nique does not require accurate synchronization, and it does not re- quire each laptop to emit special signals. We estimate the meeting participants’ positions based on average energies of their speech signals. In addition, we present a technique, which is independent of the volumes of the speakers, to estimate the relative gains of the microphones. This is crucial to aggregate various audio channels from the ad hoc microphone network into a single stream for audio conferencing. Keywords: Ad hoc microphone array, sound source location, meeting analysis 1. INTRODUCTION Portable devices are becoming increasingly popular in collabora- tive environments such as in meetings, teleconferencing, and class rooms. Many people bring laptops and PDAs to meeting rooms and many of these devices are WiFi enabled and have built-in microphones. We are interested in how to leverage such ad hoc microphone networks to improve user experience in collaboration and communication. Many audio conferencing in meeting rooms uses a special de- vice with multiple microphones such as Polycom’s SoundStation [1] and Microsoft’s RingCam [2]. The device usually sits in the mid- dle of the table. Because all microphones are synchronized and the geometry of the microphones is known, techniques such as time delay of arrival (TDOA) [3] can be used to determine the location of the speaker/talker, known as sound source localization (SSL). Once the sound source is localized, intelligent mixing or beam- former picks up the sound and outputs higher quality audio than if a single microphone is used. However, not every meeting room is equipped with such a spe- cial device, and a meeting does not always take place in a meeting room. It would be very useful if we could leverage the micro- phones built in the laptops some meeting participants bring for the meeting. Laptops are usually WiFi-enabled, so they can form an ad hoc network. Compared to traditional microphone array de- vices, such ad hoc microphone arrays are spatially distributed and the microphones in general are closer to the meetings participants. Thus, higher audio quality can be expected, assuming the micro- phones used in the laptops and those in the array devices have the same quality. On the other hand, ad hoc microphones present many challenges: • Microphones are not synchronized; • The location of the microphones/laptops is unknown; • Microphones have different and unknown gains on different laptops; and • The microphone quality is different, i.e., they have different signal to noise ratios. Lienhart et. al. [4] developed a system to synchronize the audio signals by having the microphone devices to send special synchro- nization signals over a dedicated link. Raykar et al. [5] devel- oped an algorithm to calibrate the positions of the microphones by having each loudspeaker to play a coded chirp. Once the mi- crophones are time synchronized and their positions are calibrated, traditional beamforming and sound source localization techniques can be used for speech enhancement, directing camera to speakers, and so on, to improve teleconferencing experience. In this paper, we present an energy-based technique for locat- ing human speakers (talkers). Compared to the previous technique by Raykar et. al. [5], our technique does not require accurate time synchronization, and it does not require the loudspeakers to send special coded chirps. In fact, we only use the average energy of the meeting participants’ speech signals over a relative large win- dow, so synchronization at 50 ms or even 100 ms suffices with our technique. The price to pay is that we cannot obtain position estimation as accurate as what was reported in Raykar et. al [5]. However, the position estimation is still good enough for many scenarios such as audio-visual speaker window selection in video conferencing [2, 6]. Given that the microphones are spatially distributed, a speaker/talker is usually relatively close to one of the microphones. Therefore, a simple mechanism for speech enhancement is to select the signal from the microphone that is closest to the speaker (or select the signal that has the best signal to noise ratio (SNR)). One prob- lem is that the microphones have different gains thus resulting in abrupt gain changes in the output signal. We present an algorithm to estimate the relative gains of the microphones using meeting participants’ speech signals. One nice property of the algorithm is that it is independent of the volume of the speakers so that each speaker can speak with any loudness he/she likes. 2. ENERGY-BASED SOUND SOURCE LOCALIZATION Throughout this paper, we limit our discussions to the meeting room scenario where a number of meeting participants have their laptops in front of them. We assume that each laptop has a micro- phone and the laptops are connected by a network.