arXiv:1804.05111v1 [cs.SD] 13 Apr 2018 1 Multi-Sound-Source Localization for Small Autonomous Unmanned Vehicles with a Self-Rotating Bi-Microphone Array Deepak Gala, Nathan Lindsay, and Liang Sun Abstract—While vision-based localization techniques have been widely studied for small autonomous unmanned vehicles (SAUVs), sound-source localization capability has not been fully enabled for SAUVs. This paper presents two novel approaches for SAUVs to perform multi-sound-sources localization (MSSL) using only the interaural time difference (ITD) signal generated by a self-rotating bi-microphone array. The proposed two ap- proaches are based on the DBSCAN and RANSAC algorithms, respectively, whose performances are tested and compared in both simulations and experiments. The results show that both approaches are capable of correctly identifying the number of sound sources along with their three-dimensional orientations in a reverberant environment. I. I NTRODUCTION Small autonomous unmanned vehicles (e.g., quadcopters and ground robots) have revolutionized civilian and military missions by creating a platform for observation and permit- ting access to locations that are too dangerous, too difficult or too costly to send humans. These small vehicles have shown themselves to be remarkably capable in a lot of applications, such as surveying and mapping, precision agriculture, search and rescue, traffic surveillance, and infrastructure monitoring, to name just a few. The sensing capability of unmanned vehicles has been enabled by various sensors, such as RGB cameras, infrared cameras, LiDARs, RADARs, and ultrasound sensors. How- ever, these mainstream sensors are subject to either lighting conditions or line-of-sight requirements. On the end of the spectrum, sound sensors have the superiority to conquer line- of-sight constraints and provide a more efficient approach for unmanned vehicles to acquire situational awareness thanks to their omnidirectional nature. Among the sensing tasks for unmanned vehicles, localiza- tion is of utmost significance [1]. While vision-based local- ization techniques have been developed based on cameras, sound source localization (SSL) has been achieved using microphone arrays with different numbers (e.g., 2, 4, 8, Deepak Gala is with the Klipsch School of Electrical and Computer Engineering, New Mexico State University, Las Cruces, NM, 88001 USA e-mail: dgala@nmsu.edu Nathan Lindsay is with the Department of Mechanical and Aerospace Engineering, New Mexico State University, Las Cruces, NM, 88001 USA e-mail: nl22@nmsu.edu Liang Sun is with the Department of Mechanical and Aerospace Engi- neering, New Mexico State University, Las Cruces, NM, 88001 USA e-mail: lsun@nmsu.edu 16) of microphones. Although it has been reported that the accuracy of the localization is enhanced as the number of microphones increases [2], [3], this comes with the price of algorithm complexity and hardware cost, especially due to the expense of the Analog-to-Digital converters (ADC), which is proportional to the number of speaker channels. Moreover, arrays with a particular structure (e.g., linear, cubical, circular, etc.) will be difficult to control, mount and maneuver, which makes them unsuitable to be used on small vehicles. Humans and many other animals can locate sound sources with decent accuracy and responsiveness by using their two ears associated with head rotations to avoid ambiguity (i.e., cone of confusion) [4]. Recently, SSL techniques based on a self-rotating bi-microphone array have been reported in the literature [5]–[9]. Single-SSL (SSSL) techniques have been well studies using different numbers of microphones, while for multi-sound-source-localization (MSSL) many reported techniques require large microphone arrays with specific structures, limiting them to be mounted on small robots. Pioneer work for MSSL assumed the number of sources to be known beforehand [10], [11]. Some of these approaches [12]– [14] are based on sparse component analysis (SCA) that requires the sources to be W-disjoint orthogonal [15] (i.e., in some time-frequency components, at most one source is active), thereby making them unsuitable for reverberant environments. Pavlidi et al. [16] and Loesch et al. [17] pre- sented an SCA-based method to count and localize multiple sound sources but requires one sound source to be dominant over others in a time-frequency zone. Clustering methods have also been used to conduct MSSL [12]–[14]. Catalbas et al. [18] presented an approach for MSSL by deploying four microphones at the corners of the room and the sound sources are required to be present within the boundary. The tech- nique was limited to localize sound orientations in the two- dimensional plane using K-mediods clustering. The number of sound sources were calculated using the exhaustive elbow method, which is instinctive and computationally expensive. Traa et al. [19] presented an approach that converts the time- delay between the microphones in the frequency domain so as to model the phase differences in each frequency bin of short-time Fourier transform. Due to the linear relationship between phase difference and frequency, the data were then clustered using random sample consensus (RANSAC). In our