Learning Sound Location from a Single Microphone Ashutosh Saxena and Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA. {asaxena,ang}cs.stanford.edu Abstract—We consider the problem of estimating the in- cident angle of a sound, using only a single microphone. The ability to perform monaural (single-ear) localization is important to many animals; indeed, monaural cues are also the primary method by which humans decide if a sound comes from the front or back, as well as estimate its elevation. Such monaural localization is made possible by the structure of the pinna (outer ear), which modifies sound in a way that is dependent on its incident angle. In this paper, we propose a machine learning approach to monaural localization, using only a single microphone and an “artificial pinna” (that distorts sound in a direction-dependent way). Our approach models the typical distribution of natural and artificial sounds, as well as the direction-dependent changes to sounds induced by the pinna. Our experimental results also show that the algorithm is able to fairly accurately localize a wide range of sounds, such as human speech, dog barking, waterfall, thunder, and so on. In contrast to microphone arrays, this approach also offers the potential of significantly more compact, as well as lower cost and power, devices for sounds localization. I. I NTRODUCTION The ability to perform sound localization—i.e., to estimate the direction of a sound source—is important to many biological organisms, where sound can serve as a warning of danger, or be used to locate prey [11]. Further, sound localization has many important engineering applications, ranging from estimating the position of a speaker, to auto- matically deciding where to steer a directional microphone (or beamformer) or camera. Typically, sound localization in artificial systems is per- formed by using two (or more) microphones. By using the difference of arrival times of a sound at the two microphones, one can mathematically estimate the direction of the sound source. However, the accuracy which which an array of microphones can localize a sound (using interaural time difference) is fundamentally limited by the physical size of the array [6], [5]. If the array is too small, then the microphones are spaced too closely together so that they all record essentially the same sound (with interaural time dif- ferences near zero), making it extremely difficult to estimate the orientation. Thus, it is not uncommon for microphone arrays to range from 10’s of centimeters in length (for desktop applications) to many 10’s of meters in length (for underwater localization). However, microphone arrays of this size then become impractical to use on small robots [24], [19]; even for large robots, such microphone arrays can be cumbersome to mount and to maneuver. In contrast, the ability to localize sound using a single microphone (which can be made extremely small) holds the potential of significantly more compact, as well as lower cost and power, devices for localization. Being able to do so is also an interesting and enlightening problem in its own right. In biological organisms such as humans, “interaural time difference” is also used as a cue for sound localization; on organisms with two ears (or microphones), this is called the biaural time difference or the biaural cue. However, in humans the biaural cue cannot be used to estimate the elevation of the sound, nor can it be used to distinguish a sound coming from the front vs. one from the back. For example, if a sound source is directly in front of us, then the interaural time difference will be zero, regardless of the source’s height. This cue in isolation also suffers from front/back ambiguity. Humans can localize the full 3d direction of a sound with reasonably high accuracy, including both elevation and whether it is from the front or back. Indeed, humans can perform this task using even a single ear; this is known as monaural localization. They can perform this task because the sound measured in the inner ear changes as a function of the source’s direction. Specifically, reflections from the ear pinna (the outer ear part of ear, also called the auricle) and the head changes the perceived sound in a way that is dependent on its source’s direction. This allows humans and other organisms to perform monaural localization (including estimating the elevation of a sound source). Monaural localization, however, is a challenging problem for artificial systems, because it requires prior knowledge of the possible sounds. Indeed, the ability of humans to estimate the direction of a sound monaurally is contingent on their familiarity with it. Specifically, even though sounds are modified by the ear depending on its incident angle, we note that in a narrow mathematical sense, it is actually impossible to determine a sound’s direction from a monaural recording alone, because it is impossible to know whether a sound appears different because it is coming from a certain direction (and thus modified in a certain way by the pinna), or if it was originally like that. However, typical sounds found in our environments (and in natural environments) are not random—they have certain structure. Thus, it is by using our prior knowledge (perhaps gained through our years of experience with sound) about what sounds are likely that we