Model of Auditory Localization Using Neural Networks Scott K. Isabelle, James A. Janko, and Robert H. Gilkey Department of Psychology, Wright State University, Dayton, 0hio,45435 Abstract: Janko, Anderson, and Gilkey[Binaural and Spatial Hearing in Real and Virtual Environmettts, Gilkey and Anderson (eds.)Erlbaum, pp. 557-570 (1997)] found that localization judgements collected in quiet were not sufficient to discriminate among various models of auditory localization performance. To constrain the models further, localization in the presence of a spatially fixed masker is considered here. The wideband targets and maskers were filtered byhead-related transfer functions, then byagammatone filter bank. A model for binaural interaction similar to that of Lindemann [J. Acoustic. Sot. Am. 80, 1608-1622 (1986)] was used to process the filter-bank outputs. The inhibited cross-correlation output of the binaural processor was sampled andusdas theinputto a neural network (e,g., 26 correlation lags at each of the 12 frequency channels for 312 input nodes). The network was trained using back propagation across several values of signal-to-noise ratio (SNR)and tested as a function of S~. Preliminary results are compared to human data. INTRODUCTION Typically, models of auditory Focalization have been applied only to stimulus conditions in which a single source is presented in the quiet. For example, Janko, Anderson, and Gilkey (3) employed simple auditory mtiels followed by a neural network to predict subject’s localization judgments in the quiet, but found that quite different models (i.e., those emphasizing monaural information and those emphasizing binaural information) made similar predictions. In an attempt to find a more rigorous test for models of spatial hearing we consider the localization in noise data of Good and Gilkey (2) in this paper. One model of binaural processing that has had some success in predicting the localization of multiple sources is that of Lindemann (5,6,1). Therefore, we use a Lindemann-like model as a front end for a neural network, in a manner similar to our previous work. MODEL IMPLEMENTATION The model was composed of three stages, representing the auditory periphery, the binaural display, and the decision processor. The model’s task on each trial was to determine from which of 144 possible virtual locations the target had been presented (the possible target locations spanned 36V of azimuth in 15’ steps and ranged from -36° to +54” elevation in 18” steps). As in Good and Gilkey, the masker, when present, was directly in front of the subject (0° azimuth, 0“ elevation). To simulate the acoustic effects of the head and pinnae, the 100-Hz broadband click-train target and white-noise masker used by Good and Gilkey were convolved with the appropriate location-specific M- related transfer functions (subject SDO, 7). Targets were presented in the quiet and at signal-to-noise ratios (SNRS) from - 19to+21 dB relative to the average masked detection threshold for the subjects of Good and Gilkey. The model for the auditory periphery consisted of a bank of 12 third-octave gammatone filters followed by compressive half-wave rectifiers (q=O.3) and 800-Hz low-pass filters. The frequency channels were spaced at octave intervals from 0.5 to 2 kHz and at third-octave intervals from 2 to 16 kHz. For computational efficiency, no peripheral internal noise was included. The model of binaural interaction was based on the inhibited cross-correlation model of Lindemann with no dynamic inhibition, Cd=O, and increased static inhibition, C~=O.8 (n. b., the cross-correlation patterns obtained were similar to those shown by Lindemann for C~=O.5). Within each frequency channel, the peripherally processed signals were used to compute the running-time inhibited intermural cross-correlation function, with correlation lags between ~1 ms. The resulting pattern was averaged across running time. The set of averaged cross-correlation patterns (one pattern for each frequency channel) was sampled at a 12.5-kHz rate and corrupted by uniformly distributed internal noise to provide the input to the neural network. The level of internal noise was adjusted so that the localization performance of the model, when trained and tested in the quiet, was comparable to that of a human observer in the quiet (note, the results shown below were obtained with a model traind across SNRS of 1, 11, 21 dB, and Quiet, without readjusting the internal noise level). The three-layer artificial neural network had 312 input nodes (12 frequency channels with 26 values of the 855