M.I.T. Media Laboratory Perceptual Computing Section Technical Report No. 491, April 1999. Sound Scene Segmentation by Dynamic Detection of Correlogram Comodulation Eric D. Scheirer Machine Listening Group, MIT Media Laboratory E15-401D Cambridge, MA 02139-4307 USA eds@media.mit.edu Abstract: A new technique for sound-scene analysis is presented. This technique operates by discovering common modulation behavior among groups of frequency subbands in the autocorrelogram domain. The analysis is conducted by first analyzing the autocorrelogram to estimate the amplitude modulation and period modulation of each channel of data at each time step, and then using dynamic clustering techniques to group together channels with similar modulation behavior. Implementation details of the analysis technique are presented, and its performance is demonstrated on a test sound. 1. Introduction The autocorrelogram and similar representations of subband periodicity are now established as the preferred computational representation of early sound processing in the auditory system. This is primarily due to the accuracy with which these models explain the available experimental data on pitch perception. Although there is still debate regarding the strengths and weaknesses of different periodicity representations (Irino and Patterson 1996; Slaney 1997; de Cheveigné 1998; Kaernbach and Demany 1998), it is now relatively uncontested that some sort of temporal periodicity detection is applied to the output of the cochlear filterbank in the human listening process. A remarkable correspondence between visual motion in the autocorrelogram and the perception of auditory scene analysis was reported some time ago (Duda, Lyon and Slaney 1990). However, as yet there has been relatively little attempt to operationalize this discovery in a computational auditory-scene-analysis (CASA) system. The present paper describes initial experiments in building a CASA system around the principle of detecting subband comodulation in the autocorrelogram domain. The literature on correlogram-based scene analysis is reviewed, and those few approaches to using the correlogram for purposes other than pitch analysis highlighted. Then, the correlation-comodulation algorithm is described. Finally, results on test signals are presented and the directions of future research discussed. 2. Background Most of the early attempts to construct computational source-grouping systems were based on sinusoidal analysis (Quatieri and McAulay 1998). Details on these sorts of systems are now widely available in the literature (for example, Brown and Cooke 1994; Ellis 1994; Rosenthal and Okuno 1998) and will not be discussed further. The concept of subband periodicity detection was first suggested as a model for the pitch of a sound by Licklider (1951). Licklider’s model was based on a network of delay lines and coincidence detectors oriented in a two-dimensional representation. Since Licklider's formulation, this technique has been rediscovered several times, first by van Noorden (1983), who cast it in terms of the calculation of histograms of neural interspike intervals in the cochlear nerve. In the last decade, the model was reintroduced by Slaney and Lyon (1990), Meddis and Hewitt (1991), and others; it has since come to be called the autocorrelogram method of pitch analysis and is today the preferred model. The autocorrelogram is the 3-D volumetric function mapping a cochlear channel, temporal time delay (or lag), and time to the amount of periodic energy in that band at that lag and time.