M.I.T. Media Laboratory Perceptual Computing Section Technical Report No. 491, April 1999.
Sound Scene Segmentation
by Dynamic Detection of Correlogram Comodulation
Eric D. Scheirer
Machine Listening Group, MIT Media Laboratory
E15-401D Cambridge, MA 02139-4307 USA
eds@media.mit.edu
Abstract: A new technique for sound-scene analysis is presented. This technique operates by discovering
common modulation behavior among groups of frequency subbands in the autocorrelogram domain. The
analysis is conducted by first analyzing the autocorrelogram to estimate the amplitude modulation and
period modulation of each channel of data at each time step, and then using dynamic clustering techniques
to group together channels with similar modulation behavior. Implementation details of the analysis
technique are presented, and its performance is demonstrated on a test sound.
1. Introduction
The autocorrelogram and similar representations of subband periodicity are now established as the
preferred computational representation of early sound processing in the auditory system. This is primarily
due to the accuracy with which these models explain the available experimental data on pitch perception.
Although there is still debate regarding the strengths and weaknesses of different periodicity
representations (Irino and Patterson 1996; Slaney 1997; de Cheveigné 1998; Kaernbach and Demany
1998), it is now relatively uncontested that some sort of temporal periodicity detection is applied to the
output of the cochlear filterbank in the human listening process.
A remarkable correspondence between visual motion in the autocorrelogram and the perception of auditory
scene analysis was reported some time ago (Duda, Lyon and Slaney 1990). However, as yet there has been
relatively little attempt to operationalize this discovery in a computational auditory-scene-analysis (CASA)
system. The present paper describes initial experiments in building a CASA system around the principle
of detecting subband comodulation in the autocorrelogram domain. The literature on correlogram-based
scene analysis is reviewed, and those few approaches to using the correlogram for purposes other than pitch
analysis highlighted. Then, the correlation-comodulation algorithm is described. Finally, results on test
signals are presented and the directions of future research discussed.
2. Background
Most of the early attempts to construct computational source-grouping systems were based on sinusoidal
analysis (Quatieri and McAulay 1998). Details on these sorts of systems are now widely available in the
literature (for example, Brown and Cooke 1994; Ellis 1994; Rosenthal and Okuno 1998) and will not be
discussed further.
The concept of subband periodicity detection was first suggested as a model for the pitch of a sound by
Licklider (1951). Licklider’s model was based on a network of delay lines and coincidence detectors
oriented in a two-dimensional representation. Since Licklider's formulation, this technique has been
rediscovered several times, first by van Noorden (1983), who cast it in terms of the calculation of
histograms of neural interspike intervals in the cochlear nerve. In the last decade, the model was
reintroduced by Slaney and Lyon (1990), Meddis and Hewitt (1991), and others; it has since come to be
called the autocorrelogram method of pitch analysis and is today the preferred model. The autocorrelogram
is the 3-D volumetric function mapping a cochlear channel, temporal time delay (or lag), and time to the
amount of periodic energy in that band at that lag and time.