Presented at ACM Multimedia ’98 Workshop on Content Processing of Music for Multimedia Applications, Bristol UK, 12 Sept 1998 Music Content Analysis through Models of Audition Keith D. Martin, Eric D. Scheirer, and Barry L. Vercoe MIT Media Laboratory Machine Listening Group Cambridge MA USA {kdm,eds,bv}@media.mit.edu ABSTRACT The direct application of ideas from music theory and music signal processing has not yet led to successful musical multimedia systems. We present a research framework that addresses the limitations of conventional approaches by questioning their (often tacit) underlying princi- ples. We discuss several case studies from our own research on the extraction of musical rhythm, timbre, harmony, and structure from complex audio signals; these projects have demon- strated the power of an approach based on a realistic view of human listening abilities. Con- tinuing research in this direction is necessary for the construction of robust systems for music content analysis. INTRODUCTION Most attempts to build music-analysis systems have tried hard to respect the conventional wisdom about the structure of music. However, this model of music—based on notes grouped into rhythms, chords and harmonic progressions—is really only applicable to a restricted class of listeners; there is strong evidence that non-musicians do not hear music in these terms. As a result, attempts to directly apply ideas from music theory and statistical signal-processing have not yet led to successful musical multimedia systems. Today’s computer systems are not capable of understanding music at the level of an average five-year-old; they cannot recognize a melody in a polyphonic recording or understand a song on a children’s television program. We believe that to build robust and broadly useful musical systems, we must discard entrenched ideas about what it means to listen to music and start again. The goal of this paper is to present an unconventional view of what human listeners are able and, espe- cially, unable to do when listening to music. We provide a conceptual framework that acknowledges these perceptual limitations and even exploits them for the purpose of building artificial listening systems. We hope to convince other researchers to think deeply about the limitations of conventional approaches and to consider alternatives to direct application of research results from structuralist music psychology to the con- struction of music-analysis systems. Many of these ideas are rooted in traditional music theory and have questionable relevance to practical issues in building real computational models. In contrast, we will pre- sent evidence from our own modeling work, which advocates a more directly psychoacoustic perspective. The paper has three main sections. First, we present a collection of broad research goals in musical content analysis. Second, we describe a research framework that attempts to address the limitations of conventional approaches. Third, we present several case studies from our own research, demonstrating the power of an approach based on a realistic view of the abilities of human listeners. BROAD RESEARCH GOALS Current research in the Machine Listening Group at the MIT Media Lab addresses two broad goals simul- taneously. The first is the scientific goal of building computer models in order to understand the properties of human perceptual processes. The second is the practical goal of engineering computer systems that can understand musical sound and building useful applications around them. Although the framework dis- cussed here is wholly compatible with the broader project of general sound understanding, this paper ad- dresses only music, and only from an application-centered perspective. For the case studies we give as examples here, there are direct parallels in our research into non-musical sound and the scientific study of auditory perception in general.