INSTRUMENT IDENTIFICATION IN POLYPHONIC MUSIC: FEATURE WEIGHTING WITH MIXED SOUNDS, PITCH-DEPENDENT TIMBRE MODELING, AND USE OF MUSICAL CONTEXT Tetsuro Kitahara, † Masataka Goto, ‡ Kazunori Komatani, † Tetsuya Ogata † and Hiroshi G. Okuno † † Dept. of Intelligence Science and Technology ‡ National Institute of Advanced Industrial Graduate School of Informatics, Kyoto University Science and Technology (AIST) Sakyo-ku, Kyoto 606-8501, Japan Tsukuba, Ibaraki 305-8568, Japan {kitahara, komatani, ogata, okuno}@kuis.kyoto-u.ac.jp m.goto@aist.go.jp ABSTRACT This paper addresses the problem of identifying musi- cal instruments in polyphonic music. Musical instrument identiﬁcation (MII) is an improtant task in music informa- tion retrieval because MII results make it possible to au- tomatically retrieving certain types of music (e.g., piano sonata, string quartet). Only a few studies, however, have dealt with MII in polyphonic music. In MII in polyphonic music, there are three issues: feature variations caused by sound mixtures, the pitch dependency of timbres, and the use of musical context. For the ﬁrst issue, templates of feature vectors representing timbres are extracted from not only isolated sounds but also sound mixtures. Be- cause some features are not robust in the mixtures, fea- tures are weighted according to their robustness by using linear discriminant analysis. For the second issue, we use an F0-dependent multivariate normal distribution, which approximates the pitch dependency as a function of funda- mental frequency. For the third issue, when the instrument of each note is identiﬁed, the a priori probablity of the note is calculated from the a posteriori probabilities of tempo- rally neighboring notes. Experimental results showed that recognition rates were improved from 60.8% to 85.8% for trio music and from 65.5% to 91.1% for duo music. Keywords: Musical instrument identiﬁcation, mixed- sound template, F0-dependent multivariate normal distri- bution, musical context, MPEG-7 1 INTRODUCTION The increasing quantity of musical audio signals available in electric music distribution services and personal mu- sic storage has made users spend a longer time on ﬁnding musical pieces that they want. Efﬁcient music information retrieval (MIR) technologies are indispensable to shorten the time to ﬁnd musical pieces. In particular, automatic description of musical content in a universal framework is expected to become one of the most important key tech- nologies for achieving sophisticated MIR. In fact, the ISO recently established a new standard called MPEG-7 [1], Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee pro- vided that copies are not made or distributed for proﬁt or com- mercial advantage and that copies bear this notice and the full citation on the ﬁrst page. c 2005 Queen Mary, University of London which provides a universal framework for describing mul- timedia content. The names of musical instruments play an impor- tant role as music descriptors because musical pieces are sometimes characterized by what instruments are used. In fact, the names of some music genres are based on in- strument names, such as “piano sonata” and “string quar- tet.” In addition, when a user wants to search for certain types of musical pieces, such as piano solos or string quar- tets, a retrieval system can use the description of musical instrument names. Therefore, musical instrument iden- tiﬁcation (MII), which aims at determining what instru- ments are used in musical pieces, has been studied in re- cent years [2, 3, 4, 5, 6, 7, 8]. Identifying instruments in polyphonic music is more difﬁcult than in monophonic music. In fact, most meth- ods of identifying monophonic sounds [3, 4, 7, 9] of- ten fail in dealing with polyphonic music. For example, our previous method [9], which identiﬁed an instrument by calculating the similarities between a feature vector of a given isolated sound and prestored feature vectors of instrument-labeled sounds (called training data), had difﬁculty dealing with polyphonic music because features extracted from simultaneously played instruments were different from those extracted from monophonic sounds. To achieve highly accurate MII in polyphonic music, it is essential to resolve three issues: feature variations caused by sound mixtures, the pitch dependency of tim- bres, and the use of musical context. These issues, how- ever, have not been fully dealt with in existing studies. Some techniques such as time-domain waveform template matching [5], feature adaptation [6] and the missing fea- ture theory [2] have been proposed to address the ﬁrst is- sue, but no attempts have been made to construct a tem- plate from polyphonic music although this is expected to contribute to improving MII. To address the second issue, most existing studies have used multiple templates cov- ering the entire pitch range for each instrument, but they have not dealt with effective modeling of the pitch depen- dency of timbres. To address the third issue, Kashino et al. [5] introduced music stream networks and proposed a technique of propagating the a posteriori probabilities of musical notes in a network to one another based on the Bayesian network. To apply musical context to identi- ﬁcation frameworks not based on the Bayesian network, however, we need an alternative solution. In this paper, to address the ﬁrst issue, we construct a feature vector template (i.e., a set of training data) from polyphonic sound mixtures. Because features tend to vary 558