PARAMETRIZATION OF INHARMONIC BIRD SOUNDS FOR AUTOMATIC RECOGNITION Seppo Fagerlund * Laboratory of Acoustics and Audio Signal Processing Helsinki University of Technology (HUT) Otakaari 5a, 02015, Espoo, Finland phone: + (358) 9-451 6029, fax: + (358) 9-460 224 email: Seppo.Fagerlund@hut.ﬁ Aki H¨ arm¨ a Philips Research Prof. Holstlaan 4, WO-090, 5656 AA, Eindhoven, The Netherlands ABSTRACT We have earlier found that the sinusoidal modeling and re- lated parametrization is a promising technique for the auto- matic analysis and recognition of typical sounds produced by songbirds. In this article we study techniques that can be used to characterize sounds that cannot be efﬁciently param- eterized using the sinusoidal model. Most familiar examples of such sounds are creaky sounds of Crows and many of the sounds produced, e.g., by Mallards. Often those sounds fea- ture irregular pitch pattern. We introduce a method for fea- ture reduction and optimal feature selection for recognition of bird species. 1. INTRODUCTION The long-term objective in the current work is to develop fea- ture extraction and classiﬁcation methods for a system that could automatically recognize bird species by their sounds in ﬁeld conditions. It has been demonstrated earlier that sounds of many songbirds are clearly tonal and can be efﬁciently modeled by one or a small number of time-varying sinusoidal components [1]. Nevertheless, songbirds regularly produce also sounds which have a complex spectrum and temporal envelope. In other than songbirds such cases are even more common. For example, the Common Raven rarely produces anything that ﬁts to the sinusoidal signal model. In the cur- rent article we develop a more appropriate set of descriptive parameters for those sounds. Bird sounds are divided by the function into songs and calls, which are further divided into hierarchical levels, which are phrase, syllable, and element or note [2]. Elements are smallest separable units of bird vocalization. In the sim- plest case syllable is constructed from one element but more complex syllables may include several elements. Phrase is a series of syllables that occur in a particular pattern. A phrase is often, but not always, a sequence of similar syllables. Relatively little has been done previously to ﬁnd efﬁcient parametrization of bird sounds for recognition. For example, in [3, 4] bird sounds were represented by spectrograms of syllables or elements. Most of the earlier work on automatic recognition of birds have been related to the recognition of songs of birds [5, 6] or some restricted set of predeﬁned sounds from one species [4]. In this work we test recogni- tion bird species based on individual syllables. Nelson [6] noted that different species used different cues to recognize their own species. In this work we introduce a method to * Supported by the Academy of Finland (AveSound project) . measure features importance for classiﬁcation and try to ﬁnd species-speciﬁc feature sets. The recognition experiments in the current article are based on the bird song database collected in the Avesound project [7] at HUT. Audio ﬁles in the database contain songs, calls or series of calls mainly recorded in Finland. Individ- ual syllables are extracted from songs using a segmentation algorithm based on the short-time signal energy and an adap- tive estimate of the background noise. Feature vectors are then formed from various signal measures introduced below. Finally, syllables are then classiﬁed based on those represen- tations. In this article we ﬁrst try to characterize what types of non-tonal sounds are common in avian vocalization. Sec- ondly, we study the performance of several different compu- tational measures that could be used as features in an auto- matic recognizer. We use low-level signal parameters such as the spectral centroid and signal bandwidth. These param- eters have been used previously, for example, in general au- dio context classiﬁcation [8], music genre classiﬁcation [9], but, to our knowledge, have not been tested for bird sounds previously. For comparison we also test recognition with Mel-frequency cepstral coefﬁcient (MFCC) representation of syllables. MFCC-model have been popular parametrization method in different types of audio recognition tasks, e.g. in automatic speech recognition [10]. 2. THE CLASS OF INHARMONIC SOUNDS IN BIRDS In [1], harmonic bird sounds were divided into four classes by the observed harmonic structure. Classes I and II were for pure sinusoidal and pure harmonic signals, respectively. Class III syllable has a harmonic structure such that the fun- damental frequency component (F0) is heavily attenuated and, in the Class IV both F0 and F1 are weaker than F2. It was found in [1] that syllables that are not harmonic usually fell outside of the four classes or went to the harmonic class IV. In these cases likelihood of a syllable to belong to the pure sinusoidal class (class I) was also very small. In this article this observation has been turned into a criterion for selecting sounds that do not ﬁt into the sinusoidal signal model. In par- ticular, if the likelihood to belong to pure sinusoidal class is less than 60%, syllable is labelled to the class of inharmonic sounds. Note that the set of inharmonic sounds deﬁned this way will contain many different types of sounds and some of those can also be considered harmonic. Hooded Crow (Corvus corone cornix) is a good example