UNSUPERVISED DISCOVERY OF ACOUSTIC PATTERNS IN BIRD VOCALISATIONS EMPLOYING DTW AND CLUSTERING Peter Janˇ coviˇ c 1 *, M¨ unevver K¨ ok¨ uer 2,1 , Masoud Zakeri 1 and Martin Russell 1 1 School of Electronic, Electrical & Computer Engineering, University of Birmingham, UK 2 Faculty of Technology, Engineering & Environment, Birmingham City University, UK {p.jancovic,mxz848,m.j.russell}@bham.ac.uk, munevver.kokuer@bcu.ac.uk ABSTRACT This paper presents a method for an unsupervised discov- ery of acoustic patterns in bird vocalisations recorded in real world natural environments. The proposed method employs sinusoidal detection to provide frequency tracks which are used as features to characterise bird tonal vocalisations. A variant of dynamic time warping, capable of searching for multiple partial matchings, is used to segment the data based on these frequency track sequences. Agglomerative hierar- chical clustering approach is then employed to cluster recur- ring segments. Evaluations are performed on audio record- ings provided by the Borror Laboratory of Bioacoustics. The obtained results indicate that structurally distinct stereotyped acoustic units can be determined. Index Terms— unsupervised, clustering, segmentation, dynamic time warping, bird, vocalisation, sinusoid, tonal 1. INTRODUCTION Bird vocalisations can be considered to be composed of sub- units of different levels, such as elements (also referred to as notes), syllables, phrases and songs. Elements can be taken as the smallest structurally distinct stereotyped acoustic units produced by birds, and these can be thought of similarly as phonemes in the context of speech processing. While large amount of phoneme (or higher) level of annotated data ex- ists for speech, there are no wide range publically available annotated data for bird vocalisations. Such annotated bird acoustic data and the inventory of units of bird vocalisations are important both for bioacousticians, for instance, to study differences between individuals and populations or behaviour contexts, and for development of more advanced automated systems for processing of bird vocalisations. Unsupervised processing of time series data and search- ing for recurring patterns relates to current research in vari- ous ﬁelds, from computational biology to audio summarisa- tion. A recent review of time series matching approaches was presented in [1]. We focus here on works in speech and au- dio processing. An unsupervised derivation of variable-length acoustic units from speech signal employing hidden Markov models was investigated in [2]. The authors in [3] employed dynamic time warping (DTW) and neural networks for an un- supervised categorisation of isolated vocalisations of dolphins and whales. The work in [4] employed a segmental variant of DTW for unsupervised processing of speech data to automat- ically extract words and linguistic phrases from recordings of academic lectures. In [5], the segmental DTW and K-means clustering was employed for unsupervised learning of acous- tic events, with evaluations presented for spoken digits and non-speech sounds in meeting rooms. In [6], a similarity ma- trix approach was used to summarise music data. Automatic processing of bird vocalisations is a relatively recent research ﬁeld [7, 8, 9]. The data used in many stud- ies up to date consists of recordings of relatively isolated bird vocalisations without noise. Some studies used continuous recordings and split the signal into smaller segments either by human intervention of spectrograms [9] or automatically using an energy-based threshold decision in time or time- frequency domain [7, 10, 11, 12]. Such energy-based seg- mentation may be difﬁcult to obtain accurately in recordings of bird vocalisations in their natural habitat due to being usu- ally contaminated by various background noise or vocalisa- tions of other birds or animals. In this paper, we propose an approach for unsupervised discovery of acoustic elements in bird vocalisations. As we are dealing speciﬁcally with bird tonal vocalisations, we em- ployed an algorithm, which we introduced in [13, 14], to decompose the entire acoustic scene into sinusoidal com- ponents. This is then used for detection and estimation of frequency tracks that are used in this paper as temporal se- quences for further processing stages. Note that the further stages of the processing are not dependent on the type of features and thus the presented work could also be applied to birds producing non-tonal vocalisations. We developed a variant of DTW which can search for multiple partial match- ings within given sequences. The resulted segments are then, based on their DTW measured similarity, clustered using a hierarchical clustering approach. Experimental evaluations show that the proposed method can provide a set of struc- turally distinct stereotyped bird vocalisation patterns. EUSIPCO 2013 1569744753 1