Similarity-based birdcall retrieval from environmental audio Xueyan Dong ⁎, Michael Towsey, Anthony Truskinger, Mark Cottman-Fields, Jinglan Zhang, Paul Roe Ecoacoustics Research Group, School of Electrical Engineering and Computer Science, Queensland University of Technology, Brisbane, Australia abstract article info Article history: Received 1 December 2014 Received in revised form 28 July 2015 Accepted 28 July 2015 Available online 5 August 2015 Keywords: Birdcall retrieval Environmental audio Ridge detection Spectral peak tracks Automated digital recordings are useful for large-scale temporal and spatial environmental monitoring. An important research effort has been the automated classiﬁcation of calling bird species. In this paper we examine a related task, retrieval of birdcalls from a database of audio recordings, similar to a user supplied query call. Such a retrieval task can sometimes be more useful than an automated classiﬁer. We compare three approaches to similarity-based birdcall retrieval using spectral ridge features and two kinds of gradient features, structure tensor and the histogram of oriented gradients. The retrieval accuracy of our spectral ridge method is 94% compared to 82% for the structure tensor method and 90% for the histogram of gradients method. Additionally, this approach potentially offers a more compact representation and is more computationally efﬁcient. © 2015 Elsevier B.V. All rights reserved. 1. Introduction Birds are widely regarded as a good indicator of biodiversity because they provide important ecosystem services (Gregory et al., 2005). Most biodiversity assessment studies are done by ﬁeld observations using manual surveys. Manual methods, such as the ﬁve-minute bird count used in New Zealand (Department of Conservation, 2006) rely on the professional knowledge of experts and can achieve accurate results. However, since most bird species are mobile and spot counts are neces- sarily of short duration, there is a probability that some species will be missed. In addition, the cost of keeping experts in the ﬁeld limits the spatiotemporal scalability of manual approaches. Automated acoustic recorders are now frequently deployed to assist biologist in bird studies (Farnsworth and Russell, 2007; Sueur et al., 2008) because they can operate unattended for long periods. Automat- ed digital recordings, stored in an appropriate way (Kasten et al., 2012), can provide a persistent and veriﬁable record of the acoustic sound- scape (Frommolt et al., 2008; S. H. Gage and Axel, 2014; Wimmer et al., 2010, S. Gage et al., 2004; Qi et al., 2008). Signal and image processing techniques can be used to automate the detection of animal calls (Bardeli et al., 2008) and commercial software, such as Raven, and Song Scope (Agranat, 2009) is now available to segment and character- ise birdcalls. However, while fully-automated analysis techniques can, in theory, scale up to process large volumes of audio data, in practice their reliabil- ity and accuracy remains problematic. It is not an easy task to build accurate birdcall recognisers partly because the calls of interest must be disentangled from many kinds of non-biological sounds, collectively described as geophony and anthrophony (Pijanowski et al., 2011). Another reason is that birdcalls can vary geographically, seasonally and over the life-cycle of a species (Kirschel et al., 2009). There are two kinds of birdcall identiﬁcation task that are of use to ecologists: call classiﬁcation and call retrieval. In the former task, a clas- siﬁer is trained to recognise a ﬁxed set of call classes (this terminology is used because some species make more than one type of call) and in op- erational mode, a classiﬁer assigns every input to one of those classes. In the call retrieval task, an input call initiates a search through a database of audio recordings to retrieve one or more similar calls. Although these two tasks are deﬁned differently, they typically involve common steps, namely segmentation, feature extraction and a similarity measure. An important difference between the two tasks is that it is not clear how the user would want a classiﬁer to respond to an unexpected sound (that is, one from a class not among the training classes) whereas a retrieval system is not constrained by the notion of class. Another difference between the two tasks is that the feature set for a classiﬁer is ‘tuned’ to discriminate the target classes, whereas the feature set for a retrieval task must be able to characterise any arbitrary input. The contribution of this paper lies in the innovative features for content-based bird vocalisation retrieval from audio recordings. Fig. 1 shows the ﬂowchart of the retrieval system. Our work is motivated by a paper from Bardeli (2009). In the next section we review some of the literature on birdcall recognition. 1.1. Related work Most research effort to date has been on the birdcall classiﬁcation task (Aide et al., 2013; Anderson et al., 1996; Chen and Maher, 2006; S. Duan et al., 2011; Jančovič and Köküer, 2011; Kasten et al., 2010). Bird vocalisations are typically divided into songs (associated with Ecological Informatics 29 (2015) 66–76 ⁎ Corresponding author. E-mail address: xueyan.dong@student.qut.edu.au (X. Dong). http://dx.doi.org/10.1016/j.ecoinf.2015.07.007 1574-9541/© 2015 Elsevier B.V. All rights reserved. Contents lists available at ScienceDirect Ecological Informatics journal homepage: www.elsevier.com/locate/ecolinf