Similarity-based birdcall retrieval from environmental audio
Xueyan Dong ⁎, Michael Towsey, Anthony Truskinger, Mark Cottman-Fields, Jinglan Zhang, Paul Roe
Ecoacoustics Research Group, School of Electrical Engineering and Computer Science, Queensland University of Technology, Brisbane, Australia
abstract article info
Article history:
Received 1 December 2014
Received in revised form 28 July 2015
Accepted 28 July 2015
Available online 5 August 2015
Keywords:
Birdcall retrieval
Environmental audio
Ridge detection
Spectral peak tracks
Automated digital recordings are useful for large-scale temporal and spatial environmental monitoring. An
important research effort has been the automated classification of calling bird species. In this paper we examine
a related task, retrieval of birdcalls from a database of audio recordings, similar to a user supplied query call. Such
a retrieval task can sometimes be more useful than an automated classifier. We compare three approaches to
similarity-based birdcall retrieval using spectral ridge features and two kinds of gradient features, structure
tensor and the histogram of oriented gradients. The retrieval accuracy of our spectral ridge method is 94%
compared to 82% for the structure tensor method and 90% for the histogram of gradients method. Additionally,
this approach potentially offers a more compact representation and is more computationally efficient.
© 2015 Elsevier B.V. All rights reserved.
1. Introduction
Birds are widely regarded as a good indicator of biodiversity because
they provide important ecosystem services (Gregory et al., 2005). Most
biodiversity assessment studies are done by field observations using
manual surveys. Manual methods, such as the five-minute bird count
used in New Zealand (Department of Conservation, 2006) rely on the
professional knowledge of experts and can achieve accurate results.
However, since most bird species are mobile and spot counts are neces-
sarily of short duration, there is a probability that some species will be
missed. In addition, the cost of keeping experts in the field limits the
spatiotemporal scalability of manual approaches.
Automated acoustic recorders are now frequently deployed to assist
biologist in bird studies (Farnsworth and Russell, 2007; Sueur et al.,
2008) because they can operate unattended for long periods. Automat-
ed digital recordings, stored in an appropriate way (Kasten et al., 2012),
can provide a persistent and verifiable record of the acoustic sound-
scape (Frommolt et al., 2008; S. H. Gage and Axel, 2014; Wimmer
et al., 2010, S. Gage et al., 2004; Qi et al., 2008). Signal and image
processing techniques can be used to automate the detection of animal
calls (Bardeli et al., 2008) and commercial software, such as Raven, and
Song Scope (Agranat, 2009) is now available to segment and character-
ise birdcalls.
However, while fully-automated analysis techniques can, in theory,
scale up to process large volumes of audio data, in practice their reliabil-
ity and accuracy remains problematic. It is not an easy task to build
accurate birdcall recognisers partly because the calls of interest must
be disentangled from many kinds of non-biological sounds, collectively
described as geophony and anthrophony (Pijanowski et al., 2011).
Another reason is that birdcalls can vary geographically, seasonally
and over the life-cycle of a species (Kirschel et al., 2009).
There are two kinds of birdcall identification task that are of use to
ecologists: call classification and call retrieval. In the former task, a clas-
sifier is trained to recognise a fixed set of call classes (this terminology is
used because some species make more than one type of call) and in op-
erational mode, a classifier assigns every input to one of those classes. In
the call retrieval task, an input call initiates a search through a database
of audio recordings to retrieve one or more similar calls. Although these
two tasks are defined differently, they typically involve common steps,
namely segmentation, feature extraction and a similarity measure. An
important difference between the two tasks is that it is not clear how
the user would want a classifier to respond to an unexpected sound
(that is, one from a class not among the training classes) whereas a
retrieval system is not constrained by the notion of class. Another
difference between the two tasks is that the feature set for a classifier
is ‘tuned’ to discriminate the target classes, whereas the feature set for
a retrieval task must be able to characterise any arbitrary input.
The contribution of this paper lies in the innovative features for
content-based bird vocalisation retrieval from audio recordings. Fig. 1
shows the flowchart of the retrieval system. Our work is motivated by
a paper from Bardeli (2009). In the next section we review some of
the literature on birdcall recognition.
1.1. Related work
Most research effort to date has been on the birdcall classification
task (Aide et al., 2013; Anderson et al., 1996; Chen and Maher, 2006;
S. Duan et al., 2011; Jančovič and Köküer, 2011; Kasten et al., 2010).
Bird vocalisations are typically divided into songs (associated with
Ecological Informatics 29 (2015) 66–76
⁎ Corresponding author.
E-mail address: xueyan.dong@student.qut.edu.au (X. Dong).
http://dx.doi.org/10.1016/j.ecoinf.2015.07.007
1574-9541/© 2015 Elsevier B.V. All rights reserved.
Contents lists available at ScienceDirect
Ecological Informatics
journal homepage: www.elsevier.com/locate/ecolinf