MASK: Robust Local Features for Audio Fingerprinting
Xavier Anguera, Antonio Garzon
†
and Tomasz Adamek
‡
Telefonica Research,
Torre Telefonica Diagonal 00,
08019, Barcelona, Spain
xanguera@tid.es
Abstract—This paper presents a novel local audio fingerprint
called MASK (Masked Audio Spectral Keypoints) that can
effectively encode the acoustic information existent in audio
documents and discriminate between transformed versions of
the same acoustic documents and other unrelated documents.
The fingerprint has been designed to be resilient to strong
transformations of the original signal and to be usable for
generic audio, including music and speech. Its main char-
acteristics are its locality, binary encoding, robustness and
compactness. The proposed audio fingerprint encodes the
local spectral energies around salient points selected among
the main spectral peaks in a given signal. Such encoding is
done by centering on each point a carefully designed mask
defining regions of the spectrogram whose average energies are
compared with each other. From each comparison we obtain
a single bit depending on which region has more energy, and
group all bits into a final binary fingerprint. In addition, the
fingerprint also stores the frequency of each peak, quantized
using a Mel filterbank. The length of the fingerprint is solely
defined by the number of compared regions being used, and can
be adapted to the requirements of any particular application.
In addition, the number of salient points encoded per second
can be also easily modified. In the experimental section we show
the suitability of such fingerprint to find matching segments by
using the NIST-TRECVID benchmarking evaluation datasets
by comparing it with a well known fingerprint, obtaining up
to 26% relative improvement in NDCR score.
Keywords-Audio fingerprinting, audio indexing, copy detec-
tion
I. I NTRODUCTION
In this paper we propose a novel audio fingerprint we
call MASK (which stands for Masked Audio Spectral Key-
points). Audio fingerprinting is understood as the method by
which we can compactly represent an audio signal so that it
is convenient for storage, indexing and comparison between
audio documents. It differs from watermarking techniques
[1] in that no external information/watermark needs to be a
priori encoded into the audio, as the audio itself acts as the
watermark.
A good fingerprint should capture and characterize the
essence of the audio content. More specifically, the quality
of a fingerprint can be measured in four main dimensions:
discriminability, robustness, compactness and efficiency. A
†
For the duration of this project Antonio Garzon was a visiting scholar
from Universitat Pompeu Fabra
‡
Tomasz Adamek is currently with www.catchoom.com
fingerprint has a high discriminatory power if two fin-
gerprints extracted from the same location in two audio
segments coming from the same source are very similar,
and at the same time, fingerprints extracted from segments
coming from different sources, or different locations in the
same source, are very different. Another important quality
is robustness to acoustic transformations. We define as a
transformation any alteration of the original signal that
modifies the physical characteristics of the signal but still
allows a human to judge that such audio comes from the
original signal. Typical transformations include MP3 encod-
ing, sound equalization and mixing with external noises or
signals. Next, compactness is also important for reducing
the amount of information that needs to be compared when
using fingerprints for searching in large collections of au-
dio documents. Finally, efficiency refers to how fast the
fingerprint can be extracted from the original signal and,
equivalently, the efficiency of retrieval methods that can be
used with such fingerprint.
In recent years there have been many proposals for
different ways to construct acoustic fingerprints [2]–[6]. For
an early review of some alternatives see [7]. Some of them
are not robust enough to severe audio transformations, their
performance degrades when encoding content other than
music or are expensive to compute or to store. Three of
the most cited audio fingerprints are probably the Shazam
fingerprint presented in [2], the system proposed by Philips
in [3] and the waveprint, proposed by Google [4].
The Shazam fingerprint [2] encodes the relationship be-
tween two spectral maxima, where one of them is called
an anchor point. By encoding multiple maxima in a single
fingerprint they are prone to errors when either of the
maxima is missing. For this reason, in order to make the
system robust, for each selected anchor point they need
to store several tuple combinations within its target area,
creating an overhead of data to be stored for a given audio.
In addition, they encode the data inside the fingerprint in
3 different blocks (20 bits for the frequency locations of
the two peaks and 12 bits for their time difference). If the
comparison between fingerprints is allowed some error they
need to first apply a conversion from binary form to the
corresponding natural numbers and later differentiation to
find how far the spectral maxima are from each other. Given
2012 IEEE International Conference on Multimedia and Expo
978-0-7695-4711-4/12 $26.00 © 2012 IEEE
DOI 10.1109/ICME.2012.137
455