MASK: Robust Local Features for Audio Fingerprinting Xavier Anguera, Antonio Garzon † and Tomasz Adamek ‡ Telefonica Research, Torre Telefonica Diagonal 00, 08019, Barcelona, Spain xanguera@tid.es Abstract—This paper presents a novel local audio ﬁngerprint called MASK (Masked Audio Spectral Keypoints) that can effectively encode the acoustic information existent in audio documents and discriminate between transformed versions of the same acoustic documents and other unrelated documents. The ﬁngerprint has been designed to be resilient to strong transformations of the original signal and to be usable for generic audio, including music and speech. Its main char- acteristics are its locality, binary encoding, robustness and compactness. The proposed audio ﬁngerprint encodes the local spectral energies around salient points selected among the main spectral peaks in a given signal. Such encoding is done by centering on each point a carefully designed mask deﬁning regions of the spectrogram whose average energies are compared with each other. From each comparison we obtain a single bit depending on which region has more energy, and group all bits into a ﬁnal binary ﬁngerprint. In addition, the ﬁngerprint also stores the frequency of each peak, quantized using a Mel ﬁlterbank. The length of the ﬁngerprint is solely deﬁned by the number of compared regions being used, and can be adapted to the requirements of any particular application. In addition, the number of salient points encoded per second can be also easily modiﬁed. In the experimental section we show the suitability of such ﬁngerprint to ﬁnd matching segments by using the NIST-TRECVID benchmarking evaluation datasets by comparing it with a well known ﬁngerprint, obtaining up to 26% relative improvement in NDCR score. Keywords-Audio ﬁngerprinting, audio indexing, copy detec- tion I. I NTRODUCTION In this paper we propose a novel audio ﬁngerprint we call MASK (which stands for Masked Audio Spectral Key- points). Audio ﬁngerprinting is understood as the method by which we can compactly represent an audio signal so that it is convenient for storage, indexing and comparison between audio documents. It differs from watermarking techniques [1] in that no external information/watermark needs to be a priori encoded into the audio, as the audio itself acts as the watermark. A good ﬁngerprint should capture and characterize the essence of the audio content. More speciﬁcally, the quality of a ﬁngerprint can be measured in four main dimensions: discriminability, robustness, compactness and efﬁciency. A † For the duration of this project Antonio Garzon was a visiting scholar from Universitat Pompeu Fabra ‡ Tomasz Adamek is currently with www.catchoom.com ﬁngerprint has a high discriminatory power if two ﬁn- gerprints extracted from the same location in two audio segments coming from the same source are very similar, and at the same time, ﬁngerprints extracted from segments coming from different sources, or different locations in the same source, are very different. Another important quality is robustness to acoustic transformations. We deﬁne as a transformation any alteration of the original signal that modiﬁes the physical characteristics of the signal but still allows a human to judge that such audio comes from the original signal. Typical transformations include MP3 encod- ing, sound equalization and mixing with external noises or signals. Next, compactness is also important for reducing the amount of information that needs to be compared when using ﬁngerprints for searching in large collections of au- dio documents. Finally, efﬁciency refers to how fast the ﬁngerprint can be extracted from the original signal and, equivalently, the efﬁciency of retrieval methods that can be used with such ﬁngerprint. In recent years there have been many proposals for different ways to construct acoustic ﬁngerprints [2]–[6]. For an early review of some alternatives see [7]. Some of them are not robust enough to severe audio transformations, their performance degrades when encoding content other than music or are expensive to compute or to store. Three of the most cited audio ﬁngerprints are probably the Shazam ﬁngerprint presented in [2], the system proposed by Philips in [3] and the waveprint, proposed by Google [4]. The Shazam ﬁngerprint [2] encodes the relationship be- tween two spectral maxima, where one of them is called an anchor point. By encoding multiple maxima in a single ﬁngerprint they are prone to errors when either of the maxima is missing. For this reason, in order to make the system robust, for each selected anchor point they need to store several tuple combinations within its target area, creating an overhead of data to be stored for a given audio. In addition, they encode the data inside the ﬁngerprint in 3 different blocks (20 bits for the frequency locations of the two peaks and 12 bits for their time difference). If the comparison between ﬁngerprints is allowed some error they need to ﬁrst apply a conversion from binary form to the corresponding natural numbers and later differentiation to ﬁnd how far the spectral maxima are from each other. Given 2012 IEEE International Conference on Multimedia and Expo 978-0-7695-4711-4/12 $26.00 © 2012 IEEE DOI 10.1109/ICME.2012.137 455