MASK: Robust Local Features for Audio Fingerprinting Xavier Anguera, Antonio Garzon and Tomasz Adamek Telefonica Research, Torre Telefonica Diagonal 00, 08019, Barcelona, Spain xanguera@tid.es Abstract—This paper presents a novel local audio fingerprint called MASK (Masked Audio Spectral Keypoints) that can effectively encode the acoustic information existent in audio documents and discriminate between transformed versions of the same acoustic documents and other unrelated documents. The fingerprint has been designed to be resilient to strong transformations of the original signal and to be usable for generic audio, including music and speech. Its main char- acteristics are its locality, binary encoding, robustness and compactness. The proposed audio fingerprint encodes the local spectral energies around salient points selected among the main spectral peaks in a given signal. Such encoding is done by centering on each point a carefully designed mask defining regions of the spectrogram whose average energies are compared with each other. From each comparison we obtain a single bit depending on which region has more energy, and group all bits into a final binary fingerprint. In addition, the fingerprint also stores the frequency of each peak, quantized using a Mel filterbank. The length of the fingerprint is solely defined by the number of compared regions being used, and can be adapted to the requirements of any particular application. In addition, the number of salient points encoded per second can be also easily modified. In the experimental section we show the suitability of such fingerprint to find matching segments by using the NIST-TRECVID benchmarking evaluation datasets by comparing it with a well known fingerprint, obtaining up to 26% relative improvement in NDCR score. Keywords-Audio fingerprinting, audio indexing, copy detec- tion I. I NTRODUCTION In this paper we propose a novel audio fingerprint we call MASK (which stands for Masked Audio Spectral Key- points). Audio fingerprinting is understood as the method by which we can compactly represent an audio signal so that it is convenient for storage, indexing and comparison between audio documents. It differs from watermarking techniques [1] in that no external information/watermark needs to be a priori encoded into the audio, as the audio itself acts as the watermark. A good fingerprint should capture and characterize the essence of the audio content. More specifically, the quality of a fingerprint can be measured in four main dimensions: discriminability, robustness, compactness and efficiency. A For the duration of this project Antonio Garzon was a visiting scholar from Universitat Pompeu Fabra Tomasz Adamek is currently with www.catchoom.com fingerprint has a high discriminatory power if two fin- gerprints extracted from the same location in two audio segments coming from the same source are very similar, and at the same time, fingerprints extracted from segments coming from different sources, or different locations in the same source, are very different. Another important quality is robustness to acoustic transformations. We define as a transformation any alteration of the original signal that modifies the physical characteristics of the signal but still allows a human to judge that such audio comes from the original signal. Typical transformations include MP3 encod- ing, sound equalization and mixing with external noises or signals. Next, compactness is also important for reducing the amount of information that needs to be compared when using fingerprints for searching in large collections of au- dio documents. Finally, efficiency refers to how fast the fingerprint can be extracted from the original signal and, equivalently, the efficiency of retrieval methods that can be used with such fingerprint. In recent years there have been many proposals for different ways to construct acoustic fingerprints [2]–[6]. For an early review of some alternatives see [7]. Some of them are not robust enough to severe audio transformations, their performance degrades when encoding content other than music or are expensive to compute or to store. Three of the most cited audio fingerprints are probably the Shazam fingerprint presented in [2], the system proposed by Philips in [3] and the waveprint, proposed by Google [4]. The Shazam fingerprint [2] encodes the relationship be- tween two spectral maxima, where one of them is called an anchor point. By encoding multiple maxima in a single fingerprint they are prone to errors when either of the maxima is missing. For this reason, in order to make the system robust, for each selected anchor point they need to store several tuple combinations within its target area, creating an overhead of data to be stored for a given audio. In addition, they encode the data inside the fingerprint in 3 different blocks (20 bits for the frequency locations of the two peaks and 12 bits for their time difference). If the comparison between fingerprints is allowed some error they need to first apply a conversion from binary form to the corresponding natural numbers and later differentiation to find how far the spectral maxima are from each other. Given 2012 IEEE International Conference on Multimedia and Expo 978-0-7695-4711-4/12 $26.00 © 2012 IEEE DOI 10.1109/ICME.2012.137 455