AN AUDIO FINGERPRINTING SYSTEM FOR LIVE VERSION IDENTIFICATION USING IMAGE PROCESSING TECHNIQUES Zafar Rafii Northwestern University Evanston, IL, USA zafarrafii@u.northwestern.edu Bob Coover, Jinyu Han Gracenote, Inc. Emeryville, CA, USA {bcoover,jhan}@gracenote.com ABSTRACT Suppose that you are at a music festival checking on an artist, and you would like to quickly know about the song that is be- ing played (e.g., title, lyrics, album, etc.). If you have a smart- phone, you could record a sample of the live performance and compare it against a database of existing recordings from the artist. Services such as Shazam or SoundHound will not work here, as this is not the typical framework for audio fingerprint- ing or query-by-humming systems, as a live performance is neither identical to its studio version (e.g., variations in in- strumentation, key, tempo, etc.) nor it is a hummed or sung melody. We propose an audio fingerprinting system that can deal with live version identification by using image process- ing techniques. Compact fingerprints are derived using a log- frequency spectrogram and an adaptive thresholding method, and template matching is performed using the Hamming sim- ilarity and the Hough Transform. Index Terms— Adaptive thresholding, audio fingerprint- ing, Constant Q Transform, cover identification 1. INTRODUCTION Audio fingerprinting systems typically aim at identifying an audio recording given a sample of it (e.g., the title of a song), by comparing the sample against a database for a match. Such systems generally first transform the audio signal into a com- pact representation (e.g., a binary image) so that the compari- son can be performed efficiently (e.g., via hash functions) [1]. In [2], the sign of energy differences along time and fre- quency is computed in log-spaced bands selected from the spectrogram. In [3], a two-level principal component analy- sis is computed from the spectrogram. In [4], pairs of time- frequency peaks are chosen from the spectrogram. In [5], the sign of wavelets computed from the spectrogram is used. Audio fingerprinting systems are designed to be robust to audio degradations (e.g., encoding, equalization, noise, etc.) [1]. Some systems are also designed to handle pitch or tempo deviations [6, 7, 8]. Yet, all those systems aim at identifying This work was done while the first author was intern at Gracenote, Inc. the same rendition of a song, and will consider cover versions (e.g., a live performance) to be different songs. For a review on audio fingerprinting, the reader is referred to [1]. Cover identification systems precisely aim at identifying a song given an alternate rendition of it (e.g., live, remas- ter, remix, etc.). A cover version essentially retains the same melody, but differs from the original song in other musical aspects (e.g., instrumentation, key, tempo, etc.) [9]. In [10], beat tracking and chroma features are used to deal with variations in tempo and instrumentation, and cross- correlation is used between all key transpositions. In [11], chord sequences are extracted using chroma vectors, and a sequence alignment algorithm based on Dynamic Program- ming (DP) is used. In [12], chroma vectors are concatenated into high-dimensional vectors and nearest neighbor search is used. In [13], an enhanced chroma feature is computed, and a sequence alignment algorithm based on DP is used. Cover identification systems are designed to capture the melodic similarity while being robust to the other musical as- pects [9]. Some systems also propose to use short queries [14, 15, 16] or hash functions [12, 17, 18], in a formalism similar to audio fingerprinting. Yet, all those systems aim at identifying a cover song given a full and/or clean recording, and will not apply in case of short and noisy excerpts, such as those that can be recorded from a smartphone in a concert. For a review on cover identification, and more generally on audio matching, the reader is referred to [9] and [19]. We propose an audio fingerprinting system that can deal with live version identification by using image processing techniques. The system is specially intended for applications where a smartphone user is attending a live performance from a known artist and would like to quickly know about the song that is being played (e.g., title, lyrics, album, etc.). As com- puter vision is shown to be practical for music identification [20, 5], image processing techniques are used to derive novel fingerprints that are robust to both audio degradations and au- dio variations, while still compact for an efficient matching. In Section 2, we describe our system. In Section 3, we evaluate our system using live queries against a database of studio references. In Section 4, we conclude this article.