FAST SIMILARITY SEARCH ON VIDEO SIGNATURES Sen-ching S. Cheung * Center for Applied Scientific Computing Lawrence Livermore National Laboratory P.O. Box 808, L-561, Livermore, CA 94551 sccheung@llnl.gov Avideh Zakhor Department of EECS University of California Berkeley, CA 94720 avz@eecs.berkeley.edu ABSTRACT Video signatures are compact representations of video sequences designed for efficient similarity measurement. In this paper, we propose a feature extraction technique to support fast similarity search on large databases of video signatures. Our proposed tech- nique transforms the high dimensional video signatures into low dimensional vectors where similarity search can be efficiently per- formed. We exploit both the upper and lower bounds of the tri- angle inequalities in approximating the high-dimensional metric, and combine this approximation with the classical PCA to achieve the target dimension. Experimental results on a large set of web video sequences show that our technique outperforms Fastmap, Haar wavelet, PCA, and Triangle-Inequality Pruning. 1. INTRODUCTION Thanks to widespread availability of broadband connections and decreasing cost of disk storage, it is now commonplace to publish, broadcast, or stream video sequences over the Internet. As video content becomes more popular on the web, there is a growing need to develop tools for analyzing, searching, and organizing visually similar video sequences. In the development of such tools, we are faced with two major algorithmic challenges: how to efficiently measure the similarity between two video sequences, and how to identify video sequences similar to a given query out of possibly millions of entries on the web. In [1], we introduce a class of techniques called ViSig for efficient video similarity measurement. The ViSig method summarizes a video sequence into a compact video signature, consisting of a small number of representative feature vectors from the video. Compared to other summarization techniques, video signatures are simple to compute, robust against temporal re-ordering, and capable of identifying similar video se- quences regardless of their length. In this paper, we consider the problem of searching for signatures similar to a user-defined query in a very large database. The naive approach of sequential search is typically too slow to handle large databases. Faster-than-sequential solutions have been extensively studied by the database community. Elaborate data structures, collectively known as the Spatial Access Methods (SAM), have been proposed to facilitate similarity search [2, 3]. Most of these methods, however, do not scale well to high dimen- sional metric spaces [4]. One strategy to mitigate this problem is to design a feature extraction mapping to map the original metric space to a low-dimensional space where a SAM structure can be efficiently applied. The approach of combining feature extraction with SAM is called GEneric Multimedia INdexIng (GEMINI) [3]. ∗ This work was supported by NSF grant ANI-9905799, AFOSR con- tract F49620-00-1-0327, and ARO contract DAAD19-00-1-0352. This work was done while S.-C. Cheung was with University of California at Berkeley. Part of this work was also performed under the auspices of the U.S. Department of Energy by the University of California, Lawrence Liv- ermore National Laboratory under Contract No. W-7405-Eng-48. In this paper, we propose a novel feature extraction mapping to be used in GEMINI for fast similarity search on signature data. The most commonly used feature extraction is the Principal Component Analysis (PCA), which is optimal in approximating Euclidean distance [5]. If the underlying metric is not Euclidean, PCA is no longer optimal and more general schemes need to be used. One such technique is the Fastmap, a heuristics algorithm that approximates general metric by Euclidean distance [6]. An- other class of techniques construct mappings based on distances between the high-dimensional vectors and a set of random vec- tors [7, 8, 9, 10]. These kinds of “random mappings” have been shown to possess certain favorable theoretical properties [7, 8]. Such mappings, however, are very complex, and effectively re- quire the computations of all pairwise distances between entries in the database. A more practical version has been proposed in [9] for protein matching. An even simpler version, called the Triangle- Inequality Pruning (TIP), has been proposed for similarity search on image databases [10]. TIP exploits the lower bound of the triangle inequality in approximating the high-dimensional metric. Our proposed technique improves upon TIP by taking into account both the upper and lower bounds offered by the triangle-inequality. In addition, it takes advantage of the classical PCA technique to achieve any user-defined target dimension. This paper is organized as follows: in Section 2, we briefly review the ViSig method and the GEMINI approach. The proposed feature extraction mapping and its performance evaluation on a large database of signatures are presented in Section 3. 2. REVIEW OF VISIG AND GEMINI We begin with a brief overview of the ViSig method [1]. We as- sume that each video is represented by a set of high-dimensional feature vectors, X, from a metric space (F, d(·, ·)) 1 . The met- ric function d(·, ·) is used to measure the visual dis-similarity be- tween two feature vectors. In this paper, we use four concate- nated 178-bin HSV color histograms as our feature vector, each representing a quadrant of a video frame, and l 1 as the metric be- tween two histograms. In order to reduce the complexity in com- paring two video sequences, the ViSig method summarizes each video X in the database into a signature X S , which consists of the feature vectors in X that are closest to a set of seed vectors S = {s 1 ,s 2 ,...,s m }: XS =(gX(s1),gX(s2),...,gX(sm)) where gX(s) = arg min x∈X d(x, s). (1) The central idea behind the ViSig method is that if two video clips share a large fraction of similar feature vectors, their signature vec- tors with respect to the same seed vectors are likely to be similar as well. The seed vectors are feature vectors randomly sampled 1 In the remainder of this paper, we refer to video and its feature vectors interchangeably.