Binary SIFT: Fast Image Retrieval Using Binary Quantized SIFT Features Kadir A. Peker Meliksah University Computer Engineering Department Talas, Kayseri, Turkey kpeker@meliksah.edu.tr Abstract SIFT features are widely used in content based image retrieval. Typically, a few thousand keypoints are extracted from each image. Image matching involves distance computations across all pairs of SIFT feature vectors from both images, which is quite costly. We show that SIFT features perform surprisingly well even after quantizing each component to binary, when the medians are used as the quantization thresholds. Quantized features preserve both distinctiveness and matching properties. Almost all of the features in our 5.4 million feature test set map to distinct binary patterns after quantization. Furthermore, number of matches between images using both the original and the binary quantized SIFT features are quite similar. We investigate the distribution of SIFT features and observe that the space of 128-D binary vectors has sufficient capacity for the current performance of SIFT features. We use component median values as quantization thresholds and show through vector-to-vector distance comparisons and image-to-image matches that the resulting binary vectors perform comparable to original SIFT vectors. We also discuss computational and storage gains. Binary vector distance computation reduces to bit-wise operations. Square operation is eliminated. Fast and efficient indexing techniques such as the signatures used for chemical databases can also be considered. 1. Introduction Matching images of objects and places against other images of same or similar objects and places has been a key problem of computer vision and pattern recognition. One of the successful approaches in this task is by detecting 'salient' or 'key' points in images and then describing them by a set of numerical descriptors [1][2]. Among these 'salient point' approaches, SIFT (scale invariant feature transform) features are found to be one of the most successful [3]. In the SIFT approach, the salient points of an image are found as the extrema of a multi-resolution image computed using a difference of Gaussian function [4]. Thus, the key points can be detected in different scales. Each key point is described by a 128 dimensional vector which is essentially a histogram of gradient directions for an image patch around the detected key point. The dominant gradient direction(s) are selected as the reference direction, hence providing rotation invariance. Each of the 128 components takes integer values between 0 and 255. SIFT features are widely used in many applications from stereo to object detection, and found to be robust against scale and orientation changes, and quite discriminative even in large databases of features [3,5,6]. When searching an image in a database, the key points of the query image are compared to each key point of the each target image. Usually, a few thousand key points are detected per image. Comparison between two images involves vector distance computations in the order of square of number of key points, which is quite costly. A number of methods have been suggested to speed up the matching process. One method is suggested in the original paper that describes SIFT features [4]. Grauman et.al. proposes pyramid match, an approximate but fast matching method between sets of features [5]. In [6], all the features from all images in a database are clustered and a reduced set of representative vectors are selected (“visual vocabulary”), thus providing a more scalable approach. In this work, we show that SIFT feature vectors perform quite well even after each component of the vector is quantized to binary. We use the median value of each component as the quantization threshold for that component. Almost all of the features in our 5.4 million feature database (more than 99.86%) map to 978-1-61284-433-6/11/$26.00 ©2011 IEEE 217 CBMI’2011