Binary SIFT: Fast Image Retrieval Using Binary Quantized SIFT Features
Kadir A. Peker
Meliksah University
Computer Engineering Department
Talas, Kayseri, Turkey
kpeker@meliksah.edu.tr
Abstract
SIFT features are widely used in content based
image retrieval. Typically, a few thousand keypoints
are extracted from each image. Image matching
involves distance computations across all pairs of
SIFT feature vectors from both images, which is quite
costly. We show that SIFT features perform
surprisingly well even after quantizing each component
to binary, when the medians are used as the
quantization thresholds. Quantized features preserve
both distinctiveness and matching properties. Almost
all of the features in our 5.4 million feature test set
map to distinct binary patterns after quantization.
Furthermore, number of matches between images
using both the original and the binary quantized SIFT
features are quite similar. We investigate the
distribution of SIFT features and observe that the
space of 128-D binary vectors has sufficient capacity
for the current performance of SIFT features. We use
component median values as quantization thresholds
and show through vector-to-vector distance
comparisons and image-to-image matches that the
resulting binary vectors perform comparable to
original SIFT vectors. We also discuss computational
and storage gains. Binary vector distance computation
reduces to bit-wise operations. Square operation is
eliminated. Fast and efficient indexing techniques such
as the signatures used for chemical databases can also
be considered.
1. Introduction
Matching images of objects and places against other
images of same or similar objects and places has been
a key problem of computer vision and pattern
recognition. One of the successful approaches in this
task is by detecting 'salient' or 'key' points in images
and then describing them by a set of numerical
descriptors [1][2]. Among these 'salient point'
approaches, SIFT (scale invariant feature transform)
features are found to be one of the most successful [3].
In the SIFT approach, the salient points of an image
are found as the extrema of a multi-resolution image
computed using a difference of Gaussian function [4].
Thus, the key points can be detected in different scales.
Each key point is described by a 128 dimensional
vector which is essentially a histogram of gradient
directions for an image patch around the detected key
point. The dominant gradient direction(s) are selected
as the reference direction, hence providing rotation
invariance. Each of the 128 components takes integer
values between 0 and 255.
SIFT features are widely used in many applications
from stereo to object detection, and found to be robust
against scale and orientation changes, and quite
discriminative even in large databases of features
[3,5,6].
When searching an image in a database, the key
points of the query image are compared to each key
point of the each target image. Usually, a few thousand
key points are detected per image. Comparison
between two images involves vector distance
computations in the order of square of number of key
points, which is quite costly. A number of methods
have been suggested to speed up the matching process.
One method is suggested in the original paper that
describes SIFT features [4]. Grauman et.al. proposes
pyramid match, an approximate but fast matching
method between sets of features [5]. In [6], all the
features from all images in a database are clustered and
a reduced set of representative vectors are selected
(“visual vocabulary”), thus providing a more scalable
approach.
In this work, we show that SIFT feature vectors
perform quite well even after each component of the
vector is quantized to binary. We use the median value
of each component as the quantization threshold for
that component. Almost all of the features in our 5.4
million feature database (more than 99.86%) map to
978-1-61284-433-6/11/$26.00 ©2011 IEEE 217 CBMI’2011