Vis Comput (2013) 29:491–499
DOI 10.1007/s00371-013-0813-5
ORIGINAL ARTICLE
Spatial consistency of dense features within interest regions
for efficient landmark recognition
Priyadarshi Bhattacharya · Marina L. Gavrilova
Published online: 25 April 2013
© Springer-Verlag Berlin Heidelberg 2013
Abstract Recently, feature grouping has been proposed as a
method for improving retrieval results for logos and web im-
ages. This relies on the idea that a group of features match-
ing over a local region in an image is more discriminative
than a single feature match. In this paper, we evolve this
concept further and apply it to the more challenging task of
landmark recognition. We propose a novel combination of
dense sampling of SIFT features with interest regions which
represent the more salient parts of the image in greater de-
tail. In place of conventional dense sampling used in cat-
egory recognition that computes features on a regular grid
at a number of fixed scales, we allow the sampling density
and scale to vary based on the scale of the interest region.
We develop new techniques for exploring stronger geomet-
ric constraints inside the feature groups and computing the
match score. The spatial information is stored efficiently in
an inverted index structure. The proposed approach consid-
ers part-based matching of interest regions instead of match-
ing entire images using a histogram under bag-of-words.
This helps reducing the influence of background clutter and
works better under occlusion. Experiments reveal that di-
recting more attention to the salient regions of the image
and applying proposed geometric constraints helps in vastly
improving recognition rates for reasonable vocabulary sizes.
Keywords Computer vision · Landmark recognition ·
Dense features · Feature grouping · Inverted index · Spatial
information
P. Bhattacharya ( ) · M.L. Gavrilova
Department of Computer Science, University of Calgary,
2500 University Drive, NW, Calgary, AB, Canada
e-mail: bhattacp@ucalgary.ca
M.L. Gavrilova
e-mail: mgavrilo@ucalgary.ca
1 Introduction
The problem we consider in this paper is the retrieval of
landmark images similar to a query image from a large, un-
ordered collection of images. This is a very challenging task
because of the large variations in scale, viewpoint and illu-
mination conditions between two images of the same land-
mark. State-of-art landmark or scene retrieval techniques
[11, 13, 18] use a bag-of-words approach that constructs
a visual vocabulary by quantizing features (usually SIFT
[9]) from the image database and encodes each image as
a normalized histogram of the frequency of occurrence of
visual words. The query images are encoded likewise and
then matched to data-base images using L1 or L2 distance
between the histograms. The bag-of-words approach is how-
ever orderless and discards any spatial information about
features. Also, any noise or background clutter becomes a
part of the image representation. In order to reduce the num-
ber of false matches between unrelated images and improve
the recognition accuracy, spatial information is typically in-
troduced at a later stage in the recognition pipeline by re-
ranking the retrieved images using RANSAC [4] based ge-
ometric verification [3, 11]. But because of efficiency con-
siderations, geometric verification can be performed on only
a relatively small part of the ranked list of images. As a re-
sult, the recognition accuracy substantially depends on the
quality of the initial ranked list.
In order to improve the recognition accuracy, various ap-
proaches such as using fine quantization, soft quantization
and hamming embedding have been proposed. But these
do not address the aforementioned weaknesses of bag-of-
words. Injecting spatial information into the inverted index
structure has been proposed to avoid the costlier geometric
verification at a later stage. But these techniques require fine
quantization resulting in extremely large vocabularies to ob-
tain reasonable recognition results. Fine quantization helps