Vis Comput (2013) 29:491–499 DOI 10.1007/s00371-013-0813-5 ORIGINAL ARTICLE Spatial consistency of dense features within interest regions for efficient landmark recognition Priyadarshi Bhattacharya · Marina L. Gavrilova Published online: 25 April 2013 © Springer-Verlag Berlin Heidelberg 2013 Abstract Recently, feature grouping has been proposed as a method for improving retrieval results for logos and web im- ages. This relies on the idea that a group of features match- ing over a local region in an image is more discriminative than a single feature match. In this paper, we evolve this concept further and apply it to the more challenging task of landmark recognition. We propose a novel combination of dense sampling of SIFT features with interest regions which represent the more salient parts of the image in greater de- tail. In place of conventional dense sampling used in cat- egory recognition that computes features on a regular grid at a number of fixed scales, we allow the sampling density and scale to vary based on the scale of the interest region. We develop new techniques for exploring stronger geomet- ric constraints inside the feature groups and computing the match score. The spatial information is stored efficiently in an inverted index structure. The proposed approach consid- ers part-based matching of interest regions instead of match- ing entire images using a histogram under bag-of-words. This helps reducing the influence of background clutter and works better under occlusion. Experiments reveal that di- recting more attention to the salient regions of the image and applying proposed geometric constraints helps in vastly improving recognition rates for reasonable vocabulary sizes. Keywords Computer vision · Landmark recognition · Dense features · Feature grouping · Inverted index · Spatial information P. Bhattacharya () · M.L. Gavrilova Department of Computer Science, University of Calgary, 2500 University Drive, NW, Calgary, AB, Canada e-mail: bhattacp@ucalgary.ca M.L. Gavrilova e-mail: mgavrilo@ucalgary.ca 1 Introduction The problem we consider in this paper is the retrieval of landmark images similar to a query image from a large, un- ordered collection of images. This is a very challenging task because of the large variations in scale, viewpoint and illu- mination conditions between two images of the same land- mark. State-of-art landmark or scene retrieval techniques [11, 13, 18] use a bag-of-words approach that constructs a visual vocabulary by quantizing features (usually SIFT [9]) from the image database and encodes each image as a normalized histogram of the frequency of occurrence of visual words. The query images are encoded likewise and then matched to data-base images using L1 or L2 distance between the histograms. The bag-of-words approach is how- ever orderless and discards any spatial information about features. Also, any noise or background clutter becomes a part of the image representation. In order to reduce the num- ber of false matches between unrelated images and improve the recognition accuracy, spatial information is typically in- troduced at a later stage in the recognition pipeline by re- ranking the retrieved images using RANSAC [4] based ge- ometric verification [3, 11]. But because of efficiency con- siderations, geometric verification can be performed on only a relatively small part of the ranked list of images. As a re- sult, the recognition accuracy substantially depends on the quality of the initial ranked list. In order to improve the recognition accuracy, various ap- proaches such as using fine quantization, soft quantization and hamming embedding have been proposed. But these do not address the aforementioned weaknesses of bag-of- words. Injecting spatial information into the inverted index structure has been proposed to avoid the costlier geometric verification at a later stage. But these techniques require fine quantization resulting in extremely large vocabularies to ob- tain reasonable recognition results. Fine quantization helps