Recognizing Landmarks Using Automated Classiﬁcation Techniques: an Evaluation of Various Visual Features Giuseppe Amato, Fabrizio Falchi, and Paolo Bolettieri ISTI-CNR via G. Moruzzi, 1 - Pisa, Italy name.surname@isti.cnr.it Abstract—In this paper, the performance of several visual features is evaluated in automatically recognizing landmarks (monuments, statues, buildings, etc.) in pictures. A number of landmarks were selected for the test. Pictures taken from a test set were classiﬁed automatically trying to guess which landmark they contained. We evaluated both global and local features. As expected, local features performed better given their capability of being less affected to visual variations and given that landmarks are mainly static objects that generally also maintain static local features. Between the local features, SIFT outperformed SURF and ColorSIFT. Keywords - Image indexing, image classiﬁcation, recogni- tion, landmarks. I. I NTRODUCTION The amount of pictures taken by individuals has exploded during the last decade due to the wide adoption of the digital photography in the consumer market. However, many of these pictures remain unannotated and are stored with anonymous names on personal computers. Currently, there are no tools and effective technologies to help users in searching pictures by content, when they are not explicitly annotated. Therefore, it is becoming more and more difﬁcult for users to retrieve even their own pictures. A picture contains a lot of implicit conceptual information that, if understood how to be automatically inferred and used, can open up opportunities for new advanced appli- cations. For instance, in addition to automatically create annotations and descriptions, pictures could also be used as queries on the web. Given that smartphones equipped with cameras are be- coming very popular nowadays, we can imagine that people, for instance tourists, can search for information on the web by simply pointing the camera of their smartphone on some subject (a monument, a restaurant, a painting). Consider in this respect the experimental service “Google Goggles” [1] recently launched by Google, that allows you to obtain information about a monument through your smarthphone using this paradigm. Note that, even if many smartphones and cameras are equipped with a GPS and a compass, the geo-reference obtained with this is not enough to infer what the user is actually aiming at. Content analysis of the picture is still needed to determine more precisely the user query or the annotation to be associated with a picture. In this respect, many researcher have been investigating the use of classiﬁcation techniques as for instance, Support Vector Machines [2], k-Nearest Neighbor (k-NN) classiﬁers [3], boosting [4], etc., with visual information, with the purpose of automatically recognize visual content. Content based retrieval and content based classiﬁcation techniques typically are not directly applied to images con- tent. Rather, matching and comparisons between low level mathematical descriptions of the images visual appearance, in terms of color histograms, textures, shapes, point of interests, etc., are used. Different visual features represent different visual aspects of an image. All together, different visual features, contribute, not exhaustively, to represent the complete information contained in an image. A single feature is generally able to carry out just a limited amount of this information. Therefore, its performance varies in dependence of the speciﬁc dataset used and the type of conceptual information one wants to recognize. The goal of this paper is to identify the best visual features or combination of visual features that provides us with the best performance with the above mentioned task. In this respect, as better described in the reminder of the paper, we identiﬁed 12 landmarks, and we manually built the training sets for them by identifying a congruous number of pictures representing them. A classiﬁcation algorithm was tested with these land- marks, using various visual features. We measured the performance of the classiﬁcation algorithm to correctly rec- ognize the landmark in a test set, varying the visual features used. The rest of the paper is organized as follows. We brieﬂy discuss related work next. In Section III we present the features used in the experiments, while in Section IV we describe the experimental enviroment. Finally, we present and discuss the results in Section V. II. RELATED WORK In [5], the MPEG-7 Visual Descriptors have been com- pared in terms of effectiveness for a general purpose Content Based Image Retrieval (CBIR). The results are interesting