Bag of spatio-visual words for context inference in scene classification A. Bolovinou a,b,n , I. Pratikakis c , S. Perantonis b a Department of Informatics and Telecommunications, University of Athens, Greece b Computational Intelligence Laboratory, Institute of Informatics and Telecommunications, National Center for Scientific Research ‘‘Demokritos’’, 153 10 Athens, Greece c Democritus University of Thrace, Department of Electrical and Computer Engineering, GR-67100 Xanthi, Greece article info Article history: Received 4 August 2010 Received in revised form 30 July 2012 Accepted 31 July 2012 Available online 5 September 2012 Keywords: Scene classification Bag of spatio-visual words Spatial co-occurrence Contextual descriptors Ensembles’ learning High dimensional features’ clustering abstract In the ‘‘bag of visual words (BoVW)’’ representation each image is represented by an unordered set of visual words. In this paper, a novel approach to encode ordered spatial configurations of visual words in order to add context in the representation is presented. The proposed method introduces a bag of spatio-visual words representation (BoSVW) obtained by clustering of visual words’ correlogram ensembles. Specifically, the spherical K-means clustering algorithm is employed accounting for the large dimensionality and the sparsity of the proposed spatio-visual descriptors. Experimental results on four standard datasets show that the proposed method significantly improves a state-of-the-art BoVW model and compares favorably to existing context-based scene classification approaches. & 2012 Elsevier Ltd. All rights reserved. 1. Introduction Automatic semantic image annotation for the management and maintenance of visual data archives has become a goal of important value in view of the evolving image large-scale collec- tions being generated and stored by digital media worldwide [1]. While research in object classification has achieved considerable levels of performance based solely on local appearance [28], classifying objects taking into account the context in a scene, remains an open issue towards improved scene understanding [9]. One of the main difficulties that computational recognition approaches face by including contextual information is the lack of simple representations of context and efficient algorithms for the extraction of such information from the visual input. In natural scene classification the objective is to classify images into pre-defined scene categories. Despite large variations in their content, images of natural scenes usually share a semantic setup per category: themes that co-exist in typical or abstract spatial configurations. For example, in the input image of Fig. 1b that belongs to class ‘‘coast’’, we usually observe, in bottom-up order (spatial context), a dry land part, a sea part and a sky part (semantic context), captured from a long distance (scale context). This kind of underlying information that may influence the way a scene and the objects within it are perceived is vaguely under- stood as context [10]. It is noted, from the beginning, that this work is focused on the whole scene classification task and not on the classification of objects within it. Since most effective scene classification approaches are vocabu- lary based, the proposed approach aims at inferring scene context from visual words’ co-occurrences. For that, it explores how the Biederman et al. [11] notions of context (semantic as in [1214], spatial as in [14] or scale context as in [15,16]) can be applied on top of a bag of visual words representation [4,17,18]. The standard bag of visual words model (referred to as BoVW in the sequel), assigns each local feature from an image to a visual label (namely a visual word from the vocabulary) and then it represents visual labels’ frequency of occurrence in an image, disregarding locality and scale information of the visual labels. The proposed method couples the spatial layout with the local co-occurrence of visual labels in a spatio-visual descriptor and constructs a bag of spatio-visual words model (referred to as BoSVW in the sequel) based on this descriptor. A variant of BoSVW model where the descriptor adapts to the scale information of visual patches is also implemented. While the spatial layout of visual words has been partially taken into account in the works of [5,19,20], modeling the co-location of visual words that usually co-exist within a spatial setting, is much more difficult to be achieved. This is due to the large number of the possible co-occurrence combinations and different spatial layouts of visual words in the image space. The latter objective becomes Contents lists available at SciVerse ScienceDirect journal homepage: www.elsevier.com/locate/pr Pattern Recognition 0031-3203/$ - see front matter & 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.patcog.2012.07.024 n Corresponding author at: Department of Informatics and Telecommunications, University of Athens, Greece. Tel.: þ30 210 6503141; fax: þ30 210 653 2175. E-mail addresses: abolov@iit.demokritos.gr, abolov@iccs.gr (A. Bolovinou), ipratika@ee.duth.gr (I. Pratikakis), sper@iit.demokritos.gr (S. Perantonis). URLS: http://www.di.uoa.gr (A. Bolovinou), http://www.iit.demokritos.gr/cil (I. Pratikakis). Pattern Recognition 46 (2013) 1039–1053 uncorrected proof