An Unsupervised Method for Clustering Images Based on Their Salient Regions of Interest Gustavo B. Borba and Humberto R. Gamba Programa de P´ os-Graduac ¸˜ ao em Engenharia El´ etrica e Inform´ atica Industrial Universidade Tecnol´ ogica Federal do Paran´ a Curitiba - Paran´ a - Brasil {gustavo, humberto}@cpgei.cefetpr.br Oge Marques and Liam M. Mayron Department of Computer Science and Engineering Florida Atlantic University Boca Raton, FL – USA {omarques, lmayron}@fau.edu ABSTRACT We have developed a biologically-motivated, unsupervised way of grouping together images whose salient regions of interest (ROIs) are perceptually similar regardless of the visual contents of other (less relevant) parts of the image. In the implemented model cluster membership is assigned based on feature vectors extracted from salient ROIs. This paper focuses on the experimental evaluation of the pro- posed approach for several combinations of feature extrac- tion techniques and unsupervised clustering algorithms. The results reported here show that this is a valid approach and encourage further research. Categories and Subject Descriptors I.4.8 [Image Processing and Computer Vision]: Scene Analysis General Terms Algorithms, Human Factors, Performance. Keywords Visual Attention, Image Retrieval, Clustering. 1. INTRODUCTION The dramatic growth in the amount of digital images available for consumption and the popularity of inexpensive hardware and software for acquiring, storing, and distribut- ing images has fostered considerable research activity in the ﬁeld of content-based image retrieval (CBIR) [6]. In spite of the large number of related papers, prototypes, and sev- eral commercial solutions, the CBIR problem has not been satisfactorily solved. Chen et al. [1] have shown that clustering and ranking of relevant results is a viable alternative to the usual approach of presenting the results in a ranked list format. The results of their experiments motivated the cluster-based approach taken in our work. We have developed a CBIR solution [4] in which results from two diﬀerent computational models of visual attention (VA) are combined to extract ROIs in an unsupervised way. Copyright is held by the author/owner(s). MM’06, October 23–27, 2006, Santa Barbara, California, USA. ACM 1-59593-447-2/06/0010. In [4] we present a complete evaluation of the ROI extraction algorithm as well as performance measures for the entire sys- tem. In this paper we focus on testing the proposed model for a combination of feature extraction and clustering algo- rithms. In doing so, we are using classical clustering evalu- ation techniques – such as measures of purity and entropy – as indirect measures of success of the overall approach. 2. THE PROPOSED MODEL This section presents an overview of the proposed model and explains its main components in detail. 2.1 Overview We present a biologically-plausible model that extracts ROIs using saliency-based visual attention models, which are then used for the image clustering process. The visual attention models used are those proposed by Itti and Koch [3] and Stentiford [7]. The Itti-Koch model of visual attention considers the task of attentional selection from a purely bottom-up perspective, although recent ef- forts have been made to incorporate top-down impulses [3]. The model generates a map of the most salient points in an image, the saliency map. The model of visual attention proposed by Stentiford [7] is also a biologically inspired ap- proach to CBIR tasks. It functions by suppressing areas of the image with patterns that are repeated elsewhere. As a result ﬂat surfaces and textures are suppressed while unique objects are given prominence. Regions are marked as high interest if they possess features not frequently present else- where in the image. The result is a visual attention map that is similar in function to the saliency map generated by Itti-Koch. There are several key aspects that our model adheres to. It is biologically-inspired. The Itti and Stentiford models are both biologically-inspired while the biological-plausibility of clustering the results is veriﬁed by Draper et al. [2]. Our model is unsupervised and content-based: it is able to func- tion without the intervention of a user, producing clusters of related images at its output. We limit our model to in- corporating only bottom-up knowledge. Finally, our model is modular. While we rely on the Itti-Koch model of visual attention, our model allows for a variety of other models of visual attention to be used in its place. Similarly, the choice of feature extraction techniques and descriptors as well as clustering algorithms can also be varied. This allows a good degree of ﬂexibility and ﬁne-tuning (or customization) based