Applying the Information Bottleneck Principle to Unsupervised Clustering of Discrete and Continuous Image Representations Shiri Gordon Faculty of Engineering Tel-Aviv University, Israel gordonha@post.tau.ac.il Hayit Greenspan Faculty of Engineering Tel-Aviv University, Israel hayit@eng.tau.ac.il Jacob Goldberger CUTe Systems Ltd. Tel-Aviv, Israel jacob@cute.co.il Abstract In this paper we present a method for unsupervised clus- tering of image databases. The method is based on a re- cently introduced information-theoretic principle, the infor- mation bottleneck (IB) principle. Image archives are clus- tered such that the mutual information between the clus- ters and the image content is maximally preserved. The IB principle is applied to both discrete and continuous im- age representations, using discrete image histograms and probabilistic continuous image modeling based on mixture of Gaussian densities, respectively. Experimental results demonstrate the performance of the proposed method for image clustering on a large image database. Several clus- tering algorithms derived from the IB principle are explored and compared. 1. Introduction Image clustering and categorization is a means for high- level description of image content. The goal is to find a mapping of the archive images into classes (clusters) such that the set of classes provide essentially the same predic- tion, or information, about the image archive as the entire image set collection. The generated classes provide a con- cise summarization and visualization of the image content. Image archive clustering is important for efficient handling (search and retrieval) of large image databases [8, 3, 1]. In the retrieval process, the query image is initially compared with all the cluster centers. The subset of clusters that have the largest similarity to the query image is chosen, follow- ing which the query image is compared with all the images within this subset of clusters. Search efficiency is improved due to the fact that the query image is not compared exhaus- tively to all the images in the database. Image clustering may be performed using discrete image representations (e.g. histograms) [8, 3] as well as continu- ous image representations (e.g. probabilistic continuous im- age modeling based on mixture of Gaussian densities) [7]. In recent work that compares between various image rep- resentation schemes, image modeling based on mixture of Gaussian densities was shown to outperform discrete image representations (such as the well-known color histograms, color correlograms, and more) [15]. In the current work we demonstrate unsupervised clustering in both the discrete and continuous image representations domains. The clustering method presented in this work is based on the information bottleneck (IB) principle [14, 12, 10] (an earlier version was introduced in [6]). Characteristics of the proposed method include: 1) Image models are clustered rather than raw image pixels (image models may be discrete or continuous); 2) The IB method provides a simultane- ous construction of both the clusters and the distance mea- sure between them; 3) A natural termination of the bottom- up clustering process can be determined as part of the IB principle. This provides an automated means for finding the relevant number of clusters per archive; 4) The con- tinuous agglomerative version of the IB clustering scheme is extended to include relaxation steps for better cluster- ing results. The continuous probabilistic image modeling scheme is presented in section 2. The information bottle- neck method along with clustering algorithms derived from the IB principle is presented in section 3. The method’s application to discrete image representation is shown. In section 4 we extend the information bottleneck method to the case of continuous densities. Section 5 presents results of the proposed clustering method. 2. Grouping pixels into GMMs In the first layer of the grouping process the raw pixel representation of an input image is shifted to a mid-level representation. The image representation may be discrete ( e.g. histograms) or continuous. Histograms are well known in the literature and have been used substantially [13]. In Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV 2003) 2-Volume Set 0-7695-1950-4/03 $17.00 © 2003 IEEE