SIGT: SYNTHETIC IMAGE GENERATION TOOL FOR CLUSTERING ALGORITHMS Ayed Salman 1 , Mahamed G Omran 2 , Andries P Engelbrecht 3 1 Department of Computer Engineering Kuwait University Kuwait, KUWAIT Phone: +965-4811188-5833 Fax: +965-4817451 Email: ayed@eng.kuniv.edu.kw 2,3 Department of Computer Science University of Pretoria Pretoria, SOUTH AFRICA Email: mjomran@engineer.com , engel@driesie.cs.up.ac.za Abstract A new automatic image generation tool is proposed in this paper tailored specifically for verification and comparison of different image clustering algorithms. The tool can be used to produce different images (in raw format) with different criteria based on user specification. The user specifies the number of clusters to be included in the image along with the probability distribution that governs a set of points that belong to different clusters. On the other hand, the tool can be used to verify the degree of approximation a new algorithm has been able to achieve compared to the original image. This allows for a scientific confident comparison between any new algorithm and existing algorithms. The tool usefulness is demonstrated in this paper with reference to the well-known K-means clustering algorithm and a Particle Swarm Optimization (PSO)- based clustering algorithm recently proposed by the authors. Keywords: Image classification, synthetic image generator, clustering verification, K-means, benchmarks. 1. Introduction Image clustering [1-6] is the process of identifying groups of similar image primitives [7]. These image primitives can be pixels, regions, line elements and so on, depending on the problem encountered. Many basic image processing techniques such as quantization, segmentation and coarsening can be viewed as different instances of the clustering problem [7]. There are two approaches for image classification: supervised and unsupervised. In the supervised approach, the number and the numerical characteristics (e.g. mean and variance) of the classes in the image are known in advance (by the analyst) and used in the training step which is followed by the classification step. There are several popular supervised algorithms such as the minimum-distance- to-mean, parallelepiped and the Gaussian maximum likelihood classifiers [8]. For the unsupervised approach, the classes are unknown and clustering starts by partitioning the image data into groups (or clusters) according to a similarity measure, which can be compared with reference data by the analyst [8]. Unsupervised clustering is therefore usually referred to as a clustering problem. In general, the unsupervised approach has several advantages over the supervised approach [9]: • In the unsupervised approach, there is no need for an analyst to specify in advance all the classes in the image data set. It will automatically find distinct classes, thereby dramatically reducing the work of the analyst. • The characteristics of the objects being classified can vary with time. The unsupervised approach is an excellent way to monitor these changes. • Some characteristics of objects may not be known in advance. The unsupervised approach will automatically flag these characteristics.