Novel Methods for Initial Seed Selection in K-Means Clustering for Image Compression K. Somasundaram 1 and M. Mary Shanthi Rani 2 Department of Computer Science & Applications, Gandhigram Rural Institute, Deemed University, Tamil Nadu, India e-mail: 1 ksomasundaram@hotmail.com, 2 shanthifrank@yahoo.com Abstract!In this paper, we propose two methods to construct the initial codebook for K-means clustering based on covariance and spectral decomposition. Experimental results with standard images show that the proposed methods produces better quality reconstructed images measured in terms of Peak Signal to Noise Ratio (PSNR). Keywords: Codebook, Variance, Spectral Decomposition, Eigen value I. INTRODUCTION Clustering is an important partitioning technique which organizes information in the form of clusters such that patterns within a cluster are more similar to each other than the patterns belonging to different clusters [1]. Traditionally, clustering techniques are broadly divided into hierarchical and partitioning. Hierarchical algorithms build clusters hierarchically, whereas partitioning algorithms determine all clusters at once. The partitioning methods generally result in a set of K clusters, each object belonging to one cluster. Each cluster may be represented by a centroid or a cluster representative which is some sort of summary description of all the objects contained in a cluster. Hierarchical algorithms can be agglomerative (bottom- up) or divisive (top-down). Agglomerative algorithms begin with each object as a separate cluster and merge them successively into larger clusters until the termination criteria is reached. Divisive algorithms begin with the whole set and proceed to recursively divide it into smaller clusters. The Pairwise Nearest Neighbor (PNN) [2] method belongs to the class of agglomerative clustering methods. It generates hierarchical clustering using a sequence of merge operations until the desired number of clusters is obtained. The main drawback of the PNN method is its slowness and the time complexity of even the fastest implementation of the PNN method is lower bounded by the number of data objects. K-means [3] is one of the most popular partitioning techniques which has great number of applications in the fields of image and video compression [4],[5], image segmentation [6], pattern recognition and data mining[7]-[8]. The rest of the paper is organized as follows: Section II gives an overview of K-means clustering technique, Section III briefly describes the method, Section IV presents the results and performance analysis of the proposed method and Section V concludes our work. II. K-MEANS CLUSTERING The K-means algorithm is a broadly used VQ technique and was also graded as one of the top ten algorithms in data mining [9]. This is iterative in nature and generates a codebook from the training data using a distortion measure appropriate for the given application [3], [10]. It is simple and easy to implement and the computation time mainly depends on the amount of training data, codebook size, vector dimension, and distortion measure for convergence. It clusters the given objects, based on their attributes into K partitions. K- means comprises of four steps: initialization, classification, computational and convergence criteria. There are two issues in creating a K-means clustering model: 1. Determining the optimal number of clusters to create 2. Determining the center of each cluster Determining the number of clusters (K) is specific to the problem domain. The overall quality of clustering is the average distance from each data point to its associated cluster center. Given the number of clusters K, the second part of the problem is determining where to place the center of each cluster. Often, points are scattered and don’t fall into easily recognizable groups. The algorithm starts by partitioning the input points into K initial sets, either at random or using some heuristic data. It then calculates the centroid of each set and constructs a new partition by associating each point with the closest centroid. The centroid of each set is recalculated and the algorithm is repeated by alternate application of these two steps until convergence is reached where the points no longer switch clusters or the overall squared error is less than the convergence threshold. Although the K-means algorithm procedure will always terminate, it does not necessarily find the most optimal configuration, corresponding to the global objective function minimum. The algorithm is also significantly sensitive to the initial randomly selected cluster centers. A simple approach to reduce this effect is to make multiple runs of the algorithm with different