International Journal of Image Processing & Networking Techniques Vol. 5 No. 1 June. 2014 0973 – 7650 © UPA 2014 1 Genetic k-means Clustering for Software Quality Estimation S. Suyambu Kesavan, K. Alagarsamy, S. Palanikumar Abstract – Software Quality Estimation has been a long standing and pressing problem for Software developers and managers for a period of time. In the current competitive business environment the paucity of resources prohibits managers from devoting resources to all modules to ensure quality. There have been attempts to use fault-data from previous system releases and construct fault prediction models. Such models are then used to predict the fault- proneness of modules in development. Modules that are predicted to be fault-prone are allocated more resources and subject to greater scrutiny and quality assurance techniques. The present paper proposes the use of genetic k-means clustering for software quality estimation. I. INTRODUCTION Clustering is a division of data into groups of similar objects. Each group called a cluster, consists of objects that are similar between themselves and dissimilar to objects of other groups [1]. These clusters correspond to hidden patterns, and the search for clusters is termed “unsupervised learning”. One of the most popular clustering algorithms is the k-means clustering algorithm. Mertik et al. presented the use of advanced tool for data mining called Multimethod on the case of building software fault prediction model [5]. Azar et al. given a search-based software engineering approach to improve the prediction accuracy of software quality estimation models by adapting them to new unseen software products [6]. Prakriti and Rajeev presented set of software matrix that will check the interconnection between the software component and the application [7]. Naeem and Taghi presented a semi-supervised learning scheme as a solution to software defect modeling when there is limited prior knowledge of software quality [8]. Deepak et al. have studied three object oriented metrics and given a case study to show,how these metrics are useful in determining the quality of any software designed by using object oriented paradigm [9]. 1.1 k-means Clustering k-means clustering algorithm follows a simple way to classify a given data set thorough a certain number of clusters fixed apriori. The algorithm starts by defining k- centroids, one for each cluster. The better choice to place the centroids is to place them as far as possible from each other. The algorithm then proceeds to take each point in the data set and associate it with the nearest centroid. When all points are done this way, the first iteration is completed and an early groupage is done. Now the algorithm recalculates k new centroids. After this a new binding has to be done between the same set of data points and the new centroids. The k-centroids change step by step until no more changes are done. The algorithm aims at minimizing an objective function which is the squared error function. The objective function ¦¦   k 1 j n 1 i 2 j (j) i c x j where 2 j (j) i c x  is a chosen distance measure between a data point ) ( j i x and the cluster centre j c , is an indicator of the distance of the n data points from their respective cluster centres. 1.2 Genetic k-means clustering Krishna and Murty propose a novel hybrid genetic algorithm that finds a globally optimal partition of a given data into specified number of clusters [2]. They attempt to hybridize GA with the k-means algorithm. The important aspects of the proposed GKA are listed below: x Coding – W is encoded into a string W s by considering a chromosome of length n and allowing each allele to take values {1,2, …, K}. Each allele represents a pattern and the allele value indicates the cluster number to which the pattern belongs. This is called string-of-group- numbers encoding. x Initialization – as with most GA’s the initial population is obtained by initializing each allele in the population to a random number selected from the set {1,2, …, K}. x Selection – a chromosome is selected from the previous population according to the distribution: ¦  N j j i i s F s F s P 1 ) ( ) ( ) ( where ) ( i s F represents the fitness value. x Fitness function – in order to minimize S(W) – the Total Within Cluster Variation, Krishna and Murty resort to the V -truncation mechanism. They define     ) . ( ) ( ), ( V F c s f s g W S s f W W W      where F    denote the average value and