International Journal of Image Processing & Networking Techniques Vol. 5 No. 1 June. 2014
0973 – 7650 © UPA 2014 1
Genetic k-means Clustering for
Software Quality Estimation
S. Suyambu Kesavan, K. Alagarsamy, S. Palanikumar
Abstract – Software Quality Estimation has been a long
standing and pressing problem for Software developers and
managers for a period of time. In the current competitive
business environment the paucity of resources prohibits
managers from devoting resources to all modules to ensure
quality. There have been attempts to use fault-data from
previous system releases and construct fault prediction
models. Such models are then used to predict the fault-
proneness of modules in development. Modules that are
predicted to be fault-prone are allocated more resources
and subject to greater scrutiny and quality assurance
techniques. The present paper proposes the use of genetic
k-means clustering for software quality estimation.
I. INTRODUCTION
Clustering is a division of data into groups of similar
objects. Each group called a cluster, consists of objects
that are similar between themselves and dissimilar to
objects of other groups [1]. These clusters correspond to
hidden patterns, and the search for clusters is termed
“unsupervised learning”. One of the most popular
clustering algorithms is the k-means clustering
algorithm. Mertik et al. presented the use of advanced
tool for data mining called Multimethod on the case of
building software fault prediction model [5]. Azar et al.
given a search-based software engineering approach to
improve the prediction accuracy of software quality
estimation models by adapting them to new unseen
software products [6]. Prakriti and Rajeev presented set
of software matrix that will check the interconnection
between the software component and the application [7].
Naeem and Taghi presented a semi-supervised learning
scheme as a solution to software defect modeling when
there is limited prior knowledge of software quality [8].
Deepak et al. have studied three object oriented metrics
and given a case study to show,how these metrics are
useful in determining the quality of any software
designed by using object oriented paradigm [9].
1.1 k-means Clustering
k-means clustering algorithm follows a simple way to
classify a given data set thorough a certain number of
clusters fixed apriori. The algorithm starts by defining k-
centroids, one for each cluster. The better choice to place
the centroids is to place them as far as possible from
each other. The algorithm then proceeds to take each
point in the data set and associate it with the nearest
centroid. When all points are done this way, the first
iteration is completed and an early groupage is done.
Now the algorithm recalculates k new centroids. After
this a new binding has to be done between the same set
of data points and the new centroids. The k-centroids
change step by step until no more changes are done. The
algorithm aims at minimizing an objective function
which is the squared error function. The objective
function
¦¦
k
1 j
n
1 i
2
j
(j)
i
c x j where
2
j
(j)
i
c x is a
chosen distance measure between a data point
) ( j
i
x and
the cluster centre
j
c , is an indicator of the distance of
the n data points from their respective cluster centres.
1.2 Genetic k-means clustering
Krishna and Murty propose a novel hybrid genetic
algorithm that finds a globally optimal partition of a
given data into specified number of clusters [2]. They
attempt to hybridize GA with the k-means algorithm.
The important aspects of the proposed GKA are listed
below:
x Coding – W is encoded into a string
W
s by
considering a chromosome of length n and
allowing each allele to take values {1,2, …, K}.
Each allele represents a pattern and the allele
value indicates the cluster number to which the
pattern belongs. This is called string-of-group-
numbers encoding.
x Initialization – as with most GA’s the initial
population is obtained by initializing each allele
in the population to a random number selected
from the set {1,2, …, K}.
x Selection – a chromosome is selected from the
previous population according to the
distribution:
¦
N
j
j
i
i
s F
s F
s P
1
) (
) (
) ( where
) (
i
s F represents the fitness value.
x Fitness function – in order to minimize S(W) –
the Total Within Cluster Variation, Krishna and
Murty resort to the V -truncation mechanism.
They define
) . ( ) ( ), ( V F c s f s g W S s f
W W W
where F denote the average value and