Modiﬁed global k-means algorithm for clustering in gene expression data sets Adil M. Bagirov Karim Mardaneh Centre for Informatics and Applied Optimization, School of Information Technology and Mathematical Sciences, University of Ballarat, Victoria, 3353, Australia, Email: a.bagirov@ballarat.edu.au Abstract Clustering in gene expression data sets is a challeng- ing problem. Diﬀerent algorithms for clustering of genes have been proposed. However due to the large number of genes only a few algorithms can be applied for the clustering of samples. k-means algorithm and its diﬀerent variations are among those algorithms. But these algorithms in general can converge only to local minima and these local minima are signiﬁ- cantly diﬀerent from global solutions as the number of clusters increases. Over the last several years diﬀer- ent approaches have been proposed to improve global search properties of k-means algorithm and its perfor- mance on large data sets. One of them is the global k-means algorithm. In this paper we develop a new version of the global k-means algorithm: the modiﬁed global k-means algorithm which is eﬀective for solv- ing clustering problems in gene expression data sets. We present preliminary computational results using gene expression data sets which demonstrate that the modiﬁed k-means algorithm improves and sometimes signiﬁcantly results by k-means and global k-means algorithms. 1 Introduction This paper develops an incremental algorithm for solving sum-of-squares clustering problems in gene expression data sets. Clustering in gene expression data sets is a challenging problem. Diﬀerent algo- rithms for clustering of genes have been proposed (see, for example, (Medvedovic & Sivaganesan 2002, Ye- ung et al. 2001, Yeung et al. 2003)). However due to the large number of genes only a few algorithms can be applied for the clustering of samples ((Bagirov et al. 2003)). As the number of clusters increases the number of variables in the clustering problem in- creases drastically and most of clustering algorithms become ineﬃcient for solving such problems. k-means algorithm and its diﬀerent variations are among those algorithms which still applicable to clustering of sam- ples in gene expression data sets. But k-means algo- rithms in general can converge only to local minima and these local minima may be signiﬁcantly diﬀerent from global solutions as the number of clusters in- creases. Recently the global k-means algorithm has been proposed to improve global search properties of k-means algorithms ((Likas et al. 2003)). In this pa- per we develop a new version of the global k-means algorithm: the modiﬁed global k-means algorithm Copyright c 2006, Australian Computer Society, Inc. This pa- per appeared at The 2006 Workshop on Intelligent Systems for Bioinformatics (WISB2006), Hobart, Australia. Conferences in Research and Practice in Information Technology (CRPIT), Vol. 73. Mikael Bod´ en and Timothy L. Bailey, Ed. Reproduc- tion for academic, not-for proﬁt purposes permitted provided this text is included. which is eﬀective for solving clustering problems in gene expression data sets. The cluster analysis deals with the problems of organization of a collection of patterns into clusters based on similarity. It is also known as the unsuper- vised classiﬁcation of patterns and has found many applications in diﬀerent areas. In cluster analysis we assume that we have been given a ﬁnite set of points A in the n-dimensional space IR n , that is A = {a 1 ,...,a m }, where a i ∈ IR n ,i =1,...,m. There are diﬀerent types of clustering. In this paper we consider the hard unconstrained partition cluster- ing problem, that is the distribution of the points of the set A into a given number k of disjoint subsets A j ,j =1,...,k with respect to predeﬁned criteria such that: 1) A j = ∅,j =1,...,k; 2) A j  A l = ∅, j,l =1,...,k,j = l; 3) A = k  j=1 A j . 4) no constraints are imposed on clusters A j ,j = 1,...,k. The sets A j ,j =1,...,k are called clusters. We assume that each cluster A j can be identiﬁed by its center (or centroid) x j ∈ IR n ,j =1,...,k. Then the clustering problem can be reduced to the following optimization problem (see (Bock 1998, Spath 1980)): minimize ψ(x,w)= 1 m m  i=1 k  j=1 w ij ‖x j − a i ‖ 2 (1) subject to x =(x 1 ,...,x k ) ∈ IR n×k , (2) k  j=1 w ij =1,i =1,...,m, (3) and w ij = 0 or 1,i =1,...,m,j =1,...,k (4) where w ij is the association weight of pattern a i with cluster j , given by w ij =  1 if pattern a i is allocated to cluster j , 0 otherwise