Modified global k-means algorithm for clustering in gene expression data sets Adil M. Bagirov Karim Mardaneh Centre for Informatics and Applied Optimization, School of Information Technology and Mathematical Sciences, University of Ballarat, Victoria, 3353, Australia, Email: a.bagirov@ballarat.edu.au Abstract Clustering in gene expression data sets is a challeng- ing problem. Different algorithms for clustering of genes have been proposed. However due to the large number of genes only a few algorithms can be applied for the clustering of samples. k-means algorithm and its different variations are among those algorithms. But these algorithms in general can converge only to local minima and these local minima are signifi- cantly different from global solutions as the number of clusters increases. Over the last several years differ- ent approaches have been proposed to improve global search properties of k-means algorithm and its perfor- mance on large data sets. One of them is the global k-means algorithm. In this paper we develop a new version of the global k-means algorithm: the modified global k-means algorithm which is effective for solv- ing clustering problems in gene expression data sets. We present preliminary computational results using gene expression data sets which demonstrate that the modified k-means algorithm improves and sometimes significantly results by k-means and global k-means algorithms. 1 Introduction This paper develops an incremental algorithm for solving sum-of-squares clustering problems in gene expression data sets. Clustering in gene expression data sets is a challenging problem. Different algo- rithms for clustering of genes have been proposed (see, for example, (Medvedovic & Sivaganesan 2002, Ye- ung et al. 2001, Yeung et al. 2003)). However due to the large number of genes only a few algorithms can be applied for the clustering of samples ((Bagirov et al. 2003)). As the number of clusters increases the number of variables in the clustering problem in- creases drastically and most of clustering algorithms become inefficient for solving such problems. k-means algorithm and its different variations are among those algorithms which still applicable to clustering of sam- ples in gene expression data sets. But k-means algo- rithms in general can converge only to local minima and these local minima may be significantly different from global solutions as the number of clusters in- creases. Recently the global k-means algorithm has been proposed to improve global search properties of k-means algorithms ((Likas et al. 2003)). In this pa- per we develop a new version of the global k-means algorithm: the modified global k-means algorithm Copyright c 2006, Australian Computer Society, Inc. This pa- per appeared at The 2006 Workshop on Intelligent Systems for Bioinformatics (WISB2006), Hobart, Australia. Conferences in Research and Practice in Information Technology (CRPIT), Vol. 73. Mikael Bod´ en and Timothy L. Bailey, Ed. Reproduc- tion for academic, not-for profit purposes permitted provided this text is included. which is effective for solving clustering problems in gene expression data sets. The cluster analysis deals with the problems of organization of a collection of patterns into clusters based on similarity. It is also known as the unsuper- vised classification of patterns and has found many applications in different areas. In cluster analysis we assume that we have been given a finite set of points A in the n-dimensional space IR n , that is A = {a 1 ,...,a m }, where a i IR n ,i =1,...,m. There are different types of clustering. In this paper we consider the hard unconstrained partition cluster- ing problem, that is the distribution of the points of the set A into a given number k of disjoint subsets A j ,j =1,...,k with respect to predefined criteria such that: 1) A j = ,j =1,...,k; 2) A j A l = , j,l =1,...,k,j = l; 3) A = k j=1 A j . 4) no constraints are imposed on clusters A j ,j = 1,...,k. The sets A j ,j =1,...,k are called clusters. We assume that each cluster A j can be identified by its center (or centroid) x j IR n ,j =1,...,k. Then the clustering problem can be reduced to the following optimization problem (see (Bock 1998, Spath 1980)): minimize ψ(x,w)= 1 m m i=1 k j=1 w ij x j a i 2 (1) subject to x =(x 1 ,...,x k ) IR n×k , (2) k j=1 w ij =1,i =1,...,m, (3) and w ij = 0 or 1,i =1,...,m,j =1,...,k (4) where w ij is the association weight of pattern a i with cluster j , given by w ij = 1 if pattern a i is allocated to cluster j , 0 otherwise