International Journal of Computer Science and its Applications 232 Microarray Gene Expression Data Clustering using PSO based K-means Algorithm Lopamudra Dey Anirban Mukhopadhyay Department of Computer Science and Engineering Department of Computer Science and Engineering University of Kalyani University of Kalyani Kalyani-741235, Nadia, West Bengal, India Kalyani-741235, Nadia, West Bengal, India Email: lopamudra.dey1@gmail.com Email: anirban@klyuniv.ac.in Abstract–This paper describes the clustering analysis of microarray gene expression data. Microarray basically consists of large number of gene sequences under multiple conditions. This microarray technology has made it possible to concurrently monitor the expression levels of thousands of genes and across collection of related samples. The most important area of microarray technology is the data clustering analysis. Cluster analysis refers to partitioning a given data set into groups based on specified features so that the data points within a group are more similar to each other than the points in different groups. Many conventional clustering algorithms like K-means, FCM, hierarchical techniques are used for gene expression data clustering. But PSO based K-means gives better accuracy than these existing algorithms. In this paper, a Particle Swarm Optimization (PSO)-based K-means clustering algorithm has been proposed for clustering microarray gene expression data. Keywords–Clustering, K-means, PSO, Microarray Gene Expression data. I. INTRODUCTION The DNA microarray is a way to measure the expression level of thousands of genes at the same time in a cell mixture [3]. Microarray data can be viewed as an n * (m+1) matrix: Each of the columns represents a gene. Each of the rows represents an experimental condition (a sample, a time point, etc.) as shown in Figure 1. Fig 1. The gene expression data matrix represents m columns of genes and n rows of samples. The last column is the class label i.e. information about which sample goes to which cluster. The original gene expression matrix obtained from a scanning process contains noise, missing values and systematic variations arising from the experimental procedure such as missing value estimation, data normalization etc. Genes are expressed when they are copied into mRNA or RNA. Gene structure is same in all cells in our body. One frequent use of this microarray technology is to determine which genes are activated and which genes are repressed when two populations of cells are compared at a given point of time in the life of the organism [10]. Total RNA can be isolated from cells or tissues under different experimental conditions and the relative amounts of transcribed RNA can be measured. A typical microarray experiment contains 10 2 to 10 4 genes and the no of samples involved in a microarray experiment is generally less than 100. One of the characteristics of gene expression data is that it is significant to cluster both genes and samples. In gene- based clustering the genes are treated as the objects while the samples are the features. But in sample-based clustering the samples are act as the objects and the genes are treated as the features. The division of gene- based clustering and sample-based clustering is based on different characteristics clustering tasks for gene expression data. In current days, only a small subset of genes take parts in any cellular procedure. In this paper, standard deviation of the genes across all the samples are calculated first, then a small set of genes are taken having high standard deviation as input to the different clustering algorithms. Microarray is a tool for analyzing gene expression that consists of a small membrane containing samples of thousands of genes arranged in some regular pattern. Microarrays may be used in a wide variety of a fields, including biotechnology, agriculture, food, cosmetics and computers This technology can simultaneously monitor and study the expression levels of thousands of genes, relationship between the genes, their functions and classifying genes or samples. The change of experimental condition, environmental change, drug, disease etc. can change the expression levels. So, gene expression profiling can help to distinguish between disease state versus healthy state, drug identification, effect of change of environmental conditions etc. Some work is done on the performance of K-means, PSO and hybrid PSO clustering approaches on different data sets [1][2]. The Euclidean distance measure and