International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169 Volume: 2 Issue: 2 398 – 402 ______________________________________________________________________________________ 398 IJRITCC | February 2014, Available @ http://www.ijritcc.org _______________________________________________________________________________________ A Novel Ant based Clustering of Gene Expression Data using MapReduce Framework Bhavani R, Assistant Professor Department of CSE Government College of Technology Coimbatore, India. bhavanirajasekar@gmail.com Dr.G.Sudha Sadasivam, Professor Department of CSE PSG College of Technology Coimbatore, India. sudhasadhasivam@yahoo.com Abstract — Genes which exhibit similar patterns are often functionally related. Microarray technology provides a unique tool to examine how a cell’s gene expression pattern changes under various conditions. Analyzing and interpreting these gene expression data is a challenging task. Clustering is one of the useful and popular methods to extract useful patterns from these gene expression data. In this paper multi colony ant based clustering approach is proposed. The whole processing procedure is divided into two parts: The first is the construction of Minimum spanning tree from the gene expression data using MapReduce version of ant colony optimization techniques. The second part is clustering, which is done by cutting the costlier edges from the minimum spanning tree, followed by one step k-means clustering procedure. Applied to different file sizes of gene expression data over different number of processors, the proposed approach exhibits good scalability and accuracy. Keywords- Bioinformatics, Gene expression data, Multi colony ant system, Data mining, Clustering, MapReduce programming) I. INTRODUCTION Data Mining is the process of analyzing large datasets to find useful patterns. Microarray technology is an experimental technique that can measure expression levels of hundreds and thousands of genes simultaneously. Analysis of gene expression data involves many computational tools for searching genes of interest, clustering and classification to find meaningful interpretation from huge volume of data. This paper aims at clustering the genes in the gene expression data. It helps in understanding gene functions and regulatory networks and assists in the diagnostics of disease conditions and effects of medical treatment. Clustering is one of the important methods in the field of data mining which aims at grouping objects into clusters such that the objects from the same cluster are similar and objects from different clusters are dissimilar. The similarity measurement is calculated through distance function. It is an unsupervised learning technique where the given dataset is analyzed and grouped into meaningful clusters without the prior knowledge of the classes in the dataset [1]. Traditional clustering algorithms can be broadly classified into two categories namely partitioning method and hierarchical methods. K-means clustering is a partitioning method of clustering which partitions the given dataset into k clusters. It is one of the easy and efficient methods for clustering and the parameter k is crucial. Clustering based on metaheuristic algorithms is emerging as an alternative to more conventional clustering techniques. In this paper, an ant based metaheuristic algorithm is proposed to perform gene expression data clustering. Ant colony optimization (ACO) is a kind of metaheuristic based on the behaviour of ants seeking a path between their colony and a source of food. Solutions for a given problem are constructed by random walks of artificial ants on a so-called construction graph, which has pheromone (weights) on the edges. Some of the problems in conventional clustering methods like clusters with arbitrary shapes, clusters with outliers are resolved using ACO based clustering. Since the work involves processing huge size of data, that is computationally-intensive and time-consuming, a MapReduce model for clustering is proposed. MapReduce programming model is typically used in distributed computing on clusters of computers. The model abstract distributed computing in two steps. The Map step is applied on the input data and produces a list of intermediate results. The Reduce step is applied to the intermediate results to perform some kind of merging operation to produce the output. Developers need to code Map and Reduce functions, and then submit the job to the MapReduce operating environment. Hadoop is open-source implementation of MapReduce computing model [2]. II. RELATED WORK Study of related literature can be grouped under two categories namely ant based clustering methods and parallelism of ACO algorithm. The sum of k-nearest neighbor distances metric and a shrinking range strategy is accommodated with ant colony optimization algorithm to resolve the problem of clusters with arbitrary shapes, clusters with outliers and bridges between outliers [3]. ACO based feature selection for image clustering is adopted in[4] and is used in content based image retrieval. Preprocessing of input to k-means clustering is done using ant based self organizing maps (SOM). It embeds the exploitation and exploration rules of state transition into conventional SOM algorithm [5]. The next position of the particle in the PSO algorithm is found using ACO and is taken as the initial clusters of the k-means approach [6]. ACO with different flavor (ACODF) uses simulated annealing concept for ants to decreasingly visit the amount