A survey on biological data analysis by biclustering Majid Rastegar-Mojarad Faculty of Electrical Engineering, Persian Gulf University, Bushehr, Iran rastegar_m@comp.iust.ac.ir Saeed Talatian-Azad Faculty of Computer Engineering, Islamic Azad University, Bushehr, Iran s.talatian@mail.sbu.ac.ir Behrouz Minaei-Bidgoli Assistant Professor of School of Computer Engineering, Iran Univ. of Scie. & Technology, Tehran, Iran minaeibi@cse.msu.edu Abstract—Several non-supervised machine learning methods have been used in the analysis of gene expression data obtained from microarray experiments. Recently, biclustering, a non- supervised approach that performs simultaneous clustering on the row and column dimensions of the data matrix, has been shown to be remarkably effective in a variety of applications. The discovery of biclusters, which denote groups of items that show coherent values across a subset of all the transactions in a data set, is an important type of analysis performed on real- valued data sets in various domains, such as biology. In this survey, we analyze several of existing approaches to biclustering that use in biological data analysis. Keywords; data mining, biclusterng, biological data analysis I. INTRODUCTION DNA chips and other techniques measure the expression level of a large number of genes, perhaps all genes of an organism, within a number of different experimental samples (conditions). The samples may correspond to different time points or different environmental conditions. In other cases, the samples may have come from different organs, from cancerous or healthy tissues, or even from different individuals. Simply visualizing this kind of data, which is widely called gene expression data or simply expression data, is challenging and extracting biologically relevant knowledge is harder still [1]. Several non-supervised machine learning methods have been used in the analysis of gene expression data obtained from microarray experiments. Recently, biclustering, a non- supervised approach that performs simultaneous clustering on the row and column dimensions of the data matrix has been shown to be remarkably effective in a variety of applications. The goal of biclustering is to find subgroups of genes and subgroups of conditions, where the genes exhibit highly correlated behaviors. In the most common settings, biclustering is an NP-complete problem, and heuristic approaches are used to obtain sub-optimal solutions using reasonable computational resources. Biclustering can be applied whenever the data to analyze has the form of a real-valued matrix A, where the set of values aij represent the relation between its rows i and its columns j. An example of this kind of data are the gene expression matrices. Moreover, it can be applied when the data can be modeled as a weighted bipartite graph. Furthermore, biclustering can be used when the goal is to identify sub-matrices described by a subset of rows and a subset of columns with certain coherence properties. Large datasets of clinical samples are an ideal target for biclustering. As such, many applications of biclustering are performed using gene expression data obtained using microarray technologies that allow the measurement of the expression level of thousands of genes in target experimental conditions. In this application domain, we can use biclusters to associate genes with specific clinical classes or for classifying samples, among other possible interesting applications. DNA microarray data are usually arranged in a matrix, where each row corresponds to a gene and each column an experimental condition. Each entry in the matrix records the expression level of a gene as a real number, which is usually derived by taking the logarithmic of the relative abundance of the mRNA of that genes in a specific condition [2]. An important objective of analyzing this kind of data is the classification of genes and conditions and the identification of regulatory process. With the aim of analyzing such groups and samples, clustering has an important role in the exploratory analysis of microarray data [3]. Hartigan’s pioneering work on direct clustering was the first to reveal the potential of co-clustering, also called biclustering, [4]. In a two dimensional matrix, co-clustering aims at identifying homogeneous local patterns, each of which consists of a subset of rows and a subset of columns. In particular, co-clustering has attracted genomic researchers, because the co-clustering model is compatible with our understanding of cellular processes, where a subset of genes are coregulated under certain experimental conditions, but behave almost independently under other conditions [5]. The paper by Madiera and Oliveira provides an extensive survey on the application of co-clustering to biological data analysis [6]. Another interesting survey on biclustering algorithms is also in [7].Cheng and Church [8] are considered to be the first to apply co-clustering to gene expression data. They proposed a greedy search heuristic that generates biclusters, one at a time, based on a