Pattern Recognition 39 (2006) 2464 – 2477 www.elsevier.com/locate/patcog Multi-objective evolutionary biclustering of gene expression data Sushmita Mitra, Haider Banka Machine Intelligence Unit, Indian Statistical Institute, Kolkata 700 108, India Received 3 November 2005; received in revised form 3 February 2006; accepted 1 March 2006 Abstract Biclustering or simultaneous clustering of both genes and conditions have generated considerable interest over the past few decades, particularly related to the analysis of high-dimensional gene expression data in information retrieval, knowledge discovery, and data mining. The objective is to find sub-matrices, i.e., maximal subgroups of genes and subgroups of conditions where the genes exhibit highly correlated activities over a range of conditions. Since these two objectives are mutually conflicting, they become suitable candidates for multi-objective modeling. In this study, a novel multi-objective evolutionary biclustering framework is introduced by incorporating local search strategies. A new quantitative measure to evaluate the goodness of the biclusters is developed. The experimental results on benchmark datasets demonstrate better performance as compared to existing algorithms available in literature. 2006 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. Keywords: Multi-objective optimization; Microarray; Genetic algorithms; Knowledge discovery; Clustering 1. Introduction Microarray experiments produce gene expression patterns that offer enormous information about cell function. This is useful while investigating complex interactions within the cell [1]. Microarrays are used in the medical domain to produce molecular profiles of diseased and normal tissues of patients. Such profiles are useful for understanding various diseases, and aid in more accurate diagnosis, prog- nosis, treatment planning, as well as drug discovery. Being typically high-dimensional, gene expression data requires appropriate mining strategies like feature selection and clustering [2] for further analysis. Biological networks relate genes, gene products or their groups (like protein complexes or protein families) to each other in the form of a graph. Clustering of gene expres- sion patterns are being used to generate gene regulatory networks [3]. A major cause of coexpression of genes is their Corresponding author. E-mail addresses: sushmita@isical.ac.in (S. Mitra), hbanka_r@isical.ac.in (H. Banka). 0031-3203/$30.00 2006 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2006.03.003 sharing of the regulation mechanism (coregulation) at the se- quence level. Clustering of coexpressed genes, into biolog- ically meaningful groups, helps in inferring the biological role of an unknown gene that is coexpressed with a known gene(s). A cluster is a collection of data objects which are similar to one another within the same cluster but dissimilar to the objects in other clusters [4]. The problem is to group N pat- terns into n c possible clusters with high intra-class similar- ity and low inter-class similarity by optimizing an objective function. In objective function-based clustering algorithms, the goal is to find a partition for a given value of n c . Cluster- ing in gene expression data includes partitional, hierarchical, grid-based and density-based approaches to clustering [5] to name a few. Here the genes are typically partitioned into disjoint or overlapped groups according to the similarity of their expression patterns over all conditions. It is often observed that a subset of genes are coregulated and coexpressed under a subset of conditions, but behave almost independently under other conditions. Here the term “conditions” can imply environmental conditions as well as time points corresponding to one or more such environmen- tal conditions. Biclustering attempts to discover such local