New Distance Measure for Microarray Gene Expressions using Linear Dynamic Range of Photo Multiplier Tube Shubhra Sankar Ray Center for Soft Computing Research Indian Statistical Institute Kolkata, India shubhra r@isical.ac.in Sanghamitra Bandyopadhyay Machine Intelligence Unit Indian Statistical Institute Kolkata, India sanghami@isical.ac.in Sankar K. Pal Center for Soft Computing Research Indian Statistical Institute Kolkata, India sankar@isical.ac.in Abstract This paper deals with a new distance measure for genes using their microarray expressions. The distance measure is called, “Maxrange distance”, where an experiment specific normalization factor is incorporated in the computation of the distance. The normalization factor is dependent on the linear dynamic range of the photo multiplier tube (PMT) for scanning fluorescence intensities of the gene expression values. Superiority of this distance measure in the microar- ray gene ordering problem has been extensively established on widely studied microarray data sets by performing sta- tistical tests. 1 Introduction The recent advances in DNA array technologies have re- sulted in a significant increase in the amount of genomic data [3, 2]. The most powerful and commonly used tech- nique is that involving microarray, which has enabled the monitoring of the expression levels of more than thousands of genes simultaneously. Due to the large quantity of in- formation available from microarray it is necessary to find an appropriate distance measure for genes and to employ a process of classification of the data in order to obtain initial conclusions about the genes. The present article deals with the tasks of measuring the distance between genes and evaluating their biological or- dering in clustering framework. The widely used measures for finding the global similarity (where all the gene expres- sion values present in the gene are taken into considera- tion) between genes are the Pearson correlation [3, 2] and the Euclidean distance [8]. In computing the similarity, all the above mentioned measures do not assign appropriate weights to gene expressions obtained from different types of experiments, where the expressions differ by orders of magnitude from one type to another. Consequently, gene expression values in lower dynamic range do get dominated by those with higher dynamic range. A new similarity mea- sure between genes, called “Maxrange distance” is defined in this article, where gene expression (for a particular type of experiment) distance between two genes are first normal- ized with a factor dependent on the linear dynamic range of photo multiplier tube (used for scanning fluorescence inten- sities of that experiment), and then summed to find a global distance. Superiority of the proposed Maxrange distance measure over the related measures is established by using them on four different algorithms. 2 Gene Ordering Methods Cluster analysis, ordering, and display of gene expres- sion patterns are considered to be useful tools to detect genes that are co-expressed or implicated in similar cellular functions [3, 2]. Hierarchical clustering approaches (single, complete and average linkage) [3, 1] group gene expres- sions into trees of clusters. They start with singleton sets and merge all genes until all nodes belong to only one set. Hierarchical clustering does not determine unique clusters. Thus the user has to determine which of the subtrees are clusters and which subtrees are only a part of a bigger clus- ter. So in the framework of hierarchical clustering a gene ordering algorithm helps the user to identify clusters, and subclusters in big clusters, by means of visual inspection of the clustered gene expression data [1]. Moreover, genes that are adjacent in a linear ordering are often functionally co-regulated and involved in the same cellular process [2, 3] and biological analysis is often done in the context of this linear ordering [1]. Ideally, one would like to obtain a linear order of all Proceedings of the International Conference on Computing: Theory and Applications (ICCTA'07) 0-7695-2770-1/07 $20.00 © 2007