The Utility of Sequence, Function and Transcriptional Information in Gene Expression Profiling * George Potamias 1,2 1 Institute of Computer Science, Foundation for Research and Technology – Hellas (FORTH), Vassilika Vouton, P.O. Box 1385, GR-71110, Heraklion, Crete, Greece. potamias@ics.forth.gr and 2 Department of Computer Science, University of Crete, GR-71409, Heraklion, Crete, Greece Extended Abstract 1. Introduction As the determination of the DNA sequences comprising the genome of various organisms came to completion or, nears completion, a shift from static structural genomics to dynamic functional genomics is taken place. In this paper we focus on functional genomics and in particular on the analysis of microarray data. Microarray or, gene-expression data analysis is heavily depended on Gene Expression Data Mining (GEDM) technology, and in the very-last years a lot of research efforts are in progress. GEDM is used to identify intrinsic patterns and relationships in gene expression data. The identification of patterns in complex gene expression datasets provides two benefits: (i) generation of insight into gene regulation, and (ii) characterization of multiple gene expression profiles in complex biological processes, e.g. pathological states [8]. GEDM activities are based on two approaches: (a) hypothesis testing- to investigate the induction or perturbation of a biological process that leads to predicted results, and (b) knowledge discovery- to detect internal structure in biological data [1, 13]. In this paper we present an integrated methodology that combines both. It is based on a hybrid clustering approach able to compute and utilize different distances (or, similarities) between the objects to be clustered. In this respect the whole exploratory data analysis process becomes more knowledgeable in the sense that pre-established domain-knowledge is used to guide the clustering. 2. Methodology A hybrid clustering approach is devised that follows three steps [7]: 1. First a distance is computed between all the objects to be clustered. The distance may be computed taking in consideration various modalities. For microarray data the distance between two genes may reflect their functional classification (i.e., their known assignment to the same or similar functional activity during the metabolic process) or, the occurrence of transcriptional-factors (i.e., pre-specified and established motifs in the corresponding DNA-sequences of the genes). 2. A fully-connected weighted graph is devised with genes as nodes, and weights- for the edges/links between genes, the computed distances. The minimum spanning tree (MST) of this graph is found. The computed MST reserves the minimum distance between * Extended abstract for the EUNITE Workshop “Intelligent Technologies for Gene Expression Based Individualized Medicine”, 9 th May 2003, Jena, Germany.