POSTER ABSTRACT Evolutionary Segmentation of Yeast Genome Daniel Mateos Computer Science Depmt. U. of Seville Avda. Reina Mercedes s/n 41012 Seville Spain +34 954 553 866 mateos@lsi.us.es José C. Riquelme Computer Science Depmt. U. of Seville Avda. Reina Mercedes s/n 41012 Seville Spain +34 954 552 775 riquelme@lsi.us.es Jesús S. Aguilar-Ruiz Computer Science Depmt. U. of Seville Avda. Reina Mercedes s/n 41012 Seville Spain +34 954 553 871 aguilar@lsi.us.es ABSTRACT Segmentation algorithms differ from clustering algorithms with regard to how to deal with the physical location of genes throughout the sequence. Therefore, segments have to keep the original positions of consecutive genes, which is not a constraint for clustering algorithms. It has been proven that exist functional relations among neighbour-genes, so the localization of the boundaries between these functionally similar groups of genes has turned out an important challenge. In this paper, we present an evolutionary algorithm to segment the yeast genome. 1. INTRODUCTION Chromosomes are organized in gene sequences. Each chromosome has a variable number of genes that physically are located in consecutive positions. Genome study tries to find the functionality of every gene. Recent researches in Genetics try to discover the existence of functional relations among one gene and its “neighbours” within a chromosome. This process is known as DNA segmentation, and it exists little scientific literature about it. The commonly used techniques work with DNA sequences instead of numerical values associated to each gene. Nowadays, the microarray techniques are generating great amounts of data, which might be very useful to analyze the functional properties of genes, as they collect a numerical value for every gene. This fact clears the way for new algorithms that can handle this sort of data. In this work, we present an Evolutionary Algorithm (EA) to find valid segments from the yeast genome. For the yeast genome study, we have a file with the sixteen chromosomes (NREG). Each gene is a row of the file. The file has three columns, and each column represents a genomic characteristic under specific conditions. The object is either clustering consecutive genes with similar properties with regard to the three variables, or clustering consecutive genes properly differentiated from adjacent clusters. Each cluster will be a segment of genes, as it will maintain the physical location within the genome. 2. EVOLUTIONARY ALGORITHM Each individual of population is a static array of natural numbers with size NCOR, and it represents a cutoffs collection into yeast genome. Fifteen of these cutoffs correspond to the sixteen chromosomes of yeast genome, and they are permanents. The sixteen cutoffs corresponding to centromeres also are permanents. These cutoffs ( NCORFIJ=31) although they can’t be moved, they have been included in all individuals, making easier the computing process. For example, if a cutoffs array includes among others, the values 34, 57, 7, 25 and 80, it means that there’s a cutoff between the 34 th and the 35 th entry of file, between the 57 th and the 58 th entry, between the 7 th and the 8 th entry, etc. Therefore, the segments comprise from first to 7 th gene, from 8 th to 25 th gene, from 26 th to 34 th gene, from 35 th to 56 th gene, etc. In order to verify the quality of the fitness functions, we execute the algorithm with the original data, and with randomized versions. We can understand that a fitness function is correct if the results obtained with the random data are inferior to the obtained with the original data. In another case, we can say that we have an “artifact” (An apparent experimental result that is not actually real but is due to the experimental methods). The fitness function of the first experiments, calculated the median of each variable for each segment, and it maximized the correlations between these medians (Eq. 1). ( 29 ( 29 U 1 1 2 23 2 13 2 12 1 , max + = = = + + = NCOR k i k i j i ij med Med Med Med correl F ρ ρ ρ ρ = i k med median of the k th segment for the i th variable Eq. 1. Fitness function (inter-median correlations) This fitness function turned out to be an artifact, because the results with random data were similar to the results with the original data. Another possibility for the fitness function is to maximize the difference between the values of the variables of two consecutive segments, for each variable separately and for all. We can use as statistical the median (robust against outliers) or the classic Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC ’04, March 14-17, 2004, Nicosia, Cyprus. Copyright 2004 ACM 1 -58113-812-1/03/04 …$5.00. 1026 2004 ACM Symposium on Applied Computing