A Novel Approach to Extract Structured Motifs by Multi-Objective Genetic Algorithm Mehmet Kaya and Melikali Güç Data Mining and Bioinformatics Laboratory Department of Computer Engineering, Fırat University, 23119, Elazığ, Turkey kaya@firat.edu.tr Abstract The functional motifs composed of several sequential blocks are difficult to find. Current mining methods might individually find each motif block but fail to connect them with large irregular gaps. In this paper we propose a novel method for the efficient extraction of structured motifs from DNA sequences using multi-objective genetic algorithm. The main advantage of our approach is that a large number of nondominated motifs can be obtained by a single run with respect to conflicting objectives: similarity and support maximization and gap minimization. To the best of our knowledge, this is the first effort in this direction. The proposed method can be applied to any data set with a sequential character. Furthermore, it allows any choice of similarity measures for finding motifs. By analyzing the obtained optimal motifs, the decision maker can understand the tradeoff between the objectives. We compare our method with the two well-known structured motif extraction methods, EXMOTIF and RISOTTO. Experimental results on synthetics data set demonstrate that the proposed method exhibits good performance over the other methods in terms of runtime. 1. Introduction In this paper, we consider structured motifs, which are motifs composed of several disjoint single motifs placed at given distances from each other. The extraction of structured motifs appears particularly interesting because of its application to the detection of binding sites that respect a distance constraint. Given a sequence s, the problem is to find repeated patterns in s according to some parameters that specify the frequency and the structure required for the motifs. Many simple motif extraction algorithms have been proposed primarily for extracting the transcription factor binding sites, where each motif consists of a unique binding site [1-3] or two binding sites separated by a fixed number of gaps [4]. Structured motif extraction problems, in which variable numbers of gaps are allowed, have attracted much attention recently, where the structured motifs can be extracted either from multiple sequences or from a single sequence. In many cases, more than one transcription factor may cooperatively regulate a gene. Such patterns are called composite regulatory patterns. To detect the composite regulatory patterns, one may apply single binding site identification algorithms to detect each component separately. However, this solution may fail when some components are not very strong. Thus it is necessary to detect the whole composite regulatory patterns directly; whose gaps and other possibly strong components can increase its significance [5, 6]. Recently, Genetic Algorithms (GAs) have been used for discovering simple motifs in multiple unaligned DNA sequences. Of these, Liu et al., [7] developed a program called FMGA for the motif discovery problem. In their method, each individual represents a candidate motif generated randomly, one motif per sequence. Then, Che et al., [8] proposed a new GA approach called MDGA to efficiently predict the binding sites for homologous genes. The fitness value for an individual is evaluated by summing up the information content for each column in the alignment of its binding site. Congdon et al., [9] developed a GA approach to Motif Inference, called GAMI, to work with divergent species, and possibly long nucleotide sequences. The system design reduces the size of the search space as compared to typical window-location approaches for motif inference. They presented preliminary results on data from the literature and from novel projects. Finally, Paul and Iba [10] presented a GA based method for identification of multiple (l, d) motifs in each of the given sequences. The method can handle longer motifs and can identify multiple positions of motif instances of a consensus motif and can extract weakly conserved regions in the given sequences. 21st IEEE International Symposium on Computer-Based Medical Systems 1063-7125/08 $25.00 © 2008 IEEE DOI 10.1109/CBMS.2008.99 278 21st IEEE International Symposium on Computer-Based Medical Systems 1063-7125/08 $25.00 © 2008 IEEE DOI 10.1109/CBMS.2008.99 278 21st IEEE International Symposium on Computer-Based Medical Systems 1063-7125/08 $25.00 © 2008 IEEE DOI 10.1109/CBMS.2008.99 278 21st IEEE International Symposium on Computer-Based Medical Systems 1063-7125/08 $25.00 © 2008 IEEE DOI 10.1109/CBMS.2008.99 278