A Novel Approach to Extract Structured Motifs
by Multi-Objective Genetic Algorithm
Mehmet Kaya and Melikali Güç
Data Mining and Bioinformatics Laboratory
Department of Computer Engineering, Fırat University, 23119, Elazığ, Turkey
kaya@firat.edu.tr
Abstract
The functional motifs composed of several
sequential blocks are difficult to find. Current mining
methods might individually find each motif block but
fail to connect them with large irregular gaps. In this
paper we propose a novel method for the efficient
extraction of structured motifs from DNA sequences
using multi-objective genetic algorithm. The main
advantage of our approach is that a large number of
nondominated motifs can be obtained by a single run
with respect to conflicting objectives: similarity and
support maximization and gap minimization. To the
best of our knowledge, this is the first effort in this
direction. The proposed method can be applied to any
data set with a sequential character. Furthermore, it
allows any choice of similarity measures for finding
motifs. By analyzing the obtained optimal motifs, the
decision maker can understand the tradeoff between
the objectives. We compare our method with the two
well-known structured motif extraction methods,
EXMOTIF and RISOTTO. Experimental results on
synthetics data set demonstrate that the proposed
method exhibits good performance over the other
methods in terms of runtime.
1. Introduction
In this paper, we consider structured motifs, which are
motifs composed of several disjoint single motifs
placed at given distances from each other. The
extraction of structured motifs appears particularly
interesting because of its application to the detection of
binding sites that respect a distance constraint. Given a
sequence s, the problem is to find repeated patterns in s
according to some parameters that specify the
frequency and the structure required for the motifs.
Many simple motif extraction algorithms have been
proposed primarily for extracting the transcription
factor binding sites, where each motif consists of a
unique binding site [1-3] or two binding sites separated
by a fixed number of gaps [4]. Structured motif
extraction problems, in which variable numbers of gaps
are allowed, have attracted much attention recently,
where the structured motifs can be extracted either
from multiple sequences or from a single sequence. In
many cases, more than one transcription factor may
cooperatively regulate a gene. Such patterns are called
composite regulatory patterns. To detect the composite
regulatory patterns, one may apply single binding site
identification algorithms to detect each component
separately. However, this solution may fail when some
components are not very strong. Thus it is necessary to
detect the whole composite regulatory patterns directly;
whose gaps and other possibly strong components can
increase its significance [5, 6].
Recently, Genetic Algorithms (GAs) have been used
for discovering simple motifs in multiple unaligned
DNA sequences. Of these, Liu et al., [7] developed a
program called FMGA for the motif discovery
problem. In their method, each individual represents a
candidate motif generated randomly, one motif per
sequence. Then, Che et al., [8] proposed a new GA
approach called MDGA to efficiently predict the
binding sites for homologous genes. The fitness value
for an individual is evaluated by summing up the
information content for each column in the alignment
of its binding site. Congdon et al., [9] developed a GA
approach to Motif Inference, called GAMI, to work
with divergent species, and possibly long nucleotide
sequences. The system design reduces the size of the
search space as compared to typical window-location
approaches for motif inference. They presented
preliminary results on data from the literature and from
novel projects. Finally, Paul and Iba [10] presented a
GA based method for identification of multiple (l, d)
motifs in each of the given sequences. The method can
handle longer motifs and can identify multiple
positions of motif instances of a consensus motif and
can extract weakly conserved regions in the given
sequences.
21st IEEE International Symposium on Computer-Based Medical Systems
1063-7125/08 $25.00 © 2008 IEEE
DOI 10.1109/CBMS.2008.99
278
21st IEEE International Symposium on Computer-Based Medical Systems
1063-7125/08 $25.00 © 2008 IEEE
DOI 10.1109/CBMS.2008.99
278
21st IEEE International Symposium on Computer-Based Medical Systems
1063-7125/08 $25.00 © 2008 IEEE
DOI 10.1109/CBMS.2008.99
278
21st IEEE International Symposium on Computer-Based Medical Systems
1063-7125/08 $25.00 © 2008 IEEE
DOI 10.1109/CBMS.2008.99
278