MProﬁler: A Proﬁle-Based Method for DNA Motif Discovery Doaa Altarawy, Mohamed A. Ismail, and Sahar M. Ghanem Computer and Systems Engineering Dept. Faculty of Engineering, Alexandria University Alexandria 21544, Egypt {doaa.altarawy,maismail,sghanem}@alex.edu.eg Abstract. Motif Finding is one of the most important tasks in gene regulation which is essential in understanding biological cell functions. Based on recent studies, the performance of current motif ﬁnders is not satisfactory. A number of ensemble methods have been proposed to en- hance the accuracy of the results. Existing ensemble methods overall performance is better than stand-alone motif ﬁnders. A recent ensemble method, MotifVoter, signiﬁcantly outperforms all existing stand-alone and ensemble methods. In this paper, we propose a method, MProﬁler, to increase the accuracy of MotifVoter without increasing the run time by introducing an idea called center proﬁling. Our experiments show im- provement in the quality of generated clusters over MotifVoter in both accuracy and cluster compactness. Using 56 datasets, the accuracy of the ﬁnal results using our method achieves 80% improvement in correlation coeﬃcient nCC, and 93% improvement in performance coeﬃcient nP C over MotifVoter. Keywords: Bioinformatics, DNA Motif Finding, Clustering. 1 Introduction Computational identiﬁcation of overrepresented patterns (motifs) in DNA se- quences is a long-standing problem in Bioinformatics. Identiﬁcation of those patterns is one of the most important tasks in gene regulation which is essential in understanding biological cell functions. Over the last few years, the sequenc- ing of the complete genome of large variety of species (including human) has accelerated the advance in the ﬁled of Bioinformatics [1]. The problem of DNA motif ﬁnding is to locate common short patterns in a set of co-regulated gene promoters (DNA sequences). Those patterns are conserved but still tend to vary slightly [2]. Normally the patterns (motifs) are fairly short (5 to 20 base pair long) [3]. Those motifs are the locations where transcription factors (TF) bind to in order to control protein production in cells. DNA motifs are also called transcription factor binding sites (TFBS). Many computational methods are being proposed to solve this problem. Their strategies can be divided into two main classes: exhaustive enumeration and probabilistic methods [4]. V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 13–23, 2009. c  Springer-Verlag Berlin Heidelberg 2009