MProfiler: A Profile-Based Method for DNA Motif Discovery Doaa Altarawy, Mohamed A. Ismail, and Sahar M. Ghanem Computer and Systems Engineering Dept. Faculty of Engineering, Alexandria University Alexandria 21544, Egypt {doaa.altarawy,maismail,sghanem}@alex.edu.eg Abstract. Motif Finding is one of the most important tasks in gene regulation which is essential in understanding biological cell functions. Based on recent studies, the performance of current motif finders is not satisfactory. A number of ensemble methods have been proposed to en- hance the accuracy of the results. Existing ensemble methods overall performance is better than stand-alone motif finders. A recent ensemble method, MotifVoter, significantly outperforms all existing stand-alone and ensemble methods. In this paper, we propose a method, MProfiler, to increase the accuracy of MotifVoter without increasing the run time by introducing an idea called center profiling. Our experiments show im- provement in the quality of generated clusters over MotifVoter in both accuracy and cluster compactness. Using 56 datasets, the accuracy of the final results using our method achieves 80% improvement in correlation coefficient nCC, and 93% improvement in performance coefficient nP C over MotifVoter. Keywords: Bioinformatics, DNA Motif Finding, Clustering. 1 Introduction Computational identification of overrepresented patterns (motifs) in DNA se- quences is a long-standing problem in Bioinformatics. Identification of those patterns is one of the most important tasks in gene regulation which is essential in understanding biological cell functions. Over the last few years, the sequenc- ing of the complete genome of large variety of species (including human) has accelerated the advance in the filed of Bioinformatics [1]. The problem of DNA motif finding is to locate common short patterns in a set of co-regulated gene promoters (DNA sequences). Those patterns are conserved but still tend to vary slightly [2]. Normally the patterns (motifs) are fairly short (5 to 20 base pair long) [3]. Those motifs are the locations where transcription factors (TF) bind to in order to control protein production in cells. DNA motifs are also called transcription factor binding sites (TFBS). Many computational methods are being proposed to solve this problem. Their strategies can be divided into two main classes: exhaustive enumeration and probabilistic methods [4]. V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 13–23, 2009. c Springer-Verlag Berlin Heidelberg 2009