A MapReduce-based Algorithm for Motif Search Hongwei Huo, Shuai Lin, Qiang Yu and Yipu Zhang School of Computer Science and Technology Xidian University Xi’an, 710071, China {hwhuo, lin_s2009, feqond, zephyr26026}@mail.xidian.edu.cn Vojislav Stojkovic Department of Computer Science Morgan State University Baltimore, MD 21251, USA vojislav.stojkovic@morgan.edu Abstract—Motif search plays an important role in gene finding and understanding gene regulation relationship. Motif search is one of the most challenging problems in bioinformatics. In this paper, we present three data partitions for the PMSP algorithm and propose the PMSP MapReduce algorithm (PMSPMR) for solving the motif search problem. For instances of the problem with different difficulties, the experimental results on the Hadoop cluster demonstrate that PMSPMR has good scalability. In particular, for the more difficult motif search problems, PMSPMR shows its advantage because the speedup is almost linearly proportional to the number of nodes in the Hadoop cluster. We also present experimental results on realistic biological data by identifying known transcriptional regulatory motifs in eukaryotes as well as in actual promoter sequences extracted from Saccharomyces cerevisiae. Keywords- Motif search; data partition; scalability; MapReduce; Hadoop I. INTRODUCTION Motif search is one of the most challenging problems in biology, molecular biology, bioinformatics, and computer science [1]. Motif search in unaligned DNA sequences plays an important role in gene finding and understanding gene regulation relationship. Das and Dai [2] made a survey of the recent developments in DNA motif search algorithms. Hu et al [3] extended earlier works to prokaryotic datasets and clarified the limitations and the potentials of existing motif search algorithms. DNA motif discovery algorithms can be divided into two categories based on the combinatorial approach used in their design, word-based methods that mostly rely on exhaustive enumeration and probabilistic sequence models [2]. The enumerative approach is exact and guarantees finding optimal solutions in the restricted search space. The probabilistic approach involves representation of the motif model by a position weight matrix. Most probabilistic motif discovery algorithms apply potent statistical techniques such as Expectation Maximization (EM) and Gibbs sampling and its extensions. Among the probabilistic methods, Gibbs sampling method [4] has been used extensively for motif discovery algorithms. It is initialized by choosing random starting positions within the various sequences and then proceeds through much iteration to execute the two steps of the Gibbs sampler: predictive update step and sampling step. The MEME algorithm [5] extended the EM algorithm for identifying motifs in unaligned biopolymer sequences, aiming to discover new motifs in a set of biopolymer sequences where little is known in advance about any motif that may be present. In PROJECTION [6], each l-mer of input sequences was projected into a smaller space through the projection template, and then, the EM algorithm is used to do refinement. GARPS [7] used an efficient hashing- based random projection strategy for processing input data, reducing the search space, and then, the genetic algorithm is used to do refinement. The word-based enumerative methods guarantee global optimality. They can be very fast when they are implemented with optimized data structures such as suffix trees. Buhler and Tompa [6] defined the challenging instances of the planted motif search problems, such as (9, 2)-, (11, 3)-, (13, 4)-, (15, 5)-, (17, 6)- and (19, 7)-motif problems. In WINNOWER [8], the graph-theoretic method was introduced to motif search for the first time by finding the maximum clique. However, it cannot handle the (15, 5)- motif problem due to the huge search space. RISOTTO [9] was the fastest algorithm in the family of suffix tree algorithms for solving motif search. Davila [11] implemented it on the machine with a Pentium4 2.40 GHz processor and a core memory size of 1 Gbyte, and found that it is capable of solving the (17, 6) challenging instance in 12 hours. Recently, the research of exact motif search algorithms mainly concentrated on the pattern-driven method. PMSP [10] tracked the following simple idea. For every l-mer x in the first sequence it generates d-neighbors of x and tries to guess if an l-mer y in that neighborhood is a motif by checking whether there is any l-mer in s i for i = 2,…,t at the distance d from it. PMSprune [11] was an improvement over PMSP by using the branch and bound strategy. Although the exact enumeration is an advantage of these methods, one limitation is that searching for long patterns is computationally expensive, and an exhaustive search through the sequence space of 4 L words often becomes impractical for L > 10 [8]. For different biological characteristics of the regulatory elements, several researchers have developed new methods/techniques such as parallel algorithms or 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops 978-0-7695-4676-6/12 $26.00 © 2012 IEEE DOI 10.1109/IPDPSW.2012.255 2063 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum 978-0-7695-4676-6/12 $26.00 © 2012 IEEE DOI 10.1109/IPDPSW.2012.255 2046