AbstractIn this paper, we propose an efficient hierarchical DNA sequence search method to improve the search speed while the accuracy is being kept constant. For a given query DNA sequence, firstly, a fast local search method using histogram features is used as a filtering mechanism before scanning the sequences in the database. An overlapping processing is newly added to improve the robustness of the algorithm. A large number of DNA sequences with low similarity will be excluded for latter searching. The Smith-Waterman algorithm is then applied to each remainder sequences. Experimental results using GenBank sequence data show the proposed method combining histogram information and Smith-Waterman algorithm is more efficient for DNA sequence search. KeywordsFast search, DNA sequence, Histogram feature, Smith-Waterman algorithm, Local search I. INTRODUCTION POLLO project of life sciences [1], [2], that is, the decipherment of 3-billion-base human genome sequence was finally completed by the international cooperation in April 2003. Since this achievement of human genome project, researchers around the world are now having a very keen competition on clarification of the structure and performance analysis of the protein, genes and protein networks, and new gene sequences are clarified every day. The enormous quantity of data has been accumulated in the database like GenBank [7], EMBL, and DDBJ, etc. Moreover, the volume of data of Genome Database still increases in exponential [8]. Homology search of genome sequences (DNA, mRNA and protein) is the most important task in the life science area. There are 4 types of the DNA nucleotides, namely, A (adenine), C (cytosine), G (guanine) and T (thymine), which are utilized to encode DNA. If gene A and gene B have high homology, it is surmisable that the function of gene A is similar to that of gene B. Normally, when a new DNA or protein sequence is determined, it would be compared to all known sequences in the annotated databases such as GenBank, EMBL, and DDBJ, etc. Because the database is very large, a lot of algorithms are studied and used for the speeding-up of data search. Needleman and Wunsch presented the Needleman-Wunsch algorithm [3], Qiu Chen, Feifei Lee, and Tadahiro Ohmi are with New Industry Creation Hatchery Center, Tohoku University, Sendai, 980-8579 Japan (phone: +81-22-795-3977; fax: +81-22-795-3986; e-mail: qiu@fff.niche.tohoku.ac.jp). Koji Kotani is with Department of Electronics, Graduate School of Engineering, Tohoku University, Sendai, 980-8579 Japan. which calculates similarities between sequences by the dynamic programming, and Smith-Waterman algorithm is the improved approach [4]. However, it takes much time to retrieve data with these algorithms because they require too many amounts of calculation. Blast [5], FASTA [6] and PatternHunter [9], [10] are three rapid heuristic algorithms are regularly used for searching protein and DNA sequence databases. The idea in these tools is to find subsequences that share some patterns called as filtration techniques. While BLAST and FASTA have improved the retrieving speed with heuristic algorithms, there is a possibility of missing an alignment or giving inaccurate output. Thus, many researches have been trying to improve both the search time and the precision. We have proposed an efficient method combining histogram features and Smith-Waterman dynamic programming algorithms [4] in order to improve both speed and precision [11]. Histogram features of sequences are firstly used to compare the query sequence with the sequences in database and similarity scores would be obtained. Only the sequences whose similarities exceeded a given threshold are then aligned using exhaustive Smith-Waterman dynamic programming algorithm. The effects have been demonstrated by using GenBank sequence data, which is the NIH genetic sequence database, a collection of all publicly available DNA sequences. For sequences which range of length variation is not very large, the experimental results show the proposed algorithm is very efficient, but the efficiency decreases with variation in sequence length. In this paper, we propose a local search method in order to improve both efficiency and speed even the sequence length changes largely. An overlapping processing is newly added to improve the robustness. The effects will be demonstrated by using GenBank sequence data. This paper is organized as follows. In section II, we will first introduce the proposed local search algorithm using histogram features for DNA sequences in detail. Experimental results using publicly available GenBank sequence data will be discussed in section III. Finally, conclusions are given in section IV. An Improved Fast Search Method Using Histogram Features for DNA Sequence Database Qiu Chen, Feifei Lee, Koji Kotani, and Tadahiro Ohmi A World Academy of Science, Engineering and Technology 45 2010 569