International Journal of Computer Applications (0975 - 8887) Volume 50 - No. 22, July 2012 Protein Sequence Similarity Search Technique Suitable for Parallel Implementation Himanshu S Mazumdar Head, Research & development Center Dharmsinh Desai University Nadiad, Gujarat, India Maulika S Patel Research Scholar Dharmsinh Desai University Nadiad, Gujarat, India ABSTRACT Having entered the post genomic era, there lies a plethora of infor- mation, both genomic and proteomic. This provides quite a lot of resources so that the computational and machine learning strategies be applied to address the problems of biological relevance. Search- ing in biological databases for similar or homologous sequences is a fundamental step for many bioinformatics tasks. On discovery of a new protein sequence or drug, a biologist would like to con- ﬁrm the discovery by comparing with the largest available protein database. Alignment based methods become too complex and time consuming with the increase in the number of sequences. Align- ment free sequence comparison is many a time used as a ﬁlter- ing step for application of alignment. A novel method of searching for similar sequences in a huge protein database is proposed. The method has two interesting aspects. One is the divide and conquer approach and use of hashing like scheme for indexing the large database. The index consists of the addresses of the 15-residue words in the UniRef100.fasta database. The second aspect is the possibility of data parallelism as the database is divided into m seg- ments for indexing. This can further increase the efﬁciency of the algorithm. The creation of index is time consuming but the search time is constant and affordable. The method is particularly use- ful when used with the large databases like UniRef100.fasta which consists of 9757328 protein sequences as on May 2010. The index based searching algorithm is implemented in C # .NET. General Terms: Protein sequence similarity, alignment free Keywords: 15- residue words, proteins, indexing, divide and conquer 1. INTRODUCTION The post-genomic era is experiencing genomic and proteomic data ﬂoods and encouraging more researchers to address problems like targeted drug discovery, protein-protein interaction identiﬁ- cation, protein function identiﬁcation and more. It is well under- stood that searching is an important step towards ﬁnding homol- ogy, gene identiﬁcation, motif identiﬁcation, and other bioinfor- matics tasks[3, 4, 5]. Biologists are interested in identifying which sequences in a database are the most similar to a new sequence which is uncharacterized[9]. Alignment based algorithms[10, 11] have been proposed, but they suffer from the curse of dimension- ality. As the number of sequences to be aligned increases, the complexity increases[13]. Heuristic based alignement algorithms like BLAST[15] are also very popular for sequence alignment. In this scenario, alignment free algorithms[2] have attracted many re- searchers. Different metrics have been proposed to assess the simi- larity obtained using alignment free techniques[1, 7, 8, 16]. A pre- search approach is proposed in [14] to search for similar sequences from a huge database. The method worked by discovering ﬁrst sim- ilar sequence and then using the common words for discovering similar sequences. Efﬁcient sequence similarity searching becomes even more more challenging when the size of the database is huge. Indexing or hashing can reduce the latency. In the same light, an indexed based divide and conquer algorithmic method has been proposed and implemented for retrieving similar sequences. The method is also suitable for parallel implementation which can fur- ther increase the efﬁciency of the method. 2. MATERIALS AND METHODS As shown in Fig. 1, the ﬁrst step is to extract all 15 residue words from the database and prepare an index containing the location of these words in terms of sequence number in the database. To identify the location of all 15 residue words in a database is an expensive task in terms of time and storing the location is expensive in terms of space, which is not of much concern. If done in the simplest way possible, the index will have a list of 20 15 entries containing the sequence number in the database. The space requirement is further increased with the use of a larger database, which usually is the case in proteomic tasks. The algorithm is used and tested with UniRef100.fasta, a comprehensive and non-redundant UniProt reference cluster, and ss.txt, a FASTA formatted ﬁle with protein sequences and secondary structures, databases available at www.uniprot.org. The UniRef100.fasta database is 4.21 GB in size (9757328 sequences) as of May 2010[12] and ss.txt, a smaller dataset, contains 174372 sequences. It is obvious that this large database cannot be handled by any efﬁcient run time environment. To prepare the index, a divide and conquer strategy is adopted. We chose to segment the database into m segments or parts, such that each of the m segments consisted of around 100000 words baring the last segment. This facilitated us to handle the database at run time. The index is so prepared so 1