International Journal of Computer Applications (0975 – 8887) Volume 75– No.4, August 2013 29 Compression Algorithm for all Specified bases in Nucleic Acid Sequences Subhankar Roy CSE Department, Academy of Technology, G. T. Road, Aedconagar, Hooghly-712121, W.B., India Sunirmal Khatua Department of Computer Science and Engineer, University of Calcutta, 92, A.P.C. Road, Kolkata-700009, India ABSTRACT Organizations such as IT industry, colleges and Scientists regularly encounter problems to handle large data sets for their different purpose in many areas as for example biological research. These limitations also affect internet search to fetch data, business for analysis etc. So it is simply needed generalized but special types of compression algorithm for dissimilar data to get utmost saving percentage. In this article Compression of biological data that is single and double strand DNA and single strand RNA have been considered. Since biological data are less random compare to any text data that means redundancy within the sequences are more but they have some special property as for example different types of repeat one of such repeat is called dinucleotide repeat .This type of repeat are more in any sequence. Here the two proposed algorithm are based on this repeat using static fixed length LUT for input file and output file mapping. Keywords Completely and incompletely specified nucleic acid bases, static LUT, dinucleotide repeats, base pair, sequence line length, compressed sequence length, compression factor, saving percentage. 1. INTRODUCTION Till now in most of the nucleic acid sequences compression algorithm, only completely specified bases that are A, C, G and T/U [1-2]. That means if any sequence contains some of the incompletely bases, although it is a rare case then those techniques will not be adequate. So there is need of some different algorithm which has the generalized property. That means those techniques can handle only four completely bases [3-5] as well as eleven incompletely specified bases. The primary bases are adenine (A), cytosine (C), guanine (G) which are exist in both DNA and RNA sequence, thymine (T) only in DNA and uracil (U) in RNA respectively. The above said symbols are typically called bases in genome. Other than theses symbols there may exists some intermittent and incompletely specified bases in nucleic acid sequences these are K, M, R, S, W, Y, B, D, H, V, N respectively [6]. The incompletely specified bases can be represent in terms of A, C, G and T/U where Keto (K) may be G or T, Amino (M) A or C, Puine (R) A or G, S C or G, W A or T, Pyrimidine (Y) C or T all of these having 50% probability between two primary bases. The following bases which may have one of the three primary bases are B - C or G or T, D A or G or T, H A or C or T and V A or C or G respectively. The last one N may be A or C or G or T. The above said bases relevant for both deoxyribonucleic acids (DNA) and ribonucleic acids (RNA) respectively. The only difference is in T and U in case RNA it is U where T for DNA. So the deduction of RNA sequences from the subsequent DNA sequences is just a replacement of all occurrence of U by T in the corresponding incompletely specified bases. Since the difference between DNA and RNA is just by single bases so storing process are quite equivalent both for DNA and RNA. So programmer need not to make different data bank for DNA and RNA. Here no discrimination between lower and upper case letters have been considered also i.e. „A‟ is equivalent to „a‟ and so on. All symbols have their corresponding complement but only for DNA not for RNA they are A-T, C-G, K-M, M-K, R-Y, Y-R, B-V, D-H, H-D, and V-B. Some symbols are self complement they are S, W, and N. 2. BRIEF REVIEW There have a lot of pre-existing DNA sequence compression. Here only few of them have been explained. Some authors use the property approximate repeat [7]. Approximate repeats are subsequences of a sequence which can be transformed into a copy of the original previous subsequences using edit operations such as substitution, insertion and deletion. But the searching process is time consuming, to save time greedy approach misses long repeats which prohibits from receiving high compression. The tandem repeat finder [8] which is a program to analyze DNA sequences. It is used to find the two or more contiguous exact or approximate pattern in a sequence. It helps to find which portions of a genomic sequence are similar and which are not. Another compression, which divides the entirely scanned DNA sequence into factors of length four, is Hashbased (Ateet Mehta et al, 2010) [9] and as its name itself suggests, the algorithm initially builds a hash table and assigns a unique character to each of the factors which act as the hash key. Each factor of length four is assigned corresponding unique characters to each of the factors. But this algorithm doesn‟t consider any junk characters in the sequence. In this article completely specified as well as all incompletely specified bases have been considered. So here the proposed compression techniques are more flexible to switch a wide variety of bases. Rest of the article organized as follows. Section 3 two proposed generalized and specially designed algorithms which