International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 03 Issue: 05 | May-2016 www.irjet.net p-ISSN: 2395-0072 © 2016, IRJET | Impact Factor value: 4.45 | ISO 9001:2008 Certified Journal | Page 1208 Reverse Sequencing based Genome Sequence using Lossless Compression Algorithm Rajesh Mukherjee 1 , Subhrajyoti Mandal 2 , Bijoy Mandal 1 1 Dept. of CSE, NSHM Knowledge Campus, Durgapur, WB, India 2 Datamatics Global Services Ltd, Bangalore, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - Genome sequence based on reversed sequencing is a lossless compression algorithm. We will introduce a DNA compression algorithm, founded on exact reverse matching which gives the best compression results on standard DNA sequences benchmark. In an immensely long DNA sequence, searching of all exact reverses is non-trivial task. To find approximate reverses optimal for compression, this algorithm takes a long time (essentially a quadratic time search or even more). Also, obtaining high speed and best compression ratio is a challenging task. Proposed DNA sequences compression achieves a better compression ratio and runs significantly faster than any existing compression program for benchmark DNA sequences, simultaneously. Key Words: lossless compression algorithm, encoding, decoding, palindrome, DNA sequences, ASCII character 1. INTRODUCTION Adenine, Cytosine, Guanine and Thymine are the four bases found in DNA. Those are abbreviated as A, C, G, T respectively. DNA sequencing is finding the order of DNA nucleotides or bases, i.e. in a genome, the order of As, Cs, Gs, Ts that make up an organism DNA. Sequencing the genome is an important step towards understanding it. The importance of genome sequence is that, a genome sequence does contain some clues about where genes are. These clues are useful for interpretation. The human genome is made up of over 3 billion of these nucleotides. The human genome is about 20-40 percent repetitive DNA, but bacterial and viral genomes contain almost no repetition [1]. With the completion of the human genome project, an enormous quantity of different genome sequences becoming available, whose size varies in the range of millions to billions of nucleotides. In both scientific and commercial communities there is an intensive activity targeted at sequencing the DNA of many species and studying the variability of DNA between individuals of the same species, which produces huge amounts of information that need to be stored and communicated to a large number of people. Therefore, there is a great need for fast and efficient compression of DNA sequences [2]. From the viewpoint of information science; we can use compression techniques to capture the properties of DNA sequences. It is known that DNA sequences have two characteristic structures. One is reverse complements and the other is approximate repeats. The reverse complement of a sequence is a reverse sequence whose each symbol is replaced with its complement one. The approximate repeats are repeats that contain errors. There have been developed several special-purpose compression algorithms for DNA sequences have been developed (Grumbach and Tahi [3], Chen, Kwong and Li [4], Lanctot, Li and Yang [5]). These algorithms use the structures and can achieve high compression ratio. Now, it is known that DNA macromolecule comprises of two strands: Coding strand and Non- coding strand. The coding region contains the information (digital code) for synthesizing proteins. Only about ten percent of genetic material of Human beings contains coding region i.e. genes. The rest is considered to be non-coding. Non-coding strand of DNA does not carry any information necessary to make proteins [6]. Therefore, the compression ratio of coding and non-coding regions of DNA sequence must be different and the two regions should have different information theoretical entropy. This is supported by a biological hypothesis (Lanctot, Li and Yang [5]). From these scenarios, one fundamental question should be raised about the nature of the DNA sequence, i.e random or nonrandom. Unfortunately the compression of genetic sequences happens to be a very difficult task. They are at a glance, very similar to random strings and have only very hidden regularities. The classical algorithms like compact and compress from Unix and the text compression algorithm provided in [Nel 91] [6] namely static and adaptive Huffman’s encodings, static and adaptive arithmetic encoding including higher order encodings and various substitution algorithms based on Ziv and Lempels methods for the text compression, fail to compress genetic sequences. Rather they extend the contents of the sequences, leading to negative compression rates [6].