International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 03 Issue: 05 | May-2016 www.irjet.net p-ISSN: 2395-0072
© 2016, IRJET | Impact Factor value: 4.45 | ISO 9001:2008 Certified Journal | Page 1208
Reverse Sequencing based Genome Sequence using Lossless
Compression Algorithm
Rajesh Mukherjee
1
, Subhrajyoti Mandal
2
, Bijoy Mandal
1
1
Dept. of CSE, NSHM Knowledge Campus, Durgapur, WB, India
2
Datamatics Global Services Ltd, Bangalore, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Genome sequence based on reversed
sequencing is a lossless compression algorithm. We will
introduce a DNA compression algorithm, founded on exact
reverse matching which gives the best compression results on
standard DNA sequences benchmark. In an immensely long
DNA sequence, searching of all exact reverses is non-trivial
task. To find approximate reverses optimal for compression,
this algorithm takes a long time (essentially a quadratic time
search or even more). Also, obtaining high speed and best
compression ratio is a challenging task. Proposed DNA
sequences compression achieves a better compression ratio and
runs significantly faster than any existing compression
program for benchmark DNA sequences, simultaneously.
Key Words: lossless compression algorithm, encoding,
decoding, palindrome, DNA sequences, ASCII character
1. INTRODUCTION
Adenine, Cytosine, Guanine and Thymine are the four
bases found in DNA. Those are abbreviated as A, C, G, T
respectively. DNA sequencing is finding the order of
DNA nucleotides or bases, i.e. in a genome, the order of
As, Cs, Gs, Ts that make up an organism DNA.
Sequencing the genome is an important step towards
understanding it. The importance of genome sequence
is that, a genome sequence does contain some clues
about where genes are. These clues are useful for
interpretation. The human genome is made up of over
3 billion of these nucleotides. The human genome is
about 20-40 percent repetitive DNA, but bacterial and
viral genomes contain almost no repetition [1].
With the completion of the human genome
project, an enormous quantity of different genome
sequences becoming available, whose size varies in the
range of millions to billions of nucleotides. In both
scientific and commercial communities there is an
intensive activity targeted at sequencing the DNA of
many species and studying the variability of DNA
between individuals of the same species, which
produces huge amounts of information that need to be
stored and communicated to a large number of people.
Therefore, there is a great need for fast and efficient
compression of DNA sequences [2]. From the
viewpoint of information science; we can use
compression techniques to capture the properties of
DNA sequences. It is known that DNA sequences have
two characteristic structures. One is reverse
complements and the other is approximate repeats.
The reverse complement of a sequence is a reverse
sequence whose each symbol is replaced with its
complement one. The approximate repeats are repeats
that contain errors. There have been developed several
special-purpose compression algorithms for DNA
sequences have been developed (Grumbach and Tahi
[3], Chen, Kwong and Li [4], Lanctot, Li and Yang [5]).
These algorithms use the structures and can achieve
high compression ratio.
Now, it is known that DNA macromolecule
comprises of two strands: Coding strand and Non-
coding strand. The coding region contains the
information (digital code) for synthesizing proteins.
Only about ten percent of genetic material of Human
beings contains coding region i.e. genes. The rest is
considered to be non-coding. Non-coding strand of
DNA does not carry any information necessary to make
proteins [6]. Therefore, the compression ratio of coding
and non-coding regions of DNA sequence must be
different and the two regions should have different
information theoretical entropy. This is supported by a
biological hypothesis (Lanctot, Li and Yang [5]). From
these scenarios, one fundamental question should be
raised about the nature of the DNA sequence, i.e
random or nonrandom. Unfortunately the compression
of genetic sequences happens to be a very difficult task.
They are at a glance, very similar to random strings and
have only very hidden regularities. The classical
algorithms like compact and compress from Unix and
the text compression algorithm provided in [Nel 91] [6]
namely static and adaptive Huffman’s encodings, static
and adaptive arithmetic encoding including higher
order encodings and various substitution algorithms
based on Ziv and Lempels methods for the text
compression, fail to compress genetic sequences.
Rather they extend the contents of the sequences,
leading to negative compression rates [6].