NP3: A New DNA Compression Algorithm Paul Gardner-Stephen and Greg Knowles Embedded Systems Laboratory, School of Informatics and Engineering, Flinders University, GPO Box 2100, Adelaide 5001, Australia Abstract:- NP3 is a nucleotide database compression algorithm which takes advantage of the redundancy among sequences in genomic databases. We demonstrate the effectiveness of the NP3 algorithm by com- pressing the bodies of the UniGene [1] transcriptome database at 0.71 bits per base, while offering sequential and random access decompression speeds peaking in excess of 500,000 sequences per second. This is sig- nificantly better than the results offered by existing algorithms[2, 3, 4, 5, 6]. Key-Words:- Data compression, Blast, DNA, UniGene, Human Genome, zip 1 Introduction There is little doubt that genomic databases are growing in size at a rate which exceeds that of both Moore’s Law and the increase in hard disk storage capacities. There is also the rising spectre of mass sequencing systems entering production in the future. If computer systems are swamped by current data volumes, then such systems capable of sequencing perhaps 10 8 bases per hour threaten to drown them. However, it can be observed that while the volume of sequencing data is increasing exponentially, that the increase in entropy is much closer to linear. The homology between species being sequenced, the homology among individuals and varieties of organisms as well as the homol- ogy of sequences within an organism together sug- gest that a significant proportion of the entropy pre- sented by individual sequences is in fact common. It therefore follows that it should be possible to produce compression algorithms which take advan- tage of this inter-sequence miscibility to produce extremely compact representations. As the volume of sequencing grows, and the redundancy increases it is not unreasonable to envisage compression ra- tios an order of magnitude better than the 1.6 to 1.8 bits/base of current algorithms. This is the motiva- tion of NP3, to produce an algorithm which focuses on harnessing the inter-sequence redundancy which is present and will continue to grow as aggregate database sizes continue to balloon. Variants of this approach have been employed in the past, for exam- ple the EMBL-ALIGN database efficiently stores groups of sequences as multiple alignments. The general data flow of the NP3 Algorithm proceeds as follows: the Fasta format database is initially parsed, separated into sequence descrip- tion and body streams which are then handled sep- arately. The remaining processes are elucidated in the corresponding text below which describes how the compressed representations of the descrip- tion and body data are generated, collated and seg- mented to produce an NP3 file. Finally, as genetic databases are continually evolving we have designed the NP3 file format to be easy to patch, so that changes can be incorpo- rated without the need to re-compress the whole database. 1.1 Discovery and Indexing of Redun- dancy in Sequence Bodies Redundancy is discovered within and among se- quences by searching for matching regions in a window of sequence data spanning of the order of several hundred sequences or kilo bases, which ever is the lesser. This allows the discovery of redun- dancy between sequences with substantially con- served regions. Consider Figure 1 where three se- quences share varying portions of a common sub- sequence, as indicated by the striped region. In this case sequence #2 contains a region which also oc- Proceedings of the 3rd WSEAS International Conference on Mathematical Biology and Ecology, Gold Coast, Queensland, Australia, January 17-19, 2007 21