DNA CODING USING FINITE-CONTEXT MODELS AND ARITHMETIC CODING Armando J. Pinho, Ant´ onio J. R. Neves, Carlos A. C. Bastos and Paulo J. S. G. Ferreira Signal Processing Lab, DETI / IEETA University of Aveiro, 3810–193 Aveiro, Portugal ap@ua.pt / an@ua.pt / cbastos@ua.pt / pjf@ua.pt ABSTRACT The interest in DNA coding has been growing with the avail- ability of extensive genomic databases. Although only two bits are sufﬁcient to encode the four DNA bases, efﬁcient lossless compression methods are still needed due to the size of DNA sequences and because standard compression algo- rithms do not perform well on DNA sequences. As a result, several speciﬁc coding methods have been proposed. Most of these methods are based on searching procedures for ﬁnding exact or approximate repeats. Low order ﬁnite-context mod- els have only been used as secondary, fall back mechanisms. In this paper, we show that ﬁnite-context models can also be used as main DNA encoding methods. We propose a cod- ing method based on two ﬁnite-context models that compete for the encoding of data, on a block by block basis. The ex- perimental results conﬁrm the effectiveness of the proposed method. Index Terms— DNA coding, source coding, ﬁnite- context modeling, bioinformatics, arithmetic coding. 1. INTRODUCTION Recently, and with the completion of the human genome se- quencing, the development of efﬁcient lossless compression methods for DNA sequences gained considerable interest [1– 7]. For example, the human genome is determined by approx- imately 3 000 million base pairs [8], whereas the genome of the wheat has about 16 000 million [9]. Since DNA is based on an alphabet of four different symbols (usually known as nucleotides or bases), namely, Adenine (A), Cytosine (C), Guanine (G), and Thymine (T ), it takes approximately 750 MBytes to store the human genome (using log 2 4=2 bits per symbol) and 4 GBytes to store the genome of the wheat. In a previous work [10, 11], we proposed a three-state ﬁnite-context model for DNA protein-coding regions, i.e., for the parts of the DNA that carry information regarding how proteins are synthesized. Basically, this three-state model proved to be better than a single-state model, given additional This work was supported in part by the FCT (Fundac ¸˜ ao para a Ciˆ encia e Tecnologia) grant PTDC/EIA/72569/2006. evidence of a phenomenon that is common in these protein- coding regions, i.e., a periodicity of period three. More recently [12], we investigated the performance of ﬁnite-context models for unrestricted DNA, i.e., DNA includ- ing coding and non-coding parts. In that work, we have shown that a characteristic usually found in DNA sequences, the oc- currence of inverted repeats, which is used by most of the DNA coding methods (see, for example, [4–6]), could also be successfully integrated in ﬁnite-context models. Inverted re- peats are copies of DNA sub-sequences that appear reversed and complemented (A ↔ T , C ↔ G) in some parts of the DNA. In this paper, we propose a lossless coding method for DNA sequences based on ﬁnite-context models and arith- metic coding. It uses two competing ﬁnite-context models that capture the statistical information along the sequence and, on a block basis, strive for encoding the data. For each block, the best of the two models is chosen, i.e., the one that requires less bits for representing the block. More- over, we give experimental evidence that a correct tuning of the parameter controlling the Lidstone estimator (which is a generalization of the Laplace law of succession [13] and also contains the Jeffreys [14] / Krichevsky-Troﬁmov estima- tor [15] as a special case) is most relevant in the case of the higher order ﬁnite-context model. The experimental results obtained show that the proposed codec is able to give very competitive compression results and that, therefore, ﬁnite- context models can be used as the main method for lossless coding of DNA sequences. This paper is organized as follows. In Section 2 we de- scribe our algorithm, and in particular how we collect the statistical information needed by the arithmetic coding. In Section 3 we provide experimental results obtained with our method and we compare the results with one of the most re- cent specialized methods. Finally, in Section 4 we draw some conclusions. 2. THE PROPOSED METHOD In this work, we propose a DNA lossless compression method that is based on two ﬁnite-context models of different orders that compete for encoding the data. Because DNA data are 1693 978-1-4244-2354-5/09/$25.00 ©2009 IEEE ICASSP 2009