DNA CODING USING FINITE-CONTEXT MODELS AND ARITHMETIC CODING
Armando J. Pinho, Ant´ onio J. R. Neves, Carlos A. C. Bastos and Paulo J. S. G. Ferreira
Signal Processing Lab, DETI / IEETA
University of Aveiro, 3810–193 Aveiro, Portugal
ap@ua.pt / an@ua.pt / cbastos@ua.pt / pjf@ua.pt
ABSTRACT
The interest in DNA coding has been growing with the avail-
ability of extensive genomic databases. Although only two
bits are sufficient to encode the four DNA bases, efficient
lossless compression methods are still needed due to the size
of DNA sequences and because standard compression algo-
rithms do not perform well on DNA sequences. As a result,
several specific coding methods have been proposed. Most of
these methods are based on searching procedures for finding
exact or approximate repeats. Low order finite-context mod-
els have only been used as secondary, fall back mechanisms.
In this paper, we show that finite-context models can also be
used as main DNA encoding methods. We propose a cod-
ing method based on two finite-context models that compete
for the encoding of data, on a block by block basis. The ex-
perimental results confirm the effectiveness of the proposed
method.
Index Terms— DNA coding, source coding, finite-
context modeling, bioinformatics, arithmetic coding.
1. INTRODUCTION
Recently, and with the completion of the human genome se-
quencing, the development of efficient lossless compression
methods for DNA sequences gained considerable interest [1–
7]. For example, the human genome is determined by approx-
imately 3 000 million base pairs [8], whereas the genome of
the wheat has about 16 000 million [9]. Since DNA is based
on an alphabet of four different symbols (usually known as
nucleotides or bases), namely, Adenine (A), Cytosine (C),
Guanine (G), and Thymine (T ), it takes approximately 750
MBytes to store the human genome (using log
2
4=2 bits per
symbol) and 4 GBytes to store the genome of the wheat.
In a previous work [10, 11], we proposed a three-state
finite-context model for DNA protein-coding regions, i.e., for
the parts of the DNA that carry information regarding how
proteins are synthesized. Basically, this three-state model
proved to be better than a single-state model, given additional
This work was supported in part by the FCT (Fundac ¸˜ ao para a Ciˆ encia e
Tecnologia) grant PTDC/EIA/72569/2006.
evidence of a phenomenon that is common in these protein-
coding regions, i.e., a periodicity of period three.
More recently [12], we investigated the performance of
finite-context models for unrestricted DNA, i.e., DNA includ-
ing coding and non-coding parts. In that work, we have shown
that a characteristic usually found in DNA sequences, the oc-
currence of inverted repeats, which is used by most of the
DNA coding methods (see, for example, [4–6]), could also be
successfully integrated in finite-context models. Inverted re-
peats are copies of DNA sub-sequences that appear reversed
and complemented (A ↔ T , C ↔ G) in some parts of the
DNA.
In this paper, we propose a lossless coding method for
DNA sequences based on finite-context models and arith-
metic coding. It uses two competing finite-context models
that capture the statistical information along the sequence
and, on a block basis, strive for encoding the data. For
each block, the best of the two models is chosen, i.e., the
one that requires less bits for representing the block. More-
over, we give experimental evidence that a correct tuning of
the parameter controlling the Lidstone estimator (which is
a generalization of the Laplace law of succession [13] and
also contains the Jeffreys [14] / Krichevsky-Trofimov estima-
tor [15] as a special case) is most relevant in the case of the
higher order finite-context model. The experimental results
obtained show that the proposed codec is able to give very
competitive compression results and that, therefore, finite-
context models can be used as the main method for lossless
coding of DNA sequences.
This paper is organized as follows. In Section 2 we de-
scribe our algorithm, and in particular how we collect the
statistical information needed by the arithmetic coding. In
Section 3 we provide experimental results obtained with our
method and we compare the results with one of the most re-
cent specialized methods. Finally, in Section 4 we draw some
conclusions.
2. THE PROPOSED METHOD
In this work, we propose a DNA lossless compression method
that is based on two finite-context models of different orders
that compete for encoding the data. Because DNA data are
1693 978-1-4244-2354-5/09/$25.00 ©2009 IEEE ICASSP 2009