1-4244-0983-7/07/$25.00 ©2007 IEEE ICICS 2007
Comprehensive Autoregressive Modeling for
Classification of Genomic Sequences
Mahmood Akhtar, Eliathamby Ambikairajah, Julien Epps
School of Electrical Engineering and Telecommunications,
The University of New South Wales, Sydney 2052, Australia
mahmood@unsw.edu.au , ambi@ee.unsw.edu.au , j.epps@unsw.edu.au
Abstract—In this paper, we propose the novel use of an
autoregressive (AR) model to produce a multi-dimensional
feature for distinguishing between genomic protein coding and
non-coding regions, at their nucleotide level. In contrast to
previous research, in which AR models were used to estimate a
single frequency, here AR model parameters characterizing the
entire short-term sequence spectrum are employed as a feature in
conjunction with Gaussian mixture model-based classification.
The optimized AR-based features are then combined with other
signal processing based time-domain and frequency-domain
features to advance detection accuracy for the coding/non-coding
region classification problem. The system described herein is
shown to produce identification accuracies of more than 78.9%,
and 81.6% respectively for protein coding and non-coding
nucleotides, when evaluated on the GENSCAN test set.
Keywords—DNA, autoregressive models, discrete Fourier
transforms, discrete cosine transforms, Gaussian mixture models
I. INTRODUCTION
One of the important problems in deoxyribonucleic acid
(DNA) sequence analysis is to identify protein coding regions.
In eukaryotic genes, these regions are commonly known as
exons, and are separated by relatively large non-coding regions
known as introns. The intergenic and intronic regions make up
most of the genome. For example, in the human genome the
exonic fraction is as low as 2%. Despite the existence of many
applications in this area, the accuracy of exon detection is still
limited. Techniques applied in the past include the
autocorrelation function (ACF) [1], discrete Fourier transforms
[2, 3, 4, 5], digital filters [6], time-domain algorithms [7],
autoregressive (AR) models [7, 8], and singular value
decomposition [9]. Almost all of the existing techniques exploit
the periodicity of three behavior of exons [1], according to
which particular DNA nucleotides are repeated in identical
codon (i.e., triplet of available four types of nucleotides A, C,
G, and T) positions in exon regions. Such occurrences in
intergenic and intronic regions are random. The recently
proposed paired and weighted spectral rotation (PWSR)
measure [10], however, successfully incorporates an alternative
statistical property of genomic sequences according to which
introns are rich in nucleotides ‘A’ and ‘T’ whereas exons are
rich in nucleotides ‘C’ and ‘G’. In addition to this
complementary property, the PWSR computes DFT magnitude
and phase angle on both DNA strands. The combined time-
domain and frequency-domain features of multi-dimension
have already been used for the classification between protein
coding and non-coding nucleotides [11]. Chakravarthy et al.
[12] have recently used the AR model parameters of a template
sequence as a feature to identify repeats and similar segments
in the long DNA sequence, using different numerical
representations of DNA character string. Their work, however,
lacks classification between protein coding and non-coding
regions on large databases. Furthermore, discriminatory
features from both protein coding as well as non-coding
regions are required for the highly accurate separation of two
types of regions.
In this paper, we extend the use of AR model parameters
and show that, by optimizing the AR model order and data
window length, multi-dimensional features obtained from
protein coding and non-coding regions of the GENSCAN
training set can be effectively used to train Gaussian mixture
models (GMMs) for the classification of these regions in the
GENSCAN test set. AR features are then combined with the
time-frequency hybrid (TFH) feature to model two types of
regions. Finally, the classification results are compared against
previously published work in [11].
II. DIGITAL SIGNAL PROCESSING METHODS
A. Feature Extraction
In this subsection, two different methods for genomic
sequence feature extraction, based on global spectral
characteristics of the sequence and the period-3 behaviour of
exonic regions of the sequence, are described. The AR models
are used to capture the spectral characteristics of the modeled
coding and non-coding sequences, whereas signal processing
based time-domain and frequency-domain methods are
employed to extract period-3 based features from coding
regions only.
1) Autoregressive model based features: After
considerable success in speech processing, AR modeling has
recently been used in [7, 8] to identify period-3 behavior in
exons. The work herein, however, does not exclusively identify
the period-3 component. Rather, the linear predictor
coefficients are used to model protein coding and non-coding
regions of genomic sequences, in terms of their global spectral
characteristics. The coefficients of a p
th
order forward linear
predictor can be used to predict the current sample of the AR
process as a linear combination of previous samples, as
follows: