1-4244-0983-7/07/$25.00 ©2007 IEEE ICICS 2007 Comprehensive Autoregressive Modeling for Classification of Genomic Sequences Mahmood Akhtar, Eliathamby Ambikairajah, Julien Epps School of Electrical Engineering and Telecommunications, The University of New South Wales, Sydney 2052, Australia mahmood@unsw.edu.au , ambi@ee.unsw.edu.au , j.epps@unsw.edu.au Abstract—In this paper, we propose the novel use of an autoregressive (AR) model to produce a multi-dimensional feature for distinguishing between genomic protein coding and non-coding regions, at their nucleotide level. In contrast to previous research, in which AR models were used to estimate a single frequency, here AR model parameters characterizing the entire short-term sequence spectrum are employed as a feature in conjunction with Gaussian mixture model-based classification. The optimized AR-based features are then combined with other signal processing based time-domain and frequency-domain features to advance detection accuracy for the coding/non-coding region classification problem. The system described herein is shown to produce identification accuracies of more than 78.9%, and 81.6% respectively for protein coding and non-coding nucleotides, when evaluated on the GENSCAN test set. Keywords—DNA, autoregressive models, discrete Fourier transforms, discrete cosine transforms, Gaussian mixture models I. INTRODUCTION One of the important problems in deoxyribonucleic acid (DNA) sequence analysis is to identify protein coding regions. In eukaryotic genes, these regions are commonly known as exons, and are separated by relatively large non-coding regions known as introns. The intergenic and intronic regions make up most of the genome. For example, in the human genome the exonic fraction is as low as 2%. Despite the existence of many applications in this area, the accuracy of exon detection is still limited. Techniques applied in the past include the autocorrelation function (ACF) [1], discrete Fourier transforms [2, 3, 4, 5], digital filters [6], time-domain algorithms [7], autoregressive (AR) models [7, 8], and singular value decomposition [9]. Almost all of the existing techniques exploit the periodicity of three behavior of exons [1], according to which particular DNA nucleotides are repeated in identical codon (i.e., triplet of available four types of nucleotides A, C, G, and T) positions in exon regions. Such occurrences in intergenic and intronic regions are random. The recently proposed paired and weighted spectral rotation (PWSR) measure [10], however, successfully incorporates an alternative statistical property of genomic sequences according to which introns are rich in nucleotides ‘A’ and ‘T’ whereas exons are rich in nucleotides ‘C’ and ‘G’. In addition to this complementary property, the PWSR computes DFT magnitude and phase angle on both DNA strands. The combined time- domain and frequency-domain features of multi-dimension have already been used for the classification between protein coding and non-coding nucleotides [11]. Chakravarthy et al. [12] have recently used the AR model parameters of a template sequence as a feature to identify repeats and similar segments in the long DNA sequence, using different numerical representations of DNA character string. Their work, however, lacks classification between protein coding and non-coding regions on large databases. Furthermore, discriminatory features from both protein coding as well as non-coding regions are required for the highly accurate separation of two types of regions. In this paper, we extend the use of AR model parameters and show that, by optimizing the AR model order and data window length, multi-dimensional features obtained from protein coding and non-coding regions of the GENSCAN training set can be effectively used to train Gaussian mixture models (GMMs) for the classification of these regions in the GENSCAN test set. AR features are then combined with the time-frequency hybrid (TFH) feature to model two types of regions. Finally, the classification results are compared against previously published work in [11]. II. DIGITAL SIGNAL PROCESSING METHODS A. Feature Extraction In this subsection, two different methods for genomic sequence feature extraction, based on global spectral characteristics of the sequence and the period-3 behaviour of exonic regions of the sequence, are described. The AR models are used to capture the spectral characteristics of the modeled coding and non-coding sequences, whereas signal processing based time-domain and frequency-domain methods are employed to extract period-3 based features from coding regions only. 1) Autoregressive model based features: After considerable success in speech processing, AR modeling has recently been used in [7, 8] to identify period-3 behavior in exons. The work herein, however, does not exclusively identify the period-3 component. Rather, the linear predictor coefficients are used to model protein coding and non-coding regions of genomic sequences, in terms of their global spectral characteristics. The coefficients of a p th order forward linear predictor can be used to predict the current sample of the AR process as a linear combination of previous samples, as follows: