S.K. Pal et al. (Eds.): PReMI 2005, LNCS 3776, pp. 630 – 635, 2005.
© Springer-Verlag Berlin Heidelberg 2005
Intelligent Data Recognition of DNA Sequences
Using Statistical Models
Jitimon Keinduangjun
1
, Punpiti Piamsa-nga
1
, and Yong Poovorawan
2
1
Department of Computer Engineering, Faculty of Engineering, Kasetsart University,
Bangkok, 10900, Thailand
{jitimon.k, punpiti.p}@ku.ac.th
2
Department of Pediatrics, Faculty of Medicine, Chulalongkorn University,
Bangkok, 10400, Thailand
yong.p@chula.ac.th
Abstract. The intelligent data acquisition in biological sequences is a hard and
challenge problem since most biological sequences contain unknowledgeable,
diverse and huge data. However, the intelligent data acquisition reduces a
demand on the use of high computation methods because the data are more
compact and more precise. We propose a novel approach for discovering
sequence signatures, which are sufficiently distinctive information in
identifying the sequences. The signatures are derived from the best combination
of the n-grams and the statistical scoring models. From our experiments in
applying them to identify the Influenza virus, we found that the identifiers
constructed by too short n-gram signatures and inappropriate scoring models
get low efficiency since the inappropriate combinations of n-gram signatures
and scoring models bring about unbalanced class and pattern score distribution.
However, the other identifiers provide accuracy over 80% and up to 100%,
when they apply an appropriate combination. In addition to accomplishing in
the signature recognition, our proposed approach also requires low computation
time for the biological sequence identification.
1 Introduction
The rapid growth of genomic and sequencing technologies during the past few
decades has facilitated the incredibly large size of diverse genome data, such as DNA
and protein sequences. However, most biological sequences contain very little known
meaning. Therefore, techniques for knowledge acquisition from sequences become
more important for transforming the sequences into useful, concise and compact
information. These techniques generally consume long computation time; their
accuracy usually depends on data size; and there is no best known solution. Many
sequence processing projects still have some common stages for experiments, which
are recognition of the most significant characteristics (Intelligent Data -- Signatures).
Signatures are short informative data that can identify types of the sequences.
Recently, several biological research areas demand the informative signatures as one
of important keys of success in the research areas since the signatures help reduce the
computation time and also the data are more compact and more precise [4]. Many