S.K. Pal et al. (Eds.): PReMI 2005, LNCS 3776, pp. 630 – 635, 2005. © Springer-Verlag Berlin Heidelberg 2005 Intelligent Data Recognition of DNA Sequences Using Statistical Models Jitimon Keinduangjun 1 , Punpiti Piamsa-nga 1 , and Yong Poovorawan 2 1 Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok, 10900, Thailand {jitimon.k, punpiti.p}@ku.ac.th 2 Department of Pediatrics, Faculty of Medicine, Chulalongkorn University, Bangkok, 10400, Thailand yong.p@chula.ac.th Abstract. The intelligent data acquisition in biological sequences is a hard and challenge problem since most biological sequences contain unknowledgeable, diverse and huge data. However, the intelligent data acquisition reduces a demand on the use of high computation methods because the data are more compact and more precise. We propose a novel approach for discovering sequence signatures, which are sufficiently distinctive information in identifying the sequences. The signatures are derived from the best combination of the n-grams and the statistical scoring models. From our experiments in applying them to identify the Influenza virus, we found that the identifiers constructed by too short n-gram signatures and inappropriate scoring models get low efficiency since the inappropriate combinations of n-gram signatures and scoring models bring about unbalanced class and pattern score distribution. However, the other identifiers provide accuracy over 80% and up to 100%, when they apply an appropriate combination. In addition to accomplishing in the signature recognition, our proposed approach also requires low computation time for the biological sequence identification. 1 Introduction The rapid growth of genomic and sequencing technologies during the past few decades has facilitated the incredibly large size of diverse genome data, such as DNA and protein sequences. However, most biological sequences contain very little known meaning. Therefore, techniques for knowledge acquisition from sequences become more important for transforming the sequences into useful, concise and compact information. These techniques generally consume long computation time; their accuracy usually depends on data size; and there is no best known solution. Many sequence processing projects still have some common stages for experiments, which are recognition of the most significant characteristics (Intelligent Data -- Signatures). Signatures are short informative data that can identify types of the sequences. Recently, several biological research areas demand the informative signatures as one of important keys of success in the research areas since the signatures help reduce the computation time and also the data are more compact and more precise [4]. Many