Author's personal copy Chemometrics for QSAR with low sequence homology: Mycobacterial promoter sequences recognition with 2D-RNA entropies Humberto González-Díaz a,b, ⁎ , Alcides Pérez-Bello b , Maykel Cruz-Monteagudo b , Yenny González-Díaz c,d , Lourdes Santana a , Eugenio Uriarte a a Department of Organic Chemistry, Faculty of Pharmacy, University of Santiago de Compostela, 15782, Spain b CBQ, CEQA, and Department of Veterinary Medicine, Central University of ‘Las Villas’, 54830, Cuba c Provincial Center for Human Genetics, Las Tunas, 77400, Cuba d National Center for Human Genetics, ICBP “Victoria de Girón”, La Habana, 11600, Cuba Received 30 May 2005; received in revised form 31 January 2006; accepted 22 March 2006 Available online 26 May 2006 Abstract Predicting mycobacterial sequences promoter of protein synthesis is important in the study of protein metabolism regulation. This goal is however considered a challenging computational biology task due to low inter-sequences homology. Consequently, a previous work based only on DNA sequence had to use a large input parameter set and multilayered feed-forward ANN architecture trained using the error-back-propagation algorithm to raise an overall accuracy up to 97% [Kalate, et al. 2003. Comput. Biol. Chem. 27, 555–564]. Subsequently, one could expect that a notably simpler model may be derived using parameters based on non-linear structural information. In the present work, a method based on molecular folding negentropies (Θ k ) is introduced to predict by the first time mycobacterial promoter sequences (mps) from the corresponding RNA secondary structure. The best QSAR equation found was the classification function mps = 4.921 × 0 Θ M - 1.205, which recognised 126/135 mps (93.3%) and 100% of 245 control sequences (cs). The model have shown a very high Mathew regression coefficient C = 0.949. Both average overall accuracy and predictability were 97.6%. Additionally, several machine learning algorithms were applied in order to reaffirm the validity of the LDA model from the chemometrics point of view. This linear model with only one parameter ( 0 Θ M ) may be considered the simpler reported up-to-date by large, without lack of accuracy (97%) with respect to Kalate et al.'s model. © 2006 Elsevier B.V. All rights reserved. Keywords: Mycobacterial promoter sequences; RNA secondary structure; Markov models; Machine learning algorithms; QSAR; Information theory; Entropy 1. Introduction Protein synthesis promoter sequences play an important role in the regulation of the function of several important mycobacterial pathogens. In this sense, the prediction of mycobacterial promoter sequences (mps) could be interesting for the future discovery of new anti-mycobacterial drugs targets or in the study of proteins metabolism. Mycobacteria have a low transcription rate and a low RNA content per unit DNA. Then, it is expected that the transcription and translation signals in Mycobacteria may be different from those in other bacteria such as Escherichia coli. In this sense, Mulder et al. have listed - 35 and - 10 regions of a few mycobacterial promoters [1] . For mycobacterial promoters, where apparent conservation in - 35 region is absent, many of them possess TG di- nucleotide in the immediate upstream of the - 10 region, and thus they are termed “extended - 10 promoters”. The large variations among the mycobacterial promoters charac- terized thus far suggest that the consensus sequences are not representative of all mycobacterial promoters. Consequently, a number of conflicting opinions regarding the presence and characteristics of consensus promoter sequences in the Mycobacteria have been aired in the literature [1]. So, understanding the factors responsible for the low level of Chemometrics and Intelligent Laboratory Systems 85 (2007) 20 – 26 www.elsevier.com/locate/chemolab ⁎ Corresponding author. Department of Organic Chemistry, Faculty of Pharmacy, University of Santiago de Compostela, 15782, Spain. E-mail address: qohumbe@usc.es (H. González-Díaz). 0169-7439/$ - see front matter © 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2006.03.005