Author's personal copy
Chemometrics for QSAR with low sequence homology: Mycobacterial
promoter sequences recognition with 2D-RNA entropies
Humberto González-Díaz
a,b,
⁎
, Alcides Pérez-Bello
b
, Maykel Cruz-Monteagudo
b
,
Yenny González-Díaz
c,d
, Lourdes Santana
a
, Eugenio Uriarte
a
a
Department of Organic Chemistry, Faculty of Pharmacy, University of Santiago de Compostela, 15782, Spain
b
CBQ, CEQA, and Department of Veterinary Medicine, Central University of ‘Las Villas’, 54830, Cuba
c
Provincial Center for Human Genetics, Las Tunas, 77400, Cuba
d
National Center for Human Genetics, ICBP “Victoria de Girón”, La Habana, 11600, Cuba
Received 30 May 2005; received in revised form 31 January 2006; accepted 22 March 2006
Available online 26 May 2006
Abstract
Predicting mycobacterial sequences promoter of protein synthesis is important in the study of protein metabolism regulation. This goal is
however considered a challenging computational biology task due to low inter-sequences homology. Consequently, a previous work based only on
DNA sequence had to use a large input parameter set and multilayered feed-forward ANN architecture trained using the error-back-propagation
algorithm to raise an overall accuracy up to 97% [Kalate, et al. 2003. Comput. Biol. Chem. 27, 555–564]. Subsequently, one could expect that a
notably simpler model may be derived using parameters based on non-linear structural information. In the present work, a method based on
molecular folding negentropies (Θ
k
) is introduced to predict by the first time mycobacterial promoter sequences (mps) from the corresponding
RNA secondary structure. The best QSAR equation found was the classification function mps = 4.921 ×
0
Θ
M
- 1.205, which recognised 126/135
mps (93.3%) and 100% of 245 control sequences (cs). The model have shown a very high Mathew regression coefficient C = 0.949. Both average
overall accuracy and predictability were 97.6%. Additionally, several machine learning algorithms were applied in order to reaffirm the validity of
the LDA model from the chemometrics point of view. This linear model with only one parameter (
0
Θ
M
) may be considered the simpler reported
up-to-date by large, without lack of accuracy (97%) with respect to Kalate et al.'s model.
© 2006 Elsevier B.V. All rights reserved.
Keywords: Mycobacterial promoter sequences; RNA secondary structure; Markov models; Machine learning algorithms; QSAR; Information theory; Entropy
1. Introduction
Protein synthesis promoter sequences play an important
role in the regulation of the function of several important
mycobacterial pathogens. In this sense, the prediction of
mycobacterial promoter sequences (mps) could be interesting
for the future discovery of new anti-mycobacterial drugs
targets or in the study of proteins metabolism. Mycobacteria
have a low transcription rate and a low RNA content per
unit DNA. Then, it is expected that the transcription and
translation signals in Mycobacteria may be different from
those in other bacteria such as Escherichia coli.
In this sense, Mulder et al. have listed - 35 and - 10
regions of a few mycobacterial promoters [1] . For
mycobacterial promoters, where apparent conservation in
- 35 region is absent, many of them possess TG di-
nucleotide in the immediate upstream of the - 10 region,
and thus they are termed “extended - 10 promoters”. The
large variations among the mycobacterial promoters charac-
terized thus far suggest that the consensus sequences are not
representative of all mycobacterial promoters. Consequently,
a number of conflicting opinions regarding the presence and
characteristics of consensus promoter sequences in the
Mycobacteria have been aired in the literature [1]. So,
understanding the factors responsible for the low level of
Chemometrics and Intelligent Laboratory Systems 85 (2007) 20 – 26
www.elsevier.com/locate/chemolab
⁎
Corresponding author. Department of Organic Chemistry, Faculty of
Pharmacy, University of Santiago de Compostela, 15782, Spain.
E-mail address: qohumbe@usc.es (H. González-Díaz).
0169-7439/$ - see front matter © 2006 Elsevier B.V. All rights reserved.
doi:10.1016/j.chemolab.2006.03.005