PREDICTING CHEMICAL CLASSES THROUGH ANN 1
Copyright © 2004 John Wiley & Sons, Ltd. Phytochem. Anal. 15: 00–00 (2004)
5
10
15
20
25
30
35
40
45
50
55
60
65
UNCORRECTED PROOF
PHYTOCHEMICAL ANALYSIS
Phytochem. Anal. 15, 00–00 (2004)
Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002.pca.799
Copyright © 2004 John Wiley & Sons, Ltd.
Received 17 December 2002
Revised 15 June 2003
Accepted 19 September 2003
Prediction of Occurrences of Diverse Chemical
Classes in the Asteraceae through Artificial
Neural Networks
Marcelo J. P. Ferreira,
1
Antônio J. C. Brant,
1
Alessandra R. Rufino,
1
Sandra A. V. Alvarenga,
2
Fátima M. M. Magri
3
and Vicente P. Emerenciano
1
*
1
Instituto de Química, Universidade de São Paulo, Caixa Postal 26077-05513-970, São Paulo, Brazil
2
Faculdade de Engenharia de Guaratinguetá, UNESP, CEP 12516-410, Guaratinguetá, São Paulo, Brazil
3
Faculdade de Farmácia, UNIABC, São Paulo, Brazil
The training and the application of a neural network system for the prediction of occurrences of secondary
metabolites belonging to diverse chemical classes in the Asteraceae is described. From a database containing
about 604 genera and 28,000 occurrences of secondary metabolites in the plant family, information was collected
encompassing nine chemical classes and their respective occurrences for training of a multi-layer net using the
back-propagation algorithm. The net supplied as output the presence or absence of the chemical classes as well
as the number of compounds isolated from each taxon. The results provided by the net from the presence or
absence of a chemical class showed a 89% hit rate; by excluding triterpenes from the analysis, only 5% of the
genera studied exhibited errors greater than 10%. Copyright © 2004 John Wiley & Sons, Ltd.
Keywords: Artificial neural networks; chemical composition; occurrence number, secondary metabolites; Asteraceae.
* Correspondence to: V. P. Emerenciano, Instituto de Química, Universidade
de São Paulo, Caixa Postal 26077-05513-970, São Paulo, Brazil.
Email: vdpemere@quim.iq.usp.br
Contract/grant sponsor: FAPESP.
Contract/grant sponsor: CNPq.
popular medicinal use of plants or from chemotaxono-
mic information (Gottlieb, 1982; Gottlieb et al., 2002).
In spite of the large number of species already studied,
the information available thus far is still somewhat
scarce since phytochemical studies for each species are
rather incomplete. Certain genera have many interest-
ing species and are very well studied, and this generates
a great deal of data. A typical example would be the
genus Artemisia where the ratio of the number of
studied species vs. the number of existing species is 240/
390 (i.e. 62%). In contrast, for many other genera, this
ratio can be very small; thus for the genus Kaonosphyllon
the ratio is 7/120 or only 6%. Techniques of multivariate
statistics involving, for instance, principal component
analysis have been used to explore the existence of
correlations among occurrences of chemical classes in
the family, but the results of this methodology remain
insufficient to determine accurately a species to be ana-
lysed (Alvarenga et al., 2001a). As the data available are
somewhat noisy, an approach utilising artificial neural
networks (ANNs) may be more appropriate for the
analysis of this type of phytochemical information and it
is this approach that we have employed in the present
work.
An ANN is an information processing paradigm
inspired by the strategy through which biological nervous
systems, such as the brain, process information (Stergiou
and Siganos, 1996). The application of ANNs constitutes
an important tool for the resolution of complex problems
in many fields of human knowledge. In chemistry, for
example, the technique has been successfully applied to
the prediction of biological activity of natural products
or congeneric compounds (Wrede et al., 1998), statistical
and pattern recognition methods in analytical chemistry
(Gasteiger and Zupan, 1993; Zupan and Gasteiger,
1993), the identification, distribution and recognition of
patterns of chemical shifts from
1
H-NMR spectra (Gross
INTRODUCTION
The Asteraceae family comprises some 23,000 species,
the economical and medicinal importance of which has
been widely described. Several reviews of the family, in
terms of chemical and botanical data, are available in the
literature (Heywood et al., 1977; Seaman et al., 1990;
Hind and Beentje, 1994; Bremer, 1994a, b).
Botanically, the family has been divided into sub-
families and tribes by several authors (Carlquist, 1976;
Wagentiz, 1976; Cronquist, 1977; Jansen et al., 1990;
Bremer, 1994a, b). The family has been extremely well
studied from the chemical standpoint, and by the late
1980s, about 7000 species of the family had already
received some type of chemical study (Zdero and
Bohlmann, 1990). An enormous variety of chemical
classes has been isolated from members of the family,
including monoterpenoids, sesquiterpenoids, sesquiter-
pene lactones, diterpenes, triterpenes, coumarins, flavon-
oids, polyacetylenes and benzofurans. The database that
we maintain on the Asteraceae has entries for about
5000 species for which detailed chemical information
is available, and this number may be considered a priori
as representative of an approach using artificial neural
networks.
Phytochemical research and the hunt for new bio-
logically active substances from plants have been a
constant scientific activity for several decades. Typically,
the factors that direct this type of research emanate
from ethnobotanical information obtained from the