PREDICTING CHEMICAL CLASSES THROUGH ANN 1 Copyright © 2004 John Wiley & Sons, Ltd. Phytochem. Anal. 15: 00–00 (2004) 5 10 15 20 25 30 35 40 45 50 55 60 65 UNCORRECTED PROOF PHYTOCHEMICAL ANALYSIS Phytochem. Anal. 15, 00–00 (2004) Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002.pca.799 Copyright © 2004 John Wiley & Sons, Ltd. Received 17 December 2002 Revised 15 June 2003 Accepted 19 September 2003 Prediction of Occurrences of Diverse Chemical Classes in the Asteraceae through Artificial Neural Networks Marcelo J. P. Ferreira, 1 Antônio J. C. Brant, 1 Alessandra R. Rufino, 1 Sandra A. V. Alvarenga, 2 Fátima M. M. Magri 3 and Vicente P. Emerenciano 1 * 1 Instituto de Química, Universidade de São Paulo, Caixa Postal 26077-05513-970, São Paulo, Brazil 2 Faculdade de Engenharia de Guaratinguetá, UNESP, CEP 12516-410, Guaratinguetá, São Paulo, Brazil 3 Faculdade de Farmácia, UNIABC, São Paulo, Brazil The training and the application of a neural network system for the prediction of occurrences of secondary metabolites belonging to diverse chemical classes in the Asteraceae is described. From a database containing about 604 genera and 28,000 occurrences of secondary metabolites in the plant family, information was collected encompassing nine chemical classes and their respective occurrences for training of a multi-layer net using the back-propagation algorithm. The net supplied as output the presence or absence of the chemical classes as well as the number of compounds isolated from each taxon. The results provided by the net from the presence or absence of a chemical class showed a 89% hit rate; by excluding triterpenes from the analysis, only 5% of the genera studied exhibited errors greater than 10%. Copyright © 2004 John Wiley & Sons, Ltd. Keywords: Artificial neural networks; chemical composition; occurrence number, secondary metabolites; Asteraceae. * Correspondence to: V. P. Emerenciano, Instituto de Química, Universidade de São Paulo, Caixa Postal 26077-05513-970, São Paulo, Brazil. Email: vdpemere@quim.iq.usp.br Contract/grant sponsor: FAPESP. Contract/grant sponsor: CNPq. popular medicinal use of plants or from chemotaxono- mic information (Gottlieb, 1982; Gottlieb et al., 2002). In spite of the large number of species already studied, the information available thus far is still somewhat scarce since phytochemical studies for each species are rather incomplete. Certain genera have many interest- ing species and are very well studied, and this generates a great deal of data. A typical example would be the genus Artemisia where the ratio of the number of studied species vs. the number of existing species is 240/ 390 (i.e. 62%). In contrast, for many other genera, this ratio can be very small; thus for the genus Kaonosphyllon the ratio is 7/120 or only 6%. Techniques of multivariate statistics involving, for instance, principal component analysis have been used to explore the existence of correlations among occurrences of chemical classes in the family, but the results of this methodology remain insufficient to determine accurately a species to be ana- lysed (Alvarenga et al., 2001a). As the data available are somewhat noisy, an approach utilising artificial neural networks (ANNs) may be more appropriate for the analysis of this type of phytochemical information and it is this approach that we have employed in the present work. An ANN is an information processing paradigm inspired by the strategy through which biological nervous systems, such as the brain, process information (Stergiou and Siganos, 1996). The application of ANNs constitutes an important tool for the resolution of complex problems in many fields of human knowledge. In chemistry, for example, the technique has been successfully applied to the prediction of biological activity of natural products or congeneric compounds (Wrede et al., 1998), statistical and pattern recognition methods in analytical chemistry (Gasteiger and Zupan, 1993; Zupan and Gasteiger, 1993), the identification, distribution and recognition of patterns of chemical shifts from 1 H-NMR spectra (Gross INTRODUCTION The Asteraceae family comprises some 23,000 species, the economical and medicinal importance of which has been widely described. Several reviews of the family, in terms of chemical and botanical data, are available in the literature (Heywood et al., 1977; Seaman et al., 1990; Hind and Beentje, 1994; Bremer, 1994a, b). Botanically, the family has been divided into sub- families and tribes by several authors (Carlquist, 1976; Wagentiz, 1976; Cronquist, 1977; Jansen et al., 1990; Bremer, 1994a, b). The family has been extremely well studied from the chemical standpoint, and by the late 1980s, about 7000 species of the family had already received some type of chemical study (Zdero and Bohlmann, 1990). An enormous variety of chemical classes has been isolated from members of the family, including monoterpenoids, sesquiterpenoids, sesquiter- pene lactones, diterpenes, triterpenes, coumarins, flavon- oids, polyacetylenes and benzofurans. The database that we maintain on the Asteraceae has entries for about 5000 species for which detailed chemical information is available, and this number may be considered a priori as representative of an approach using artificial neural networks. Phytochemical research and the hunt for new bio- logically active substances from plants have been a constant scientific activity for several decades. Typically, the factors that direct this type of research emanate from ethnobotanical information obtained from the