Structure-based classification of active and inactive estrogenic compounds by decision tree, LVQ and kNN methods Arja Asikainen a, * , Mikko Kolehmainen a , Juhani Ruuskanen a , Kari Tuppurainen b a Department of Environmental Sciences, University of Kuopio, P.O. Box 1627, FIN-70211 Kuopio, Finland b Department of Chemistry, University of Kuopio, P.O. Box 1627, FIN-70211 Kuopio, Finland Received 26 November 2004; received in revised form 18 April 2005; accepted 29 April 2005 Available online 29 June 2005 Abstract The performance of decision tree (DT), learning vector quantization (LVQ), and k-nearest neighbour (kNN) meth- ods classifying active and inactive estrogenic compounds in terms of their structure activity relationship (SAR) was evaluated. A set of 311 compounds was used for construction of the models, the predictive power of which was verified with separate training and test sets. Principal components derived from molecular descriptors calculated with DRA- GON software were used as variables representing the structures of the compounds. Broadly, kNN had the best clas- sification ability and DT the weakest, although the performance of each method was dependent on the group of compounds used for modelling. The best performance was obtained with kNN for the calf estrogen receptor data, aver- aging 98.3% of correctly classified compounds in the external tests. Overall, the results indicate that all the methods tested are suitable for the SAR classification of estrogenic compounds, producing models with a predictive power rang- ing from adequate to excellent. Ó 2005 Elsevier Ltd. All rights reserved. Keywords: SAR; Estrogen receptor; Endocrine disruptors; Principal components; SammonÕs mapping; Tooldiag 1. Introduction The biological action of many important natural hor- mones, such as estradiol derivatives, is mediated through the estrogen receptor (ER). On the other hand, numer- ous naturally occurring and man-made estrogen-like compounds, called endocrine disrupting chemicals (EDCs), are ubiquitously present in the environment (Crisp et al., 1998; Singleton and Khan, 2003). The EDCs display a broad structural diversity, including at least phenols, phthalates, phytoestrogens, DDT deriv- atives, PCBs, pesticides, diethylstilbestrol (DES) deriva- tives and steroids. In broad outline, EDCs can interfere with the normal action of the ER, and thus constitute a potential environmental risk. Experimental assays for the measurement of the bio- logical activity of the EDCs are time-consuming and expensive, and computational and modelling methods are beginning to gain a foothold in the field of envi- ronmental toxicology as an alternative approach. In 0045-6535/$ - see front matter Ó 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.chemosphere.2005.04.115 * Corresponding author. Tel.: +358 17 162 893; fax: +358 17 163 191. E-mail address: arja.asikainen@uku.fi (A. Asikainen). Chemosphere 62 (2006) 658–673 www.elsevier.com/locate/chemosphere