World Applied Sciences Journal 2 (4): 323-332, 2007 ISSN 1818-4952 © IDOSI Publications, 2007 Corresponding Author: Dr. Sorana D. Bolboaca, “Iuliu Ha ieganu” University of Medicine and Pharmacy, Cluj-Napoca, Romania 323 Data Mining on Structure-Activity/Property Relationships Models Sorana D. Bolboaca and Lorentz Jäntschi 1 2 “Iuliu Ha ieganu” University of Medicine and Pharmacy, Cluj-Napoca, Romania 1 Technical University of Cluj-Napoca, Romania 2 Abstract: Molecular descriptors family on structure-activity/property relationships studies were carried out in order to identify the link between compounds structure and their activity/property. A number of fifty-five classes of properties or activities of different compounds sets were investigated. Single and multi-varied linear regression models using molecular descriptors as variables were identified. The models with estimation and prediction abilities and associated characteristics were stored into a database. A data mining analysis using classification and clustering were applied on the obtained database for searching and extracting useful information. The methodology applied in searching and extracting for information and the obtained results are presented. Key words: Knowledge-Discovery in Database (KDD) % cluster analysis % Structure-Activity/Property Relationships (SAR/SPR) %Molecular Descriptors Family (MDF) INTRODUCTION index [9], relative response factor [11], molar refraction Data mining (DM), also called Knowledge-Discovery (insecticidal activity [16], herbicidal activity [17], in Databases (KDD) or Knowledge-Discovery and Data antioxidant efficacy [18], inhibition activity [19-21], Mining, is the process of automatically searching large toxicity [22, 23], antituberculotic activity [24] and volumes of data for patterns using tools such as antimalarial activity [25]) have been reported. In addition, classification, association rule mining and/or clustering. the overall results from the use of molecular descriptors The term has been defined as the nontrivial extraction of family on structure property/activity relationships has implicit, previously unknown and potentially useful also been published [26]. information from data [1], being considered as the science The best performing models in terms of correlation of extracting useful information from large data sets or coefficients and cross-validation scores were collected databases [2]. into a database. On this amount of information, data Data mining techniques are use in search of mining techniques have been applied in order to identify consistent patterns and/or systematic relationships consistent patterns and/or relationships between between variables in business [3], evaluation of variables of MDF SAR/SPR models. web-based educational programs [4], computer science [5], chemistry [6], engineering [7], medicine [8] MATERIAL and in all domains where a large amount of date must be analyzed. A number of fifty-five sets of compounds were A new method of quantitative structure- included into analysis. The set abbreviation, activity or activity/property relationships abbreviated as MDF property of interest and class of compounds are SAR/SPR (molecular descriptors family on the structure- presented in Table 1. activity/property relationships) has been introduced by Univariate and multivariate models were obtained Jäntschi in 2004 [9] and reviewed in 2005 [10]. Since then, by applying the MDF SAR/SPR methodology on the samples of compounds with different properties or samples of compounds; the models were stored into a activities have been investigated and analyzed. Some database. The molecular descriptors are the variables results on different properties (retention chromatography used by the models. [12], octanol/water partition coefficient [13-15] or activities